Recently I had to migrate some very old website to a new server. When that website was first deployed the MySql server was at version 4.0 and had no internal support for character sets. When MySql 4.1 first appeared and assigned each table and column their own character set it was very common for applications to store data with different actual encoding than the meta data specified in the database schema.

This led to problems during data migrations and also when browsing database with clients that specify their connection encoding. Whenever a connection encoding different than default (latin1) was specified MySql would try to convert the table contents to the requested coding and fail miserably if the data’s real encoding was different than what was stored in table definition. In my case I was storing utf8 data while MySql was convinced it was latin1. When in a situation like this you have two options.

If you’re able to SSH to the machine with the database then you can work around this problem during export by specifying default character set during mysqldump. If it matches the table’s character set then MySql doesn’t perform any conversion. Then you have to adjust the meta data in the data definition statements inside the sql dump. Something like this might work:

1
2
3
4
mysqldump -u user --default-character-set=latin1 -p database | \
sed -e "s/ DEFAULT CHARSET=latin1;$/ DEFAULT CHARSET=utf8;/" | \
sed -e "s/!40101 SET NAMES latin1 /!40101 SET NAMES utf8 /" > \
  good-encoding.sql

This is a safe procedure because you’re not modifying any data on the server only creating a separate dump.

Unfortunately I didn’t have shell access to the machine that was running my app, so I had to modify the data in place before doing the export with phpMyAdmin.

The documentation mentions a command in ALTER TABLE statement that can change table’s character encoding, but that command also performs data conversion and can corrupt data if it’s meta information is incorrect. That same page mentions that if you want to skip the conversion then you have to add an intermediate step and convert the column to BINARY/VARBINARY/BLOB type, change the table encoding and then convert the column back to it’s normal type. Like this:

1
2
3
ALTER TABLE `pages` MODIFY `title` varbinary(30) NOT NULL default '';
ALTER TABLE `pages` CHARACTER SET utf8;
ALTER TABLE `pages` MODIFY `title` varchar(30) NOT NULL default '';

After executing these statements the only thing modified should be the meta information stored in table definition. The data will remain unchanged.

I’ve written a little Ruby script that parses data definition statements generated by mysqldump or phpMyAdmin (select “no-data” option during export) and prepares these statements automatically. Note that it will not convert columns that have explicit character set definition (don’t use table default).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/ruby

targetEnc = 'utf8'
targetEnc = ARGV[0] unless ARGV.size < 1

f = $stdin
f = File.new(ARGV[1], 'r') unless ARGV.size < 2

table_name = ''
to_binary = []
to_text = []

while not f.eof? do
  line = f.readline

  # Read table name and clear sql statements
  if /^CREATE TABLE.*`([a-z_]*)`.*$/i =~ line then
    table_name = $1
    to_binary = []
    to_text = []
  end

  # prepare conversion statements for char, varchar and text
  if /^ *`([a-z_]*)` (char|varchar|text)([^,]*),?$/i =~ line then
    case $2
    when 'char' then newcol = 'binary'
    when 'varchar' then newcol = 'varbinary'
    when 'text' then newcol = 'blob'
    end

    # don't convert columns with explicit character set
    type = $3.downcase
    next if type.include? "character" or type.include? "collate"

    to_binary << "ALTER TABLE `#{table_name}` MODIFY `#{$1}` #{newcol}#{$3};"
    to_text << "ALTER TABLE `#{table_name}` MODIFY `#{$1}` #{$2}#{$3};"
  end

  # At the end of table definition puts sql statements
  if /^\) ENGINE=.*$/ =~ line then
    to_binary.each { |s| puts s }
    puts "ALTER TABLE `#{table_name}` CHARACTER SET #{targetEnc};"
    to_text.each { |s| puts s }
  end
end

This worked for me perfectly, but be careful – you’re modifying your original data.