Learning from your mistakes: mixed character sets in MySQL

Character sets are a big thing these days and luckily we’re slowly evolving to a world where UTF-8 rules. In the meanwhile we’re stuck in a transitional phase where the ISO-8859-1 encoding is still used.

MySQL followed this trend and introduced character set support in version 4.1. Importing data from older versions is easy and SET NAMES ‘utf8′ is quite useful to ensure special characters are correctly encoded in the new databases.

But I learned first hand that the SET NAMES ‘utf8 trick is not a cure for every disease.

The story

Moving up is easy, but moving down is a real nightmare if you don’t take the necessary precautions. I had a case where a database had to be moved from a MySQL 3.23 server to a MySQL 5.0 server. We decided to store all data in UTF-8 encoding which worked out well.

We also knew that the upgrade was a temporary measure: the 5.0 was installed on a high performance dedicated box to handle a peak. Once we were passed the peak, we decided to move back to the old server, which was a shared box. I was always under the assumption that the old server had a MySQL 4.1, but little did I know that I was dealing with an old school 3.23.

The problem

Unaware of any risk, I quickly took a mysqldump and imported the data on the old server. Apparently I had saved UTF-8 data in a database server that was only ISO aware. The consequences weren’t that clear in the beginning since Dutch & English characters are usually the basic ISO characters.

Once I noticed that French characters were suddenly malformatted, I knew something was wrong. I quickly recreated the hosting on a 4.1 server and imported the data again. When reloading the page I was stunned: the old UTF-8 data was correctly displayed, but apparently all the new data since the migration from the dedicated server was still ISO.

The solution

Attempts that failed

The conclusion was that I had tables with records partially in UTF-8 & partially in ISO. I figured this was easy to solve by writing a script that reads every line from the dump and converts it to UTF-8. By doing this, I would be sure all characters were the same.

Again I was faced with a big surprise: converting ISO data to UTF-8 works fine via ‘utf8_encode’ in PHP , but UTF-8 stored in an ISO database cannot be converted to valid UTF-8 via this function.

Later I figured iconv could to the trick for me, but again the results where negative.

The solution that actually worked

Desperate for a solution, I continued browsing http://www.php.net/manual/en/ and stumbled upon a user comment describing a function to detect UTF-8 strings. That function, written by a guy named Anton Vakulchik, contained a fancy regular expression that does all the heavy lifting.

So I created a script containing the regular expression, started reading every line from my dumpfile and converted the string to UTF-8 where the preg_match returned false. To my surprise, it worked perfectly and I managed to output the data line by line in valid UTF-8

The script

The script below reads from stdin line by line and processes it. It detects UTF-8 data and converts lines to UTF-8 when necessary. This CLI script outputs every line to the stdout.

<?php
function detectUTF8($string)
{
	return preg_match('%(?:
	[\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
	|\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
	|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
	|\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
	|\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
	|[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
	|\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
	)+%xs', $string);
}
function readStdInArray()
{
	$stdin = fopen('php://stdin', 'r');
	$out = array();
	if(is_resource($stdin)){
		while(!feof($stdin)){
			$in = trim(fgets($stdin));
			if(strlen($in)>0){
				$out[]= $in;
			}
		}
	}
	fclose($stdin);
	return $out;
}
$std = readStdInArray();
foreach($std as $line){
	if(detectUTF8($line)){
		echo $line.PHP_EOL;
	} else {
		echo utf8_encode($line).PHP_EOL;
	}
}

Using the script

The combination of stdin and stdout allow the use of pipes in Linux which makes it really easy to convert data between 2 MySQL servers.

There are several ways to use the script. Every method consists of 3 steps:

  1. Dump mixed data from the old MySQL server
  2. Convert mixed data into UTF-8 data
  3. Import UTF-8 data into the new MySQL server

You can choose to store to results in a dumpfile or immediately transfer it from source to destination server with the conversion script in between. Here are a couple of examples:

Direct transfer between 2 database servers

mysqldump -hOldHost -uUsername -pPassword -Q -A --skip-extended-insert | php -q utf8.php | mysql -hNewHost -uUsername -pPassword

Storing the dumps in dumfiles

mysqldump -hOldHost -uUsername -pPassword -Q -A --skip-extended-insert -pPassword > mixeddump.sql
cat mixeddump.sql | php -q utf8.php > utf8dump.sql
cat utf8dump.sql | mysql -hNewHost -uUsername -pPassword

Script restriction

The script is quite powerful, but has one big restriction: it can only determine the charset of a line. So every line must be encoded in a single character set. To enforce this, I had to make sure that every database record was stored on 1 single line. The mysqldump –skip-extended-insert parameter helped me to achieve this. So please do use it!

Of course, I could have performed a UTF-8 check on every character, but that wouldn’t have been so resource friendly when using 100+ MB dumpfiles.

Check out the mysqldump manual for more information on mysqldump command line parameters.

Conclusion

What I’ve learned is that working with character sets can get tricky. I will definitely by more cautious in the future, but a least I have a conversion script that can take care of hairy situations.

Thank you Anton Vakulchik for writing the kick-ass regular expressions

3 Comments

  • The mixed character set /UTF-8 nightmare is something we all go through I guess. I remember well what a hair pulling day it was when I encountered something similar a few years back (and how confusing information about UTF-8 was).
    It’s one of the first things I check when starting a project, making sure everything is set to UTF-8, just to avoid these kind of troubles.

  • Simon Schick wrote:

    Hi, all

    I had the same problem but in my case it had been even worse!
    I had a character-mixup even in one line!

    Here’s the code. Please read the comments to know how I fixed this problem:

    [code]
    // Use this option only if you know what you're doing!
    $enableInlineEncoding = TRUE;
    $lineEncoding = "ISO-8859-1";
    $inlineEncoding = "Windows-1252";

    function detectUTF8($string) {
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )+%xs', $string);
    }

    function readStdInArray() {
    $stdin = fopen('php://stdin', 'r');
    $out = array();
    if (is_resource($stdin)) {
    while (!feof($stdin)) {
    $in = trim(fgets($stdin));
    if (strlen($in) > 0) {
    $out[] = $in;
    }
    }
    }
    fclose($stdin);
    return $out;
    }

    $std = readStdInArray();

    //# Here's an example of my code.
    //$std = array(
    // "ä f\xFCr \x80 asdasd", // Sometimes there's a \x80 (€) saved as windows-1252 character and somethimes characters are valid utf-8 characters ...
    // 'asäbc'
    //);

    foreach ($std as $line) {
    if (detectUTF8($line)) {

    if ($enableInlineEncoding) {
    // replace all valid UTF8-character to ''
    $string_without_utf8 = preg_replace('%(?:
    [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )+%xs', '', $line);

    // get a list of all invalid UTF8-characters by counting all non-ascii characters in the string created above.
    $count_invalid = preg_match_all("/[\x80-\xFF]/", $string_without_utf8, $matches);

    // replace all invalid utf-8 characters by their utf8-encoded characters.
    foreach ($matches[0] as $match) {
    $line = str_replace($match, iconv($inlineEncoding, "UTF-8", $match), $line);
    }
    }

    echo $line . PHP_EOL;
    } else {
    echo iconv($lineEncoding, "UTF-8", $line) . PHP_EOL;
    }
    }
    [/code]

  • Simon Schick wrote:

    Hi, all

    Sorry, for this double-post – but I missed some informations :)

    I replaced utf8_decode() by iconv().
    By changing this I’m able to use the charset Windows-1252. I needed this charset to get the correct encoded utf-8 character for the char €.

    Please look at this post to get a detailed information about this unexpected behaviour:
    http://www.unixresources.net/linux/lf/47/archive/00/00/16/76/167628.html#article722496

Leave a Reply

Your email is never shared.Required fields are marked *