<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Learning from your mistakes: mixed character sets in MySQL</title>
	<atom:link href="http://blog.feryn.eu/2009/12/learning-from-your-mistakes-mixed-character-sets-in-mysql/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.feryn.eu/2009/12/learning-from-your-mistakes-mixed-character-sets-in-mysql/</link>
	<description>Thijs Feryn's blog</description>
	<lastBuildDate>Sat, 20 Aug 2011 02:21:52 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
	<item>
		<title>By: Simon Schick</title>
		<link>http://blog.feryn.eu/2009/12/learning-from-your-mistakes-mixed-character-sets-in-mysql/comment-page-1/#comment-724</link>
		<dc:creator>Simon Schick</dc:creator>
		<pubDate>Thu, 12 May 2011 10:04:42 +0000</pubDate>
		<guid isPermaLink="false">http://blog.feryn.eu/?p=659#comment-724</guid>
		<description>Hi, all

Sorry, for this double-post - but I missed some informations :)

I replaced utf8_decode() by iconv().
By changing this I&#039;m able to use the charset Windows-1252. I needed this charset to get the correct encoded utf-8 character for the char €.

Please look at this post to get a detailed information about this unexpected behaviour:
http://www.unixresources.net/linux/lf/47/archive/00/00/16/76/167628.html#article722496</description>
		<content:encoded><![CDATA[<p>Hi, all</p>
<p>Sorry, for this double-post &#8211; but I missed some informations <img src='http://blog.feryn.eu/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>I replaced utf8_decode() by iconv().<br />
By changing this I&#8217;m able to use the charset Windows-1252. I needed this charset to get the correct encoded utf-8 character for the char €.</p>
<p>Please look at this post to get a detailed information about this unexpected behaviour:<br />
<a href="http://www.unixresources.net/linux/lf/47/archive/00/00/16/76/167628.html#article722496">http://www.unixresources.net/linux/lf/47/archive/00/00/16/76/167628.html#article722496</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Simon Schick</title>
		<link>http://blog.feryn.eu/2009/12/learning-from-your-mistakes-mixed-character-sets-in-mysql/comment-page-1/#comment-723</link>
		<dc:creator>Simon Schick</dc:creator>
		<pubDate>Thu, 12 May 2011 09:34:39 +0000</pubDate>
		<guid isPermaLink="false">http://blog.feryn.eu/?p=659#comment-723</guid>
		<description>Hi, all

I had the same problem but in my case it had been even worse!
I had a character-mixup even in one line!

Here&#039;s the code. Please read the comments to know how I fixed this problem:

[code]
// Use this option only if you know what you&#039;re doing!
$enableInlineEncoding = TRUE;
$lineEncoding = &quot;ISO-8859-1&quot;;
$inlineEncoding = &quot;Windows-1252&quot;;

function detectUTF8($string) {
	return preg_match(&#039;%(?:
	[\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
	&#124;\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
	&#124;[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
	&#124;\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
	&#124;\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
	&#124;[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
	&#124;\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
	)+%xs&#039;, $string);
}

function readStdInArray() {
	$stdin = fopen(&#039;php://stdin&#039;, &#039;r&#039;);
	$out = array();
	if (is_resource($stdin)) {
		while (!feof($stdin)) {
			$in = trim(fgets($stdin));
			if (strlen($in) &gt; 0) {
				$out[] = $in;
			}
		}
	}
	fclose($stdin);
	return $out;
}

$std = readStdInArray();

//# Here&#039;s an example of my code.
//$std = array(
//	&quot;ä f\xFCr \x80 asdasd&quot;, // Sometimes there&#039;s a \x80 (€) saved as windows-1252 character and somethimes characters are valid utf-8 characters ...
//	&#039;asäbc&#039;
//);

foreach ($std as $line) {
	if (detectUTF8($line)) {

		if ($enableInlineEncoding) {
			// replace all valid UTF8-character to &#039;&#039;
			$string_without_utf8 = preg_replace(&#039;%(?:
			[\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
			&#124;\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
			&#124;[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
			&#124;\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
			&#124;\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
			&#124;[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
			&#124;\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
			)+%xs&#039;, &#039;&#039;, $line);

			// get a list of all invalid UTF8-characters by counting all non-ascii characters in the string created above.
			$count_invalid = preg_match_all(&quot;/[\x80-\xFF]/&quot;, $string_without_utf8, $matches);

			// replace all invalid utf-8 characters by their utf8-encoded characters.
			foreach ($matches[0] as $match) {
				$line = str_replace($match, iconv($inlineEncoding, &quot;UTF-8&quot;, $match), $line);
			}
		}

		echo $line . PHP_EOL;
	} else {
		echo iconv($lineEncoding, &quot;UTF-8&quot;, $line) . PHP_EOL;
	}
}
[/code]</description>
		<content:encoded><![CDATA[<p>Hi, all</p>
<p>I had the same problem but in my case it had been even worse!<br />
I had a character-mixup even in one line!</p>
<p>Here&#8217;s the code. Please read the comments to know how I fixed this problem:</p>
<p>[code]<br />
// Use this option only if you know what you're doing!<br />
$enableInlineEncoding = TRUE;<br />
$lineEncoding = "ISO-8859-1";<br />
$inlineEncoding = "Windows-1252";</p>
<p>function detectUTF8($string) {<br />
	return preg_match('%(?:<br />
	[\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte<br />
	|\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs<br />
	|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte<br />
	|\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates<br />
	|\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3<br />
	|[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15<br />
	|\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16<br />
	)+%xs', $string);<br />
}</p>
<p>function readStdInArray() {<br />
	$stdin = fopen('php://stdin', 'r');<br />
	$out = array();<br />
	if (is_resource($stdin)) {<br />
		while (!feof($stdin)) {<br />
			$in = trim(fgets($stdin));<br />
			if (strlen($in) &gt; 0) {<br />
				$out[] = $in;<br />
			}<br />
		}<br />
	}<br />
	fclose($stdin);<br />
	return $out;<br />
}</p>
<p>$std = readStdInArray();</p>
<p>//# Here's an example of my code.<br />
//$std = array(<br />
//	"ä f\xFCr \x80 asdasd", // Sometimes there's a \x80 (€) saved as windows-1252 character and somethimes characters are valid utf-8 characters ...<br />
//	'asäbc'<br />
//);</p>
<p>foreach ($std as $line) {<br />
	if (detectUTF8($line)) {</p>
<p>		if ($enableInlineEncoding) {<br />
			// replace all valid UTF8-character to ''<br />
			$string_without_utf8 = preg_replace('%(?:<br />
			[\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte<br />
			|\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs<br />
			|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte<br />
			|\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates<br />
			|\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3<br />
			|[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15<br />
			|\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16<br />
			)+%xs', '', $line);</p>
<p>			// get a list of all invalid UTF8-characters by counting all non-ascii characters in the string created above.<br />
			$count_invalid = preg_match_all("/[\x80-\xFF]/", $string_without_utf8, $matches);</p>
<p>			// replace all invalid utf-8 characters by their utf8-encoded characters.<br />
			foreach ($matches[0] as $match) {<br />
				$line = str_replace($match, iconv($inlineEncoding, "UTF-8", $match), $line);<br />
			}<br />
		}</p>
<p>		echo $line . PHP_EOL;<br />
	} else {<br />
		echo iconv($lineEncoding, "UTF-8", $line) . PHP_EOL;<br />
	}<br />
}<br />
[/code]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joris</title>
		<link>http://blog.feryn.eu/2009/12/learning-from-your-mistakes-mixed-character-sets-in-mysql/comment-page-1/#comment-350</link>
		<dc:creator>Joris</dc:creator>
		<pubDate>Sun, 06 Dec 2009 22:30:48 +0000</pubDate>
		<guid isPermaLink="false">http://blog.feryn.eu/?p=659#comment-350</guid>
		<description>The mixed character set /UTF-8 nightmare is something we all go through I guess. I remember well what a hair pulling day it was when I encountered something similar a few years back (and how confusing information about UTF-8 was).
It&#039;s one of the first things I check when starting a project, making sure everything is set to UTF-8, just to avoid these kind of troubles.</description>
		<content:encoded><![CDATA[<p>The mixed character set /UTF-8 nightmare is something we all go through I guess. I remember well what a hair pulling day it was when I encountered something similar a few years back (and how confusing information about UTF-8 was).<br />
It&#8217;s one of the first things I check when starting a project, making sure everything is set to UTF-8, just to avoid these kind of troubles.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

