utf-8For those of you who follow me due to my social media strategy, this is a little break from the norm. I am a nerd and I used a lot of resources to develop this script, so in an effort to give back to the community I’m gonna nerd out for a post.

The Problem with ISO-8859-1 aka Latin1

I am not an expert in character encoding, so I’ll break it down in human terms. Latin1 cannot hold as many characters as UTF8. This becomes a problem when you have encoding from applications such as Microsoft Word, or different languages. UTF8 on the other hand can store everything latin1 can and a whole lot more. As an example, and to learn more about how to make your websites UTF-8 compatible, see Geir Berset’s Mother of all UTF-8 Check lists.

Why not just convert to UTF8 from Latin1

Well, that’s the goal, but the problem is there are a ton of variables in this situation if you couldn’t tell from that checklist above, if you get even one messed up or multiple settings messed up you might (or might not) have a problem. IT gets more complex as you can store UTF8 encoded characters into a latin1 database. (Don’t ask me how, but we did – often!)

After scouring the interwebs for things such as mysql convert latin1 to utf8, mysql convert latin1 to utf8 iconv, and bash convert latin1 to utf8 I was left to believe that I was out of luck. Quite a few people suggested that they preformed some of the latin1 to utf-8 conversion steps needed but then spent 60 or more hours cleaning up the mess. This would not have been acceptable to me, and really not feasible in our situation.

The Problem: Our Setup or Why We’re still on Latin1

We’re a small hosting company, which aspire to be as renowned as collectiveray, we have over 200 sites on our infrastructure; some code we control, but half or more is legacy code which needs to run but we don’t control. We have over 100 databases on our mysql server. We were also running an older version of Apache and PHP. In an effort to bring us out of the stone ages into a more compliant system we decided to move it to a current version of Apache, PHP and mysql. Fortunately I can do this is stages, but because I have to do it in stages there is no option to shut down the site, convert all the settings, the data, the code and then turn the server back on, check verify and clean. I would have to do that 200 times. That’s just not feasible. I’ll focus on the mysql portion of this conversion for now, because I found the help out there to be lacking. The rest of the conversion information is easily found online.

I made a separate environment for testing and I tried and tried the lessons I learned from the inter-searches and could not get the right mix. Mostly because I had UTF-8 characters stored in the Latin1 (ISO 8859 1) encoding.

Mysqldump, Iconv

Many sources suggest taking a mysqldump (not what you are thinking!) and then use the linux command iconv to convert the encoding of the dump to UTF8.

Lots of problems occurred here. What I didn’t know is that mysqldump is a client, and thereby has a character set that it requests from mysql. Mysql in it’s genius-ness can convert data to whatever format the client requests. So the UTF8 characters in the db were assumed to be latin1, and the encoding was lost when iconv tried to convert it.

Did you miss it? I did to. But another hint is that mysql itself can preform data encoding conversions. No matter what the data is in the table/database you can request it in a different format and mysql will do the heavy lifting of converting it. This means we don’t need iconv or any other conversion tool.

Once I learned this and with trial and error, and a bit of luck I found a 5 step method that works every time on our tables. It is a 5 step method, but it feels like a dance, kinda like the hokey pokey. Do this do that do this again and it works.

Five Step Mysql Latin1 Encoding to UTF8 Encoding Script Explained

1. Mysqldump, Force it to be UTF8

Grab the data from mysql and force it to be encoded as UTF8. One might assume this makes it all UTF8 and you simple import it, sorry charlie! We still have 4 steps to go. The Conversion engine gets tricked in this step I think. Since this is copied from a bash script you will see that I use variables, you’ll see the whole script in a minute, but hopefully you can understand this.

mysqldump -u${DBUSER} -p${DBPASSWD} -h${DBHOST} -a -c -e --add-drop-table --default-character-set=utf8 ${DB} ${DBTABLE} > $TEMPFILELOC$TEMPFILE

The key is the

--default-character-set=utf8

2. Lie to Mysql; Tell Mysql the Data is Latin1

Inside the dump file there are two settings which tell the mysql server upon import that this is utf8 encoded. One says this is UTF8 data coming in and the other says this was pulled from utf8 data. We need to change one of those to say latin1. Reason? The Mysql conversion system saw that there were UTF8 characters in the table and then didn’t convert the data in step 1. This way mysql will convert the characters on import.

sed -i 's/utf8/latin1/g' $TEMPFILELOC$TEMPFILE

If you don’t know sed, then you don’t know linux. Sed is a line editor which is doing a search and replace inline. Every instance of utf8 will be replaced with latin1 (the other instance of utf8 in the dump is actually utf-8 so it only replaces one.)

3. Import the Mysql Data

This is what hung me up before, once I do the above steps and then re-import the data I assumed that mysql’s conversion engine would put it into the table correctly. But it never worked. Stay tuned for Step 4, the magic.

mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE

4. Mysqldump do not Specify UTF-8 or Latin1

The magic happens here. Until this point we still have two forms of encoded data, but now on export we have one, and it’ll clean up really nicely. No special magic to this command, just take a dump.

mysqldump -u${DBUSER} -p${DBPASSWD} -h${DBHOST} -a -c -e --add-drop-table ${DB} ${DBTABLE} > $TEMPFILELOC$TEMPFILE

5. Import Mysql Data to a UTF-8 Environment

Again, nothing special here, except now that we have clean data in the file we need to push it to the server.

mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE

viola, oh but then there is the clean up of Altering the table character set and the collation.

ALTER TABLE $DBTABLE CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

The Full Latin1 to UTF8 Bash Script

Feel free to use this script. You will need to tweak it to your server settings. Let me know in the comments if you have used this.

#!/bin/bash
# Use this to convert a table from latin1 encoding to utf8 enconding
# Written by paul@paulkortman.com see more here: http://paulkortman.com/2009/07/24/mysql-latin1-t…tf8-conversion/
 
# MySQL Hostname
DBHOST='localhost'
 
# MySQL Username
DBUSER='root'
 
# MySQL Password
DBPASSWD='(yeah right, get your own password)'
 
# temp file location
TEMPFILELOC='/tmp/'
# temp file name
TEMPFILE='tempfile.txt'
 
#Verify the Arguments
if [ $# -ne 2 ];
then
echo "USAGE: mysqlconvertutf8.sh Database Table"
exit 1
fi
#Grab the Args
# MySQL Database
DB=$1
# MySQL Database Table
DBTABLE=$2
 
#Alter Command
ALTERCMD="ALTER TABLE $DBTABLE CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;"
 
#meat of the script
#Grab data force UTF8
mysqldump -u${DBUSER} -p${DBPASSWD} -h${DBHOST} -a -c -e --add-drop-table --default-character-set=utf8 ${DB} ${DBTABLE} > $TEMPFILELOC$TEMPFILE
#change it to think its latin1
sed -i 's/utf8/latin1/g' $TEMPFILELOC$TEMPFILE
#import new data
mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE #Export, without forcing char set mysqldump -u${DBUSER} -p${DBPASSWD} -h${DBHOST} -a -c -e --add-drop-table ${DB} ${DBTABLE} > $TEMPFILELOC$TEMPFILE
#import final data
mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE #Data is converted now need to change table definitions. echo $ALTERCMD > $TEMPFILELOC$TEMPFILE
mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE
 
#Cleanup after yourself, delete the tempfile
[ -e $TEMPFILELOC$TEMPFILE ]; rm -f $TEMPFILELOC$TEMPFILE
 
exit 0

To use this script follow these steps

  1. Copy this code,
  2. use vi (is there anything else??) to create paste into a file called mysqlconvertutf8.sh
  3. run the command chmod +x mysqlconvertutf8.sh
  4. run the command mysqlconvertutf8.sh databasename tablename  (you have to change the last two to be your database, and your table names!)