MySql Latin1 to UTF8 Conversion
For those of you who follow me due to my social media strategy, this is a little break from the norm. I am a nerd and I used a lot of resources to develop this script, so in an effort to give back to the community I’m gonna nerd out for a post.
The Problem with ISO-8859-1 aka Latin1
I am not an expert in character encoding, so I’ll break it down in human terms. Latin1 cannot hold as many characters as UTF8. This becomes a problem when you have encoding from applications such as Microsoft Word, or different languages. UTF8 on the other hand can store everything latin1 can and a whole lot more. As an example, and to learn more about how to make your websites UTF-8 compatible, see Geir Berset’s Mother of all UTF-8 Check lists.
Why not just convert to UTF8 from Latin1
Well, that’s the goal, but the problem is there are a ton of variables in this situation if you couldn’t tell from that checklist above, if you get even one messed up or multiple settings messed up you might (or might not) have a problem. IT gets more complex as you can store UTF8 encoded characters into a latin1 database. (Don’t ask me how, but we did – often!)
After scouring the interwebs for things such as mysql convert latin1 to utf8, mysql convert latin1 to utf8 iconv, and bash convert latin1 to utf8 I was left to believe that I was out of luck. Quite a few people suggested that they preformed some of the latin1 to utf-8 conversion steps needed but then spent 60 or more hours cleaning up the mess. This would not have been acceptable to me, and really not feasible in our situation.
The Problem: Our Setup or Why We’re still on Latin1
We’re a small hosting company, we have over 200 sites on our infrastructure; some code we control, but half or more is legacy code which needs to run but we don’t control. We have over 100 databases on our mysql server. We were also running an older version of Apache and PHP. In an effort to bring us out of the stone ages into a more compliant system we decided to move it to a current version of Apache, PHP and mysql. Fortunately I can do this is stages, but because I have to do it in stages there is no option to shut down the site, convert all the settings, the data, the code and then turn the server back on, check verify and clean. I would have to do that 200 times. That’s just not feasible. I’ll focus on the mysql portion of this conversion for now, because I found the help out there to be lacking. The rest of the conversion information is easily found online.
I made a separate environment for testing and I tried and tried the lessons I learned from the inter-searches and could not get the right mix. Mostly because I had UTF-8 characters stored in the Latin1 (ISO 8859 1) encoding.
Mysqldump, Iconv
Many sources suggest taking a mysqldump (not what you are thinking!) and then use the linux command iconv to convert the encoding of the dump to UTF8.
Lots of problems occurred here. What I didn’t know is that mysqldump is a client, and thereby has a character set that it requests from mysql. Mysql in it’s genius-ness can convert data to whatever format the client requests. So the UTF8 characters in the db were assumed to be latin1, and the encoding was lost when iconv tried to convert it.
Did you miss it? I did to. But another hint is that mysql itself can preform data encoding conversions. No matter what the data is in the table/database you can request it in a different format and mysql will do the heavy lifting of converting it. This means we don’t need iconv or any other conversion tool.
Once I learned this and with trial and error, and a bit of luck I found a 5 step method that works every time on our tables. It is a 5 step method, but it feels like a dance, kinda like the hokey pokey. Do this do that do this again and it works.
Five Step Mysql Latin1 Encoding to UTF8 Encoding Script Explained
1. Mysqldump, Force it to be UTF8
Grab the data from mysql and force it to be encoded as UTF8. One might assume this makes it all UTF8 and you simple import it, sorry charlie! We still have 4 steps to go. The Conversion engine gets tricked in this step I think. Since this is copied from a bash script you will see that I use variables, you’ll see the whole script in a minute, but hopefully you can understand this.
mysqldump -u${DBUSER} -p${DBPASSWD} -h${DBHOST} -a -c -e --add-drop-table --default-character-set=utf8 ${DB} ${DBTABLE} > $TEMPFILELOC$TEMPFILE |
The key is the
--default-character-set=utf8 |
2. Lie to Mysql; Tell Mysql the Data is Latin1
Inside the dump file there are two settings which tell the mysql server upon import that this is utf8 encoded. One says this is UTF8 data coming in and the other says this was pulled from utf8 data. We need to change one of those to say latin1. Reason? The Mysql conversion system saw that there were UTF8 characters in the table and then didn’t convert the data in step 1. This way mysql will convert the characters on import.
sed -i 's/utf8/latin1/g' $TEMPFILELOC$TEMPFILE |
If you don’t know sed, then you don’t know linux. Sed is a line editor which is doing a search and replace inline. Every instance of utf8 will be replaced with latin1 (the other instance of utf8 in the dump is actually utf-8 so it only replaces one.)
3. Import the Mysql Data
This is what hung me up before, once I do the above steps and then re-import the data I assumed that mysql’s conversion engine would put it into the table correctly. But it never worked. Stay tuned for Step 4, the magic.
mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE |
4. Mysqldump do not Specify UTF-8 or Latin1
The magic happens here. Until this point we still have two forms of encoded data, but now on export we have one, and it’ll clean up really nicely. No special magic to this command, just take a dump.
mysqldump -u${DBUSER} -p${DBPASSWD} -h${DBHOST} -a -c -e --add-drop-table ${DB} ${DBTABLE} > $TEMPFILELOC$TEMPFILE |
5. Import Mysql Data to a UTF-8 Environment
Again, nothing special here, except now that we have clean data in the file we need to push it to the server.
mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE |
viola, oh but then there is the clean up of Altering the table character set and the collation.
ALTER TABLE $DBTABLE CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; |
The Full Latin1 to UTF8 Bash Script
Feel free to use this script. You will need to tweak it to your server settings. Let me know in the comments if you have used this.
#!/bin/bash # Use this to convert a table from latin1 encoding to utf8 enconding # Written by paul@paulkortman.com see more here: http://paulkortman.com/2009/07/24/mysql-latin1-t…tf8-conversion/ # MySQL Hostname DBHOST='localhost' # MySQL Username DBUSER='root' # MySQL Password DBPASSWD='(yeah right, get your own password)' # temp file location TEMPFILELOC='/tmp/' # temp file name TEMPFILE='tempfile.txt' #Verify the Arguments if [ $# -ne 2 ]; then echo "USAGE: mysqlconvertutf8.sh Database Table" exit 1 fi #Grab the Args # MySQL Database DB=$1 # MySQL Database Table DBTABLE=$2 #Alter Command ALTERCMD="ALTER TABLE $DBTABLE CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;" #meat of the script #Grab data force UTF8 mysqldump -u${DBUSER} -p${DBPASSWD} -h${DBHOST} -a -c -e --add-drop-table --default-character-set=utf8 ${DB} ${DBTABLE} > $TEMPFILELOC$TEMPFILE #change it to think its latin1 sed -i 's/utf8/latin1/g' $TEMPFILELOC$TEMPFILE #import new data mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE #Export, without forcing char set mysqldump -u${DBUSER} -p${DBPASSWD} -h${DBHOST} -a -c -e --add-drop-table ${DB} ${DBTABLE} > $TEMPFILELOC$TEMPFILE #import final data mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE #Data is converted now need to change table definitions. echo $ALTERCMD > $TEMPFILELOC$TEMPFILE mysql -u${DBUSER} -p${DBPASSWD} -h${DBHOST} ${DB} < $TEMPFILELOC$TEMPFILE #Cleanup after yourself, delete the tempfile [ -e $TEMPFILELOC$TEMPFILE ]; rm -f $TEMPFILELOC$TEMPFILE exit 0 |
To use this script follow these steps
- Copy this code,
- use vi (is there anything else??) to create paste into a file called mysqlconvertutf8.sh
- run the command chmod +x mysqlconvertutf8.sh
- run the command mysqlconvertutf8.sh databasename tablename (you have to change the last two to be your database, and your table names!)





[...] the Internet has been based on the latin character set. There has been a shift away from this to UTF-8 encoding. But there is now a larger shift. We’re approaching the 20th birthday of the Web, and for the [...]
U saved my life! :-)
Thank you very much…
(@namtrok)
Marco, I’m happy I was able to help.
Thank you for leaving a comment! My site analytics suggests you are not the only one helped by this post :)
HI, thanks a lot for the script, I have not tried it yet but i have a question about the line right after the “final import comment”, it looks like you do the import twice and you have in comment to do the table definitions change to utf8. I am a bit confused there, does changing the table definition also change the column character-set to utf8? Thanks
(@namtrok)replied:
K I’m not sure I’m able to fully explain it, except that it worked for me. I found a bunch of examples in how it did not work for others, but this was the trick to convince it to import the content in the correct character set.
You saved my week!
Great article. I’ve been searching the web for hours, and tried many things but never managed to convert my db without loosing windows Windows-1252 chars.
Merci beaucoup!
[...] En otros casos, sí puede surgir la necesidad de convertir datos de la BD entre latin1 y utf8; os dejo este enlace bastante esclarecedor en http://paulkortman.com/2009/07/24/mysql-latin1-to-utf8-conversion/. [...]
[...] for a process that I could apply to an online database, and luckily I found some good notes by Paul Kortman and fabio, so I combined some of their ideas and automated the process for my site. I hit a couple [...]
[...] En otros casos, sí puede surgir la necesidad de convertir datos de la BD entre latin1 y utf8; os dejo este enlace bastante esclarecedor en http://paulkortman.com/2009/07/24/mysql-latin1-to-utf8-conversion/. [...]
thx bro!
two of the lines in the above script were apparently joined with both its comment and one line above so they appear to be commented out, AFTER a line instead of in the program flow. As above, the program will not do what you want.
[...] ¡ Funcionando ! Recordad que desde vuestras aplicaciones deberéis indicar que escoja esta codificación o sino, regenerar la base de datos almacenando los datos como utf8 (ver este artículo). [...]