Un-%&$-ing MySQL character sets and collations across an entire server

Recently, requests to one of our data-backed web services started timing out. It turned out the problem was that some of our data tables had been (re)created using the wrong character sets and collations. And as everyone should know, and now I do:

Indexes are useless for joins unless the collations match

The carefully optimized queries were running as un-indexed. Ugh. Once I’d spotted the problem, the task was “simple” – find and fix incorrect character set and collation settings through the entire database server. Here’s what worked. Except the RDS updates, all work was done directly from the mysql client command line, and for reference, this service included a couple hundred databases and a couple thousand tables.

First back up

First, I verified that our nightly backups were intact and complete. While we did not have any problems with data loss, garbling, etc, through this process, YMMV. (Note that if you back up using mysqldump with default options, it will save the bogus character sets and collations, so be careful if you have to restore.)

RDS settings

The target databases were running on an AWS RDS instance using MySQL 5.6. Out of the box, the defaults for this version are latin1 and latin1_swedish_ci (yes, really!) so I created and applied an RDS Parameter Group with the following settings:

character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem utf8
character_set_results utf8
character_set_server utf8
collation_connection utf8_general_ci
collation_server utf8_general_ci

Changing these parameters will not modify anything in any of your current databases, but it will set proper defaults for creating new database objects and hopefully keep things from getting messed up again.

Databases

Like the RDS settings, changing database defaults will only affect newly-created data objects. But it’s worth setting proper defaults this to avoid future headaches. I handled this in two steps. First I queried to see which databases needed updating, then I ran the updates. Rinse and repeat until it’s all good. Here’s the SQL I used to find errant databases:

-- List all databases with incorrect character sets or collations
SELECT
  SCHEMA_NAME as `database`,
  DEFAULT_CHARACTER_SET_NAME as `character_set`,
  DEFAULT_COLLATION_NAME as `collation`
FROM information_schema.SCHEMATA
WHERE DEFAULT_CHARACTER_SET_NAME <> "utf8"
  OR DEFAULT_COLLATION_NAME <> "utf8_general_ci";

I got a ton of hits. Instead of handling each one by hand using ALTER DATABASE `<database name>` CHARACTER SET utf8 COLLATE utf8_general_ci; , I wrote a little SQL to create all the commands:

-- ALTER DATABASE `databasename` CHARACTER SET utf8 COLLATE utf8_general_ci;
SELECT
  CONCAT("ALTER DATABASE \`", SCHEMA_NAME, "\` CHARACTER SET utf8 COLLATE utf8_general_ci;")
FROM information_schema.SCHEMATA
WHERE DEFAULT_CHARACTER_SET_NAME <> "utf8"
  OR DEFAULT_COLLATION_NAME <> "utf8_general_ci";

I took the output from this query, cleaned it up with a text editor, and ran it from the command line. I ran the database test query again ,and got zero records. Success

Tables

Next, I needed to update the tables. Finally a step that should affect real data, not just future additions. Here’s how I found tables that needed the fix:

-- Find tables not matching preferred character set and collation
SELECT
  tables.TABLE_SCHEMA AS `database`,
  tables.TABLE_NAME AS `table`,
  collations.CHARACTER_SET_NAME AS `character_set`,
  collations.COLLATION_NAME as `collation`
FROM information_schema.`TABLES` AS tables
JOIN information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` as collations ON collations.collation_name = T.table_collation
WHERE T.TABLE_SCHEMA NOT IN ('mysql', 'sys', 'performance_schema')
  AND (collations.CHARACTER_SET_NAME <> "utf8" OR collations.COLLATION_NAME <> "utf8_general_ci");

I omitted the mysql, sys, and performance_schema databases because the account I was using lacked permissions anyway. It did not seem to matter in any way. Again I got a boatload of results, so I wrote some SQL to create the update SQL for me for each target table.

-- ALTER TABLE `databasename`.`tablename` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
SELECT
  CONCAT("ALTER TABLE \`", tables.TABLE_SCHEMA, "\`.\`", tables.TABLE_NAME, "\` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;", "ALTER TABLE \`", T.TABLE_SCHEMA, "\`.\`", tables.TABLE_NAME, "\` COLLATE utf8_general_ci;")
FROM information_schema.`TABLES` AS tables
JOIN information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` AS collations ON collations.collation_name = tables.table_collation
WHERE T.TABLE_SCHEMA NOT IN ('mysql', 'sys', 'performance_schema')
  AND (collations.CHARACTER_SET_NAME <> "utf8" OR collations.COLLATION_NAME <> "utf8_general_ci");

The first SQL “update” query I tried was ALTER TABLE `databasename`.`tablename` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci; but this didn’t change tables unless they included string/text fields. I was able to convert the rest of the tables using simply ALTER TABLE `databasename`.`tablename` COLLATE utf8_general_ci; The code example here tries the first version, then the shortened one to cover all the tables. I ran the resulting sql, then ran the table check query again and another success.

Columns

It was not clear whether I needed to explicitly convert table columns after doing the table conversions. So I checked:

-- Find columns that still have bogus character sets and collations
SELECT
  table_schema AS `database`,
  table_name AS `table`,
  column_name AS `column`,
  character_set_name AS `character_set`,
  collation_name AS `collation`
FROM information_schema.`COLUMNS`
WHERE character_set_name <> 'utf8'
  OR collation_name <> 'utf8_general_ci';

And… I got zero rows. So at least for our environment, altering the tables with CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci was sufficient to update the table columns as well. Docs were sparse about the effects of these changes, so I recommend checking to make sure anyway. At this point all the defaults for our entire database service, the defaults for all databases and tables, and the actual parameters for all data objects match our preferred character set and collation. As hoped, our application immediately perked up and the timeouts stopped.