Ansible, python and mysql… untangling the mess

Nearly every “how to” article or tutorial on the web describes one way of using Ansible, python, and connecting to MySQL as if that was the only solution.  Many don’t note code versions used, or even the pub date, and the Internet is rife with simply bad advice.  I finally gave up researching all this and ran a few quick tests  using VirtualBox/Vagrant to see what is really necessary to do a few things we need.  Our situation is:

  • Ubuntu 18.04, building instances using Ansible 2.7
  • We must manage remote mysql users
  • A target machine must be able to
    • run mysql commands from the command line (via a python script)
    • run mysql queries from inside a python script

Here’s what I found:

 

Python 2 vs python 3: just stop using python(2)

Unless you’re invested in a bunch of skanky old python(2) code, there is no reason to NOT use python3.  Python3 is the default version, and comes pre-installed on Ubuntu 18.04 (at least all of the images we are using) and it is accessible as “python3”.  Install it like this on your local machine too.  (Python 2 and python 3 are not compatible, so they are reasonably installed using different executable names. Don’t fight this.  The Homebrew folks added some confusion at first by installing python 3 as “python” in some cases, but that’s fixed now.)

To use python3 with Ansible, you must set the variable  ansible_python_interpreter: "python3" , and then Ansible will just use python3 and you won’t need to mess around installing python(2) at all.  For anything.

What’s needed to use the Ansible mysql_user module: the PyMySQL pip package

Ansible runs on python (specifically, python3, if you are following along.)  If you try and run a mysql_user task without installing the necessary pip packages, you’ll get a surprisingly helpful error message: “The PyMySQL (Python 2.7 and Python 3.X) or MySQL-python (Python 2.X) module is required.”  So to make this work in Ansible, just install the PyMySQL pip package:

What’s needed to run mysql from the command line: the mysql-client linux package

Finally with Ubuntu 18.04 and current versions of mysql packages, we have success by installing a single linux package.  This package makes the “mysql” command available from the command line for all users (and from python scripts.)

What’s needed access mysql databases from within a python script: just use the PyMySQL pip package

Our batch jobs do most of the mysql work using the mysql command line (using python’s subcommand with the shell=True option,) but there are times when we need to read some data in the script to determine what tasks to perform, etc.  The two most common pip packages are PyMySQL and mysql-connector-python.  After digging into the code examples, it turns out that there are only a few differences in the interface, and my conclusion is that for most purposes there is no real difference.

I recommend using PyMySQL, because it’s already required for the Ansible mysql_user module, and you can install it on the target host using the same Ansible tasks show above.

Retrofitting tests

In this time of TDD, the obvious question is, “why?”
(Or possibly “what the hell is wrong with you?” but I think I can explain.)

Why are there no tests to begin with? It is sort of obvious if you have been at this for a while:  legacy apps. All the Cool Kids have the privilege of working on the Latest Thing all the time, but reality is that a lot of the world runs on legacy code, from the dark times before modern testing tools and dev methodologies were common.  Why do this now?  Again, it’s about legacy code.  It’s live, mission-critical, and we need to upgrade underlying dependencies.  Since the app provides services to live client websites, it’s unwise to either keep running on old iron, or to roll out anything new without testing.

Writing tests after the fact is definitely backward, and there are some pitfalls, like tests that (erroneously) will never fail, so this takes a bit more care.  Since the legacy code is eventually going away, there’s no sense in writing unit tests for something that already works, so the focus is on feature/acceptance tests.

Tests will run both virtually (locally using simulations of client resources,) and live, where we really hit our clients’ web sites to verify the app’s working right.

This is not the best scenario but there are few options since it’s impossible to replicate all our clients’ environments in the lab, and the very slight traffic won’t matter to anyone.  When the test suite’s complete, the plan is:

  1. Test in the virtual environment tests to uncover and fix any issues
  2. Roll out the updates, then immediately run the tests against our clients’ live sites to make

Automating the live tests means we can get full coverage, and can quickly revert in case of problems.  This will still be a php app (big sigh,) but the platform and code base will be up to date and we’ll be prepared for the time when we need to replace it.

Why you shouldn’t care whether Ansible runs are re-entrant

I recently wrote about a problem I had as a result of imagining that Ansible runs were re-entrant.  (Spoiler: they are generally not.)  After kicking this around a little I realized that you should not care whether Ansible runs are re-entrant.  I like cherry pie so I will explain myself with a pie analogy.

If you are baking a pie for dinner tonight and something goes wrong, you would probably try to salvage it.  If you realize you forgot an essential ingredient you might try to pull the top crust and add it in.  If you  screwed up the oven temp, you can adjust the baking time, temp or both.

But if you screw up while baking pies for the County Fair, it’s very different.  You’re going for the blue ribbon, so no compromises.  To bake the best pie you can,  you would Just Start Over (and keep doing so until you Do Not Screw Up.)

If you’re bothering to automate server setup to begin with, you have already done all the work to make the best pie server you can, so there’s no reason to settle for anything less.  When anything at all goes wrong during a build, why not just start over and get a clean build?  It seems so obvious now.

A note about Chef:

Chef’s a different animal.  The m.o. for a full-on Chef implementation is to continuously run the recipes against your machines from a Chef server, as much as for configuration/re-configuration as for initial deployment. We haven’t used Chef for a couple years, but I recall it seemed more tolerant to interrupted runs.  Still, you can’t do better than a clean run, especially on an initial build.

Re-entrant vs idempotent in Ansible roles

I wasted a couple hours tracking down a problem with a raft of  new AWS ec2 instances generated using Ansible, and it’s worth explaining because it showcases problem common in a lot of Ansible roles.  While Ansible docs talk up the concept of “idempotency” (the ability to run a playbook multiple times without screwing up your hosts,) not much is said about being “re-entrant.”  I’ll explain.

Here’s a typical task from an Ansible role:

The notify: items indicate a handlers like these to be executed at a later time:

There are tw0 reasons you’d notify handlers like this instead of just running them inline as tasks:

  • You might encounter a bunch of tasks that all notify a handler to run later.  Handler execution is deferred as long as possible, so it only has to run once even for multiple “notify:” events.
  • The handler only gets called then the notifying task is “changed.”  What that means depends on the task, but in general it means something was actually, well, changed.  If you run this role again, most/many/all of the tasks won’t need to do anything, they won’t be ‘changed” and won’t have to trigger the handlers, either.

That’s all fine and dandy… until a playbook is interrupted.  All the talk of “idempotency” implies you can just re-run an interrupted Ansible run and everything will be fine, but not so!   Using the examples above:

  1. The Ansible playbook starts up and runs the “Create remote_syslog init.d service” task to create the init.d service.  This notifies to enable and restart the service later in the Ansible run.
  2. The run is interrupted, leaving the host setup incomplete
  3. So you run the playbook again to finish the job.  This time, the “Create remote_syslog init.d service” task doesn’t need to run because the service was already created.  However because the status is not “changed,” the handlers are also not notified.
  4. The Ansible run completes, everything looks fine, but the service has not been restarted or enabled. In this case “enabled” means it will restart on reboot, so now the service is not running, and won’t start even after a reboot.

The problem is that “idempotent” is not “re-entrant.”  Re-entrant means the run can be interrupted and re-run and everything will be ok.  Here’s a great post from Pivotal, describing the difference.  I expected the Ansible role to be re-entrant.  It cost me a little time, but could have easily resulted in security holes or worse.

The lesson is this:

Any Ansible playbook or role that uses notify to trigger handlers (which is like, nearly all of them) cannot be fully re-entrant.  Notifications occur only when the notifying task has a changed state, so if a run in interrupted after the task but before the handlers run, the handler queue is lost.  Re-running the task typically won’t result in a changed state, so the handlers will never run.

That’s fine, so long as you are expecting this behavior.  I have a few words to say about that here!

Un-%&$-ing MySQL character sets and collations across an entire server

Recently, requests to one of our data-backed web services started timing out.  It turned out the problem was that some of our data tables had been (re)created using the wrong character sets and collations.  And as everyone should know, and now I do:

Indexes are useless for joins unless the collations match

The carefully optimized queries were running as un-indexed.  Ugh.  Once I’d spotted the problem, the task was “simple” –  find and fix incorrect character set and collation settings through the entire database server.  Here’s what worked.  Except the RDS updates, all work was done directly from the mysql client command line, and for reference, this service included a couple hundred databases and a couple thousand tables.

First back up

First, I verified that our nightly backups were intact and complete.  While we did not have any problems with data loss, garbling, etc, through this process, YMMV.  (Note that if you back up using mysqldump with default options, it will save the bogus character sets and collations, so be careful if you have to restore.)

RDS settings

The target databases were running on an AWS RDS instance using MySQL 5.6.  Out of the box, the defaults for this version are latin1 and latin1_swedish_ci (yes, really!) so I created and applied an RDS  Parameter Group with the following settings:

  • character_set_client utf8
  • character_set_connection utf8
  • character_set_database utf8
  • character_set_filesystem utf8
  • character_set_results utf8
  • character_set_server utf8
  • collation_connection utf8_general_ci
  • collation_server utf8_general_ci

Changing these parameters will not modify anything in any of your current databases, but it will set proper defaults for creating new database objects and hopefully keep things from getting messed up again.

Databases

Like the RDS settings, changing database defaults will only affect newly-created data objects.  But it’s worth setting proper defaults this to avoid future headaches.  I handled this in two steps.  First I queried to see which databases needed updating, then I ran the updates.  Rinse and repeat until it’s all good.  Here’s the SQL I used to find errant databases:

I got a ton of hits.  Instead of handling each one by hand using  ALTER DATABASE `<database name>` CHARACTER SET utf8 COLLATE utf8_general_ci; , I wrote a little SQL to create all the commands:

I took the output from this query, cleaned it up with a text editor, and ran it from the command line.  I ran the database test query again ,and got zero records.  Success

Tables

Next, I needed to update the tables.  Finally a step that should affect real data, not just future additions.  Here’s how I found tables that needed the fix:

I omitted the mysql, sys, and performance_schema databases because the account I was using lacked permissions anyway.  It did not seem to matter in any way.  Again I got a boatload of results, so I wrote some SQL to create the update SQL for me for each target table.

The first SQL “update” query I tried was  ALTER TABLE `databasename`.`tablename` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;  but this didn’t change tables unless they included string/text fields.  I was able to convert the rest of the tables using simply  ALTER TABLE `databasename`.`tablename` COLLATE utf8_general_ci;  The code example here tries the first version, then the shortened one to cover all the tables.  I ran the resulting sql, then ran the table check query again and another success.

Columns

It was not clear whether I needed to explicitly convert table columns after doing the table conversions.   So I checked:

And… I got zero rows.  So at least for our environment, altering the tables with  CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci was sufficient to update the table columns as well.  Docs were sparse about the effects of these changes, so I recommend checking to make sure anyway.  At this point all the defaults for our entire database service, the defaults for all databases and tables, and the actual parameters for all data objects match our preferred character set and collation.  As hoped, our application immediately perked up and the timeouts stopped.