Retrofitting tests

In this time of TDD, the obvious question is, “why?”
(Or possibly “what the hell is wrong with you?” but I think I can explain.)

Why are there no tests to begin with? It is sort of obvious if you have been at this for a while:  legacy apps. All the Cool Kids have the privilege of working on the Latest Thing all the time, but reality is that a lot of the world runs on legacy code, from the dark times before modern testing tools and dev methodologies were common.  Why do this now?  Again, it’s about legacy code.  It’s live, mission-critical, and we need to upgrade underlying dependencies.  Since the app provides services to live client websites, it’s unwise to either keep running on old iron, or to roll out anything new without testing.

Writing tests after the fact is definitely backward, and there are some pitfalls, like tests that (erroneously) will never fail, so this takes a bit more care.  Since the legacy code is eventually going away, there’s no sense in writing unit tests for something that already works, so the focus is on feature/acceptance tests.

Tests will run both virtually (locally using simulations of client resources,) and live, where we really hit our clients’ web sites to verify the app’s working right.

This is not the best scenario but there are few options since it’s impossible to replicate all our clients’ environments in the lab, and the very slight traffic won’t matter to anyone.  When the test suite’s complete, the plan is:

  1. Test in the virtual environment tests to uncover and fix any issues
  2. Roll out the updates, then immediately run the tests against our clients’ live sites to make

Automating the live tests means we can get full coverage, and can quickly revert in case of problems.  This will still be a php app (big sigh,) but the platform and code base will be up to date and we’ll be prepared for the time when we need to replace it.

Why you shouldn’t care whether Ansible runs are re-entrant

I recently wrote about a problem I had as a result of imagining that Ansible runs were re-entrant.  (Spoiler: they are generally not.)  After kicking this around a little I realized that you should not care whether Ansible runs are re-entrant.  I like cherry pie so I will explain myself with a pie analogy.

If you are baking a pie for dinner tonight and something goes wrong, you would probably try to salvage it.  If you realize you forgot an essential ingredient you might try to pull the top crust and add it in.  If you  screwed up the oven temp, you can adjust the baking time, temp or both.

But if you screw up while baking pies for the County Fair, it’s very different.  You’re going for the blue ribbon, so no compromises.  To bake the best pie you can,  you would Just Start Over (and keep doing so until you Do Not Screw Up.)

If you’re bothering to automate server setup to begin with, you have already done all the work to make the best pie server you can, so there’s no reason to settle for anything less.  When anything at all goes wrong during a build, why not just start over and get a clean build?  It seems so obvious now.

A note about Chef:

Chef’s a different animal.  The m.o. for a full-on Chef implementation is to continuously run the recipes against your machines from a Chef server, as much as for configuration/re-configuration as for initial deployment. We haven’t used Chef for a couple years, but I recall it seemed more tolerant to interrupted runs.  Still, you can’t do better than a clean run, especially on an initial build.

Re-entrant vs idempotent in Ansible roles

I wasted a couple hours tracking down a problem with a raft of  new AWS ec2 instances generated using Ansible, and it’s worth explaining because it showcases problem common in a lot of Ansible roles.  While Ansible docs talk up the concept of “idempotency” (the ability to run a playbook multiple times without screwing up your hosts,) not much is said about being “re-entrant.”  I’ll explain.

Here’s a typical task from an Ansible role:

The notify: items indicate a handlers like these to be executed at a later time:

There are tw0 reasons you’d notify handlers like this instead of just running them inline as tasks:

  • You might encounter a bunch of tasks that all notify a handler to run later.  Handler execution is deferred as long as possible, so it only has to run once even for multiple “notify:” events.
  • The handler only gets called then the notifying task is “changed.”  What that means depends on the task, but in general it means something was actually, well, changed.  If you run this role again, most/many/all of the tasks won’t need to do anything, they won’t be ‘changed” and won’t have to trigger the handlers, either.

That’s all fine and dandy… until a playbook is interrupted.  All the talk of “idempotency” implies you can just re-run an interrupted Ansible run and everything will be fine, but not so!   Using the examples above:

  1. The Ansible playbook starts up and runs the “Create remote_syslog init.d service” task to create the init.d service.  This notifies to enable and restart the service later in the Ansible run.
  2. The run is interrupted, leaving the host setup incomplete
  3. So you run the playbook again to finish the job.  This time, the “Create remote_syslog init.d service” task doesn’t need to run because the service was already created.  However because the status is not “changed,” the handlers are also not notified.
  4. The Ansible run completes, everything looks fine, but the service has not been restarted or enabled. In this case “enabled” means it will restart on reboot, so now the service is not running, and won’t start even after a reboot.

The problem is that “idempotent” is not “re-entrant.”  Re-entrant means the run can be interrupted and re-run and everything will be ok.  Here’s a great post from Pivotal, describing the difference.  I expected the Ansible role to be re-entrant.  It cost me a little time, but could have easily resulted in security holes or worse.

The lesson is this:

Any Ansible playbook or role that uses notify to trigger handlers (which is like, nearly all of them) cannot be fully re-entrant.  Notifications occur only when the notifying task has a changed state, so if a run in interrupted after the task but before the handlers run, the handler queue is lost.  Re-running the task typically won’t result in a changed state, so the handlers will never run.

That’s fine, so long as you are expecting this behavior.  I have a few words to say about that here!

Un-%&$-ing MySQL character sets and collations across an entire server

Recently, requests to one of our data-backed web services started timing out.  It turned out the problem was that some of our data tables had been (re)created using the wrong character sets and collations.  And as everyone should know, and now I do:

Indexes are useless for joins unless the collations match

The carefully optimized queries were running as un-indexed.  Ugh.  Once I’d spotted the problem, the task was “simple” –  find and fix incorrect character set and collation settings through the entire database server.  Here’s what worked.  Except the RDS updates, all work was done directly from the mysql client command line, and for reference, this service included a couple hundred databases and a couple thousand tables.

First back up

First, I verified that our nightly backups were intact and complete.  While we did not have any problems with data loss, garbling, etc, through this process, YMMV.  (Note that if you back up using mysqldump with default options, it will save the bogus character sets and collations, so be careful if you have to restore.)

RDS settings

The target databases were running on an AWS RDS instance using MySQL 5.6.  Out of the box, the defaults for this version are latin1 and latin1_swedish_ci (yes, really!) so I created and applied an RDS  Parameter Group with the following settings:

  • character_set_client utf8
  • character_set_connection utf8
  • character_set_database utf8
  • character_set_filesystem utf8
  • character_set_results utf8
  • character_set_server utf8
  • collation_connection utf8_general_ci
  • collation_server utf8_general_ci

Changing these parameters will not modify anything in any of your current databases, but it will set proper defaults for creating new database objects and hopefully keep things from getting messed up again.

Databases

Like the RDS settings, changing database defaults will only affect newly-created data objects.  But it’s worth setting proper defaults this to avoid future headaches.  I handled this in two steps.  First I queried to see which databases needed updating, then I ran the updates.  Rinse and repeat until it’s all good.  Here’s the SQL I used to find errant databases:

I got a ton of hits.  Instead of handling each one by hand using  ALTER DATABASE `<database name>` CHARACTER SET utf8 COLLATE utf8_general_ci; , I wrote a little SQL to create all the commands:

I took the output from this query, cleaned it up with a text editor, and ran it from the command line.  I ran the database test query again ,and got zero records.  Success

Tables

Next, I needed to update the tables.  Finally a step that should affect real data, not just future additions.  Here’s how I found tables that needed the fix:

I omitted the mysql, sys, and performance_schema databases because the account I was using lacked permissions anyway.  It did not seem to matter in any way.  Again I got a boatload of results, so I wrote some SQL to create the update SQL for me for each target table.

The first SQL “update” query I tried was  ALTER TABLE `databasename`.`tablename` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;  but this didn’t change tables unless they included string/text fields.  I was able to convert the rest of the tables using simply  ALTER TABLE `databasename`.`tablename` COLLATE utf8_general_ci;  The code example here tries the first version, then the shortened one to cover all the tables.  I ran the resulting sql, then ran the table check query again and another success.

Columns

It was not clear whether I needed to explicitly convert table columns after doing the table conversions.   So I checked:

And… I got zero rows.  So at least for our environment, altering the tables with  CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci was sufficient to update the table columns as well.  Docs were sparse about the effects of these changes, so I recommend checking to make sure anyway.  At this point all the defaults for our entire database service, the defaults for all databases and tables, and the actual parameters for all data objects match our preferred character set and collation.  As hoped, our application immediately perked up and the timeouts stopped.

 

Ubuntu 14.04… apt-get install “disk full” error (aaack!)

Recently we have been considering moving some MySQL database services from our cloud servers to Amazon RDS to simplify our management tasks, and wanted to run atop on some Ubuntu 14.04 LTS machines to get an initial sizing for RDS instances.

While trying to apt-get install  the atop package, I was getting a “disk full” error, but df -h said there were several GB’s left. It took a while to figure out the problem wasn’t disk space… I was out of inodes (basically, file handles.) Running df -i  showed 100% of the inodes were used up which has a similar result.  We monitor a lot of things, but inode usage was not one of them!

While figuring this out, I did manage to brick one of our servers, making this the day that all the work building our infrastructure with Chef paid off!  Instead of panicking and having to work all night, I was able to build and configure a new machine in just a few minutes.  But I digress.

The problem

Here’s what was going on:  Most of these servers have been running a while, and we apply all the linux/Ubuntu security updates regularly, and this often updates the linux kernel.  What we didn’t realize was that none of the old versions are ever automatically deleted.  We ended up with a /usr/src directory that looked like this:

And each version included a ton of individual inode-sucking files.  Finally some of the machines had gotten to the point where kernel updates were failing silently due to the inode shortage.

The solution

It turns out  apt  does not play well once something’s messed up, and as I said, subsequent fiddling bricked a machine before I figured out this solution.

We had to start by making some headroom using dpkg directly.  While you can shoot your foot off by deleting the kernel version you are using, and possibly recent ones, we had good success by starting with the oldest ones and only deleting a couple, like this:

etc… This ran ok and freed up enough inodes so we could run apt-get without disk full errors.  The autoremove option did all the work.

That removed the rest of the unused versions all at once, though we still had to manually “ok” a configuration option while uninstalling some of the packages.

From what I read, the default install of Ubuntu 16.04 LTS will default to removing old versions of kernels once they are not needed.  Until then, I’m writing this down in case it happens again!