Ruby file encoding determination with EncodingSampler gem


Is there a good automatic way to determine file encoding?

If you have had to deal with importing user data from text files, you’ll know a couple things about character encodings:

  • Using the wrong encoding to interpret the file is disastrous.  Once you start to notice this, you’ll see it all over the web.
  • Users have no idea what a character encoding is, and there is zero chance they will actually know the encoding of a file they’re uploading.

My hope was that there was some simple way to actually determine the encoding of a general input file.  I dug around for days looking for a way to handle this situation for Ruby 1.9 and found:

  • rchardet is a Ruby gem that  tries to determine the encoding of a data sample based on original Mozilla algorithms (“The Original Code is Mozilla Communicator client code.“)  This gem tries to determine the best encoding, and gives you a “confidence” value to give you an idea about how sure it is.
  • I didn’t dig too deep into charlock_holmes, but it seems to provide the same sort of results by extending String.
  • cmess combines several encoding tools and CMess::GuessEncoding seems to work in a similar way.  I ran a few tests, and found that a lot of the results I got for encodings didn’t match anything in Ruby’s Encoding::name_list, so I gave up on it.
  • I even ran across this well-intentioned “solution” that picks an encoding using just this: “File.open(source_file).read.encoding” (which actually doesn’t work at all.)  The rpsec test just checks to see if the result is not nil.

My problem is, it’s not good enough to get a result and a confidence.  I need The Right Answer so that customer data can be interpreted correctly.  And further research showed that, in the general case it is impossible to determine the intended encoding without knowing with the data content is supposed to look like in the first place.  For example, the character 0xA4 could be either the generic currency symbol (¤) using ISO-8859-1 or the Euro symbol (€) using ISO-8859-15.  There is absolutely no way to automatically determine the intent from just programmatically examining the file.

EncodingSampler: let the user decide

EncodingSampler (https://github.com/flatrocks/encoding_sampler) solves this problem in a different way: decode the file using the various options, and let the user decide which one looks right.  There are a couple issues that pop up right away.

First, most characters in most (US English) text files are just 7-bit ASCII and will be interpreted identically by the common (US English) encodings.  A decent sample would have to exclude lines that translate identically and find the ones that actually look different.

It’s also common for several encodings to yield identical results for an entire file. In that case, it’s confusing to show multiple, identical samples and ask a user to choose, so encodings with identical samples should be grouped together

How it works

EncodingSampler is a “brute force”solution.  It works by reading each line in the target file, trying to decode it using each of the target encodings.  When the results differ for any pair of encodings, the line is retained in the sample.  When an encoding error occurs for a target encoding, that encoding is rejected from the possible solutions.  There are three possible results:

  • There may be no valid encodings. This could mean that none of the proposed encodings match the file, but often it means the file is simply malformed. This is generally what you will see if you try to determine the encoding of a non-text binary file.
  • There may be only one group of valid encodings, all of which yield the same decoded data. In this case there are no samples to look at because there are no encodings to differentiate between.
  • There may be more than one set of valid encodings, each of which yields different decoded data. In this case the samples are available so a user can determine which is the correct interpretation.

Since the differences between samples can be subtle, EncodingSampler implements a “diff” solution, making it simple to highlight the differences.  The default options wrap the differences with <span class=”difference”>…</span>.  With simple CSS you could present the options to the user as something like:

ASCII-8BIT ?ABCDEFabcdef0123456789?ABCDEFabcdef0123456789?
ISO-8859-1 ¤ABCDEFabcdef0123456789¤ABCDEFabcdef0123456789¤
ISO-8859-15 ABCDEFabcdef0123456789ABCDEFabcdef0123456789

There’s also an experimental “best” result.  The idea is that a properly-encoded result won’t contain the junk characters you sometimes see in improperly-decoded content.  In that case, the samples that are longer than the smallest one are probably wrong, and if there’s only one shortest sample, it probably represents the one correct encoding.

Motivation…

I wrote the EncodingSampler because we had an immediate need, and it’s a gem because it’s fairly in-elegant by nature and I didn’t want to junk up the project by including it directly.  It’s public because 1) it might be useful, and 2) I hope that smarter people will find better ways to do this, and will create a newer, better solution, or even fork this one.

Sorry, I currently don’t allow comments on this blog, but you can enter or comment on issues using the github issues page.