[Update 10/11/2009: I just found a better tool, bindata, to do what I'm describing in this post. It also lists, at the bottom of the page, many links to yet other implementations of binary data packing/unpacking. Worth checking out if this is what you need.]
[Update: presentation from the 4/15/2008 Ruby Meetup is now available here.]
I like reading code. Its like a novel and I want to read it cover-to-cover. Some, such as Why’s Camping framework, I struggle to comprehend. But most code that I read comes up slightly short. Like a novel with some mis-spellings, awkward phrasing or repeated analogies, I mentally mark it as “could be better”. And sometimes I really do sit down and write something better – maybe just for my own amusement but often for a useful purpose.
I recently had the experience of reading some code that parsed a variable-length binary data structure. This sort of thing comes up often when parsing a file format or communications protocol. Most of the code looks fairly similar because it does similar stuff: ignore one byte, read the next four as the length of the following junk, read two important bytes, ignore two more, read another four-byte length and skip past the following N bytes – ad nauseum.
I’ve written it in C, and it looks something like this (ignoring error conditions like getting to the end of the buffer):
ptr = &data; // start at the beginning of our data ptr++; // skip junk we don't care about UInt32 len = *(UInt32 *) ptr; // get the 4-byte length len = ntohl(len); // convert from network byte ordering ptr += sizeof(UInt32); // skip past the length we just read ptr += len; // skip past the data we don't care about UInt16 cost = *(UInt16 *)ptr; // read our important two bytes cost = ntohs(cost); // convert to the correct byte ordering
In Ruby, this tends to be shorter due to the handy String.unpack() routine, which takes a concise format string to define how many bytes to read and what to do with them. “a3″ reads 3 bytes as a string, “N” reads 4 bytes in network order, “n” reads 2 bytes in network order, etc. The code above could be rewritten in Ruby like this:
array = data.unpack( "a1N") # read the junk and the 4 length bytes len = array[1] # only get the length value we care about data = data[5..-1] # throw away the stuff we just read array = data.unpack( "a#{len}n" ) # define the length to read on the fly cost = array[1] # get our data in its correct ordering data = data[(len+2)..-1] # again, throw away what we just read
This code works fine, but its not much more readable than the C code. A first step would be do define a string.unpack!() routine, where the ‘!’ exclamation clues us in that it modifies the object we’re working with. In this case, the modification is to eat (discard) the data we just read. This shortens the code to:
array = data.unpack!( "a1N") # read the junk and the 4 length bytes len = array[1] # only get the length value we care about array = data.unpack!("a#{len}n") # define the length to read on the fly cost = array[1] # get our data in its correct ordering
But again, this isn’t much more readable (in my opinion) than the C code. Additionally, it doesn’t help us understand the code much better in the case where our format string is “a3Nna5″ and we need to remember which item in ‘array’ corresponds to the ‘n’ in the string (in this case, it is array[2]). After a test iteration or two, what I finally hit upon was to encapsulate the behavior we want in a separare Unpacker class, that automatically eats the data it reads and stores the results in an internal Hash object, to map the name ‘len’ or ‘cost’ to the data. I also combined the format string and the resulting variable so we can clearly see the relationships. The result looks like this:
u = Unpacker.new(data) u.u! "a1 => unused N => len" u.u! "a#{u.len} => unused n => cost"
Now we can clearly see which values are ignored, which are given meaningful names, and how the format codes relate to the meaning of the data. Changing it to reflect a better understanding of the underlying data will be very easy. Note that the only reason its in two statements is to define a value for u.len before we use it – blocks of fixed-length data can be one statement.
The code to implement the Unpacker class is only about 30 lines of Ruby – including the string.unpack!() routine that can be reused separately.
class String def unpack! format array = self.unpack(format+"a*") self.replace array.pop return array end end class Unpacker < Hash attr_reader :data def initialize string @data = string super end # format string is expected to have whitespace between each # "unpackCode=>variableName" pairing (which can have whitespace # around the "=>"). u! was picked to be short so it would # look nice, and to connote a destructive "unpack!" operation. def u! format format.gsub(/\s*=>\s*/,'=>').strip.split(/\s+/).each do |segment| src,dst = segment.split(/=>/) self[dst] = @data.unpack!("#{src}")[0] end end # Hash_with_Attrs - For the simplicity of using either u.len or u['len'], # makes a hash appear to have members for each hash entry. Many thanks # to Why_ for collecting this handy routine on his a href= RedHanded blog. # Note of Caution: 'len' is fine but 'length' would not be since u.length # would give the number of entries in the hash, not the just-parsed value. def method_missing(meth,*args) meth = meth.id2name if meth =~ /=$/ self[meth[0..-2]] = (args.length<2 ? args[0] : args) else self[meth] end end end
Update: An even cleaner and shorter way would be to implement a DSL as a module so the code above could look like this:
a 1, :unused N :len a :len, :unused n :cost
(and yes, this is valid Ruby code)