Tuesday, 26 February 2008

Testing small scripts

Seasoned programmers know this: testing should be an integral part of developing any script/program/software suite. Part and parcel is the unit test, where you test every little aspect of your program little by little.

For larger projects using a bunch of library files, the setup for testing basically always looks the same: there's your /lib/ directory with your class definitions and your /test/unit/ directory which holds yours tests. See here and here for introductions on full-blown unit tests.

That's all nice and fine, but as a bioinformatician you often just write small scripts for which it would be way to much hassle to create those different directories and separate files containing the classes from those containing the tests. So what do we do?

Often, you end up running your program and looking out for part of the output that you know should be correct. Let's take a very simple example. Suppose we have a file with just one column that has numbers in it. The same number can occur multiple times, and the ground-breaking script you'll write will just count the occurrences of each. Even though there are thousands of lines, you know from visual inspection that there are 7 1's and 15 2's. So a script could look like this:


occurrences = Hash.new(0)
File.open('data.txt').each do |number|
number.chomp!
occurrences[number.to_i] += 1
end
occurrences.keys.each do |k|
puts k.to_s + "\t" + occurrences[k].to_s
end


So you'd run the script and check if you get the expected values for 1 and 2. If not: revise, rerun and check again.

But this looks like something that would be ideally suited for a unit test, if it weren't for the fact that it'd be too much hassle creating those different files and all. What if we could put the testing code in the script itself?

Actually, with a few adjustments, that's not a problem. Look at the following version of the code.


class Parser
attr_accessor :occurrences
def run
@occurrences = Hash.new(0)
File.open('data.txt').each do |number|
number.chomp!
@occurrences[number.to_i] += 1
end
end
end

if ! $test
p = Parser.new
p.run
p.occurrences.keys.each do |k|
puts k.to_s + "\t" + p.occurrences[k].to_s
end
else
require 'test/unit'
class TestSimple < Test::Unit::TestCase
def test_simple
p = Parser.new
p.run
assert_equal(p.occurrences[1],7)
assert_equal(p.occurrences[2],15)
end
end
end


So what happened here?

The original script, as so many scripts we write, actually does two things: (1) it parses a file to extract information, and (2) it prints some things out. In these cases, we can take the approach I outlined here. The biggest change you have to make, is to put your code in a class, otherwise you won't be able to run the unit test. Secondly, the if ! $test separates out the behaviour of the code based on if you want it tested or just run. I'll explain this line later. But if the if ! $test is true, the script just dumps the same output as the first version. However, when that statement is false, the script loads test/unit and runs two test: checking if the value for 1 is 7 and the value for 2 is 15.

How does that if ! $test work? If you call your script using

ruby -s my_script.rb -test

instead of

ruby my_script.rb

ruby will provide your script with an extra (global) variable: $test. See man ruby for more information.

So with this approach you can use test-driven development also in your teenie weenie scripts and not just in your mammoth software suites.

Wednesday, 6 February 2008

Making Bio::Graphics extendable

One of the issues in a library like Bio::Graphics, is the plethora of glyph types that users will want. Here's a little showcase of what's provided by the library:


Features on a DNA sequences can be represented as filled boxes, open boxes, boxes with arrows, lines, triangles, ... In this post, I'll show you (and remind myself) how I came to a version of the Bio::Graphics code that makes adding glyphs straightforward both by myself and the user. WARNING: this post is going to be rather technical... Sorry about that.

First pass
Suppose we want to make it possible to create a picture like this one:

You basically have to tell your script that marker features should be drawn as triangles, and both scaffold and clone features as coloured boxes. The initial version of doing the actual drawing looked like this (only taking the relevant bits):

class Feature
def initialize(glyph = :generic)
@glyph = glyph
end
attr_accessor :glyph

def draw
case @glyph
when :generic
drawing.rectangle(left, top, width, height).fill
when :line
drawing.move_to(left,top)
drawing.line_to(right,top)
drawing.stroke
when :triangle
# code to draw triangle
end
end
end


This does work, but you see the issue, right? Whenever I or someone else comes up with another idea on how to represent a particular feature, the library code itself has to be changed. So far from extendable, that is...

Second pass: extracting the glyphs
To handle this issue for perl's Bio::Graphics, Lincoln Stein uses the Factory pattern. Which means that he creates a single GlyphFactory object that spits out different Glyph objects for each feature based on the configuration set at the Feature level. As I didn't know a thing about Design Patterns (i.e. before Russ Olsen's "Design Patterns in Ruby" arrived here at work) I had no idea how to set something up like that and just started coding away. As it turns out, I actually implemented it using a Strategy pattern.

What I basically wanted, is to delegate the actual drawing of a feature to a glyph. The Design Patterns in Ruby book gives a good example for formatting text. Here's the code:

class XMLFormatter
def output_report(title, text)
puts('< xml>')
puts(' < title>#{title}< /title>')
puts(' < text>#{text}< /text>')
puts('< /xml>')
end
end

class PlainTextFormatter
def output_report(title, text)
puts("***** #{title} *****")
puts text
end
end


This can then be used in e.g. a Report class like this (also from the same book):

class Report
attr_reader :title, :text
attr_accessor :formatter

def initialize(formatter)
@title = 'Monthly Report'
@text = 'Things are going pretty well.'
@formatter = formatter
end

def output_report
@formatter.output_report(@title, @text)
end
end


Looks a lot like what we need, isn't it? Translating this to our purposes, the library code could look like this:

class Glyph::Common
def initialize(caller)
@caller = caller
end
attr_accessor :caller
end

class Glyph::Generic < Glyph::Common
def draw(left, right, width, height)
@caller.drawing.rectangle(left, top, width, height).fill
end
end

class Glyph::Line < Glyph::Common
def draw(left, right, width, height)
@caller.drawing.move_to(left,top)
@caller.drawing.line_to(right,top)
@caller.drawing.stroke
end
end


And use it in the Feature class like this:

class Feature
def initialize(glyph_object = Glyph::Generic)
@glyph_object = glyph_object.new(self)
end
attr_accessor :glyph_object

def draw
@glyph_object.draw
end
end


At least this approach splits out the actual drawing into different simple classes. But the extendability still isn't there: the user still has to open the library file containing all glyph definitions and hack away in there.

Third pass: loading glyphs automatically
It's be nice if we could add new glyph types on the fly just by creating a little file containing the code for that glyph's class. Convention over configuration to the rescue...

What I did, was create a folder (/lib/bio/graphics/glyphs/) that contains the description of all glyphs in separate files:
generic.rb

class Glyph::Generic < Glyph::Common
def draw(left, right, width, height)
@caller.drawing.rectangle(left, top, width, height).fill
end
end


line.rb

class Glyph::Line < Glyph::Common
def draw(left, right, width, height)
@caller.drawing.move_to(left,top)
@caller.drawing.line_to(right,top)
@caller.drawing.stroke
end
end


So ideally, the only thing to make a script work that asks for a feature to be drawn as a empty box (feature = Feature.new(:empty_box)), would be to add a file to that directory called 'empty_box.rb'. Several things have to be taken care of to make that happen:
* loading the new file
* translating the :empty_box to EmptyBox

To load all files in that directory is easy enough. Adding the following code to the main bio-graphics.rb file (which loads the whole library) does the trick:

glyph_dir = File.dirname(__FILE__) + '/bio/graphics/glyphs/'
require glyph_dir + '/common.rb'
full_pattern = File.join(glyph_dir, '*.rb')
Dir.glob(full_pattern).each do |file|
require file
end


To translate the :empty_box symbol into the EmptyBox class takes a little more work: we need to convert the snake_case symbol into a CamelCase string, and then create an object of the class that has that name. To do that, I extended the String class a bit with these additional methods:

class String
def snake_case
return self.to_s.gsub(/::/, '/').gsub(/([A-Z]+)([A-Z][a-z])/,'\1_\2').gsub(/([a-z\d])([A-Z])/,'\1_\2').tr("-", "_").downcase
end

def camel_case
return self.to_s.gsub(/\/(.?)/) { "::" + $1.upcase }.gsub(/(^|_)(.)/) { $2.upcase }.to_s.gsub(/\/(.?)/) { "::" + $1.upcase }.gsub(/(^|_)(.)/) { $2.upcase }
end

def to_class
parts = self.split(/::/)
klass = Kernel
parts.each do |part|
klass = klass.const_get(part)
end
return klass
end
end


Now what happens here? The snake_case and camel_case methods should be not that difficult to understand and are not really where the magic happens. The String#to_class method however is a different story. As it happens, every class in ruby is also represented by a constant (the class name always start with a capital). To get to the class that has the name MyClass, all you have to do is retrieve the constant with that name: Kernel.const_get("MyClass"). Unfortunately, having namespaces (Bio::Graphics::Glyph::Generic) makes things a bit difficult. You can't just do Kernel.const_get("Bio::Graphics::Glyph::Generic"). To get to the Generic class, you have to call the const_get method on the Bio::Graphics::Glyph class, which doesn't exist yet. Therefore we have to look through all parts of the namespace and build up the class as we go.

With this code in place, I rewrote the Feature class to use this functionality:

class Feature
def initialize(glyph = :generic)
@glyph = glyph
end
attr_accessor :glyph

def draw
glyph_name = 'Bio::Graphics::Glyph::' + glyph.to_s.camel_case
glyph_class = glyph_name.to_class
glyph = glyph_class.new(self)
glyph.draw
end
end


Now all a user has to do to add a new glyph type to his application, is:
* create a file in the lib/bio/graphics/glyphs/ directory that defines the glyph
* make sure that the name he gives to that class is the CamelCase version of the symbol he wants to use (which should be snake_case)

There you go. As I warned at the start: technical. At the moment this setup works for what I need the Bio::Graphics library to do. There might be a chance that the approach is changed in the future as we need to handle subfeatures, subsubfeatures, subsubsubfeatures, ... more elegantly. But thats' something for another post.