Wednesday, 10 October 2007

The state of bioruby (or: how can bioruby grow?)

A number of people asked me recently about the usability of ruby/bioruby and if it would be worthwhile for them to take the plunge and investigate bioruby more. So I thought writing up here would be a good idea...

First a disclaimer: this is my own personal view on bioruby, based on experiences in the last year-and-a-half. In addition, this is about the bioruby project, not the code or the people.

Let's first see what bioruby does. It's a library of ruby classes and modules that can be used in biological -omics research. Just like ruby itself, bioruby's origins lie in Japan. Version 0.5.0 was released in 2003 and we're at 1.1.0 now. A brief and incomplete overview of what's covered by the library:
  • sequences
  • locations
  • pathways
  • alignments
  • trees
  • databases: GenBank, RefSeq, Ensembl, KEGG, ...
  • applications: fasta, BLAST, HMMER, clustalw, sim4, spidey, ...
  • ...
Now why this post? Because I believe adoption and use of bioruby could be much improved. Is bioruby dead? Far from it. I think it's more like it is growing out of its clothes as any toddler does when it's getting older.

So what's the problem? It's not the quality of the code. It's not that too much stuff is missing. It's a sub-optimal level of communication (between users, from users to core-developers and from core-developers to users) and the low visibility of the project.

How can bioruby be taken forward? Somewhere in March, I sent some suggestions to the bioruby mailing list in response to a post by a very frustrated Trevor Wennblom. What it basically boils down to, is to get organized and get bioruby much more to the fore-front. So what options exist?

Getting bioruby organized
First of all, it wouldn't be bad if there would be a (mixed American/European/Japanese) board-like little group of people (3 or 4) who would be able to take the executive decisions on releases and what new modules should be incorporated in the bioruby library (after discussions on the mailing list, obviously). This would take a lot of weight of the shoulders of Toshiaki Katayama who now has almost single responsibility (and stress) for this. Having this done by a small group of people would relieve him from some of that stress.

Secondly, we need something of a playground for experimental modules. Call it bioruby-edge or whatever. These could be any modules that are not ready to go into the core bioruby, but are already really useful. When they reach a good enough quality, they can be moved to bioruby itself. There is already a bioruby-annex project at rubyforge, but according to its description it's only meant to hold rails plugins. (However, I was told to put my Ensembl API there as well...)

In addition, it would be good if the rubyforge project website would be used for feature requests and bug reports. This would then be the one-stop shop for the development.

And of course there is the documentation. I think we did a good thing in that big push to document the API in 2006, but the community needs more. The bioruby website already hosts the BioRuby in Anger documentation written by Toshiaki and Pjotr Prins. That's great stuff for quick lookups and I often use that information (especially the sequence IO; I never can remember. Must be an APOE4 mutation.). It would be nice though if it would be worked out a bit more. Take a look at the BioPerl documentation. I've always found the howto's really helpfull: getting a bit deeper into how the code works as well.
The rubyforge system provides wiki functionality for its projects, which is apparently not activated in bioruby. There is bioruby-doc maintained by Trevor, but I think it would be good to keep the core things together: put the wiki on rubyforge.

Letting people know about bioruby
Tremendous things are happening in the bioruby code (e.g. the rails thing), but we just don't know about it. Let alone what might be in store for the future. In addition, we know that the library exists, but the community who uses it has up till now not been really talkative about what they used it for. What we need here is communication in all directions.
First of all: a paper in a medium/high-profile journal. And sooner rather than later. This could serve as the starting block for building a wider bioruby community.

Secondly, we need to let each other know what we're doing with bioruby and how we're using it. I'm talking blogs and social networks here. I hope the blog you're reading at the moment might be a small contribution. Both the end-users and the core-developers should get their thoughts and work out in the open. Reports by end-users will keep the developers a bit on their toes and can highlight things that can be improved in bioruby. Core-developers could on the other hand shed a light on what's in store for bioruby in the future. How do you yourself use the code? Are you contemplating something great? We'd like to know what you're planning... I got a reply from Toshiaki about writing a blog, and he mentioned that it's not straightforward to do that in English. I do understand that that's a hurdle, but what I'd say to everyone having that issue: no problem. So let it be English "with hair on" (ooh, hair-rising-on-my-back translated literally from Dutch, but you hopefully get the idea). It's about us getting the big picture. Not about reading poetry. If we get the meaning, that's the main point.

The core-developers have done and are doing a great job. Respect. The only thing is that this toddler is now grown enough to want to play outside and will need additional clothes for that.

Ruby has so much to offer for bioinformatics as it has tremendous functionality and is yet so simple to code in. It would be a shame if the bioinformatics community can not capitalize on that.

Of course, I'd be very interested in your comments. Let's start talking! Especially about how to start that social network.

Note: while writing this entry, there were actually two messages sent to the bioruby mailing list asking for more documentation and easier access to new users (October 10, 2007). One of these messages stated "...BioRuby docs should have a version of more readable/easy-to-use format for beginners apart from the API stuff". Quod erat demonstrandum.

Second note: there's been some comments on the mailing list about the fact that this post was too much of a criticism to the original contributors to bioruby. That's not what I intended to do. Instead, it was my intention to look at options on how to take bioruby forward and let it grow from its small niche today to a more widely accepted toolkit. I've changed some phrasing in the text to hopefully make sure that that intention is clear (including changing the title).

Tuesday, 9 October 2007

Using rake to manage your software project

Do you have some of those projects where you have to be sure that you jump the same loops every time you edit some code? Take a look at the bio-graphics code. Every time I change anything in the code, I have to do the following things:
  1. regenerate the RDoc documentation
  2. regenerate the ruby gem
  3. check SVN status
  4. do an SVN update
  5. perform the SVN commit
  6. upload the new documentation to the website
That's a prime candidate for rake. Rake does the same as GNU make, which is dependency-based programming. The major advantage for us over GNU make is of course that it uses ruby syntax. With dependency-based programming, I mean that some tasks rely on other ones. GNU make is best know for managing the compilation of source files. But you can do other stuff with it as well: if I want to commit to SVN, I want to make sure that the latest RDoc has been generated as well as a new gem. Therefore, you can have the 'SVN commit' task depend on the 'generate RDoc' and 'generate gem' tasks. And the task 'generate RDoc' will depend on the freshness of the actual library files.

How's this work? You basically create a file containing tasks and tell rake to execute one or more of them, the Rakefile. There are several good tutorials on rake, like the one from Martin Fowler and from the Rails Envy guys. I'm not going into the nitty-gritty of how they're written. These tutorials are much better at that. What I will do here, is describe the Rakefile I use for Bio::Graphics. (Someone already asked in the comments on my post on using ActiveRecord outside of rails what the Rakefile was that I used. Actually, the one mentioned in that post was empty and just a place holder.)

Without further ado, here it is:

#
# Rakefile.rb
#
# Copyright (C):: Jan Aerts
# License:: The Ruby License
#
require 'rake'
require 'rake/testtask'
require 'rake/rdoctask'

task :default => :svn_commit

file_list = Dir.glob("lib/**/*.rb")

desc "Create RDoc documentation"
file 'doc/index.html' => file_list do
puts "######## Creating RDoc documentation"
system "rdoc --title 'Bio::Graphics documentation' -m TUTORIAL TUTORIAL README.DEV lib/"
end

desc "An alias for creating the RDoc documentation"
task :rdoc do
Rake::Task['doc/index.html'].invoke
end

desc "Create a new gem"
file 'bio-graphics-1.0.gem' => file_list do
puts "######## Creating new gem"
system "gem build bio-graphics.gemspec"
end

desc "An alias for creating the gem"
task :create_gem do
Rake::Task['bio-graphics-1.0.gem'].invoke
end

desc "Check SVN status"
task :check_svn_status do
puts "######## Checking SVN status"
message = String.new
message << "# SVN status requires manual intervention\n"
message << "# For items with '?': either svn add or svn propedit svn:ignore\n"
message << "# For items with '~': don't know yet\n"
message << "# Please see http://svnbook.red-bean.com/en/1.4/svn-book.html#svn.ref.svn.c.status"

output = `svn status`
puts output

allowed_status = ['A','D','M','R','X','I'] # See http://svnbook.red-bean.com/en/1.4/svn-book.html#svn.ref.svn.c.status

output.each do |line|
status = line.slice(0,1)
if ! allowed_status.include?(status)
raise message
end
end
end

desc "Check if SVN updates available"
task :check_svn_update do
puts "######## Checking SVN update"
output = `svn update`
puts output
if output !~ /^At revision [0-9]/
raise "Please update your working copy first"
end
end

desc "Commit to SVN repository"
task :svn_commit => [:check_svn_update, :check_svn_status, :create_gem, :rdoc] do
puts "######## Doing SVN commit"
system 'svn commit'
end


rake -T lists all available tasks:
rake bio-graphics-1.0.gem  # Create a new gem
rake check_svn_status # Check SVN status
rake check_svn_update # Check if SVN updates available
rake create_gem # An alias for creating the gem
rake doc/index.html # Create RDoc documentation
rake rdoc # An alias for creating the RDoc documentation
rake svn_commit # Commit to SVN repository
The file_list at the top contains all files in the library itself, and will be used to check all timestamps. The file 'doc/index.html' task looks at the timestamp of the index.html file and if it's older than any of the files in file_list, it will regenerate the documentation. If it's newer, nothing happens. Same goes for bio-graphics-1.0.gem.

The check_svn_update and check_svn_status tasks just check if subversion needs some manual intervention before being able to commit. This should be able to catch conflicts in the working copy and the repository, or files that you forgot to add the SVN.

Note: why didn't I use the special Rake::RDocTask instead of the one I use here? Because the built-in RDoc task first removes your whole doc directory, also deleting the subversion metadata contained in it.