Tuesday, 14 August 2007

A ruby API to the Ensembl database

"Joy to the world, lalaa la laaaa." I can finally announce that I've released the ruby API to the Ensembl core database under the bioruby-annex umbrella. Go here for the release.

What is it?
The Ensembl database stores genetic and genomic data on a variety of species: sequences of chromosomes and positions of features such as genes and polymorphisms. This data is browseable using their genome browser, but is also directly accessible if you connect to their mysql database. A perl API to that database has been available from the start and is used by the ensembl people themselves to handle the data. A java implementation (called Ensj) is also available, but I don't know the status of that one. The ruby version should provide similar functionality to the perl API, albeit for querying only and not for writing to the database.

This API is aimed at the core database. Ensembl also provides the variation and compara databases, but these are not the focus of the current API implementation.

A minimal interface to the data of Ensembl was already available through Mitsuteru Nakao's ensembl.rb library in the bioruby project, and is based on the exportview functionality of Ensembl's web interface. Although very useful, it does not give the full functionality that can be achieved by accessing the database directly.

The ruby API basically provides two things: access to the data in the database, and transformations of those data.
Access to the data. (Virtually) all tables of the database are available through ActiveRecord, with all the automated query methods associated with that ('find_by_anything_you_like'). Say you want to get the object of a transcript with stable_id "ENST00000380593", you'd do
transcript = Ensembl::Core::Transcript.find_by_stable_id('ENST00000380593')

Transformations of the data. You might have the coordinates of a gene on the chromosome, but actually want them on a contig or supercontig. This is where the Sliceable#transform and Slice#project methods come in. In contrast to the perl API, there is no Sliceable#transfer method, because my interpretation of a 'slice' is slightly different from the perl implementation. Read the tutorial for more information.

Minimal script
Any script using the API would have to these steps:
  1. require the library
  2. include the Ensembl::Core namespace (not strictly necessary, but saves typing)
  3. connect to the database
  4. start doing stuff

So for example:
require 'rubygems'
require_gem 'ensembl-api'

include Ensembl::Core


transcript = Transcript.find_by_stable_id('ENST00000380593')
puts "5'UTR: " + transcript.five_prime_utr_seq

How to install
The API has been released as a gem file, which you can either download from the website and install using the command
gem install ensembl-api-0.9.gem

, or export from the SubVersion repository using the command
svn export svn://rubyforge.org/var/svn/bioruby-annex/ensembl-api
This gem depends on bioruby and ActiveRecord.

UPDATE: The code has been moved from rubyforge to github. Get it from http://github.com/jandot/ruby-ensembl-api

Check the website at rubyforge, which will show the tutorial (based on the perl version) and the rdoc documentation. In addition, there are the tests in your gem directory, plus a sample script that shows all functionality of the perl-version of the API called examples_perl_tutorial.rb.

I owe a lot to the Ensembl core team for helping me out when I was at the Ensembl site as a "Geek for a Week"...

Call for help
If anyone would be interested in improving the API, don't hesitate to contact me. At the moment, for example, projections between coordinate systems only work if they're directly linked in the assembly table, and projections of the haplotype assembly_exceptions will now raise a NotImplementedError error. In addition, it would be very useful if we could add the variation and compara databases to the API.