Friday, 29 June 2007

Naming conventions

Naming conventions. You bump into them every single minute of the day. Naming new directories in your project folder, naming new tables in your database, ... Recently, the issue of naming convention came more to the foreground for me as I'm trying to write a ruby API to one of our databases (see later).

Two of the most-often-encountered naming schemes are CamelCase (ThisIsACamelCaseString) and snake_case (this_is_a_snake_case_string). And in the case of CamelCase: do you make the very first letter a capital or not? If I'm not mistaken, variables in java are often CamelCase, except the first letter (thisCouldBeAJavaVariable).

When thinking of names for directories and files (read also "Organizing yourself as a dry-lab scientist" on BioinformaticsZen), i.e. when there's no set naming convention that you have to follow (e.g. variable naming conventions), I tend to use different schemes for directories versus files for some reason. For naming directories, I use an underscore to separate different things in the same name. For example, I name my folders by concatenating the date they were created with the RT Task Tracker ticket and a description (the latter being in camelcase). For example ~/20070629_RT12345_ThisIsADirectory. For files, I normally use snake_case, except for scripts... Why? Maybe to distinguish those scripts from the data files. For example:

+- Documents
+- Projects
+- 20070629_RT12345_ThisIsASampleProject
| +- input_file.txt
| +- output_file.txt
| +- log_file.txt
| +- ParseInput.rb
+- 20070629_RT23446_AnotherProject
+- input_file.txt
+- output_file.txt
+- log_file.txt
+- ParseInput.rb

It would probably not be a bad idea to start to use a default data folder or something to keep all the input, output and other files, and a script folder for the scripts... I should take a look again at the BioinformaticsZen blog.

Of course things are completely different when you're coding or setting up databases. In these cases, your preferred programming language will have it's own conventions. In ruby, for example, classes are in CamelCase, while variables should be snake_case. All good and well, until they start to bite you in the you-know. I'm trying to create a ruby API to an existing database that would require polymorphic associations. This requires that the values in the something_something_type column should be class names, which are CamelCase and singular. But of course, the database has everything in snake_case. I found a workaround to get the thing up and running with snake_case, except that it requires the value to be plural, which it of course isn't. End result: I'll have to get the actual data values in the database changed to get this thing working.

Tuesday, 26 June 2007

Manual genome annotation tools

An important part of genomics and genetics research is to know where your genes of interest lie on the genome and what the gene model looks like. In other words: to know where do the exons start and stop, what the UTR boundaries are, and where there are any polymorphisms in those genes. That's called genome annotation, that is.

With the sequencing of any new genome, the annotation of its genes is the logical next step. This often consists of 2 main phases: there's the automated annotation by BLASTing against known sequences or even de novo gene annotation. The second phase is the manual curation of those automated annotations: biologists or bioinformaticians with a biology background looking at those gene models and correcting things like "there should be an additional exon here", "this exon-intron boundary is 2 bp off" or "this actually is an alternative transcript".

Over the last two years, I've tried out several software packages to deal with the manual curation or annotation of sequences. And I must say: it hasn't been great. Let's walk through them:

Is a commercial package from Invitrogen. I didn't try this tool recently, partly because my experiences with it on my last job was less then impressive. From what I remember, it becomes unusable when you have to handle larger sequences or when you've got a high density of features. Things might have changed in the mean time, but remarks at the water cooler from colleagues do not support that hope.

The Artemis tool from the Sanger Institute in itself is a great tool with a lot of functionality. You can launch the thing using Java Webstart without having to install it on your own computer. However, I found it to be lacking in user intuitiveness and usability. In addition, it looked like it used non-standard ways to store my annotations. I found that any annotations that I made were stored in the original FASTA-file that I loaded as GenBank annotations at the top. Result: the FASTA-file itself became invalid, and it wasn't a GenBank file either. Still, this tool can be very useful for very small projects.

This is the biggy, developed by the FlyBase people in collaboration with the Sanger Institute in the UK. It's the tool that is referred to the most in the community. But not the tool that I'll use anymore. First of all, it is recommended that you have at least 2 Gigs of RAM when you try to run it. That's right: 2,000 Mb of the stuff. Not your average desktop PC, then... In addition, many people have reported that it crashed on them at random leaving their unsaved work, well, unsaved. The tool also has a lot of features. Too many, actually. Finding out how you can do something can take really long because you've got a haystack of things to go through.

What was the biggest drawback for me, was that I was not able to import my own externally generated results into the tool. You can only do that by creating a GameXML file, which is way to cumbersome to do.

The HAVANA annotation group uses a tool based on AceDB for annotation. I can just be brief about this: I think it's a really good tool, but as it's not available to annotators outside the HAVANA group, has to be discarded as an option.

And then I found Argo (from the Broad Institute). This finally looks like a tool that does what it should do: it has an intuitive interface and overview of your genomic region, it has import and export filters to GFF 1, 2 and 3 as well as GTF. It also allows you to quickly check for non-standard intron-exon boundaries and translations of your gene models and a whole lot of other features. Same as Apollo does, but without making it confusing and complicated. I would suggest this tool in combination with blixem (to check BLAST results in detail).

Thursday, 21 June 2007

Databases and ruby (without rails)

Just bumped into a really nice O'Reilly blog article that combines two of the things I like to work with: GTD and ruby.

As a bioinformatician, I often have to handle quite a lot of data, which I tend to put into databases. Ruby has a fabulous framework in Rails to access and manipulate data, but after having created rails applications for most of those, it became more and more clear that it would be preferable to just have the power of ActiveRecord without having to deploy the whole rails-thing.

Similar to what is discussed in the blog article by Gregory Brown, I created a directory template including the migration code, a connection configuration and a Rakefile. The directory structure looks like this:
+- config
| +- project_config.yaml
| +- load_config.yaml
+- db
| +- migrate
| | +- 001_initial_schema.rb
| +- import
+- lib
| +- models.rb
+- Rakefile

Let's walk through this:

  • The project_config.yaml file contains the project name and connection settings to get to the database. For example

  • project:
    name: MyFunkyProject
    adapter: sqlite3
    name: db/my_funky_project.s3db

  • The load_config.rb file uses that information to connect to the database.

  • require 'rubygems'
    require_gem 'activerecord'

    class ProjectConfig
    attr_accessor :project_root
    attr_accessor :project_name
    attr_accessor :db_adapter
    attr_accessor :db_name

    $config =
    $config.project_root = File.dirname(__FILE__) + '/..'

    YAML.load_documents($config.project_root + '/config/project_config.yaml')) do |p|
    $config.project_name = p['project']['name']
    $config.db_adapter = p['database']['adapter']
    $config.db_name = p['database']['name']

    $connection_settings =
    $connection_settings[:adapter] = $config.db_adapter
    if $config.db_adapter == 'sqlite3'
    $connection_settings[:dbfile] = $config.project_root + '/' + $config.db_name
    $connection_settings[:database] = $config.db_name


  • The lib/models.rb file contains the... models, obviously. It should 'require' the load_config.rb file to get the connection.

  • require File.dirname(__FILE__) + '/../config/load_config.rb'

    class Task < ActiveRecord::Base
    belongs_to :project

    class Project < ActiveRecord::Base
    has_many :tasks

  • The db/migrate/001_initial_schema.rb is used to create the database.

  • require File.dirname(__FILE__) + '/../../config/load_config.rb'

    class InitialSchema < ActiveRecord::Migration
    def self.up
    create_table :tasks do |t|
    t.column :description, :string
    t.column :project_id, :integer
    create_table :projects do |t|
    t.column :description, :string
    def self.down
    drop_table :tasks
    drop_table :projects

  • The import directory will then hold a group of loading scripts, that 'require' the lib/models.rb file. They look something like this:

  • require File.dirname(__FILE__) + '/../../lib/models.rb''my_file.txt').each do |line|
    # do_something useful

As you can see, this directory template - useful as it is for me at the moment - can be rationalized a bit, and I should add tests and stuff. Gregory Brown's article might just give me the right ideas to do that.

Tuesday, 19 June 2007

Anatomy of data integration

The paper "Anatomy of data integration" by Brazhnik & Jones (J Biomed Inf 40:252-269 (2007)) gives a clear high-level overview of what is involved in the process of acquiring data from different sources and how to integrate them. Apart from talking about information pipelines and conceptual data models, it delves deeper into the concept of types of data elements (DEs). It really all speaks for itself, but it's nice to be able to name things in a meaningful way. Some good things that you already know, but are still helpful to write down.

Basically, in the context of a data source, you can distinguish between 2 types of DEs: ''focal'' and ''peripheral''. The focal DEs are usually mandatory and have a high quality, often because they are the primary reason for building the database in the first place (e.g. the assay sequence in a dbSNP record). In contrast, the peripheral DEs are often optional and are much more prone to error (e.g. the number of chromosomes sampled in a dbSNP record). As a remark: making peripheral DEs mandatory is asking for trouble. If I don't know the number of chromosomes sampled for a SNP but I'm forced to fill in some number, than that number will insert wrong data which is infinitely worse than having a nil value.

Meaningful integration can only occur between data sources that have a shared pool of focal DEs. You obviously don't try to integrate two different datasets based on the number of chromosomes sampled... This also means that in order to study correlations between two distant domains, you'd need to build what they call a multi-step integration staircase.

From the viewpoint of the integration itself, the focal DEs can be subdivided into two distinct groups: ''integration keys'' and ''informative elements''.
Integration keys are the backbone of the integration and a combination of DEs that identify exactly the same entity in two sources. They are chosen from the overlapping focal DEs. The informative elements represent to goal of the integration and contain the information that we actually want.

What if there are different data sources for the same data set? To make things worse, it might even be that particular data elements are similar but not the same. For example: you can get SNP data from NCBI and from Ensembl, but people have noticed recently that the same SNP can be annotated on a different strand depending on what database you look at. In this case, a practice of keeping all redundant data along with the information about the source becomes important. The quality of the data might vary between the different sources, and having all information available makes it possible for the researcher to make decisions based on all available evidence. "My SNP is on the forward strand according to NCBI, but on the reverse strand according to Ensembl. I trust Ensembl more, so..."