Monday, 30 July 2007

ActiveRecord and mysql: show my databases

Working on a ruby API for the Ensembl databases, I bumped into the issue of having to connect to a database without knowing its name.

The ensembl database server hosts databases for each species. Every two months or so, there's a new release which means a new database for every single species. To see what databases are there, you can login to the ensembl server with mysql:

mysql -h ensembldb.ensembl.org -u anonymous

(for more information, see here).

The command "show databases;" on the mysql prompt lists a total of 1035 databases at the moment, a short selection looks like this:

bos_taurus_core_41_2d
bos_taurus_core_42_2e
bos_taurus_core_43_3
bos_taurus_core_44_3a
bos_taurus_core_45_3b
bos_taurus_est_36_2
bos_taurus_est_37_2a
homo_sapiens_core_36_35i
homo_sapiens_core_37_35j
homo_sapiens_core_38_36
homo_sapiens_core_39_36a
homo_sapiens_core_40_36b
homo_sapiens_core_41_36c
homo_sapiens_core_42_36d
homo_sapiens_core_43_36e
homo_sapiens_core_44_36f
homo_sapiens_core_45_36g
homo_sapiens_core_expression_est_34_35g
homo_sapiens_core_expression_est_45_36g
homo_sapiens_core_expression_gnf_34_35g
homo_sapiens_core_expression_gnf_45_36g

To connect to the homo_sapiens_core_45_36g database, type "use homo_sapiens_core_45_36g;" at the mysql prompt. However, as all 'core' databases have the same database schema, the API applies to all of these species, and just has to connect to different databases. But how do you go about doing that? What you could do, is provide the full database name in the establish_connection statement. But having to memorize these full names, or having to open mysql connections prior to writing your scripts is, to say the least, far from optimal. But how do you query a database system without connecting to a particular database?

Basically, you make a connection to the host without specifying a database, and send the raw sql query "show databases;" over that connection. The code below does just that.


ENSEMBL_RELEASE = 45
DB_ADAPTER = 'mysql'
DB_HOST = 'ensembldb.ensembl.org'
DB_USERNAME = 'anonymous'
DB_PASSWORD = ''

class DummyDBConnection < ActiveRecord::Base
self.abstract_class = true

establish_connection(
:adapter => DB_ADAPTER,
:host => DB_HOST,
:database => '',
:username => DB_USERNAME,
:password => DB_PASSWORD
)
end

class CoreDBConnection < ActiveRecord::Base
self.abstract_class = true

def self.connect(species)
db_name = DummyDBConnection.connection.select_values('show databases').select{|v| v =~ /#{species}_core_#{ENSEMBL_RELEASE.to_s}/}[0]


if db_name.nil?
warn "WARNING: No connection to database established. Check that the species is in snake_case (was: #{species})."
else
establish_connection(
:adapter => DB_ADAPTER,
:host => DB_HOST,
:database => db_name,
:username => DB_USERNAME,
:password => DB_PASSWORD
)
end
end
end



And then just have your classes (e.g. CoordSystem, SeqRegion, Gene) inherit from CoreDBConnection instead of ActiveRecord::Base.

To make the actual connection, start your script with:

CoreDBConnection.connect('bos_taurus')


I'm currently at Ensembl for a week ("Geek for a Week") to work on the full-blown ruby API, and am planning to give an introduction on how to use it in one of the later posts.

Wednesday, 25 July 2007

How do you process literature?

A quick glance at the side of my desk reveals two stacks of manuscripts to read; each stack about 20cm high. Sounds familiar? There seems to be a major task in front of me to process all that.
First thing to do is to identify what caused those piles in the first place. The answer: no system that I'm satisfied with for reference management. Of course, there is software like Reference Manager and EndNote as well as websites like Connotea and CiteULike. But they all have one major flaw: they are not suited to store the knowledge gained from those papers. Entering a reference to those papers in the software is not the same as going through them and extracting useful information. Sure, they do have a notes field where you can jot down some short remarks, but often knowledge is much easier recorded and remembered in little graphs and pictures than in words. There's reference management, and there's knowledge management.

What do I want my system to look like? First of all, it should be searchable. The tagging system provided by CiteULike/Connotea seems good for that. Also (and this might seem illogical for a bioinformatician), the system should not be fully automatic or even electronic, but analog. Why? Just pressing a button to for example add the abstract from a paper to the system gives a sense of... what's the word in English: volatility? For some things you should use the help of a computer, and for some you shouldn't. There's a difference between using Excel to repeat the same calculation 50 times, and trying to use a pc for storing knowledge. It's me who needs to store that knowledge, not the computer. If that was the case, I could always go back to Google instead of making the effort of using a reference manager in the first place. I've played around with zotero and personal wikis in the past, and they just didn't do the trick: I still ended up just copy-pasting the information instead of absorbing it.

Another advantage of using an analog system, is that when you feel your productivity behind your computer is suboptimal, you can always take your cards, find yourself a quiet place, put your feet on a desk, and flick through the things you wrote down. Slippers and pipe are optional.

During my PhD a few years ago, I used a system that was exclusively based on index cards. The inspiration came from Umberto Eco's book "Come si fa una tesi di laurea" (Or "How to make a doctoral thesis") (1977) in which he explains how he handles the knowledge for his book research. For each manuscript, I'd make a new card. The front contained an identifier, the paper title, full reference and keywords. On the back I'd write down what I had to remember from that paper, including little graphs, schemas and stuff. I've got to admit that a drawback of using these cards was that they were not easily searchable, but linking them worked quite well with a bit of discipline.
During those years, I used the index card system both as reference manager and as knowledgebase. Although it did work to satisfaction, the role of reference manager should be fulfilled by a better tool.

Now how could I implement something like that into a workflow? Basically, any new paper to be read should be entered in CiteULike and tagged as 'to_read'. When I've got time to read it: see if it's necessary to print out, or preferably read from the screen (we want to be nice to the trees, don't we?). When I've read the manuscript and there is interesting information to remember, only then create an index card. In case it's a landmark paper and/or I've been adding a lot of comments and markings in the text: keep the printout as well, and mark the index card that I've got that printout as well.
Let's try this out for a few weeks and see where it goes...

BTW: for a knowledgebase system based on index cards taken to the extreme, see PoIC (Pile of Index Cards).

Wednesday, 18 July 2007

Documenting one-off parsers

A lot of day-to-day work consists of parsing data files to transform the contents from one format into another or to create statistics. However, when you have to get back to those scripts at a later stage - you need something similar in another project or you notice that something along the way must have gone horribly wrong - it can often be quite hard to figure out what the script actually did. Having meaningless script filenames like ParseBlast.rb doesn't help either. (Parse BLAST into what?)

I must say that things improved a lot when I switched from perl to ruby. Trying to understand a perl script that I wrote a couple of weeks earlier was a real pain, while I normally have no problems understanding ruby code that I wrote months ago... Has a lot to do with the simple syntax and expressiveness (is that a word?) of that language.

But just understanding the code is not enough. To be able to assess if you could use a script from an earlier project in a new one, you often also have to get hold of information in that script that is not inherently encoded in its code: what was the project? What did the input look like? What did the output have to look like? That's where the script documentation comes in, because having that information easily available greatly reduces the time you need to assess if you should copy-paste-adapt the script, or just start from scratch.

What I try to do, is always use the same template when starting a new parsing script. Even when the parsing itself would only take 5 lines, I try to use this whole template. A quick walk-through:

Lines 2-32: The documentation, consisting of:
  • script name
  • short usage message
  • a description of what the script does
  • list of arguments that can be used
  • input format
  • output format
  • contact information
Lines 38-40: Definitions of classes that are used later on in the script itself
Lines 42-51: Parsing the arguments to the script. If the user (i.e. me when I run it) uses --help or an argument that does not exist, he automatically gets the documentation of the script. If I would use the -a tag here, the output to the screen would be

USAGE
-----
./this_script.rb [ -h | --help ]
[ -i | --infile | < ] your_input.txt [ -o | --outfile | > your_output.txt ]


Lines 53-58: Logging of the options that were used in running the script.
Lines 60-70: Create the input and output streams.
Lines 72-75: The actual parsing code. This is the bit that does the work (using the classes described in lines 38-40).
Lines 77-79: Clean up.

Here's the complete template.

#!/usr/bin/ruby
# == NAME
# this_script.rb
#
# == USAGE
# ./this_script.rb [ -h | --help ]
# [ -i | --infile | < ] your_input.txt # [ -o | --outfile | > your_output.txt ]
#
# == DESCRIPTION
# Description of what this script does...
#
# == OPTIONS
# -h,--help:: Show help
# -i,--infile=INFILE:: Name of input file. STDIN if not defined.
# -o,--outfile=OUTFILE:: Name of output file. STDOUT if not defined.
#
# == FORMAT INPUT
# >gi|4531835|bla
# ACTTACCGACCGACTGACTACTTATGCCA
# >gi|4861534|blabla
# CTACCCCATCTACCGGGGCTCGACT
# ...
#
# == FORMAT OUTPUT
# 4531835 29
# 4861534 25
# ...
#
# == AUTHOR
# my full contact information

require 'rdoc/usage'
require 'optparse'
require 'ostruct'
require 'logger'

### Define classes here
class MyClass
end

### Get the script arguments and open relevant files
options = OpenStruct.new()
opts = OptionParser.new()
opts.on("-h","--help",
"Display the usage information") {RDoc::usage}
opts.on("-i","--infile", "=INFILE",
"Input file name") {|argument| options.infile = argument}
opts.on("-o","--outfile", "=OUTFILE",
"Output file name") {|argument| options.outfile = argument}
opts.parse! rescue RDoc::usage('usage')

log = Logger.new(File.new('this_script.log', File::WRONLY | File::TRUNC | File::CREAT))
log_level = Logger::INFO # or: DEBUG, WARN, FATAL, UNKNOWN

log.info('Script this_script.rb started')
log.info('Options:')
log.info(options.to_yaml)

if options.infile
input_stream = File.open(options.infile)
else
input_stream = $stdin
end

if options.outfile
output_stream = File.new(options.outfile,'w')
else
output_stream = $stdout
end

### Actually do some stuff
input_stream.each_line do |line|
output_stream.puts line
end

### Wrap everything up
output_stream.close
input_stream.close

(If I wanted this a bit DRYer, I'd have a constant with the script name at the top so I wouldn't have to repeat that filename over and over. To be done...)

If I wanted to over-organize, I'd create a little database with the descriptions of those scripts so I can search through them. Even though that would have it's use from time to time, that would be taking it too far. Having that documentation and a standard way of providing the command line arguments is enough for me for now.

I must admit I let this slip in the last couple of months, which doesn't mean it didn't work the months before that. That's just what happens when you go on holiday and completely forgot the habit of doing this afterwards.

UPDATE: InfiniteRed blogged about a similar approach later.

Wednesday, 11 July 2007

A choice of databases, or PostGres vs SQLite

When managing data in bioinformatics, one of the main tools you use is databases to store the stuff. In many cases, flatfiles are sufficient, but sometimes you need the flexibility and reliance of a nice relational database. (Note: Excel is not a database). Actually, in quite a few cases it does make sense to use flat text files instead of databases, but that's the subject of another post.

As there is a plethora of RDBMS (relational database management systems) available, the first thing you have to ask yourself is what system to use. There's mysql, oracle and others. But for the purpose of this post, I'll focus on PostGreSQL (http://www.postgresql.org) and SQLite3 (http://www.sqlite.org), because that's what I normally use.

For the big projects with a lot of data, I normally use postgres, while smaller projects are typically served by sqlite databases. Both of these however have their good and bad things. Choice depends on a number of factors:

Installation. The main difference between postgres and sqlite3 is that one is based on running a server and the other is not. To be able to use postgres, you'll have to install the software, create a postgres user on your system and have that user start the postgres server. Next thing to do is create a user within the database system and grant it the right permissions (e.g. to create other users or databases). That's a bit more complex than installing sqlite3.

Database creation. To create a new database in PostGreSQL, you issue a createdb MyDatabase. This will add the database to your RDBMS. You can get into the database with psql MyDatabase. In contrast, sqlite3 databases are just text-files like any other. Create a new databasebase by calling it: sqlite3 my_database.s3db. This will create a file in your present directory called my_database.s3db. Note: you don't have to use the .s3db extension, but that makes it recognizable for the SQLiteAdmin tool (see below).

Use. Once you've got your databases created, their use is really similar. On the command line, use the psql or sqlite3 command. On Windows, you can e.g. pgAdmin and SQLiteAdmin.

I must say that psql is a bit easier on day-to-day typing, because it's got tab-completion, which I haven't been able to activate yet in sqlite3. I should spend some time one day to configure sqlite exactly like I want it to.

Locking. This is what bit me in the ankles with sqlite3... According to the website: "SQLite version 2.8 allowed multiple simultaneous readers or a single writer but not both. SQLite version 3.0 allows one process to begin writing the database while other processes continue to read. The writer must still obtain an exclusive lock on the database for a brief interval in order to commit its changes, but the exclusive lock is no longer required for the entire write operation." Indeed, I could load hundreds/thousands of records while simultaneously querying the database. But at what looked like a random moment, after hours of loading, it would choke, leaving me with a partially populated database. Any small child can tell you that this is not what you want to happen. I've never encountered something like that in postgres, and probably never will.

Backup. Because sqlite if file-based, making a backup of a database is nothing more than taking a copy and moving that copy to another hard disk or burning it on a CD-ROM. To make backups of postgres databases, we typically use pg_dump, which dumps the contents (and if you want the schema as well) of a database in a big-ass text file. So the backup is outside of the RDBMS itself.

So both RDBMS have their advantages and disadvantages. For huge projects, I tend to use PostGres, although I've started to use sqlite more and more lately. If only it wouldn't choke on the locking...

Thursday, 5 July 2007

Bioinformatics and labbooks

What I've really struggled with in the first years after I ended up in bioinformatics (still in denial back then), was how to record what I'd done. Researchers in a wet-lab environment typically use a labbook in which they write down what protocols they used for an experiment, together with the results and nice pictures of their PCR electrophoresis gels. This is a bit more difficult for bioinformaticians. Why? I think because the physical actions that are required to do the research are less pronounced. Filling out a PCR plate, putting it into a PCR machine and loading the result on a gel is a bit more effort than pressing Enter. When we code, what we actually do, is create the PCR machine hardware. The programming language acts as the nuts and bolts used to create that machine. The effort for the bioinformatician is in building the hardware, not running the tests themselves (that's the sweat and tears of the servers, not ours).

So I was pleasantly surprised when I arrived at my present job 2-and-a-bit years ago when they told me they even had an SOP on how to perform and record bioinformatics work. Now SOPs can be restrictive and more about paperwork than actually helping you out, but coming from an environment where everything (or rather: nothing) goes regarding logging bioinformatics work, this piece of paper was a big relief.

The main principle of a bioinformatics SOP is that you have to be able to record all steps you followed to transform one piece of information into another. Same as what a lab-oriented SOP does: "how do you get from a BAC library and a pair of PCR primers to knowing which BAC is positive for that marker". But of course there are other things to record than in a lab. The main principle is that we have to be able to prove that the data we use are what we say they are. Same goes for the scripts.

The input data. The startpoint for most of my dry-lab experiments is data that I downloaded from the internet. So what do I (have to) record? The URL of the file and the md5sum of the file once it's downloaded. That md5sum makes sure that, in case I have to reanalyze, I can check if that big sequencing centre has changed the contents of their files or not (of course without telling us).

The scripts. Same goes for the scripts. Apart from the fact that it's always a good thing to document your scripts, you also have to be able to prove afterwards that that actually was the script you used, and not just some other piece of code with the same name. Can be done with md5sums again.
It does happen that I logged the md5sum and still change the script afterwards (e.g. setting constants to another value). As long as that is logged as well, we can reconstruct the original script. Of course it is better to circumvent that problem by creating generic code, but sometimes the effort of doing just that outweighs the benefits gained. (Might be a subject of a later post.) Having the md5sum of the script acts as logging the manufacturer and type of PCR machine you used.

The output. The combination of identity-checked input and scripts should always produce the same output. For completeness sake, the SOP tells us to also log the md5sum again...

The system around it. So how do we actually do the recording itself? Where I work, the end product has to be a filled-in labbook. What we do, is use a task tracker (RT Request Tracker) to, well, track our tasks (duh). The moment we start a new project (let's define that here the GTD-way as anything that consists of more than one physical action), we create a new ticket (e.g. "Identify syntenic region between species A and B") and log everything in there: the project directory, background information, md5sums, workflows, interpretation of the results. When all is finished, we make a hard copy (well: print it out) and glue it into our labbooks.

In some settings it might be more sensible to log things in a wiki, as explained by Mike at bioinformaticszen (here), where he talks about using a hyperlinked document or a wiki to track what you've done. Of course it can make sense in many environments (i.e. if you don't have to care about audits and are tracking stuff merely for yourself), but the moment you have to be able to present stuff for audits or patent applications, it's critical that the documentation you generated is immutable. Don't get me wrong: I'm a big fan of the wiki-way (and use one at work as well), but not as a labbook. What do I find a wiki useful for, is operating guidelines, which are allowed to change over time.

Tuesday, 3 July 2007

Six months of Getting Things Done

It's the start of July and about 6 months after I started implementing the Getting Things Done meme. And, man, did it change things... In this post, I'll share some of my experiences and the obstacles I encountered.

What is that Getting Things Done already?

I'm not going to explain the whole thing here. There are better resources for that (for a list of URLs, see the bottom of this post). Most importantly: get the book "Getting Things Done: The Art of Stress-Free Productivity" by David Allen. Basically, GTD is a way of turning the overwhelming amount of 'stuff' we have to remember to do into a manageable system and out of your head.

Implementing the GTD system has helped me a lot in the last half year. For starters, I got things done (more than I would normally have, I think) while at the same time it allowed me to have less on my mind. I no longer have that dreaded nagging feeling of "I should remember to do this" while knowing that I would forget something else by remembering it. The time that I couldn't get to sleep because I had to remember this and this and that and the other is now a distant memory. One of the effects that it also had was that it was easier to quickly switch between different jobs I had to do. I'm not yet at the stage of what the book calls a "mind like water" (i.e. neither over- or underreacting to anything), but it's started to get a bit fluid. Stuff like responding inappropriately to email, projects, thoughts about what I need to do (the over- or underreacting) leads to less effective results.

* What works / my system

What parts of the GTD system work for me? Up till now, it looks like I mainly focussed on the day-to-day tasks rather than the long-term goals and someday/maybe.

The hardware:

A good pen.

A filofax. After trying different low and high-tech ways to implement GTD, I finally ended up using a filofax. It's a black pocket-size (8x12cm) one called Identity. I renamed the different tabs in it to 'Next Actions', 'Projects', 'Waiting For', 'Someday', 'Calendar' and 'Lists'.

Project folders. I started creating project folders both at work and at home. I used to take the same notebook to all my meetings to take my minutes. That's now shifted to taking a single piece of paper to them, taking my notes and then archiving that paper in the project folder. It's always good to just be able to take a single project folder to a meeting containing all minutes and other supporting information.

How I use that hardware:

Next Actions and Projects. Always having the filofax with me, I instantly write down anything that comes into my head that I have to remember to do. At that moment, I don't make the decision if I have to put it in next actions or projects or someday/maybe. I just jot it down on my next actions list. The moment that I come to actually doing that thing, it might be sensible to convert it into a project or a someday/maybe item. At any one moment, I have 40 or so items on my next actions list.
David Allen talks about having different next action lists for different contexts: a list for stuff to do at home (@home), a list for stuff to do at the office (@office), while at a computer (@pc), while with a phone (@phone) or when going shopping (@shop). That definitely didn't work for me. I ended up having just one next actions list, using the type of bullet to distinguish between the different contexts. The different bullet point types I use are: a triangle (i.e. a stylized tear-drop for the tears of sweat I shed at work), a square with its bottom line missing (a stylized version of a house), a little phone and a dollar-symbol.



I noticed that it's really important to make a real distinction in how you phrase the next actions and projects. Projects are basically things you have to do that will require more than 1 step, while next actions are the indivisible counterpart. It really helps if you phrase the next actions as verbs ("write unit test"), while the projects are phrased as end-points rather than verbs ("API published" instead of "create API").

Calendar. I've been relatively successful in putting all my meetings in the filofax calendar. To remember to do stuff with a given deadline, I write a note in my calendar for example a forthnight before that says "add to next actions: write poster abstract". I do something similar for recurring events: I'll fill the next ten or so, and add a note with the last occurrence to fill in the next ten.

Templates. The filofax came with a bunch of empty templates to use. As it didn't take too long to get through those, I created a new template in OpenOffice with a grid and showing where to punch the holes. Just printing it out and making double-sided copies gives me all the empty sheets I want.

* What kinda works

The weekly review. This is one of the cornerstones of the system, where you take a little time once a week to go over your next actions, projects and other lists. Until recently, I've neglected this quite a few times. However, it's easy to pick up again and that's what I did. What I do, is purge the next actions list (taking the time to rewrite the actions on virgin sheets), check that I have at least one next action related to each project, and ask myself if I have to chase people on my waiting_for list.

Empty inbox. One of the things you _can_ do, is keep an empty inbox. I got to the stage of the empty inbox during a few weeks, but let it slip again. At the moment there are 1060 mails in it, but I know it will be relatively straightforward to do the big reorganize/purge exercise again.

* What doesn't work (yet)

Tickler file. The tickler file is a set of folders to organize paperwork that has to be done by a certain date (see the Wikipedia page for GTD). Although I won't implement this at work, we haven't really decided yet if this is really useful or not at home.

GTD software. Even though I'm a bioinformatician, I noticed that using software to keep track of my lists doesn't do it for me. I've tried ThinkingRock, BackPack (from 37signals) as well as a host of other little applications. In the end, not being able to use that system while I'm on the bus, in a shop or just sitting in the living room made it clear that I had to go for the analog version.

Literature. This is what I'm annoyed about the most at the moment. Even though it's not really part of GTD, reading literature should fit into the bigger system in some way or another. I haven't found out how yet. I like to print out the papers and make notes directly on them, but that's no way of organizing the information contained in them. I'm thinking about using CiteULike as a reference manager (don't like Reference Manager or Endnote) and making notes using Adobe Acrobat on the PDF. To be continued...

* The future

I noticed that the elements of GTD that I'm using are focused on day-to-day work. I'm not really happy with how I use the someday/maybe list in my filofax, do not have a list of big goals, ... So while I will continue to use this system in the next couple of months, I'll probably try working on the big picture as well. I might keep a couple of lists at home to do that. It's not necessary to carry the list with things you want to do one day or your major goals with you all the time.

Conclusion

Based on what I've experienced, it is really worthwhile to implement the Getting Things Done strategy to get to stress-free productivity. Start with reading the book. It's important to find your own implementation and decide what things you want to implement or not: next_action lists, empty inbox. You don't have to bring every suggestion into practice. There are many things in the book that I decided I wouldn't need.

It's really easy to start following the GTD philosophy (although becoming a black-belt requires a lot of work), and it's no problem if you fall off the wagon for some reason after a while without wanting to: it's straigthforward to jump on it again. Just start using those lists again.

Some really good tips on starting with GTD can be found on this blog entry.

And now back to some real work...

Links

* The book: see Amazon "Getting Things Done: The Art of Stress-Free Productivity"
* Introduction of concepts
* How to start
* Big list of GTD software