Friday, 25 June 2010

Encounter with incanter - about clojure, incanter and bioinformatics


I have been a bit frustrated lately by the fact that for many of my analyses I have to write a ruby script to mangle my data first, then resort to R to add a statistic to each of the datapoints, go back to ruby to mangle the result, repeat, rinse, and finally make plots in R. Of course as a bioinformatician you're used to that and if necessary you write wrapper/pipeline scripts to handle this all for you if you know that this won't be the only time you have to do the analysis. But often you don't know. And breaking the analyses up into chunks just based on the tool you use will help you losing overview.

Being relatively new to R (less than 2 years), some of its idiosyncracies keep biting me. Just last week I had a dataframe to which I wanted to add a column which was the mean of two other columns. So from
1    9
8 16
2 4

I wanted to go to:
1    9    5
8 16 12
2 4 3

So you'd think (at least I would):
df$new_column <- mean(df$first_column, df$second_column)
and you'd be wrong. Just try it out... But it works for adding a column that is the sum of the other two, right? Don't get me wrong: many people are very good in R. Including the colleagues and ex-colleagues who I constantly ask for help. If you're reading this: you know who you are, and thank you.

There has been some talk on FriendFeed lately about the clojure programming language, which is a Lisp-inspired functional language that runs on the JVM. There is also incanter, a "clojure-based, R-like platform for statistical computing and graphics". Apart from the fact that functional languages are said to be good at working with huge data and concurrency, I was interested because, if incanter can do what I need from R, an incanter script is basically just a clojure script. So no switching between ruby/R/ruby/R/ruby/R; just do everything in the same place.

So I borrowed someone's "Programming Clojure" book, got onto some websites, downloaded incanter and gave it a spin this week. And my verdict: I'm impressed. It's a bit of a warp in your brain at first - being a functional language and all - but once you get the hang of it, it is easy to write. Of course I haven't done anything really difficult yet, but

You'll get used to the prefix notation ("+ 1 2") instead of the infix notation ("1 + 2"). At least I did. Maybe it helped that the calculator that I have been using for more than a decade - my trusted HP 42S - also doesn't use infix: "1 2 +".

So let's go for an example. Let's say I have a file with SNPs. For each SNP I have the position, a quality score and the reference and alternative alleles. Suppose I want to know what the ratio of transition over transversion is for a given set of quality score cutoffs. So the file looks like this:
1   123    29   A   G
1 245 93 C G
1 832 51 C T
1 1234 63 T G
1 2345 75 A C
1 8315 9 C A
1 9213 59 T G
...
For a list of quality score cutoffs (let's say [10,40,60,80,100]), I wanted to get the ratio of transitions to transversion. Output would look like this:
10   1.8
20 1.9
40 2.3
60 2.3
80 2.4
100 2.4

I'd be interested to know how people would do this in R. Please feel free to put solutions in the comments. I can only learn from them...

But how about doing this in clojure/incanter? Being new to the language, I first wrote a script that would add "ti" or "tv" as an extra column, and another one that would calculate ti/tv for those cutoff.

Step 1: Assigning "ti" or "tv" to SNPs
(use '(incanter core io charts))
(defn ti?
"Checks if two alleles constitute a transition"
[allele-1 allele-2]
(if (contains? #{"AG" "CT"} (apply str (sort [allele-1 allele-2]))) true false)
)

(defn tv?
"Checks if two alleles constitute a transversion"
[allele-1 allele-2]
(if (contains? #{"AC" "CG" "GT" "AT"} (apply str (sort [allele-1 allele-2]))) true false)
)

(defn ti-or-tv
"Returns 'ti' if two alleles are transition, otherwise returns 'tv'"
[allele-1 allele-2]
(if (ti? allele-1 allele-2) "ti"
(if (tv? allele-1 allele-2) "tv" "other"))
)

(def data
(read-dataset "data_file.tsv" :header true :delim \tab)
)

(def titv-result (map #(ti-or-tv (:ref %) (:alt %)) (:rows data)))

(def data
(col-names
(conj-cols data titv-result)
[:chr :pos :qual :ref :alt :titv]))

(save data "data_file_with_titv.tsv")

This creates a file "data_file_with_titv.tsv" that has an additional column that either says "ti" or "tv".

Step 2: Calculate ti/tv using different quality score cutoffs
(use '(incanter core io charts))

(defn filter-qual
"Returns only rows of a dataset where quality score is better than cutoff"
[ds cutoff]
(sel ds :filter #(> (nth % 2) cutoff)))

(defn count-ti
"Count the number of rows that have 'ti' in the 6th column"
[ds]
(nrow (sel ds :filter #(= (nth % 5) "ti"))))

(defn count-tv
"Count the number of rows that have 'tv' in the 6th column"
[ds]
(nrow (sel ds :filter #(= (nth % 5) "tv"))))


(defn titv
"Calculate the ti/tv ratio in a given dataset"
[ds]
(float (/ (count-ti ds) (count-tv ds))))

(def filename
"data_file_with_titv.tsv")

(def data
(read-dataset filename :header true :delim \tab)
)

(apply max (sel data :cols :qual)) ; => max qual = 527.58

(def cutoffs (range 10 500 10))

(def titv-values (map titv (map #(filter-qual data %) cutoffs)))

(view (xy-plot cutoffs titv-values))

...and this nicely creates the graph that I want.

The code may seem strange, but once you get used to the prefix notation thing you'll be alright.

Of course there are downsides in using clojure/incanter as well. Being a young language, there is no application that can come close to R.app yet. There are plugins for eclipse, netbeans, vi and emacs, but I had trouble installing any of these. There is a one-click incanter Mac application, but it does not provide the same user-friendliness as R.app. My current (well: these few days) setup: I start an interactive clojure/incanter session
java -Xmx800m -cp /Library/Clojure/lib jline.ConsoleRunner clojure.main
and have a TextMate window open. I just write my script in TextMate and copy/paste to the command line. Not ideal, but workable.

I also don't like having to refer to column numbers in some occasions instead of always being able to refer to column names (see the "(nth % 2)" bits in the code). But this might be me: you can probably do that but I just don't know how yet.

Overall, I'd say that my first encounter with incanter has been a great success. The clojure language seems to be very consistent which greatly helps picking it up. You have to make a click in your head, but I'm sure any bioinformatician is easily capable of doing that. I would greatly recommend giving clojure and incanter a try. Get the book. Browse the web. Rewrite one of your small R scripts in incanter. Response on the incanter Google group is also very fast, including from the developer of incanter David Liebke (@liebke)

Further information can be found in the usual places: clojure.org, incanter.org and google. Also check out data-sorcery.org, the incanter blog.

Update: Tim Yates has created an R version of the incanter solution above. I'm impressed. There's no way I would have been able to write that... He's now in my "R guru" box. See here