Being relatively new to R (less than 2 years), some of its idiosyncracies keep biting me. Just last week I had a dataframe to which I wanted to add a column which was the mean of two other columns. So from
1 9
8 16
2 4
I wanted to go to:
1 9 5
8 16 12
2 4 3
So you'd think (at least I would):
df$new_column <- mean(df$first_column, df$second_column)and you'd be wrong. Just try it out... But it works for adding a column that is the sum of the other two, right? Don't get me wrong: many people are very good in R. Including the colleagues and ex-colleagues who I constantly ask for help. If you're reading this: you know who you are, and thank you.
There has been some talk on FriendFeed lately about the clojure programming language, which is a Lisp-inspired functional language that runs on the JVM. There is also incanter, a "clojure-based, R-like platform for statistical computing and graphics". Apart from the fact that functional languages are said to be good at working with huge data and concurrency, I was interested because, if incanter can do what I need from R, an incanter script is basically just a clojure script. So no switching between ruby/R/ruby/R/ruby/R; just do everything in the same place.
So I borrowed someone's "Programming Clojure" book, got onto some websites, downloaded incanter and gave it a spin this week. And my verdict: I'm impressed. It's a bit of a warp in your brain at first - being a functional language and all - but once you get the hang of it, it is easy to write. Of course I haven't done anything really difficult yet, but
You'll get used to the prefix notation ("+ 1 2") instead of the infix notation ("1 + 2"). At least I did. Maybe it helped that the calculator that I have been using for more than a decade - my trusted HP 42S - also doesn't use infix: "1 2 +".
So let's go for an example. Let's say I have a file with SNPs. For each SNP I have the position, a quality score and the reference and alternative alleles. Suppose I want to know what the ratio of transition over transversion is for a given set of quality score cutoffs. So the file looks like this:
I'd be interested to know how people would do this in R. Please feel free to put solutions in the comments. I can only learn from them...
But how about doing this in clojure/incanter? Being new to the language, I first wrote a script that would add "ti" or "tv" as an extra column, and another one that would calculate ti/tv for those cutoff.
Step 1: Assigning "ti" or "tv" to SNPs
This creates a file "data_file_with_titv.tsv" that has an additional column that either says "ti" or "tv".
Step 2: Calculate ti/tv using different quality score cutoffs
...and this nicely creates the graph that I want.
The code may seem strange, but once you get used to the prefix notation thing you'll be alright.
Of course there are downsides in using clojure/incanter as well. Being a young language, there is no application that can come close to R.app yet. There are plugins for eclipse, netbeans, vi and emacs, but I had trouble installing any of these. There is a one-click incanter Mac application, but it does not provide the same user-friendliness as R.app. My current (well: these few days) setup: I start an interactive clojure/incanter session
I also don't like having to refer to column numbers in some occasions instead of always being able to refer to column names (see the "(nth % 2)" bits in the code). But this might be me: you can probably do that but I just don't know how yet.
Overall, I'd say that my first encounter with incanter has been a great success. The clojure language seems to be very consistent which greatly helps picking it up. You have to make a click in your head, but I'm sure any bioinformatician is easily capable of doing that. I would greatly recommend giving clojure and incanter a try. Get the book. Browse the web. Rewrite one of your small R scripts in incanter. Response on the incanter Google group is also very fast, including from the developer of incanter David Liebke (@liebke)
Further information can be found in the usual places: clojure.org, incanter.org and google. Also check out data-sorcery.org, the incanter blog.
So let's go for an example. Let's say I have a file with SNPs. For each SNP I have the position, a quality score and the reference and alternative alleles. Suppose I want to know what the ratio of transition over transversion is for a given set of quality score cutoffs. So the file looks like this:
1 123 29 A GFor a list of quality score cutoffs (let's say [10,40,60,80,100]), I wanted to get the ratio of transitions to transversion. Output would look like this:
1 245 93 C G
1 832 51 C T
1 1234 63 T G
1 2345 75 A C
1 8315 9 C A
1 9213 59 T G
...
10 1.8
20 1.9
40 2.3
60 2.3
80 2.4
100 2.4
I'd be interested to know how people would do this in R. Please feel free to put solutions in the comments. I can only learn from them...
But how about doing this in clojure/incanter? Being new to the language, I first wrote a script that would add "ti" or "tv" as an extra column, and another one that would calculate ti/tv for those cutoff.
Step 1: Assigning "ti" or "tv" to SNPs
(use '(incanter core io charts))
(defn ti?
"Checks if two alleles constitute a transition"
[allele-1 allele-2]
(if (contains? #{"AG" "CT"} (apply str (sort [allele-1 allele-2]))) true false)
)
(defn tv?
"Checks if two alleles constitute a transversion"
[allele-1 allele-2]
(if (contains? #{"AC" "CG" "GT" "AT"} (apply str (sort [allele-1 allele-2]))) true false)
)
(defn ti-or-tv
"Returns 'ti' if two alleles are transition, otherwise returns 'tv'"
[allele-1 allele-2]
(if (ti? allele-1 allele-2) "ti"
(if (tv? allele-1 allele-2) "tv" "other"))
)
(def data
(read-dataset "data_file.tsv" :header true :delim \tab)
)
(def titv-result (map #(ti-or-tv (:ref %) (:alt %)) (:rows data)))
(def data
(col-names
(conj-cols data titv-result)
[:chr :pos :qual :ref :alt :titv]))
(save data "data_file_with_titv.tsv")
This creates a file "data_file_with_titv.tsv" that has an additional column that either says "ti" or "tv".
Step 2: Calculate ti/tv using different quality score cutoffs
(use '(incanter core io charts))
(defn filter-qual
"Returns only rows of a dataset where quality score is better than cutoff"
[ds cutoff]
(sel ds :filter #(> (nth % 2) cutoff)))
(defn count-ti
"Count the number of rows that have 'ti' in the 6th column"
[ds]
(nrow (sel ds :filter #(= (nth % 5) "ti"))))
(defn count-tv
"Count the number of rows that have 'tv' in the 6th column"
[ds]
(nrow (sel ds :filter #(= (nth % 5) "tv"))))
(defn titv
"Calculate the ti/tv ratio in a given dataset"
[ds]
(float (/ (count-ti ds) (count-tv ds))))
(def filename
"data_file_with_titv.tsv")
(def data
(read-dataset filename :header true :delim \tab)
)
(apply max (sel data :cols :qual)) ; => max qual = 527.58
(def cutoffs (range 10 500 10))
(def titv-values (map titv (map #(filter-qual data %) cutoffs)))
(view (xy-plot cutoffs titv-values))
...and this nicely creates the graph that I want.
The code may seem strange, but once you get used to the prefix notation thing you'll be alright.
Of course there are downsides in using clojure/incanter as well. Being a young language, there is no application that can come close to R.app yet. There are plugins for eclipse, netbeans, vi and emacs, but I had trouble installing any of these. There is a one-click incanter Mac application, but it does not provide the same user-friendliness as R.app. My current (well: these few days) setup: I start an interactive clojure/incanter session
java -Xmx800m -cp /Library/Clojure/lib jline.ConsoleRunner clojure.mainand have a TextMate window open. I just write my script in TextMate and copy/paste to the command line. Not ideal, but workable.
I also don't like having to refer to column numbers in some occasions instead of always being able to refer to column names (see the "(nth % 2)" bits in the code). But this might be me: you can probably do that but I just don't know how yet.
Overall, I'd say that my first encounter with incanter has been a great success. The clojure language seems to be very consistent which greatly helps picking it up. You have to make a click in your head, but I'm sure any bioinformatician is easily capable of doing that. I would greatly recommend giving clojure and incanter a try. Get the book. Browse the web. Rewrite one of your small R scripts in incanter. Response on the incanter Google group is also very fast, including from the developer of incanter David Liebke (@liebke)
Further information can be found in the usual places: clojure.org, incanter.org and google. Also check out data-sorcery.org, the incanter blog.
Update: Tim Yates has created an R version of the incanter solution above. I'm impressed. There's no way I would have been able to write that... He's now in my "R guru" box. See here
