Thursday, 7 January 2010

Tipping my toes in mongodb with ruby

Read the excellent post by Neil Saunders on using ruby and mongodb to archive his posts on FriendFeed, prompting me to finally write down my own experiences with mongodb. So here goes...

Let's have a look at the pilot SNP data from the 1000genomes project. The data released in April 2009 contain lists of SNPs from a low-coverage sequencing effort in the CEU (European descent), YRI (African) and JPTCHB (Asian) populations. SNPs can be dowloaded from here; get the files called something.sites.2009_04.gz. The exercise that we'll be performing here, is to get an idea of how many SNPs are in common between those populations.

The input data
The input data contains chromosome, position, reference allele (based on reference sequence), alternative allele and allele frequency in that population. Some sample lines from CEU.sites.2009_04.gz:
1 223336 C G 0.017544
1 224176 C T 0.052632
1 224344 T A 0.824561
1 224419 C T 0.008772
1 224438 C T 0.412281
1 224472 T C 0.122807
What will it look like in a database?
If we would load this into a relational database, we could create a table with the following columns: chromosome, position, ceu_ref, ceu_alt, ceu_maf, yri_ref, yri_alt, yri_maf, jptchb_ref, jptchb_alt, jptchb_maf. However we would end up with a lot of NULLs in that table.

For example: there is a SNP at position 522,311 on chromosome one in the CEU and YRI populations, but not in the JPTCHB population.
chr  pos      ceu_maj   ceu_min   ceu_maf   yri_maj   yri_min   yri_maf   jptchb_maj  jptchb_min  jptchb_maf
1 223336 C G 0.01754 G C 0.47321 C G 0.22034
1 522311 C A 0.05263 C A 0.33036 NULL NULL NULL
(In this example, I have already recoded ref/alt allele to major/minor allele. In some cases such as the SNP at 223,336 the major allele is not the same in each population.)

Actually, of the 21,742,359 SNPs, there are only 4,836,814 (about 22%) is present in each of these populations. Moreover, there are 13,957,866 (64%!) SNPs that are present in only one of the populations. Ergo: loads of NULLs.

This is where you can start thinking of using a document-oriented database for storing these SNP data: each document will be tailored to a specific SNP and will e.g. not refer to the JPTCHB population if it it not present in that population. Enter mongodb. Apparently the "mongo" comes from "humongouos". I hope it will live up to that name...

The above two SNPs could be represented in two json documents like this:
"_id" : "1_223336",
"chr" : "1",
"pos" : 223336,
"kg" : {
"ceu" : {
"major" : "C",
"minor" : "G",
"maf" : 0.017544
"yri" : {
"maf" : 0.473214,
"major" : "G",
"minor" : "C"
"jptchb" : {
"maf" : 0.220339,
"major" : "C",
"minor" : "G"

"_id" : "1_522311",
"chr" : "1",
"pos" : 522311,
"kg" : {
"ceu" : {
"major" : "C",
"minor" : "A",
"maf" : 0.05263
"yri" : {
"maf" : 0.33036,
"major" : "D",
"minor" : "A"

Getting the data in the database
So how do we load this data into a mongo database? You'll obviously need a mongo daemon running, but I'll leave that as an exercise to the reader as it's very easy and well-explained on the mongodb website. My approach was to first import all CEU SNPs straight from a json-formatted file, and then parse the YRI and JPTCHB files

1. Loading the CEU SNPs
The mongoimport tool takes a json-formatted file and - guess what - imports it into a mongo database. Just reformat the lines in the original CEU.sites.2009_04 to json (see below) and run mongoimport --db test --collection snps --file ceu.json. Ceu.json looks like this:
{_id: "1_211", chr: "1", pos: 211, kg: {ceu: {major: "A", minor: "G", maf: 0.201754}}}
{_id: "1_216", chr: "1", pos: 216, kg: {ceu: {major: "T", minor: "A", maf: 0.035088}}}
{_id: "1_229", chr: "1", pos: 229, kg: {ceu: {major: "A", minor: "C", maf: 0.017544}}}
2. Loading the YRI and JPTCHB SNPs
Loading the YRI and JPTCHB SNPs is a bit trickier because we need to check if the SNP is already defined. If it's not we'll just add a new document, but if it is we have to update the existing document and add the population-specific details. For this I use a ruby script with the mongo gem (I already added an additional column to the input files containing a unique ID per SNP consisting of chromosome and position):

require 'rubygems'
require 'mongo'
require 'progressbar'

db ="test")
coll = db.collection('snps')

pbar ='processing', 13759844)'YRI.sites.2009_04.with_uid').each do |line|
uid, chr, pos, allele1, allele2, allele2_freq = line.chomp.split(/\t/)
allele2_freq = allele2_freq.to_f
major, minor, maf = nil, nil, nil
if allele2_freq.to_f <= 0.5
major, minor, maf = allele1, allele2, allele2_freq
major, minor, maf = allele2, allele1, 1 - allele2_freq

snp = coll.find_one(:_id => uid)
if snp.nil?
snp = {:_id => uid, :chr => chr, :pos => pos.to_i, :kg => {:yri => {:major => major, :minor => minor, :maf => maf}}}
kg_data = snp['kg']
kg_data['yri'] = {:major => major, :minor => minor, :maf => maf}

For each SNP, we first get the major and minor allele as well as the minor allele frequency based on the alternative allele frequency. Next, we search the entire SNP collection for the SNP ID (which consists of chromosome_position). If it doesn't exist, we create a completely new document, otherwise we fetch the "kg" data ("kg" stands for "1000genomes") and add a 'yri' key-value pair.

Do the same for JPTCHB and you should get a complete collection.

Summarizing the data
Going back to the question we're asking: how many SNPs are in common among these populations? Again, a ruby script:

require 'rubygems'
require 'mongo'

map = "function() { " +
"var keys = [];" +
"for ( item in this['kg'] ) { keys.push(item) }" +
"emit(keys.sort().join(';'), {count: 1})" +
reduce = "function(key, values) { " +
"var sum = 0; " +
"values.forEach(function(doc) { " +
" sum += doc.count; " +
"}); " +
"return {count: sum}; " +

db ="test")
coll = db.collection("snps")

result = coll.map_reduce(map, reduce)
result.find.to_a.each do |r|
puts ['{', r['_id'], ':', r['value']['count'].to_i, '}'].join(" ")

This script takes 50 minutes to run using a mongo database on my MacBook laptop.

Counting things in a collection makes at least me think of mapreduce. And I had seen mapreduce mentioned on the mongodb website, so clearly wanted to give that a go. Unfortunately, you have to define the map and reduce functions in javascript, which is a bit unsightly within a ruby script, but so be it. For more info, see this blog post by Kyle Banker.

This shows us the following numbers of SNPs:
  • CEU alone: 2,954,418
  • YRI alone: 6,901,757
  • JPTCHB alone: 4,101,691
  • CEU/YRI: 915,476
  • CEU/JPTCHB: 926,406
  • YRI/JPTCHB: 1,105,797
  • all three: 4,836,814
As Pierre would say: that's it.