Thursday, 12 August 2010

VCF, tab-delimited files and bioclojure

A lot of the work I do involves extracting data from VCF files ("Variant Call Format"; see http://bit.ly/apUbi8). It's tab-delimited but not quite: some of the columns contains structured data rather than just a value, and the format of these columns might even be different for every single line.

An example line (with the header):

#CHROM POS   ID REF ALT QUAL   FILTER INFO                                            FORMAT   SAMPLE1
1 12345 . A G 249.00 0 MQ=23.66;DB;DP=89;MQ0=26;LowMQ=0.2921,0.2921,89 GT:DP:GQ 1/1:89:99.00


The INFO field is actually a list of tag/value pairs (except when it's just a tag), and the meaning of the data in the SAMPLE1 column is explained in the FORMAT column. Not only can different INFO tags be present on different lines, but the FORMAT can change line-by-line. Let it be clear that this is quite a bit of a pain to parse. What if I want to know the distribution of the depth at which each SNP is covered (i.e. the DP tag in the INFO field)?

So the first thing anyone does before working with such files: convert to "real" tab-delimited: the INFO field is spread over multiple columns and the same goes for the data in SAMPLE1. There is a vcf2tsv script available in vcftools, but it only extracts part of the data in the INFO field; not all tags in the INFO field will be represented in the tab-delimited output file.

I've tried to write a vcf2tsv method in clojure (see clojure.com and clojure.org). I'm now able to read/write VCF files without the tsv-intermediate using incanter (an R-like statistics environment using clojure), and also have a vcf2tsv command-line script that converts a VCF file to its tab-delimited counterpart, but this time including all the data that is in the file instead of a fixed selection of tags.

The vcf2tsv script convert the above sample data into:

CHROM POS   ID REF ALT QUAL   FILTER INFO-MQ INFO-DB INFO-MQ0 INFO-LowMQ       SAMPLE1-GT SAMPLE1-DP SAMPLE1-GQ
1 12345 . A G 249.00 0 23.66 1 26 0.2921,0.2921,89 1/1 89 99.00


So to generate that histogram of quality scores:

(use 'bioclojure)
(ns bioclojure)
(def snps (load-vcf "data.vcf"))
(with-data snps
(view (histogram ($ :QUAL) :nbins 50)))


The result:


You can download the bioclojure library (which now only has this functionality :-) from github. That page also shows how to use the library. Feel free to fork and add new functionality; I only have time for very small incremental additions :-)