Sunday, 8 March 2009

The good and bad of genome viewers

Back before the human genome was fully sequenced and NCBI, UCSC and Ensembl started working on visualization, it made a lot of sense to go for linear representations and use tracks for annotation. After all: chromosomes are linear. Using different tracks to show different types of annotation is the next logical step.

But there is not just one human genome on earth; according to Wikipedia there's about 6.76 billion copies as of March 2009. So instead of talking about "the human genome" in those browsers, we talk about "the reference genome". Each person on earth is different, and so is each human genome. (That putting the reasoning on its head, but never mind).

Differences between humans such as SNPs and microsatellites can still be shown in the track-based browsers.

Things get more difficult when you're looking at structural variation. Structural variation messes up the backbone of the linear genome browser: you can't show differences between individuals in one straight line. Suppose you want to investigate a copy-number variation (CNV) and consult UCSC. You'd find tracks such as this:

Although this does give you quite some information on the CNV in question, it's not an adequate representation of what the different alleles actually look like. It also highlights another issue: the concept of "the reference genome". As more and more genomes are getting sequenced, is the one that was picked first the best for visualization and indeed, the reference? To be able to handle the different MHC haplotypes in Ensembl, for example, the database contains a table called "assembly_exceptions" that contains the alternative assemblies for each haplotype.

I believe that further down the line (although it might be quite a while) we might need to forget the whole notion of a reference genome. Two options come to mind. First of all, we could create an artificial reference that contains all sequence and let each real sequence we want to look at well, reference, that artificial assembly. That would mean that the different MHC haplotypes for example would all be in the same sequence. Similarly, copy-number variants containing let's say 3 to 8 copies would include all 8 in the mock-assembly. Unfortunately this still cannot cover structural variation like inter-chromosomal translocations. We can't build a single artificial assembly that would incorporate those. So here's the alternative: deBruijn graphs. Instead of creating a single linear representation of a reference, just let's not. We could use building blocks to build up each individual. Take a look at this picture:

Suppose that each block is a part of a chromosome and the red and blue lines represent the path to follow to build up the chromosome for a particular individual. In this picture the red individual misses a part of that chromosome that is present in the blue individual, and another part is inverted. Notice that we don't make any (arbitrary) decision on what is the reference sequence. By dragging the blocks we can either place all red connections on one line or all blue ones, making them look like a reference.

If we'd then add annotations to this picture like genes, we'd be able to display fusion genes. Suppose that the densely-striped block is on chromosome 7 in the red individual but on chromosome 12 in the blue one. If there's a gene on the right breakpoints we end up with a fusion gene.

Time permitting I'm going to investigate how useful this will be in projects like CNVs in the 1000genomes project.