"Contigs should not know where they are." That's a phrase uttered by James Bonfield when presenting his work on gap5, the successor to gap4, a much-used assembly software suite. So you think: "Wait a second: you're talking about assembly, and the contigs should not store their position?"
This statement addresses a problem that we encounter often when working with genomic data: how to handle features. The approach often used is to give the feature a 'chromosome', 'start position' and 'stop position'. Seems reasonable, right? So if you want all features on chromosome 1 between positions 6,124,627 and 6,827,197 you just loop over all features and check if their range overlaps with this query range. Indeed: seems reasonable. Unless your collection of features goes into the millions. Suppose you have arrayCGH data with a resolution of 1 datapoint per 50 basepairs (the green/red bars in the picture). If you'd want to search for this region, you can focus on chromosome 1 (if you were so smart to create different collections per chromosome) and start comparing the start and stop position of each feature from the beginning. Chromosome 1 is almost 250Mb so would contain 5 million array probes. That's a lot of features to check.
What James alluded to, was that you should retain the position information at a higher level. He uses an adaption of an R-Tree, which is a datastructure that bins all raw data into bins of a certain size (say 25 elements). Those bins themselves are binned again into larger bins, and so forth until only one bin remains: the root. Each bin knows its start and stop position relative to the bin it's contained in. To search for the same region as above, you start at the root bin which (by definition) overlaps with that query. So you go to the next level and check each of the child bins of root. You can reject those that don't overlap with your range (the red bins) and only focus on the ones that do. Continue until you reach the bottom-most layer which contains your actual features.
This has two advantages for my work on pARP, the visualization tool for aberrant read pair mappings. First of all, it's faster to get those features in a given region. A major feature of the pARP tool is that you can zoom into any region. With more than 6 million values for readdepth, speed is a big issue. But that's not all. Think about this: if I have a screen resolution of 1200 pixels wide, why would I try and plot 6 million points next to each other? Twelve-hundred would be enough, isn't it? R-Tree to the rescue again. In building the tree, each bin stores some aggregate data from its members; for example the average readdepth. To draw the genome readdepth across the whole genome, I only have to start at the root of the LocusTree and go down to that level that contains 1200 bins. No need to load the complete dataset. At the moment LocusTree bins only take the average of whatever value its members have, but this can be expanded to contain any type of aggregation (e.g. number of objects).
LocusTree is not the fastest library around (being written in ruby, and by me), but I think it does what I need it to do. This means that it might not do exactly what you need it to do. But the code is at github, so feel free to fork and improve. I did add a list of to-dos at the bottom of the README file...
UPDATE: I have now realized that - due to DataMapper - this library does not work under jruby, which is necessary for pARP which uses ruby-processing. A better approach than the above would be to have C-based algorithms combined with an API, so that we don't need DataMapper.