Saturday, 13 September 2008

Data visualization

Today is "Data management, mining, curation and visualization" day at the Genome Informatics conference in Hinxton. It might be one of the more interesting ones for me, because that's what I do: manage, mine, curate and attempt to visualize. And I must say the last bit the most difficult. It's not difficult to upload results into a genome browser, but is it the best way?

I say we have to break free from the track. Ninety percent of all visualizations today in genomics is track-based (add DAS tracks to Ensembl, upload BED files to UCSC or run your own gbrowse). It's ideal for showing features on a chromosome, but it's used even if it's not the optimal tool (a feature shared with Microsoft Excel, but let's not go there). Why's that? Because that's what there is, and they do provide very useful functionality. But at the same time, having them available tempers the search for new and innovative ways of visualizing data. Having a computer at hand doesn't help either, I think: it's just much easier in PowerPoint to draw a collection of squares than a rich multi-facetted picture. That's just more easily done by hand, but that's not what we do, is it?

One of the articles in Nature's Big Data issue calls for artists and visualization experts to be involved before all data are gathered. This idea got quite a few comments on FriendFeed as well. I do agree with the idea of visualization experts being involved in many projects, but that visualization expert should be you. Well... you don't need to be an expert, but still you should have an idea on how to show the gist of your results. I think that's one of the important things that's missing in MSc education (apart with a good introduction to data management): some course in visualization concepts. How do you visualize time-series? How do you visualize differences? And what about time-series of differences?

Small example: I've been asked to think about how to visualize copy-number variations between individuals. The most obvious to do is what's used on the UCSC and any track browser: show a box where the variation is. But it's a variation, right? So what does this box mean? That some individuals miss that bit? That it's duplicated? What individuals? Using a track-based genome browser, you must make one individual the reference.


This is not about tools (there's Processing), but about a mind shift.

As it happens, I hope to get hold of a small tablet in the next week or so to replace my mouse and relieve my RSI a bit, so that might be a good opportunity for me to at least explore a bit.

Keep drawing.