Wednesday, 12 March 2008

Where did I get that data from?

Did you ever have data lying around that you couldn't figure out where you got it from?

You downloaded and imported data from an FTP site into your database ages ago and you actually want to use it now. But if different records come from different sources, it can be really challenging to know what data to trust or how to retrieve additional information afterwards. Not keeping track of the source of the data breaks the chain of provenance.

I've seen it happen.

So what do you do? The obvious thing when you're working with a database is to have an additional column for that table where you can store the FTP URL or the people who sent you the data. Easy peasy. Things get a lot more complicated if different cells in a single row can have different sources. For example: suppose one of your tables (called markers) contains information on STSs.

id name fw_primer rev_primer PCR product length gene
---------------------------------------------------------------------
1 marker_1 AACCGGACGA GACCTCGGAGAC 241 CYP2D6
2 marker_2 TCAATGGAGG GATTCGCTGACTC 183 BRCA2
3 marker_3 CGCTATGACTGC AACTGCGTCATG 221 DAG1
4 ... ... ... ... ...


Let's say that you get the primers and STS length from a couple of inputs (e.g. dbSTS as well as designed by colleagues: marker_1 can come from dbSTS while the other two were created by your colleague Tim) and the gene information was added by two different colleagues (let's say Bert and Pat). In this case it becomes quite impossible to have that information stored in an additional column of the table ("primers: dbSTS; gene: Pat"??).

I've discussed this issue with a database guy during the last Perl Programming course at CSHL, and guess what: there's no easy solution. Having it play in the back of my head for the last couple of months, I finally took the effort to actually draw some possibilities on a piece of paper. And here's what might do the trick: just add an additional table. Let's call it marker_sources. This table has exactly the same columns as the markers table (apart from a foreign key to the data table).


marker_id name fw_primer rev_primer PCR product length gene
---------------------------------------------------------------------------
1 dbSTS dbSTS dbSTS dbSTS Pat
2 Tim Tim Tim Tim Pat
3 Tim Tim Tim Tim Bert
4 ... ... ... ... ...


Result: the resolution of your provenance increases up to single cell level. This does mean additional tables, but for typical use they will not have to be queried if you're just interrogating the data. But at least you can get to that information if needs be...