Tuesday, 13 January 2009

Who-o-o are you? Who who? Who who?

Identity Card - National RegistrationImage by Danny McL via Flickr

There’s been quite a lot of discussions going on lately about author identification: Raf Aerts’ correspondence piece in Nature (doi:10.1038/453979b), discussions on FriendFeed, ... The issue is that it can be hard to identify who the actual author of a paper is if their name is very common. If your name is Gudmundur Thorisson (“hi, mummi”) you’re in luck. But if you are a Li Y, Zhang L or even an Aerts J it’s a bit harder. Searching PubMed for “Aerts J” returns 299 papers. I surely don’t remember writing that many. I wish… So if a future employer would search pubmed for my name they will not get a list of my papers, but a list of papers by authors that have my name. Also, some of my papers mention jan.aerts@bbsrc.ac.uk as the contact email. Well: you’re out of luck, I’m afraid. That email address doesn’t exist anymore because I changed jobs.

The idea exists to call into life a unique ID for each author similar to the doi (“digital object identifier”) for a paper. Thomson Reuters have created ResearcherID, but because doi’s are handled through a not-for-profit CrossRef, let’s call the unique author ID a dsi (“digital scientist identifier”). This dsi can then be used by that scientist to identify himself wherever he needs.

Here I’ll try to explain how I think this could work.

But first of all: what are the prerequisites for a dsi-based environment? Obviously, the journals would need to request the dsi of authors on submission rather than just their names and email addresses. They are able to get names and email addresses through the dsi. And secondly, we need a service that assigns dsi’s and where scientists can update their details and add information.

The service/website

Let there be a website (for argument’s sake http://www.dsi.org) that assigns new dsi’s to new authors (only one dsi per author). So I could for example be dsi.12345. This service should have additional functionality such as list of contributions, curriculum vitae, contact details, network. It should also provide a homepage or profile page for each scientist listing at least the name, affiliation and literature list (i.e. what you would get from a PubMed search). So if you’d go to http://www.dsi.org/dsi.12345 you’d see at least my name, the address of the institute I work and a list of papers that I co-authored.

Getting a dsi

It’s critical that one researcher only gets one dsi. This is less than straightforward because I believe many researchers will not be interested enough in the whole identity story to even remember if they already had a dsi or not. So if I were to go to the dsi website and request an ID, the website would ask for my name first. It’d also ask if I used different names in author lists (e.g. I’m a woman, got married and started using my married name instead of my maiden name). Using that information the service would then search pubmed for papers that are authored by someone with my name (who might be me). It could present that list to me and ask if I’m actually that same person or not. This way we’d build up a minimal list of papers. That minimal list would then be checked against the dsi database to see if there isn’t already someone with my name who has claimed these papers. Logically that person would be me and it would appear that I already have a dsi. If no dsi has this name and these papers associated the new dsi can be assigned.


A central service like this would be ideal for collaborators and possible employers to find out about contributions of a specific researcher to science. Instead of asking for author names and emails (the latter change over time anyway), a journal would ask for the dsi of all authors. If the paper gets accepted that journal would notify the dsi service to add that paper to the researchers publication list. But it goes further than just the papers. It’s a shame that researchers virtually only get marks for their published papers (Publish or Perish) and not for other contributions to scientific research. What about people who submit data to genome annotation databases? What about contributions to discussion in comments to blog posts, FriendFeed, ...? Setting up public databases? Writing APIs for scientific data? Think of a browser-button with which you could sign certain contributionsanywhere. Signing a contribution would add a link in your list of contributions in the dsi system.

It should obviously be possible to log into the dsi system and edit or remove contributions that you made. That one little API you wrote 5 years ago seemed so important then but you’ve come to see it as insignificant now, for example.

Contact details

People change employer, email, address and even name. So there’s a problem inherent in only listing email address and institute on a paper. Using the unique dsi for the authors would always point to that researcher no matter how many times he or she moved jobs or contact information. When a researcher has his contact details changed he would log onto the dsi service (we’ll come to this later) and change those data. Other people would then see those details on the researchers dsi page (http://www.dsi.org/dsi.12345), or if the researcher wants to keep them hidden send a message through the dsi service itself. The researcher’s email address does not have to be visible to the outside world.


Even though you might not want to make your email address visible for the whole world, you wouldn’t mind if the people you know would see it. Your network. I think that a dsi service should contain capabilities like those from LinkedIn. You should be able to build a trusted network (with people that you know well). This network is another important pilar in your contribution to science.

There would ideally be different personas you could set for your profile. The default would for example be that your profile page would only show your name and papers. But you might also have a full profile that is only to researchers who are logged into the service and are not further than two steps away in your network. That extended profile might show your contact details (including email), contributions outside of papers (e.g. comments on blog posts) and curriculum vitae.


The above explains how I would like to see the issue of identification solved. But there is also the problem of authentication. How do I prove that I am dsi.12345? Ideally the dsi service would be an OpenID provider so that it let’s me prove that I own http://www.dsi.org/dsi.12345. Hopefully more and more websites (biomedcentral, nature, ...) would allow logging in using OpenID.

Apart from serving as an OpenID provider, the dsi service should obviously also be an OpenID consumer so I don’t have to remember another username and password but can use http://jandot.myopenid.com or http://saaientist.blogspot.com to log in.

I hope this gives a little bit of an idea of the environment I hope we’ll move to. Any comments welcome. Any progress even more…

Reblog this post [with Zemanta]

Tuesday, 6 January 2009

To find structural variation, look at read pairs: introducing pARP

Nextgen sequencing is making a huge impact on how research is done in the genomics field. One of the ways to discover structural variants in a genome for example is to create a clone library for an individual, sequence the ends of those clones and then map those ends to the reference genome. Suppose that the clones in the library are all 150kb large, then we would expect the ends of each clone to be mapped about 150kb from each other on that reference genome, in a forward/reverse direction. Any read pair that does not follow this pattern, might indicate a structural variation. There are of course numerous spurious mapping results, so we need to ignore those.

Suppose that the resulting data look like this:

1 1016287 1 1025027 FF 10
1 54809626 1 54814724 RR 20
1 65970649 1 67123551 DIST 32
1 143840263 1 143841351 RR 34
1 241524162 16 298176281 DIST 36
First two columns are the position of the first read from the pair; third and fourth columns refer to the second read from the pair. Fifth column is FF, RR or DIST: forward-forward, reverse-reverse or distance (i.e. >> 150kb). The last column is some arbitrary quality score assigned to the mapping of this read pair. Notice that the last of these lines shows a readpair where one end is mapped on chr1 and the other is mapped to chr16.

We can do two things: analyze and then create a picture, or create a picture and then interpret (see also one of my previous posts). In the first approach, you'd run a statistical analysis to see if certain regions have a higher prevalence of abnormally mapped read pairs. In the second, you plot the raw data and try to identify abnormalities by eye. Of course ideally you switch between both approaches.

To visualize raw read pair information I've written a tool called pARP (Processing Abnormal ReadPairs) and available from github. It's very similar to the display used by [edited] this paper by Hampton et al to display structural variation using Circos (see picture, taken from the circos website). But instead of just creating a static picture, pARP is meant to be an interactive tool to browse the data.

Below is a screenshot of pARP running on some test data. It doesn't look as nice as the above image, but remember that this is interactive and thus doesn't have minutes to calculate everything.

Some of the features:
  • pARP can display abnormal readpairs (forward/forward, reverse/reverse or wrong distance), read depth and other features (e.g. segmental duplications).
  • Circular display gives overview of between-chromosome mapped readpairs.
  • Chromosomes can be dragged from the circular display to the upper or lower linear display to show (a) more detail and (b) within-chromosome aberrant readpairs (note: none in the image above).
  • Visible readpairs can be filtered by quality score.
  • Readpairs that are close to the mouse position are highlighted.

Prefiltering of the data should be minimal, and only focussed on getting the amount of data down. For example, the readpair data file could contain all normal readpair mappings, but getting rid of those just makes the display much more visually clear and reduces the amount of data to be loaded by several orders of magnitude (obviously...).

The version just released (tagged v0.8) is workable, but not ready for prime time yet. At this moment the user has to run the tool using jruby instead of just loading it as an applet. Also the filenames to be loaded have to be changed in the parp.rb code itself. I hope to add functionality so that you can upload your own data into an applet, or use a URI to link to it. But can't promise because other work is waiting. So here's also a call for help: if you're interested in contributing, please do! There's a "features-yet-to-be-implemented" list further down.

Features not yet implemented:
  • pARP should be available as an applet/application.
  • User should be able to point to files or URIs representing files instead of changing filenames in the code itself.
  • Saving an image to disk (also from the applet).
  • Further performance improvements.
  • Fixing of not-yet-identified-but-definitely-present bugs.

And now for some technical stuff. To keep redrawing times low so that the interaction wouldn't suffer too much from the huge amount of data, I had to use a few tricks. First of all, pARP makes heavy use of buffers. Different parts of the image are stored on different buffers. When the user interacts with the display, only the relevant buffers are updated while the others are untouched. For more info, see the github wiki page on the subject. Secondly, I've found out how to use ruby threads to load some data asynchronously. In particular the readdepth data can be a huge hog on performance; there are >6 million datapoints for a genome window size of 500bp. So what happens is that (a) readdepth data for a chromosome is only loaded when that chromosome is displayed in the linear part of the image, and (b) the readdepth data is drawn onto a separate buffer that is only displayed when the thread is finished.

Many thanks to:

Update: reference changed for Circos picture