Monday, 23 June 2008

Bioruby with git: how would that work?

Disclaimer: This blog post is the result of several iterations of writing/discussion/rewriting from Anthony Underwood, Michael Barton, Matt Wood and myself, with additional help from Paul Thornthwaite.
Disclaimer nr 2: We are not yet git veterans ourselves, so if you see simpler ways of doing what we describe below (or spot any errors), please let us know so we can update this post and put it onto the bioruby wiki as well.
Disclaimer nr 3: This is a proposal. Bioruby has not moved to git yet. However, we are working on it and trying to get the support from the main developers. Update: bioruby has been converted to git (thanks, Anthony) and is not available on github. So you can clone or fork now. However, the official development is still on CVS.

Update: I have discovered a very good presentation on how to work and collaborate with git. If you're interested in using git, have a look at this talk. You can fast forward to 1hr10min27sec where he starts talking about the practical use. Very strongly recommended.

In this blog post, we try to give some guidelines on how people can contribute to the bioruby code if/when that code will become available on github. The rationale for what we describe here is very much based on the premise that the job for the maintainer(s) of bioruby should be as simple as possible. Their workload should be as light as possible; this means that there are some additional steps that any contributor has to go through.
What follows is only a proposal. This is not a standard operating procedure; it’s only a guideline. Feel free to digress from it or use a completely different workflow. But remember: keep it simple for the maintainers.

Git


Distributed source control. Git is a truly distributed source control system, and in contrast with CVS or SVN, there is no central repository. With CVS or SVN, every time someone checks out or exports the repository, his own copy is so-to-speak subordinate to the central one. Not so with git: every single clone is equivalent; none is more important than another. In technical terms, the copy of bioruby on your laptop is as important as the one that for example Toshiaki maintains. One of the big advantages is that continued support is more likely should a key developer move on to pastures new (or github goes up in smoke), since the community can simply elect a new "blessed" repository (see below).

A blessed repository. Noticed that I said “in technical terms”? In some cases, like for bioruby, we would obviously like to have some repository that we would consider the ‘true’ one. Enter the notion of a “blessed” repository. This is purely by convention: the community appoints one particular repository as the main one.
A good place to put this repository is Github. For bioruby, this blessed repository will start out to be http://github.com/bioruby/bioruby. Official bioruby builds will take place from there. However, development can take place in additional, personal repositories.

Forking. Any development of bioruby would happen in clones of this blessed repository. Using the “fork” button on Github not only creates a clone, but it automatically puts that clone on Github itself as well. (Forking has the added value of the github social aspect where the network of changes can be viewed.) So if I would want to contribute, I would fork from bioruby/bioruby (that is: username/projectname) which would automatically create http://github.com/jandot/bioruby.

Guidelines for contribution


There are several ways of contributing: you can either create a patch or use a fork/clone.

Here we’ll try to explain how contribution could work with forking for Bioruby, both from the individual contributor’s view as from the view of the person(s) managing the blessed repository. What follows is not a Standard Operating Procedure. You do not have to do it like this. However, it will make it easier on the blessed maintainers to merge your code.

A. Using patches


A.1 The contributor


The simplest way to contribute is to send in patches. RailsCasts has a great screencast explaining this.

Creating a fork
Click the “fork” button on the bioruby/bioruby page. This will create a new repository in your own namespace: jandot/bioruby. It’s on this clone that you will be working; you will not touch bioruby/bioruby itself.

Making changes
To actually start making changes (e.g. you want to add functionality for Ensembl cigar format), you create a local clone on your own computer (step 2 in the picture):
git clone git@github.com:jandot/bioruby.git
This will be your local master branch. The first thing to do after cloning your own fork, is to create an additional branch for the feature you want to work on: add_cigar_format (step 3)

The command to do this:
git checkout -b add_cigar_format

This will create the new branch and check it out so it becomes your active one. From the fluxbox wiki (http://fluxbox-wiki.org/index.php/Git_-_using): “Branching and merging is very powerful in git. You can create thousands of local branches, one for each bug you work on or feature you implement. It is good practice to do this because it safes your from accidentally pushing changes to another repository.”

So you’ll end up with 2 branches (do a “git branch”):
  1. master: a reflection of the master branch of your remote repository
  2. add_cigar_format: is where the actual work is done

The “git branch” should have a star in front of add_cigar_format because that’s your current branch. If master is starred, do a “git checkout add_cigar_format” to change to this branch.

Now you can edit and change to your heart’s content. The current branch you’re working on maintains an index of files that git is tracking. You can find the current status of the branch by typing
git status
which will list the current status of all the files. Changes can be committed to the local index by using the command.
git add file
The index is an intermediary between the working copy files you are editing, and the changes committed to the repositroy. Changes can be committed from the index to your local repository using the command
git commit
This command will also prompt you for a message describing the commit. Try not to do too much work before committing. A single commit should concern (part of) a single conceptual change with its tests. It’s good practice to commit often (and several commits per conceptual change), but do try not to mix different changes into one commit. This will make it harder afterwards if a commit has to be reverted.

Commits are applied to the only current checked out branch (i.e. add_cigar_format), and do not affect any other branches, or the original repository. Also, if you have to make site-specific changes (e.g. hard-coding a proxy server in one of the files), try to put those changes in one single commit. This will make it easier later to remove them.



Preparing the patch
When you think your change is ready for inclusion in the blessed repository (and you’ve included tests as well), you can create a patch file. To make sure that the blessed repository maintainers will have no problem merging your version, you will want to make the patch reflect the latest version of the blessed repository (step 5).
git remote add blessed git://github.com/bioruby/bioruby.git
git fetch blessed

So now you can check that the patch you will submit will only contain the changes that you want to be included in the blessed repository. One of the things to look out for is that there are not site-specific configurations in your branch (e.g. a hard-coded proxy or directory path, no “STDERR.puts”, ...). Hopefully, you put all those site-specific changes in a separate commit as described above. To get rid of them, you just revert that commit. “git log” will show you the SHA1 of that particular commit (the long crazy string), and you just run “git revert [that_SHA1]". After that, check your changes:
git log -p blessed..feature_c

When that’s done, you can create the actual patch (step 6):
git format-patch blessed..feature_C

This creates a file that you can send to the maintainer (step 7). And you’re done…

A.2 The maintainer


The maintainer gets an email from someone containing a patch. The first thing to do, is to create a new branch and merge the changes into that branch.
git checkout -b feature_c
git am <0001-feature_C_commit_message.patch

Of course he would want to check the changes by comparing the new version of the code with the one that is in the blessed repository (i.e. the master).
git log feature_c..master

If everything looks OK, he can then merge the changes into master itself and push it up onto github.
git branch master
git merge feature_c
git push

And he’s ready. Only thing left to do is remove the branch he created during the process.
git branch -d feature_c


B. Using a pull request


B.1 The contributor


This type of contribution starts out exactly the same as the one with patches: you fork/clone, create a feature branch and hack away.



Preparing the pull
When you think your change is ready for inclusion in the blessed repository, you will create a branch specific for this pull (e.g. called to_pull; step 5): “git branch -b to_pull”.

To make sure that the blessed repository maintainers will have no problem merging your version, you have to rebase your branch (steps 6 and 7).
git remote add blessed git://github.com/bioruby/bioruby.git
git fetch blessed
git rebase blessed/master
git checkout blessed/master fileA_for_user_environment_only
git checkout blessed/master fileB_for_user_environment_only

At this point, a “git log -p blessed/master..to_pull” can help you check that the differences between your _to_pul_l branch and the blessed branch only contain the changes that you intend to be pulled (e.g. getting rid of “STDERR.puts” statements).

When you’re satisfied, you can put the to_pull branch onto your remote repository so it becomes available for the maintainers of the blessed repository (step 8):
git push origin to_pull:refs/heads/to_pull
and push the “Send pull request” button on github.

After that, wait for any news if your change is accepted or not. When your remote to_pull branch becomes obsolete, you can remove it (step 10) with
git push origin :to_pull


B.2 The maintainer



The first thing the maintainer has to do, is get the latest version of his own (i.e. the blessed) repository.
git clone git@github.com:bioruby/bioruby.git

Then he can get a copy of your to_pull branch:
git remote add your_name git://github.com/your_name/bioruby.git
git checkout -b your_name/to_pull
git pull your_name to_pull

...and check what the change looks like.
git log -p master..your_name/to_pull

If he’s satisfied, he can merge your changes into the blessed master branch.
git merge your_name/to_pull

If there are no conflicts, he can then push the new version up onto github:
git push


Useful links



Who we are


Tuesday, 3 June 2008

Would you want to contribute to a small open-source project?

Just a quick plug to see if I can find people interested in helping me out in some of my projects.

In the last 2 years, I started four open source projects (well: the last one was today...), each of which scratches my own itch and does what it needs to do for me. However, some features will have to be added and bugs be fixed to make them more useful to others (you, that is...).

If you are using one of these projects, please think about contributing. With that all-new fancy git version control system, it should be simpler than ever to get your own copy, tweak things a bit and send the improvements back.

1. Bio::Graphics
The Bio::Graphics library "allows for drawing overviews of genomic regions, similar to the pictures drawn by gbrowse" (from the homepage). I believe dgtized (Charles Comstock) and I have done not a bad job in creating something that is really useful, but of course some features are still missing.

Some things that come to mind that need help:
  • Find a good description on how to install it on a Mac (I've just changed job and am trying to do that, resulting in a major headache and no bio-graphics still).
  • Add new features such as a type of track to show continuous data (e.g. GC-content). We're thinking about ways to implement this, but additional ideas are welcome to get that done.
  • Is someone working on a gbrowse-like application in ruby?
  • Although I haven't found it a bottleneck yet, it'd might be a good idea to look at the performance of the thing.

A full tutorial and more information can be found here. The actual code repository has been moved from rubyforge to github and can be downloaded at http://github.com/jandot/bio-graphics. To get your copy: git clone git://github.com/jandot/bio-graphics.git

2. Ruby API to the Ensembl database
In late spring 2007, I started the ruby API to the Ensembl database. This API relies on ActiveRecord and very much tries to copy the functionality of the perl API (including transfer of coordinates between coordinate systems). Just recently I was glad to hear people at the Sanger here are looking/have looked into it as well. At the moment, only the core database is covered but it would be nice if the other ones (funcgen, variation) would be added as well. In addition, the API was developed based on Ensembl release 45. With a new release coming out every few months, the API has to be tested against those as well.

What needs help:
  • Add an API for the other databases, including variation and funcgen.
  • Keep the API testing going for each new release of the database.
The full tutorial can be found on rubyforge, but the source control has also moved to github at http://github.com/jandot/ruby-ensembl-api. Get your copy using git clone git://github.com/jandot/ruby-ensembl-api.git

3. Ruby API to the UCSC database
I just started out the API for the UCSC database. There is data available in that database that cannot be found in Ensembl yet, for example copy number variations. So I started a new project (now solely on github) by copy-paste-modifying some code of the Ensembl API. Unfortunately, there are 3415 tables in the hg18 database (yes, that's three thousand four hundred and fifteen). Obviously, I only created interfaces for the tables that I will need at work.

What needs help:
  • Add additional tables to the API.
  • Think about additional functionality that might be added to some of the models.
  • A tutorial.
Again, get your copy from http://github.com/jandot/ruby-ucsc-api.

4. Simple Project Logger
As mentioned in my previous post, I use a simple rails application to keep track of the things I work on; a digital labbook, basically. That application is called Simple Project Logger (or its unix name sprolog). It's only task is to allow me to create tasks within projects and log what I've done for each task. Sprolog works good enough for me at the moment, but there is some bad bug I can't get out. In addition, it just looks ugly.

So if you're interested:
  • The authentication doesn't work yet. You can login using OpenID, but it is still possible to view anyone's projects and tasks by just typing the full URL. I know it will need a "before_filter :login_required", but that just breaks the thing.
  • Sprolog could definitely use some CSS-love.
You can download sprolog from http://github.com/jandot/sprolog.

I'll buy you a beer.