Wednesday, 19 May 2010

Threads in ruby: probably not how to use them

I should create an online labbook with code examples of how I do things. Keep going back to an example script I have to copy/paste the code for handling different threads in ruby. But I'll put it here for the moment :-)

Suppose I have a file with several millions of lines containing information on SNPs. And suppose I have a database that already contains data for those SNP. And suppose I want to update the entries in the database with the data from the input file.

Please note: this is a quick hack.

require 'rubygems'
require 'progressbar'

MAX_NR_OF_THREADS = 5
nr_of_lines = `wc -l input_file.tsv`.split(/ /)[0].to_i
pbar = ProgressBar.new('processing', nr_of_lines.to_f/MAX_NR_OF_THREADS)
File.open(input_file.tsv).each_slice(MAX_NR_OF_THREADS) do |slice|
pbar.inc
threads = Hash.new
slice.each do |line|
threads[line] = Thread.new do
# do the actual line parsing, DB lookup and DB updates
end
end
threads.values.each do |thread|
thread.join
end
end
pbar.finish

I know this is far from perfect:
  • I shouldn't need to create that array.
  • This way all concurrent threads wait for each other before the next slice is taken from the input file. If one of the 5 threads takes a really long time, the other ones will wait but could instead start parsing the next lines in the input file.

Don't think less of me for this code...