Suppose I have a file with several millions of lines containing information on SNPs. And suppose I have a database that already contains data for those SNP. And suppose I want to update the entries in the database with the data from the input file.
Please note: this is a quick hack.
MAX_NR_OF_THREADS = 5
nr_of_lines = `wc -l input_file.tsv`.split(/ /).to_i
pbar = ProgressBar.new('processing', nr_of_lines.to_f/MAX_NR_OF_THREADS)
File.open(input_file.tsv).each_slice(MAX_NR_OF_THREADS) do |slice|
threads = Hash.new
slice.each do |line|
threads[line] = Thread.new do
# do the actual line parsing, DB lookup and DB updates
threads.values.each do |thread|
I know this is far from perfect:
- I shouldn't need to create that array.
- This way all concurrent threads wait for each other before the next slice is taken from the input file. If one of the 5 threads takes a really long time, the other ones will wait but could instead start parsing the next lines in the input file.
Don't think less of me for this code...