As a recap from my previous post: I'll be running an equivalent of the following:
cat snps.txt | ruby snp_mapper.rb | sort | ruby snp_reducer.rb
First thing the Sanger wiki told me was to format my HDFS space:
hadoop namenode -formatThis apparently only affects my own space... After that I could start playing with hadoop.
Where am I?
It looks like hadoop installs its own complete filesystem: even if my path on the server would be /home/users/a/aerts, a
hadoop fs -lsr /shows that in the HDFS system I'm at /user/aerts.
Preparing the run
First off: create a directory to work in. Let's call that "locustree" with two subdirectories, called "input" and "output".
hadoop fs -mkdir locustreeAnd copy your datafile to the input directory:
hadoop fs -mkdir locustree/input
hadoop fs -mkdir locustree/output
hadoop fs -put snps.txt locustree/input/Running the run
Hadoop documentation mentions that any shebang line in the mapper and reducer scripts are likely to not work, so you have to call ruby explicitely. Provide both scripts as "-file" arguments to hadoop. Finally, from your local directory containing the scripts, run the following command to run the mapreduce job:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \
-input locustree/input/snps.txt \
-mapper "/usr/local/bin/ruby snp_mapper.rb" \
-reducer "/usr/local/bin/ruby snp_reducer.rb" \
-output locustree/output/snp_index \
-file snp_mapper.rb \
Et voila: a new directory snp_index now contains the file spit out by the snp_reducer.rb script.