- MapReduce with Hadoop Streaming in bash – Part 1
- MapReduce with Hadoop Streaming in bash – Part 2
- MapReduce with Hadoop Streaming in bash – Part 3
- Hadoop Streaming, Hue, Oozie Workflows, and Hive
Hadoop Streaming
Hadoop Streaming allows you to write MapReduce code in any language that can process stdin and stdout. This includes Python, PHP, Ruby, Perl, bash, node.js, and tons of others. I’m a huge fan of node and PHP but not everyone knows those. Python is desirable and I’m working to learn it but nowhere near ready yet. So I went for bash, since most Oracle-heads and other Linux lovers know it.
The algorithm I’m using is TF-IDF, which stands for Term Frequency – Inverse Document Frequency. According to Wikipedia, TF-IDF is “a numerical statistic which reflects how important a word is to a document in a collection or corpus”. It’s useful for search ranking, collaborative filtering, and other tasks. In this article (Part 1), we’re going to calculate term frequency by grabbing the lines of each file, parsing out all the words (map), then summing them up to show the frequency of each word per document (reduce).
The Setup
To set this up I’m using the Cloudera QuickStart VM. This is a phenomenal resource that is preconfigured with CDH4.3 and tons of extra tools. The data I’m working with is small and simple since I’m running in pseudo-distributed mode on a VM, consisting of 8 Stephen Crane poems (my favorite) in text format.
[training@localhost steve]$ pwd /home/training/steve [training@localhost steve]$ ls crane a_man_said_to_the_universe.txt a_newspaper.txt met_a_seer.txt truth.txt a_man_saw_a_ball_of_gold.txt a_spirit_sped.txt pursuing_the_horizon.txt walked_in_a_desert.txt [training@localhost steve]$ cat crane/pursuing_the_horizon.txt I saw a man pursuing the horizon; Round and round they sped. I was disturbed at this; I accosted the man. "It is futile," I said, "You can never -- " "You lie," he cried, And ran on.
I had to load this data into Hadoop, so I made a ‘crane’ directory and put the files in there.
[training@localhost steve]$ hadoop fs -mkdir crane [training@localhost steve]$ hadoop fs -put crane/* crane [training@localhost steve]$ hadoop fs -ls crane Found 8 items -rw-r--r-- 1 training supergroup 137 2013-10-01 00:41 crane/a_man_said_to_the_universe.txt -rw-r--r-- 1 training supergroup 322 2013-10-01 00:41 crane/a_man_saw_a_ball_of_gold.txt -rw-r--r-- 1 training supergroup 747 2013-10-01 00:41 crane/a_newspaper.txt -rw-r--r-- 1 training supergroup 439 2013-10-01 00:41 crane/a_spirit_sped.txt -rw-r--r-- 1 training supergroup 350 2013-10-01 00:41 crane/met_a_seer.txt -rw-r--r-- 1 training supergroup 192 2013-10-01 00:41 crane/pursuing_the_horizon.txt -rw-r--r-- 1 training supergroup 452 2013-10-01 00:41 crane/truth.txt -rw-r--r-- 1 training supergroup 208 2013-10-01 00:41 crane/walked_in_a_desert.txt
And we’re set!
The Mapper
So here’s the mapper (maptf.sh), which reads lines of whatever file is sent to it, tokenizes it, then emits keys and values (tab separated).
[training@localhost steve]$ cat maptf.sh #!/bin/bash exclude="\.\,?!\-_:;\]\[\#\|\$()\"" while read split; do for word in $split; do term=`echo "${word//[$exclude]/}" | tr [:upper:] [:lower:]` if [ -n "$term" ]; then printf "%s\t%s\t%s\n" "$term" "$map_input_file" "1" fi done done
Let’s go through the code:
- Define the exclude variable. This variable holds the regex characters that will be stripped out during the map.
- Main loop. This reads stdin (while read) into a variable called ‘split’, one line at a time.
- Inner loop. For each word in the ‘split’ variable (native tokenizing)
- Set the ‘term’ variable equal to the current word, excluding characters from the ‘exclude’ variable, and converted to lowercase.
- Make sure ‘term’ isn’t empty.
- Print the output in the form of: term-inputfile-1 (with tabs instead of dashes). Inputfile in this case is represented by the environment variable ‘map_input_file’. This is a standard Map variable normally denoted as map.input.file; however, Hadoop Streaming turns the periods into underscores for compatibility.
The cool part is that since this is a shell script, we can test it at the command prompt to see how it works by reading the file and piping the script. Note that I’m setting the ‘map_input_file’ variable manually for the test so I get the proper output.
[training@localhost steve]$ export map_input_file=crane/pursuing_the_horizon.txt [training@localhost steve]$ cat crane/pursuing_the_horizon.txt | ./maptf.sh i crane/pursuing_the_horizon.txt 1 saw crane/pursuing_the_horizon.txt 1 a crane/pursuing_the_horizon.txt 1 man crane/pursuing_the_horizon.txt 1 pursuing crane/pursuing_the_horizon.txt 1 the crane/pursuing_the_horizon.txt 1 horizon crane/pursuing_the_horizon.txt 1 ... (and so on)
At this point it’s no different from a simple wordcount Mapper. Which is sort of what the term frequency portion of this algorithm is, except that it takes the file and the term into account.
The Reducer
The Reducer is where we’ll aggregate the data that was emitted. The 8 files that will serve as input to this MapReduce job will be broken into 8 Mappers which each run the maptf.sh for their specific input split. Then results are then put through the ‘shuffle and sort’ phase where the keys are sorted (the first two output columns are the key in this case, more on this later) and sent to the reducer(s). The reducer then takes all the data and aggregates it into the final format. Our reducer will take the Map data with format (term-file-1) and sum it up to (term-file-termfrequency).
[training@localhost steve]$ cat redtf.sh #!/bin/bash read currterm currfile currnum while read term file num; do if [[ $term = "$currterm" ]] && [[ $file = "$currfile" ]]; then currnum=$(( currnum + num )) else printf "%s\t%s\t%s\n" "$currterm" "$currfile" "$currnum" currterm="$term" currfile="$file" currnum="$num" fi done printf "%s\t%s\t%s\n" "$currterm" "$currfile" "$currnum"
- Read the first line, putting the fields into the variables ‘currterm’, ‘currfile’, and ‘currnum’
- Loop through the rest of the file, putting new terms into the variables ‘term’, ‘file’, and ‘num’
- Check to see if the latest term matches the previous term and the latest file matches the previous file. Remember, this works because the input to a reducer is ALWAYS sorted by key! The magic of shuffle and sort.
- Set ‘currnum’ equal to ‘currnum’ plus the latest value of ‘num’ (always 1 in this case
- Else… (no match, it’s a new term/file combo)
- Print the current term, current file, and current sum in tab delimited format.
- Set ‘currterm’ equal to the latest ‘term’
- Set ‘currfile’ equal to the latest ‘file’
- Set ‘currnum’ equal to the latest ‘num’
- Keep doing that until the loop’s exhausted, then print the final value.
Fun, right? What’s cool as that we can test this the same way we tested the mapper, as long as we sort first. Remember, sorting has to be done on the first two columns which make up the key. So:
[training@localhost steve]$ cat crane/pursuing_the_horizon.txt | ./maptf.sh | sort -k1,2 | ./redtf.sh accosted crane/pursuing_the_horizon.txt 1 a crane/pursuing_the_horizon.txt 1 and crane/pursuing_the_horizon.txt 2 at crane/pursuing_the_horizon.txt 1 ... (and so on)
And that’s our expected result.
Hadoop It Up
The time has finally come to run our MapReduce script. To do this we’re going to use the ‘hadoop’ command with the Hadoop Streaming JAR file included with the distro. Here’s the command we’ll use:
[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -D stream.num.map.output.key.fields=2 -input crane -output crane_out \ -mapper /home/training/steve/maptf.sh -reducer /home/training/steve/redtf.sh
NOTE: The backslashes (\) are just to say that I’m splitting the command up over multiple lines.
This command is doing a few critical things. First, it says we want to run the hadoop-streaming.jar file. The -D option then allows us to enter any general options.
The first one is absolutely critical: stream.num.map.output.key.fields=2. This tells my MapReduce job that the first two fields output by the Mapper will be the key. It’s critical because the sort and shuffle phase needs to sort keys in order for the reducer to work properly. This is the case for all MapReduce jobs in any language, but only Hadoop Streaming needs to worry about this parameter.
The next parameter is the ‘-input’ option which is the HDFS location of the input files. It can be either a directory or any POSIX compliant glob match. The next parameter is ‘-output’ which is the location on HDFS where the output should be dumped. This directory MUST NOT exist. Then we define the ‘-mapper’ and ‘-reducer’ parameters, pointing them to my shell scripts. Simple.
You can see the output of this command here:
[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -input crane -output crane_out -mapper /home/training/steve/maptf.sh -reducer /home/training/steve/redtf.sh packageJobJar: [/tmp/hadoop-training/hadoop-unjar4001401820102363860/] [] /tmp/streamjob4042079727913400227.jar tmpDir=null 13/10/01 01:38:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/10/01 01:38:37 WARN snappy.LoadSnappy: Snappy native library is available 13/10/01 01:38:37 INFO snappy.LoadSnappy: Snappy native library loaded 13/10/01 01:38:37 INFO mapred.FileInputFormat: Total input paths to process : 8 13/10/01 01:38:37 INFO mapred.JobClient: Running job: job_201309292255_0058 13/10/01 01:38:38 INFO mapred.JobClient: map 0% reduce 0% 13/10/01 01:38:45 INFO mapred.JobClient: map 25% reduce 0% 13/10/01 01:38:51 INFO mapred.JobClient: map 50% reduce 0% 13/10/01 01:38:56 INFO mapred.JobClient: map 75% reduce 16% 13/10/01 01:38:59 INFO mapred.JobClient: map 87% reduce 16% 13/10/01 01:39:00 INFO mapred.JobClient: map 100% reduce 16% 13/10/01 01:39:02 INFO mapred.JobClient: map 100% reduce 100% 13/10/01 01:39:02 INFO mapred.JobClient: Job complete: job_201309292255_0058 13/10/01 01:39:02 INFO mapred.JobClient: Counters: 33 13/10/01 01:39:02 INFO mapred.JobClient: File System Counters 13/10/01 01:39:02 INFO mapred.JobClient: FILE: Number of bytes read=34933 13/10/01 01:39:02 INFO mapred.JobClient: FILE: Number of bytes written=1758451 13/10/01 01:39:02 INFO mapred.JobClient: FILE: Number of read operations=0 13/10/01 01:39:02 INFO mapred.JobClient: FILE: Number of large read operations=0 13/10/01 01:39:02 INFO mapred.JobClient: FILE: Number of write operations=0 13/10/01 01:39:02 INFO mapred.JobClient: HDFS: Number of bytes read=3750 13/10/01 01:39:02 INFO mapred.JobClient: HDFS: Number of bytes written=21678 13/10/01 01:39:02 INFO mapred.JobClient: HDFS: Number of read operations=17 13/10/01 01:39:02 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/10/01 01:39:02 INFO mapred.JobClient: HDFS: Number of write operations=2 13/10/01 01:39:02 INFO mapred.JobClient: Job Counters 13/10/01 01:39:02 INFO mapred.JobClient: Launched map tasks=8 13/10/01 01:39:02 INFO mapred.JobClient: Launched reduce tasks=1 13/10/01 01:39:02 INFO mapred.JobClient: Data-local map tasks=8 13/10/01 01:39:02 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=40278 13/10/01 01:39:02 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=16523 13/10/01 01:39:02 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/10/01 01:39:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/10/01 01:39:02 INFO mapred.JobClient: Map-Reduce Framework 13/10/01 01:39:02 INFO mapred.JobClient: Map input records=114 13/10/01 01:39:02 INFO mapred.JobClient: Map output records=516 13/10/01 01:39:02 INFO mapred.JobClient: Map output bytes=33895 13/10/01 01:39:02 INFO mapred.JobClient: Input split bytes=903 13/10/01 01:39:02 INFO mapred.JobClient: Combine input records=0 13/10/01 01:39:02 INFO mapred.JobClient: Combine output records=0 13/10/01 01:39:02 INFO mapred.JobClient: Reduce input groups=326 13/10/01 01:39:02 INFO mapred.JobClient: Reduce shuffle bytes=34975 13/10/01 01:39:02 INFO mapred.JobClient: Reduce input records=516 13/10/01 01:39:02 INFO mapred.JobClient: Reduce output records=326 13/10/01 01:39:02 INFO mapred.JobClient: Spilled Records=1032 13/10/01 01:39:02 INFO mapred.JobClient: CPU time spent (ms)=3520 13/10/01 01:39:02 INFO mapred.JobClient: Physical memory (bytes) snapshot=1265045504 13/10/01 01:39:02 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3495202816 13/10/01 01:39:02 INFO mapred.JobClient: Total committed heap usage (bytes)=1300004864 13/10/01 01:39:02 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/10/01 01:39:02 INFO mapred.JobClient: BYTES_READ=2847 13/10/01 01:39:02 INFO streaming.StreamJob: Output directory: crane_out
Now we can go look at our results to see how the job did. The results will be in the ‘crane_out’ directory as specified by the hadoop command. So let’s take a look:
[training@localhost steve]$ hadoop fs -ls crane_out Found 3 items -rw-r--r-- 1 training supergroup 0 2013-10-01 01:39 crane_out/_SUCCESS drwxr-xr-x - training supergroup 0 2013-10-01 01:38 crane_out/_logs -rw-r--r-- 1 training supergroup 21678 2013-10-01 01:38 crane_out/part-00000
The ‘part-00000’ file is our output. By default, MapReduce ignores files that begin with an underscore (_) or a period (.). The output of the MapReduce job produced two ignorable files and one ‘part’ file which was the output of the single reducer used to aggregate our numbers. If this were a bigger dataset with more reducers, we’d have more part files.
So let’s take a look at our final output:
[training@localhost steve]$ hadoop fs -cat crane_out/part-00000 a hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2 a hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 a hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 14 a hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 3 a hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 a hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 a hdfs://0.0.0.0:8020/user/training/crane/truth.txt 12 a hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 3 accosted hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 achieved hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 addressed hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 again hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 ages hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 agony hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 ah hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 ah hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 already hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 am hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 and hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 and hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 3 and hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 3 and hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 and hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 and hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 and hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 another's hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 are hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 as hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 at hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 aye hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 aye hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 ball hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 bawled hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 been hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 before hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 began hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 believed hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 black hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 black hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 blind hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 book hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 boys hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 breath hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 but hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 but hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 by hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 by hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 4 called hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 calling hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 can hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 cavern hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 child hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 chronicle hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 clay hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 climbed hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 collection hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 concentrating hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 court hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 created hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 crevice hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 cried hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 cried hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 crowd hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 crowned hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 cuddle hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 curious hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 dead hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 death hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 deathslime hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 denial hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 desert hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 3 dire hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 disturbed hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 earth hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 echoes hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 error hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 eternal hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 even hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 eventually hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 eventually hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 ever hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 every hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 exist hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 fact hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 families hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 feckless hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 fenceless hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 fireside hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 fleetly hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 for hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 for hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 for hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 fortress hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 freedom hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 from hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 from hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 from hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 from hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 futile hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 game hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 garment hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 god hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 10 god hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 gold hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 grown hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 had hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 halfinjustices hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 hand hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 hands hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 has hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 have hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 have hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 he hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 he hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 6 he hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 4 he hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 heat hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 heavens hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 held hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 hem hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 highest hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 him hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 him hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 his hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 his hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 hold hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 honest hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 horizon hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 horizon hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 however hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 i hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 i hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 6 i hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 4 i hdfs://0.0.0.0:8020/user/training/crane/truth.txt 5 i hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 3 in hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 in hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 in hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 in hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 in hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 in hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 into hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 is hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 is hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 7 is hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 is hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 is hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 is hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 it hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 5 it hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 it hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 it hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 it hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 it hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 its hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 its hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 joys hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 kindly hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 know hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 let hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 lie hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 life's hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 lived hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 lo hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 lone hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 long hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 looked hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 looks hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 loud hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 mad hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 man hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 man hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 man hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 market hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 me hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 me hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 me hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 me hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 melons hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 men hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 merciful hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 met hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 mighty hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 mile hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 million hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 mocked hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 much hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 never hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 never hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 newspaper hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 5 night hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 no hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 no hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 not hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 not hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 now hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 obligation hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 of hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 of hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 of hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 5 of hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 3 of hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 of hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 often hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 on hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 one hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 opened hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 opinion hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 part hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 phantom hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 place hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 plains hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 player hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 pursued hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 pursuing hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 ran hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 read hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 remote hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 replied hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 roaming hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 rock hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 round hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 said hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 said hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 said hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 said hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 said hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 sand hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 saw hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 saw hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 scores hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 screamed hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 second hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 seer hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 sells hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 sense hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 shadow hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 should hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 sir hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 sir hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 skill hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 sky hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 sky hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 smiled hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 smote hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 sneering hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 so hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 space hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 spaces hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 sped hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 sped hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 spirit hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 spreads hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 spurred hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 squalor hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 strange hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 strange hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 stupidities hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 suddenly hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 swift hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 sword hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 symbol hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 take hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 tale hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 tales hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 that hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 that hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 3 the hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 3 the hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 7 the hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 4 the hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 the hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 the hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 the hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4 the hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 3 their hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 then hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 then hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 there hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 there hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 they hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 think hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 this hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 this hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 this hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 through hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 through hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 to hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 to hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 to hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 to hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 touched hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 tower hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 traveller hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 tried hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 truth hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 unfairly hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 unhaltered hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 universe hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2 vacant hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 valleys hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 victory hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 voice hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 walked hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 was hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 was hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 was hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 was hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 well hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 went hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 went hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 when hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 when hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 whence hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 where hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 3 which hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 which hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 while hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 wind hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 wins hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 wisdom hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 wisdom hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 world hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 world hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 you hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 you hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2
As you can see, each output record consists of a term, a filename, and a count (term frequency).
Special Note
Daniel Templeton left a very important note in the comments. In these examples I am running my scripts from the local filesystem; however, it’s a much better practice to load them into HDFS. Running on a VM is great but can make you lazy…once you move on to running on a cluster it will make a huge difference! He offered up this example:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -D stream.num.map.output.key.fields=2 -input crane -output crane_out \ -file ./maptf.sh -mapper maptf.sh -file ./redtf.sh -reducer redtf.sh
Conclusion
Now normally you’d want to check your words against a stoplist and rule out all the common ones like ‘a’ and ‘and’ and such. However, since this is a small dataset and Stephen Crane is a man of few words, we’ll leave them in to see how our final algorithm holds up.
What we just calculated is the crucial ‘term frequency’ part of the TF-IDF algorithm. In the next part, we’ll be calculating the number of documents each term appears in (document frequency), an important part of the IDF portion of the algorithm. We’ll do this with another MapReduce job using different code, using the output from today’s job as input. See you then!
Oh yeah, one more thing. I’m not the best bash coder out there so if I could have coded the two functions better let me know! I tried using arrays first but that was slooooooow.
Added Note: I uploaded the source data and scripts into GitHub and will add new scripts as the three part blog tutorial moves forward. The Cloudera VM comes preconfigured with git.
This is a quality blog post for sure, but I think it is worth mentioning that this approach looks a little cumbersome for computing TF-IDF. I get that you chose TF-IDF because it’s a simple and popular computation but a better solution is only a few lines of trivial Pig code (e.g. http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-topic-summarization-2-lines-of-pig/). Pig will create the optimized M/R jobs to do the actual work. Not to mention it would be hard (and/or slow) to integrate a more sophisticated analysis/tokenization strategy into a streaming job, i.e. using Lucene’s StandardAnalyzer to tokenize and split text. With Pig, you could write a simple UDF to invoke a Lucene Analyzer in a few lines of code.
Timothy, thanks for the input into better tech to use for this purpose! I’m hoping to tackle some Mahout tasks later for blogging and will probably use Mahout/Lucene for the same type of things.
I figured TF-IDF would be a good algorithm to use to demo the fact that shell scripts can be used for MapReduce…as cumbersome as they may be. 😉 Your acronym on LinkedIn was spot on: I’m trying not to create YAWCE (Yet Another Word Count Example)! Love it.
I absolutely LOVE that you did this with with Crane’s poetry!
One thing to watch, though, is that you skipped the step where you uploaded your scripts into HDFS. A better approach would be to let streaming handle it for you:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D stream.num.map.output.key.fields=2 -input crane -output crane_out \
-file ./maptf.sh -mapper maptf.sh -file ./redtf.sh -reducer redtf.sh
The -file args tell streaming to upload your scripts into the working directory of the job, so the commands passed to -mapper and -reducer are running with your scripts in the cwd.
A man cried out to the JobTracker, “Sir, my job exists!”
“However,” replied the JobTracker, “that fact has not created in me a guarantee of data locality.”
Daniel, you’re awesome. Thanks for the -file note, that’s definitely something that was missing! I’ll add it into the post.
And that parody is spot on as well. Absolutely hilarious.
I’m afraid that ‘split’ trick doesn’t work too well; it simply eliminates the specified characters, joining the (possibly unrelated) words:
$ map_input_file=test ./maptf.sh
not-connected
notconnected test 1
joe,frank,lisa
joefranklisa test 1
Try this instead:
———————————————————————————————————–
#!/bin/bash
old_ifs=”$IFS”
while read split; do
IFS=’ .,?!-_:;][#|$()”‘
for word in $split; do
term=`echo $word | tr [:upper:] [:lower:]`
[ -n “$term” ] && printf “%s\t%s\t%s\n” “$term” “$map_input_file” 1
done
IFS=”$old_ifs”
done
———————————————————————————————————–
$ map_input_file=test ./maptf.sh
not-connected
not test 1
connected test 1
joe,frank,lisa
joe test 1
frank test 1
lisa test 1
Good informative post overall, though – thank you!
Ben Okopnik
Steve, this. is. awesome! I’m sharing it with every developer class I have going forward. Hugely useful stuff and I love how you did it in bash. Ben, thank you for the clarifications & code rewrite.
Cant get past the error java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not found – while running mapreduce streaming job.
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.4.0.jar -D stream.num.map.output.key.fields=2 -input crane -output crane_out -file ./maptf.sh -mapper maptf.sh -file ./redtf.sh -reducer redtf.sh -verbose
I have replaced the shared lib under oozie. But I’m not quite sure what could be causing this. please ignore if this is not the right forum. I’m new to this area.
thanks.