- MapReduce with Hadoop Streaming in bash – Part 1
- MapReduce with Hadoop Streaming in bash – Part 2
- MapReduce with Hadoop Streaming in bash – Part 3
- Hadoop Streaming, Hue, Oozie Workflows, and Hive
god hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 10 god hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 gold hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 grown hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 had hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1
Today we will be calculating the ‘document frequency’, or the number of documents each word appears in. This will help us calculate the ‘inverse document frequency’ (IDF) portion of our TF-IDF algorithm. To do this, we’ll be using our term frequency output as the input to our document frequency MapReduce job, using term and filename as our input key. The actual key/value transformation will look like this (key is the first variable or parenthetical:
- {(term, file),tf} -> Map -> {term,(file, tf, 1)}
- {term,(file, tf, 1)} -> Reduce -> {(term, file), (tf, df)}
This is the trickiest part of the TF-IDF calculation, because the reduce job has to span multiple documents in a single read loop and therefore buffer in-progress rows. But more on that later. For now let’s get started!
The Mapper
For the purposes of testing, I’m first going to pull the results of yesterday’s term frequency job to the local filesystem.
[training@localhost steve]$ hadoop fs -get crane_out/part-00000
Now let’s take a look at our document frequency mapper code.
[training@localhost steve]$ cat maptf.sh [training@localhost steve]$ cat mapdf.sh #!/bin/bash while read term file num; do printf "%s\t%s\t%s\t%s\n" "$term" "$file" "$num" "1" done
This script is exceedingly simple this time because we’re working with more structured input as opposed to yesterday where we had to tokenize unstructured data (plain text). The code above does the following:
- Read and loop the input in as three variables: term, file, and num (tf from our last job’s output)
- Print the variables back out, appending a new column with a value of “1”. All we’re showing here is that yes, this term made an appearance in this file. Since this calculation is for document frequency each word-per-doc result is just 1.
That was easy, right? Let’s test it using the file we grabbed from the last job.
[training@localhost steve]$ cat part-00000 | ./mapdf.sh a hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2 1 a hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 1 a hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 14 1 a hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 3 1 ... (and so on)
Looks good! Moving on.
The Reducer
This is where our feeling of “oh, that was easy” is bashed (pardon the pun) beyond recognition.
[training@localhost steve]$ cat reddf.sh #!/bin/bash read currterm currfile currtf currdf while read term file tf df; do if [[ $term = "$currterm" ]]; then currdf=$(( currdf + df )) buffer+="${term}\t${file}\t${tf}\n" else echo -e -n $buffer | while read line; do echo -e "${line}\t${currdf}"; done printf "%s\t%s\t%s\t%s\n" "$currterm" "$currfile" "$currtf" "$currdf" buffer="" currterm="$term" currfile="$file" currtf="$tf" currdf="$df" fi done echo -e -n $buffer | while read line; do echo -e "${line}\t${currdf}"; done printf "%s\t%s\t%s\t%s\n" "$currterm" "$currfile" "$currtf" "$currdf"
Alright, let’s slog through it.
- Just like our term frequency example, we’re going to read the first line of the file in as the variables “currterm”, “currfile”, “currtf”, and “currdf”.
- Loop through the rest of the file with the variables “term”, “file”, “tf”, and “df”.
- Remember that the “term” is the only key for the input–the rest count as values. As such, we check to see if the newest value of “term” equals the last one stored in “currterm”.
- If matched
- increment our document frequency (df) by the loop value (always 1 in this case)
- Add the term, file, and tf to a buffer so we can print it out later (very important)
- If not matched (new term)
- Print the buffer, adding the total document frequency for the term to the end of each line in it (saved during incrementing from before)
- Print the last and most recent term, file, term frequency, and document frequency.
- Reset the buffer and set all the curr* variables to the latest variable value to begin again.
- Print out the final buffer and final line of the file.
Trust me, it was more painful to write than to read. One of the tougher parts about Hadoop Streaming is that you are responsible for maintaining the state and scope of the keys, as opposed to Java where it’s done for you. Beyond my bash shortcomings, I had problems early on with this because I was using the wrong key in my conditions–it is absolutely vital that you keep track of the key in your reducer calculations. Let’s see how it looks with a bash test:
[training@localhost steve]$ cat part-00000 | ./mapdf.sh | sort -k1 | ./reddf.sh accosted hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 achieved hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 addressed hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 again hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 ages hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 agony hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 ... (and so on)
Unfortunately not much to see here (and I don’t want to paste the whole thing in the interest of space), but at least it is correct. Remember, the fields here are term, file, tf (frequency of the term within the file), and df (frequency of the term across all files). Those terms only appear once in their associated file and in the document set overall. Thank you Stephen Crane for your uniqueness.
Time to Hadoop
Now that the mapper and reducer are done, here’s the command we will use to process it through MapReduce:
[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -D stream.num.map.output.key.fields=1 -input crane_out -output crane_out2 \ -mapper /home/training/steve/mapdf.sh -reducer /home/training/steve/reddf.sh
Remember, the backslashes are only there to say this is a multi-line input. If you type it all on one line you don’t need them. Also note that the stream.num.map.output.key.fields is set to 1 here, as the output from the Mapper has only one column for the key: term. This is important because the shuffle and sort phase needs to sort on the key. The input location is the results from the last job (crane_out/) and the output is a new directory (must not exist) called crane_out2/.
So let’s run it and see what happens!
[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D stream.num.map.output.key.fields=1 -input crane_out -output crane_out2 -mapper /home/training/steve/mapdf.sh -reducer /home/training/steve/reddf.sh packageJobJar: [/tmp/hadoop-training/hadoop-unjar1827136867538905859/] [] /tmp/streamjob145477386971923155.jar tmpDir=null 13/10/01 07:30:10 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/10/01 07:30:10 WARN snappy.LoadSnappy: Snappy native library is available 13/10/01 07:30:10 INFO snappy.LoadSnappy: Snappy native library loaded 13/10/01 07:30:10 INFO mapred.FileInputFormat: Total input paths to process : 1 13/10/01 07:30:11 INFO mapred.JobClient: Running job: job_201309292255_0065 13/10/01 07:30:12 INFO mapred.JobClient: map 0% reduce 0% 13/10/01 07:30:17 INFO mapred.JobClient: map 100% reduce 0% 13/10/01 07:30:20 INFO mapred.JobClient: map 100% reduce 100% 13/10/01 07:30:21 INFO mapred.JobClient: Job complete: job_201309292255_0065 13/10/01 07:30:21 INFO mapred.JobClient: Counters: 33 13/10/01 07:30:21 INFO mapred.JobClient: File System Counters 13/10/01 07:30:21 INFO mapred.JobClient: FILE: Number of bytes read=22988 13/10/01 07:30:21 INFO mapred.JobClient: FILE: Number of bytes written=420886 13/10/01 07:30:21 INFO mapred.JobClient: FILE: Number of read operations=0 13/10/01 07:30:21 INFO mapred.JobClient: FILE: Number of large read operations=0 13/10/01 07:30:21 INFO mapred.JobClient: FILE: Number of write operations=0 13/10/01 07:30:21 INFO mapred.JobClient: HDFS: Number of bytes read=21785 13/10/01 07:30:21 INFO mapred.JobClient: HDFS: Number of bytes written=22330 13/10/01 07:30:21 INFO mapred.JobClient: HDFS: Number of read operations=3 13/10/01 07:30:21 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/10/01 07:30:21 INFO mapred.JobClient: HDFS: Number of write operations=2 13/10/01 07:30:21 INFO mapred.JobClient: Job Counters 13/10/01 07:30:21 INFO mapred.JobClient: Launched map tasks=1 13/10/01 07:30:21 INFO mapred.JobClient: Launched reduce tasks=1 13/10/01 07:30:21 INFO mapred.JobClient: Data-local map tasks=1 13/10/01 07:30:21 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=5186 13/10/01 07:30:21 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=3343 13/10/01 07:30:21 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/10/01 07:30:21 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/10/01 07:30:21 INFO mapred.JobClient: Map-Reduce Framework 13/10/01 07:30:21 INFO mapred.JobClient: Map input records=326 13/10/01 07:30:21 INFO mapred.JobClient: Map output records=326 13/10/01 07:30:21 INFO mapred.JobClient: Map output bytes=22330 13/10/01 07:30:21 INFO mapred.JobClient: Input split bytes=107 13/10/01 07:30:21 INFO mapred.JobClient: Combine input records=0 13/10/01 07:30:21 INFO mapred.JobClient: Combine output records=0 13/10/01 07:30:21 INFO mapred.JobClient: Reduce input groups=226 13/10/01 07:30:21 INFO mapred.JobClient: Reduce shuffle bytes=22988 13/10/01 07:30:21 INFO mapred.JobClient: Reduce input records=326 13/10/01 07:30:21 INFO mapred.JobClient: Reduce output records=326 13/10/01 07:30:21 INFO mapred.JobClient: Spilled Records=652 13/10/01 07:30:21 INFO mapred.JobClient: CPU time spent (ms)=920 13/10/01 07:30:21 INFO mapred.JobClient: Physical memory (bytes) snapshot=199909376 13/10/01 07:30:21 INFO mapred.JobClient: Virtual memory (bytes) snapshot=776908800 13/10/01 07:30:21 INFO mapred.JobClient: Total committed heap usage (bytes)=176492544 13/10/01 07:30:21 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/10/01 07:30:21 INFO mapred.JobClient: BYTES_READ=21678 13/10/01 07:30:21 INFO streaming.StreamJob: Output directory: crane_out2
Looks good! Well…completed at least. Let’s take a look at the output.
[training@localhost steve]$ hadoop fs -cat crane_out2/part-00000 a hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 14 8 a hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2 8 a hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 3 8 a hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 8 a hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 8 a hdfs://0.0.0.0:8020/user/training/crane/truth.txt 12 8 a hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 3 8 a hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 8 accosted hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 achieved hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 addressed hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 again hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 ages hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 agony hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 ah hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 2 ah hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 2 already hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 am hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 and hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 7 and hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 3 7 and hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 7 and hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 7 and hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 7 and hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 7 and hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 3 7 another's hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 are hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 as hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 at hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 aye hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 aye hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 2 ball hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 1 bawled hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 been hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 before hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 began hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 believed hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 black hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 2 black hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 2 blind hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 book hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 1 boys hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 breath hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 1 but hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 2 but hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 2 by hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 4 2 by hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 2 called hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 calling hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 1 can hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 cavern hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 child hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 1 chronicle hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 clay hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 climbed hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 collection hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 1 concentrating hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 court hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 created hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 1 crevice hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 cried hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 2 cried hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 2 crowd hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 crowned hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 cuddle hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 curious hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 dead hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 death hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 deathslime hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 denial hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 desert hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 3 1 dire hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 disturbed hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 earth hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 echoes hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 error hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 eternal hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 even hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 eventually hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 2 eventually hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 2 ever hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 1 every hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 exist hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 1 fact hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 1 families hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 feckless hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 fenceless hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 fireside hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 fleetly hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 for hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 3 for hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 3 for hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 3 fortress hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 freedom hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 from hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 4 from hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 4 from hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 4 from hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 4 futile hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 game hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 garment hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 1 god hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 2 god hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 10 2 gold hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 1 grown hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 had hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 halfinjustices hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 hand hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 hands hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 has hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 1 have hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 2 have hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 he hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 6 4 he hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 4 4 he hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 4 he hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 4 heat hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 1 heavens hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 held hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 1 hem hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 1 highest hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 him hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 2 him hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 his hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 his hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 2 hold hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 honest hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 horizon hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 2 horizon hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 2 however hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 1 i hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 5 i hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 6 5 i hdfs://0.0.0.0:8020/user/training/crane/truth.txt 5 5 i hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 3 5 i hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 4 5 in hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 6 in hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 6 in hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 6 in hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 6 in hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 6 in hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 6 into hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 is hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 6 is hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 6 is hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 6 is hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 6 is hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 6 is hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 7 6 it hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 5 6 it hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 6 it hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 6 it hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 6 it hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 6 it hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 6 its hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 2 its hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 2 joys hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 kindly hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 know hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 let hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 lie hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 life's hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 lived hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 lo hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 lone hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 long hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 looked hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 1 looks hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 loud hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 mad hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 man hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 3 man hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 3 man hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 3 market hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 me hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 4 me hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 4 me hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 4 me hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 4 melons hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 men hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 1 merciful hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 met hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 mighty hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 mile hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 1 million hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 mocked hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 much hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 1 never hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 2 never hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 2 newspaper hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 5 1 night hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 no hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 2 no hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 2 not hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 2 not hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 now hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 1 obligation hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 1 of hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 6 of hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 5 6 of hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 3 6 of hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 6 of hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 6 of hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 6 often hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 on hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 one hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 opened hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 opinion hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 part hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 1 phantom hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 1 place hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 1 plains hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 player hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 pursued hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 pursuing hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 ran hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 read hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 remote hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 replied hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 1 roaming hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 rock hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 round hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 1 said hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 5 said hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 5 said hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 5 said hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 5 said hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 5 sand hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 1 saw hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 2 saw hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 2 scores hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 screamed hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 second hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 seer hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 sells hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 sense hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 1 shadow hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 1 should hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 sir hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 2 sir hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 2 skill hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 sky hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 2 sky hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 2 smiled hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 smote hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 sneering hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 so hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 space hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 spaces hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 sped hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 2 sped hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 2 spirit hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 spreads hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 spurred hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 squalor hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 strange hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 strange hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 2 stupidities hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 suddenly hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 swift hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 sword hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 symbol hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 take hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 1 tale hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 tales hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 that hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 3 2 that hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 2 the hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 7 8 the hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 4 8 the hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 8 the hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2 8 the hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 8 the hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4 8 the hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 3 8 the hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 3 8 their hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 then hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 then hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 2 there hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 2 there hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 2 they hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 1 think hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 1 this hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 3 this hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 3 this hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2 3 through hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 2 through hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 2 to hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 4 to hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1 4 to hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 4 to hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 4 touched hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 1 tower hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 traveller hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 1 tried hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 truth hdfs://0.0.0.0:8020/user/training/crane/truth.txt 3 1 unfairly hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 unhaltered hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 universe hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2 1 vacant hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 1 valleys hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 1 victory hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 voice hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2 1 walked hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 1 was hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4 4 was hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1 4 was hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 4 was hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1 4 well hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 1 went hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 2 went hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 2 when hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 2 when hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1 2 whence hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 1 where hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 3 1 which hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 which hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 2 while hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2 1 wind hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2 1 wins hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 1 wisdom hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2 wisdom hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 2 world hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 2 world hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 2 you hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2 2 you hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1 2
Beautiful! Each word/file combination now has an associated term frequency and document frequency. Simple checks against your source data with ‘grep’ can determine if it’s correct or not. For example, take a look at the word ‘from’:
from hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2 4 from hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1 4 from hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1 4 from hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1 4
According to my MapReduce job, the word ‘from’ has a document frequency of 4 because it appears in 4 different files. In ‘a_spirit_sped.txt’ it has a term frequency of 2 (appears twice) and in the others it has a term frequency of 1 (appears once). Let’s see if that’s right.
[training@localhost steve]$ grep -i from crane/* crane/a_newspaper.txt:Which, bawled by boys from mile to mile, crane/a_spirit_sped.txt:From crevice and cavern crane/a_spirit_sped.txt:A sword from the sky, crane/truth.txt:From whence the world looks black." crane/walked_in_a_desert.txt:"Ah, God, take me from this place!"
Looks good to me! I think we’re set for the day.
Conclusion
In Part 1, we completed a MapReduce job to calculate the term frequency of words within documents. In this part, we completed a MapReduce job to go through the output and append document frequency for each term–i.e., the amount of documents the term appears in. Both of these numbers are critical for our final calculation in the next article which will calculate the Term Frequency/Inverse Document Frequency (TF-IDF). Stay tuned!
Thanks for sharing this valuble information and itis useful for me .Hadoop online trainings also provides the best Hadoop online training classes in India,uk.
Since your map script really doesn’t do anything, you’d probably be better off using the identity mapper and just always assuming df=1 in the reducer (which is true). It saves you a script, and your command becomes:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D stream.num.map.output.key.fields=1 -input crane_out -output crane_out2 \
-file reddf.sh -reducer reddf.sh \
-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat
The -inputformat is because the identity mapper spits out whatever it’s handed, and the default input format hands it keys that are type Long. The streaming reducer expects Text keys, and unhappiness ensues. The KeyValueTextInputFormat spits out text keys, as the reducer expects.
Thank you for pointing that out Daniel. That makes a lot of sense, not sure why I didn’t think to use the IdentityMapper. Since you brought it up though I have a question for you…I tried to use “org.apache.hadoop.mapred.lib.IdentityReducer” in Oozie (through Hue) and it didn’t like that so I wrote my own IdentityReducer. Do you know if I missed something there or is it not supported?
Off the top of my head, no idea. My Oozie experience is limited.