- MapReduce with Hadoop Streaming in bash – Part 1
- MapReduce with Hadoop Streaming in bash – Part 2
- MapReduce with Hadoop Streaming in bash – Part 3
- Hadoop Streaming, Hue, Oozie Workflows, and Hive
For this final part, we will use the term frequency and document frequency to build the final Term Frequency/Inverse Document Frequency (TF-IDF) score. To do this, we need to fill our results into the TF-IDF algorithm.
This algorithm shows that TF-IDF equals the Term Frequency times the natural logarithm of total documents divided by document frequency. So for each term/file combination, we need to calculate the TF-IDF based on the values provided by our last MapReduce job. Some people prefer to use base 10 logarithm to dampen down the results–I’ll cover this in the Mapper section.
Setup
So the first thing I’m going to do is get the output from the last job. You should be used to this by now.
[training@localhost steve]$ hadoop fs -get crane_out2/part-00000
The next thing I’m going to do is cheat. See, the algorithm requires the total number of documents that we’ve been analyzing. Sure, I could write a MapReduce job that looks through our latest output and emits a list of unique files; however, this is very inefficient and a waste of resources. Using a simple ‘ls’ command with a glob is much more efficient and makes better use of our (pseudo)cluster. To figure out our total document count, we’ll do just that:
[training@localhost steve]$ hadoop fs -ls crane | tail -n +2 | wc -l 8
For our testing we’ll just explicitly set this as a variable. In the Hadoop job we’ll pass it in as a parameter.
On to the Mapper
So let’s go ahead and do our final calculation using the Mapper.
[training@localhost steve]$ cat maptfidf.sh #!/bin/bash while read term file tf df; do TFIDF=$(echo $N $df $tf | awk '{print $3 * log($1/$2)}') printf "%s\t%s\t%s\n" "$term" "$file" "$TFIDF" done
Simpler than you thought? Let’s look at what we did.
- Read each line into the variables term, file, tf, and df. These represent (from our last job) a unique term/document combination, the number of times the term appeared in that document (tf), and the number of documents the term appears in (df).
- Calculate TFIDF using awk. We do this by passing total documents ($N, calculated with the variable I mentioned in the setup), document frequency, and term frequency. TF times log(total/DF) is the final answer.
- Print the final output as Term, File, and TF-IDF. Term and File make up the unique key for each line of output.
This is actually our final result. This is exactly what we’ve been trying to calculate and the product of our three jobs. As I mentioned in the intro to this article, some people prefer to use log10() instead of natural logarithm (which uses the constant e) to dampen the results like this:
If that’s the case, you can replace the awk line in the Mapper with this one:
TFIDF=$(echo $N $df $tf | awk '{print $3 * (log($1/$2)/log(10))}')
Let’s test it our in the shell, first setting the ‘N’ variable required for the algorithm:
[training@localhost steve]$ export N=`hadoop fs -ls crane | tail -n +2 | wc -l` [training@localhost steve]$ cat part-00000 | ./maptfidf.sh | head -6 a hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0
Not too descriptive with all those 0’s, but if you know how TF-IDF works you know it is working. The lower the value, the less relevant that word is. The letter ‘a’ appears in all 8 documents, making it a very irrelevant word. Usually words like ‘a’ or ‘and’ or ‘the’ would have been filtered out in the beginning via a stoplist.
What Reducer?
Since the Mapper produced our final output, we actually don’t need to worry about a reducer. We could specify no reducer (-reducer NONE in the options) but instead we’ll use something called the IdentityReducer. An IdentityReducer means that we want the reducer to take its input and just output it naturally with no calculation. This accomplishes two things: 1) the data is sorted/shuffled when it’s sent to the reducer, so with a single reducer it should come out sorted, and 2) we will get a single output file instead of one per mapper which is easier to work with later.
So you get a break this time. No reducer. Sweet!
Our Hadoop Command
[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -D stream.num.map.output.key.fields=2 -D N=`hadoop fs \ -ls crane | tail -n +2 | wc -l` -input crane_out2 \ -output tfidf -mapper /home/training/steve/maptfidf.sh \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer
Just as in the previous articles, the backslashes are just there to show this is a multiline command. If you put it all on one line you don’t need them.
So a few things to note here. First, we set the stream.num.map.output.key.fields variable to 2. Even though we don’t have a formal reducer, we still want to tell the job the key field count so it will sort properly. Second, we set a new variable (-D is required for each one) called ‘N’ to the result of an ‘ls’ command in Hadoop against our original document folder. This variable will be expressed as bash variable inside our shell script and denotes the total document count. The third thing is the -reducer setting. To use the identity reducer, set it to org.apache.hadoop.mapred.lib.IdentityReducer.
Running this command gives us the final job output and the save to the ‘tfidf’ folder under HDFS:
[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -D N=8 -input crane_out2 -output tfidf -mapper /home/training/steve/maptfidf.sh -reducer org.apache.hadoop.mapred.lib.IdentityReducer packageJobJar: [/tmp/hadoop-training/hadoop-unjar6684831878608134041/] [] /tmp/streamjob5486308040698764550.jar tmpDir=null 13/10/01 07:33:26 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/10/01 07:33:26 WARN snappy.LoadSnappy: Snappy native library is available 13/10/01 07:33:26 INFO snappy.LoadSnappy: Snappy native library loaded 13/10/01 07:33:26 INFO mapred.FileInputFormat: Total input paths to process : 1 13/10/01 07:33:26 INFO mapred.JobClient: Running job: job_201309292255_0066 13/10/01 07:33:27 INFO mapred.JobClient: map 0% reduce 0% 13/10/01 07:33:32 INFO mapred.JobClient: map 100% reduce 0% 13/10/01 07:33:35 INFO mapred.JobClient: map 100% reduce 100% 13/10/01 07:33:36 INFO mapred.JobClient: Job complete: job_201309292255_0066 13/10/01 07:33:36 INFO mapred.JobClient: Counters: 33 13/10/01 07:33:36 INFO mapred.JobClient: File System Counters 13/10/01 07:33:36 INFO mapred.JobClient: FILE: Number of bytes read=24244 13/10/01 07:33:36 INFO mapred.JobClient: FILE: Number of bytes written=421188 13/10/01 07:33:36 INFO mapred.JobClient: FILE: Number of read operations=0 13/10/01 07:33:36 INFO mapred.JobClient: FILE: Number of large read operations=0 13/10/01 07:33:36 INFO mapred.JobClient: FILE: Number of write operations=0 13/10/01 07:33:36 INFO mapred.JobClient: HDFS: Number of bytes read=22438 13/10/01 07:33:36 INFO mapred.JobClient: HDFS: Number of bytes written=23586 13/10/01 07:33:36 INFO mapred.JobClient: HDFS: Number of read operations=3 13/10/01 07:33:36 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/10/01 07:33:36 INFO mapred.JobClient: HDFS: Number of write operations=2 13/10/01 07:33:36 INFO mapred.JobClient: Job Counters 13/10/01 07:33:36 INFO mapred.JobClient: Launched map tasks=1 13/10/01 07:33:36 INFO mapred.JobClient: Launched reduce tasks=1 13/10/01 07:33:36 INFO mapred.JobClient: Data-local map tasks=1 13/10/01 07:33:36 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=5617 13/10/01 07:33:36 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=3000 13/10/01 07:33:36 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/10/01 07:33:36 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/10/01 07:33:36 INFO mapred.JobClient: Map-Reduce Framework 13/10/01 07:33:36 INFO mapred.JobClient: Map input records=326 13/10/01 07:33:36 INFO mapred.JobClient: Map output records=326 13/10/01 07:33:36 INFO mapred.JobClient: Map output bytes=23586 13/10/01 07:33:36 INFO mapred.JobClient: Input split bytes=108 13/10/01 07:33:36 INFO mapred.JobClient: Combine input records=0 13/10/01 07:33:36 INFO mapred.JobClient: Combine output records=0 13/10/01 07:33:36 INFO mapred.JobClient: Reduce input groups=326 13/10/01 07:33:36 INFO mapred.JobClient: Reduce shuffle bytes=24244 13/10/01 07:33:36 INFO mapred.JobClient: Reduce input records=326 13/10/01 07:33:36 INFO mapred.JobClient: Reduce output records=326 13/10/01 07:33:36 INFO mapred.JobClient: Spilled Records=652 13/10/01 07:33:36 INFO mapred.JobClient: CPU time spent (ms)=840 13/10/01 07:33:36 INFO mapred.JobClient: Physical memory (bytes) snapshot=199655424 13/10/01 07:33:36 INFO mapred.JobClient: Virtual memory (bytes) snapshot=776904704 13/10/01 07:33:36 INFO mapred.JobClient: Total committed heap usage (bytes)=176492544 13/10/01 07:33:36 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/10/01 07:33:36 INFO mapred.JobClient: BYTES_READ=22330 13/10/01 07:33:36 INFO streaming.StreamJob: Output directory: tfidf
The Final Results
After three MapReduce jobs, we’re finally ready to see our word/document and associated TF-IDF. Score! (ba dum tss)
[training@localhost steve]$ hadoop fs -cat tfidf/part-00000 a hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0 a hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0 accosted hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 achieved hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.07944 addressed hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 again hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.07944 ages hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 agony hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 ah hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1.38629 ah hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1.38629 already hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 am hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 and hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 0.267063 and hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 0.400594 and hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 0.400594 and hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0.133531 and hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0.267063 and hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.267063 and hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0.133531 another's hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 are hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 as hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 at hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 aye hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 aye hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 ball hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 8.31777 bawled hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 been hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 before hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 began hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 believed hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 black hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1.38629 black hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1.38629 blind hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 book hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 4.15888 boys hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 breath hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 but hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1.38629 but hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1.38629 by hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 by hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 5.54518 called hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 calling hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 4.15888 can hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 cavern hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 child hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 4.15888 chronicle hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 clay hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.07944 climbed hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.07944 collection hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 4.15888 concentrating hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 court hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 created hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2.07944 crevice hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 cried hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1.38629 cried hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.77259 crowd hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 crowned hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 cuddle hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 curious hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 dead hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 death hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 deathslime hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 denial hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 desert hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 6.23832 dire hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 disturbed hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 earth hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.07944 echoes hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 error hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 eternal hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 even hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 eventually hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 eventually hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1.38629 ever hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 4.15888 every hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 exist hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2.07944 fact hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2.07944 families hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 feckless hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 fenceless hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 fireside hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 fleetly hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 for hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 0.980829 for hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0.980829 for hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.980829 fortress hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 freedom hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 from hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 0.693147 from hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1.38629 from hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.693147 from hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0.693147 futile hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 game hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 garment hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 god hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 13.8629 god hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1.38629 gold hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 8.31777 grown hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 had hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 halfinjustices hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 hand hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 hands hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 has hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2.07944 have hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 have hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 he hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 he hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 4.15888 he hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.77259 he hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0.693147 heat hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.07944 heavens hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.07944 held hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 4.15888 hem hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 highest hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 him hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.77259 him hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 his hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.38629 his hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 hold hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 honest hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 horizon hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1.38629 horizon hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1.38629 however hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2.07944 i hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0.470004 i hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.82002 i hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1.88001 i hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.35002 i hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 1.41001 in hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0.287682 in hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 0.287682 in hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 0.287682 in hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 0.287682 in hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0.287682 in hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0.287682 into hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 is hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 0.575364 is hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.01377 is hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 0.287682 is hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0.287682 is hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.575364 is hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0.575364 it hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.43841 it hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 0.287682 it hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0.287682 it hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0.287682 it hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.575364 it hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0.575364 its hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.77259 its hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 joys hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 kindly hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 know hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 let hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 lie hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 life's hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 lived hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 lo hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.07944 lone hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 long hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 looked hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.07944 looks hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 loud hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 mad hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 man hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0.980829 man hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.96166 man hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1.96166 market hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 me hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0.693147 me hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 me hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.693147 me hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0.693147 melons hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 men hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 4.15888 merciful hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 met hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 mighty hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 mile hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 4.15888 million hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 mocked hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 much hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 4.15888 never hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1.38629 never hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.77259 newspaper hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 10.3972 night hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 no hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1.38629 no hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.77259 not hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1.38629 not hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 now hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4.15888 obligation hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2.07944 of hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0.287682 of hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.15073 of hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.43841 of hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 0.863046 of hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0.575364 of hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.575364 often hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 on hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 one hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 opened hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 opinion hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 part hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 4.15888 phantom hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 place hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.07944 plains hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 player hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 pursued hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 pursuing hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 ran hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 read hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 remote hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 replied hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2.07944 roaming hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 rock hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 round hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 4.15888 said hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0.470004 said hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0.470004 said hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0.470004 said hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.940007 said hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0.940007 sand hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.07944 saw hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 saw hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1.38629 scores hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 screamed hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 second hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 seer hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 sells hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 sense hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 2.07944 shadow hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 should hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 sir hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 1.38629 sir hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.77259 skill hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 sky hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 sky hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1.38629 smiled hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 smote hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 sneering hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 so hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 space hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 spaces hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 sped hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.77259 sped hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 1.38629 spirit hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 spreads hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 spurred hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 squalor hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 strange hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.77259 strange hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 stupidities hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 suddenly hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 swift hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 sword hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 symbol hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 take hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.07944 tale hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 tales hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 that hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.38629 that hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 4.15888 the hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0 the hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 0 the hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 0 the hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 0 the hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 0 the hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0 the hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0 the hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0 their hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 then hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1.38629 then hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 there hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 there hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 1.38629 they hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.07944 think hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 2.07944 this hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.96166 this hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0.980829 this hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 0.980829 through hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.38629 through hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.77259 to hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 0.693147 to hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 0.693147 to hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.38629 to hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 touched hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 tower hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 traveller hdfs://0.0.0.0:8020/user/training/crane/truth.txt 6.23832 tried hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 truth hdfs://0.0.0.0:8020/user/training/crane/truth.txt 6.23832 unfairly hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 unhaltered hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 universe hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt 4.15888 vacant hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.07944 valleys hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.07944 victory hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 voice hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 4.15888 walked hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.07944 was hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 2.77259 was hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 0.693147 was hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 0.693147 was hdfs://0.0.0.0:8020/user/training/crane/truth.txt 0.693147 well hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt 2.07944 went hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 went hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt 2.77259 when hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt 1.38629 when hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.38629 whence hdfs://0.0.0.0:8020/user/training/crane/truth.txt 2.07944 where hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 6.23832 which hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.38629 which hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 while hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 4.15888 wind hdfs://0.0.0.0:8020/user/training/crane/truth.txt 4.15888 wins hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 2.07944 wisdom hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.38629 wisdom hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 world hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt 1.38629 world hdfs://0.0.0.0:8020/user/training/crane/truth.txt 1.38629 you hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt 1.38629 you hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt 2.77259
Just as in our test, “a” is not important at all as it appears in 8 documents so it came out with a score of 0. “Where” seems very important for such a common word. But it turns out that is because it shows up 3 times in only 1 file. Remember, TF-IDF is “a numerical statistic which reflects how important a word is to a document in a collection or corpus”. Take a look at words like “you” at the end of the file–it shows up in two different files, but has a different TF-IDF weight for each one. That’s because “you” only appears once in the first file but twice in the second file, making it more important to that document in relation to the whole corpus. You can see this with ‘grep’ commands against the original content.
[training@localhost steve]$ grep -i you crane/met_a_seer.txt Of that which you hold. [training@localhost steve]$ grep -i you crane/pursuing_the_horizon.txt "You can never -- " "You lie," he cried,
And with that, we just built an index. If we were to build a search engine against those 8 Stephen Crane poems, then a search for a word would output file ordered by TF-IDF descending. That way the most pertinent (keyword rich) files would come first on the results.
Conclusion
Of course, we could have done this project a lot easier with tools like Lucene and Mahout. They are of course made for this sort of thing, and have a ton of extra features including automatic stoplisting, weight tuning, etc. But it wouldn’t be nearly as fun, right?
This concludes our TF-IDF with Hadoop Streaming in bash exercise. If you have any feedback on better ways to do these tasks (or errata) please let me know in the comments!
Great series! Thanks for the very practical guide to Hadoop streaming!
Hi, I don’t know why you can do pipe in hadoop streaming. Whenever I do piping at the reducer, hadoop return fail code with broken pipe.