MapReduce with Hadoop Streaming in bash – Bonus! To conclude my three part series on writing MapReduce jobs with shell script for use with Hadoop Streaming, I’ve decided to throw together a video tutorial on running the jobs we’ve created in Oozie, a workflow editor for Hadoop that allows jobs
Tag: mapreduce
MapReduce with Hadoop Streaming in bash – Part 3
In our first MapReduce with Hadoop Streaming in bash article, we took a collection of Stephen Crane poems and used a MapReduce job to calculate ‘term frequency’–meaning we counted the number of times each word in the collection appeared in the collection. In the second part, we calculated ‘document frequency’
MapReduce with Hadoop Streaming in bash – Part 2
In MapReduce with Hadoop Streaming in bash – Part 1 we found the ‘term frequency’ of words within a collection of documents. For the documents I chose 8 Stephen Crane poems, and our bash Map and Reduce jobs tokenized the words and found their frequency among the entire set. The
MapReduce with Hadoop Streaming in bash – Part 1
So to commemorate my recent certification and because my Java absolutely sucks, I decided to do a common algorithm using Hadoop Streaming. Hadoop Streaming Hadoop Streaming allows you to write MapReduce code in any language that can process stdin and stdout. This includes Python, PHP, Ruby, Perl, bash, node.js, and
Just how big is your data?
A while back (2007 to be exact, an eternity in Internet years), Google released a product called Google 411. You could call either 1-800-GOOG-411 or 1-877-GOOG-411 and search for businesses by city and state, category, or other criteria. It was a direct competitor to the local expensive 411 services, and