Yesterday I completed the second day of Cloudera Developer Training for Apache Hadoop. While the first day focused on Hadoop core technology like HDFS, the second day was all about MapReduce. That means it was the day that whole ‘developer’ thing was thrown into sharp relief.
I’ve been a DBA for most of my career, but I believe I still think like a developer in a lot of ways. In fact, when it came to Hadoop I specifically decided to take the Developer course because I figured it would make me more uncomfortable, provide a greater challenge, and teach me more things that I would have a harder time learning on my own.
Hey awesome, I was right! Definite discomfort. But therein lies salvation.
Since I’m a fan of summaries and tidbits, I’m going to stick with yesterday’s Day 1 summary and detail three things I learned during Day 2.
I Don’t Like Java
Okay, this isn’t about the class as much as it is me. But I am not a Java fan. Part of the problem (and this is me being honest here) is that I don’t know it very well. That always leads to a rougher experience. But even the things I knew pretty well were fairly annoying just by virtue of it being Java. Unit testing with MRUnit was a real brain-fryer (I’m a DBA, we do all our testing in production anyways, right guys?). But…
That testing in production thing was a joke by the way. Don’t do that.
So Java it is for now. However, I plan to make a good effort after class to learn Hive, Pig, and try using Hadoop Streaming to write MapReduce code in Python or other languages. From what I understand, most MapReduce developers use one of these options. If you have any insight on that I’d love to hear it in the comments. Hive provides options to run HiveQL (a lot like SQL) queries straight against parsed views of your files stored on HDFS. Pig provides a language (PigLatin) to write your own data flows. Both of these options transform into MapReduce code on the Hadoop side of things. Python is just so I can sit at the cool kids table. I’d consider Ruby but I’m just not that cool.
MapReduce sounds better than MapCombinePartitionSortShuffleReduce.
Thanks to my pitiful experience with MapReduce on MongoDB, I was of the impression that Hadoop querying had two parts: Map and Reduce. However, it turns out there’s a lot more:
- Mappers to take in the original data from HDFS and perform your parsing and calculations
- Combiners which run in-mapper as a mini-reduce to pre-aggregate some data (yet may or may not run)
- Partitioners which decide how data will be distributed to the reducers
- Sort and Shuffle, the phase where MapReduce sorts all the data emitted from the Mappers by key then shuffles it into merged lists (by key) for the Reducers
- Reducers which perform final aggregation if necessary. Reducers are totally optional, just as a SQL query without a GROUP BY is optional.
So it turns out that MapReduce implies a lot more, which is great. Knowing that helps understand MapReduce a little more. It also helps to compare these phases to piped Linux commands. For example, a grep could be analogized with a Mapper and ‘wc -l’ with a Reducer. The biggest difference of course is the distributed nature of Hadoop vs. standard shell access. However, with Hadoop Streaming you can write MapReduce jobs easily with anything that can use STDIN and STDOUT, including shell commands. Time to brush off those old awk scripts folks!
A Book Literally Is a Thousand Words (and then some)
On the recommendation of the instructor I purchased Tom White’s Hadoop: The Definitive Guide. It really is an outstanding book that reinforced what I’m learning in class and will hopefully help me prepare for the CCDH exam. If you’re interested in learning more about Hadoop, both on the HDFS and MapReduce side, I would highly recommend it. The flow is good and the information is great.
Make sure you get the 3rd Edition! Hadoop is young(ish) and moving quickly. The previous two editions focused on the old MapReduce API (<0.20) whereas the 3rd edition published in May 2012 includes the new MapReduce API. It also includes information on MapReduce 2 (MRv2) and Yet Another Resource Navigator (YARN) which is on its way to being ready for production use.
Bonus Lesson!
You have to ask questions to get answers. The hype that Big Data just “finds” all this amazing stuff is ridiculous. Certainly there are tools to help you dashboard and discover what your data contains (Hadoop/Solr/Mahout, Tableau, etc.), but Hadoop is just a filesystem and code framework. If you ask a developer to “find something neat” with MapReduce they might shoot you.
So, what type of learning can an Oracle DBA take to Bigdata. What does an Oracle DBA already know that can be useful in Bigdata and what does he need to learn?
thanks
Gary,
Most DBAs will understand the basics of filesystems and cluster filesystems. Even though HDFS is a share-nothing cluster, it comes easily to understand how the local filesystems on datanodes are seen as a virtual filesystem on the namenode. We’re also used to a variety of control systems and processes (pmon, smon, etc) so understanding JobTracker and TaskTrackers gets pretty easy.
An understanding of normalization helps us get the idea of unstructured data and structured data. Knowing RDBMS/SQL helps understand parsable files and Hive big time; in fact, many DBAs would probably feel very comfortable using Hive. Think of it like external tables.
MapReduce is programmatic which means a lot of DBAs will be confused by it. But at the same time, if you understand WHERE clauses (data filtering), GROUP BY (data aggregating), then you’ve got a lot of the core concepts behind MapReduce. From there it comes down to matching up the concepts and expanding your thinking to take into account the capabilities of a deep programming language. Imagine if you could make a WHERE clause that is based on any calculations/data you want, including lookups to extra files, calls to Image processing engines, etc. Or if you can make a GROUP BY that doesn’t just group by a column, but a part of that column, or combination, or computed value. MapReduce is kind of (but not quite) like GROUP BY on steroids with tons of customization options built in.
Of course that’s oversimplifying. But it helps.
Thanks Steve