Wednesday, February 20, 2013



Before we try to understand mapReduce  i would like you to understand the hadoop file system (HDFS) . Please read my previous blog Understanding HDFS

Lets look into simple requirement . I have to analyze the messages sent on WhatsApp on Feb14 to come to know how many of them celebrated their valentines day. For the sake of simplicity lets assume that we have all the messages in a single system. The basic steps that could involve to find the word valentine in each message would be like below.

Search through each message for the word valentine.
Have a counter that increments each time i found the word in the message.

I write a program and run it in the system , we can understand how long it could really get to analyze such a simple information on a single system. So what happens when we have data across multiple systems. The straight forward answer could be parallel processing. But it comes with its own problems. 

One problem is we cannot be sure about when we will be able to finish our process in each system.The other problem is consolidating all the outputs. Hadoop helps us in parallel processing by providing  blocks of same size data through HDFS. It takes care of running the tasks on each data block and consolidating it. To make our hadoop understand the logic , the logic should be provided in MapReduce format.

Hadoop expects our program to be in the form of MapReduce . So lets try to understand how it works. MapReduce helps in data-processing parallely.  MapReduce uses a concept of divide and conquer method.

MapReduce consists of two methods. One is our map function and the other one is reduce function. Lets look into the map method. Map method accepts input in the form of key-value pairs. Lets see how it really works, the input to the map function will be in the below format .

The key value in our case is line offset number , the point to note here is the key value is generated by hadoop and it is unique. We will be worried about the values alone. Here the values mean the messages. so we will have the below logic in our map method.

foreach key-value pair
Search through each message for the word valentine.
Add it to the counter.
end return counter.

The next important thing that we should note is the output of the map function is also a key-value pair. So lets try to understand it in our HDFS perspective , where does this map function exactly run. In HDFS we have a task tracker at the data node. The map function basically runs on your block of data in the dataNode.The out put of the map function would be like 

We have another functionality provided by Hadoop called shuffle and sort. Lets try to understand how exactly it works. We will having several data nodes and our map function could run parallel on several data blocks in dataNodes. The shuffle and sort functionality provided by the hadoop will merge all the outputs or the processed data from the map functions to your reduce function.The entire process is monitored by the JobTracker of HDFS. All the mapreduce jobs has to be passed through JobTracker.

So final input to the reduce function could look like

So lets try to understand what the reduce function actually does. The reduce function also accepts input in the form of key-value pair. Just like our map function it also generates the out put as a key-value pair. So lets look what happens in our reduce method.

ForEach Map(Key,Value)
Add value to totalNoOfLovers
Return totalNoOfLovers

So basically our final output would be reduce(valentine,13) .

This is exactly how we use MapReduce  in Hadoop.

The logic is mainly written in java . However it also supports other languages. 

Since writing the programs using java in map-reduce model could be difficult as java is a generic programming language and the development time could be huge , Hadoop helps us with an Eco tool called PIG. PIG is mainly focused and built for data processing.

We can see more about PIG in my next blog.

Please leave your suggestions and feed back below.

1 comment:

  1. Sir, your explanation is quite an ease to understand. Thank You for that.
    Can you help me on how to develop an original algorithm based on MapReduce ?