Elastic MapReduce Working with flow diagram
Elastic MapReduce Working in Hadoop is easy to learn and very interesting.
Let’s recall some points of Mapper and Reducer.
- The input is provided to Mapper as Keys and Values.
- Mapper has a user defined function.
- More number of Mapper present in a MapReduce framework than the Reducers. So heavy processing of data is done by the Mapper in parallel. Mapper generates an output which is an intermediate data and output from Mapper goes to the Reducer as input.
- The output from the Mapper is processed in the Reducer. There is a user defined function in the reducer which further processes the input data and the final output is generated. Processing in reducer is light when compared with heavy processing done at Mapper.
- The output is stored in HDFS and the replication is done as usual.
Working of Mapper and Reducer in Hadoop MapReduce Process Flow:
Let us understand in detail about MapReduce. Now, we will look into detail on data flow of MapReduce along with inputs & outputs to Map & Reduce, process and types of process in MapReduce and data storage etc.
As seen in the diagram of elastic mapReduce, the square box is a slave. There are 4 slaves in the figure. On all 4 slaves Mappers will run, and then a reducer will run on any 1 of the slave.
Let us now discuss the map phase in steps:
- An input to a Mapper is 1 block at a time. (Split = block by default)
- An output of Mapper is written to a local disk of the machine on which Mapper is running. Once the map finishes the work, this intermediate output provides to the reducer.
- Reducer is the second phase of processing where the user can again write his custom business logic. Hence, an output of reducer is the final output written to HDFS.
Important points for mapper
- By default on a slave, 2 Mappers run at a time which can also be increased as per the requirements.
- Number of Mappers should not be increased beyond the defined limit as it might decrease the performance of the system.
How Map Reduce works along with working of Mapper and Reducer
Mapper writes the output to the local disk of the machine. This is the temporary data. An output of Mapper is also called intermediate output. All mappers are writing the output to the local disk. When Mapper finishes task then the data sends to reducer. Hence, this movement of output from Mapper node to reducer node is called shuffle.
Reducer is also available on the data node. The output of Mappers sends to the reducer. All these outputs from different Mappers are merged to form input for the reducer. This input is also on local disk. Reducer is another processor where you can write custom business logic. It is the second stage of the processing. Usually to reducer we write aggregation, summation etc. type of functionalities. Hence, Reducer gives the final output which it writes on HDFS.
Map and reduce are the stages of processing. They run one after other. After all, mappers complete the processing, then only reducer starts processing.
- Though 1 block is present at 3 different locations by default, but framework allows only 1 mapper to process 1 block.
- So only one Mapper will be processing one particular block out of 3 replicas at a time.
- Output of every mapper sends to every reducer.
- Partitioner filtered the data along with partitioned which was provided by the output from mapper.
- Each of this partition goes to a reducer based on some conditions.
Keep in mind, Hadoop Elastic MapReduce Working on key value pair.