Mapper Reducer Hadoop
In Mapper Reducer Hadoop, Lets understand the some terminology first.
MapReduce converts the list of input to the output which will be also list. In between Map & Reduce there is a small phase called Shuffle & Sort.
Let’s start with Mapper Reducer Hadoop terminology,
- A MapReduce Job is the “Full Program” a client wants to be performed.
- A JOB is nothing but the complete two processing layers Map & Reduce. Starting from client input to ending at client output.
- It contains the Mapper process and the Reducer process. A full Job requires input data, MapReduce program and the set of configuration information. Some configuration information provided by Hadoop setup.
- A task is process performed on a small stack of data in a particular data node.
- Task is execution of both the layers – Map and Reduce on a slice of data. This task is also called as TIP (Task in Progress), which means data processing is in progress in either of Map or Reduce.
- Attempt in general is used to refer a process which failed during its runtime or execution. (For example: Some could say this is my fourth attempt of my 12th grade final exams that means he failed three times before and he is attempting it for the fourth time. ) Here it is an instance of attempt to execute a task on a node.
- There is always a possibility of machine failure anytime in that case a master will reschedule that particular task to another machine but that rescheduling cannot be done for any number of times, there is a limit for it.
- The limit set for rescheduling is called attempts. A default value of attempts is 4. If the task failed for all 4 times it is considered to be a failed job. For jobs, number of attempts can be increased through configuration file.
We have got the idea of Mapper Reducer Hadoop terminology. We will cover the below question’s.
- What is Mapper Reducer Hadoop
- Mapper working
- Reducer working
What is Mapper or Map Abstraction:
Let us understand,
- What Mapper is ?
- What is the input to the Mapper?
- How the inputs to the Mapper are processed?
- What is the outcome from the Mapper?
Hadoop MapReduce framework operates on key/value pair as input to the job. Structured or unstructured whatever may be the input format framework converts the input data into key and value. It also produces the output as a set of key and value pairs.
- Key is a reference to the input value.
- Value is the data set on which to operate.
Mapper Processing – A function can be defined by the user to process the data based on his requirement of business logic. The function defined to every value in value input.
Mapper Output – map produces a new set of key/value pairs as output. These are called intermediate outputs. The output is stored in the local disk from where it is shuffled to reduce nodes.
What is Reducer or Reduce Abstraction:
So the second major phase of MapReduce is Reduce.
- What is the input to the Reducer?
- How the inputs to the Reducer are processed?
- What is the outcome from the Reducer?
Reducer takes the values/keys as input. So the intermediate outcome from the Mapper is taken as input to the Reducer.
- Input given to reducer is generated by Map (intermediate output)
- Key / Value pairs provided to reduce are sorted by key
Reducer Processing – It works similar as that of a Mapper. A user defined function for his own business logic is processed to get the output. Iterator supplies the values for a given key to the Reduce function.
Reducer Output – Reducer produces the final output and will store the data in HDFS.
That’s all about “Mapper Reducer Hadoop”. Next we are set for the working flow of the Hadoop and MapReduce. Let’s understand that in next session’s.