Hadoop MapReduce (Mapping -Reducing) Work Flow

Let’s study about Hadoop MapReduce (Mapping and Reducing) Work Flow,

Let us count number colour words present in the given text file.

Hadoop MR

Hadoop MR output

 

 

 

 

 

 

 

 

In MapReduce Workflow we have 5 steps to get the output,

  1. Input: Here we are using text file input which words in random manner separated by comma
  2. Splitting: Here given input file are splitted using word separator comma (it can be anything like colon : full stop . etc., )
  3. Intermediate Splitting: Here data might be disturbed among the entire cluster in the hadoop system, for the next stage “Reduce Phase” same Key data should be present on the same cluster.
  4. Reduce: Group the same Key data together.
  5. Combining: Combining the data together to form the output result.

Example

Now let us write the Java Program

  1. Open Eclipse -> New -> java Project -> (WordCountDemo) -> Finish
  2. Right Click -> New -> Package ->(WordCountPackageDemo) -> Finish
  3. Right Click on Package -> New -> Class -> WordCount
  4. Add Hadoop Libraries
    1. hadoop-core.jar
    2. Commons-cli-1.2.jar
  5. Write the java coding for counting the words.

Coding

package WordCountPackageDemo;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,”wordcount”);
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}

public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{

public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
{
//Changing the line into String value
String line = value.toString();

//Splitting given input using word delimiter
String[] words=line.split(“,”);

for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}

// Here the intermediate output Red 1 Blue 1, will be shuffle and sort as Red 1 1 2 etc and sent to reducer.

public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
{

public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
// output will be Red 3.
}

Compile WordCount.java

1. Create a Directory say “BeyondCornerWordCount” to store the compiled java class,

mkdir BeyondCornerWordCount

2. Compile using $HADOOP_CLASSPATH or using jars Directory.

javac -classpath $HADOOP_CLASSPATH -d BeyondCornerWordCount/ WordCount.java

OR

[hdfs@beyondCorner-hadoop1 Development]$ ll jars

total 4728

-rw-r–r–. 1 hdfs hadoop 17052 Dec 20 02:22 hadoop-annotations-2.6.0.3.0.0.0-249.jar

-rw-r–r–. 1 hdfs hadoop 3309465 Nov 19 07:05 hadoop-common-2.6.0.3.0.0.0-249.jar

-rw-r–r–. 1 hdfs hadoop 1509831 Nov 19 07:06 hadoop-mapreduce-client-core-2.6.0.3.0.0.0-249.jar

[hdfs@beyondCorner-hadoop3 Development]$ javac -cp “jars/hadoop-annotations-2.6.0.3.0.0.0-249.jar:jars/hadoop-common-2.6.0.3.0.0.0-249.jar:jars/hadoop-mapreduce-client-core-2.6.0.3.0.0.0-249.jar” WordCounts.java

Verify the compiled Classes of “BeyondCornerWordCount” directory, it should have one Main class along with Map & Reduce class using below command.

WordCount
Create the JAR file using Eclipse

Right Click on Project -> Export -> Export destination (BeyondCorner directory) ->Next ->Finish.

Create the input file (wordCountFile) and move it into HDFS using the below command

[training@beyondcorner ~]$ hadoop fs -put wordcountFile wordCountFile
 Run the JAR file

(hadoop jar jarfilename.jar packageName.ClassName  PathToInputTextFile PathToOutputDirectry)

[training@beyondcorner~] $ hadoop jar WordCountDemo.jar

WordCounePackageDemo.WordCount wordcountFile BeyondCornerWordCount

Open Result
[training@beyondcorner ~]$ hadoop fs -ls BeyondCornerWordCount

Found 3 items

-rw-r–r–   1 training supergroup          0 2016-02-23 03:36 /user/training/ BeyondCornerWordCount /_SUCCESS

drwxr-xr-x   – training supergroup          0 2016-02-23 03:36 /user/training/ BeyondCornerWordCount /_logs

-rw-r–r–   1 training supergroup         20 2016-02-23 03:36 /user/training/ BeyondCornerWordCount /part-r-00000

[training@beyondcorner ~]$ hadoop fs -cat BeyondCornerWordCount/part-r-00000

Red    4

Green 2

Blue   6

Pink   4

 “That’s all about the Hadoop MapReduce Work Flow”