Hadoop Configuration Custom Data Types

Let’s have a look on Hadoop Custom Data Types and Configuration of hadoop Data Type,

Overview

In this tutorial we will understand

  1. What are the different data types provided by Hadoop?
  2. What are the different data input and output formats provided by Hadoop?
  3. How to Configure Hadoop Data Types.

1. Serialization

The process of converting object data into byte streams is know as serialization.

* In Serialization, in order to transfer through network from one machine to another machine in a cluster the data should be in the form of objects.

* Objects cannot travel through pipes or wire, we need to convert it into byte stream.

2. De-serialization

* As we can understand from the name de-serialization is the reverse process of serialization.

* De-serialization data is in the form of streams of bytes, which are converted into data objects which can be read from the HDFS.

Why Hadoop data Types? Why not java data Types?

1.Why can’t we use Java data types for MapReduce?

We use map reduce specific data types to increase the performance. Because of Serialisation provided by Hadoop data types are more efficient.

2.Why Hadoop serialisation? Why not Java serialisation?

Because hadoop serialisation is more effective and efficient when compared with Java serialisation.

* Java serialisation emits a type name with each object. Emitting the type name for each object results in large amount of data, whereas Hadoop produces less intermediate data for serialisation. So it takes less time and space.

* Java serialisation will take the whole object to serialize like if you want to serialise employee details, it will take all 10 fields in that object to serialize like Emp ID, Emp Name, Dept, Salary etc.

* Hadoop serialisation has a mechanism to choose the fields you want to serialize. It is a customized serialisation.

Input and Output data types of Hadoop MapReduce application should be configured to run the MapReduce Job.

Steps to configure Hadoop Data Types

1. Specify the Key-Value pair input data type (key: LongWritable, value: Text) and output data type (key: Text, value: IntWritable) of your mapper using the generic-type variables.

public class SampleMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

public void map(LongWritable key, Text value,

Context context) … {

……  }  }

2. Specify the Key-Value pair input data type (key: Text, value: IntWritable) and output data type (key: Text, value: IntWritable) of your reducer using the generic-type variables. Reducer takes input from Mapper output so reducer’s input key-value pair data types should match the mapper’s output key-value pairs data types.

public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key,

Iterable<IntWritable> values, Context context) {

……  }

}

3. MapReduce computation’s output data types should be specified using the Job The specified data types will serve as the output types for both the reducer and the mapper. In other case you can also specifically configure the mapper output types as done in step 4.

Job job = new Job(..);

….

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

4. Optionally mapper and reducer can have different output data types. In that case you can configure different data types for the mapper’s output key-value pairs using the following steps.

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

General understanding to execute a MapReduce Job

In order to execute a MapReduce Job you should understand the following about data types

  1. Understand the format or type of the data
  2. What is the data delimiter?
  3. Identify Keys and values K1,V1,K2,V2,K3,V3
  4. While finding the data type, identify whether we need reducer or not?
  5. Identify whether we need single or multiple reducers.
  6. Specify the input format based on which K1 & V1 data types are identified automatically.
  7. How to write a MapReduce code according to the identified data types.
  8. Similarly data type identified for K2,K3 and V2,V3.
  9. Understand what will be done at mapper phase and reducer phase.

Test Data Types 

We can write the program using get(), set(), getBytes(), put(), containsKey(), getLength(), and keyset() methods.

Example
import org.apache.hadoop.io.* ;

import java.util.* ;

public class WritablesTest

{

public static class TextArrayWritable extends ArrayWritable

{

public TextArrayWritable()

{

super(Text.class) ;

}

}

public static class IntArrayWritable extends ArrayWritable

{

public IntArrayWritable()

{

super(IntWritable.class) ;

}

}

public static void main(String[] args)

{

IntWritable i1 = new IntWritable(2) ;

i2.set(5);

IntWritable i3 = new IntWritable();

i3.set(i2.get());

System.out.printf(“Int Writables Test I1:%d , I2:%d , I3:%d”, i1.get(), i2.get(), i3.get()) ;

BooleanWritable bool1 = new BooleanWritable() ;

bool1.set(true);

ByteWritable byte1 = new ByteWritable( (byte)7) ;

System.out.printf(“\n Boolean Value:%s Byte Value:%d”, bool1.get(), byte1.get()) ;

Text t = new Text(“hadoop”);

Text t2 = new Text();

t2.set(“pig”);

System.out.printf(“\n t: %s, t.legth: %d, t2: %s, t2.length: %d \n”, t.toString(), t.getLength(),  t2.getBytes(), t2.getBytes().length);

ArrayWritable a = new ArrayWritable(IntWritable.class) ;

a.set( new IntWritable[]{ new IntWritable(10), new IntWritable(20), new IntWritable(30)}) ;

ArrayWritable b = new ArrayWritable(Text.class) ;

b.set( new Text[]{ new Text(“Hello”), new Text(“Writables”), new Text(“World !!!”)}) ;

for (IntWritable i: (IntWritable[])a.get())

System.out.println(i) ;

for (Text i: (Text[])b.get())

System.out.println(i) ;

IntArrayWritable ia = new IntArrayWritable() ;

ia.set( new IntWritable[]{ new IntWritable(100), new IntWritable(300), new IntWritable(500)}) ;

IntWritable[] ivalues = (IntWritable[])ia.get() ;

for (IntWritable i : ivalues)

System.out.println(i);

MapWritable m = new MapWritable() ;

IntWritable key1 = new IntWritable(1) ;

NullWritable value1 = NullWritable.get() ;

m.put(key1, value1) ;

m.put(new VIntWritable(2), new LongWritable(163));

m.put(new VIntWritable(3), new Text(“Mapreduce”));

System.out.println(m.containsKey(key1)) ;

System.out.println(m.get(new VIntWritable(3))) ;

m.put(new LongWritable(1000000000), key1) ;

Set<Writable> keys = m.keySet() ;

for(Writable w: keys)

System.out.println(m.get(w)) ;

}

}

Compile and run the above java program and you will get the results as shown below.

java program

So you would have understood the built-in Hadoop data types along with successful tested example.

“That’s all about the Hadoop Custom Data Types and Configure Hadoop Data Types”