Hadoop Streaming

 

Let’s study about Hadoop Streaming,

Purpose

the main purpose of designing the Hadoop Streaming is, MapReduce supports programs written in java language, because by default MapReduce framework is written in Java. Hadoop framework support other Programs in Hadoop.

Overview

Hadoop streaming is an utility which permits user to write MapReduce programs in any language to run a Map or Reduce job. It support program to read a standard input and write standard output which can be used for Map and Reduce tasks.

As we all know the core architecture of Hadoop is to have a mapper and reducer. Hadoop streaming supports the following languages like Python, Ruby, PHP, Pearl, bash etc. Hadoop previous versions were supporting only text processing whereas latest versions of Hadoop support both binary and text files.

Hadoop streaming

Example

Let us consider a word count problem for Hadoop streaming. So codes should be written to perform both mapper and reducer jobs.

In this example code is written in python language to run it under hadoop. Similarly other languages like Pearl and Ruby can also be used to execute mapper and reducer jobs.

1. Mapper Phase code

!/usr/bin/python

import sys

# Input takes from standard input for beyondcornerline in sys.stdin:

# Remove whitespace either side beyondcornerline = beyondcornerline.strip()

# Break the line into words words = beyondcornerline.split()

# The words list for myword in words are iterated:

# Write the results to standard output print ‘%s\t%s’ % (myword, 1)

 

Execution permission should be released for the file before executing; permission released using “chmod” command (chmod +x /home/ expert/hadoop-1.2.1/mapper.py).

2. Reducer Phase code

#!/usr/bin/python

from operator import itemgetter

import sys

current_word = “”

current_count = 0

word = “”

# Input takes from standard input for beyondcornerline in sys.stdin:

# Remove whitespace either side beyondcornerline = beyondcornerline.strip()

# Split the input we got from mapper.py word, count = beyondcornerline.split(‘\t’, 1)

# Convert count variable to integer

try:

count = int(count)

except ValueError:

# Count was not a number, so silently ignore this line continue

if current_word == word:

current_count += count

else:

if current_word:

# Write result to standard output print ‘%s\t%s’ % (current_word, current_count)

current_count = count

current_word = word

# Do not forget to output the last word if needed!

if current_word == word:

print ‘%s\t%s’ % (current_word, current_count)

Both the mapper and reducer program files are saved in Hadoop home directory as,

  1. mappertask.py
  2. reducertask.py

As said before these files should have execution permission (chmod +x mappertask.py and chmod +x reducertask.py).

Execution of Word Count Program

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.

2.1.jar \

-input input_dirs \

-output output_dir \

-mapper <path/mappertask.py \

-reducer <path/reducertask.py

Note: This “\” is used For line continuation and clear readability, .

Example
./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-

1.2.1/reducer.py

How Streaming works?

What does the above written mapper and reducer python script does?

* It reads the input from the standard input (line by line) and produces the output to the standard output. So streaming is a utility which creates a MapReduce job.

* When an executable or a script is specified for mappers, mappers will start launching the mapping process. As we have learnt from our previous topics, there will be multiple mappers available on slave nodes, so each mapper will launch the script as a separate process when the mapper is initialized.

* Reducer works almost similar to Mapper.

When an executable or a script is specified for reducers, It will start launching the reducer process. As we have multiple reducers in the MapReduce framework a separate reduce process will be launched by each Reducer when the reducer is initialized.

File Processing in Hadoop Streaming

* Mapper takes the input from HDFS and it converts the inputs into lines then it will feed those lines to the STDIN (Standard input) of the process.

* Mapper produce intermediate output in the form the Keys-Values, it is utilized as output for Reducer. So mapper collects the line-oriented outputs from the STDOUT (Standard Output) of the process and then the output converted into Key-Value pair.

* Key and value can be customized based on the requirements of the client. Usually the master data which is common in both input files are set as Key and Values are any tabs of characters which are related to the key.

* Reducer takes the input from Mapper as Key-Value pairs and it converts the inputs into lines then it will feed those lines to the STDIN (Standard input) of the process. So reducer collects the line-oriented outputs from the STDOUT (Standard Output) of the process and then the output converted into Key-Value pair.

Important Commands                 

Streaming supports both streaming command options as well as generic command options.

Note: Always the generic options should be placed before the streaming options, if this syntax is not followed then the command will fail.

Syntax:

bin/hadoop command [genericOptions] [streamingOptions]

Streaming commands

Streaming Commands Table

“That’s all about the Hadoop Streaming, i hope it is useful for Hadoop beginners”