Hadoop Environment Setup & Installation

Pre-installation Setup

We need to set up Linux using ssh (Secure Shell) before installing Hadoop into the Linux environment. The below steps are followed for setting up the Linux environment.

Creating a User

At the beginning, to isolate Hadoop file system from Unix file system, it is recommended to create a separate user for Hadoop. Create a user following the below steps:

  • Using the command “su” ope the root.
  • From the root account create a user using the command “useradd username”.
  • Existing user account can be opened using the command “su username”.

Type the following commands in Linux terminal to create a user.

$ su

password:

# useradd hadoop

# passwd hadoop

New passwd:

Retype new passwd:

SSH Setup and Key Generation

SSH setup is required to manage the cluster operations such as starting, stopping, distributed daemon shell operations. It is used for the purpose of remote login. It is required to provide public/private key pair for a Hadoop user and share it with different users in order to authenticate different users of Hadoop. Let us now understand the steps involved in SSH installation.

Step1. Installation of password less SSH

1

2

sudo apt-get install ssh

sudo apt-get install pdsh

Step 2. Generate Key Pairs

1ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa

Step 3. Configure password less SSH

1cat ~/.ssh/id_rsa.pub>>~/.ssh/authorized_keys

Step 4. Change the permission of file that contains the key

1chmod 0600 ~/.ssh/authorized_keys

Step 5. Check SSH to the localhost

1ssh localhost

Installing Java

 Hadoop runs on Java so java is main prerequisite of Hadoop. Check if Java is already installed in your system or not using the following command

$ java -version

If it is already available then it will give you the following output.

java version “1.7.0_71”

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

Step1:               

Download Java (JDK <latest version> – X64.tar.gz) from the following link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads1880260.html.

Step2:

Verify the downloaded file and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/

$ ls

jdk-7u71-linux-x64.gz

$ tar zxf jdk-7u71-linux-x64.gz

$ ls

jdk1.7.0_71   jdk-7u71-linux-x64.gz

Step 3:

Using the following command move java to the location “/usr/local/” where it will be available to all the users.

$ su

password:

# mv jdk1.7.0_71 /usr/local/

# exit

Step 4:

Add the following commands to ~/.bashrc file for setting up PATH and JAVA_HOME variables.

export JAVA_HOME=/usr/local/jdk1.7.0_71

export PATH=$PATH:$JAVA_HOME/bin

Apply all the changes into the current running system.

$ source ~/.bashrc

Step 5:

Configure Java alternatives using the below commands.

# alternatives –install /usr/bin/java java usr/local/java/bin/java 2

# alternatives –install /usr/bin/javac javac usr/local/java/bin/javac 2

# alternatives –install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives –set java usr/local/java/bin/java

# alternatives –set javac usr/local/java/bin/javac

# alternatives –set jar usr/local/java/bin/jar

Use the Java – Version command to verify the installation.

 Install Hadoop

Download Hadoop

http://redrockdigimark.com/apachemirror/hadoop/common/hadoop-3.0.0-alpha2/hadoop-3.0.0-alpha2.tar.gz

(Download the latest version of Hadoop hadoop-3.0.0-alpha2.tar.gz)

Untar Tarball

1tar -xzf hadoop-3.0.0-alpha2.tar.gz

Hadoop Setup Configuration

Edit .Bashrc

Open .bashrc:

1nano ~/.bashrc

Edit .bashrc:

Edit .bashrc file is located in user’s home directory and it adds below parameters;

1

2

3

4

5

6

7

export HADOOP_PREFIX=”/home/beyondcorner/hadoop-3.0.0-alpha2″

export PATH=$PATH:$HADOOP_PREFIX/bin

export PATH=$PATH:$HADOOP_PREFIX/sbin

export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}

export HADOOP_COMMON_HOME=${HADOOP_PREFIX}

export HADOOP_HDFS_HOME=${HADOOP_PREFIX}

export YARN_HOME=${HADOOP_PREFIX}

Then Run

1Source ~/.bashrc

Edit hadoop-env.sh

Edit configuration file hadoop-env.sh (located in Hadoop_Home/etc/hadoop) and set JAVA_HOME

1export JAVA_HOME=/usr/lib/jvm/java-8-oracle/

Edit core-site.xml

Edit core-site.xml configuration file (located in HADOOP_HOME/etc/hadoop) and add the below entries:

1

2

3

4

5

6

7

8

9

10

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/home/beyondcorner/hdata</value>

</property>

</configuration>

Edit hdfs-site.xml

Edit hdfs-site.xml configuration file (located in HADOOP_HOME/etc/hadoop) and add the below entries:

1

2

3

4

5

6

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

Edit mapred-site.xml

Use the below command if mapred-site.xml is not available

1cp mapred-site.xml.template mapred-site.xml

Edit mapred-site.xml configuration file (located in HADOOP_HOME/etc/hadoop) and add the below entries:

1

2

3

4

5

6

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

Yarn-site.xml

Edit yarn-site.xml configuration file (located in HADOOP_HOME/etc/hadoop) and add the below entries:

1

2

3

4

5

6

7

8

9

10

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

How to start the Hadoop services

Now we have to understand how to run a Hadoop cluster:

The first step to start Hadoop installation is to format the Hadoop file system which is implemented on your cluster’s local file system.

Format the NameNode
  1 bin/hdfs namenode –format

Note: This will delete all your data from HDFS if it is used for running Hadoop file system. It should be done once you install Hadoop.

Start HDFS Services
1sbin/start-dfs.sh

It will throw an error at the time of starting HDFS services then use:

1echo “ssh” | sudo tee /etc/pdsh/rcmd_default
Start YARN Services
1sbin/start-yarn.sh
Check how many daemons are running

Check if the expected Hadoop process are running or not;

1

2

3

4

5

6

7

Jps

2961 ResourceManager

2482 DataNode

3077 NodeManager

2366 NameNode

2686 SecondaryNameNode

How to stop the Hadoop services

Stop YARN Services

1sbin/stop-yarn.sh

Start HDFS Services

1sbin/stop-dfs.sh

Note                                   

Browse the web interface for the Name Node; by default, it is available at:

Name Node – http://localhost:9870/

Browse the web interface for the Resource Manager; by default, it is available at:

Resource Manager – http://localhost:8088/

Run a MapReduce job

We are all ready to start our first Hadoop MapReduce job through Hadoop word count example.