Apache Pig Reading/Loading Data

Let’s study about the Apache Pig Reading Data.

We know that Apache pig is build on the top of Hadoop. It is an analytical tool used to analyze large datasets that exist in the Hdfs (Hadoop distributed File System).

In this article will study how to Read/Load the data to Apache Pig from HDFS.

Example

Let’s consider ‘employee.txt’ as the sample data, as shown below.

IdNameAgeDept
100Roshan23HR
101Roy27CS
102Shruthi31IT
103Disha28EC
104Gowri30HR

Follow the below steps to Read/Load data into Apache pig from HDFS.

1. Verify the Hadoop Version

In this step we are verifying the HAdoop version using below command.

$ hadoop version

2. Starting HDFS

In the second stage we are starting the dfs and yarn services using the below commands.

cd /$Hadoop_Home/sbin/

$ start-dfs.sh

$ start-yarn.sh

3. Create a Directory in HDFS

In this stage we are creating directory in the name of “emp_pigdata” using mkdir command in HDFS. As shown below.

$cd /$Hadoop_Home/bin/

$ hdfs dfs -mkdir hdfs://localhost:9000/emp_pigdata

4. Moving data to HDFS

In this stage we are moving (using mv command) “employee.txt” data from local file system to the HDFS.

$ cd $HADOOP_HOME/bin

$ hdfs dfs -mv /home/Hadoop/employee.txt dfs://localhost:9000/emp_pigdata/

5. Verifying the Data

In this stage we are verifying the employee.txt data in the HDFS using cat command.

$ hdfs dfs -cat hdfs://localhost:9000/emp_pigdata/employee.txt

6. Start the Pig Grunt Shell

In this stage start the Pig terminal using below command.

$ Pig

grunt>

7. Loading/Reading data to pig

In this stage we are loading “employee.txt” data using “load” operator from HDFS to the Apache pig.

grunt> empdata = LOAD ‘hdfs://localhost:9000/emp_pigdata/employee.txt’  USING   PigStorage(‘,’);

Note: In order to verify the loaded data we need to use Diagnostic Operators (Discussed in the next article).