Filtering Pig

Let’s study about Filtering Pig.

Filtering is a method of removing duplicated values and selecting few tuples based on conditions. Apache Pig supports Filtering operation in Pig Latin with the help three operators. As shown below.

1. Filter Operator

* Filter Operator used to selects tuples from a relation based on some condition.

Example

In this example consider a file “employee_deatils.txt” in HDFS directory ‘/beyond_empdata/’ as shown below.

Employee_details.txt

100,Roshan,23,HR

101,Roy,27,CS

102,Shruthi,31,IT

103,Disha,28,EC

104,Gowri,30,HR

105,Drusya,25,HR

106,manju,34,IT

Step 1: In this step will load the file into pig using “load” operator.

grunt>employee_details = LOAD ‘hdfs://localhost:9000/beyond_empdata/employee_details.txt’ USING PigStorage(‘,’) as (id:int, name:chararray, age:int, dept:chararray);

Step 2: In this step will use Filter operator to get the details of the employee who belong to the department “HR”.

grunt>filter_data = FILTER employee_details BY dept == ‘HR’;

Step 3: In this step verify the relation filter_data using the DUMP operator.

grunt> Dump filter_data;

2. Distinct Operator

* Distinct Operator is used to removes duplicate tuples in a relation.

Example

In this Example let us assume that we are having file called “employee.txt” in HDFS directory. It contains so many duplicate tuples. We need to removes those values. In this example it will explains how to remove duplicate tuples.

Employee.txt

100,Roshan,23,HR

101,Roy,27,CS

102,Shruthi,31,IT

103,Disha,28,EC

101,Roy,27,CS

104,Gowri,30,HR

105,Drusya,25,HR

106,manju,34,IT

103,Disha,28,EC

Step 1: In this step we are loading the “employee.txt” file into pig using “load” operator.

grunt>employee_data = LOAD ‘hdfs://localhost:9000/beyond_empdata/employee.txt’ USING PigStorage(‘,’) as (id:int, name:chararray, age:int, dept:chararray);

Step 2: In this case we are removing duplicate tuples from the relation “employee_data” using the DISTINCT operator, and store it as another relation called “distinct_data” .

grunt> distinct_data = DISTINCT employee_data;

Step 3: Here we are verifying the “distinct_data” using dump operator.

grunt> Dump distinct_data;

3. Foreach Operator

* Foreach operator is used to generate data transformations based on columns of data.

Example

In this example we are consider a file “employee_deatils.txt” in HDFS directory ‘/beyond_empdata/’ as shown below. This file contains all the information about the beyond employees, in that we are transfering the few data into another relation for job promotion. Let us follow the below steps.

Employee_details.txt

100,Roshan,23,HR

101,Roy,27,CS

102,Shruthi,31,IT

103,Disha,28,EC

104,Gowri,30,HR

105,Drusya,25,HR

106,manju,34,IT

Step 1: In this step we are loading the file “employee_details.txt” into pig using “load” operator for data transformation.

grunt>employee_details = LOAD ‘hdfs://localhost:9000/beyond_empdata/employee_details.txt’ USING PigStorage(‘,’) as (id:int, name:chararray, age:int, dept:chararray);

Step 2: In this step let us take id, name, and age values of each employee from the “employee_details” and store it into the “foreach_data”.

grunt> foreach_data = FOREACH employee_details GENERATE id, name,age;

Step 3: Here we are verifying the “foreach_data” using dump operator.

grunt> Dump foreach_data;

” That’s all about the Filtering in Pig, this concept is used to analyze few data based on conditions”