Grouping Joining Apache pig

Let’s study about Grouping Joining Apache pig.

Grouping in Apache pig

In Apache Pig Grouping data is done by using GROUP operator by grouping one or more relations.

* It collects the data having the same key.

Grouping in Apache can be performed in three ways, it is shown in the below diagram.

1. Single Column grouping

In this case we are grouping single column of a relation.

Example

In this example we are grouping single column “age” from the “empdata” relation.

Let us assume that we have a file called “employee.txt” in HDFS with the following content.

100,Roshan,23,HR

101,Roy,27,CS

102,Shruthi,31,IT

103,Disha,28,EC

104,Gowri,30,HR

 Step 1: In this step will load the data using “load” operator into the pig.

grunt> empdata = LOAD ‘hdfs://localhost:9000/emp_pigdata/employee.txt’  USING PigStorage(‘,’);

Step 2: Here will group the single column “age” from the “empdata” relation.

grunt> groupdata = GROUP empdata by age;

Step 3: Here we are verifying the grouped data using “dump” operator.

grunt> Dump group_data;

2. Multiple Column grouping                                

In this case we are grouping more than one column of a relation.

Example

Let us consider a previous example file called “employee.txt” in HDFS.

In this example we are grouping two columns like “name” and “age” from the “empdata” relation.

Step 1: In this step will load the data using “load” operator into the pig.

grunt> empdata = LOAD ‘hdfs://localhost:9000/emp_pigdata/employee.txt’  USING PigStorage(‘,’);

Step 2: Here will group the two columns like “name” and “age” from the “empdata” relation.

grunt> groupmultiple = GROUP empdata by (name,age);

Step 3: Here we are verifying the grouped data using “dump” operator.

grunt> Dump groupmultiple;

3. Group All

In this case we are grouping all columns of a relation.

Example

Let us consider a previous example file called “employee.txt” in HDFS.

In this example we are grouping all columns from the “empdata” relation.

Step 1: In this step will load the data using “load” operator into the pig.

grunt> empdata = LOAD ‘hdfs://localhost:9000/emp_pigdata/employee.txt’  USING PigStorage(‘,’);

Step 2: Here we are grouping all the columns from the “empdata” relation.

grunt> groupall = GROUP empdata All;

Step 3: Here we are verifying the grouped data using “dump” operator.

grunt> Dump groupall;

Co-group Operator

* Co-group is similar to Group Operator, The only difference between is, Group operator normally operate with single relation where as Co-group operate with two or more relation.

Example

Consider two files “employee.txt” and “student.txt” in HDFS directory ‘/beyondcorner/pigdata/’ as shown below.

Employee.txt

100,Roshan,23,HR

101,Roy,27,CS

102,Shruthi,31,IT

103,Disha,28,EC

104,Gowri,30,HR

Student.txt

10,Rosy,20,Mysore

11,Rama,21,Delhi

12,Sati,19,Mumbai

13,Druha,23,Nagpur

14,Gowthami,22,Bangalore

Step 1: In this step will load the two text file data using “load” operator into the pig.

grunt> empdata = LOAD ‘hdfs://localhost:9000/beyondcorner/pigdata/employee.txt’ USING PigStorage(‘,’) as (id:int, name:chararray,age:int,dept:chararray);

grunt> studentdata = LOAD ‘hdfs://localhost:9000/beyondcorner/pigdata/student.txt’ USING PigStorage(‘,’) as (id:int, name:chararray,age:int,city:chararray);

Step 2: Here will perform co-group between two relations with the key “age”.

grunt> cogroupdata = COGROUP empdata by age, studentdata by age;

Step 3: Now will verifying the co-grouped data using “dump” operator.

grunt> Dump cogroupdata;

Joining in Apache pig

In Apache pig joining of records from two or more relation id done by using “join” operator.

Join can be performed in different ways, as shown in the below diagram.

Example

In this example will see how to perform join operation in Apache pig.

Let’s consider two files like customer.txt and order.txt in the /pigdata/ directory of HDFS as shown below.

Customer.txt

1,Ram,30,Ahmedabad

2,Sam,35,Delhi

3,kushi,22,Bangalore

4,Chaitra,35,Mysore

5,Dhoni,37,Bhopal

6,Kamal,32,Nagpur

order.txt

102,2009-10-08,2

100,2009-10-08,2

101,2009-11-20,3

103,2008-05-20,4

Here will load these two files into Pig with the relations “customers” and “orders” as shown below.

grunt> customers = LOAD ‘hdfs://localhost:9000/pigdata/customer.txt’ USING PigStorage(‘,’)

as (id:int, name:chararray, age:int, city:chararray);

grunt> orders = LOAD ‘hdfs://localhost:9000/pigdata/order.txt’ USING PigStorage(‘,’)

as (oid:int, date:chararray, customer_id:int);

1. Self join

*Self join is used to join two similar files with the same content.

Note: To perform self joins in Pig load the same data multiple times, under different aliases, to avoid naming conflicts.

Step 1: In this step we are loading same data multiple times with different names like customer1 and customer2.

grunt> customer1 = LOAD ‘hdfs://localhost:9000/pigdata/customer.txt’ USING PigStorage(‘,’)

as (id:int, name:chararray, age:int, city:chararray);

grunt> customer2 = LOAD ‘hdfs://localhost:9000/pigdata/customer.txt’ USING PigStorage(‘,’)

as (id:int, name:chararray, age:int, city:chararray);

Step 2: In this step will perform self-join operation on the relation customer, by joining the two relations customer1 and customer2.

grunt> customer3 = JOIN customer1 BY id, customer2 BY id;

Step 3: In this step verify the relation customer3 using the DUMP operator.

grunt> Dump customer3;

2. Inner join             

* Inner join is used to perform join operation between two or more relations based on common field values.

* It is frequently used joining method in Apache Pig.

* Inner join returns similar rows in both tables, ignore null keys.

Step 1: In this step will perform inner join operation on the two relations customer and order.

grunt> result = JOIN customer BY id, order BY customer_id;

Step 2: In this step will verify the result using the DUMP operator.

grunt> Dump result;

3. Outer Join

* Outer join is used to perform join operation between two or more relations based on common field values.

* There are three ways of outer join. They are,

  • Left outer join
  • Right outer join
  • Full outer join
Left Outer Join

* In left outer Join operation returns all rows from the left table, even if there are no matches in the right table.

Step 1: In this step will perform left outer join operation on the two relations customer and order.

grunt> result = JOIN customer BY id LEFT OUTER, order BY customer_id;

Step 2: In this step will verify the result using Dump operator.

grunt> Dump result;
Right outer join

* In right outer join operation returns all rows from the right table, even if there are no matches in the left table.

Step 1: In this step will perform right outer join operation on the two relations customer and order.

grunt> result = JOIN customer BY id RIGHT, order BY customer_id;

Step 2: In this step will verify the result using Dump operator.

grunt> Dump result;
Full outer join

* In full outer join operation returns rows when there is a match in one of the relations.

Step 1: In this step will perform full outer join operation on the two relations customer and order.

grunt> result = JOIN customer BY id FULL OUTER, order BY customer_id;

Step 2: In this step will verify the result using Dump operator.

grunt> Dump result;

Multiple Keys joining operation

In this case we can perform JOIN operation using multiple keys.

Example

In this example will take two files namely employee.txt and employee_details.txt in the /pigdata/ directory of HDFS as shown below.

employee.txt

100,Roshan,23,HR,12

101,Roy,27,CS,13

102,Shruthi,31,IT,14

103,Disha,28,EC,15

104,Gowri,30,HR,16

 

employee_details.txt

100,Roshan@gmail.com,23,9945123145,12

101,Roy@yahoo.com,27,9916797944,13

102,Shruthi@gmail.com,31,9945112153,14

103,Disha@gmail.com,28,9916606055,15

104,Gowri@gmail.com,30,9845111231,16

Step 1: In this step will load above two files into the pig, using Load operator.

grunt> employee = LOAD ‘hdfs://localhost:9000/pigdata/employee.txt’ USING PigStorage(‘,’)

as (id:int, name:chararray,age:int, dept:chararray, jobid:int);

grunt> employee_details = LOAD ‘hdfs://localhost:9000/pigdata/employee_details.txt’ USING PigStorage(‘,’)  as (id:int, email:chararray, phone:chararray, ,jobid:int);

Step 2:  In this step will join the contents of the two relations using the JOIN operator.

grunt> result = JOIN employee BY (id,jobid), employee_details BY (id,jobid);

Step 3: In this step verify the result using the DUMP operator.

grunt> Dump result;

Crossing operation in Apache pig

In Apache pig we can perform cross-product of two or more relations using “cross” operator.

Example

In this example let us consider above text files like “customer.txt” and “order.txt” to perform cross-product of two tables.

Step 1: In this step will load above two files into the pig, using Load operator.

grunt> customers = LOAD ‘hdfs://localhost:9000/pigdata/customer.txt’ USING PigStorage(‘,’)

as (id:int, name:chararray, age:int, city:chararray);

grunt> orders = LOAD ‘hdfs://localhost:9000/pigdata/order.txt’ USING PigStorage(‘,’)

as (oid:int, date:chararray, customer_id:int);

Step 2: In this step will perform cross-product of two relations using the cross operator.

grunt> crossresult = CROSS customer, order;

Step 3: In this step verify the crossresult using the DUMP operator.

grunt> Dump crossresult;

“That’s all about the Grouping and Joining in Apache pig”