Apache Pig Architecture

Let us study about the Apache Pig Architecture,

* Pig Latin is a language used in Apache pig to analyze data in Hadoop.

* It is a high level data processing language which provides a rich set of data types and operators to perform various operations on the data.

* Programmers write a Pig script using the Pig Latin language to perform any task, and execute them using any of the execution mechanisms.

* After execution pig Latin scripts it will go through a series of transformations applied by the Pig Framework, to produce the desired output.

* It makes the programmer’s job easy by converting the scripts internally into a series of MapReduce jobs. The architecture of Apache Pig is shown below.

* Apache pig Architecture has the following components like,

  1. Parser
  2. Optimizer
  3. Compiler
  4. Execution engine

Let us study the major components of Apache pig.

Parser

* Initially Pig Latin Scripts are passed to the Parser, it does type checking and syntax checking of the script.

* The output of the parser will be a DAG (directed acyclic graph), it represents the statements and logical operators of Pig Latin.

Note:  In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

Optimizer

* The DAG is passing to the optimizer. It performs the optimization activities like split, merge, transform, and reorder operators etc.

* It provides automatic optimization feature to Apache Pig.

Compiler

* The compiler compiles the optimized logical plan into a series of MapReduce jobs.

* It converts Pig jobs automatically into MapReduce jobs.

Execution engine

* Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Then these MapReduce jobs are executed on Hadoop and it will produce the desired results.

Pig Latin Data Model

Pig Latin Data Model is fully nested type and supports complex non-atomic data types like map and tuple. The Pig Latin’s data model is shown below.

Atom

* An Atom is a single value in Pig Latin, with any data type is known as Atom.

* It is stored in the form of string and it can be used as string and number.

* It supports different atomic values of Pig like int, long, float, double, chararray, and byte array.

Note: A piece of data or a simple atomic value is known as a field.

Example: ‘beyondcorner’ or ‘17’

Tuple

* A record which contains an ordered set of fields is known as tuple, the fields can be of any type.

* Tuple is same as the row in a table of RDBMS.

Example:  (beyondcorner, 17)

Bag

* It is a collection of unordered set of tuples is knows as Bag.

* It has a flexible schema.

   i.e each tuple can have any number of fields.

* A bag is represented by symbol ‘{}’, it is similar to a table in RDBMS.

Example: {(Rose, 30), (Roy, 45)}

Map

* A map is a set of key-value pairs. The key must be of type chararray and it should be unique. The value might be of any type.

* It is represented by symbol ‘[]’.

Example: [name#sita, age#20]

Relation

* A relation is a bag of tuples.

* In Pig Latin tuples are unordered in nature.

i.e Tuples are not  processed in particular order.

“That’s all about the Apache Pig – Architecture”