Apache Pig Introduction

What is Apache pig?

Apache Pig is an open source platform, built on the top of Hadoop to analyzing large data sets.

Objective: Apache pig is designed to handle any kind of data like structured, semi-structured and unstructured data.

Introduction

* It is a high-level data flow language called as pig Latin.

* Pig Latin is similar to SQL language, used to execute queries on huge datasets that are stored in HDFS.

* Pig was developed by Yahoo to help Hadoop programmers to analyze large unstructured data.

* It mainly helps to reduce the complexity of writing code in java.

i.e 10 lines of code in Pig equal 200 lines of code in Java.

* Pig operates on Hadoop platform to perform reading, writing and processing data (in detail explained in hands-on article).

 

Why Apache pig is used?

The Apache pig is used for the following reasons like,

* The main reason of using Apache pig is it can handle any kind of data like structured, semi-structured and unstructured data.

* It is simple query language like SQL (structured query language), easily we can learn and write the query to perform the task.

* Apache pig reduces the LOC (line of code) compare to the java programming, so it’s a boon for programmers.

* It reduces the complexity of map reduce task using pigLatin script.

i.e Previously map reduce task was done using java program, it was lengthy program and difficult for coding and teasting. In case of pigLatin, it is simple for coding and debugging.

* Apache Pig provides rich built-in operators to support data operations like joins, filters, ordering, etc.

* It also provides nested data types like tuples, bags, and maps (in detail explained in pig latin basics article).

Reference                                                                                   

https://cwiki.apache.org/confluence/display/PIG/Index

“That’s all about the Apache Pig introduction. I hope this article gives brief overview of Apache pig”