Pig Latin Basics

Let’s study about Pig Latin Basics like data types, operators, user-defined function and built-in function.

Data types

The following table describes the Data types of Pig Latin.

Data typeDescriptionExample
intRepresents a 32-bit signed integer.10
longRepresents a 64-bit signed integer.10L
floatRepresents a 32-bit floating point.10.5F or 10.5f
doubleRepresents a 64-bit floating point.10.5
chararrayCharacter array (string) in Unicode UTF-8 format‘beyond corner’
bytearrayRepresents a Byte array (blob).
booleanRepresents a Boolean value.true/ false

Complex Data Types

TupleA tuple is an ordered set of fields.(rosy, 20)
BagA bag is a collection of tuples.{(rosy,20),(Mohan,25)}
MapA Map is a set of key-value pairs.[ ‘name’#’rosy’, ‘age’#20]

Null Values

* The Values for all the above data types can be NULL.

* Apache Pig treats null values in a similar way as SQL.

* A null can be an unknown value, it is used as a placeholder for optional values.

* These nulls can occur naturally or can be the result of an operation.

Arithmetic Operators

The following table describes the arithmetic operators of Pig Latin. Let’s take a = 10 and b = 30.

OperatorDescriptionExample
+Addition

Adds values on either side of the operator

10+30

=40

Subtraction

Subtracts right hand operand from left hand operand

10-30

=-20

*Multiplication

Multiplies values on either side of the operator

10*30

=300

/Division

Divides left hand operand by right hand operand

30/10

=3

%Modulus

Divides left hand operand by right hand operand and returns remainder

30%10

=0

? :Bincond

Evaluates the Boolean operators

It has three operands as shown below.

variable x = (expression) ? value1 if true : value2 if false.

b = (a == 1)? 20: 30;

if a = 1 the value of b is 20.

 

if a!=1 the value of b is 30.

Comparison Operators

The below table describes the comparison operators of Pig Latin.

OperatorDescriptionExample
==Equal

It will Checks the values of two operands are equal or not;

if yes, then the condition becomes true.

(a = b) is not true
!=Not Equal

The values of two operands are equal or not is checked.

If the values are not equal, then condition becomes true.

(a != b) is true.
>Greater than

Check the value of left hand side operand is greater than the value of the right hand  side operand is true, then the condition becomes true.

(a > b) is not true.
<Less than

Validate the value of the left operand is less than the value of the right operand.

If yes, then the condition becomes true.

(a < b) is true.
>=Greater than or equal to

Left side operand is greater than or equal to the value of the right side operand is true, then the condition becomes true.

(a >= b) is not true.
<=Less than or equal to

It will Verify the value of the left operand is less than or equal to the value of the right operand.

If yes, then the condition becomes true.

(a <= b) is true.
MatchesPattern matching

Validate the string in the left-hand side matches with the constant in the right-hand side.

x = filter a by (f1 matches ‘.*beyond.*’);

Type Construction Operators

The following table describes the Type construction operators of Pig Latin.

OperatorDescriptionExample
()Tuple constructor operator It is used to construct a tuple.(Rosy, 20)
{}Bag constructor operator

It is used to construct a bag.

{(Rosy,20),(Moni, 35)}
[]Map constructor operator

It is used to construct a tuple.

[name#Rosy, age#20]

Relational Operations

The following table describes the relational operators of Pig Latin.

OperatorDescription
Loading and Storing
LOADIt Load the data from the file system (local/HDFS) into a relation.
STORETo save a relation to the file system (local/HDFS).
Filtering
FILTERIt removes unwanted rows from a relation.
DISTINCTTo remove duplicate rows from a relation.
FOREACH, GENERATETo generate data transformations based on columns of data.
STREAMIt transforms a relation using an external program.
Grouping and Joining
JOINTo join two or more relations.
COGROUPTo group the data in two or more relations.
GROUPTo group the data in a single relation.
CROSSTo create the cross product of two or more relations.
Sorting
ORDERTo arrange a relation in a sorted order based on one or more fields (ascending or descending).
LIMITTo get a limited number of tuples from a relation.
Combining and Splitting
UNIONTo combine two or more relations into a single relation.
SPLITTo split a single relation into two or more relations.
Diagnostic Operators
DUMPTo print the contents of a relation on the console.
DESCRIBETo describe the schema of a relation.
EXPLAINTo view the logical, physical, or MapReduce execution plans to compute a relation.
ILLUSTRATETo view the step-by-step execution of a series of statements.

Built-In Function

Pig provides various built-in functions like eval, load, store, math, string, bag and tuple functions.

Eval Functions

Given below is the list of eval functions provided by Apache Pig.

FunctionDescription
Avg()To compute the average of the numerical values within a bag.
BagToString()To concatenate the elements of a bag into a string.
Concat()To concatenate two or more expressions of same type.
Count()To get the number of elements/tuples in a bag
Count_star()It is similar to the COUNT() function.

It is used to get the number of elements in a bag.

Diff()To compare two bags (fields) in a tuple.
IsEmpty()To check if a bag or map is empty.
Max()To calculate the highest value for a column (numeric values or chararrays) in a single-column bag.
Min()To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag.
PluckTuple()Using the Pig Latin PluckTuple() function, we can define a string Prefix and filter the columns in a relation that begin with the given prefix.
Size()To compute the number of elements based on any Pig data type.
Subtract()To subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples of the first bag that are not in the second bag.
Sum()To get the total of the numeric values of a column in a single-column bag.
TokenizeTo split a string (which contains a group of words) in a single tuple and return a bag which contains the output of the split operation.

Load and Store functions

* It is used to determine how the data goes and comes out of Pig.

*These functions are used with the load and store operators.

Given below is the list of load and store functions available in Pig.

FunctionDescription
PigStorageTo load and store structured files.
TextLoaderTo load unstructured data into Pig.
BinStorageTo load and store data into Pig using machine readable format.
Handling CompressionIn Pig Latin, we can load and store compressed data.

Bag and Tuple functions

Given below is the list of Bag and Tuple functions.

FunctionDescription
ToBagTo convert two or more expressions into a bag.
TopTo get the top N tuples of a relation.
TotupleIt converts one or more expressions into a tuple.
ToMapIt converts the key-value pairs into a Map.

String functions

We have the following String functions in Apache Pig.

FunctionDescription
INDEXOFReturns the index of the first occurrence of a character in a string, searching forward from a start index.
LAST_INDEX_OFReturns the index of the last occurrence of a character in a string, searching backward from a start index.
LCFIRSTConverts the first character in a string to lower case.
LOWERConverts all characters in a string to lower case.
REGEX_EXTRACTPerforms regular expression matching and extracts the matched group defined by an index parameter.
REGEX_EXTRACT_ALLPerforms regular expression matching and extracts all matched groups.
REPLACEReplaces existing characters in a string with new characters.
STRSPLITSplits a string around matches of a given regular expression.
SUBSTRINGReturns a substring from a given string.
TRIMReturns a copy of a string with leading and trailing white space removed.
UCFIRSTReturns a string with the first character converted to upper case.
UPPERReturns a string converted to upper case.

Date and Time functions

Apache Pig provides the following Date and Time functions.

FunctionDescription
AddDurationReturns the result of a date-time object along with the duration object.
CurrentTimeReturns the DateTime object of the current time.
DaysBetweenReturns the number of days between two DateTime objects.
GetDayReturns the day of a month from a DateTime object.
GetHourReturns the hour of a day from a DateTime object.
GetMilliSecondReturns the millisecond of a second from a DateTime object.
GetMinuteReturns the minute of a hour from a DateTime object.
GetMonthReturns the month of a year from a DateTime object.
GetSecondReturns the second of a minute from a DateTime object.
GetWeekReturns the week of a week year from a DateTime object.
GetWeekYearReturns the week year from a DateTime object.
GetYearReturns the year from a DateTime object.
HoursBetweenReturns the number of hours between two DateTime objects.
MilliSecondsBetweenReturns the number of milliseconds between two DateTime objects.
MinutesBetweenReturns the number of minutes between two DateTime objects.
MonthsBetweenReturns the number of months between two DateTime objects.
SecondsBetweenReturns the number of seconds between two DateTime objects.
SubtractDurationSubtracts the Duration object from the Date-Time object and returns the result.
ToDateReturns a DateTime object according to parameters.
WeeksBetweenReturns the number of weeks between two DateTime objects.
YearsBetweenReturns the number of years between two DateTime objects.

User Defined Functions

Pig provides support to create User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and use them.

* Pig UDF provides support in six programming languages like Java, Jython, Python, JavaScript, Ruby and Groovy.

* Pig UDF got more support from the Java functions, and limited support from Python, JavaScript, Ruby and Groovy functions.

Note: Java repository for UDF’s in Apache pig is called as Piggybank.

Types of UDF’s in Java

There are three types of UDF’s supported in java, they are.

Filter function

* It is used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value.

Eval Function      

* It is used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result.

Algebraic Function

* This functions act on inner bags in a FOREACHGENERATE statement. These functions are used to perform full MapReduce operations on an inner bag.

Note: All UDFs must extend “org.apache.pig.EvalFunc”

All functions must override “exec” method.

Writing UDF’s using Java

In this example we are writing simple EVAL Function to convert to upper. And create a jar for the below code as myudfs.jar.

Example

packagemyudfs;

importjava.io.IOException;

importorg.apache.pig.EvalFunc;

importorg.apache.pig.data.Tuple;

public class UPPER extends EvalFunc<String>

{

public String exec(Tuple input) throws IOException {

if (input == null || input.size() == 0)

return null;

try{

String str = (String)input.get(0);

returnstr.toUpperCase();

}catch(Exception e){

throw new IOException(“Caught exception processing input row “, e);         }

}

}

Now will write the script in a file and save it as “newscript.pig”

/*newscript.pig*/

REGISTER myudfs.jar;

A = LOAD ’employee’ AS (name: chararray, age: int);

B = FOREACH A GENERATE myudfs.UPPER(name);

DUMP B;

Run the newscript.pig in the terminal to get the output.

grunt> run newscript.pig

“That’s all about the Pig Latin Basics”