UNIT-5
PIG: Hadoop
Programming made
easier
Syllab
us
Pig: Hadoop Programming Made Easier
Admiring the Pig Architecture,
going with the Pig Latin Application Flow
working through the ABCs of Pig Latin
Evaluating Local and Distributed Modes of
Running Pig Scripts
Checking out the Pig Script Interfaces
Scripting with Pig Latin
Objectiv
es
To admire the Pig Architecture,
To go with the Pig Latin Application Flow,
To work through the ABCs of Pig Latin,
To evaluate Local and Distributed Modes of Running Pig
Scripts
To check out the Pig Script Interfaces,
To script with Pig Latin
Outcom
es
At the end of the course the student will be able to
Admire the Pig Architecture
Go with the Pig Latin Application Flow
Work through the ABCs of Pig Latin
Evaluate Local and Distributed Modes of Running Pig
Scripts
Check out the Pig Script Interfaces
Script with Pig Latin
Hadoop Ecosystem
What is
PIG
Pig is a Scripting platformor tool that runs
on Hadoop Clusters, designed to process and
analyze large datasets.
(or)
Pig is a high-level platform for creating
MapReduce programs used with Hadoop
It enables to write complex data transformations
without Knowing Java
It provides a high-level of abstraction for
processing over the MapReduce.
Contd..
• Pig was initially developed at Yahoo! to allow
people using Apache Hadoop to focus more on
analyzing large data sets and spend less time
having to write mapper and reducer programs.
• Like actual pigs, who eat almost anything, the Pig
programming language is designed to handle any
kind of data—That’s why the name, Pig!
Pig Components
Two Main components of the Apache Pig tool are:
1. Pig Latin – A language
2. Pig Engine-A runtime
Environment
It provides a high-level scripting language,
known as Pig Latin which is used to develop the
data analysis codes.
To process the data which is stored in the HDFS,
the programmers will write the scripts using the
Pig Latin Language.
This Pig Latin language provides various operators
using which programmers can develop their own
functions for reading,writing and processing data.
Contd..
A Pig Latin program consists of a series of operations
or transformations which are applied to the input data to
produce output.
These operations describe a data flow which is translated
into an executable representation, by Pig execution
environment.
Underneath, results of these transformations are series of
MapReduce jobs which a programmer is unaware of. So, in
a way, Pig allows the programmer to focus on data rather
than the nature of execution.
Internally Pig Engine(a component of Apache Pig)
accepts the pig latin scrpits and converted all
these scripts into a specific map and reduce task
Contd..
Pig operates on various types of data like:
Structured, Semi-structured and Unstructured
The result of Pig always stored in HDFS
Need of Pig
• While performing any MapReduce tasks, there is a
case Programmers who are not so good at Java
normally used to struggle to work with Hadoop.
Thus, we can say, Pig is a boon for all such
programmers
• Because, Without having to type complex codes in
Java, using Pig Latin, programmers can perform
MapReduce tasks easily.
One limitation of MapReduce is that the
development cycle is very long. So, Writing the
reducer and mapper, compiling packaging
the code, submitting the job and retrieving the
output is a time-consuming task.
Contd..
Apache Pig reduces the time of development
using the multi-query approach
It also helps in reduce the length of codes.
Let’s understand it with an example. Here an
operation that would require us to type 200 lines of
code (LoC) in Java can be easily done by typing as less
as just 10 LoC in Apache Pig. Hence, it shows, Pig
reduces the development time by almost 16 times.
Programmers who have SQL knowledge needed less
effort to learn Pig. Because Pig Latin is SQL-like
language
It offers many built-in operators, in order to support
data operations such as joins, filters, ordering, and
many more
Evolution of Pig:
Earlier in 2006, Apache Pig was developed by
Yahoo’s researchers.
At that time, the main idea to develop
Pig was to execute the MapReduce jobs on
extremely large datasets.
In the year 2007, it moved to Apache Software
Foundation(ASF) which makes it an open source
project.
The first version(0.1) of Pig came in the year 2008.
The latest version of Apache Pig is 0.18 which came
in the year 2017.
Features of Apache
Pig:
Pig is an Open source project and it was
developed by Yahoo having the following features:
Rich Set of operators:-For performing several
operations Apache Pig provides rich sets of
operators like the filters, join, sort, etc.
Handles Heterogeneity of Data: Pig can handle
the analysis of both structured and unstructured
data
Ease of Programming: Pig Latin is similar to SQL
and it is easy to write a Pig script if you are good at
SQL. Especially for SQL-programmer, Apache Pig is
a boon.
Create User-defined Functions: Apache Pig is
extensible so that you can make your own user-
defined functions and process.
Features of Apache Pig:
(Contd.)
Short Development time as the code is
simpler
Extensibility − Using the existing operators,
users can develop their own functions to read,
process, and write data
Multi Query approach- Apache Pig uses multi-
query approach. Basically, this reduces the length
of the codes to a great extent.
No need for compilation
Here, we do not require any compilation. Since
every Apache Pig operator is converted internally
into a MapReduce job on execution.
Features of Apache Pig:
(Contd.)
Optimization opportunities − The tasks in
Apache Pig optimize their execution
automatically, so the programmers need to focus
only on semantics of the language.
Optional Schema - However, the schema is
optional, in Apache Pig. Hence, without designing
a schema we can store data. So, values are
stored as $01, $02 …so on.
Difference between Pig and
MapReduce
Difference between SQL & PIG
Admiring Pig
Architecture
The language used to analyze data in Hadoop
using Pig is known as Pig Latin.
It is a high level data processing language which
provides a rich set of data types and operators to
perform various operations on the data.
To perform a particular task Programmers using Pig,
programmers need to write a Pig script using the
Pig Latin language, and execute them using any of
the execution mechanisms (Grunt Shell, UDFs,
Embedded).
After execution, these scripts will go through a
series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a
series of MapReduce jobs, and thus, it makes the
The architecture of Apache Pig is shown
below:
Apache Pig
Components
As shown in the figure, there are various components
in the Apache Pig framework. Let us take a look at
the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It
checks the syntax of the script, does type checking,
and other miscellaneous checks.
The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are
represented as the nodes and the data flows are
represented as edges
Optimiz
er
After the output from the parser is retrieved, a
logical plan for DAG is passed to a logical
optimizer. The optimizer is responsible for
carrying out the logical optimizations such as
projection and pushdown.
Compiler
The compiler compiles the logical plan sent by
the optimizer. The compiler compiles the
optimized logical plan into a series of MapReduce
jobs.
Execution engine
After the logical plan is converted to MapReduce
jobs, these jobs are sent to Hadoop in a properly
Grunt shell
• Grunt shell is a shell command.
• The Grunts shell of Apace pig is mainly used to
write pig Latin scripts. Pig script can be executed
with grunt shell which is native shell provided by
Apache pig to execute pig queries
• We can invoke shell commands
using sh and fs.
Shell Commands:
sh Command
• we can invoke any shell commands from the
Grunt shell, using the sh command. But make
sure, we cannot execute the commands that are
a part of the shell environment (ex − cd), using
Cont..
Syntax
The syntax of the sh command is:
grunt> sh shell command parameters
fs Command
• Moreover, we can invoke any fs Shell
commands from the Grunt shell by using
the fs command.
• The syntax of fs command is:
grunt> sh File System command
parameters
Cont..
File System Commands:
Command Description
cat prints the contents of one or more files
cd changes the current directory
copyFromLocal copies a local file to a Hadoop file system
copyToLocal copies a file or directory from HDFS to local file
system
cp copies a file or a directory to another directory
fs Accesses Hadoop ‘s file system shell
ls Lists file
mkdir Creates new directory
mv Move a file/directory to another directory
pwd prints the path of the current working directory
rm Deletes a file or a directory
Utility Commands:
• clear : clear the screen
– grunt> clear
• help : Provides help about the commands.
• history : Displays a list of statements executed / used so
far since the Grunt sell is invoked .
• set : Used to show/assign values to keys used in Pig.
• quit : You can quit from the Grunt shell.
• exec/run: Can execute Pig scripts
– grunt> exec [–param param_name =
param_value] [–param_file file_name] script
• kill : kill a job from the Grunt shell , grunt> kill JobId
Pig Latin Data Model
• Pig’s data types make up the data model for how
Pig thinks of the structure of the data it is
processing.
• With Pig, the data model gets defined when the
data is loaded. Any data you load into Pig from
disk is going to have a particular schema and
structure.
• Pig needs to understand that structure, so when
you do the loading, the data automatically goes
through a mapping.
• The Pig data model is rich enough to handle most
anything thrown its way, including table- like
structures and nested hierarchical data
structures.
Pig Latin Data
Model(Cont..)
It consists of 4 types of data models as
follows
The data model of Pig Latin is fully nested and it allows
complex (non- atomic) data types such as Atom, Bag,map
Ato
and tuple.
•
m:
Any single value in Pig Latin, irrespective of their data type is
known as an Atom. It is stored as string and can be used as
string and number. Pig’s atomic values are scalar types that
appear in most programming languages like-int, long, float,
double, chararray, and bytearray are the atomic values of
Pig.
• A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple:
A tuple is a record that consists of sequence of fields, where
the fields can be of type. It is similar to a row in RDBMS table
E.g:- (raju, 30)
Bag
A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag. Each
tuple can have any number of fields (flexible schema). A
bag is represented by ‘{}’. It is similar to a table in RDBMS,
but unlike a table in RDBMS, it is not necessary that every
tuple contain the same number of fields or that the fields in
the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known
as inner bag.
Example −
Map
A map (or data map) is a set of key-value pairs. The key needs to
be of type chararray and should be unique. The value might be
of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation
Pag Latin statements works with relations. A relation is a
outermost structure of data model and it is defined as a bag of
tuples
The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).
• A relation is a bag
• A bag is a collection of tuples
• A tuple is an ordered set of fields
Working through the ABCs of Pig Latin
• Pig Latin is the language for Pig programs.
• Pig translates the Pig Latin script into
MapReduce jobs that can be executed
within Hadoop cluster.
• Pig Latin development team followed
three key design principles:
Keep it simple:
Make it smart
Don’t limit development
Keep it Simple
• Pig Latin is an abstraction for MapReduce that
simplifies the creation of parallel programs on the
Hadoop cluster for data flows and analysis.
• Complex tasks may require a series of
interrelated data transformations — such series
are encoded as data flow sequences.
• Writing Pig Latin scripts instead of Java
MapReduce programs makes these programs
easier to write, understand, and maintain
because
a) you don’t have to write the job in Java,
b) you don’t have to think in terms of
MapReduce, and
c) you don’t need to come up with custom code
to
Make it smart
• Pig Latin Compiler transform a Pig Latin program
into a series of Java MapReduce jobs.
• The compiler can optimize the execution of these
Java MapReduce jobs automatically, allowing the
user to focus on semantics rather than on how to
optimize and access the data.
• For example, SQL is set up as a declarative query
that you use to access structured data stored in
an RDBMS. The RDBMS engine first translates the
query to a data access method and then looks at
the statistics and generates a series of data
access approaches. The cost-based optimizer
chooses the most efficient approach for execution
Don’t limit development
• Make Pig extensible so that developers can add
functions to address their particular business
problems.
• Traditional RDBMS data warehouses make use of
the ETL data processing pattern, where you extract
data from outside sources, transform it to fit your
operational needs, and then load it into the end
target, whether it’s an operational data store, a
data warehouse, or another variant of database.
• With big data, the language for Pig data flows goes
with ELT instead: Extract the data from your
various sources, load it into HDFS, and then
transform it as necessary to prepare the data for
further analysis
Going with the Pig Latin Application
Flow
Pig Latin is a dataflow language, where you define
a data stream and a series of transformations
that are applied to the data as it flows through
your application
This is in contrast to a control flow language (like
C or Java), where you write a series of
instructions.
In control flow languages, we use constructs like
loops and conditional logic (like an if statement).
You won’t find loops and if statements in Pig Latin
To realize working with pig is significantly easier
Working of Pig
The basic flow of a Pig program is:
Load: First load (LOAD) the data we want to manipulate, that
data is stored in HDFS or local file system.
For a Pig program to access the data, you first tell Pig what file
or files to use.
For that task, you use the LOAD 'data_file' command.
Here, 'data_file' can specify either an HDFS file or a directory.
If a directory is specified, all files in that directory are loaded
into the program
If the data is stored in a file format that isn’t natively
accessible to Pig,
you can optionally add the USING function to the LOAD
Transform: We run the data through a set of transformations
that are translated into a set of Map and Reduce tasks.
The transformation logic is where all the data manipulation
happens. Here, you can FILTER out rows that aren’t of
interest, JOIN two sets of data files, GROUP data to build
aggregations, ORDER results, and do much, much more
Dump: Finally, you dump (DUMP) the results to the screen or
Store (STORE) the results in a file somewhere
You would typically use the DUMP command to send the output to
the screen when you debug your programs. When your program
goes into production, you simply change the DUMP call to a STORE
call so that any results from running your programs are stored in a
file for further processing or analysis
Pig Latin
Statements
•Pig Latin is a data flow language used by Apache Pig to
analyze the data in Hadoop.
•While processing data using Pig Latin, statements
are the basic constructs.
•A Pig Latin statement is an operator that takes a relation
as input and produces another relation as output.
•This definition applies to all pig latin operators except LOAD
and STORE command which read data from and write data to
the file system.
•These statements work with relations. They include
expressions and schemas. Every statement ends with a
semicolon (;).
•We will perform various operations using operators provided
by Pig Latin, through statements.
Except LOAD and STORE, while performing all other
operations, Pig Latin statements take a relation as input
and produce another relation as output.
Pig Latin statements are generally organized in the following
manner:
• A LOAD statement reads data from the file system
• A series of Transformations statements process
the data
• A STORE statement writes output to the file system
OR
• A DUMP statement displays output to the screen
Preparing Data (student_data.txt)
Pig Latin statement to load data to
Apache Pig
In MapReduce mode, Pig reads (loads) data from
HDFS and stores the results back in HDFS.
Therefore, let us start HDFS and create the
following sample data in HDFS
grunt> Relation_name= LOAD 'student_data.txt'
USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
PigStorage() function:
It loads and stores data as structured text files. It
takes a delimiter using which each entity of a
tuple is separated, as a parameter. By default, it
Pig Data
Types
• Pig Data types defines the data model of how pig thinks
the
structure of the data that it is processing
• In pig, the data model gets defined when the data is loaded
and it have a particular schema and structure.
• Pig model is rich enough to handle most of the structures
like hierarchal data and table-like structures.
• Pig data types are broken into two categories:
1.Scalar types (Simplex types)
2.Complex types
• Scalar types contain single value of types where as complex
Scalar
Types
int- Represents a signed 32-bit integer. Example : 8
long-Represents a signed 64-bit integer. Example : 5L
float-Represents a signed 32-bit floating point. Example : 5.5F
double-Represents a 64-bit floating point. Example : 10.5
chararray- Represents a character array (string) in Unicode UTF-8
format. Example : ‘apachepig’
Bytearray- Represents a Byte array (blob).
Boolean- Represents a Boolean value. Example : true/ false.
Datetime-Represents a date-time.
Example : 1970-01-01T00:00:00.000+00:00
Biginteger-Represents a Java BigInteger. Example : 60708090709
Bigdecimal-Represents a Java BigDecimal
Example : 185.98376256272893883
Null Values
Values for all the above data types can be NULL.
Apache Pig treats null values in a similar way as
SQL does.
A null can be an unknown value or a non-existent
value.
It is used as a placeholder for optional values.
These nulls can occur naturally or can be the
result of an operation.
Complex
Types
Tuple-
• A tuple is an ordered set of fields.
• Tuple have fixed length and it have collection datatypes and
also containing multiple fields
Example : (raja, 30)
Syntax:
( field [, field …] )
Terms:
( ) - A tuple is enclosed in parentheses ( ).
Field -A piece of data. A field can be any data type (including tuple
and bag
Bag-
A bag containing collection of tuples which are unordered, Bag
constants are constructed using braces, with tuples in the bag
separated by com-
mas
Syntax: Inner bag
{ tuple [, tuple …] }
Terms:
{ } - An inner bag is enclosed in curly brackets { }
tuple - A tuple
Example :
• {(raju,30),(Mohhammad,45)}
Keys Points about Bag:
• A bag can have duplicate tuples.
• A bag can have tuples with differing numbers of fields. However, if
Pig tries to access a field that does not exist, a null value is
substituted.
• A bag can have tuples with fields that have different data types.
However, for Pig to effectively process bags, the schemas of the
tuples within those bags should be the same. For example, if half of
the tuples include chararray fields and while the other half include
float fields, only half of the tuples will participate in any kind of
computation because the chararray fields will be converted to null.
• Bags have two forms: outer bag (or relation) and inner bag
Example: Outer Bag
In this example A is a relation or bag of tuples. You can think of this bag as an
outer bag.
A = LOAD 'data' as (f1:int, f2:int, f3:int);
DUMP A;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
Example: Inner Bag
Now, suppose we group relation A by the first field to form relation X.
In this example X is a relation or bag of tuples. The tuples in relation X have
two fields. The first field is type int. The second field is type bag; you can think
of this bag as an inner bag.
X = GROUP A BY f1;
DUMP X;
Output:
(1,{(1,2,3)})
(4,{(4,2,1),
(4,3,3)})
(8,{(8,3,4)})
Map-
• A Map is a set of key-value pairs.
• Key values within a relation must be unique
Syntax (<> denotes optional)
• [ key#value <, key#value …> ]
Example : In this example the map includes two key value
pairs
[ ‘name’#’Raju’, ‘age’#30]
Pig Operators
(or)
Data Transformations in Pig
Pig Latin – Arithmetic
Operators
+ Addition − Adds values on either side of the operator
− Subtraction − Subtracts right hand operand
from left hand operand
* Multiplication − Multiplies values on either side of the
operator
/ Division − Divides left hand operand by right hand
operand
% Modulus − Divides left hand operand by right hand
operand and returns remainder
? : Bincond − Evaluates the Boolean operators. It has three
operands as shown below. variable x = (expression) ?
value1 if true : value2 if false.
Pig Latin – Comparison
Operators
== Equal − Checks if the values of two operands are equal or
not; if yes, then the condition becomes true.
(a = b) is not true
!= Not Equal − Checks if the values of two operands are
equal or not. If the values are not equal, then condition
becomes true.
(a != b) is true.
> Greater than − Checks if the value of the left operand is
greater than the value of the right operand. If yes, then the
condition becomes true.
(a > b) is not true.
< Less than − Checks if the value of the left operand is less
>= Greater than or equal to − Checks if the value of the
left operand is greater than or equal to the value of the right
operand. If yes, then the condition becomes true.
Eg. (a >= b) is not true.
<= Less than or equal to − Checks if the value of the left
operand is less than or equal to the value of the right
operand. If yes, then the condition becomes true.
Eg. (a <= b) is true.
matches Pattern matching − Checks whether the string
in the left-
hand side matches with the constant in the right-hand side.
Eg. f1 matches '.*tutorial.*'
Pig Latin – Type Construction
Operators
() Tuple constructor operator − This operator is
used to construct a tuple.
Eg.(Raju, 30)
{} Bag constructor operator − This operator is
used to construct a bag.
Eg. {(Raju, 30), (Mohammad, 45)}
[] Map constructor operator − This operator is
used to construct a tuple.
Eg. [name#Raja, age#30]
Pig Latin – Relational
Operations
Loading and Storing
LOAD- To Load the data from the file system (local/HDFS)
into a relation.
The load statement consists of two parts divided by the
“=” operator.
On the left-hand side, we need to mention the name of
the relation where we want to store the data, and on the
right-hand side, we have to define how we store the
data.
Given below is the syntax of the Load operator.
Relation_name = LOAD 'Input file path’ [USING function
as schema];
Cont..
Component Description
Relation_name The relation in which we want to store the data.
Input file path Mention the HDFS directory where the file is stored
function A function from the set of load functions
provided by Apache Pig (BinStorage,
JsonLoader, PigStorage, TextLoader).
schema Define the schema of the data
We can define the required schema as follows:
(column1 : data type, column2 : data type,
column3 : data type);
Note: We load the data without specifying the schema. In
that case, the columns will be addressed as $01, $02,
etc…
Cont..
• grunt> student = LOAD
‘hdfs://localhost:9000/pig_data/student
_data.txt'USING PigStorage(',’) as
( id:int, firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );
The PigStorage() function:
• It loads and stores data as structured text files. It takes a
delimiter using which each entity of a tuple is separated,
as a parameter. By default, it takes ‘\t’ as a parameter.
Pig Latin – Relational
Operations (Cont.)
STORE- To save a relation to the file system
(local/HDFS).
Syntax:- STORE relation-name into ‘file-path’
Example:
grunt> STORE student INTO ‘
hdfs://localhost:9000/pig_Output/ ' USING
PigStorage (',');
Load and Store functions:
Used to determine how the data goes in and comes out of Pig
Cont..
Cont..
Filtering
Operators
FILTER
DISTINCT
FOREACH, GENERATE
STREAM
FILTER
It is used to select required rows from a relation based on a
condition
Syntax:- relation-name1= FILTER relation-name2 By
Condition;
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
B= FILTER A BY city==‘chennai’
DUMP B
DISTINCT
• To remove duplicate rows from a relation.
Syntax:-
relation-name1= DISTINCT relation-name2;
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray,
lastname:chararray,
phone:chararray, city:chararray );
B= DISTINCT A
DUMP B
FOREACH, GENERATE
To generate data transformations based on columns
of data.
Syntax:
Relation_name2= FOREACH relation-name1
GENERATE (required data)
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
B= FOREACH A GENERATE id,fisrstname,city;
DUMP B
STREAM
• The stream operator allows transforming data in a relation
using an external program or script.
• This is possible because hadoop Mapreduce supports
“streaming”
Example:
C = STREAM A through ‘cut –f 2’;
Which use the Unix cut command to extract the second filed of
each tuple in A
Grouping and
Joining
Group:
The group operator is used to group the data in one relation. It collects
the data having same key
Syntax:- relation-name2= group relation-name1 by key
Group all- This command is used to aggregate all tuples into a
single group
Syntax:- relation-name2= group relation-name1 all by key
COGROUP -To group the data in two or more relations.
Syntax:- relation-name3= group relation-name1
by key, relation-name2 by key,…../
CROSS -To create the cross product of two or more relations.
Syntax:- relation-name1= CROSS relation-name2, relation-name3….;
JOIN
It is used to combine records from two or more
relations. It is of following types:
1.Inner Join
2.Outer Join
Inner Join:
An inner join returns those rows of tables whose join
predicate is matched
Syntax:
relation-name3 = JOIN relation-name1 by column_name,
relation-name2 by column_name;
Outer Join:
It returns all the rows from at least one of the relations. It can
be carried out in three ways:
Left Outer Join
Right Outer Join
Full Outer Join
Left Outer Join:- returns all the rows from the
left table, even if there are no matches in the
right relation
Syntax:-
relation-name1 = JOIN relation-name2 By key
LEFT OUTER, relation-name3 By key;
Right Outer Join:- returns all the rows from the
right table, even if there are no matches in the left
relation
Syntax:-
relation-name1 = JOIN relation-name2 By key
RIGHT OUTER, relation-name3 By key;
FULL Outer Join:- returns all the rows of the
table, even
there is no match
Syntax:-
Sorti
ng
ORDER BY -To arrange a relation in a sorted order
based on one or more fields (ascending or
descending).
Syntax: relation-name2=ORDER relation-name1
By key(ASC/DEC)
LIMIT -To get a limited number of tuples from a
relation. Syntax:
relation-name2= LIMIT relation-name1 required-
no.of-tuples
Combining and
Splitting
UNION -To combine two or more relations into a single
relation.
Syntax:
relation-name3 = UNION relation-name1, relation-name2;
SPLIT -To split a single relation into two or more
relations.
Syntax:
SPLIT relation-name1 INTO relation-name2 IF(condition1)
Diagnostic
Operators
DUMP - The Dump operator is used to run the Pig Latin
statements and display the results on the screen. It is
generally used for debugging Purpose
Syntax:- grunt> DUMP relation-name;
Example: DUMP student;
Once you execute the above Pig Latin statement, it will
start a
MapReduce job to read data from HDFS.
DESCRIBE –Used to view the schema of a relation or alias
Syntax:- DESCRIBE relation-name;
Example:grunt> DESCRIBE student;
Output: student: { id: int, firstname: chararray, lastname:
Diagnostic
Operators(Cont..)
EXPLAIN -To view the logical, physical, or MapReduce
execution plans to compute a relation.
This is helpful to know how pig is compiling each commands
into mapreduce scripts.
Syntax:- EXPLAIN relation-name;
Example: EXPLAIN student;
Logical Plan – The logical plan contains the pipeline of the
operators it needs to be executed
Diagnostic
Operators(Cont..)
• Physical Plan- It specifies how the logical operators
converted into backend specific physical operators
• Mapreduce Execution Plan – How the physical operators are
grouped together to form the mapreduce jobs
ILLUSTRATE –
To view the step by step execution of a pig script.
If we need to test a script with small sample of data then
we use it.
Syntax: ILLUSTRATE relation-name
Example : ILLUSTRATE student;
Boolean
Operators
and- AND operation
or- OR operation
not- NOT operation
Pig does not support a boolean data type. However, the result
of a boolean expression (an expression that includes boolean
and comparison operators) is always of type boolean (true or
false).
Cast Operators
Cast operators enable you to cast or convert data from one
type to
another, as long as conversion is supported (see the table
above).
For example, suppose you have an integer field, myint, which
A= LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
O/P: (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
B = GROUP A BY f1;
DUMP B;
(1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),
(8,4,3)})
DESCRIBE B;
O/P: B: {group: int, A: {f1: int,f2: int,f3: int}}
X = FOREACH B GENERATE group, (chararray)COUNT(A) AS
total;
(1,1) (4,2) (7,1) (8,2)
Disambiguate
Operator
Use the disambiguate operator ( :: ) to identify field names
after JOIN, COGROUP, CROSS, or FLATTEN operators.
In this example, to disambiguate y, use A::y or B::y. In cases
where there is no ambiguity, such as z, the :: is not necessary
but is still supported.
A = load 'data1' as (x,
y); B = load 'data2' as
(x, y, z); C = join A by
x, B by x;
D = foreach C
generate y; -- which y?
Flatten
Operator
The FLATTEN operator looks like a UDF syntactically, but it is
actually an operator that changes the structure of tuples and
bags in a way that a UDF cannot. Flatten un-nests tuples as
well as bags. The idea is the same, but the operation and
result is different for each type of structure.
For tuples, flatten substitutes the fields of a tuple in place of the
tuple. For example, consider a relation that has a tuple of the
form (a, (b, c)). The expression GENERATE $0, flatten($1), will
cause that tuple to become (a, b, c).
grunt> cat empty.bag
grunt> A = LOAD 'empty.bag' AS (b : bag{}, i : int);
grunt> B = FOREACH A GENERATE flatten(b), i;
grunt> DUMP
B; grunt>
Built-in
•
Functions
AVG:- Compute the average of numeric values in a single
column of a bag
Syntax:- CONCAT(expression)
• COUNT:- Compute the Number of elements in a bag
Syntax:- COUNT(expression)
• DIFF:- Compare two fields in a table
Syntax:- DIFF(expression, expression)
• CONCAT:- Concatenates two expressions of identical types
Syntax:- CONCAT(expression, expression)
Example:-FOREACH A GENERATE concat(first_name,
second_name)
• MAX:- Computes the maximum of the numeric values or
chararrays in a single colun bag. It requires a preceding
GROUP statement for group maximum.
Syntax:- MAX(expression)
• MIN:- Computes the minimum of the numeric values or chararrays in a
single column bag.
Syntax:- MIN(expression)
• SIZE:- Used to compute the number of elements based on the data type.
It includes
null values also.
Syntax:- SIZE(expression)
Example:- FOREACH A GENERATE size(name);
• SUM:- Computes the sum of numeric values in a single column bag
Syntax:- SUM(expression)
• TOKENIZE:- Splits a string in a single tuple(which contains group of words)
and outputs a bag of words
Syntax:- TOKENIZE(expression)
• ISEMPTY:- check if a Bag or Map is
empty or not Syntax:- ISEMPTY(expression)
Comments
• Pig Latin has two types of comment operators:
SQL-style single-line comments (--) and Java-
style multiline comments (/* */).
For example:
--this is a single-line comment
A = load 'foo';
/* * This is a multiline comment. */
B = load /* a comment in the middle */'bar';
WordCount Program using Pig
Latin
Steps involved to find the number of
occurrences of the words in a file using the pig
script
Cont..
• but we have to convert it into multiple rows like
below
(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)
Cont..
Cont..
Complete Pig Script for WordCount Program
Note:
You can see just with 5 lines of pig program, we have solved the word
count problem very easily.
Calculate maximum recorded temperature by year
for weather dataset in Pig Latin:
A = load 'weather' using PigStorage(',') as
(year:chararray, temp:int);
B = group A by year;
c= foreach B generate group, MAX(A.temp);
store c into 'wout.txt';
Using Pig Latin, Order the movies based on rating
and display the results.
A = load 'movie' using PigStorage(',') as
(id:int,name:chararray,year:int,rating:double,
duration:int);
B = distinct A;
C = order B by rating;
DUMP C;
Pig Execution Modes (or) Evaluating local and
Distributed modes of Running pig Script
Apache pig can be run in two modes:
1. Local Mode(Local Execution Environment)
2. Hadoop Mode(Distributed Execution Environment)
Local Mode:
In this mode all the files are installed and run from your
local host and local file system
Executes in a single JVM
No need of Hadoop or HDFS
This mode is generally used for developing and testing
pig logic.
If you’re using a small set of data or test your code, then
local mode could be faster than going through the
MapReduce Infrastructure.
To start the local mode of execution, the following
command is used.
~$ PIG –x local
MapReduce Mode (or) Distributed Mode
• In this mode, Apache Pig will take the input from HDFS
paths only,
and after processing data it will put output files on top of
HDFS
• In MapReduce mode of Execution,Pig translates
queries into MapReduce jobs and runs them on a
Hadoop Cluster
• In this mode, whenever we execute the pig latin statements
to process the data, a Mapreduce job is invoked in the back-
end to perform a particular operation on the data that
exists in the HDFS.
• MapReduce mode with the fully distributed cluster is useful
Syntax:
To start the local mode of execution, the following
command is used.
• PIG –x mapreduce (or) PIG
Apache Pig Execution Mechanisms(Pig Script
Interfaces)
Apache Pig scripts can be executed in three ways,
namely, interactive mode, batch mode, and embedded mode.
Interactive Mode (Grunt shell) − Grunt acts as a
command interpreter.
It is a Pig’s Interactive shell which is used to execute all pig
scripts.
Simply say that, Interactive means coding and executing the
script, line by line.
You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and
get the output (using Dump operator). This method is useful
for initial development.
Batch Mode (Script)
In Batch mode, all scripts are coded in a single file with the
extension .pig and the file is directly executed
This mode allows a single file containing Pig Latin
commands, identified by the .pigsuffix (FlightData.pig, for
example).
Ending your Pig program with the .pigextension is a
convention but not required.
The commands are interpreted by the Pig Latin compiler and
executed in the order determined by the Pig optimizer.
Cont..
• Embedded Mode (UDF) − Apache Pig
provides the provision of defining our own
functions (User Defined Functions) in
programming languages such as Java, and
using them in our script.
• It is useful to execute pig programs from a
java program
Applications of
Apache Pig:
For exploring large datasets Pig Scripting is used.
Provides the supports across large data-sets for
Ad-hoc queries.
In the prototyping of large data-sets processing
algorithms.
Required to process the time sensitive data loads.
For collecting large amounts of datasets in form of
search logs and web log processing (i.e. error
logs).
Used where the analytical insights are needed
using the sampling.
Pig User Defined Functions(UDF)
• To specify custom processing, Pig provides support for user-
defined functions (UDFs). Thus, Pig allows us to create our
own functions. Currently, Pig UDFs can be implemented using
the following programming languages: -
– Java
– Python
– Jython
– JavaScript
– Ruby
• Groovy
• Among all the languages, Pig provides the most extensive
support for Java functions
Example of Pig UDF
• In Pig:
– All UDFs must extend "org.apache.pig.EvalFunc"
– All functions must override the "exec" method.
• In Apache Pig, we also have a Java repository for
UDF’s named Piggybank. Using Piggybank, we can
access Java UDF’s written by other users, and
contribute our ownUDF’s.
Create a simple EVAL Function to convert the provided string to
uppercase
package com.hadoop;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class TestUpper extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
} }
Tutorial Questions
1. a) Consider the Departmental Stores data file (stores.txt) in the following
format customerName, deptName, purchaseAmount.
i) Write a Pig script to list total sales per departmental store.
ii) Write a Pig script to list total sales per
customer.
b) Explain the following operators in Pig Latin.
i) flatten operator ii) Relational operators
2. a) Explain the architecture of Apache Pig with neat sketch.
b) Explain about the complex data types in Pig Latin.
3. a) Explain about Pig Latin data model and its data types.
b) Write about the three key design principles of Pig Latin
c) Write about Apache Pig execution modes and mechanism.
4. a) Write the major differences between Apache Pig and SQL
b) List and Explain various operators of Pig Latin.
5. a) Explain the principles to be considered while writing the Pig Scripts
b) Describe two modes for running scripts in Pig
6. a) How can you run the Pig scripts in Local and Distributed mode
b) Write the syntax of a Pig program with suitable example.
7. a)Discuss in brief about the operators supported by PIG with respect to data
access and debugging operations.
b) Explain in brief about the scripting in PIG with suitable example.
8. a) Draw and explain architecture of APACHE PIG in detail.
b) Discuss how Pig data model will help in effective data flow
9. a) List any five commands of pig script.
b) Discuss Pig Latin Application Flow
10 a) Discuss the various data types in Pig.
b) Write a word count program in Pig to count the occurrence of similar words
in a file.
11. a) How the pig programs can be packaged and explain the modes of running a
pig script with a neat sketch.
b) List and explain the relational operators in Pig.
12. a) Write the general PIG Latin program/flow organization.
B) Consider The student data File (st.txt), Data in the following format Name,
District, age, gender
i) Write a PIG script to Display Names of all female students
ii) Write a PIG script to find the number of Students from Prakasham District
iii) Write a PIG script to Display District wise count of all male students.
13. a) List the relational operators in Pig?
b) What are the components of Pig Execution Environment?
Example: Pig Script_1
Consider the Departmental Stores data file
(stores.txt) in the following format:
customerName, deptName, purchaseAmount.
i) Write a Pig script to list total sales per
departmental store.
ii) Write a Pig script to list total sales per
customer.
Example: stores.txt
• customerName deptName PurchaseAmount
A S 3.3
A S 4.7
B S 1.2
B T 3.4
C Z 1.1
C T 5.5
D R 1.1
List total sales per department stores:
• grunt>
data = LOAD 'Documents/stores.txt' using PigStorage(',') as
(customerName:chararray, deptName:chararray,
purchaseAmount:float);
grp = (GROUP data BY deptName)
result=FOREACH grp GENERATE group, COUNT(data.deptName) ,
(FLOAT)SUM(data.purcahseAmount)
DUMP result;
Output
• The output has dept. store, customer count,
total sales.
( R,1,1.1)
( S,3,9.2)
( T,2,8.9)
( Z,1,1.1)
List total Sales per customer
• grunt>
data = LOAD 'Documents/stores.txt' using PigStorage(',') as
(customerName:chararray, deptName:chararray,
purchaseAmount:float);
grp = (GROUP data BY customerName )
result=FOREACH grp GENERATE group,
COUNT(data.customerName) ,
(FLOAT)SUM(data.purchaseAmount) ;
DUMP result;
output
• The output has customer id, total transactions
per customer, total sales.
(A,2,8.0)
(B,2,4.6000004)
(C,2,6.6)
(D,1,1.1)
Example: stores.txt
• cust_Id dstore spent
A S 3.3
A S 4.7
B S 1.2
B T 3.4
C Z 1.1
C T 5.5
D R 1.1
Example Pig Script-2
Consider The student data File (st.txt), Data in the
following format Name, District, age, gender
i) Write a PIG script to Display Names of all
female students
ii) Write a PIG script to find the number of
Students from Prakasham District
iii) Write a PIG script to Display District wise
count of all male students.
st.txt
SID Name District Age Gender
1 Raju Guntur 28 Male
2 Prudhvi West Godavari 29 Male
3 Indra Prakasham28 Male
4 Ramana Prakasham27 Male
5 Nagarjuna Nellore 29 Male
6 Ravindra Krishna 30 Male
7 Jyothi Guntur 27 Female
8 Lahari West Godavari 26 Female
9 Hema Prakasham27 Female
Write a PIG script to Display Names of all female students
• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',')
as (sid:int, name:chararray, district:chararray,
age:int, gender: chararray);
fdata = FILTER data by gender ==‘female’
result=FOREACH fdata GENERATE name;
DUMP result;
Write a PIG script to find the number of Students from Prakasham District
• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',')
as (sid:int, name:chararray, district:chararray, age:int,
gender: chararray);
fdata = FILTER data by district ==‘prakasham’
std_count = FOREACH fdata GENERATE group,
COUNT(fdata);
DUMP std_count;
Write a PIG script to Display District wise count of all male students.
• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',') as (sid:int,
name:chararray, district:chararray, age:int, gender: chararray);
fdata = FILTER data by gender ==‘male’;
stdgrp=GROUP fdata by location;
std_count = FOREACH stdgrp GENERATE group,
COUNT(fdata);
DUMP std_count;
Using Join
Operation
customers = LOAD 'customer' USING
PigStorage(',')as (id:int,
name:chararray, age:int,
address:chararray, salary:int);
orders = LOAD 'orders' USING PigStorage(',')
as
(oid:int, date:chararray, customer_id:int,
amount:int);
customer_orders = JOIN customers BY id,
orders BY customer_id;
dump customer_orders;
To Exercise more problems on
Pig Scripts-go to below link
• http://howt2talkt2apachepig.blogspot.com/