Big Data Analytics (BAD601) Module -4 PIG
MODULE 4
PIG
What is PIG
Apache Pig is a platform for data analysis. It is an alternative to MapReduce
Programming. Pig was developed as a research project at Yahoo.
Key Features of Pig
1. It provides an engine for executing data flows (how your data should flow). Pig
processes data in parallel on the Hadoop cluster.
2. It provides a language called “Pig Latin” to express data flows.
3. Pig Latin contains operators for many of the traditional data operations such as
join, filter, sort, etc.
4. It allows users to develop their own functions (User Defined Functions) for
reading, processing, and
Anatomy of PIG
The main components of Pig are as follows:
1. Data flow language (Pig Latin).
2. Interactive shell where you can type Pig Latin statements (Grunt).
3. Pig interpreter and execution engine.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 1
Big Data Analytics (BAD601) Module -4 PIG
PIG on Hadoop
Pig runs on Hadoop. Pig uses both Hadoop Distributed File System and MapReduce
Programming. By default, Pig reads input files from HDFS. Pig stores the intermediate data
(data produced by MapReduce jobs) and the output in HDFS. However, Pig can also read input
from and place output to other sources.
Pig supports the following:
1. HDFS commands.
2. UNIX shell commands.
3. Relational operators.
4. Positional parameters.
5. Common mathematical functions.
6. Custom functions.
7. Complex data structures.
PIG Philosophy
Pigs Eat Anything: Pig can process different kinds of data such as structured and unstructured
data.
2. Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.
3. Pigs are Domestic Animals: Pig allows you to develop user-defined functions and the same
can be included in the script for complex operations.
4. Pigs Fly: Pig processes data quickly.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 2
Big Data Analytics (BAD601) Module -4 PIG
Use Cases of PIG
Pig is widely used for “ETL” (Extract, Transform, and Load). Pig can extract data from different
sources such as ERP, Accounting, Flat Files, etc. Pig then makes use of various operators to
perform transformation on the data and subsequently loads it into the data warehouse.
PIG LATIN OERVIEW
Pig Latin Statements
1. Pig Latin statements are basic constructs to process data using Pig.
2. Pig Latin statement is an operator.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 3
Big Data Analytics (BAD601) Module -4 PIG
3. An operator in Pig Latin takes a relation as input and yields another relation as output. 4. Pig
Latin statements include schemas and expressions to process data.
5. Pig Latin statements should end with a semi-colon.
Pig Latin Statements are generally ordered as follows:
1. LOAD statement that reads data from the file system.
2. Series of statements to perform transformations.
3. DUMP or STORE to display/store result.
The following is a simple Pig Latin script to load, filter, and store “student” data.
Pig Latin keywords are reserved. It cannot be used to name things.
Pig Latin: Identifiers
1. Identifiers are names assigned to fields or other data structures.
2. It should begin with a letter and should be followed only by letters, numbers, and
underscores.
Pig Latin Comments
In Pig Latin two types of comments are supported:
1. Single line comments that begin with “--”. 2. Multiline comments that begin with “/* and
end with */”.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 4
Big Data Analytics (BAD601) Module -4 PIG
Pig Latin Case Sensitive
1. Keywords are not case sensitive such as LOAD, STORE, GROUP, FOREACH, DUMP,
etc.
2. Relations and paths are case-sensitive.
3. 3. Function names are case sensitive such as PigStorage, COUNT.
Operators in Pig Latin
Data Types in PIG
SUNIL G L, Dept. of CSE(DS), RNSIT Page 5
Big Data Analytics (BAD601) Module -4 PIG
Execution modes of PIG
Pig can run in two ways:
1.Local Mode: In this mode, all the files are installed and run from your local host and local
file system. There is no need of Hadoop or HDFS.
Pig -x local filename
2.MapReduce Mode: MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig.
Pig filename
HDFS Commands:
Relational Operators
Filter
FILTER operator is used to select tuples from a relation based on specified conditions.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 6
Big Data Analytics (BAD601) Module -4 PIG
Foreach
Group
Distinct
DISTINCT operator is used to remove duplicate tuples. In Pig, DISTINCT operator works on
the entire tuple and NOT on individual fields.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 7
Big Data Analytics (BAD601) Module -4 PIG
LIMIT
LIMIT operator is used to limit the number of output tuples.
ORDER BY
ORDER BY is used to sort a relation based on specific value.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 8
Big Data Analytics (BAD601) Module -4 PIG
Join
It is used to join two or more relations based on values in the common field. It always
performs inner Join.
UNION
Merge the content of two tables
SUNIL G L, Dept. of CSE(DS), RNSIT Page 9
Big Data Analytics (BAD601) Module -4 PIG
SPLIT:
Split is used to portioned a relation into two or more relations.
Sample
It is used to select random sample of data based on the specified sample size.
EVAL Function
1. AVG
AVG is used to compute the average of numeric values in a single column bag.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 10
Big Data Analytics (BAD601) Module -4 PIG
MAX
MAX is used to compute the maximum of numeric values in a single column bag.
COUNT
COUNT is used to count the number of elements in a bag.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 11
Big Data Analytics (BAD601) Module -4 PIG
Complex Datatypes
Tuple
A TUPLE is an ordered collection of fields.
MAP
Map Represent a Key Value Pair
SUNIL G L, Dept. of CSE(DS), RNSIT Page 12
Big Data Analytics (BAD601) Module -4 PIG
PIGGY BANK
Pig user can use Piggy Bank functions in Pig Latin script and they can also share their
functions in Piggy Bank
USER DEFINED FUNCTIONS
Pig allows you to create your own function for complex analysis.
SUNIL G L, Dept. of CSE(DS), RNSIT Page 13
Big Data Analytics (BAD601) Module -4 PIG
PIG VS HIVE
SUNIL G L, Dept. of CSE(DS), RNSIT Page 14