Chapter 16: Pig
Chapter Index
S.No. Reference Particulars Slide
No. From - To
1 Learning Objectives 3
2 Topic 1 Pig 4-6
3 Topic 2 Modes of running Pig Scripts 7-8
4 Topic 3 Running Pig problems 9
5 Topic 4 Getting started with Pig Latin 10-15
6 Topic 5 Working with Functions in Pig 16-17
Let’s Sum Up 18-19
Learning Objectives
Describe the meaning of Pig
Discuss the benefits of Pig
Describe the properties of Pig
Explain how to execute Pig programs
Explain the working with functions in Pig
1.Pig
Pig was designed and developed for performing a long series of data operations.
The Pig platform is specially designed for handling many kinds of data, be it
structured, semi-structured, or unstructured.
Pig was developed in 2006 at Yahoo. Its aim, as a research project was to
provide a simple way to use Hadoop and focus on examining large datasets.
Pig became an Apache project in 2007. By 2009, other companies started using
pig, making it a top-level Apache project in 2010.
Pig can be divided into three categories: ETL (Extract, Transform, and Load),
research, and interactive data processing.
Pig consists of a scripting language, known as Pig Latin, and a Pig Latin
compiler.
1.Pig
Benefits of Pig :
The Pig programming language offers the following benefits:
Ease of coding: Using Pig Latin we can write complex programs. The
code is simple and easy to understand and maintain. It takes complex tasks
involving interrelated data transformations as data flow sequence and
explicitly encodes them.
Optimization: Pig Latin encodes tasks in such a way that they can be
easily optimized for execution. This allows users to concentrate on the data
processing aspects without bothering about efficiency.
Extensibility: Pig Latin is designed in such a way that it allows us to create
our own custom functions. These can be used for performing special tasks.
Custom functions are also called user- defined functions.
1.Pig
Properties of Pig :
Pig programming language is written in Java; therefore, it supports the
properties of Java and uses the Java notation.
Following table shows some important properties of Pig:
Property Description Values Default
stop.on.failure Decides whether to exit at True/ False
the first error False
udf.import.list Refers to a comma-
separated list of imports
for Universal Disk Format
(UDF)
pig.additional. Refers to a command- JAR
jars separated list of files
JavaArchive (JAR) files
2.Modes of Running Pig Scripts
Pig scripts can be run in the following two modes:
Local mode: In this mode, several scripts can run on a single machine without
requiring Hadoop MapReduce and Hadoop Distributed File System (HDFS).
This mode is useful for developing and testing Pig logic.
MapReduce mode: Also known as the Hadoop mode. The pig script, in this
mode, gets converted into a series of MapReduce jobs and then run on the
Hadoop cluster.
2.Modes of Running Pig Scripts
Following figure shows the modes available for running Pig scripts:
3.Running Pig Programs
Before running the Pig program, it is necessary to know about the pig shell. As
we all know, without a shell, no one can access the pig’s in-built characteristics.
Pig shell is known as “Grunt.” Grunt is a command shell, which is graphical in
nature and used for scripting of pig.
Grunt saves the previously used command in “pig_history” file in Home
directory.
There is one handier feature of Grunt: If you are writing a script on grunt, it will
automatically complete the keywords that you are typing.
For example, if you are writing a script and you need to just type “for” and press
Tab button, it will automatically type “foreach” keyword.
4.Getting Started with Pig Latin
Schema :For every scripting language, there is a schema definition that tells
everything about the script.
Pig is a high-level programming platform that uses Pig Latin language for
developing Hadoop MapReduce programs.
Pig Latin abstracts its programming from the Java MapReduce idiom and
extends itself by allowing direct calling of user-defined functions written in
Java, Python, JavaScript, Ruby, or Groovy.
Pig translates the Pig Latin script into MapReduce jobs.
4.Getting Started with Pig Latin
The main reasons for developing Pig Latin are as follows:
Simple: A streamlined method is provided in Pig Latin for interacting with
MapReduce.
Smart: A Pig Latin program is converted into a series of Java MapReduce jobs
using the Pig Latin Compiler.
Extensible: Due to its extensible nature, Pig Latin enables the developers to
address specific business problems by adding functions
4.Getting Started with Pig Latin
Pig Latin Application Flow :
Pig Latin is regarded as a data flow language. This simply means that we can
use Pig Latin to define a data stream and a sequence of transformations, which
can be applied to the data as it moves throughout our application.
In a control flow language, we can write a sequence of instructions. Also We
can use concepts such as conditional logic and loops.
There are no if and loop statements in Pig Latin.
The Pig syntax used for data processing flow is as follows:
X = LOAD ‘file_name.txt’ ...
Y = GROUP ... ;...
Z= FILTER ... ;...
DUMP Y; ..
STORE Z into ‘temp’;
4.Getting Started with Pig Latin
Working with Operators in Pig :
In Pig Latin, relational operators are used for transforming data. Different
types of transformations include grouping, filtering, sorting, and joining.
The following are some basic relational operators used in Pig:
1. FOREACH 2. ASSERT 3. FILTER
4. GROUP 5. ORDER BY 6. DISTINCT
7. JOIN 8. SAMPLE 9. SPLIT
10. LIMIT
4.Getting Started with Pig Latin
Description of Operators:
FOREACH : The FOREACH operator performs iterations over every record to
perform a transformation.
ASSERT : The ASSERT operator asserts a condition on the data.
FILTER : The FILTER operator enables us to use a predicate for selecting the
records that need to be retained in the pipeline.
Working with Operators in Pig :
Description of Operators:
GROUP : GROUP operator is used for grouping data in single or
multiple relations.
4.Getting Started with Pig Latin
Description of Operators:
GROUP : GROUP operator is used for grouping data in single or multiple
relations.
ORDER BY : Depending on one or more fields, a given relation can be
sorted using the ORDER BY operator.
DISTINCT : This operator is used for removing duplicate fields from a
given set of records.
JOIN : For joining two or more relations, JOIN operator is used. Types of
joins can be performed in Pig Latin:
1. Inner join 2. Outer join
LIMIT : The LIMIT operator in Pig allows a user to limit the number of
results.
SA MPLE : SAMPLE operator can be used for selecting a random data
sample in Pig.
SPLIT : SPLIT operator partitions a given relation into two or more relations.
5.Working with Functions in Pig
A set of statements that can be used to perform a specific task or a set of tasks is
called a function.
There are basically two types of functions in Pig: user-defined functions and
built-in functions.
User-defined functions can be created by users as per their requirements. On the
other hand, built-in functions are already defined in Pig.
There are mainly five categories of built-in functions in Pig.
5.Working with Functions in Pig
There are mainly five categories of built-in functions in Pig, which are as
follows :
Eval or Evaluation functions: These functions are used to evaluate a
value by using an expression.
Math functions: These functions are used to perform mathematical
operations.
String functions: These functions are used to perform various operations
on strings.
Bag and Tuple functions: These functions are used to perform operations
on tuples and bags.
Load and Store functions: These functions are used to load and extract
Let’s Sum Up
Pig was designed and developed for performing a long series of data operations.
The Pig platform is specially designed for handling many kinds of data, be it
structured, semi-structured, or unstructured.
Pig enables users to focus more on what to do than on how to do it.
Pig was developed in 2006 at Yahoo.
Pig can be divided into three categories: ETL (Extract, Transform, and Load),
research, and interactive data processing.
Pig consists of a scripting language, known as Pig Latin, and a Pig Latin
compiler.
Let’s Sum Up
Pig programming language is written in Java; therefore, it supports the
properties of Java and uses the Java notation.
Pig shell is known as “Grunt.” Grunt is a command shell, which is graphical in
nature and used for scripting of pig.
Grunt saves the previously used command in “pig_history” file in Home
directory.
Pig is a high-level programming platform that uses Pig Latin language for
developing Hadoop MapReduce programs.
THANK YOU