What is Apache Pig
Apache Pig is a high-level data flow platform for executing MapReduce
programs of Hadoop. The language used for Pig is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS. Apart from that, Pig can also execute its
job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-structured or
unstructured and stores the corresponding results into Hadoop Data File
System. Every task which can be achieved using PIG can also be achieved
using java used in MapReduce.
Features of Apache Pig
Let's see the various uses of Pig technology.
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-
programmers. Pig makes this process easy. In the Pig, the queries are
converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than
efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to
execute over the data set.
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, filter and joins.
Differences between Apache MapReduce and PIG
Apache MapReduce Apache PIG
It is a high-level data flow
It is a low-level data processing tool.
tool.
It is not required to
Here, it is required to develop complex programs
develop complex
using Java or Python.
programs.
It provides built-in
It is difficult to perform data operations in operators to perform data
MapReduce. operations like union,
sorting and ordering.
It provides nested data
It doesn't allow nested data types. types like tuple, bag, and
map.
Advantages of Apache Pig
o Less code - The Pig consumes less line of code to perform any
operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data
types like tuple, bag, and map.
Apache Pig Run Modes
Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
o It executes in a single JVM and is used for development
experimenting and prototyping.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output
data stored in the local file system.
The command for local mode grunt shell:
1. $ pig-x local
MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o In this Pig renders Pig Latin into MapReduce jobs and executes them
on the cluster.
o It can be executed against semi-distributed or fully distributed
Hadoop installation.
o Here, the input and output data are present on HDFS.
The command for Map reduce mode:
1. $ pig
Or,
1. $ pig -x mapreduce
Ways to execute Pig Program
These are the following ways of executing a Pig program on local and
MapReduce mode: -
o Interactive Mode - In this mode, the Pig is executed in the Grunt
shell. To invoke Grunt shell, run the pig command. Once the Grunt
mode executes, we can provide Pig Latin statements and command
interactively at the command line.
o Batch Mode - In this mode, we can run a script file having a .pig
extension. These files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions.
These functions can be called as UDF (User Defined Functions). Here,
we use programming languages like Java and Python.
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyze the
data in Hadoop. It is a textual language that abstracts the programming
from the Java MapReduce idiom into a notation.
Pig Latin Statements
The Pig Latin statements are used to process the data. It is an operator that
accepts a relation as an input and generates another relation as an output.
o It can span multiple lines.
o Each statement must end with a semi-colon.
o It may include expression and schemas.
o By default, these statements are processed using multi-query
execution
Pig Latin Conventions
Convention Description
The parenthesis can enclose one o
It can also be used to indicate the
()
type.
Example - (10, xyz, (3,6,9))
The straight brackets can enclose
items. It can also be used to indica
[]
data type.
Example - [INNER | OUTER]
The curly brackets enclose two or
{} can also be used to indicate the ba
Example - { block | nested_block }
The horizontal ellipsis points indica
... can repeat a portion of the code.
Example - cat path [path ...]
Latin Data Types
Simple Data Types
Type Description
It defines the signed 32-bit
int integer.
Example - 2
It defines the signed 64-bit
long integer.
Example - 2L or 2l
It defines 32-bit floating point
number.
float
Example - 2.5F or 2.5f or
2.5e2f or 2.5.E2F
It defines 64-bit floating point
number.
double
Example - 2.5 or 2.5 or 2.5e2f
or 2.5.E2F
It defines character array in
chararray Unicode UTF-8 format.
Example - javatpoint
bytearray It defines the byte array.
It defines the boolean type
boolean values.
Example - true/false
It defines the values in
datetime order.
datetime
Example - 1970-01-
01T00:00:00.000+00:00
It defines Java BigInteger
biginteger values.
Example - 5000000000000
It defines Java BigDecimal
bigdecimal values.
Example - 52.232344535345
Complex Types
Type Description
It defines an ordered set of fields.
tuple
Example - (15,12)
It defines a collection of tuples.
bag
Example - {(15,12), (12,15)}
It defines a set of key-value pairs.
map
Example - [open#apache]
Pig Example
Use case: Using Pig find the most occurred start letter.
Solution:
Case 1: Load the data into bag named "lines". The entire line is stuck to
element line of type character array.
1. grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
Case 2: The text in the bag lines needs to be tokenized this produces one
word per row.
1. grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) A
s token: chararray;
Case 3: To retain the first letter of each word type the below
command .This commands uses substring method to take the first
character.
1. grunt>letters = FOREACH tokens GENERATE SUBSTRING(0,1) as lett
er : chararray;
Case 4: Create a bag for unique character where the grouped bag will
contain the same character for each occurrence of that character.
1. grunt>lettergrp = GROUP letters by letter;
Case 5: The number of occurrence is counted in each group.
1. grunt>countletter = FOREACH lettergrp GENERATE group , COUNT(
letters);
Case 6: Arrange the output according to count in descending order using
the commands below.
1. grunt>OrderCnt = ORDER countletter BY $1 DESC;
Case 7: Limit to One to give the result.
1. grunt> result =LIMIT OrderCnt 1;
Case 8: Store the result in HDFS . The result is saved in output directory
under sonoo folder.
1. grunt> STORE result into 'home/sonoo/output';