Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views14 pages

Module IV Pig

Apache Pig is a data analysis platform that serves as an alternative to MapReduce, developed at Yahoo. It features a data flow language called Pig Latin, which allows users to express data operations and develop custom functions, and it runs on Hadoop, processing both structured and unstructured data. Common use cases include ETL processes, and Pig supports various operators and execution modes for data manipulation.

Uploaded by

chinnu.200420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

Module IV Pig

Apache Pig is a data analysis platform that serves as an alternative to MapReduce, developed at Yahoo. It features a data flow language called Pig Latin, which allows users to express data operations and develop custom functions, and it runs on Hadoop, processing both structured and unstructured data. Common use cases include ETL processes, and Pig supports various operators and execution modes for data manipulation.

Uploaded by

chinnu.200420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Big Data Analytics (BAD601) Module -4 PIG

MODULE 4

PIG
What is PIG

Apache Pig is a platform for data analysis. It is an alternative to MapReduce


Programming. Pig was developed as a research project at Yahoo.

Key Features of Pig

1. It provides an engine for executing data flows (how your data should flow). Pig
processes data in parallel on the Hadoop cluster.

2. It provides a language called “Pig Latin” to express data flows.

3. Pig Latin contains operators for many of the traditional data operations such as
join, filter, sort, etc.

4. It allows users to develop their own functions (User Defined Functions) for
reading, processing, and

Anatomy of PIG

The main components of Pig are as follows:

1. Data flow language (Pig Latin).

2. Interactive shell where you can type Pig Latin statements (Grunt).

3. Pig interpreter and execution engine.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 1


Big Data Analytics (BAD601) Module -4 PIG

PIG on Hadoop

Pig runs on Hadoop. Pig uses both Hadoop Distributed File System and MapReduce
Programming. By default, Pig reads input files from HDFS. Pig stores the intermediate data
(data produced by MapReduce jobs) and the output in HDFS. However, Pig can also read input
from and place output to other sources.

Pig supports the following:

1. HDFS commands.

2. UNIX shell commands.

3. Relational operators.

4. Positional parameters.

5. Common mathematical functions.

6. Custom functions.

7. Complex data structures.

PIG Philosophy

Pigs Eat Anything: Pig can process different kinds of data such as structured and unstructured
data.

2. Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.

3. Pigs are Domestic Animals: Pig allows you to develop user-defined functions and the same
can be included in the script for complex operations.

4. Pigs Fly: Pig processes data quickly.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 2


Big Data Analytics (BAD601) Module -4 PIG

Use Cases of PIG

Pig is widely used for “ETL” (Extract, Transform, and Load). Pig can extract data from different
sources such as ERP, Accounting, Flat Files, etc. Pig then makes use of various operators to
perform transformation on the data and subsequently loads it into the data warehouse.

PIG LATIN OERVIEW

Pig Latin Statements

1. Pig Latin statements are basic constructs to process data using Pig.

2. Pig Latin statement is an operator.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 3


Big Data Analytics (BAD601) Module -4 PIG

3. An operator in Pig Latin takes a relation as input and yields another relation as output. 4. Pig
Latin statements include schemas and expressions to process data.

5. Pig Latin statements should end with a semi-colon.

Pig Latin Statements are generally ordered as follows:

1. LOAD statement that reads data from the file system.

2. Series of statements to perform transformations.

3. DUMP or STORE to display/store result.

The following is a simple Pig Latin script to load, filter, and store “student” data.

Pig Latin keywords are reserved. It cannot be used to name things.

Pig Latin: Identifiers

1. Identifiers are names assigned to fields or other data structures.


2. It should begin with a letter and should be followed only by letters, numbers, and
underscores.

Pig Latin Comments

In Pig Latin two types of comments are supported:

1. Single line comments that begin with “--”. 2. Multiline comments that begin with “/* and
end with */”.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 4


Big Data Analytics (BAD601) Module -4 PIG

Pig Latin Case Sensitive

1. Keywords are not case sensitive such as LOAD, STORE, GROUP, FOREACH, DUMP,
etc.
2. Relations and paths are case-sensitive.
3. 3. Function names are case sensitive such as PigStorage, COUNT.

Operators in Pig Latin

Data Types in PIG

SUNIL G L, Dept. of CSE(DS), RNSIT Page 5


Big Data Analytics (BAD601) Module -4 PIG

Execution modes of PIG

Pig can run in two ways:

1.Local Mode: In this mode, all the files are installed and run from your local host and local
file system. There is no need of Hadoop or HDFS.

Pig -x local filename

2.MapReduce Mode: MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig.

Pig filename

HDFS Commands:

Relational Operators
Filter
FILTER operator is used to select tuples from a relation based on specified conditions.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 6


Big Data Analytics (BAD601) Module -4 PIG

Foreach

Group

Distinct

DISTINCT operator is used to remove duplicate tuples. In Pig, DISTINCT operator works on
the entire tuple and NOT on individual fields.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 7


Big Data Analytics (BAD601) Module -4 PIG

LIMIT

LIMIT operator is used to limit the number of output tuples.

ORDER BY

ORDER BY is used to sort a relation based on specific value.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 8


Big Data Analytics (BAD601) Module -4 PIG

Join

It is used to join two or more relations based on values in the common field. It always
performs inner Join.

UNION

Merge the content of two tables

SUNIL G L, Dept. of CSE(DS), RNSIT Page 9


Big Data Analytics (BAD601) Module -4 PIG

SPLIT:

Split is used to portioned a relation into two or more relations.

Sample

It is used to select random sample of data based on the specified sample size.

EVAL Function

1. AVG

AVG is used to compute the average of numeric values in a single column bag.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 10


Big Data Analytics (BAD601) Module -4 PIG

MAX

MAX is used to compute the maximum of numeric values in a single column bag.

COUNT

COUNT is used to count the number of elements in a bag.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 11


Big Data Analytics (BAD601) Module -4 PIG

Complex Datatypes
Tuple
A TUPLE is an ordered collection of fields.

MAP

Map Represent a Key Value Pair

SUNIL G L, Dept. of CSE(DS), RNSIT Page 12


Big Data Analytics (BAD601) Module -4 PIG

PIGGY BANK
Pig user can use Piggy Bank functions in Pig Latin script and they can also share their
functions in Piggy Bank

USER DEFINED FUNCTIONS


Pig allows you to create your own function for complex analysis.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 13


Big Data Analytics (BAD601) Module -4 PIG

PIG VS HIVE

SUNIL G L, Dept. of CSE(DS), RNSIT Page 14

You might also like