0% found this document useful (0 votes)

6 views14 pages

Module IV Pig

Apache Pig is a data analysis platform that serves as an alternative to MapReduce, developed at Yahoo. It features a data flow language called Pig Latin, which allows users to express data operations and develop custom functions, and it runs on Hadoop, processing both structured and unstructured data. Common use cases include ETL processes, and Pig supports various operators and execution modes for data manipulation.

Uploaded by

chinnu.200420

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views14 pages

Module IV Pig

Uploaded by

chinnu.200420

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Big Data Analytics (BAD601) Module -4 PIG

MODULE 4

PIG
What is PIG

Apache Pig is a platform for data analysis. It is an alternative to MapReduce

Programming. Pig was developed as a research project at Yahoo.

Key Features of Pig

1. It provides an engine for executing data flows (how your data should flow). Pig
processes data in parallel on the Hadoop cluster.

2. It provides a language called “Pig Latin” to express data flows.

3. Pig Latin contains operators for many of the traditional data operations such as
join, filter, sort, etc.

4. It allows users to develop their own functions (User Defined Functions) for
reading, processing, and

Anatomy of PIG

The main components of Pig are as follows:

1. Data flow language (Pig Latin).

2. Interactive shell where you can type Pig Latin statements (Grunt).

3. Pig interpreter and execution engine.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 1

Big Data Analytics (BAD601) Module -4 PIG

PIG on Hadoop

Pig runs on Hadoop. Pig uses both Hadoop Distributed File System and MapReduce
Programming. By default, Pig reads input files from HDFS. Pig stores the intermediate data
(data produced by MapReduce jobs) and the output in HDFS. However, Pig can also read input
from and place output to other sources.

Pig supports the following:

1. HDFS commands.

2. UNIX shell commands.

3. Relational operators.

4. Positional parameters.

5. Common mathematical functions.

6. Custom functions.

7. Complex data structures.

PIG Philosophy

Pigs Eat Anything: Pig can process different kinds of data such as structured and unstructured
data.

2. Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.

3. Pigs are Domestic Animals: Pig allows you to develop user-defined functions and the same
can be included in the script for complex operations.

4. Pigs Fly: Pig processes data quickly.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 2

Big Data Analytics (BAD601) Module -4 PIG

Use Cases of PIG

Pig is widely used for “ETL” (Extract, Transform, and Load). Pig can extract data from different
sources such as ERP, Accounting, Flat Files, etc. Pig then makes use of various operators to
perform transformation on the data and subsequently loads it into the data warehouse.

PIG LATIN OERVIEW

Pig Latin Statements

1. Pig Latin statements are basic constructs to process data using Pig.

2. Pig Latin statement is an operator.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 3

Big Data Analytics (BAD601) Module -4 PIG

3. An operator in Pig Latin takes a relation as input and yields another relation as output. 4. Pig
Latin statements include schemas and expressions to process data.

5. Pig Latin statements should end with a semi-colon.

Pig Latin Statements are generally ordered as follows:

1. LOAD statement that reads data from the file system.

2. Series of statements to perform transformations.

3. DUMP or STORE to display/store result.

The following is a simple Pig Latin script to load, filter, and store “student” data.

Pig Latin keywords are reserved. It cannot be used to name things.

Pig Latin: Identifiers

1. Identifiers are names assigned to fields or other data structures.

2. It should begin with a letter and should be followed only by letters, numbers, and
underscores.

Pig Latin Comments

In Pig Latin two types of comments are supported:

1. Single line comments that begin with “--”. 2. Multiline comments that begin with “/* and
end with */”.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 4

Big Data Analytics (BAD601) Module -4 PIG

Pig Latin Case Sensitive

1. Keywords are not case sensitive such as LOAD, STORE, GROUP, FOREACH, DUMP,
etc.
2. Relations and paths are case-sensitive.
3. 3. Function names are case sensitive such as PigStorage, COUNT.

Operators in Pig Latin

Data Types in PIG

SUNIL G L, Dept. of CSE(DS), RNSIT Page 5

Big Data Analytics (BAD601) Module -4 PIG

Execution modes of PIG

Pig can run in two ways:

1.Local Mode: In this mode, all the files are installed and run from your local host and local
file system. There is no need of Hadoop or HDFS.

Pig -x local filename

2.MapReduce Mode: MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig.

Pig filename

HDFS Commands:

Relational Operators
Filter
FILTER operator is used to select tuples from a relation based on specified conditions.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 6

Big Data Analytics (BAD601) Module -4 PIG

Foreach

Group

Distinct

DISTINCT operator is used to remove duplicate tuples. In Pig, DISTINCT operator works on
the entire tuple and NOT on individual fields.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 7

Big Data Analytics (BAD601) Module -4 PIG

LIMIT

LIMIT operator is used to limit the number of output tuples.

ORDER BY

ORDER BY is used to sort a relation based on specific value.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 8

Big Data Analytics (BAD601) Module -4 PIG

Join

It is used to join two or more relations based on values in the common field. It always
performs inner Join.

UNION

Merge the content of two tables

SUNIL G L, Dept. of CSE(DS), RNSIT Page 9

Big Data Analytics (BAD601) Module -4 PIG

SPLIT:

Split is used to portioned a relation into two or more relations.

Sample

It is used to select random sample of data based on the specified sample size.

EVAL Function

1. AVG

AVG is used to compute the average of numeric values in a single column bag.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 10

Big Data Analytics (BAD601) Module -4 PIG

MAX

MAX is used to compute the maximum of numeric values in a single column bag.

COUNT

COUNT is used to count the number of elements in a bag.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 11

Big Data Analytics (BAD601) Module -4 PIG

Complex Datatypes
Tuple
A TUPLE is an ordered collection of fields.

MAP

Map Represent a Key Value Pair

SUNIL G L, Dept. of CSE(DS), RNSIT Page 12

Big Data Analytics (BAD601) Module -4 PIG

PIGGY BANK
Pig user can use Piggy Bank functions in Pig Latin script and they can also share their
functions in Piggy Bank

USER DEFINED FUNCTIONS

Pig allows you to create your own function for complex analysis.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 13

Big Data Analytics (BAD601) Module -4 PIG

PIG VS HIVE

SUNIL G L, Dept. of CSE(DS), RNSIT Page 14

Student Management System Java SQL Project Report
No ratings yet
Student Management System Java SQL Project Report
15 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Module-IV Pig
No ratings yet
Module-IV Pig
34 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit 5
No ratings yet
Unit 5
24 pages
Unit 5
No ratings yet
Unit 5
10 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Unit-4 PIG
No ratings yet
Unit-4 PIG
9 pages
Notes
No ratings yet
Notes
19 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
BDA Unit5
No ratings yet
BDA Unit5
36 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Unit5 Part1 Notes
No ratings yet
Unit5 Part1 Notes
21 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Unit 5
No ratings yet
Unit 5
76 pages
Pig
No ratings yet
Pig
6 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Apache Pig & Pig Latin Overview
No ratings yet
Apache Pig & Pig Latin Overview
41 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
BDA - HIVE & PIG-Other Notes in Detail
No ratings yet
BDA - HIVE & PIG-Other Notes in Detail
162 pages
Pig Latin: Simplifying Hadoop for All
No ratings yet
Pig Latin: Simplifying Hadoop for All
9 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
Pig 2
No ratings yet
Pig 2
63 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Pig Framework for Non-Java Developers
No ratings yet
Pig Framework for Non-Java Developers
16 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Pig
No ratings yet
Pig
61 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Big Data 2
No ratings yet
Big Data 2
3 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Apache Pig Guide: Features & Functions
No ratings yet
Apache Pig Guide: Features & Functions
31 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
DLRL Module 1
No ratings yet
DLRL Module 1
20 pages
Module 2
No ratings yet
Module 2
19 pages
Module 1 BA Notes
No ratings yet
Module 1 BA Notes
18 pages
Convolution Operation Solution
No ratings yet
Convolution Operation Solution
4 pages
Module 2
No ratings yet
Module 2
16 pages
BAI701 - DLRL - Question Bank (Module 1 & 2)
No ratings yet
BAI701 - DLRL - Question Bank (Module 1 & 2)
3 pages
Department of CSE (Data Science) : Statistical Machine Learning For Data Science (BAD702-IPCC)
No ratings yet
Department of CSE (Data Science) : Statistical Machine Learning For Data Science (BAD702-IPCC)
78 pages
Module1 Smlds Bad702 Notes
No ratings yet
Module1 Smlds Bad702 Notes
29 pages
DL QB
No ratings yet
DL QB
1 page
Department of CSE (Data Science) : Statistical Machine Learning For Data Science (BAD702-IPCC)
No ratings yet
Department of CSE (Data Science) : Statistical Machine Learning For Data Science (BAD702-IPCC)
61 pages
QB 1st IA
No ratings yet
QB 1st IA
2 pages
02.MOUDLE 5 - Text Mining
No ratings yet
02.MOUDLE 5 - Text Mining
27 pages
Module-IV HIVE
No ratings yet
Module-IV HIVE
69 pages
Fingerprint Scanner Manual: 1.appearance
No ratings yet
Fingerprint Scanner Manual: 1.appearance
3 pages
Developer-Days - NSO-CDM Migration
No ratings yet
Developer-Days - NSO-CDM Migration
25 pages
Best Practices Guide For Databases On IBM FlashSystem
No ratings yet
Best Practices Guide For Databases On IBM FlashSystem
20 pages
Data Analysis & Business Intelligence: Ombir Rathee
100% (1)
Data Analysis & Business Intelligence: Ombir Rathee
24 pages
Ch3 Profiles, Password Policies, Privileges, and Roles
No ratings yet
Ch3 Profiles, Password Policies, Privileges, and Roles
79 pages
MAZZOCCHI, 2017. Knowledge Organization System (IEKO)
No ratings yet
MAZZOCCHI, 2017. Knowledge Organization System (IEKO)
22 pages
30 Salesforce Developer Interview Questions Answers Salesforce Ben
No ratings yet
30 Salesforce Developer Interview Questions Answers Salesforce Ben
21 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Mobile GIS Enhances Algerian Road Management
No ratings yet
Mobile GIS Enhances Algerian Road Management
17 pages
Review Vijitha Presentation2
No ratings yet
Review Vijitha Presentation2
14 pages
Database Design Using The E-R Model
No ratings yet
Database Design Using The E-R Model
10 pages
Neo4j Graph Database Guide
No ratings yet
Neo4j Graph Database Guide
8 pages
Unit I
No ratings yet
Unit I
44 pages
TE Comp 2019 I AY23-24 DBMS UT1
No ratings yet
TE Comp 2019 I AY23-24 DBMS UT1
1 page
Benefits of Migrating to BW on HANA
No ratings yet
Benefits of Migrating to BW on HANA
2 pages
Secure Chat App Design with AES
No ratings yet
Secure Chat App Design with AES
7 pages
Real Estate Management System SRD
No ratings yet
Real Estate Management System SRD
1 page
Assignment - Big Data Management
No ratings yet
Assignment - Big Data Management
2 pages
KIET School of Engineering & Technology, Ghaziabad
No ratings yet
KIET School of Engineering & Technology, Ghaziabad
2 pages
MIS Model Paper for MBA Students
No ratings yet
MIS Model Paper for MBA Students
17 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
CS506 Highlight Handout
No ratings yet
CS506 Highlight Handout
633 pages
Homework 8
No ratings yet
Homework 8
3 pages
A Review Paper On Big Data Analytics Tools: Article
No ratings yet
A Review Paper On Big Data Analytics Tools: Article
7 pages
Iconic Simran Sahu
No ratings yet
Iconic Simran Sahu
22 pages
Hadoop Quiz and Exam Answers
No ratings yet
Hadoop Quiz and Exam Answers
10 pages
LFCS Practice Questions: Domain Sample Task
No ratings yet
LFCS Practice Questions: Domain Sample Task
4 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
WILCOXON TOOL 2 Sure1
No ratings yet
WILCOXON TOOL 2 Sure1
12 pages

Module IV Pig

Uploaded by

Module IV Pig

Uploaded by

Big Data Analytics (BAD601) Module -4 PIG

Apache Pig is a platform for data analysis. It is an alternative to MapReduce

Key Features of Pig

2. It provides a language called “Pig Latin” to express data flows.

The main components of Pig are as follows:

1. Data flow language (Pig Latin).

3. Pig interpreter and execution engine.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 1

Pig supports the following:

2. UNIX shell commands.

5. Common mathematical functions.

7. Complex data structures.

4. Pigs Fly: Pig processes data quickly.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 2

Use Cases of PIG

PIG LATIN OERVIEW

Pig Latin Statements

2. Pig Latin statement is an operator.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 3

5. Pig Latin statements should end with a semi-colon.

Pig Latin Statements are generally ordered as follows:

1. LOAD statement that reads data from the file system.

2. Series of statements to perform transformations.

3. DUMP or STORE to display/store result.

Pig Latin keywords are reserved. It cannot be used to name things.

Pig Latin: Identifiers

1. Identifiers are names assigned to fields or other data structures.

Pig Latin Comments

In Pig Latin two types of comments are supported:

SUNIL G L, Dept. of CSE(DS), RNSIT Page 4

Pig Latin Case Sensitive

Operators in Pig Latin

Data Types in PIG

SUNIL G L, Dept. of CSE(DS), RNSIT Page 5

Execution modes of PIG

Pig can run in two ways:

Pig -x local filename

SUNIL G L, Dept. of CSE(DS), RNSIT Page 6

SUNIL G L, Dept. of CSE(DS), RNSIT Page 7

LIMIT operator is used to limit the number of output tuples.

ORDER BY is used to sort a relation based on specific value.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 8

Merge the content of two tables

SUNIL G L, Dept. of CSE(DS), RNSIT Page 9

Split is used to portioned a relation into two or more relations.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 10

COUNT is used to count the number of elements in a bag.

SUNIL G L, Dept. of CSE(DS), RNSIT Page 11

Map Represent a Key Value Pair

SUNIL G L, Dept. of CSE(DS), RNSIT Page 12

USER DEFINED FUNCTIONS

SUNIL G L, Dept. of CSE(DS), RNSIT Page 13

SUNIL G L, Dept. of CSE(DS), RNSIT Page 14

You might also like