0% found this document useful (0 votes)

39 views21 pages

Lecture 12

Pig is a platform for analyzing large datasets that provides high-level abstractions over MapReduce. It consists of the Pig Latin scripting language and the Pig execution engine. Pig Latin scripts describe data transformations and the engine parallelizes the computations using MapReduce. Pig aims to be flexible in data types and scalable for large datasets.

Uploaded by

melipint

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views21 pages

Lecture 12

Uploaded by

melipint

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CP 422 Programming for Big

Data
Higher-level API: PIG
Outline
• Introduction to Pig on Hadoop
• Pig as a parallel data flow language
• How pig differs from MapReduce
• Use cases
• Pig Philosophy
• Pig’s History

• Pig’s Data Model

• Data types
• Data Schemas
• Cast
PIG on Hadoop
• Pig is a high-level platform for processing big data.
It provides a high level of abstraction for
MapReduce computation.

• The Apache Pig tool's two primary components are

Pig Latin and Pig Engine.
• Pig Latin is a high-level scripting language to create data
analysis codes.
• The Pig Engine accepts Pig Latin scripts as input and
turns them into MapReduce jobs.
• The programmers will build scripts in the Pig Latin
Language to process the data stored in
the Hadoop distributed file system(HDFS).
• Pig Latin includes operators for many of the
traditional data operations (join, sort, filter, etc.), as
well as the ability for users to develop their own
functions for reading, processing, and writing data.
PIG as a parallel dataflow
language

• It allows users to describe how data from one or

more inputs should be read, processed, and then
stored to one or more outputs in parallel.

• Different from many programming languages:

• No if statements or for loops in Pig Latin.
How Pig differs from MapReduce
• Pig has advantages in:
• Pig Latin provides all of the standard data-processing
operations, such as join, filter, group by, order by, union, etc.
– some need to be written in MapReduce by programmers
• Pig provides complex, nontrivial implementations to optimize
parallel computing
• Pig can analyze a Pig Latin script and understand the data
flow that the user is describing – error checking and
optimization
• MapReduce has advantages in:
• More control
• Suitable for: less common algorithms or extremely
performance-sensitive ones
Use Cases for Pig
• Traditional extract transform load (ETL) data
pipelines
• research on raw data
• Iterative processing
•…
Pig Philosophy
• Pigs eat anything
• Pig can operate on data whether it has metadata or not. It
can operate on data that is relational, nested, or
unstructured. And it can easily be extended to operate on
data beyond files, including key/value stores, databases, etc.
• Pigs live anywhere
• Pig is intended to be a language for parallel data processing.
• It is not tied to one particular parallel framework.
• Pigs are domestic animals
• Pig is designed to be easily controlled and modified by its
users.
• Pigs fly
• Pig processes data quickly.
Pig’s History
• Pig started out as a research project in Yahoo! Research. A
paper SIGMOD in 2008 describes MapReduce as “is too low-
level and rigid, and leads to a great deal of custom user
code that is hard to maintain and reuse”
• The first Pig release came in September 2008.
• Early in 2009 other companies started to use Pig for their
data processing. Amazon also added Pig as part of its Elastic
MapReduce service.
• By the end of 2009 about half of Hadoop jobs at Yahoo!
were Pig jobs.
• In 2010, Pig adoption continued to grow, and Pig graduated
from a Hadoop subproject, becoming its own top-level
Apache project.
Pig’s Data Model
• Data type:
• Scalar types – single value, represented by java.lang
classes
• int
• long
• float
• double
• chararray: java.lang.String
• Bytearray
• Complex types - can contain data of any type, including
other complex types
Complex Types - map
map: a chararray to data element mapping
<key, value>

• Key: chararray
• Value: any Pig type
• Value can be a complex type
• Assumed to be a bytearray
• Programmers can cast the value
• Pig will make a best guess based on how it is used
• No requirements on values to be the same type

Example: ['name'#'bob’, 'age'#55]

Using brackets to delimit the map, a hash between keys and values, and a comma
between key-value pairs
Complex Types - tuple
A tuple is a fixed-length, ordered collection of Pig
data elements.
• Tuples are divided into fields, with each field containing
one data element.
• These elements can be of any type—they do not all
need to be the same type.
• Pig can check the data type in the tuple
• A field can be referred by the position or the filed name

Example: ('bob', 55)

Using parentheses to indicate the tuple and commas to delimit fields in the tuple
Complex Types - bag
• A bag is an unordered collection of tuples.
• it is not possible to reference tuples in a bag by position
• Bag is the one type in Pig that is not required to fit into
memory.
• Pig does not provide a list or set type that can store
items of any type. It is possible to mimic a set type using
the bag, by wrapping the desired type in a tuple of one
field.
Example: {('bob', 55), ('sally', 52), ('john', 25)}
Using braces, with tuples in the bag separated by commas.
Nulls
• Data of any type can be null.
• In Pig, the concept of null is the same as in SQL,
which is completely different from the concept of
null in C, Java, Python
• In Pig a null data element means the value is
unknown. This might be because the data is
missing, an error occurred in processing it, etc.
• This will affect how Pig treats null data.
Schemas
• Schemas enable programmers to assign names to
fields and declare types for fields
• Schema Handling:
• You can define a schema that includes both the field
name and field type.
• You can define a schema that includes the field name
only; in this case, the field type defaults to bytearray.
• You can choose not to define a schema; in this case, the
field is un-named and the field type defaults to
bytearray.
Communicate the schema to Pig
• Runtime declaration
dividends = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray, date:chararray, dividend:float);

• Pig now expects your data to have four fields with the
specified types.
• If it has more, it will truncate the extra ones.
• If it has less, it will pad the end of the record with nulls.

dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);

• It is also possible to specify the schema without giving

explicit data types. In this case, the data type is assumed
to be bytearray
Schema syntax
• Schema store in a metadata repository
• it is stored in a metadata repository such as HCatalog
• It is stored in the data itself (if, for example, the data is
stored in JSON format).
• Pig will fetch the schema from the load function before
doing error checking on your script:

mdata = load 'mydata' using HCatLoader();

cleansed = filter mdata by name is not null;
• What if we do not tell Pig the data schema
• Fields can be referenced by position, starting from zero
(the syntax is a dollar sign)
• Pig assumes bytearray, and guess the type from the
script

daily = load 'NYSE_daily';

calcs = foreach daily generate $7 / 1000, $3 * 100.0, SUBSTRING($0, 0, 1), $6 - $3;
Cast in Pig
• The syntax for casts in Pig is the same as in Java—
the type name in parentheses before the value

player = load 'baseball' as (name:chararray, team:chararray,

pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate (int)bat#'base_on_balls' - (int)bat#'ibbs';
Supported casts

Pig
No ratings yet
Pig
6 pages
International Relations GS Foundation Class Notes Sunya IAS
No ratings yet
International Relations GS Foundation Class Notes Sunya IAS
78 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
What Is Apache Pig?
No ratings yet
What Is Apache Pig?
5 pages
SJC Icse 2025 Computer Applications Prelims Paper
No ratings yet
SJC Icse 2025 Computer Applications Prelims Paper
5 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
6 Part1
No ratings yet
6 Part1
5 pages
Apache Pig for Data Analysts
No ratings yet
Apache Pig for Data Analysts
4 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
PDF Sample Download - Google Search
No ratings yet
PDF Sample Download - Google Search
2 pages
Pig Latin: Simplifying Hadoop for All
No ratings yet
Pig Latin: Simplifying Hadoop for All
9 pages
BDTools PIG
No ratings yet
BDTools PIG
14 pages
BDH Exp-4 I232
No ratings yet
BDH Exp-4 I232
8 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
How To Install Odoo 16 On Ubuntu 22
No ratings yet
How To Install Odoo 16 On Ubuntu 22
5 pages
2.1.4 Oops Through JAVA
No ratings yet
2.1.4 Oops Through JAVA
2 pages
Next Generation Security Gateway Guide R80.30
No ratings yet
Next Generation Security Gateway Guide R80.30
629 pages
Pig vs. SQL & MapReduce: Features & Benefits
No ratings yet
Pig vs. SQL & MapReduce: Features & Benefits
21 pages
5 Methods For Hack Any Trial Software To Use It Forever
100% (2)
5 Methods For Hack Any Trial Software To Use It Forever
8 pages
BDA Unit5
No ratings yet
BDA Unit5
36 pages
Salesforce Developer Interview Q&A
No ratings yet
Salesforce Developer Interview Q&A
8 pages
BDA - HIVE & PIG-Other Notes in Detail
No ratings yet
BDA - HIVE & PIG-Other Notes in Detail
162 pages
Case Study Insurance
No ratings yet
Case Study Insurance
4 pages
Android Developer Virtual Internship
No ratings yet
Android Developer Virtual Internship
16 pages
Feeder Protection Relay
No ratings yet
Feeder Protection Relay
16 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Wolf Crypt
No ratings yet
Wolf Crypt
1 page
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
UM - E-OCD II Debugger Manual - V1.0.2
No ratings yet
UM - E-OCD II Debugger Manual - V1.0.2
92 pages
Unit IV
No ratings yet
Unit IV
36 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
Niton XL5 - Spec Sheet
No ratings yet
Niton XL5 - Spec Sheet
2 pages
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
No ratings yet
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
9 pages
Bda Unit 4
No ratings yet
Bda Unit 4
16 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
Unit 5
No ratings yet
Unit 5
24 pages
Practical Exam STD 12
No ratings yet
Practical Exam STD 12
4 pages
6 Part2
No ratings yet
6 Part2
45 pages
Apache Pig & Pig Latin Overview
No ratings yet
Apache Pig & Pig Latin Overview
41 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Unit 4
No ratings yet
Unit 4
20 pages
Dbms Intro
No ratings yet
Dbms Intro
2 pages
Introduction to Apache Pig & Pig Latin
No ratings yet
Introduction to Apache Pig & Pig Latin
28 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Apache Pig vs MapReduce: Key Differences
No ratings yet
Apache Pig vs MapReduce: Key Differences
27 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Flow Chart 2
No ratings yet
Flow Chart 2
16 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Nireekshith BR & Darshan K
No ratings yet
Nireekshith BR & Darshan K
5 pages
Pig
No ratings yet
Pig
61 pages
6 Data Management Patterns For Microservices - PROGRESSIVE CODER
No ratings yet
6 Data Management Patterns For Microservices - PROGRESSIVE CODER
7 pages
Pig
No ratings yet
Pig
16 pages
7 Off Your Morrisons - Com Shop PDF
No ratings yet
7 Off Your Morrisons - Com Shop PDF
4 pages
Machine Learning For Chemistry
No ratings yet
Machine Learning For Chemistry
4 pages
UML Behavioral Diagram Events
No ratings yet
UML Behavioral Diagram Events
14 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
Cisco Trustsec: Security Solution Overview
No ratings yet
Cisco Trustsec: Security Solution Overview
38 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
05a Pig
No ratings yet
05a Pig
52 pages
Citra Android Log Analysis
No ratings yet
Citra Android Log Analysis
9 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
Unit 5
No ratings yet
Unit 5
76 pages
Termbase Management
No ratings yet
Termbase Management
10 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Unit 5
No ratings yet
Unit 5
39 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Deploy and Manage Kubernetes Clusters in A Multicloud World
No ratings yet
Deploy and Manage Kubernetes Clusters in A Multicloud World
13 pages
Pig 2
No ratings yet
Pig 2
63 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
Data Structures 2
No ratings yet
Data Structures 2
17 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
BDP U4
No ratings yet
BDP U4
58 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Kelas Vii Tata Surya
No ratings yet
Kelas Vii Tata Surya
28 pages
Poweredge r760 Technical Guide
No ratings yet
Poweredge r760 Technical Guide
84 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Basis Admin Practice Notes
No ratings yet
Basis Admin Practice Notes
75 pages

Lecture 12

Uploaded by

Lecture 12

Uploaded by

CP 422 Programming for Big

• Pig’s Data Model

• The Apache Pig tool's two primary components are

• It allows users to describe how data from one or

• Different from many programming languages:

Example: ['name'#'bob’, 'age'#55]

Example: ('bob', 55)

dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);

• It is also possible to specify the schema without giving

mdata = load 'mydata' using HCatLoader();

daily = load 'NYSE_daily';

player = load 'baseball' as (name:chararray, team:chararray,

You might also like