Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
39 views21 pages

Lecture 12

Pig is a platform for analyzing large datasets that provides high-level abstractions over MapReduce. It consists of the Pig Latin scripting language and the Pig execution engine. Pig Latin scripts describe data transformations and the engine parallelizes the computations using MapReduce. Pig aims to be flexible in data types and scalable for large datasets.

Uploaded by

melipint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views21 pages

Lecture 12

Pig is a platform for analyzing large datasets that provides high-level abstractions over MapReduce. It consists of the Pig Latin scripting language and the Pig execution engine. Pig Latin scripts describe data transformations and the engine parallelizes the computations using MapReduce. Pig aims to be flexible in data types and scalable for large datasets.

Uploaded by

melipint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

CP 422 Programming for Big

Data
Higher-level API: PIG
Outline
• Introduction to Pig on Hadoop
• Pig as a parallel data flow language
• How pig differs from MapReduce
• Use cases
• Pig Philosophy
• Pig’s History

• Pig’s Data Model


• Data types
• Data Schemas
• Cast
PIG on Hadoop
• Pig is a high-level platform for processing big data.
It provides a high level of abstraction for
MapReduce computation.

• The Apache Pig tool's two primary components are


Pig Latin and Pig Engine.
• Pig Latin is a high-level scripting language to create data
analysis codes.
• The Pig Engine accepts Pig Latin scripts as input and
turns them into MapReduce jobs.
• The programmers will build scripts in the Pig Latin
Language to process the data stored in
the Hadoop distributed file system(HDFS).
• Pig Latin includes operators for many of the
traditional data operations (join, sort, filter, etc.), as
well as the ability for users to develop their own
functions for reading, processing, and writing data.
PIG as a parallel dataflow
language

• It allows users to describe how data from one or


more inputs should be read, processed, and then
stored to one or more outputs in parallel.

• Different from many programming languages:


• No if statements or for loops in Pig Latin.
How Pig differs from MapReduce
• Pig has advantages in:
• Pig Latin provides all of the standard data-processing
operations, such as join, filter, group by, order by, union, etc.
– some need to be written in MapReduce by programmers
• Pig provides complex, nontrivial implementations to optimize
parallel computing
• Pig can analyze a Pig Latin script and understand the data
flow that the user is describing – error checking and
optimization
• MapReduce has advantages in:
• More control
• Suitable for: less common algorithms or extremely
performance-sensitive ones
Use Cases for Pig
• Traditional extract transform load (ETL) data
pipelines
• research on raw data
• Iterative processing
•…
Pig Philosophy
• Pigs eat anything
• Pig can operate on data whether it has metadata or not. It
can operate on data that is relational, nested, or
unstructured. And it can easily be extended to operate on
data beyond files, including key/value stores, databases, etc.
• Pigs live anywhere
• Pig is intended to be a language for parallel data processing.
• It is not tied to one particular parallel framework.
• Pigs are domestic animals
• Pig is designed to be easily controlled and modified by its
users.
• Pigs fly
• Pig processes data quickly.
Pig’s History
• Pig started out as a research project in Yahoo! Research. A
paper SIGMOD in 2008 describes MapReduce as “is too low-
level and rigid, and leads to a great deal of custom user
code that is hard to maintain and reuse”
• The first Pig release came in September 2008.
• Early in 2009 other companies started to use Pig for their
data processing. Amazon also added Pig as part of its Elastic
MapReduce service.
• By the end of 2009 about half of Hadoop jobs at Yahoo!
were Pig jobs.
• In 2010, Pig adoption continued to grow, and Pig graduated
from a Hadoop subproject, becoming its own top-level
Apache project.
Pig’s Data Model
• Data type:
• Scalar types – single value, represented by java.lang
classes
• int
• long
• float
• double
• chararray: java.lang.String
• Bytearray
• Complex types - can contain data of any type, including
other complex types
Complex Types - map
map: a chararray to data element mapping
<key, value>

• Key: chararray
• Value: any Pig type
• Value can be a complex type
• Assumed to be a bytearray
• Programmers can cast the value
• Pig will make a best guess based on how it is used
• No requirements on values to be the same type

Example: ['name'#'bob’, 'age'#55]


Using brackets to delimit the map, a hash between keys and values, and a comma
between key-value pairs
Complex Types - tuple
A tuple is a fixed-length, ordered collection of Pig
data elements.
• Tuples are divided into fields, with each field containing
one data element.
• These elements can be of any type—they do not all
need to be the same type.
• Pig can check the data type in the tuple
• A field can be referred by the position or the filed name

Example: ('bob', 55)


Using parentheses to indicate the tuple and commas to delimit fields in the tuple
Complex Types - bag
• A bag is an unordered collection of tuples.
• it is not possible to reference tuples in a bag by position
• Bag is the one type in Pig that is not required to fit into
memory.
• Pig does not provide a list or set type that can store
items of any type. It is possible to mimic a set type using
the bag, by wrapping the desired type in a tuple of one
field.
Example: {('bob', 55), ('sally', 52), ('john', 25)}
Using braces, with tuples in the bag separated by commas.
Nulls
• Data of any type can be null.
• In Pig, the concept of null is the same as in SQL,
which is completely different from the concept of
null in C, Java, Python
• In Pig a null data element means the value is
unknown. This might be because the data is
missing, an error occurred in processing it, etc.
• This will affect how Pig treats null data.
Schemas
• Schemas enable programmers to assign names to
fields and declare types for fields
• Schema Handling:
• You can define a schema that includes both the field
name and field type.
• You can define a schema that includes the field name
only; in this case, the field type defaults to bytearray.
• You can choose not to define a schema; in this case, the
field is un-named and the field type defaults to
bytearray.
Communicate the schema to Pig
• Runtime declaration
dividends = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray, date:chararray, dividend:float);

• Pig now expects your data to have four fields with the
specified types.
• If it has more, it will truncate the extra ones.
• If it has less, it will pad the end of the record with nulls.

dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);

• It is also possible to specify the schema without giving


explicit data types. In this case, the data type is assumed
to be bytearray
Schema syntax
• Schema store in a metadata repository
• it is stored in a metadata repository such as HCatalog
• It is stored in the data itself (if, for example, the data is
stored in JSON format).
• Pig will fetch the schema from the load function before
doing error checking on your script:

mdata = load 'mydata' using HCatLoader();


cleansed = filter mdata by name is not null;
• What if we do not tell Pig the data schema
• Fields can be referenced by position, starting from zero
(the syntax is a dollar sign)
• Pig assumes bytearray, and guess the type from the
script

daily = load 'NYSE_daily';


calcs = foreach daily generate $7 / 1000, $3 * 100.0, SUBSTRING($0, 0, 1), $6 - $3;
Cast in Pig
• The syntax for casts in Pig is the same as in Java—
the type name in parentheses before the value

player = load 'baseball' as (name:chararray, team:chararray,


pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate (int)bat#'base_on_balls' - (int)bat#'ibbs';
Supported casts

You might also like