0% found this document useful (0 votes)

17 views16 pages

Bda Unit 4

Apache Pig is a high-level platform for processing large datasets using a scripting language called Pig Latin, which simplifies the development cycle compared to traditional MapReduce. It supports various execution modes and mechanisms, allowing users to run scripts interactively, in batches, or embedded within other applications. Additionally, Pig provides extensive support for User Defined Functions (UDFs) and has evolved since its inception in 2006, becoming an open-source project under the Apache Software Foundation.

Uploaded by

aburoobhastudy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views16 pages

Bda Unit 4

Uploaded by

aburoobhastudy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Introduction to Apache Pig

Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets. It provides a high-level of abstraction for processing over the
MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a
component of Apache Pig) converted all these scripts into a specific map and reduce task. But
these are not visible to the programmers in order to provide a high-level of abstraction. Pig Latin
and Pig Engine are the two main components of the Apache Pig tool. The result of Pig always
stored in the HDFS.

Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing
the reducer and mapper, compiling packaging the code, submitting the job and retrieving the
output is a time-consuming task. Apache Pig reduces the time of development using the
multi-query approach. Also, Pig is beneficial for programmers who are not from Java
background. 200 lines of Java code can be written in only 10 lines using the Pig Latin language.
Programmers who have SQL knowledge needed less effort to learn Pig Latin.

It uses query approach which results in reducing the length of the code.
Pig Latin is SQL like language.
It provides many builtIn operators.
It provides nested data types (tuples, bags, map).

Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that
time, the main idea to develop Pig was to execute the MapReduce jobs on extremely large
datasets. In the year 2007, it moved to Apache Software Foundation(ASF) which makes it an
open source project. The first version(0.1) of Pig came in the year 2008. The latest version of
Apache Pig is 0.18 which came in the year 2017.
Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is
no need of Hadoop or HDFS. This mode is generally used for testing purpose.

MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump
operator).

Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in
a single file with .pig extension.

Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our script.

Pig Latin

The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.
Pig Latin Statements

The Pig Latin statements are used to process the data. It is an operator that accepts a relation as
an input and generates another relation as an output.

○ It can span multiple lines.

○ Each statement must end with a semi-colon.

○ It may include expression and schemas.

○ By default, these statements are processed using multi-query execution

Pig Latin Conventions

Conventio Description
n

() The parenthesis can enclose one or more items. It can also be used to indicate
the tuple data type.
Example - (10, xyz, (3,6,9))

[] The straight brackets can enclose one or more items. It can also be used to
indicate the map data type.
Example - [INNER | OUTER]

{} The curly brackets enclose two or more items. It can also be used to indicate
the bag data type
Example - { block | nested_block }

... The horizontal ellipsis points indicate that you can repeat a portion of the code.
Example - cat path [path ...]

Latin Data Types

Simple Data Types

Type Description

int It defines the signed 32-bit integer.

Example - 2

long It defines the signed 64-bit integer.

Example - 2L or 2l

float It defines 32-bit floating point number.

Example - 2.5F or 2.5f or 2.5e2f or 2.5.E2F

double It defines 64-bit floating point number.

Example - 2.5 or 2.5 or 2.5e2f or 2.5.E2F

chararray It defines character array in Unicode UTF-8 format.

Example - javatpoint

bytearray It defines the byte array.

boolean It defines the boolean type values.

Example - true/false

datetime It defines the values in datetime order.

Example - 1970-01- 01T00:00:00.000+00:00

biginteger It defines Java BigInteger values.

Example - 5000000000000

bigdecimal It defines Java BigDecimal values.

Example - 52.232344535345

Apache Pig - User Defined Functions

In addition to the built-in functions, Apache Pig provides extensive support for User Defined
Functions (UDF’s). Using these UDF’s, we can define our own functions and use them. The
UDF support is provided in six programming languages, namely, Java, Jython, Python,
JavaScript, Ruby and Groovy.

For writing UDF’s, complete support is provided in Java and limited support is provided in all
the remaining languages. Using Java, you can write UDF’s involving all parts of the processing
like data load/store, column transformation, and aggregation. Since Apache Pig has been written
in Java, the UDF’s written using Java language work efficiently compared to other languages.

In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank,
we can access Java UDF’s written by other users, and contribute our own UDF’s.

Types of UDF’s in Java

While writing UDF’s using Java, we can create and use the following three types of functions −

Filter Functions − The filter functions are used as conditions in filter statements. These
functions accept a Pig value as input and return a Boolean value.
Eval Functions − The Eval functions are used in FOREACH-GENERATE statements.
These functions accept a Pig value as input and return a Pig result.
Algebraic Functions − The Algebraic functions act on inner bags in a
FOREACHGENERATE statement. These functions are used to perform full MapReduce
operations on an inner bag.

Writing UDF’s using Java

To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we
discuss how to write a sample UDF using Eclipse. Before proceeding further, make sure you
have installed Eclipse and Maven in your system.

Follow the steps given below to write a UDF function −

Open Eclipse and create a new project (say myproject).

Convert the newly created project into a Maven project.
Copy the following content in the pom.xml. This file contains the Maven dependencies
for Apache Pig and Hadoop-core jar files.

Using the UDF

After writing the UDF and generating the Jar file, follow the steps given below −

Step 1: Registering the Jar file

After writing UDF (in Java) we have to register the Jar file that contain the UDF using the
Register operator. By registering the Jar file, users can intimate the location of the UDF to
Apache Pig.

Syntax

Given below is the syntax of the Register operator.

Example

As an example let us register the sample_udf.jar created earlier in this chapter.

Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.

$cd PIG_HOME/bin

$./pig –x local

Step 2: Defining Alias

After registering the UDF we can define an alias to it using the Define operator.

Syntax

Given below is the syntax of the Define operator.

DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };

Example

Define the alias for sample_eval as shown below.

DEFINE sample_eval sample_eval();

Step 3: Using the UDF

After defining the alias you can use the UDF same as the built-in functions. Suppose there is a
file named emp_data in the HDFS /Pig_Data/ directory with the following content.

001,Robin,22,newyork

002,BOB,23,Kolkata

003,Maya,23,Tokyo

004,Sara,25,London

005,David,23,Bhuwaneshwar

006,Maggy,22,Chennai
007,Robert,22,newyork

008,Syam,23,Kolkata

009,Mary,25,Tokyo

010,Saran,25,London

011,Stacy,25,Bhuwaneshwar

012,Kelly,22,Chennai

And assume we have loaded this file into Pig as shown below.

grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')

as (id:int, name:chararray, age:int, city:chararray);

Let us now convert the names of the employees in to upper case using the UDF sample_eval.

grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);

Verify the contents of the relation Upper_case as shown below.

grunt> Dump Upper_case;

(ROBIN)

(BOB)

(MAYA)

(SARA)

(DAVID)
(MAGGY)

(ROBERT)

(SYAM)

(MARY)

(SARAN)

(STACY)

(KELLY)

Data processing operators - Hive :

In the context of Hive in big data environments, data processing operators typically refer to
various components and functionalities within the Hive ecosystem that enable users to interact
with and process large datasets. Here's an overview:

1. Hive Shell: Hive provides a command-line interface called the Hive shell, which allows
users to interact with Hive using HiveQL (Hive Query Language), a SQL-like language
for querying and managing data stored in Hive. The Hive shell enables users to execute
queries, create tables, load data, and perform various data manipulation tasks.
2. Hive Services: Hive operates as a service within the Hadoop ecosystem. These services
include the Hive Metastore, Hive Server, and other auxiliary services. The Hive
Metastore stores metadata about Hive tables, partitions, and other database objects. It acts
as a central repository for schema information and helps manage table schemas, storage
locations, and other metadata.
3. Hive Metastore: The Hive Metastore is a critical component of the Hive architecture. It
stores metadata such as table and partition definitions, column information, storage
location, and other details needed to manage and query data stored in Hive. The
metastore is typically implemented using a relational database management system
(RDBMS) such as MySQL, PostgreSQL, or Derby. It allows multiple Hive instances to
share metadata and provides a centralized catalog for managing Hive tables.

These operators work together to enable users to define schemas, query data, and perform
various data processing tasks using Hive. The Hive shell provides an interface for users to
interact with Hive, while the Hive Metastore stores metadata about Hive tables and other objects,
enabling efficient query execution and management of data stored in Hive.

1. Hive Shell:The Hive Shell is a command-line interface (CLI) that allows users to interact
with Hive using HiveQL commands. It provides a familiar environment for users who are
comfortable with SQL-like syntax. Here are some key features and functionalities of the
Hive Shell:
● Query Execution: Users can execute HiveQL queries to perform various
operations such as data selection, filtering, aggregation, joins, and transformations
on datasets stored in Hive tables.
● Table Management: Users can create, drop, alter, and describe tables using DDL
(Data Definition Language) commands within the Hive Shell. They can define
table schemas, specify storage formats, partitioning schemes, and other table
properties.
● Data Loading: The Hive Shell supports data loading from external sources such as
HDFS (Hadoop Distributed File System), local file systems, or other databases
using commands like LOAD DATA INPATH or INSERT INTO TABLE.
● Session Configuration: Users can configure session-level properties and settings
such as mapreduce tasks, input/output formats, compression codecs, and query
execution options.
● Scripting: The Hive Shell supports scripting using languages like Bash or Python,
allowing users to automate tasks, execute multiple queries sequentially, or interact
with external systems.
2. Hive Services:Hive operates as a service within the Hadoop ecosystem, comprising
several components that work together to provide data processing capabilities. Here are
some key Hive services:
● Hive Metastore: As mentioned earlier, the Hive Metastore stores metadata about
Hive tables, partitions, columns, storage formats, and other database objects. It
serves as a centralized catalog for managing and querying data stored in Hive.
The Metastore can be configured to use different backend databases such as
MySQL, PostgreSQL, or Derby.
● Hive Server: The Hive Server provides a remote interface for clients to submit
HiveQL queries and interact with Hive. It supports multiple client interfaces
including JDBC (Java Database Connectivity), ODBC (Open Database
Connectivity), Thrift, and HTTP. The Hive Server facilitates concurrent query
execution, session management, and authentication for multiple users accessing
Hive concurrently.
● Hive Execution Engine: Hive uses MapReduce, Tez, or Spark as its execution
engine for processing HiveQL queries. MapReduce is the traditional execution
engine for Hive, while Tez and Spark offer improved performance and
optimization for complex queries and interactive analysis.
● Hive CLI and Beeline: In addition to the Hive Shell, Hive provides two
command-line interfaces: Hive CLI and Beeline. Hive CLI is the legacy CLI,
while Beeline is a modern JDBC client that provides better support for JDBC
connectivity, security, and user authentication.
3. Hive Metastore:The Hive Metastore is a central component of the Hive architecture
responsible for managing metadata about Hive tables and other database objects. Here's a
closer look at its functionalities:
● Metadata Storage: The Hive Metastore stores metadata such as table definitions,
column names, data types, partitioning schemes, storage formats, and file
locations. It maintains a catalog of all Hive objects and their properties, making it
easier to manage and query data stored in Hive.
● Schema Management: Users can create, alter, drop, and describe tables using
DDL commands, and the Metastore stores the corresponding schema information.
It enforces data integrity constraints and ensures consistency across Hive tables.
● Partition Management: Hive supports partitioning of tables based on one or more
columns, which helps improve query performance and data organization. The
Metastore manages partition metadata and tracks partition locations, making it
efficient to access specific partitions during query execution.
● Concurrency Control: The Metastore handles concurrency control and metadata
locking to ensure data consistency and prevent conflicts when multiple users or
processes access and modify metadata simultaneously.
● Integration with External Systems: The Hive Metastore integrates with various
external systems and tools within the Hadoop ecosystem, including HDFS,
YARN, MapReduce, Tez, Spark, and other components. It provides seamless
interoperability and metadata sharing across different data processing frameworks
and applications.

Comparison with traditional databases

When comparing Hive with traditional databases in big data environments, several key
differences and considerations arise due to their distinct architectures, use cases, and
design philosophies:
1. Data Model:
● Traditional Databases: These databases adhere to the relational model,
where data is organized into tables with a fixed schema. This means that
data types, column names, and constraints must be defined upfront before
inserting data into the database. This model ensures data integrity and
consistency through normalization and enforced constraints.
● Hive: In contrast, Hive follows a schema-on-read approach. Data is stored
as files in a distributed file system, and the schema is applied at the time of
querying rather than during data ingestion. This allows for flexibility in
handling various data formats, including structured, semi-structured, and
unstructured data. Hive's schema-on-read model is advantageous for
processing diverse datasets where the schema may evolve over time or
where upfront schema definition is not practical.
2. Query Language:
● Traditional Databases: SQL is the standard query language used in
traditional databases. It provides a rich set of features for data
manipulation, including querying, filtering, joining, aggregating, and
transaction management. SQL is optimized for relational data models and
is well-suited for OLTP workloads.
● Hive: HiveQL is Hive's query language, which is similar to SQL but
tailored for distributed data processing. While it shares similarities with
SQL, HiveQL has certain limitations and differences due to the underlying
distributed computing framework. For example, complex queries or joins
may require optimization techniques specific to Hive, and certain SQL
features may not be fully supported or may have performance implications
in a distributed environment.
3. Scalability and Performance:
● Traditional Databases: Traditional databases are typically designed for
vertical scalability, where additional resources are added to a single server
to handle increased workloads. While modern traditional databases may
support some degree of horizontal scalability through clustering or
replication, they are not as inherently scalable as distributed systems like
Hive.
● Hive: Hive is designed for horizontal scalability, allowing users to scale
out by adding more nodes to the cluster. It leverages distributed storage
and processing frameworks such as Hadoop MapReduce, Apache Tez, or
Apache Spark to parallelize data processing across multiple nodes. This
architecture enables Hive to handle large-scale datasets and parallelize
complex analytical queries, making it well-suited for big data processing
and analytics.
4. Metadata Management:
● Traditional Databases: Metadata management in traditional databases is
tightly coupled with the database system itself. Metadata, such as table
definitions, indexes, and constraints, is stored within the database instance
and managed by the database management system (DBMS).
● Hive: Hive separates metadata management from data storage through the
Hive Metastore. The Metastore stores metadata about Hive tables,
partitions, columns, and storage formats in a separate repository, often
using a relational database backend. This separation allows for metadata
sharing across multiple Hive instances and facilitates integration with
other components in the Hadoop ecosystem.
5. Use Cases:
● Traditional Databases: Traditional databases are commonly used for OLTP
applications, where the emphasis is on high transactional throughput, low
latency, and strong consistency guarantees. They are well-suited for
transaction processing tasks such as online retail, banking, inventory
management, and order processing.
● Hive: Hive is primarily used for OLAP and batch processing workloads,
where the focus is on analyzing large volumes of historical data to derive
insights and make data-driven decisions. It is often used in data
warehousing, business intelligence, reporting, and ETL (Extract,
Transform, Load) workflows, where the ability to handle massive datasets
and perform complex analytics is crucial.

In summary, while traditional databases and Hive serve different purposes and have distinct
architectures, both play important roles in managing and analyzing data in various contexts. The
choice between them depends on factors such as the nature of the data, performance
requirements, scalability needs, and the types of applications and workloads being supported.

HiveQLBigSQL
HiveQL and BigSQL are both SQL-based query languages designed for big data processing, but
they are associated with different platforms and ecosystems. Let's explore each one:
1. HiveQL:
● Platform: HiveQL is the query language used with Apache Hive, a data
warehouse infrastructure built on top of Apache Hadoop. Hive provides a
high-level abstraction for querying and analyzing large datasets stored in Hadoop
Distributed File System (HDFS) or other compatible distributed storage systems.
● Syntax and Features: HiveQL syntax closely resembles SQL, making it relatively
easy for users familiar with SQL to transition to HiveQL. It supports a wide range
of SQL-like operations such as SELECT, INSERT, UPDATE, DELETE, JOIN,
GROUP BY, ORDER BY, and more. Additionally, HiveQL includes extensions
and optimizations specific to distributed computing environments, allowing users
to perform complex analytical queries on massive datasets efficiently.
● Data Formats: One of the key features of HiveQL is its support for various data
formats, including structured, semi-structured, and unstructured data. Hive can
process data stored in formats such as CSV, JSON, Parquet, ORC, Avro, and
more. This flexibility enables users to work with diverse datasets without needing
to perform extensive data preprocessing or conversion.
● Use Cases: HiveQL is primarily used for OLAP and batch processing workloads
in big data environments. It is commonly employed in data warehousing, business
intelligence, reporting, and ETL workflows. Hive's strengths lie in its ability to
handle large-scale datasets and parallelize complex analytical queries across
distributed compute resources.
2. BigSQL:
● Platform: BigSQL is a SQL-on-Hadoop solution provided by IBM as part of its
IBM BigInsights platform. It extends the capabilities of Apache Hadoop with
additional features and optimizations tailored for SQL-based analytics.
● Syntax and Features: BigSQL also offers a SQL-based interface for querying and
analyzing data stored in Hadoop Distributed File System (HDFS) or other
compatible storage systems. Its syntax is similar to standard SQL, making it
accessible to users familiar with relational databases. BigSQL includes features
for advanced analytics, machine learning, and integration with other IBM
BigInsights components.
● Data Formats: Similar to Hive, BigSQL supports various data formats commonly
used in big data environments, including structured and semi-structured formats
such as CSV, JSON, Parquet, ORC, Avro, and more. This flexibility enables users
to work with diverse datasets and perform analytical tasks efficiently.
● Use Cases: BigSQL is suitable for a wide range of use cases, including data
warehousing, business intelligence, ad-hoc querying, and operational analytics. It
is often used in industries such as finance, telecommunications, healthcare, and
retail for processing large volumes of data and deriving insights from it.

In summary, both HiveQL and BigSQL provide SQL-based interfaces for querying and
analyzing data in big data environments. While HiveQL is associated with the Apache Hive
ecosystem and is part of the broader Hadoop ecosystem, BigSQL is offered as part of IBM's
BigInsights platform, with additional features and optimizations tailored for IBM environments.
The choice between them may depend on factors such as existing infrastructure, vendor
preferences, feature requirements, and integration with other tools and systems.

Technology in Teaching and Learning Post Test
87% (15)
Technology in Teaching and Learning Post Test
15 pages
Empowerment Technology: Quarter 1 - Module 1
100% (3)
Empowerment Technology: Quarter 1 - Module 1
20 pages
Introduction To Docker: Ajeet Singh Raina Docker Captain - Docker, Inc
No ratings yet
Introduction To Docker: Ajeet Singh Raina Docker Captain - Docker, Inc
56 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Unit 5
No ratings yet
Unit 5
24 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Unit 4
No ratings yet
Unit 4
20 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
6 Part2
No ratings yet
6 Part2
45 pages
Unit5 Bigdatanotes
No ratings yet
Unit5 Bigdatanotes
52 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Unit 5
No ratings yet
Unit 5
39 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
BDA - HIVE & PIG-Other Notes in Detail
No ratings yet
BDA - HIVE & PIG-Other Notes in Detail
162 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
No ratings yet
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
23 pages
Bda V
No ratings yet
Bda V
10 pages
Introduction to Apache Pig & Pig Latin
No ratings yet
Introduction to Apache Pig & Pig Latin
28 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Unit V Notes
No ratings yet
Unit V Notes
17 pages
Notes
No ratings yet
Notes
19 pages
Unit IV
No ratings yet
Unit IV
36 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Unit 5
No ratings yet
Unit 5
76 pages
Apache Pig & Pig Latin Overview
No ratings yet
Apache Pig & Pig Latin Overview
41 pages
Unit-4 Bigdata Analytics: What Is Apache Pig?
No ratings yet
Unit-4 Bigdata Analytics: What Is Apache Pig?
47 pages
6 Part1
No ratings yet
6 Part1
5 pages
BDH Exp-4 I232
No ratings yet
BDH Exp-4 I232
8 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
BDP U4
No ratings yet
BDP U4
58 pages
Unit5 Part1 Notes
No ratings yet
Unit5 Part1 Notes
21 pages
PIG
No ratings yet
PIG
9 pages
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
No ratings yet
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
9 pages
What Is Apache Pig?
No ratings yet
What Is Apache Pig?
5 pages
Pig
No ratings yet
Pig
6 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
BD 5
No ratings yet
BD 5
28 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Pig Latin: Simplifying Hadoop for All
No ratings yet
Pig Latin: Simplifying Hadoop for All
9 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
BDTools PIG
No ratings yet
BDTools PIG
14 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Apache Pig
No ratings yet
Apache Pig
4 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Gemini
No ratings yet
Gemini
23 pages
Deep Learning For Medicinal Plant
No ratings yet
Deep Learning For Medicinal Plant
6 pages
Unit I 20HSMG601 - Principles of Engineeering Management
No ratings yet
Unit I 20HSMG601 - Principles of Engineeering Management
34 pages
Syllabus
No ratings yet
Syllabus
3 pages
BDA Question BANK
No ratings yet
BDA Question BANK
7 pages
20aipw602-Big Data Analytics With Lab
No ratings yet
20aipw602-Big Data Analytics With Lab
14 pages
Unit III-IOT - LESSON
No ratings yet
Unit III-IOT - LESSON
12 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
25 pages
JVC kd-r601
No ratings yet
JVC kd-r601
85 pages
QP Xii Ip Hy 2024-25
No ratings yet
QP Xii Ip Hy 2024-25
9 pages
Construction Methods Course Guide
No ratings yet
Construction Methods Course Guide
5 pages
Maven "Convention Over Configuration" Example: An Illustration of This Notion Inside The Maven
No ratings yet
Maven "Convention Over Configuration" Example: An Illustration of This Notion Inside The Maven
2 pages
Support Vector Machine
100% (1)
Support Vector Machine
40 pages
Tdt7 Manual
No ratings yet
Tdt7 Manual
2 pages
Ceph Performance & Cost Optimization
No ratings yet
Ceph Performance & Cost Optimization
13 pages
Case Study 1
No ratings yet
Case Study 1
3 pages
Computer Architecture & Related Topics: Ben Schrooten Shawn Borchardt, Eddie Willett Vandana Chopra
No ratings yet
Computer Architecture & Related Topics: Ben Schrooten Shawn Borchardt, Eddie Willett Vandana Chopra
88 pages
MulesoftDevLead-Resume
No ratings yet
MulesoftDevLead-Resume
11 pages
DDA Line Drawing Algorithm
No ratings yet
DDA Line Drawing Algorithm
11 pages
Exam Data Prep for Centre Staff
No ratings yet
Exam Data Prep for Centre Staff
7 pages
Module 2 Instructions
No ratings yet
Module 2 Instructions
31 pages
Thrift Fashion-Website Development-SRS
No ratings yet
Thrift Fashion-Website Development-SRS
7 pages
English 201 Final Exam Review Guide
No ratings yet
English 201 Final Exam Review Guide
19 pages
Assessment in Quadratic Inequality
No ratings yet
Assessment in Quadratic Inequality
2 pages
Accessories Guide Enterprise
No ratings yet
Accessories Guide Enterprise
10 pages
Class 10 IT 402 Important Questions Updated Syllabus
No ratings yet
Class 10 IT 402 Important Questions Updated Syllabus
3 pages
Thesis Using Multiple Linear Regression
75% (4)
Thesis Using Multiple Linear Regression
7 pages
Cours de Béton Armé Selon L'eurocode PDF
No ratings yet
Cours de Béton Armé Selon L'eurocode PDF
1 page
Internship Report Template
No ratings yet
Internship Report Template
25 pages
NCM 110 Midterm
No ratings yet
NCM 110 Midterm
14 pages
CC Assignment 5
No ratings yet
CC Assignment 5
5 pages
MAN0070A0002 Pilots Manual
No ratings yet
MAN0070A0002 Pilots Manual
37 pages
Functional Testing Techniques
No ratings yet
Functional Testing Techniques
42 pages
SCCM, Microsoft System Center Configuration Manager, IDM, Windows 7, Windows 8.1
No ratings yet
SCCM, Microsoft System Center Configuration Manager, IDM, Windows 7, Windows 8.1
11 pages

Bda Unit 4

Uploaded by

Bda Unit 4

Uploaded by

Introduction to Apache Pig

Apache Pig Execution Mechanisms

○ It can span multiple lines.

○ Each statement must end with a semi-colon.

○ It may include expression and schemas.

○ By default, these statements are processed using multi-query execution

Pig Latin Conventions

Latin Data Types

Simple Data Types

int It defines the signed 32-bit integer.

long It defines the signed 64-bit integer.

float It defines 32-bit floating point number.

double It defines 64-bit floating point number.

chararray It defines character array in Unicode UTF-8 format.

bytearray It defines the byte array.

boolean It defines the boolean type values.

datetime It defines the values in datetime order.

biginteger It defines Java BigInteger values.

bigdecimal It defines Java BigDecimal values.

Apache Pig - User Defined Functions

Types of UDF’s in Java

Writing UDF’s using Java

Follow the steps given below to write a UDF function −

​ Open Eclipse and create a new project (say myproject).

Using the UDF

Step 1: Registering the Jar file

Given below is the syntax of the Register operator.

As an example let us register the sample_udf.jar created earlier in this chapter.

Step 2: Defining Alias

Given below is the syntax of the Define operator.

DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };

Define the alias for sample_eval as shown below.

DEFINE sample_eval sample_eval();

Step 3: Using the UDF

grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')

as (id:int, name:chararray, age:int, city:chararray);

grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);

Verify the contents of the relation Upper_case as shown below.

grunt> Dump Upper_case;

Data processing operators - Hive :

Comparison with traditional databases

You might also like

Open Eclipse and create a new project (say myproject).