Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views16 pages

Bda Unit 4

Apache Pig is a high-level platform for processing large datasets using a scripting language called Pig Latin, which simplifies the development cycle compared to traditional MapReduce. It supports various execution modes and mechanisms, allowing users to run scripts interactively, in batches, or embedded within other applications. Additionally, Pig provides extensive support for User Defined Functions (UDFs) and has evolved since its inception in 2006, becoming an open-source project under the Apache Software Foundation.

Uploaded by

aburoobhastudy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Bda Unit 4

Apache Pig is a high-level platform for processing large datasets using a scripting language called Pig Latin, which simplifies the development cycle compared to traditional MapReduce. It supports various execution modes and mechanisms, allowing users to run scripts interactively, in batches, or embedded within other applications. Additionally, Pig provides extensive support for User Defined Functions (UDFs) and has evolved since its inception in 2006, becoming an open-source project under the Apache Software Foundation.

Uploaded by

aburoobhastudy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to Apache Pig

Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets. It provides a high-level of abstraction for processing over the
MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a
component of Apache Pig) converted all these scripts into a specific map and reduce task. But
these are not visible to the programmers in order to provide a high-level of abstraction. Pig Latin
and Pig Engine are the two main components of the Apache Pig tool. The result of Pig always
stored in the HDFS.

Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing
the reducer and mapper, compiling packaging the code, submitting the job and retrieving the
output is a time-consuming task. Apache Pig reduces the time of development using the
multi-query approach. Also, Pig is beneficial for programmers who are not from Java
background. 200 lines of Java code can be written in only 10 lines using the Pig Latin language.
Programmers who have SQL knowledge needed less effort to learn Pig Latin.

It uses query approach which results in reducing the length of the code.
Pig Latin is SQL like language.
It provides many builtIn operators.
It provides nested data types (tuples, bags, map).

Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that
time, the main idea to develop Pig was to execute the MapReduce jobs on extremely large
datasets. In the year 2007, it moved to Apache Software Foundation(ASF) which makes it an
open source project. The first version(0.1) of Pig came in the year 2008. The latest version of
Apache Pig is 0.18 which came in the year 2017.
Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is
no need of Hadoop or HDFS. This mode is generally used for testing purpose.

MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.

Apache Pig Execution Mechanisms


Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump
operator).

Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in
a single file with .pig extension.

Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our script.

Pig Latin

The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.
Pig Latin Statements

The Pig Latin statements are used to process the data. It is an operator that accepts a relation as
an input and generates another relation as an output.

○ It can span multiple lines.

○ Each statement must end with a semi-colon.

○ It may include expression and schemas.

○ By default, these statements are processed using multi-query execution

Pig Latin Conventions

Conventio Description
n

() The parenthesis can enclose one or more items. It can also be used to indicate
the tuple data type.
Example - (10, xyz, (3,6,9))

[] The straight brackets can enclose one or more items. It can also be used to
indicate the map data type.
Example - [INNER | OUTER]

{} The curly brackets enclose two or more items. It can also be used to indicate
the bag data type
Example - { block | nested_block }

... The horizontal ellipsis points indicate that you can repeat a portion of the code.
Example - cat path [path ...]

Latin Data Types

Simple Data Types


Type Description

int It defines the signed 32-bit integer.


Example - 2

long It defines the signed 64-bit integer.


Example - 2L or 2l

float It defines 32-bit floating point number.


Example - 2.5F or 2.5f or 2.5e2f or 2.5.E2F

double It defines 64-bit floating point number.


Example - 2.5 or 2.5 or 2.5e2f or 2.5.E2F

chararray It defines character array in Unicode UTF-8 format.


Example - javatpoint

bytearray It defines the byte array.

boolean It defines the boolean type values.


Example - true/false

datetime It defines the values in datetime order.


Example - 1970-01- 01T00:00:00.000+00:00

biginteger It defines Java BigInteger values.


Example - 5000000000000

bigdecimal It defines Java BigDecimal values.


Example - 52.232344535345

Apache Pig - User Defined Functions


In addition to the built-in functions, Apache Pig provides extensive support for User Defined
Functions (UDF’s). Using these UDF’s, we can define our own functions and use them. The
UDF support is provided in six programming languages, namely, Java, Jython, Python,
JavaScript, Ruby and Groovy.

For writing UDF’s, complete support is provided in Java and limited support is provided in all
the remaining languages. Using Java, you can write UDF’s involving all parts of the processing
like data load/store, column transformation, and aggregation. Since Apache Pig has been written
in Java, the UDF’s written using Java language work efficiently compared to other languages.

In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank,
we can access Java UDF’s written by other users, and contribute our own UDF’s.

Types of UDF’s in Java

While writing UDF’s using Java, we can create and use the following three types of functions −

​ Filter Functions − The filter functions are used as conditions in filter statements. These
functions accept a Pig value as input and return a Boolean value.
​ Eval Functions − The Eval functions are used in FOREACH-GENERATE statements.
These functions accept a Pig value as input and return a Pig result.
​ Algebraic Functions − The Algebraic functions act on inner bags in a
FOREACHGENERATE statement. These functions are used to perform full MapReduce
operations on an inner bag.

Writing UDF’s using Java

To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we
discuss how to write a sample UDF using Eclipse. Before proceeding further, make sure you
have installed Eclipse and Maven in your system.

Follow the steps given below to write a UDF function −

​ Open Eclipse and create a new project (say myproject).


​ Convert the newly created project into a Maven project.
​ Copy the following content in the pom.xml. This file contains the Maven dependencies
for Apache Pig and Hadoop-core jar files.

Using the UDF

After writing the UDF and generating the Jar file, follow the steps given below −

Step 1: Registering the Jar file

After writing UDF (in Java) we have to register the Jar file that contain the UDF using the
Register operator. By registering the Jar file, users can intimate the location of the UDF to
Apache Pig.

Syntax

Given below is the syntax of the Register operator.

REGISTER path;

Example

As an example let us register the sample_udf.jar created earlier in this chapter.

Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.

$cd PIG_HOME/bin

$./pig –x local

REGISTER '/$PIG_HOME/sample_udf.jar'
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar

Step 2: Defining Alias

After registering the UDF we can define an alias to it using the Define operator.

Syntax

Given below is the syntax of the Define operator.

DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };

Example

Define the alias for sample_eval as shown below.

DEFINE sample_eval sample_eval();

Step 3: Using the UDF

After defining the alias you can use the UDF same as the built-in functions. Suppose there is a
file named emp_data in the HDFS /Pig_Data/ directory with the following content.

001,Robin,22,newyork

002,BOB,23,Kolkata

003,Maya,23,Tokyo

004,Sara,25,London

005,David,23,Bhuwaneshwar

006,Maggy,22,Chennai
007,Robert,22,newyork

008,Syam,23,Kolkata

009,Mary,25,Tokyo

010,Saran,25,London

011,Stacy,25,Bhuwaneshwar

012,Kelly,22,Chennai

And assume we have loaded this file into Pig as shown below.

grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')

as (id:int, name:chararray, age:int, city:chararray);

Let us now convert the names of the employees in to upper case using the UDF sample_eval.

grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);

Verify the contents of the relation Upper_case as shown below.

grunt> Dump Upper_case;

(ROBIN)

(BOB)

(MAYA)

(SARA)

(DAVID)
(MAGGY)

(ROBERT)

(SYAM)

(MARY)

(SARAN)

(STACY)

(KELLY)

Data processing operators - Hive :

In the context of Hive in big data environments, data processing operators typically refer to
various components and functionalities within the Hive ecosystem that enable users to interact
with and process large datasets. Here's an overview:

1. Hive Shell: Hive provides a command-line interface called the Hive shell, which allows
users to interact with Hive using HiveQL (Hive Query Language), a SQL-like language
for querying and managing data stored in Hive. The Hive shell enables users to execute
queries, create tables, load data, and perform various data manipulation tasks.
2. Hive Services: Hive operates as a service within the Hadoop ecosystem. These services
include the Hive Metastore, Hive Server, and other auxiliary services. The Hive
Metastore stores metadata about Hive tables, partitions, and other database objects. It acts
as a central repository for schema information and helps manage table schemas, storage
locations, and other metadata.
3. Hive Metastore: The Hive Metastore is a critical component of the Hive architecture. It
stores metadata such as table and partition definitions, column information, storage
location, and other details needed to manage and query data stored in Hive. The
metastore is typically implemented using a relational database management system
(RDBMS) such as MySQL, PostgreSQL, or Derby. It allows multiple Hive instances to
share metadata and provides a centralized catalog for managing Hive tables.

These operators work together to enable users to define schemas, query data, and perform
various data processing tasks using Hive. The Hive shell provides an interface for users to
interact with Hive, while the Hive Metastore stores metadata about Hive tables and other objects,
enabling efficient query execution and management of data stored in Hive.

1. Hive Shell:The Hive Shell is a command-line interface (CLI) that allows users to interact
with Hive using HiveQL commands. It provides a familiar environment for users who are
comfortable with SQL-like syntax. Here are some key features and functionalities of the
Hive Shell:
● Query Execution: Users can execute HiveQL queries to perform various
operations such as data selection, filtering, aggregation, joins, and transformations
on datasets stored in Hive tables.
● Table Management: Users can create, drop, alter, and describe tables using DDL
(Data Definition Language) commands within the Hive Shell. They can define
table schemas, specify storage formats, partitioning schemes, and other table
properties.
● Data Loading: The Hive Shell supports data loading from external sources such as
HDFS (Hadoop Distributed File System), local file systems, or other databases
using commands like LOAD DATA INPATH or INSERT INTO TABLE.
● Session Configuration: Users can configure session-level properties and settings
such as mapreduce tasks, input/output formats, compression codecs, and query
execution options.
● Scripting: The Hive Shell supports scripting using languages like Bash or Python,
allowing users to automate tasks, execute multiple queries sequentially, or interact
with external systems.
2. Hive Services:Hive operates as a service within the Hadoop ecosystem, comprising
several components that work together to provide data processing capabilities. Here are
some key Hive services:
● Hive Metastore: As mentioned earlier, the Hive Metastore stores metadata about
Hive tables, partitions, columns, storage formats, and other database objects. It
serves as a centralized catalog for managing and querying data stored in Hive.
The Metastore can be configured to use different backend databases such as
MySQL, PostgreSQL, or Derby.
● Hive Server: The Hive Server provides a remote interface for clients to submit
HiveQL queries and interact with Hive. It supports multiple client interfaces
including JDBC (Java Database Connectivity), ODBC (Open Database
Connectivity), Thrift, and HTTP. The Hive Server facilitates concurrent query
execution, session management, and authentication for multiple users accessing
Hive concurrently.
● Hive Execution Engine: Hive uses MapReduce, Tez, or Spark as its execution
engine for processing HiveQL queries. MapReduce is the traditional execution
engine for Hive, while Tez and Spark offer improved performance and
optimization for complex queries and interactive analysis.
● Hive CLI and Beeline: In addition to the Hive Shell, Hive provides two
command-line interfaces: Hive CLI and Beeline. Hive CLI is the legacy CLI,
while Beeline is a modern JDBC client that provides better support for JDBC
connectivity, security, and user authentication.
3. Hive Metastore:The Hive Metastore is a central component of the Hive architecture
responsible for managing metadata about Hive tables and other database objects. Here's a
closer look at its functionalities:
● Metadata Storage: The Hive Metastore stores metadata such as table definitions,
column names, data types, partitioning schemes, storage formats, and file
locations. It maintains a catalog of all Hive objects and their properties, making it
easier to manage and query data stored in Hive.
● Schema Management: Users can create, alter, drop, and describe tables using
DDL commands, and the Metastore stores the corresponding schema information.
It enforces data integrity constraints and ensures consistency across Hive tables.
● Partition Management: Hive supports partitioning of tables based on one or more
columns, which helps improve query performance and data organization. The
Metastore manages partition metadata and tracks partition locations, making it
efficient to access specific partitions during query execution.
● Concurrency Control: The Metastore handles concurrency control and metadata
locking to ensure data consistency and prevent conflicts when multiple users or
processes access and modify metadata simultaneously.
● Integration with External Systems: The Hive Metastore integrates with various
external systems and tools within the Hadoop ecosystem, including HDFS,
YARN, MapReduce, Tez, Spark, and other components. It provides seamless
interoperability and metadata sharing across different data processing frameworks
and applications.

Comparison with traditional databases


When comparing Hive with traditional databases in big data environments, several key
differences and considerations arise due to their distinct architectures, use cases, and
design philosophies:
1. Data Model:
● Traditional Databases: These databases adhere to the relational model,
where data is organized into tables with a fixed schema. This means that
data types, column names, and constraints must be defined upfront before
inserting data into the database. This model ensures data integrity and
consistency through normalization and enforced constraints.
● Hive: In contrast, Hive follows a schema-on-read approach. Data is stored
as files in a distributed file system, and the schema is applied at the time of
querying rather than during data ingestion. This allows for flexibility in
handling various data formats, including structured, semi-structured, and
unstructured data. Hive's schema-on-read model is advantageous for
processing diverse datasets where the schema may evolve over time or
where upfront schema definition is not practical.
2. Query Language:
● Traditional Databases: SQL is the standard query language used in
traditional databases. It provides a rich set of features for data
manipulation, including querying, filtering, joining, aggregating, and
transaction management. SQL is optimized for relational data models and
is well-suited for OLTP workloads.
● Hive: HiveQL is Hive's query language, which is similar to SQL but
tailored for distributed data processing. While it shares similarities with
SQL, HiveQL has certain limitations and differences due to the underlying
distributed computing framework. For example, complex queries or joins
may require optimization techniques specific to Hive, and certain SQL
features may not be fully supported or may have performance implications
in a distributed environment.
3. Scalability and Performance:
● Traditional Databases: Traditional databases are typically designed for
vertical scalability, where additional resources are added to a single server
to handle increased workloads. While modern traditional databases may
support some degree of horizontal scalability through clustering or
replication, they are not as inherently scalable as distributed systems like
Hive.
● Hive: Hive is designed for horizontal scalability, allowing users to scale
out by adding more nodes to the cluster. It leverages distributed storage
and processing frameworks such as Hadoop MapReduce, Apache Tez, or
Apache Spark to parallelize data processing across multiple nodes. This
architecture enables Hive to handle large-scale datasets and parallelize
complex analytical queries, making it well-suited for big data processing
and analytics.
4. Metadata Management:
● Traditional Databases: Metadata management in traditional databases is
tightly coupled with the database system itself. Metadata, such as table
definitions, indexes, and constraints, is stored within the database instance
and managed by the database management system (DBMS).
● Hive: Hive separates metadata management from data storage through the
Hive Metastore. The Metastore stores metadata about Hive tables,
partitions, columns, and storage formats in a separate repository, often
using a relational database backend. This separation allows for metadata
sharing across multiple Hive instances and facilitates integration with
other components in the Hadoop ecosystem.
5. Use Cases:
● Traditional Databases: Traditional databases are commonly used for OLTP
applications, where the emphasis is on high transactional throughput, low
latency, and strong consistency guarantees. They are well-suited for
transaction processing tasks such as online retail, banking, inventory
management, and order processing.
● Hive: Hive is primarily used for OLAP and batch processing workloads,
where the focus is on analyzing large volumes of historical data to derive
insights and make data-driven decisions. It is often used in data
warehousing, business intelligence, reporting, and ETL (Extract,
Transform, Load) workflows, where the ability to handle massive datasets
and perform complex analytics is crucial.

In summary, while traditional databases and Hive serve different purposes and have distinct
architectures, both play important roles in managing and analyzing data in various contexts. The
choice between them depends on factors such as the nature of the data, performance
requirements, scalability needs, and the types of applications and workloads being supported.

HiveQLBigSQL
HiveQL and BigSQL are both SQL-based query languages designed for big data processing, but
they are associated with different platforms and ecosystems. Let's explore each one:
1. HiveQL:
● Platform: HiveQL is the query language used with Apache Hive, a data
warehouse infrastructure built on top of Apache Hadoop. Hive provides a
high-level abstraction for querying and analyzing large datasets stored in Hadoop
Distributed File System (HDFS) or other compatible distributed storage systems.
● Syntax and Features: HiveQL syntax closely resembles SQL, making it relatively
easy for users familiar with SQL to transition to HiveQL. It supports a wide range
of SQL-like operations such as SELECT, INSERT, UPDATE, DELETE, JOIN,
GROUP BY, ORDER BY, and more. Additionally, HiveQL includes extensions
and optimizations specific to distributed computing environments, allowing users
to perform complex analytical queries on massive datasets efficiently.
● Data Formats: One of the key features of HiveQL is its support for various data
formats, including structured, semi-structured, and unstructured data. Hive can
process data stored in formats such as CSV, JSON, Parquet, ORC, Avro, and
more. This flexibility enables users to work with diverse datasets without needing
to perform extensive data preprocessing or conversion.
● Use Cases: HiveQL is primarily used for OLAP and batch processing workloads
in big data environments. It is commonly employed in data warehousing, business
intelligence, reporting, and ETL workflows. Hive's strengths lie in its ability to
handle large-scale datasets and parallelize complex analytical queries across
distributed compute resources.
2. BigSQL:
● Platform: BigSQL is a SQL-on-Hadoop solution provided by IBM as part of its
IBM BigInsights platform. It extends the capabilities of Apache Hadoop with
additional features and optimizations tailored for SQL-based analytics.
● Syntax and Features: BigSQL also offers a SQL-based interface for querying and
analyzing data stored in Hadoop Distributed File System (HDFS) or other
compatible storage systems. Its syntax is similar to standard SQL, making it
accessible to users familiar with relational databases. BigSQL includes features
for advanced analytics, machine learning, and integration with other IBM
BigInsights components.
● Data Formats: Similar to Hive, BigSQL supports various data formats commonly
used in big data environments, including structured and semi-structured formats
such as CSV, JSON, Parquet, ORC, Avro, and more. This flexibility enables users
to work with diverse datasets and perform analytical tasks efficiently.
● Use Cases: BigSQL is suitable for a wide range of use cases, including data
warehousing, business intelligence, ad-hoc querying, and operational analytics. It
is often used in industries such as finance, telecommunications, healthcare, and
retail for processing large volumes of data and deriving insights from it.

In summary, both HiveQL and BigSQL provide SQL-based interfaces for querying and
analyzing data in big data environments. While HiveQL is associated with the Apache Hive
ecosystem and is part of the broader Hadoop ecosystem, BigSQL is offered as part of IBM's
BigInsights platform, with additional features and optimizations tailored for IBM environments.
The choice between them may depend on factors such as existing infrastructure, vendor
preferences, feature requirements, and integration with other tools and systems.

You might also like