Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views36 pages

BDA Unit5

The document outlines the Hadoop ecosystem, detailing its components such as HDFS, HBase, Hive, and Pig, and their functionalities in big data applications. It explains Apache Pig as a platform for analyzing large datasets using a simplified scripting language called Pig Latin, which abstracts the complexity of MapReduce. Additionally, it covers Pig's architecture, data types, and various data processing operators, illustrating how to execute Pig programs and manipulate data effectively.

Uploaded by

Babji Ponnana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views36 pages

BDA Unit5

The document outlines the Hadoop ecosystem, detailing its components such as HDFS, HBase, Hive, and Pig, and their functionalities in big data applications. It explains Apache Pig as a platform for analyzing large datasets using a simplified scripting language called Pig Latin, which abstracts the complexity of MapReduce. Additionally, it covers Pig's architecture, data types, and various data processing operators, illustrating how to execute Pig programs and manipulate data effectively.

Uploaded by

Babji Ponnana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT-V

Frameworks and Applications


Frameworks and Applications: Hadoop Echo System, Applications on Big Data Using Pig,
Pig Architecture, Data processing operators in Pig, Applications on Big Data Using Hive, Hive
Architecture, HiveQL, Querying Data in Hive, fundamentals of HBase and ZooKeeper.
Visualizations, Visual data analysis techniques, interaction techniques, Systems and
application.

Hadoop Ecosystem :
The following are the components of Hadoop ecosystem:
1. HDFS: Hadoop Distributed File System. It simply stores data files as close to the original form as
possible.
2. HBase: It is Hadoop’s distributed column based database. It supports structured data storage
for large tables.
3. Hive: It is a Hadoop’s data warehouse, enables analysis of large data sets using a language very
similar to SQL. So, one can access data stored in Hadoop cluster by using Hive.
4. Pig: Pig is an easy to understand data flow language. It helps with the analysis of large data sets
which is quite the order with Hadoop without writing codes in MapReduce paradigm

5. Zookeeper: It is an open source application that configures synchronizes the distributed


systems.
6. Oozie: It is a workflow scheduler system to manage apache Hadoop jobs.
7. Mahout: It is a scalable Machine Learning and data mining library.
8. Chukwa: It is a data collection system for managing large distributed systems.
9. Sqoop: it is used to transfer bulk data between Hadoop and structured data stores such as
relational databases.
10. Ambari: it is a web based tool for provisioning, Managing and Monitoring Apache Hadoop
clusters.

Hadoop PIG

What is Pig in Hadoop ecosystem?

 Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to


analyze larger sets of data representing them as data flows. Pig is generally used with
Hadoop; we can perform all the data manipulation operations in Hadoop using
Apache Pig.
 To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language. All these scripts are internally converted to Map and Reduce tasks. Apache
Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?


Programmers who are not so good at Java normally used to struggle working with Hadoop,
especially while performing any MapReduce tasks. Apache Pig is an alternative for all such
programmers.
 Using Pig Latin, programmers can perform MapReduce tasks easily without having to
type complex codes in Java.
 Apache Pig uses multi-query approach, thereby reducing the length of codes. For
example, an operation that would require you to type 200 lines of code (LoC) in Java
can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache
Pig reduces the development time by almost 16 times.
 Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar
with SQL.
 Apache Pig provides many built-in operators to support data operations like joins,
filters, ordering, etc. In addition, it also provides nested data types like tuples, bags,
and maps that are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features:
 Rich set of operators: It provides many operators to perform operations like join, sort,
filer, etc.
 Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if
you are good at SQL.
 Optimization opportunities: The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
 Extensibility: Using the existing operators, users can develop their own functions to
read, process, and write data.
 UDF’s: Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
 Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured as
well as unstructured. It stores the results in HDFS.

Applications on Big Data Using Pig

 Processes large volume of data


 Supports quick prototyping and ad-hoc queries across large datasets
 Performs data processing in search platforms
 Processes time-sensitive data loads
 Used by telecom companies to de-identify the user call data information.
PIG ARCHITECTURE

 To perform a particular task, programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded).
 After execution, these scripts will go through
a series of transformations applied by the Pig
Framework, to produce the desired output.
 Internally, Apache Pig converts these scripts
into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The
architecture of Apache Pig is shown in figure
below.

 Parser
o Initially the Pig Scripts are handled by
the Parser. It checks the syntax of the
script, does type checking, and other
miscellaneous checks.
o The output of the parser will be a DAG
(Directed Acyclic Graph), which
represents the Pig Latin statements
and logical operators.
o In the DAG, the logical operators of the
script are represented as the nodes
and the data flows are represented as
edges.
 Optimizer
o The logical plan (DAG) is passed to the
logical optimizer, which carries out the
logical optimizations such as
projection and pushdown.
 Compiler
o The compiler compiles the optimized logical plan into a series of MapReduce jobs.
 Execution engine
o Finally, the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.

Pig Latin–Data Model


 Pig’s data types make up the data model for the way Pig thinks about the data structure
it processes. With Pig, when the data is loaded the data model is specified.
 Any data that you load from the disk into Pig will have a specific schema and structure.
Pig needs to understand the structure so the data can automatically go through a
mapping when you do the loading.
 Pig data types can be divided into two groups in general terms:
o Scalar type
 It contains a single value
o Complex type
 It includes other values, such as the values of Tuple, Container, and Map
below.

Given below is the diagrammatical representation of Pig Latin’s data model.


Atom
 Any single value in Pig Latin, irrespective of their datatype is known as an Atom. The
atomic values of Pig are scalar forms. int, long, float, double, chararray, and bytearray
are the atomic values of Pig.
 A piece of data or a simple atomic value is known as a field.
o Example: ‘john’ or ‘30’

Tuple
 A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type. A tuple is similar to a row in a table of RDBMS.
o Example: (john, 30)

Bag
 A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema).
 A bag is represented by ‘{ }’.
 It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary
that every tuple contain the same number of fields or that the fields in the same
position (column) have the same type.
o Example: {(Adam, 30), (John, 45)}
 A bag can be a field in a relation; in that context, it is known as inner bag.
o Example: {Adam, 30, {9848022338, [email protected],}}

Relation
 A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).

Map
 A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is represented
by ‘[ ]’
o Example: [name#john, age#30]
NULL
 In Pig ‘s notion of null means the value is unknown. In cases where values are
unreadable or unrecognizable, nulls can show up in the data — for example, if you
were using a wrong data form in the LOAD statement.

Running Pig Programs


There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:
 Script
Pig can run a script file that contains Pig commands. For example, pigscript.pig runs the
commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e
option to run a script specified as a string on the command line.
 Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is
specified for Pig to run and the -e option is not used. It is also possible to run Pig scripts from
within Grunt using run and exec.
 Embedded
You can run Pig programs from Java using the PigServer class, much like you can use JDBC
to run SQL programs from Java. For programmatic access to Grunt, use PigRunner.

DATA PROCESSING OPERATORS IN PIG


Data Types in Pig

Data Type Description Example Implementing class


Int Signed 32 bit integer 2 java.lang.Integer
Float 32 bit floating point 2.5f or 2.5F java.lang.Float
Long Signed 64 bit integer 15L or 15l java.lang.Long
Double 32 bit floating point 1.5 or 1.5e2 java.lang.Double
Chararray Character array Hi hello java.lang.String
bytearray BLOB(Byte array) byte[]
(Default Field Type)
Bag Collection of tuples {(12,43),(54,28)} org.apache.pig.data.DataBag
Tuple Ordered set of fields (12,43) org.apache.pig.data.Tuple
Map collection of tuples [open#apache] java.util.Map<Object,Object>

Pig Relational Operators

LOAD
$LOAD 'info' [USING FUNCTION] [AS SCHEMA];
o LOAD is a relational operator.
o 'info' is a file that is required to load. It contains any type of data.
o USING is a keyword.
o FUNCTION is a load function.
o AS is a keyword.
o SCHEMA is a schema of passing file, enclosed in parentheses.

Example:
 File in local file system
$cat data.txt
1010,10,3
2020,20,4
3030,30,5
4040,40,2
 Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int) ;
 Printing loaded data on console
grunt> DUMP A;
(1010,10,3)
(2020,20,4)
(3030,30,5)
(4040,40,2)
STORE

 Stores or saves results to the file system.

grunt>STORE A INTO 'myoutput' USING PigStorage (',');

CROSS

 File in local file system

$cat data1.txt $cat data2.txt


1,2 3,4,5
2,3 4,5,6
 Loading data file into HDFS file system
$ hdfs dfs -put data1.txt /pigtest
$ hdfs dfs -put data2.txt /pigtest
 Starting pig grunt shell
$pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data1.txt' USING PigStorage(',') AS (d1:int,d2:int) ;
grunt> B = LOAD '/pigtest/data2.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int) ;
 Cross product of data1.txt and data2.txt
grunt> C=CROSS A,B;
 Printing final output
grunt> DUMP C;
(1,2,3,4,5)
(1,2,4,5,6)
(2,3,3,4,5)
(2,3,4,5,6)

DISTINCT
 File in local file system
$cat data.txt
1,10,3
2,20,4
1,10,3
2,20,4
 Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int) ;
 To remove duplicate data
grunt>B = DISTINCT A
 Printing loaded data on console
grunt> DUMP B;
(1,10,3)
(2,20,4)

FILTER
 File in local file system
$cat data.txt
1,10,3
2,20,4
3,10,3
4,20,4
 Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int) ;
 To remove duplicate data
grunt>B = FILTER A BY d2 == 10;
 Printing loaded data on console
grunt> DUMP B;
(1,10,3)
(3,10,3)

FOREACH
 File in local file system
$cat data.txt
1,2,3,4
5,6,7,8
8,7,6,5
4,3,2,1
 Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int,d3:int,
d4:int) ;
 To fetch second and fourth columns
grunt>B = FOREACH A GENERATE d2,d4;
 Printing loaded data on console
grunt> DUMP B;
(2,4)
(6,8)
(7,5)
(3,1)

GROUP BY
 File in local file system
$cat data.txt
John,Ram,3
Clark,John,2
Nike,Ram,5
Imran,John,6
 Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',')
AS (d1:chararray,d2:chararray,d3:int) ;
 To group the data based on d2 column data
grunt>B = GROUP A BY d2;
 Printing loaded data on console
grunt> DUMP B;
(Ram, {(John,Ram,3), (Nike,Ram,5)})
(John, {(Clark,John,2), (Imran,John,6)}

LIMIT

 File in local file system


$cat data.txt
John,Ram,3
Clark,John,2
Nike,Ram,5
Imran,John,6
 Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',')
AS (d1:chararray,d2:chararray,d3:int) ;
 To print only first two tuples
grunt>B = LIMIT A 2;
 Printing loaded data on console
grunt> DUMP B;
John,Ram,3
Clark,John,2

ORDER BY

 File in local file system


$cat data.txt
John,Ram,3
Clark,John,2
Nike,Ram,5
Imran,John,6
 Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',')
AS (d1:chararray,d2:chararray,d3:int) ;
 To sort tuples in an Order
grunt>B = ORDER A BY d3 DESC;
 Printing loaded data on console
grunt> DUMP B;
Imran,John,6
Nike,Ram,5
John,Ram,3
Clark,John,2

SPLIT
 File in local file system
$cat data.txt
1,2
2,4
3,6
4,8
5,7
6,5
7,3
8,1
 Loading data file into HDFS file system
$ hdfs dfs -put data.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',') AS (d1:int,d2:int) ;
 To Split the tuples based on field values
grunt> SPLIT A INTO X IF d1<=5, Y IF d1>=6;
 Printing loaded data on console

grunt> DUMP X; grunt> DUMP Y;


(1,2) (6,5)
(2,4) (7,3)
(3,6) (8,1)
(4,8)
(5,7)

UNION

 File in local file system

$cat data1.txt $cat data2.txt


John,Ram,3 Nike,Ram,5
Clark,John,2 Imran,John,6

 Loading data file into HDFS file system


$ hdfs dfs -put data1.txt /pigtest
$ hdfs dfs -put data2.txt /pigtest
 Starting pig grunt shell
$pig -x mapreduce or $pig
 Loading data into pig by defining schema and fields are separated with comma.
grunt> A = LOAD '/pigtest/data.txt' USING PigStorage(',')
AS (d1:chararray,d2:chararray,d3:int) ;
grunt> B = LOAD '/pigtest/data2.txt' USING PigStorage(',')
AS (d1:chararray,d2:chararray,d3:int) ;
 To combine two bags as one bag
grunt>C = UNION A,B;
 Printing loaded data on console
grunt> DUMP C;
John,Ram,3
Clark,John,2
Nike,Ram,5
Imran,John,6
Example: Finds the maximum temperature by year

 File containing year wise temperature data


grunt>cat sample.txt
1990,11,1
1991,00,4
1990,24,5
1991,-4,9
1990,20,1
1991,5,4
 Loading data into pig
grunt>records = LOAD '/home/kkr/sample.txt' UNING PigStorage(‘,’)
AS (year:chararray, temperature:int, quality:int);

 Filter the records


grunt>filtered_records = FILTER records BY temperature != 9999
AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

 Group the tuples with respective year


grunt>grouped_records = GROUP filtered_records BY year;

 Fetch the max temperature from each group


grunt>max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);

 Print the results on console


grunt>DUMP max_temp;

Apache Hive
What is Hive?

One of the biggest ingredients in the Information Platform built by Jeff’s team at Facebook
was Hive, a framework for data warehousing on top of Hadoop. Hive grew from a need to
manage and learn from the huge volumes of data that Facebook was producing every day
from its burgeoning social network. After trying a few different systems, the team chose
Hadoop for storage and processing, since it was cost-effective and met their scalability needs

Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.

Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).

Feature of Hive:

o Hive is fast and scalable.


o It provides SQL-like queries (i.e., HiveQL) that are implicitly transformed to
MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its
func*tionality.

Limitations of Hive

o Hive is not capable of handling real-time data.


o It is not designed for online transaction processing.
o Hive queries contain high latency.

Differences between Hive and Pig

Hive Pig

Hive is commonly used by Data Analysts. Pig is commonly used by programmers.

It follows the Declarative language, SQL-like It follows the Procedural data-flow language
queries called Hql. called PigLatin.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.

Data Model: Tables, partitions and Buckets Data Model: Atom, Tuple, Bags and Maps

Web interface is supported Web Interface is not supported

Introduced by: Facebook, Amazon, Netflix Introduced by: Yahoo

 Modes of Execution
o Local Mode
 When the Hadoop is built in pseudo distributed mode (Single Node/Local
Machine) and the data is within the node.
o Mapreduce mode
 When the Hadoop is built with multiple data nodes and the data is distributed
across the cluster.

 Hive Architecture

Hive Client

Hive allows writing applications in various languages, including Java, Python, and
C++. It supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the


request from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java
applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to
connect to Hive.

Hive Services

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI.
It provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure
information of various tables and partitions in the warehouse. It also includes
metadata of column and its type information, the serializers and deserializers
which is used to read and write data and the corresponding HDFS files where
the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request
from different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift,
and JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts
HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of
DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine
executes the incoming tasks in the order of their dependencies.

 The Metastore
The metastore is the central repository of Hive metadata. The metastore is divided
into two pieces: a service and the backing store for the data. It is configured in three
different ways:
 Embedded Metastore
 Local Metastore
 Remote Metastore

Embedded Metastore
By default, the metastore service runs in the same JVM as the Hive service and
contains an embedded Derby database instance backed by the local disk. This is
called the embedded metastore configuration Using an embedded metastore is a
simple way to get started with Hive; however, only one embedded Derby database
can access the database files on disk at any one time, which means you can have
only one Hive session open at a time that shares the same metastore. Trying to start
a second session gives the error: Failed to start database 'metastore_db'
Local Metastore
The solution to supporting multiple sessions (and therefore multiple users) is to use
a standalone database. This configuration is referred to as a local metastore, since
the metastore service still runs in the same process as the Hive service, but connects
to a database running in a separate process, either on the same machine or on a
remote machine.
Remote Metastore
One or more metastore servers run in separate processes to the Hive service is called
a remote metastore. This brings better manageability and security because the
database tier can be completely firewalled off, and the clients no longer need the
database credentials.
APPLICATIONS ON BIG DATA USING HIVE

 Hive Data Types


Hive supports both primitive and complex data types. Primitives include numeric,
Boolean, string, and timestamp types. The complex data types include arrays, maps,
and structs.

a. The literal forms for arrays, maps, and structs are provided as functions. That is, array(), map(), and struct() are built-
in Hive functions.
b. The columns are named col1, col2, col3, etc.

Complex data types


Hive has three complex types: ARRAY, MAP, and STRUCT. ARRAY and MAP are like
their namesakes in Java, whereas a STRUCT is a record type that encapsulates a set
of named fields. Complex types permit an arbitrary level of nesting. Complex type
declarations must specify the type of the fields in the collection, using an angled
bracket notation, as illustrated in this table definition with three columns, one for
each complex type:
hive>CREATE TABLE complex (
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col1[0], col2['b'], col3.c FROM complex;


1 2 1.0

 Operators
The usual set of SQL operators is provided by Hive:
 Relational Operators
Operator Description

A=B It returns true if A equals B, otherwise false.

A <> B, A !=B It returns null if A or B is null; true if A is not equal to B, otherwise false.

A<B It returns null if A or B is null; true if A is less than B, otherwise false.

A>B It returns null if A or B is null; true if A is greater than B, otherwise false.

A<=B It returns null if A or B is null; true if A is less than or equal to B, otherwise false.

A>=B It returns null if A or B is null; true if A is greater than or equal to B, otherwise


false.

A IS NULL It returns true if A evaluates to null, otherwise false.

A IS NOT NULL It returns false if A evaluates to null, otherwise true

 Arithmetic Operators
Operators Description

A+B This is used to add A and B.

A–B This is used to subtract B from A.

A*B This is used to multiply A and B.

A/B This is used to divide A and B and returns the quotient of the operands.

A%B This returns the remainder of A / B.

A|B This is used to determine the bitwise OR of A and B.

A&B This is used to determine the bitwise AND of A and B.

A^B This is used to determine the bitwise XOR of A and B.

~A This is used to determine the bitwise NOT of A.

 logical operators (such asx OR y for logical OR).


 operators match those in MySQL, which deviates fromSQL-92 because || is
logical OR, not string concatenation. Use the concat function forthe latter in
both MySQL and Hive.
Hive Database
In Hive, the database is considered as a catalog or namespace of tables. So, we can
maintain multiple tables within a database where a unique name is assigned to
each table. Hive also provides a default database with a name default..

 hive> show databases; // to check the existing databases

 hive> create database demo; // to create new database


 hive> create database if not exists demo;
 hive>create database demo
>WITH DBPROPERTIES ('creator' = 'sumit', 'date' = '2019-06-03');
 hive> describe database extended demo;
 hive> drop database demo;
 hive> drop database if exists demo;
 hive> drop database if exists demo cascade;

 Hive Tables

In Hive, we can create a table by using the conventions similar to the SQL. It supports
a wide range of flexibility where the data files for tables are stored. It provides two
types of table: -

o Internal table
o External table

Internal Table

The internal tables are also called managed tables as the lifecycle of their data is
controlled by the Hive. By default, these tables are stored in a subdirectory under
the directory defined by hive.metastore.warehouse.dir (i.e. /user/hive/warehouse).
The internal tables are not flexible enough to share with other tools like Pig. If we try
to drop the internal table, Hive deletes both table schema and data.

hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;

hive> create table if not exists demo.employee (Id int, Name string , Sa
lary float)
row format delimited
fields terminated by ',' ;

hive> create table demo.new_employee (Id int comment 'Employee Id', N


ame string comment 'Employee Name', Salary float comment 'Employee
Salary')
comment 'Table Description'
TBLProperties ('creator'='rahul', 'created_at' = '2019-06-06 11:00:00');

hive> create table if not exists demo.copy_employee


like demo.employee;

hive> describe demo.employee

External Table

The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas
the location keyword is used to determine the location of loaded data.

As the table is external, the data is not present in the Hive directory. Therefore, if we
try to drop the table, the metadata of the table will be deleted, but the data still exists.

$hdfs dfs -mkdir /HiveDirectory


$hdfs dfs -put hive/emp_details /HiveDirectory

hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';

select * from emplist;

 Hive - Load Data

Once the internal table has been created, the next step is to load the data into it. So,
in Hive, we can easily load data from any file to the database.

o Let's load the data of the file into the database by using the following
command: -

hive>load data local inpath '/homehive/empdata' into table demo.emp;

Here, empdata is the file name that contains the data.

hive>select * from demo.emp;

If we want to add more data into the current database, execute the same query
again by just updating the new file name.

hive>load data local inpath '/home/hive/empdata1' into table demo.emp;

hive>select * from demo.emp;

o In Hive, if we try to load unmatched data (i.e., one or more column data doesn't
match the data type of specified table columns), it will not throw any
exception. However, it stores the Null value at the position of unmatched
tuple.

 Hive - Drop Table


Hive facilitates us to drop a table by using the SQL drop table command. Let's follow
the below steps to drop the table from the database.
o Let's check the list of existing databases by using the following command:
hive> show databases;
o Now select the database from which we want to delete the table by using the
following command: -
hive> use demo;
o Let's check the list of existing tables in the corresponding database.
hive> show tables;
o Now, drop the table by using the following command: -
hive> drop table demo.emp;
o Let's check whether the table is dropped or not.
hive> show tables;

Hive - Alter Table


in Hive, we can perform modifications in the existing table like changing the table name,
column name, comments, and table properties. It provides SQL like commands to alter
the table.

Rename a Table
If we want to change the name of an existing table, we can rename that table by using
the following signature: -
hive>ALTER TABLE old_table_name RENAME TO new_table_name;

Let's see the existing tables present in the current database.

hive>SHOW TABLES;

Adding column
In Hive, we can add one or more columns in an existing table by using the following
syntax
hive>ALTER TABLE table_name ADD COLUMNS(column_name datatype);

Ex: hive>Alter table demo.emp add columns (age int);


hive>describe demo.emp;
hive>select * from demo.emp;

Change Column
In Hive, we can rename a column, change its type and position. Here, we are
changing the name of the column by using the following signature: -

hive>ALTER TABLE table_name CHANGE old_column_name new_column_name datatyp;

Ex: hive>ALTER TABLE demo.emp CHANGE name fname string;


hive>DESCRIBE demo.emp;
hive>SELECT * FROM demo.emp;

Delete or Replace Column


Hive allows us to delete one or more columns by replacing them with the
new columns. Thus, we cannot drop the column directly.
Let's see the existing schema of the table.

hive>DESCRIBE demo.emp;

Now, drop a column from the table.

hive>ALTER TABLE demp.emp REPLACE COLUMNS( id string, first_name string, age int);

hive>DESCRIBE demo.emp;

Hive Query Language (HiveQL)

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore.

Return type Functions Description

BIGINT round(num) It returns the BIGINT for the rounded value of DOUBLE num.
BIGINT floor(num) It returns the largest BIGINT that is less than or equal to num.
BIGINT ceil(num), It returns the smallest BIGINT that is greater than or equal to num.
ceiling(DOUBLE num)
DOUBLE exp(num) It returns exponential of num.
DOUBLE ln(num) It returns the natural logarithm of num.
DOUBLE log10(num) It returns the base-10 logarithm of num.
DOUBLE sqrt(num) It returns the square root of num.
DOUBLE abs(num) It returns the absolute value of num.
DOUBLE sin(d) It returns the sin of num, in radians.
DOUBLE asin(d) It returns the arcsin of num, in radians.
DOUBLE cos(d) It returns the cosine of num, in radians.
DOUBLE acos(d) It returns the arccosine of num, in radians.
DOUBLE tan(d) It returns the tangent of num, in radians.
DOUBLE atan(d) It returns the arctangent of num, in radians.

Operations in HiveQL

Arithmetic Operations

hive> select id, name, salary + 50 from employee;

hive> select id, name, (salary * 10) /100 from employee; //find 10% salary from each
employee

Relational Operators

hive> select * from employee where salary >= 25000;

hive> select * from employee where salary < 25000;

Math functions
hive> select Id, Name, sqrt(Salary) from employee_data ;
Aggregate Functions in Hive
In Hive, the aggregate function returns a single value resulting from computation
over many rows. Let''s see some commonly used aggregate functions: -
Return Operator Description
Type

BIGINT count(*) It returns the count of the number of rows


present in the file.
DOUBLE sum(col) It returns the sum of values.
DOUBLE sum(DISTINCT col) It returns the sum of distinct values.
DOUBLE avg(col) It returns the average of values.
DOUBLE avg(DISTINCT col) It returns the average of distinct values.
DOUBLE min(col) It compares the values and returns the minimum
one form it.
DOUBLE max(col) It compares the values and returns the
maximum one form it.

Sorting and Aggregating


 Sorting data in Hive can be achieved by using a standard ORDER BY clause, but
there is a catch. ORDER BY produces a result that is totally sorted, as expected,
but to do so it sets the number of reducers to one, making it very inefficient for
large datasets.
 When a globally sorted result is not required and in many cases it isn’t, you can
use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted
file per reducer.
 In some cases, you want to control which reducer a particular row goes to,
typically so you can perform some subsequent aggregation. This is what Hive’s
DISTRIBUTE BY clause does.

GROUP BY and HAVING Clause


The Hive Query Language provides GROUP BY and HAVING clauses that facilitate
similar functionalities as in SQL. Here, we are going to execute these clauses on the
records of the below table:
 Create Table:
hive> create table emp (Id int, Name string, Salary float, Desig string)
>row format delimited
>fields terminated by ',' ;
 Load data into emp table:
hive> load data local inpath '/home/hive/emp_data' into table emp;

ID NAME SALARY DESIG


1 John 40,000 Sr. Manager
2 Harry 30,000 Developer
3 Clark 25,000 Developer
4 Ram 35,000 Manager
5 Imran 50,000 Sr. Manager
6 Lisa 45,000 Manager
7 Syam 35,000 Manager

GROUP BY Clause:
The HQL Group By clause is used to group the data from the multiple records
based on one or more column. It is generally used in conjunction with the aggregate
functions (like SUM, COUNT, MIN, MAX and AVG) to perform an aggregation over
each group.
hive> SELECT desig, sum(salary) from emp GROUP BY desig;
Developer 55,000
Manager 1,15,000
Sr. Manager 90,000

HAVING Clause:
The HQL HAVING clause is used with GROUP BY clause. Its purpose is to
apply constraints on the group of data produced by GROUP BY clause. Thus, it
always returns the data where the condition is TRUE.

hive> SELECT desig, sum(salary) from emp


>GROUP BY desig HAVING salary>=30,000;
Developer 30,000
Manager 1,15,000
Sr. Manager 90,000

ORDER BY and SORT BY Clause

By using HiveQL ORDER BY and SORT BY clause, we can apply sort on the column.
It returns the result set either in ascending or descending order. Here, we are going
to execute these clauses on the records of the below table:

ID NAME SALARY DESIG


1 John 40,000 Sr. Manager
2 Harry 30,000 Developer
3 Clark 25,000 Developer
4 Ram 35,000 Manager
5 Imran 50,000 Sr. Manager
6 Lisa 45,000 Manager
7 Syam 35,000 Manager
ORDER BY Clause
In HiveQL, ORDER BY clause performs a complete ordering of the query result set.
Hence, the complete data is passed through a single reducer. This may take much
time in the execution of large datasets. However, we can use LIMIT to minimize the
sorting time.

hive> select * from emp order by salary desc;

SORT BY Clause

The HiveQL SORT BY clause is an alternative of ORDER BY clause. It orders the data
within each reducer. Hence, it performs the local ordering, where each reducer's
output is sorted separately. It may also give a partially ordered result.

hive> select * from emp sort by salary desc;

DISTRIBUTE BY:
Distribute BY clause used on tables present in Hive. Hive uses the columns in
Distribute by to distribute the rows among reducers. All Distribute BY columns will
go to the same reducer.

 It ensures each of N reducers gets non-overlapping ranges of column


 It doesn’t sort the output of each reducer

hive>SELECT Id, Name from emp DISTRIBUTE BY Id;


Fundamentals of HBase
Introduction to HBase
HBase is an open source, multidimensional, distributed, scalable and a NoSQL Column based
database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System). HBase
achieves high throughput and low latency by providing faster Read/Write Access on huge data
sets. Therefore, HBase is the choice for the applications which require fast & random access to
large amount of data.

NoSQL vs SQL Databases

• SQL Databases

• Define Structure and then insert Data known as Schema on write


• It uses SQL (Declarative Query language)
• Data maintains Table based
• Data stored in tables only when satisfies ACID properties.
• NoSQL (Not Only SQL)
• No need of to satisfy ACID properties to store data.
• Relaxed consistency and Validations while storing data.
• Define table and then define structure when data is loaded known as Schema on
read
• No Declarative Query language
• No Null values in NoSQL
• Stores huge amount of Data
Types of NoSQL:

1) Key-Value based DB
2) Document based DB
3) Column based DB
4) Graph based DB
Row Oriented Data Stores vs Column-Oriented Data Stores

A data store is basically a place for storing collections of data, such as a database, a file
system or a directory. In Database system they can be stored in two ways. These are as
follows:
1. Row Oriented Data Stores
2. Column-Oriented Data Stores
Comparisons between Row oriented data stores and Column oriented data stores are as
following:

Row oriented data stores Column oriented data stores

Data is stored and retrieved one row at a time and In this type of data stores, data are stored and
hence could read unnecessary data if some of the retrieve in columns and hence it can only able
data in a row are required. to read only the relevant data if required.

Records in Row Oriented Data stores are easy to Read and Write operations are slower as
read and write. compared to row-oriented.

Best suited for online transaction system. Best suited for online analytical processing.

These are not efficient in performing operations These are efficient in performing operations
applicable to the entire datasets and hence applicable to the entire dataset and hence
aggregation in row-oriented is an expensive job or enables aggregation over many rows and
operations. columns.

Typical compression mechanisms which provide These type of data stores basically permits
less efficient result than what we achieve from high compression rates due to little distinct or
column-oriented data stores. unique values in columns.

HBASE DATA MODEL

 Tables: Data is stored in a table format in HBase. But here tables are in column-oriented
format.
 Row Key: Row keys are used to search records which make searches fast.
 Column Families: Various columns are combined in a column family. These column
families are stored together which makes the searching process faster because data
belonging to same column family can be accessed together in a single seek.
 Column Qualifiers: Each column’s name is known as its column qualifier.
 Cell: Data is stored in cells. The data is dumped into cells which are specifically identified
by rowkey and column qualifiers.
 Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is
stored with its timestamp. This makes easy to search for a particular version of data.
ARCHITECTURE OF HBASE:
The main components of HBase architecture are

 Region
 Region Server
 HMaster
 Zookeeper

Region: A region contains all the rows between the start key and the end key assigned to that
region. HBase tables can be divided into a number of regions in such a way that all the columns
of a column family is stored in one region. Each region contains the rows in a sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing,
executing reads and writes operations on that set of regions.
So, concluding in a simpler way:

 A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
 A Region has a default size of 256MB which can be configured according to the need.
 A Group of regions is served to the clients by a Region Server.
 A Region Server can serve approximately 1000 regions to the client.

Components of Region Server: This below image shows the components of a Region Server.
Now, I will discuss them separately.

A Region Server maintains various regions running on the top of HDFS. Components of a
Region Server are:

 WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores
the new data that hasn’t been persisted or committed to the permanent storage. It is used
in case of failure to recover the data sets.
 Block Cache: From the above image, it is clearly visible that Block Cache resides in
the top of Region Server. It stores the frequently read data in the memory. If the data in
BlockCache is least recently used, then that data is removed from BlockCache.
 MemStore: It is the write cache. It stores all the incoming data before committing it to
the disk or permanent memory. There is one MemStore for each column family in a
region. As you can see in the image, there are multiple MemStores for a region because
each region contains multiple column families. The data is sorted in lexicographical
order before committing it to the disk.
HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the actual
cells on the disk. MemStore commits the data to HFile when the size of MemStore exceeds.
HMaster
As in the below image, you can see the HMaster handles a collection of Region Server which
resides on DataNode. Let us understand how HMaster does that.
 HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
 It coordinates and manages the Region Server (similar as NameNode manages
DataNode in HDFS).
 It assigns regions to the Region Servers on startup and re-assigns regions to Region
Servers during recovery and load balancing.
 It monitors all the Region Server’s instances in the cluster (with the help of Zookeeper)
and performs recovery activities whenever any Region Server is down.
 It provides an interface for creating, deleting and updating tables.

HBase has a distributed and huge environment where HMaster alone is not sufficient to manage
everything. So, you would be wondering what helps HMaster to manage this huge
environment? That’s where ZooKeeper comes into the picture. After we understood how
HMaster manages HBase environment, we will understand how Zookeeper helps HMaster in
managing the environment.

HBase Architecture: Compaction

HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for
a read. This process is called compaction. Compaction chooses some HFiles from a region and
combines them. There are two types of compaction as you can see in the above image.

1. Minor Compaction: HBase automatically picks smaller HFiles and recommits them to
bigger HFiles as shown in the above image. This is called Minor Compaction. It
performs merge sort for committing smaller HFiles to bigger HFiles. This helps in
storage space optimization.
2. Major Compaction: As illustrated in the above image, in Major compaction, HBase
merges and recommits the smaller HFiles of a region to a new HFile. In this process,
the same column families are placed together in the new HFile. It drops deleted and
expired cell in this process. It increases read performance.

Fundamentals of ZooKeeper
Zookeeper is a cluster coordinating, cross-platform software service provided by the Apache
Foundation. It is essentially designed for providing service for distributed systems offering
a hierarchical key-value store, which is used to provide a distributed configuration
service, synchronization service, and naming registry for large distributed systems

Architecture of Zookeeper
Apache Zookeeper basically follows the Client-Server Architecture. Participants in the
Zookeeper architecture can be enlisted as follows.

The Architecture of Apache Zookeeper is categorized into 5 different components as follows:

 Ensemble
 Server
 Server Leader
 Follower
 Client

Ensemble: It is basically the collection of all the Server nodes in the Zookeeper ecosystem.
The Ensemble requires a minimum of three nodes to get itself set up.

Server: It is one among-st the other servers present in the Zookeeper Ensemble whose
objective is to provide all sorts of services to its clients. It sends its alive status to its client in
order to inform its clients about its availability.

Server Leader: Ensemble Leader is elected at the service startup. It has access to recover the
data from any of the failed nodes and performs automatic data recovery for clients.
Follower: A follower is one of the servers in the Ensemble. Its duty is to follow the orders
passed by the Leader.

Client: Clients are the nodes that request service from the server. Similar to servers, the
client also sends signals to servers regarding their availability. In case if the server fails to
respond, then they automatically redirect themselves to the next available server

HBase Architecture: ZooKeeper – The Coordinator


This below image explains the ZooKeeper’s coordination mechanism.

 Zookeeper acts like a coordinator inside HBase distributed environment. It helps in


maintaining server state inside the cluster by communicating through sessions.

Every Region Server along with HMaster Server sends continuous heartbeat at regular interval
to Zookeeper and it checks which server is alive and available as mentioned in above image. It
also provides server failure notifications so that, recovery measures can be executed.
VISUALIZATIONS
Data visualization is defined as a graphical representation that contains the information and
the data. By using visual elements like charts, graphs, and maps, data visualization techniques
provide an accessible way to see and understand trends, outliers, and patterns in data.

In modern days we have a lot of data in our hands i.e, in the world of Big Data, data
visualization tools, and technologies are crucial to analyze massive amounts of information and
make data-driven decisions.

The basic uses of the Data Visualization technique are as follows:

 It is a powerful technique to explore the data with presentable and interpretable


results.
 In the data mining process, it acts as a primary step in the pre-processing portion.
 It supports the data cleaning process by finding incorrect data and corrupted or
missing values.
 It also helps to construct and select variables, which means we have to determine
which variable to include and discard in the analysis.
 In the process of Data Reduction, it also plays a crucial role while combining the
categories.

Visual data analysis techniques

Different Types of Analysis for Data Visualization


Mainly, there are three different types of analysis for Data Visualization:

 Univariate Analysis: In the univariate analysis, we will be using a single feature to


analyze almost all of its properties.
 Bivariate Analysis: When we compare the data between exactly 2 features then it is
known as bivariate analysis.
 Multivariate Analysis: In the multivariate analysis, we will be comparing more than
2 variables.

Univariate Analysis Techniques for Data Visualization


Distribution Plot

 It is one of the best univariate plots to know about the distribution of data.
 When we want to analyze the impact on the target variable(output) with respect to an
independent variable(input), we use distribution plots a lot.
 This plot gives us a combination of both probability density functions(pdf) and
histogram in a single plot.

Box and Whisker Plot

 This plot can be used to obtain more statistical details about the data.
 The straight lines at the maximum and minimum are also called whiskers.
 Points that lie outside the whiskers will be considered as an outlier.
 The box plot also gives us a description of the 25th, 50th,75th quartiles.
 With the help of a box plot, we can also determine the Interquartile range(IQR) where
maximum details of the data will be present. Therefore, it can also give us a clear idea
about the outliers in the dataset.

Fig. General Diagram for a Box-plot

Violin Plot

 The violin plots can be considered as a combination of Box plot at the middle and
distribution plots (Kernel Density Estimation) on both sides of the data.
 This can give us the description of the distribution of the dataset like whether the
distribution is multimodal, Skewness, etc.
 It also gives us useful information like a 95% confidence interval.

Fig. General Diagram for a Violin-plot


Bivariate Analysis Techniques for Data Visualization
1. Line Plot

 This is the plot that you can see in the nook and corners of any sort of analysis between
2 variables.
 The line plots are nothing but the values on a series of data points will be connected
with straight lines.
 The plot may seem very simple but it has more applications not only in machine
learning but in many other areas.

2. Bar Plot

 This is one of the widely used plots, that we would have seen multiple times not just in
data analysis, but we use this plot also wherever there is a trend analysis in many fields.
 Though it may seem simple it is powerful in analysing data like sales figures every
week, revenue from a product, Number of visitors to a site on each day of a week,
etc.
3. Scatter Plot

 It is one of the most commonly used plots used for visualizing simple data in Machine
learning and Data Science.
 This plot describes us as a representation, where each point in the entire dataset is
present with respect to any 2 to 3 features (Columns).
 Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the
common one, where we will primarily try to find the patterns, clusters, and separability
of the data.

Some conclusions inferred from the above Scatter plot:


From the above Scatter plot we can conclude the following observations:
 The colors are assigned to different data points based on how they were present in the
dataset i.e, target column representation.
 We can color the data points as per their class label given in the dataset.

INTERACTION TECHNIQUES
Interactive Visual Analysis (IVA) is a set of techniques for combining the computational
power of computers with the perceptive and cognitive capabilities of humans, in order to extract
knowledge from large and complex datasets.

IVA is a suitable technique for analyzing high-dimensional data that has a large number of
data points, where simple graphing and non-interactive techniques give an insufficient
understanding of the information.

The objective of Interactive Visual Analysis is to discover information in data which is not
readily apparent. The goal is to move from the data itself to the information contained in the
data, ultimately uncovering knowledge which was not apparent from looking at the raw
numbers.
The general framework for an interactive data structure visualization project typically follows
these steps: identify your desired goals, understand the challenges presented by data
constraints, and design a conceptual model in which data can be quickly iterated and reviewed.

Some popular libraries for creating your own interactive data visualizations include: Altair,
Bokeh, Celluloid, Matplotlib, nbinteract, Plotly, Pygal, and Seaborn. Libraries are available for
Python, Jupyter, Javascript, and R interactive data visualizations.

Some major benefits of interactive data visualizations include:

 Identify Trends Faster - The majority of human communication is visual as the human
brain processes graphics magnitudes faster than it does text. Direct manipulation of
analyzed data via familiar metaphors and digestible imagery makes it easy to
understand and act on valuable information.
 Identify Relationships More Effectively - The ability to narrowly focus on specific
metrics enables users to identify otherwise overlooked cause-and-effect relationships
throughout definable timeframes. This is especially useful in identifying how daily
operations affect an organization’s goals.
 Useful Data Storytelling - Humans best understand a data story when its development
over time is presented in a clear, linear fashion. A visual data story in which users can
zoom in and out, highlight relevant information, filter, and change the parameters
promotes better understanding of the data by presenting multiple viewpoints of the data.
 Simplify Complex Data - A large data set with a complex data story may present itself
visually as a chaotic, intertwined hairball. Incorporating filtering and zooming controls
can help untangle and make these messes of data more manageable, and can help users
glean better insights.

You might also like