Big Data
And Analytics
Seema Acharya
Subhashini Chellappan
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Chapter 9
Introduction to Hive
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes
Introduction to Hive
1. To study the Hive Architecture a) To understand the hive
architecture.
2. To study the Hive File format b) To create databases, tables and
execute data manipulation
3. To study the Hive Query language statements on it.
Language c) To differentiate between static
and dynamic partitions.
d) To differentiate between
managed and external tables.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Session Plan
Lecture time 90 to 120 minutes
Q/A 15 minutes
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda
What is Hive?
Hive Architecture
Hive Data Types
Primitive Data Types
Collection Data Types
Hive File Format
Text File
Sequential File
RCFile (Record Columnar File)
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda
Hive Query Language
DDL (Data Definition Language) Statements
DML (Data Manipulation Language) Statements
Database
Tables
Partitions
Buckets
Aggregation
Group BY and Having
SERDER
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
What is Hive?
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
What is Hive?
Hive is a Data Warehousing tool. Hive is used to query structured data built
on top of Hadoop. Facebook created Hive component to manage their ever-
growing volumes of data. Hive makes use of the following:
1. HDFS for Storage
2. MapReduce for execution
3. Stores metadata in an RDBMS.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Features
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Features of Hive
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs, lists, and maps.
4. Hive supports SQL filters, group-by and order-by clauses.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Data Types
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Data Types
• Databases
• Tables
• Partitions
• Buckets (Clusters)
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Architecture
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Architecture
Hive
Command-Line Hive Web Hive Server
Interface Interface (Thrift)
Driver (Query
Metastore
Compiler, Executor)
JobTracker TaskTracker
HDFS
Hadoop
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Data Types
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Data Types
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number
String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)
Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Data Types
Collection Data Types
STRUCT Similar to ‘C’ struct. Fields are accessed using dot notation.
E.g.: struct('John', 'Doe')
MAP A collection of key - value pairs. Fields are accessed using [] notation.
E.g.: map('first', 'John', 'last', 'Doe')
ARRAY Ordered sequence of same types. Fields are accessed using array index.
E.g.: array('John', 'Doe')
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive File Format
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive File Format
• Text File
The default file format is text file.
• Sequential File
Sequential files are flat files that store binary key-value pairs.
• RCFile (Record Columnar File)
RCFile stores the data in Column Oriented Manner which ensures that
Aggregation operation is not an expensive operation.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Query Language
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hive Query Language (HQL)
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.
4. Download the contents of a table to a local directory or result of queries to HDFS
directory.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
DDL and DML statements
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Database
To create a database named “STUDENTS” with comments and database
properties.
CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details'
WITH DBPROPERTIES ('creator' = 'JOHN');
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Database
To describe a database.
DESCRIBE DATABASE STUDENTS;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Database
To drop database.
DROP DATABASE STUDENTS;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
Hive provides two kinds of tables:
Managed Table
External Table
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
To create managed table named ‘STUDENT’.
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
To create external table named ‘EXT_STUDENT’.
CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name
STRING,gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION ‘/STUDENT_INFO;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
To load data into the table from file named student.tsv.
LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE INTO
TABLE EXT_STUDENT;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
To retrieve the student details from “EXT_STUDENT” table.
SELECT * from EXT_STUDENT;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Partitions
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Partitions
Partitions split the larger dataset into more meaningful chunks.
Hive provides two kinds of partitions: Static Partition and Dynamic Partition.
• To create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT (rollno INT, name
STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t';
• Load data into partition table from table.
INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa =4.0)
SELECT rollno, name from EXT_STUDENT where gpa=4.0;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Partitions
• To create dynamic partition on column date.
CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT(rollno INT, name
STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t';
• To load data into a dynamic partition table from table.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Note: The dynamic partition strict mode requires at least one static partition
column. To turn this off,
set hive.exec.dynamic.partition.mode=nonstrict
INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT PARTITION (gpa) SELECT
rollno,name,gpa from EXT_STUDENT;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Buckets
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Buckets
• To create a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name
STRING,grade FLOAT)
CLUSTERED BY (grade) into 3 buckets;
• Load data to bucketed table.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
• To display the content of first bucket.
SELECT DISTINCT GRADE FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Aggregations
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Aggregations
Hive supports aggregation functions like avg, count, etc.
To write the average and count aggregation function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Group by and Having
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Group by and Having
To write group by and having function.
SELECT rollno, name,gpa
FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
SerDer
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
SerDer
• SerDer stands for Serializer/Deserializer.
• Contains the logic to convert unstructured data into records.
• Implemented using Java.
• Serializers are used at the time of writing.
• Deserializers are used at query time (SELECT Statement).
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Answer a few quick questions …
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Fill in the blanks
The metastore consists of ______________ and a ______________.
The most commonly used interface to interact with Hive is ______________.
The default metastore for Hive is ______________.
Metastore contains ______________ of Hive tables.
______________ is responsible for compilation, optimization, and execution
of Hive queries.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Summary please…
Ask a few participants of the learning program to summarize the lecture.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
References …
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Further Readings
http://en.wikipedia.org/wiki/RCFile
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Thank you
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.