Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views26 pages

Unit 5 Notes Big Data

This document provides an overview of NoSQL databases, highlighting their characteristics, advantages, and types, including document-based, key-value, column-family, and graph-based models. It discusses the concept of Aggregate Data Models in NoSQL, which allow for flexible data storage and efficient data access, particularly in big data applications. Additionally, it covers HBase as a specific NoSQL implementation, detailing its architecture, data model, and client interfaces.

Uploaded by

maryjoseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views26 pages

Unit 5 Notes Big Data

This document provides an overview of NoSQL databases, highlighting their characteristics, advantages, and types, including document-based, key-value, column-family, and graph-based models. It discusses the concept of Aggregate Data Models in NoSQL, which allow for flexible data storage and efficient data access, particularly in big data applications. Additionally, it covers HBase as a specific NoSQL implementation, detailing its architecture, data model, and client interfaces.

Uploaded by

maryjoseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

1

UNIT 5: BIG DATA MODELS


Introduction to NoSQL, Aggregate Data Models, Hbase: Data Model and
Implementations, Hbase Clients Examples, Pig Data Model, Hive Data
Types and File Formats, HiveQL Data Definition - HiveQL Data
Manipulation - HiveQL Queries

Introduction to NoSQL
What is NoSQL?
 NoSQL stands for "Not Only SQL".
 It refers to a non-relational database system designed to handle large volumes of
data, especially unstructured, semi-structured, or distributed data.
 NoSQL databases are schema-less, meaning they do not require a predefined schema
like traditional RDBMS (Relational Database Management Systems).
Why NoSQL?
Traditional SQL databases have limitations when it comes to:
 Handling big data and high-velocity streaming data.
 Supporting horizontal scalability (adding more servers to handle data).
 Managing unstructured data like images, videos, social media posts, sensor data,
etc.
NoSQL solves these issues by offering:
 High scalability
 Flexibility in data modeling
 High performance for specific types of applications (like real-time analytics or IoT)

Characteristics of NoSQL Databases


1. Schema-less Design
o No predefined table structure.
o Each record can have a different structure.
2. Horizontal Scalability
o Easily scalable across multiple servers (distributed systems).
2

3. High Availability
o Designed for high uptime and fault tolerance.
4. Efficient for Big Data Applications
o Handles petabytes of data and high-throughput applications.
5. Flexible Data Models
o Supports documents, key-values, graphs, wide-columns, etc.

Types of NoSQL Databases

Type Description Example

Document- Stores data in JSON, BSON, or XML format (as


MongoDB
based documents).

Data stored as key-value pairs. Fast lookups using Redis,Amazon


Key-Value
keys. DynamoDB

Stores data in columns rather than rows. Great for


Column-based Apache Cassandra
analytical queries.

Designed to represent relationships between data


Graph-based Neo4j
using nodes and edges.

Advantages of NoSQL
 Scalability: Easily handles large data volumes with horizontal scaling.
 Performance: Optimized for specific access patterns.
 Flexibility: No rigid schema; supports varied data types and formats.
 Cost-effective: Often open-source; works well with commodity hardware.
 Suited for Cloud and Big Data applications.
Disadvantages of NoSQL
 Lacks standardization: No common query language like SQL.
 Complex querying: Not ideal for complex joins or transactions.
 Consistency issues: Often follows Eventual Consistency (CAP Theorem).
 Learning curve: Developers familiar with RDBMS may need time to adapt.
3

When to Use NoSQL?


Use NoSQL when:
 You’re dealing with large-scale or real-time data.
 Data is unstructured or semi-structured.
 Your app requires fast performance and scalability.
 There are frequent schema changes.

CAP Theorem and NoSQL


 CAP Theorem: A distributed database can only guarantee two of the following three:
o C – Consistency
o A – Availability
o P – Partition Tolerance
 NoSQL databases typically prioritize Availability and Partition Tolerance, while
compromising on strict consistency (eventual consistency).
Real-World Applications of NoSQL
 Social Media Platforms (Facebook, Instagram)
 E-Commerce Sites (Amazon)
 Real-Time Analytics (IoT devices, Recommendation Engines)
 Content Management Systems
 Chat Applications

Aggregate Data Models


Over the last few years, companies started using modern applications with flexible data requirements
to run their business activities. Relational Databases have predefined schemas and cannot satisfy the
changing data needs. NoSQL Databases provide schema-less architecture making it easier for
Developers to store data in a flexible manner.

Modern applications and websites use types of data structures that are very different from Relational
Database modeling. Aggregate Data Models in NoSQL are used to meet the requirements and
maintain smoothness in storing data appropriately.

The Aggregate Data Models in NoSQL Database allow easy handling of complex and list of nested
records. In this article, you will learn about Aggregate Data Models in NoSQL Database, what are its
4

different types, and their use cases. You will also go through an example of Aggregate Data Models in
NoSQL.

What is a NoSQL Database?


A NoSQL Database, also known as a non SQL or non-relational Database is a non-tabular Database
that stores data differently than the tabular relations used in relational databases. Companies widely
used NoSQL Database generally for big Data and real-time web applications. NoSQL Databases
offers flexible schema, and it’s not mandatory to provide schema. As modern applications use a
variety of changing data, so NoSQL Databases are best suited for them.

NoSQL Databases offer a simple design, horizontal scaling for clustering machines, and limit the
object-relational impedance mismatch. It uses different data structures from those used by relational
Databases making some operations faster. NoSQL Databases are designed to be flexible, scalable, and
capable of rapidly responding to the data management demands of modern businesses.

Key Features of NoSQL Database

Some of the main features of the NoSQL Database are listed below:

 Horizontal Scaling: NoSQL Databases can scale horizontally by adding nodes to share loads.
As the data grows the hardware can be added and scalability features could be preserved for
NoSQL.
 Performance: Users can increase the performance of the NoSQL Database by adding a
different server.
 Flexible Schema: NoSQL Databases do not require the same schema as compared to SQL
Databases. The document in the same collection does not need to have the same set of fields
and data type.
 High Availability: Unlike Relational Databases that use primary and secondary nodes for
fetching data. NoSQL Databases use master place architecture.

What are Aggregate Data Models in NoSQL?


Aggregate means a collection of objects that are treated as a unit. In NoSQL Databases, an aggregate
is a collection of data that interact as a unit. Moreover, these units of data or aggregates of data form
the boundaries for the ACID operations.

Aggregate Data Models in NoSQL make it easier for the Databases to manage data storage over the
clusters as the aggregate data or unit can now reside on any of the machines. Whenever data is
retrieved from the Database all the data comes along with the Aggregate Data Models in NoSQL.

Aggregate Data Models in NoSQL don’t support ACID transactions and sacrifice one of the ACID
properties. With the help of Aggregate Data Models in NoSQL, you can easily perform OLAP
operations on the Database.

You can achieve high efficiency of the Aggregate Data Models in the NoSQL Database if the data
transactions and interactions take place within the same aggregate.
5

Types of Aggregate Data Models in NoSQL Databases


The Aggregate Data Models in NoSQL are majorly classified into 4 Data Models listed
below:

1. Key-Value Model

The Key-Value Data Model contains the key or an ID used to access or fetch the data of the
aggregates corresponding to the key. In this Aggregate Data Models in NoSQL, the data of the
aggregates are secure and encrypted and can be decrypted with a Key.

Use Cases:

 These Aggregate Data Models in NoSQL Database are used for storing the user session data.
 Key Value-based Data Models are used for maintaining schema-less user profiles.
 It is used for storing user preferences and shopping cart data.

2. Document Model
6

The Document Data Model allows access to the parts of aggregates. In this Aggregate Data Models in
NoSQL, the data can be accessed in an inflexible way. The Database stores and retrieves documents,
which can be XML, JSON, BSON, etc. There are some restrictions on data structure and data types of
the data aggregates that are to be used in this Aggregate Data Models in NoSQL Database.

Use Cases:

 Document Data Models are widely used in E-Commerce platforms


 It is used for storing data from content management systems.
 Document Data Models are well suited for Blogging and Analytics platforms.

3. Column Family Model


7

Column family is an Aggregate Data Models in NoSQL Database usually with big-table style Data
Models that are referred to as column stores. It is also called a two-level map as it offers a two-level
aggregate structure. In this Aggregate Data Models in NoSQL, the first level of the Column family
contains the keys that act as a row identifier that is used to select the aggregate data. Whereas the
second level values are referred to as columns.

Use Cases:

 Column Family Data Models are used in systems that maintain counters.
 These Aggregate Data Models in NoSQL are used for services that have expiring usage.
 It is used in systems that have heavy write requests.

4. Graph-Based Model

Graph-based data models store data in nodes that are connected by edges. These
Aggregate Data Models in NoSQL are widely used for storing the huge volumes of complex
aggregates and multidimensional data having many interconnections between them.

Use Cases:

 Graph-based Data Models are used in social networking sites to store


interconnections.
 It is used in fraud detection systems.
 This Data Model is also widely used in Networks and IT operations.
8

Steps to Build Aggregate Data Models in NoSQL Databases


Now that you have a brief knowledge of Aggregate Data Models in NoSQL Database. In this section,
you will go through an example to understand how to design Aggregate Data Models in NoSQL. For
this, a Data Model of an E-Commerce website will be used to explain Aggregate Data Models in
NoSQL.

This example of the E-Commerce Data Model has two main aggregates – customer and order. The
customer contains data related to billing addresses while the order aggregate consists of ordered items,
shipping addresses, and payments. The payment also contains the billing address.

If you notice a single logical address record appears 3 times in the data, but its value is
copied each time wherever used. The whole address can be copied into an aggregate as
needed. There is no pre-defined format to draw the aggregate boundaries. It solely depends
on whether you want to manipulate the data as per your requirements.
9

The Data Model for customer and order would look like this.

In these Aggregate Data Models in NoSQL, if you want to access a customer along with all
customer’s orders at once. Then designing a single aggregate is preferable. But if you want
to access a single order at a time, then you should have separate aggregates for each order.
It is very content-specific.
10

Aggregate Data Models


Aggregate data models are a way of structuring data where related information is grouped
together as a single unit called an aggregate. This concept is different from traditional
relational databases where data is normalized and spread across multiple tables. In NoSQL
systems, especially document-based and key-value databases, aggregates allow for storing all
relevant information in one place, which helps improve read and write performance by
reducing the need for joins and complex queries.
Each aggregate is treated as an independent data unit, which means operations like reading,
writing, or updating can be done more efficiently. For example, in a document store like
MongoDB, a complete order record—including customer details, items, prices, and shipping
address—can be stored in one document instead of spreading it across several relational
tables. This design supports scalability and flexibility, making it well-suited for big data and
cloud-based applications.
Aggregate data models are most commonly used in:
 Document Stores (e.g., MongoDB)
 Key-Value Stores (e.g., Redis, DynamoDB)
 Column-Family Stores (e.g., Cassandra, HBase)
One of the powerful systems that supports aggregate-style storage, especially for large-scale
data, is HBase.

HBase (Hadoop Database)


HBase is a column-family NoSQL database built on top of the Hadoop Distributed File
System (HDFS). It is modeled after Google’s Bigtable and is designed to handle large
amounts of sparse data (data with many empty fields) across a distributed environment. It
integrates seamlessly with the Hadoop ecosystem and supports real-time read/write access
to big data.
HBase organizes data into tables, similar to relational databases, but the internal structure is
quite different. Each table is made up of rows and column families. A column family groups
11

related columns, and each row has a unique row key. Data within HBase is stored as key-
value pairs where the key is a combination of row key, column family, column qualifier, and
timestamp, allowing for versioning of data.
One of the key advantages of HBase is its horizontal scalability—it can handle billions of
rows and millions of columns by distributing data across multiple nodes. HBase is ideal for
random, real-time access to big data, unlike Hadoop’s MapReduce which is more batch-
oriented. It also supports strong consistency, which sets it apart from some eventually
consistent NoSQL databases.
HBase is widely used in scenarios where fast reads and writes of large volumes of data are
required, such as:
 Sensor and IoT data ingestion
 Social media analytics
 Time-series databases
 Messaging and recommendation systems
Aggregate data models simplify data access by bundling related data, and technologies like
HBase take this a step further by enabling scalable, consistent, and real-time data operations
in big data environments.

Data Model and Implementation


In the context of NoSQL and big data systems, a data model defines how data is structured,
stored, and accessed in a database. Unlike relational databases that use a rigid tabular
structure with predefined schemas, NoSQL databases offer flexible and dynamic data
models suited for large-scale, diverse, and rapidly changing datasets. The choice of data
model influences how efficiently the system handles queries, updates, and storage. NoSQL
data models are generally designed to optimize for performance, scalability, and simplicity
in managing massive data loads.
There are four main types of data models in NoSQL:
1. Document Model – Stores data as structured documents (usually in JSON or BSON).
Each document is self-describing and can have a unique structure. Ideal for use cases
like user profiles, orders, and blogs. (e.g., MongoDB)
2. Key-Value Model – The simplest model where data is stored as a pair of unique keys
and their corresponding values. Great for caching and session data. (e.g., Redis)
3. Column-Family Model – Data is organized into column families and rows, enabling
efficient storage and access of sparse data. Suited for analytical and wide-table use
cases. (e.g., HBase, Cassandra)
4. Graph Model – Designed to represent relationships between data using nodes and
edges. Best for recommendation engines, social networks, and fraud detection. (e.g.,
Neo4j)
12

Implementation of a data model involves designing the schema (if any), deciding on the
indexing strategies, and determining how data will be partitioned and replicated across
distributed nodes. For example, in HBase, data is implemented in a table format but
internally managed using row keys, column families, and timestamps. The design choices
depend on query patterns—HBasefavorsdenormalized models to reduce read complexity,
storing related data together in a row.
Efficient implementation also includes tuning performance through configurations like
sharding (data partitioning), replication (for fault tolerance), and caching. Understanding
access patterns early on is critical, because unlike RDBMS, schema and indexing decisions in
NoSQL can significantly impact both performance and storage.
The data model forms the foundation of how data is represented, while implementation deals
with how this model is realized in practice within a NoSQL system to ensure speed,
scalability, and consistency.

HBase Clients – In Detail with Examples


HBase clients are the interfaces or libraries that allow users and applications to interact with
the HBase database. These clients enable reading from and writing to HBase tables by
communicating with the HBase server using the appropriate APIs. Depending on the
programming environment and use case, HBase supports multiple types of clients such as
Java APIs, Thrift, REST, and Avro, making it flexible for integration into various
applications.
1. Java Client (Native API)
The most powerful and widely used client for HBase is its native Java API. Since HBase is
built in Java, the Java client provides direct access to all HBase functionalities, including
advanced operations like scanning large datasets, filtering, and handling bulk loads.
Example (Java API):
Configuration config = HBaseConfiguration.create();
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf("student"));

Put put = new Put(Bytes.toBytes("row1"));


put.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"), Bytes.toBytes("Harini"));
put.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("age"), Bytes.toBytes("21"));

table.put(put);
table.close();
13

This snippet connects to an HBase table named student and inserts data into the personal
column family. Java clients offer full control but require familiarity with the HBase Java
classes.

2. REST Client
HBase also offers a REST API via a service called stargate, which allows clients to interact
with HBase using standard HTTP methods like GET, PUT, POST, and DELETE. REST
clients are useful for lightweight applications or when integrating with systems not written in
Java.
Example (using cURL):
curl -X PUT \
-H "Content-Type: application/json" \
-d '{"Row":[{"key":"row1","Cell":
[{"column":"cGVyc29uYWw6bmFtZQ==","$":"SGFyaW5p"}]}]}' \
http://localhost:8080/student/fakerow
Here, data is inserted using the REST interface where the column name and value are base64
encoded.
3. Thrift Client
The Thrift client is a cross-language interface that allows HBase operations from languages
like Python, PHP, C++, Ruby, etc.. Thrift is especially helpful when you're building
applications in non-Java environments.
Example (Python via Thrift):
You'd first need to generate Python bindings using Apache Thrift, and then you can do
something like:
from hbase import Hbase
transport = TSocket.TSocket('localhost', 9090)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Hbase.Client(protocol)

transport.open()
14

client.createTable('my_table', [ColumnDescriptor(name='cf:')])
transport.close()
This lets a Python program create and interact with an HBase table using Thrift.

4. Avro Client
Apache Avro is another serialization system supported by HBase, mainly used for remote
procedure calls (RPCs). Although less commonly used today, it still provides a way for inter-
process communication using a compact and efficient binary format.
5. HBase Shell
Though not a “client” in the programming sense, the HBase Shell is a command-line
interface that allows you to interact with HBase easily. It’s often used for administrative tasks
and quick testing.
Example (HBase Shell commands):
create 'student', 'personal'
put 'student', 'row1', 'personal:name', 'Harini'
get 'student', 'row1'
The shell is easy to use and excellent for learning and debugging.
Summary

Client Type Language Support Use Case Example

Java API Java only Full-featured enterprise applications

REST API Any (via HTTP) Lightweight web/mobile apps

Thrift Client Python, PHP, Ruby, etc. Multi-language application integration

Avro Client Java (mainly) High-performance RPCs

HBase Shell CLI Testing, debugging, administration

HBase’s rich set of clients makes it extremely flexible and adaptable for various application
needs, ranging from real-time web apps to big data processing pipelines.
15

Pig Data Model and Its Function


Apache Pig is a high-level platform developed to simplify processing large datasets in
Hadoop using a language called Pig Latin. At its core, the Pig Data Model represents the
way data is structured and manipulated within Pig scripts. It is designed to handle semi-
structured data, offering more flexibility than traditional relational models.
The Pig Data Model is hierarchical and supports both simple and complex data types. The
simplest unit is an Atom, which holds a single value like a number or a string. A Tuple is an
ordered set of fields (like a row), and a Bag is a collection of tuples (like a table but
unordered and allowing duplicates). A Map is a set of key-value pairs that can be used for
accessing specific fields within a record. This hierarchical design enables Pig to work with
deeply nested and complex datasets without requiring a rigid schema, making it ideal for data
exploration and ETL (Extract, Transform, Load) operations.
Functions of Pig Data Model
1. Flexible Schema Handling: Pig allows data to be loaded without strictly defining
schemas. This is useful when working with loosely structured or evolving data
formats.
2. Supports Nested Structures: Unlike traditional RDBMS, Pig can handle complex
and nested structures like bags within bags, tuples within tuples, etc.
3. Efficient Data Manipulation: Pig Latin provides operators like FOREACH,
FILTER, GROUP, and JOIN, which use the data model to perform transformations on
large datasets.
4. Extensibility: Pig supports user-defined functions (UDFs), which can operate on data
model types and extend Pig’s capability to fit specific processing needs.
5. Compatibility with Hadoop: The model is designed to work seamlessly on Hadoop’s
distributed file system (HDFS) and supports parallel execution of data flows.
Example (Pig Latin)
A = LOAD 'students.csv' USING PigStorage(',') AS (id:int, name:chararray, marks:int);
B = FILTER A BY marks > 80;
C = FOREACH B GENERATE name, marks;
Here, each line of the dataset is treated as a tuple, and the collection of all tuples forms a bag.
Operations like filtering and projection are directly performed on these data model structures.
16

Apache Hive
What is Hive?
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. It allows users to query and manage large datasets
residing in distributed storage using a SQL-like language called HiveQL. It was initially
developed by Facebook and later contributed to the Apache Software Foundation.
Hive is not a database but a data warehousing tool that translates SQL-like queries into
MapReduce jobs, allowing people familiar with SQL to work with big data without writing
complex Java code.

Key Features of Hive


 SQL-like Query Language (HiveQL): Easy for people with RDBMS knowledge.
 Schema on Read: Data is interpreted during reading, not when it is loaded.
 Extensibility: Supports custom User Defined Functions (UDFs).
 Integration with Hadoop: Executes queries using MapReduce, Tez, or Spark.
 Partitioning & Bucketing: Optimizes query performance.
 Metastore: Stores metadata like table schema, partitions, etc.
Architecture of Hive
Hive has a layered architecture:
1. User Interface (UI): Accepts HiveQL queries from users (CLI, JDBC, Web UI).
2. Driver: Handles lifecycle of a HiveQL query (compilation, optimization, execution).
3. Compiler: Converts HiveQL to DAG (Directed Acyclic Graph) of MapReduce jobs.
4. Metastore: Stores metadata like table definitions, partitions, and schemas.
5. Execution Engine: Executes the DAG using Hadoop’s YARN and returns the result.
Hive Data Model
Hive organizes data into the following components:
 Databases: Logical grouping of tables.
 Tables: Structured like RDBMS tables with rows and columns.
 Partitions: Data is divided into parts based on column values (like country, date).
17

 Buckets: Further divides partitions into fixed number of buckets using hash function
on a column.

Data Types in Hive


 Primitive Types: INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING, DATE,
etc.
 Complex Types:
o ARRAY<data_type>
o MAP<key_type, value_type>
o STRUCT<col1:type1, col2:type2>
Hive Query Language (HiveQL)
HiveQL is similar to SQL and supports operations such as:
 DDL (Data Definition Language):
 CREATE TABLE students (id INT, name STRING, marks FLOAT);
 SHOW TABLES;
 DROP TABLE students;
 DML (Data Manipulation Language):
 LOAD DATA INPATH '/data/students.csv' INTO TABLE students;
 INSERT INTO TABLE students VALUES (1, 'Asha', 88.5);
 Querying:
 SELECT name, marks FROM students WHERE marks > 80;
Execution Flow of a Hive Query
1. User submits a HiveQL query.
2. The query is parsed by the compiler and translated into a logical plan.
3. Logical plan is optimized (e.g., filters pushed down).
4. Physical plan (MapReduce/Tez jobs) is created.
5. Execution engine runs the jobs and results are stored or returned.
Advantages of Hive
 Easy to learn for SQL users.
 Handles large datasets efficiently with Hadoop.
 Suitable for batch processing.
18

 Scalable and extensible with UDFs.


 Good for ETL jobs and data summarization.

Limitations of Hive
 Not suitable for real-time querying (high latency due to MapReduce).
 Not ideal for row-level updates or transactions.
 Slower compared to systems like Impala or Presto for low-latency queries.
Sample Example
-- Creating a table
CREATE TABLE emp (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Loading data into the table


LOAD DATA INPATH '/user/hive/emp.csv' INTO TABLE emp;

-- Querying the table


SELECT name, salary FROM emp WHERE salary > 50000;

Hive vs Traditional RDBMS

Feature Hive RDBMS

Query Language HiveQL (SQL-like) SQL

Data Storage HDFS Local Storage

Execution Engine MapReduce / Tez / Spark Relational Engine

Schema Schema-on-Read Schema-on-Write

Suitable for Batch Processing OLTP


19

Feature Hive RDBMS

Speed Slower (batch jobs) Faster (real-time)

Hive Data Types


Hive supports two categories of data types:
1. Primitive Data Types

Data Type Description Example

TINYINT 1-byte signed integer 127

SMALLINT 2-byte signed integer 32767

INT 4-byte signed integer 1234

BIGINT 8-byte signed integer 9876543210

FLOAT 4-byte single-precision floating point 3.14

DOUBLE 8-byte double-precision floating point 123.456

DECIMAL Fixed-point with precision & scale DECIMAL(10,2)

STRING Sequence of characters 'Harini'

VARCHAR Variable length string (with max length) VARCHAR(20)

CHAR Fixed length string CHAR(10)

BOOLEAN TRUE or FALSE TRUE / FALSE

DATE Date (YYYY-MM-DD) 2025-04-11

TIMESTAMP Date and time 2025-04-11 12:30:00

2. Complex Data Types

Data Type Description Example

Ordered collection of
ARRAY<T> ARRAY<STRING> = ["apple", "banana"]
elements

MAP<K, V> Key-value pairs MAP<STRING, INT> = {"math":90, "eng":85}

STRUCT<name:STRING, age:INT> =
STRUCT Group of named fields
{"Harini", 21}

UNIONTYPE Holds one value from many Rarely used in practice


20

Data Type Description Example

types

Hive File Formats


Hive supports multiple file formats for storing and querying data efficiently. Choosing the
right format impacts performance significantly.
1. Text File (Default)
 Extension: .txt, .csv
 Format: Plain text, row-wise
 Pros: Human-readable, easy to use
 Cons: Slow, no compression, inefficient for large datasets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
2. Sequence File
 Binary format that stores key-value pairs
 Faster than text files, supports compression
 Not human-readable
 Good for intermediate data
STORED AS SEQUENCEFILE;
3. RCFile (Record Columnar File)
 Columnar storage format
 Splits rows into columns and stores them together
 Improves performance for analytical queries on selected columns
STORED AS RCFILE;
4. ORC (Optimized Row Columnar) [Highly recommended]
 Best performance for Hive
 Columnar format with compression, indexing, and predicate pushdown
 Saves space and speeds up read operations
STORED AS ORC;
5. Parquet
 Columnar format widely used in Apache Spark and Hive
21

 Supports complex nested data


 Highly compressed, efficient for big data processing
STORED AS PARQUET;
6. Avro
 Row-based storage with schema support
 Ideal for data exchange between systems
 Schema stored along with data
STORED AS AVRO;

File Format Comparison Table

Format Type Compression Schema Evolution Best Use Case

Text Row-based ❌ ❌ Simple datasets

Sequence Row-based ✅ ❌ Intermediate storage

RCFile Columnar ✅ ❌ Old columnar option

ORC Columnar ✅✅ ✅ Hive + performance

Parquet Columnar ✅✅ ✅ Spark, cross-platform

Avro Row-based ✅ ✅ Serialization and exchange

HiveQL Data Definitions – Explanations


1. Databases
In Hive, a database is a logical namespace for tables. It helps in organizing large datasets
across multiple projects or domains. Each database is like a folder that can contain multiple
Hive tables, views, and functions.

2. Tables
Hive tables are similar to RDBMS tables but are backed by HDFS files. You define a table's
schema using columns, and each table is mapped to a directory in HDFS where the actual
data is stored.
 Managed Table: Hive owns both metadata and data. Dropping the table deletes the
data.
22

 External Table: Hive manages only metadata. Data remains even if the table is
dropped.
3. Partitions
Partitions improve query performance by dividing data based on a specific column (e.g., year,
state, product type). Queries can target specific partitions to avoid scanning unnecessary data.
4. Buckets
Bucketing is used for evenly distributing data into fixed number of files based on the hash of
a column value. It works well with joins and sampling.
5. Views
Views are virtual tables created using a query. They don’t store data but allow users to
simplify complex queries and reuse logic.
6. Alter Table
Used to change schema of a table such as renaming it, adding/dropping columns, or
changing file formats.
7. Drop Table / Database
These commands delete the metadata (and sometimes data) of Hive objects.
Relevant HiveQL Code Examples
1. Create a Database
CREATE DATABASE IF NOT EXISTS sales_db;
2. Create a Managed Table
CREATE TABLE employees (
id INT,
name STRING,
department STRING,
salary FLOAT
);
3. Create an External Table
CREATE EXTERNAL TABLE logs (
user_id STRING,
access_time TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
23

STORED AS TEXTFILE
LOCATION '/user/hive/logs/';
4. Create a Partitioned Table
CREATE TABLE sales (
id INT,
amount FLOAT
)
PARTITIONED BY (year INT, region STRING);
5. Create a Bucketed Table
CREATE TABLE customers (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 4 BUCKETS;

6. Create a View
CREATE VIEW high_salary_employees AS
SELECT name, department FROM employees WHERE salary > 50000;
7. Alter Table Examples
 Rename a table:
ALTER TABLE employees RENAME TO staff;
 Add a column:
ALTER TABLE staff ADD COLUMNS (joining_date DATE);
8. Drop Table / Database
DROP TABLE IF EXISTS staff;

DROP DATABASE IF EXISTS sales_db CASCADE;


24

HiveQL Queries – Explanations


1. SELECT Queries
The SELECT statement is used to retrieve data from Hive tables. You can select specific
columns or use * to retrieve all.
2. WHERE Clause
Used to filter rows that meet a specific condition. Helps narrow down the dataset before
further processing.
3. ORDER BY and SORT BY
 ORDER BY: Sorts all data and sends it to a single reducer (can be slow).
 SORT BY: Sorts data within each reducer — faster, but no global ordering guarantee.

4. GROUP BY
Used to aggregate values across groups, like finding average salary per department.
5. HAVING Clause
Used to filter results after aggregation (just like in SQL).

6. JOIN Operations
Hive supports joins similar to SQL:
 INNER JOIN: Matches rows from both tables.
 LEFT/RIGHT OUTER JOIN: Returns matched rows and unmatched ones with
NULLs.
 FULL OUTER JOIN: Includes all rows from both sides.
7. UNION and UNION ALL
 UNION: Merges two result sets and removes duplicates.
 UNION ALL: Keeps duplicates.
8. LIMIT
Restricts the number of output rows. Useful for previewing data.
9. Subqueries
Queries inside other queries, either in the SELECT, FROM, or WHERE clause.
25

HiveQL Queries – Code Examples


1. Simple SELECT
SELECT name, salary FROM employees;
2. SELECT with WHERE Clause
SELECT name FROM employees WHERE salary > 50000;
3. ORDER BY vs SORT BY
-- ORDER BY (global sort - slow)
SELECT * FROM employees ORDER BY salary DESC;

-- SORT BY (distributed sort - faster)


SELECT * FROM employees SORT BY salary DESC;
4. GROUP BY and HAVING
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING avg_salary> 40000;
5. INNER JOIN
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id;
6. LEFT OUTER JOIN
SELECT e.name, d.dept_name
FROM employees e
LEFT OUTER JOIN departments d ON e.dept_id = d.id;
7. UNION and UNION ALL
-- Removes duplicates
SELECT name FROM employees
UNION
SELECT name FROM interns;
26

-- Keeps duplicates
SELECT name FROM employees
UNION ALL
SELECT name FROM interns;
8. LIMIT Clause
SELECT * FROM employees LIMIT 10;
9. Subquery Example
SELECT name FROM (
SELECT name, salary FROM employees
WHERE department = 'IT'
) temp
WHERE salary > 60000;

You might also like