0% found this document useful (0 votes)

6 views26 pages

Unit 5 Notes Big Data

This document provides an overview of NoSQL databases, highlighting their characteristics, advantages, and types, including document-based, key-value, column-family, and graph-based models. It discusses the concept of Aggregate Data Models in NoSQL, which allow for flexible data storage and efficient data access, particularly in big data applications. Additionally, it covers HBase as a specific NoSQL implementation, detailing its architecture, data model, and client interfaces.

Uploaded by

maryjoseph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views26 pages

Unit 5 Notes Big Data

Uploaded by

maryjoseph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

1

UNIT 5: BIG DATA MODELS

Introduction to NoSQL, Aggregate Data Models, Hbase: Data Model and
Implementations, Hbase Clients Examples, Pig Data Model, Hive Data
Types and File Formats, HiveQL Data Definition - HiveQL Data
Manipulation - HiveQL Queries

Introduction to NoSQL
What is NoSQL?
 NoSQL stands for "Not Only SQL".
 It refers to a non-relational database system designed to handle large volumes of
data, especially unstructured, semi-structured, or distributed data.
 NoSQL databases are schema-less, meaning they do not require a predefined schema
like traditional RDBMS (Relational Database Management Systems).
Why NoSQL?
Traditional SQL databases have limitations when it comes to:
 Handling big data and high-velocity streaming data.
 Supporting horizontal scalability (adding more servers to handle data).
 Managing unstructured data like images, videos, social media posts, sensor data,
etc.
NoSQL solves these issues by offering:
 High scalability
 Flexibility in data modeling
 High performance for specific types of applications (like real-time analytics or IoT)

Characteristics of NoSQL Databases

1. Schema-less Design
o No predefined table structure.
o Each record can have a different structure.
2. Horizontal Scalability
o Easily scalable across multiple servers (distributed systems).
2

3. High Availability
o Designed for high uptime and fault tolerance.
4. Efficient for Big Data Applications
o Handles petabytes of data and high-throughput applications.
5. Flexible Data Models
o Supports documents, key-values, graphs, wide-columns, etc.

Types of NoSQL Databases

Type Description Example

Document- Stores data in JSON, BSON, or XML format (as

MongoDB
based documents).

Data stored as key-value pairs. Fast lookups using Redis,Amazon

Key-Value
keys. DynamoDB

Stores data in columns rather than rows. Great for

Column-based Apache Cassandra
analytical queries.

Designed to represent relationships between data

Graph-based Neo4j
using nodes and edges.

Advantages of NoSQL
 Scalability: Easily handles large data volumes with horizontal scaling.
 Performance: Optimized for specific access patterns.
 Flexibility: No rigid schema; supports varied data types and formats.
 Cost-effective: Often open-source; works well with commodity hardware.
 Suited for Cloud and Big Data applications.
Disadvantages of NoSQL
 Lacks standardization: No common query language like SQL.
 Complex querying: Not ideal for complex joins or transactions.
 Consistency issues: Often follows Eventual Consistency (CAP Theorem).
 Learning curve: Developers familiar with RDBMS may need time to adapt.
3

When to Use NoSQL?

Use NoSQL when:
 You’re dealing with large-scale or real-time data.
 Data is unstructured or semi-structured.
 Your app requires fast performance and scalability.
 There are frequent schema changes.

CAP Theorem and NoSQL

 CAP Theorem: A distributed database can only guarantee two of the following three:
o C – Consistency
o A – Availability
o P – Partition Tolerance
 NoSQL databases typically prioritize Availability and Partition Tolerance, while
compromising on strict consistency (eventual consistency).
Real-World Applications of NoSQL
 Social Media Platforms (Facebook, Instagram)
 E-Commerce Sites (Amazon)
 Real-Time Analytics (IoT devices, Recommendation Engines)
 Content Management Systems
 Chat Applications

Aggregate Data Models

Over the last few years, companies started using modern applications with flexible data requirements
to run their business activities. Relational Databases have predefined schemas and cannot satisfy the
changing data needs. NoSQL Databases provide schema-less architecture making it easier for
Developers to store data in a flexible manner.

Modern applications and websites use types of data structures that are very different from Relational
Database modeling. Aggregate Data Models in NoSQL are used to meet the requirements and
maintain smoothness in storing data appropriately.

The Aggregate Data Models in NoSQL Database allow easy handling of complex and list of nested
records. In this article, you will learn about Aggregate Data Models in NoSQL Database, what are its
4

different types, and their use cases. You will also go through an example of Aggregate Data Models in
NoSQL.

What is a NoSQL Database?

A NoSQL Database, also known as a non SQL or non-relational Database is a non-tabular Database
that stores data differently than the tabular relations used in relational databases. Companies widely
used NoSQL Database generally for big Data and real-time web applications. NoSQL Databases
offers flexible schema, and it’s not mandatory to provide schema. As modern applications use a
variety of changing data, so NoSQL Databases are best suited for them.

NoSQL Databases offer a simple design, horizontal scaling for clustering machines, and limit the
object-relational impedance mismatch. It uses different data structures from those used by relational
Databases making some operations faster. NoSQL Databases are designed to be flexible, scalable, and
capable of rapidly responding to the data management demands of modern businesses.

Key Features of NoSQL Database

Some of the main features of the NoSQL Database are listed below:

 Horizontal Scaling: NoSQL Databases can scale horizontally by adding nodes to share loads.
As the data grows the hardware can be added and scalability features could be preserved for
NoSQL.
 Performance: Users can increase the performance of the NoSQL Database by adding a
different server.
 Flexible Schema: NoSQL Databases do not require the same schema as compared to SQL
Databases. The document in the same collection does not need to have the same set of fields
and data type.
 High Availability: Unlike Relational Databases that use primary and secondary nodes for
fetching data. NoSQL Databases use master place architecture.

What are Aggregate Data Models in NoSQL?

Aggregate means a collection of objects that are treated as a unit. In NoSQL Databases, an aggregate
is a collection of data that interact as a unit. Moreover, these units of data or aggregates of data form
the boundaries for the ACID operations.

Aggregate Data Models in NoSQL make it easier for the Databases to manage data storage over the
clusters as the aggregate data or unit can now reside on any of the machines. Whenever data is
retrieved from the Database all the data comes along with the Aggregate Data Models in NoSQL.

Aggregate Data Models in NoSQL don’t support ACID transactions and sacrifice one of the ACID
properties. With the help of Aggregate Data Models in NoSQL, you can easily perform OLAP
operations on the Database.

You can achieve high efficiency of the Aggregate Data Models in the NoSQL Database if the data
transactions and interactions take place within the same aggregate.
5

Types of Aggregate Data Models in NoSQL Databases

The Aggregate Data Models in NoSQL are majorly classified into 4 Data Models listed
below:

1. Key-Value Model

The Key-Value Data Model contains the key or an ID used to access or fetch the data of the
aggregates corresponding to the key. In this Aggregate Data Models in NoSQL, the data of the
aggregates are secure and encrypted and can be decrypted with a Key.

Use Cases:

 These Aggregate Data Models in NoSQL Database are used for storing the user session data.
 Key Value-based Data Models are used for maintaining schema-less user profiles.
 It is used for storing user preferences and shopping cart data.

2. Document Model
6

The Document Data Model allows access to the parts of aggregates. In this Aggregate Data Models in
NoSQL, the data can be accessed in an inflexible way. The Database stores and retrieves documents,
which can be XML, JSON, BSON, etc. There are some restrictions on data structure and data types of
the data aggregates that are to be used in this Aggregate Data Models in NoSQL Database.

Use Cases:

 Document Data Models are widely used in E-Commerce platforms

 It is used for storing data from content management systems.
 Document Data Models are well suited for Blogging and Analytics platforms.

3. Column Family Model

Column family is an Aggregate Data Models in NoSQL Database usually with big-table style Data
Models that are referred to as column stores. It is also called a two-level map as it offers a two-level
aggregate structure. In this Aggregate Data Models in NoSQL, the first level of the Column family
contains the keys that act as a row identifier that is used to select the aggregate data. Whereas the
second level values are referred to as columns.

Use Cases:

 Column Family Data Models are used in systems that maintain counters.
 These Aggregate Data Models in NoSQL are used for services that have expiring usage.
 It is used in systems that have heavy write requests.

4. Graph-Based Model

Graph-based data models store data in nodes that are connected by edges. These
Aggregate Data Models in NoSQL are widely used for storing the huge volumes of complex
aggregates and multidimensional data having many interconnections between them.

Use Cases:

 Graph-based Data Models are used in social networking sites to store

interconnections.
 It is used in fraud detection systems.
 This Data Model is also widely used in Networks and IT operations.
8

Steps to Build Aggregate Data Models in NoSQL Databases

Now that you have a brief knowledge of Aggregate Data Models in NoSQL Database. In this section,
you will go through an example to understand how to design Aggregate Data Models in NoSQL. For
this, a Data Model of an E-Commerce website will be used to explain Aggregate Data Models in
NoSQL.

This example of the E-Commerce Data Model has two main aggregates – customer and order. The
customer contains data related to billing addresses while the order aggregate consists of ordered items,
shipping addresses, and payments. The payment also contains the billing address.

If you notice a single logical address record appears 3 times in the data, but its value is
copied each time wherever used. The whole address can be copied into an aggregate as
needed. There is no pre-defined format to draw the aggregate boundaries. It solely depends
on whether you want to manipulate the data as per your requirements.
9

The Data Model for customer and order would look like this.

In these Aggregate Data Models in NoSQL, if you want to access a customer along with all
customer’s orders at once. Then designing a single aggregate is preferable. But if you want
to access a single order at a time, then you should have separate aggregates for each order.
It is very content-specific.
10

Aggregate Data Models

Aggregate data models are a way of structuring data where related information is grouped
together as a single unit called an aggregate. This concept is different from traditional
relational databases where data is normalized and spread across multiple tables. In NoSQL
systems, especially document-based and key-value databases, aggregates allow for storing all
relevant information in one place, which helps improve read and write performance by
reducing the need for joins and complex queries.
Each aggregate is treated as an independent data unit, which means operations like reading,
writing, or updating can be done more efficiently. For example, in a document store like
MongoDB, a complete order record—including customer details, items, prices, and shipping
address—can be stored in one document instead of spreading it across several relational
tables. This design supports scalability and flexibility, making it well-suited for big data and
cloud-based applications.
Aggregate data models are most commonly used in:
 Document Stores (e.g., MongoDB)
 Key-Value Stores (e.g., Redis, DynamoDB)
 Column-Family Stores (e.g., Cassandra, HBase)
One of the powerful systems that supports aggregate-style storage, especially for large-scale
data, is HBase.

HBase (Hadoop Database)

HBase is a column-family NoSQL database built on top of the Hadoop Distributed File
System (HDFS). It is modeled after Google’s Bigtable and is designed to handle large
amounts of sparse data (data with many empty fields) across a distributed environment. It
integrates seamlessly with the Hadoop ecosystem and supports real-time read/write access
to big data.
HBase organizes data into tables, similar to relational databases, but the internal structure is
quite different. Each table is made up of rows and column families. A column family groups
11

related columns, and each row has a unique row key. Data within HBase is stored as key-
value pairs where the key is a combination of row key, column family, column qualifier, and
timestamp, allowing for versioning of data.
One of the key advantages of HBase is its horizontal scalability—it can handle billions of
rows and millions of columns by distributing data across multiple nodes. HBase is ideal for
random, real-time access to big data, unlike Hadoop’s MapReduce which is more batch-
oriented. It also supports strong consistency, which sets it apart from some eventually
consistent NoSQL databases.
HBase is widely used in scenarios where fast reads and writes of large volumes of data are
required, such as:
 Sensor and IoT data ingestion
 Social media analytics
 Time-series databases
 Messaging and recommendation systems
Aggregate data models simplify data access by bundling related data, and technologies like
HBase take this a step further by enabling scalable, consistent, and real-time data operations
in big data environments.

Data Model and Implementation

In the context of NoSQL and big data systems, a data model defines how data is structured,
stored, and accessed in a database. Unlike relational databases that use a rigid tabular
structure with predefined schemas, NoSQL databases offer flexible and dynamic data
models suited for large-scale, diverse, and rapidly changing datasets. The choice of data
model influences how efficiently the system handles queries, updates, and storage. NoSQL
data models are generally designed to optimize for performance, scalability, and simplicity
in managing massive data loads.
There are four main types of data models in NoSQL:
1. Document Model – Stores data as structured documents (usually in JSON or BSON).
Each document is self-describing and can have a unique structure. Ideal for use cases
like user profiles, orders, and blogs. (e.g., MongoDB)
2. Key-Value Model – The simplest model where data is stored as a pair of unique keys
and their corresponding values. Great for caching and session data. (e.g., Redis)
3. Column-Family Model – Data is organized into column families and rows, enabling
efficient storage and access of sparse data. Suited for analytical and wide-table use
cases. (e.g., HBase, Cassandra)
4. Graph Model – Designed to represent relationships between data using nodes and
edges. Best for recommendation engines, social networks, and fraud detection. (e.g.,
Neo4j)
12

Implementation of a data model involves designing the schema (if any), deciding on the
indexing strategies, and determining how data will be partitioned and replicated across
distributed nodes. For example, in HBase, data is implemented in a table format but
internally managed using row keys, column families, and timestamps. The design choices
depend on query patterns—HBasefavorsdenormalized models to reduce read complexity,
storing related data together in a row.
Efficient implementation also includes tuning performance through configurations like
sharding (data partitioning), replication (for fault tolerance), and caching. Understanding
access patterns early on is critical, because unlike RDBMS, schema and indexing decisions in
NoSQL can significantly impact both performance and storage.
The data model forms the foundation of how data is represented, while implementation deals
with how this model is realized in practice within a NoSQL system to ensure speed,
scalability, and consistency.

HBase Clients – In Detail with Examples

HBase clients are the interfaces or libraries that allow users and applications to interact with
the HBase database. These clients enable reading from and writing to HBase tables by
communicating with the HBase server using the appropriate APIs. Depending on the
programming environment and use case, HBase supports multiple types of clients such as
Java APIs, Thrift, REST, and Avro, making it flexible for integration into various
applications.
1. Java Client (Native API)
The most powerful and widely used client for HBase is its native Java API. Since HBase is
built in Java, the Java client provides direct access to all HBase functionalities, including
advanced operations like scanning large datasets, filtering, and handling bulk loads.
Example (Java API):
Configuration config = HBaseConfiguration.create();
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf("student"));

Put put = new Put(Bytes.toBytes("row1"));

put.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"), Bytes.toBytes("Harini"));
put.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("age"), Bytes.toBytes("21"));

table.put(put);
table.close();
13

This snippet connects to an HBase table named student and inserts data into the personal
column family. Java clients offer full control but require familiarity with the HBase Java
classes.

2. REST Client
HBase also offers a REST API via a service called stargate, which allows clients to interact
with HBase using standard HTTP methods like GET, PUT, POST, and DELETE. REST
clients are useful for lightweight applications or when integrating with systems not written in
Java.
Example (using cURL):
curl -X PUT \
-H "Content-Type: application/json" \
-d '{"Row":[{"key":"row1","Cell":
[{"column":"cGVyc29uYWw6bmFtZQ==","$":"SGFyaW5p"}]}]}' \
http://localhost:8080/student/fakerow
Here, data is inserted using the REST interface where the column name and value are base64
encoded.
3. Thrift Client
The Thrift client is a cross-language interface that allows HBase operations from languages
like Python, PHP, C++, Ruby, etc.. Thrift is especially helpful when you're building
applications in non-Java environments.
Example (Python via Thrift):
You'd first need to generate Python bindings using Apache Thrift, and then you can do
something like:
from hbase import Hbase
transport = TSocket.TSocket('localhost', 9090)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Hbase.Client(protocol)

transport.open()
14

client.createTable('my_table', [ColumnDescriptor(name='cf:')])
transport.close()
This lets a Python program create and interact with an HBase table using Thrift.

4. Avro Client
Apache Avro is another serialization system supported by HBase, mainly used for remote
procedure calls (RPCs). Although less commonly used today, it still provides a way for inter-
process communication using a compact and efficient binary format.
5. HBase Shell
Though not a “client” in the programming sense, the HBase Shell is a command-line
interface that allows you to interact with HBase easily. It’s often used for administrative tasks
and quick testing.
Example (HBase Shell commands):
create 'student', 'personal'
put 'student', 'row1', 'personal:name', 'Harini'
get 'student', 'row1'
The shell is easy to use and excellent for learning and debugging.
Summary

Client Type Language Support Use Case Example

Java API Java only Full-featured enterprise applications

REST API Any (via HTTP) Lightweight web/mobile apps

Thrift Client Python, PHP, Ruby, etc. Multi-language application integration

Avro Client Java (mainly) High-performance RPCs

HBase Shell CLI Testing, debugging, administration

HBase’s rich set of clients makes it extremely flexible and adaptable for various application
needs, ranging from real-time web apps to big data processing pipelines.
15

Pig Data Model and Its Function

Apache Pig is a high-level platform developed to simplify processing large datasets in
Hadoop using a language called Pig Latin. At its core, the Pig Data Model represents the
way data is structured and manipulated within Pig scripts. It is designed to handle semi-
structured data, offering more flexibility than traditional relational models.
The Pig Data Model is hierarchical and supports both simple and complex data types. The
simplest unit is an Atom, which holds a single value like a number or a string. A Tuple is an
ordered set of fields (like a row), and a Bag is a collection of tuples (like a table but
unordered and allowing duplicates). A Map is a set of key-value pairs that can be used for
accessing specific fields within a record. This hierarchical design enables Pig to work with
deeply nested and complex datasets without requiring a rigid schema, making it ideal for data
exploration and ETL (Extract, Transform, Load) operations.
Functions of Pig Data Model
1. Flexible Schema Handling: Pig allows data to be loaded without strictly defining
schemas. This is useful when working with loosely structured or evolving data
formats.
2. Supports Nested Structures: Unlike traditional RDBMS, Pig can handle complex
and nested structures like bags within bags, tuples within tuples, etc.
3. Efficient Data Manipulation: Pig Latin provides operators like FOREACH,
FILTER, GROUP, and JOIN, which use the data model to perform transformations on
large datasets.
4. Extensibility: Pig supports user-defined functions (UDFs), which can operate on data
model types and extend Pig’s capability to fit specific processing needs.
5. Compatibility with Hadoop: The model is designed to work seamlessly on Hadoop’s
distributed file system (HDFS) and supports parallel execution of data flows.
Example (Pig Latin)
A = LOAD 'students.csv' USING PigStorage(',') AS (id:int, name:chararray, marks:int);
B = FILTER A BY marks > 80;
C = FOREACH B GENERATE name, marks;
Here, each line of the dataset is treated as a tuple, and the collection of all tuples forms a bag.
Operations like filtering and projection are directly performed on these data model structures.
16

Apache Hive
What is Hive?
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. It allows users to query and manage large datasets
residing in distributed storage using a SQL-like language called HiveQL. It was initially
developed by Facebook and later contributed to the Apache Software Foundation.
Hive is not a database but a data warehousing tool that translates SQL-like queries into
MapReduce jobs, allowing people familiar with SQL to work with big data without writing
complex Java code.

Key Features of Hive

 SQL-like Query Language (HiveQL): Easy for people with RDBMS knowledge.
 Schema on Read: Data is interpreted during reading, not when it is loaded.
 Extensibility: Supports custom User Defined Functions (UDFs).
 Integration with Hadoop: Executes queries using MapReduce, Tez, or Spark.
 Partitioning & Bucketing: Optimizes query performance.
 Metastore: Stores metadata like table schema, partitions, etc.
Architecture of Hive
Hive has a layered architecture:
1. User Interface (UI): Accepts HiveQL queries from users (CLI, JDBC, Web UI).
2. Driver: Handles lifecycle of a HiveQL query (compilation, optimization, execution).
3. Compiler: Converts HiveQL to DAG (Directed Acyclic Graph) of MapReduce jobs.
4. Metastore: Stores metadata like table definitions, partitions, and schemas.
5. Execution Engine: Executes the DAG using Hadoop’s YARN and returns the result.
Hive Data Model
Hive organizes data into the following components:
 Databases: Logical grouping of tables.
 Tables: Structured like RDBMS tables with rows and columns.
 Partitions: Data is divided into parts based on column values (like country, date).
17

 Buckets: Further divides partitions into fixed number of buckets using hash function
on a column.

Data Types in Hive

 Primitive Types: INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING, DATE,
etc.
 Complex Types:
o ARRAY<data_type>
o MAP<key_type, value_type>
o STRUCT<col1:type1, col2:type2>
Hive Query Language (HiveQL)
HiveQL is similar to SQL and supports operations such as:
 DDL (Data Definition Language):
 CREATE TABLE students (id INT, name STRING, marks FLOAT);
 SHOW TABLES;
 DROP TABLE students;
 DML (Data Manipulation Language):
 LOAD DATA INPATH '/data/students.csv' INTO TABLE students;
 INSERT INTO TABLE students VALUES (1, 'Asha', 88.5);
 Querying:
 SELECT name, marks FROM students WHERE marks > 80;
Execution Flow of a Hive Query
1. User submits a HiveQL query.
2. The query is parsed by the compiler and translated into a logical plan.
3. Logical plan is optimized (e.g., filters pushed down).
4. Physical plan (MapReduce/Tez jobs) is created.
5. Execution engine runs the jobs and results are stored or returned.
Advantages of Hive
 Easy to learn for SQL users.
 Handles large datasets efficiently with Hadoop.
 Suitable for batch processing.
18

 Scalable and extensible with UDFs.

 Good for ETL jobs and data summarization.

Limitations of Hive
 Not suitable for real-time querying (high latency due to MapReduce).
 Not ideal for row-level updates or transactions.
 Slower compared to systems like Impala or Presto for low-latency queries.
Sample Example
-- Creating a table
CREATE TABLE emp (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Loading data into the table

LOAD DATA INPATH '/user/hive/emp.csv' INTO TABLE emp;

-- Querying the table

SELECT name, salary FROM emp WHERE salary > 50000;

Hive vs Traditional RDBMS

Feature Hive RDBMS

Query Language HiveQL (SQL-like) SQL

Data Storage HDFS Local Storage

Execution Engine MapReduce / Tez / Spark Relational Engine

Schema Schema-on-Read Schema-on-Write

Suitable for Batch Processing OLTP

Feature Hive RDBMS

Speed Slower (batch jobs) Faster (real-time)

Hive Data Types

Hive supports two categories of data types:
1. Primitive Data Types

Data Type Description Example

TINYINT 1-byte signed integer 127

SMALLINT 2-byte signed integer 32767

INT 4-byte signed integer 1234

BIGINT 8-byte signed integer 9876543210

FLOAT 4-byte single-precision floating point 3.14

DOUBLE 8-byte double-precision floating point 123.456

DECIMAL Fixed-point with precision & scale DECIMAL(10,2)

STRING Sequence of characters 'Harini'

VARCHAR Variable length string (with max length) VARCHAR(20)

CHAR Fixed length string CHAR(10)

BOOLEAN TRUE or FALSE TRUE / FALSE

DATE Date (YYYY-MM-DD) 2025-04-11

TIMESTAMP Date and time 2025-04-11 12:30:00

2. Complex Data Types

Data Type Description Example

Ordered collection of
ARRAY<T> ARRAY<STRING> = ["apple", "banana"]
elements

MAP<K, V> Key-value pairs MAP<STRING, INT> = {"math":90, "eng":85}

STRUCT<name:STRING, age:INT> =
STRUCT Group of named fields
{"Harini", 21}

UNIONTYPE Holds one value from many Rarely used in practice

Data Type Description Example

types

Hive File Formats

Hive supports multiple file formats for storing and querying data efficiently. Choosing the
right format impacts performance significantly.
1. Text File (Default)
 Extension: .txt, .csv
 Format: Plain text, row-wise
 Pros: Human-readable, easy to use
 Cons: Slow, no compression, inefficient for large datasets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
2. Sequence File
 Binary format that stores key-value pairs
 Faster than text files, supports compression
 Not human-readable
 Good for intermediate data
STORED AS SEQUENCEFILE;
3. RCFile (Record Columnar File)
 Columnar storage format
 Splits rows into columns and stores them together
 Improves performance for analytical queries on selected columns
STORED AS RCFILE;
4. ORC (Optimized Row Columnar) [Highly recommended]
 Best performance for Hive
 Columnar format with compression, indexing, and predicate pushdown
 Saves space and speeds up read operations
STORED AS ORC;
5. Parquet
 Columnar format widely used in Apache Spark and Hive
21

 Supports complex nested data

 Highly compressed, efficient for big data processing
STORED AS PARQUET;
6. Avro
 Row-based storage with schema support
 Ideal for data exchange between systems
 Schema stored along with data
STORED AS AVRO;

File Format Comparison Table

Format Type Compression Schema Evolution Best Use Case

Text Row-based ❌ ❌ Simple datasets

Sequence Row-based ✅ ❌ Intermediate storage

RCFile Columnar ✅ ❌ Old columnar option

ORC Columnar ✅✅ ✅ Hive + performance

Parquet Columnar ✅✅ ✅ Spark, cross-platform

Avro Row-based ✅ ✅ Serialization and exchange

HiveQL Data Definitions – Explanations

1. Databases
In Hive, a database is a logical namespace for tables. It helps in organizing large datasets
across multiple projects or domains. Each database is like a folder that can contain multiple
Hive tables, views, and functions.

2. Tables
Hive tables are similar to RDBMS tables but are backed by HDFS files. You define a table's
schema using columns, and each table is mapped to a directory in HDFS where the actual
data is stored.
 Managed Table: Hive owns both metadata and data. Dropping the table deletes the
data.
22

 External Table: Hive manages only metadata. Data remains even if the table is
dropped.
3. Partitions
Partitions improve query performance by dividing data based on a specific column (e.g., year,
state, product type). Queries can target specific partitions to avoid scanning unnecessary data.
4. Buckets
Bucketing is used for evenly distributing data into fixed number of files based on the hash of
a column value. It works well with joins and sampling.
5. Views
Views are virtual tables created using a query. They don’t store data but allow users to
simplify complex queries and reuse logic.
6. Alter Table
Used to change schema of a table such as renaming it, adding/dropping columns, or
changing file formats.
7. Drop Table / Database
These commands delete the metadata (and sometimes data) of Hive objects.
Relevant HiveQL Code Examples
1. Create a Database
CREATE DATABASE IF NOT EXISTS sales_db;
2. Create a Managed Table
CREATE TABLE employees (
id INT,
name STRING,
department STRING,
salary FLOAT
);
3. Create an External Table
CREATE EXTERNAL TABLE logs (
user_id STRING,
access_time TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
23

STORED AS TEXTFILE
LOCATION '/user/hive/logs/';
4. Create a Partitioned Table
CREATE TABLE sales (
id INT,
amount FLOAT
)
PARTITIONED BY (year INT, region STRING);
5. Create a Bucketed Table
CREATE TABLE customers (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 4 BUCKETS;

6. Create a View
CREATE VIEW high_salary_employees AS
SELECT name, department FROM employees WHERE salary > 50000;
7. Alter Table Examples
 Rename a table:
ALTER TABLE employees RENAME TO staff;
 Add a column:
ALTER TABLE staff ADD COLUMNS (joining_date DATE);
8. Drop Table / Database
DROP TABLE IF EXISTS staff;

DROP DATABASE IF EXISTS sales_db CASCADE;

HiveQL Queries – Explanations

1. SELECT Queries
The SELECT statement is used to retrieve data from Hive tables. You can select specific
columns or use * to retrieve all.
2. WHERE Clause
Used to filter rows that meet a specific condition. Helps narrow down the dataset before
further processing.
3. ORDER BY and SORT BY
 ORDER BY: Sorts all data and sends it to a single reducer (can be slow).
 SORT BY: Sorts data within each reducer — faster, but no global ordering guarantee.

4. GROUP BY
Used to aggregate values across groups, like finding average salary per department.
5. HAVING Clause
Used to filter results after aggregation (just like in SQL).

6. JOIN Operations
Hive supports joins similar to SQL:
 INNER JOIN: Matches rows from both tables.
 LEFT/RIGHT OUTER JOIN: Returns matched rows and unmatched ones with
NULLs.
 FULL OUTER JOIN: Includes all rows from both sides.
7. UNION and UNION ALL
 UNION: Merges two result sets and removes duplicates.
 UNION ALL: Keeps duplicates.
8. LIMIT
Restricts the number of output rows. Useful for previewing data.
9. Subqueries
Queries inside other queries, either in the SELECT, FROM, or WHERE clause.
25

HiveQL Queries – Code Examples

1. Simple SELECT
SELECT name, salary FROM employees;
2. SELECT with WHERE Clause
SELECT name FROM employees WHERE salary > 50000;
3. ORDER BY vs SORT BY
-- ORDER BY (global sort - slow)
SELECT * FROM employees ORDER BY salary DESC;

-- SORT BY (distributed sort - faster)

SELECT * FROM employees SORT BY salary DESC;
4. GROUP BY and HAVING
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING avg_salary> 40000;
5. INNER JOIN
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id;
6. LEFT OUTER JOIN
SELECT e.name, d.dept_name
FROM employees e
LEFT OUTER JOIN departments d ON e.dept_id = d.id;
7. UNION and UNION ALL
-- Removes duplicates
SELECT name FROM employees
UNION
SELECT name FROM interns;
26

-- Keeps duplicates
SELECT name FROM employees
UNION ALL
SELECT name FROM interns;
8. LIMIT Clause
SELECT * FROM employees LIMIT 10;
9. Subquery Example
SELECT name FROM (
SELECT name, salary FROM employees
WHERE department = 'IT'
) temp
WHERE salary > 60000;

NoSQL Module1 PPT
No ratings yet
NoSQL Module1 PPT
64 pages
M240205055 MD Razon BDA ASSIGNMENT
No ratings yet
M240205055 MD Razon BDA ASSIGNMENT
14 pages
SQL and No PDF
No ratings yet
SQL and No PDF
89 pages
Unit Ii Nosql Data Management
No ratings yet
Unit Ii Nosql Data Management
26 pages
What Is NoSQL
No ratings yet
What Is NoSQL
8 pages
Unit 2
No ratings yet
Unit 2
48 pages
NoSQL Module 1 Part1
No ratings yet
NoSQL Module 1 Part1
13 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
38 pages
NoSQL Databases for Tech Enthusiasts
No ratings yet
NoSQL Databases for Tech Enthusiasts
33 pages
Unit II Nosql Data Management
No ratings yet
Unit II Nosql Data Management
57 pages
Unit 2
No ratings yet
Unit 2
18 pages
Unit V Big Data Frameworks
No ratings yet
Unit V Big Data Frameworks
42 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
30 pages
DSA Notes Unit-03
No ratings yet
DSA Notes Unit-03
144 pages
NOSQL
No ratings yet
NOSQL
15 pages
Unit Ii
No ratings yet
Unit Ii
70 pages
Unit II - BDA NEW
No ratings yet
Unit II - BDA NEW
48 pages
AD Practical Theory
No ratings yet
AD Practical Theory
8 pages
CHP 4
No ratings yet
CHP 4
47 pages
Chapter1 NoSQL Databases
No ratings yet
Chapter1 NoSQL Databases
7 pages
Unit VI Big Data
No ratings yet
Unit VI Big Data
19 pages
Unit No 1
No ratings yet
Unit No 1
34 pages
No SQL
No ratings yet
No SQL
3 pages
4unit NoSQL
No ratings yet
4unit NoSQL
27 pages
NoSQL Databases Explained
No ratings yet
NoSQL Databases Explained
4 pages
Nosql
No ratings yet
Nosql
64 pages
Nosql, Mongodb
No ratings yet
Nosql, Mongodb
18 pages
Non Relational Database Management Systems
No ratings yet
Non Relational Database Management Systems
15 pages
Unit 2 Handouts
No ratings yet
Unit 2 Handouts
11 pages
Introduction To Nosql: What Is A Nosql Database Used For?
No ratings yet
Introduction To Nosql: What Is A Nosql Database Used For?
6 pages
Types of NoSQL Databases - GeeksforGeeks
No ratings yet
Types of NoSQL Databases - GeeksforGeeks
9 pages
Unit 6
No ratings yet
Unit 6
143 pages
NoSQL for Developers and IT Pros
No ratings yet
NoSQL for Developers and IT Pros
3 pages
Dbms Presentation
No ratings yet
Dbms Presentation
22 pages
What Is Nosql Nodesc
No ratings yet
What Is Nosql Nodesc
17 pages
LOREAL 2023 Universal Registration Document en
No ratings yet
LOREAL 2023 Universal Registration Document en
450 pages
BDA (2) Merged
No ratings yet
BDA (2) Merged
29 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
Module 3 Bigdata Analytics
No ratings yet
Module 3 Bigdata Analytics
19 pages
Bda Unit-2
No ratings yet
Bda Unit-2
29 pages
Practical Ict Book
No ratings yet
Practical Ict Book
204 pages
NoSQL M1
No ratings yet
NoSQL M1
48 pages
BDA Unit-2
No ratings yet
BDA Unit-2
30 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
28 pages
Unit II - BIG DATA ANALYTICS
No ratings yet
Unit II - BIG DATA ANALYTICS
11 pages
Flip Flops - Registers and Counters
No ratings yet
Flip Flops - Registers and Counters
42 pages
Pco2
No ratings yet
Pco2
55 pages
Bda Unit 3
No ratings yet
Bda Unit 3
8 pages
Unit 2 BDA
No ratings yet
Unit 2 BDA
32 pages
BDT Unit-Ii
No ratings yet
BDT Unit-Ii
13 pages
Nosql Module 1
No ratings yet
Nosql Module 1
23 pages
21aim45a-Dbms Module-5
No ratings yet
21aim45a-Dbms Module-5
74 pages
UNIT II First Half Notes
No ratings yet
UNIT II First Half Notes
21 pages
NoSQL Databases: A Beginner's Guide
No ratings yet
NoSQL Databases: A Beginner's Guide
36 pages
No SQL
No ratings yet
No SQL
12 pages
Operation Manual Book Shapoli Eco 8
No ratings yet
Operation Manual Book Shapoli Eco 8
38 pages
BIG Data 2
No ratings yet
BIG Data 2
18 pages
British Standards Cable
100% (1)
British Standards Cable
3 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
Grade 2 Homeschool Pacing Guide Unit 1: Work Like A Scientist
No ratings yet
Grade 2 Homeschool Pacing Guide Unit 1: Work Like A Scientist
30 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
Big Data Bhag 4 Changes
No ratings yet
Big Data Bhag 4 Changes
26 pages
Sapera User
No ratings yet
Sapera User
109 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
NoSQL Databases for Tech Students
No ratings yet
NoSQL Databases for Tech Students
16 pages
Fish-Ridge Wind Turbine
No ratings yet
Fish-Ridge Wind Turbine
19 pages
Omron Program Copy From Card To PLC.
No ratings yet
Omron Program Copy From Card To PLC.
8 pages
Bai Tap Ve May Bien AP
No ratings yet
Bai Tap Ve May Bien AP
7 pages
Session 9 Verilog Programming
No ratings yet
Session 9 Verilog Programming
13 pages
Black Box Fairness Testing of Machine Learning Models
No ratings yet
Black Box Fairness Testing of Machine Learning Models
11 pages
Zero Leakage Performance Robust Design Trouble-Free Operation Ergonomically Designed
No ratings yet
Zero Leakage Performance Robust Design Trouble-Free Operation Ergonomically Designed
9 pages
Swanti Satsangi
No ratings yet
Swanti Satsangi
1 page
Dfs-tp-Ah-001-Us 012h10 Lgtechpaper Errorcodes Ls 091-121 HSV 3
No ratings yet
Dfs-tp-Ah-001-Us 012h10 Lgtechpaper Errorcodes Ls 091-121 HSV 3
2 pages
Web Practical
No ratings yet
Web Practical
37 pages
Audi 80/90 Wiring Diagram Guide
No ratings yet
Audi 80/90 Wiring Diagram Guide
20 pages
Woofer Tester Pro
No ratings yet
Woofer Tester Pro
16 pages
7 - 5250 - 01880 - 01E Nachrüstsatz
No ratings yet
7 - 5250 - 01880 - 01E Nachrüstsatz
11 pages
UV-Laser-engraver HS-UV05 240711 132434.pdf 20240711 134653 0000
No ratings yet
UV-Laser-engraver HS-UV05 240711 132434.pdf 20240711 134653 0000
5 pages
Importance & Structure of Business Letters
No ratings yet
Importance & Structure of Business Letters
9 pages
SH - Fall of Troy Semi Fiction PDF
No ratings yet
SH - Fall of Troy Semi Fiction PDF
11 pages
The Social Engineer Toolkit
No ratings yet
The Social Engineer Toolkit
20 pages
HT Test Reopts July CTPT 2020
No ratings yet
HT Test Reopts July CTPT 2020
6 pages
Identify Your Imac Model - Apple Support
No ratings yet
Identify Your Imac Model - Apple Support
1 page
Classical Mechanics A Modern Perspective
No ratings yet
Classical Mechanics A Modern Perspective
2 pages
LOGIQ P9P7 R3 User Guide - English - UM - 5791624-100 - 3
No ratings yet
LOGIQ P9P7 R3 User Guide - English - UM - 5791624-100 - 3
343 pages
Dip Computation Methods
No ratings yet
Dip Computation Methods
20 pages
Chapter I Review of Related Studies and Literature
89% (18)
Chapter I Review of Related Studies and Literature
5 pages