Unit 5 Notes Big Data
Unit 5 Notes Big Data
Introduction to NoSQL
What is NoSQL?
NoSQL stands for "Not Only SQL".
It refers to a non-relational database system designed to handle large volumes of
data, especially unstructured, semi-structured, or distributed data.
NoSQL databases are schema-less, meaning they do not require a predefined schema
like traditional RDBMS (Relational Database Management Systems).
Why NoSQL?
Traditional SQL databases have limitations when it comes to:
Handling big data and high-velocity streaming data.
Supporting horizontal scalability (adding more servers to handle data).
Managing unstructured data like images, videos, social media posts, sensor data,
etc.
NoSQL solves these issues by offering:
High scalability
Flexibility in data modeling
High performance for specific types of applications (like real-time analytics or IoT)
3. High Availability
o Designed for high uptime and fault tolerance.
4. Efficient for Big Data Applications
o Handles petabytes of data and high-throughput applications.
5. Flexible Data Models
o Supports documents, key-values, graphs, wide-columns, etc.
Advantages of NoSQL
Scalability: Easily handles large data volumes with horizontal scaling.
Performance: Optimized for specific access patterns.
Flexibility: No rigid schema; supports varied data types and formats.
Cost-effective: Often open-source; works well with commodity hardware.
Suited for Cloud and Big Data applications.
Disadvantages of NoSQL
Lacks standardization: No common query language like SQL.
Complex querying: Not ideal for complex joins or transactions.
Consistency issues: Often follows Eventual Consistency (CAP Theorem).
Learning curve: Developers familiar with RDBMS may need time to adapt.
3
Modern applications and websites use types of data structures that are very different from Relational
Database modeling. Aggregate Data Models in NoSQL are used to meet the requirements and
maintain smoothness in storing data appropriately.
The Aggregate Data Models in NoSQL Database allow easy handling of complex and list of nested
records. In this article, you will learn about Aggregate Data Models in NoSQL Database, what are its
4
different types, and their use cases. You will also go through an example of Aggregate Data Models in
NoSQL.
NoSQL Databases offer a simple design, horizontal scaling for clustering machines, and limit the
object-relational impedance mismatch. It uses different data structures from those used by relational
Databases making some operations faster. NoSQL Databases are designed to be flexible, scalable, and
capable of rapidly responding to the data management demands of modern businesses.
Some of the main features of the NoSQL Database are listed below:
Horizontal Scaling: NoSQL Databases can scale horizontally by adding nodes to share loads.
As the data grows the hardware can be added and scalability features could be preserved for
NoSQL.
Performance: Users can increase the performance of the NoSQL Database by adding a
different server.
Flexible Schema: NoSQL Databases do not require the same schema as compared to SQL
Databases. The document in the same collection does not need to have the same set of fields
and data type.
High Availability: Unlike Relational Databases that use primary and secondary nodes for
fetching data. NoSQL Databases use master place architecture.
Aggregate Data Models in NoSQL make it easier for the Databases to manage data storage over the
clusters as the aggregate data or unit can now reside on any of the machines. Whenever data is
retrieved from the Database all the data comes along with the Aggregate Data Models in NoSQL.
Aggregate Data Models in NoSQL don’t support ACID transactions and sacrifice one of the ACID
properties. With the help of Aggregate Data Models in NoSQL, you can easily perform OLAP
operations on the Database.
You can achieve high efficiency of the Aggregate Data Models in the NoSQL Database if the data
transactions and interactions take place within the same aggregate.
5
1. Key-Value Model
The Key-Value Data Model contains the key or an ID used to access or fetch the data of the
aggregates corresponding to the key. In this Aggregate Data Models in NoSQL, the data of the
aggregates are secure and encrypted and can be decrypted with a Key.
Use Cases:
These Aggregate Data Models in NoSQL Database are used for storing the user session data.
Key Value-based Data Models are used for maintaining schema-less user profiles.
It is used for storing user preferences and shopping cart data.
2. Document Model
6
The Document Data Model allows access to the parts of aggregates. In this Aggregate Data Models in
NoSQL, the data can be accessed in an inflexible way. The Database stores and retrieves documents,
which can be XML, JSON, BSON, etc. There are some restrictions on data structure and data types of
the data aggregates that are to be used in this Aggregate Data Models in NoSQL Database.
Use Cases:
Column family is an Aggregate Data Models in NoSQL Database usually with big-table style Data
Models that are referred to as column stores. It is also called a two-level map as it offers a two-level
aggregate structure. In this Aggregate Data Models in NoSQL, the first level of the Column family
contains the keys that act as a row identifier that is used to select the aggregate data. Whereas the
second level values are referred to as columns.
Use Cases:
Column Family Data Models are used in systems that maintain counters.
These Aggregate Data Models in NoSQL are used for services that have expiring usage.
It is used in systems that have heavy write requests.
4. Graph-Based Model
Graph-based data models store data in nodes that are connected by edges. These
Aggregate Data Models in NoSQL are widely used for storing the huge volumes of complex
aggregates and multidimensional data having many interconnections between them.
Use Cases:
This example of the E-Commerce Data Model has two main aggregates – customer and order. The
customer contains data related to billing addresses while the order aggregate consists of ordered items,
shipping addresses, and payments. The payment also contains the billing address.
If you notice a single logical address record appears 3 times in the data, but its value is
copied each time wherever used. The whole address can be copied into an aggregate as
needed. There is no pre-defined format to draw the aggregate boundaries. It solely depends
on whether you want to manipulate the data as per your requirements.
9
The Data Model for customer and order would look like this.
In these Aggregate Data Models in NoSQL, if you want to access a customer along with all
customer’s orders at once. Then designing a single aggregate is preferable. But if you want
to access a single order at a time, then you should have separate aggregates for each order.
It is very content-specific.
10
related columns, and each row has a unique row key. Data within HBase is stored as key-
value pairs where the key is a combination of row key, column family, column qualifier, and
timestamp, allowing for versioning of data.
One of the key advantages of HBase is its horizontal scalability—it can handle billions of
rows and millions of columns by distributing data across multiple nodes. HBase is ideal for
random, real-time access to big data, unlike Hadoop’s MapReduce which is more batch-
oriented. It also supports strong consistency, which sets it apart from some eventually
consistent NoSQL databases.
HBase is widely used in scenarios where fast reads and writes of large volumes of data are
required, such as:
Sensor and IoT data ingestion
Social media analytics
Time-series databases
Messaging and recommendation systems
Aggregate data models simplify data access by bundling related data, and technologies like
HBase take this a step further by enabling scalable, consistent, and real-time data operations
in big data environments.
Implementation of a data model involves designing the schema (if any), deciding on the
indexing strategies, and determining how data will be partitioned and replicated across
distributed nodes. For example, in HBase, data is implemented in a table format but
internally managed using row keys, column families, and timestamps. The design choices
depend on query patterns—HBasefavorsdenormalized models to reduce read complexity,
storing related data together in a row.
Efficient implementation also includes tuning performance through configurations like
sharding (data partitioning), replication (for fault tolerance), and caching. Understanding
access patterns early on is critical, because unlike RDBMS, schema and indexing decisions in
NoSQL can significantly impact both performance and storage.
The data model forms the foundation of how data is represented, while implementation deals
with how this model is realized in practice within a NoSQL system to ensure speed,
scalability, and consistency.
table.put(put);
table.close();
13
This snippet connects to an HBase table named student and inserts data into the personal
column family. Java clients offer full control but require familiarity with the HBase Java
classes.
2. REST Client
HBase also offers a REST API via a service called stargate, which allows clients to interact
with HBase using standard HTTP methods like GET, PUT, POST, and DELETE. REST
clients are useful for lightweight applications or when integrating with systems not written in
Java.
Example (using cURL):
curl -X PUT \
-H "Content-Type: application/json" \
-d '{"Row":[{"key":"row1","Cell":
[{"column":"cGVyc29uYWw6bmFtZQ==","$":"SGFyaW5p"}]}]}' \
http://localhost:8080/student/fakerow
Here, data is inserted using the REST interface where the column name and value are base64
encoded.
3. Thrift Client
The Thrift client is a cross-language interface that allows HBase operations from languages
like Python, PHP, C++, Ruby, etc.. Thrift is especially helpful when you're building
applications in non-Java environments.
Example (Python via Thrift):
You'd first need to generate Python bindings using Apache Thrift, and then you can do
something like:
from hbase import Hbase
transport = TSocket.TSocket('localhost', 9090)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Hbase.Client(protocol)
transport.open()
14
client.createTable('my_table', [ColumnDescriptor(name='cf:')])
transport.close()
This lets a Python program create and interact with an HBase table using Thrift.
4. Avro Client
Apache Avro is another serialization system supported by HBase, mainly used for remote
procedure calls (RPCs). Although less commonly used today, it still provides a way for inter-
process communication using a compact and efficient binary format.
5. HBase Shell
Though not a “client” in the programming sense, the HBase Shell is a command-line
interface that allows you to interact with HBase easily. It’s often used for administrative tasks
and quick testing.
Example (HBase Shell commands):
create 'student', 'personal'
put 'student', 'row1', 'personal:name', 'Harini'
get 'student', 'row1'
The shell is easy to use and excellent for learning and debugging.
Summary
HBase’s rich set of clients makes it extremely flexible and adaptable for various application
needs, ranging from real-time web apps to big data processing pipelines.
15
Apache Hive
What is Hive?
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. It allows users to query and manage large datasets
residing in distributed storage using a SQL-like language called HiveQL. It was initially
developed by Facebook and later contributed to the Apache Software Foundation.
Hive is not a database but a data warehousing tool that translates SQL-like queries into
MapReduce jobs, allowing people familiar with SQL to work with big data without writing
complex Java code.
Buckets: Further divides partitions into fixed number of buckets using hash function
on a column.
Limitations of Hive
Not suitable for real-time querying (high latency due to MapReduce).
Not ideal for row-level updates or transactions.
Slower compared to systems like Impala or Presto for low-latency queries.
Sample Example
-- Creating a table
CREATE TABLE emp (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Ordered collection of
ARRAY<T> ARRAY<STRING> = ["apple", "banana"]
elements
STRUCT<name:STRING, age:INT> =
STRUCT Group of named fields
{"Harini", 21}
types
2. Tables
Hive tables are similar to RDBMS tables but are backed by HDFS files. You define a table's
schema using columns, and each table is mapped to a directory in HDFS where the actual
data is stored.
Managed Table: Hive owns both metadata and data. Dropping the table deletes the
data.
22
External Table: Hive manages only metadata. Data remains even if the table is
dropped.
3. Partitions
Partitions improve query performance by dividing data based on a specific column (e.g., year,
state, product type). Queries can target specific partitions to avoid scanning unnecessary data.
4. Buckets
Bucketing is used for evenly distributing data into fixed number of files based on the hash of
a column value. It works well with joins and sampling.
5. Views
Views are virtual tables created using a query. They don’t store data but allow users to
simplify complex queries and reuse logic.
6. Alter Table
Used to change schema of a table such as renaming it, adding/dropping columns, or
changing file formats.
7. Drop Table / Database
These commands delete the metadata (and sometimes data) of Hive objects.
Relevant HiveQL Code Examples
1. Create a Database
CREATE DATABASE IF NOT EXISTS sales_db;
2. Create a Managed Table
CREATE TABLE employees (
id INT,
name STRING,
department STRING,
salary FLOAT
);
3. Create an External Table
CREATE EXTERNAL TABLE logs (
user_id STRING,
access_time TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
23
STORED AS TEXTFILE
LOCATION '/user/hive/logs/';
4. Create a Partitioned Table
CREATE TABLE sales (
id INT,
amount FLOAT
)
PARTITIONED BY (year INT, region STRING);
5. Create a Bucketed Table
CREATE TABLE customers (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 4 BUCKETS;
6. Create a View
CREATE VIEW high_salary_employees AS
SELECT name, department FROM employees WHERE salary > 50000;
7. Alter Table Examples
Rename a table:
ALTER TABLE employees RENAME TO staff;
Add a column:
ALTER TABLE staff ADD COLUMNS (joining_date DATE);
8. Drop Table / Database
DROP TABLE IF EXISTS staff;
4. GROUP BY
Used to aggregate values across groups, like finding average salary per department.
5. HAVING Clause
Used to filter results after aggregation (just like in SQL).
6. JOIN Operations
Hive supports joins similar to SQL:
INNER JOIN: Matches rows from both tables.
LEFT/RIGHT OUTER JOIN: Returns matched rows and unmatched ones with
NULLs.
FULL OUTER JOIN: Includes all rows from both sides.
7. UNION and UNION ALL
UNION: Merges two result sets and removes duplicates.
UNION ALL: Keeps duplicates.
8. LIMIT
Restricts the number of output rows. Useful for previewing data.
9. Subqueries
Queries inside other queries, either in the SELECT, FROM, or WHERE clause.
25
-- Keeps duplicates
SELECT name FROM employees
UNION ALL
SELECT name FROM interns;
8. LIMIT Clause
SELECT * FROM employees LIMIT 10;
9. Subquery Example
SELECT name FROM (
SELECT name, salary FROM employees
WHERE department = 'IT'
) temp
WHERE salary > 60000;