Course: MSc DS
Advanced Database Management
Systems
Module: 3
Learning Objectives:
1. Understand their distinct roles within NoSQL and Hadoop
ecosystems.
2. Efficiently design schemas in both databases, noting
benefits and trade-offs.
3. Identify when to use MongoDB vs. HBase based on
specific scenarios.
4. Compare data models, scalability, and consistency
mechanisms.
5. Engage in practical exercises to apply theoretical concepts.
6. Decide which database best suits a given project's needs.
Structure:
3.1 MongoDB: A Document-oriented NoSQL Database
3.2 Deep Dive into HBase
3.3 Comparison Between MongoDB and HBase
3.4 Summary
3.5 Keywords
3.6 Self-Assessment Questions
3.7 Case Study
3.8 Reference
3.1 MongoDB: A Document-oriented NoSQL Database
MongoDB is an open-source, document-oriented NoSQL database
that offers flexibility, scalability, and the ability to store large
amounts of data in a schema-free environment. Unlike traditional
relational databases which rely on structured tables and rows,
MongoDB stores data in flexible, JSON-like documents, allowing
for varying fields and data structures. Its popularity among
developers and businesses alike stems from its robust features
and its ability to handle big data challenges effectively.
What is MongoDB?
● Document-Oriented Storage: Data is stored as "documents"
in collections instead of rows in tables, offering a flexible
schema.
● JSON-like Documents: Uses BSON (Binary JSON) format,
allowing embedded arrays and document hierarchies which
makes it rich in data.
● NoSQL: Not only SQL; it does not use SQL as its query
language, instead using a method-oriented query language.
History and Evolution of MongoDB
MongoDB was developed by Dwight Merriman and Eliot Horowitz,
who faced scalability and agility challenges with traditional
relational database approaches while building web applications at
DoubleClick, an online advertising company. The name
"MongoDB" stems from the word "humongous", referencing its
ability to handle large amounts of data. Since its inception in 2007,
MongoDB, Inc. (previously 10gen) has continued to refine and
improve MongoDB, ensuring it remains at the forefront of NoSQL
databases. With numerous versions and updates over the years,
MongoDB has consistently provided developers with tools needed
for vast and complex data solutions.
Key Features of MongoDB
● High Performance: MongoDB provides high performance
data persistence, especially for read and write operations.
● Scalability: Through sharding, MongoDB scales out by
partitioning data across many servers.
● Flexibility: Allows dynamic schema design, enabling the
addition of new fields without disrupting existing data.
● Replication: Supports server-to-server data replication,
ensuring high availability and fault tolerance.
● Full Index Support: For faster queries, MongoDB can index
any field in a document.
● Aggregation Framework: Allows data transformation and
computation tasks to be performed.
Data Modeling in MongoDB
Data modelling in MongoDB is quite distinct from relational
database systems. Given its schema-free nature, data can be
stored based on application needs. Some considerations include:
● Dynamic Schema: Unlike RDBMS, there's no need to define
the structure to create collections. They can be created on
the fly.
● Document Structure: It's essential to consider the type of
queries, updates, and operations that will be run. Depending
on the need, data can be normalised or denormalized.
Understanding BSON (Binary JSON)
BSON, which stands for Binary JSON, is a binary representation of
JSON-like documents. MongoDB uses BSON because:
● It offers a more efficient storage space and scan-speed than
JSON.
● It can support additional data types not supported by JSON,
such as Date and binary data types.
However, to the human eye, data stored in MongoDB appears as
JSON. Behind the scenes, this data is stored as BSON, enabling
faster data manipulation and indexing.
MongoDB Collections and Documents
● Collections: Analogous to tables in RDBMS, collections are
groups of MongoDB Documents. They don't enforce any
schema.
● Documents: Within collections, the data is stored as
documents. These are JSON-like structures made up of field
and value pairs.
Schema Design Principles in MongoDB
Given the flexibility of MongoDB's schema-less architecture,
designing an effective schema is crucial:
● Application-driven Schema: Structure data in a way that
complements your application's requirements.
● Think in Documents: Unlike rows in an RDBMS, think about
what data belongs together as a document.
● Balance Needs: Consider trade-offs between normalisation
(reference data) and denormalization (embed data). Factors
include data size, read-write patterns, and data growth.
Embedding vs. Referencing: Making the Right Choice
In MongoDB, you can either embed related data directly within a
single document or reference data stored in another document,
much like a foreign key in RDBMS. Choosing between these
methods is critical:
● Embedding:
Use when:
o You have “contains” relationships between
entities.
o You need to retrieve the whole dataset with a
single query.
o There are one-to-few relationships between
entities.
Benefits: Improved read performance as all data is in a
single location.
● Referencing:
Use when:
o There are one-to-many or many-to-many
relationships.
o Related data grows over time, potentially beyond
the BSON document size limit.
Benefits: Simplified updates as data isn't duplicated
across multiple documents.
3.2 Deep Dive into HBase
HBase is a distributed, scalable, big data store, inspired by
Google's Bigtable. It's designed to host very large tables – billions
of rows X millions of columns – atop clusters of commodity
hardware. Unlike traditional relational databases, HBase is a
NoSQL database that's characterised by its column-oriented
design, enabling high-speed read and write access to data.
Overview of HBase:
● Nature of Data: Unlike RDBMSs which store data in rows,
HBase stores data in columns. This column-oriented storage
allows for efficient read/write operations on a subset of the
data.
● Scalability: HBase operates atop the Hadoop Distributed File
System (HDFS) and is designed to be horizontally scalable,
simply by adding more nodes.
● Fault Tolerance: HBase automatically replicates data to
ensure system robustness. It also supports automatic
failover for a seamless user experience.
● Consistency: It provides strong consistency guarantees for
reads and writes. This ensures that once data is written,
subsequent reads will reflect that write.
How HBase Fits into the Hadoop Ecosystem: HBase is an integral
part of the Hadoop ecosystem and provides random, real-time
read/write capabilities. Here's how it dovetails with other Hadoop
components:
● HDFS: HBase relies on the Hadoop Distributed File System
for its storage. HDFS provides the reliable and distributed
storage layer for HBase, ensuring data is replicated across
nodes.
● MapReduce: HBase integrates with MapReduce, enabling
developers to use the MapReduce programming paradigm
for tasks like data analysis and transformation on data
stored in HBase.
● ZooKeeper: HBase employs Apache ZooKeeper for
distributed coordination, which aids in maintaining server
state within the HBase cluster.
The Architecture of HBase:
Understanding the architecture of HBase is vital for leveraging its
full potential:
● Master-Slave Architecture: HBase follows a master-slave
architecture where one HMaster node manages multiple
RegionServer slaves.
● HMaster: It handles DDL operations (like creating or deleting
tables) and coordinates the RegionServers. HMaster ensures
load balancing across RegionServers and manages region
assignment.
● RegionServers: These are the slave daemons responsible for
serving read and write requests, and they manage regions of
data. Each RegionServer runs multiple regions.
● Regions: A region is a subset of a table's rows. It grows over
time and when it exceeds a certain size, it's split into two to
ensure efficient data management.
● WAL (Write Ahead Log): Before data is stored in an HFile,
it’s first written to a WAL. This ensures that in case of a
RegionServer failure, data can be reconstructed.
● HFile: The actual storage location for HBase data. These are
stored on HDFS and hold the data in columnar format.
● MemStore & Block Cache: To improve read and write
performance, HBase employs in-memory storage structures.
MemStore accumulates writes and when full, its content is
flushed to an HFile. Block Cache, on the other hand, is used
for reads, where frequently read data blocks are cached.
Data Modeling in HBase:
In traditional relational databases, data modelling is centred on
defining tables, establishing relationships between them, and
normalising the data structure. In contrast, HBase, being a NoSQL,
column-oriented database, emphasises denormalization and
offers a different set of challenges and methodologies for data
modelling.
Concepts of Column Families and Qualifiers:
Column Families:
● Groupings of one or more related columns.
● Each column family is stored separately on disk, which
makes it crucial to design them wisely, taking into
consideration access patterns.
● Columns are physically co-located on storage only if they
are part of the same column family.
● Ideally, infrequently accessed or large columns should be
separated from frequently accessed or smaller ones.
Qualifiers:
● Unlike traditional relational databases where columns
are predefined, in HBase, column qualifiers can be
dynamic.
● Column qualifiers can be thought of as the ‘names’ of
specific columns within a column family.
● This flexible approach allows HBase tables to adapt to
evolving schemas.
Designing HBase Tables: Row Key Design Strategies:
Row Key Selection: It is critical because HBase stores data in
lexicographical order of the row key, influencing read/write
performance.
● Compound Key: Composing a row key using multiple
attributes can be useful for modelling hierarchical or
multi-dimensional data.
● Salted Key: A strategy where a prefix (or "salt") is
added to the row key to distribute writes across
multiple region servers and avoid hotspotting.
● Reversed Key: Useful for datasets where new records
have monotonically increasing keys, such as
timestamps. Reversing the key helps in evenly
distributing the writes.
● Short and Stable: Keeping keys concise speeds up
access and reduces storage needs.
Time Series Data Handling in HBase:
HBase's architecture is well-suited for time series data:
● Timestamps: Each cell in HBase has a version marked by its
timestamp. This inbuilt versioning capability makes it
efficient to handle time series data.
● Data Aging: Using HBase's Time-to-Live (TTL) feature, older
data can be automatically pruned, optimising storage for
time series scenarios.
● Row Key Design for Time Series: Combining metric
identifiers with timestamps (perhaps reversed) can provide
efficient time-range scan capabilities.
Optimising HBase for Query Performance:
● Bloom Filters: They help in determining if an HFile contains a
particular row or column, thus reducing unnecessary disk
seeks and improving read performance.
● Compression: Using compression algorithms like Gzip or
Snappy can help in optimising storage and IO operations,
though the choice might impact CPU usage.
● Block Size Tuning: Adjusting the HFile block size can balance
between memory usage and disk seek costs.
● Caching: Leveraging the Block Cache to store frequently
accessed data in memory can drastically reduce read
latencies.
● Pre-splitting Tables: To avoid region hotspotting, tables can
be pre-split based on expected row key distribution.
● Co-processors: They allow running custom code on region
servers, facilitating operations like data aggregation at
source, reducing data transferred over the network.
3.3 Comparison Between MongoDB and HBase
One of the most intriguing comparisons in the NoSQL world is that
between MongoDB and HBase, two eminent systems with
distinctive features and functionalities.
Data Models: Document vs. Column-family
● MongoDB:
o Document-based: MongoDB is primarily a document-
based NoSQL database. It uses BSON (Binary JSON)
format to store data. In essence, each record in
MongoDB is a document, which can hold arrays and
other documents. This offers high flexibility as fields
can differ from document to document.
o Schema-less: MongoDB is naturally schema-less,
meaning that the database doesn't require a fixed
schema. As a result, as the application's requirements
change, MongoDB can adapt more flexibly.
● HBase:
o Column-family based: HBase is a column-family store.
In this model, data is stored in tables, rows, and
column families. A column family is a collection of
columns and can be thought of as a coarser-grained
unit than individual columns.
o Sparse data storage: HBase thrives in scenarios where
the data schema is sparse. In HBase, absent columns
consume no storage space, which optimises storage
for massive datasets.
Scalability: How MongoDB and HBase Scale Horizontally
MongoDB:
● Sharding: MongoDB achieves horizontal scalability via
sharding. Data is divided into chunks, and these chunks are
distributed across multiple shards or nodes. The distribution
of data ensures that no single node is overwhelmed with
requests.
● Replica sets: To improve availability and fault-tolerance,
MongoDB uses replica sets. A replica set is a group of
MongoDB instances that replicate data, ensuring that if one
node goes down, the system can continue functioning.
HBase:
● Region-based scaling: HBase scales horizontally by splitting
large tables into regions, and these regions get distributed
among the RegionServers. When a RegionServer reaches its
capacity, the region is split into two to accommodate more
data.
● Distributed architecture: HBase is built on top of Hadoop's
distributed file system (HDFS). This architecture allows
HBase to handle large datasets across a distributed
environment.
Consistency: Trade-offs and Guarantees
MongoDB:
● Eventual consistency: In a sharded environment, MongoDB
can offer eventual consistency. However, within a single
shard, strong consistency is achieved.
● Write Concerns: MongoDB provides a "write concern"
feature, which lets users specify the level of
acknowledgment required from the system for write
operations.
HBase:
● Strong consistency: HBase provides strong consistency
guarantees. All reads after a write will see that write. This is
beneficial for applications where consistency is a priority.
● Atomic operations: HBase also provides atomic read-
modify-write operations on rows, ensuring that data
integrity is maintained.
Comparative Analysis of Use Cases
MongoDB:
● Web applications: Due to its flexible schema, MongoDB is
often preferred for web applications where requirements
can change rapidly.
● Content Management Systems: The document-based model
of MongoDB can accommodate varied forms of content,
making it ideal for CMS.
● IoT: MongoDB's ability to handle large volumes of rapidly
changing, diverse data makes it a good fit for IoT use cases.
HBase:
● Time-series data: HBase's columnar structure is conducive
for time-series data, often used in financial analysis or
monitoring systems.
● Massive datasets: Built on HDFS, HBase can handle
petabyte-scale datasets, making it suitable for Big Data
analytics.
● Real-time querying: HBase provides rapid real-time querying
capabilities, making it preferable for systems that require
real-time analytics.
3.4 Summary
❖ A document-oriented NoSQL database that uses JSON-like
documents with optional schemas. It's particularly favoured
for flexibility in schema design and is widely used in
applications requiring quick iterations and flexible data
models.
❖ The process of structuring data in MongoDB using BSON. It
encompasses schema design decisions like embedding vs.
referencing, collections, and document structures.
❖ A distributed, scalable, and big data store that runs on top of
the Hadoop Distributed File System (HDFS). HBase provides
real-time read/write access to large datasets, and its
architecture is column-oriented.
❖ Structuring data in HBase involves understanding and
designing column families, qualifiers, and row keys. It's
crucial to optimise data design for query performance,
especially considering the unique column-family storage of
HBase.
❖ While both databases offer scalability, their data models
differ significantly. MongoDB is document-oriented, while
HBase is column-oriented. They also have different trade-
offs regarding consistency, scalability, and typical use cases.
❖ Real-world applications and analyses that delve into the
practical implementations of both MongoDB and HBase,
showcasing their strengths, limitations, and best-fit
scenarios.
3.5 Keywords
● BSON (Binary JSON): BSON is a binary representation of
JSON-like documents. It's used primarily as the data storage
and network transfer format in MongoDB. BSON extends the
JSON model to offer additional data types and to be efficient
for both encoding and decoding.
● Column Families (in HBase): In HBase, a column family is a
collection of columns that are stored together on disk. Each
column family can have any number of columns, and each
column within a column family is identified by a unique
qualifier. Designing column families correctly is critical for
HBase performance since data is read and written to storage
by column families.
● Row Key (in HBase): The Row Key in HBase is the identifier
for a row of data. It's crucial for both data retrieval and
storage. Efficient design of the row key can significantly
impact the performance of HBase operations, as data in
HBase is stored and retrieved in sorted order based on the
row key.
● Document-oriented Database: A document-oriented
database is a type of NoSQL database that is designed to
store, retrieve, and manage document-oriented, or semi-
structured, information. In this context, "document" doesn't
mean traditional documents but refers to data structures
(like JSON or BSON in MongoDB) comprising key-value pairs,
arrays, and nested documents.
● Consistency (in Distributed Systems): Consistency refers to
the state where all nodes in a distributed system show the
same data. In the context of databases like MongoDB and
HBase, it relates to how and when changes made to the data
on one node are reflected on other nodes. Achieving
consistency often comes at the cost of availability or
partition tolerance, as per the CAP theorem.
● Horizontal Scalability: Horizontal scalability is the ability of a
system (like a database) to increase its capacity by adding
more machines or nodes to the system, rather than
upgrading the hardware of an existing machine (which
would be vertical scalability). Both MongoDB and HBase are
designed to scale horizontally, allowing them to handle huge
datasets and high traffic loads by distributing data across
multiple machines.
3.6 Self-Assessment Questions
1. How does MongoDB store its data, and what is the
significance of BSON in this context?
2. What are the key differences in the data modelling
approaches of MongoDB and HBase, especially concerning
schema design and data relationships?
3. Which database is better suited for real-time data analytics,
considering their respective architectures and use cases, and
why?
4. What are the primary trade-offs between MongoDB and
HBase in terms of consistency and scalability?
5. How does the row key design in HBase influence query
performance, and which strategies can be used to optimise
it?
3.7 Case Study
Title: Tokaido Railways: Managing Large-scale Passenger Data
with HBase
Introduction:
In Japan, the Shinkansen, or the "bullet train", is not just a mode
of transport; it's an icon. Tokaido Railways, one of the operators
of the Shinkansen routes, has always been at the forefront of
technological advancements. But, as passenger numbers
skyrocketed, so did the volume of passenger data. This data
included ticket transactions, seating preferences, journey histories,
and more. Traditional relational databases struggled to keep up
with this massive influx of real-time data.
Background:
In 2018, Tokaido Railways sought a solution that would not only
handle this volume but would also enable fast query responses for
operational tasks such as real-time seat allocations, ticketing, and
providing personalised recommendations for passengers. Their
choice was HBase, a distributed, scalable, big data store from the
Hadoop ecosystem.
The shift to HBase brought immediate benefits. Its columnar
storage, designed for sparse data, made it perfect for Tokaido’s
needs. With millions of passengers daily, only a small fraction
would have unique seating preferences or purchase add-ons.
Traditional row-based databases would waste space storing these
sparse attributes, but HBase shone in this regard.
The performance was another area of improvement. Queries that
took minutes in the old system were executed in seconds. This
drastically enhanced the customer experience, as passengers
received instant feedback on seat availability, potential upgrades,
and journey details.
Moreover, as a part of their Data Science initiative, Tokaido
Railways was able to employ machine learning models on the vast
data stored in HBase. This enabled them to predict travel patterns,
optimise train schedules, and even anticipate onboard services
required for specific journeys.
The success of this transition has now made Tokaido Railways a
case study for effective big data management in transportation
sectors across Japan.
Questions:
1. Why did Tokaido Railways decide to shift from traditional
relational databases to HBase?
2. How did HBase's columnar storage prove advantageous for
managing Tokaido Railways' passenger data?
3. Based on the case study, how did the integration of HBase
influence the customer experience and operational
efficiency at Tokaido Railways?
3.8 References
1. "MongoDB: The Definitive Guide" by Kristina Chodorow
2. "HBase: The Definitive Guide" by Lars George
3. "NoSQL Distilled: A Brief Guide to the Emerging World of
Polyglot Persistence" by Pramod J. Sadalage and Martin
Fowler
4. "Designing Data-Intensive Applications: The Big Ideas Behind
Reliable, Scalable, and Maintainable Systems" by Martin
Kleppmann
5. "Mastering MongoDB 4.x: Expert techniques to run high-
volume and fault-tolerant database solutions using
MongoDB 4.x" by Alex Giamas