Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views28 pages

DBMS Unit3

Master of data science certified course, Database Management System notes

Uploaded by

girab87633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views28 pages

DBMS Unit3

Master of data science certified course, Database Management System notes

Uploaded by

girab87633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Course: MSc DS

Advanced Database Management

Systems

Module: 3
Learning Objectives:

1. Understand their distinct roles within NoSQL and Hadoop

ecosystems.

2. Efficiently design schemas in both databases, noting

benefits and trade-offs.

3. Identify when to use MongoDB vs. HBase based on

specific scenarios.

4. Compare data models, scalability, and consistency

mechanisms.

5. Engage in practical exercises to apply theoretical concepts.

6. Decide which database best suits a given project's needs.

Structure:

3.1 MongoDB: A Document-oriented NoSQL Database

3.2 Deep Dive into HBase

3.3 Comparison Between MongoDB and HBase

3.4 Summary

3.5 Keywords
3.6 Self-Assessment Questions

3.7 Case Study

3.8 Reference

3.1 MongoDB: A Document-oriented NoSQL Database

MongoDB is an open-source, document-oriented NoSQL database

that offers flexibility, scalability, and the ability to store large

amounts of data in a schema-free environment. Unlike traditional

relational databases which rely on structured tables and rows,

MongoDB stores data in flexible, JSON-like documents, allowing

for varying fields and data structures. Its popularity among

developers and businesses alike stems from its robust features

and its ability to handle big data challenges effectively.

What is MongoDB?

● Document-Oriented Storage: Data is stored as "documents"

in collections instead of rows in tables, offering a flexible

schema.

● JSON-like Documents: Uses BSON (Binary JSON) format,


allowing embedded arrays and document hierarchies which

makes it rich in data.

● NoSQL: Not only SQL; it does not use SQL as its query

language, instead using a method-oriented query language.

History and Evolution of MongoDB

MongoDB was developed by Dwight Merriman and Eliot Horowitz,

who faced scalability and agility challenges with traditional

relational database approaches while building web applications at

DoubleClick, an online advertising company. The name

"MongoDB" stems from the word "humongous", referencing its

ability to handle large amounts of data. Since its inception in 2007,

MongoDB, Inc. (previously 10gen) has continued to refine and

improve MongoDB, ensuring it remains at the forefront of NoSQL

databases. With numerous versions and updates over the years,

MongoDB has consistently provided developers with tools needed

for vast and complex data solutions.

Key Features of MongoDB

● High Performance: MongoDB provides high performance


data persistence, especially for read and write operations.

● Scalability: Through sharding, MongoDB scales out by

partitioning data across many servers.

● Flexibility: Allows dynamic schema design, enabling the

addition of new fields without disrupting existing data.

● Replication: Supports server-to-server data replication,

ensuring high availability and fault tolerance.

● Full Index Support: For faster queries, MongoDB can index

any field in a document.

● Aggregation Framework: Allows data transformation and

computation tasks to be performed.

Data Modeling in MongoDB

Data modelling in MongoDB is quite distinct from relational

database systems. Given its schema-free nature, data can be

stored based on application needs. Some considerations include:

● Dynamic Schema: Unlike RDBMS, there's no need to define

the structure to create collections. They can be created on

the fly.
● Document Structure: It's essential to consider the type of

queries, updates, and operations that will be run. Depending

on the need, data can be normalised or denormalized.

Understanding BSON (Binary JSON)

BSON, which stands for Binary JSON, is a binary representation of

JSON-like documents. MongoDB uses BSON because:

● It offers a more efficient storage space and scan-speed than

JSON.

● It can support additional data types not supported by JSON,

such as Date and binary data types.

However, to the human eye, data stored in MongoDB appears as

JSON. Behind the scenes, this data is stored as BSON, enabling

faster data manipulation and indexing.

MongoDB Collections and Documents

● Collections: Analogous to tables in RDBMS, collections are

groups of MongoDB Documents. They don't enforce any

schema.

● Documents: Within collections, the data is stored as


documents. These are JSON-like structures made up of field

and value pairs.

Schema Design Principles in MongoDB

Given the flexibility of MongoDB's schema-less architecture,

designing an effective schema is crucial:

● Application-driven Schema: Structure data in a way that

complements your application's requirements.

● Think in Documents: Unlike rows in an RDBMS, think about

what data belongs together as a document.

● Balance Needs: Consider trade-offs between normalisation

(reference data) and denormalization (embed data). Factors

include data size, read-write patterns, and data growth.

Embedding vs. Referencing: Making the Right Choice

In MongoDB, you can either embed related data directly within a

single document or reference data stored in another document,

much like a foreign key in RDBMS. Choosing between these

methods is critical:

● Embedding:
Use when:

o You have “contains” relationships between

entities.

o You need to retrieve the whole dataset with a

single query.

o There are one-to-few relationships between

entities.

Benefits: Improved read performance as all data is in a

single location.

● Referencing:

Use when:

o There are one-to-many or many-to-many

relationships.

o Related data grows over time, potentially beyond

the BSON document size limit.

Benefits: Simplified updates as data isn't duplicated

across multiple documents.


3.2 Deep Dive into HBase

HBase is a distributed, scalable, big data store, inspired by

Google's Bigtable. It's designed to host very large tables – billions

of rows X millions of columns – atop clusters of commodity

hardware. Unlike traditional relational databases, HBase is a

NoSQL database that's characterised by its column-oriented

design, enabling high-speed read and write access to data.

Overview of HBase:

● Nature of Data: Unlike RDBMSs which store data in rows,

HBase stores data in columns. This column-oriented storage

allows for efficient read/write operations on a subset of the

data.

● Scalability: HBase operates atop the Hadoop Distributed File

System (HDFS) and is designed to be horizontally scalable,

simply by adding more nodes.

● Fault Tolerance: HBase automatically replicates data to

ensure system robustness. It also supports automatic

failover for a seamless user experience.


● Consistency: It provides strong consistency guarantees for

reads and writes. This ensures that once data is written,

subsequent reads will reflect that write.

How HBase Fits into the Hadoop Ecosystem: HBase is an integral

part of the Hadoop ecosystem and provides random, real-time

read/write capabilities. Here's how it dovetails with other Hadoop

components:

● HDFS: HBase relies on the Hadoop Distributed File System

for its storage. HDFS provides the reliable and distributed

storage layer for HBase, ensuring data is replicated across

nodes.

● MapReduce: HBase integrates with MapReduce, enabling

developers to use the MapReduce programming paradigm

for tasks like data analysis and transformation on data

stored in HBase.

● ZooKeeper: HBase employs Apache ZooKeeper for

distributed coordination, which aids in maintaining server

state within the HBase cluster.


The Architecture of HBase:

Understanding the architecture of HBase is vital for leveraging its

full potential:

● Master-Slave Architecture: HBase follows a master-slave

architecture where one HMaster node manages multiple

RegionServer slaves.

● HMaster: It handles DDL operations (like creating or deleting

tables) and coordinates the RegionServers. HMaster ensures

load balancing across RegionServers and manages region

assignment.

● RegionServers: These are the slave daemons responsible for

serving read and write requests, and they manage regions of

data. Each RegionServer runs multiple regions.

● Regions: A region is a subset of a table's rows. It grows over

time and when it exceeds a certain size, it's split into two to

ensure efficient data management.

● WAL (Write Ahead Log): Before data is stored in an HFile,

it’s first written to a WAL. This ensures that in case of a


RegionServer failure, data can be reconstructed.

● HFile: The actual storage location for HBase data. These are

stored on HDFS and hold the data in columnar format.

● MemStore & Block Cache: To improve read and write

performance, HBase employs in-memory storage structures.

MemStore accumulates writes and when full, its content is

flushed to an HFile. Block Cache, on the other hand, is used

for reads, where frequently read data blocks are cached.

Data Modeling in HBase:

In traditional relational databases, data modelling is centred on

defining tables, establishing relationships between them, and

normalising the data structure. In contrast, HBase, being a NoSQL,

column-oriented database, emphasises denormalization and

offers a different set of challenges and methodologies for data

modelling.

Concepts of Column Families and Qualifiers:

Column Families:

● Groupings of one or more related columns.


● Each column family is stored separately on disk, which

makes it crucial to design them wisely, taking into

consideration access patterns.

● Columns are physically co-located on storage only if they

are part of the same column family.

● Ideally, infrequently accessed or large columns should be

separated from frequently accessed or smaller ones.

Qualifiers:

● Unlike traditional relational databases where columns

are predefined, in HBase, column qualifiers can be

dynamic.

● Column qualifiers can be thought of as the ‘names’ of

specific columns within a column family.

● This flexible approach allows HBase tables to adapt to

evolving schemas.

Designing HBase Tables: Row Key Design Strategies:

Row Key Selection: It is critical because HBase stores data in


lexicographical order of the row key, influencing read/write

performance.

● Compound Key: Composing a row key using multiple

attributes can be useful for modelling hierarchical or

multi-dimensional data.

● Salted Key: A strategy where a prefix (or "salt") is

added to the row key to distribute writes across

multiple region servers and avoid hotspotting.

● Reversed Key: Useful for datasets where new records

have monotonically increasing keys, such as

timestamps. Reversing the key helps in evenly

distributing the writes.

● Short and Stable: Keeping keys concise speeds up

access and reduces storage needs.

Time Series Data Handling in HBase:

HBase's architecture is well-suited for time series data:

● Timestamps: Each cell in HBase has a version marked by its

timestamp. This inbuilt versioning capability makes it


efficient to handle time series data.

● Data Aging: Using HBase's Time-to-Live (TTL) feature, older

data can be automatically pruned, optimising storage for

time series scenarios.

● Row Key Design for Time Series: Combining metric

identifiers with timestamps (perhaps reversed) can provide

efficient time-range scan capabilities.

Optimising HBase for Query Performance:

● Bloom Filters: They help in determining if an HFile contains a

particular row or column, thus reducing unnecessary disk

seeks and improving read performance.

● Compression: Using compression algorithms like Gzip or

Snappy can help in optimising storage and IO operations,

though the choice might impact CPU usage.

● Block Size Tuning: Adjusting the HFile block size can balance

between memory usage and disk seek costs.

● Caching: Leveraging the Block Cache to store frequently

accessed data in memory can drastically reduce read


latencies.

● Pre-splitting Tables: To avoid region hotspotting, tables can

be pre-split based on expected row key distribution.

● Co-processors: They allow running custom code on region

servers, facilitating operations like data aggregation at

source, reducing data transferred over the network.

3.3 Comparison Between MongoDB and HBase

One of the most intriguing comparisons in the NoSQL world is that

between MongoDB and HBase, two eminent systems with

distinctive features and functionalities.

Data Models: Document vs. Column-family

● MongoDB:

o Document-based: MongoDB is primarily a document-

based NoSQL database. It uses BSON (Binary JSON)

format to store data. In essence, each record in

MongoDB is a document, which can hold arrays and

other documents. This offers high flexibility as fields

can differ from document to document.


o Schema-less: MongoDB is naturally schema-less,

meaning that the database doesn't require a fixed

schema. As a result, as the application's requirements

change, MongoDB can adapt more flexibly.

● HBase:

o Column-family based: HBase is a column-family store.

In this model, data is stored in tables, rows, and

column families. A column family is a collection of

columns and can be thought of as a coarser-grained

unit than individual columns.

o Sparse data storage: HBase thrives in scenarios where

the data schema is sparse. In HBase, absent columns

consume no storage space, which optimises storage

for massive datasets.

Scalability: How MongoDB and HBase Scale Horizontally

MongoDB:

● Sharding: MongoDB achieves horizontal scalability via

sharding. Data is divided into chunks, and these chunks are


distributed across multiple shards or nodes. The distribution

of data ensures that no single node is overwhelmed with

requests.

● Replica sets: To improve availability and fault-tolerance,

MongoDB uses replica sets. A replica set is a group of

MongoDB instances that replicate data, ensuring that if one

node goes down, the system can continue functioning.

HBase:

● Region-based scaling: HBase scales horizontally by splitting

large tables into regions, and these regions get distributed

among the RegionServers. When a RegionServer reaches its

capacity, the region is split into two to accommodate more

data.

● Distributed architecture: HBase is built on top of Hadoop's

distributed file system (HDFS). This architecture allows

HBase to handle large datasets across a distributed

environment.

Consistency: Trade-offs and Guarantees


MongoDB:

● Eventual consistency: In a sharded environment, MongoDB

can offer eventual consistency. However, within a single

shard, strong consistency is achieved.

● Write Concerns: MongoDB provides a "write concern"

feature, which lets users specify the level of

acknowledgment required from the system for write

operations.

HBase:

● Strong consistency: HBase provides strong consistency

guarantees. All reads after a write will see that write. This is

beneficial for applications where consistency is a priority.

● Atomic operations: HBase also provides atomic read-

modify-write operations on rows, ensuring that data

integrity is maintained.

Comparative Analysis of Use Cases

MongoDB:

● Web applications: Due to its flexible schema, MongoDB is


often preferred for web applications where requirements

can change rapidly.

● Content Management Systems: The document-based model

of MongoDB can accommodate varied forms of content,

making it ideal for CMS.

● IoT: MongoDB's ability to handle large volumes of rapidly

changing, diverse data makes it a good fit for IoT use cases.

HBase:

● Time-series data: HBase's columnar structure is conducive

for time-series data, often used in financial analysis or

monitoring systems.

● Massive datasets: Built on HDFS, HBase can handle

petabyte-scale datasets, making it suitable for Big Data

analytics.

● Real-time querying: HBase provides rapid real-time querying

capabilities, making it preferable for systems that require

real-time analytics.

3.4 Summary
❖ A document-oriented NoSQL database that uses JSON-like

documents with optional schemas. It's particularly favoured

for flexibility in schema design and is widely used in

applications requiring quick iterations and flexible data

models.

❖ The process of structuring data in MongoDB using BSON. It

encompasses schema design decisions like embedding vs.

referencing, collections, and document structures.

❖ A distributed, scalable, and big data store that runs on top of

the Hadoop Distributed File System (HDFS). HBase provides

real-time read/write access to large datasets, and its

architecture is column-oriented.

❖ Structuring data in HBase involves understanding and

designing column families, qualifiers, and row keys. It's

crucial to optimise data design for query performance,

especially considering the unique column-family storage of

HBase.
❖ While both databases offer scalability, their data models

differ significantly. MongoDB is document-oriented, while

HBase is column-oriented. They also have different trade-

offs regarding consistency, scalability, and typical use cases.

❖ Real-world applications and analyses that delve into the

practical implementations of both MongoDB and HBase,

showcasing their strengths, limitations, and best-fit

scenarios.

3.5 Keywords

● BSON (Binary JSON): BSON is a binary representation of

JSON-like documents. It's used primarily as the data storage

and network transfer format in MongoDB. BSON extends the

JSON model to offer additional data types and to be efficient

for both encoding and decoding.

● Column Families (in HBase): In HBase, a column family is a

collection of columns that are stored together on disk. Each

column family can have any number of columns, and each


column within a column family is identified by a unique

qualifier. Designing column families correctly is critical for

HBase performance since data is read and written to storage

by column families.

● Row Key (in HBase): The Row Key in HBase is the identifier

for a row of data. It's crucial for both data retrieval and

storage. Efficient design of the row key can significantly

impact the performance of HBase operations, as data in

HBase is stored and retrieved in sorted order based on the

row key.

● Document-oriented Database: A document-oriented

database is a type of NoSQL database that is designed to

store, retrieve, and manage document-oriented, or semi-

structured, information. In this context, "document" doesn't

mean traditional documents but refers to data structures

(like JSON or BSON in MongoDB) comprising key-value pairs,

arrays, and nested documents.


● Consistency (in Distributed Systems): Consistency refers to

the state where all nodes in a distributed system show the

same data. In the context of databases like MongoDB and

HBase, it relates to how and when changes made to the data

on one node are reflected on other nodes. Achieving

consistency often comes at the cost of availability or

partition tolerance, as per the CAP theorem.

● Horizontal Scalability: Horizontal scalability is the ability of a

system (like a database) to increase its capacity by adding

more machines or nodes to the system, rather than

upgrading the hardware of an existing machine (which

would be vertical scalability). Both MongoDB and HBase are

designed to scale horizontally, allowing them to handle huge

datasets and high traffic loads by distributing data across

multiple machines.

3.6 Self-Assessment Questions

1. How does MongoDB store its data, and what is the

significance of BSON in this context?


2. What are the key differences in the data modelling

approaches of MongoDB and HBase, especially concerning

schema design and data relationships?

3. Which database is better suited for real-time data analytics,

considering their respective architectures and use cases, and

why?

4. What are the primary trade-offs between MongoDB and

HBase in terms of consistency and scalability?

5. How does the row key design in HBase influence query

performance, and which strategies can be used to optimise

it?

3.7 Case Study

Title: Tokaido Railways: Managing Large-scale Passenger Data

with HBase

Introduction:

In Japan, the Shinkansen, or the "bullet train", is not just a mode

of transport; it's an icon. Tokaido Railways, one of the operators

of the Shinkansen routes, has always been at the forefront of


technological advancements. But, as passenger numbers

skyrocketed, so did the volume of passenger data. This data

included ticket transactions, seating preferences, journey histories,

and more. Traditional relational databases struggled to keep up

with this massive influx of real-time data.

Background:

In 2018, Tokaido Railways sought a solution that would not only

handle this volume but would also enable fast query responses for

operational tasks such as real-time seat allocations, ticketing, and

providing personalised recommendations for passengers. Their

choice was HBase, a distributed, scalable, big data store from the

Hadoop ecosystem.

The shift to HBase brought immediate benefits. Its columnar

storage, designed for sparse data, made it perfect for Tokaido’s

needs. With millions of passengers daily, only a small fraction

would have unique seating preferences or purchase add-ons.

Traditional row-based databases would waste space storing these

sparse attributes, but HBase shone in this regard.


The performance was another area of improvement. Queries that

took minutes in the old system were executed in seconds. This

drastically enhanced the customer experience, as passengers

received instant feedback on seat availability, potential upgrades,

and journey details.

Moreover, as a part of their Data Science initiative, Tokaido

Railways was able to employ machine learning models on the vast

data stored in HBase. This enabled them to predict travel patterns,

optimise train schedules, and even anticipate onboard services

required for specific journeys.

The success of this transition has now made Tokaido Railways a

case study for effective big data management in transportation

sectors across Japan.

Questions:

1. Why did Tokaido Railways decide to shift from traditional

relational databases to HBase?

2. How did HBase's columnar storage prove advantageous for

managing Tokaido Railways' passenger data?


3. Based on the case study, how did the integration of HBase

influence the customer experience and operational

efficiency at Tokaido Railways?

3.8 References

1. "MongoDB: The Definitive Guide" by Kristina Chodorow

2. "HBase: The Definitive Guide" by Lars George

3. "NoSQL Distilled: A Brief Guide to the Emerging World of

Polyglot Persistence" by Pramod J. Sadalage and Martin

Fowler

4. "Designing Data-Intensive Applications: The Big Ideas Behind

Reliable, Scalable, and Maintainable Systems" by Martin

Kleppmann

5. "Mastering MongoDB 4.x: Expert techniques to run high-

volume and fault-tolerant database solutions using

MongoDB 4.x" by Alex Giamas

You might also like