NoSQL Database - Module
Overview
Module 1 – Introduction & Storage Architectures (8 hrs) Outcome:
Students will understand the need for NoSQL, the types of NoSQL DBs,
and the internal storage mechanisms
WHAT NOSQL IS, TYPES OF NOSQL: STORAGE DOCUMENT STORE
HISTORY, WHY IT’S KEY/VALUE, ARCHITECTURES: INTERNALS
NEEDED (BIG DATA, DOCUMENT, GRAPH, COLUMN-ORIENTED (MONGODB
SCALABILITY) COLUMN-ORIENTED DATABASES (HBASE) COLLECTIONS,
INDEXES,
RELIABILITY,
SCALING)
KEY/VALUE STORES CONSISTENCY MODELS:
EVENTUALLY CONSISTENT
(REDIS, DBS, CONSISTENT
MEMCACHED HASHING, GOSSIP
INTERNALS) PROTOCOLS, HINTED
HANDOFF
Module 2 – Indexing & Special Collections (8 hrs)
Outcome: Students will learn how to optimize data access using
indexes and how MongoDB handles special storage needs.
Indexing in MongoDB: Compound indexes, $-operators, cardinality, query
optimizer
Unique, sparse indexes, index administration
Special collections & indexes: Capped collections, tailable cursors, TTL indexes
Full-text search, multilingual search, geospatial indexes (2D, 2DSphere)
GridFS (file storage in MongoDB)
Module 3 – Aggregation & Application Design (8 hrs) Outcome: Students
will be able to perform analytics queries on NoSQL data and design
applications with efficient schemas.
Aggregation framework
in MongoDB: $match,
Aggregation commands
$project, $group, MapReduce examples
(count, distinct, group)
$unwind, $sort, $limit,
$skip
Application design: Cardinality
Optimizations for data
Normalization vs (friends/followers
manipulation
denormalization example)
Schema planning,
managing consistency,
schema migration
Module 4 – Sharding (Scaling Out in MongoDB) (8 hrs)
Outcome: Students will understand how MongoDB handles
massive data by distributing it and how to choose shard keys wisely.
Sharding basics: Components of
a cluster, test setup
Configuring sharding: mongos,
config servers, adding shards,
chunk splitting, balancing
Choosing a shard key:
Strategies (ascending, random,
hashed, location-based)
Rules & limitations
Multi-database/collection
clusters and manual sharding
Module 5 – Transactions, Transactions & integrity:
RDBMS ACID vs CAP theorem trade-offs
Consistency & distributed ACID
Performance Tuning (8
hrs)
Performance tuning:
Consistency in MongoDB, Reducing latency,
CouchDB, Cassandra, increasing throughput,
Membase scalability laws (Amdahl’s
Outcome: Students will Law, Little’s Law)
be able to analyze trade-
offs in consistency models
and apply performance Partitioning, scheduling,
communication overhead, HBase coprocessors,
optimization techniques.
compression, MapReduce Bloom filters
tuning
• NOSQL: What It Is
and Why You Need It
•Definition: NoSQL is literally a
• Defining NoSQL combination of two words: No
Module 1 • Setting context by
and SQL. The implication is that
NoSQL is a technology or
explaining the history product that counters SQL.
of NoSQL’s •This means NoSQL is not a
emergence
single product or even a single
• Introducing the technology. It represents a class
NoSQL variants of products and a collection of
diverse, and sometimes related,
• Listing a few popular
concepts about data storage
NoSQL products
and manipulation.
Challenges of RDBMS
Rigid Schema Assumption: RDBMS expects a fixed schema
(tables, columns).
Dense & Uniform Data Assumption: Works best when data is
consistent and structured.
Index Dependence: Relational queries rely heavily on
indexes.
Scalability Issues: Vertical scaling (bigger server) is expensive;
horizontal scaling (many servers) is hard for RDBMS.
Workarounds Break the Model:To scale, RDBMS uses denormalization,
drops constraints, and relaxes ACID — but then it starts to behave like
NoSQL.
• Bit of History
– Non-relational databases are not new
• Even before SQL and relational databases became popular, there were non-
relational data storage methods.
• Example: In mainframes, data was stored in hierarchical or network-based
structures (like IBM IMS in the 1960s).
– They existed in specialized domains
• For example, LDAP (Lightweight Directory Access Protocol) uses a hierarchical
directory database for storing authentication and authorization credentials.
• These were non-relational, but used for specific, narrow purposes (not
general-purpose like SQL).
– Rooted in distributed & parallel computing
• Modern NoSQL systems are designed to work across many servers (clusters)
rather than a single big machine.
• They use parallelism and distribution to handle huge volumes of data and
millions of users simultaneously.
• Big Data
SI standards
Manufacturers standard
The report claims that the total size of digital data created and replicated will
grow to 35 zettabytes by 2020.
• Challenges of Big data
Efficiently storing and accessing large amounts of data is
difficult. The additional demands of fault tolerance and
backups makes things even more complicated.
Manipulating large data sets involves running
immensely parallel processes. Gracefully recovering
from any failures during such a run and providing results
in a reasonably short period of time is complex.
Managing the continuously evolving schema and
metadata for semi-structured and un-structured data,
generated by diverse sources, is a convoluted problem.
• Scalability
– Vertical scaling
– Horizontal scaling
The MapReduce model possibly provides one of the best
possible methods to process large-scale data on a horizontal
cluster of machines.
MapReduce is a programming model (introduced by Google)
for processing and analyzing large datasets in a distributed
environment.
• It splits a task into two major steps:
• Map → Break data into smaller chunks and process them in
parallel.
• Reduce → Combine (aggregate) the results from all the
chunks into a final answer.
• Map Reduce Architecture
• How it Works (Step-by-Step)
– Input data is split into chunks and distributed across
many computers (nodes).
– Map function processes each chunk independently
(parallel processing).
– Intermediate results are grouped by key.
– Reduce function aggregates the grouped results into
a final output
Application
•Hadoop MapReduce → early big data framework by Apache.
•Search engines (Google) → used for indexing web pages.
•Log analysis, recommendation engines, fraud detection.
Case study 1: Multiply by 2 function
Step 1: Read input
input_list = [1, 2, 3, 4]
Step 2: Map (Transform Each Item)
map(multiply_by_two, input_list) → [2, 4, 6, 8]
Step 3:Reduce (Aggregate Results)
reduce(sum, [2, 4, 6, 8]) → 20
Step 4: Output
sum=20
Functional programming idea of map and reduce is adapted to key/value data
Case study 2: Say you want to count how many times each word appears in a huge
set of documents.
1.Input
•Documents: "hello world", "hello NoSQL", "hello students"
2.Map phase (break text into key–value pairs)
•"hello world" → [(hello, 1), (world, 1)]
•"hello NoSQL" → [(hello, 1), (NoSQL, 1)]
•"hello students" → [(hello, 1), (students, 1)]
3.Shuffle & Group (system groups by key)
•hello → [1, 1, 1]
•world → [1]
•NoSQL → [1]
•students → [1]
4.Reduce phase (aggregate values)
•hello → 3
•world → 1
•NoSQL → 1
•students → 1
Final Output:
hello: 3 world: 1 NoSQL: 1 students: 1
Case study 3: People by Zip Code
Step1: Input (Key/Value Pairs)
[
{"94303": "Tom"},
{"94303": "Jane"},
{"94301": "Arun"},
{"94302": "Chen"}
]
Step 2: Map Function- Group people by zip code.
[
{"94303": ["Tom", "Jane"]},
{"94301": ["Arun"]},
{"94302": ["Chen"]}
]
Step 3: Reduce Function- Apply aggregation (count names per zip code).
[
{"94303": 2},
{"94301": 1},
{"94302": 1}
Why types of NoSQL databases
• Not all data fits neatly into rows and columns (RDBMS model).
Today’s apps handle structured, semi-structured, and unstructured
data. NoSQL provides the right tool for the right job instead of
forcing everything into tables.
• Real-world applications need different data models:
– Key/Value → Simple, fast lookups (e.g., caching, sessions).
– Document → JSON-like flexible data (e.g., user profiles, product catalogs).
– Column-family → Huge datasets with sparse columns (e.g., analytics,
logs).
– Graph → Highly connected data (e.g., social networks, recommendations).
• Each type solves specific problems better than a “one-size-fits-all”
solution.
1) What are Column-Oriented Stores?
• A type of NoSQL database that stores data by
columns instead of rows.
• Inspired by Google Bigtable.
• Data organized using Row Keys and Column
Families.
• Efficient for large, sparse datasets.
Data Model • Stored as: (Row Key, Column Family:Column
Qualifier, Timestamp) → Value
Example
• Column Families group
related columns
• Sparse storage (no empty
values stored)
• Flexible schema (different
Key rows may have different
Features columns)
• Time-stamped versions
(history tracking)
• Horizontal scalability
(petabytes across servers)
Examples of Column-Oriented Stores
- Google Bigtable (proprietary)
- Apache HBase (open-source on Hadoop)
- Hypertable (C++ implementation)
- Cassandra (distributed wide-column store)
- ScyllaDB (C++ high-performance Cassandra clone)
Use Cases
Large-scale analytics
Time-series data
(web logs,
(IoT, sensor data)
clickstream)
Search indexes User profiles (social
(inverted index) networks)
Metadata storage
for big data
platforms
Row-Oriented vs Column-Oriented
2) Key/Value Store in NoSQL
Concept
• The simplest type of NoSQL
database.
• Stores data as a collection of key–
value pairs, similar to a dictionary
or hash map.
• Each key is unique and retrieves
its associated value.
• Values can be strings, numbers,
JSON, or even binary data
(images, files).
• Key → Value
• user:101 → {name: "Alice",
age: 23, city: "New York"}
• user:102 → {name: "Bob", age:
30, city: "London"}
Key Features
• High-speed lookups (direct access by key)
• Schema-free and flexible values
• Horizontally scalable
• Great for caching & session storage
Popular Databases
• Redis → In-memory, caching & queues
• Amazon DynamoDB → Managed
key/value & document store
• Riak KV → Distributed, fault-tolerant
• Memcached → In-memory caching
system
• Membase
• Kyoto Cabinet
• Cassandra implemented in JAVA
• VoldemortImplemented in Erlang.
Also, uses a bit of C and JavaScript
Use Cases
Caching frequently accessed data
User sessions (token storage)
Shopping carts in e-commerce
Leaderboards in gaming apps
IoT data ingestion
Document databases
What are Document Databases?
• A NoSQL database designed to
store, retrieve, and manage
semi-structured data as
documents.
• Documents are usually stored
in formats like JSON, BSON, or
XML.
• Unlike RDBMS tables with rows
& columns, documents are
flexible and can have different
structures.
Example: A collection of users
Example: one document {
unit "_id": "user_101",
"name": "Alice",
"email": "
[email protected]",
{ "age": 23,
"id": "101", "interests": ["reading", "traveling"]
"name": "Alice", }
"email":
"
[email protected]", {
"skills": ["Python", "_id": "user_102",
"Django"] "name": "Bob",
} "email": "
[email protected]",
"skills": ["Python", "Django"]
}
• Key Features
Schema-less → No predefined schema
needed.
Hierarchical documents → Supports
nested structures (arrays, sub-documents).
Indexing & querying → Index on fields
inside documents.
Horizontal scaling → Distribute documents
across clusters.
Rich queries → Support for filtering,
searching, aggregations.
Popular Document Databases
MongoDB → Most popular, JSON/BSON-based.
CouchDB → Uses JSON to store data & JavaScript
for queries.
RavenDB → .NET-based document database.
Amazon DocumentDB → Managed MongoDB-
compatible database.
Use Cases
Content Management Systems (CMS) → Blogs, product catalogs.
User profiles & personalization → Flexible user attributes.
E-commerce applications → Products with varying attributes.
IoT data storage → Semi-structured sensor data.
Mobile/web apps → Fast, flexible backend storage.
Difference between RBMS and Document DB
Feature RDBMS (SQL) Document DB (NoSQL)
Tables (rows &
Data model Documents (JSON-like)
columns)
Schema Fixed, rigid Flexible, schema-less
Embedded/nested
Relationships Normalized (JOINs)
documents
JSON-like query
Query language SQL
(MongoDB, etc.)
Horizontal (sharding,
Scaling Vertical (limited)
clusters)
Graph Database
What is a Graph Database?
• A NoSQL database that uses graph
structures (nodes, edges,
properties) to represent and store
data.
• Instead of rows/columns (RDBMS)
or documents (MongoDB), data is
represented as a graph.
• Best suited for highly connected
data (like social networks,
recommendations, fraud detection).
• Core Components
Nodes → Represent entities (e.g., Person, Product, Location).
Edges → Represent relationships between nodes (e.g., FRIEND_OF, PURCHASED,
LOCATED_IN).
Properties → Attributes for nodes/edges.
Example:
• (Alice) -[FRIEND_OF]-> (Bob)
• (Bob) -[WORKS_AT]-> (Company X)
Example Data (Social Network):
• Key Features
Schema-free → Flexible, evolving structures.
Efficient for relationships → Directly stores and
queries connections.
Query languages → Cypher (Neo4j), Gremlin, SPARQL.
Traversals → Fast navigation of relationships (friends of
friends, etc.).
Great for graph algorithms (shortest path, centrality,
community detection).
• Popular Graph Databases
Neo4j → Most widely used, Cypher query language.
Amazon Neptune → AWS graph DB (supports Gremlin & SPARQL).
OrientDB → Multi-model DB (graph + document).
ArangoDB → Graph + document + key/value.
• Use Cases
Social Networks → Friend recommendations, follower graphs.
Recommendation Systems → “People who bought X also bought Y.”
Fraud Detection → Identify suspicious connections across transactions.
Knowledge Graphs → Google’s Knowledge Graph, semantic search.
Network/IT Operations → Mapping dependencies in systems.
Working with column-oriented database
Using Tables and Columns in Relational
Databases (RDBMS)
Contrasting Column Databases with RDBMS
Distributed system: Horizontal scaling
Column family indentifier
Data Evolution is recorded
A single table spans multiple machines in a column-
oriented database like Cassandra, HBase, or Bigtable
1. Why One Machine Isn’t Enough
In huge databases, a table can grow to
billions of rows and millions of columns.
One physical machine cannot store or process
this much data (limitations of storage,
memory, CPU).
So, the system automatically splits the table
into smaller pieces and distributes them
across multiple servers.
2. How Splitting Works
The row key uniquely identifies each row.
Rows are stored in sorted order of row keys.
The table is divided into bundles (HBase),
partitions (Cassandra), or regions (Bigtable).
Example: Hbase
In HBase, a table is divided into regions based on row key ranges.
Each region is hosted on a different RegionServer (machine).
As data grows, regions split into smaller ranges and migrate to
other servers for load balancing.
•Table Customer with row keys from A → Z
Region 1: A–H → stored on Machine 1
Region 2: I–P → stored on Machine 2
Region 3: Q–Z → stored on Machine 3
Column Databases as Nested Maps of Key/Value
Pairs
Unlike relational databases (which store data in rows), column databases store
data in columns grouped into families.
But internally, they don’t use fixed schemas — they treat data like flexible key-
value maps.
Table = { RowKey → { ColumnFamily → { ColumnName → Value } } }
1. Row Key
1. Unique identifier for each row (like (User ID=“User101”)
2. This is the outermost key in the map.
2. Column Family
• A logical grouping of columns (like personsInfo, Orders)
• Acts like a sub-map inside each row.
3. Column Namevalue
• Each column inside the family is itself a key/value pair.
• Example: Name”Alice”, Age25
Example: Imagine a table of users in Cassandra
UserID Name Age City
User101 Alice 25 London
User102 Bob 30 Paris
• Analogy
Think of a column database like a bookshelf:
Shelf = Row Key (User)
Book = Column Family (Personal Info, Orders, etc.)
Pages in the Book = Column → Value pairs (Name=Alice,
Age=25, …)
Technical example: Webtable
HBase Distributed storage architecture
Analogy: Imagine a library:
• The table = Library
• Regions = Different floors of the library (each
stores certain row ranges).
• Column families = Sections in each floor (e.g.,
Fiction, Science).
• Stores = Bookcases inside each section.
• Physical files = Actual books stored in the library.
• Thin wrapper = A librarian who helps you find
books instead of you searching raw shelves.
Document internal storage
MongoDB stores data in documents (like JSON objects), not in rows
and columns like traditional databases.
Collections:
A collection is like a table in relational databases.
Example 1: Collection :Students
Documentation inside:
{ "id": 1, "name": "Alice", "age": 21 }
{ "id": 2, "name": "Bob", "age": 22 }
Example 2:
{ "id": 1, "name": "Alice", "age": 21 }
{ "id": 2, "name": "Bob", "email": "
[email protected]" }
Namespaces: Collection can be separated using namespaces (like
database+collection name , e.g: school.students)
Unique identifier(_id) :
• What is a Memory-Mapped File?
A memory-mapped file is a way to make a file on disk behave like it is part of your
computer’s memory (RAM).
Instead of reading/writing to the file using system calls (which are slow), the file is
mapped into virtual memory.
This means: your program can read/write the file as if it’s just an array in memory.
• Why is it Fast?
Normally:
To read from a file → app → system call → disk → OS → back to app (lots of
steps).
With memory mapping:
File contents are directly available in memory space of the program.
No repeated system calls.
Since memory access is way faster than disk access, I/O performance improves.
• Kernel’s Role
The operating system’s kernel manages this memory mapping and page cache.
It automatically keeps the file content in sync between disk and RAM.
So, applications don’t have to worry about the details—they just access memory
normally.
• import mmap
# Step 1: Create a sample file
with open("example.txt", "wb") as f:
f.write(b"Hello MongoDB with memory-mapped files!")
# Step 2: Open the file for reading and writing
with open("example.txt", "r+b") as f:
# Step 3: Memory-map the file
mm = mmap.mmap(f.fileno(), 0)
# Step 4: Read from memory (just like reading from RAM)
print("Original content:", mm[:].decode("utf-8"))
# Step 5: Modify the file content via memory
mm[6:12] = b"Python" # Replacing 'MongoD' with 'Python'
# Step 6: Move file pointer and read again
mm.seek(0)
print("Modified content:", mm.read().decode("utf-8"))
# Step 7: Close the mapping
• MongoDB’s Memory-Mapped Storage Strategy
Earlier versions of MongoDB (before WiredTiger became the default engine) relied heavily on
memory-mapped files to store data.
This has pros (speed) but also as some side effects.
(a) No Separation Between OS Cache & DB Cache
• In some databases (like Oracle, MySQL), there’s a database-managed cache and the OS has its own
cache.
• In MongoDB’s memory-mapped strategy:
– The OS cache = DB cache.
– No duplicate copies, so less redundancy.
• Advantage: Efficient (no wasted memory).
• Disadvantage: MongoDB loses fine-grained control (depends on OS).
(b) Cache Management Controlled by OS
• Since memory-mapped files rely on virtual memory, the OS decides:
– Which data pages to keep in cache.
– Which pages to evict (remove).
• Problem: Different OS (Linux, Windows, macOS) may behave differently.
• So, MongoDB performance can vary across platforms.
(c) MongoDB Can Use All Available Memory
• With memory mapping, MongoDB automatically expands to use as much RAM as available.
• No special tuning needed.
• If you add more RAM, MongoDB cache effectively grows larger → performance boost.
Some other limitation of memory mapping
• On 32-bit systems → MongoDB can only handle 2 GB database size (because of
memory addressing limits).
• On 64-bit systems → This restriction is removed → databases can be much larger.
• Each document in MongoDB can be at most 8 MiB.
• Why? Because documents are designed to be lightweight and fast to query.
• If you need to store files larger than 8 MiB (e.g., images, videos), use GridFS.
• GridFS breaks the file into smaller chunks and stores them across multiple documents.
Namespace limit
• In MongoDB, a namespace is just the unique identifier string that points to
a collection or an index inside a database.
• MongoDB indexes are implemented as B-trees
• Reminder:
• 1 collection = 1 namespace
• 1 index = 1 namespace
• Example: If each collection has 2 indexes →
– Each collection = 3 namespaces (1 + 2).
• Namespace storage File (.ns File)
– MongoDB stores namespace metadata in a file called
<dbname>.ns. This file keeps track of collections and
indexes. Max size of .ns file = 2GB.
• Example: Database mydbfile is mydb.ns
• Guidelines for Using Collections and Indexes in
MongoDB
– Thumb rule: “Do I often need to query across all this data
together?”
– Capped collections: Follow LIFO technique.
– _id is always indexed. We can add our own index also.
Results come in _id order (or insertion order in capped
collections)
• MangoDB Reliability and durability
Traditional databases (like MySQL, PostgreSQL) guarantee ACID properties
(Atomicity, Consistency, Isolation, Durability).
MongoDB does not guarantee full ACID transactions (at least in older versions;
newer ones support multi-document transactions, but with overhead).
So, in concurrent operations (multiple clients updating same data at once),
conflicts can occur.
Some operations are atomic at the document level.
Example: $inc, $set, $push
• Replication for safety
To prevent data loss in failures, MongoDB supports replication.
Replication is asynchronous → changes on master may take time to appear on
slaves.
In older versions → master-slave replication:
One master (primary) → handles reads/writes.
One or more slaves (secondary) → keep a copy of master’s data.
In the current versions of MongoDB, replica pairs of master and slave have
been replaced with replica sets, where three replicas are in a set. Replica
sets allow automatic recovery and automatic failover.
Why Sharding?
• MongoDB stores huge datasets that may not fit on one
server.
• Sharding = Horizontal Scaling (splitting collections across
servers).
• Example: Instead of 1 billion docs on one server → split
across 10 servers.
• Shard = Portion of collection stored on one machine.
• Shards are replicated for reliability (Replica Sets).
• Example: 4 shards × 3 replicas = 12 MongoDB servers.
• Each shard is divided into chunks.
• Chunk = continuous range of documents based on shard key.
• Identified by: minKey, maxKey, and collection.
• MongoDB auto-balances chunks across shards.
• Field(s) used to distribute data across shards.
• Can be single [{ userId: 1 }]or compound
field [{ state: 1, city: 1 }].
• Must be chosen carefully to ensure balanced
distribution.
• Bad shard key = unbalanced shards.
• Store metadata about shards, chunk
distribution, and shard keys.
• Replicated to avoid single point of failure.
• Understanding Key/Values stores
– Memcached
• Memcached is an open-source, high-performance,
distributed memory caching system.
• It is used to store data in RAM (memory) temporarily
to reduce the number of direct database or API calls.
• Think of it as a fast-access shortcut: instead of hitting
the database every time, apps check Memcached first.
• Key features:
• In-Memory Storage
• Key-Value Store
• Volatile (Non-persistent)
• Distributed System
• Simple Protocols: (set,get,delete)
• The heart of Memcached is its slab allocator, which helps manage memory
efficiently (instead of using traditional malloc/free).
1) Slabs
• Memory in Memcached is divided into slabs.
• Each slab is responsible for storing values of a specific size range.
• Example: one slab stores objects around 1 KB, another slab for 1.25 KB, etc.
2) Pages
• A slab is further divided into pages.
• Each page is 1 MB in size (default).
• Pages contain chunks (or buckets) where the actual objects are stored.
3) Chunks(Buckets)
• A chunk is the smallest unit of memory allocation inside Memcached.
• Each chunk can store one object (a value + metadata).
• Object placement rule:
– An object is stored in the closest larger chunk size.
– Example:
• Object size: 1.4 KB
• Next available chunk size: 1.5625 KB
• Object stored in that chunk → 0.1625 KB wasted
Redis :
• Redis (REmote DIctionary Server) is an open-source, in-memory
data store.
• It can be used as a database, cache, or message broker.
• Unlike traditional databases, Redis keeps data in RAM, which makes
it extremely fast.
• Key features of Redis:
– In-memory storage
– Key-value store: (keys are always string)
– Rich data structures: String,List,Set,Sorted Set,Hash,Streams, Bitmaps.
– Persistence storage :
• RDB (Redis Database Backup)
• AOF (Append-Only File
– Replication & High Availability: master-replica replication, automatic
failover, sharding across multiple nodes.
– Redis can be used as a message broker with publish/subscribe support.
Consistent hashing
Consistent Hashing is a smarter way of distributing
keys across servers that minimizes reassignments
when servers are added or removed.
• How it Works:
– Imagine a hash space arranged in a ring (circular
structure).
– Both servers and keys are hashed onto this ring.
– A key belongs to the first server clockwise from its
position on the ring.
– When a server is added or removed, only the keys
near that server’s position are reassigned, not all keys.
• Object Versioning:
• When multiple clients update the same object at
the same time, conflicts may occur.
To resolve this, systems use object versioning –
each object is given a version identifier whenever
it is modified.
• Instead of overwriting old values, the system
keeps track of different versions of the object.
Node A has performed 2 updates
Node B has performed 1 update