Cassandra Notes
Cassandra Notes
(10-Marks)
1. Define Data and Information. Explain how they differ with examples.
Definition:
Data: Raw, unprocessed facts or observations collected from various sources, lacking context or
meaning until analyzed. It can be numbers, text, images, or any other form (e.g., sensor readings
from an IoT device).
Information: Processed data that has been organized, analyzed, and given context to provide
meaning and support decision-making. It is the output of interpreting data (e.g., a report
summarizing sensor trends).
Differences:
Structure: Data is unorganized (e.g., a list of temperatures: 25°C, 27°C, 23°C), while information
is structured and meaningful (e.g., "Average temperature today is 25°C, indicating stable
weather").
Processing: Data requires processing (e.g., aggregation, filtering), whereas information is the
result of that processing.
Utility: Data is passive and needs interpretation, while information is actionable (e.g., data from
a smart grid shows power usage; information predicts a peak load at 6 PM).
Example in Context: In an IoT healthcare system, raw data might be heart rate readings (e.g.,
72, 75, 80 bpm) collected every minute. Information emerges when these are analyzed to show
"Patient’s heart rate averaged 75 bpm with a spike at 80 bpm, suggesting possible stress—alert
doctor."
2. What is RDBMS? List its limitations in the context of big data applications.
Definition:
A Relational Database Management System (RDBMS) is a software system that manages data stored in
a structured format using tables (rows and columns) with predefined schemas. It uses SQL (Structured
Query Language) for querying and ensures data integrity through relationships (e.g., primary and foreign
keys). Examples include MySQL, Oracle, and PostgreSQL.
Schema Rigidity: RDBMS requires a fixed schema, making it inflexible for handling unstructured
or semi-structured big data (e.g., JSON logs from smart devices) that evolves rapidly.
Performance with High Velocity: The row-based storage and join operations in RDBMS struggle
with real-time processing of high-velocity data streams (e.g., millions of transactions per second
in a smart city).
Limited Parallel Processing: RDBMS lacks native support for distributed computing, unlike big
data frameworks (e.g., Hadoop), leading to bottlenecks when analyzing large datasets across
multiple nodes.
Cost and Complexity: Managing large-scale RDBMS instances (e.g., Oracle RAC) involves high
licensing costs and complex administration, whereas big data solutions like NoSQL databases are
more cost-effective for scale.
Example: In a big data IoT application tracking global shipping, an RDBMS might fail to handle 1
TB of unstructured GPS and sensor data daily, whereas a NoSQL solution like Cassandra could
scale horizontally and process it efficiently.
Atomicity: Ensures that a transaction is fully completed or fully rolled back; no partial updates
occur.
Consistency: Guarantees that a transaction brings the database from one valid state to another,
adhering to defined rules (e.g., constraints).
Isolation: Ensures that transactions are executed independently, preventing interference from
concurrent operations.
Durability: Ensures that once a transaction is committed, it is permanently saved, even in case
of system failure.
o Cassandra ensures atomicity at the row level within a single partition. If a write
operation updates multiple columns in one row, it either succeeds or fails entirely.
However, atomicity across multiple rows or partitions is not guaranteed unless explicitly
managed (e.g., using lightweight transactions with IF NOT EXISTS).
Consistency:
o Cassandra follows an eventual consistency model rather than strict consistency. It uses a
tunable consistency level (e.g., ONE, QUORUM, ALL) where data is consistent only after
replication across nodes is complete. For example, with QUORUM, a majority of nodes
must acknowledge a write for it to be considered consistent.
Isolation:
o Cassandra provides isolation through its write path, where updates are written to a
commit log and memtable, ensuring no interference during a single operation. However,
it does not support full transaction isolation levels (e.g., Serializable) like RDBMS.
Concurrent writes to the same row can lead to "last write wins" conflicts unless resolved
with timestamps or lightweight transactions.
Durability:
Example in Context:
In an IoT smart grid application, Cassandra stores power usage data. A write operation updating voltage
and current for a single sensor (row) is atomic and durable. However, if two nodes update the same row
concurrently with different values, the latest timestamp wins (weak isolation), and consistency depends
on the chosen level (e.g., QUORUM ensures eventual consistency across nodes).
(15-Marks questions)
1. Discuss in detail the evolution and history of Apache Cassandra. Why was it developed and how has
it grown?
Origins (2008): Cassandra was initially developed by Facebook to power its Inbox Search feature,
which required handling massive amounts of data with high availability and fault tolerance. It
was created by Avinash Lakshman (who also co-authored Amazon’s Dynamo) and Prashant
Malik, combining ideas from Amazon’s Dynamo (decentralized design) and Google’s Bigtable
(column-family data model).
Open-Source Release (2009): Facebook open-sourced Cassandra under the Apache License,
contributing it to the Apache Software Foundation (ASF). This marked its transition from a
proprietary tool to a community-driven project.
Apache Incubator (2009–2010): Cassandra entered the Apache Incubator program, gaining
traction among developers. Version 0.6, released in 2010, introduced key features like multi-
data-center replication.
Maturity (2011–2015): With the 1.0 release in 2011, Cassandra became production-ready,
adding support for CQL (Cassandra Query Language), which resembled SQL and improved
usability. Version 2.0 (2013) introduced materialized views and lightweight transactions, while
version 3.0 (2015) enhanced performance with improved compaction and storage engine
upgrades.
Recent Developments (2016–Present): Versions like 4.0 (2021) introduced virtual tables,
storage-attached indexing (SAI), and better Java compatibility, reflecting its adaptation to
modern big data needs. As of 2025, Cassandra continues to evolve with community
contributions, supporting IoT, real-time analytics, and AI workloads.
Scalability Needs: Facebook needed a system to scale horizontally across thousands of nodes to
manage billions of inbox messages, which RDBMS struggled to achieve due to vertical scaling
constraints.
High Availability: The requirement for zero downtime during server failures (e.g., during peak
usage) led to a design inspired by Dynamo’s peer-to-peer architecture.
Write-Heavy Workloads: Inbox Search involved frequent writes (e.g., message indexing),
necessitating a database optimized for write performance over complex queries.
Fault Tolerance: With data centers worldwide, Cassandra was designed to handle node failures
and network partitions, ensuring data accessibility.
Community and Ecosystem: The open-source model fostered a global community, with
companies like Netflix, Apple, and Uber adopting Cassandra for real-time applications. The
DataStax enterprise edition further commercialized it, adding tools like OpsCenter.
Use Cases: It grew to support diverse applications, from IoT sensor data storage to e-commerce
order processing, handling petabytes of data with high throughput.
Technological Advancements: Integration with Apache Spark, Kafka, and cloud platforms (e.g.,
Azure Cosmos DB’s Cassandra API) expanded its ecosystem, making it a cornerstone for big data
architectures.
Global Impact: By 2025, Cassandra’s adoption in smart cities (e.g., traffic data) and healthcare
(e.g., patient monitoring) reflects its growth into a robust, globally recognized solution, with
over 2,000 contributors and continuous updates.
Example:
Netflix uses Cassandra to manage 1.3 billion requests daily across multiple data centers, showcasing its
scalability and fault tolerance, a direct result of its original design goals.
2. Compare Cassandra with Traditional RDBMS in terms of scalability, performance, and data
structure.
Scalability:
Cassandra: Excels in horizontal scalability, allowing the addition of nodes to a cluster to handle
increased load without downtime. Its peer-to-peer architecture distributes data across nodes
using consistent hashing, making it ideal for big data (e.g., IoT sensor networks with millions of
devices). For example, adding 10 nodes can linearly increase capacity for a smart grid
application.
Traditional RDBMS: Relies on vertical scaling (upgrading hardware like CPU or RAM), which is
limited by physical constraints and costly. Systems like Oracle or MySQL struggle with large-scale
distributed environments, often requiring sharding or replication with complex management.
Performance:
Cassandra: Optimized for write-heavy workloads with high throughput, leveraging an append-
only log-structured merge-tree (LSM) storage engine. It achieves low-latency writes (e.g., <5ms
for IoT data ingestion) and tunable consistency (e.g., QUORUM), but read performance can
degrade with large datasets unless indexed properly. For instance, it handles 1 million writes per
second at Netflix.
Traditional RDBMS: Excels in read-heavy, transactional workloads with strong consistency (e.g.,
ACID compliance), but write performance suffers under high concurrency due to locking
mechanisms (e.g., MySQL might take 50ms per transaction in a busy e-commerce system).
Complex joins and normalization also slow queries.
Data Structure:
Cassandra: A wide-column store NoSQL database, storing data in tables with flexible schemas
(e.g., rows can have different column sets). It uses a key-value pair approach within column
families, supporting unstructured or semi-structured data (e.g., JSON from smart devices). This
flexibility suits evolving IoT data models.
Traditional RDBMS: Uses a rigid, tabular structure with fixed schemas (e.g., rows and columns in
MySQL), enforcing relationships via foreign keys. This is efficient for structured data (e.g.,
customer records) but inflexible for unstructured big data, requiring schema changes for new
data types.
3. Explain CAP theorem with examples. How does Cassandra address the trade-offs?
Consistency: All nodes see the same data at the same time after a write operation.
Availability: Every request receives a response, even if some nodes fail, ensuring no downtime.
Partition Tolerance: The system continues to operate despite network partitions (e.g., node
failures or communication breaks).
In a distributed environment, network partitions are inevitable, forcing a trade-off between consistency
and availability.
Examples:
Consistency over Availability: A banking system using an RDBMS (e.g., Oracle) ensures all
accounts reflect the same balance after a transaction (consistency), but if a partition occurs, it
may reject requests until resolved (unavailable).
Availability over Consistency: A social media platform like Twitter uses eventual consistency,
allowing users to post during a partition (available), but some followers might see updates later
(inconsistent).
Partition Tolerance in Action: An IoT smart grid with Cassandra operates across multiple
regions; if a datacenter fails, it remains functional but may temporarily show inconsistent data
until synchronized.
Tunable Consistency: Cassandra allows configuring consistency levels (e.g., ONE, QUORUM, ALL)
per operation. For example, using QUORUM (majority of nodes) ensures stronger consistency
during writes, while ONE prioritizes availability by accepting writes on a single node during a
partition.
Replication Strategy: With a replication factor (e.g., 3), Cassandra stores copies across nodes,
ensuring partition tolerance. If one node fails, others serve data, maintaining availability.
Example in Context:
In a distributed IoT healthcare system, Cassandra stores patient vitals across three data centers. During
a network partition, it uses a QUORUM consistency level to ensure most nodes agree on critical updates
(e.g., heart rate spikes), maintaining consistency and partition tolerance. For non-critical data (e.g.,
routine logs), it switches to ONE, ensuring availability by accepting writes on any available node, with
consistency restored later. This flexibility addresses trade-offs based on application needs.
UNIT-2
(10-Marks)
1. Explain the differences between a logical and a physical data model with examples.
Logical Data Model: Represents the conceptual structure of data, focusing on entities,
relationships, and attributes without detailing how data is stored or implemented. It is
independent of the database management system (DBMS) and emphasizes business
requirements.
o Key Characteristics: Defines what data exists (e.g., entities like "Customer" and
"Order"), their relationships (e.g., one-to-many), and attributes (e.g., CustomerID,
OrderDate), using tools like Entity-Relationship (ER) diagrams.
o Example: In an IoT smart home system, a logical model might include entities "Sensor"
and "Room," with a relationship "Sensor monitors Room," and attributes like SensorID
and RoomTemperature, abstracting the business need to track temperature per room.
Physical Data Model: Specifies how data is stored, accessed, and managed in a specific DBMS,
including tables, columns, indexes, partitions, and storage details. It translates the logical model
into a technical implementation.
o Key Characteristics: Includes physical storage details (e.g., table names, data types,
indexes), optimization strategies (e.g., partitioning), and performance considerations
(e.g., caching).
o Example: For the same smart home system, a physical model in Cassandra might define
a table "SensorData" with columns (SensorID text, RoomID text, Temperature float,
Timestamp timestamp), partitioned by SensorID, and indexed for quick timestamp-
based queries.
Differences:
Abstraction Level: Logical models are abstract and business-focused, while physical models are
technical and implementation-specific.
Dependency: Logical models are DBMS-agnostic, whereas physical models depend on the
chosen system (e.g., Cassandra vs. MySQL).
Detail: Logical models omit storage or performance details, while physical models include them
(e.g., data types, indexing).
Example in Context: A logical model for an e-commerce system might define "Product" and
"Order" entities with a many-to-one relationship. The physical model in an RDBMS might create
tables with primary keys (ProductID, OrderID) and foreign key constraints, while in Cassandra, it
might use a wide-row structure with OrderID as the partition key and ProductID in a collection
column.
2. Describe how Cassandra designs differ from RDBMS in terms of data modeling and querying.
Cassandra:
o Uses a wide-column store model, organizing data into tables with flexible, sparse
column families. Data is modeled around query patterns rather than normalized
relationships, using partition keys and clustering columns to distribute data across
nodes.
RDBMS:
o Employs a relational model with normalized tables linked by primary and foreign keys to
avoid redundancy. Data is structured into fixed schemas with rows and columns (e.g., a
"Customers" table with CustomerID as the primary key).
o Approach: Normalization (e.g., 3NF) ensures data integrity but may require joins, which
can complicate queries. For instance, linking "Orders" and "Customers" tables via
CustomerID.
Querying Differences:
Cassandra:
o Uses CQL (Cassandra Query Language), a SQL-like language, but optimized for high-
throughput, write-heavy workloads. Queries are designed based on partition keys,
limiting flexibility (e.g., cannot query without specifying the partition key unless
indexed).
o Performance: Excels at single-row or range queries within a partition (e.g., "SELECT *
FROM SensorReadings WHERE SensorID = 'S001' AND Timestamp > '2025-06-20'"), with
tunable consistency (e.g., QUORUM).
o Limitation: Joins are not supported; data must be pre-joined during modeling.
RDBMS:
o Uses SQL with full support for complex queries, including joins, aggregations, and
subqueries (e.g., "SELECT * FROM Orders JOIN Customers ON Orders.CustomerID =
Customers.CustomerID WHERE OrderDate > '2025-06-20'").
o Performance: Strong for read-heavy, transactional workloads but slower for high write
volumes due to locking and normalization overhead.
o Flexibility: Offers rich querying capabilities but may incur latency with large datasets or
complex joins.
Example in Context:
In an IoT traffic monitoring system, Cassandra might model data with a table partitioned by CameraID,
storing timestamped vehicle counts, optimized for queries like "Get counts for Camera C001 today." An
RDBMS might normalize this into "Cameras" and "VehicleCounts" tables, requiring a join to retrieve the
same data, which could be slower with millions of records.
3. What steps are involved in evaluating and refining a data model in Cassandra?
o Analyze the application’s read and write requirements (e.g., time-series queries for IoT
sensor data). Identify primary access patterns, such as retrieving data by SensorID and
Timestamp, to design tables around these needs.
o Create tables based on query patterns, selecting partition keys (e.g., SensorID) to
distribute data evenly and clustering columns (e.g., Timestamp) for ordering. Define
column families to denormalize data, avoiding joins. For example, a table "SensorData"
might include (SensorID, Timestamp, Value).
Evaluate Performance:
o Run benchmark tests with realistic workloads (e.g., 1 million inserts, 10,000 reads per
second) using tools like Cassandra Stress Tool. Measure latency, throughput, and error
rates. For instance, check if a query on SensorID takes <10ms.
o Adjust partition keys or add secondary indexes (e.g., Materialized Views or Storage-
Attached Indexing) if query performance is suboptimal. For example, add a Materialized
View for reverse lookups (Timestamp to SensorID) if needed.
o Tune consistency levels (e.g., LOCAL_QUORUM) and replication factors (e.g., 3) based on
availability vs. consistency needs. Test failure scenarios to ensure data durability and
availability.
o Gather feedback from application performance and user experience (e.g., slow
dashboard updates). Redesign tables or add new ones (e.g., a summary table for
aggregated data) if query patterns change, such as adding weekly averages.
o Use Cassandra’s monitoring tools (e.g., OpsCenter, Prometheus) to track metrics like
compaction delays or disk usage. Refine the model by adjusting compaction strategies
or adding TTL (time-to-live) for old data (e.g., expire sensor readings after 30 days).
Example in Context:
For an IoT weather system, an initial model with a partition key of StationID and clustering by
Timestamp is tested. If queries for hourly averages are slow, a Materialized View is added to
precompute averages, and performance is re-evaluated, ensuring sub-second response times for 1000
stations.
(15-Marks)
1. Design and Explain a Complete Data Model for an E-commerce System Using
Cassandra. Include Keyspaces, Tables, and Primary Keys.
Overview:
A data model for an e-commerce system using Cassandra must be query-driven, denormalized,
and optimized for high write throughput and read scalability, typical of online retail platforms
handling product catalogs, orders, and user interactions. Cassandra’s distributed nature suits this
use case, supporting millions of transactions daily.
Keyspace Design:
1. Users Table
o Purpose: Store customer details for authentication and personalization.
o Structure:
sql
CollapseWrap
Copy
email text,
name text,
address text,
registration_date timestamp
);
o Primary Key: user_id (partition key) ensures each user’s data is uniquely
distributed across nodes.
o Explanation: Partitioned by user_id for efficient lookups (e.g., "Get user profile
for user123"). No clustering key is needed for simple queries.
2. Products Table
o Purpose: Maintain a catalog of products with details.
o Structure:
sql
CollapseWrap
Copy
product_id text,
category text,
name text,
price decimal,
stock int,
);
o Primary Key: Composite key with category (partition key) and product_id
(clustering column) to distribute data by category and order products within each
category.
o Explanation: Allows queries like "List all electronics products" or "Get product
P001 in electronics," optimizing category-based browsing.
3. Orders Table
o Purpose: Record customer orders with order details.
o Structure:
sql
CollapseWrap
Copy
order_id text,
user_id text,
order_date timestamp,
total_amount decimal,
status text,
o Primary Key: Composite key with user_id (partition key), order_date (clustering
column), and order_id (clustering column) to distribute orders by user and order
them chronologically.
o Explanation: Supports queries like "Get last 10 orders for user123," with
CLUSTERING ORDER BY DESC for recent orders.
4. Order_Items Table
o Purpose: Link orders to products with quantities.
o Structure:
sql
CollapseWrap
Copy
order_id text,
product_id text,
quantity int,
unit_price decimal,
);
o Primary Key: Composite key with order_id (partition key) and product_id
(clustering column) to store items per order.
o Explanation: Enables queries like "List items in order O001," denormalizing data
to avoid joins.
Explanation of Design:
Query-Driven Approach: Tables are designed around common queries (e.g., user
orders, product listings), avoiding joins by denormalizing data (e.g., duplicating user_id
in orders).
Partitioning Strategy: Partition keys (e.g., user_id, category) ensure even data
distribution, preventing hotspots in a system with 1 million users.
Scalability: Horizontal scaling is supported by adding nodes, handling peak traffic (e.g.,
Black Friday sales).
Example: For a user "user123" placing order "O001" with product "P001" (electronics
category), data is stored across nodes, with queries like "SELECT * FROM orders
WHERE user_id = 'user123' LIMIT 10" returning recent orders efficiently.
Overview:
RDBMS (e.g., MySQL, PostgreSQL) and Cassandra represent different design philosophies,
rooted in their intended use cases—transactional systems vs. distributed, high-scale data stores.
Key Differences:
1. Data Model:
o RDBMS: Uses a relational model with normalized tables and fixed schemas,
linked by primary-foreign key relationships (e.g., "Customers" and "Orders"
tables with CustomerID).
o Cassandra: Employs a wide-column store model with flexible, denormalized
schemas, designed around query patterns (e.g., "Orders" table partitioned by
user_id).
2. Scalability:
o RDBMS: Relies on vertical scaling (upgrading hardware), limiting scalability for
large datasets due to single-server constraints.
oCassandra: Designed for horizontal scaling, adding nodes to a cluster to handle
increased load, ideal for big data (e.g., scaling to 100 nodes for IoT traffic data).
3. Consistency Model:
o RDBMS: Enforces strict ACID consistency, ensuring all transactions maintain
data integrity (e.g., a bank transfer is either fully committed or rolled back).
o Cassandra: Uses eventual consistency with tunable levels (e.g., QUORUM),
prioritizing availability and partition tolerance (e.g., accepting writes during
network splits).
4. Querying Approach:
o RDBMS: Supports complex SQL queries with joins, aggregations, and subqueries
(e.g., "SELECT * FROM Orders JOIN Customers ON Orders.CustomerID =
Customers.CustomerID").
o Cassandra: Uses CQL, optimized for simple, partition-key-based queries without
joins (e.g., "SELECT * FROM orders WHERE user_id = 'user123'"), requiring
pre-joined data.
5. Data Distribution:
o RDBMS: Typically centralized or replicated with master-slave setups, requiring
manual sharding for distribution (e.g., MySQL with Galera cluster).
o Cassandra: Features a peer-to-peer, ring-based distribution using consistent
hashing, automatically balancing data across nodes (e.g., IoT sensor data spread
across a 10-node cluster).
Philosophical Goal: RDBMS prioritizes data integrity and complex querying for
transactional systems (e.g., e-commerce payments), while Cassandra focuses on high
availability and scalability for write-heavy, distributed systems (e.g., real-time analytics).
Trade-Offs: RDBMS sacrifices scalability for consistency, whereas Cassandra trades
strict consistency for availability (CAP theorem), aligning with big data needs.
Example: In an e-commerce system, an RDBMS ensures a customer’s order total is
consistent across tables, but struggles with 1 million concurrent users. Cassandra handles
this load by distributing orders, accepting temporary inconsistencies resolved later.
3. Explain the Entire Process of Building, Evaluating, and Refining a Data Model
in Cassandra from Start to Deployment.
Process Overview:
Building, evaluating, and refining a Cassandra data model is an iterative, query-driven process
tailored to distributed systems, ensuring performance, scalability, and maintainability from
conception to deployment.
Steps:
sql
CollapseWrap
Copy
sensor_id text,
timestamp timestamp,
value float,
o Denormalize data to avoid joins, ensuring each table serves a specific query
pattern.
3. Data Population and Distribution Testing:
o Load sample data (e.g., 1 million sensor readings) using tools like cqlsh or a data
generator. Use nodetool ring to verify even distribution across nodes, adjusting
partition keys if hotspots occur (e.g., too many readings for one SensorID).
4. Performance Evaluation:
o Benchmark the model with realistic workloads using Cassandra Stress Tool or
custom scripts. Test latency (e.g., <10ms for reads), throughput (e.g., 100,000
writes/second), and consistency levels (e.g., LOCAL_QUORUM). Monitor with
OpsCenter or Prometheus for CPU, memory, and disk usage.
5. Refinement Based on Feedback:
o Analyze performance results. If reads are slow, add secondary indexes (e.g.,
Materialized View for reverse lookups) or adjust clustering order. For example,
create a view:
sql
CollapseWrap
Copy
o Optimize partition size (e.g., 100MB–1GB per partition) to balance load and
reduce compaction overhead.
6. Consistency and Replication Tuning:
o Adjust replication factor (e.g., from 3 to 5) and consistency levels based on
availability needs. Test failure scenarios (e.g., node down) to ensure data
durability and accessibility.
7. Integration and Testing with Application:
o Integrate the model with the application (e.g., a Java app using DataStax driver).
Perform end-to-end tests (e.g., simulate 10,000 users querying sensor data) to
validate query performance and data integrity.
8. Deployment and Monitoring:
o Deploy the model to a production cluster using a CI/CD pipeline (e.g., via
Terraform or Ansible). Monitor post-deployment with tools like Grafana, tracking
metrics like read/write latency and node health. Set alerts for anomalies (e.g.,
latency > 50ms).
9. Continuous Refinement:
o Collect user and system feedback (e.g., slow dashboard updates). Refine by
adding new tables (e.g., aggregated hourly data) or adjusting TTL (e.g., expire
data after 30 days) to manage storage.
Example in Context:
For an IoT weather system, the process starts with designing a readings table, populating it with
1 million records, and testing 100 reads/second. If latency exceeds 10ms, a Materialized View
for hourly averages is added. After deployment to a 5-node cluster, monitoring reveals a hotspot,
prompting a partition key change to (SensorID, Region), ensuring scalability and efficiency.
UNIT-3
10-Marks
1. Explain the role of the cassandra.yaml file in Cassandra configuration. Discuss at least five critical
properties.
Critical Properties:
listen_address: Sets the IP address or hostname for inter-node communication within the
cluster (e.g., "192.168.1.10"). Proper configuration is vital for node discovery and data
replication during peak traffic.
rpc_address: Defines the IP address for client connections (e.g., "0.0.0.0" for all interfaces). This
enables clients to query the node, critical for real-time order processing tonight.
num_tokens: Determines the number of virtual nodes (vnodes) per physical node (default 256).
Increasing this (e.g., to 512) improves data distribution across a 10-node cluster handling 1
million orders.
Explanation:
The cassandra.yaml file acts as the central control point, allowing fine-tuning to match workload
demands. For instance, during the current sale, adjusting num_tokens and endpoint_snitch ensures
balanced data distribution and efficient cross-region replication, while listen_address and rpc_address
maintain cluster connectivity and client access.
2. Describe the importance of proper directory configuration in Cassandra, including the purpose of
data_file_directories, commitlog_directory, and hints_directory.
data_file_directories:
o Purpose: Specifies the directories where SSTables (on-disk data files) are stored (e.g.,
"/var/lib/cassandra/data"). Multiple directories can be listed for load balancing across
disks.
commitlog_directory:
o Purpose: Defines the location for the commit log, a durable record of all writes before
they are flushed to SSTables (e.g., "/var/lib/cassandra/commitlog").
o Importance: Isolating the commit log on a separate, fast device (e.g., another SSD)
improves write performance and recovery speed after a node restart, critical for
maintaining order integrity during peak traffic.
hints_directory:
o Purpose: Stores hint files that track data for nodes that are temporarily down, enabling
repair when they recover (e.g., "/var/lib/cassandra/hints").
Explanation:
Configuring these directories on separate drives (e.g., SSD for data_file_directories, another for
commitlog_directory) leverages I/O parallelism, enhancing throughput and fault tolerance. For example,
during tonight’s sale, a well-configured setup ensures quick writes to the commit log and efficient
SSTable access, while hints facilitate recovery if a node fails under load.
3. Discuss the use and significance of cassandra-env.sh in JVM tuning for Cassandra performance
optimization.
Memory Configuration:
o Use: Sets JVM heap size via MAX_HEAP_SIZE and HEAP_NEWSIZE (e.g.,
MAX_HEAP_SIZE="8G" for an 8GB heap).
o Use: Configures GC algorithms (e.g., -XX:+UseG1GC for the G1 collector) and parameters
like -XX:MaxGCPauseMillis=200 to limit pause times.
o Significance: Enhances parallel processing of writes and reads across a 10-node cluster,
improving throughput for the current sale’s high concurrency.
o Significance: Provides debugging data if memory issues arise, ensuring system stability
under load, which is critical for uninterrupted e-commerce operations.
Explanation:
The cassandra-env.sh file bridges Cassandra’s performance with JVM capabilities, allowing
customization for workload-specific needs. For instance, during tonight’s sale, increasing
MAX_HEAP_SIZE to 8GB and using G1GC ensures the cluster handles 100,000 writes/second without GC-
related delays, while thread tuning maximizes CPU utilization across nodes.
15-Marks
Overview:
Deploying Cassandra in a production multi-node cluster requires a robust configuration strategy to
ensure scalability, high availability, and performance, especially for a high-traffic scenario like an e-
commerce sale at 10:43 PM IST on June 20, 2025. This strategy balances resource allocation, data
durability, and network efficiency across a 5-node cluster.
Configuration Strategy:
Cluster Name:
o Rationale: A unique name ensures nodes join the intended cluster, preventing
misconfiguration during the sale peak with 1 million orders.
Seed Nodes:
o Rationale: Designates nodes 192.168.1.10 and 192.168.1.11 as seeds for initial node
discovery and gossip communication, ensuring cluster stability across 5 nodes in India-
South and US-East data centers.
IP Settings:
o rpc_address: rpc_address: 0.0.0.0 to allow client connections from any interface, critical
for global order processing.
Directory Separation:
o Rationale: Separating directories onto different drives (SSDs for data/caches, SSD for
commit log, HDD for hints) optimizes I/O performance and recovery, handling 10 TB of
order data tonight.
Memory Tuning:
o Rationale: Tuning heap size and GC prevents out-of-memory errors, while G1GC
minimizes latency (<200ms) for real-time queries, supporting 10,000 concurrent users.
Implementation Steps:
Configure cassandra.yaml on each node with the above settings, ensuring consistency.
Start nodes sequentially, verifying cluster status with nodetool status at 10:43 PM IST.
Example:
During the sale, the 5-node cluster handles 1 million orders, with seed nodes facilitating join, SSDs
ensuring fast data access, and JVM tuning maintaining sub-10ms latency for order lookups.
2. Elaborate on the Step-by-Step Process to Modify Cassandra to Handle Large-Scale Workloads,
Highlighting the Importance of Each Configuration File Involved.
Overview:
Modifying Cassandra to handle large-scale workloads (e.g., 1 million orders at 10:43 PM IST on June 20,
2025) involves adjusting configuration files to optimize performance, scalability, and reliability. This
process ensures the system adapts to increased data volume and concurrency.
Step-by-Step Process:
o Analyze expected load (e.g., 100,000 writes/second, 10 TB data). Identify query patterns
(e.g., order retrieval by user_id).
2. Update cassandra.yaml:
o Modifications: Set log level to "INFO" and rotate logs daily to /var/log/cassandra to
manage disk space.
o Importance: Prevents log overload during peak traffic, aiding troubleshooting without
performance impact.
5. Adjust cassandra-rackdc.properties:
o Modifications: Define data centers (e.g., dc=India-South, rack=Rack1) and replicate with
NetworkTopologyStrategy (replication factor 3).
o Importance: Ensures data locality and fault tolerance across regions, critical for global
order processing.
o Use Cassandra Stress Tool to simulate 1 million writes and 100,000 reads, monitoring
latency and throughput with OpsCenter.
o Importance: Validates configuration changes, ensuring the cluster handles the sale load
at 10:43 PM IST.
o Roll out changes to the 5-node cluster, restarting nodes with nodetool drain and
nodetool start. Monitor with Grafana for metrics (e.g., CPU > 80%, latency > 50ms).
Example:
For the sale, increasing num_tokens and heap size in cassandra.yaml and cassandra-env.sh allows the
cluster to scale to 1.5 million orders, while logback.xml prevents log saturation, and cassandra-
rackdc.properties ensures data availability across India-South and US-East.
Overview:
Incorrect configuration in cassandra.yaml can lead to performance degradation, data loss, or cluster
instability, especially under the high load of an e-commerce sale at 10:43 PM IST on June 20, 2025.
Understanding these implications is key to maintaining a 5-node cluster handling 1 million orders.
Implications:
Performance Degradation: Misconfigured settings can overload nodes, increasing latency and
reducing throughput.
Data Inconsistency: Improper replication or consistency settings may lead to data loss or
unavailability.
Cluster Instability: Wrong network or seed node settings can cause nodes to fail to join,
disrupting operations.
Resource Exhaustion: Poor directory or memory settings can exhaust disk space or memory,
crashing nodes.
Examples of Misconfigurations and Effects:
o Implication: Orders for 500,000 users may fail to sync, causing transaction losses.
o Effect: Causes uneven data distribution, creating hotspots on nodes with fewer tokens.
This overloads disk I/O, increasing latency to >50ms for order queries.
o Effect: Ignores data center awareness, leading to inefficient replication across India-
South and US-East. Reads may hit remote nodes, adding 100ms latency.
o Effect: Limits write throughput to 10,000 writes/second, causing queue buildup during
100,000 writes/second demand. This leads to timeouts and failed orders.
o Implication: Results in lost sales and customer complaints during the peak sale.
o Effect: Forces all SSTables onto a single disk, exhausting 1 TB space with 10 TB of order
data. This causes node crashes and data unavailability.
Mitigation:
Regular validation with nodetool status and monitoring with Grafana can detect these issues. For
example, adjusting listen_address to a node-specific IP and adding multiple data_file_directories on
SSDs ensures stability and performance during the sale.
Example:
If num_tokens is misconfigured to 16, a node handling 2 TB of orders may crash, while correct
configuration to 256 balances the load, maintaining sub-10ms latency for 1 million orders at 10:43 PM
IST.
UNIT-4
10-Marks
1. Explain the output and significance of the 'nodetool status' command in Cassandra.
Status: A letter indicating node state (e.g., "UN" for Up/Normal, "UD" for Up/Leaving, "DN" for
Down/Normal).
State: The node’s role (e.g., "N" for Normal, "L" for Leaving).
Load: The amount of data stored on the node (e.g., 1.23 TB).
Example Output:
text
CollapseWrap
Copy
Datacenter: dc1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
Significance:
Cluster Health: Indicates operational status (e.g., "UN" nodes are healthy, "DN" nodes need
investigation), critical for ensuring 1 million e-commerce orders process smoothly at 10:49 PM
IST.
Load Balancing: Monitors data distribution (e.g., 1.23 TB vs. 1.20 TB) to detect imbalances,
preventing hotspots during peak traffic.
Fault Detection: Identifies down nodes (e.g., 192.168.1.12) for repair or replacement, ensuring
high availability.
Capacity Planning: Helps assess if the 5-node cluster can handle 10 TB of data, guiding node
additions.
Example: During tonight’s sale, nodetool status reveals a "DN" node, prompting immediate
repair to maintain order processing for 10,000 concurrent users.
2. Describe how the 'nodetool info' command helps administrators monitor Cassandra nodes.
Example Output:
ID : 550e8400-e29b-11d4-a716-446655440000
Load : 1.23 TB
Generation No : 1624200000
Uptime : 10d 2h
Rack : rack1
Health Monitoring: Confirms gossip and Thrift activity, ensuring node communication and client
access, vital for 100,000 writes/second during the sale.
Resource Utilization: Tracks heap usage (e.g., 6G/8G) to detect memory pressure, guiding JVM
tuning to avoid crashes under 10,000 users.
Load Assessment: Monitors data load (e.g., 1.23 TB) to identify storage constraints, aiding
capacity planning for 10 TB of order data.
Uptime and Stability: Verifies uptime (e.g., 10d 2h) to assess reliability, critical for uninterrupted
e-commerce operations.
Troubleshooting: Identifies issues (e.g., "Native Transport active: false") for quick resolution,
preventing downtime.
Example: At 10:49 PM IST, nodetool info shows 90% heap usage, prompting an increase to 12G
in cassandra-env.sh to handle the sale peak.
3. List and Explain the Different Thread Pool Metrics Available Through 'nodetool tpstats'.
Overview of 'nodetool tpstats':
The nodetool tpstats command provides statistics on Cassandra’s thread pools, which manage various
tasks like reads, writes, and compaction. Executed at 10:49 PM IST, it helps optimize performance for
the current e-commerce workload.
ReadStage:
o Metrics: Active, Pending, Completed, Blocked tasks (e.g., Active: 5, Pending: 2).
o Significance: High pending tasks (e.g., 10) indicate read bottlenecks, suggesting index
optimization for 100,000 order lookups.
MutationStage:
o Significance: Excessive pending writes (e.g., 15) during 1 million order inserts signal the
need to increase concurrent_writers in cassandra.yaml.
CompactionExecutor:
o Significance: High pending compactions (e.g., 5) can slow reads, requiring tuning of
compaction_throughput_mb_per_sec to free resources during peak traffic.
RequestResponseStage:
GossipStage:
MigrationStage:
o Significance: Pending tasks (e.g., 2) during schema updates can disrupt order processing,
requiring off-peak execution.
Example Output:
ReadStage 5 2 100000 0
MutationStage 8 3 150000 0
CompactionExecutor 2 1 5000 0
RequestResponseStage 3 0 80000 0
GossipStage 1 0 1000 0
MigrationStage 0 0 10 0
Explanation:
These metrics help diagnose performance issues. For instance, at 10:49 PM IST, 3 pending
MutationStage tasks during 1 million order writes suggest increasing concurrent_writers to 32, while 2
pending ReadStage tasks indicate indexing needs for faster order retrieval, ensuring sub-10ms latency
for 10,000 users.
15-Marks
nodetool is a command-line utility for managing and monitoring Cassandra clusters. It is essential for
maintaining a 5-node cluster handling 1 million e-commerce orders at 10:56 PM IST on June 20, 2025.
Below are demonstrations of major commands with practical scenarios.
nodetool status:
o Scenario: During the sale peak, a node (192.168.1.12) shows "DN" (Down/Normal).
o Demonstration
Action: Restart the node with nodetool start or repair data with nodetool repair to restore
availability for 500,000 order processes.
nodetool info:
Scenario: Heap memory usage reaches 90% (7.2G/8G) on node 192.168.1.10, slowing order
queries.
Demonstration:
Action: Adjust MAX_HEAP_SIZE to 12G in cassandra-env.sh and restart the node to handle
10,000 concurrent users.
nodetool tpstats:
Demonstration:
nodetool repair:
o Action: Schedule during off-peak (e.g., 2 AM IST) to avoid impacting the sale, ensuring
data integrity.
nodetool cleanup:
o Scenario: After adding a new node, old data on 192.168.1.10 causes redundancy.
Explanation:
These commands enable proactive maintenance, ensuring the cluster handles the sale load at 10:56 PM
IST. For example, repair restores consistency, while cleanup optimizes space, maintaining sub-10ms
latency for order processing.
2. Analyze a Scenario Where nodetool tpstats Reveals Thread Pool Congestion. What Corrective
Actions Would You Recommend?
Scenario Analysis:
At 10:56 PM IST on June 20, 2025, during an e-commerce sale with 1 million orders, nodetool tpstats on
a 5-node cluster shows:
Observations:
o ReadStage with 20 pending and 5 blocked tasks indicates read bottlenecks, slowing
order lookups for 10,000 users.
Root Causes:
o Insufficient thread pool sizes, high I/O contention, or inadequate JVM memory (e.g., 8G
heap at 90% usage) under peak load.
Corrective Actions:
o Rationale: Reduces queue buildup, ensuring sub-10ms latency for reads and writes.
o Use nodetool status to check load (e.g., 1.25 TB vs. 1.20 TB) and nodetool cleanup on
uneven nodes.
o Add a node if pending tasks exceed 50 after tuning, verified by nodetool tpstats.
Implementation:
Apply changes, restart nodes with nodetool drain and nodetool start, and re-run nodetool tpstats to
confirm pending tasks drop (e.g., to 5 for MutationStage). Monitor with Grafana to ensure stability.
Example:
After tuning, ReadStage pending drops to 5, and MutationStage to 10, reducing latency to 5ms, ensuring
smooth order processing for the sale peak.
3. Explain the Role of Gossip in Cassandra. How Can nodetool info Help Verify Its Status and Issues?
Node Discovery: New nodes join the cluster by contacting seed nodes, sharing state via gossip
messages.
State Propagation: Each node periodically (every second) exchanges state information (e.g.,
uptime, load) with up to three random peers, ensuring all nodes have a consistent view.
Failure Detection: Marks nodes as down if they miss heartbeats, triggering repairs (e.g., hint
replay).
Load Balancing: Distributes data ownership percentages based on node availability, critical for a
5-node cluster handling 1 million orders at 10:56 PM IST.
Example: If node 192.168.1.12 fails, gossip updates other nodes, enabling the cluster to
redistribute 1.20 TB of data.
o Issue Detection: Gossip active: false signals a failure, possibly due to network issues or
listen_address misconfiguration, halting state updates.
o Verification: Consistent load and uptime across nodes (e.g., 1.20–1.25 TB, 10d uptime)
suggest healthy gossip, balancing 10 TB of data.
o Issue Detection: Divergent values (e.g., 0 TB load) indicate a node isn’t receiving gossip,
possibly due to a seed node failure.
o Verification: Ensures client communication aligns with gossip state, supporting 10,000
user queries.
o Issue Detection: false values with active gossip suggest internal inconsistencies,
requiring nodetool drain and restart.
Actionable Insights:
o If Gossip active: false, check network (e.g., ping 192.168.1.11) and cassandra.yaml
settings. Use nodetool gossipinfo for detailed state to pinpoint issues.
o Example: At 10:56 PM IST, nodetool info shows Gossip active: true and Load: 1.23 TB,
confirming healthy gossip, while a false state would trigger a network diagnostic.
Explanation:
Gossip ensures cluster cohesion, and nodetool info provides a real-time health check. For the sale,
verifying gossip status prevents data inconsistency, ensuring all 1 million orders are processed reliably.
UNIT-5
10-marks
1. Explain the role and purpose of the commit log and memtable in Cassandra's write path.
Commit Log:
Role: The commit log is a durable log file that records all write operations before they are
applied to the memtable. It is stored on disk (e.g., /var/lib/cassandra/commitlog) and serves as a
crash-recovery mechanism.
Process: When a write arrives, it is appended to the commit log synchronously, guaranteeing
that even if the system fails, data up to the last commit is recoverable.
Memtable:
Role: The memtable is an in-memory data structure (e.g., a sorted skip list) that holds recent
write data before it is flushed to disk as an SSTable. It resides in the JVM heap.
Purpose: Provides fast write performance by allowing in-memory storage and quick lookups for
recent data. During the sale peak, the memtable handles 100,000 order writes/second, enabling
sub-10ms latency for updates.
Process: Writes are first written to the commit log, then inserted into the memtable. Once the
memtable reaches a size threshold (e.g., 128MB) or a time limit, it is flushed to an SSTable.
The write path begins with a client request, which is logged in the commit log for durability and
stored in the memtable for performance. This dual mechanism ensures both speed (memtable)
and reliability (commit log), critical for handling 1 million orders tonight.
2. Describe the flushing process in Cassandra. When does it happen, and what are the consequences?
Flushing Process:
The flushing process in Cassandra involves transferring data from the memtable to an SSTable on disk. It
occurs in the following steps:
When the memtable reaches a configurable size threshold (e.g., 128MB by default, set via
memtable_flush_writers in cassandra.yaml) or a time interval (e.g., every 10 minutes, controlled
by memtable_flush_period_in_ms), it is marked as immutable.
A new memtable is created to handle incoming writes, while the old memtable is written to disk
as an SSTable in the data_file_directories.
The commit log entries for the flushed data are cleared to free space, ensuring the log doesn’t
grow indefinitely.
When It Happens:
Size-Based Trigger: Occurs when the memtable size exceeds the threshold, e.g., after 100,000
order writes accumulate 150MB of data at 11:07 PM IST.
Time-Based Trigger: Happens periodically (e.g., every 10 minutes) even if the size limit isn’t
reached, ensuring regular data persistence.
Manual Trigger: Administrators can force flushing with nodetool flush during maintenance, e.g.,
to stabilize the cluster under load.
Consequences:
Positive:
o Frees up memory in the JVM heap, preventing out-of-memory errors during the sale
peak with 10,000 concurrent users.
o Persists data to SSTables, enabling durable storage and reducing commit log size for
faster recovery.
Negative:
o Increases I/O load on data_file_directories (e.g., SSDs), potentially causing latency spikes
(e.g., >20ms) if disks are slow.
o Triggers compaction later, which can consume CPU and I/O resources, slowing reads if
not tuned (e.g., via compaction_throughput_mb_per_sec).
Example: If flushing occurs for 200MB of order data, it ensures durability but may temporarily
delay reads, requiring SSD optimization to maintain sub-10ms performance.
3. List and Describe the Main Components of an SSTable and Their Purposes.
o Description: Contains the actual data in a sorted order based on the partition key and
clustering columns (e.g., order data sorted by user_id and order_date).
o Purpose: Stores the persistent representation of memtable data, enabling efficient read
access. For the 1 million orders at 11:07 PM IST, it holds 10 TB of sorted records.
o Description: A summary index mapping partition keys to file offsets, allowing quick
location of data ranges (e.g., offset for user_id = 'user123').
o Purpose: Speeds up read operations by reducing seek time, critical for sub-10ms order
lookups across a 5-node cluster.
o Description: A Bloom filter that predicts whether a partition key exists, minimizing disk
reads (e.g., checks if user_id = 'user999' is present).
o Purpose: Enhances read performance by avoiding unnecessary I/O, especially useful for
the 10,000 concurrent queries during the sale.
o Description: Stores metadata about the SSTable, such as min/max partition keys,
compression ratios, and row counts (e.g., min key = 'user001', max = 'user500').
o Purpose: Assists in compaction and repair processes, optimizing space and ensuring
data integrity across 10 TB of data.
o Description: A sampled index of partition keys at intervals (e.g., every 100th key),
reducing memory usage compared to the full index.
o Purpose: Facilitates file management and integrity checks during maintenance, ensuring
all 10 TB of SSTables are correctly structured.
Explanation:
These components work together to balance storage efficiency and query performance. For instance,
during the sale, the Bloom filter and index file enable rapid order retrieval, while the statistics file aids in
optimizing compactions, maintaining cluster health under load.
15-Marks
1. Draw and Explain the Complete Write Path in Cassandra, Including Commit Log, Memtable, and
Flushing to SSTables.
1. An arrow points to "Commit Log" (a disk file, e.g., /var/lib/cassandra/commitlog), where the
write is appended synchronously.
2. Another arrow points to "Memtable" (an in-memory structure), where the write is inserted.
From the Memtable, an arrow labeled "Flush Trigger (Size/Time)" leads to "SSTable" (on-disk
file, e.g., /var/lib/cassandra/data), with a feedback loop to "Clear Commit Log" after flushing.
Step 1: Client Write Request: A write operation (e.g., inserting an order with user_id,
order_date, and total_amount) arrives from a client during the e-commerce sale peak.
Step 2: Commit Log Update: The write is synchronously appended to the commit log on disk,
ensuring durability. For 100,000 writes/second, this guarantees recovery if a node fails,
protecting 1 million orders.
Step 3: Memtable Insertion: The write is added to the memtable, a sorted in-memory structure
(e.g., a skip list), enabling fast lookups and writes with sub-10ms latency. The memtable grows
with each write.
Step 4: Flush Trigger: When the memtable reaches a size threshold (e.g., 128MB) or a time limit
(e.g., 10 minutes), it becomes immutable. A new memtable handles new writes, and the old one
is scheduled for flushing.
Step 5: Flushing to SSTable: The immutable memtable is written to an SSTable on disk in the
data_file_directories (e.g., SSDs), sorted by partition key (e.g., user_id). This persists data for 10
TB of order data.
Step 6: Commit Log Clearance: After successful flushing, the corresponding commit log
segments are deleted, freeing space and preparing for the next cycle.
Example: At 11:13 PM IST, an order for "user123" is logged, inserted into the memtable, and
flushed to an SSTable when it hits 150MB, ensuring durability and performance for 10,000
concurrent users.
Significance:
This write path balances speed (memtable) and reliability (commit log), with flushing ensuring persistent
storage, critical for handling the sale’s high throughput.
2. Illustrate and Discuss How Cassandra Reads Data with Optimizations Like Bloom Filters, Row Cache,
and Compression Maps.
4. "SSTable Data File" with "Compression Map" to decompress and retrieve rows.
An arrow loops back to "Row Cache" for future reads if data is cached.
Step 1: Client Read Request: A query (e.g., SELECT * FROM orders WHERE user_id = 'user123')
arrives at 11:13 PM IST during the sale.
Step 2: Bloom Filter Check: The Bloom filter, an in-memory probabilistic data structure, checks if
the partition key exists in the SSTable. If likely present (low false positive rate), it proceeds; if
not, it skips the SSTable, reducing I/O for 10,000 queries/second.
Step 3: Partition Key Lookup: The index file maps the key to an offset in the SSTable data file,
enabling a seek to the relevant data range, optimizing read latency to sub-10ms.
Step 4: Row Cache Utilization: If the requested row (e.g., recent order for "user123") is in the
row cache (configured via row_cache_size_in_mb), it is returned directly from memory,
bypassing disk I/O for frequent accesses.
Step 5: SSTable Data Retrieval with Compression Map: If not cached, the data file is accessed.
The compression map (stored in the SSTable) provides offsets for decompressing compressed
blocks, retrieving the row efficiently from 10 TB of data.
Step 6: Cache Update: The retrieved row is added to the row cache for future reads, improving
performance for repeated queries.
Example: For "user123"’s orders, the Bloom filter skips irrelevant SSTables, the index locates the
data, and the row cache serves a recent order, while the compression map handles a 1GB
compressed block, ensuring fast response times.
Discussion of Optimizations:
Bloom Filters: Reduce unnecessary disk reads, critical for large datasets (e.g., 10 TB), though
false positives may increase I/O slightly.
Row Cache: Enhances read performance for hot data (e.g., popular products), but requires
tuning to avoid memory pressure (e.g., 512MB limit).
Compression Maps: Minimize storage (e.g., 2:1 ratio) and I/O by decompressing only needed
blocks, though excessive compression can slow reads if CPU-bound.
Trade-Offs: These optimizations prioritize speed but require careful configuration (e.g.,
bloom_filter_fp_chance at 0.01) to balance accuracy and performance during the sale peak.
3. Analyze the Role of SSTables in Cassandra. What Makes Them Efficient for Reads and How Are They
Structured?
Data Persistence: Store all committed data (e.g., 10 TB of e-commerce orders at 11:13 PM IST)
durably on disk.
Read Optimization: Enable efficient data retrieval through indexing and sorting.
Compaction Support: Facilitate merging and cleanup of data during compaction, maintaining
performance.
Scalability: Support distributed reads across a 5-node cluster, handling 10,000 queries/second
during the sale.
Sorted Structure: Data is sorted by partition key (e.g., user_id), allowing binary search-like
access, reducing seek time to sub-10ms.
Index File: Provides quick offset lookups, minimizing disk scans for large datasets (e.g., 1 million
orders).
Bloom Filters: Pre-filter non-existent keys, cutting I/O by up to 90% for irrelevant SSTables.
Compression: Reduces storage (e.g., 2:1 ratio) and I/O, with compression maps enabling block-
level decompression, optimizing read bandwidth.
Row Cache Integration: Caches frequently accessed rows in memory, bypassing disk for hot
data (e.g., recent orders).
Immutability: Once written, SSTables are read-only, avoiding write locks and enabling parallel
reads across nodes.
Structure of SSTables:
Data File (.db): Contains the sorted key-value data (e.g., user_id, order_date, total_amount),
persisted from the memtable. It holds the bulk of 10 TB of order data.
Index File (.index): Stores a summary of partition keys with offsets, enabling rapid location of
data ranges (e.g., offset for user123’s orders).
Filter File (.bloom): A Bloom filter predicting key existence, reducing unnecessary reads (e.g.,
checks for user999).
Statistics File (.stats): Metadata like min/max keys and row counts, aiding compaction and
repair (e.g., min key = 'user001').
Summary File (.summary): A sampled index of keys at intervals (e.g., every 100th key),
optimizing memory usage for range queries.
Compression Info File (.compactioninfo): Tracks compression details, including the compression
map for block offsets.
TOC File (.toc): Lists all component files, ensuring integrity during maintenance.
Example: An SSTable for "user123"’s orders includes a .db file with 100 rows, an .index for
offsets, and a .bloom filter, enabling a 5ms read for the latest order.
Analysis:
SSTables’ efficiency stems from their immutable, sorted nature and supporting files, which minimize I/O
and leverage memory caches. However, their read performance depends on proper indexing and
compression tuning (e.g., compression_chunk_length_kb at 64KB). During the sale, their structure
ensures scalability, but excessive SSTable growth requires compaction to prevent read degradation.