Intro to Cassandra for Developers
Housekeeping
Courses: youtube.com/DataStaxDevs Runtime: dtsx.io/workshop
YouTube
Twitch
Questions: bit.ly/cassandra-workshop Quizz: menti.com
Discord
YouTube
2
Achievement Unlocked! - “Introduction to Cassandra”
Homework
==
Fully managed Cassandra
Without the ops!
DataStax Astra
Global Scale No Operations 25 Gig Free Tier
Put your data where you need it Launch a database in the cloud
Eliminate the overhead to install,
without compromising performance, with a few clicks, no credit card
operate, and scale Cassandra.
availability, or accessibility. required.
menti.com
Apache Cassandra™ = NoSQL Distributed Database
1 Installation = 1 NODE
NODE ✔ Capacity = ~ 2-4TB
✔ Throughput = LOTS Tx/sec/core
NODE NODE
DataCenter | Ring
NODE NODE
Communication:
✔ Gossiping
NODE NODE
Apache Cassandra™ = NoSQL Distributed Database
- Big Data Ready
- Highest Availability
- Geographical Distribution
- Read/Write Performance
- Vendor Independent
Data is Distributed
Country City Population
USA New York 8.000.000
USA Los Angeles 4.000.000
FR Paris 2.230.000
DE Berlin 3.350.000
UK London 9.200.000
AU Sydney 4.900.000
DE Nuremberg 500.000
CA Toronto 6.200.000
CA Montreal 4.200.000
FR Toulouse 1.100.000
JP Tokyo 37.430.000
IN Mumbai 20.200.000
Partition Key
Data is Distributed
USA New York 8.000.000
Country City Population
USA Los Angeles 4.000.000
FR Paris 2.230.000
DE Berlin 3.350.000
FR Toulouse 1.100.000
DE Nuremberg 500.000
UK London 9.200.000 JP Tokyo 37.430.000
AU Sydney 4.900.000 CA Toronto 6.200.000
IN Mumbai 20.200.000 CA Montreal 4.200.000
Data is Replicated
RF = 3 83 17
Replication Factor 3
means that every
row is stored on 3
different nodes
67 33
50
Replication within the Ring
0
59 (data)
83 17
RF = 3
67 33
50
Replication within the Ring
83 59 (data)
17
RF = 3
67 33
50
Replication within the Ring
59 (data)
0
59 (data)
83 17
RF = 3
59 (data)
67 33
50
Node Failure
59 (data)
0
83 17 Hint
59 (data)
RF = 3
59 (data)
67 33
50
Node Failure Recovered
59 (data)
0
83 17 Hint
59 (data)
RF = 3
59 (data)
67 33
50
Immediate Consistency – A Better Way
Client Client
Write Read
CL = QUORUM CL = QUORUM
Data Distributed Everywhere
• Geographic Distribution • Hybrid-Cloud and Multi-Cloud
On-premise
Understanding Use Cases
High Throughput Heavy Writes Event Streaming Log Analytics
Scalability
High Volume Heavy Reads Internet of Things Other Time Series
No Data Loss Caching Pricing
Availability Mission-Critical
Always-on Market Data Inventory
Global Presence Banking Retail
Distributed Compliance /
GDPR Tracking / Customer
Workload Mobility
Logistics Experience
Modern Cloud API Layer Hybrid-cloud
Cloud-native Applications
Enterprise Data
Multi-cloud
Layer
https://github.com/DataStax-Academy
/Intro-to-Cassandra-for-Developers
Intro to Cassandra for Developers
1. Tables, Partitions
2. The Art of Data Modelling
3. What’s NEXT?
Intro to Cassandra for Developers
1. Tables, Partitions
2. The Art of Data Modelling
3. What’s NEXT?
Data Structure: a Cell
An intersection of a row
and a column, stores data.
Data Structure: a Row
A single, structured
data item in a table.
Data Structure: a Partition
A group of rows having the ID First Name Last Name Department
same partition token, a base
unit of access in Cassandra. 1 John Doe Wizardry
IMPORTANT: stored together, all 399 Marisha Chapez Wizardry
the rows are guaranteed to be
neighbors. 415 Maximus Flavius Wizardry
Data Structure: a Table
ID First Name Last Name Department
1 John Doe Wizardry
A group of columns and
rows storing partitions. 2 Mary Smith Dark Magic
3 Patrick McFadin DevRel
Data Structure: Overall
Keyspace columns
Table ● Tabular data model, with one twist
● Tables are organized in rows and columns
- - - -
- - - ● Groups of related rows called partitions are
x stored together on the same node (or nodes)
partitions - - -
● Each row contains a partition key
- - - ○ One or more columns that are hashed to
y - - - determine which node(s) store that data
- - -
z - - -
rows
Partition key
Example Data: Users organized by city
Keyspace killrvideo
Table users_by_city
Last First
City Address Email
Name Name
Hellson Kevin 23 Jackson St. [email protected]
Phoenix Lastfall Norda 3 Stone St [email protected]
partitions Smith Jana 3 Stone St [email protected]
Franklin George 2 Star St [email protected]
rows
Seattle Jackson Jane 2 Star St [email protected]
Jasons Judy 2 StarSt [email protected]
Partition key column Clustering columns Data columns
Creating a Table in CQL
keyspace table
CREATE TABLE killrvideo.users_by_city (
city text,
column last_name text,
definitions first_name text,
address text,
email text,
PRIMARY KEY ((city), last_name, first_name, email));
Primary key Partition key Clustering columns
Primary Key CREATE TABLE killrvideo.users_by_city (
city text,
An identifier for a row. Consists last_name text,
of at least one Partition Key and first_name text,
address text,
zero or more Clustering email text,
Columns. PRIMARY KEY ((city), last_name, first_name, email));
MUST ENSURE UNIQUENESS.
MAY DEFINE SORTING. Partition key Clustering columns
Good Examples:
PRIMARY KEY ((city), last_name, first_name, email);
PRIMARY KEY (user_id);
Bad Example:
PRIMARY KEY ((city), last_name, first_name);
Partition Key CREATE TABLE killrvideo.users_by_city (
city text,
An identifier for a partition. last_name text,
Consists of at least one column, first_name text,
address text,
may have more if needed email text,
PRIMARY KEY ((city), last_name, first_name, email));
PARTITIONS ROWS.
Partition key Clustering columns
Good Examples:
PRIMARY KEY (user_id);
PRIMARY KEY ((video_id), comment_id);
Bad Example:
PRIMARY KEY ((sensor_id), logged_at);
Clustering Column(s) CREATE TABLE killrvideo.users_by_city (
city text,
Used to ensure uniqueness and last_name text,
sorting order. Optional. first_name text,
address text,
email text,
PRIMARY KEY ((city), last_name, first_name, email));
Partition key Clustering columns
PRIMARY KEY ((city), last_name, first_name); Not Unique
PRIMARY KEY ((city), last_name, first_name, email);
PRIMARY KEY ((video_id), comment_id); Not Sorted
PRIMARY KEY ((video_id), created_at, comment_id);
The Slide of the Year Award!
Rules of a Good Partition
● Store together what you retrieve together
● Avoid big partitions
● Avoid hot partitions
Example: open a video? Get the comments in a single query!
PRIMARY KEY ((video_id), created_at, comment_id);
PRIMARY KEY ((comment_id), created_at);
The Slide of the Year Award!
Rules of a Good Partition
● Store together what you retrieve together
● Avoid big partitions
● Avoid hot partitions
PRIMARY KEY ((video_id), created_at, comment_id);
PRIMARY KEY ((country), user_id);
● Up to 2 billion cells per partition
● Up to ~100k rows in a partition
● Up to ~100MB in a Partition
The Slide of the Year Award!
Rules of a Good Partition
● Store together what you retrieve together
● Avoid big and constantly growing partitions!
● Avoid hot partitions
Example: a huge IoT infrastructure, hardware all over
● Sensor ID: UUID
the world, different sensors reporting their state
● Timestamp: Timestamp
every 10 seconds. Every sensor reports its UUID,
● Value: float
timestamp of the report, sensor’s value.
PRIMARY KEY ((sensor_id), reported_at);
The Slide of the Year Award!
Rules of a Good Partition
● Store together what you retrieve together
BUCKETING
● Avoid big and constantly growing partitions!
● Avoid hot partitions
Example: a huge IoT infrastructure, hardware all over
● Sensor ID: UUID
the world, different sensors reporting their state
● MonthYear: Integer or String
every 10 seconds. Every sensor reports its UUID,
● Timestamp: Timestamp
timestamp of the report, sensor’s value.
● Value: float
PRIMARY KEY ((sensor_id), reported_at);
PRIMARY KEY ((sensor_id, month_year), reported_at);
The Slide of the Year Award!
Rules of a Good Partition
● Store together what you retrieve together
● Avoid big partitions
● Avoid hot partitions
PRIMARY KEY (user_id);
PRIMARY KEY ((video_id), created_at, comment_id);
PRIMARY KEY ((country), user_id);
https://github.com/DataStax-Academy/Intro-t
o-Cassandra-for-Developers#2-create-a-table
Intro to Cassandra for Developers
1. Tables, Partitions
2. The Art of Data Modelling
3. What’s NEXT?
Normalization
Employees
“Database normalization is the process of
structuring a relational database in accordance userId deptId firstName lastName
with a series of so-called normal forms in order
to reduce data redundancy and improve data 1 1 Edgar Codd
integrity. It was first proposed by Edgar F. Codd
as part of his relational model.” 2 1 Raymond Boyce
Departments
departmentId department
PROS: Simple write, Data Integrity
CONS: Slow read, Complex Queries 1 Engineering
2 Math
41
Denormalization
“Denormalization is a strategy used on a Employees
database to increase performance. In
computing, denormalization is the process of userId firstName lastName department
trying to improve the read performance of a
database, at the expense of losing some write 1 Edgar Codd Engineering
performance, by adding redundant copies of
data” 2 Raymond Boyce Engineering
3 Sage Lahja Math
PROS: Quick Read, Simple Queries 4 Juniper Jones Botany
CONS: Multiple Writes, Manual Integrity
42
Relational Data Modelling
Data
1. Analyze raw data
2. Identify entities, their properties
and relations
3. Design tables, using
normalization and foreign keys. Models
4. Use JOIN when doing queries to
join normalized data from
multiple tables
Application
NoSQL Data Modelling
Application
1. Analyze user behaviour
(customer first!)
2. Identify workflows, their
dependencies and needs
3. Define Queries to fulfill these Models
workflows
4. Knowing the queries, design tables,
using denormalization.
5. Use BATCH when inserting or
updating denormalized data of Data
multiple tables
Designing Process: Step by Step
Entities & Relationships
Queries
Designing Process:
Conceptual Data Model
Designing Process:
Application Workflow
Use-Case I:
● A User opens a Profile
WF2: Find comments related to target user using its identifier, get most recent first
Use-Case II:
● A User opens a Video Page
WF1: Find comments related to target video using its identifier, most recent first
Designing Process:
Mapping
Query I: Find comments posted for a user comments_by_user
with a known id (show most recent first)
Query II: Find comments for a video with a comments_by_video
known id (show most recent first)
Designing Process:
Mapping
SELECT * FROM comments_by_user comments_by_user
WHERE userid = <some UUID>
SELECT * FROM comments_by_video comments_by_video
WHERE videoid = <some UUID>
Designing Process:
Logical Data Model
comments_by_user comments_by_video
userid K videoid K
creationdate creationdate C
↑
C
↑
commentid C↑ commentid C↑
videoid userid
comment comment
Designing Process:
Physical Data Model
comments_by_user comments_by_video
userid UUID K videoid UUID K
commentid TIMEUUID C
↑ commentid TIMEUUID C
↑
videoid UUID userid UUID
comment TEXT comment TEXT
Designing Process:
Schema DDL
CREATE TABLE IF NOT EXISTS comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
PRIMARY KEY ((userid), commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
CREATE TABLE IF NOT EXISTS comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY ((videoid), commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
https://github.com/DataStax-Academy/Intro-to-Cas
sandra-for-Developers#3-execute-crud-operations
menti.com
Intro to Cassandra for Developers
1. Tables, Partitions
2. The Art of Data Modelling
3. What’s NEXT?
Homework
MORE LEARNING!!!!
Developer site: datastax.com/dev
● Developer Stories
● New hands-on learning scenarios with
Katacoda
● Try it Out
● Cassandra Fundamentals
● https://www.datastax.com/learn/cassandra-funda
mentals
● New Data Modeling course
https://www.datastax.com/dev/modeling
Classic courses available at DataStax Academy
✔ Academy.datastax.com
✔ datastax.com/dev
✔ community.datastax.com
✔ Datastax Developers
YouTube Channel
58
Weekly Workshops https://www.datastax.com/workshops
59
Join our 10k Discord Community https://bit.ly/cassandra-workshop
The Fellowship of the RINGS
60
Thank you!