0% found this document useful (0 votes)

10 views46 pages

Wide-Column Stores: Big Data Management Phil Bartie

explain this like explaining to 15yrs old girl remember that i'm having examination in next 1 hour so please make sure to cover all the key point in the document

Uploaded by

aksshu1902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views46 pages

Wide-Column Stores: Big Data Management Phil Bartie

explain this like explaining to 15yrs old girl remember that i'm having examination in next 1 hour so please make sure to cover all the key point in the document

Uploaded by

aksshu1902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Wide-Column Stores

Big Data Management

Phil Bartie
[email protected]
EM G.29

Using material from

Alasdair Gray, HWU
Aidan Hogan, Universidad de Chile
Guillaume Marquis
https://www.tutorialspoint.com/cassandra/
https://pandaforme.gitbooks.io/introduction-to-cassandra/
Big Data Management

Database Landscape
RDF Virtuoso Object XML Relational Oracle
Jena Caché MarkLogic
Stardog Db4o MySQL
RDF4J
Versant
Sedna MS SQL Server
Tamino
GraphDB BaseX
Blazegraph ObjectStore eXist-db
PostgreSQL DB2
SQLite
MS Access Teradata
NoSQL
SAP Adaptive Server
Key-Value Redis Document MongoDB Hive
Memcached DynamoDB FileMaker
Riak KV MariaDB
Aerospike CouchBase
SimpleDB Elasticsearch Informix Vertica

Wide-Column Graph Neo4J NewSQL SAP HANA

Cassandra Titan Google Spanner
HBase Giraph Clustrix
Accumulo
HyperTable InfiniteGraph VoltDB MemSQL NuoDB
2
Big Data Management

Relational Databases Recap

 Two-dimensional tables
 Relationships between tables
 Fixed schema
 Homogeneous

 Highly structured
 NULLs – arrghh!
Source : http://excel.quebec/attachments/Image/excel-quebec-requete-sql-excel-1.jpg
3
Big Data Management

Key-value and Tabular

Key–Value = a Distributed Map
Countries
Primary Key Value
Afghanistan capital:Kabul,continent:Asia,pop:31108077#2011
Albania capital:Tirana,continent:Europe,pop:3011405#2013
… …

Tabular = Two-dimensional Maps

Countries
Primary Key capital continent pop-value pop-year
Afghanistan Kabul Asia 31108077 2011
Albania Tirana Europe 3011405 2013
… … … … …
5
Big Data Management

Wide-Column Stores
a sparse, distributed, persistent, multi-dimensional, sorted map

 Sparse – not a value for every column

(i.e. not dense square)

 Distributed – each node has the same role

– no single point of failure

 Masterless – each node can service any request

 New nodes can be added without downtime

 Keyspace: container for column families

 Column Family: container for rows

 Rows: ordered columns https://www.tutorialspoint.com/cassandra/cassandra_data_model.htm

6
<< 12th
Big Data Management

Column Family (Table)

10
Big Data Management

Row
Row: smallest unit that stores related data
 Data partition mechanism

https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html 11
Big Data Management

Keys
Composite Row Key

Composite Column Key

https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html 13
Big Data Management

Column Family View: Single-row partitions

14
https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html
Big Data Management

Column Family: Multi-row partitions

https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html 15
Big Data Management

Wide-Column Advantages
 Highly scalable: designed for distributing across:
 Cluster
 Data centres

 Data manipulation: includes limited query language

 Data stored in sorted order

 Wide-columns: increased granularity of operation

 Not affected by increasing number of rows

16
Big Data Management

Cassandra
Wide-Column Store
Big Data Management

Meta (Facebook) Stats (2022)

 Messenger
 2.91 billion users

 Search requires inverse-index

 Search term to message id

 Continuous data arrival

 Instantaneous responses
Cassandra developed as a solution

https://www.statista.com/topics/4625/facebook-messenger/#dossierKeyfigures https://www.messenger.com/ 18
Big Data Management

Cassandra History
History

Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik initially developed Cassandra at Facebook
to power the Facebook inbox search feature. Facebook released Cassandra as an open-source project on Google code in
July 2008. In March 2009 it became an Apache Incubator project. On February 17, 2010 it graduated to a top-level project.

Facebook developers named their database after the Trojan mythological prophet Cassandra - with classical allusions to a
curse on an oracle.


Free and open-source

Distributed 
Can add more hardware nodes with no downtime

Wide Column Store 
Should always be able to read/write to Cassandra

NoSQL database 
Consistency can be adjusted – at expense of availability

Masterless replication

Each node has same role
- Secondary Index support is weak

Low latency
(single columns only; equality comparisons only)

https://en.wikipedia.org/wiki/Apache_Cassandra
Big Data Management

http://cassandra.apache.org/ 20
CONSISTENT HASHING

https://www.scnsoft.com/blog/cassandra-performance
21
Commit log : Append only log = very fast https://www.scnsoft.com/blog/cassandra-performance
MemTable stored in memory
Acknowledge to client
Flush MemTable to SSTable (Sorted Strings Table) See intro video: https://youtu.be/B_HTdrTgGNs?t=947
22
SSTable – Sequential, Immutable

Every so often Cassandra carries out a COMPACTION

Does big sequential READ, MERGE, WRITE

Check video: https://youtu.be/B_HTdrTg

GNs?t=1143
Cassandra Write Path

Fully distributed with no single point of failure (masterless)

QUORUM Consistency:
(n/2 +1) rounded down
where n= replication factor
24
https://www.scnsoft.com/blog/cassandra-performance
25
Big Data Management

Distributed, Replicated and Fault Tolerant

 Consistent Hashing
 Hashed to ring
 Order preserving hash function

 Gossip style membership algorithm

 Data replication
 Eventual Consistency
 Merkle Tree

26
Big Data Management

Where is Cassandra?
CA : Guarantees
But (like to give a
Dynamo), CP: Guarantees
correct response but only while responses are correct even
tables are tunable
C
network works fine if there are network
towards CP
(Centralised / Traditional) failures, but response may
fail (Weak availability)

A P

AP: Always provides a “best-effort”

response even in presence of network failures
(Eventual consistency)
27
Big Data Management

Tuneable Consistency
 Write = Commit Log + Memtable
 Quorom = Majority of replicas: ⌊R/2⌋+1 for R the replication factor
 Hinted handoff: central 3 hour TODO log (not readable)

Level Explanation
Availability ANY One replica node or a hinted handoff
ONE One replica node (hinted handoff not enough)
TWO Two replica nodes
THREE Three replica nodes
QUORUM A quorum of replica nodes
ALL All replica nodes
Consistency

28
Big Data Management

Tuneable Consistency
Level Explanation
For write operations, ANY is the lowest consistency (but ANY One replica node or a hinted handoff
highest availability), and ALL is the highest consistency ONE One replica node (hinted handoff not enough)
(but lowest availability). TWO Two replica nodes

THREE Three replica nodes

For read operations, ONE is the lowest consistency (but QUORUM A quorum of replica nodes

highest availability), and ALL is the highest consistency ALL All replica nodes

(but lowest availability).

QUORUM is a good middle-ground ensuring strong

consistency, yet still tolerating some level of failure.

The size of the quorum is calculated as (replication_factor / 2) + 1 Replication factor

Replication factor is total number of replicas
across the cluster.

https://blog.imaginea.com/consistency-tuning-in-cassandra 29
Big Data Management

Cassandra Query Language (CQL)

SQL-like declarative query language

30
Big Data Management

CQL: Create Keyspace (Database)

CQL MySQL (equivalent)

 Create Keyspace  Create Database
CREATE KEYSPACE MyKeySpace CREATE DATABASE MyKeySpace;
WITH REPLICATION = {
'class' : 'SimpleStrategy’,

'replication_factor' : 3 };
 Load in Database
 Load in keyspace USE MyKeySpace;
USE MyKeySpace;

32
Big Data Management

CQL: Create Column Family(Table)

CQL MySQL (equivalent)

 Create Column Family  Create Database
CREATE COLUMNFAMILY MyColumns
(id varint,
CREATE TABLE MyColumns (
lastname varchar, id int NOT NULL,
firstname varchar, lastname varchar(50),
PRIMARY KEY (id)); firstname varchar (100),
 Load data PRIMARY KEY (id));
INSERT INTO MyColumns  Load data
(id, lastname, firstname)
VALUES (1, 'Doe', 'John’); INSERT INTO MyColumns
(id, lastname, firstname)
(inserts will always overwrite) VALUES (1, 'Doe', 'John'); 33
Big Data Management

CQL: Retrieve data

CQL MySQL (equivalent)

 Retrieve all rows  Retrieve all rows
SELECT * FROM MyColumns; SELECT * FROM MyColumns;

34
Big Data Management

CQL: Retrieve data

CQL MySQL (equivalent)

 Retrieve row 1  Retrieve id 1
SELECT * FROM MyColumns SELECT * FROM MyColumns
WHERE id = 1; WHERE id = 1;

35
Big Data Management

CQL: Retrieve data

CQL MySQL (equivalent)

 Retrieve all Johns  Retrieve all Johns
SELECT * FROM MyColumns SELECT * FROM MyColumns
WHERE firstname = 'John'; WHERE firstname = 'John';
Bad Request: Cannot execute this query as it
might involve data filtering and thus may have
unpredictable performance. If you want to execute
this query despite the performance
unpredictability, use ALLOW FILTERING.

CREATE INDEX on MyColumns (firstname);

36
Big Data Management

How is this different from RDBMS?

In a static-column storage engine, each row must reserve space for every column
ALTER TABLE users ADD birth_date INT;

new columns can be added on the fly while running

and processing queries

https://www.datastax.com/dev/blog/schema-in-cassandra-1-1 37
Big Data Management

Using Columns

siteid date mean_temp

1 2012-09-01 20.6
1 2012-09-01 21.9  RDBMS approach
1 2012-09-01 21.7

siteid 2012-09-01 2012-09-02 2012-09-03

1 20.6 21.9 21.7
 CASSANDRA approach

38
Big Data Management

CQL: Consistency

SELECT totalsales
FROM sales
USING CONSISTENCY QUORUM
WHERE customerid=5;

UPDATE SALES
USING CONSISTENCY ONE
SET totalsales=50000
WHERE customerid=4;
39
Big Data Management

Limitations of CQL

 No join or subquery support, and limited support for aggregation.

- This is by design, to force you to denormalize into partitions that can be efficiently queried from a single replica,
instead of having to gather data from across the entire cluster.

 A single column value may not be larger than 2GB

- in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob
values.

 The maximum number of cells (rows x columns) in a single partition is 2 billion.

https://wiki.apache.org/cassandra/CassandraLimitations 40
Big Data Management

Using Bloom Filters for Fast Data Retrieval

 Each SSTable (String Sorted Table) has
an associated Bloom Filter
 Bloom Filter Stored in Memory
 Highly Efficient
 Can produce false positives

41
Big Data Management

Bloom Filters

 Efficient test for data location

 Hash object on insert using k hash functions
 Set bit to 1
 Hash object on read using k hash functions
 Any 0s then not present
 Bit would have been set to 1 on insert
 They can give FALSE POSITIVES, but not FALSE NEGATIVES – so a good way to check if data has been
processed before
Video on Bloom Filters: https://youtu.be/bEmBh1HtYrw

42
Big Data Management

Bloom Filters: Insert A

 Hash object on insert using k
hash functions
 Set bit to 1

e.g. input word = ‘aardvark’

Output from hash function 1 = 3

Output from hash function 2 = 1
Output from hash function 3 = 14

Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters

43
Big Data Management

Bloom Filters: Insert B

 Hash object on insert using k
hash functions
 Set bit to 1

e.g. input word = ‘bat’

Output from hash function 1 = 16

Output from hash function 2 = 1
Output from hash function 3 = 7

Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters

44
Big Data Management

Bloom Filters: Read Y

 Hash object on read using k
hash functions
 Any 0s then not present
(as Bit would have been set to 1 on
insert)

e.g. input word = ‘elephant’

Output from Hash 1 = 16
Output from Hash 2 = 2
Output from Hash 3 = 7
Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters

45
Big Data Management

Bloom Filters: Read X

 Hash object on read using k
hash functions
 All 1s then data may be
present
 Bit would have been set to 1 on
insert
e.g. input word = ‘bat’
Hash results [16,1,7]
e.g. input word = ‘snake’
Hash results [1,14,16]  FALSE POSITIVE Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters
46
DEMO of a BLOOM Filter

https://llimllib.github.io/bloomfilter-tutorial/
Big Data Management

Cassandra vs MongoDB
Problem domain needs a rich data model = MongoDB
Need secondary indexes and flexibility in the query model = MongoDB
(by contrast Cassandra secondary indexes only support single columns and equality
comparisons)

100% uptime = Cassandra

Write scalability = Cassandra
Query language support = CQL is similar to SQL

The Apache Cassandra database is a good choice when you need scalability and high
availability without compromising performance, and with no single point failure.

https://scalegrid.io/blog/cassandra-vs-mongodb/ 51
Big Data Management

Summary: Wide-column stores

 Conceptually a big table
 Sparse: not all cells have values
 Distributed and persistent
 Multi-dimensional: multiple values
 Sorted Map
 Model:
 Keyspace: container for column families
 Column Family: container for rows
 Rows: Set of ordered columns
 Column: (name, value, timestamp)
 Cassandra:
 Distributed, replicated, and fault tolerant
 SQL-like query language
 Bloom filters: efficient data presence testing 52
Big Data Management

Reading
Cassandra Vs MongoDB In 2018 by Matan Sarig

https://blog.panoply.io/cassandra-vs-mongodb

Cassandra Scales well (linear with more nodes)

Apache Cassandra introduction video

https://www.youtube.com/watch?v=B_HTdrTgGNs

NoSQL Apache Cassandra
No ratings yet
NoSQL Apache Cassandra
159 pages
9 TH
No ratings yet
9 TH
33 pages
Key - Value - Database - (2) (1) (Read-Only)
No ratings yet
Key - Value - Database - (2) (1) (Read-Only)
48 pages
Cassandra Notes
No ratings yet
Cassandra Notes
50 pages
Cassandra Notes
No ratings yet
Cassandra Notes
45 pages
No SQL
No ratings yet
No SQL
49 pages
Final Quiz - Cybersecurity Essentials
50% (26)
Final Quiz - Cybersecurity Essentials
32 pages
Big Data 76-100
No ratings yet
Big Data 76-100
25 pages
Chap 3. NoSQL
No ratings yet
Chap 3. NoSQL
97 pages
Cassandra Data Model Big Data Seminar
No ratings yet
Cassandra Data Model Big Data Seminar
8 pages
Module - 3
No ratings yet
Module - 3
63 pages
Lab Exam Notes
No ratings yet
Lab Exam Notes
3 pages
Mod10-Wk10 CSG2132 Module 10 Big Data 2020
No ratings yet
Mod10-Wk10 CSG2132 Module 10 Big Data 2020
26 pages
4 - Key-Value Storage
No ratings yet
4 - Key-Value Storage
109 pages
Cassandra Data Modeling Best Practices
No ratings yet
Cassandra Data Modeling Best Practices
57 pages
NoSql Unit 2
No ratings yet
NoSql Unit 2
72 pages
Cassandra Data Base1
No ratings yet
Cassandra Data Base1
9 pages
No SQL Lab Manual
No ratings yet
No SQL Lab Manual
19 pages
Intro To NoSQL
No ratings yet
Intro To NoSQL
18 pages
Unit2 Cassandra
No ratings yet
Unit2 Cassandra
15 pages
Cansendra
No ratings yet
Cansendra
2 pages
Cassandra Data Model
No ratings yet
Cassandra Data Model
17 pages
Cassandra Presentation Final
100% (3)
Cassandra Presentation Final
71 pages
04 Introduction To CassandraDB
No ratings yet
04 Introduction To CassandraDB
19 pages
WIN-Wireless Intelligent Network Full Report
90% (10)
WIN-Wireless Intelligent Network Full Report
37 pages
Ch3 Nosql Wordpress
No ratings yet
Ch3 Nosql Wordpress
15 pages
Cassandra Complete Notes
No ratings yet
Cassandra Complete Notes
5 pages
Slide 6 NoSQL Database and HBase Tutorial
No ratings yet
Slide 6 NoSQL Database and HBase Tutorial
110 pages
Cassandra
No ratings yet
Cassandra
25 pages
Cassandra PPT Final
No ratings yet
Cassandra PPT Final
23 pages
NO SQL-Unit 3
No ratings yet
NO SQL-Unit 3
27 pages
Cassandra Article Review
No ratings yet
Cassandra Article Review
10 pages
Chapter 7
No ratings yet
Chapter 7
48 pages
Cassandra Tutorial For Beginners: Learn in 3 Days: What Is Apache Cassandra?
No ratings yet
Cassandra Tutorial For Beginners: Learn in 3 Days: What Is Apache Cassandra?
4 pages
Apache Cassandra Nosql SonuJha 04
No ratings yet
Apache Cassandra Nosql SonuJha 04
14 pages
Casandra
No ratings yet
Casandra
57 pages
Dzone Refcard 153 Apache Cassandra 2020
No ratings yet
Dzone Refcard 153 Apache Cassandra 2020
11 pages
Module 4
No ratings yet
Module 4
22 pages
Cassandra
No ratings yet
Cassandra
31 pages
Nosql Column-Family Stores
No ratings yet
Nosql Column-Family Stores
30 pages
BigData NoSQL
No ratings yet
BigData NoSQL
30 pages
Cassandra Introduction
No ratings yet
Cassandra Introduction
99 pages
Windows 10 Network File Sharing Guide
100% (1)
Windows 10 Network File Sharing Guide
28 pages
Cassandra Data Modeling Guide
No ratings yet
Cassandra Data Modeling Guide
8 pages
TR Bigdata 05 2015 CKL
No ratings yet
TR Bigdata 05 2015 CKL
8 pages
Unit 2
No ratings yet
Unit 2
26 pages
3-6 Pasolink NEO Series LCT Operation
No ratings yet
3-6 Pasolink NEO Series LCT Operation
22 pages
DS-2CD1123G0E-I 2 MP IR Fixed Network Dome Camera: Key Features
No ratings yet
DS-2CD1123G0E-I 2 MP IR Fixed Network Dome Camera: Key Features
4 pages
Autodesk FlexLM Error Codes
No ratings yet
Autodesk FlexLM Error Codes
4 pages
Boot DLC v3.1 from USB Guide
No ratings yet
Boot DLC v3.1 from USB Guide
7 pages
Apache Cassandra: Het Patel Kajal Patel
No ratings yet
Apache Cassandra: Het Patel Kajal Patel
8 pages
DataStax-WP-Apache-Cassandra-Architecture (Technical) PDF
No ratings yet
DataStax-WP-Apache-Cassandra-Architecture (Technical) PDF
22 pages
Talend - Case Study
100% (1)
Talend - Case Study
5 pages
TourismWebsite Download PDF
No ratings yet
TourismWebsite Download PDF
10 pages
Cassandra Quick Guide
No ratings yet
Cassandra Quick Guide
60 pages
A Software Assurance Framework For Aircraft
No ratings yet
A Software Assurance Framework For Aircraft
119 pages
An Overview of Apache Cassandra: Cassandra Essentials Tutorial Series
No ratings yet
An Overview of Apache Cassandra: Cassandra Essentials Tutorial Series
20 pages
Esprit Report Generator
0% (1)
Esprit Report Generator
21 pages
Name Shivam Prasad Reg No. 15BCE1196
No ratings yet
Name Shivam Prasad Reg No. 15BCE1196
8 pages
Cassandra
No ratings yet
Cassandra
7 pages
Introduction to Cassandra Basics
No ratings yet
Introduction to Cassandra Basics
27 pages
Seminar Topic Nosql
No ratings yet
Seminar Topic Nosql
73 pages
Apache Cassandra: by Chethan Gowda
No ratings yet
Apache Cassandra: by Chethan Gowda
12 pages
Project-Fit Blueprint: Client: Project
No ratings yet
Project-Fit Blueprint: Client: Project
17 pages
Cassandra Design Patterns - Sample Chapter
No ratings yet
Cassandra Design Patterns - Sample Chapter
32 pages
Cassandra Database Overview
No ratings yet
Cassandra Database Overview
37 pages
An Empirical Study of Data Warehouse Implementation Effectiveness
No ratings yet
An Empirical Study of Data Warehouse Implementation Effectiveness
10 pages
Introduction To Cassandra
No ratings yet
Introduction To Cassandra
47 pages
Security and Control Risk Assessment of Toll Bridge Operations
No ratings yet
Security and Control Risk Assessment of Toll Bridge Operations
2 pages
Moving Queries To The Data, Not Data To The Queries
No ratings yet
Moving Queries To The Data, Not Data To The Queries
2 pages
Developer Track 6 TrustZone TEEs and Trusted Video Path Implementation Considerations
No ratings yet
Developer Track 6 TrustZone TEEs and Trusted Video Path Implementation Considerations
31 pages
Apache Cassandra: Database
No ratings yet
Apache Cassandra: Database
55 pages
Cassandra: Decentralized Storage System
No ratings yet
Cassandra: Decentralized Storage System
37 pages
Athlean Xero Torrent
No ratings yet
Athlean Xero Torrent
4 pages
SeaDAS 6.4 Installation Guide
No ratings yet
SeaDAS 6.4 Installation Guide
7 pages
Histrory of Antivirus
No ratings yet
Histrory of Antivirus
2 pages
Cursors - TSQL Tutorial: Declare Cursor Syntax
No ratings yet
Cursors - TSQL Tutorial: Declare Cursor Syntax
5 pages
Free Download: Bud Powell Real Book
No ratings yet
Free Download: Bud Powell Real Book
2 pages
Aspera EUG - Posting Rules
No ratings yet
Aspera EUG - Posting Rules
45 pages
Orion Network Atlas Admin Guide
No ratings yet
Orion Network Atlas Admin Guide
60 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
DBMS Exam: Key Concepts & Questions
No ratings yet
DBMS Exam: Key Concepts & Questions
3 pages
Fordham Letter
No ratings yet
Fordham Letter
10 pages
ActiveRobot User Guide PDF
No ratings yet
ActiveRobot User Guide PDF
288 pages
Sentinel System Driver 7.6.0 ReadMe PDF
No ratings yet
Sentinel System Driver 7.6.0 ReadMe PDF
5 pages
Manipulating XML Using Data Integrator PDF
No ratings yet
Manipulating XML Using Data Integrator PDF
37 pages
S and C PDF
100% (1)
S and C PDF
675 pages
Hitachi Tuning Manager: Troubleshooting Guide 08-40-00 Rev01 2016/2/1
No ratings yet
Hitachi Tuning Manager: Troubleshooting Guide 08-40-00 Rev01 2016/2/1
550 pages

Wide-Column Stores: Big Data Management Phil Bartie

Uploaded by

Wide-Column Stores: Big Data Management Phil Bartie

Uploaded by

Wide-Column Stores

Big Data Management

Using material from

Wide-Column Graph Neo4J NewSQL SAP HANA

Relational Databases Recap

Key-value and Tabular

Tabular = Two-dimensional Maps

 Sparse – not a value for every column

 Distributed – each node has the same role

 Masterless – each node can service any request

 New nodes can be added without downtime

 Keyspace: container for column families

 Column Family: container for rows

 Rows: ordered columns https://www.tutorialspoint.com/cassandra/cassandra_data_model.htm

Column Family (Table)

Composite Column Key

Column Family View: Single-row partitions

Column Family: Multi-row partitions

 Data manipulation: includes limited query language

 Wide-columns: increased granularity of operation

Meta (Facebook) Stats (2022)

 Search requires inverse-index

 Continuous data arrival

Every so often Cassandra carries out a COMPACTION

Check video: https://youtu.be/B_HTdrTg

Fully distributed with no single point of failure (masterless)

Distributed, Replicated and Fault Tolerant

 Gossip style membership algorithm

AP: Always provides a “best-effort”

THREE Three replica nodes

(but lowest availability).

QUORUM is a good middle-ground ensuring strong

The size of the quorum is calculated as (replication_factor / 2) + 1 Replication factor

Cassandra Query Language (CQL)

CQL: Create Keyspace (Database)

CQL MySQL (equivalent)

CQL: Create Column Family(Table)

CQL MySQL (equivalent)

CQL: Retrieve data

CQL MySQL (equivalent)

CQL: Retrieve data

CQL MySQL (equivalent)

CQL: Retrieve data

CQL MySQL (equivalent)

CREATE INDEX on MyColumns (firstname);

How is this different from RDBMS?

new columns can be added on the fly while running

siteid date mean_temp

siteid 2012-09-01 2012-09-02 2012-09-03

 No join or subquery support, and limited support for aggregation.

 A single column value may not be larger than 2GB

 The maximum number of cells (rows x columns) in a single partition is 2 billion.

Using Bloom Filters for Fast Data Retrieval

 Efficient test for data location

Bloom Filters: Insert A

e.g. input word = ‘aardvark’

Output from hash function 1 = 3

Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters

Bloom Filters: Insert B

e.g. input word = ‘bat’

Output from hash function 1 = 16

Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters

Bloom Filters: Read Y

e.g. input word = ‘elephant’

Bloom Filters: Read X

100% uptime = Cassandra

Summary: Wide-column stores

Cassandra Scales well (linear with more nodes)

You might also like