Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views46 pages

Wide-Column Stores: Big Data Management Phil Bartie

explain this like explaining to 15yrs old girl remember that i'm having examination in next 1 hour so please make sure to cover all the key point in the document

Uploaded by

aksshu1902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

Wide-Column Stores: Big Data Management Phil Bartie

explain this like explaining to 15yrs old girl remember that i'm having examination in next 1 hour so please make sure to cover all the key point in the document

Uploaded by

aksshu1902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Wide-Column Stores

Big Data Management


Phil Bartie
[email protected]
EM G.29

Using material from


Alasdair Gray, HWU
Aidan Hogan, Universidad de Chile
Guillaume Marquis
https://www.tutorialspoint.com/cassandra/
https://pandaforme.gitbooks.io/introduction-to-cassandra/
Big Data Management

Database Landscape
RDF Virtuoso Object XML Relational Oracle
Jena Caché MarkLogic
Stardog Db4o MySQL
RDF4J
Versant
Sedna MS SQL Server
Tamino
GraphDB BaseX
Blazegraph ObjectStore eXist-db
PostgreSQL DB2
SQLite
MS Access Teradata
NoSQL
SAP Adaptive Server
Key-Value Redis Document MongoDB Hive
Memcached DynamoDB FileMaker
Riak KV MariaDB
Aerospike CouchBase
SimpleDB Elasticsearch Informix Vertica

Wide-Column Graph Neo4J NewSQL SAP HANA


Cassandra Titan Google Spanner
HBase Giraph Clustrix
Accumulo
HyperTable InfiniteGraph VoltDB MemSQL NuoDB
2
Big Data Management

Relational Databases Recap

 Two-dimensional tables
 Relationships between tables
 Fixed schema
 Homogeneous

 Highly structured
 NULLs – arrghh!
Source : http://excel.quebec/attachments/Image/excel-quebec-requete-sql-excel-1.jpg
3
Big Data Management

Key-value and Tabular


Key–Value = a Distributed Map
Countries
Primary Key Value
Afghanistan capital:Kabul,continent:Asia,pop:31108077#2011
Albania capital:Tirana,continent:Europe,pop:3011405#2013
… …

Tabular = Two-dimensional Maps


Countries
Primary Key capital continent pop-value pop-year
Afghanistan Kabul Asia 31108077 2011
Albania Tirana Europe 3011405 2013
… … … … …
5
Big Data Management

Wide-Column Stores
a sparse, distributed, persistent, multi-dimensional, sorted map

 Sparse – not a value for every column


(i.e. not dense square)

 Distributed – each node has the same role


– no single point of failure

 Masterless – each node can service any request

 New nodes can be added without downtime

 Keyspace: container for column families

 Column Family: container for rows

 Rows: ordered columns https://www.tutorialspoint.com/cassandra/cassandra_data_model.htm


6
<< 12th
Big Data Management

Column Family (Table)

10
Big Data Management

Row
Row: smallest unit that stores related data
 Data partition mechanism

https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html 11
Big Data Management

Keys
Composite Row Key

Composite Column Key

https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html 13
Big Data Management

Column Family View: Single-row partitions

14
https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html
Big Data Management

Column Family: Multi-row partitions

https://pandaforme.gitbooks.io/introduction-to-cassandra/content/understand_the_cassandra_data_model.html 15
Big Data Management

Wide-Column Advantages
 Highly scalable: designed for distributing across:
 Cluster
 Data centres

 Data manipulation: includes limited query language


 Data stored in sorted order

 Wide-columns: increased granularity of operation


 Not affected by increasing number of rows

16
Big Data Management

Cassandra
Wide-Column Store
Big Data Management

Meta (Facebook) Stats (2022)


 Messenger
 2.91 billion users

 Search requires inverse-index


 Search term to message id

 Continuous data arrival


 Instantaneous responses
Cassandra developed as a solution

https://www.statista.com/topics/4625/facebook-messenger/#dossierKeyfigures https://www.messenger.com/ 18
Big Data Management

Cassandra History
History

Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik initially developed Cassandra at Facebook
to power the Facebook inbox search feature. Facebook released Cassandra as an open-source project on Google code in
July 2008. In March 2009 it became an Apache Incubator project. On February 17, 2010 it graduated to a top-level project.

Facebook developers named their database after the Trojan mythological prophet Cassandra - with classical allusions to a
curse on an oracle.


Free and open-source

Distributed 
Can add more hardware nodes with no downtime

Wide Column Store 
Should always be able to read/write to Cassandra

NoSQL database 
Consistency can be adjusted – at expense of availability

Masterless replication

Each node has same role
- Secondary Index support is weak

Low latency
(single columns only; equality comparisons only)

https://en.wikipedia.org/wiki/Apache_Cassandra
Big Data Management

http://cassandra.apache.org/ 20
CONSISTENT HASHING

https://www.scnsoft.com/blog/cassandra-performance
21
Commit log : Append only log = very fast https://www.scnsoft.com/blog/cassandra-performance
MemTable stored in memory
Acknowledge to client
Flush MemTable to SSTable (Sorted Strings Table) See intro video: https://youtu.be/B_HTdrTgGNs?t=947
22
SSTable – Sequential, Immutable

Every so often Cassandra carries out a COMPACTION


Does big sequential READ, MERGE, WRITE

Check video: https://youtu.be/B_HTdrTg


GNs?t=1143
Cassandra Write Path

Fully distributed with no single point of failure (masterless)

QUORUM Consistency:
(n/2 +1) rounded down
where n= replication factor
24
https://www.scnsoft.com/blog/cassandra-performance
25
Big Data Management

Distributed, Replicated and Fault Tolerant


 Consistent Hashing
 Hashed to ring
 Order preserving hash function

 Gossip style membership algorithm


 Data replication
 Eventual Consistency
 Merkle Tree

26
Big Data Management

Where is Cassandra?
CA : Guarantees
But (like to give a
Dynamo), CP: Guarantees
correct response but only while responses are correct even
tables are tunable
C
network works fine if there are network
towards CP
(Centralised / Traditional) failures, but response may
fail (Weak availability)

A P

AP: Always provides a “best-effort”


response even in presence of network failures
(Eventual consistency)
27
Big Data Management

Tuneable Consistency
 Write = Commit Log + Memtable
 Quorom = Majority of replicas: ⌊R/2⌋+1 for R the replication factor
 Hinted handoff: central 3 hour TODO log (not readable)

Level Explanation
Availability ANY One replica node or a hinted handoff
ONE One replica node (hinted handoff not enough)
TWO Two replica nodes
THREE Three replica nodes
QUORUM A quorum of replica nodes
ALL All replica nodes
Consistency

28
Big Data Management

Tuneable Consistency
Level Explanation
For write operations, ANY is the lowest consistency (but ANY One replica node or a hinted handoff
highest availability), and ALL is the highest consistency ONE One replica node (hinted handoff not enough)
(but lowest availability). TWO Two replica nodes

THREE Three replica nodes

For read operations, ONE is the lowest consistency (but QUORUM A quorum of replica nodes

highest availability), and ALL is the highest consistency ALL All replica nodes

(but lowest availability).

QUORUM is a good middle-ground ensuring strong


consistency, yet still tolerating some level of failure.

The size of the quorum is calculated as (replication_factor / 2) + 1 Replication factor


Replication factor is total number of replicas
across the cluster.

https://blog.imaginea.com/consistency-tuning-in-cassandra 29
Big Data Management

Cassandra Query Language (CQL)


SQL-like declarative query language

30
Big Data Management

CQL: Create Keyspace (Database)

CQL MySQL (equivalent)


 Create Keyspace  Create Database
CREATE KEYSPACE MyKeySpace CREATE DATABASE MyKeySpace;
WITH REPLICATION = {
'class' : 'SimpleStrategy’,

'replication_factor' : 3 };
 Load in Database
 Load in keyspace USE MyKeySpace;
USE MyKeySpace;

32
Big Data Management

CQL: Create Column Family(Table)

CQL MySQL (equivalent)


 Create Column Family  Create Database
CREATE COLUMNFAMILY MyColumns
(id varint,
CREATE TABLE MyColumns (
lastname varchar, id int NOT NULL,
firstname varchar, lastname varchar(50),
PRIMARY KEY (id)); firstname varchar (100),
 Load data PRIMARY KEY (id));
INSERT INTO MyColumns  Load data
(id, lastname, firstname)
VALUES (1, 'Doe', 'John’); INSERT INTO MyColumns
(id, lastname, firstname)
(inserts will always overwrite) VALUES (1, 'Doe', 'John'); 33
Big Data Management

CQL: Retrieve data

CQL MySQL (equivalent)


 Retrieve all rows  Retrieve all rows
SELECT * FROM MyColumns; SELECT * FROM MyColumns;

34
Big Data Management

CQL: Retrieve data

CQL MySQL (equivalent)


 Retrieve row 1  Retrieve id 1
SELECT * FROM MyColumns SELECT * FROM MyColumns
WHERE id = 1; WHERE id = 1;

35
Big Data Management

CQL: Retrieve data

CQL MySQL (equivalent)


 Retrieve all Johns  Retrieve all Johns
SELECT * FROM MyColumns SELECT * FROM MyColumns
WHERE firstname = 'John'; WHERE firstname = 'John';
Bad Request: Cannot execute this query as it
might involve data filtering and thus may have
unpredictable performance. If you want to execute
this query despite the performance
unpredictability, use ALLOW FILTERING.

CREATE INDEX on MyColumns (firstname);

36
Big Data Management

How is this different from RDBMS?

In a static-column storage engine, each row must reserve space for every column
ALTER TABLE users ADD birth_date INT;

new columns can be added on the fly while running


and processing queries

https://www.datastax.com/dev/blog/schema-in-cassandra-1-1 37
Big Data Management

Using Columns

siteid date mean_temp


1 2012-09-01 20.6
1 2012-09-01 21.9  RDBMS approach
1 2012-09-01 21.7

siteid 2012-09-01 2012-09-02 2012-09-03


1 20.6 21.9 21.7
 CASSANDRA approach

38
Big Data Management

CQL: Consistency

SELECT totalsales
FROM sales
USING CONSISTENCY QUORUM
WHERE customerid=5;

UPDATE SALES
USING CONSISTENCY ONE
SET totalsales=50000
WHERE customerid=4;
39
Big Data Management

Limitations of CQL

 No join or subquery support, and limited support for aggregation.


- This is by design, to force you to denormalize into partitions that can be efficiently queried from a single replica,
instead of having to gather data from across the entire cluster.

 A single column value may not be larger than 2GB


- in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob
values.

 The maximum number of cells (rows x columns) in a single partition is 2 billion.

https://wiki.apache.org/cassandra/CassandraLimitations 40
Big Data Management

Using Bloom Filters for Fast Data Retrieval


 Each SSTable (String Sorted Table) has
an associated Bloom Filter
 Bloom Filter Stored in Memory
 Highly Efficient
 Can produce false positives

41
Big Data Management

Bloom Filters

 Efficient test for data location


 Hash object on insert using k hash functions
 Set bit to 1
 Hash object on read using k hash functions
 Any 0s then not present
 Bit would have been set to 1 on insert
 They can give FALSE POSITIVES, but not FALSE NEGATIVES – so a good way to check if data has been
processed before
Video on Bloom Filters: https://youtu.be/bEmBh1HtYrw

42
Big Data Management

Bloom Filters: Insert A


 Hash object on insert using k
hash functions
 Set bit to 1

e.g. input word = ‘aardvark’

Output from hash function 1 = 3


Output from hash function 2 = 1
Output from hash function 3 = 14

Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters

43
Big Data Management

Bloom Filters: Insert B


 Hash object on insert using k
hash functions
 Set bit to 1

e.g. input word = ‘bat’

Output from hash function 1 = 16


Output from hash function 2 = 1
Output from hash function 3 = 7

Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters


44
Big Data Management

Bloom Filters: Read Y


 Hash object on read using k
hash functions
 Any 0s then not present
(as Bit would have been set to 1 on
insert)

e.g. input word = ‘elephant’


Output from Hash 1 = 16
Output from Hash 2 = 2
Output from Hash 3 = 7
Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters

45
Big Data Management

Bloom Filters: Read X


 Hash object on read using k
hash functions
 All 1s then data may be
present
 Bit would have been set to 1 on
insert
e.g. input word = ‘bat’
Hash results [16,1,7]
e.g. input word = ‘snake’
Hash results [1,14,16]  FALSE POSITIVE Big Data Management: http://chimera.labs.oreilly.com/books/1234000001802/ch06.html#_bloom_filters
46
DEMO of a BLOOM Filter

https://llimllib.github.io/bloomfilter-tutorial/
Big Data Management

Cassandra vs MongoDB
Problem domain needs a rich data model = MongoDB
Need secondary indexes and flexibility in the query model = MongoDB
(by contrast Cassandra secondary indexes only support single columns and equality
comparisons)

100% uptime = Cassandra


Write scalability = Cassandra
Query language support = CQL is similar to SQL

The Apache Cassandra database is a good choice when you need scalability and high
availability without compromising performance, and with no single point failure.

https://scalegrid.io/blog/cassandra-vs-mongodb/ 51
Big Data Management

Summary: Wide-column stores


 Conceptually a big table
 Sparse: not all cells have values
 Distributed and persistent
 Multi-dimensional: multiple values
 Sorted Map
 Model:
 Keyspace: container for column families
 Column Family: container for rows
 Rows: Set of ordered columns
 Column: (name, value, timestamp)
 Cassandra:
 Distributed, replicated, and fault tolerant
 SQL-like query language
 Bloom filters: efficient data presence testing 52
Big Data Management

Reading
Cassandra Vs MongoDB In 2018 by Matan Sarig

https://blog.panoply.io/cassandra-vs-mongodb

Cassandra Scales well (linear with more nodes)


Apache Cassandra introduction video

https://www.youtube.com/watch?v=B_HTdrTgGNs

53

You might also like