Big Data Analytics
Module V-Part 1
Introduction to NoSQL Data Management
By
Dr. Jagadamba G
Dept. of ISE, SIT, Tumakuru
Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes
The big data technology
landscape
a) To understand the
1. What is NoSQL databases? significance of NoSQL
databases.
2. Why NoSQL?
b) To understand the need for
3. Key advantages of NoSQL. NewSQL.
Big Data and Analytics by Seema Acharya and Subhashini
Chellappan
Introduction to NoSQL (Not Only SQL)
• The NoSQL provided a platform for schema free
database that can handle large amount of data.
• These databases are scalable, enable availability of
user, support replication and are distributed and
possibly open source.
• Before we develop applications that can interact
with NoSQL databases, we should understand
need for maintaining separation between data
management and data storage in these databases.
• It focuses on high performance scalable data
storage and provides low-level access to the data
management layer.
• This allows data management tasks to be created
easily in any programming language.
Why NoSQL?
Non-relational data storage systems
No fixed table schema
No Joins
NoSQL
No multi-document transactions
Relaxes one or more ACID properties
Benefits of NoSQL Databases Challenges against NoSQL
• Scalable • ACID transaction
• Simple data model • Cannot use SQL
• Streaming/Volume • Ecosystem/tools/adds-on
• Reliability • Cannot perform searches
• Schema-lies • Data loss
• Rapid development • No referential integrity
• Flexible • Lack of availability of expertise
• Cheaper than RDBMS
• Creates a caching layer
• Wide data type variety
• Uses large binary objects for storing
large data
• Bulk upload
• Graphs
• Lower administration
• Distributed storage
• Real-time analysis
Characteristics of NoSQL
•Rows in tables—NoSQL systems store and retrieve data from many formats: key-value
stores, graph databases, column-family (Bigtable) stores, document stores, and even
rows in tables.
•Free of joins—NoSQL systems allow you to extract your data using simple interfaces
without joins.
•Schema-free—NoSQL systems allow you to drag-and-drop your data into a folder and
then query it without creating an entity-relational model.
•Works on many processors—NoSQL systems allow you to store your database on
multiple processors and maintain high-speed performance.
•Uses shared-nothing commodity computers—Most (but not all) NoSQL systems
leverage low-cost commodity processors that have separate RAM and disk.
•Supports linear scalability—When you add more processors, you get a consistent
increase in performance.
•Innovative—NoSQL offers options to a single way of storing, retrieving, and
manipulating data. NoSQL supporters (also known as NoSQLers) have an inclusive
•attitude about NoSQL and recognize SQL solutions as viable options. To the NoSQL
community, NoSQL means “Not only SQL.”
History of NoSQL
• Invented by Carlo Strozzi in 1998
• It started with the mechanism for data retrieval and storage
• Eric Evans reintroduced the term NoSQL in 2009.
• NoSQL databse are mangoDB, Cassandra, redis, Hbase, Splunk, Neo4j, CouchDB, etc
Types of NoSQL
Types of NoSQL
Key value data Column-oriented Document data Graph data
model Data model model model
• Riak • Cassandra • MongoDB • InfiniteGraph
• Redis • HBase • CouchDB • Neo4
• Membase • HyperTable • RavenDB • Allegro Graph
1. Key value data Model
• A key-value database (also known as a key-value store and key-value store database) is a
type of NoSQL database that uses a simple key/value method to store data.
• The key-value part refers to the fact that the database stores data as a collection of
key/value pairs. This is a simple method of storing data, and it is known to scale well.
• The key-value pair is a well established concept in many programming languages.
Programming languages typically refer to a key-value as an associative array or data
structure. A key-value is also commonly referred to as a dictionary or hash.
• Example: Phone directory
Key Value
Bob (123) 456-7890
Jane (234) 567-8901
Tara (345) 678-9012
Tiara (456) 789-0123
The Key
• The key in a key-value pair must (or at least, should) be unique. This is the
unique identifier that allows you to access the value associated with that
key.
• In theory, the key could be anything. But this may depend on the DBMS.
One DBMS may impose limitations while another may impose none.
• However, for performance reasons, you should avoid having a key that’s
too long. But too short can cause readability issues too. In any case, the key
should follow an agreed convention in order to keep things consistent.
The Value
• The value in a key-value store can be anything, such as text (long
or short), a number, markup code such as HTML, programming code
such as PHP, an image, etc.
• The value could also be a list, or even another key-value pair
encapsulated in an object.
• Some key-store DBMSs allow you to specify a data type for the
value. For example, you could specify that the value should be an
integer. Other DBMSs don’t provide this functionality and therefore,
the value could be of any type.
Examples of Key-Value Database
Management Systems
• Redis
• Oracle NoSQL Database
• Voldemorte
• Aerospike
• Oracle Berkeley DB
2. Column-oriented Data model
• In this, data is stored in cells grouped in columns of data rather than as rows of data.
• Columns are logically grouped into column families. Column families can contain a
virtually unlimited number of columns that can be created at runtime or while defining
the schema.
• Read and write is done using columns rather than rows.
• Column families are groups of similar data that is usually accessed together. As an
example, we often access customers’ names and profile information at the same time,
but not the information on their orders.
• The main advantages of storing data in columns over relational DBMS are fast
search/access and data aggregation.
• Each column family can be compared to a container of rows in an RDBMS table, where
the key identifies the row and the row consists of multiple columns. The difference is
that various rows do not have to have the same columns, and columns can be added to
any row at any time without having to add them to other rows.
Examples of column oriented data model
• Content management systems
• Blogging platforms
• Systems that maintain counters
• Services that have expiring usage
• Systems that require heavy write requests (like log
aggregators)
3. Document Data Model
• There are many types of document
databases, such as XML, JSON, BSON, etc.
• These are self describing, hierarchical tree
data structures that can contain maps,
collections and scalar value.
• Document databases store documents in the value part of the key/value store
• For easier transactions from relational database, document database provides
indexing and searching etc.
• It provides good performance and scalability, but doesn't provides ACID and data
integrity.
• Document database not a replacement to relational database, but an alternate
way
Examples of Document Data model
• MangoDB
• CouchDB
• Terrastore
• orientDB
• RavenDB
• Lotus Notes
Note: Couchbase now offers ACID Transactions.
4. Graph base NoSQL database
• It is designed to handle very large sets of data that is capable of
integrating heterogeneous data from many sources and making links
between datasets.
• It focuses on the relationships between entities and is able to infer new
knowledge out of existing information.
• It is built upon the Entity – Attribute – Value model.
• Entities are also known as nodes, which have properties.
• It is a very flexible way to describe how data relates to other data.
• Nodes store data about each entity in the database, relationships
describe a relationship between nodes, and a property is simply the node
on the opposite end of the relationship.
• Whereas a traditional database stores a description of each possible
relationship in foreign key fields or junction tables.
• But, graph databases allow virtual relationship on any definition.
Examples of Graph base NoSQL database
• Neo4J
• InfoGrid
• Infinite Graph.
Note: Fortune 500 financial services company uses Neo4j to more quickly identify potential fraud,
stopping millions of fraudulent transactions.
With the advent of the NoSQL movement, businesses of all sizes have a
variety of modern options from which to build solutions relevant to their use
cases.
• Calculating average income? Ask a relational database.
• Building a shopping cart? Use a key-value Store.
• Storing structured product information? Store as a document.
• Describing how a user got from point A to point B? Follow a graph.
Advantages of NoSQL
Advantages of NoSQL
Cheap, Easy to implement
Easy to distribute
Can easily scale up & down
Advantages of NoSQL
Relaxes the data consistency
requirement
Doesn’t require a pre-defined
schema
Data can be replicated to
multiple nodes and can be
partitioned
NoSQL Vendors
NoSQL Vendors
Company Product Most widely used by
Amazon DynamoDB LinkedIn, Mozilla
Facebook Cassandra Netflix, Twitter, eBay
Google BigTable Adobe Photoshop
NoSQL Vendors
Company Product Most widely used by
Amazon DynamoDB LinkedIn, Mozilla
Facebook Cassandra Netflix, Twitter, eBay
Google BigTable Adobe Photoshop
Materialized View
• Materialized view is slightly different from normal views. And will be used in some environments
where the source data is in a format that is not suitable for querying.
• These views are disk based and updated periodically as per the requirements of the query.
• It does have a storage cost associated with it.
• It does have updations cost associated with it.
• There is no SQL standard for defining a materialized view, and the functionality is provided by some
databases systems as an extension.
• Materialized views are efficient when the view is accessed frequently as it saves the computation
time by storing the results before hand., i.e., when response time should be very fast.
Distribution models
There are two styles of distributing data:
• Sharding provides horizontal scalability, which allows different sites to have
different types of data. This scalability helps in reducing the work load of servers
• Replication is just a process of coping the same data across different sites while
sharding is the process of distributing different datasets on different sites.
• In addition sharding improves both read and write performance, while replication
improves red performance but not write performance.
CAP theorem
The CAP theorem applies to distributed systems—namely, that a distributed
system can deliver only two of three desired characteristics: Consistency,
Availability, and Partition tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP).
• Consistency: Consistency means that all clients see the same data at the same time, no matter
which node they connect to. For this to happen, whenever data is written to one node, it must
be instantly forwarded or replicated to all the other nodes in the system before the write is
deemed ‘successful.’
• Availability: Availability means that that any client making a request for data gets a response,
even if one or more nodes are down. Another way to state this—all working nodes in the
distributed system return a valid response for any request, without exception.
• Partition tolerance: A partition is a communications break within a distributed system—a lost
or temporarily delayed connection between two nodes. Partition tolerance means that the
cluster must continue to work despite any number of communication breakdowns between
nodes in the system.
ACID Property
• ACID transactions are a very important feature that most relational
databases have had for decades. They enable you to combine a series of
different database operations into one transaction that provides the
following four guarantees:
• Atomicity - that the operations will all either succeed or fail as a single
unit;
• Consistency - that they won’t violate certain constraints you defined for
the data as a whole;
• Isolation - that each operation is hidden from view until the whole
transaction is complete;
• Durability - that all changes to the data are safely persisted.
BASE an alternate to ACID
• When it comes to NoSQL databases, data consistency models can
sometimes be strikingly different than those used by relational databases
(as well as quite different from other NoSQL stores).
• The two most common consistency models are known by the acronyms
ACID and BASE. While they’re often pitted against each other in a battle
for ultimate victory (please someone make a video of that), both
consistency models come with advantages – and disadvantages – and
neither is always a perfect fit.
BASE an alternate to ACID
• In the NoSQL database world, ACID transactions are less fashionable as
some databases have loosened the requirements for immediate
consistency, data freshness and accuracy in order to gain other benefits,
like scale and resilience.
• Here’s how the BASE acronym breaks down:
• Basic Availability: The database appears to work most of the time.
• Soft-state: Stores don’t have to be write-consistent, nor do different
replicas have to be mutually consistent all the time.
• Eventual consistency: Stores exhibit consistency at some later point (e.g.,
lazily at read time).
Sharding
• Sharding is a partitioning pattern for the NoSQL age.
• Sharding is a method of splitting and storing a single logical dataset in
multiple databases.
• Sharding is also referred as horizontal partitioning. The distinction of
horizontal vs vertical comes from the traditional tabular view of a
database.
• A database can be split vertically — storing different tables & columns in
a separate database, or horizontally — storing rows of a same table in
multiple database nodes.