Ramaiah Institute of Technology
(Autonomous Institute, Affiliated to VTU)
Department of AIML
Course Name : NoSQL Data Bases
Curse Code : AIE734
Credits : 3:0:0
Unit-2
NoSQL Data Architecture Patterns
Ø Architecture Pattern is a logical way of categorizing data that will be stored on the
Database.
Ø An architecture pattern is a way of organizing data in a logical and structured manner.
Ø NoSQL is a type of database which helps to perform operations on big data and store it in
a valid format.
Ø It is widely used because of its flexibility and a wide variety of services.
Architecture Patterns of NoSQL
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Document data model
2. Key Value data model
3. Column data model
4. Graph based data model
Document data model
Ø The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents.
Ø Document can be stated as a complex data structure.
Ø Document can be a form of text, arrays, strings, JSON, XML or any such format.
Ø It is very effective as most of the data created is usually in form of JSONs and is unstructured.
Ø Each document contains all the necessary data, and the data can be indexed for easy retrieval.
Advantages
• This type of format is very useful and apt for semi-structured data.
• Storage retrieval and managing of documents is easy.
• Flexible schema allows for easy changes to data structure.
• High performance for read-heavy workloads.
Disadvantages
• Lack of support for joins and transactions.
• Poor performance for write-heavy workloads.
Example
MongoDB a popular document store with support for dynamic schemas and horizontal scaling.
Figure – Document Store Model in form of JSON documents
Key-Value data model
Ø This model is one of the most basic models of NoSQL databases.
Ø As the name suggests, the data is stored in form of Key-Value Pairs.
Ø The key is usually a sequence of strings, integers or characters but can also be a more advanced data type.
Ø The value is typically linked or co-related to the key.
Ø The key-value pair storage databases generally store data as a hash table where each key is unique.
Ø The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc).
Ø This type of pattern is usually used in shopping websites or e-commerce applications.
Advantages
• Simple and efficient architecture.
• It allows for the fast read and write operations.
• Can handle large amounts of data and heavy load
Disadvantages
• Not suitable for complex data structures.
Example
• DynamoDB
• Berkeley DB
Column Data Model
Ø This pattern stores data in column families, rather than storing data in relational tuples, the data is
stored in individual cells which are further grouped into columns.
Ø Basically, the relational database stores data in rows and also reads the data row by row, column
store is organized as a set of columns.
Ø So if someone wants to run analytics on a small number of columns, one can read those columns
directly without consuming memory with the unwanted data.
Ø For eg: Cassandra and Apache Hadoop Hbase.
Sl.No Name Course ID
1 Ramya BE 4
2 Kiran BE 7
3 Kapilan M.Tech 20
4 Priya M.Tech 8
Fig : Eg of Row-Oriented Table
Sl.No Name ID
1 Ramya 4
2 Kiran 7
3 Kapilan 20
4 Priya 8
Fig : Column – Oriented Table
Sl.No Course ID
1 BE 4
2 BE 7
3 M.Tech 20
4 M.Tech 8
Fig : Column – Oriented Table
Working of Column data model
Ø In Columnar Data Model instead of organizing information into rows, it does in columns.
Ø This makes them function the same way that tables work in relational databases.
Advantages
• Well Structured
• Flexibility
• Scalability
• Load Time
Disadvantages
• Designing indexing schema
• Online transaction processing
• Security Vulnerabilitie
Graph Based Data Model
Ø Graph Based Data Model in NoSQL is a type of Data Model which tries to focus on building the relationship
between data elements.
Ø As the name suggests Graph-Based Data Model, each element here is stored as a node, and the association
between these elements is often known as Links.
Ø Association is stored directly as these are the first-class elements of the data model.
Ø These data models give us a conceptual view of the data.
Ø These are the data models which are based on topographical network structure.
•Nodes: These are the instances of data that represent objects which is to be tracked.
•Edges: As we already know edges represent relationships between nodes.
•Properties: It represents information associated with nodes.
Fig: A simple graph with two vertices and one edge.
Fig : Image represents Nodes with properties from relationships represented by edges
Working of graph data model.
Ø In these data models, the nodes which are connected together are connected physically and the physical
connection among them is also taken as a piece of data.
Ø Connecting data in this way becomes easy to query a relationship.
Ø This data model reads the relationship from storage directly instead of calculating and querying the
connection steps.
Ø Like many different NoSQL databases these data models don’t have any schema as it is important because
schema makes the model well and good and easy to edit.
Advantages
• Structure
• Real time output results
Disadvantages
• No standard query language
• Unprofessional graphs
NoSQL system ways to handle big data problems
Ø Datasets that are difficult to store and analyze by any software database tool are referred to as big data.
Ø Due to the growth of data, an issue arises that based on recent fads in the IT region, how the data will be
effectively processed.
Ø A requirement for ideas, techniques, tools, and technologies is been set for handling and transforming a lot of
data into business value and knowledge.
Ø The major features of NoSQL solutions are stated below that help us to handle a large amount of data.
NoSQL databases that are best for big data are:
• MongoDB
• Cassandra
• CouchDB
• Neo4j
Different ways to handle Big Data problems
1. Moving query to the data, not data to the query.
2. Hash rings to distribute the data on clusters.
3. Replication to scale read.
4. Distributed queries to data nodes.
Fig: One or many databases? Here are some of the challenges you face when you move from a single processor to a distributed
computing system. Moving to a distributed environment is a nontrivial endeavor and should be done only if the business problem really
warrants the need to handle large data volumes in a short period of time. This is why platforms like Hadoop are complex and require a
complex framework to make things easier for the application developer.
Moving query to the data, not data to the query
Ø With the exception of large graph databases, most NoSQL systems use commodity processors that each hold a
subset of the data on their local shared-nothing drives.
Ø When a client wants to send a general query to all nodes that hold data, it’s more efficient to send the query to
each node than it is to transfer large datasets to a central processor.
Ø Keeping all the data within each data node in the form of logical documents means that only the query itself and
the final result need to be moved over a network.
Ø This keeps your big data queries fast.
Ø The entire data is kept inside hub/node in document form which means just the query and result are needed to
move over the network.
Hash rings to distribute the data on clusters.
Ø Hash rings are common in big data solutions because they consistently determine how to assign a piece of
data to a specific processor.
Ø Hash rings take the leading bits of a document’s hash value and use this to determine which node the
document should be assigned.
Ø This allows any node in a cluster to know what node the data lives on and how to adapt to new assignment
methods as your data grows.
Ø Using a hash ring technique to evenly distribute big data loads over many servers with a randomly generated
40-character key is a good way to evenly distribute a network load.
Ø One of the most challenging problems with distributed databases is figuring out a consistent way of assigning
a document to a processing node.
Fig: Using a hash ring to assign a node to a key that uses a 40-character hex number. This
number can be expressed in 2160 bits. The first bits in the hash can be used to map a
document directly to a node. This allows documents to be randomly assigned to nodes and
new assignment rules to be updated as you add nodes to your cluster.
Replication to scale read.
Ø Databases use replication to make backup copies of data in real time.
Ø Using replication allows you to horizontally scale read requests.
Ø In real-time, replication is used by databases for making data’s backup copies.
Ø There are only a few times when you must be concerned about the lag time between a write to the read/write
node and a client reading that same record from a replica.
Ø One of the most common operations after a write is a read of that same record.
Ø If a client does a write and then an immediate read from that same node, there’s no problem.
Ø The problem occurs if a read occurs from a replica node before the update happens.
Ø The best way to avoid this type of problem is to only allow reads to the same write node after a write has been
done.
Ø This logic can be added to a session or state management system at the application layer.
Ø If your application needs fast read/write consistency, you must deal with it at the application layer.
Distributed queries to data nodes.
Ø In order to get high performance from queries that span multiple nodes, it’s important to separate the concerns
of query evaluation from query execution.
Ø The query is moved to the data by the NoSQL database instead of data moving to the query.
Ø NoSQL systems move the query to a data node, but don’t move data to a query node.
Ø In this example, all incoming queries arrive at query analyzer nodes.
Ø These nodes then forward the queries to each data node.
Ø If they have matches, the documents are returned to the query node.
Ø The query won’t return until all data nodes (or a response from a replica) have responded to the original query
request.
Ø If the data node is down, a query can be redirected to a replica of the data node.
Here are some typical big data use cases:
1. Bulk image processing
2. Public web page data
3. Event log data
4. Remote sensor data
5. Mobile phone data
6. Social media data
7. Game data
8. Open linked data
Exercise
1. Create a JSON document for a "user" profile with fields like username, email, age, and an array posts that
contains several objects, each representing a user's post with attributes like post_id, title, and content.