Bda Unit-2

BDA

Uploaded by

CSE NSIT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

108 views29 pages

Bda Unit-2

BDA

Uploaded by

CSE NSIT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 29

NoSQL Data Management Syllabus Introduction to NoSQL - aggregate data models - key-value and document data models - relationships - graph databases - schemaless databases - materialized views - distribution models - master-slave replication - consistency - Cassandra - Cassandra data model ~ Cassandra examples + Cassandra clients Contents 24 22 29 24 25 26 27 28 Introduction to NoSQL Aggregate Data Models Schemaless Databases Materialized Views Distribution Models Consistency Cassandra Two Marks Questions with Answers (2-1)EBM introduction to Nosal ‘ huge vo NeSQL means Not Only SOL, it solves the problem of handling : ‘ata that relational databases cannot handle, NoSQL databases are schema and are sonsrolational databases, Most of the NoSQL databases are open so NoSQL is also type of distributed database, which means that info rm Sopied and stored on various seryets, which can be remote or local. This availability and reliability of data, If some of the data goes offline, the rest of database can continue to run, NoSQL encompasses structured data, semi-structured data, unstructured data ap polymorphic data. No SQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. NoSQL is a response to nowadays business data related factors ; 1 Volume and velocity, Teferring to the ability to handle large datasets that amiy quickly; % Variability, xeferring to how diverse data types don't fit into structured tables, 5. Agility, referring to how fast an organization responds to business changes, NoSQL databases are very often referred to as data stores rather fhan data-baces, NoSQE systems work on multiple processors and can run on low-cost separa somputer systems - No need for expensive nodes to get high-speed perf 2, It supports. linear scalability. Every time we add more Processors, we get a consistent increase in performance. History of NoSQL : | # The acronym NoSQL was first used in 1998 by Carlo Strozzi while naming his lightweight, open-source “relational” database that did not use SQL. The fame came up again in 2009 when Eric Evans and Johan Oskarsson used it) to describe non-relational databases, Relational databases are often referred to as SQL systems, The term Ne san mean either "No SQL systems” or the more commonly accep translation of "Not only SQL," to emphasize the fact some systems: migt support SQL-like query languages, # NOoSQL developed at least in the beginning as a response to web data, need for processing unstructured data and the need for faster p The NoSQL model uses a distributed database system, meaning a with multiple computers.Agile sprints, quick iteration and frequent cade pushes, . Objectoriented programming that is easy to use and flexible, » Scale-out architecture, of NoSQL Stores : ‘Column Oriented (Accumulo, Cassandra, Hbase) 2, Document Oriented (MongoDB, Couchbase, Clusterpoint) 3, Key-value (Dynamo, MemcacheDB, Riak) 4, Graph (Allegro, Neo4j, OrientDB) + NoSQL databases are guaranteed to-adhere to two of the CAP properties. Such databases are of several types. 1. Key-value store : Stores in the form of a hash table {Example - Riak, Amazon $3 (Dynamo), Redis} 2 Document-based store : Stores objects; mostly JSON, which is web friendly or supports ODM (Object Document Mappings). (Example - CouchDB, MongoDB} 3, Column-based store : Each storage block contains data from only ene column {Example - HBase, Cassandra) 4, Graph-based : Graph representation of relationships, mostly used by social networks. {Example - Neo4J) | RRM The Definition of Four Types of NoSQL Databases «| NoSQL database provides a mechanism for storage and retrieval of data that is “Modeled in means other than the tabular relations used in relational databases) “NoSQL is often interpreted as Notonly-SQL to emphasize that they may also support SQL-like query languages. Most NoSQL databases are designed to store large quantities of data in a fault-tolerant way. * NoSQL is simply the term that is used to describe a family of databases that are ‘all non-relational. While the technologies, data types and use cases vary widely __ amount them, it is generally agreed that there are four types of NoSQL databases : TECHNICAL PUBLICATIONS® - en up-thrust for knowledge ~~ a¢ NOSOL datehases can manage txformation using any of four | “Kapreahn hors docamentsbaned, colar bated and graph ERED exampie and Advantages ‘Examples of NoSQL databases < ComOR an open source, KON document-based database phat JavaScript as its query language. ‘ 2) Hlasticsearch. a Gocumentsbased database What includes a full-text search et ©) Couchbase, a keywalue and document database ‘Yhat empowers developers “build responsive and flexible applications for cloud, mobile and ed computing. * Advantages _ 8) NeSQL databases have a simple and flexible structure. They are schema-free. .d) NoSQL databases are based on key-value pairs. <) Some store types of NoSQL databases include column store, document stop Key value store, graph store, object store, XML store and other data modes. Gj Each walue in the database has a ey) Some NoSQL database stores also ‘developers to store serialized objects into the database, not just simple values. ©) Open-source NoSQL databases do not require expensive licensing fees and ~ run on inexpensive hardware; rendering their deployment cost-effective. © Disadvantages : 2) Most NoSQL databases do not support reliability features that are nativ > supported by relational database systems. b) In order to support reliability and consistency features, developers mui “implement their own proprietary code, which adds more complexity to th system. CAP Theorem + Fig. 2.1.1 shows the three properties of the CAP theorem. + (The theorem states that distributed data systems will offer a trade-off be Consistency, availability and partition tolerance. And, that any database can 0 guarantee two of the three properties : { id Consistency Every node in the cluster responds with the most recent data, # if the system must block the request until all replicas update. \If you qe! L TECHNICAL PUBLICATIONS® - an upsthiist for knowledgeFig, 2.1.1 CAP theorem “consistent system’ for an item that is currently updating, you will wait for that response until all replicas successfully update. However, you will receive the most cament data. » Availability : Every tis tetums an immediate response, even if that response is not the mostrecent data) If you query an "available system" for an item that is updating, you will get the best possible answer the service can provide at that moment. + Partition tolerance :( Guarantees the system continues to operate even if a replicated data node fails-or loses connectivity with other replicated data nodes, ) ay ee of SQL and NoSQL Databases fF So 7 ee aa Peal ey. Pee) sx No. sa oe Wee | 1 SL databases are relational, Nei dttinesatenacscatral | 2 SQL databases are vertically scalable. NoSQL databases are horizontally scalable. . a & SQL databases use structured query NoSOL database: have dynamic steaks | language and have'a predefined schema. for unstructured tifa = | 4 SQL databases are table-based, NoSQL, databases are ane key-value, graph, or wide-column stores. | 5. SQL databases are better for multirow while NoSQL is better for unsteacred | | transactions, data | like documents or or t ISON. 22) Aggregate Data Modeis of Aggregate means a collection of objects that are treated as a unit. In NoSQL Databases, an ageregate is a collection of data that interact as a unit, Moreover, these units of data or aggregates of data form the boundaties for the ACID operations.) “TECHNICAL PUBLIGATIONS® - an up-/must for knowisdgeet tes. i et ae Sto manag _stonage over the clusters as tho aggtegate dita or unit can now reside on the machines) Whatever data is retrieved from the database all the data eq ‘along with the aggregate data models in NoSQL. (Aeon data models in NoSQL do not support ACID transactions and ‘af the ACID proportion With the help of aggregate data models in No we can easily: perform CLAP operations on the database. We can actieve lh eificieney of tho aggregate data modols in the NoSQL database if the transactions and interactions take place within the same aggregate. 5 EERE Key-vatue store + Ih the key-value structure, the key is usually a simple string of characters and ‘value is a series of uninterrupted bytes that are opaque to the database. Key-va store is like a relational database with only two columns ; The key or attrib name and the value, ) Fig. 22.1 shows key-value store. Key Value ‘Aadhar [=== 222208002022 ===) Pune . | Fig. 2.24 Key-value store Saves data as a group of key value pairs, which are made up of two data it that are linked ‘The link between the items is a key” which acts as an for an item within the data and the “value” that is the data that has identified. The data itself is usually some primitive data type (string, integer, array) Fiore complex object that an application needs to persist and access directly. © This replaces the rigidity of relational schemas with a more flexible data that allows developers to easily modify fields ard object structures 35° applications evolve. « Key value systems treat the data as a single opaque collection which different fields for every record, ¢ In each key value pair, Ya) The key is represented by an arbitrary string b) The value can be any kind of data like an ‘image, file, text or d TECHNICAL PUBLICATIONS? - an uptrust for known> simplicity of this model. portable and flexible, 7 apes tr a) The secret to its speed lies in its simplicity) The path to retrieve data is a direct to the object in memory ot on ms ‘The relationship between data does not have to be calculated by a query Tanguage, there is no optimization performed. <9) They can exist on distributed systems and do not need to won about where to Store indexes, of key value stores : (a) No complex query filters ‘p) All joins must be done in code ©) No foreign key constraints 4) No trigger. ) Document-based A document is an object and keys (strings) that have values of recognizable types, ‘Induding numbers, Booleans and strings, as well as nested arrays and dictionaries. All data is stored in one table, so there is no need for cross-referencing and instead of storing information in a table, it is stored in a document. \ « Fig. 22.2 shows document based data model, ie Fig. 2.2.2 Document based data model +{osinen databeses are designed for Hetbility) They ate-not hppicelly towed to ie a schema and are therefore easy to modify. : : TECHNICAL PUBLICATIONS® - an upthust for knowledge ‘La = . A an apitaton rele the aity to tore varying atte slong ‘amounts of data, document statabaves are a good option. ‘Decent stores work wvith muiphe formats including XML and JSON, iy for torage ane rettiewal of data without av impedarice match. i} Terminologies in docuanent data store are as follows : } ‘A table is called « collection Dd A row fs called a document © Acotumn in called « field, ) # ‘Typical tse cases for document stores include the storage and retrieval of cat _ blog posts, news articles and data analysis. f MongoDB and Apache CouchDB are examples of popular docu “elatabases, >) * Do not use document databases for transactions across multiple 4 {fecords) and Ad hoc cross-document queries. * Advantages of document based model : a) Faster retrieval of data, b) Dynamic architecture for unstructured data and storage options c} Sharing for horizontal scalability @) Rephcaton is managed internally, so clunces of ac negligible. : * Disadvantages of document data model : be 2) No views, triggers, scripts or stored procedure. b) Relationship not well defined. ©) No support for transactions, which could lead to data corruption. EA Column-based v . (Column-based is also called ‘wide column’ models enabling very quick data psing a row key, column name and cell timestamp. | cidental loss of data * The flexible schema of these types of databases means that the columns do have to be consistent across records and you can add a column to specific without having to add them to every single record. { « It is also called a two-level map as it offers a two-level aggregate a) «/As data is organized into columns, we have better indexing compared to ‘key-value stores.) Also, when it comes to updates, multiple column block can be aggregated. TECHNICAL PUBLICATIONS® = an up-thrust for knowledgei.) 1 ota re ath we hr wh Gol sourced its implementation “of ¢ Column store NOSQL database called Big Table) Apparently, the data for the welinown Google eniail service, Gmail, is in the Google Big Table NOSQL Database. «The wide, columimar stores data model, like that found in Apache Cassandra, are derived from Google's BigTable paper, p ¢ Cometic monty use Colusn data stores for data warehousing and data processing, which is evident in services such as Amazon Redshift. « Advantages of column data stores : a) Column stores are very efficient at data compression and/or partitioning. b) Cohumnar stores can be loaded extremely fast. ¢) Columnar databases are very scalable. @) Due to their structure, columnar databases perform particularly well with aggregation queries, . Disadvantages of column data store : a) Updates can be inefficient.) The fact that columnar families group attributes, as J “Opposed to rows of tuples, works against it. >) f multiple attributes are touched by a join or query, this may also lead to column storage experiencing slower performance. ¢) It is also slower when deleting rows from columnar systems, as a record needs to be deleted from each of the record files. $223) Graph-based i *| The modern graph database is a data storage and processing engine that makes the persistence and exploration of data and relationships more efficient. ) i L + Graph-based data models store data in nodes that are connected by edges. These Aggregate Data Models in NoSQL are widely used for storing the huge volumes of complex aggregates and multidimensional data having many interconnections between them. of In graph theory, structures are composed of vertices and edges, or what would later be called "data relationships”. * Graphs behave similarly to how people think, in specific relationships between discrete units of data. This database type is particularly useful for visualizing, analyzing, or helping to find connections between different pieces of data. TECHNICAL PUBLICATIONS® - ai up-thrust for knowledgeto atvalyze customer interactions, social media itis crucial to traverse long, relationship. raphy 3) Ganater Bushy ta adapting your model © Goater porformance when traversing data relationships. a ‘of graph data stores : = NoSQL Key/Value Database : MongoDB . MongoDB is an open-source document database that provides high performas ~ high availability and automatic scaling. MongoDB is one of the most pop Open-source NoSQL databases written in C++. \As of February 2015, MongoDb j the fourth most popular database manag it system, It was developed by company 10gen which is now known as MongoDB Inc) Why use MongoDB ? 7 a) Simple queries ©) Functionality provided applicable to most web applications ¢) Easy and fast integration of data d) No ERD diagram 2 Not well suited for heavy and complex transactions systems. . ‘MongoDB did not provides any command to create a "database". Actually, youd Hot need to create it manually,|because, MangoDB will create it on the fly, durit the first time you save the value into the defined collection (or table in SQL) database. */MongoDB is a document-oriented database which stores data in JSON “documents with dynamic schema. It means you can store your records withd worrying about the data structure such as the number of fields or types of fie to store values) MongoDB documents are similar to JSON objects. ¢| MongoDB stores data records as BSON documents, BSON is a resentation of JSONdocuments, thotigh it contains more data types than’ : TECHNICAL PUBLICATIONS? - en up-thrust for knowledge= n NoSQL Data Management . 5 stores data. in documents invapte of tables. You can change the structure % ecords simply by adding new fields or deleting existing ones. This ability of|MongoDB helps you to represent hierarchical relationships, to store arrays and other more complex structures easily) ao uses Mongo server and Mongo shell commands to fetch records or the information from the database (ive. collections), Few areas where MongoDB is ideal are big data, user data management, mobilé and social infrastructure, content management and delivery, data hub, t A MongoDB instance may have zero or more databases, A database may have zer0 OF more ‘collections’, A collection may have zero or more ‘documents’. A document may have one or more ‘elds:) MongoDB ‘Indexes’ function much like their RDBMS counterparts, # Database is a physical container for collections Collection is a group of documents ‘end is similar to an RDBMS table. A document is a set of key-value paits, Documents have dynamic schema, ) */ MongoDB documents are composed of field-and-value pairs and have the “following structure: ) fieldi: vaiuet, field2: value2, See fields: value3, ‘GeldN: valueNy } i. «|The value of a field can be any of the BSON data types, including other ‘doctiments, arrays and arrays of documents, MongoDB supports many data types such as ; String, integer, boolean, double, arrays, timestamp, object, null, symbol, date, code and binary data. Fig, 2.2.3 shows relation between SQL terms and MangoBD terms. [Sal tens’ Concepts MongoOB Terms) Concepis ae ‘ 1 1 i , SY 1 1 1 1 1 ‘ Fig, 2.2.3 Relation between SQL terms and MangoBD terms a TECHNICAL PUBLICATIONS® - an up-thrust for knowledge7 and supports Ad Cine aay ar se a ‘operate as a distributed data system, ) : to sre dat across multiple sachs) ‘ Se ce car coon cast by cote det respect to the growth of load and demand. *( Sharing rangement in MongoDB has mainly three components 3) Fig, 2.2.4 Sharding by MongoDB ) Shards or replica sets Each shard serves as a separate replica set. They g BA ati ey targe? To increase the consistency and availability of the ¢ b) Configuration servers + They are like the managers of the clusters) 1 Servers contain the cluster's metadata. They actually have the mapping 6f cluster’s data to the shards.) When a query comes, query routers use mappings from the config servers to target the required shard, ©) Query router :| The query router is mongo instances which serve as i for user applications) They take in the user queries from the application serve the applications with the required results, /bdvantages ‘of MangoDB : MongoDB is a schema - less document type database. * MongoDB supports field, range based query, data from the stored data, sepular-expwession fog- + NoogoD8 support pray an secondary indexes on any fi ‘» MongoDB supports replication of databases, ‘+ MongoDB can be used a8 a fle storage system which is known as a Grid 8, ) GE setertaes Detatavee a ; ~ Since NoSQL does not require a schema, there is no blueprint on how data should stored and therefore varies between databases, Generally, there are two ways that NoSQL data storage functions : 1, On-the-disk using B-Trees, with the top of it being permanently in RAM. 2 In-memory where it is all on RAM using RB-Trees and anything stored on the disc is just an. append. ) . (Sehemaless databases are a type of NoSQL databases that do not have a predefined schema or structure for dats) This means that data can be inserted and retrieved without adhering to a specific structure and the database can adapt to changes in data over time without requiring schema migrations or changes. e Schemaless database manages information without the need for a blueprint. The onset of building a schemaless database does not rely on conforming to certain, fields, tables, or data model structures, There is no Relational Database Management System (RDBMS) to enforce any specific kind of structure. In other words,it is a non-relational database that can handle any database type, whether that be a key-value store, document store, in-memory, column-oriented, or graph data model. ee In actuality, there is no such thing as schema-less dataset : 1. In a relational database, the schema is explicit and created separately in advance. In column-based databases, we create a fresh schema for each row and in fact, \ we often reuse schema fragments from rows that are grouped“ together. The “same is true for document databases. 3f In column-based and also in document databases, users directly query data “based on the schema. 4\\In graph-based databases, we are in essence building the schema as we build data. ) ee as a TECHNICAL pusticarions® = an up-thrust for knowledge* Us the above condition, the data itself normaly has a fily consistent struc With the schemaless MongoDB database, there is some additional structure, h System mamespace contains an explicit list of collections and indexes, Collectio ‘may be implicitly or explicitly created, indexes must be explicitly declared. * Benefits of using schemaless databases : 1, Flexibility : Schemaless databases allow for greater flexibility in data m ae 2 Scalitbility : Schemaless databases are designed for scalability, Shey ‘can handle large amounts of unstructured data with ease, 3 Reduced complexity ; Schemaless databases can reduce the complexity of da modeling and development, +. Good support for non-uniform data. . 1, Potentially inconsistent names and data types for a single value, a 2. Management of the implicit schema migrates into the application layer. BBY mteriatized Views * | Materialized views solve the problem of views. The views provide a mechanism “Tide from the client whether data is derived data ot base data, Views are u when data is to be accessed infrequently and the data in a table gets updated on; frequent basis, q. A materialized view is a replica of a target master from a single point in time. Tt ~master can be either a master table at a master site Or a master materialized at a materialized view site,)A materialized view is like a cache, a copy of the da that can be accessed quickly, © Ifa regular view is a saved query, a materialized view is a saved query results stored as.a table. i © NoSQL databases do not have views, queries and they reuse the term “materi they may have Precomputed and oi alized view" to describe them NoSalFee Po “Create «tee deplore envtrnmar inet +t ibs. milealnmatl = 3. Bnable data subsetting, ; priate 4 Bnable disconnected computing, 7 al! smethods are used for building a materialized view 2 approach fuser update the materialized view at the same time update base data fOr it)\In this ease, adding an order would also update the J purchase history aggregates for each product, This method is used when mote _frequent reads of the materialized view than writes, 2, The application database approach is valuable here as it makes it easier to ensure that any updates to base data also update materialized views. ) + Materialized views can be built outside of the database by reading the data, ‘computing the view and saving it back to the database, EA Distribution Models ~ 4 Ability of NoSql is to run a database on a large cluster. As data volumes increase, “Wbecomes more difficult and expensive to scale up, so it is necessary to buy a bigger server to run the database on. ) Single Server - Database is run on a single machine which handles all the reads and writes to the data store. Organizations prefer a single server because it eliminates all the complexities that the other options introduce, e/Single server is easy to manage for application developers. )Lot of NoSQL databases are designed around the idea of running on a cluster, it can make sense to use NoSQL with a single-server distribution model if the data model of the NoSQL store is more suited to the application. + Single-server configuration is suitable for graph-database. +f data usage is mostly about processing aggregates, then a single-server document oF key-value store may be useful. a Sharding Sharding is a method for distributing a single dataset across multiple databases, oe can then be stored on multiple machines. This allows for larger datasets |ig Date Anaiytics * NoSQL. Data Manager t Sharding is a form of sealing known as horizontal sealing or scale-ou Modes are brought on to share the load. Horizontal scaling: allows Neatslimitless scalability to handle big data and intense workloads, databases 9 Sharing is also known as data partitioning, Many Ni Fig. 2.5.1 shows Sharding. — eee Fig. 2.5.1 Sharding +f */ Sharding is the Process of splitting a large dataset into many small Partition Which are placed on different machines, Each partition is known as a "shard’) * Each shard has the same database schema as the original database. Most data j \ distributed such that each row appears in exactly one shard. The combined 4 from all shards is the same as the data from the original database) The load balanced out nicely between servers, for example, if we have five séfvers, each 0 only has to handle 20 % of the load. The NoSQL framework is natively designed to support automatic distribution the data across multiple servers including the query load. Both data and q replacements are automatically distributed across multiple servers located in # different geographic regions and this facilitates rapid, automatic and transpate replacement of the data or query instances without any disruption, */ Sharding is particularly valuable for performance because it can improve both rei and write performance) Using replication, Particularly with caching, can improve read performance but does little for applications that have a lot of © Advantages of Sharding a) Faster performance : There are more servers available to handle input/outy o b) Horizontal scaling ; We can quickly add additional servers to a cluster, ©) Costs : Horizontal scaling can often be less expensive than vertical scaling. d) Distribution/uptime : A horizontally scaled distributed database can adh better uptime than a traditional single server.replicate data across multiple nodes. One node is designed as. primary ‘others as secondary (slaves), Master is responsible for processing any dine Ws ten br eetedies penne het synchronizes the slaves with the Master is the authoritative source for the data. It is responsible for processing any updates to that data. Masters can be appointed manually or automatically. ogee eee fe utr Sa sere failure of the master, a slave can be appointed Serene aed Fig. 25.2 shows master-slave replication. Changés propogates to all siaves oe Fig. 2.5.2 Master-stave replication © Master-slave replication is most helpful for scaling when we have a read-intensive dataset. It will scale horizontally to handle more reads. «(This design offers read resilience. Even if one or more of the servers fais, the “remaining servers can keep offering read access. This can help a lot with read-heavy applications, but will offer little benefit to write-intensive applications. ee the slaves are exact replicas of the master server, one of them can assume the Je of the master in case the master fails. In fact most of the time you can simply create a set of nodes and have them automatically decide who would be the master, There are some consistency issues that occur due to the delay in updating between master and slaves, TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeA, Does aut hep: with of writes Aa, Provides resilience against failure of a slave, but not of a master 3. The master is still a bottleneck, EER Poor-to-Peer Replication a «Ih a peertowpesr replication setup the various nodes are all "equals". Any’ ni ‘can accept reads as well as writes and they communicate these writes to ‘other.\in peerto-peer replication updates on any one server are replicated to a iated servers. * Fig. 253 shows peer-to-peer replications. Fig. 2.5.3 Peer-to-peer replications «The advantage of this setup is its read and write resilience, )One node's fi does not cause problems, as the remaining nodes can continue their work losing a beat. */The problem that arises is that of consistency.) For example we may conflicting write requests that come to different nodes and then those attempt to communicate those requests to the rest of the nodes. This could considerable inconsistencies. « (There are various ways to resolve this problem. The most standard ap “would be to have the replicas communicate their writes first before they " them. \Once a majority of the replicas has confirmed a write, it can no} ideted as having been successfully performed and a response sent {0 | client. This requires a certain amount of network traffic in coordinating # writes. TECHNICAL PUBLIGATIONS® - an up-tmrust for knowledge eeere Ledges 4 rit pheadlnel coe pin Ale aH a combined to get sp better resporuehl-ve tee both " —msstrlave replication With Shard a ace ple rplcaton and Shrdng, ad Peer-to-peer replication with Sharding, ¢ We have multiple masters, but each data item only has a single maser # Anode canbe a master for some data and a slave for others, 2, Peerso-peer replication and Shading : ¢ A common strategy for column-family databases, ‘ « A good starting point for Peer-to-peer replication is to have a replication factor of 3 ali is present on three nodes, } fal Difference between Replication and Sharding i e py = Be eS) The salah onto secondary server ntodes.\ This can across servers using a shard key. ‘help increase data availability-and act as ze ie a pes an case the primary server Sharding distributes different data across. | multiple servers, / Each bit of data can be found in multiple places. é Replicated servers contain identical Sharded database servers each contain a | copies of the entire database. i Bach server acis as the single source for a subset of data part of the overall data, ie, they store different data on separate nodes, | Tt can improve both reads and wail Consistency © The CAP theorem is important when considering a distributed database, since we must make a decision about what we are willing to give up. The database we choose will lose either availability or consistency. Reading about NoSQL databases we can face the concept of quorum; A quorum is the minimal number of nodes: that must respond to a read or write operation to be considered complete, TECHNICAL PUBLICATIONS® - an up-thrust for knowledge an a2420 NoSQL Date Me OF course having a inaximum quomim and querying all servers is the way determine the correct result, J *{ Gemstone cn be simpy dine ty ho the cope om the same data Ay Sime replicated database system, >) * Nowadays systems need to scale, The “traditional” monolithic datab architecture, based on a powerful server, does not guarantee the high avi and network partition required by today's web-scale systems, as demonstrated jy the CAP theorem. To achieve sich requirements, systems cannot impose stro Sonsistency, * Ih the past, almost all architectures used in database systems were strong Consistent. In these cases, most architectures would have a single database instang only responding to a few hundred clients. Nowadays, many systems are acc by hundreds of thousands of clients, so there was a mandatory requirement system's architectures that scale. However, considering the CAP theor high-availability and consistency do conflict on distributed systems when subj to a network partition event. Update Consistency “Two users updating the same data item at the same time is called write-wni conflict. / «When the writes reach the server of the two users, the server will serialize the “and decide to apply one, then the other. First user's update would be applied an immediately overwritten by the second user. ) ! In this case first user's is a lost update. Here the lost update is not a big. proble We see this as a failure of consistency)because second user's update Was based the state before first user's update, yet was applied after it. q */ Approaches for maintaining consistency in the case of concurrency are off “described as pessimistic or optimistic. OA pessimistic approach warks by preventing conflicts from occurring; an opti “approach lets conflicts occur, but detects them and takes action to sort them out) 4 For update conflicts, the most common pessimistic approach is to have write lods so that in order to change a value we need to acquire a lock and the ensures that only one client can get a lock at a time. * So both users would attempt to acquire the write lock, but only the first 0 would succeed. Second user would then see the result of the first user's before deciding whether to make his own update. 7 TECHNICAL PUBLICATIONS® > an up-thrust for knowledgech : Lets conflicts occur, but detects them and takes actions to y If there are more than one server ie, peer lication, then two nodes might apply the updates in a difierent oe me a different value for the telephone number on each peer, Sequential consistency is used in distributed systems. | # Optimistic way to handle a writewrite conflict is to save both updates and record that they are in conflict, Replication makes it much more likely to’ ran into write-wnite conflicts, If different nodes have different copies of some data which can be independently updated, then we will get conflicts unless we take specific measures to avoid them. Using a single node as the target for all writes for some data makes it much easier to maintain update consistency. [BEE] Read Consistency v +( Problem : One user reads in the middle of another user's writing, et is called tead-write conflict, inconsistent reading. This leads to logical inconsistency. » . «/1n NoSQL databases, read consistency refers to the level of consistency between matiltiple read operations on the same data, In a distributed database, where data can be replicated across multiple nodes, ensuring read consistency can be challenging. « Aggregate-oriented databases do support atomic updates, but only within a single aggregate. This means that we will have logical consistency within an aggregate but not between aggregates. { ‘The length of time an inconsistency is present is called the inconsistency window, loGQI. system may have a quite short inconsistency window.) {There are different levels of read consistency available in NoSQL databases, \yanging from eventual consistency to strong consistency. J = TECHNICAL PUBLICATIONS® - 7: upthnust for kriowiedgepowvrite ce we have updated a s reads of that record will eae On consistency means read-your-writes consistency but at session | jon can be identified with a conversation between a client and a serv + the conversation continues, we will read everything we have ‘this conversation, If the session ends and we start another session with ‘same server, there is no guarantee that we can read yalues we have during previous conversation. ee na where consistency is more important availability (CAP theorem) for write and read. ‘In systems with multiple replicas there is a possibility that the user Teas | inconsistent data. This happens say when there are 2 replicas, N1 and N2 in cluster and a user writes value v1 to node Ni and then another user reads from| node N? which is still behind N1 and thus will not have the value v1, so the| second user will not get the consistent state of data. * In order to achieve a state where at least one node has consistent data we quorum consistency. * Fig. 26.1 shows write and read quorums. a Eh (key, A) Write (key, i) Hh Write (key, A) Write (key, B), f ya 4 : (a) Write quorums (b) Read quorums Fig. 2.6.1 TECHNICAL PUBLICA’ Tions® - an up-thnist for knowledgeWn ote come he age 8pm ae eins » Strict durability is not essential and it can be tra Pecalability he. ee ni oe eel ae « A-simple way to relax durability is to store data in memory and flush b ‘ memory and flush regularly. If the system shuts down, velo Poy [id Cassandra VY of Cassandra's a column NoSQL database. It was initially developed by Facebook to the needs of the company's Inbox Search services| In 2009, it became an Apache Project, ¢ ‘Apache Garces is an open source, distributed, NoSQL database Apache \dra is a distributed database system using a shared nothing architecture.) © Apache Cassandra was initially designed at Facebook using a Staged Event-Driven Architecture (SEDA) to implement a combination of Amazon's Dynamo distributed storage and teplication techniques and Google's Bigtable data and storage engine model. ie columnar database, also called a column-oriented database or a wide-column store, is a database that stores the values of each column together, rather than storing the values of each row together. > s Columnar databases are well suited for big data processing, Business Intelligence (Bl) and analytics. ° oa provides tunable consistency ie. users can determine the consistency 1 by tuning it via read and write operations. Cassandra enables users to configure the number of replicas in a luster that must acknowledge a read or write operation before considering the operation successful. (Cassandra uses a gossip protocal to discover node state for all nodes in a cluster. sandra is designed to handle "big data” workloads by distributing data, reads: and writes (eventually) across multiple nodes with no single point of ine! peetermance ‘Comarca So Veer scalable, 0,1 Foe iereae Vo muribr of nodes inthe cust ‘storage t Cassandta accommodates all possible data fo ones Somer yrpaormentiar] © Transaction support | Cassandra supports properties like ACID. @ Basy data distribution : Cassandra provides the flexibility to distribute where you need by replicating data across multiple datacenters. EERE cassandra Architecture + Fig. 27.1 shows Cassandra architecture. Fig. 2.7.1 Cassandra architecture W ecempanents of Cassandra acchitéctiire are node, data center, cluster, commnltlll ~memtable, $STable, Bloom Filters and Cassandra query language. # Node : A Cassandra node is a place where data is stored. * Data center : Data center is a collection of related nodes. * (Cluster : A cluster is a component which contains one or more data centers. * Commit log : In Cassandra, the commit log is a crash- recovery mechaniiany write operation is written to the commit log. . Mem-table : A mem-table is a memory-resident data structure) After co “the data will be written to the mem-table, Sometimes, for a single-column fami there will be multiple mem-tables,goTable : It is a disk file « to Which the data in flushed from the mem-table when its om reach a threshold value, @ ‘Boon oe i Bloom filters are very fast, nondeterministi¢ alyjorithms for testing wheth et ebm: of x a Yep kno each, Bese filters are accessed after every query. yen i Catan cluster alto mma «sequential conunt lg of writ \activity on isk to ensure data integrity, These writes are indexed and written to an irsmemory structure called a memtable. \\ fa memtable can be thought of as a writeback cache where write 1/0 is directed to cache with its completion immediately confirmed by the host:\This has the advantage of low latency and high throughput, The memtable striicture is kept in Java heap memory by default, »/ScTables : When the commit log gets full, a flush is triggered and the contents of ‘he memtable are vrritten to disk into an SSTables data file) At the completion of this poe the memtable is cleared and the commit log is recycled. Cassandra automatically partitions these writes and replicates them throughout the cluster. [AA Cassandra Data Model »/ Some of the features of Cassandra data model are as follows : <1) Data in Cassandra is stored as a set of rows that are organized into tables. 2) Tables are also called column families. 3) Each row is identified by a primary key value. me) Data is partitioned by the primary key. / ¢ Data modelling in Cassandra uses a query-driven approach, in which specific queries are the Key to organizing the data. The main goal of Cassandra data modeling is to develop and design a high-performance and well-organized Cluster, ) + Apache Cassandra data model components include keyspaces, tables and columns : 2) Cassandra stores data as a set of rows organized into tables or column families / +) A primary key value identifies each row ¢) The primary key pattitions data 4) We can fetch data in part or in its entirety based on the primary key. |) Cassandra data model provides a mechanism for data storage. The components of Cassandra data moclel are keyspaces, tables and columns. TECHNICAL PUBLICATIONS® ~ an p-thrust for knowledgeFig. 2.7.2 Cassandra data modo! : BP mes Sere Ge Caneniica NGSOL dita moda consisu of data o ‘keyspaces. Keyspaces are similar to the schema in a relational ‘Typically, there are many tables in a keyspace, ) © Features of keyspaces are > a)(A keyspace needs to be defined before creating tables,)as there is no defatilt of A keyspace can contain any number of tables and a table belongs only to one: -keyspace. This represents a one-to-many relationship. i ©) Replication is specified at the keyspace level.) For example, Feplication of three implies that each data row in the keyspace will have three copies. Tables : a / (Tables, also called eglumn families in earlier iterations of Cassandra, are defined in the keyspaces /Tables store data in a set of rows and contain a primary key and a set of columns. * (Cassandra tables are used to hold the actual data in the form of rows and SA table in Cassondra must be created with the primary creation time, post that it can not be altered, (To alter the table new tables should be created with existing data. The prima Would be used to locate and order the data. i TECHNICAL, PUBLICATIONS® - en vothrust tor knowledge. - key during tablesc : define data ' ¢ oe structure within a table, There are various types of columns, \ gudh as Boolean, double, integer and text. Reece ‘The column can consist various types of data such as big integer, double, text, float and Boolean. Each column value has 2 timestamp associated with it that shows the fime of update) Cessandra provides the collection type of columns such as list, set and ‘map + Some of its features are : a) Columns consist of various types, such as integer, big integer, text, float, double and Boolean. ) Cassandra also provides collection types such as set, listand map. Y —¢) Further, column values have an associated time stamp representing the time of update. d) This timestamp can be retrieved using the function write time. M Cassandra Clients 4 +/ Thrift is the driver-level interface; it provides the API for client implementations in a wide variety of languages. Thrift was developed at Facebook, ) ' f Client holds connections to a Cassandra cluster, allowing it to be queried. Each Glient instance maintains multiple connections to the cluster nodes) provides policies to choose which node to use for each query and handles retriés for failed query. etc... of Client instances are designed to be long-lived and usually a single instance is ‘ough per application. As a given Client can only be "logged" into one keyspace at a time, it can make sense to create one client per keyspace used.) This is however not necessary to query multiple keyspaces since it is always possible to tuse a single session with fully qualified table name in queries. | The Cassandra cluster is denoted as a ring. The idea behind this representation is ‘show token distribution. ) — TECHNICAL PUBLICATIONS® - an up-thrust for knowledge -ge What ls the ditforence betwoon Sharding and replication 7 | ant: Sharced database sorvers each contaity a part of the overall data, Le. they store sifterent data on separate nodes, Replicated servers contain identical copies of the entire database as How is Sharding different from partitioning 7 ans. All partitions of a table reside on the same server whereas Sharding involves qmiltiple servers. Therefore, Sharding implies a distributed architecture whereas | i does not. Partitions can be horizontal (split by rows) or vertical (by cotumns). Shards are usually only horizontal. In other words, all shards share the same schema but contain different records of the otiginal table, Q6 What are write-write and read-write conflicts 7 ans: Write-write conflicts occur when two dients try to write the same data at the same time. Read-write conflicts occur when one client reads inconsistent data in the middle of another client's write. Q7 Define Cassandra. Ans, ; Cassandra is a distributed, fault tolerant, scalable, column oriented data store. Cessandra is a peer-to-peer distributed system made up of a cluster of nodes in which any node can accept a read or write request. Q8 What is the use of Bloom filters in Cassandra ? Ans, Bloom filters are used as a performance booster. Bloom filters are very fast, nondeterministic algorithms for testing whether an element is a member of a set. They | are nondeterministic because it is possible to get a false-positive read from a Bloom | filter, but not a false-negative. Bloom filters work by mapping the values in a data set into a bit array and condensing @ larger data set into a digest string, Bloom filter is a special kind of cache. Q% Explain sorted strings table, Ans. ; Sorted strings table which is a file format used by Cassandra to store the statics and the data from the memtables. The Cassandra SSTables are immutable hertce any update on the table creates a new SSTables file. The data structure format used by $STables is Log-Structured Merge which is qualified for writing intense heavy data sets compared to the traditional B tree structure. TECHNICAL PUBLICATIONS® - an upthrust for knowledgean Describe session consistency, Bes * Sesion consistency means read-your-writes consistency but at session leve Be auSan DS Het with a convention tetseen a cent and a sever. As Se Soversition continues, we will read everything we have waitten during tie Fearssation. IF the session ends and we start another cession with tho same e : XS Ro guarantee that we can read values we have written during previous sonversation. 13 What are schemaless databases ? Ans. Schemaless databases are a type of NoSQL databases that do not have Predefined schema or structure for data. This means that data can be Tetrieved without adhering to a specific structure and the changes in data over a| inserted and } database can adapt to time without requiring schema migrations or changes.

Nosql
No ratings yet
Nosql
6 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
30 pages
Unit II - BDA NEW
No ratings yet
Unit II - BDA NEW
48 pages
CHP 4
No ratings yet
CHP 4
47 pages
Unit 2
No ratings yet
Unit 2
25 pages
No SQL
No ratings yet
No SQL
24 pages
Big Data Unit 3
No ratings yet
Big Data Unit 3
374 pages
U5 Final
No ratings yet
U5 Final
45 pages
NoSQL Lec
No ratings yet
NoSQL Lec
45 pages
DBMS Lecture13 NoSQL
No ratings yet
DBMS Lecture13 NoSQL
31 pages
Unit VI Big Data
No ratings yet
Unit VI Big Data
19 pages
Module 1 Introduction
No ratings yet
Module 1 Introduction
9 pages
BigData Unit2 V2
No ratings yet
BigData Unit2 V2
70 pages
Unit II - BIG DATA ANALYTICS
No ratings yet
Unit II - BIG DATA ANALYTICS
11 pages
NoSQL Database
No ratings yet
NoSQL Database
10 pages
Unit VI - 1
No ratings yet
Unit VI - 1
31 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
NoSQL Databases for Tech Enthusiasts
No ratings yet
NoSQL Databases for Tech Enthusiasts
33 pages
Dbms Presentation
No ratings yet
Dbms Presentation
22 pages
BDA Unit-2
No ratings yet
BDA Unit-2
30 pages
No SQL
No ratings yet
No SQL
19 pages
Unit 3 NoSQL
No ratings yet
Unit 3 NoSQL
98 pages
Nosql
No ratings yet
Nosql
20 pages
Unit No 1
No ratings yet
Unit No 1
34 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
Nosql Module 1
No ratings yet
Nosql Module 1
23 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
NoSQL Notes
No ratings yet
NoSQL Notes
11 pages
No SQL
No ratings yet
No SQL
12 pages
BDT Unit-Ii
No ratings yet
BDT Unit-Ii
13 pages
Module 5 - NoSQL Databases
No ratings yet
Module 5 - NoSQL Databases
33 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
Unit 2
No ratings yet
Unit 2
26 pages
NoSQL Tutorial - New
No ratings yet
NoSQL Tutorial - New
10 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
22 pages
Unit 2 Handouts
No ratings yet
Unit 2 Handouts
11 pages
NoSQL
No ratings yet
NoSQL
18 pages
NoSQL for Developers and IT Pros
No ratings yet
NoSQL for Developers and IT Pros
3 pages
BIG Data 2
No ratings yet
BIG Data 2
18 pages
Unit 2 BDA
No ratings yet
Unit 2 BDA
32 pages
Unit 2
No ratings yet
Unit 2
65 pages
What Is NoSQL
No ratings yet
What Is NoSQL
10 pages
Dbms Unit 5
No ratings yet
Dbms Unit 5
9 pages
MODULE7
No ratings yet
MODULE7
23 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
NoSQL Databases Overview
No ratings yet
NoSQL Databases Overview
8 pages
NOsql Presentation
No ratings yet
NOsql Presentation
20 pages
Big Data Notes
No ratings yet
Big Data Notes
70 pages
Full Stack-Unit-Iii
No ratings yet
Full Stack-Unit-Iii
56 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Lecture 3.1.2
No ratings yet
Lecture 3.1.2
47 pages
Unit II Nosql Data Management
No ratings yet
Unit II Nosql Data Management
57 pages
Full Stack Development
No ratings yet
Full Stack Development
3 pages
Nosql PDF
No ratings yet
Nosql PDF
21 pages
Introduction To: Nosql
No ratings yet
Introduction To: Nosql
27 pages
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
No ratings yet
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
31 pages
Learning Guide 2.1 - CloudDatabase - NOSQL PDF
No ratings yet
Learning Guide 2.1 - CloudDatabase - NOSQL PDF
44 pages
001 - IO Statements
No ratings yet
001 - IO Statements
10 pages
Nsit Naac 4 PPT (28.2.23
No ratings yet
Nsit Naac 4 PPT (28.2.23
6 pages
What Is NoSQL
No ratings yet
What Is NoSQL
4 pages
Conference Agenda
No ratings yet
Conference Agenda
1 page

Bda Unit-2

Uploaded by

Bda Unit-2

Uploaded by

You might also like