Module 5
Concurrency Control Techniques
Concurrency control techniques are used for isolation of concurrently
executing transactions. Most of these techniques ensure serializability of
schedules. Some of the concurrency control techniques are: Transaction
timestamps , Locking of Data , Multiversion Concurrency control
protocols
LockingTechniquesforConcurrency control:
Locking of data item is one of the main techniques used to control concurrent
execution of transactions.
1) Define lock. List the different types of lock.
Alockisavariableassociatedwithadataiteminthedatabaseanddescribesthestatusof
that item with respect to possible operations that can be applied to the item.
Generally there will be one lock for each data item in the database.
TypesofLocks:
1.BinaryLocks2.Sharedand ExclusiveLocks 3.TwophaseLocking
2) Explain Binary Lock.
A Binary lock will have two states or values, i.e locked and unlocked (or 1
and 0) and either of the state or value is associated with each database item X.
If the value of
thelockonthedataitemXis1(locked),thenitemXcannotbeaccessedbyadatabase
operation that requests the item. If the value of the lock on X is 0 (unlocked),
then the item X can be accessed when requested.
When binary locking is used, then the transactions include lock_item and
unlock_item operations. A transaction request access to an item X byissuing
lock_item(X) operation. If lock_item(X)=1, then the transaction is forced to
wait, otherwise the transaction sets lock_item(X)=1 (locks the item.) and is
allowed to access. When the transaction is through using the item, it issues an
unlock_item(X) operation, which sets lock_item(X)=0, so that X may be
accessed by other transaction. Hence binary lock enforces mutual exclusion
on data item. The DBMS has a Lock manager subsystem to keep track of and
control access to locks.
When the binarylocking scheme is used everytransaction must obeythefollowing rules
1. A transaction T must issue the operation lock_item =(X) before any
read_item(X) or write_item(X) operations are performed in T.
2. A transaction T must issue the operation unlock_item(X) after all
read_item(X) andwrite_item(X) operations are completed in T.
3. A transaction T will not issue a lock_item(X) operation if it already holds
the lock on item X.
4. AtransactionTwillnotissueanunlock_item(X)operationunlessitalready
holds the lock on item X.
3) ExplainSharedandExclusiveLocks.
This technique allows severaltransaction to access the same itemX if they all
access X for reading purpose only. However, if transaction is to write an item
X, then that transactionwillhaveexclusiveaccessto
X.Forthispurposeitusesadifferentlock called multiple-
modelock.Inthisschemetherearethreelockingoperation:1.Read_lock(X)
2.Write_lock(X)and3.unlock(X).
A lock associated with an item X, LOCK(X) canhavethreepossiblestates“read-
locked”, “write-lockd” or “unlocked”. A read-locked item is also called Share
locked, because other transactions are allowed to read the item .
The write-locked itemis calledexclusive-locked,because a single transaction
exclusively holds the lock on the item.
When we use the multiple-mode locking scheme , the system must enforce the
following rules.
1. Transaction T must issue the operation read_lock(X) or write_lock(X)
before any read_item(X) operation is performed in T
AtransactionTmustissuetheoperationwrite_lock(X),beforeany
write_item(X) operation is performed in T.
2. A transaction T must issue the operation unlock(X) after all
read_item(X)and write_item(X) operations are completed in T.
3. A transaction T will not issue read_lock(X) operation, if it already holds a
read(shared) lock or a write(exclusive)lock on item X .
4. A transaction T will not issue write_lock(X) operation if it already holds a
read lock or writelock on item X.
5. A transaction T will not issue an unlock(X) operation, unless it already
holds a readlock or write lock on item X.
4) Explain Two-PhaseLocking (2PL). Also explain
variationsof 2PL A transaction is said to fallow the two-phase locking
protocol, if all locking operations (read_lock, write_lock) precede the first
unlock operation in the transaction. Such a transaction can be divided into two
phases. Expanding phase, during which new lockson items can be acquired
but none can be released.
Shrinking phase, during which existing locks can release, but no new locks
can be acquired.
Two-phase locking may limit the amount of concurrency that can occur in a
schedule. This is because a transaction T may not be able to release an item X
after it is through using it, if T must lock an additional item Y later on, i.e item
X must remain locked by T until all items that the transaction needs have been
locked. Meanwhile, another transactionseekingto accessX maybeforcedto
wait,even thoughtransactionTnomore needs X. This type of two-phase locking
is also called as Basic2PL.
There are number of variations in two-phase locking some of them are
Conservative2PL(orStatic2PL):Thistechniquerequiresatransactiontolockall
items it access beforethe transaction begins execution,by predeclaring its read
set and
writeset.Readsetoftransactionisthesetofallitemsthatthetransactionreads,and
write set is the set of all items that the transaction writes.
Strict 2PL:In this variation T does not release any of its locks until after it
commits or aborts. Hence no other transaction can read or write an item that is
written by T unless T has committed, leading to strict schedule for
recoverability.
5) Howdoes2PLguaranteeserializability?
Transactions T1 and T2 in Figure 22.3(a) do not follow the two-phase locking
protocol because thewrite_lock(X) operation follows theunlock(Y) operation
inT1, and similarly thewrite_lock(Y) operation follows theunlock(X) operation
inT2.If we enforce two-phase locking, the transactions can be rewritten asT1`
and T2`,as shown in Figure 22.4. Now, the schedule shown in Figure 22.3(c) is
not permitted for T1` and T2` (with their modified order of locking and
unlocking operations) under the rules of locking, because T1` will issue its
write_lock(X) before it unlocks item Y; consequently, when T2` issues its
read_lock(X), it is forced to wait until T1` releases the lock byissuing an
unlock (X) in the schedule.
It can be proved that, if every transaction in a schedule follows the two-phase
locking protocol, the schedule is guaranteed to be serializable, obviating the
need to test for serializability of schedules. The locking protocol, by enforcing
two-phase locking rules, also enforces serializability.
6) Define deadlock. List the different types ofDeadlock
Prevention Protocol.
Deadlockisthesituationwhereeachoftwotransactionsiswaitingfor
theothertorelease the lock on an item. For example, in two transactions T1 and
T2, if T1 is waiting for T2 to releaseitem Xand T2 is waitingfor T1 to release
item Y.Meanwhile,neithercan proceed to unlock the item that is locked nor any
other transaction can access items X and Y. Different Deadlock Prevention
Protocols:
1. DeadlockpreventionschemesbasedonTransactiontimestampconcept
2. GroupofDeadlock preventionprotocolsthatdo notrequire timestamps
7) Explain deadlock prevention schemes based on Transaction
timestamp concept.
Transaction timestamp TS (T):Transaction timestamp TS (T) is a unique
identifier assigned to each transaction. The timestamps are ordered based on
the order in which
transactionsarestarted;henceiftransactionT1startsbeforetransactionT2,thenTS(T
1)
< TS (T2), i.e the older transaction has the smaller timestamp value. The two
deadlock prevention schemes under this concept are Wait-die and Wound-
wait.
Wait-die:In wait-die, an older transaction is allowed to wait on a younger
transaction, whereas younger transaction requesting an item held by an older
transaction is aborted and restarted, i.e if transaction Ti tries to lock an item X,
but is not able to lock, becauseX is locked by some other transaction Tjthen
Wait-
die:ifTS(Ti)<TS(Tj)(Tiisol
derthanTj) Then Ti is
allowed to wait
Elseabort Ti(Ti dies)and restartit later.
Wound-wait: In Wound_waita
youngertransactionisallowedtowaitonanolderone,
whereasanoldertransactionrequestinganitem heldbyan
youngertransaction
preemptsthe younger transaction byabortingit, i.e if transaction Titriesto lock
an item X , but is not able to lock, because X is locked by some other
transactionTjthen Wound-wait: if TS (Ti) < TS (Tj) (Ti is older than Tj)
ThenabortTj(TiwoundsTj)andrestartitlat
er ElseTi is allowedto wait.
8) Explaingroupofdeadlockpreventionprotocolsthatdonot
require timestamps.
No waiting algorithm : According to this algorithm, ifa transaction is unable
to obtaina lockon an item, it is immediatelyabortedand then restartedafter a
certain time delay without checking whether a deadlock will actually occur or
not.
Cautious waiting algorithm: This approach tries to reduce the number of
needlessaborts / restarts. Suppose the transaction Ti tries to lock an item X but
is not able to do so as X is locked by transaction Tj then the cautious waiting
rules are
Cautiouswaiting:if
Tjisnotblocked(notwaitingforsomeotherlockeditem)
Then Ti is blocked and is allowed to wait
ElseabortTi.
9) Explainhowtodealwithdeadlockwiththeuseoftimeouts.
In this scheme, if a transaction waits longer than a system defined timeout, the
system assumes that the transaction is deadlocked and aborts it, regardless of
whether a deadlock situation actually exits.
LIVELOCK: A transaction is said to be in a state of livelock if it cannot
proceed for an indefinite period of time while other transactions in the system
continue normally. This may occur if the waiting scheme for locked items is
unfair, giving priority to some transactions over others.
Solutionfor Livelock:One standard solution for this problem is First-come-
first-serve scheme. In this scheme the transactions are enabled to lock an item
in the order in which they originally requested to lock the item.
Starvation: Starvation is a problem like livelock, it can occur in the
algorithms for dealing with deadlock. It occurs if the algorithm selects the
same transaction as victim repeatedly, thus causing it to abort and never finish
execution. Wait-die and Wound-wait schemes avoid starvation.
10) ExplainTimestampOrderingAlgorithm.
The idea for this scheme is to order the transactions based on their timestamps.
A schedule in which the transactions participate is then serializable, and the
only equivalent serial schedule permitted has the transactions in order of their
timestamp values. This is called timestamp ordering (TO).
The algorithm must ensure that, for each item accessed by conflicting
operations in the schedule,theorderinwhichtheitemisaccesseddoesnotviolatethe
timestamporder.To do this, the algorithm associates with each database item X
two timestamp (TS) values:
1. read_TS(X). The read timestamp of item X is the largest timestamp among
all the timestamps of transactions that have successfully read item X—that is,
read_TS(X) = TS(T), where T is the youngest transaction that has read X
successfully.
2. write_TS(X). The write timestamp of item X is the largest of all the time-
stamps of transactions that have successfully written itemX—that is,
write_TS(X) = TS(T), where T is the youngest transaction that has written X
successfully.
Basic Timestamp Ordering (TO) Whenever some transaction T tries to
issue a read_item(X) or awrite_item(X) operation, thebasic TO algorithm
compares the timestamp of T withread_TS(X) andwrite_TS(X) to ensure that
the timestamp order of
transactionexecutionisnotviolated.Ifthisorderisviolated,thentransactionTisabort
ed and resubmitted to the system as a new transaction with a new timestamp. If
T is aborted and rolled back, anytransaction T1 that mayhave used a value
written by T must also be rolled back. Similarly, any transaction T2 that may
have used a value written by T1 must also be rolled back, and so on. This
effect is known as cascading rollback and is one of the problems associated
with basic TO, since the schedules produced are not guaranteed to be
recoverable.
An additional protocol must be enforced to ensure that the schedules are
recoverable, cascadeless, or strict. We first describe the basic TO algorithm
here. The concurrency control algorithm must check whether conflicting
operations violate the timestamp ordering in the following two cases:
1. WheneveratransactionT issuesa write_item(X)operation, thefollowingischecked:
a. Ifread_TS(X)>TS(T)orifwrite_TS(X)>TS(T),thenabortandrollback
Tandreject the operation. This should be done because someyounger
transaction with a timestamp greater thanTS(T)—and henceafter T in the
timestamp ordering—has already read or written the value of item X before T
had a chance to write X, thus violating thetimestamp ordering.
b. If the condition in part (a) does not occur, then execute the write_item(X)
operation of T and set write_TS(X) to TS(T).
2. Wheneveratransaction Tissues aread_item(X) operation,the followingis checked:
a. If write_TS(X) > TS(T), then abort and roll back T and reject the operation.
This should be done because some younger transaction with timestamp greater
than TS(T)— and hence after T in the timestamp ordering—has already
written the value of item X before T had a chance to read X.
b. If write_TS(X) ≤ TS(T), then execute the read_item(X) operation of T and
set read_TS(X) to the larger of TS(T) and the current read_TS(X).
Whenever the basic TO algorithm detects two conflicting operations that occur in the incorrect
order, it rejects the later of the two operations by aborting the transaction that issued it. The
schedules produced by basic TO are hence guaranteed to be conflict serializable, like the 2PL
protocol.
Strict Timestamp Ordering (TO) A variation of basic TO called strict TO ensures that the
schedules are both strict (for easy recoverability) and (conflict) serializable.
Inthisvariation,atransactionT thatissuesaread_item(X) orwrite_item(X)suchthatTS
(T) > write_TS(X) has its read or write operation delayed until the transaction T that wrote the value
of X (hence TS (T) = write_TS(X)) has committed or aborted.
To implement this algorithm, it is necessaryto simulate the locking of an item X that has been
written by transaction T until T is either committed or aborted.
This algorithm does not cause deadlock, since T waits for Tonly if TS(T) > TS(T). Thomas’s Write
RuleA modification of the basic TO algorithm,known as
Thomas’swriterule,doesnotenforceconflictserializability, butitrejectsfewerwriteoperationsby
modifying the checks for the write_item(X) operation as follows:
1. Ifread_TS(X) >TS(T),thenabortandrollbackTandrejectthe operation.
2. If write_TS(X) > TS (T), then do not execute the write operation but continue
processing. This is because some transaction with timestamp greater than TS (T)—and hence after T
in the timestamp ordering—has already written the value of X. Thus, we must ignore the
write_item(X) operation of T because it is alreadyoutdated and obsolete. Notice that any conflict
arising from this situation would be detected by case (1).
3. If neither the condition in part (1) nor the condition in part (2) occurs, then execute the
write_item(X) operation of T and set write_TS(X) to TS (T).
Introduction to NOSQL Systems
Emergence of NOSQL Systems
Many companies and organizations are faced with applications that store vast amounts of
data. Consider a free e-mail application, such as Google Mail or Yahoo Mail or other
similar service—this application can have millions of users, and each user can have
thousands of e-mail messages. There is a need for a storage system that can manage all
these e-mails; a structured relational SQL system may not be appropriate because (1) SQL
systems offer too many services (powerful query lan- guage, concurrency control, etc.),
which this application may not need; and (2) a structured data model such the traditional
relational model may be too restrictive. Although newer relational systems do have more
complex object-relational model- ing options (see Chapter 12), they still require schemas,
which are not required by many of the NOSQL systems.
As another example, consider an application such as Facebook, with millions of users who
submit posts, many with images and videos; then these posts must be displayed on pages
of other users using the social media relationships among the users. User profiles, user
relationships, and posts must all be stored in a huge collec- tion of data stores, and the
appropriate posts must be made available to the sets of users that have signed up to see
these posts. Some of the data for this type of appli- cation is not suitable for a traditional
relational system and typically needs multiple types of databases and data storage systems.
Some of the organizations that were faced with these data management and storage
applications decided to develop their own systems:
■ Google developed a proprietary NOSQL system known as BigTable, which is used in many of
Google’s applications that require vast amounts of data stor- age, such as Gmail, Google Maps, and Web
site indexing. Apache Hbase is an open source NOSQL system based on similar concepts. Google’s
innovation led to the category of NOSQL systems known as column-based or wide column stores;
they are also sometimes referred to as column family stores.
■ Amazon developed a NOSQL system called DynamoDB that is available through Amazon’s cloud
services. This innovation led to the category known as key-value data stores or sometimes key-tuple or
key-object data stores.
■ Facebook developed a NOSQL system called Cassandra, which is now open source and known as
Apache Cassandra. This NOSQL system uses concepts from both key-value stores and column-based
systems.
■ Other software companies started developing their own solutions and making them available to users
who need these capabilities—for example, MongoDB and CouchDB, which are classified as document-
based NOSQL systems ordocument stores.
■ Another category of NOSQL systems is the graph-based NOSQL systems,or graph databases; these
include Neo4J and GraphBase, among others.
■ Some NOSQL systems, such as OrientDB, combine concepts from many ofthe categories discussed
above.
■ In addition to the newer types of NOSQL systems listed above, it is also pos-sible to classify database
systems based on the object model (see Chapter 12)or on the native XML model (see Chapter 13) as
NOSQL systems, althoughthey may not have the high-performance and replication characteristics of
the other types of NOSQL systems.
These are just a few examples of NOSQL systems that have been developed. There are
many systems, and listing all of them is beyond the scope of our presentation.
Characteristics of NOSQL Systems
We now discuss the characteristics of many NOSQL systems, and how these sys- tems
differ from traditional SQL systems. We divide the characteristics into two categories—
those related to distributed databases and distributed systems, and those related to data
models and query languages.
NOSQL characteristics related to distributed databases and distributed systems. NOSQL
systems emphasize high availability, so replicating the data is inherent in many of these
systems. Scalability is another important characteristic, because many of the applications
that use NOSQL systems tend to have data that keeps growing in volume. High
performance is another required characteristic, whereas serializable consistency may not
be as important for some of the NOSQL applications. We discuss some of these
characteristics next.
1. Scalability: As we discussed in Section 23.1.4, there are two kinds of scal- ability in distributed
systems: horizontal and vertical. In NOSQL systems, horizontal scalability is generally used, where
the distributed system is expanded by adding more nodes for data storage and processing as the vol-
ume of data grows. Vertical scalability, on the other hand, refers to expand- ing the storage and
computing power of existing nodes. In NOSQL systems, horizontal scalability is employed while the
system is operational, so tech- niques for distributing the existing data among new nodes without inter-
rupting system operation are necessary. We will discuss some of these techniques in Sections 24.3
through 24.6 when we discuss specific systems.
Availability, Replication and Eventual Consistency: Many applications that use NOSQL systems
require continuous system availability. To accom- plish this, data is replicated over two or more nodes
in a transparent man- ner, so that if one node fails, the data is still available on other nodes. Replication
improves data availability and can also improve read perfor- mance, because read requests can often be
serviced from any of the repli- cated data nodes. However, write performance becomes more
cumbersome because an update must be applied to every copy of the replicated data items; this can slow
down write performance if serializable consistency is required (see Section 23.3). Many NOSQL
applications do not require serializable consistency, so more relaxed forms of consistency
known as eventual consistency are used. We discuss this in more detail in Section 24.2.
2. Replication Models: Two major replication models are used in NOSQL sys- tems: master-slave and
master-master replication. Master-slave replication requires one copy to be the master copy; all write
operations must be applied to the master copy and then propagated to the slave copies, usually using
eventual consistency (the slave copies will eventually be the same as the mas- ter copy). For read, the
master-slave paradigm can be configured in various ways. One configuration requires all reads to also
be at the master copy, so this would be similar to the primary site or primary copy methods of distrib-
uted concurrency control (see Section 23.3.1), with similar advantages and disadvantages. Another
configuration would allow reads at the slave copies but would not guarantee that the values are the
latest writes, since writes to the slave nodes can be done after they are applied to the master copy. The
master-master replication allows reads and writes at any of the replicas but may not guarantee that
reads at nodes that store different copies see the same values. Different users may write the same data
item concurrently at different nodes of the system, so the values of the item will be temporarily
inconsistent. A reconciliation method to resolve conflicting write operations of the same data item at
different nodes must be implemented as part of themaster-master replication scheme.
3. Sharding of Files: In many NOSQL applications, files (or collections of data objects) can have
many millions of records (or documents or objects), and these records can be accessed concurrently by
thousands of users. So it is not practical to store the whole file in one node. Sharding (also known as
horizontal partitioning ; see Section 23.2) of the file records is often employed in NOSQL systems.
This serves to distribute the load of accessing the file records to multiple nodes. The combination of
sharding the file records and replicating the shards works in tandem to improve load balancing as well
as data availability. We will discuss some of the sharding techniques in Sections 24.3 through 24.6
when we discuss specific systems.
4. High-Performance Data Access: In many NOSQL applications, it is neces- sary to find
individual records or objects (data items) from among the mil- lions of data records or objects in a file.
To achieve this, most systems use one of two techniques: hashing or range partitioning on object
keys. The majority of accesses to an object will be by providing the key value rather than by using
complex query conditions. The object key is similar to the concept of object id (see Section 12.1). In
hashing, a hash function h(K) is applied to the key K, and the location of the object with key K is
determined by the value of h(K). In range partitioning, the location is determined via a range of key
values; for example, location i would hold the objects whose key values K are in the range Kimin ≤ K ≤
Kimax. In applications that require range queries, where multiple objects within a range of key
values are retrieved, range partitioned is preferred. Other indexes can also be used to locate objects
based on attribute conditions different from the key K. We will discuss some of the hashing,
partitioning, and indexing techniques in Sections 24.3 through 24.6 when we discuss specific
systems.
NOSQL characteristics related to data models and query languages. NOSQL systems
emphasize performance and flexibility over modeling power and complex querying. We
discuss some of these characteristics next.
1. Not Requiring a Schema: The flexibility of not requiring a schema is achieved in many
NOSQL systems by allowing semi-structured, self- describing data (see Section 13.1). The users can
specify a partial schema in some systems to improve storage efficiency, but it is not required to have a
schema in most of the NOSQL systems. As there may not be a schema to specify constraints, any
constraints on the data would have to be pro- grammed in the application programs that access the data
items. There are various languages for describing semistructured data, such as JSON (JavaScript Object
Notation) and XML (Extensible Markup Language; see Chapter 13). JSON is used in several NOSQL
systems, but other methods for describing semi-structured data can also be used. We will discuss JSON
in Section 24.3when we present document-based NOSQL systems.
2. Less Powerful Query Languages: Many applications that use NOSQL sys- tems may not
require a powerful query language such as SQL, because search (read) queries in these systems often
locate single objects in a single file based on their object keys. NOSQL systems typically provide a set
of functions and operations as a programming API (application programming interface), so reading and
writing the data objects is accomplished by calling the appropriate operations by the programmer. In
many cases, the opera- tions are called CRUD operations, for Create, Read, Update, and Delete. In
other cases, they are known as SCRUD because of an added Search (or Find) operation. Some NOSQL
systems also provide a high-level query language, but it may not have the full power of SQL; only a
subset of SQL querying capabilities would be provided. In particular, many NOSQL systems do not
provide join operations as part of the query language itself; the joins need to be implemented in the
application programs.
3. Versioning: Some NOSQL systems provide storage of multiple versions of the data items, with
the timestamps of when the data version was created. We will discuss this aspect in Section 24.5 when
we present column-based NOSQL systems.
In the next section, we give an overview of the various categories of NOSQL systems.
Categories of NOSQL Systems
NOSQL systems have been characterized into four major categories, with some additional
categories that encompass other types of systems. The most common categorization lists
the following four major categories:
1. Document-based NOSQL systems: These systems store data in the form of documents using well-
known formats, such as JSON (JavaScript Object Notation). Documents are accessible via their
document id, but can also be accessed rapidly using other indexes.
2. NOSQL key-value stores: These systems have a simple data model based on fast
access by the key to the value associated with the key; the value can be a record or an object or a
document or even have a more complex data structure.
3. Column-based or wide column NOSQL systems: These systems partition a table by column into
column families (a form of vertical partitioning; see Section 23.2), where each column family is stored
in its own files. They alsoallow versioning of data values.
4. Graph-based NOSQL systems: Data is represented as graphs, and related nodes can be found
by traversing the edges using path expressions.
Additional categories can be added as follows to include some systems that are not easily
categorized into the above four categories, as well as some other types of sys- tems that
have been available even before the term NOSQL became widely used.
5. Hybrid NOSQL systems: These systems have characteristics from two ormore of the above four
categories.
6. Object databases: These systems were discussed in Chapter 12.
7. XML databases: We discussed XML in Chapter 13.
Even keyword-based search engines store large amounts of data with fast search access, so
the stored data can be considered as large NOSQL big data stores.
The rest of this chapter is organized as follows. In each of Sections 24.3 through 24.6, we
will discuss one of the four main categories of NOSQL systems, and elabo- rate further on
which characteristics each category focuses on. Before that, in Sec- tion 24.2, we discuss
in more detail the concept of eventual consistency, and we discuss the associated CAP
theorem.
The CAP Theorem
When we discussed concurrency control in distributed databases in Section 23.3, we
assumed that the distributed database system (DDBS) is required to enforce the ACID
properties (atomicity, consistency, isolation, durability) of transactions that are running
concurrently (see Section 20.3). In a system with data replication, con- currency control
becomes more complex because there can be multiple copies of each data item. So if an
update is applied to one copy of an item, it must be applied to all other copies in a
consistent manner. The possibility exists that one copy of an item X is updated by a
transaction T1 whereas another copy is updated by a transac- tion T2, so two inconsistent
copies of the same item exist at two different nodes in the distributed system. If two other
transactions T3 and T4 want to read X, each may read a different copy of item X.
We saw in Section 23.3 that there are distributed concurrency control methods that do not
allow this inconsistency among copies of the same data item, thus enforcing serializability
and hence the isolation property in the presence of replication. How- ever, these techniques
often come with high overhead, which would defeat the pur- pose of creating multiple
copies to improve performance and availability in distributed database systems such as
NOSQL. In the field of distributed systems, there are various levels of consistency among
replicated data items, from weak con- sistency to strong consistency. Enforcing
serializability is considered the strongest form of consistency, but it has high overhead so
it can reduce performance of read and write operations and hence adversely affect system
performance.
The CAP theorem, which was originally introduced as the CAP principle, can be used to
explain some of the competing requirements in a distributed system with replication. The
three letters in CAP refer to three desirable properties of distributed systems with replicated
data: consistency (among replicated copies), availability (of the system for read and write
operations) and partition tolerance (in the face of the nodes in the system being
partitioned by a network fault). Availability means that each read or write request for a data
item will either be processed successfully or will receive a message that the operation cannot
be completed. Partition tolerance means that the system can continue operating if the
network connecting the nodes has a fault that results in two or more partitions, where the
nodes in each partition can only communicate among each other. Consistency means that
the nodes will have the same copies of a replicated data item visible for various
transactions.
It is important to note here that the use of the word consistency in CAP and its use in
ACID do not refer to the same identical concept. In CAP, the term consistency refers to
the consistency of the values in different copies of the same data item in a replicated
distributed system. In ACID, it refers to the fact that a transaction will not violate the
integrity constraints specified on the database schema. However, if we consider that the
consistency of replicated copies is a specified constraint, then the two uses of the term
consistency would be related.
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties—consistency, availability, and partition tolerance—at the same time in a
distributed system with data replication. If this is the case, then the distributed sys- tem
designer would have to choose two properties out of the three to guarantee. It is generally
assumed that in many traditional (SQL) applications, guaranteeing consistency through
the ACID properties is important. On the other hand, in a NOSQL distributed data store, a
weaker consistency level is often acceptable, and guaranteeing the other two properties
(availability, partition tolerance) is impor- tant. Hence, weaker consistency levels are
often used in NOSQL system instead of guaranteeing serializability. In particular, a form
of consistency known as eventual consistency is often adopted in NOSQL systems. In
Sections 24.3 through 24.6, we will discuss some of the consistency models used in specific
NOSQL systems.
The next four sections of this chapter discuss the characteristics of the four main cat- egories
of NOSQL systems. We discuss document-based NOSQL systems in Sec- tion 24.3, and
we use MongoDB as a representative system. In Section 24.4, we discuss NOSQL systems
known as key-value stores. In Section 24.5, we give an overview of column-based NOSQL
systems, with a discussion of Hbase as a representative sys- tem. Finally, we introduce
graph-based NOSQL systems in Section 24.6.
Document-Based NOSQL Systemsand MongoDB
Document-based or document-oriented NOSQL systems typically store data as collections
of similar documents. These types of systems are also sometimes known as document
stores. The individual documents somewhat resemble complex objects (see Section 12.3) or
XML documents (see Chapter 13), but a major difference between document-based
systems versus object and object-relational systems and XML is that there is no
requirement to specify a schema—rather, the documents are specified as self-describing
data (see Section 13.1). Although the documents in a collection should be similar, they
can have different data elements (attributes), and new documents can have new data
elements that do not exist in any of the current documents in the collection. The system
basically extracts the data element names from the self-describing documents in the
collection, and the user can request that the system create indexes on some of the data
elements. Documents can be speci- fied in various formats, such as XML (see Chapter 13).
A popular language to spec- ify documents in NOSQL systems is JSON (JavaScript Object
Notation).
There are many document-based NOSQL systems, including MongoDB and CouchDB,
among many others. We will give an overview of MongoDB in this sec- tion. It is
important to note that different systems can use different models, lan- guages, and
implementation methods, but giving a complete survey of all document-based NOSQL
systems is beyond the scope of our presentation.
MongoDB Data Model
MongoDB documents are stored in BSON (Binary JSON) format, which is a varia- tion of
JSON with some additional data types and is more efficient for storage than JSON.
Individual documents are stored in a collection. We will use a simple exam- ple based on
our COMPANY database that we used throughout this book. The operation
createCollection is used to create each collection. For example, the fol- lowing command
can be used to create a collection called project to hold PROJECT objects from the
COMPANY database (see Figures 5.5 and 5.6):
db.createCollection(“project”, { capped : true, size : 1310720, max : 500 } )
The first parameter “project” is the name of the collection, which is followed by an
optional document that specifies collection options. In our example, the collection is
capped; this means it has upper limits on its storage space (size) and number of
documents (max). The capping parameters help the system choose the storage options for
each collection. There are other collection options, but we will not dis- cuss them here.
For our example, we will create another document collection called worker to hold
information about the EMPLOYEEs who work on each project; for example:
db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 } ) )
Each document in a collection has a unique ObjectId field, called _id, which is
automatically indexed in the collection unless the user explicitly requests no index for the
_id field. The value of ObjectId can be specified by the user, or it can be system-generated
if the user does not specify an _id field for a particular document. System-generated
ObjectIds have a specific format, which combines the timestamp when the object is
created (4 bytes, in an internal MongoDB format), the node id (3 bytes), the process id (2
bytes), and a counter (3 bytes) into a 16-byte Id value. User-generated ObjectsIds can
have any value specified by the user as long as it uniquely identifies the document and so
these Ids are similar to primary keys in relational systems.
A collection does not have a schema. The structure of the data fields in documents is
chosen based on how documents will be accessed and used, and the user can choose a
normalized design (similar to normalized relational tuples) or a denor- malized design
(similar to XML documents or complex objects). Interdocument references can be
specified by storing in one document the ObjectId or ObjectIds of other related documents.
Figure 24.1(a) shows a simplified MongoDB document showing some of the data from
Figure 5.6 from the COMPANY database example that is used throughout the book. In our
example, the _id values are user-defined, and the documents whose _id starts with P (for
project) will be stored in the “project” collection, whereas those whose _id starts with W
(for worker) will be stored in the “worker” collection.
In Figure 24.1(a), the workers information is embedded in the project document; so there is
no need for the “worker” collection. This is known as the denormalized pat- tern, which is
similar to creating a complex object (see Chapter 12) or an XML document (see Chapter
13). A list of values that is enclosed in square brackets [ … ] within a document represents a
field whose value is an array.
Another option is to use the design in Figure 24.1(b), where worker references are
embedded in the project document, but the worker documents themselves are stored in a
separate “worker” collection. A third option in Figure 24.1(c) would use a normalized
design, similar to First Normal Form relations (see Sec- tion 14.3.4). The choice of
which design option to use depends on how the data will be accessed.
It is important to note that the simple design in Figure 24.1(c) is not the general nor-
malized design for a many-to-many relationship, such as the one between employees and
projects; rather, we would need three collections for “project”, “employee”, and
“works_on”, as we discussed in detail in Section 9.1. Many of the design tradeoffs that
were discussed in Chapters 9 and 14 (for first normal form relations and for ER- to-
relational mapping options), and Chapters 12 and 13 (for complex objects and XML) are
applicable for choosing the appropriate design for document structures
Figure 24.1 (a) project document with an array of embedded workers:
Example of {
simple _id: “P1”,
documents Pname: “ProductX”,
in Plocation: “Bellaire”,Workers: [
MongoDB. { Ename: “John Smith”,Hours: 32.5
(a) Den },
ormalized { Ename: “Joyce English”,Hours: 20.0
document }
designwith
embedded
subdocume
nts.
(b) Emb
edded array
ofdocument
references.
(c) Nor
malized
documents.
]
);
(b) project document with an embedded array of worker ids:
{
_id: “P1”,
Pname: “ProductX”,
Plocation: “Bellaire”,
WorkerIds: [ “W1”, “W2” ]
}
{ _id: “W1”,
Ename: “John Smith”,
Hours: 32.5
}
{ _id: “W2”,
Ename: “Joyce English”,
Hours: 20.0
}
(c) normalized project and worker documents (not a fully normalized designfor M:N relationships):
{
_id: “P1”,
Pna “Produ
me: ctX”,
Ploc “Bella
}
atio ire”
n:
{ _id: “W1”
,
Ena “John
me: Smith”
,
Proj “P1”,
ectI
d:
Ho 32.5
urs:
}
{ _id: “W2”, Figure 24.1
Ename: “Joyce English”, (continued)
ProjectId: “P1”, Example of
Hours: 20.0 simple
} documents in
(d) inserting the documents in (c) into their collections “project” and “worker”: MongoDB. (d)
db.project.insert( { _id: “P1”, Pname: “ProductX”, Plocation: “Bellaire” } ) db.worker.insert( [ { Insertingthe
_id: “W1”, Ename: “John Smith”, ProjectId: “P1”, Hours: 32.5 }, documents in
Figure 24.1(c)
{ _id: “W2”, Ename: “Joyce English”, ProjectId: “P1”,Hours: 20.0 } ] )
into their
collections.
and document collections, so we will not repeat the discussions here. In the design
in Figure 24.1(c), an EMPLOYEE who works on several projects would be repre-
sented by multiple worker documents with different _id values; each document
would represent the employee as worker for a particular project. This is similar to
the design decisions for XML schema design (see Section 13.6). However, it is again
important to note that the typical document-based system does not have a schema,
so the design rules would have to be followed whenever individual documents are
inserted into a collection.
MongoDB CRUD Operations
MongoDb has several CRUD operations, where CRUD stands for (create, read,
update, delete). Documents can be created and inserted into their collections using
the insert operation, whose format is:
db.<collection_name>.insert(<document(s)>)
The parameters of the insert operation can include either a single document or an
array of documents, as shown in Figure 24.1(d). The delete operation is called
remove, and the format is:
db.<collection_name>.remove(<condition>)
The documents to be removed from the collection are specified by a Boolean con-
dition on some of the fields in the collection documents. There is also an update
operation, which has a condition to select certain documents, and a $set clause to
specify the update. It is also possible to use the update operation to replace an
existing document with another one but keep the same ObjectId.
For read queries, the main command is called find, and the format is:
db.<collection_name>.find(<condition>)
General Boolean conditions can be specified as <condition>, and the documents in
the collection that return true are selected for the query result. For a full discussion
of the MongoDb CRUD operations, see the MongoDB online documentation in the
chapter references.
MongoDB Distributed Systems Characteristics
Most MongoDB updates are atomic if they refer to a single document, but MongoDB
also provides a pattern for specifying transactions on multiple documents. Since
MongoDB is a distributed system, the two-phase commit method is used to ensure
atomicity and consistency of multidocument transactions. We discussed the atomi-
city and consistency properties of transactions in Section 20.3, and the two-phase
commit protocol in Section 22.6.
Replication in MongoDB. The concept of replica set is used in MongoDB to create
multiple copies of the same data set on different nodes in the distributed system, and
it uses a variation of the master-slave approach for replication. For example, suppose
that we want to replicate a particular document collection C. A replica set will have
one primary copy of the collection C stored in one node N1, and at least one
secondary copy (replica) of C stored at another node N2. Additional copies can be
stored in nodes N3, N4, etc., as needed, but the cost of storage and update (write)
increases with the number of replicas. The total number of participants in a replica set
must be at least three, so if only one secondary copy is needed, a participant in the
replica set known as an arbiter must run on the third node N3. The arbiter does not
hold a replica of the collection but participates in elections to choose a new primary if
the node storing the current primary copy fails. If the total number of members in a rep-
lica set is n (one primary plus i secondaries, for a total of n = i + 1), then n must be an odd
number; if it is not, an arbiter is added to ensure the election process works correctly if
the primary fails. We discussed elections in distributed systems in Section 23.3.1.
In MongoDB replication, all write operations must be applied to the primary copy
and then propagated to the secondaries. For read operations, the user can choose the
particular read preference for their application. The default read preference
processes all reads at the primary copy, so all read and write operations are per-
formed at the primary node. In this case, secondary copies are mainly to make sure
that the system continues operation if the primary fails, and MongoDB can ensure
that every read request gets the latest document value. To increase read perfor-
mance, it is possible to set the read preference so that read requests can be processed
at any replica (primary or secondary); however, a read at a secondary is not guaran-
teed to get the latest version of a document because there can be a delay in propa-
gating writes from the primary to the secondaries.
Sharding in MongoDB. When a collection holds a very large number of docu-
ments or requires a large storage space, storing all the documents in one node can
lead to performance problems, particularly if there are many user operations
accessing the documents concurrently using various CRUD operations. Sharding
of the documents in the collection—also known as horizontal partitioning—
divides the documents into disjoint partitions known as shards. This allows the
system to add more nodes as needed by a process known as horizontal scaling of
the distributed system (see Section 23.1.4), and to store the shards of the collection
on different nodes to achieve load balancing. Each node will process only those
operations pertaining to the documents in the shard stored at that node. Also, each
shard will contain fewer documents than if the entire collection were stored at one
node, thus further improving performance.
There are two ways to partition a collection into shards in MongoDB—range
partitioning and hash partitioning. Both require that the user specify a particular
document field to be used as the basis for partitioning the documents into shards.
The partitioning field—known as the shard key in MongoDB—must have two
characteristics: it must exist in every document in the collection, and it must have an
index. The ObjectId can be used, but any other field possessing these two character-
istics can also be used as the basis for sharding. The values of the shard key are
divided into chunks either through range partitioning or hash partitioning, and the
documents are partitioned based on the chunks of shard key values.
Range partitioning creates the chunks by specifying a range of key values; for example,
if the shard key values ranged from one to ten million, it is possible to create ten
ranges—1 to 1,000,000; 1,000,001 to 2,000,000; … ; 9,000,001 to 10,000,000—and
each chunk would contain the key values in one range. Hash partitioning applies a
hash function h(K) to each shard key K, and the partitioning of keys into chunks is
based on the hash values (we discussed hashing and its advantages and disadvantages
in Section 16.8). In general, if range queries are commonly applied to a collection (for
example, retrieving all documents whose shard key value is between 200 and 400),
then range partitioning is preferred because each range query will typically be submit-
ted to a single node that contains all the required documents in one shard. If most
searches retrieve one document at a time, hash partitioning may be preferable because
it randomizes the distribution of shard key values into chunks.
When sharding is used, MongoDB queries are submitted to a module called the query
router, which keeps track of which nodes contain which shards based on the particu-
lar partitioning method used on the shard keys. The query (CRUD operation) will be
routed to the nodes that contain the shards that hold the documents that the query is
requesting. If the system cannot determine which shards hold the required docu-
ments, the query will be submitted to all the nodes that hold shards of the collection.
Sharding and replication are used together; sharding focuses on improving perfor-
mance via load balancing and horizontal scalability, whereas replication focuses on
ensuring system availability when certain nodes fail in the distributed system.
There are many additional details about the distributed system architecture and com-
ponents of MongoDB, but a full discussion is outside the scope of our presentation.
MongoDB also provides many other services in areas such as system administration,
indexing, security, and data aggregation, but we will not discuss these features here.
Full documentation of MongoDB is available online (see the bibliographic notes).
NOSQL Key-Value Stores
Key-value stores focus on high performance, availability, and scalability by storing data in a
distributed storage system. The data model used in key-value stores is rela-tively simple, and
in many of these systems, there is no query language but rather aset of operations that can
be used by the application programmers. The key is a unique identifier associated with a
data item and is used to locate this data item rapidly. The value is the data item itself, and it
can have very different formats for different key-value storage systems. In some cases, the
value is just a string of bytes or an array of bytes, and the application using the key-value
store has to interpret the structure of the data value. In other cases, some standard formatted
data is allowed; for example, structured data rows (tuples) similar to relational data, or
semistructured data using JSON or some other self-describing data format. Differ- ent key-
value stores can thus store unstructured, semistructured, or structured data items (see
Section 13.1). The main characteristic of key-value stores is the fact that every value (data
item) must be associated with a unique key, and that retrieving the value by supplying the
key must be very fast.
There are many systems that fall under the key-value store label, so rather than pro-
vide a lot of details on one particular system, we will give a brief introductory over-
view for some of these systems and their characteristics.
DynamoDB Overview
The DynamoDB system is an Amazon product and is available as part of Amazon’s
AWS/SDK platforms (Amazon Web Services/Software Development Kit). It can be
used as part of Amazon’s cloud computing services, for the data storage component.
DynamoDB data model. The basic data model in DynamoDB uses the concepts of
tables, items, and attributes. A table in DynamoDB does not have a schema; it
holds a collection of self-describing items. Each item will consist of a number of
(attribute, value) pairs, and attribute values can be single-valued or multivalued. So
basically, a table will hold a collection of items, and each item is a self-describing
record (or object). DynamoDB also allows the user to specify the items in JSON for-
mat, and the system will convert them to the internal storage format of DynamoDB.
When a table is created, it is required to specify a table name and a primary key;
the primary key will be used to rapidly locate the items in the table. Thus, the pri-
mary key is the key and the item is the value for the DynamoDB key-value store.
The primary key attribute must exist in every item in the table. The primary key can
be one of the following two types:
■ A single attribute. The DynamoDB system will use this attribute to build a
hash index on the items in the table. This is called a hash type primary key. The items are not
ordered in storage on the value of the hash attribute.
■ A pair of attributes. This is called a hash and range type primary key. The
primary key will be a pair of attributes (A, B): attribute A will be used for hash- ing, and because
there will be multiple items with the same value of A, the B values will be used for ordering the
records with the same A value. A table with this type of key can have additional secondary
indexes defined on its attributes. For example, if we want to store multiple versions of some type
of items in a table, we could use ItemID as hash and Date or Timestamp (when the version was
created) as range in a hash and range type primary key.
DynamoDB Distributed Characteristics. Because DynamoDB is proprietary, in the
next subsection we will discuss the mechanisms used for replication, sharding, and
other distributed system concepts in an open source key-value system called
Voldemort. Voldemort is based on many of the techniques proposed for DynamoDB.
Voldemort Key-Value Distributed Data Store
Voldemort is an open source system available through Apache 2.0 open source licens-
ing rules. It is based on Amazon’s DynamoDB. The focus is on high performance and
horizontal scalability, as well as on providing replication for high availability and
sharding for improving latency (response time) of read and write requests. All three
of those features—replication, sharding, and horizontal scalability—are realized
through a technique to distribute the key-value pairs among the nodes of a distrib-
uted cluster; this distribution is known as consistent hashing. Voldemort has been
used by LinkedIn for data storage. Some of the features of Voldemort are as follows:
■ Simple basic operations. A collection of (key, value) pairs is kept in a Voldemort store.
In our discussion, we will assume the store is called s. The basic interface for data storage and
retrieval is very simple and includes three operations: get, put, and delete. The operation
s.put(k, v) inserts an item as a key-value pair with key k and value v. The operation s.delete(k)
deletes the item whose key is k from the store, and the operation v = s.get(k)retrieves the value v
associated with key k. The application can use these basic operations to build its own
requirements. At the basic storage level, both keys and values are arrays of bytes (strings).
■ High-level formatted data values. The values v in the (k, v) items can be specified in
JSON (JavaScript Object Notation), and the system will convert between JSON and the internal
storage format. Other data object formats can also be specified if the application provides the
conversion (also known as serialization) between the user format and the storage format as a
Serializer class. The Serializer class must be provided by the user and will include oper- ations to
convert the user format into a string of bytes for storage as a value, and to convert back a string
(array of bytes) retrieved via s.get(k) into the user format. Voldemort has some built-in serializers
for formats other than JSON.
■ Consistent hashing for distributing (key, value) pairs. A variation of the data
distribution algorithm known as consistent hashing is used in Volde- mort for data
distribution among the nodes in the distributed cluster of nodes. A hash function h(k) is
applied to the key k of each (k, v) pair, and h(k) determines where the item will be stored.
The method assumes that h(k) is an integer value, usually in the range 0 to Hmax = 2n−1,
where n ischosen based on the desired range for the hash values. This method is best visualized
by considering the range of all possible integer hash values 0 to Hmax to be evenly distributed
on a circle (or ring). The nodes in the distrib- uted system are then also located on the same
ring; usually each node will have several locations on the ring (see Figure 24.2). The
positioning of the points on the ring that represent the nodes is done in a psuedorandom manner.
■ An item (k, v) will be stored on the node whose position in the ring follows
the position of h(k) on the ring in a clockwise direction. In Figure 24.2(a), we assume
there are three nodes in the distributed cluster labeled A, B, and C, where node C
has a bigger capacity than nodes A and B. In a typical system, there will be many
more nodes. On the circle, two instances each of A and B are placed, and three
instances of C (because of its higher capacity), in a pseudorandom manner to
cover the circle. Figure 24.2(a) indicates which (k, v) items are placed in which
nodes based on the h(k) values.
■ The h(k) values that fall in the parts of the circle marked as range 1 in Fig- ure 24.2(a)
will have their (k, v) items stored in node A because that is the node whose label follows h(k) on the
ring in a clockwise direction; those in range 2are stored in node B; and those in range 3 are stored
in node C. This scheme allows horizontal scalability because when a new node is added to the
distrib- uted system, it can be added in one or more locations on the ring depending on the node
capacity. Only a limited percentage of the (k, v) items will be reas- signed to the new node from the
existing nodes based on the consistent hash- ing placement algorithm. Also, those items assigned
to the new node may not all come from only one of the existing nodes because the new node can
have multiple locations on the ring. For example, if a node D is added and it has two placements on
the ring as shown in Figure 24.2(b), then some of the items from nodes B and C would be moved
to node D. The items whose keys hash to range 4 on the circle (see Figure 24.2(b)) would be
migrated to node D. This scheme also allows replication by placing the number of specified
replicas of an item on successive nodes on the ring in a clockwise direction. The sharding is built
into the method, and different items in the store (file) are located on different nodes in the
distributed cluster, which means the items are horizon- tally partitioned (sharded) among the
nodes in the distributed system. When a node fails, its load of data items can be distributed to the
other existing nodes whose labels follow the labels of the failed node in the ring. And nodes with
higher capacity can have more locations on the ring, as illustrated by node C in Figure 24.2(a),
and thus store more items than smaller-capacity nodes.
■ Consistency and versioning. Voldemort uses a method similar to the one developed for
DynamoDB for consistency in the presence of replicas. Basi- cally, concurrent write operations
are allowed by different processes so there could exist two or more different values associated
with the same key at dif- ferent nodes when items are replicated. Consistency is achieved when
the item is read by using a technique known as versioning and read repair. Con- current writes
are allowed, but each write is associated with a vector clock value. When a read occurs, it is
possible that different versions of the same value (associated with the same key) are read from
different nodes. If the system can reconcile to a single final value, it will pass that value to the
read; otherwise, more than one version can be passed back to the application, which will
reconcile the various versions into one version based on the application semantics and give this
reconciled value back to the nodes.
Examples of Other Key-Value Stores
In this section, we briefly review three other key-value stores. It is important to note
that there are many systems that can be classified in this category, and we can only
mention a few of these systems.
Oracle key-value store. Oracle has one of the well-known SQL relational data- base
systems, and Oracle also offers a system based on the key-value store concept; this
system is called the Oracle NoSQL Database.
Redis key-value cache and store. Redis differs from the other systems dis- cussed
here because it caches its data in main memory to further improve perfor- mance. It
offers master-slave replication and high availability, and it also offers persistence
by backing up the cache to disk.
Apache Cassandra. Cassandra is a NOSQL system that is not easily categorized into
one category; it is sometimes listed in the column-based NOSQL category (see
Section 24.5) or in the key-value category. If offers features from several NOSQL
categories and is used by Facebook as well as many other customers.
Column-Based or Wide Column
NOSQL Systems
Another category of NOSQL systems is known as column-based or wide column
systems. The Google distributed storage system for big data, known as BigTable, is
a well-known example of this class of NOSQL systems, and it is used in many
Google applications that require large amounts of data storage, such as Gmail. Big-
Table uses the Google File System (GFS) for data storage and distribution. An
open source system known as Apache Hbase is somewhat similar to Google Big-
Table, but it typically uses HDFS (Hadoop Distributed File System) for data stor-
age. HDFS is used in many cloud computing applications, as we shall discuss in
Chapter 25. Hbase can also use Amazon’s Simple Storage System (known as S3)
for data storage. Another well-known example of column-based NOSQL systems is
Cassandra, which we discussed briefly in Section 24.4.3 because it can also be char-
acterized as a key-value store. We will focus on Hbase in this section as an example
of this category of NOSQL systems.
BigTable (and Hbase) is sometimes described as a sparse multidimensional distrib-
uted persistent sorted map, where the word map means a collection of (key, value)
pairs (the key is mapped to the value). One of the main differences that distinguish
column-based systems from key-value stores (see Section 24.4) is the nature of the
key. In column-based systems such as Hbase, the key is multidimensional and so
has several components: typically, a combination of table name, row key, column,
and timestamp. As we shall see, the column is typically composed of two compo-
nents: column family and column qualifier. We discuss these concepts in more
detail next as they are realized in Apache Hbase.
Hbase Data Model and Versioning
Hbase data model. The data model in Hbase organizes data using the concepts of
namespaces, tables, column families, column qualifiers, columns, rows, and data
cells. A column is identified by a combination of (column family:column qualifier).
Data is stored in a self-describing form by associating columns with data values,
where data values are strings. Hbase also stores multiple versions of a data item,
with a timestamp associated with each version, so versions and timestamps are also
part of the Hbase data model (this is similar to the concept of attribute versioning in
temporal databases, which we shall discuss in Section 26.2). As with other NOSQL
systems, unique keys are associated with stored data items for fast access, but the
keys identify cells in the storage system. Because the focus is on high performance
when storing huge amounts of data, the data model includes some storage-related
concepts. We discuss the Hbase data modeling concepts and define the terminol-
ogy next. It is important to note that the use of the words table, row, and column is
not identical to their use in relational databases, but the uses are related.
■ Tables and Rows. Data in Hbase is stored in tables, and each table has a table name.
Data in a table is stored as self-describing rows. Each row has a unique row key, and row keys
are strings that must have the property that they can be lexicographically ordered, so characters
that do not have a lexi-cographic order in the character set cannot be used as part of a row key.
■ Column Families, Column Qualifiers, and Columns. A table is associated with one or
more column families. Each column family will have a name, and the column families
associated with a table must be specified when the table is created and cannot be changed later.
Figure 24.3(a) shows how a table may be created; the table name is followed by the names of the
column fami- lies associated with the table. When the data is loaded into a table, each col- umn
family can be associated with many column qualifiers, but the columnqualifiers are not specified
as part of creating a table. So the column qualifiers make the model a self-describing data model
because the qualifiers can be dynamically specified as new rows are created and inserted into the
table. A column is specified by a combination of ColumnFamily:ColumnQualifier. Basically,
column families are a way of grouping together related columns (attributes in relational
terminology) for storage purposes, except that the column qualifier names are not specified
during table creation. Rather, they are specified when the data is created and stored in rows, so
the data is self- describing since any column qualifier name can be used in a new row of data (see
Figure 24.3(b)). However, it is important that the application program- mers know which
column qualifiers belong to each column family, even though they have the flexibility to create
new column qualifiers on the fly when new data rows are created. The concept of column family
is somewhat similar to vertical partitioning (see Section 23.2), because columns (attri- butes)
that are accessed together because they belong to the same column family are stored in the same
files. Each column family of a table is stored in its own files using the HDFS file system.
■ Versions and Timestamps. Hbase can keep several versions of a data item, along with
the timestamp associated with each version. The timestamp is a long integer number that
represents the system time when the version was created, so newer versions have larger
timestamp values. Hbase uses mid- night ‘January 1, 1970 UTC’ as timestamp value zero, and uses
a long integer that measures the number of milliseconds since that time as the system timestamp
value (this is similar to the value returned by the Java utility java.util.Date.getTime() and is also
used in MongoDB). It is also possible for Figure 24.3
Examples in Hbase. (a) Creating a table called EMPLOYEE with three column families: Name, Address, and Details.
(b) Inserting some in the EMPLOYEE table; different rows can have different self-describing column qualifiers
(Fname, Lname, Nickname, Mname, Minit, Suffix, … for column family Name; Job, Review, Supervisor, Salary for
column family Details). (c) Some CRUD operations of Hbase.
(a) creating a table:
create ‘EMPLOYEE’, ‘Name’, ‘Address’, ‘Details’
(b) inserting some row data in the EMPLOYEE table:
put ‘EMPLOYEE’, ‘row1’, ‘Name:Fname’, ‘John’ put ‘EMPLOYEE’, ‘row1’, ‘Name:Lname’, ‘Smith’
put ‘EMPLOYEE’, ‘row1’, ‘Name:Nickname’, ‘Johnny’put ‘EMPLOYEE’, ‘row1’, ‘Details:Job’, ‘Engineer’ put
‘EMPLOYEE’, ‘row1’, ‘Details:Review’, ‘Good’ put ‘EMPLOYEE’, ‘row2’, ‘Name:Fname’, ‘Alicia’
put ‘EMPLOYEE’, ‘row2’, ‘Name:Lname’, ‘Zelaya’ put ‘EMPLOYEE’, ‘row2’, ‘Name:MName’, ‘Jennifer’put
‘EMPLOYEE’, ‘row2’, ‘Details:Job’, ‘DBA’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Supervisor’, ‘James Borg’put ‘EMPLOYEE’, ‘row3’, ‘Name:Fname’, ‘James’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Minit’, ‘E’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Lname’, ‘Borg’put ‘EMPLOYEE’, ‘row3’, ‘Name:Suffix’, ‘Jr.’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Job’, ‘CEO’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Salary’, ‘1,000,000’
(c) Some Hbase basic CRUD operations:
Creating a table: create <tablename>, <column family>, <column family>, …
Inserting Data: put <tablename>, <rowid>, <column family>:<column qualifier>, <value>Reading Data (all data in
a table): scan <tablename>
Retrieve Data (one item): get <tablename>,<rowid>
the user to define the timestamp value explicitly in a Date format rather than using
the system-generated timestamp.
■ Cells. A cell holds a basic data item in Hbase. The key (address) of a cell is
specified by a combination of (table, rowid, columnfamily, columnqualifier, timestamp). If
timestamp is left out, the latest version of the item is retrievedunless a default number of versions
is specified, say the latest three versions. The default number of versions to be retrieved, as well
as the default numberof versions that the system needs to keep, are parameters that can be speci-
fied during table creation.
■ Namespaces. A namespace is a collection of tables. A namespace basically
specifies a collection of one or more tables that are typically used together by user applications,
and it corresponds to a database that contains a collectionof tables in relational terminology.
Hbase CRUD Operations
Hbase has low-level CRUD (create, read, update, delete) operations, as in many of
the NOSQL systems. The formats of some of the basic CRUD operations in Hbase
are shown in Figure 24.3(c).
Hbase only provides low-level CRUD operations. It is the responsibility of the
application programs to implement more complex operations, such as joins
between rows in different tables. The create operation creates a new table and spec-
ifies one or more column families associated with that table, but it does not specify
the column qualifiers, as we discussed earlier. The put operation is used for insert-
ing new data or new versions of existing data items. The get operation is for retriev-
ing the data associated with a single row in a table, and the scan operation retrieves
all the rows.
Hbase Storage and Distributed System Concepts
Each Hbase table is divided into a number of regions, where each region will hold a
range of the row keys in the table; this is why the row keys must be lexicographically
ordered. Each region will have a number of stores, where each column family is
assigned to one store within the region. Regions are assigned to region servers
(storage nodes) for storage. A master server (master node) is responsible for moni-
toring the region servers and for splitting a table into regions and assigning regions
to region servers.
Hbase uses the Apache Zookeeper open source system for services related to man-
aging the naming, distribution, and synchronization of the Hbase data on the dis-
tributed Hbase server nodes, as well as for coordination and replication services.
Hbase also uses Apache HDFS (Hadoop Distributed File System) for distributed
file services. So Hbase is built on top of both HDFS and Zookeeper. Zookeeper can
itself have several replicas on several nodes for availability, and it keeps the data it
needs in main memory to speed access to the master servers and region servers.
We will not cover the many additional details about the distributed system architecture
and components of Hbase; a full discussion is outside the scope of our presentation. Full
documentation of Hbase is available online (see the bibliographic notes).
NOSQL Graph Databases and Neo4j
Another category of NOSQL systems is known as graph databases or graph-
oriented NOSQL systems. The data is represented as a graph, which is a collection
of vertices (nodes) and edges. Both nodes and edges can be labeled to indicate the
types of entities and relationships they represent, and it is generally possible to store
data associated with both individual nodes and individual edges. Many sys- tems
can be categorized as graph databases. We will focus our discussion on one
particular system, Neo4j, which is used in many applications. Neo4j is an open
source system, and it is implemented in Java. We will discuss the Neo4j data model
in Section 24.6.1, and give an introduction to the Neo4j querying capabilities in
Section 24.6.2. Section 24.6.3 gives an overview of the distributed systems and
some other characteristics of Neo4j.
Neo4j Data Model
The data model in Neo4j organizes data using the concepts of nodes and relation-
ships. Both nodes and relationships can have properties, which store the data items
associated with nodes and relationships. Nodes can have labels; the nodes that have
the same label are grouped into a collection that identifies a subset of the nodes in
the database graph for querying purposes. A node can have zero, one, or several
labels. Relationships are directed; each relationship has a start node and end node as
well as a relationship type, which serves a similar role to a node label by identifying
similar relationships that have the same relationship type. Properties can be speci-
fied via a map pattern, which is made of one or more “name : value” pairs enclosed
in curly brackets; for example {Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}.
In conventional graph theory, nodes and relationships are generally called vertices
and edges. The Neo4j graph data model somewhat resembles how data is repre-
sented in the ER and EER models (see Chapters 3 and 4) , but with some notable
differences. Comparing the Neo4j graph model with ER/EER concepts, nodes cor-
respond to entities, node labels correspond to entity types and subclasses, relation-
ships correspond to relationship instances, relationship types correspond to
relationship types, and properties correspond to attributes. One notable difference
is that a relationship is directed in Neo4j, but is not in ER/EER. Another is that a
node may have no label in Neo4j, which is not allowed in ER/EER because every
entity must belong to an entity type. A third crucial difference is that the graph
model of Neo4j is used as a basis for an actual high-performance distributed data-
base system whereas the ER/EER model is mainly used for database design.
Figure 24.4(a) shows how a few nodes can be created in Neo4j. There are various
ways in which nodes and relationships can be created; for example, by calling appro-
priate Neo4j operations from various Neo4j APIs. We will just show the high-level
syntax for creating nodes and relationships; to do so, we will use the Neo4j CREATE
command, which is part of the high-level declarative query language Cypher. Neo4j
has many options and variations for creating nodes and relationships using various
scripting interfaces, but a full discussion is outside the scope of our presentation.
■ Labels and properties. When a node is created, the node label can be speci-
fied. It is also possible to create nodes without any labels. In Figure 24.4(a), the node labels are
EMPLOYEE, DEPARTMENT, PROJECT, and LOCATION, and the created nodes correspond
to some of the data from the COMPANY database in Figure 5.6 with a few modifications; for
example, we use EmpId instead of SSN, and we only include a small subset of the data for
illustration purposes. Properties are enclosed in curly brackets { … }. It is possible that some
nodes have multiple labels; for example the same node can be labeled as PERSON and
EMPLOYEE and MANAGER by listing all the labels separated by the colon symbol as follows:
PERSON:EMPLOYEE:MANAGER. Having multiple labels is similar to an entity belonging to
an entity type (PERSON) plus some subclasses of PERSON (namely EMPLOYEE and
MANAGER) inthe EER model (see Chapter 4) but can also be used for other purposes.
■ Relationships and relationship types. Figure 24.4(b) shows a few example relationships
in Neo4j based on the COMPANY database in Figure 5.6. The specifies the direction of
the relationship, but the relationship can be traversed in either direction. The relationship types
(labels) in Figure 24.4(b) are WorksFor, Manager, LocatedIn, and WorksOn; only relationships
with the relationship type WorksOn have properties (Hours) in Figure 24.4(b).
■ Paths. A path specifies a traversal of part of the graph. It is typically used as part of a
query to specify a pattern, where the query will retrieve from the graph data that matches the
pattern. A path is typically specified by a start node, followed by one or more relationships,
leading to one or more end nodes that satisfy the pattern. It is somewhat similar to the concepts
of pathexpressions that we discussed in Chapters 12 and 13 in the context of query languages for
object databases (OQL) and XML (XPath and XQuery).
■ Optional Schema. A schema is optional in Neo4j. Graphs can be created and used
without a schema, but in Neo4j version 2.0, a few schema-related functions were added. The
main features related to schema creation involve creating indexes and constraints based on the
labels and properties. For example, it is possible to create the equivalent of a key constraint on a
prop- erty of a label, so all nodes in the collection of nodes associated with the label must have
unique values for that property.
■ Indexing and node identifiers. When a node is created, the Neo4j system creates an
internal unique system-defined identifier for each node. Toretrieve individual nodes using other
properties of the nodes efficiently, the user can create indexes for the collection of nodes that
have a particular label. Typically, one or more of the properties of the nodes in that collection
can be indexed. For example, Empid can be used to index nodes with the EMPLOYEE label,
Dno to index the nodes with the DEPARTMENT label, and Pno to index the nodes with the
PROJECT label.
The Cypher Query Language of Neo4j
Neo4j has a high-level query language, Cypher. There are declarative commands for
creating nodes and relationships (see Figures 24.4(a) and (b)), as well as for finding
nodes and relationships based on specifying patterns. Deletion and modification of
data is also possible in Cypher. We introduced the CREATE command in the previous
section, so we will now give a brief overview of some of the other features of Cypher.
A Cypher query is made up of clauses. When a query has several clauses, the result
from one clause can be the input to the next clause in the query. We will give a fla-
vor of the language by discussing some of the clauses using examples. Our presenta-
tion is not meant to be a detailed presentation on Cypher, just an introduction to
some of the languages features. Figure 24.4(c) summarizes some of the main clauses
that can be part of a Cyber query. The Cyber language can specify complex queries
and updates on a graph database. We will give a few of examples to illustrate simple
Cyber queries in Figure 24.4(d).
Figure 24.4
Examples in Neo4j using the Cypher language. (a) Creating some nodes. (b) Creating some relationships.
(a) creating some nodes for the COMPANY data (from Figure 5.6):
CREATE (e1: EMPLOYEE, {Empid: ‘1’, Lname: ‘Smith’, Fname: ‘John’, Minit: ‘B’})CREATE (e2: EMPLOYEE,
{Empid: ‘2’, Lname: ‘Wong’, Fname: ‘Franklin’}) CREATE (e3: EMPLOYEE, {Empid: ‘3’, Lname: ‘Zelaya’,
Fname: ‘Alicia’})
CREATE (e4: EMPLOYEE, {Empid: ‘4’, Lname: ‘Wallace’, Fname: ‘Jennifer’, Minit: ‘S’})
…
CREATE (d1: DEPARTMENT, {Dno: ‘5’, Dname: ‘Research’})
CREATE (d2: DEPARTMENT, {Dno: ‘4’, Dname: ‘Administration’})
…
CREATE (p1: PROJECT, {Pno: ‘1’, Pname: ‘ProductX’})CREATE (p2: PROJECT, {Pno: ‘2’, Pname:
‘ProductY’})
CREATE (p3: PROJECT, {Pno: ‘10’, Pname: ‘Computerization’})CREATE (p4: PROJECT, {Pno: ‘20’, Pname:
‘Reorganization’})
…
CREATE (loc1: LOCATION, {Lname: ‘Houston’}) CREATE (loc2: LOCATION, {Lname: ‘Stafford’})
CREATE (loc3: LOCATION, {Lname: ‘Bellaire’}) CREATE (loc4: LOCATION, {Lname: ‘Sugarland’})
…
(b) creating some relationships for the COMPANY data (from Figure 5.6):
CREATE (e1) – [ : WorksFor ] –> (d1)
CREATE (e3) – [ : WorksFor ] –> (d2)
…
CREATE (d1) – [ : Manager ] –> (e2)
CREATE (d2) – [ : Manager ] –> (e4)
…
CREATE (d1) – [ : LocatedIn ] –> (loc1) CREATE (d1) – [ : LocatedIn ] –> (loc3) CREATE (d1) – [ :
LocatedIn ] –> (loc4)CREATE (d2) – [ : LocatedIn ] –> (loc2)
…
CREATE (e1) – [ : WorksOn, {Hours: ‘32.5’} ] –> (p1)
CREATE (e1) – [ : WorksOn, {Hours: ‘7.5’} ] –> (p2)
CREATE (e2) – [ : WorksOn, {Hours: ‘10.0’} ] –> (p1)
CREATE (e2) – [ : WorksOn, {Hours: 10.0} ] –> (p2)
CREATE (e2) – [ : WorksOn, {Hours: ‘10.0’} ] –> (p3)
CREATE (e2) – [ : WorksOn, {Hours: 10.0} ] –> (p4)
…
Figure 24.4 (continued)
Examples in Neo4j using the Cypher language. (c) Basic syntax of Cypher queries. (d) Examples of Cypher queries.
(c) Basic simplified syntax of some common Cypher clauses:
Finding nodes and relationships that match a pattern: MATCH <pattern>Specifying aggregates and other query
variables: WITH <specifications>Specifying conditions on the data to be retrieved: WHERE <condition>
Specifying the data to be returned: RETURN <data>
Ordering the data to be returned: ORDER BY <data>
Limiting the number of returned data items: LIMIT <max number>Creating nodes: CREATE <node, optional labels
and properties>
Creating relationships: CREATE <relationship, relationship type and optional properties>Deletion: DELETE
<nodes or relationships>
Specifying property values and labels: SET <property values and labels> Removing property values and labels:
REMOVE <property values and labels>
(d) Examples of simple Cypher queries:
1. MATCH (d : DEPARTMENT {Dno: ‘5’}) – [ : LocatedIn ] (loc)RETURN d.Dname , loc.Lname
2. MATCH (e: EMPLOYEE {Empid: ‘2’}) – [ w: WorksOn ] (p)
RETURN e.Ename , w.Hours, p.Pname
3. MATCH (e ) – [ w: WorksOn ] (p: PROJECT {Pno: 2})RETURN p.Pname, e.Ename , w.Hours
4. MATCH (e) – [ w: WorksOn ] (p)
RETURN e.Ename , w.Hours, p.PnameORDER BY e.Ename
5. MATCH (e) – [ w: WorksOn ] (p)
RETURN e.Ename , w.Hours, p.PnameORDER BY e.Ename
LIMIT 10
6. MATCH (e) – [ w: WorksOn ] (p)WITH e, COUNT(p) AS numOfprojsWHERE numOfprojs > 2
RETURN e.Ename , numOfprojs ORDER BY numOfprojs
7. MATCH (e) – [ w: WorksOn ] (p)
RETURN e , w, p
ORDER BY e.EnameLIMIT 10
8. MATCH (e: EMPLOYEE {Empid: ‘2’})SET e.Job = ‘Engineer’
Query 1 in Figure 24.4(d) shows how to use the MATCH and RETURN clauses in a
query, and the query retrieves the locations for department number 5. Match speci-
fies the pattern and the query variables (d and loc) and RETURN specifies the query
result to be retrieved by refering to the query variables. Query 2 has three variables
(e, w, and p), and returns the projects and hours per week that the employee with
Empid = 2 works on. Query 3, on the other hand, returns the employees and hours
per week who work on the project with Pno = 2. Query 4 illustrates the ORDER BY
clause and returns all employees and the projects they work on, sorted by Ename. It
is also possible to limit the number of returned results by using the LIMIT clause as
in query 5, which only returns the first 10 answers.
Query 6 illustrates the use of WITH and aggregation, although the WITH clause can
be used to separate clauses in a query even if there is no aggregation. Query 6 also illus-
trates the WHERE clause to specify additional conditions, and the query returns the
employees who work on more than two projects, as well as the number of projects each
employee works on. It is also common to return the nodes and relationships them-
selves in the query result, rather than the property values of the nodes as in the previ-
ous queries. Query 7 is similar to query 5 but returns the nodes and relationships only,
and so the query result can be displayed as a graph using Neo4j’s visualization tool. It is
also possible to add or remove labels and properties from nodes. Query 8 shows how to
add more properties to a node by adding a Job property to an employee node.
The above gives a brief flavor for the Cypher query language of Neo4j. The full lan-
guage manual is available online (see the bibliographic notes).
Neo4j Interfaces and Distributed System Characteristics
Neo4j has other interfaces that can be used to create, retrieve, and update nodes and
relationships in a graph database. It also has two main versions: the enterprise edi-
tion, which comes with additional capabilities, and the community edition. We dis-
cuss some of the additional features of Neo4j in this subsection.
■ Enterprise edition vs. community edition. Both editions support the Neo4j graph data
model and storage system, as well as the Cypher graph query language, and several other
interfaces, including a high-performance native API, language drivers for several popular
programming languages, such as Java, Python, PHP, and the REST (Representational State
Transfer) API. In addition, both editions support ACID properties. The enterprise edition
supports additional features for enhancing performance, such as caching and clustering of data
and locking.
■ Graph visualization interface. Neo4j has a graph visualization interface, so that a subset
of the nodes and edges in a database graph can be displayed as a graph. This tool can be used to
visualize query results in a graph representation.
■ Master-slave replication. Neo4j can be configured on a cluster of distrib- uted system
nodes (computers), where one node is designated the master node. The data and indexes are
fully replicated on each node in the cluster. Various ways of synchronizing the data between
master and slave nodes canbe configured in the distributed cluster.
■ Caching. A main memory cache can be configured to store the graph data for improved
performance.
■ Logical logs. Logs can be maintained to recover from failures.