Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views4 pages

Dynamodb Part 3

The document discusses the bootstrapping process of a Dynamo system, detailing how nodes are added and removed while managing key distribution and failure detection. It explains the importance of a gossip-based protocol for maintaining membership and preventing logical partitions, as well as the implementation of a pluggable persistence engine to accommodate different storage needs. Additionally, it highlights the balance between performance, durability, and availability in Dynamo's architecture, emphasizing the ability for client applications to adjust parameters for optimal operation.

Uploaded by

Sandeep Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Dynamodb Part 3

The document discusses the bootstrapping process of a Dynamo system, detailing how nodes are added and removed while managing key distribution and failure detection. It explains the importance of a gossip-based protocol for maintaining membership and preventing logical partitions, as well as the implementation of a pluggable persistence engine to accommodate different storage needs. Additionally, it highlights the balance between performance, durability, and availability in Dynamo's architecture, emphasizing the ability for client applications to adjust parameters for optimal operation.

Uploaded by

Sandeep Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

initially contains only the local node and token set.

The mappings us consider a simple bootstrapping scenario where node X is


stored at different Dynamo nodes are reconciled during the same added to the ring shown in Figure 2 between A and B. When X is
communication exchange that reconciles the membership change added to the system, it is in charge of storing keys in the ranges
histories. Therefore, partitioning and placement information also (F, G], (G, A] and (A, X]. As a consequence, nodes B, C and D no
propagates via the gossip-based protocol and each storage node is longer have to store the keys in these respective ranges.
aware of the token ranges handled by its peers. This allows each Therefore, nodes B, C, and D will offer to and upon confirmation
node to forward a key’s read/write operations to the right set of from X transfer the appropriate set of keys. When a node is
nodes directly. removed from the system, the reallocation of keys happens in a
reverse process.
4.8.2 External Discovery
The mechanism described above could temporarily result in a Operational experience has shown that this approach distributes
logically partitioned Dynamo ring. For example, the the load of key distribution uniformly across the storage nodes,
administrator could contact node A to join A to the ring, then which is important to meet the latency requirements and to ensure
contact node B to join B to the ring. In this scenario, nodes A and fast bootstrapping. Finally, by adding a confirmation round
B would each consider itself a member of the ring, yet neither between the source and the destination, it is made sure that the
would be immediately aware of the other. To prevent logical destination node does not receive any duplicate transfers for a
partitions, some Dynamo nodes play the role of seeds. Seeds are given key range.
nodes that are discovered via an external mechanism and are
known to all nodes. Because all nodes eventually reconcile their 5. IMPLEMENTATION
membership with a seed, logical partitions are highly unlikely. In Dynamo, each storage node has three main software
Seeds can be obtained either from static configuration or from a components: request coordination, membership and failure
configuration service. Typically seeds are fully functional nodes detection, and a local persistence engine. All these components
in the Dynamo ring. are implemented in Java.
Dynamo’s local persistence component allows for different
4.8.3 Failure Detection storage engines to be plugged in. Engines that are in use are
Failure detection in Dynamo is used to avoid attempts to
Berkeley Database (BDB) Transactional Data Store2, BDB Java
communicate with unreachable peers during get() and put()
Edition, MySQL, and an in-memory buffer with persistent
operations and when transferring partitions and hinted replicas.
backing store. The main reason for designing a pluggable
For the purpose of avoiding failed attempts at communication, a
persistence component is to choose the storage engine best suited
purely local notion of failure detection is entirely sufficient: node
for an application’s access patterns. For instance, BDB can handle
A may consider node B failed if node B does not respond to node
objects typically in the order of tens of kilobytes whereas MySQL
A’s messages (even if B is responsive to node C's messages). In
can handle objects of larger sizes. Applications choose Dynamo’s
the presence of a steady rate of client requests generating inter-
local persistence engine based on their object size distribution.
node communication in the Dynamo ring, a node A quickly
The majority of Dynamo’s production instances use BDB
discovers that a node B is unresponsive when B fails to respond to
Transactional Data Store.
a message; Node A then uses alternate nodes to service requests
that map to B's partitions; A periodically retries B to check for the The request coordination component is built on top of an event-
latter's recovery. In the absence of client requests to drive traffic driven messaging substrate where the message processing pipeline
between two nodes, neither node really needs to know whether the is split into multiple stages similar to the SEDA architecture [24].
other is reachable and responsive. All communications are implemented using Java NIO channels.
The coordinator executes the read and write requests on behalf of
Decentralized failure detection protocols use a simple gossip-style
clients by collecting data from one or more nodes (in the case of
protocol that enable each node in the system to learn about the
reads) or storing data at one or more nodes (for writes). Each
arrival (or departure) of other nodes. For detailed information on
client request results in the creation of a state machine on the node
decentralized failure detectors and the parameters affecting their
that received the client request. The state machine contains all the
accuracy, the interested reader is referred to [8]. Early designs of
logic for identifying the nodes responsible for a key, sending the
Dynamo used a decentralized failure detector to maintain a
requests, waiting for responses, potentially doing retries,
globally consistent view of failure state. Later it was determined
processing the replies and packaging the response to the client.
that the explicit node join and leave methods obviates the need for
Each state machine instance handles exactly one client request.
a global view of failure state. This is because nodes are notified of
For instance, a read operation implements the following state
permanent node additions and removals by the explicit node join
machine: (i) send read requests to the nodes, (ii) wait for
and leave methods and temporary node failures are detected by
minimum number of required responses, (iii) if too few replies
the individual nodes when they fail to communicate with others
were received within a given time bound, fail the request, (iv)
(while forwarding requests).
otherwise gather all the data versions and determine the ones to be
4.9 Adding/Removing Storage Nodes returned and (v) if versioning is enabled, perform syntactic
reconciliation and generate an opaque write context that contains
When a new node (say X) is added into the system, it gets
the vector clock that subsumes all the remaining versions. For the
assigned a number of tokens that are randomly scattered on the
sake of brevity the failure handling and retry states are left out.
ring. For every key range that is assigned to node X, there may be
a number of nodes (less than or equal to N) that are currently in After the read response has been returned to the caller the state
charge of handling keys that fall within its token range. Due to the
2
allocation of key ranges to X, some existing nodes no longer have http://www.oracle.com/database/berkeley-db.html
to some of their keys and these nodes transfer those keys to X. Let
Figure 4: Average and 99.9 percentiles of latencies for read and Figure 5: Comparison of performance of 99.9th percentile
write requests during our peak request season of December 2006. latencies for buffered vs. non-buffered writes over a period of
The intervals between consecutive ticks in the x-axis correspond 24 hours. The intervals between consecutive ticks in the x-axis
to 12 hours. Latencies follow a diurnal pattern similar to the correspond to one hour.
request rate and 99.9 percentile latencies are an order of
magnitude higher than averages
machine waits for a small period of time to receive any • Timestamp based reconciliation: This case differs from the
outstanding responses. If stale versions were returned in any of previous one only in the reconciliation mechanism. In case of
the responses, the coordinator updates those nodes with the latest divergent versions, Dynamo performs simple timestamp
version. This process is called read repair because it repairs based reconciliation logic of “last write wins”; i.e., the object
replicas that have missed a recent update at an opportunistic time with the largest physical timestamp value is chosen as the
and relieves the anti-entropy protocol from having to do it. correct version. The service that maintains customer’s
session information is a good example of a service that uses
As noted earlier, write requests are coordinated by one of the top
this mode.
N nodes in the preference list. Although it is desirable always to
have the first node among the top N to coordinate the writes • High performance read engine: While Dynamo is built to be
thereby serializing all writes at a single location, this approach has an “always writeable” data store, a few services are tuning its
led to uneven load distribution resulting in SLA violations. This is quorum characteristics and using it as a high performance
because the request load is not uniformly distributed across read engine. Typically, these services have a high read
objects. To counter this, any of the top N nodes in the preference request rate and only a small number of updates. In this
list is allowed to coordinate the writes. In particular, since each configuration, typically R is set to be 1 and W to be N. For
write usually follows a read operation, the coordinator for a write these services, Dynamo provides the ability to partition and
is chosen to be the node that replied fastest to the previous read replicate their data across multiple nodes thereby offering
operation which is stored in the context information of the incremental scalability. Some of these instances function as
request. This optimization enables us to pick the node that has the the authoritative persistence cache for data stored in more
data that was read by the preceding read operation thereby heavy weight backing stores. Services that maintain product
increasing the chances of getting “read-your-writes” consistency. catalog and promotional items fit in this category.
It also reduces variability in the performance of the request
handling which improves the performance at the 99.9 percentile. The main advantage of Dynamo is that its client applications can
tune the values of N, R and W to achieve their desired levels of
6. EXPERIENCES & LESSONS LEARNED performance, availability and durability. For instance, the value of
Dynamo is used by several services with different configurations. N determines the durability of each object. A typical value of N
These instances differ by their version reconciliation logic, and used by Dynamo’s users is 3.
read/write quorum characteristics. The following are the main The values of W and R impact object availability, durability and
patterns in which Dynamo is used: consistency. For instance, if W is set to 1, then the system will
• Business logic specific reconciliation: This is a popular use never reject a write request as long as there is at least one node in
case for Dynamo. Each data object is replicated across the system that can successfully process a write request. However,
multiple nodes. In case of divergent versions, the client low values of W and R can increase the risk of inconsistency as
application performs its own reconciliation logic. The write requests are deemed successful and returned to the clients
shopping cart service discussed earlier is a prime example of even if they are not processed by a majority of the replicas. This
this category. Its business logic reconciles objects by also introduces a vulnerability window for durability when a write
merging different versions of a customer’s shopping cart. request is successfully returned to the client even though it has
been persisted at only a small number of nodes.
significant difference in request rate between the daytime and
night). Moreover, the write latencies are higher than read latencies
obviously because write operations always results in disk access.
Also, the 99.9th percentile latencies are around 200 ms and are an
order of magnitude higher than the averages. This is because the
99.9th percentile latencies are affected by several factors such as
variability in request load, object sizes, and locality patterns.
While this level of performance is acceptable for a number of
services, a few customer-facing services required higher levels of
performance. For these services, Dynamo provides the ability to
trade-off durability guarantees for performance. In the
optimization each storage node maintains an object buffer in its
main memory. Each write operation is stored in the buffer and
Figure 6: Fraction of nodes that are out-of-balance (i.e., nodes gets periodically written to storage by a writer thread. In this
whose request load is above a certain threshold from the scheme, read operations first check if the requested key is present
average system load) and their corresponding request load. in the buffer. If so, the object is read from the buffer instead of the
The interval between ticks in x-axis corresponds to a time storage engine.
period of 30 minutes. This optimization has resulted in lowering the 99.9th percentile
latency by a factor of 5 during peak traffic even for a very small
Traditional wisdom holds that durability and availability go hand-
buffer of a thousand objects (see Figure 5). Also, as seen in the
in-hand. However, this is not necessarily true here. For instance,
figure, write buffering smoothes out higher percentile latencies.
the vulnerability window for durability can be decreased by
Obviously, this scheme trades durability for performance. In this
increasing W. This may increase the probability of rejecting
scheme, a server crash can result in missing writes that were
requests (thereby decreasing availability) because more storage
queued up in the buffer. To reduce the durability risk, the write
hosts need to be alive to process a write request.
operation is refined to have the coordinator choose one out of the
The common (N,R,W) configuration used by several instances of N replicas to perform a “durable write”. Since the coordinator
Dynamo is (3,2,2). These values are chosen to meet the necessary waits only for W responses, the performance of the write
levels of performance, durability, consistency, and availability operation is not affected by the performance of the durable write
SLAs. operation performed by a single replica.
All the measurements presented in this section were taken on a 6.2 Ensuring Uniform Load distribution
live system operating with a configuration of (3,2,2) and running Dynamo uses consistent hashing to partition its key space across
a couple hundred nodes with homogenous hardware its replicas and to ensure uniform load distribution. A uniform key
configurations. As mentioned earlier, each instance of Dynamo distribution can help us achieve uniform load distribution
contains nodes that are located in multiple datacenters. These assuming the access distribution of keys is not highly skewed. In
datacenters are typically connected through high speed network particular, Dynamo’s design assumes that even where there is a
links. Recall that to generate a successful get (or put) response R significant skew in the access distribution there are enough keys
(or W) nodes need to respond to the coordinator. Clearly, the in the popular end of the distribution so that the load of handling
network latencies between datacenters affect the response time popular keys can be spread across the nodes uniformly through
and the nodes (and their datacenter locations) are chosen such that partitioning. This section discusses the load imbalance seen in
the applications target SLAs are met. Dynamo and the impact of different partitioning strategies on load
distribution.
6.1 Balancing Performance and Durability
While Dynamo’s principle design goal is to build a highly To study the load imbalance and its correlation with request load,
available data store, performance is an equally important criterion the total number of requests received by each node was measured
in Amazon’s platform. As noted earlier, to provide a consistent for a period of 24 hours - broken down into intervals of 30
customer experience, Amazon’s services set their performance minutes. In a given time window, a node is considered to be “in-
targets at higher percentiles (such as the 99.9th or 99.99th balance”, if the node’s request load deviates from the average load
percentiles). A typical SLA required of services that use Dynamo by a value a less than a certain threshold (here 15%). Otherwise
is that 99.9% of the read and write requests execute within 300ms. the node was deemed “out-of-balance”. Figure 6 presents the
fraction of nodes that are “out-of-balance” (henceforth,
Since Dynamo is run on standard commodity hardware “imbalance ratio”) during this time period. For reference, the
components that have far less I/O throughput than high-end corresponding request load received by the entire system during
enterprise servers, providing consistently high performance for this time period is also plotted. As seen in the figure, the
read and write operations is a non-trivial task. The involvement of imbalance ratio decreases with increasing load. For instance,
multiple storage nodes in read and write operations makes it even during low loads the imbalance ratio is as high as 20% and during
more challenging, since the performance of these operations is high loads it is close to 10%. Intuitively, this can be explained by
limited by the slowest of the R or W replicas. Figure 4 shows the the fact that under high loads, a large number of popular keys are
average and 99.9th percentile latencies of Dynamo’s read and accessed and due to uniform distribution of keys the load is
write operations during a period of 30 days. As seen in the figure, evenly distributed. However, during low loads (where load is 1/8th
the latencies exhibit a clear diurnal pattern which is a result of the
diurnal pattern in the incoming request rate (i.e., there is a
Figure 7: Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the
preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A,
B, and C form the preference list. Dark arrows indicate the token locations for various nodes.

of the measured peak load), fewer popular keys are accessed, The fundamental issue with this strategy is that the schemes for
resulting in a higher load imbalance. data partitioning and data placement are intertwined. For instance,
in some cases, it is preferred to add more nodes to the system in
This section discusses how Dynamo’s partitioning scheme has order to handle an increase in request load. However, in this
evolved over time and its implications on load distribution. scenario, it is not possible to add nodes without affecting data
Strategy 1: T random tokens per node and partition by token partitioning. Ideally, it is desirable to use independent schemes for
value: This was the initial strategy deployed in production (and partitioning and placement. To this end, following strategies were
described in Section 4.2). In this scheme, each node is assigned T evaluated:
tokens (chosen uniformly at random from the hash space). The Strategy 2: T random tokens per node and equal sized partitions:
tokens of all nodes are ordered according to their values in the In this strategy, the hash space is divided into Q equally sized
hash space. Every two consecutive tokens define a range. The last partitions/ranges and each node is assigned T random tokens. Q is
token and the first token form a range that "wraps" around from usually set such that Q >> N and Q >> S*T, where S is the
the highest value to the lowest value in the hash space. Because number of nodes in the system. In this strategy, the tokens are
the tokens are chosen randomly, the ranges vary in size. As nodes only used to build the function that maps values in the hash space
join and leave the system, the token set changes and consequently to the ordered lists of nodes and not to decide the partitioning. A
the ranges change. Note that the space needed to maintain the partition is placed on the first N unique nodes that are encountered
membership at each node increases linearly with the number of while walking the consistent hashing ring clockwise from the end
nodes in the system. of the partition. Figure 7 illustrates this strategy for N=3. In this
While using this strategy, the following problems were example, nodes A, B, C are encountered while walking the ring
encountered. First, when a new node joins the system, it needs to from the end of the partition that contains key k1. The primary
“steal” its key ranges from other nodes. However, the nodes advantages of this strategy are: (i) decoupling of partitioning and
handing the key ranges off to the new node have to scan their partition placement, and (ii) enabling the possibility of changing
local persistence store to retrieve the appropriate set of data items. the placement scheme at runtime.
Note that performing such a scan operation on a production node Strategy 3: Q/S tokens per node, equal-sized partitions: Similar to
is tricky as scans are highly resource intensive operations and they strategy 2, this strategy divides the hash space into Q equally
need to be executed in the background without affecting the sized partitions and the placement of partition is decoupled from
customer performance. This requires us to run the bootstrapping the partitioning scheme. Moreover, each node is assigned Q/S
task at the lowest priority. However, this significantly slows the tokens where S is the number of nodes in the system. When a
bootstrapping process and during busy shopping season, when the node leaves the system, its tokens are randomly distributed to the
nodes are handling millions of requests a day, the bootstrapping remaining nodes such that these properties are preserved.
has taken almost a day to complete. Second, when a node Similarly, when a node joins the system it "steals" tokens from
joins/leaves the system, the key ranges handled by many nodes nodes in the system in a way that preserves these properties.
change and the Merkle trees for the new ranges need to be
recalculated, which is a non-trivial operation to perform on a The efficiency of these three strategies is evaluated for a system
production system. Finally, there was no easy way to take a with S=30 and N=3. However, comparing these different
snapshot of the entire key space due to the randomness in key strategies in a fair manner is hard as different strategies have
ranges, and this made the process of archival complicated. In this different configurations to tune their efficiency. For instance, the
scheme, archiving the entire key space requires us to retrieve the load distribution property of strategy 1 depends on the number of
keys from each node separately, which is highly inefficient. tokens (i.e., T) while strategy 3 depends on the number of
partitions (i.e., Q). One fair way to compare these strategies is to

You might also like