Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views10 pages

A Scalable Lock-Free Stack Algorithm (2004)

The document presents a new concurrent stack algorithm called the elimination-backoff stack, which is designed to be lock-free, linearizable, and scalable across varying loads. This algorithm combines a simple elimination array with a lock-free stack to enhance performance at both low and high loads, outperforming existing methods in empirical tests. The paper details the algorithm's structure, implementation, and performance comparisons with other stack algorithms.

Uploaded by

Jully B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

A Scalable Lock-Free Stack Algorithm (2004)

The document presents a new concurrent stack algorithm called the elimination-backoff stack, which is designed to be lock-free, linearizable, and scalable across varying loads. This algorithm combines a simple elimination array with a lock-free stack to enhance performance at both low and high loads, outperforming existing methods in empirical tests. The paper details the algorithm's structure, implementation, and performance comparisons with other stack algorithms.

Uploaded by

Jully B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Scalable Lock-free Stack Algorithm


Danny Hendler Nir Shavit Lena Yerushalmi
School of Computer Science Tel-Aviv University & School of Computer Science
Tel-Aviv University Sun Microsystems Tel-Aviv University
Tel Aviv, Israel 69978 Laboratories Tel Aviv, Israel 69978
[email protected] [email protected] [email protected]

ABSTRACT shared stack is a data structure that supports the usual push
The literature describes two high performance concurrent and pop operations with linearizable LIFO semantics. Lin-
stack algorithms based on combining funnels and elimina- earizability [11] guarantees that operations appear atomic
tion trees. Unfortunately, the funnels are linearizable but and can be combined with other operations in a modular
blocking, and the elimination trees are non-blocking but way.
not linearizable. Neither is used in practice since they per- When threads running a parallel application on a shared
form well only at exceptionally high loads. The literature memory machine access the shared stack object simultane-
also describes a simple lock-free linearizable stack algorithm ously, a synchronization protocol must be used to ensure cor-
that works at low loads but does not scale as the load in- rectness. It is well known that concurrent access to a single
creases. The question of designing a stack algorithm that object by many threads can lead to a degradation in perfor-
is non-blocking, linearizable, and scales well throughout the mance [1, 9]. Therefore, in addition to correctness, synchro-
concurrency range, has thus remained open. nization methods should offer efficiency in terms of scala-
This paper presents such a concurrent stack algorithm. bility and robustness in the face of scheduling constraints.
It is based on the following simple observation: that a sin- Scalability at high loads should not however come at the
gle elimination array used as a backoff scheme for a simple price of good performance in the more common low con-
lock-free stack is lock-free, linearizable, and scalable. As tention cases.
our empirical results show, the resulting elimination-backoff Unfortunately, the two known methods for parallelizing
stack performs as well as the simple stack at low loads, and shared stacks do not meet these criteria. The combining
increasingly outperforms all other methods (lock-based and funnels of Shavit and Zemach [20] are linearizable [11] LIFO
non-blocking) as concurrency increases. We believe its sim- stacks that offer scalability through combining, but per-
plicity and scalability make it a viable practical alterna- form poorly at low loads because of the combining over-
tive to existing constructions for implementing concurrent head. They are also blocking and thus not robust in the
stacks. face of scheduling constraints [12]. The elimination trees of
Shavit and Touitou [17] are non-blocking and thus robust,
but the stack they provide is not linearizable, and it too
Categories and Subject Descriptors has large overheads that cause it to perform poorly at low
C.1.4.1 [Computer Systems Organization]: Processor loads. On the other hand, the results of Michael and Scott
Architectures—Parallel Architectures,Distributed Architec- [15] show that the best known low load method, the simple
tures; E.1.4.1 [Data]: Data Structures—lists, stacks and linearizable lock-free stack of Treiber [22], scales poorly due
queues to contention and an inherent sequential bottleneck.
This paper presents the elimination backoff stack, a new
General Terms concurrent stack algorithm that overcomes the combined
drawbacks of all the above methods. The algorithm is lin-
Algorithms, theory, lock-freedom, scalability earizable and thus easy to modularly combine with other
algorithms, it is lock-free and hence robust, it is parallel
1. INTRODUCTION and hence scalable, and it utilizes its parallelization con-
Shared stacks are widely used in parallel applications and struct adaptively, which allows it to perform well at low
operating systems. As shown in [21], LIFO-based schedul- loads. The elimination backoff stack is based on the fol-
ing not only reduces excessive task creation, but also pre- lowing simple observation: that a single elimination array
vents threads from attempting to dequeue and execute a task [17], used as a backoff scheme for a lock-free stack [22], is
which depends on the results of other tasks. A concurrent both lock-free and linearizable. The introduction of elim-
ination into the backoff process serves a dual purpose of
∗This work was supported in part by a grant from Sun Mi-
adding parallelism and reducing contention, which, as our
crosystems. empirical results show, allows the elimination-backoff stack
to outperform all algorithms in the literature at both high
and low loads. We believe its simplicity and scalability make
SPAA’04, June 27–30, 2004, Barcelona, Spain.
it a viable practical alternative to existing constructions for
Copyright 2004 Sun Microsystems, Inc. All rights reserved. implementing concurrent stacks.
ACM 1-58113-840-7/04/0006.

206
1.2 The New Algorithm
Consider the following simple observation due to Shavit
Lock-free and Touitou [17]: if a push followed by a pop are performed
backoff to Stack on a stack, the data structure’s state does not change (sim-
array ilarly for a pop followed by a push). This means that if one
can cause pairs of pushes and pops to meet and pair up in
re-try stack separate locations, the threads can exchange values without
having to touch a centralized structure since they have any-
Elimination
array
how “eliminated” each other’s effect on it. Elimination can
be implemented by using a collision array in which threads
pick random locations in order to try and collide. Pairs of
threads that “collide” in some location run through a lock-
double or halve the range
free synchronization protocol, and all such disjoint collisions
can be performed in parallel. If a thread has not met an-
other in the selected location or if it met a thread with an
operation that cannot be eliminated (such as two push op-
erations), an alternative scheme must be used. In the elimi-
Figure 1: Schematic depiction of the elimination- nation trees of [17], the idea is to build a tree of elimination
backoff cycle. arrays and use the diffracting tree paradigm of Shavit and
Zemach [19] to deal with non-eliminated operations. How-
ever, as we noted, the overhead of such mechanisms is high,
and they are not linearizable.
1.1 Background The new idea (see Figure 1) in this paper is strikingly
simple: use a single elimination array as a backoff scheme
Generally, algorithms for concurrent data structures fall
on a shared lock-free stack. If the threads fail on the stack,
into two categories: blocking and non-blocking. There are
they attempt to eliminate on the array, and if they fail in
several lock-based concurrent stack implementations in the
eliminating, they attempt to access the stack again and so
literature. Typically, lock-based stack algorithms are ex-
on. The surprising result is that this structure is lineariz-
pected to offer limited robustness as they are susceptible to
able: any operation on the shared stack can be linearized
long delays and priority inversions [7].
at the access point, and any pair of eliminated operations
Treiber [22] proposed the first non-blocking implementa-
can be linearized when they met. Because it is a back-
tion of concurrent list-based stack. He represented the stack
off scheme, it delivers the same performance as the simple
as a singly-linked list with a top pointer and used compare-
stack at low loads. However, unlike the simple stack it scales
and-swap (CAS) to modify the value of the top atomically.
well as load increases because (1) the number of successful
No performance results were reported by Treiber for his non-
eliminations grows, allowing many operations to complete in
blocking stack. Michael and Scott in [15] compared Treiber’s
parallel, and (2) contention on the head of the shared stack
stack to an optimized non-blocking algorithm based on Her-
is reduced beyond levels achievable by the best exponential
lihy’s general methodology [8], and to lock-based stacks.
backoff schemes [1] since scores of backed off operations are
They showed that Treiber’s algorithm yields the best overall
eliminated in the array and never re-attempt to access the
performance, and that the performance gap increases as the
shared structure.
amount of multiprogramming in the system increases. How-
ever, from their performance data it is clear that because of
its inherent sequential bottleneck, the Treiber stack offers
1.3 Performance
little scalability. We compared our new elimination-backoff stack algorithm
Shavit and Touitou [17] introduced elimination trees, scal- to a lock-based implementation using Mellor-Crummey and
able tree like data structures that behave “almost” like stacks. Scott’s MCS-lock [13] and to several non-blocking imple-
Their elimination technique (which we will elaborate on mentations: the linearizable Treiber [22] algorithm with and
shortly as it is key to our new algorithm) allows highly dis- without backoff and the elimination tree of Shavit and Touitou
tributed coupling and execution of operations with reverse [17]. Our comparisons were based on a collection of syn-
semantics like the pushes and pops on a stack. Elimination thetic microbenchmarks executed on a 14-node shared mem-
trees are lock-free, but not linearizable. In a similar fash- ory machine. Our results, presented in Section 4, show that
ion, Shavit and Zemach introduced combining funnels [20], the elimination-backoff stack outperforms all three methods,
and used them to provide scalable stack implementations. and specifically the two lock-free methods, exhibiting almost
Combining funnels employ both combining [5, 6] and elimi- three times the throughput at peak load. Unlike the other
nation [17] to provide scalability. They improve on elimina- methods, it maintains constant latency throughout the con-
tion trees by being linerarizable, but unfortunately they are currency range, and performs well also in experiments with
blocking. As noted earlier, both [17] and [20] are directed unequal ratios of pushs and pops.
at high-end scalability, resulting in overheads which severely The remainder of this paper is organized as follows. In
hinder their performance under low loads. the next section we describe the new algorithm in depth.
The question of designing a practical lock-free linearizable In Section 3, we give the sketch of adaptive strategies we
concurrent stack that will perform well at both high and low used in our implementation. In Section 4, we present our
loads has thus remained open. empirical results. Finally, in Section 5, we provide a proof
that our algorithm has the required properties of a stack, is
linearizable, and lock-free.

207
2. THE ELIMINATION BACKOFF STACK void StackOp(ThreadInfo* pInfo) {
P1: if(TryPerformStackOp(p)==FALSE)
2.1 Data Structures P2: LesOP(p);
P3: return;
We now present our elimination backoff stack algorithm. }
Figure 2 specifies some type definitions and global variables. void LesOP(ThreadInfo *p) {
S1: while (1) {
S2: location[mypid]=p;
struct Cell { struct Simple_Stack { S3: pos=GetPosition(p);
Cell *pnext; Cell *ptop; S4: him=collision[pos];
void *pdata; }; S5: while(!CAS(&collision[pos],him,mypid))
}; S6: him=collision[pos];
struct ThreadInfo { Simple_Stack S; S7: if (him!=EMPTY) {
u_int id; void **location; S8: q=location[him];
char op; int *collision; S9: if(q!=NULL&&q->id==him&&q->op!=p->op) {
Cell cell; S10: if(CAS(&location[mypid],p,NULL)) {
int spin; S11: if(TryCollision(p,q)==TRUE)
}; S12: return;
S13: else
S14: goto stack;
Figure 2: Types and Structures }
S15: else {
S16: FinishCollision(p);
Our central stack object follows Treiber [22] and is im- S17: return;
plemented as a singly-linked list with a top pointer. The }
elimination layer follows Shavit and Touitou and is built of }
two arrays: a global location[1..n] array has an element }
S18: delay(p->spin);
per thread p ∈ {1..n}, holding the pointer to the ThreadInfo
S19: if (!CAS(&location[mypid],p,NULL)) {
structure, and a global collision[1..size] array, that holds S20: FinishCollision(p);
the ids of the threads trying to collide. Each ThreadInfo S21: return;
record contains the thread id, the type of the operation to }
be performed by the thread (push or pop), and the node for stack:
the operation. The spin variable holds the amount of time S22: if (TryPerformStackOp(p)==TRUE)
return;
the thread should delay while waiting to collide.
}
}
2.2 Elimination Backoff Stack Code
We now provide the code of our algorithm. It is shown in boolean TryPerformStackOp(ThreadInfo*p){
Figures 3 and 4. As can be seen from the code, first each Cell *phead,*pnext;
thread tries to perform its operation on the central stack T1: if(p->op==PUSH) {
object (line P1). If this attempt fails, a thread goes through T2: phead=S.ptop;
T3: p->cell.pnext=phead;
the collision layer in the manner described below. T4: if(CAS(&S.ptop,phead,&p->cell))
Initially, thread p announces its arrival at the collision T5: return TRUE;
layer by writing its current information to the location ar- T6: else
ray (line S2). It then chooses the random location in the T7: return FALSE;
collision array (line S3). Thread p reads into him the id }
of the thread written at collision[pos] and tries to write T8: if(p->op==POP) {
T9: phead=S.ptop;
its own id in place (lines S4 and S5). If it fails, it retries T10: if(phead==NULL) {
until success (lines S5 and S6). T11: p->cell=EMPTY;
After that, there are three main scenarios for thread ac- T12: return TRUE;
tions, according to the information the thread has read. }
They are illustrated in Figure 5. If p reads an id of the T13: pnext=phead->pnext;
existing thread q (i.e., him!=EMPTY), p attempts to collide T14: if(CAS(&S.ptop,phead,pnext)) {
T15: p->cell=*phead;
with q. The collision is accomplished by p first executing a T16: return TRUE;
read operation (line S8) to determine the type of the thread }
being collided with. As two threads can collide only if they T17: else {
have opposing operations, if q has the same operation as p, T18: p->cell=EMPTY;
p waits for another collision (line S18). If no other thread T19: return FALSE;
collides with p during its waiting period, p clears its entry in }
}
the location array and tries to perform its operation on the void FinishCollision(ProcessInfo *p) {
central stack object. If p’s entry cannot be cleared, it follows F1: if (p->op==POP_OP) {
that p has been collided with, in which case p completes its F2: p->pcell=location[mypid]->pcell;
operation and returns. F3: location[mypid]=NULL;
If q does have a complementary operation, p tries to elim- }
inate by performing two CAS operations on the location }
array. The first clears p’s entry, assuring no other thread
will collide with it during its collision attempt (this elim-
Figure 3: Elimination Backoff Stack Code - part 1
inates race conditions). The second attempts to mark q’s

208
void TryCollision(ThreadInfo*p,ThreadInfo *q) {
As our algorithm is based on the compare-and-swap (CAS)
C1: if(p->op==PUSH) {
C2: if(CAS(&location[him],q,p)) operation, it must deal with the “ABA problem” [4]. If a
C3: return TRUE; thread reads the top of the stack, computes a new value,
C4: else and then attempts a CAS on the top of the stack, the CAS
C5: return FALSE; may succeed when it should not, if between the read and the
} CAS some other thread(s) change the value to the previous
C6: if(p->op==POP) {
one again. The simplest and most common ABA-prevention
C7: if(CAS(&location[him],q,NULL)){
C8: p->cell=q->cell; mechanism is to include a tag with the target memory lo-
C9: location[mypid]=NULL; cation such that both are manipulated together atomically,
C10: return TRUE and the tag is incremented with updates of the target loca-
} tion [4]. The CAS operation is sufficient for such manipula-
C11: else tion, as most current architectures that support CAS (Intel
C12: return FALSE;
x86, Sun SPARC) support their operation on aligned 64-bit
}
} blocks. One can also use general techniques to eliminate
ABA issues through memory managements such as SMR
[14] or ROP [10].
Figure 4: Elimination Backoff Stack Code - part 2
3. ADAPTATIVE ELIMINATION BACKOFF
PUSH POP PUSH PUSH PUSH POP
The classical approach to handling load is backoff, and
specifically exponential backoff [1]. In a regular backoff
scheme, once contention in detected on the central stack,
threads back off in time. Here, threads will back off in both
time and space, in an attempt to both reduce the load on
the centralized data structure and to increase the probabil-
ity of concurrent colliding. Our backoff parameters are thus
Stack Stack Stack
the width of the collision layer, and the delay at the layer.
Object Object Object The elimination backoff stack has a simple structure that
naturally fits with a localized adaptive policy for setting pa-
rameters similar to the strategy used by Shavit and Zemach
for combining funnels in [20]. Decisions on parameters are
made locally by each thread, and the collision layer does
Figure 5: Collision scenarios not actually grow or shrink. Instead, each thread indepen-
dently chooses a subrange of the collision layer it will map
into, centered around the middle of the array, and limited
by the maximal array width. It is possible for threads to
entry as “collided with p”. If both CAS operations succeed, have different ideas about the collision layer’s width, and
the collision is successful. Therefore p can return (in case of particulary bad scenarios might lead to bad performance,
a pop operation it stores the value of the popped cell). but as we will show, the overall performance is superior to
If the first CAS fails, it follows that some other thread that of exponential backoff schemes [1]. Our policy is to first
r has already managed to collide with p. In that case the attempt to access the central stack object, and only if that
thread p acts as in case of a successful collision, mentioned fails to back off to the elimination array. This allows us,
above. If the first CAS succeeds but the second fails, then in case of low loads, to avoid the collision array altogether,
the thread with whom p is trying to collide is no longer thus achieving the latency of a simple stack (in comparison,
available for collision. In that case, p tries to perform the [20] are at best three times slower than a simple stack).
operation on the central stack object, returns in case of suc- One way of adaptively changing the width of the colli-
cess, and repeatedly goes through the collision layer in case sion layer is the following. Each thread t keeps a value,
of failure. 0<factor <1, by which it multiplies the collision layer width
to choose the interval into which it will randomly map to try
2.3 Memory Management and ABA Issues and collide (e.g. if factor =0.5 only half the width is used).
In our implementation we use a very simple memory man- When t fails to collide because it did not encounter another
agement mechanism - a pool of cells available for restricted thread, it increments a private counter. When the counter
use (similar to the pool introduced in [22]). When a thread exceeds some limit, factor is halved, and the counter is being
needs a cell to perform a push operation on a stack, it re- reset to its initial value. If, on the other hand, t encountered
moves a cell from the pool and uses it. When a thread some other thread u, performing an opposite operation-type,
pops a cell from the stack, it returns the cell to the pool. but fails to collide with it (the most probable reason being
Note that the cells are returned only by threads that per- that some other thread v succeeded in colliding with u before
formed pop operations, thus insuring correctness in lines C8 t), the counter is being decremented, and when it reaches 0,
and F2. Without this assumption we would need to copy factor is doubled, and the counter is being reset to its initial
the contents of the cell and not just its address. Though value.
outside the scope of this paper, we note that one can use The second part of our strategy is the dynamic update
techniques such as those of Trieber [22], or more general of the delay time for attempting to collide in the array, a
techniques such as SMR [14] or ROP [10], to detect when a technique used by Shavit and Zemach for diffracting trees
cell in the pool can be reused. in [18, 19]. One way of doing that is the following. Each

209
repeat
op:=random(push,pop) Throughput
perform op 8000
w:=random(0..workload) 7000

Number of operation
wait w millisecs 6000

per second
until 500000 operations performed 5000 New algorithm
4000
Treiber with backoff
3000
Figure 6: Produce-Consume benchmark 2000
MCS
Treiber
1000 ETree
0
thread t keeps a value spin which holds the amount of time 1 2 4 8 14 32
that t should delay while waiting to be collided. The spin Threads

value may change within a predetermined range. When t


Latency
successfully collides, it increments a local counter. When
1900
the counter exceeds some limit, t doubles spin. If t fails to New algorithm
1700
collide, it decrements the local counter. When the counter Treiber with backoff

Average latency per


1500
decreases bellow some limit, spin is halved. This localized 1300
MCS
Treiber
version of exponential backoff serves a dual role: it increases 1100 ETree
the chance of successful eliminations, and it plays the role

operation
900
of a backoff mechanism on the central stack structure. 700
There are obviously other conceivable ways of adaptively 500
updating these parameters, and this is a subject for further 300
research. 1 2 4
Threads
8 14 32

4. PERFORMANCE
Figure 7: Throughput and latency of different stack
We evaluated the performance of our elimination-backoff implementations with varying number of threads.
stack algorithm relative to other known methods by run- Each thread performs 50% pushs, 50% pops.
ning a collection of synthetic benchmarks on a 14 node Sun
EnterpriseTM E6500, an SMP machine formed from 7 boards
of two 400MHz UltraSparcTM processors, connected by a
crossbar UPA switch, and running Solaris 9. Our C code 4.3 Measuring the performance of benchmarked
was compiled by a Sun cc compiler 5.3, with flags -xO5 algorithms
-xarch=v8plusa.
We ran the produce-consume benchmark specified above
4.1 The Benchmarked Algorithms varying the number of threads and measuring latency, the
average amount of time spent per operation, and throughput,
We compared our stack implementation to the lock-free
the number of operations per second. We compute through-
but non-linearizable elimination tree of Shavit and Touitou
put and latency by measuring the total time required to
[17] and to two linearizable methods: a serial stack protected
perform the specific amount of operations by each thread.
by MCS lock [13], and a non-blocking implementation due
We refer to the longest time as the time needed to complete
to Treiber [22].
the specified amount of work.
• MCS A serial stack protected by an MCS-queue-lock To counteract transient startup effects, we synchronized
[13]. Each processor locks the top of the stack, changes the start of the threads (i.e., no thread can start before all
it according to the type of the operation, and then other threads finished their initialization phase). Each data
unlocks it. The lock code was taken directly from the point is the average of three runs, with the results varying
article. by at most 1.4% throughout all our benchmarks.
• Treiber Our implementation of Treiber’s non-blocking 4.4 Empirical Results
stack followed the code given in [22]. We added to it
Figure 7 shows the results of a benchmark in which half a
exponential backoff scheme, as introduced in [2].
million operations were performed by every working thread,
• ETree An elimination tree [17] based stack. Its pa- with each thread performing 50% pushs and 50% pops on
rameters were chosen so as to optimize its performance, average. Figure 9 provides a detailed view of the three best
based on empirical testing. performers. From Figure 7 it can be seen that our results for
known structures generally conform with those of [15, 16],
and that Treiber’s algorithm with added exponential backoff
4.2 The Produce-Consume Benchmark is the best among known techniques. It can also be seen that
In the produce-consume benchmark each thread alter- the new algorithm provides superior scalable performance at
nately performs a push or pop operation and then waits for all tested concurrency levels. The throughput gap between
a period or time, whose length is chosen uniformly at ran- our algorithm and Treiber’s algorithm with backoff grows
dom from the range: [0 . . . workload]. The waiting period as concurrency increases, and at 32 threads the new algo-
simulates the local work that is typically done by threads in rithm is almost three times faster. Such a significant gap in
real applications between stack operations (see Figure 6). In performance can be explained by reviewing the difference in
all our experiments the stack was initialized as sufficiently latency for the two algorithms.
filled to prevent it from becoming empty during the run. Table 1 shows latency measured on a single dedicated pro-

210
Throughput
6000 EBS
Table 1: Latency on a single processor (no con-
tention).
Number of operation per
Treiber with backoff
5000
MCS New algorithm 370
4000
second Treiber
Treiber with backoff 380
ETree
3000 MCS 546
2000 Treiber 380
1000 ETree 6850
0
1 2 4 8 14 32
Threads Table 2: Fraction of successfully eliminated opera-
tions per concurrency level
Latency 2 threads 11%
1900
4 threads 24%
New algorithm
8 threads 32%
Average latency per

1700 Treiber with backoff


1500 MCS 14 threads 37%
operation

1300 Treiber 32 threads 43%


1100 ETRee
900
700
500 µsec per thread, explaining the difference in throughput as
300 concurrency increases.
1 2 4 8 14 32
In Figure 10 we compare the various methods as access
Threads
patterns become sparse and the load decreases. Under low
load, when workload = 1000, all the algorithms (except the
elimination tree) maintain an almost constant latency as the
Figure 8: Throughput and latency under varying level of concurrency increases because of the low contention.
distribution of operations: 25% push, 75%pop The decrease in the latency of elimination tree w.r.t. the
case of workload = 0 is smaller, because of the lower levels
of elimination. In contrast, the adverse effect of the sparse
access pattern on our algorithm’s latency is small, because
our algorithm uses the collision layer only as a backup if it
failed to access the central stack object, and the rate of such
cessor. The new algorithm and Treiber’s algorithm with
failures is low when the overall load is low.
backoff have about the same latency, and outperform all oth-
To further test the effectiveness of our policy of using
ers. The reason the new algorithm achieves this good per-
elimination as a backoff scheme, we measured the fraction
formance is due to the fact that elimination backoff (unlike
of operations that failed on their first attempt to change
the elimination used in structures such as combining funnels
the top of the stack. As seen in Figure 11, this fraction
and elimination trees) is used only as a backoff scheme and
is low under low loads (as can be expected) and grows to-
introduces no overhead. The gap of the two algorithms with
gether with load, and, perhaps unexpectedly, is lower than
respect to MCS and ETree is mainly due to the fact that a
in Trieber’s algorithm. This is a result of using the collision
push or a pop in our algorithm and in Treiber’s algorithm
layer as the backoff mechanism in the new algorithm as op-
typically needs to access only two cache lines in the data
posed to regular backoff, since in the new algorithm some of
structure, while a lock-based algorithm has the overhead of
the failed threads are eliminated and do not interfere with
accessing lock variables as well. The ETree has an overhead
the attempts of newly arrived threads to modify the stack.
of travelling through the tree.
These results further justify the choice of elimination as a
As Figure 9 shows, as the level of concurrency increases,
backoff scheme.
the latency of Treiber’s algorithm grows since the head of
To study the behavior of our adaptation strategy we con-
the stack, even with contention removed, is a sequential
ducted a series of experiments to hand-pick the “optimized
bottleneck. On the other hand, the new algorithm has in-
parameter set” for each level of concurrency. We then com-
creased rate of successful collisions on the elimination array
pared the performance of elimination backoff with an adap-
as concurrency increases. As Table 2 shows, the fraction of
tive strategy to an optimized elimination backoff stack. These
successfully eliminated operations increases from only 11%
results are summarized in Figure 12. Comparing the latency
for two threads up to 43% for 32 threads. The increased
of the best set of parameters to those achieved using adap-
elimination level means that increasing numbers of threads
tation we see that adaptive strategy is about 2.5% - 4%
complete their operations quickly and in parallel, keeping
slower.
latency fixed and increasing overall throughput.
We also tested the robustness of the algorithms under
workloads with an imbalanced distribution of push and pop From these results we conclude that our adaptation tech-
operations. Such imbalanced workloads are not favorable niques appear to work reasonably well. Based on the above
for the new algorithm because of the smaller chance of suc- benchmarks, we conclude that for the concurrency range we
cessful collision. From Figure 8 it can be sees that the new tested, elimination backoff is the algorithm of choice for im-
algorithm still scales, but at a slower rate. The slope of the plementing linearizable stacks.
latency curve for our algorithm is 0.13 µsec per thread, while
the slope of the latency curve for Treiber’s algorithm is 0.3

211
Latency
5. CORRECTNESS PROOF
600 This section contains a formal proof that our algorithm is
550 a lock-free linearizable implementation of a stack. For lack
MCS of space, proofs of a few lemmata are omitted and would
Average latency per

500
appear in the full paper.
450
operation

Treiber with backoff Our model for multithreaded computation follows [11],
400
though for brevity and accessibility we will use operational
New algorithm
350 style arguments. In our proof we will ignore issues relating to
300 the ABA problem typical of implementations using the CAS
250 operation. As described earlier (Section 2.3), there are sev-
200 eral standard techniques for overcoming the ABA problem
1 2 4 8 14 32 [10, 14]. A concurrent stack is a data structure whose oper-
Threads
ations are linearizable [11] to those of the sequential stack
as defined in [3]. The following is a sequential specification
of a stack object.
Figure 9: Detailed graph of latency with threads
performing 50% pushs, 50% pops. Definition 5.1. A stack S is an object that supports two
types of operations on it: push and pop. The state of a
stack is a sequence of items S = hv0 , ..., vk i. The stack is
Latency initially empty. The push and pop operations induce the
following state transitions of the sequence S = hv0 , ..., vk i,
1900
with appropriate return values:
Average latency per

1700 New algorithm


1500 Treiber with backoff
• push(vnew ), changes S to be hv0 , ..., vk , vnew i
operation

1300 MCS
1100 ETree
• pop(), if S is not empty, changes S to be hv0 , ..., vk−1 i
900
700
and returns vk ; if S is empty, it returns empty and S
500 remains unchanged.
300
1 2 4 8 14 32
We note that a set is a relaxation of a stack that does not
Threads require LIFO ordering. We begin by proving that our al-
gorithm implements a concurrent set, without considering a
linearization order. We then prove that our stack implemen-
Figure 10: Workload=1000 tation is linearizable to the sequential stack specification of
Definition 5.1. Finally we prove that our implementation is
40 lock-free.
35
30 Treiber 5.1 Correct Set Semantics
Percent of failures

25 We now prove that our algorithm has correct set seman-


20 New algorithm tics, i.e. that pop operations can only pop items that were
15
previously pushed, and that items pushed by push opera-
tions are not duplicated. This is formalized in the following
10
definition 1 .
5

0 Definition 5.2. A stack algorithm has correct set se-


2 4 8 14 32
mantics if the following requirements are met for all stack
Threads operations:
Figure 11: Fraction of failures on first attempt 1. Let Op be a pop operation that returns an item i, then
i was previously pushed by a push operation.
Latency
400
2. Let Op be a push operation that pushed an item i to
Average latency per

390 the stack, then there is at most a single pop operation


that returns i.
operation

380 New algorithm


We call any operation that complies with the above require-
370 ment a correct set operation.

360 Best new algorithm Lemma 5.1. Operations that modify the central stack ob-
ject are correct set operations.
350
1 2 4 8 14 32 Proof. Follows from the correctness of Treiber’s algo-
Threads rithm [22].
Figure 12: Comparison of algorithm latency 1
For simplicity we assume all items are unique, but the proof
achieved by hand-picked parameters with that can easily be modified to work without this assumption.
achieved by using an adaptive strategy

212
In the following, we prove that operations that exchange • If op is a passive colliding-operation, then op performs
their values through collisions are also correct set operations, FinishCollision, which implies that op failed in reset-
thus we show that our algorithm has correct set semantics. ting its entry in the location array (in line S10 or s19).
We first need the following definitions. Let op1 be the operation that has caused op’s failure
by writing to its entry. From the code, op1 must have
Definition 5.3. We say that two operations op1 and op2 succeeded in TryCollision, thus, it has verified in line
have collided if they have exchanged their values and have S9 that its type is opposite to that of op.
not modified the central stack object; we say that each of
op1 , op2 is a colliding operation.

Definition 5.4. We say that a colliding operation op is The proofs of the following three technical lemmata are
active if it executes a successful CAS in lines C2 or C7. We omitted for lack of space.
say that a colliding operation is passive if op fails in the CAS
of line S10 or S19. Lemma 5.4. An operation terminates without modifying
the central stack object, if and only if it collides with another
Definition 5.5. A state s of the algorithm in an n-thread operation.
system is a vector of size n, with entry i, 1 ≤ i ≤ n, rep-
resenting the state of thread i. The state of thread i in s Lemma 5.5. For every thread p and in any state s, if p is
consists of the values of thread i’s data structures and of the not trying to collide in s, then it holds in s that the element
value of thread i’th program-counter. corresponding to p in the location array is NULL.

Definition 5.6. Let op be an operation performed by thread Lemma 5.6. Let op be a push operation by some thread
t. We say that op is trying to collide at state s, if, in s, p; if location[p] 6= N U LL, then op is trying to push the
the value of t’s program counter is pointing at a statement value location[p]->cell.pdata.
of one of the following procedures: LesOP, TryCollision,
FinishCollision. Otherwise, we say that op is not trying In the next three lemmata, we show that push and pop op-
to collide at s. erations are paired correctly during collisions.

We next prove that operations can only collide with op- Lemma 5.7. Every passive collider collides with exactly
erations of the opposite type. First we need the following one active collider.
technical lemma. Proof. Assume by contradiction that some passive col-
lider, op1 , collides with multiple other operations, and let
Lemma 5.2. Every colliding operation op is either active
op2 , op3 be the last two operations that succeed in collid-
or passive, but not both.
ing with op1 . We denote the element written by op1 to the
Proof. Clearly from the code, a colliding operation is location array by lop1 . We consider the following two pos-
active and/or passive. We have to show that it cannot be sibilities.
both. Suppose that the operation op is passive, then op fails
the CAS of line S10 or that of line S19; clearly from the code, • Assume op1 is a passive-collider performing a pop op-
op then calls FinishCollision and exits, therefore op cannot eration. From Lemma 5.3, both op2 , op3 are push op-
play an active-collider role after playing a passive-collider erations. From Lemma 5.2, op1 cannot be both active
role. Suppose now that op is active. From definition 5.4, it and passive. Thus op1 exchanges values only in line
executes a successful CAS in lines C2 or C7. It is clear from F2, with the last operation that has written to its en-
the code that in this case op returns TRUE from TryColli- try in the location array. As both op2 and op3 are
sion and does not reach line S10 or S19 afterwards (it returns active colliders performing a push, both succeed in the
in line S12). So op cannot play a passive-collider role after CAS of line C2. As op3 succeeds in colliding with op1
playing an active-collider role. after op2 does, the q parameter used in the CAS of
op3 at line C2 must be the value written by op2 in its
Lemma 5.3. Operations can only collide with operations successful CAS of line C2. This is impossible, because
of the opposite type: an operation that performs a push can in line S9 op3 verifies that q is of type pop, but op2 is
only collide with operations that perform a pop, and vice performing a push.
versa.
• Otherwise, assume op1 is a passive-collider perform-
Proof. Let us consider some operation, op, that col- ing a push operation. From Lemma 5.3, both op2 , op3
lides. From the code, in order to successfully collide, op perform a pop operation. Thus it must be that both
must either succeed in performing TryCollision or execute op2 and op3 succeed in the CAS of line C7. This im-
FinishCollision. We now examine both cases. plies that both succeed in writing NULL to the entry
of op1 ’s thread in the location array. This, however,
• TryCollision can succeed only in case of a successful implies that the q parameter used by op3 in line C7 is
CAS in line C2 (for a push operation) or in line C7 NULL, which is impossible since in this case op3 would
(for a pop operation). Such a CAS changes the value have failed the check in line S9.
of the other thread’s cell in the location array, thus ex-
changing values with it and returns without modifying
the central stack object. From the code, before calling
TryCollision op has to execute line S9, thus verifying Lemma 5.8. Every active collider op1 collides with ex-
that it collides with an operation of the opposite type. actly one passive collider.

213
Proof. The proof is by contradiction. Assume that an active-collider operation, and the push colliding-operation
active-collider, op1 , collides with two operations op2 and op3 . is linearized before the pop colliding-operation.
From Lemma 5.3, both op2 and op3 are passive, hence both Each push or pop operation consists of a while loop that
op2 and op3 have failed while executing CAS in lines S10 repeatedly attempts to complete the operation. An iteration
or S19. It follows that op1 must have written its value to is successful if its attempt succeeds, in which case the oper-
the location array twice. From the code, this is impossible, ation returns at that iteration; otherwise, another iteration
because op1 can perform such a write only in line C2 or C9, is performed . Each completed operation has exactly one
and it exits immediately after. successful attempt (its last attempt), and the linearization
of the operation occurs in that attempt. In other words, the
Lemma 5.9. Every colliding operation op participates in operations are linearized in the aforementioned lineaniraza-
exactly one collision with an operation of the opposite type. tion points only in case of a successful CAS, which can only
Proof. Follows from Lemmata 5.2, 5.7 and 5.8. be performed in the last iteration of the while loop.
We note that, from definition 5.1, a successful collision
We now prove that, when colliding, opposite operations does not change the state of the central stack object. It
exchange values in a proper way. follows that at any point of time, the state of the stack is
determined solely by the state of its central stack object.
Lemma 5.10. If a pop operation collides, it obtains the To prove that the aforementioned lines are correct lin-
value of the single push operation it collided with. earization points of our algorithm, we need to prove that
these are correct linearization points for the two types of op-
Proof. Let op1 , op2 respectively denote the pop opera- erations: operations that complete by modifying the central
tion and the push operation that collided with it. Also, let stack object, and operations that exchange values through
p1 and p2 respectively denote the threads that perform op1 collisions.
and op2 . We denote the entry corresponding to p1 in the
location array as lp1 . We denote the entry corresponding Lemma 5.13. For operations that do not collide, we can
to p2 in the location array as lp2 . Assume that op1 is a choose the following linearization points:
passive collider, then from Lemma 5.9 it collides with a sin-
gle active push collider, op2 . As op1 succeeds in colliding, it • Line T4 (for a push operation)
obtains in line F2 the cell that was written to its entry in
• Line T10 (in case of empty stack) or line T14 (for a
the location array by op2 .
pop operation)
Assume that op1 is a an active collider, then from Lemma
5.9 it collides with a single passive push collider, op2 . As Proof. Follows directly from the linearizability of Treiber’s
op1 succeeds in colliding, it succeeds in the CAS of line C7 algorithm [22].
and thus returns the cell that was written by op2 .
We still have to prove that the linearization points for
Lemma 5.11. If a push operation collides, its value is ob- collider-operations are consistent, both with one another,
tained by the single pop operation it collided with. and with non-colliding operations. We need the following
Proof. Symmetric to the proof of Lemma 5.10. technical lemma, whose proof is omitted for lack of space.

We can now finally prove that our algorithm has correct Lemma 5.14. Let op1 , op2 , be a colliding operations-pair,
set semantics. and assume w.l.o.g. that op1 is the active-collider and op2
is the passive collider, then the linearization point of op1 (as
Theorem 5.12. The elimination-backoff stack has cor- defined above) is within the time interval of op2 .
rect set semantics.
Lemma 5.15. The following are legal linearization points
Proof. From Lemma 5.1, all operations that modify the for collider-operations.
central stack object are correct set operations. From Lem-
mata 5.10 and 5.11, all colliding operations are correct set • An active-collider, op1 , is linearized at either line C2
operations. Thus, all operations on the elimination-backoff (in case of a push operation) or at line C7 (in case of
stack are correct set operations and so, from Definition 5.2, a pop operation).
the elimination-backoff stack has correct set semantics.
• A passive-collider, op2 , is linearized at the lineariza-
5.2 Linearizability tion time of the active-collider it collided with. If op2
is a push operation, it is linearized immediately before
Given a sequential specification of a stack, we provide
op1 , otherwise it is linearized immediately after op1 .
specific linearization points mapping operations in our con-
current implementation to sequential operations so that the Proof. To simplify the proof and avoid the need for
histories meet the specification. Specifically, we choose the backward simulation style arguments, we consider only com-
following linearization points for all operations, except for plete execution histories, that is, ones in which all abstract
passive-colliders: operations have completed, so we can look “back” at the
execution and say for each operation where it happened.
• Lines T4, C2 (for a push operation) We first note that according to Lemma 5.14, the lineariza-
• Lines T10, T14, C7 (for a pop operation) tion point of the passive-collider is well-defined (it is obvi-
ously well-defined for the active-collider). We need to prove
For a passive-collider operation, we set the linearization the correct LIFO ordering between two linearized collided
point to be at the time of linearization of the matching operations.

214
As we linearize the passive-collider in the linearization [9] M. Herlihy, B.-H. Lim, and N. Shavit. Scalable
point of its counterpart active-collider, no other operations concurrent counting. ACM Transactions on Computer
can be linearized between op1 and op2 ; as the push oper- Systems, 13(4):343–364, 1995.
ation is linearized just before the pop operation, this is a [10] M. Herlihy, V. Luchangco, and M. Moir. The
legal LIFO matching that cannot interfere with the LIFO repeat-offender problem, a mechanism for supporting
matching of other collider-pairs or that of non-collider oper- dynamic-sized,lock-free data structures. Technical
ations. Finally, from Lemma 5.10, the pop operation indeed Report TR-2002-112, Sun Microsystems, September
obtains the value of the operation it collided with. 2002.
Theorem 5.16. The elimination-backoff stack is a cor- [11] M. P. Herlihy and J. M. Wing. Linearizability: a
rect linearizable implementation of a stack object. correctness condition for concurrent objects. ACM
Transactions on Programming Languages and Systems
Proof. Immediate from Lemmata 5.13, 5.15 (TOPLAS), 12(3):463–492, 1990.
5.3 Lock Freedom [12] B.-H. Lim and A. Agarwal. Waiting algorithms for
synchronization in large-scale multiprocessors. ACM
Theorem 5.17. The elimination-backoff stack algorithm Transactions on Computer Systems, 11(3):253–294,
is lock-free. august 1993.
Proof. Let op be some operation. We show that in every [13] J. M. Mellor-Crummey and M. L. Scott. Algorithms
iteration made by op, some operation performs its lineariza- for scalable synchronization on shared-memory
tion point, thus the system as a whole makes progress. If multiprocessors. ACM Transactions on Computer
op manages to collide, then op’s linearization has occued, Systems (TOCS), 9(1):21–65, 1991.
and op does not iterate anymore before returning. Other- [14] M. M. Michael. Safe memory reclamation for dynamic
wise, op calls TryPerformStackOp; if TryPerformStackOp lock-free objects using atomic reads and writes. In
returns TRUE, op immediately returns, and its lineariza- Proceedings of the twenty-first annual symposium on
tion has occured; if, on the other hand, TryPerformStackOp Principles of distributed computing, pages 21–30.
returns FALSE, this implies that the CAS performed by it ACM Press, 2002.
has failed, and the only possible reason for the failure of the [15] M. M. Michael and M. L. Scott. Nonblocking
CAS by op is the success of a CAS on phead by some other algorithms and preemption-safe locking on
operation, thus whenever op completes a full iteration, some multiprogrammed shared — memory multiprocessors.
operation is linearized. Journal of Parallel and Distributed Computing,
51(1):1–26, 1998.
6. REFERENCES [16] M. Scott and W. Schrerer. User-level spin locks for
[1] A. Agarwal and M. Cherian. Adaptive backoff large commercial applications. In SOSP,
synchronization techniques. In Proceedings of the 16th Work-in-progress talk, 2001.
Symposium on Computer Architecture, pages 41–55, [17] N. Shavit and D. Touitou. Elimination trees and the
June 1989. construction of pools and stacks. Theory of Computing
[2] T. E. Anderson. The performance of spin lock Systems, (30):645–670, 1997.
alternatives for shared-memory multiprocessors. IEEE [18] N. Shavit, E. Upfal, and A. Zemach. A steady state
Transactions on Parallel and Distributed Systems, analysis of diffracting trees. Theory of Computing
1(1):6–16, January 1990. Systems, 31(4):403–423, 1998.
[3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and [19] N. Shavit and A. Zemach. Diffracting trees. ACM
C. Stein. Introduction to Algorithms, Second Edition. Transactions on Computer Systems, 14(4):385–428,
MIT Press, Cambridge, Massachusetts, 2002. 1996.
[4] I. Corporation. IBM System/370 Extended [20] N. Shavit and A. Zemach. Combining funnels: A
Architecture, Principles of Operation. IBM dynamic approach to software combining. Journal of
Publication No. SA22-7085, 1983. Parallel and Distributed Computing, (60):1355–1387,
[5] R. Goodman and M. K. V. P. J. Woest. Efficient 2000.
synchronisation primitives for large-scale [21] K. Taura, S. Matsuoka, and A. Yonezawa. An efficient
cache-coherent multiprocessors. In Proceedings of the implementation scheme of concurrent object-oriented
Third International Conference on Architectural languages on stock multicomputers. In Principles
Support for Programming Languages and Operating Practice of Parallel Programming, pages 218–228,
Systems, ASPLOS-III, pages 64–75, 1989. 1993.
[6] A. Gottleib, B. D. Lubachevsky, and L. Rudolph. [22] R. K. Treiber. Systems programming: Coping with
Efficient techniques for coordinating sequential parallelism. Technical Report RJ 5118, IBM Almaden
processors. ACM TOPLAS, 5(2):164–189, April 1983. Research Center, April 1986.
[7] M. Greenwald. Non-Blocking Synchronization and
System Design. PhD thesis, Stanford University
Technical Report STAN-CS-TR-99-1624, Palo Alto,
CA, 8 1999.
[8] M. Herlihy. A methodology for implementing highly
concurrent data objects. ACM Transactions on
Programming Languages and Systems, 15(5):745–770,
November 1993.

215

You might also like