Good References For Concurrent Programming
Good References For Concurrent Programming
Abstract—The actor programming model is gaining popular- to achieve the best possible runtime performance. The C++
ity due to the prevalence of multi-core systems along with the Actor Framework (CAF) [7] provides a runtime for the actor
rising need for highly scalable and distributed applications. model and has a low memory footprint and improved CPU
Frameworks such as Akka, Orleans, Pony, and C++ Actor
Framework (CAF) have been developed to address these utilization to support a broad range of applications.
application requirements. Each framework provides a runtime This paper studies the challenges for a popular type of
system to schedule and run millions of actors, potentially actor scheduler in CAF, work-stealing, on NUMA platforms.
on multi-socket platforms with non-uniform memory access Actor models have specific characteristics that need to be
(NUMA). However, the literature provides only limited research taken into account when designing scheduling policies. The
that studies or improves the performance of actor-based appli-
cations on NUMA systems. This paper studies the performance contributions are: a) the structured presentation of these
penalties that are imposed on actors running on a NUMA characteristics, b) an improved hierarchical scheduling pol-
system and characterizes applications based on the actor type, icy for actor runtime systems, c) the experimental evaluation
behavior, and communication pattern. This information is used of the new scheduling proposals and study their performance
to identify workloads that can benefit from improved locality in comparison to a randomized work-stealing scheduler.
on a NUMA system. In addition, two locality- and NUMA-
aware work-stealing schedulers are proposed and their their The rest of the paper is organized as follows. The next
respective execution overhead in CAF is studied on both AMD section provides background information related to schedul-
and Intel machines. The performance of the proposed work- ing, the actor programming model, and the C++ Actor
stealing schedulers is evaluated against the default scheduler Framework (CAF). Section 3 describes existing research
in CAF. work related to the problem studied in this paper, while
Keywords-Actors, Scheduling, NUMA, Locality Section 4 presents a discussion of workload and application
characteristics, followed by the actual scheduling proposal.
I. I NTRODUCTION An experimental evaluation is provided in Section 5 and the
Modern computers utilize multi-core processors to in- paper is concluded with brief remarks in Section 6.
crease performance, because the breakdown of Dennard
II. BACKGROUND
scaling [1] makes substantial increase of clock frequencies
unlikely and because the depth and complexity of instruction A. The Actor Programming Model
pipelines have also reached a breaking point. Other hardware In the actor model the term actor describe autonomous
scaling limitations have led to the emergence of non-uniform objects that communicate asynchronously through messages.
memory access (NUMA) multi-chip platforms [2] as a trade- Each actor has a unique address that is used to send mes-
off between low-latency and symmetric memory access. sages to that actor, and each actor has a mailbox that is used
Multi-core and multi-chip computing platforms provide a to queue received messages before processing. Actors do not
shared memory interface across a non-uniform memory share state and only communicate by sending messages to
hierarchy by way of hardware-based cache coherence [3]. each other. Sending a message is a nonblocking operation
The actor model of computing [4], [5], [6] is a model for and an actor processes each message in a single atomic step.
writing concurrent applications for parallel and distributed Actors may perform three types of action as a result of
systems. The actor model provides a high-level abstraction receiving a message: (1) send messages to other actors, (2)
of concurrent tasks where information is exchanged by create new actors, (3) update their local state [6], [8]. Actors
message passing, in contrast to task parallelism where tasks can change their behavior as a result of updating their local
communicate by sharing memory. A fundamental building state. In principle, message processing in a system of actors
block of any software framework implementing the actor is non-deterministic, because reliable, in-order delivery of
model is the proper distribution and scheduling of actors messages is not guaranteed. This non-deterministic nature
on multiple underlying processors (often represented by ker- of actors makes it hard to predict their behavior based on
nel/system threads) and the efficient use of system resources static compile-time analysis or dynamic analysis at runtime.
Despite their differences [9], all actor systems provide execution. In contrast, actors in an actor system can wait for
a runtime that multiplexes actors onto multiple system events and other messages, or cooperatively yield execution
threads to take advantage of multi-core hardware and provide to guarantee fairness. Hence, the execution pattern of actors
concurrency and parallelism. For example, Erlang [10] uses is different from tasks in a task-parallel workload. This
a virtual machine along with a work-stealing scheduler to affects both scheduling objectives and locality.
distribute actors and to balance the workload. Akka [11] Finally, since actors fully encapsulate their internal state
provides various policies to map actors to threads and by and only communicate through passing messages that are
default uses a fork/join policy. Pony [12] provides only a placed into the respective mailboxes of other actors, the
work-stealing scheduler, while CAF provides both work- resulting memory access pattern is not necessarily the same
sharing and work-stealing policies. The type of the actor as the access pattern seen in task-parallel frameworks. For
runtime can influence how tasks should be scheduled in instance, in OpenStream [14] each consumer task has mul-
regard to locality. For example, in a managed language such tiple data-buffers for each producer task. OpenMP 4.0 [15]
as Erlang actors have their own private heap and stack allows task dependencies through shared memory, however
whereas in CAF actor objects and variables are directly this is based on a deterministic sporadic DAG model which
allocated from the default global heap. only allows dependencies to be defined among sibling tasks.
Scheduling is affected by implementation details of the Therefore, although locality-aware scheduling is a well-
actor framework and the type of workload. Most importantly, studied topic for task parallelism, due to those differences,
depending on the type of an actor and how memory is it cannot automatically be assumed that findings for task
allocated and accessed by that actor, scheduling might or parallelism directly translate into similar findings for actor
might not benefit from locality improvements related to CPU models - or vice versa.
caches or NUMA. Therefore, it is important to identify all
factors at play when it comes to locality-aware scheduling. C. Work-Stealing Scheduling
These factors must be studied both individually and in com- Multiprocessor scheduling is a well-known NP-hard prob-
bination to determine scenarios that benefit from locality- lem. In practice, runtime schedulers apply broad strategies to
aware scheduling in a message-passing software framework. satisfy objectives, such as resource utilization, load balanc-
ing, and fairness. Work-stealing has emerged as a popular
B. Actor Model vs. Task Parallelism strategy for task placement and also load balancing [16].
Actor-based systems can benefit from work-stealing due Work-stealing primarily addresses resource utilization by
to the dynamic nature of tasks and their asynchronous exe- stipulating that a processor without work “steals” tasks from
cution, which is similar to task parallelism. However, most another processor.
variations of task parallelism, e.g., dataflow programming, Work-stealing has been investigated for general multi-
are used to deconstruct strict computational problems to ex- threaded programs with arbitrary dependencies [17], gener-
ploit hardware parallelism. As such, interaction patterns be- alizing from its classical formulation limited to the fork-join
tween tasks are usually deterministic, because dependencies pattern. The main overhead of work-stealing occurs during
betweens tasks are known at runtime. Tasks are primarily the stealing phase when an idle processor polls other deques
concerned with the availability of input data and therefore to find work, which might cause interprocessor communica-
do not have any need to track their state. In contrast, the actor tion and lock contention that negatively impact performance.
model provides nondeterministic, asynchronous, message- The particulars of victim selection vary among work-stealing
passing concurrency. Computation in the actor model cannot schedulers. In Randomized Work-Stealing (RWS), when a
be considered as a directed acyclic graph (DAG) [7] (e.g., worker runs out of work it chooses the victims randomly.
Cilk assume DAG computation through fork/join [13]) and The first randomized work-stealing algorithm for fully-strict
actors usually have to maintain state. Due to these differ- computing is given in [17]. The algorithm has an expected
ences, the internals of an actor runtime, such as scheduling, execution time of T1 /P + O(T∞ ) on P processors, and also
differ from runtime systems aimed at task parallelism. has much lower communication cost than work-sharing.
Most importantly, applications written using the actor For message-driven applications, such as those built with
model, such as a chat server, are sensitive to latency and actor-based programming, these bounds can only be re-
fairness for individual tasks. Therefore, a scheduler designed garded as an approximation. The reason is that deep re-
for an actor system must be both efficient and fair, otherwise cursion does not occur in event-based actor systems, since
applications show a long tail in their latency distribution. On computation is driven by asynchronous message passing and
the contrary, for task parallelism, as long as the the entire cannot be considered as a DAG.
problem space is explored in an acceptable time, fairness It has been shown that work-stealing fundamentally is
and latency of individual tasks does not matter. efficient for message-driven applications [18]. However,
Furthermore, in task parallelism many lightweight tasks random victim selection is not scalable [19], because it does
are created that run from start to end without yielding not take into account locality, architectural diversity, and the
memory hierarchy [18], [20], [21]. In addition, RWS does Accordingly, efficient scheduling of actors on NUMA
not consider data distribution and the cost of inter-node task machines requires careful analysis of applications built using
migration on NUMA platforms [21], [22], [23]. actor programming. Application analysis must be combined
with a proper understanding of the underlying memory
D. C++ Actor Framework hierarchy to limit the communication overhead, scheduling
costs, and achieve the best possible runtime performance.
The C++ Actor Framework (CAF) [7] provides a runtime
that multiplexes N actors to M threads on the local system. III. R ELATED W ORK
The number of threads (M) is configurable and by default is
There has been very limited research addressing locality-
equal to the number of cores available on the system, while
aware or specifically NUMA-aware scheduling for actor
the number of actors (N) changes during runtime. Actors in
runtime systems. Francesquini et al. [22] provide a NUMA-
CAF transition between four states: ready, waiting, running,
aware runtime environment based on the Erlang virtual
and done. An actor changes its state from waiting to ready
machine. They identify actor lifespan and communication
in reaction to a message being placed in its mailbox. Actors
cost as information that the actor runtime can use to improve
in CAF are modeled as lightweight state machines that are
performance. Actors with a longer lifespan that create and
implemented in user space and cannot be preempted.
communicate with many short-lived actors are called hub
CAF’s work-stealing scheduler uses a double-ended task
actors. The proposed runtime system lowers the communi-
queue per worker thread. Worker threads treat this deque as
cation cost among actors and their hub actor by placing the
LIFO and other threads treat the queue as FIFO. New tasks
short-lived actors on the same NUMA node as the hub actor,
that are created by an actor are pushed to the head of the
called home node. When a worker thread runs out of work, it
local queue of the worker thread where the actor is running.
first tries to steal from workers on the same NUMA node. If
Tasks that are created by spawning actors outside of other
unsuccessful, the runtime system tries to migrate previously
actors, e.g., from the main() function, are placed into the
migrated actors back to that home node. The private heap of
task queues in a round robin manner for load balancing.
each actor is allocated on the home node, so executing on
Actors create new tasks by either spawning a new actor
the home node improves locality. As a last resort the runtime
or sending a message to an existing actor with an empty
steals actors from other NUMA nodes and moves them to
mailbox. If the receiver actor’s mailbox is not empty, sending
the worker’s NUMA node.
a message to its mailbox does not result in creation of a new
Although the evaluation results look promising, the
task since a task already processes the existing messages.
caveat, as stated by the authors, is in assuming that hub
The RWS scheduler in the CAF runtime uses a uniform actors are responsible for the creation of the majority of
random number generator to randomly select victims when actors. This is a strong assumption only applies to some
a worker thread runs out of work. Although CAF provides a applications. Also, when multiple hub actors are responsible
support layer for seamless heterogeneous hardware to bridge for creating actors, the communication pattern among none-
architectural design gaps between platforms, it does not yet hub actors can still be complicated. Another assumption is
provide a locality-aware work-stealing scheduler. that all NUMA nodes have the same distance from each
other, and the scheduler does not take the CPU cache hier-
E. NUMA Effects
archy into account. The approach presented takes advantage
Contemporary large-scale shared-memory computer sys- of knowledge that is available within the Erlang virtual
tems are built using non-uniform memory access (NUMA) machine, but not necessarily available in an unmanaged
hardware where each processor chip has direct access to language runtime, such as CAF.
a part of the overall system memory, while access to the In contrast, the work presented here is not based on any
remaining remote memory (which is local to other processor assumptions about the communication pattern among actors
chips) is routed through an interconnect and thus slower. or information available through a virtual machine runtime.
NUMA provides the means to reach a higher core count Instead, it is solely focused on improving performance by
than single-chip systems, albeit at the (non-uniform) cost improving the scheduler. Also, the full extent and variability
of occasionally accessing remote memory. Clearly, with of the memory hierarchy is taken into account.
remote memory, the efficacy of the CPU caching hierarchy A simple affinity-type modification to task scheduling
becomes even more important. However, CPU caches are is reported in [26]. Tasks in this system are blocking on
shared between cores with the particular nature of sharing a channel waiting for messages to arrive, and thus show
depending on the particulars of the hardware architecture. similar behavior to actors waiting on empty mailboxes. In
Therefore, new scheduling algorithms are proposed that are contrast to basic task scheduling, an existing task that is
aware of the memory hierarchy and shared resources to unblocked is never placed on the local queue. Instead, it is
exploit cache and memory locality and minimize overhead always placed at the end of the queue of the worker thread
[16], [20], [21], [22], [23], [24], [25]. previously executing the task. A task is only migrated to
another worker thread by stealing. This modification leads to Producer-Single-Consumer (MPSC) FIFO queue in addition
significant performance improvements for certain workloads to a work-stealing dequeue. The MPSC queue is only
and thus contradicts assumptions about treating the queues processed when the deque is empty. However, this approach
in LIFO manner. However, the system has a well-defined is not applicable to latency-sensitive actor application for
communication pattern and long task lifespans, in contrast two reasons: first, the actor model is nondeterministic and
to an actor system that is non-deterministic with a mixture data dependence difficult to infer at runtime. Second, adding
of short- and long-lived actors. We have implemented both a a lower-priority MPSC queue adds complexity and can
LIFO and an affinity policy using our hierarchical scheduler cause some actors to be inactive for a long time, which
and present results in Section V. violates fairness and thus causes long tail latencies for the
Various locality-aware work-stealing schedulers have been application. Moreover, the proposed deferred memory allo-
proposed for other parallel programming models and shown cation relies on knowing the task dependencies in advance,
to improve performance. Suksompong et al. [25] investigate which is not possible with the actor model. Therefore, this
localized work-stealing and provide running time bounds optimization cannot be applied to the actor model. The
when workers try to steal their own work back. Acar et topology-aware work-stealing introduced by this work is
al. [27] study the data locality of work-stealing scheduling similar to ours, but it is evaluated in combination with
on shared-memory machines and provide lower and upper deferred memory allocation and work-pushing. Thus, it is
bounds for the number of cache misses, and also provide not possible to discern the isolated contribution of topology-
a locality-guided work-stealing scheduling scheme. Chen et aware work-stealing.
al. [28] present a cache-aware two-tier scheduler that uses an Majo et al. [2] specify that optimizing for data locality can
automatic partitioning method to divide an execution DAG counteract the benefits of cache contention avoidance and
into the inter-chip tier and the intra-chip tier. vice versa. In Section V we present results that demonstrate
Olivier et al. [29] provide a hierarchical scheduling strat- this effect for actor workloads where aggressive optimization
egy where threads on the same chip share a FIFO task for locality increases the last-level cache contention.
queue. In this proposal, load balancing is performed by work
sharing within a chip, while work-stealing only happens be- IV. L OCALITY-AWARE S CHEDULER FOR ACTORS
tween chips. In follow-up work, the proposal is improved by This section first discusses the key characteristics of an
using a shared LIFO queue to exploit cache locality between actor-based application and workload that need to be consid-
sibling tasks as well as between a parent and newly created ered when designing a locality-aware, work-stealing sched-
task [23]. Moreover, the work-stealing strategy is changed, uler. These characteristics also provide valuable hints for
so that only a single thread can steal work on behalf of other designing evaluation benchmarks. Based on these findings,
threads on the same chip to limit the number of costly remote a novel hierarchical, locality-aware work-stealing scheduler
steals. Pilla et al. [30] propose a hierarchical load balancing is presented.
approach to improve the performance of applications on
parallel multi-core systems and show that Charm++ can A. Characteristics of Actor Applications
benefit from such a NUMA-aware load balancing strategy. Key operations can be slowed down when an actor
Min et al. [21] propose a hierarchical work-stealing sched- migrates to another core on a NUMA system, depending
uler that uses the Hierarchical Victim Selection (HVS) policy on the NUMA distance. This performance degradation can
to determine from which thread a thief steals work, and the come from messages that arrive from another core, or from
Hierarchical Chunk Selection (HCS) policy that determines accessing the actor’s state that is allocated on different
how much work a thief steals from the victim. The HVS NUMA node. Depending on the type of actor and the
policy relies on the scheduler having information about the communication pattern, the amount of degradation differs.
memory hierarchy: cache, socket, and node (this work also Therefore, improving locality does not benefit all work-
considers many-core clusters). Threads first try to steal from loads. We identify the following factors in applications and
the nearest neighbors and only upon failure move up the workloads for actor-based programming that can affect the
locality hierarchy. The number of times that each thread tries performance of a work-stealing scheduler on a hierarchical
to steal from different levels of the hierarchy is configurable. NUMA machine:
The victim selection strategy presented here in Section IV-B 1) Memory allocated for actor and access pattern: Actors
is similar to HVS, but takes NUMA distances into account. sometimes only perform computations on data passed to
Drebes et al. [31], [32] combine topology-aware work- them through messages. For simplicity, we denote actors
stealing with work pushing and dependence-aware memory that only depend on message data as stateless actors, and
allocation to improve NUMA locality and performance for actors that do manage local state as stateful actors.
data-flow task parallelism. Work pushing transfers a task to a Stateful actors allocate local memory upon creation and
worker whose node contains the task’s input data according access it or perform other computations depending on the
to some dependence heuristics. Each worker has a Multiple- type of a message and their state when they receive a
message. A stateful actor with sizable state and intensive The scheduler builds a representation of the locality
memory access to that state is better executed closer to the hierarchy using the hwloc library [33] that uses hardware
NUMA node where it was created. Also, for better cache information to determine the memory hierarchy, which the
locality, especially if the actor receives messages from its scheduler represents as a tree. The root is a place-holder
spawner frequently, it is better to keep such an actor on the representing the whole system, while intermediate levels rep-
same core, or a core that shares a higher-level CPU cache resent NUMA nodes and groups, taking into account NUMA
with the spawner core. The reason is that those messages distances. Subsequent nodes represent shared caches and the
are hot in the cache of the spawner actor. leaves represent the cores on the system. This approach is
On the other hand, stateless actors do not allocate any independent from any particular hardware architecture.
local memory and can be spawned multiple times for better 2) Actor Placement: For fully-strict computations, data
scalability. For such actors, the required memory to process dependencies of a task only go to its parent. Thus the
messages is allocated when they start processing a message natural placement for new tasks the core of the parent
and deallocated when the processing is done. Therefore, the task. However, actors can communicate arbitrarily and thus,
only substantial memory that is accessed is the one allocated local placement of newly created actors does not guarantee
for the received message by the sender of that message. Such the best performance. For example, actors receiving remote
actors are better executed closer to the message sender. messages pollute the CPU cache for actors that execute later
2) Message size and access pattern: The size of messages and process messages from their parents. Also, as stated
has a direct impact on the performance and locality of actors earlier, depending on the size of the message in comparison
on NUMA machines. Messages are allocated on the NUMA to state variables, placing the actor in the sender’s NUMA
node of the sender, but accessed on the core that is executing node can help or hurt performance. Determining the best
the receiver actor. If the size of messages is typically larger strategy at runtime can add significant overhead, so there is
than the size of the local state of an actor, and the receiving no apparent optimal approach.
actor accesses the message intensively, actors are better to The exception are hub actors [22], i.e., long-living actors
be activated on the same node as the sender of the message. that spawn many children and communicate with them
3) Communication pattern: Since the actor model is non- frequently. Such actors place high demand on the memory
deterministic, it is difficult to generally analyze the commu- allocator and can interfere with each other if placed on
nication pattern between actors. Two actors that are sending the same NUMA node. Furthermore, if a locality-aware
messages to each other can go through different states and affinity policy tries to keep actors on their home node,
thus have various memory access patterns. In addition, the placing multiple hub actors on the same NUMA node
type and size of each message can vary depending on the further increases contention over shared resources and thus
state of the actor. We do not make any assumptions about reduces the performance. Hence, our scheduler uses the same
the communication pattern of actors, unlike others [22]. algorithm for initial placement of hub actors [22] to spread
Aside from illustrating the trade-offs involved in actor them across different NUMA nodes. The programmer needs
scheduling, these observations are also useful to determine to annotate hub actors. The system then tags corresponding
which benchmarks realistically demonstrate the benefits of structures at compile time and the runtime scheduler uses
locality-aware work stealing schedulers and which represent this information to place such actors far from each other.
worst-case tests. 3) Locality-aware Work-stealing: A locality-aware victim
selection policy attempts to keep tasks closer to the core
B. Locality-Aware Scheduler (LAS) that created them or was running them previously to take
Our locality-aware scheduler consists of three stages: advantage of better cache locality. Depending on the hard-
memory hierarchy detection, work placement, and work ware architecture, cores might share higher-level CPU cache.
stealing. When an application starts running, the scheduler Therefore, in our scheduler the thief worker thread first
determines the memory hierarchy of the machine. Also, a steals from worker threads executing on nearby cores in the
new actor is is placed on the local or a remote NUMA node same NUMA node with shared caches to improve locality.
depending on the type of the actor. Finally, when a worker If there is no work available in the local NUMA node,
thread runs out of work it uses a locality-aware policy to the hierarchical victim selection policy tries to steal jobs
steal work from another work thread. from worker threads of other NUMA nodes with increasing
1) Memory Hierarchy Detection: Our work-stealing al- NUMA distance. The goal of NUMA-aware work stealing
gorithm needs to be aware of the memory hierarchy of the is to avoid migrating actors between NUMA nodes to the
underlying system. In addition to the cache and NUMA extent possible, and thus to remove the need for remote
hierarchy, differing distances between NUMA nodes are an memory accesses.
important factor in deciding where to steal tasks from. Ac- Limiting the worker threads to initially choose their
cess latencies can vary significantly based on the topological victims within their own NUMA node can lead to more
distance between access node and storage node. frequent contention over deques on the local NUMA node
in comparison to using the random victim selection strategy. Algorithm 1 Hierarchical Victim Selection
For example, if a single queue still has work, while all other 1: T: Memory hierarchy tree
worker threads run out of work, is the worst-case scenario 2: C: Set of cores under v
for a work-stealing scheduler. However, this case appears 3: p: Number of threads polling under local NUMA node
frequently in actor applications, where a hub actor creates 4: r: Number of steal attempts for v
multiple actors and other worker threads steal from the local 5: procedure C HOOSE V ICTIM ( V )
deque of the thread that runs the hub actor. Our investigation 6: if r = Size(C) and v 6= root(T ) then
shows that when stealing fine-grained tasks with workloads 7: v ← parent(v)
that are 20µs or shorter, the performance penalty ratio in- 8: if v is in local NUMA node then
creases exponentially as the number of thief threads increase. 9: S = {s | all non-empty local deques}
For more coarse-grained tasks, the performance penalty is 10: if S = ∅ then
not significant, since the probability of contention decreases. 11: v ← parent(v)
To alleviate this problem, our scheduler keeps track of the 12: else if size(S) = 1 and p > size(C)
2 then
number of threads per NUMA node that are polling the local 13: return ∅
node. This number is used along with the approximate size 14: else return random from S
of the deques in the node to reduce the number of threads 15: return random from C
that are simultaneously polling a deque (Algorithm 1). If
there is only a single non-empty deque and more than half
of threads under that node are polling that deque, the thief because actors are moved only by stealing. However, it adds
thread backs off and tries again later. overhead to saturated workers and increases contention when
In addition, polling the queue of many other worker placing actors on remote deques, which is further discussed
threads with empty queues can result in wasting CPU cycles in Section V-C.
when the number of potential victims is limited. In CAF, a
worker thread constantly polls its own deque and after a V. E XPERIMENTS AND E VALUATION
certain number of attempts, polls a victim deque. To avoid Experiments are conducted on an Intel and an AMD
wasting CPU cycles, we have modified the deque and added machine that have different NUMA topologies and memory
an approximate size of the deque using a counter. A thief hierarchies. The Intel machine is a Xeon with 4 sockets, 4
uses this approximate size when it attempts to steal from NUMA nodes, and 32 cores. Each NUMA node has 64 GB
other workers executing on the same NUMA node. If there of memory for a total of 256 GB. Each socket has 8 cores,
are non-empty queues, it chooses one randomly, otherwise if running at 2.3 GHz, that share 16 MB L3 Cache, and each
all the queues are empty, the thief immediately moves up to core has private L1 and L2 caches. Each NUMA node is
the next higher level (Algorithm 1). This approach removes only directly connected to two other NUMA nodes. Hpyer-
the overhead of polling empty queues on the local NUMA threading is disabled. The AMD machine is an Opteron with
node and thus decreases the number of wasted CPU cycles. 4 sockets, 8 NUMA nodes, and 64 cores. Each NUMA node
Since there is a fixed number of cores on a NUMA node, contains 64 GB of memory for total of 512 GB. Each socket
scanning their queue sizes adds little overhead that remains has 8 cores running at 2.5 GHz. Each core has a private L1
constant even when the application scales. data cache, and shares the L1 instruction cache and an L2
When a worker runs out of work, it becomes a thief and cache (2 MB) with a neighbour core. All cores in the same
uses the memory hierarchy tree provided by the scheduler socket share one L3 cache (6 MB). The experiments are
to perform hierarchical victim selection as described in performed with CAF version 0.12.2, compiled with GCC
Algorithm 1. The updated vertex v is passed to the function 5.4.0, on Ubuntu Linux 16.04 with kernel 4.4.0.
each time to complete the tree traversal. An empty result or The experiments compare CAF’s default Randomized
a victim with an empty deque means that the thief has to Work-Stealing (RWS) scheduler with LAS/L and LAS/A.
try again. We first evaluate the performance using benchmarks to study
We have created two variants of LAS that differ in their the effect of scheduling policy on different communication
placement strategy. When an existing actor is unblocked by patterns and message sizes. Next, we use a simple chat server
a message, the local variant (LAS/L) places the actor on the to observe the efficiency of schedulers for an application that
local deque, while the affinity variant (LAS/A) places the has a large number of actors with non-trivial communication
actor at the end of the deque of the worker thread previously patterns, different behaviors, and various message sizes.
executing it. In both cases, newly created actors are pushed
to the head of the local deque. LAS/L is similar to typi- A. Benchmarks
cal work-stealing placement where all activated and newly The first set of experiments attempts to isolate the effects
created tasks are pushed to the head of the local deque. of each scheduling policy for different actor communication
LAS/A improves actor-to-thread (and thus to-core) affinity, patterns. A subset of benchmarks from the BenchErl [34]
and Savina [35] benchmark suites is chosen that represents B. Experiments
typical communication patterns used in actor-based appli-
cations. Some of these benchmarks are adopted from task- The benchmark results are shown in Figure 1. The exe-
parallelism benchmarks, but modified to fit the actor model. cution time is the average of 10 runs and normalized to the
slowest scheduler. All experiments are configured to keep
all cores busy most of the time, i.e., the system operates at
• Big (BIG): In a many-to-many message passing sce-
peak load.
nario many actors are spawned and each one sends a ping The RWS scheduler performs relatively better for the
message to all other actors. An actor responds with a pong BIG benchmark and it outperforms both LAS/L and LAS/A.
message to any ping message it receives. This benchmark represents a symmetric many-to-many com-
• Bang (BANG): In a many-to-one scenario, multiple
munication pattern where all actors are sending messages
senders flood the one receiver with messages. Senders send to each other. This workload benefits from a symmetric
messages in a loop without waiting for any response. distribution of work. Other experiments (not shown here)
• Logistic Map Series (LOGM): A synchronous request-
show that using the NUMA interleave memory allocation
response benchmark pairs control actors with compute actors policy improves the performance further. For this particular
to calculate logistic map polynomials through a sequence of workload, improving locality does not translate to improving
requests and responses between each pair. the performance.
• All-Pairs Shortest Path (APSP): This benchmark is a The BANG benchmark represents workloads using many-
weighted graph exploration application that uses the Floyd- to-one communication. Messages have very small sizes and
Warshall algorithm to compute the shortest path among all no computation is performed. Since the receiver’s mailbox
pairs of nodes. The weight matrix is divided into blocks. is the bottleneck, improving locality does not significantly
Each actor performs calculations on a particular block and affect the overall performance. LAS/L only improves the
communicates with the actors holding adjacent blocks. performance slightly by allocating more messages on the
• Concurrent Dictionary (CDICT): This benchmark local node.
maintains a key-value store by spawning a dictionary actor LOGM and APSP both create multiple actors during
with a constant-time data structure (hash table). It also startup and each actor frequently communicates with a
spawns multiple sender actors that send write and read limited number of other actors. In addition, computation
requests to the dictionary actor. Each request is served with depend on an actor’s local state and message content. For
a constant-time operation on the hash table. both workloads, LAS/L and LAS/A outperform RWS by
• Concurrent Sorted Linked-List (CSLL): This benchmark a great margin. In such workloads, each actor can only
is similar to CDICT but the data-structure has linear access be activated by one of the actors it communicates with. If
time (linked list). The time to serve each request depends the one of the communicating actors is stolen and executes on
type of the operation and the location of the requested item. another core, in RWS and LAS/L it causes the other actors
Also, actors can inquire about the size of the list, which to follow and execute on the new core upon activation. Since
requires iterating through all items. all actors maintain local state that is allocated upon creation
• NQueens first N Solutions (NQN): A divide-and- of the actor, all actors that are part of the communication
conquer style algorithm searches a solution for the problem: group experience longer memory access times if one of them
”How can N queens be placed on an N × N chessboard, migrates to another NUMA node. Keeping actors on the
so that no pair attacks each other?” same NUMA node and closer to the core they were run-
• Trapezoid approximation (TRAPR): This benchmark ning before can prevent this. LAS/A improves performance
consists of a master actor that partitions an integral and further by preventing other actors to migrate along with the
assigns each part to a worker. After receiving all responses stolen actor. Even though actors are occasionally moved to
they are added up to approximate the total integral. The another NUMA node, the rest of the group stays on their
message size and computation time is the same for all own NUMA node. Thus, LAS/A performs better than LAS/L
workers. by preventing group migration of actors. For LOGM, since
• Publish-Subscribe (PUBSUB): Publish/subscribe is an each pair of actors is isolated from other actors, stealing
important communication pattern in actor programs that is one actor translates to moving one other actor along with it.
used extensively in many applications, such as chat servers However, in APSP each actor communicates with multiple
and message brokers. This benchmark is implemented using actors, which means stealing one actor can cause a chain
CAF’s group communication feature and measures the end- reaction and several actors that do not directly communicate
to-end latency of individual messages. It represents a one-to- with the stolen actor might also migrate. LAS/A therefore
many communication pattern where a publisher actor sends has a stronger effect on APSP than LOGM.
messages to multiple subscribers. Actors can subscribe to CDICT and CSLL represent workloads where a central
more than one publisher. actor spawns multiple worker actors and communicates with
100
Normalized execution time
80
60
Scheduling Policy
40 RWS
LAS/L
20 LAS/A
0
BIG
BANG
LOGM
APSP
CDICT
CSLL
NQN
TRAPR
PUBSUB
(a)
100
Normalized execution time
80
60
Scheduling Policy
40 RWS
LAS/L
20 LAS/A
0
BIG
BANG
LOGM
APSP
CDICT
CSLL
NQN
TRAPR
PUBSUB
(b)
Figure 1. Results of running the benchmarks with various schedulers on (a) AMD Opteron and (b) Intel Xeon. The results are normalized to slowest
scheduler (lower is better).
them frequently. The central actor is responsible for manag- actor is responsible for dividing the work among fixed
ing a data structure and receives read and write requests from number of worker actors. Each worker actor performs a
the worker actors. CDICT benefits from improved locality recursive operation on the task assigned to it and further
provided by both locality-aware schedulers. Since the ma- divides the task to smaller subtasks. But instead of spawning
jority of operations are allocating and accessing messages new actors, it reports back to the master actor that assigns
between the central actor and the worker actors, placing the new tasks to the worker actors in a round-robin fashion.
worker actors closer to the central actor leads to improved Therefore, all worker actors are constantly producing and
performance due to faster memory accesses. In CDICT all consuming messages. The computation performed for each
requests are served from a hash table in roughly constant message depends on the content of the message and all
time. In such a setting, LAS/A can cause an imbalance in items in each message are accessed during computation.
service times, because some actors are being placed on cores Improving locality and placing worker actors closer to each
with higher memory access times. Since the service time other and to the master actor has a significant impact on
for each request is fairly small, this additional overhead can performance. LAS/L and LAS/A perform 5 times faster than
slightly slow down the application. RWS in this case.
However, in CSLL LAS/A outperforms LAS/L and RWS. In TRAPAR worker actors receive a message from a mas-
First, the overhead that LAS/A imposes on the central actor ter actor, perform some calculations, send back a message,
becomes negligible in comparison with the linear lookup and exit. LAS/L and LAS/A improve the performance up to
time into the linked list. Because of the resulting increased 10 times for this benchmark. Since all actors are created on
service times, most actors ultimately become inactive, wait- the local deque of the master actor, and the tasks are very
ing for a response from the central actor. The corresponding fine-grained, locality-aware scheduling increases the chance
worker threads end up being idle and seeking work. With of local cores to steal and run these tasks closer to the master
LAS/A, the response unblocks a worker actor on its previous actor. Since all communications are with the master actor,
worker thread, so that execution can continue right away. the performance is improved significantly.
However, LAS/L unblocks worker actors on the same worker The PUBSUB benchmark shows significant end-to-end
thread as the central actor. This introduces additional latency message latency improvement when LAS/L policy is used in
until the worker actor executed or alternatively, until it is comparison with RWS. LAS/L keeps the subscribers closer
stolen by an idle worker thread. to the publisher that sends them a message and improves
NQN is a divide-and-conquer algorithm where a master the locality. However, LAS/A shows worse performance than
100 here due to limited space.
Normalized execution time
The results show that the LAS/L scheduler outperforms
80 both RWS and LAS/A for value sizes smaller than 256
words. LAS/L improves locality and since most messages
60 Scheduler and objects fit into lower level caches (L1 and L2), improved
RWS locality further improves the performance. RWS distributes
40 LAS/L tasks among NUMA nodes and therefore imposes higher
LAS/A
20 memory access times. LAS/A also adds additional overhead,
because it causes the dictionary actor to unblock some
0 actors on remote NUMA nodes. However, as the value
1
4
16
64
256
1024
4056
8112
size gets larger, messages and objects do not fit into lower
Message Size(words) level caches. Since LAS/L keeps most actors on the same
NUMA node as the dictionary actor, this creates contention
Figure 2. Results for varying the size of the value in CDICT benchmark. in the L3 cache, which slows down the dictionary actor.
The execution times are normalized to the slowest scheduler (lower is
better) LAS/A, on the other hand, distributes actors to other NUMA
nodes as well, which avoids the contention in L3, such that
remote access overhead is compensated by lower contention.
the other two policies. Profiling the code reveals that worker RWS also avoids the contention problem and therefore the
threads are stalled by lock contention most of the time. The difference between LAS/L and RWS decreases. In fact,
reason is that with LAS/A, worker threads place the newly when increasing the percentage of write requests, we have
created tasks on the deque of other cores rather than the local observed that RWS can even outperform LAS/L (not shown
core. Since publishers are constantly unblocking actors on due to limited space).
other cores, this leads to higher contention when there are
large number of publishers and subscribers. C. Chat Server
There are minor differences between the results from the To evaluate both variants of LAS using a more realistic
AMD machine and the Intel machine. These differences scenario, we have implemented a chat server similar to [36]
come from the differences in the NUMA setup of each that supports one-to-one and group chats. Each user (session)
machine, explained at the beginning of this section. The is represented by an actor that holds the state for the
probability that RWS moves tasks to a NUMA node with session in the server application. Chat groups are created
higher access times is higher for the AMD machine. Thus, using the publish/subscribe based group communication in
locality-aware schedulers are slightly more effective on the CAF. To simplify the implementation the server does not
AMD machine. include network operations and the workload is generated
In general, the results indicate that workloads with many- and consumed in the same process. However, the chat server
to-many communication patterns (BIG) do not benefit from implements pre- and post-processing operations that would
locality-aware schedulers. On the other hand, workloads normally be carried out in the context of communication
where actors are communicating with a small cluster of other with remote clients, such as encryption.
actors (LOGM and APSP), actors communicate with a cen- Each user has a friend list, group list, and blocked list,
tral actor and access message contents (CDICT and CSLL), which represent the corresponding lists of users respectively.
or actors communicate with a central actor one or multiple Information about each session and a log of messages is
times and perform computations that depend on the content stored in an in-memory key-value storage controlled by a
of the message (NQN and TRAPR), benefit from locality- database actor. In addition, each session actor stores its
aware schedulers. Moreover, in most cases where locality information in a local cache controlled by a local cache actor.
improves performance, LAS/A performs similar or better in When a session actor receives a message, it first decrypts the
comparison with LAS/L. However, in one case (PUBSUB) message, uses the receiver user ID to find the reference to the
LAS/A causes high contention and a performance decrease. receiving actor, and forwards the message to that actor. The
We have performed another experiment to study the message is also logged in both the local cache and the central
effect of message size on the performance of locality-aware storage. When the receiver actor receives the message, it
schedulers. The CDICT benchmark is modified to make the first checks whether the sender is in its blocked list. If not,
value size configurable for each key-value pair. This affects it encrypts the message as if it was sent out to a remote
the size of messages and the size of memory operations client. If a message is sent to a group, an actor representing
performed by the central actor. Worker actors submit write the group forwards the message to all subscribers, which
requests 20% of the time. Figure 2 shows the results for this creates a one-to-many communication pattern.
experiment executed on the AMD machine. Experiments on The chat server is configured to run with 1 million
the Intel machine show similar results, but are not shown actors and 10000 groups. Each user has random number of
106 106
Throughput Throughput 1000
105 105
800
104 104
900
700
Throughput(Kmsg/sec)
103
Throughput(Kmsg/sec)
103
Latency(ms)
Latency(ms)
102 102
600 800
101 101
100 500 100
700
10 1 10 1
400
10 2 10 2
600
10 3
RWS LAS/L LAS/A 300 10 3
RWS LAS/L LAS/A
(a) (b)
Figure 3. Throughput and distribution of end-to-end message latency using different scheduling policies on (a) AMD Opteron and (b) Intel Xeon. Latencies
are quantified on the left Y-axis that a logarithmic scale (lower is better). Throughput is quantified on the 2nd Y-axis (Higher is better).
Hence, although LAS/A is performing fairly good in sim- [2] Z. Majo and T. R. Gross, “Memory System Performance in
a NUMA Multicore Multiprocessor,” in Proc. 4th Ann. Int’l
ple scenarios and benchmarks, lock contention significantly Conf. Systems and Storage, 2011, pp. 12:1–12:10.
affects its performance when the application scales and
number of actors increases. Therefore, the proposal in [26] [3] D. Molka, D. Hackenberg, R. Schone, and M. S. Muller,
“Memory Performance and Cache Coherency Effects on an
does not apply to large-scale actor-based applications. The Intel Nehalem Multiprocessor System,” in Proc. 18th Int’l
LAS/L policy, on the other hand, shows stable performance Conf. Parallel Architectures and Compilation Techniques,
improvements over RWS and reduces the latency. 2009, pp. 261–270.
[4] C. Hewitt, P. Bishop, and R. Steiger, “A Universal Modular [22] E. Francesquini, A. Goldman, and J. F. Mhaut, “A NUMA-
ACTOR Formalism for Artificial Intelligence,” in Advance Aware Runtime Environment for the Actor Model,” in 2013
Papers of the Conf., vol. 3. Stanford Research Inst., 1973, 42nd Int’l Conf. on Parallel Processing, 2013, pp. 250–259.
p. 235.
[23] S. L. Olivier, A. K. Porterfield, K. B. Wheeler, M. Spiegel,
[5] C. Hewitt and H. G. Baker, “Actors and Continuous Function- and J. F. Prins, “OpenMP Task Scheduling Strategies for
als,” Massachusetts Inst. of Technology, Tech. Rep., 1978. Multicore NUMA Systems,” The Int’l J. High Performance
Computing Applications, vol. 26, no. 2, pp. 110–124, 2012.
[6] G. Agha, Actors: A Model of Concurrent Computation in
Distributed Systems. MIT Press, 1986. [24] S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and
M. Prieto, “Survey of Scheduling Techniques for Addressing
[7] D. Charousset, R. Hiesgen, and T. C. Schmidt, “Revisiting Shared Resources in Multicore Processors,” ACM Computing
Actor Programming in C++,” Computer Languages, Systems Surveys (CSUR), vol. 45, no. 1, pp. 4:1–4:28, Dec. 2012.
& Structures, vol. 45, pp. 105–131, April 2016.
[25] W. Suksompong, C. E. Leiserson, and T. B. Schardl, “On the
[8] G. Agha, “Concurrent Object-oriented Programming,” Com- Efficiency of Localized Work Stealing,” Information Process-
mun. ACM, vol. 33, no. 9, pp. 125–141, Sep. 1990. ing Letters, vol. 116, no. 2, pp. 100–106, 2016.
[9] J. De Koster, T. Van Cutsem, and W. De Meuter, “43 Years [26] Z. Vrba, P. Halvorsen, and C. Griwodz, “A Simple Im-
of Actors: A Taxonomy of Actor Models and Their Key provement of the Work-stealing Scheduling Algorithm,” in
Properties,” in Proc. 6th Int’l Workshop on Programming Proc. Int’l Conf. Complex, Intelligent and Software Intensive
Based on Actors, Agents, and Decentralized Control, 2016, Systems, 2010, pp. 925–930.
pp. 31–40.
[27] U. A. Acar, G. E. Blelloch, and R. D. Blumofe, “The Data
[10] J. Armstrong, R. Virding, C. Wikström, and M. Williams, Locality of Work Stealing,” in Proc. 12th Ann. ACM Symp.
“Concurrent Programming in ERLANG,” 1993. Parallel Algorithms and Architectures, 2000, pp. 1–12.
[11] J. Bonér, “Introducing Akka - Simpler Scalability, Fault- [28] Q. Chen, M. Guo, and Z. Huang, “Adaptive Cache Aware
Tolerance, Concurrency & Remoting through Actors,” 2010, Bitier Work-Stealing in Multisocket Multicore Architectures,”
http://jonasboner.com/introducing-akka. IEEE Trans. Parallel and Distributed Systems, vol. 24, no. 12,
[12] S. Clebsch, S. Drossopoulou, S. Blessing, and A. McNeil, pp. 2334–2343, 2013.
“Deny Capabilities for Safe, Fast Actors,” in Proc. 5th Int’l [29] S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins,
Workshop on Programming Based on Actors, Agents, and “Scheduling Task Parallelism on Multi-socket Multicore Sys-
Decentralized Control, 2015, pp. 1–12. tems,” in Proc. 1st Int’l Workshop on Runtime and Operating
[13] R. D. Blumofe et al., Cilk: An Efficient Multithreaded Runtime Systems for Supercomputers, 2011, pp. 49–56.
System. ACM, 1995, vol. 30, no. 8. [30] L. L. Pilla et al., “A Hierarchical Approach for Load Balanc-
[14] A. Pop and A. Cohen, “OpenStream,” ACM Trans. Architec- ing on Parallel Multi-core Systems,” in Proc. 41st Int’l Conf.
ture and Code Optimization, vol. 9, no. 4, pp. 1–25, 2013. Parallel Processing, 2012, pp. 118–127.
[15] OpenMP, “OpenMP Application Program Interface Version [31] A. Drebes, K. Heydemann, N. Drach, A. Pop, and A. Co-
4.0,” 2013. hen, “Topology-Aware and Dependence-Aware Scheduling
and Memory Allocation for Task-Parallel Languages,” ACM
[16] J. Yang and Q. He, “Scheduling Parallel Computations by Trans. Architecture and Code Optimization, vol. 11, no. 3,
Work Stealing: A Survey,” Int’l J. Parallel Programming, pp. pp. 1–25, Aug. 2014.
1–25, 2017.
[32] A. Drebes, A. Pop, K. Heydemann, A. Cohen, and N. Drach,
[17] R. D. Blumofe and C. E. Leiserson, “Scheduling Mul- “Scalable Task Parallelism for NUMA,” in Proc. Int’l Conf.
tithreaded Computations by Work Stealing,” J. the ACM Parallel Architectures and Compilation (PACT 16), 2016.
(JACM), vol. 46, no. 5, pp. 720–748, Sep. 1999.
[33] F. Broquedis et. al, “hwloc: A Generic Framework for Man-
[18] Z. Vrba, H. Espeland, P. Halvorsen, and C. Griwodz, “Limits aging Hardware Affinities in HPC Applications,” in 2010 18th
of Work-Stealing Scheduling,” in Job Scheduling Strategies Euromicro Conf. on Parallel, Distributed and Network-based
for Parallel Processing, 2009, pp. 280–299. Processing. IEEE, 2010, pp. 180–186.
[19] J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, [34] S. Aronis et al., “A Scalability Benchmark Suite for Er-
and J. Nieplocha, “Scalable Work Stealing,” in Proc. Conf. lang/OTP,” in Proc. 11th ACM SIGPLAN Workshop on Erlang
High Performance Computing Networking, Storage and Anal- Workshop, 2012, pp. 33–42.
ysis, 2009, pp. 53:1–53:11.
[35] S. M. Imam and V. Sarkar, “Savina - An Actor Benchmark
[20] Y. Guo, J. Zhao, V. Cave, and V. Sarkar, “SLAW: a Scalable Suite: Enabling Empirical Evaluation of Actor Libraries,” in
Locality-aware Adaptive Work-stealing Scheduler,” in 2010 Proc. 4th Int’l Workshop on Programming Based on Actors
IEEE Int’l Symp. Parallel Distributed Processing (IPDPS 10), Agents & Decentralized Control, 2014, pp. 67–80.
2010, pp. 1–12.
[36] Riot Games, “Chat Service Architecture: Servers,”
[21] S. Min, C. Iancu, and K. Yelick, “Hierarchical Work Stealing Sep. 2015, https://engineering.riotgames.com/news/
on Manycore Clusters,” in In 5th Conf. Partitioned Global chat-service-architecture-servers.
Address Space Programming Models, 2011.