DCPerf ISCA25
DCPerf ISCA25
and production deployments of datacenter workloads, we can en- of experiences are rarely reported in research literature but offer
sure that the software architecture of DCPerf’s benchmarks closely valuable insights and motivation for future research.
resembles the production workloads and periodically calibrate their Broad Usage: Although this paper focuses on applying DCPerf to
performance to align with that of the production workloads. Second, CPU selection and optimization due to space constraints, DCPerf,
we regularly use DCPerf to drive decisions in procuring millions as a general-purpose benchmark suite for datacenter applications,
of CPUs and guide CPU vendors in optimizing their designs. This can help evaluate performance improvements or regressions in
regular exercise provides valuable feedback for improving DCPerf. common software components it utilizes, including compilers, run-
In contrast, many past benchmarks have become outdated due to times (e.g., PHP/HHVM and Python/Django), common libraries (e.g.
the absence of a strong mandate for continuous improvement. Thrift and Folly), Memcache, Spark, or the OS kernel. For instance,
The second challenge in developing benchmarks for datacenter Section 5.3 demonstrates how DCPerf helps identify scalability
workloads is that it is insufficient for a benchmark to simply achieve bottlenecks in the Linux kernel. Furthermore, DCPerf can help
performance similar to the real-world application it models at the assess the effectiveness of a wide range of research ideas, such
macro level in terms of throughput, latency, and CPU utilization. In- as resource allocation, resource isolation, performance modeling,
stead, its performance characteristics at the microarchitecture level performance optimization, fault diagnosis, and power management.
must also be sufficiently close. Otherwise, even if their performance These use cases are similar to existing full-application benchmarks
is similar for the current generation of CPUs, as the next-generation like RUBiS [5, 33], TPC-W[27], and BigDataBench[43], but with the
CPU evolves with a different microarchitecture, performance may added advantage that DCPerf is well-calibrated with production
diverge significantly. datacenter workloads.
The key microarchitecture-level performance characteristics in-
clude: (1) instructions per cycle (IPC); (2) cache miss rates; (3) branch
misprediction; (4) memory bandwidth usage; (5) effective CPU fre-
2 Requirements and Design Considerations
quency; (6) overall power consumption and its breakdown across Before delving into the design of DCPerf and its benchmarks, we
various CPU and server components; (7) instruction stall causes; first outline the requirements and key design considerations.
(8) CPU cycles spent in the OS kernel and user space; and (9) CPU
cycles spent in application logic and “datacenter tax” [23], such as 2.1 Easily Deployable Outside Meta
libraries for RPC and compression. While DCPerf is designed to accurately model Meta’s production
To address these complexities, we devise a holistic approach to workloads, a key requirement is that CPU vendors can indepen-
evaluate benchmark fidelity against production workloads across all dently set up and run DCPerf without dependence on Meta’s pro-
aforementioned aspects. We then iteratively refine the benchmarks duction environment. This independence enables CPU vendors to
to reduce performance gaps relative to production workloads. optimize the CPU’s design, microcode, firmware, and configuration
We make the following contributions in this paper. during the early stages of CPU development, even before Meta
Novelty : We develop a comprehensive approach to (1) faithfully has access to any CPU samples. Although Meta has a performance
modeling key software features of production workloads, such as testing platform [6] capable of running production code and identi-
highly optimized multi-process or multi-thread concurrency, and fying minor performance differences, it is unsuitable for external
(2) measuring and improving benchmark fidelity against produc- use due to its dependencies on Meta’s production environment.
tion workloads across a broad range of microarchitecture metrics. Moreover, although DCPerf is designed to model datacenter
This novel combination is made possible by our direct access to the applications often running at large scale, to make it practical for
source code and production deployments of datacenter workloads. developers outside Meta to use, it must operate on just one or a
This approach enables the creation of benchmarks that accurately few servers without requiring a large-scale setup. In most cases,
reflect real-world applications with millions of lines of code, within its benchmarks need only a single server to run. For benchmarks
a 3.3% error margin. To our knowledge, this level of comprehen- deployed as distributed systems, where the primary component
siveness and accuracy has not been reported previously. running on one server depends on auxiliary components running
Impact : Over the past three years, DCPerf has served as Meta’s on two or three other servers, only the primary component’s perfor-
primary tool to inform CPU selection across x86 and ARM, influ- mance is assessed. This component must be deployed on the server
encing procurement decisions for millions of CPUs. Additionally, it being evaluated, while auxiliary components can be deployed on
has guided CPU vendors in optimizing their products effectively. any server. Overall, we have designed DCPerf to require only a
For instance, in 2023, it enabled a CPU vendor to implement approx- few servers and streamlined the benchmarking process into three
imately 10 microarchitecture optimizations, resulting in an overall simple steps: clone the repository, build, and run the benchmarks.
38% performance improvement for our web application that runs
on more than half a million servers. Finally, by making DCPerf 2.2 High Fidelity with Production Workloads
open-source [9], we hope to inspire industry peers to also share
their benchmarks, such as those for search and e-commerce. In the past, some benchmarks were designed to mimic the func-
tionality of real-world applications but did not faithfully replicate
Experiences : We share real-world examples of using DCPerf in
their software architectures or traffic patterns due to a lack of ac-
critical decision-making, such as selecting future CPU SKUs and
cess to proprietary information. For instance, RUBiS [5, 33] and
guiding CPU vendors in optimizing their designs. These kinds
TPC-W [27] mimic an auction site and an e-commerce site, respec-
tively, and are widely used in performance studies. However, we
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan
consider such benchmarks insufficient, as they cannot accurately loads. This situation typically arises when some servers must handle
project the modeled application’s performance on new CPUs. a load spike due to another datacenter region failing entirely.
In contrast, the DCPerf benchmark’s software architectures and TCO consists of two components: capital expenditures (Capex)
traffic patterns closely resemble those of the production workloads and operating expenses (Opex). Capex covers the purchase of phys-
they model. For example, while many caching benchmarks imple- ical hardware. Opex represents the ongoing costs required to keep
ment a look-aside cache, DCPerf uses a read-through cache because servers operational, such as expenses for power and maintenance.
our production systems employ it to simplify application logic. DCPerf is designed to capture both performance per unit of
Moreover, we model the “datacenter tax” [23] associated with RPC, power consumption (Perf/Watt) and performance per TCO (Perf/$).
compression, and various libraries used in production. We also While higher values of both metrics are preferred, they are not
ensure that the benchmark’s threading model matches that of the always aligned. For instance, CPU 𝑋 may offer higher Perf/Watt
production system, e.g., using separate thread pools to handle fast but lower Perf/$, whereas CPU 𝑌 may have lower Perf/Watt but
and slow code paths depending on factors such as cache hits. On higher Perf/$. The decision depends on business priorities. For
machines with many CPU cores, the benchmark spawns multiple example, even though CPU 𝑋 incurs a higher TCO, it might be
instances to model the production system’s multi-tenancy setup preferred if its power efficiency enables the installation of more AI
and ensure scalability. In contrast, insufficient scalability on many- servers in a power-constrained datacenter, potentially delivering
core machines is a key limitation of CloudSuite [11]. Finally, the substantial business value. To help evaluate the trade-off between
benchmark enforces the same service level objectives (SLOs) used in Perf/Watt and Perf/$, DCPerf records detailed statistics on CPU
production, such as maximizing throughput while maintaining the clock frequency and power consumption during benchmarking.
95th-percentile latency under 500ms for our newsfeed benchmark.
In addition to software architecture, DCPerf generates traffic 3 DCPerf Framework and Benchmarks
patterns or uses datasets that represent production systems. For In this section, we present the DCPerf framework and the current
example, the distribution of request and response sizes is replicated set of benchmarks included in DCPerf.
from production systems. In the benchmark for big-data processing,
the dataset is scaled down compared to the production dataset, but
we ensure that each server processes an amount of data similar to DCPerf Automation Framework
that in production, and the dataset retains features such as table
Representative Applications Hooks
schema, data types, cardinality, and the number of distinct values.
Moreover, while prior benchmarks often focus on application- TaoBench Perf
level performance, DCPerf strives to ensure characteristics aligned FeedSim CPU Util
at the microarchitecture level (e.g., IPC, cache misses, and instruc- MemStat
DjangoBench NetStat
tion stall causes). To achieve this, DCPerf collects detailed statistics
Mediawiki CPUFreq
during each benchmarking run and analyzes them afterward. Fur- Power
thermore, as the set of microarchitecture features evolves over time, SparkBench
uArch
DCPerf adopts an extensible framework that facilitates the easy ad- VideoTranscode Topdown
Storage 256GB SATA 512GB NVMe 512GB NVMe 1TB NVMe 60% 22 5 23
31 18 59
9
50% 13 10 56
Year of introduction 2018 2021 2022 2023 6
7
9
16 43 9
10
23
19
40% 12
10
Table 3: Specification of x86-based production server SKUs. 30%
5 13
11
2
3 3
9 11
20 3
8 7
48 46 2 17
20% 41 39 36 11 7
Production DCPerf SPEC 2006 SPEC 2017 33 29 33
5.75 29 29 28
24 23
5.42 10% 21
15
21 22
13 14
4.50 4.65 0%
Cache (prod)
TaoBench
Ranking (prod)
Feedsim
IG Web (prod)
DjangoBench
FB Web (prod)
Mediawiki
Spark (prod)
SparkBench
500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
531.deepsjeng
541.leela
548.exchange2
557.xz
1.74 1.69 1.67 1.90
1.25 1.24 1.24 1.32
1 1 1 1 Prod & DCPerf SPEC2017
20.4%
50 47 100
Prod DCPerf SPEC2017 45 Max System MemBW
45
80
39 68
40
Percentage of Pipeline Slots (%)
Cache (prod)
TaoBench
Ranking (prod)
Feedsim
IG Web (prod)
DjangoBench
FB Web (prod)
Mediawiki
Spark (prod)
SparkBench
500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz
9 9 9
10
5
0
Frontend stalls Incorrect Backend stalls Retiring
Speculation Prod & DCPerf SPEC2017
Figure 5: Average values for the different causes of stalls. Figure 7: Memory bandwidth consumption.
4.0 60 56 54 55
3.3 46
2.1 31
2.0 1.9
2.0 1.8 1.8 1.8
1.4 1.4 1.6 1.5
1.2 1.1 17
1.2 20 14 12
1.0 9
1.0 0.8 7
0.6 3 2 4 4 4 1 2 2
1
0
0.0
Cache (prod)
TaoBench
Ranking (prod)
Feedsim
IG Web (prod)
DjangoBench
FB Web (prod)
Mediawiki
Spark (prod)
SparkBench
500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz
Cache (prod)
TaoBench
Ranking (prod)
Feedsim
IG Web (prod)
DjangoBench
FB Web (prod)
Mediawiki
Spark (prod)
SparkBench
500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz
Figure 6: IPC per physical core (with SMT On). Figure 8: L1 I-Cache misses (MPKI).
Cache (prod)
TaoBench
Ranking (prod)
Feedsim
IG Web (prod)
DjangoBench
FB Web (prod)
Mediawiki
Spark (prod)
SparkBench
500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz
17
20 10 13 11 10 9
1 3 0 1 1 0 1 0 0 0 0 0
0
Cache (prod)
TaoBench
Ranking (prod)
Feedsim
IG Web (prod)
DjangoBench
FB Web (prod)
Mediawiki
Spark (prod)
SparkBench
500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz
Prod & DCPerf SPEC2017
Figure 9: CPU utilization. (App) Spark (App) HHVM JIT (App) Feature Extraction (App) Ranking
(App) RPC (App) MySQL RPC - DC Tax Compression
Core SoC Non-Core DRAM Other Serialization KVStore ThreadManager Memory
100 93 85 94 89
84 82 82 86 82 84 87 84 Benchmark Clients IO Preparation Hashing Others
77 80 82 80 80 75 77 74
Percentage of Power (%)
82 81 80 78
80 22 13 21 71 100%
13 19 13 14 19 13
18 13 18 19 14 18 18
17 18 18 18 18
10 10 11 9 9 10 9 8 10 10 18 18 17 16 90%
60 12 11 10 8 5 10 12 11 9 3 4 2 2 8 7
28 22 30 21 22 19
29 23
26 26 22 19 26 22 15 14 14 11 11 21 20 80%
40 26 31 29 24
20 34 40 33 40 31 38 32 40 36 42 32 39 38 39 41 43 42 33 34 70%
26 31 26 22 22 29
0 0 0 0 0 60%
FB Web (prod)
Mediawiki
IG Web (prod)
DjangoBench
Ranking (prod)
FeedSim
Video1 (prod)
VideoBench1
Video2 (prod)
VideoBench2
Video3 (prod)
VideoBench3
Average (prod)
Average (DCPerf)
perlbench
gcc
mcf
omnetpp
xalanchbmk
x264
deepsjeng
leela
exchange2
xz
Average (SPEC)
50%
40%
30%
20%
Prod & DCPerf SPEC17 10%
SKU-A SKU4 Ops/sec Errors/sec Peak CPU Util (%) CPU Util - In Memory Analysis CPU Util - SparkBench
700K 80 100 100
600K
80 80
60
40
300K 40 40
200K
20
20 20
100K
0K 0 0 0
0 20 40 60 80 100 10 40 70 100 130 160 190 220 250 280 310 340 370 400 5 50 95 140 185 230 275 320 365 410 455 500
CPU Util (%) Load Scale Time Elapsed (s)
(a) Data Caching (b) Web Serving (c) In-memory Analysis
Figure 13: CloudSuite’s benchmarking results.
SKU-A SKU-B
this is an area for future work. Despite this limitation, DCPerf rep-
Logical cores 72 160
resents a significant advancement in accounting for and evaluating L1-I cache size (normalized) 4× 1×
the datacenter tax, an aspect overlooked by previous benchmarks. RAM (GB) 256 256
Network bandwidth (Gbps) 50 50
Server Power (Watt) 175 275
Table 4: Specification of the ARM-based new server SKUs.
4.6 Evaluating Alternatives: CloudSuite
To identify reasonable comparison baselines for evaluating DCPerf,
we explored several relevant benchmark suites [10, 11, 14, 24, 34, reaches 100%. The rate of errors, most notably “504 Gateway Time-
41, 44, 46] that share similar goals of benchmarking datacenter out,” increases after the load scale exceeds 140, even while CPU
applications. However, we found that they all fell short of repre- utilization is still below 50%. Once again, this experiment shows
senting Meta’s datacenter workloads. In this section, we present that Web Serving is not optimized to scale effectively on servers
performance results for CloudSuite [11] as an example. with many cores.
CloudSuite’s Data Caching benchmark runs Memcached with CloudSuite’s In-memory Analytics benchmark uses Apache
the Twitter dataset [29]. Its function is similar to DCPerf’s TaoBench, Spark to run an alternating least squares (ALS) filtering algorithm [42]
though Data Caching does not implement the behavior of a read- on a user-movie rating dataset called MovieLens [19]. This bench-
through cache. We ran Data Caching on both SKU4 and SKU-A mark shares similarities with both SparkBench (in terms of the
servers (see Tables 3 and 4 for server configurations). To drive up software stack) and FeedSim (in terms of its function, focusing on
CPU utilization and maximize throughput, we experimented with ranking rather than big-data queries). The MovieLens dataset’s
different configurations for Data Caching, varying the number uncompressed size is around 1.2GB. We run this benchmark on an
of server instances, server threads, and client threads. We report SKU4 server and compare its execution time and CPU utilization
the results for the best configuration in Figure 13a, which shows with SparkBench in Figure 13c. This benchmark only achieves about
the requests per second (RPS) achieved on the two server SKUs at 20% CPU utilization throughout the run. We explored different con-
various levels of CPU utilization. figurations, such as Spark parameters for parallelism, number of
On SKU-A with 72 cores, when CPU utilization increases from executor workers, and executor cores, but failed to push the CPU
12% to 88%, a 7.3-fold increase, the throughput increases by only 26%, utilization of this benchmark higher. Once again, this experiment
showing that Data Caching has limited scalability. On SKU4 with shows that the benchmark is not optimized to scale effectively on
176 cores, the throughput actually decreases as both the thread pool servers with many cores.
size and CPU utilization increase, indicating a performance anomaly.
In addition, the portion of “datacenter tax” in the benchmark is
5 Using DCPerf: Case Studies
not modeled accurately to reflect the CPU cycles consumption in DCPerf is Meta’s primary tool for evaluating prospective CPUs,
Meta’s production. Moreover, we encountered segmentation faults determining server configurations, guiding vendors in optimizing
in the client when trying to use more than five Memcached server CPU microarchitectures, and identifying systems software ineffi-
instances. Overall, this experiment shows that Data Caching is ciencies. We describe several case studies below.
not optimized to scale effectively on modern servers with very high
core counts. 5.1 Choosing ARM-Based New Server SKUs
CloudSuite’s Web Serving benchmark runs the open-source While the server SKUs in Table 3 are all based on x86, in 2023, we
social networking engine Elgg [7] using PHP and Nginx, with Mem- began designing a new server SKU based on ARM. The two server
cached providing caching and MariaDB serving as the database. SKU candidates, SKU-A and SKU-B, shown in Table 4, use ARM
This benchmark is similar to DCPerf’s MediaWiki benchmark. We CPUs from different vendors. In the early stages of server design,
run Web Serving on an SKU4 server while varying the bench- we had only a few testing servers for SKU-A and SKU-B, which
mark’s load-scale factor from 10 to 400. Figure 13b shows the were insufficient to set up and run large-scale, real-world produc-
throughput (Ops/sec), peak CPU utilization, and error rates. Web tion workloads for testing purposes. Therefore, we use DCPerf to
Serving’s throughput slows down after the load scale exceeds 100, evaluate these testing servers and compare them with the existing
even though CPU utilization continues increasing linearly until it x86-based SKU1 and SKU4 servers shown in Table 3.
ISCA ’25, June 21–25, 2025, Tokyo, Japan Su et al.
Chnages (%)
2.5 2.4 2.3
1.4
1.5 1.3 -30% -28% -28%
1 0.9
0.8
-40% -36% -36%
0.8
0.7
-50%
0.5 0.3
App Perf GIPS IPC L1-I cache L2 cache LLC miss Memory
0
miss miss BW usage
TaoBench Feedsim DjangoBench Mediawiki SparkBench DCPerf SPEC2017
FB web (prod) Mediawiki
Figure 14: Comparing Perf/Watt across server SKUs.
Figure 15: Impact of the vendor’s improvement in the CPU’s
As described in Section 2.3, due to the ongoing shortage of power cache replacement algorithm. GIPS on the x-axis means Giga
in datacenters, a key metric we consider is the performance that Instructions Per Second.
a server delivers per unit of power consumption (Perf/Watt), as
opposed to considering absolute performance alone. We calculate
other factors, such as price, vendor support, and CPU reliability,
the Perf/Watt metric as follows. While running DCPerf on a server,
hoping that one of those factors would provide a significant enough
we collect performance numbers and monitor the server’s power
difference to facilitate decision-making.
consumption. Dividing a benchmark’s performance number by the
Finally, Figure 14 also demonstrates that different benchmarks,
average server power consumption during the benchmark’s steady-
and their corresponding production workloads, scale performance
state run yields the Perf/Watt metric. However, as the benchmarks
differently across two server SKUs. Specifically, in terms of Per-
report performance results in different scales and even different
f/Watt, SKU-A outperforms SKU4 by 92% for SparkBench, while
units, we need to normalize the results. We divide each benchmark’s
both perform nearly identically for MediaWiki. Although it is theo-
raw Perf/Watt result on the new server by its Perf/Watt result on the
retically possible to design SKUs optimized for specific applications,
x86-based SKU1 server, which serves as the baseline for comparison.
this approach incurs high costs for introducing and maintaining
Finally, we calculate the geometric mean of the Perf/Watt metrics
additional SKUs in a hyperscale fleet and leads to resource waste
across all DCPerf benchmarks to produce a single Perf/Watt number
due to SKU mismatches. For example, when one application’s load
for the DCPerf suite.
grows slower than anticipated, the oversupply of its custom server
Figure 14 compares Perf/Watt across different server SKUs. In
SKU cannot be efficiently utilized by other applications. There-
terms of Perf/Watt, ARM-based SKU-A outperforms x86-based
fore, SKU selection must consider benchmarks representing a wide
SKU4, our latest server SKU running in production, by 25% overall,
range of workloads rather than focusing on individual ones. This
with the largest gain of 92% for SparkBench. In contrast, SKU-B
is why, in Figure 2, we compare the aggregate performance across
underperforms SKU4 by 57% overall, with the largest loss of 85%
benchmarks in the suites rather than individual benchmarks.
for DjangoBench and 63% for Mediawiki, respectively. These re-
sults demonstrate that DCPerf is effective in identifying SKU-B’s
weaknesses in running datacenter applications, especially its ineffi- 5.2 Guiding Vendors to Optimize CPU Design
ciency in handling user-facing web workloads, due to its smaller DCPerf not only benefits Meta in server SKU selection but also ben-
L1 I-Cache, which is not well-suited for the large code base of web efits CPU vendors by allowing them to independently run DCPerf
workloads. With DCPerf’s help, we decided to choose SKU-A over in their in-house development environments to iteratively improve
SKU-B as the next-generation ARM-based server SKU in our fleet. their new CPU products. In 2023, we collaborated with a CPU
This evaluation also shows that comparing Perf/Watt may lead to vendor to introduce their next-generation CPU to our fleet and op-
a different conclusion than comparing absolute performance. This timize its performance in the process. One of the microarchitecture
is especially important when comparing ARM-based servers, which optimizations the vendor conducted was to iteratively improve the
may be more power-efficient, with x86-based servers, which may microcode for managing the CPU’s cache replacement algorithm
offer better absolute performance. Specifically, ARM-based SKU-A to enhance the cache hit rate.
has lower absolute performance but better Perf/Watt compared to As shown in Figure 15, in the vendor’s development environ-
x86-based SKU4. However, it is important to note that this is not ment, this optimization improved the MediaWiki benchmark’s per-
a general conclusion about ARM and x86; SKU selection critically formance by 3.5%, increased IPC by 1.9%, and reduced misses in the
depends on the specific implementation of individual CPU products. L1 I-cache by 36% and in the L2 cache by 28%. Later, we tested this
For example, ARM-based SKU-B is inferior to x86-based SKU4 in optimization in Meta’s production environment and confirmed a
terms of both Perf/Watt and absolute performance. 2.9% performance improvement on Meta’s Facebook web applica-
In this experiment, we also evaluate SPEC 2017’s effectiveness in tion, which runs on more than half a million servers and consists of
comparing the server SKUs. If we had relied on SPEC for decision- millions of lines of code. Additionally, we confirmed improvements
making, we would have incorrectly concluded that the ARM-based in the Facebook web application’s microarchitecture metrics simi-
SKU-B is better than the x86-based SKU4. Moreover, because SKU-A lar to those shown in Figure 15. In contrast, testing on SPEC 2017
and SKU-B are comparable in terms of Perf/Watt (1.8 versus 1.6) in revealed no noticeable performance changes. Without DCPerf, the
the SPEC results, we would not have been able to decisively reject vendor could not have made this optimization relying only on the
SKU-B and would have had to compare subtle differences in many standard SPEC benchmarks.
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan
249%
Reltive Performance
250% Kernel 6.4 Kernel 6.9 software (e.g., OS kernels) and applications are likely to encounter
200% 162% scaling bottlenecks; therefore, organizations must proactively iden-
150%
100%
100% 103% tify and address these issues. For example, limited scalability makes
50% CloudSuite [11] unrepresentative of modern datacenter applica-
0%
176-core SKU 384-core SKU
tions. Second, debugging performance issues in system software
with DCPerf is far easier than with complex datacenter applica-
Figure 16: TaoBench’s relative performance with different
tions in production. Without DCPerf, testing production caching
Linux kernels and server SKUs.
workloads on exploratory server SKUs available in very limited
In addition to the optimization described above, the CPU vendor quantities would have been impractical.
implemented approximately ten other microarchitecture optimiza-
tions under the guidance of DCPerf, such as tuning the uncore
frequency and policies for managing TLB coherence. Altogether,
these optimizations resulted in a 38% performance boost and a 6 Takeways: Insights and Lessons
47% Perf/Watt improvement for Meta’s Facebook web application. In the previous sections, we shared our insights and lessons learned
Throughout the process, the vendor’s ability to quickly and inde- from developing and using DCPerf. To make these takeaways more
pendently evaluate their optimizations in their development en- accessible, we summarize them in this section.
vironment using DCPerf, which is easy to deploy yet effectively Limitations of popular benchmarks: Users of SPEC CPU and
represent Meta’s production workloads, was critical to their success. other popular benchmarks should be aware of their limitations in
representing datacenter workloads. Compared to real-world data-
5.3 Identifying Issues in the OS Kernel center workloads, SPEC CPU overestimates runtime CPU frequency
In addition to optimizing CPU design and server selection, DCPerf and significantly overstates the performance of many-core CPUs,
also helps identify performance issues in system software such while underestimating instruction cache misses, server power con-
as the OS kernel. In 2024, to proactively prepare for scaling both sumption, and CPU cycles spent in the OS kernel. Additionally,
our production workloads and DCPerf benchmarks to CPUs with many full-system benchmarks, such as CloudSuite, fail to scale
significantly more cores, we tested DCPerf on server SKUs with effectively on many-core CPUs, leading to a significant underesti-
176 and 384 logical cores, respectively. During this process, we mation of those CPUs’ actual performance.
observed abnormal performance results for TaoBench. TaoBench’s Benchmark generalization: The majority of benchmarks in DCPerf
performance on the 384-core SKU was only 1.6 times its perfor- represent common workloads across the industry, including web-
mance on the 176-core SKU, whereas we expected at least 384 176 = 2.2 serving, caching, data analytics, and media processing. Although
times higher performance because the 384-core SKU also has other these benchmarks are common, their specific configurations are
improvements beyond the core count increase. derived from Meta’s workload characteristics, such as the size dis-
Among many investigations we explored, one thing we sus- tribution of cached objects. If other organizations wish to have
pected is that the kernel’s limited scalability could be the cause of DCPerf represent their own workload characteristics, it is possible
this performance anomaly. Therefore, we repeated the experiment with some effort to change benchmark configurations to match
on different kernel versions. Ultimately, we found that upgrading their workloads.
the Linux kernel from 6.4 to 6.9 resolved the anomaly. TaoBench’s
Many-core CPU: As the core counts of current and future CPUs
performance on the 384-core SKU increased to 2.5 times its perfor-
grow rapidly, system software (e.g., the OS kernel), applications,
mance on the 176-core SKU, aligning with our expectation. This
and benchmarks will all face significant scaling challenges. This is
experience indicated that kernel 6.4 had scalability issues when
evident in the Linux kernel performance anomaly (Section 5.3) and
running on servers with many cores.
the limited scalability of CloudSuite (Section 4.6). Organizations
We further investigated this issue using DCPerf’s extensible
must invest sufficiently in software scalability ahead of time.
hooks, which enabled us to analyze hotspots in TaoBench’s exe-
cution. On the 384-core SKU, we observed significant overhead in Perf/Watt: Due to power shortages in datacenters, Perf/Watt is as
kernel 6.4’s scheduling functions (e.g., enqueue_task_fair) and important as TCO and absolute performance. CPU vendors must
the nanosleep() system call invoked by TaoBench. Further analy- prioritize Perf/Watt as a primary metric for optimization.
sis revealed that the root cause was lock contention on a counter ARM versus x86: Our evaluation shows that ARM CPUs are now
used for tracking system load, exacerbated by the large number of a viable option for datacenter use, with some ARM CPUs offering
CPU cores and threads. This issue was mitigated in kernel 6.9 by a better Perf/Watt compared to certain x86 CPUs. However, the choice
patch that reduced the update frequency of the counter [26]. between ARM and x86 depends on the specific CPU implementation,
To illustrate the impact of the issue and its resolution, Figure 16 as some ARM CPUs we evaluated are inferior to certain x86 CPUs
shows TaoBench’s performance across various kernel versions and in both Perf/Watt and absolute performance.
server SKUs. On the 176-core SKU, the performance difference Post-silicon CPU optimization: Even after a CPU’s tapeout and
between kernel 6.4 and 6.9 is only 3%. However, on the 384-core manufacturing, significant opportunities remain to enhance the
SKU, TaoBench achieves 249%162% − 1 = 54% higher performance with CPU performance through optimizations in microcode, firmware,
kernel 6.9 compared to kernel 6.4. and various configurations. For instance, as described in Section 5.2,
Our experience debugging this issue highlights several lessons. DCPerf enabled a vendor to boost the performance of their next-
First, as CPU core counts continue to grow rapidly, both system generation CPU for the Facebook web application by 38%. This
ISCA ’25, June 21–25, 2025, Tokyo, Japan Su et al.
underscores the importance of developing representative bench- CloudSuite contains relevant benchmarks, its benchmarks cannot
marks to support such post-silicon optimizations. scale on many-core servers, as described in Section 4.6.
Modeling software architecture: To achieve accurate perfor- Profiling datacenter workloads. Datacenter workloads have
mance projections, it is insufficient to merely model the functional been studied from multiple aspects. Google conducted profiling of
behavior of datacenter applications; instead, benchmarks must cap- fleet-wide CPU usage [23], big-data processing systems [15], and the
ture key aspects of their software architecture, as these applications RPC stack [36]. Meta conducted studies on datacenter networks [2,
tend to be highly optimized. For example, TAO [3] utilizes separate 32], microservice architecture [21, 39], and hardware optimization
thread pools for fast and slow paths and adopts a read-through and acceleration opportunities [39] [40]. DCPerf benefits from these
cache instead of the more commonly used look-aside cache. profiling efforts by incorporating their findings into the benchmarks
Aligning microarchitecture metrics: Since benchmarks are dras- to represent production workloads.
tically simplified representations of real-world applications, perfect
alignment between their microarchitecture performance character- 8 Conclusion and Future Work
istics is unrealistic. However, significant misalignment can serve as DCPerf is the first performance benchmark suite actively used to in-
a useful indicator for identifying areas to improve benchmarks, as form procurement decisions for millions of CPUs in hyperscale dat-
their refinement is a never-ending process. acenters while also remaining open source. Our evaluation demon-
Putting benchmarks on the critical path: Our three years of strates that DCPerf accurately projects the performance of repre-
experience in developing and using DCPerf suggest that benchmark sentative production workloads within a 3.3% error margin across
development is an iterative process requiring continuous invest- multiple server generations. Our future work includes broadening
ments to keep up with hardware evolution, workload changes, and DCPerf’s coverage, especially AI-related workloads [49], whose
emerging use cases. Through our exploration of existing bench- fleet sizes have been expanding rapidly, and improving DCPerf’s
marks for potential adoption, we found that most had become out- projection accuracy. We hope that DCPerf will inspire industry
dated. Identifying business needs that rely on high-quality bench- peers to open-source their well-calibrated benchmarks as well, such
marks on the critical path is essential, as it drives continuous as those for search and e-commerce.
improvements—a key factor behind DCPerf’s success at Meta.
Acknowledgments
7 Related Work We would like to thank numerous researchers and engineers from
various teams at Meta who directly worked and/or provided input
Benchmark suites. Over the past decades, there have been con- on DCPerf’s design and implementation. We also thank our external
tinuous efforts to build benchmarks for various workloads [4, 10, CPU vendors and ISCA anonymous reviewers for their feedback
11, 14, 20, 24, 34, 41, 44, 46]. For example, SPEC [4], which has and suggestions.
evolved for multiple generations, is the most popular suite for CPU
benchmarking. In addition to SPEC CPU, which is focused on CPU
References
core performance, it also has cloud-oriented benchmarks, such as
[1] Daniel J Barrett. 2008. MediaWiki: Wikipedia and beyond. " O’Reilly Media, Inc.".
SPEC Cloud IaaS, SPEC jbb, and SPECvirt. Another example is [2] Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traf-
TPC [20], which models and benchmarks Online Transaction Pro- fic characteristics of data centers in the wild. In Proceedings of the 10th ACM
SIGCOMM Conference on Internet Measurement (IMC ’10).
cessing (OLTP) workloads that use transactions per second as a [3] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov,
metric for performance comparisons. Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, et al. 2013.
CloudSuite [11] provides a collection of cloud benchmarks and re- TAO: Facebook’s distributed data store for the social graph. In 2013 USENIX
Annual Technical Conference (USENIX ATC 13).
veals several microarchitectural implications and the differences be- [4] James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017:
tween scale-out workloads and SPEC workloads. BigDataBench [44] Next-Generation Compute Benchmark. In Companion of the 2018 ACM/SPEC
collects benchmarks at different levels, ranging from the simplest International Conference on Performance Engineering (ICPE ’18).
[5] Emmanuel Cecchet, Anupam Chanda, Sameh Elnikety, Julie Marguerite, and
micro-benchmarks to full applications. Tailbench [24] focuses on Willy Zwaenepoel. 2003. Performance comparison of middleware architectures
the performance of latency-critical applications. 𝜇Suite [41] iden- for generating dynamic web content. In Proceedings of the 2003 ACM/IFIP/USENIX
International Middleware Conference (Middleware’03).
tifies four RPC-based OLDI applications with three microservice [6] Mike Chow, Yang Wang, William Wang, Ayichew Hailu, Rohan Bopardikar, Bin
tiers (front-end, mid-tier, and leaf) to study OS and network perfor- Zhang, Jialiang Qu, David Meisner, Santosh Sonawane, Yunqi Zhang, Rodrigo
mance overhead, particularly the mid-tier, which behaves as both Paim, Mack Ward, Ivor Huang, Matt McNally, Daniel Hodges, Zoltan Farkas,
Caner Gocmen, Elvis Huang, and Chunqiang Tang. 2024. ServiceLab: Preventing
an RPC client and an RPC server. DeathStarBench [14] is another Tiny Performance Regressions at Hyperscale through Pre-Production Testing. In
recent benchmark suite for large-scale cloud applications with tens Proceedings of the 28th Symposium on Operating Systems Principles (SOSP’24).
of RPC-based or RESTful microservices. [7] Cash Costello. 2012. Elgg 1.8 social networking. Packt Publishing Ltd.
[8] Custom Market Insights. 2024. Global Data Center CPU Market 2024–2033.
However, as discussed in Section 1 and Section 4.6, these bench- https://www.custommarketinsights.com/report/data-center-cpu-market/
marks have limitations, such as benchmark categories do not match [9] DCPerf. 2024. https://github.com/facebookresearch/DCPerf
[10] Andrea Detti, Ludovico Funari, and Luca Petrucci. 2023. uBench: An Open-Source
hyperscale workloads, mismatches with real-world software ar- Factory of Benchmark Microservice Applications. IEEE Transactions on Parallel
chitecture, missing system components, limited scalability, and and Distributed Systems 34, 3 (2023).
inadequate performance and power representativeness. For exam- [11] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad
Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia
ple, the benchmarks included in SPEC Cloud IaaS and SPEC jbb are Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: a study of emerging
significantly different from Meta’s production workloads. Although scale-out workloads on modern hardware. In Proceedings of the Seventeenth
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan
International Conference on Architectural Support for Programming Languages and 2019. Presto: SQL on everything. In Proceedings of the 2019 IEEE 35th International
Operating Systems (ASPLOS’12). Conference on Data Engineering (ICDE’19).
[12] Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux journal 2004, [38] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. 2007. Thrift: Scalable cross-
124 (2004), 5. language services implementation. Facebook white paper 5, 8 (2007), 127.
[13] Jeff Fulmer. 2022. Joedog/Siege: Siege is an HTTP load tester and benchmarking [39] Akshitha Sriraman and Abhishek Dhanotia. 2020. Accelerometer: Understanding
utility. https://github.com/JoeDog/siege Acceleration Opportunities for Data Center Overheads at Hyperscale. In Pro-
[14] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, ceedings of the Twenty-Fifth International Conference on Architectural Support for
Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Programming Languages and Operating Systems.
Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, [40] Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. 2019. SoftSKU:
Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Optimizing Server Architectures for Microservice Diversity @Scale. In Proceeding
Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for of the 46th Annual International Symposium on Computer Architecture (ISCA’19).
Microservices and Their Hardware-Software Implications for Cloud & Edge [41] Akshitha Sriraman and Thomas F. Wenisch. 2018. uSuite: A Benchmark Suite for
Systems. In Proceedings of the 24th International Conference on Architectural Microservices. In 2018 IEEE International Symposium on Workload Characteriza-
Support for Programming Languages and Operating Systems (ASPLOS’19). tion (IISWC’18).
[15] Abraham Gonzalez, Aasheesh Kolli, Samira Khan, Sihang Liu, Vidushi Dadu, [42] Gábor Takács and Domonkos Tikk. 2012. Alternating least squares for per-
Sagar Karandikar, Jichuan Chang, Krste Asanovic, and Parthasarathy Ran- sonalized ranking. In Proceedings of the 6th ACM conference on Recommender
ganathan. 2023. Profiling Hyperscale Big Data Processing. In Proceedings of systems.
the 50th Annual International Symposium on Computer Architecture (ISCA’23). [43] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He,
[16] Google. [n. d.]. Snappy, a fast compressor/decompressor. https://google.github. Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, et al. 2014. BigDataBench:
io/snappy/ A big data benchmark suite from internet services. In Proceedings of the 2014
[17] Google. 2015. Benchmarking web search latencies. https://cloudplatform. IEEE 20th international symposium on high performance computer architecture
googleblog.com/2015/03/benchmarking-web-search-latencies.html (HPCA’14).
[18] Alex Guzman, Kyle Nekritz, and Subodh Iyengar. 2021. Deploying TLS 1.3 at [44] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He,
scale with Fizz, a performant open source TLS library. https://engineering.fb. Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent
com/2018/08/06/security/fizz/ Zhan, Xiaona Li, and Bizhu Qiu. 2014. BigDataBench: A big data benchmark
[19] F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History suite from internet services. In Proceedings of the 20th International Symposium
and context. ACM Transactions on Interactive Intelligent Systems (TIIS) 5, 4 (2015). on High Performance Computer Architecture (HPCA’14).
[20] Trish Hogan. 2009. Overview of TPC Benchmark E: The Next Generation of [45] Michael Widenius and David Axmark. 2002. MySQL reference manual: documen-
OLTP Benchmarks. In Performance Evaluation and Benchmarking. tation from the source. " O’Reilly Media, Inc.".
[21] Darby Huye, Yuri Shkuro, and Raja R. Sambasivan. 2023. Lifting the veil on [46] Yanan Yang, Xiangyu Kong, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Heng
Meta’s microservice architecture: Analyses of topology and request workflows. Qi, and Keqiu Li. 2022. SDCBench: A Benchmark Suite for Workload Colocation
In 2023 USENIX Annual Technical Conference (USENIX ATC 23). and Evaluation in Datacenters. Intelligent Computing 2022 (2022).
[22] Geonhwa Jeong, Bikash Sharma, Nick Terrell, Abhishek Dhanotia, Zhiwei Zhao, [47] Ahmad Yasin. 2014. A top-down method for performance analysis and coun-
Niket Agarwal, Arun Kejariwal, and Tushar Krishna. 2023. Characterization of ters architecture. In Proceedings of the 2014 IEEE International Symposium on
data compression in datacenters. In Proceedings of the 2023 IEEE International Performance Analysis of Systems and Software (ISPASS’14).
Symposium on Performance Analysis of Systems and Software (ISPASS’23). [48] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
[23] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Re-
Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse- silient distributed datasets: A Fault-Tolerant abstraction for In-Memory cluster
scale computer. In Proceedings of the 42nd Annual International Symposium on computing. In Proceedings of the 9th USENIX symposium on networked systems
Computer Architecture (ISCA’15). design and implementation (NSDI’12).
[24] Harshad Kasture and Daniel Sanchez. 2016. Tailbench: a benchmark suite and [49] Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa
evaluation methodology for latency-critical applications. In 2016 IEEE Interna- Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram
tional Symposium on Workload Characterization (IISWC’16). Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu,
[25] Ioannis Katsavounidis. 2015. NETFLIX – “EL Fuente” video sequence details and Christos Kozyrakis, and Parik Pol. 2022. Understanding data storage and inges-
scenes. https://www.cdvl.org/media/1x5dnwt1/elfuente_summary.pdf tion for large-scale deep recommendation model training: Industrial product. In
[26] Aaron Lu. 2023. sched/fair: Ratelimit update to tg->load_avg. Proceedings of the 49th Annual International Symposium on Computer Architecture
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ (ISCA’22).
?id=1528c661c24b407e92194426b0adbb43de859ce0
[27] Daniel A Menascé. 2002. TPC-W: A benchmark for e-commerce. IEEE Internet
Computing 6, 3 (2002).
[28] Guilherme Ottoni. 2018. HHVM JIT: A Profile-guided, Region-based Compiler
for PHP and Hack. In Proceedings of the 39th ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation (PLDI’18).
[29] Tapti Palit, Yongming Shen, and Michael Ferdman. 2016. Demystifying cloud
benchmarking. In Proceedings of the 2016 IEEE international symposium on perfor-
mance analysis of systems and software (ISPASS’16).
[30] Margaret H Pinson. 2013. The consumer digital video library [best of the web].
IEEE Signal Processing Magazine 30, 4 (2013).
[31] Will Reese. 2008. Nginx: the high-performance web server and reverse proxy.
Linux Journal 2008, 173 (2008), 2.
[32] Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren.
2015. Inside the Social Network’s (Datacenter) Network. In Proceedings of the 2015
ACM Conference on Special Interest Group on Data Communication (SIGCOMM’15).
[33] RuBiS. 2024. https://github.com/uillianluiz/RUBiS
[34] Mohammad Reza Saleh Sedghpour, Aleksandra Obeso Duque, Xuejun Cai, Björn
Skubic, Erik Elmroth, Cristian Klein, and Johan Tordsson. 2023. HydraGen: A
Microservice Benchmark Generator. In 2023 IEEE 16th International Conference
on Cloud Computing (CLOUD’23).
[35] James Sedgwick. 2019. Wangle - an asynchronous C++ networking and RPC
Library. https://engineering.fb.com/2016/04/20/networking-traffic/wangle-an-
asynchronous-c-networking-and-rpc-library/
[36] Korakit Seemakhupt, Brent E. Stephens, Samira Khan, Sihang Liu, Hassan Wassel,
Soheil Hassas Yeganeh, Alex C. Snoeren, Arvind Krishnamurthy, David E. Culler,
and Henry M. Levy. 2023. A Cloud-Scale Characterization of Remote Procedure
Calls. In Proceedings of the 29th Symposium on Operating Systems Principles
(SOSP’23).
[37] Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie,
Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al.