0% found this document useful (0 votes)

62 views13 pages

DCPerf ISCA25

DCPerf is an open-source performance benchmark suite designed to accurately model and evaluate datacenter workloads, addressing the shortcomings of existing benchmarks that fail to represent complex, modern server environments. It provides high fidelity in performance projections with a 3.3% error margin and has been instrumental in guiding CPU procurement and optimization decisions at Meta. The suite is adaptable for external use, allowing CPU vendors to independently assess and improve their designs based on real-world application characteristics.

Uploaded by

ak.ayazulla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views13 pages

DCPerf ISCA25

Uploaded by

ak.ayazulla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

DCPerf: An Open-Source, Battle-Tested Performance

Benchmark Suite for Datacenter Workloads

Wei Su, Abhishek Dhanotia, Carlos Torres, Jayneel Gandhi
Neha Gholkar, Shobhit Kanaujia, Maxim Naumov, Kalyan Subramanian
Valentin Andrei, Yifan Yuan, Chunqiang Tang
Meta Platforms
Menlo Park, USA
Abstract 1 Introduction
We present DCPerf, the first open-source performance benchmark Datacenter CPUs represent a significant and rapidly expanding
suite actively used to inform procurement decisions for millions of market, projected to grow from $12.8 billion in 2024 to $48.9 bil-
CPU in hyperscale datacenters. Although numerous benchmarks lion by 2033 [8]. At Meta, we operate a hyperscale fleet of mil-
exist, our evaluation reveals that they inaccurately project server lions of servers. Organizations of all sizes—from hyperscalers like
performance for datacenter workloads or fail to scale to resemble Meta to smaller datacenter operators and public cloud customers—
production workloads on modern many-core servers. DCPerf dis- universally rely on performance benchmarks to select the most
tinguishes itself in two aspects: (1) it faithfully models essential cost-effective CPUs. On the supply side, CPU vendors also depend
software architectures and features of datacenter applications, such on benchmarks to guide CPU designs and optimizations.
as microservice architecture and highly optimized multi-process or To address these needs, the industry and research community
multi-thread concurrency; and (2) it strives to align its performance have developed a wide variety of benchmarks. However, develop-
characteristics with those of production workloads, at both the ing benchmarks for datacenter workloads presents challenges that
system level and microarchitecture level. Both are made possible prior work has not sufficiently addressed. Below, we outline the
by our direct access to the source code and hyperscale produc- challenges and our approach to addressing them in DCPerf [9],
tion deployments of datacenter workloads. Additionally, we share Meta’s open-source benchmark suite for datacenter workloads.
real-world examples of using DCPerf in critical decision-making, The first challenge is that the high complexity of datacenter
such as selecting future CPU SKUs and guiding CPU vendors in workloads makes them difficult to model accurately. Specifically,
optimizing their designs. Our evaluation demonstrates that DCPerf Meta operates thousands of services interconnected by an intricate
accurately projects the performance of representative production RPC call graph [21], and the largest service (i.e., the frontend web
workloads within a 3.3% error margin across four generations of application for Facebook) comprises millions of lines of code, re-
production servers introduced over a span of six years, with core ceives code contributions from over ten thousand engineers, and
counts varying widely from 36 to 176. undergoes thousands of code changes daily.
CCS Concepts Our evaluation shows that existing benchmarks fail to accurately
represent these complex datacenter workloads. For instance, per-
• Computer systems organization → Architectures; Cloud formance projections based on SPEC CPU 2017 [4] overestimate
computing; • Software and its engineering → Software archi- the performance of Meta’s latest server SKU by 28% compared to
tectures; • General and reference → Performance. its actual performance on our production workloads. This disparity
is unsurprising, as SPEC CPU was designed to represent single-
Keywords process, computation-intensive workloads, rather than capture com-
Benchmarking, Cloud Computing, Performance plex, distributed datacenter workloads. While some benchmarks
like CloudSuite [11] were designed to model datacenter workloads,
ACM Reference Format: their unoptimized implementations for modern processors limit
Wei Su, Abhishek Dhanotia, Carlos Torres, Jayneel Gandhi, Neha Gholkar, their representativeness. For example, evaluations on our produc-
Shobhit Kanaujia, Maxim Naumov, Kalyan Subramanian, Valentin Andrei, tion servers with 176 cores reveal that CloudSuite’s benchmark for
Yifan Yuan, Chunqiang Tang. 2025. DCPerf: An Open-Source, Battle-Tested in-memory analytics fails to drive CPU utilization above 20% due
Performance Benchmark Suite for Datacenter Workloads. In Proceedings of to its limited scalability on modern servers with many cores.
the 52nd Annual International Symposium on Computer Architecture (ISCA To accurately model datacenter workloads, we identify the pri-
’25), June 21–25, 2025, Tokyo, Japan. ACM, New York, NY, USA, 13 pages.
mary workload categories that account for the majority of our fleet
https://doi.org/10.1145/3695053.3731411
capacity and create simplified benchmarks to model each category.
Some workload categories are common across the industry, such
as web applications, data caching, big-data processing, and video
encoding, while others target social-network workloads, such as
newsfeed ranking.
This work is licensed under a Creative Commons Attribution 4.0 International License.
ISCA ’25, Tokyo, Japan Compared to other benchmark developers, we have two unique
© 2025 Copyright held by the owner/author(s). advantages. First, as we have direct access to both the source code
ACM ISBN 979-8-4007-1261-6/2025/06
https://doi.org/10.1145/3695053.3731411
ISCA ’25, June 21–25, 2025, Tokyo, Japan Su et al.

and production deployments of datacenter workloads, we can en- of experiences are rarely reported in research literature but offer
sure that the software architecture of DCPerf’s benchmarks closely valuable insights and motivation for future research.
resembles the production workloads and periodically calibrate their Broad Usage: Although this paper focuses on applying DCPerf to
performance to align with that of the production workloads. Second, CPU selection and optimization due to space constraints, DCPerf,
we regularly use DCPerf to drive decisions in procuring millions as a general-purpose benchmark suite for datacenter applications,
of CPUs and guide CPU vendors in optimizing their designs. This can help evaluate performance improvements or regressions in
regular exercise provides valuable feedback for improving DCPerf. common software components it utilizes, including compilers, run-
In contrast, many past benchmarks have become outdated due to times (e.g., PHP/HHVM and Python/Django), common libraries (e.g.
the absence of a strong mandate for continuous improvement. Thrift and Folly), Memcache, Spark, or the OS kernel. For instance,
The second challenge in developing benchmarks for datacenter Section 5.3 demonstrates how DCPerf helps identify scalability
workloads is that it is insufficient for a benchmark to simply achieve bottlenecks in the Linux kernel. Furthermore, DCPerf can help
performance similar to the real-world application it models at the assess the effectiveness of a wide range of research ideas, such
macro level in terms of throughput, latency, and CPU utilization. In- as resource allocation, resource isolation, performance modeling,
stead, its performance characteristics at the microarchitecture level performance optimization, fault diagnosis, and power management.
must also be sufficiently close. Otherwise, even if their performance These use cases are similar to existing full-application benchmarks
is similar for the current generation of CPUs, as the next-generation like RUBiS [5, 33], TPC-W[27], and BigDataBench[43], but with the
CPU evolves with a different microarchitecture, performance may added advantage that DCPerf is well-calibrated with production
diverge significantly. datacenter workloads.
The key microarchitecture-level performance characteristics in-
clude: (1) instructions per cycle (IPC); (2) cache miss rates; (3) branch
misprediction; (4) memory bandwidth usage; (5) effective CPU fre-
2 Requirements and Design Considerations
quency; (6) overall power consumption and its breakdown across Before delving into the design of DCPerf and its benchmarks, we
various CPU and server components; (7) instruction stall causes; first outline the requirements and key design considerations.
(8) CPU cycles spent in the OS kernel and user space; and (9) CPU
cycles spent in application logic and “datacenter tax” [23], such as 2.1 Easily Deployable Outside Meta
libraries for RPC and compression. While DCPerf is designed to accurately model Meta’s production
To address these complexities, we devise a holistic approach to workloads, a key requirement is that CPU vendors can indepen-
evaluate benchmark fidelity against production workloads across all dently set up and run DCPerf without dependence on Meta’s pro-
aforementioned aspects. We then iteratively refine the benchmarks duction environment. This independence enables CPU vendors to
to reduce performance gaps relative to production workloads. optimize the CPU’s design, microcode, firmware, and configuration
We make the following contributions in this paper. during the early stages of CPU development, even before Meta
Novelty : We develop a comprehensive approach to (1) faithfully has access to any CPU samples. Although Meta has a performance
modeling key software features of production workloads, such as testing platform [6] capable of running production code and identi-
highly optimized multi-process or multi-thread concurrency, and fying minor performance differences, it is unsuitable for external
(2) measuring and improving benchmark fidelity against produc- use due to its dependencies on Meta’s production environment.
tion workloads across a broad range of microarchitecture metrics. Moreover, although DCPerf is designed to model datacenter
This novel combination is made possible by our direct access to the applications often running at large scale, to make it practical for
source code and production deployments of datacenter workloads. developers outside Meta to use, it must operate on just one or a
This approach enables the creation of benchmarks that accurately few servers without requiring a large-scale setup. In most cases,
reflect real-world applications with millions of lines of code, within its benchmarks need only a single server to run. For benchmarks
a 3.3% error margin. To our knowledge, this level of comprehen- deployed as distributed systems, where the primary component
siveness and accuracy has not been reported previously. running on one server depends on auxiliary components running
Impact : Over the past three years, DCPerf has served as Meta’s on two or three other servers, only the primary component’s perfor-
primary tool to inform CPU selection across x86 and ARM, influ- mance is assessed. This component must be deployed on the server
encing procurement decisions for millions of CPUs. Additionally, it being evaluated, while auxiliary components can be deployed on
has guided CPU vendors in optimizing their products effectively. any server. Overall, we have designed DCPerf to require only a
For instance, in 2023, it enabled a CPU vendor to implement approx- few servers and streamlined the benchmarking process into three
imately 10 microarchitecture optimizations, resulting in an overall simple steps: clone the repository, build, and run the benchmarks.
38% performance improvement for our web application that runs
on more than half a million servers. Finally, by making DCPerf 2.2 High Fidelity with Production Workloads
open-source [9], we hope to inspire industry peers to also share
their benchmarks, such as those for search and e-commerce. In the past, some benchmarks were designed to mimic the func-
tionality of real-world applications but did not faithfully replicate
Experiences : We share real-world examples of using DCPerf in
their software architectures or traffic patterns due to a lack of ac-
critical decision-making, such as selecting future CPU SKUs and
cess to proprietary information. For instance, RUBiS [5, 33] and
guiding CPU vendors in optimizing their designs. These kinds
TPC-W [27] mimic an auction site and an e-commerce site, respec-
tively, and are widely used in performance studies. However, we
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan

consider such benchmarks insufficient, as they cannot accurately loads. This situation typically arises when some servers must handle
project the modeled application’s performance on new CPUs. a load spike due to another datacenter region failing entirely.
In contrast, the DCPerf benchmark’s software architectures and TCO consists of two components: capital expenditures (Capex)
traffic patterns closely resemble those of the production workloads and operating expenses (Opex). Capex covers the purchase of phys-
they model. For example, while many caching benchmarks imple- ical hardware. Opex represents the ongoing costs required to keep
ment a look-aside cache, DCPerf uses a read-through cache because servers operational, such as expenses for power and maintenance.
our production systems employ it to simplify application logic. DCPerf is designed to capture both performance per unit of
Moreover, we model the “datacenter tax” [23] associated with RPC, power consumption (Perf/Watt) and performance per TCO (Perf/$).
compression, and various libraries used in production. We also While higher values of both metrics are preferred, they are not
ensure that the benchmark’s threading model matches that of the always aligned. For instance, CPU 𝑋 may offer higher Perf/Watt
production system, e.g., using separate thread pools to handle fast but lower Perf/$, whereas CPU 𝑌 may have lower Perf/Watt but
and slow code paths depending on factors such as cache hits. On higher Perf/$. The decision depends on business priorities. For
machines with many CPU cores, the benchmark spawns multiple example, even though CPU 𝑋 incurs a higher TCO, it might be
instances to model the production system’s multi-tenancy setup preferred if its power efficiency enables the installation of more AI
and ensure scalability. In contrast, insufficient scalability on many- servers in a power-constrained datacenter, potentially delivering
core machines is a key limitation of CloudSuite [11]. Finally, the substantial business value. To help evaluate the trade-off between
benchmark enforces the same service level objectives (SLOs) used in Perf/Watt and Perf/$, DCPerf records detailed statistics on CPU
production, such as maximizing throughput while maintaining the clock frequency and power consumption during benchmarking.
95th-percentile latency under 500ms for our newsfeed benchmark.
In addition to software architecture, DCPerf generates traffic 3 DCPerf Framework and Benchmarks
patterns or uses datasets that represent production systems. For In this section, we present the DCPerf framework and the current
example, the distribution of request and response sizes is replicated set of benchmarks included in DCPerf.
from production systems. In the benchmark for big-data processing,
the dataset is scaled down compared to the production dataset, but
we ensure that each server processes an amount of data similar to DCPerf Automation Framework
that in production, and the dataset retains features such as table
Representative Applications Hooks
schema, data types, cardinality, and the number of distinct values.
Moreover, while prior benchmarks often focus on application- TaoBench Perf
level performance, DCPerf strives to ensure characteristics aligned FeedSim CPU Util

at the microarchitecture level (e.g., IPC, cache misses, and instruc- MemStat
DjangoBench NetStat
tion stall causes). To achieve this, DCPerf collects detailed statistics
Mediawiki CPUFreq
during each benchmarking run and analyzes them afterward. Fur- Power
thermore, as the set of microarchitecture features evolves over time, SparkBench
uArch
DCPerf adopts an extensible framework that facilitates the easy ad- VideoTranscode Topdown

dition of new features. This approach is preferred over hardcoding

these features into DCPerf’s core implementation, ensuring that the CopyMove
Microbenchmarks for Datacenter Taxes
benchmark remains adaptable and scalable as new requirements
Folly FBThrift zstd …… ……
emerge.

Figure 1: DCPerf software architecture.

2.3 Balancing Performance and Power
Prior benchmarks often focus on performance,Meta typically measured
– AMD Confidential
3.1 DCPerf Overview
as throughput under the constraint of meeting SLOs such as latency
and error rates. However, power capacity has always been a key Figure 1 illustrates the overall architecture of DCPerf. We explain
limiting factor in datacenters. As organizations race toward artificial its components below.
general intelligence (AGI), the shortage of datacenter power has Automation framework. To simplify the usage of DCPerf, the
become more urgent than ever. Consequently, DCPerf must help automation framework provides high-level commands such as
evaluate the trade-offs among power, performance, and total cost install and run to install and run benchmarks without requiring
of ownership (TCO), rather than focusing solely on performance. the user to deal with the complexity of building the benchmarks or
CPU thermal design power (TDP) represents the worst-case managing software dependencies.
power consumption under maximum load. However, reserving Result reporting. After a benchmark run finishes, DCPerf reports
datacenter power based on TDP is inefficient. In practice, most the benchmark parameters and results, along with key informa-
application’s instruction mix cannot push the CPU to its TDP, even tion about the system being tested (e.g., CPU model, memory size,
at 100% utilization. Furthermore, they are typically restricted from and kernel version). In addition to application performance met-
running near full CPU utilization to avoid violating SLOs. Therefore, rics, such as throughput, DCPerf also reports a per-benchmark
for server deployment and capacity modeling, we use budgeted normalized score representing the performance of the machine be-
power, which reflects power consumption under high but practical ing tested relative to a known baseline machine. After running all
ISCA ’25, June 21–25, 2025, Tokyo, Japan Su et al.

Workload Web Ranking Data Caching Big Data Media Processing

Workload selection. Although Meta operates thousands of ser-
Benchmarks MediaWiki
in DCPerf DjangoBench
FeedSim TaoBench SparkBench VideoTranscode vices on general-purpose servers, they can be classified into a
Performance
Peak RPS
RPS under Peak RPS and
Throughput Throughput
smaller number of workload categories, and we design benchmarks
metrics latency SLO cache hit rate to model each category separately. The workload categories cur-
Req. proc. time Seconds Seconds Milliseconds Minutes Minutes
Peak CPU util. 90-100% 50-70% 80% 60-80% 95-100%
rently modeled in DCPerf are shown in the first row of Table 1.
Thread-to-core We chose these workloads because they are the top consumers of
N(100) N(10) N(10) N(1) N(1)
ratio power in Meta’s fleet. Note that the “Web” category consists of
Per-server RPS N(1K) N(100) N(1M) N(10) N(10) two benchmarks, MediaWiki and DjangoBench, which model the
RPC fanout N(100) N(10) N(10) N(10) 0
Instructions
frontend web application of Facebook and Instagram, respectively.
N(1B) N(10B) N(1K) N(10B) N(1M)
per request Mediawiki. The Mediawiki benchmark represents a classic web
Table 1: Workloads modeled in DCPerf. 𝑁 (𝑛) means a quan- application. It runs Nginx [31] together with HHVM [28] as the
tity of the same order of magnitude as 𝑛. Acronyms: Requests web server, with MediaWiki [1] as the website to serve. It uses
Per Second (RPS) and Remote Procedure Call (RPC). MySQL [45] as the backend database and Memcached [12] as the
cache to accelerate processing and reduce database load. Siege [13]
is used as the load generator to access several endpoints of the Me-
benchmarks, DCPerf reports the overall score, which is the geomet- diaWiki website, such as the Barack Obama page from Wikipedia,
ric mean of all benchmark’s scores. Individual benchmark results the edit page, the user login page, and the talk page. The Medi-
are stored in JSON format, allowing automation scripts to process aWiki benchmark runs all components on a single machine, pushes
them further. CPU utilization above 90%, and measures both the number of re-
Extensibility. DCPerf is designed as an extensible framework quests the HHVM server can handle per second and the latency
through plugins called hooks. New hooks for monitoring additional distribution of the queries.
performance metrics can be easily added. Currently, the hooks DjangoBench. DjangoBench is a web application designed to
support the following metrics: mimic Instagram. DjangoBench uses Python, Django, and UWSGI
• CPU utilization: both total CPU utilization and breakdowns, such as the backend serving stack. Unlike MediaWiki’s multi-threading
as the percentage of cycles spent in user space, kernel and IRQs. model, UWSGI uses a multi-process model, spawning a number
• Memory utilization: memory and swap space usage. of worker processes equal to the number of logical CPU cores, en-
abling it to scale up on machines with many cores. In addition,
• Network traffic in bytes per second and packets per second.
DjangoBench uses Apache Cassandra as the backend database and
• CPU core frequency as reported in sysfs. Memcached as the cache. During benchmarking, the load gener-
• Power consumption reported by sensors on the motherboard. ator visits several endpoints, such as feed, timeline, seen, and
• Top-down microarchitecture metrics: front-end stalls, backend inbox, which mimic the main functionality of Instagram. We mea-
stalls, incorrect speculations, and retiring instructions. sure DjangoBench’s throughput and latency distribution during
benchmarking.
• Detailed microarchitecture metrics: IPC, memory bandwidth us-
age, cache misses, etc. FeedSim. FeedSim models newsfeed ranking and operates on a
single machine. It simulates key application logic, including fea-
In addition to hooks for collecting performance data, DCPerf also ture extraction, ranking, backend I/O, and response composition.
supports the addition of hooks to facilitate benchmark execution, FeedSim is implemented using the open-source framework OLD-
reporting, and automation. For example, the CopyMove hook can ISim [17], along with a set of libraries representing the datacenter
copy or move files, such as logs containing time-series performance tax, such as Thrift [38], Fizz [18], Snappy [16], and Wangle [35].
data, into a separate folder for each benchmark run, ensuring long- The client generates load to determine the maximum request rate
term data preservation and enabling post-analysis. FeedSim can handle while maintaining the 95th percentile latency
Client-server architecture. Each benchmark is designed as a within the SLO of 500ms.
client-server application. The server component mimics a produc- SparkBench. SparkBench models query execution in a data ware-
tion workload, while the client component, potentially running on house. It uses a synthetic, representative dataset (over 100GB), from
a different machine, sends requests to stress-test the server compo- which SparkBench builds the Spark table at the beginning of bench-
nent via the Thrift [38] RPC protocol. This emulates not only the marking. The dataset, the Spark table, as well as the shuffle and
communication pattern in production, but also the RPC “datacenter temporary data, are stored on a RAID array created from remote
tax” [15, 23], which consumes a significant amount of CPU cycles NVMe SSDs on storage servers. These remote SSDs are connected
and memory. to the SparkBench server via NVMe-over-TCP, with sufficient I/O
bandwidth representative of real-world data center settings. Spark-
3.2 DCPerf Benchmarks Bench runs Spark [48] or Presto [37] as the backend query pro-
We classify Meta’s server fleet into three categories: general-purpose cessing engine. SparkBench executes a Spark SQL query that scans
servers, storage servers with extensive SSD or HDD capacity de- the full dataset, performs a series of database operations, such as
signed for databases or distributed file systems, and AI servers joining and comparison, and then writes query results to a new
equipped with GPUs or accelerators. DCPerf primarily focuses on table. The entire benchmark execution is split into three stages:
workloads running on general-purpose servers. the first and second stages mainly load data from the tables and
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan

are I/O-intensive, whereas the third stage is computation-intensive. Programming

Benchmark Software Stacks
Thus, the total query execution time reflects the end-to-end data Languages
warehouse performance, while the execution time of the last stage HHVM, MediaWiki
MediaWiki PHP, Hack
can be used to evaluate CPU performance. Memcached, MySQL, Nginx, wrk
TaoBench. TaoBench is a read-through, in-memory cache modeled Django, UWSGI
DjangoBench Python, C++
after TAO [3]. TaoBench consists of a Memcached-based server and Apache Cassandra, Memcached
a Memtier-based client. Both the server and the client are modified OLDIsim
to mimic TAO request generation and processing at a very high Compression (Zlib, Snappy)
FeedSim C++
rate. The server spawns a number of so-called fast and slow threads. Crypto (OpenSSL, libsodium, fizz)
When a request encounters a cache hit in Memcached, a fast thread Protocol (FBThrift, RSocket, Wangle)
simply returns the cached object to the client. However, in the case Memcached, Memtier
TaoBench C++
of a cache miss, the request is dispatched to a slow thread, which Folly, fmt, libevent
simulates backend database lookup delay, new object creation, and SparkBench Apache Spark, OpenJDK Java, SparkSQL
Memcached insertion using the SET command. The object sizes, hit VideoTranscode ffmpeg, svt-av1, libaom, x264 C++
rate distribution, and network throughput are all modeled after the Table 2: Major software stacks and programming languages
actual TAO production workload. The benchmark measures cache of DCPerf benchmarks.
hit rate and request throughput.
VideoTranscodeBench. VideoTranscodeBench uses open-source
4 Evaluation
encoders such as ffmpeg, x264, and svt-av1. We generate realistic
configurations for a typical video processing pipeline (resizing and We evaluate whether DCPerf adequately represents the perfor-
encoding) and leverage the open-source Netflix El Fuente video [25, mance characteristics of datacenter applications. Our evaluation
30] as a representative dataset. At the beginning of benchmarking, aims to answer the following questions:
each CPU core is utilized by one ffmpeg instance to (1) resize a video • How accurately can DCPerf project the performance of different
clip into multiple resolutions and (2) encode the resized video clip generations of servers for datacenter workloads? (Section 4.1)
with the specified video encoder. This benchmark is embarrassingly • Do DCPerf benchmarks and real-world applications exhibit sim-
parallel and can push CPU utilization to more than 95%. ilar profiles in the micro-level metrics: instruction stalls (Sec-
Datacenter Tax Microbenchmarks. Some common library function 4.2), microarchitecture metrics such as IPC and cache misses
tions used by datacenter applications, such as those for RPC, encryp- (Section 4.3), power consumption (Section 4.4), and CPU cycles
tion, hashing, serialization, concurrency management, and memory spent in application logic and “datacenter tax” (Section 4.5)?
operations, are unrelated to the applications’ business logic but are • How does DCPerf compare with other datacenter-oriented bench-
necessary for optimal performance at scale in datacenters. We call marks, such as CloudSuite [11]? (Section 4.6)
these the “datacenter tax,” which has been reported to comprise
nearly 30% of CPU cycles across Google’s fleet [23] and 18-82% of 4.1 Projection Accuracy Across Different CPUs
CPU cycles depending on the application across Meta’s fleet [39].
A primary goal of DCPerf is to help select the most cost-effective
Because of their importance, we model these functions as a set
and power efficient next-generation x86 or ARM CPUs from new
of microbenchmarks. When performance bottlenecks are identi-
offerings by various vendors. To benefit from the latest technology,
fied in these functions during full-workload benchmarking, we use
Meta typically evaluates cutting-edge CPUs that are still under de-
these microbenchmarks to pinpoint the problem and guide targeted
velopment. This means that CPU samples are either unavailable to
optimizations. Additionally, these microbenchmarks are valuable
Meta or insufficient to run Meta’s hyperscale production workloads
indicators—if a server SKU performs poorly on them, it is likely to
for performance measurement. Instead, CPU vendors run DCPerf
exhibit subpar performance for many applications.
benchmark traces on their simulators or end-to-end benchmarks on
their early silicon samples to guide their design and optimization,
and also report the benchmark results to Meta. Meta uses these
3.3 DCPerf Implementation results to project the performance of production workloads on next-
DCPerf implementation is two-fold. For the benchmarks themselves, generation CPUs, guiding early decision in CPU selection. As the
we selected open-source software stacks and libraries that are being next-generation CPUs mature and the vendors provide more sam-
used by Meta or closely mimic Meta’s production workloads, and ples to Meta, Meta measures the actual performance of production
patch them to ensure the characteristics alignment across different workloads on those CPUs, improving the accuracy of performance
levels. Table 2 demonstrates the major software stacks and program- data over time.
ming languages of each benchmark. For the DCPerf automation CPU selection relies on DCPerf to accurately project the perfor-
framework, it is implemented mostly with Python and Shell scripts. mance of production workloads across different CPUs. To evaluate
We have a small team directly working on DCPerf design, devel- the accuracy of DCPerf’s projections, we compare the DCPerf-
opment, and maintenance over the past few years. We continuously projected performance with the actual performance measured in
improve DCPerf and make major external releases when there are production at hyperscale, using four different server SKUs deployed
significant changes. We welcome external contributions, and in at Meta, as shown in Table 3. We also evaluate the projection ac-
fact, some CPU vendors have already contributed. curacy of SPEC CPU 2006 and 2017 (hereafter referred to as SPEC
ISCA ’25, June 21–25, 2025, Tokyo, Japan Su et al.

Frontend Incorrect Speculation Backend Retiring

SKU1 SKU2 SKU3 SKU4 100%
90% 17
Logical cores 36 52 72 176 25
29
22
31 31 33
36
RAM (GB) 64 64 64 256 80% 44
39
45
49 49 47 48
51
Network bandwidth (Gbps) 12.5 25 25 50 70% 64
18 68 67

Storage 256GB SATA 512GB NVMe 512GB NVMe 1TB NVMe 60% 22 5 23
31 18 59
9
50% 13 10 56
Year of introduction 2018 2021 2022 2023 6
7
9
16 43 9
10
23
19
40% 12
10
Table 3: Specification of x86-based production server SKUs. 30%
5 13
11
2
3 3
9 11
20 3
8 7
48 46 2 17
20% 41 39 36 11 7
Production DCPerf SPEC 2006 SPEC 2017 33 29 33
5.75 29 29 28
24 23
5.42 10% 21
15
21 22
13 14

4.50 4.65 0%

Cache (prod)
TaoBench

Ranking (prod)
Feedsim

IG Web (prod)
DjangoBench

FB Web (prod)
Mediawiki

Spark (prod)
SparkBench

500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
531.deepsjeng
541.leela
548.exchange2
557.xz
1.74 1.69 1.67 1.90
1.25 1.24 1.24 1.32
1 1 1 1 Prod & DCPerf SPEC2017

SKU1 SKU2 SKU3 SKU4

Figure 4: Comparing TMAM profiles across Prod, DCPerf,
Figure 2: Performance of different SKUs normalized to SKU1. and SPEC 2017. “Cache (prod)” is the production workload
that TaoBench models, “Ranking (prod)” is the production
workload that FeedSim models, and so forth.
DCPerf SPEC 2006 SPEC 2017
27.8%

20.4%

9.2% DCPerf performs robustly across servers with radically different

5.6%
3.3% configurations. From SKU1 to SKU4 over six years, overall server
0% 0% 0%
-0.8% -0.8%
performance increases by 4.5 times, the core count rises from 36
-2.9% -4.0%
to 176, RAM expands from 64GB to 256GB, network bandwidth
SKU1 SKU2 SKU3 SKU4
grows from 12.5Gbps to 50Gbps, and storage transitions from SATA
Figure 3: Relative error of performance projection. to NVMe. Despite these significant changes, DCPerf’s projection
errors remain under 3.3%. This demonstrates that our methodology
2006 and 2017 for brevity). We select a subset of the entire SPEC for building benchmarks is resilient to technological evolution.
CPU benchmark suite, as we found it better representing Meta’s
workloads, before DCPerf was introduced. The evaluation of Cloud-
4.2 Top-down Microarchitecture Metrics
Suite [11] is presented in Section 4.6 because it is not sufficiently
scalable to produce comparable results. In addition to accurately modeling the end-to-end performance
Figure 2 shows the comparison result. The performance score of a of real-world applications, we strive to ensure that DCPerf bench-
benchmark suite is the geometric mean of the scores of its individual marks reflect the microarchitecture performance characteristics
benchmarks1 , normalized to those on SKU1. The performance score of these applications. However, perfect alignment in micro-level
of the production workloads is calculated as the geometric mean of metrics is unrealistic for several reasons. First, the benchmark’s
the performance scores of DCperf’s counterparts in the production codebase is significantly smaller, by orders of magnitude, compared
(aggregated from thousands of servers), also normalized to those to its counterpart applications. Second, application code evolves
on SKU1 and weighted by each workload’s power consumption in rapidly; for instance, Facebook’s web application undergoes thou-
our fleet. To highlight the differences more clearly, we transform sands of code changes daily. Third, alignment spans more than
the data in Figure 2 into Figure 3, showing the projection errors of a dozen microarchitecture metrics, making it inherently multi-
the benchmark suites relative to the production workload’s actual dimensional. Adjusting benchmarks to align on certain metrics
measurements. The projection errors are 0% for SKU1 because it is may lead to divergences in others.
used as the baseline for calibration. Therefore, our goal is not to achieve perfect alignment but rather
DCPerf more accurately reflects the performance of production to use significant misalignment as an indicator for identifying ar-
workloads on different SKUs. For example, for SKU4, DCPerf’s pro- eas of improvement in the benchmarks, as their refinement is a
jection is only 3.3% higher than the actual performance, while the never-ending process. In this section, we assess alignment using the
projections from SPEC 2006 and 2017 are 20.4% and 27.8% higher, microarchitecture metrics defined by the Top-down Microarchitec-
respectively. Interestingly, although SPEC 2017 is considered an ture Analysis Method (TMAM) [47]. In the next section, we employ
improved version of SPEC 2006, its projection for datacenter work- more fine-grained microarchitecture metrics for this purpose.
loads is actually less accurate than the older SPEC 2006. Regardless, TMAM can expose architectural bottlenecks despite the latency-
neither SPEC 2006 nor SPEC 2017 is sufficiently accurate for pro- masking optimizations in modern out-of-order CPUs. It identifies
jecting performance for datacenter workloads, especially on servers bottlenecks in terms of the percentage of “instruction slots”, defined
with many cores. as the fraction of hardware resources available to process micro-
operations that are wasted due to stalls in each cycle. The stalls are
1 Thescore of an individual benchmark is defined as its application metric (such as categorized as: (1) frontend stalls due to instruction fetch misses; (2)
RPS request per second) normalized to that on SKU1. backend stalls due to pipeline dependencies and data load misses;
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan

50 47 100
Prod DCPerf SPEC2017 45 Max System MemBW
45
80
39 68
40
Percentage of Pipeline Slots (%)

Memory Bandwidth (GB/s)

36
34 60 50
35
43
30 40 36 36 33
29 31 30 29
24 19 21 21
25 17 16 18
20 20
8
20 16
5 3 0
0.3
13 0
15

Cache (prod)
TaoBench

Ranking (prod)
Feedsim

IG Web (prod)
DjangoBench

FB Web (prod)
Mediawiki

Spark (prod)
SparkBench

500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz
9 9 9
10
5
0
Frontend stalls Incorrect Backend stalls Retiring
Speculation Prod & DCPerf SPEC2017

Figure 5: Average values for the different causes of stalls. Figure 7: Memory bandwidth consumption.

4.0 60 56 54 55
3.3 46

L1 I-Cache Miss (MPKI)

3.0 39
2.6 2.6 2.5 40
IPC Per Physical Core

2.1 31
2.0 1.9
2.0 1.8 1.8 1.8
1.4 1.4 1.6 1.5
1.2 1.1 17
1.2 20 14 12
1.0 9
1.0 0.8 7
0.6 3 2 4 4 4 1 2 2
1
0
0.0

Cache (prod)
TaoBench

Ranking (prod)
Feedsim

IG Web (prod)
DjangoBench

FB Web (prod)
Mediawiki

Spark (prod)
SparkBench

500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz
Cache (prod)
TaoBench

Ranking (prod)
Feedsim

IG Web (prod)
DjangoBench

FB Web (prod)
Mediawiki

Spark (prod)
SparkBench

500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz

Prod & DCPerf SPEC2017 Prod & DCPerf SPEC2017

Figure 6: IPC per physical core (with SMT On). Figure 8: L1 I-Cache misses (MPKI).

Memory bandwidth (Figure 7). The memory bandwidth con-

(3) incorrect speculative execution due to recovery from branch sumption of production workloads varies between 19 and 36 GB/s,
mispredictions; and (4) retiring of useful work. while DCPerf ranges from 17 to 33 GB/s, both consuming around
Figure 4 compares the TMAM profiles across production work- ∼30% of the system’s memory bandwidth. In contrast, SPEC bench-
loads, DCPerf, and SPEC CPU 2017. These experiments were run marks exhibit a much wider range, from 0.3 to 68 GB/s (as high as
on SKU2 described in Table 3, as it is the most widely used SKU ∼70% of the system’s memory bandwidth). Compared to production
in Meta’s fleet as of 2024. Overall, DCPerf’s TMAM profiles are workloads, SPEC benchmarks tend to consume either significantly
reasonably close to those of the production workloads, although more or significantly less memory bandwidth, indicating that they
DjangoBench has a relatively larger disparity with respect to “IG are not representative of datacenter workloads. In contrast, DCPerf
Web (prod)” in stalls due to the backend and retiring. In contrast, benchmarks generally align closely with the memory bandwidth
the SPEC benchmarks show significantly different TMAM profiles consumption of production workloads, with one notable exception:
compared to the production workloads, with backend stalls varying TaoBench (17 GB/s) compared to the cache production workload
much more drastically across the benchmarks. To make the overall (29 GB/s). This suggests that TaoBench’s data working set is smaller,
trend clearer, we compare the average values in Figure 5. It clearly resulting in a higher hit rate in caches. Improving TaoBench’s mem-
shows that SPEC has far fewer frontend stalls. This is because the ory profile is an area of ongoing work.
SPEC benchmarks have a small codebase and, hence, fewer frontend
L1 I-Cache miss (Figure 8). The production workloads and DCPerf
stalls due to instruction cache misses.
benchmarks exhibit comparable L1 I-Cache misses per kilo instruc-
tions (MPKIs), ranging from 7 to 56. Some workloads, such as IG
4.3 Fine-Grained Microarchitecture Metrics Web and FB Web and their representative benchmarks, have more
In addition to TMAM, we use detailed microarchitecture metrics to I-Cache misses due to their large codebase; whereas other work-
evaluate DCPerf. loads like Cache in production and TaoBench, have high I-Cache
IPC (Figure 6). The IPC of production workloads and DCPerf misses yet with much smaller code footprint because they exhibit
benchmarks varies similarly within the range of 1.0 to 2.6. In con- frequent context switches resulted from their high thread-to-core
trast, the IPC of SPEC 2017 benchmarks varies over a much wider oversubscription ratio. In contrast, the SPEC benchmarks have sig-
range of 0.6 to 3.3. Overall, the IPC of DCPerf benchmarks closely nificantly lower miss rates, ranging from 1 to 9. This indicates that
matches that of production workloads, except for a wider gap be- the instruction working set of SPEC benchmarks is too small rela-
tween IG Web (IPC 1.0) and DjangoBench (IPC 1.4). This difference tive to real-world datacenter applications, making them unsuitable
is consistent with the discrepancy in their TMAM profiles, where for evaluating the impact of I-Cache on performance.
DjangoBench has significantly fewer backend stalls compared to IG CPU Utilization (Figure 9). The SPEC benchmarks primarily
Web. Further improving DjangoBench is an area of ongoing work. focus on user-space performance, spending minimal time in the
ISCA ’25, June 21–25, 2025, Tokyo, Japan Su et al.

CPU Util Total % CPU Util Sys % 2.30

2.17 2.16 2.19
98 95 99 100 100 98 99 99 100 100 100 100 100 2.20 2.14 2.13 2.15
100 95 2.10

Core Frequency (GHz)

90 86 2.07 2.08 2.08
2.10
CPU Utilization (%)

2.00 2.00 2.01 2.00

80 70 73 2.00
1.92 1.90
61 64 1.90 1.91
1.90
60 1.80 1.80
1.80
40 30 31
1.70

Cache (prod)

TaoBench

Ranking (prod)

Feedsim

IG Web (prod)

DjangoBench

FB Web (prod)

Mediawiki

Spark (prod)

SparkBench

500.perlbench

502.gcc

505.mcf

520.omnetpp

523.xalancbmk

525.x264

531.deepsjeng

541.leela

548.exchange2

557.xz
17
20 10 13 11 10 9
1 3 0 1 1 0 1 0 0 0 0 0
0
Cache (prod)
TaoBench

Ranking (prod)
Feedsim

IG Web (prod)
DjangoBench

FB Web (prod)
Mediawiki

Spark (prod)
SparkBench

500.perlbench
502.gcc
505.mcf
520.omnetpp
523.xalancbmk
525.x264
531.deepsjeng
541.leela
548.exchange2
557.xz
Prod & DCPerf SPEC2017

Figure 11: Core frequency in GHz.

Prod & DCPerf SPEC2017

Figure 9: CPU utilization. (App) Spark (App) HHVM JIT (App) Feature Extraction (App) Ranking
(App) RPC (App) MySQL RPC - DC Tax Compression
Core SoC Non-Core DRAM Other Serialization KVStore ThreadManager Memory
100 93 85 94 89
84 82 82 86 82 84 87 84 Benchmark Clients IO Preparation Hashing Others
77 80 82 80 80 75 77 74
Percentage of Power (%)

82 81 80 78
80 22 13 21 71 100%
13 19 13 14 19 13
18 13 18 19 14 18 18
17 18 18 18 18
10 10 11 9 9 10 9 8 10 10 18 18 17 16 90%
60 12 11 10 8 5 10 12 11 9 3 4 2 2 8 7
28 22 30 21 22 19
29 23
26 26 22 19 26 22 15 14 14 11 11 21 20 80%
40 26 31 29 24
20 34 40 33 40 31 38 32 40 36 42 32 39 38 39 41 43 42 33 34 70%
26 31 26 22 22 29
0 0 0 0 0 60%
FB Web (prod)
Mediawiki

IG Web (prod)
DjangoBench

Ranking (prod)
FeedSim

Video1 (prod)
VideoBench1

Video2 (prod)
VideoBench2

Video3 (prod)
VideoBench3

Average (prod)
Average (DCPerf)

perlbench
gcc
mcf
omnetpp
xalanchbmk
x264
deepsjeng
leela
exchange2
xz

Average (SPEC)
50%

40%

30%

20%
Prod & DCPerf SPEC17 10%

Figure 10: Power Consumption. 0%

Cache TaoBench Ranking FeedSim Web-FB Mediawiki Spark Spark
(prod) (prod) (prod) (prod) Bench

Figure 12: Breakdown of CPU cycles spent in hot functions.

Functions starting with “(App)” are application logics while
kernel. In contrast, both production workloads and DCPerf bench-
the rest are the datacenter tax.
marks spend a significant portion of their time in the kernel. No-
tably, the cache production workload and TaoBench spend around
30% of their time in the kernel, as the caching logic involves limited As clock frequency impacts power consumption, we show the
user-space processing and significant time in the kernel for net- CPU core frequency when running different workloads in Figure 11.
work and storage I/O operations. Another difference is that SPEC The production workloads and DCPerf benchmarks exhibit similar
benchmarks typically drive CPU utilization to 100%, while some core frequencies, averaging 1.94 GHz and 1.92 GHz, respectively. In
production workloads and DCPerf benchmarks only reach 60-70%. contrast, the SPEC benchmarks typically run at higher frequencies
This discrepancy stems from several factors: the CPU not being overall, averaging 2.12 GHz. Despite the higher CPU core frequency,
the bottleneck resource, internal synchronization bottlenecks in the overall power consumption of the SPEC benchmarks still trends
the code, or SLO violations (e.g., latency and error rates) occurring lower than that of the production workloads and DCPerf, because
before CPU utilization reaches 100%. they do not sufficiently exercise the diverse components in CPUs
and, more broadly, in servers as a whole.
4.4 Power Consumption
We rely on sensors on the server’s motherboard to measure power 4.5 Datacenter Tax versus Application Logic
consumption. Figure 10 shows the power consumption broken Previous work [39] [23] [22] have shown that datacenter appli-
down into four categories: CPU core, CPU SoC non-core (e.g., inter- cations spend a significant number of cycles on library code not
connect and memory controller), DRAM, and others (e.g., storage, directly related to application logic, such as RPC and compres-
NIC, BMC and fans). Each component’s power consumption is nor- sion. These functions are often referred to as the “datacenter tax.”
malized to the server’s total designed power. In this experiment, When developing the DCPerf benchmarks, we profile production
VideoBench is configured with three different video quality settings, workloads and make efforts to tune the code composition in the
which affect its power consumption. benchmarks so that the ratios of CPU cycles spent on application
Overall, the average power consumption across the production logic and various categories of datacenter tax are close to those in
workloads, DCPerf, and SPEC is 87%, 84%, and 78%, respectively, in- production workloads.
dicating that DCPerf is more accurate than SPEC in modeling power Figure 12 shows the breakdown of CPU cycles spent in differ-
consumption. The breakdown of DCPerf’s individual benchmarks’ ent hot functions, some belonging to application logic and others
power consumption is reasonably close to that of the production to the datacenter tax. Although the impact of different categories
workload it models. However, overall, DCPerf tends to overrepre- of the tax varies widely across workloads, DCPerf benchmarks,
sent the CPU core’s power consumption and underrepresent the overall, reasonably reflect the datacenter tax. One exception is that
power consumption of non-core and “other” components. This is TaoBench spends significantly less time on compression and serial-
an area for improvement for DCPerf. ization compared to the production workload it models. Addressing
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan

SKU-A SKU4 Ops/sec Errors/sec Peak CPU Util (%) CPU Util - In Memory Analysis CPU Util - SparkBench
700K 80 100 100

600K
80 80
60

Ops or Errors / sec

500K

CPU Util (%)

60 60
400K
RPS

40
300K 40 40
200K
20
20 20
100K

0K 0 0 0
0 20 40 60 80 100 10 40 70 100 130 160 190 220 250 280 310 340 370 400 5 50 95 140 185 230 275 320 365 410 455 500
CPU Util (%) Load Scale Time Elapsed (s)
(a) Data Caching (b) Web Serving (c) In-memory Analysis
Figure 13: CloudSuite’s benchmarking results.

SKU-A SKU-B
this is an area for future work. Despite this limitation, DCPerf rep-
Logical cores 72 160
resents a significant advancement in accounting for and evaluating L1-I cache size (normalized) 4× 1×
the datacenter tax, an aspect overlooked by previous benchmarks. RAM (GB) 256 256
Network bandwidth (Gbps) 50 50
Server Power (Watt) 175 275
Table 4: Specification of the ARM-based new server SKUs.
4.6 Evaluating Alternatives: CloudSuite
To identify reasonable comparison baselines for evaluating DCPerf,
we explored several relevant benchmark suites [10, 11, 14, 24, 34, reaches 100%. The rate of errors, most notably “504 Gateway Time-
41, 44, 46] that share similar goals of benchmarking datacenter out,” increases after the load scale exceeds 140, even while CPU
applications. However, we found that they all fell short of repre- utilization is still below 50%. Once again, this experiment shows
senting Meta’s datacenter workloads. In this section, we present that Web Serving is not optimized to scale effectively on servers
performance results for CloudSuite [11] as an example. with many cores.
CloudSuite’s Data Caching benchmark runs Memcached with CloudSuite’s In-memory Analytics benchmark uses Apache
the Twitter dataset [29]. Its function is similar to DCPerf’s TaoBench, Spark to run an alternating least squares (ALS) filtering algorithm [42]
though Data Caching does not implement the behavior of a read- on a user-movie rating dataset called MovieLens [19]. This bench-
through cache. We ran Data Caching on both SKU4 and SKU-A mark shares similarities with both SparkBench (in terms of the
servers (see Tables 3 and 4 for server configurations). To drive up software stack) and FeedSim (in terms of its function, focusing on
CPU utilization and maximize throughput, we experimented with ranking rather than big-data queries). The MovieLens dataset’s
different configurations for Data Caching, varying the number uncompressed size is around 1.2GB. We run this benchmark on an
of server instances, server threads, and client threads. We report SKU4 server and compare its execution time and CPU utilization
the results for the best configuration in Figure 13a, which shows with SparkBench in Figure 13c. This benchmark only achieves about
the requests per second (RPS) achieved on the two server SKUs at 20% CPU utilization throughout the run. We explored different con-
various levels of CPU utilization. figurations, such as Spark parameters for parallelism, number of
On SKU-A with 72 cores, when CPU utilization increases from executor workers, and executor cores, but failed to push the CPU
12% to 88%, a 7.3-fold increase, the throughput increases by only 26%, utilization of this benchmark higher. Once again, this experiment
showing that Data Caching has limited scalability. On SKU4 with shows that the benchmark is not optimized to scale effectively on
176 cores, the throughput actually decreases as both the thread pool servers with many cores.
size and CPU utilization increase, indicating a performance anomaly.
In addition, the portion of “datacenter tax” in the benchmark is
5 Using DCPerf: Case Studies
not modeled accurately to reflect the CPU cycles consumption in DCPerf is Meta’s primary tool for evaluating prospective CPUs,
Meta’s production. Moreover, we encountered segmentation faults determining server configurations, guiding vendors in optimizing
in the client when trying to use more than five Memcached server CPU microarchitectures, and identifying systems software ineffi-
instances. Overall, this experiment shows that Data Caching is ciencies. We describe several case studies below.
not optimized to scale effectively on modern servers with very high
core counts. 5.1 Choosing ARM-Based New Server SKUs
CloudSuite’s Web Serving benchmark runs the open-source While the server SKUs in Table 3 are all based on x86, in 2023, we
social networking engine Elgg [7] using PHP and Nginx, with Mem- began designing a new server SKU based on ARM. The two server
cached providing caching and MariaDB serving as the database. SKU candidates, SKU-A and SKU-B, shown in Table 4, use ARM
This benchmark is similar to DCPerf’s MediaWiki benchmark. We CPUs from different vendors. In the early stages of server design,
run Web Serving on an SKU4 server while varying the bench- we had only a few testing servers for SKU-A and SKU-B, which
mark’s load-scale factor from 10 to 400. Figure 13b shows the were insufficient to set up and run large-scale, real-world produc-
throughput (Ops/sec), peak CPU utilization, and error rates. Web tion workloads for testing purposes. Therefore, we use DCPerf to
Serving’s throughput slows down after the load scale exceeds 100, evaluate these testing servers and compare them with the existing
even though CPU utilization continues increasing linearly until it x86-based SKU1 and SKU4 servers shown in Table 3.
ISCA ’25, June 21–25, 2025, Tokyo, Japan Su et al.

SKU4 SKU-A SKU-B Higher is better Lower is better

3 2.8 2.7 10% 2.9% 3.5% 2.4% 3.0%
2.7 2.2% 1.9%
0%

Chnages (%)
2.5 2.4 2.3

1.9 2.0 1.9 1.9

-10% -6.7%
2 1.8 -10.2% -9.9%
1.8
Perf Per Watt

1.7 1.6 1.6 -20% -14.4%

1.4
1.5 1.3 -30% -28% -28%

1 0.9
0.8
-40% -36% -36%
0.8
0.7
-50%
0.5 0.3
App Perf GIPS IPC L1-I cache L2 cache LLC miss Memory
0
miss miss BW usage
TaoBench Feedsim DjangoBench Mediawiki SparkBench DCPerf SPEC2017
FB web (prod) Mediawiki
Figure 14: Comparing Perf/Watt across server SKUs.
Figure 15: Impact of the vendor’s improvement in the CPU’s
As described in Section 2.3, due to the ongoing shortage of power cache replacement algorithm. GIPS on the x-axis means Giga
in datacenters, a key metric we consider is the performance that Instructions Per Second.
a server delivers per unit of power consumption (Perf/Watt), as
opposed to considering absolute performance alone. We calculate
other factors, such as price, vendor support, and CPU reliability,
the Perf/Watt metric as follows. While running DCPerf on a server,
hoping that one of those factors would provide a significant enough
we collect performance numbers and monitor the server’s power
difference to facilitate decision-making.
consumption. Dividing a benchmark’s performance number by the
Finally, Figure 14 also demonstrates that different benchmarks,
average server power consumption during the benchmark’s steady-
and their corresponding production workloads, scale performance
state run yields the Perf/Watt metric. However, as the benchmarks
differently across two server SKUs. Specifically, in terms of Per-
report performance results in different scales and even different
f/Watt, SKU-A outperforms SKU4 by 92% for SparkBench, while
units, we need to normalize the results. We divide each benchmark’s
both perform nearly identically for MediaWiki. Although it is theo-
raw Perf/Watt result on the new server by its Perf/Watt result on the
retically possible to design SKUs optimized for specific applications,
x86-based SKU1 server, which serves as the baseline for comparison.
this approach incurs high costs for introducing and maintaining
Finally, we calculate the geometric mean of the Perf/Watt metrics
additional SKUs in a hyperscale fleet and leads to resource waste
across all DCPerf benchmarks to produce a single Perf/Watt number
due to SKU mismatches. For example, when one application’s load
for the DCPerf suite.
grows slower than anticipated, the oversupply of its custom server
Figure 14 compares Perf/Watt across different server SKUs. In
SKU cannot be efficiently utilized by other applications. There-
terms of Perf/Watt, ARM-based SKU-A outperforms x86-based
fore, SKU selection must consider benchmarks representing a wide
SKU4, our latest server SKU running in production, by 25% overall,
range of workloads rather than focusing on individual ones. This
with the largest gain of 92% for SparkBench. In contrast, SKU-B
is why, in Figure 2, we compare the aggregate performance across
underperforms SKU4 by 57% overall, with the largest loss of 85%
benchmarks in the suites rather than individual benchmarks.
for DjangoBench and 63% for Mediawiki, respectively. These re-
sults demonstrate that DCPerf is effective in identifying SKU-B’s
weaknesses in running datacenter applications, especially its ineffi- 5.2 Guiding Vendors to Optimize CPU Design
ciency in handling user-facing web workloads, due to its smaller DCPerf not only benefits Meta in server SKU selection but also ben-
L1 I-Cache, which is not well-suited for the large code base of web efits CPU vendors by allowing them to independently run DCPerf
workloads. With DCPerf’s help, we decided to choose SKU-A over in their in-house development environments to iteratively improve
SKU-B as the next-generation ARM-based server SKU in our fleet. their new CPU products. In 2023, we collaborated with a CPU
This evaluation also shows that comparing Perf/Watt may lead to vendor to introduce their next-generation CPU to our fleet and op-
a different conclusion than comparing absolute performance. This timize its performance in the process. One of the microarchitecture
is especially important when comparing ARM-based servers, which optimizations the vendor conducted was to iteratively improve the
may be more power-efficient, with x86-based servers, which may microcode for managing the CPU’s cache replacement algorithm
offer better absolute performance. Specifically, ARM-based SKU-A to enhance the cache hit rate.
has lower absolute performance but better Perf/Watt compared to As shown in Figure 15, in the vendor’s development environ-
x86-based SKU4. However, it is important to note that this is not ment, this optimization improved the MediaWiki benchmark’s per-
a general conclusion about ARM and x86; SKU selection critically formance by 3.5%, increased IPC by 1.9%, and reduced misses in the
depends on the specific implementation of individual CPU products. L1 I-cache by 36% and in the L2 cache by 28%. Later, we tested this
For example, ARM-based SKU-B is inferior to x86-based SKU4 in optimization in Meta’s production environment and confirmed a
terms of both Perf/Watt and absolute performance. 2.9% performance improvement on Meta’s Facebook web applica-
In this experiment, we also evaluate SPEC 2017’s effectiveness in tion, which runs on more than half a million servers and consists of
comparing the server SKUs. If we had relied on SPEC for decision- millions of lines of code. Additionally, we confirmed improvements
making, we would have incorrectly concluded that the ARM-based in the Facebook web application’s microarchitecture metrics simi-
SKU-B is better than the x86-based SKU4. Moreover, because SKU-A lar to those shown in Figure 15. In contrast, testing on SPEC 2017
and SKU-B are comparable in terms of Perf/Watt (1.8 versus 1.6) in revealed no noticeable performance changes. Without DCPerf, the
the SPEC results, we would not have been able to decisively reject vendor could not have made this optimization relying only on the
SKU-B and would have had to compare subtle differences in many standard SPEC benchmarks.
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan

249%
Reltive Performance

250% Kernel 6.4 Kernel 6.9 software (e.g., OS kernels) and applications are likely to encounter
200% 162% scaling bottlenecks; therefore, organizations must proactively iden-
150%
100%
100% 103% tify and address these issues. For example, limited scalability makes
50% CloudSuite [11] unrepresentative of modern datacenter applica-
0%
176-core SKU 384-core SKU
tions. Second, debugging performance issues in system software
with DCPerf is far easier than with complex datacenter applica-
Figure 16: TaoBench’s relative performance with different
tions in production. Without DCPerf, testing production caching
Linux kernels and server SKUs.
workloads on exploratory server SKUs available in very limited
In addition to the optimization described above, the CPU vendor quantities would have been impractical.
implemented approximately ten other microarchitecture optimiza-
tions under the guidance of DCPerf, such as tuning the uncore
frequency and policies for managing TLB coherence. Altogether,
these optimizations resulted in a 38% performance boost and a 6 Takeways: Insights and Lessons
47% Perf/Watt improvement for Meta’s Facebook web application. In the previous sections, we shared our insights and lessons learned
Throughout the process, the vendor’s ability to quickly and inde- from developing and using DCPerf. To make these takeaways more
pendently evaluate their optimizations in their development en- accessible, we summarize them in this section.
vironment using DCPerf, which is easy to deploy yet effectively Limitations of popular benchmarks: Users of SPEC CPU and
represent Meta’s production workloads, was critical to their success. other popular benchmarks should be aware of their limitations in
representing datacenter workloads. Compared to real-world data-
5.3 Identifying Issues in the OS Kernel center workloads, SPEC CPU overestimates runtime CPU frequency
In addition to optimizing CPU design and server selection, DCPerf and significantly overstates the performance of many-core CPUs,
also helps identify performance issues in system software such while underestimating instruction cache misses, server power con-
as the OS kernel. In 2024, to proactively prepare for scaling both sumption, and CPU cycles spent in the OS kernel. Additionally,
our production workloads and DCPerf benchmarks to CPUs with many full-system benchmarks, such as CloudSuite, fail to scale
significantly more cores, we tested DCPerf on server SKUs with effectively on many-core CPUs, leading to a significant underesti-
176 and 384 logical cores, respectively. During this process, we mation of those CPUs’ actual performance.
observed abnormal performance results for TaoBench. TaoBench’s Benchmark generalization: The majority of benchmarks in DCPerf
performance on the 384-core SKU was only 1.6 times its perfor- represent common workloads across the industry, including web-
mance on the 176-core SKU, whereas we expected at least 384 176 = 2.2 serving, caching, data analytics, and media processing. Although
times higher performance because the 384-core SKU also has other these benchmarks are common, their specific configurations are
improvements beyond the core count increase. derived from Meta’s workload characteristics, such as the size dis-
Among many investigations we explored, one thing we sus- tribution of cached objects. If other organizations wish to have
pected is that the kernel’s limited scalability could be the cause of DCPerf represent their own workload characteristics, it is possible
this performance anomaly. Therefore, we repeated the experiment with some effort to change benchmark configurations to match
on different kernel versions. Ultimately, we found that upgrading their workloads.
the Linux kernel from 6.4 to 6.9 resolved the anomaly. TaoBench’s
Many-core CPU: As the core counts of current and future CPUs
performance on the 384-core SKU increased to 2.5 times its perfor-
grow rapidly, system software (e.g., the OS kernel), applications,
mance on the 176-core SKU, aligning with our expectation. This
and benchmarks will all face significant scaling challenges. This is
experience indicated that kernel 6.4 had scalability issues when
evident in the Linux kernel performance anomaly (Section 5.3) and
running on servers with many cores.
the limited scalability of CloudSuite (Section 4.6). Organizations
We further investigated this issue using DCPerf’s extensible
must invest sufficiently in software scalability ahead of time.
hooks, which enabled us to analyze hotspots in TaoBench’s exe-
cution. On the 384-core SKU, we observed significant overhead in Perf/Watt: Due to power shortages in datacenters, Perf/Watt is as
kernel 6.4’s scheduling functions (e.g., enqueue_task_fair) and important as TCO and absolute performance. CPU vendors must
the nanosleep() system call invoked by TaoBench. Further analy- prioritize Perf/Watt as a primary metric for optimization.
sis revealed that the root cause was lock contention on a counter ARM versus x86: Our evaluation shows that ARM CPUs are now
used for tracking system load, exacerbated by the large number of a viable option for datacenter use, with some ARM CPUs offering
CPU cores and threads. This issue was mitigated in kernel 6.9 by a better Perf/Watt compared to certain x86 CPUs. However, the choice
patch that reduced the update frequency of the counter [26]. between ARM and x86 depends on the specific CPU implementation,
To illustrate the impact of the issue and its resolution, Figure 16 as some ARM CPUs we evaluated are inferior to certain x86 CPUs
shows TaoBench’s performance across various kernel versions and in both Perf/Watt and absolute performance.
server SKUs. On the 176-core SKU, the performance difference Post-silicon CPU optimization: Even after a CPU’s tapeout and
between kernel 6.4 and 6.9 is only 3%. However, on the 384-core manufacturing, significant opportunities remain to enhance the
SKU, TaoBench achieves 249%162% − 1 = 54% higher performance with CPU performance through optimizations in microcode, firmware,
kernel 6.9 compared to kernel 6.4. and various configurations. For instance, as described in Section 5.2,
Our experience debugging this issue highlights several lessons. DCPerf enabled a vendor to boost the performance of their next-
First, as CPU core counts continue to grow rapidly, both system generation CPU for the Facebook web application by 38%. This
ISCA ’25, June 21–25, 2025, Tokyo, Japan Su et al.

underscores the importance of developing representative bench- CloudSuite contains relevant benchmarks, its benchmarks cannot
marks to support such post-silicon optimizations. scale on many-core servers, as described in Section 4.6.
Modeling software architecture: To achieve accurate perfor- Profiling datacenter workloads. Datacenter workloads have
mance projections, it is insufficient to merely model the functional been studied from multiple aspects. Google conducted profiling of
behavior of datacenter applications; instead, benchmarks must cap- fleet-wide CPU usage [23], big-data processing systems [15], and the
ture key aspects of their software architecture, as these applications RPC stack [36]. Meta conducted studies on datacenter networks [2,
tend to be highly optimized. For example, TAO [3] utilizes separate 32], microservice architecture [21, 39], and hardware optimization
thread pools for fast and slow paths and adopts a read-through and acceleration opportunities [39] [40]. DCPerf benefits from these
cache instead of the more commonly used look-aside cache. profiling efforts by incorporating their findings into the benchmarks
Aligning microarchitecture metrics: Since benchmarks are dras- to represent production workloads.
tically simplified representations of real-world applications, perfect
alignment between their microarchitecture performance character- 8 Conclusion and Future Work
istics is unrealistic. However, significant misalignment can serve as DCPerf is the first performance benchmark suite actively used to in-
a useful indicator for identifying areas to improve benchmarks, as form procurement decisions for millions of CPUs in hyperscale dat-
their refinement is a never-ending process. acenters while also remaining open source. Our evaluation demon-
Putting benchmarks on the critical path: Our three years of strates that DCPerf accurately projects the performance of repre-
experience in developing and using DCPerf suggest that benchmark sentative production workloads within a 3.3% error margin across
development is an iterative process requiring continuous invest- multiple server generations. Our future work includes broadening
ments to keep up with hardware evolution, workload changes, and DCPerf’s coverage, especially AI-related workloads [49], whose
emerging use cases. Through our exploration of existing bench- fleet sizes have been expanding rapidly, and improving DCPerf’s
marks for potential adoption, we found that most had become out- projection accuracy. We hope that DCPerf will inspire industry
dated. Identifying business needs that rely on high-quality bench- peers to open-source their well-calibrated benchmarks as well, such
marks on the critical path is essential, as it drives continuous as those for search and e-commerce.
improvements—a key factor behind DCPerf’s success at Meta.
Acknowledgments
7 Related Work We would like to thank numerous researchers and engineers from
various teams at Meta who directly worked and/or provided input
Benchmark suites. Over the past decades, there have been con- on DCPerf’s design and implementation. We also thank our external
tinuous efforts to build benchmarks for various workloads [4, 10, CPU vendors and ISCA anonymous reviewers for their feedback
11, 14, 20, 24, 34, 41, 44, 46]. For example, SPEC [4], which has and suggestions.
evolved for multiple generations, is the most popular suite for CPU
benchmarking. In addition to SPEC CPU, which is focused on CPU
References
core performance, it also has cloud-oriented benchmarks, such as
[1] Daniel J Barrett. 2008. MediaWiki: Wikipedia and beyond. " O’Reilly Media, Inc.".
SPEC Cloud IaaS, SPEC jbb, and SPECvirt. Another example is [2] Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traf-
TPC [20], which models and benchmarks Online Transaction Pro- fic characteristics of data centers in the wild. In Proceedings of the 10th ACM
SIGCOMM Conference on Internet Measurement (IMC ’10).
cessing (OLTP) workloads that use transactions per second as a [3] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov,
metric for performance comparisons. Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, et al. 2013.
CloudSuite [11] provides a collection of cloud benchmarks and re- TAO: Facebook’s distributed data store for the social graph. In 2013 USENIX
Annual Technical Conference (USENIX ATC 13).
veals several microarchitectural implications and the differences be- [4] James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017:
tween scale-out workloads and SPEC workloads. BigDataBench [44] Next-Generation Compute Benchmark. In Companion of the 2018 ACM/SPEC
collects benchmarks at different levels, ranging from the simplest International Conference on Performance Engineering (ICPE ’18).
[5] Emmanuel Cecchet, Anupam Chanda, Sameh Elnikety, Julie Marguerite, and
micro-benchmarks to full applications. Tailbench [24] focuses on Willy Zwaenepoel. 2003. Performance comparison of middleware architectures
the performance of latency-critical applications. 𝜇Suite [41] iden- for generating dynamic web content. In Proceedings of the 2003 ACM/IFIP/USENIX
International Middleware Conference (Middleware’03).
tifies four RPC-based OLDI applications with three microservice [6] Mike Chow, Yang Wang, William Wang, Ayichew Hailu, Rohan Bopardikar, Bin
tiers (front-end, mid-tier, and leaf) to study OS and network perfor- Zhang, Jialiang Qu, David Meisner, Santosh Sonawane, Yunqi Zhang, Rodrigo
mance overhead, particularly the mid-tier, which behaves as both Paim, Mack Ward, Ivor Huang, Matt McNally, Daniel Hodges, Zoltan Farkas,
Caner Gocmen, Elvis Huang, and Chunqiang Tang. 2024. ServiceLab: Preventing
an RPC client and an RPC server. DeathStarBench [14] is another Tiny Performance Regressions at Hyperscale through Pre-Production Testing. In
recent benchmark suite for large-scale cloud applications with tens Proceedings of the 28th Symposium on Operating Systems Principles (SOSP’24).
of RPC-based or RESTful microservices. [7] Cash Costello. 2012. Elgg 1.8 social networking. Packt Publishing Ltd.
[8] Custom Market Insights. 2024. Global Data Center CPU Market 2024–2033.
However, as discussed in Section 1 and Section 4.6, these bench- https://www.custommarketinsights.com/report/data-center-cpu-market/
marks have limitations, such as benchmark categories do not match [9] DCPerf. 2024. https://github.com/facebookresearch/DCPerf
[10] Andrea Detti, Ludovico Funari, and Luca Petrucci. 2023. uBench: An Open-Source
hyperscale workloads, mismatches with real-world software ar- Factory of Benchmark Microservice Applications. IEEE Transactions on Parallel
chitecture, missing system components, limited scalability, and and Distributed Systems 34, 3 (2023).
inadequate performance and power representativeness. For exam- [11] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad
Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia
ple, the benchmarks included in SPEC Cloud IaaS and SPEC jbb are Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: a study of emerging
significantly different from Meta’s production workloads. Although scale-out workloads on modern hardware. In Proceedings of the Seventeenth
DCPerf ISCA ’25, June 21–25, 2025, Tokyo, Japan

International Conference on Architectural Support for Programming Languages and 2019. Presto: SQL on everything. In Proceedings of the 2019 IEEE 35th International
Operating Systems (ASPLOS’12). Conference on Data Engineering (ICDE’19).
[12] Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux journal 2004, [38] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. 2007. Thrift: Scalable cross-
124 (2004), 5. language services implementation. Facebook white paper 5, 8 (2007), 127.
[13] Jeff Fulmer. 2022. Joedog/Siege: Siege is an HTTP load tester and benchmarking [39] Akshitha Sriraman and Abhishek Dhanotia. 2020. Accelerometer: Understanding
utility. https://github.com/JoeDog/siege Acceleration Opportunities for Data Center Overheads at Hyperscale. In Pro-
[14] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, ceedings of the Twenty-Fifth International Conference on Architectural Support for
Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Programming Languages and Operating Systems.
Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, [40] Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. 2019. SoftSKU:
Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Optimizing Server Architectures for Microservice Diversity @Scale. In Proceeding
Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for of the 46th Annual International Symposium on Computer Architecture (ISCA’19).
Microservices and Their Hardware-Software Implications for Cloud & Edge [41] Akshitha Sriraman and Thomas F. Wenisch. 2018. uSuite: A Benchmark Suite for
Systems. In Proceedings of the 24th International Conference on Architectural Microservices. In 2018 IEEE International Symposium on Workload Characteriza-
Support for Programming Languages and Operating Systems (ASPLOS’19). tion (IISWC’18).
[15] Abraham Gonzalez, Aasheesh Kolli, Samira Khan, Sihang Liu, Vidushi Dadu, [42] Gábor Takács and Domonkos Tikk. 2012. Alternating least squares for per-
Sagar Karandikar, Jichuan Chang, Krste Asanovic, and Parthasarathy Ran- sonalized ranking. In Proceedings of the 6th ACM conference on Recommender
ganathan. 2023. Profiling Hyperscale Big Data Processing. In Proceedings of systems.
the 50th Annual International Symposium on Computer Architecture (ISCA’23). [43] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He,
[16] Google. [n. d.]. Snappy, a fast compressor/decompressor. https://google.github. Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, et al. 2014. BigDataBench:
io/snappy/ A big data benchmark suite from internet services. In Proceedings of the 2014
[17] Google. 2015. Benchmarking web search latencies. https://cloudplatform. IEEE 20th international symposium on high performance computer architecture
googleblog.com/2015/03/benchmarking-web-search-latencies.html (HPCA’14).
[18] Alex Guzman, Kyle Nekritz, and Subodh Iyengar. 2021. Deploying TLS 1.3 at [44] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He,
scale with Fizz, a performant open source TLS library. https://engineering.fb. Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent
com/2018/08/06/security/fizz/ Zhan, Xiaona Li, and Bizhu Qiu. 2014. BigDataBench: A big data benchmark
[19] F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History suite from internet services. In Proceedings of the 20th International Symposium
and context. ACM Transactions on Interactive Intelligent Systems (TIIS) 5, 4 (2015). on High Performance Computer Architecture (HPCA’14).
[20] Trish Hogan. 2009. Overview of TPC Benchmark E: The Next Generation of [45] Michael Widenius and David Axmark. 2002. MySQL reference manual: documen-
OLTP Benchmarks. In Performance Evaluation and Benchmarking. tation from the source. " O’Reilly Media, Inc.".
[21] Darby Huye, Yuri Shkuro, and Raja R. Sambasivan. 2023. Lifting the veil on [46] Yanan Yang, Xiangyu Kong, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Heng
Meta’s microservice architecture: Analyses of topology and request workflows. Qi, and Keqiu Li. 2022. SDCBench: A Benchmark Suite for Workload Colocation
In 2023 USENIX Annual Technical Conference (USENIX ATC 23). and Evaluation in Datacenters. Intelligent Computing 2022 (2022).
[22] Geonhwa Jeong, Bikash Sharma, Nick Terrell, Abhishek Dhanotia, Zhiwei Zhao, [47] Ahmad Yasin. 2014. A top-down method for performance analysis and coun-
Niket Agarwal, Arun Kejariwal, and Tushar Krishna. 2023. Characterization of ters architecture. In Proceedings of the 2014 IEEE International Symposium on
data compression in datacenters. In Proceedings of the 2023 IEEE International Performance Analysis of Systems and Software (ISPASS’14).
Symposium on Performance Analysis of Systems and Software (ISPASS’23). [48] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
[23] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Re-
Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse- silient distributed datasets: A Fault-Tolerant abstraction for In-Memory cluster
scale computer. In Proceedings of the 42nd Annual International Symposium on computing. In Proceedings of the 9th USENIX symposium on networked systems
Computer Architecture (ISCA’15). design and implementation (NSDI’12).
[24] Harshad Kasture and Daniel Sanchez. 2016. Tailbench: a benchmark suite and [49] Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa
evaluation methodology for latency-critical applications. In 2016 IEEE Interna- Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram
tional Symposium on Workload Characterization (IISWC’16). Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu,
[25] Ioannis Katsavounidis. 2015. NETFLIX – “EL Fuente” video sequence details and Christos Kozyrakis, and Parik Pol. 2022. Understanding data storage and inges-
scenes. https://www.cdvl.org/media/1x5dnwt1/elfuente_summary.pdf tion for large-scale deep recommendation model training: Industrial product. In
[26] Aaron Lu. 2023. sched/fair: Ratelimit update to tg->load_avg. Proceedings of the 49th Annual International Symposium on Computer Architecture
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ (ISCA’22).
?id=1528c661c24b407e92194426b0adbb43de859ce0
[27] Daniel A Menascé. 2002. TPC-W: A benchmark for e-commerce. IEEE Internet
Computing 6, 3 (2002).
[28] Guilherme Ottoni. 2018. HHVM JIT: A Profile-guided, Region-based Compiler
for PHP and Hack. In Proceedings of the 39th ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation (PLDI’18).
[29] Tapti Palit, Yongming Shen, and Michael Ferdman. 2016. Demystifying cloud
benchmarking. In Proceedings of the 2016 IEEE international symposium on perfor-
mance analysis of systems and software (ISPASS’16).
[30] Margaret H Pinson. 2013. The consumer digital video library [best of the web].
IEEE Signal Processing Magazine 30, 4 (2013).
[31] Will Reese. 2008. Nginx: the high-performance web server and reverse proxy.
Linux Journal 2008, 173 (2008), 2.
[32] Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren.
2015. Inside the Social Network’s (Datacenter) Network. In Proceedings of the 2015
ACM Conference on Special Interest Group on Data Communication (SIGCOMM’15).
[33] RuBiS. 2024. https://github.com/uillianluiz/RUBiS
[34] Mohammad Reza Saleh Sedghpour, Aleksandra Obeso Duque, Xuejun Cai, Björn
Skubic, Erik Elmroth, Cristian Klein, and Johan Tordsson. 2023. HydraGen: A
Microservice Benchmark Generator. In 2023 IEEE 16th International Conference
on Cloud Computing (CLOUD’23).
[35] James Sedgwick. 2019. Wangle - an asynchronous C++ networking and RPC
Library. https://engineering.fb.com/2016/04/20/networking-traffic/wangle-an-
asynchronous-c-networking-and-rpc-library/
[36] Korakit Seemakhupt, Brent E. Stephens, Samira Khan, Sihang Liu, Hassan Wassel,
Soheil Hassas Yeganeh, Alex C. Snoeren, Arvind Krishnamurthy, David E. Culler,
and Henry M. Levy. 2023. A Cloud-Scale Characterization of Remote Procedure
Calls. In Proceedings of the 29th Symposium on Operating Systems Principles
(SOSP’23).
[37] Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie,
Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al.

Performance and Load Testing With Dynatrace: Release Better Software Faster
No ratings yet
Performance and Load Testing With Dynatrace: Release Better Software Faster
52 pages
Cyber Security MCQ
100% (6)
Cyber Security MCQ
2 pages
Coating and Services: Interplan Asset Integrity
No ratings yet
Coating and Services: Interplan Asset Integrity
22 pages
Li-Cloud-Performance-Yad1 2
No ratings yet
Li-Cloud-Performance-Yad1 2
18 pages
Performance Modeling of Metric-Based Serverless Computing Platforms
No ratings yet
Performance Modeling of Metric-Based Serverless Computing Platforms
12 pages
Citrix Adc Virtual Platforms
No ratings yet
Citrix Adc Virtual Platforms
5 pages
Demystifying Cloud Benchmarking
No ratings yet
Demystifying Cloud Benchmarking
11 pages
Secure NSXT Multi Tenant HPC ML Perf
No ratings yet
Secure NSXT Multi Tenant HPC ML Perf
20 pages
Aihub 1017
No ratings yet
Aihub 1017
17 pages
DK Memcached Hadoop
No ratings yet
DK Memcached Hadoop
24 pages
Performance Modelling and Optimization of Serverless Computing Platforms
No ratings yet
Performance Modelling and Optimization of Serverless Computing Platforms
183 pages
Build The AI Cloud Provision and Manage Tensorflow Cluster With OpenStack8
No ratings yet
Build The AI Cloud Provision and Manage Tensorflow Cluster With OpenStack8
38 pages
Hitachi Cloudera
No ratings yet
Hitachi Cloudera
26 pages
Highlights: Cloudformer: An Attention-Based Performance Prediction For Public Clouds With Unknown Workload
No ratings yet
Highlights: Cloudformer: An Attention-Based Performance Prediction For Public Clouds With Unknown Workload
13 pages
Compiler-Guided Throughput Scheduling For Many-Core Machines
No ratings yet
Compiler-Guided Throughput Scheduling For Many-Core Machines
14 pages
MTIA First Generation Silicon Targeting Meta's Recommendation
No ratings yet
MTIA First Generation Silicon Targeting Meta's Recommendation
13 pages
10 Cloud - Performance - Modeling - With - Benchmark - Evaluation - of - Elastic - Scaling - Strategies
No ratings yet
10 Cloud - Performance - Modeling - With - Benchmark - Evaluation - of - Elastic - Scaling - Strategies
14 pages
Open, Ethernet-Based AI Data Center Networking Solution Brief (Frontend - Backend)
No ratings yet
Open, Ethernet-Based AI Data Center Networking Solution Brief (Frontend - Backend)
4 pages
ThinkSystem SR675 V3 Sets 20 World Records
No ratings yet
ThinkSystem SR675 V3 Sets 20 World Records
6 pages
HPC Performance on AWS Cloud
No ratings yet
HPC Performance on AWS Cloud
10 pages
RIoTBench Summary
No ratings yet
RIoTBench Summary
26 pages
DCFlex - Data Center Flexible Load Initiative
No ratings yet
DCFlex - Data Center Flexible Load Initiative
2 pages
Final PPT PFD
No ratings yet
Final PPT PFD
30 pages
How Is Performance Addressed in DevOps - A Survey On Industrial Practices
No ratings yet
How Is Performance Addressed in DevOps - A Survey On Industrial Practices
6 pages
Lecture 4 Parallel Programming in The Cloud
No ratings yet
Lecture 4 Parallel Programming in The Cloud
16 pages
Machine Learning Based Network Traffic P
No ratings yet
Machine Learning Based Network Traffic P
13 pages
Mellanox Cisco
No ratings yet
Mellanox Cisco
4 pages
02-CloudFabric Builds The Next-Generation DCN For The AI Era PDF
100% (1)
02-CloudFabric Builds The Next-Generation DCN For The AI Era PDF
43 pages
Serverless Ifc Oopsla18
No ratings yet
Serverless Ifc Oopsla18
8 pages
FPL 16 Kachris
No ratings yet
FPL 16 Kachris
10 pages
Cisco Tetration On Hyperflex Systems
No ratings yet
Cisco Tetration On Hyperflex Systems
47 pages
Cid Cloud Service Catalogue
No ratings yet
Cid Cloud Service Catalogue
51 pages
HCIBench User Guide 2.8.1
No ratings yet
HCIBench User Guide 2.8.1
48 pages
Pa Integrate and Simplify Ai Ecosystem Intel Brief
No ratings yet
Pa Integrate and Simplify Ai Ecosystem Intel Brief
2 pages
Netweb HPC Profile - Cloud Services
No ratings yet
Netweb HPC Profile - Cloud Services
43 pages
PerfSim A Performance Simulator For Cloud
No ratings yet
PerfSim A Performance Simulator For Cloud
18 pages
Performance Analysis of High Performance Computing Applications On The Amazon Web Services Cloud
No ratings yet
Performance Analysis of High Performance Computing Applications On The Amazon Web Services Cloud
10 pages
Cloud wp2 Instance Implmentation
No ratings yet
Cloud wp2 Instance Implmentation
18 pages
Deep Learning Benchmarking System
No ratings yet
Deep Learning Benchmarking System
13 pages
Chen Eembc Profiling 2013
No ratings yet
Chen Eembc Profiling 2013
13 pages
HCIBench 2.0 User Guide
No ratings yet
HCIBench 2.0 User Guide
23 pages
XFaaS - Hyperscale and Low Cost Serverless
No ratings yet
XFaaS - Hyperscale and Low Cost Serverless
16 pages
Elx WP All Best Practices Deployments DCB Roce Cisco
No ratings yet
Elx WP All Best Practices Deployments DCB Roce Cisco
23 pages
Monitorama2015netflixinstanceanalysis 150616190732 Lva1 App6892
No ratings yet
Monitorama2015netflixinstanceanalysis 150616190732 Lva1 App6892
69 pages
ThinkSystem SR860 V3 Sets 16 World Records
No ratings yet
ThinkSystem SR860 V3 Sets 16 World Records
5 pages
Datasheet Hitachi Content Software For File
No ratings yet
Datasheet Hitachi Content Software For File
4 pages
Datasets For Cloud Workload Prediction and Analysis
No ratings yet
Datasets For Cloud Workload Prediction and Analysis
3 pages
BDCC 07 00068
No ratings yet
BDCC 07 00068
19 pages
Cloud Peak - Virtual Cloud Infrastructure Validation
No ratings yet
Cloud Peak - Virtual Cloud Infrastructure Validation
9 pages
RA 2114 Nutanix Enterprise Cloud For AI
No ratings yet
RA 2114 Nutanix Enterprise Cloud For AI
66 pages
AMD - SuperMicro Cluster Setup Guide
No ratings yet
AMD - SuperMicro Cluster Setup Guide
68 pages
Lyu 2019
No ratings yet
Lyu 2019
8 pages
Parallel Programming and Optimization With Intel Xeon Phi Coprocessors
No ratings yet
Parallel Programming and Optimization With Intel Xeon Phi Coprocessors
520 pages
Optimizing Cloud Data Centers Through Machine Learning
No ratings yet
Optimizing Cloud Data Centers Through Machine Learning
3 pages
Xfaas: Hyperscale and Low Cost Serverless Functions at Meta
No ratings yet
Xfaas: Hyperscale and Low Cost Serverless Functions at Meta
16 pages
20 HPC Networks Interconnects
No ratings yet
20 HPC Networks Interconnects
1 page
Fluent Analysis Intel
No ratings yet
Fluent Analysis Intel
17 pages
PerficientCloudSim - AAM
No ratings yet
PerficientCloudSim - AAM
44 pages
Machine Learning
No ratings yet
Machine Learning
102 pages
Joint Optimization of Parallelism and Resource Configuration For Serverless Func
No ratings yet
Joint Optimization of Parallelism and Resource Configuration For Serverless Func
17 pages
This Section Reviews Batch Input Programming Concepts and Explains How To Program Data Transfer and Batch Input Processing Programs
No ratings yet
This Section Reviews Batch Input Programming Concepts and Explains How To Program Data Transfer and Batch Input Processing Programs
51 pages
Downgrade Kitkat
No ratings yet
Downgrade Kitkat
7 pages
Bug Hunting 2025
No ratings yet
Bug Hunting 2025
3 pages
IWCF Forum Candidate User Role
No ratings yet
IWCF Forum Candidate User Role
7 pages
Ai Scripting Docsforadobe Dev en Latest
No ratings yet
Ai Scripting Docsforadobe Dev en Latest
884 pages
Brothers MIDI+Manual Pedal Chase+Bliss
No ratings yet
Brothers MIDI+Manual Pedal Chase+Bliss
1 page
Software Architecture
No ratings yet
Software Architecture
37 pages
Survey Tool
No ratings yet
Survey Tool
74 pages
LOGO Access Tool Help
No ratings yet
LOGO Access Tool Help
22 pages
Crash
No ratings yet
Crash
9 pages
DEF CON Safe Mode - Wesley Neelen - Hacking Traffic Lights
No ratings yet
DEF CON Safe Mode - Wesley Neelen - Hacking Traffic Lights
29 pages
Searching Algorithms
No ratings yet
Searching Algorithms
12 pages
Automation License Manager V5.3 SP4, 32-Bit and 64-Bit Edition
No ratings yet
Automation License Manager V5.3 SP4, 32-Bit and 64-Bit Edition
18 pages
WHERE Clause: DCL Command
No ratings yet
WHERE Clause: DCL Command
8 pages
The Unified Process Transition and Production Phases Best Practices in Implementing The UP 1st Edition Scott W. Ambler (Author) Instant Download
100% (1)
The Unified Process Transition and Production Phases Best Practices in Implementing The UP 1st Edition Scott W. Ambler (Author) Instant Download
52 pages
CCNA Security v2.0 Final Exam Answers 100 1 PDF
100% (3)
CCNA Security v2.0 Final Exam Answers 100 1 PDF
26 pages
Live Sound First Assignment Last Version
No ratings yet
Live Sound First Assignment Last Version
3 pages
PowerLogic PM8000 Series - METSEPM8210
No ratings yet
PowerLogic PM8000 Series - METSEPM8210
5 pages
Midterm Exam Data Base
No ratings yet
Midterm Exam Data Base
5 pages
Xii Comp Practical Journal
No ratings yet
Xii Comp Practical Journal
45 pages
Lecture #3 Addressing Modes PDF
No ratings yet
Lecture #3 Addressing Modes PDF
77 pages
Visualizing CO2 Emissions - Instruction and Assignment-1550
No ratings yet
Visualizing CO2 Emissions - Instruction and Assignment-1550
13 pages
Iseries Password Recovery
No ratings yet
Iseries Password Recovery
5 pages
1.1 About The Project: Computer Science & Engineering Dept Smart Bin
No ratings yet
1.1 About The Project: Computer Science & Engineering Dept Smart Bin
35 pages
ERP User Guide - Basic M3 Functions
No ratings yet
ERP User Guide - Basic M3 Functions
30 pages
Blockchain Proposal
No ratings yet
Blockchain Proposal
23 pages
Low-Cost Real-Time Sar Simulation For Applications in Mission Planning, Education and Information Extraction
No ratings yet
Low-Cost Real-Time Sar Simulation For Applications in Mission Planning, Education and Information Extraction
5 pages
Protocols For Qos Support
No ratings yet
Protocols For Qos Support
52 pages

DCPerf ISCA25

Uploaded by

DCPerf ISCA25

Uploaded by

DCPerf: An Open-Source, Battle-Tested Performance

Benchmark Suite for Datacenter Workloads

dition of new features. This approach is preferred over hardcoding

Figure 1: DCPerf software architecture.

Workload Web Ranking Data Caching Big Data Media Processing

are I/O-intensive, whereas the third stage is computation-intensive. Programming

Frontend Incorrect Speculation Backend Retiring

SKU1 SKU2 SKU3 SKU4

9.2% DCPerf performs robustly across servers with radically different

Memory Bandwidth (GB/s)

L1 I-Cache Miss (MPKI)

Prod & DCPerf SPEC2017 Prod & DCPerf SPEC2017

Memory bandwidth (Figure 7). The memory bandwidth con-

CPU Util Total % CPU Util Sys % 2.30

Core Frequency (GHz)

2.00 2.00 2.01 2.00

Figure 11: Core frequency in GHz.

Figure 10: Power Consumption. 0%

Figure 12: Breakdown of CPU cycles spent in hot functions.

Ops or Errors / sec

CPU Util (%)

CPU Util (%)

SKU4 SKU-A SKU-B Higher is better Lower is better

1.9 2.0 1.9 1.9

1.7 1.6 1.6 -20% -14.4%

You might also like