The Performance of Open Source Applications
The Performance of Open Source Applications
Product and company names mentioned herein may be the trademarks of their respective owners.
While every precaution has been taken in the preparation of this book, the editors and authors assume no
responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
ISBN: 978-1-304-48878-7
Contents
Introduction vii
by Tavish Armstrong
3 Ninja 33
by Evan Martin
5 MemShrink 59
by Kyle Huey
7 Infinispan 87
by Manik Surtani
8 Talos 97
by Clint Talbert and Joel Maher
9 Zotonic 103
by Arjan Scherpenisse and Marc Worrell
11 Warp 133
by Kazu Yamamoto, Michael Snoyman, and Andreas Voellmy
It’s commonplace to say that computer hardware is now so fast that most developers don’t have to
worry about performance. In fact, Douglas Crockford declined to write a chapter for this book for
that reason:
We should forget about small efficiencies, say about 97% of the time: premature opti-
mization is the root of all evil.
but between mobile devices with limited power and memory, and data analysis projects that need
to process terabytes, a growing number of developers do need to make their code faster, their data
structures smaller, and their response times shorter. However, while hundreds of textbooks explain
the basics of operating systems, networks, computer graphics, and databases, few (if any) explain
how to find and fix things in real applications that are simply too damn slow.
This collection of case studies is our attempt to fill that gap. Each chapter is written by real
developers who have had to make an existing system faster or who had to design something to be
fast in the first place. They cover many different kinds of software and performance goals; what they
have in common is a detailed understanding of what actually happens when, and how the different
parts of large applications fit together. Our hope is that this book will—like its predecessor The
Architecture of Open Source Applications—help you become a better developer by letting you look
over these experts’ shoulders.
— Tavish Armstrong
Contributors
Tavish Armstrong (editorial): Tavish studies software engineering at Concordia University and
hopes to graduate in the spring of 2014. His online home is http://tavisharmstrong.com.
Michael Snoyman (Warp): Michael is the lead software engineer at FP Complete. He is the
founder and lead developer of the Yesod Web Framework, which provides a means of creating
robust, high-performance web applications. His formal studies include actuarial science, and he has
previously worked in the US auto and homeowner insurance industry analyzing large data sets.
Kazu Yamamoto (Warp): Kazu is a senior researcher of IIJ Innovation Institute. He has been
working for open source software around 20 years. His products include Mew, KAME, Firemacs
and mighty.
Andreas Voellmy (Warp): Andreas is a PhD candidate in Computer Science at Yale University.
Andreas uses Haskell in his research on software-defined networks and has published open source
Haskell packages, such as nettle-openflow, for controlling routers using Haskell programs. Andreas
also contributes to the GHC project and is a maintainer of GHC’s IO manager.
Ilya Grigorik (Chrome): Ilya is a web performance engineer and developer advocate on the Make
The Web Fast team at Google, where he spends his days and nights on making the web fast and
driving adoption of performance best practices. You can find Ilya online on his blog at igvita.com
and under @igrigorik on Twitter.
Evan Martin (Ninja): Evan has been a programmer at Google for nine years. His background
before that includes degrees in computer science and linguistics. He has hacked on many minor free
software projects and a few major ones, including LiveJournal. His website is http://neugierig.
org.
Bryce Howard (Mobile Performance): Bryce is a software architect who obsesses about making
things go fast. He has 15+ years in the industry, and has worked for a number of startups you’ve never
heard of. He is currently taking a stab at this whole “writing” thing and authoring an introductory
Amazon Web Services book for O’Reilly Associates.
Kyle Huey (Memshrink): Kyle works at the Mozilla Corporation on the Gecko rendering engine
that powers the Firefox web browser. He earned a Bachelor’s degree in mathematics from the
University of Florida before moving to San Francisco. He blogs at blog.kylehuey.com.
Clint Talbert (Talos): Clint has been involved in the Mozilla project for almost a decade, first
as a volunteer and then as an employee. He currently leads the Automation and Tools team with a
mandate to automate everything that can be automated, and a personal vendetta to eliminate idle
cycles on any automation machine. You can follow his adventures in open source and writing at
clinttalbert.com.
Joel Maher (Talos): Joel has over 15 years of experience automating software. In the last 5 years
at Mozilla, Joel has hacked the automation and tools at Mozilla to extend to mobile phones as well
as taken ownership of Talos to expand tests, reliability and improve regression detection. While
his automation is running, Joel likes to get outdoors and tackle new challenges in life. For more
automation adventures, follow along at elvis314.wordpress.com.
Audrey Tang (Ethercalc): A self-educated programmer and translator based in Taiwan, Audrey
currently works at Socialtext with the job title “Untitled Page”, as well as at Apple on localization
and release engineering. Audrey has previously designed and led the Pugs project, the first working
Perl 6 implementation, and served in language design committees for Haskell, Perl 5, and Perl 6,
with numerous contributions to CPAN and Hackage. Follow Audrey on Twitter at @audreyt.
C. Titus Brown (Khmer): Titus has worked in evolutionary modeling, physical meteorology,
developmental biology, genomics, and bioinformatics. He is now an Assistant Professor at Michigan
State University, where he has expanded his interests into several new areas, including reproducibility
and maintainability of scientific software. He is also a member of the Python Software Foundation,
and blogs at http://ivory.idyll.org.
Eric McDonald (Khmer): Eric McDonald is a developer of scientific software with an emphasis
on high performance computing (HPC), the area in which he has worked much of the past 13 years.
Having previously worked with several varieties of physicists, he now helps bioinformaticians. He
holds a bachelor’s degree in Computer Science, Mathematics, and Physics. Eric has been a fan of
FOSS since the mid-nineties.
viii Introduction
Douglas C. Schmidt (DaNCE): Dr. Douglas C. Schmidt is a Professor of Computer Science,
Associate Chair of the Computer Science and Engineering program, and a Senior Researcher at the
Institute at Software Integrated Systems, all at Vanderbilt University. Doug has published 10 books
and more than 500 technical papers covering a wide range of software-related topics, and led the
development of ACE, TAO, CIAO, and CoSMIC for the past two decades.
Aniruddha Gokhale (DaNCE): Dr. Aniruddha S. Gokhale is an Associate Professor in the
Department of Electrical Engineering and Computer Science, and Senior Research Scientist at the
Institute for Software Integrated Systems (ISIS) both at Vanderbilt University. He has over 140
technical articles to his credit, and his current research focuses on developing novel solutions to
emerging challenges in cloud computing and cyber physical systems.
William R. Otte (DaNCE): Dr. William R. Otte is a Research Scientist at the Institute for Software
Integrated Systems (ISIS) at Vanderbilt University. He has nearly a decade of experience developing
open source middleware and modeling tools for distributed, real-time and embedded systems, working
with both government and industrial partners including DARPA, NASA, Northrup Grumman and
Lockheed-Martin. He has published numerous technical articles and reports describing these
advances and has participated in the development of open standards for component middleware.
Manik Surtani (Infinispan): Manik is a core R&D engineer at JBoss, Red Hat’s middleware
division. He is the founder of the Infinispan project, and Platform Architect of the JBoss Data Grid.
He is also the spec lead of JSR 347 (Data Grids for the Java Platform), and represents Red Hat on
the Expert Group of JSR 107 (Temporary caching for Java). His interests lie in cloud and distributed
computing, big data and NoSQL, autonomous systems and highly available computing.
Arseny Kapoulkine (Pugixml): Arseny has spent his entire career programming graphics and low-
level systems in video games, ranging from small niche titles to multi-platform AAA blockbusters
such as FIFA Soccer. He enjoys making slow things fast and fast things even faster. He can be
reached at [email protected] or on Twitter @zeuxcg.
Arjan Scherpenisse (Zotonic): Arjan is one of the main architects of Zotonic and manages to
work on dozens of projects at the same time, mostly using Zotonic and Erlang. Arjan bridges the gap
between back-end and front-end Erlang projects. Besides issues like scalability and performance,
Arjan is often involved in creative projects. Arjan is a regular speaker at events.
Marc Worrell (Zotonic): Marc is a respected member of the Erlang community and was the
initiator of the Zotonic project. Marc spends his time consulting for large Erlang projects, the
development of Zotonic and is the CTO of Maximonster, the builders of MaxClass and LearnStone.
Acknowledgments
This book would not exist without the help of Amy Brown and Greg Wilson, who asked me to edit
the book and convinced me that it was possible. I’m also grateful to Tony Arkles for his help in the
earlier stages of editing, and to our technical reviewers:
Tavish Armstrong ix
A small army of copyeditors and helpers ensured the book got published this decade:
Amy Brown, Bruno Kinoshita, and Danielle Pham deserve special thanks for their help with the
book’s build process and graphics.
Editing a book is a difficult task, but it gets easier when you have encouraging friends. Natalie
Black, Julia Evans, and Kamal Marhubi were patient and enthusiastic throughout.
Contributing
Dozens of volunteers worked hard to create this book, but there is still a lot to do. If you’d like to help,
you can do so by reporting errors, translating the content into other languages, or describing other
open source systems. Please contact us at [email protected] if you would like to get involved.
x Introduction
[chapter 1]
“It was so good that it essentially forced me to change my mind. . . ” - Eric Schmidt, on
his initial reluctance to the idea of developing Google Chrome.
Turns out, they could. Today Chrome is one of the most widely used browsers on the web
(35%+ of the market share according to StatCounter) and is now available on Windows, Linux, OS
X, Chrome OS, as well as Android and iOS platforms. Clearly, the features and the functionality
resonated with the users, and many innovations of Chrome have also found their way into other
popular browsers.
The original 38-page comic book explanation of the ideas and innovations of Google Chrome
offers a great overview of the thinking and design process behind the popular browser. However, this
was only the beginning. The core principles that motivated the original development of the browser
continue to be the guiding principles for ongoing improvements in Chrome:
Speed: Make the fastest browser
Security: Provide the most secure environment to the user
Stability: Provide a resilient and stable web application platform
Simplicity: Create sophisticated technology, wrapped in a simple user experience
As the team observed, many of the sites we use today are not just web pages, they are applications.
In turn, the ever more ambitious applications require speed, security, and stability. Each of these
deserves its own dedicated chapter, but since our subject is performance, our focus will be primarily
on speed.
Ilya Grigorik 3
can happen. The time taken to do the DNS lookup will vary based on your internet provider, the
popularity of the site and the likelihood of the hostname to be in intermediate caches, as well as the
response time of the authoritative servers for that domain. In other words, there are a lot of variables
at play, but it is not unusual for a DNS lookup to take up to several hundred milliseconds. Ouch.
With the resolved IP address in hand, Chrome can now open a new TCP connection to the
destination, which means that we must perform the “three-way handshake”: SYN > SYN-ACK > ACK.
This exchange adds a full round-trip of latency delay to each and every new TCP connection—no
shortcuts. Depending on the distance between the client and the server, as well as the chosen routing
path, this can yield from tens to hundreds, or even thousands, of milliseconds of delay. All of this
work and latency is before even a single byte of application data has hit the wire.
Once the TCP handshake is complete, and if we are connecting to a secure destination (HTTPS),
then the SSL handshake must take place. This can add up to two additional round-trips of latency
delay between client and server. If the SSL session is cached, then we can “escape” with just one
additional round-trip.
Finally, Chrome is able to dispatch the HTTP request (requestStart in Figure 1.1). Once
received, the server can process the request and then stream the response data back to the client. This
incurs a minimum of one network round-trip, plus the processing time on the server. Following that,
we are done—unless the actual response is an HTTP redirect, in which case we may have to repeat
the entire cycle once over. Have a few gratuitous redirects on your pages? You may want to revisit
that decision.
Have you been counting all the delays? To illustrate the problem, let’s assume the worst case
scenario for a typical broadband connection: local cache miss, followed by a relatively fast DNS
lookup (50 ms), TCP handshake, SSL negotiation, and a relatively fast (100 ms) server response
time, with a round-trip time (RTT) of 80 ms (an average round-trip across continental USA):
• 50 ms for DNS
• 80 ms for TCP handshake (one RTT)
• 160 ms for SSL handshake (two RTTs)
• 40 ms for request to server
• 100 ms for server processing
• 40 ms for response from the server
That’s 470 milliseconds for a single request, which translates to over 80% of network latency
overhead as compared to the actual server processing time to fulfill the request—we have some work
to do here. In fact, even 470 milliseconds may be an optimistic estimate:
Table 1.1 also explains the unofficial rule of thumb in the web performance community: render
your pages, or at the very least, provide visual feedback in under 250 ms to keep the user engaged.
This is not speed simply for speed’s sake. Studies at Google, Amazon, Microsoft, as well as thousands
of other sites show that additional latency has a direct impact on the bottom line of your site: faster
sites yield more pageviews, higher engagement from the users, and see higher conversion rates.
So, there you have it, our optimal latency budget is 250 ms, and yet as we saw in the example
above, the combination of a DNS lookup, the TCP and SSL handshakes, and propagation times for
the request add up to 370 ms. We are 50% over budget, and we still have not factored in the server
processing time!
To most users and even web developers, the DNS, TCP, and SSL delays are entirely transparent
and are negotiated at network layers to which few of us descend or think about. However, each of
these steps is critical to the overall user experience, since each extra network request can add tens or
hundreds of milliseconds of latency. This is the reason why Chrome’s network stack is much, much
more than a simple socket handler.
Now that we have identified the problem, let’s dive into the implementation details.
Ilya Grigorik 5
By default, desktop Chrome browsers use the process-per-site model, that isolates different sites
from each other, but groups all instances of the same site into the same process. However, to keep
things simple, let’s assume one of the simplest cases: one distinct process for each open tab. From
the network performance perspective, the differences here are not substantial, but the process-per-tab
model is much easier to understand.
The architecture dedicates one render process to each tab. Each render process contains instances
of the Blink layout engine and the V8 JavaScript engine, along with glue code that bridges these
(and a few other) components2 .
Each of these render processes is executed within a sandboxed environment that has limited
access to the user’s computer—including the network. To gain access to these resources, each render
process communicates with the main browser (or kernel) process, which is able to impose security
and access policies on each renderer.
Component Contents
net/android Bindings to the Android runtime
net/base Common net utilities, such as host resolution,
cookies, network change detection, and SSL
certificate management
net/cookies Implementation of storage, management, and
retrieval of HTTP cookies
net/disk_cache Disk and memory cache implementation for
web resources
net/dns Implementation of an asynchronous DNS re-
solver
net/http HTTP protocol implementation
net/proxy Proxy (SOCKS and HTTP) configuration, res-
olution, script fetching, etc.
net/socket Cross-platform implementations of TCP sock-
ets, SSL streams, and socket pools
net/spdy SPDY protocol implementation
net/url_request URLRequest, URLRequestContext, and
URLRequestJob implementations
net/websockets WebSockets protocol implementation
All of the network code is, of course, open source and can be found in the src/net subdirectory.
We will not examine each component in detail, but the layout of the code itself tells you a lot about
its capabilities and structure. A few examples are listed in Table 1.2.
The code for each of the components makes for a great read for the curious—it is well documented,
and you will find plenty of unit tests for every component.
Ilya Grigorik 7
Architecture and Performance on Mobile Platforms
Mobile browser usage is growing at an exponential rate and even by modest projections, it will
eclipse desktop browsing in the not so distant future. Needless to say, delivering an optimized mobile
experience has been a top priority for the Chrome team. In early 2012, Chrome for Android was
announced, and a few months later, Chrome for iOS followed.
The first thing to note about the mobile version of Chrome, is that it is not simply a direct
adaptation of the desktop browser—that would not deliver the best user experience. By its very
nature, the mobile environment is both much more resource constrained, and has many fundamentally
different operating parameters:
• Desktop users navigate with the mouse, may have overlapping windows, have a large screen,
are mostly not power constrained, usually have a much more stable network connection, and
have access to much larger pools of storage and memory.
• Mobile users use touch and gesture navigation, have a much smaller screen, are battery and
power constrained, are often on metered connections, and have limited local storage and
memory.
Further, there is no such thing as a “typical mobile device”. Instead there is a wide range of
devices with varying hardware capabilities, and to deliver the best performance, Chrome must adapt
to the operating constraints of each and every device. Thankfully, the various execution models allow
Chrome to do exactly that.
On Android devices, Chrome leverages the same multi-process architecture as the desktop
version—there is a browser process, and one or more renderer processes. The one difference is that
due to memory constraints of the mobile device, Chrome may not be able to run a dedicated renderer
for each open tab. Instead, Chrome determines the optimal number of renderer processes based on
available memory, and other constraints of the device, and shares the renderer process between the
multiple tabs.
In cases where only minimal resources are available, or if Chrome is unable to run multiple
processes, it can also switch to use a single-process, multi-threaded processing model. In fact, on
iOS devices, due to sandboxing restrictions of the underlying platform, it does exactly that—it runs a
single, but multi-threaded process.
What about network performance? First off, Chrome uses the same network stack on Android
and iOS as it does on all other versions. This enables all of the same network optimizations across all
platforms, which gives Chrome a significant performance advantage. However, what is different, and
is often adjusted based on the capabilities of the device and the network in use, are variables such as
priority of speculative optimization techniques, socket timeouts and management logic, cache sizes,
and more.
For example, to preserve battery, mobile Chrome may opt in to use lazy closing of idle sockets—
sockets are closed only when opening new ones to minimize radio use. Similarly, since prerendering
(which we will discuss below), may require significant network and processing resources, it is often
only enabled when the user is on Wi-Fi.
Optimizing the mobile browsing experience is one of the highest priority items for the Chrome
development team, and we can expect to see a lot of new improvements in the months and years to
come. In fact, it is a topic that deserves its own separate chapter—perhaps in the next installment of
the POSA series.
Name Action
DNS pre-resolve Resolve hostnames ahead of time, to avoid
DNS latency
TCP pre-connect Connect to destination server ahead of time,
to avoid TCP handshake latency
Resource prefetching Fetch critical resources on the page ahead of
time, to accelerate rendering of the page
Page prerendering Fetch the entire page with all of its resources
ahead of time, to enable instant navigation
when triggered by the user
Each decision to invoke one or several of these techniques is optimized against a large number of
constraints. After all, each is a speculative optimization, which means that if done poorly, it might
trigger unnecessary work and network traffic, or even worse, have a negative effect on the loading
time for an actual navigation triggered by the user.
How does Chrome address this problem? The predictor consumes as many signals as it can,
which include user generated actions, historical browsing data, as well as signals from the renderer
and the network stack itself.
Not unlike the ResourceDispatcherHost, which is responsible for coordinating all of the
network activity within Chrome, the Predictor object creates a number of filters on user and
network generated activity within Chrome:
• IPC channel filter to monitor for signals from the render processes
• ConnectInterceptor object is added to each request, such that it can observe the traffic
patterns and record success metrics for each request
Ilya Grigorik 9
As a hands-on example, the render process can trigger a message to the browser process with any
of the following hints, which are conveniently defined in ResolutionMotivation (url_info.h 3 ):
enum ResolutionMotivation {
MOUSE_OVER_MOTIVATED, // Mouse-over initiated by the user.
OMNIBOX_MOTIVATED, // Omnibox suggested resolving this.
STARTUP_LIST_MOTIVATED, // This resource is on the top 10 startup list.
EARLY_LOAD_MOTIVATED, // In some cases we use the prefetcher to warm up
// the connection in advance of issuing the real
// request.
// <snip> ...
};
Given such a signal, the goal of the predictor is to evaluate the likelihood of its success, and
then to trigger the activity if resources are available. Every hint may have a likelihood of success, a
priority, and an expiration timestamp, the combination of which can be used to maintain an internal
priority queue of speculative optimizations. Finally, for every dispatched request from within this
queue, the predictor is also able to track its success rate, which allows it to further optimize its future
decisions.
The screenshot in Figure 1.4 is an example from my own Chrome profile. How do I usually begin
my browsing? Frequently by navigating to Google Docs if I’m working on an article such as this
one. Not surprisingly, we see a lot of Google hostnames in the list.
Ilya Grigorik 11
respect to the query, as well as its past performance. In fact, Chrome allows us to inspect this data by
visiting chrome://predictors.
Chrome maintains a history of the user-entered prefixes, the actions it has proposed, as well as
the hit rate for each one. For my own profile, you can see that whenever I enter “g” in the Omnibox,
there is a 76% chance that I’m heading to Gmail. Once I add an “m” (for “gm”), then the confidence
rises to 99.8%—in fact, out of the 412 recorded visits, I did not end up going to Gmail after entering
“gm” only once.
What does this have to do with the network stack? The yellow and green colors for the likely
candidates are also important signals for the ResourceDispatcher. If we have a likely candidate
(yellow), Chrome may trigger a DNS prefetch for the target host. If we have a high confidence
candidate (green), then Chrome may also trigger a TCP pre-connect once the hostname has been
resolved. Finally, if both complete while the user is still deliberating, then Chrome may even prerender
the entire page in a hidden tab.
Alternatively, if there is no good match for the entered prefix based on past navigation history
then Chrome may issue a DNS prefetch and TCP pre-connect to your search provider in anticipation
of a likely search request.
An average user takes hundreds of milliseconds to fill in their query and to evaluate the proposed
autocomplete suggestions. In the background, Chrome is able to prefetch, pre-connect, and in certain
cases even prerender the page, so that by the time the user is ready to hit the “enter” key, much of the
network latency has already been eliminated.
If you are ever curious about the state of the Chrome cache you can open a new tab and navigate
to chrome://net-internals/#httpCache. Alternatively, if you want to see the actual HTTP
metadata and the cached response, you can also visit chrome://cache, which will enumerate all of
the resources currently available in the cache. From that page, search for a resource you are looking
for and click on the URL to see the exact, cached headers and response bytes.
Ilya Grigorik 13
perform a navigation, Chrome may record the hostnames for the popular resources on the page, and
during a future visit, it may choose to trigger a DNS pre-resolve and even a TCP pre-connect for
some or all of them.
To inspect the subresource hostnames stored by Chrome, navigate to chrome://dns and search
for any popular destination hostname for your profile. In the example above, you can see the six
subresource hostnames that Chrome remembered for Google+, as well as stats for the number of
cases when a DNS pre-resolution was triggered, or a TCP pre-connect was performed, as well as an
expected number of requests that will be served by each. This internal accounting is what enables
the Chrome predictor to perform its optimizations.
In addition to all of the internal signals, the owner of the site is also able to embed additional
markup on their pages to request the browser to pre-resolve a hostname:
<link rel="dns-prefetch" href="//host_name_to_prefetch.com">
Why not simply rely on the automated machinery in the browser? In some cases, you may want
to pre-resolve a hostname which is not mentioned anywhere on the page. A redirect is the canonical
example: a link may point to a host—like an analytics tracking service—which then redirects the
user to the actual destination. By itself, Chrome cannot infer this pattern, but you can help it by
providing a manual hint and get the browser to resolve the hostname of the actual destination ahead
of time.
How is this all implemented under the hood? The answer to this question, just like all other
optimizations we have covered, depends on the version of Chrome, since the team is always experi-
menting with new and better ways to improve performance. However, broadly speaking, the DNS
infrastructure within Chrome has two major implementations. Historically, Chrome has relied on the
platform-independent getaddrinfo() system call, and delegated the actual responsibility for the
lookups to the operating system. However, this approach is in the process of being replaced with
Chrome’s own implementation of an asynchronous DNS resolver.
The original implementation, which relied on the operating system, has its benefits: less and
simpler code, and the ability to leverage the operating system’s DNS cache. However, getaddrinfo()
is also a blocking system call, which meant that Chrome had to create and maintain a dedicated
worker thread-pool to allow it to perform multiple lookups in parallel. This unjoined pool was
capped at six worker threads, which is an empirical number based on lowest common denominator
of hardware—turns out, higher numbers of parallel requests can overload some users’ routers.
For pre-resolution with the worker-pool, Chrome simply dispatches the getaddrinfo() call,
which blocks the worker thread until the response is ready, at which point it just discards the returned
result and begins processing the next prefetch request. The result is cached by the OS DNS cache,
which returns an immediate response to future, actual getaddrinfo() lookups. It’s simple, effective,
and works well enough in practice.
Well, effective, but not good enough. The getaddrinfo() call hides a lot of useful information
from Chrome, such as the time-to-live (TTL) timestamps for each record, as well as the state of
By moving DNS resolution into Chrome, the new async resolver enables a number of new
optimizations:
• better control of retransmission timers, and ability to execute multiple queries in parallel
• visibility into record TTLs, which allows Chrome to refresh popular records ahead of time
• better behavior for dual stack implementations (IPv4 and IPv6)
• failovers to different servers, based on RTT or other signals
All of the above, and more, are ideas for continuous experimentation and improvement within
Chrome. Which brings us to the obvious question: how do we know and measure the impact
of these ideas? Simple, Chrome tracks detailed network performance stats and histograms for
each individual profile. To inspect the collected DNS metrics, open a new tab, and head to
chrome://histograms/DNS (see Figure 1.9).
The above histogram shows the distribution of latencies for DNS prefetch requests: roughly 50%
(rightmost column) of the prefetch queries were finished within 20 ms (leftmost column). Note that
this is data based on a recent browsing session (9869 samples), and is private to the user. If the user
has opted in to report their usage stats in Chrome, then the summary of this data is anonymized and
periodically beaconed back to the engineering team, which is then able to see the impact of their
experiments and adjust accordingly.
Ilya Grigorik 15
Figure 1.10: Showing hosts for which TCP pre-connects have been triggered
the TCP handshake and slow-start penalties. If no socket is available, then it can initiate the TCP
handshake, and place it in the pool. Then, when the user initiates the navigation, the HTTP request
can be dispatched immediately.
For the curious, Chrome provides a utility at chrome://net-internals#sockets for exploring
the state of all the open sockets in Chrome. A screenshot is shown in Figure 1.11.
Note that you can also drill into each socket and inspect the timeline: connect and proxy times,
arrival times for each packet, and more. Last but not least, you can also export this data for further
analysis or a bug report. Having good instrumentation is key to any performance optimization,
and chrome://net-internals is the nexus of all things networking in Chrome—if you have not
explored it yet, you should!
Subresource and prefetch look very similar, but have very different semantics. When a link
resource specifies its relationship as “prefetch”, it is an indication to the browser that this resource
might be needed in a future navigation. In other words, it is effectively a cross-page hint. By contrast,
when a resource specifies the relationship as a “subresource”, it is an early indication to the browser
that the resource will be used on a current page, and that it may want to dispatch the request before it
encounters it later in the document.
As you would expect, the different semantics of the hints lead to very different behavior by
the resource loader. Resources marked with prefetch are considered low priority and might be
downloaded by the browser only once the current page has finished loading. Subresource resources
are fetched with high priority as soon as they are encountered and will compete with the rest of the
resources on the current page.
Ilya Grigorik 17
You guessed it, that is prerendering in Chrome. Instead of just downloading a single resource, as
the “prefetch” hint would have done, the “prerender” attribute indicates to Chrome that it should,
well, prerender the page in a hidden tab, along with all of its subresources. The hidden tab itself
is invisible to the user, but when the user triggers the navigation, the tab is swapped in from the
background for an “instant experience”.
Curious to try it out? You can visit http://prerender-test.appspot.com for a hands-on
demo, and see the history and status of the prerendered pages for your profile by visiting chrome:
//net-internals/#prerender. (See Figure 1.12.)
As you would expect, rendering an entire page in a hidden tab can consume a lot of resources,
both CPU and network, and hence should only be used in cases where we have high confidence that
the hidden tab will be used. For example, when you are using the Omnibox, a prerender may be
triggered for a high confidence suggestion. Similarly, Google Search sometimes adds the prerender
hint to its markup if it estimates that the first search result is a high confidence destination (also
known as Google Instant Pages):
Note that you can also add prerender hints to your own site. Before you do, note that prerendering
has a number of restrictions and limitations, which you should keep in mind:
• At most one prerender tab is allowed across all processes
• HTTPS and pages with HTTP authentication are not allowed
• Prerendering is abandoned if the requested resource, or any of its subresources need to make a
non-idempotent request (only GET requests allowed)
• All resources are fetched with lowest network priority
• The page is rendered with lowest CPU priority
• The page is abandoned if memory requirements exceed 100 MB
• Plugin initialization is deferred, and prerendering is abandoned if an HTML5 media element
is present
19
[chapter 2]
EtherCalc is an online spreadsheet system optimized toward simultaneous editing, using SocialCalc
as its in-browser spreadsheet engine. Designed by Dan Bricklin (the inventor of spreadsheets),
SocialCalc is part of the Socialtext platform, a suite of social collaboration tools for enterprise users.
For the Socialtext team, performance was the primary goal behind SocialCalc’s development
in 2006. The key observation was this: Client-side computation in JavaScript, while an order of
magnitude slower than server-side computation in Perl, was still much faster than the network latency
incurred during AJAX round trips.
Figure 2.1: WikiCalc and SocialCalc’s performance model. Since 2009, advances in JavaScript runtimes have
reduced the 50 ms to less than 10 ms.
SocialCalc performs all of its computations in the browser; it uses the server only for loading
and saving spreadsheets. Toward the end of the Architecture of Open Source Applications [BW11]
chapter on SocialCalc, we introduced simultaneous collaboration on spreadsheets using a simple,
chatroom-like architecture.
Design Constraints
The Socialtext platform has both behind-the-firewall and on-the-cloud deployment options, imposing
unique constraints on EtherCalc’s resource and performance requirements.
At the time of this writing, Socialtext requires 2 CPU cores and 4 GB RAM for VMWare vSphere-
based intranet deployment. For cloud-based hosting, a typical Amazon EC2 instance provides about
twice that capacity, with 4 cores and 7.5 GB of RAM.
Behind-the-firewall deployment means that we can’t simply throw hardware at the problem in the
same way multi-tenant, hosted-only systems did (e.g., DocVerse, which later became part of Google
Docs); we can assume only a modest amount of server capacity.
Compared to intranet deployments, cloud-hosted instances offer better capacity and on-demand
extension, but network connections from browsers are usually slower and fraught with frequent
disconnections and reconnections.
Therefore, the following resource constraints shaped EtherCalc’s architecture directions:
Memory: An event-based server allows us to scale to thousands of concurrent connections with a
small amount of RAM.
CPU: Based on SocialCalc’s original design, we offload most computations and all content rendering
to client-side JavaScript.
The server logs each command with a timestamp. If a client drops and reconnects, it can resume
by asking for a log of all requests since it was disconnected, then replay those commands locally to
get to the same state as its peers.
This simple design minimized server-side CPU and RAM requirements, and demonstrates
reasonable resiliency against network failure.
Audrey Tang 23
an edit session under the naive backlog model, it must replay thousands of commands, incurring a
significant startup delay before it can make any modifications.
To mitigate this issue, we implemented a snapshot mechanism. After every 100 commands
sent to a room, the server will poll the states from each active client, and save the latest snapshot it
receives next to the backlog. A freshly joined client receives the snapshot along with new commands
entered after the snapshot was taken, so it only needs to replay 99 commands at most.
This workaround solved the CPU issue for new clients, but created a network performance
problem of its own, as it taxes each client’s upload bandwidth. Over a slow connection, this delays
the reception of subsequent commands from the client.
Moreover, the server has no way to validate the consistency of snapshots submitted by clients.
Therefore, an erroneous—or malicious—snapshot can corrupt the state for all newcomers, placing
them out of sync with existing peers.
An astute reader may note that both problems are caused by the server’s inability to execute
spreadsheet commands. If the server can update its own state as it receives each command, it would
not need to maintain a command backlog at all.
The in-browser SocialCalc engine is written in JavaScript. We considered translating that
logic into Perl, but that would have carried the steep cost of maintaining two code bases. We
also experimented with embedded JS engines (V8, SpiderMonkey), but they imposed their own
performance penalties when running inside Feersum’s event loop.
Finally, by August 2011, we resolved to rewrite the server in Node.js.
Audrey Tang 25
In April 2012, after delivering a talk on EtherCalc at the OSDC.tw conference, I was invited by
Trend Micro to participate in their hackathon, adapting EtherCalc into a programmable visualization
engine for their real-time web traffic monitoring system.
For their use case we created REST APIs for accessing individual cells with GET/PUT as well as
POSTing commands directly to a spreadsheet. During the hackathon, the brand-new REST handler
received hundreds of calls per second, updating graphs and formula cell contents on the browser
without any hint of slowdown or memory leaks.
However, at the end-of-day demo, as we piped traffic data into EtherCalc and started to type for-
mulas into the in-browser spreadsheet, the server suddenly locked up, freezing all active connections.
We restarted the Node.js process, only to find it consuming 100% CPU, locking up again soon after.
Flabbergasted, we rolled back to a smaller data set, which did work correctly and allowed us to
finish the demo. But I wondered: what caused the lock-up in the first place?
To recreate the heavy background load, we performed high-concurrency REST API calls with
the Apache benchmarking tool ab. For simulating browser-side operations, such as moving cursors
and updating formulas, we used Zombie.js, a headless browser also built with jsdom and Node.js.
Ironically, the bottleneck turned out to be in jsdom itself.
From the report in Figure 2.6, we can see that RenderSheet dominates the CPU use. Each time
the server receives a command, it spends a few microseconds to redraw the innerHTML of cells to
reflect the effect of each command.
Because all jsdom code runs in a single thread, subsequent REST API calls are blocked until the
previous command’s rendering completes. Under high concurrency, this queue eventually triggered
a latent bug that ultimately resulted in server lock-up.
As we scrutinized the heap usage, we saw that the rendered result is all but unreferenced, as we
don’t really need a real-time HTML display on the server side. The only reference to it is in the
HTML export API, and for that we can always reconstruct each cell’s innerHTML rendering from the
spreadsheet’s in-memory structure.
So, we removed jsdom from the RenderSheet function, re-implemented a minimal DOM in 20
lines of LiveScript1 for HTML export, then ran the profiler again (see Figure 2.7).
Much better! We have improved throughput by a factor of 4, HTML exporting is 20 times faster,
and the lock-up problem is gone.
Audrey Tang 27
Figure 2.7: Updated profiler screenshot (without jsdom)
However, while EtherCalc does support multi-server scaling with Redis, the interplay of Socket.io
clustering with RedisStore in a single server would have massively complicated the logic, making
debugging much more difficult.
Moreover, if all processes in the cluster are tied in CPU-bound processing, subsequent connections
would still get blocked.
For our purpose, the W3C Web Worker API is a perfect match. Originally intended for browsers,
it defines a way to run scripts in the background independently. This allows long tasks to be executed
continuously while keeping the main thread responsive.
So we created webworker-threads, a cross-platform implementation of the Web Worker API for
Node.js.
Using webworker-threads, it’s very straightforward to create a new SocialCalc thread and com-
municate with it:
{ Worker } = require \webworker-threads
w = new Worker \packed-SocialCalc.js
w.onmessage = (event) -> ...
w.postMessage command
This solution offers the best of both worlds: It gives us the freedom to allocate more CPUs to
EtherCalc whenever needed, and the overhead of background thread creation remains negligible on
single-CPU environments.
Audrey Tang 29
Instead, I kept the focus on getting EtherCalc performing well without trading off one resource
requirement for another, thereby minimizing its CPU, RAM and network uses at the same time.
Indeed, since the RAM requirement is under 100 MB, even embedded platforms such as Raspberry
Pi can host it easily.
This self-imposed constraint made it possible to deploy EtherCalc on PaaS environments (e.g.,
DotCloud, Nodejitsu and Heroku) where all three resources are constrained instead of just one.
This made it very easy for people to set up a personal spreadsheet service, thus prompting more
contributions from independent integrators.
Worst is Best
At the YAPC::NA 2006 conference in Chicago, I was invited to predict the open-source landscape,
and this was my entry: 3
I think, but I cannot prove, that next year JavaScript 2.0 will bootstrap itself, complete
self-hosting, compile back to JavaScript 1, and replace Ruby as the Next Big Thing on
all environments.
I think CPAN and JSAN will merge; JavaScript will become the common backend for
all dynamic languages, so you can write Perl to run in the browser, on the server, and
inside databases, all with the same set of development tools.
Because, as we know, worse is better, so the worst scripting language is doomed to
become the best.
The vision turned into reality around 2009 with the advent of new JavaScript engines running
at the speed of native machine instructions. By the time of this writing, JavaScript has become a
“write once, run anywhere” virtual machine—other major languages can compile to it with almost no
performance penalty.
In addition to browsers on the client side and Node.js on the server, JavaScript also made headway
into the Postgres database, enjoying a large collection of freely reusable modules shared by these
runtime environments.
What enabled such sudden growth in the community? During the course of EtherCalc’s develop-
ment, from participating in the fledgling NPM community, I reckoned that it was precisely because
JavaScript prescribes very little and bends itself to the various uses, so innovators can focus on the
vocabulary and idioms (e.g., jQuery and Node.js), each team abstracting their own Good Parts from
a common, liberal core.
New users are offered a very simple subset to begin with; experienced developers are presented
with the challenge to evolve better conventions from existing ones. Instead of relying on a core team
of designers to get a complete language right for all anticipated uses, the grassroots development of
JavaScript echoes Richard P. Gabriel’s well-known maxim of Worse is Better.
LiveScript, Redux
In contrast to the straightforward Perl syntax of Coro::AnyEvent, the callback-based API of Node.js
necessitates deeply nested functions that are difficult to reuse.
3 See http://pugs.blogs.com/pugs/2006/06/my_yapcna_light.html
Conclusion
Unlike the SocialCalc project’s well-defined specification and team development process, EtherCalc
was a solo experiment from mid-2011 to late-2012 and served as a proving ground for assessing
Node.js’s readiness for production use.
This unconstrained freedom afforded an exciting opportunity to explore a wide variety of al-
ternative languages, libraries, algorithms and architectures. I’m very grateful to all contributors,
collaborators and integrators, and especially to Dan Bricklin and my Socialtext colleagues for
encouraging me to experiment with these technologies. Thank you, folks!
31
[chapter 3]
Ninja
Evan Martin
Ninja is a build system similar to Make. As input you describe the commands necessary to process
source files into target files. Ninja uses these commands to bring targets up to date. Unlike many
other build systems, Ninja’s main design goal was speed.
I wrote Ninja while working on Google Chrome. I started Ninja as an experiment to find out if
Chrome’s build could be made faster. To successfully build Chrome, Ninja’s other main design goal
followed: Ninja needed to be easily embedded within a larger build system.
Ninja has been quietly successful, gradually replacing the other build systems used by Chrome.
After Ninja was made public others contributed code to make the popular CMake build system
generate Ninja files—now Ninja is also used to develop CMake-based projects like LLVM and
ReactOS. Other projects, like TextMate, target Ninja directly from their custom build.
I worked on Chrome from 2007 to 2012, and started Ninja in 2010. There are many factors
contributing to the build performance of a project as large as Chrome (today around 40,000 files of
C++ code generating an output binary around 90 MB in size). During my time I touched many of
them, from distributing compilation across multiple machines to tricks in linking. Ninja primarily
targets only one piece—the front of a build. This is the wait between starting the build and the time
the first compile starts to run. To understand why that is important it is necessary to understand how
we thought about performance in Chrome itself.
additions.
34 Ninja
3.2 The Design of Ninja
At a high level any build system performs three main tasks. It will (1) load and analyze build goals,
(2) figure out which steps need to run in order to achieve those goals, and (3) execute those steps.
To make startup in step (1) fast, Ninja needed to do a minimal amount of work while loading the
build files. Build systems are typically used by humans, which means they provide a convenient,
high-level syntax for expressing build goals. It also means that when it comes time to actually build
the project the build system must process the instructions further: for example, at some point Visual
Studio must concretely decide based on the build configuration where the output files must go, or
which files must be compiled with a C++ or C compiler.
Because of this, GYP’s work in generating Visual Studio files was effectively limited to translating
lists of source files into the Visual Studio syntax and leaving Visual Studio to do the bulk of the work.
With Ninja I saw the opportunity to do as much work as possible in GYP. In a sense, when GYP
generates Ninja build files, it does all of the above computation once. GYP then saves a snapshot of
that intermediate state into a format that Ninja can quickly load for each subsequent build.
Ninja’s build file language is therefore simple to the point of being inconvenient for humans to
write. There are no conditionals or rules based on file extensions. Instead, the format is just a list of
which exact paths produce which exact outputs. These files can be loaded quickly, requiring almost
no interpretation.
This minimalist design counterintuitively leads to greater flexibility. Because Ninja lacks higher-
level knowledge of common build concepts like an output directory or current configuration, Ninja is
simple to plug into larger systems (e.g., CMake, as we later found) that have different opinions about
how builds should be organized. For example, Ninja is agnostic as to whether build outputs (e.g.,
object files) are placed alongside the source files (considered poor hygiene by some) or in a separate
build output directory (considered hard to understand by others). Long after releasing Ninja I finally
thought of the right metaphor: whereas other build systems are compilers, Ninja is an assembler.
rule compile
command = gcc -Wall -c $in -o $out
build out/foo.o: compile src/foo.c
build out/bar.o: compile src/bar.c
The second abstraction is the variable. In the example above, these are the dollar-sign-prefixed
identifiers ($in and $out). Variables can represent both the inputs and outputs of a command and
Evan Martin 35
can be used to make short names for long strings. Here is an extended compile definition that makes
use of a variable for compiler flags:
cflags = -Wall
rule compile
command = gcc $cflags -c $in -o $out
Variable values used in a rule can be shadowed in the scope of a single build block by indenting
their new definition. Continuing the above example, the value of cflags can be adjusted for a single
file:
Rules behave almost like functions and variables behave like arguments. These two simple
features are dangerously close to a programming language—the opposite of the “do no work” goal.
But they have the important benefit of reducing repeated strings which is not only useful for humans
but also for computers, reducing the quantity of text to be parsed.
The build file, once parsed, describes a graph of dependencies: the final output binary depends
on linking a number of objects, each of which is the result of compiling sources. Specifically it is a
bipartite graph, where “nodes” (input files) point to “edges” (build commands) which point to nodes
(output files)6 . The build process then traverses this graph.
Given a target output to build, Ninja first walks up the graph to identify the state of each edge’s
input files: that is, whether or not the input files exist and what their modification times are. Ninja
then computes a plan. The plan is the set of edges that need to be executed in order to bring the
final target up to date, according to the modification times of the intermediate files. Finally, the plan
is executed, walking down the graph and checking off edges as they are executed and successfully
completed.
Once these pieces were in place I could establish a baseline benchmark for Chrome: the time to
run Ninja again after successfully completing a build. That is the time to load the build files, examine
the built state, and determine there was no work to do. The time it took for this benchmark to run
was just under a second. This was my new startup benchmark metric. However, as Chrome grew,
Ninja had to keep getting faster to keep that metric from regressing.
36 Ninja
Parsing
Initially Ninja used a hand-written lexer and a recursive descent parser. The syntax was simple
enough, I thought. It turns out that for a large enough project like Chrome8 , simply parsing the build
files (named with the extension .ninja) can take a surprising amount of time.
The original function to analyze a single character soon appeared in profiles:
A simple fix—at the time saving 200 ms—was to replace the function with a 256-entry lookup table
that could be indexed by the input character. Such a thing is trivial to generate using Python code
like:
cs = set()
for c in string.ascii_letters + string.digits + r’+,-./\_$’:
cs.add(ord(c))
for i in range(256):
print ’%d,’ % (i in cs),
This trick kept Ninja fast for quite a while. Eventually we moved to something more principled:
re2c, the lexer generator used by PHP. It can generate more complex lookup tables and trees of
unintelligible code. For example:
It remains an open question as to whether treating the input as text in the first place is a good
idea. Perhaps we will eventually require Ninja’s input to be generated in some machine-friendly
format that would let us avoid parsing for the most part.
Canonicalization
Ninja avoids using strings to identify paths. Instead, Ninja maps each path it encounters to a unique
Node object and the Node object is used in the rest of the code. Reusing this object ensures that a
given path is only ever checked on disk once, and the result of that check (i.e., the modification time)
can be reused in other code.
The pointer to the Node object serves as a unique identity for that path. To test whether two
Nodes refer to the same path it is sufficient to compare pointers rather than perform a more costly
string comparison. For example, as Ninja walks up the graph of inputs to a build, it keeps a stack of
dependendent Nodes to check for dependency loops: if A depends on B depends on C depends on
A, the build can’t proceed. This stack, representing files, can be implemented as a simple array of
pointers, and pointer equality can be used to check for duplicates.
8 Today’s Chrome build generates over 10 MB of .ninja files.
Evan Martin 37
To always use the same Node for a single file, Ninja must reliably map every possible name
for a file into the same Node object. This requires a canonicalization pass on all paths mentioned
in input files, which transforms a path like foo/../bar.h into just bar.h. Initially Ninja simply
required all paths to be provided in canonical form but that ends up not working for a few reasons.
One is that user-specified paths (e.g., the command-line ninja ./bar.h) are reasonably expected to
work correctly. Another is that variables may combine to make non-canonical paths. Finally, the
dependency information emitted by gcc may be non-canonical.
Thus most of what Ninja ends up doing is path processing, so canonicalizing paths is another hot
point in profiles. The original implementation was written for clarity, not performance, so standard
optimization techniques—like removing a double loop or avoiding memory allocation—helped
considerably.
Dependency Files
There is an additional store of metadata that must be recorded and used across builds. To correctly
build C/C++ code a build system must accomodate dependencies between header files. Suppose
foo.c contains the line #include "bar.h" and bar.h itself includes the line #include "baz.h".
All three of those files (foo.c, bar.h, baz.h) then affect the result of compilation. For example,
changes to baz.h should still trigger a rebuild of foo.o.
Some build systems use a “header scanner” to extract these dependencies at build time, but this
approach can be slow and is difficult to make exactly correct in the presence of #ifdef directives.
Another alternative is to require the build files to correctly report all dependencies, including headers,
9 It also stores when each command started and finished, which is useful for profiling builds of many files.
38 Ninja
but this is cumbersome for developers: every time you add or remove an #include statement you
need to modify or regenerate the build.
A better approach relies on the fact that at compile time gcc (and Microsoft’s Visual Studio) can
output which headers were used to build the output. This information, much like the command used
to generate an output, can be recorded and reloaded by the build system so that the dependencies can
be tracked exactly. For a first-time build, before there is any output, all files will be compiled so no
header dependency is necessary. After the first compilation, modifications to any files used by an
output (including modifications that add or remove additional dependencies) will cause a rebuild,
keeping the dependency information up-to-date.
When compiling, gcc writes header dependencies in the format of a Makefile. Ninja then includes
a parser for the (simplified subset) Makefile syntax and loads all of this dependency information at
the next build. Loading this data is a major bottleneck. On a recent Chrome build, the dependency
information produced by gcc sums to 90 MB of Makefiles, all of which reference paths which must
be canonicalized before use.
Much like with the other parsing work, using re2c and avoiding copies where possible helped
with performance. However, much like how work was shifted to GYP, this parsing work can be
pushed to a time other than the critical path of startup. Our most recent work on Ninja (at the time of
this writing, the feature is complete but not yet released) has been to make this processing happen
eagerly during the build.
Once Ninja has started executing build commands, all of the performance-critical work has
been completed and Ninja is mostly idle as it waits for the commands it executes to complete. In
this new approach for header dependencies, Ninja uses this time to process the Makefiles emitted
by gcc as they are written, canonicalizing paths and processing the dependencies into a quickly
deserializable binary format. On the next build Ninja only needs to load this file. The impact is
dramatic, particularly on Windows. (This is discussed further later in this chapter.)
The “dependency log” needs to store thousands of paths and dependencies between those paths.
Loading this log and adding to it needs to be fast. Appending to this log should be safe, even in the
event of an interruption such as a cancelled build.
After considering many database-like approaches I finally came up with a trivial implementation:
the file is a sequence of records and each record is either a path or a list of dependencies. Each path
written to the file is assigned a sequential integer identifier. Dependencies are then lists of integers.
To add dependencies to the file, Ninja first writes new records for each path that doesn’t yet have an
identifier and then writes the dependency record using those identifiers. When loading the file on a
subsequent run Ninja can then use a simple array to map identifiers to their Node pointers.
Executing a Build
Performance-wise the process of executing the commands judged necessary according to the depen-
dencies discussed above is relatively uninteresting because the bulk of the work that needs to be
done is performed in those commands (i.e., in the compilers, linkers, etc.), not in Ninja itself10 .
Ninja runs build commands in parallel by default, based on the number of available CPUs on the
system. Since commands running simultaneously could have their outputs interleave, Ninja buffers
10 One
minor benefit of this is that users on systems with few CPU cores have noticed their end-to-end builds are faster due to
Ninja consuming relatively little processing power while driving the build, which frees up a core for use by build commands.
Evan Martin 39
all output from a command until that command completes before printing its output. The resulting
output appears as if the commands were run serially.11
This control over command output allows Ninja to carefully control its total output. During the
build Ninja displays a single line of status while running; if the build completes successfully the
total printed output of Ninja is a single line.12 This doesn’t make Ninja run any quicker but it makes
Ninja feel fast, which is almost as important to the original goal as real speed is.
Supporting Windows
I wrote Ninja for Linux. Nico (mentioned previously) did the work to make it function on Mac OS
X. As Ninja became more widely used people started asking about Windows support.
At a superficial level supporting Windows wasn’t too hard. There were some straightforward
changes like making the path separator a backslash or changing the Ninja syntax to allow colons in
a path (like c:\foo.txt). Once those changes were in place the larger problems surfaced. Ninja
was designed with behavioral assumptions from Linux; Windows is different in small but important
ways.
For example, Windows has a relatively low limit on the length of a command, an issue that comes
up when constructing the command passed to a final link step that may mention most files in the
project. The Windows solution for this is “response” files, and only Ninja (not the generator program
in front of Ninja) is equipped to manage these.
A more important performance problem is that file operations on Windows are slow and Ninja
works with a lot of files. Visual Studio’s compiler emits header dependencies by simply printing
them while compiling, so Ninja on Windows currently includes a tool that wraps the compiler to
make it produce the gcc-style Makefile dependency list required by Ninja. This large number of files,
already a bottleneck on Linux, is much worse on Windows where opening a file is much more costly.
The aforementioned new approach to parsing dependencies at build time fits perfectly on Windows,
allowing us to drop the intermediate tool entirely: Ninja is already buffering the command’s output,
so it can parse the dependencies directly from that buffer, sidestepping the intermediate on-disk
Makefile used with gcc.
Getting the modification time of a file—GetFileAttributesEx() on Windows13 and stat()
on non-Windows platforms—seems to be about 100 times slower on Windows than it is on Linux.14
It is possible this is due to “unfair” factors like antivirus software but in practice those factors exist
on end-user systems so Ninja performance suffers. The Git version control system, which similarly
needs to get the state of many files, can use multiple threads on Windows to execute file checks in
parallel. Ninja ought to adopt this feature.
40 Ninja
In fact, that design was my original plan for Ninja. It was only after I saw the first build worked
quickly that I realized it might be possible to make Ninja work without needing a server component.
It may yet be necessary as Chrome continues to grow, but the simpler approach, where we gain speed
by doing less work rather than more complex machinery, will always be the most appealing to me. It
is my hope some other restructurings (like the changes we made to use a lexer generator or the new
dependency format on Windows) will be enough.
Simplicity is a virtue in software; the question is always how far it can go. Ninja managed to
cut much of the complexity from a build system by delegating certain expensive tasks to other tools
(GYP or CMake), and because of this it is useful in projects other than the one it was made for.
Ninja’s simple code hopefully encouraged contributions— the majority of work for supporting OS X,
Windows, CMake, and other features was done by contributors. Ninja’s simple semantics have led to
experiments by others to reimplement it in other languages (Scheme and Go, to my knowledge).
Do milliseconds really matter? Among the greater concerns of software it might be silly to worry
about. However, having worked on projects with slower builds, I find more than productivity is
gained; a quick turnaround gives a project a feeling of lightness that makes me happy to play with it.
And code that is fun to hack on is the reason I write software in the first place. In that sense speed is
of primary importance.
3.6 Acknowledgements
Special thanks are due to the many contributors to Ninja, some of whom you can find listed on
Ninja’s GitHub project page.
41
[chapter 4]
4.1 Preface
XML is a standardized markup language that defines a set of rules for encoding hierarchically
structured documents in a human-readable text-based format. XML is in widespread use, with
documents ranging from very short and simple (such as SOAP queries) to multi-gigabyte documents
(OpenStreetMap) with complicated data relationships (COLLADA). In order to process XML
documents, users typically need a special library: an XML parser, which converts the document
from text to internal representation. XML is a compromise between parsing performance, human
readability and parsing-code complexity—therefore a fast XML parser can make the choice of XML
as an underlying format for application data model more preferable.
This chapter describes various performance tricks that allowed the author to write a very high-
performing parser in C++: pugixml. While the techniques were used for an XML parser, most of
them can be applied to parsers of other formats or even unrelated software (e.g., memory management
algorithms are widely applicable beyond parsers).
Since there are several substantially different approaches to XML parsing, and the parser has to
do additional processing that even people familiar with XML do not know about, it is important to
outline the entire task at hand first, before diving into implementation details.
In-place parsing
There are several inefficiencies in the typical implementation of a parser. One of them is copying
string data to the heap. This involves allocating many blocks of varying sizes, from bytes to megabytes,
and requires us to copy all strings from the original stream to the heap. Avoiding the copy operation
allows us to eliminate both sources of overhead. Using a technique known as in-place (or in situ)
parsing, the parser can use data from the stream directly. This is the parsing strategy used by pugixml.
A basic in-place parser takes an input string stored in a contiguous memory buffer, scans the
string as a character stream, and creates the necessary tree structure. Upon encountering a string that
is part of the data model, such as a tag name, the parser saves a pointer to the string and its length
(instead of saving the whole string).3
As such, this is a tradeoff between performance and memory usage. In-place parsing is usually
faster compared to parsing with copying strings to the heap, but it can consume more memory. An
in-place parser needs to hold the original stream in memory in addition to its own data describing the
document’s structure. A non in-place parser can store relevant parts of the original stream instead.
Arseny Kapoulkine 45
Most in-place parsers have to deal with additional issues. In the case of pugixml, there are two:
simplifying string access and transforming XML data during parsing.
Accessing strings that are parsed in-place is difficult because they are not null-terminated. That is,
the character after the string is not a null byte, but is instead the next character in the XML document,
such as an open angle bracket (<). This makes it difficult to use standard C/C++ string functions that
expect null-terminated strings.
To make sure we can use these functions we’ll have to null-terminate the strings during parsing.
Since we can’t easily insert new characters, the character after the last character of each string will
have to be overwritten with a null character. Fortunately, we can always do this in XML: a character
that follows the end of a string is always a markup character and is never relevant to document
representation in memory.
The second issue is more complicated: often the string value and the representation in the XML
file are different. A conforming parser is expected to perform decoding of the representation. Since
doing this during node object access would make object access performance unpredictable, we prefer
to do this at parsing time. Depending on the content type, an XML parser could be required to
perform the following transformations:
• End-of-line handling: The input document can contain various types of line endings and the
parser should normalize them as follows: any two-character sequence of one carriage return
(ASCII 0xD) and one line feed (ASCII 0xA) and any free-standing carriage return should be
replaced with a line feed character. For example, the line
‘line1\xD\xAline2\xDline3\xA\xA‘
should be transformed to
‘line1\xAline2\xAline3\xA\xA‘.
• Character reference expansion: XML supports escaping characters using their Unicode code
point with either decimal or hexadecimal representation. For example, a should expand
to a and ø should expand to ø.
• Entity reference expansion: XML supports generic entity references, where &name; is replaced
with the value of entity name. There are five predefined entities: < (<), > (>), "
(“), ' (’) and & (&).
• Attribute-value normalization: in addition to expanding references, the parser should perform
whitespace normalization when parsing attribute values. All whitespace characters (space, tab,
carriage return and line feed) should be replaced with a space. Additionally, depending on
the type of the attribute, leading and trailing whitespaces should be removed and whitespace
sequences in the middle of the string should be collapsed into a single space.
It is possible to support an arbitrary transformation in an in-place parser by modifying the string
contents, given an important constraint: a transformation must never increase the length of the string.
Interestingly, in-place parsing can be used with memory-mapped file I/O.6 Supporting null-
termination and text transformation requires a special memory mapping mode known as copy-on-
write to avoid modifying the file on disk.
Using memory mapped file I/O with in-place parsing has the following benefits:
• The kernel can usually map cache pages directly into the process address space, thus eliminating
a memory copy that would have happened with standard file I/O.
• If the file is not already in the cache, the kernel can prefetch sections of the file from disk,
effectively making I/O and parsing parallel.
• Since only modified pages need to allocate physical memory, memory consumption can be
greatly decreased on documents with large text sections.
three nodes: prefix string before entity reference, a node of a special type that contains the reference id, and a suffix string,
which may have to be split further. This approach is used by the Microsoft XML parser (for different reasons).
6 See http://en.wikipedia.org/wiki/Memory-mapped_file.
Arseny Kapoulkine 47
(depending on whether the character is in the target set) and the last 128 entries of the table all
share the same value. Because of the way UTF-8 encodes data, all code points above 127 will
be represented as sequences of bytes with values above 127. Furthermore, the first character
of the sequence will also be above 127.
• For UTF-16 or UTF-32, tables of large sizes are usually impractical. Given the same constraint
as the one for optimized UTF-8, we can leave the table to be 128 or 256 entries large, and add
an additional comparison to deal with values outside the range.
Note that we only need one bit to store the true/false value, so it might make sense to use bitmasks
to store eight different character sets in one 256 byte table. Pugixml uses this approach to save cache
space: on the x86 architecture, checking a boolean value usually has the same cost as checking a bit
within a byte, provided that bit position is a compile-time constant. This C code demonstrates this
approach:
enum chartype_t {
ct_parse_pcdata = 1, // \0, &, \\r, \<
ct_parse_attr = 2, // \0, &, \\r, ’, "
ct_parse_attr_ws = 4, // \0, &, \\r, ’, ", \\n, tab
// ...
};
static const unsigned char table[256] = {
55, 0, 0, 0, 0, 0, 0, 0, 0, 12, 12, 0, 0, 63, 0, 0, // 0-15
// ...
};
bool ischartype_utf8(char c, chartype_t ct) {
// note: unsigned cast is important to
// guarantee that the value is in 0..255 range
return ct & table[(unsigned char)c];
}
If the tested range includes all characters in a certain interval, it might make sense to use a
comparison instead of a table lookup. With careful use of unsigned arithmetics just one comparison
is needed. For example, a test for a character being a digit:
bool isdigit(char ch) { return (ch >= ’0’ && ch <= ’9’); }
A < B.
should be transformed to
A < B.
The PCDATA parsing function takes a pointer to the start of PCDATA value, and proceeds by
reading the rest of the value, converting the value data in-place and null-terminating the result.
Since there are two boolean flags, we have four variations of this function. In order to avoid
expensive run-time checks, we’re using boolean template arguments for these flags—thus we’re
compiling four variations of a function from a single template, and then using runtime dispatch to
obtain the correct function pointer once before the parsing begins. The parser calls the function
using this function pointer.
This allows the compiler to remove condition checks for flags and remove dead code for each
specialization of the function. Importantly, inside the function’s parsing loop we use a fast character
set test to skip all characters that are part of the usual PCDATA content, and only process the
characters we’re interested in. Here’s what the code looks like:
example, you might be dealing with documents where it is important to preserve the exact type of newline sequences, or
where entity references should be left unexpanded by the XML parser in order to be processed afterwards.
Arseny Kapoulkine 49
return s + 1;
} else if (opt_eol && *s == ’\r’) { // 0x0d or 0x0d 0x0a pair
*s++ = ’\n’; // replace first one with 0x0a
if (*s == ’\n’) g.push(s, 1);
} else if (opt_escape && *s == ’&’) {
s = strconv_escape(s, g);
} else if (*s == 0) {
return s;
} else {
++s;
}
}
}
};
An additional function gets a pointer to a suitable implementation based on runtime flags; e.g.,
&strconv_pcdata_impl<false, true>::parse.
One unusual item in this code is the gap class instance. As shown before, if we do string
transformations, the resulting string becomes shorter because some of the characters have to be
removed. There are several ways of doing this.
One strategy (that pugixml doesn’t use) is to keep separate read and write pointers that both
point to the same buffer. In this case the read pointer tracks the current read position, and the write
pointer tracks the current write position. At all times the invariant write <= read should hold. Any
character that has to be a part of the resulting string must be explicitly written to the write pointer.
This technique avoids the quadratic cost of naive character removal, but is still inefficient, since we
now read and write all characters in the string every time, even if we don’t need to modify the string.
An obvious extension of this idea is to skip the prefix of the original string that does not need to
be modified and only start writing characters after that prefix—indeed, that’s how algorithms like
the one behind std::remove_if() commonly operate.
Pugixml follows a different approach (see Figure 4.4). At any time there is at most one gap in
the string. The gap is a sequence of characters that are no longer valid because they are no longer
part of the final string. When a new gap has to be added because another substitution was made (e.g.,
replacing " with " generates a gap of 5 characters), the existing gap (if one exists) is collapsed
by moving the data between two gaps to the beginning of the first gap and then remembering the
new gap. In terms of complexity, this approach is equivalent to the approach with read and write
pointers; however it allows us to use faster routines to collapse gaps. (Pugixml uses memmove which
can copy more efficiently compared to a character-wise loop, depending on the gap length and on C
runtime implementation.)
to the code that parses the relevant tag. For example, if the first character is <, we have to read at
least one more character to differentiate between a start tag, end tag, comment, or other types of
tags. Pugixml also uses goto statements to avoid going through the dispatch loop in certain cases —
for example, text content parsing stops at the end of stream or the < character. However, if the next
character is <, we don’t have to go through the dispatch loop only to read the character again and
check that it’s <; we can jump straight to the code that does the tag parsing.
Two important optimizations for such code are branch ordering and code locality.
In the parser, various parts of the code handle various forms of inputs. Some of them (such as
tag name or attribute parsing) execute frequently, while others (such as DOCTYPE parsing) rarely
execute at all. Even within a small section of code, different inputs have different probabilities. For
example, after the parser encounters an open angle bracket (<), the most likely character to appear
next is a character of a tag name. Next most likely is /10 , followed by ! and ?.
With this in mind, it is possible to rearrange the code to yield faster execution. First, all “cold”
code; that is, code that is unlikely to ever execute, or is unlikely to execute frequently—in the case of
pugixml this includes all XML content except element tags with attributes and text content — has to
be moved out of the parser loop into separate functions. Depending on the function’s contents and
the compiler, adding attributes such as noinline, or specifically marking extra functions as “cold”
might help. The idea is to limit the amount of code inlined into the main parser function to the hot
code. This helps the compiler optimize the function by keeping the control flow graphs small, and
keeps all hot code as close together as possible to minimize instruction cache misses.
After this, in both hot and cold code it makes sense to order any conditional chains you have by
condition probability. For example, code like this is not efficient for typical XML content:
if (data[0] == ’<’)
{
if (data[1] == ’!’) { ... }
else if (data[1] == ’/’) { ... }
else if (data[1] == ’?’) { ... }
else { /* start-tag or unrecognized tag */ }
}
if (data[0] == ’<’)
10 Thereason / is less probable than a tag name character is that for every end tag there is a start tag, but there are also
empty-element tags such as <node/>.
Arseny Kapoulkine 51
{
if (PUGI__IS_CHARTYPE(data[1], ct_start_symbol)) { /* start-tag */ }
else if (data[1] == ’/’) { ... }
else if (data[1] == ’!’) { ... }
else if (data[1] == ’?’) { ... }
else { /* unrecognized tag */ }
}
In this version the branches are sorted by probability from most-frequent to least-frequent. This
minimizes the average amount of condition tests and conditional jumps performed.
skips a run of alphabetical characters and stops at the null terminator or the next non-alphabetic
character without requiring extra checks. Storing the buffer end position everywhere also reduces
the overall speed because it usually requires an extra register. Function calls also get more expensive
since you need to pass two pointers (current position and end position) instead of one.
However, requiring null-terminated input is less convenient for library users: often XML data
gets read into a buffer that might not have space for an extra null terminator. From the client’s point
of view a memory buffer should be a pointer and a size with no null terminator.
Since the internal memory buffer has to be mutable for in-place parsing to work, pugixml solves
this problem in a simple way. Before parsing, it replaces the last character in the buffer with a null
terminator and keeps track of the value of the old character. That way, the only places it has to
account for the value of the last character are places where it is valid for the document to end. For
XML, there are not many11 , so the approach results in a net win.12
This summarizes the most interesting tricks and design decisions that help keep pugixml parser
fast for a wide range of documents. However, there is one last performance-sensitive component of
the parser that is worth discussing.
11 For example, if a tag name scan stopped at the null terminator, then the document is invalid because there are no valid XML
documents where the character before the last one is a part of tag name.
12 Of course, the parsing code becomes more complicated, since some comparisons need to account for the value of the last
character, and all others need to skip it for performance reasons. A unit test suite with good coverage and fuzz testing helps
keep the parser correct for all document inputs.
struct Node {
Node* first_child;
Node* last_child;
Node* prev_sibling;
Node* next_sibling;
};
Here, the last_child pointer is necessary to support backwards iteration and appending in O(1)
time.
Note that with this design it is easy to support different node types to reduce memory consumption;
for example, an element node needs an attribute list but a text node does not. The array approach
forces us to keep the size of all node types the same, which prevents such optimization from being
effective.
Pugixml uses a linked list-based approach. That way, node modification is always O(1). Further-
more, the array approach would force us to allocate blocks of varying sizes, ranging from tens of
bytes to megabytes in case of a single node with a lot of children; whereas in the linked list approach
13 Tree modification is important—while there are ways to represent immutable trees much more efficiently compared to what
pugixml is doing, tree mutation is a much needed feature both for constructing documents from scratch and for modifying
existing documents.
14 More complex logic can be used as well.
15 The capacity field is required to implement an amortized constant-time addition. See
http://en.wikipedia.org/wiki/Dynamic_array for more information.
struct Node { Node* children; size_t children_size; size_t children_capacity; };
Arseny Kapoulkine 53
there are only a few different allocation sizes needed for node structure. Designing a fast allocator
for fixed size allocations is usually easier than designing a fast allocator for arbitrary allocations
which is another reason pugixml chooses this strategy.
To keep memory consumption closer to the array-based approach, pugixml omits the last_child
pointer, but keeps access to the last child available in O(1) time by making the sibling list partially
cyclic with prev_sibling_cyclic:
struct Node {
Node* first_child;
Node* prev_sibling_cyclic;
Node* next_sibling;
};
The array-based approach and the linked list approach with the partially-cyclic-sibling-list trick
become equivalent in terms of memory consumption. Using 32-bit types for size/capacity makes the
array-based node smaller on 64-bit systems.16 In the case of pugixml, other benefits of linked lists
outweigh the costs.
With the data structures in place, it is time to talk about the last piece of the puzzle—the memory
allocation algorithm.
Arseny Kapoulkine 55
It turns out that the simplest allocation scheme possible is the stack allocator. This allocator
works as follows: given a memory buffer and an offset inside that buffer, an allocation only requires
increasing that offset by the allocation size. Of course, it is impossible to predict the memory buffer
size in advance, so an allocator has to be able to allocate new buffers on demand.
This code illustrates the general idea:
Supporting allocations that are larger than page size is easy. We just allocate a larger memory
block, but treat it in the same way as we would treat a small page.17
This allocator is very fast. It’s probably the fastest allocator possible, given the constraints.
Benchmarks show it to be faster than a free list allocator, which has to do more work to determine
the correct list based on page size and has to link all blocks in a page together. Our allocator also
exhibits almost perfect memory locality. The only case where successive allocations are not adjacent
is when it allocates a new page.
17 For
performance reasons this implementation does not adjust the offset to be aligned. Instead it expects that all stored types
need pointer type alignment, and that all allocation requests specify size aligned to a pointer size.
pages using a 64k allocations with 64k alignment), but it is not possible to use large allocation alignments in a portable way
without huge memory overhead.
Arseny Kapoulkine 57
an XML node fits: three bits are used for a node type, and two bits are used to specify whether the
node’s name and value reside in the in-place buffer.
The second approach is to store the offset of the allocated element relative to the beginning of
the page, which allows us to get the address of the page pointer in the following way:
(allocator_page*)((char*)(object) -
object->offset - offsetof(allocator_page, data))
If our page size is limited by 216 = 65536 bytes, this offset fits in 16 bits, so we can spend 2 bytes
instead of 4 storing it. Pugixml uses this approach for heap-allocated strings.
An interesting feature of the resulting algorithm is that it respects the locality of reference
exhibited by the code that uses the allocator. Locality of allocation requests eventually leads to
locality of allocated data in space. Locality of deallocation requests in space leads to successfully
released memory. This means, in the case of tree storage, that deletion of a large subtree usually
releases most of the memory that is used by the subtree.
Of course, for certain usage patterns nothing is ever deleted until the entire document is destroyed.
For example, if a page size is 32000 bytes, we can do one million 32-byte allocations, thus allocating
1000 pages. If we keep every 1000th object alive and delete the remaining objects, each page will
have exactly one object left, which means that, although the cumulative size of live objects is now
1000 · 32 = 32000 bytes, we still keep all pages in memory (consuming 32 million bytes). This
results in an extremely high memory overhead. However, such usage is extremely unlikely, and the
benefits of the algorithm outweigh this problem for pugixml.
4.8 Conclusion
Optimizing software is hard. In order to be successful, optimization efforts almost always involve a
combination of low-level micro-optimizations, high-level performance-oriented design decisions,
careful algorithm selection and tuning, balancing among memory, performance, implementation
complexity, and more. Pugixml is an example of a library that needs all of these approaches to
deliver a very fast production-ready XML parser—even though compromises had to be made to
achieve this. A lot of the implementation details can be adapted to different projects and tasks, be it
another parsing library or something else entirely. The author hopes that the presented tricks were
entertaining and that some of them will be useful for other projects.
MemShrink
Kyle Huey
5.1 Introduction
Firefox has long had a reputation for using too much memory. The accuracy of that reputation has
varied over the years but it has stuck with the browser. Every Firefox release in the past several years
has been met by skeptical users with the question “Did they fix the memory leak yet?” We shipped
Firefox 4 in March 2011 after a lengthy beta cycle and several missed ship dates—and it was met
by the same questions. While Firefox 4 was a significant step forward for the web in areas such as
open video, JavaScript performance, and accelerated graphics, it was unfortunately a significant step
backwards in memory usage.
The web browser space has become very competitive in recent years. With the rise of mobile
devices, the release of Google Chrome, and Microsoft reinvesting in the web, Firefox has found itself
having to contend with a number of excellent and well-funded competitors instead of just a moribund
Internet Explorer. Google Chrome in particular has gone to great lengths to provide a fast and slim
browsing experience. We began to learn the hard way that being a good browser was no longer good
enough; we needed to be an excellent browser. As Mike Shaver, at the time VP of Engineering at
Mozilla and a longtime Mozilla contributor, said, “this is the world we wanted, and this is the world
we made.”
That is where we found ourselves in early 2011. Firefox’s market share was flat or declining
while Google Chrome was enjoying a fast rise to prominence. Although we had begun to close the
gap on performance, we were still at a significant competitive disadvantage on memory consumption
as Firefox 4 invested in faster JavaScript and accelerated graphics often at the cost of increased
memory consumption. After Firefox 4 shipped, a group of engineers led by Nicholas Nethercote
started the MemShrink project to get memory consumption under control. Today, nearly a year and a
half later, that concerted effort has radically altered Firefox’s memory consumption and reputation.
The “memory leak” is a thing of the past in most users’ minds, and Firefox often comes in as one
of the slimmest browsers in comparisons. In this chapter we will explore the efforts we made to
improve Firefox’s memory usage and the lessons we learned along the way.
60 MemShrink
The GC heap in Spidermonkey contains objects, functions, and most of the other things created
by running JS. We also store implementation details whose lifetimes are linked to these objects in
the GC heap. This heap uses a fairly standard incremental mark-and-sweep collector that has been
heavily optimized for performance and responsiveness. This means that every now and then the
garbage collector wakes up and looks at all the memory in the GC heap. Starting from a set of “roots”
(such as the global object of the page you are viewing) it “marks” all the objects in the heap that are
reachable. It then “sweeps” all the objects that are not marked and reuses that memory when needed.
In Gecko most memory is reference counted. With reference counting the number of references
to a given piece of memory is tracked. When that number reaches zero the memory is freed. While
reference counting is technically a form of garbage collection, for this discussion we distinguish it from
garbage collection schemes that require specialized code (i.e., a garbage collector) to periodically
reclaim memory. Simple reference counting is unable to deal with cycles, where one piece of memory
A references another piece of memory B, and vice versa. In this situation both A and B have reference
counts of 1, and are never freed. Gecko has a specialized tracing garbage collector specifically to
collect these cycles which we call the cycle collector. The cycle collector manages only certain
classes that are known to participate in cycles and opt in to cycle collection, so we can think of the
cycle collected heap as a subset of the reference counted heap. The cycle collector also works with
the garbage collector in Spidermonkey to handle cross-language memory management so that C++
code can hold references to JS objects and vice versa.
There is also plenty of manually managed memory in both Spidermonkey and Gecko. This
encompasses everything from the internal memory of arrays and hashtables to buffers of image and
script source data. There are also other specialized allocators layered on top of manually managed
memory. One example is an arena allocator. Arenas are used when a large number of separate
allocations can all be freed simultaneously. An arena allocator obtains chunks of memory from
the main heap allocator and subdivides them as requested. When the arena is no longer needed the
arena returns those chunks to the main heap without having to individually free the many smaller
allocations. Gecko uses an arena allocator for page layout data, which can be thrown away all at once
when a page is no longer needed. Arena allocation also allows us to implement security features such
as poisoning, where we overwrite the deallocated memory so it cannot be used in a security exploit.
There are several other custom memory management systems in small parts of Firefox, used for
a variety of different reasons, but they are not relevant to our discussion. Now that you have a brief
overview of Firefox’s memory architecture, we can discuss the problems we found and how to fix
them.
Kyle Huey 61
any situation that results in Firefox being less memory-efficient than it could reasonably be. This
is consistent with the way our users employ the term as well: most users and even web developers
cannot tell if high memory usage is due to a true leak or any number of other factors at work in the
browser.
When MemShrink began we did not have much insight into the browser’s memory usage. Identi-
fying the nature of memory problems often required using complex tools like Massif or lower-level
tools like GDB. These tools have several disadvantages:
• They are designed for developers and are not easy to use.
• They are not aware of Firefox internals (such as the implementation details of the various
heaps).
• They are not “always on”—you have to be using them when the problem happens.
In exchange for these disadvantages you get some very powerful tools. To address these disadvan-
tages over time we built a suite of custom tools to gain more insight with less work into the behavior
of the browser.
The first of these tools is about:memory. First introduced in Firefox 3.6, it originally displayed
simple statistics about the heap, such as the amount of memory mapped, and committed. Later
measurements for some things of interest to particular developers were added, such as the memory
used by the embedded SQLite database engine and the amount of memory used by the accelerated
graphics subsystem. We call these measurements memory reporters. Other than these one-off
additions about:memory remained a primitive tool presenting a few summary statistics on memory
usage. Most memory did not have a memory reporter and was not specifically accounted for in
about:memory. Even so, about:memory can be used by anyone without a special tool or build of
Firefox just by typing it into the browser’s address bar. This would become the “killer feature”.
Well before MemShrink was a gleam in anyone’s eye the JavaScript engine in Firefox was
refactored to split the monolithic global GC heap into a collection of smaller subheaps called com-
partments. These compartments separate things like chrome and content (privileged and unprivileged
code, respectively) memory, as well as the memory of different web sites. The primary motivation
for this change was security, but it turned out to be very useful for MemShrink. Shortly after this
was implemented we prototyped a tool called about:compartments that displayed all of the com-
partments, how much memory they use, and how they use that memory. about:compartments was
never integrated directly into Firefox, but after MemShrink started it was modified and combined
into about:memory.
While adding this compartment reporting to about:memory, we realized that incorporating
similar reporting for other allocations would enable useful heap profiling without specialized tools
like Massif. about:memory was changed so that instead of producing a series of summary statistics
it displayed a tree breaking down memory usage into a large number of different uses. We then
started to add reporters for other types of large heap allocations such as the layout subsystem. One of
our earliest metric-driven efforts was driving down the amount of heap-unclassified, memory that
was not covered by a memory reporter. We picked a pretty arbitrary number, 10% of the total heap,
and set out to get heap-unclassified down to that amount in average usage scenarios. Ultimately it
would turn out that 10% was too low a number to reach. There are simply too many small one-off
allocations in the browser to get heap-unclassified reliably below approximately 15%. Reducing the
amount of heap-unclassified increases the insight into how memory is being used by the browser.
To reduce the amount of heap-unclassified we wrote a tool, christened the Dark Matter Detector
(DMD), that helped track down the unreported heap allocations. It works by replacing the heap
allocator and inserting itself into the about:memory reporting process and matching reported memory
blocks to allocated blocks. It then summarizes the unreported memory allocations by call site.
62 MemShrink
Running DMD on a Firefox session produces lists of call sites responsible for heap-unclassified.
Once the source of the allocations was identified, finding the responsible component and a developer
to add a memory reporter for it proceeded quickly. Within a few months we had a tool that could tell
you things like “all the Facebook pages in your browser are using 250 MB of memory, and here is
the breakdown of how that memory is being used.”
We also developed another tool (called Measure and Save) for debugging memory problems once
they were identified. This tool dumps representations of both the JS heap and the cycle-collected
C++ heap to a file. We then wrote a series of analysis scripts that can traverse the combined heap
and answer questions like “what is keeping this object alive?” This enabled a lot of useful debugging
techniques, from just examining the heap graph for links that should have been broken to dropping
into a debugger and setting breakpoints on specific objects of interest.
A major benefit of these tools is that, unlike with a tool such as Massif, you can wait until the
problem appears before using the tool. Many heap profilers (including Massif) must be started when
the program starts, not partway through after a problem appears. Another benefit that these tools
have is that the information can be analyzed and used without having the problem reproduced in
front of you. Together they allow users to capture information for the problem they are seeing and
send it to developers when those developers cannot reproduce the problem. Expecting users of a
web browser, even those sophisticated enough to file bugs in a bug tracker, to use GDB or Massif on
the browser is usually asking too much. But loading about:memory or running a small snippet of
JavaScript to get data to attach to a bug report is a much less arduous task. Generic heap profilers
capture a lot of information, but come with a lot of costs. We were able to write a set of tools tailored
to our specific needs that offered us significant benefits over the generic tools.
It is not always worth investing in custom tooling; there is a reason we use GDB instead of writing
a new debugger for each piece of software we build. But for those situations where the existing tools
cannot deliver you the information you need in the way you want it, we found that custom tooling
can be a big win. It took us about a year of part-time work on about:memory to get to a point where
we considered it complete. Even today we are still adding new features and reporters when necessary.
Custom tools are a significant investment. An extensive digression on the subject is beyond the
scope of this chapter, but you should consider carefully the benefits and costs of custom tools before
writing them.
Kyle Huey 63
source of zombie compartments was add-ons. Dealing with leaks in add-ons stymied us for several
months before we found a solution that is discussed later in this chapter. Most of these zombie
compartments, both in Firefox and in add-ons, were caused by long-lived JS objects maintaining
references to short-lived JS objects. The long-lived JS objects are typically objects attached to the
browser window, or even global singletons, while the short-lived JS objects might be objects from
web pages.
Because of the way the DOM and JS work, a reference to a single object from a web page will
keep the entire page and its global object (and anything reachable from that) alive. This can easily
add up to many megabytes of memory. One of the subtler aspects of a garbage collected system
is that the GC only reclaims memory when it is unreachable, not when the program is done using
it. It is up to the programmer to ensure that memory that will not be used again is unreachable.
Failing to remove all references to an object has even more severe consequences when the lifetime of
the referrer and the referent are expected to differ significantly. Memory that should be reclaimed
relatively quickly (such as the memory used for a web page) is instead tied to the lifetime of the
longer lived referrer (such as the browser window or the application itself).
Fragmentation in the JS heap was also a problem for us for a similar reason. We often saw that
closing a lot of web pages did not cause Firefox’s memory usage, as reported by the operating system,
to decline significantly. The JS engine allocates memory from the operating system in megabyte-sized
chunks and subdivides that chunk amongst different compartments as needed. These chunks can only
be released back to the operating system when they are completely unused. We found that allocation
of new chunks was almost always caused by web content demanding more memory, but that the
last thing keeping a chunk from being released was often a chrome compartment. Mixing a few
long-lived objects into a chunk full of short-lived objects prevented us from reclaiming that chunk
when web pages were closed. We solved this by segregating chrome and content compartments
so that any given chunk has either chrome or content allocations. This significantly increased the
amount of memory we could return to the operating system when tabs are closed.
We discovered another problem caused in part by a technique to reduce fragmentation. Firefox’s
primary heap allocator is a version of jemalloc modified to work on Windows and Mac OS X.
Jemalloc is designed to reduce memory loss due to fragmentation. One of the techniques it uses to
do this is rounding allocations up to various size classes, and then allocating those size classes in
contiguous chunks of memory. This ensures that when space is freed it can later be reused for a
similar size allocation. It also entails wasting some space for the rounding. We call this wasted space
slop. The worst case for certain size classes can involve wasting almost 50% of the space allocated.
Because of the way jemalloc size classes are structured, this usually happens just after passing a
power of two (e.g., 17 rounds up to 32 and 1025 rounds up to 2048).
Often when allocating memory you do not have much choice in the amount you ask for. Adding
extra bytes to an allocation for a new instance of a class is rarely useful. Other times you have some
flexibility. If you are allocating space for a string you can use extra space to avoid having to reallocate
the buffer if later the string is appended to. When this flexibility presents itself, it makes sense to ask
for an amount that exactly matches a size class. That way memory that would have been “wasted” as
slop is available for use at no extra cost. Usually code is written to ask for powers of two because
those fit nicely into pretty much every allocator ever written and do not require special knowledge of
the allocator.
We found lots of code in Gecko that was written to take advantage of this technique, and several
places that tried to and got it wrong. Multiple pieces of code attempted to allocate a nice round
chunk of memory, but got the math slightly wrong, and ended up allocating just beyond what they
intended. Because of the way jemalloc’s size classes are constructed, this often led to wasting nearly
64 MemShrink
50% of the allocated space as slop. One particularly egregious example was in an arena allocator
implementation used for layout data structures. The arena attempted to get 4 KB chunks from the
heap. It also tacked on a few words for bookkeeping purposes which resulted in it asking for slightly
over 4 KB, which got rounded to 8 KB. Fixing that mistake saved over 3 MB of slop on GMail alone.
On a particularly layout-heavy test case it saved over 700 MB of slop, reducing the browser’s total
memory consumption from 2 GB to 1.3 GB.
We encountered a similar problem with SQLite. Gecko uses SQLite as the database engine for
features such as history and bookmarks. SQLite is written to give the embedding application a lot
of control over memory allocation, and is very meticulous about measuring its own memory usage.
To keep those measurements it adds a couple words which pushes the allocation over into the next
size class. Ironically the instrumentation needed to keep track of memory consumption ends up
doubling consumption while causing significant underreporting. We refer to these sorts of bugs as
“clownshoes” because they are both comically bad and result in lots of wasted empty space, just like
a clown’s shoes.
5.5 Not Your Fault Does Not Mean Not Your Problem
Over the course of several months we made great strides in improving memory consumption and
fixing leaks in Firefox. Not all of our users were seeing the benefits of that work though. It became
clear that a significant number of the memory problems our users were seeing were originating
in add-ons. Our tracking bug for leaky add-ons eventually counted over 100 confirmed reports of
add-ons that caused leaks.
Historically Mozilla has tried to have it both ways with add-ons. We have marketed Firefox as an
extensible browser with a rich selection of add-ons. But when users report performance problems
with those add-ons we simply tell users not to use them. The sheer number of add-ons that caused
memory leaks made this situation untenable. Many Firefox add-ons are distributed through Mozilla’s
addons.mozilla.org (AMO). AMO has review policies intended to catch common problems in
add-ons. We began to get an idea of the scope of the problem when AMO reviewers started testing
add-ons for memory leaks with tools like about:memory. A number of tested add-ons proved to
have problems such as zombie compartments. We began reaching out to add-on authors, and we
put together a list of best practices and common mistakes that caused leaks. Unfortunately this had
rather limited success. While some add-ons did get fixed by their authors, most did not.
There were a number of reasons why this proved ineffective. Not all add-ons are regularly updated.
Add-on authors are volunteers with their own schedules and priorities. Debugging memory leaks
can be hard, especially if you cannot reproduce the problem in the first place. The heap dumping
tool we described earlier is very powerful and makes gathering information easy but analyzing the
output is still complicated and too much to expect add-on authors to do. Finally, there were no strong
incentives to fix leaks. Nobody wants to ship bad software, but you can’t always fix everything.
People may also be more interested in doing what they want to do than what we want them to do.
For a long time we talked about creating incentives for fixing leaks. Add-ons have caused other
performance problems for Mozilla too, so we have discussed making add-on performance data visible
in AMO or in Firefox itself. The theory was that being able to inform users of the performance effects
the add-ons they have installed or are about to install would help them make informed decisions
about the add-ons they use. The first problem with this is that users of consumer-facing software
like web browsers are usually not capable of making informed decisions about those tradeoffs. How
many of Firefox’s 400 million users understand what a memory leak is and can evaluate whether it is
Kyle Huey 65
worth suffering through it to be able to use some random add-on? Second, dealing with performance
impacts of add-ons this way required buy-in from a lot of different parts of the Mozilla community.
The people who make up the add-on community, for example, were not thrilled about the idea of
smacking add-ons with a banhammer. Finally, a large percentage of Firefox add-ons are not installed
through AMO at all, but are bundled with other software. We have very little leverage over those
add-ons short of trying to block them. For these reasons we abandoned our attempts to create those
incentives.
The other reason we abandoned creating incentives for add-ons to fix leaks is that we found
a completely different way to solve the problem. We ultimately managed to find a way to “clean
up” after leaky add-ons in Firefox. For a long time we did not think that this was feasible without
breaking lots of add-ons, but we kept experimenting with it anyways. Eventually we were able
to implement a technique that reclaimed memory without adversely affecting most add-ons. We
leveraged the boundaries between compartments to “cut” references from chrome compartments to
content compartments when a the page is navigated or the tab is closed. This leaves an object floating
around in the chrome compartment that no longer references anything. We originally thought that
this would be a problem when code tried to use these objects, but we found that most times these
objects are not used later. In effect add-ons were accidentally and pointlessly caching things from
webpages, and cleaning up after them automatically had little downside. We had been looking for a
social solution to a technical problem.
66 MemShrink
problems we have encountered in the past recur. For example, a significant difference between the
+30 second measurement and the measurement after forcing garbage collection may indicate that our
garbage collection heuristics are too conservative. A significant difference between the measurement
taken before loading anything and the measurement taken after closing all tabs may indicate that
we are leaking memory. We measure a number of quantities at all of these points including the
resident set size, the “explicit” size (the amount of memory that has been asked for via malloc(),
mmap(), etc.), and the amount of memory that falls into certain categories in about:memory such as
heap-unclassified.
Once we put this system together we set it up to run regularly on the latest development versions
of Firefox. We also ran it on previous versions of Firefox back to roughly Firefox 4. The result
is pseudo-continuous integration with a rich set of historical data. With some nice webdev work
we ended up with areweslimyet.com, a public web based interface to all of the data gathered by
our memory testing infrastructure. Since it was finished areweslimyet.com has detected several
regressions caused by work on different parts of the browser.
5.7 Community
A final contributing factor to the success of the MemShrink effort has been the support of the broader
Mozilla community. While most (but certainly not all) of the engineers working on Firefox are
employed by Mozilla these days, Mozilla’s vibrant volunteer community contributes support in the
forms of testing, localization, QA, marketing, and more, without which the Mozilla project would
grind to a halt. We intentionally structured MemShrink to receive community support and that has
paid off considerably. The core MemShrink team consisted of a handful of paid engineers, but the
support from the community that we received through bug reporting, testing, and add-on fixing has
magnified our efforts.
Even within the Mozilla community, memory usage has long been a source of frustration. Some
have experienced the problems first hand. Others have friends or family who have seen the problems.
Those lucky enough to have avoided that have undoubtedly seen complaints about Firefox’s memory
usage or comments asking “is the leak fixed yet?” on new releases that they worked hard on. Nobody
enjoys having their hard work criticized, especially when it is for things that you do not work on.
Addressing a long-standing problem that most community members can relate to was an excellent
first step towards building support.
Saying we were going to fix things was not enough though. We had to show that we were serious
about getting things done and we could make real progress on the problems. We held public weekly
meetings to triage bug reports and discuss the projects we were working on. Nicholas also blogged a
progress report for each meeting so that people who were not there could see what we were doing.
Highlighting the improvements that were being made, the changes in bug counts, and the new bugs
being filed clearly showed the effort we were putting into MemShrink. And the early improvements
we were able to get from the low-hanging fruit went a long way to showing that we could tackle
these problems.
The final piece was closing the feedback loop between the wider community and the developers
working on MemShrink. The tools that we discussed earlier turned bugs that would have been closed
as unreproducible and forgotten into reports that could be and were fixed. We also turned complaints,
comments, and responses on our progress report blog posts into bug reports and tried to gather the
necessary information to fix them. All bug reports were triaged and given a priority. We also put
forth an effort to investigate all bug reports, even those that were determined to be unimportant to fix.
Kyle Huey 67
That investigation made the reporter’s effort feel more valued, and also aimed to leave the bug in a
state where someone with more time could come along and fix it later. Together these actions built
a strong base of support in the community that provided us with great bug reports and invaluable
testing help.
5.8 Conclusions
Over the two years that the MemShrink project has been active we have made great improvements
in Firefox’s memory usage. The MemShrink team has turned memory usage from one of the most
common user complaints to a selling point for the browser and significantly improved the user
experience for many Firefox users.
I would like to thank Justin Lebar, Andrew McCreight, John Schoenick, Johnny Stenback, Jet
Villegas, Timothy Nikkel for all of their work on MemShrink and the other engineers who have
helped fix memory problems. Most of all I thank Nicholas Nethercote for getting MemShrink off
the ground, working extensively on reducing Spidermonkey’s memory usage, running the project
for two years, and far too many other things to list. I would also like to thank Jet and Andrew for
reviewing this chapter.
68 MemShrink
[chapter 6]
6.1 Introduction
Distributed, real-time and embedded (DRE) systems are an important class of applications that share
properties of both enterprise distributed systems and resource-constrained real-time and embedded
systems. In particular, applications in DRE systems are similar to enterprise applications, i.e., they
are distributed across a large domain. Moreover, like real-time and embedded systems, applications
in DRE systems are often mission-critical and carry stringent safety, reliability, and quality of service
(QoS) requirements.
In addition to the complexities described above, deployment of application and infrastructure
components in DRE systems incurs its own set of unique challenges. First, applications in DRE
system domains may have particular dependencies on the target environment, such as particular hard-
ware/software (e.g., GPS, sensors, actuators, particular real-time operating systems, etc.). Second,
the deployment infrastructure of a DRE system must contend with strict resource requirements in
environments with finite resources (e.g., CPU, memory, network bandwidth, etc.).
Component-Based Software Engineering (CBSE) [HC01] is increasingly used as a paradigm for
developing applications in both enterprise [ATK05] and DRE systems [SHS+ 06]. CBSE facilitates
systematic software reuse by encouraging developers to create black box components that interact
with each other and their environment through well-defined interfaces. CBSE also simplifies the de-
ployment of highly complex distributed systems [WDS+ 11] by providing standardized mechanisms
to control the configuration and lifecycle of applications. These mechanisms enable the composition
of large-scale, complex applications from smaller, more manageable units of functionality, e.g.,
commercial off-the-shelf components and preexisting application building-blocks. These applica-
tions can be packaged along with descriptive and configuration metadata, and made available for
deployment into a production environment.
Building on expertise gleaned from the development of The ACE ORB (TAO) [SNG+ 02]—
an open-source implementation of the Common Object Request Broker Architecture (CORBA)
standard—we have been applying CBSE principles to DRE systems over the past decade. As a result
of these efforts, we have developed a high-quality open-source implementation of the OMG CORBA
Component Model (CCM), which we call the Component Integrated ACE ORB (CIAO) [Insty].
CIAO implements the so-called Lightweight CCM [OMG04] specification, which is a subset of the
full CCM standard that is tuned for resource-constrained DRE systems.
In the context of our work on applying CBSE principles to DRE systems, we have also been re-
searching the equally challenging problem of facilitating deployment and configuration of component-
based systems in these domains. Managing deployment and configuration of component-based
applications is a challenging problem for the following reasons:
• Component dependency and version management. There may be complex requirements and
relationships amongst individual components. Components may depend on one another for
proper operation, or specifically require or exclude particular versions. If these relationships
are not described and enforced, component applications may fail to deploy properly; even
worse, malfunction in subtle and pernicious ways.
• Component configuration management. A component might expose configuration hooks that
change its behavior, and the deployment infrastructure must manage and apply any required
configuration information. Moreover, several components in a deployment may have related
configuration properties, and the deployment infrastructure should ensure that these properties
remain consistent across an entire application.
• Distributed connection and lifecycle management. In the case of enterprise systems, compo-
nents must be installed and have their connection and activation managed on remote hosts.
To address the challenges outlined above, we began developing a deployment engine for CIAO in
2005. This tool, which we call the Deployment and Configuration Engine (DAnCE) [DBO+ 05], is an
implementation of the OMG Deployment and Configuration (D&C) specification [OMG06]. For most
of its history, DAnCE served primarily as a research vehicle for graduate students developing novel
approaches to deployment and configuration, which had two important impacts on its implementation:
• As a research vehicle, DAnCE’s development timeline was largely driven by paper deadlines
and feature demonstrations for sponsors. As a result, its tested use cases were relatively simple
and narrowly focused.
• Custodianship of DAnCE changed hands several times as research projects were completed
and new ones started. As a result, there was often not a unified architectural vision for the
entire infrastructure.
These two factors had several impacts on DAnCE. For example, narrow and focused use-cases
often made evaluating end-to-end performance on real-world application deployments a low priority.
Moreover, the lack of a unified architectural vision combined with tight deadlines often meant that
poor architectural choices were made in the name of expediency, and were not later remedied. These
problems were brought into focus as we began to work with our commercial sponsors to apply
DAnCE to larger-scale deployments, numbering in the hundreds to thousands of components on
tens to hundreds of hardware nodes. While the smaller, focused uses cases would have acceptable
deployment times, these larger deployments would take unacceptably long amounts of time, on the
order of an hour or more to fully complete.
In response to these problems, we undertook an effort to comprehensively evaluate the architecture,
design, and implementation of DAnCE and create a new implementation that we call Locality-Enabled
DAnCE (LE-DAnCE) [OGS11] [OGST13]. This chapter focuses on documenting and applying
optimization principle patterns that form the core of LE-DAnCE to make it suitable for DRE systems.
Table 6.1 summarizes common optimization patterns [Var05], many of which we apply in LE-DAnCE.
An additional goal of this paper was to supplement this catalog with new patterns we identified in
our work on LE-DAnCE.
Table 6.1: Catalog of Optimization Principles and Known Use Cases in Networking [Var05]
The remainder of this chapter is organized as follows: Section 6.2 provides an overview of the
OMG D&C specification; Section 6.3 identifies the most significant sources of DAnCE performance
problems (parsing deployment information from XML, analysis of deployment information at
run-time, and serialized execution of deployment steps) and uses them as case studies to identify
optimization principles that (1) are generally applicable to DRE systems and (2) we applied to
LE-DAnCE; and Section 6.4 presents concluding remarks.
This architecture consists of (1) a set of global (system-wide) entities used to coordinate deploy-
ment and (2) a set of local (node-level) entities used to instantiate component instances and configure
their connections and QoS properties. Each entity in these global and local tiers correspond to one
of the following three major roles:
Manager This role (known as the ExecutionManager at the global-level and as the NodeManager
at the node-level) is a singleton daemon that coordinates all deployment entities in a single
context. The Manager serves as the entry point for all deployment activity and as a factory
for implementations of the ApplicationManager role.
ApplicationManager This role (known as the DomainApplicationManager at the global-level and
as the NodeApplicationManager at the node-level entity) coordinates the lifecycle for running
instances of a component-based application. Each ApplicationManager represents exactly
one component-based application and is used to initiate deployment and teardown of that
application. This role also serves as a factory for implementations of the Application role.
Application This role (known as the DomainApplication at the global-level and the NodeApplication
at the node-level entity) represents a deployed instance of a component-based application. It
is used to finalize the configuration of the associated component instances that comprise an
application and begin execution of the deployed component-based application.
Problem
Processing these deployment plan files during deployment and even runtime, however, can lead to
substantial performance penalties. These performance penalties stem from the following sources:
• XML deployment plan file sizes grow substantially as the number of component instances and
connections in the deployment increases, which causes significant I/O overhead to load the plan
into memory and to validate the structure against the schema to ensure that it is well-formed.
• The XML document format cannot be directly used by the deployment infrastructure because
the infrastructure is a CORBA application that implements OMG Interface Definition Language
(IDL) interfaces. Hence, the XML document must first be converted into the IDL format used
by the runtime interfaces of the deployment framework.
In DRE systems, component deployments that number in the thousands are not uncommon.
Moreover, component instances in these domains will exhibit a high degree of connectivity. Both
these factors contribute to large plans. Plans need not be large, however, to significantly impact the
operation of a system. Though the plans were significantly smaller in the SEAMONSTER case study
described above the extremely limited computational resources meant that the processing overhead
for even smaller plans was often too time consuming.
Problem
While this approach is conceptually simple, it is fraught with accidental complexities that yield the
following inefficiencies in practice:
1. Reference representation in IDL. Deployment plans are typically transmitted over networks,
so they must obey the rules of the CORBA IDL language mapping. Since IDL does not
have any concept of references or pointers, some alternative mechanism must be used to
describe the relationships between plan elements. The deployment plan stores all the major
elements in sequences, so references to other entities can be represented with simple indices
into these sequences. While this implementation can follow references in constant time, it also
means these references become invalidated when plan entities are copied to sub-plans, as their
position in deployment plan sequences will most likely be different. It is also impossible to
determine if the target of a reference has already been copied without searching the sub-plan,
which is time-consuming.
2. Memory allocation in deployment plan sequences. The CORBA IDL mapping requires that
sequences be stored in consecutive memory addresses. If a sequence is resized, therefore,
its contents will most likely be copied to another location in memory to accommodate the
increased sequence size. With the approach summarized above, substantial copying overhead
will occur as plan sizes grow. This overhead is especially problematic in resource-constrained
systems (such as our SEAMONSTER case study), whose limited run-time memory must be
conserved for application components. If the deployment infrastructure is inefficient in its
use of this resource, either it will exhaust the available memory, or cause significant thrashing
of any virtual memory available (both impacting deployment latency and the usable life of
flash-based storage).
Problem
To minimize initial implementation complexity, we used synchronous invocation in an (admittedly
shortsighted) design choice in the initial DAnCE implementation. This global synchronicity worked
fine for relatively small deployments with less than about 100 components. As the number of nodes
and instances assigned to those nodes scaled up, however, this global/local serialization imposed a
substantial cost in deployment latency.
This serialized execution yielded the most problematic performance degradation in our SEA-
MONSTER case study, i.e., the limited computational resources available on the field hardware
would often take several minutes to complete. Such latency at the node level can quickly becomes
disastrous. In particular, even relatively modest deployments involving tens of nodes quickly escalates
the deployment latency of the system to a half hour or more.
The DAnCE architecture shown in Figure 6.2 was problematic with respect to parallelization since
its NodeApplication implementation integrated all logic necessary for installing, configuring, and
connecting instances directly (as shown in Figure 6.3), rather than performing only some processing
and delegating the remainder of the concrete deployment logic to the application process. This
tight integration made it hard to parallelize the node-level installation procedures for the following
reasons:
• The amount of data shared by the generic deployment logic (the portion of the NodeApplication
implementation that interprets the plan) and the specific deployment logic (the portion which
has specific knowledge of how to manipulate components) made it hard to parallelize their
installation in the context of a single component server since that data must be modified during
installation.
• Groups of components installed to separate application processes were considered as separate
deployment sub-tasks, so these groupings were handled sequentially one after the other.
Table 6.2: Catalog of optimization principles and known use cases in LE-DAnCE
Infinispan
Manik Surtani
7.1 Introduction
Infinispan1 is an open source data grid platform. It is a distributed, in-memory key-value NoSQL store.
Software architects typically use data grids like Infinispan either as a performance-enhancing dis-
tributed in-memory cache in front of an expensive, slow data store such as a relational database, or
as a distributed NoSQL data store to replace a relational database. In either case, the main reason to
consider a data grid in any software architecture is performance. The need for fast and low-latency
access to data is becoming increasingly common.
As such, performance is Infinispan’s sole raison d’être. Infinispan’s code base is, in turn, extremely
performance sensitive.
7.2 Overview
Before digging into Infinispan’s depths, let us consider how Infinispan is typically used. Infinispan
falls into a category of software called middleware. According to Wikipedia, middleware “can be
described as software glue”—the components that sit on servers, in between applications, such as
websites, and an operating system or database. Middleware is often used to make an application
developer more productive, more effective and able to turn out—at a faster pace—applications that
are also more maintainable and testable. All of this is achieved by modularisation and component
reuse. Infinispan specifically is often placed in between any application processing or business logic
and the data-storage tier. Data storage (and retrieval) are often the biggest bottlenecks, and placing
an in-memory data grid in front of a database often makes things much faster. Further, data storage
is also often a single point of contention and potential failure. Again, making use of Infinispan in
front of (or even in place of) a more traditional data store, applications can achieve greater elasticity
and scalability.
As a Library or as a Server
Infinispan is implemented in Java (and some Scala) and can be used in two different ways. First,
it can be used as a library, embedded into a Java application, by including Infinispan JAR files
and referencing and instantiating Infinispan components programmatically. This way, Infinispan
components sit in the same JVM as the application and a part of the application’s heap memory is
allocated as a data grid node.
Second, it can be used as a remote data grid by starting up Infinispan instances and allowing
them to form a cluster. A client can then connect to this cluster over a socket from one of the many
client libraries available. This way, each Infinispan node exists within its own isolated JVM and has
the entire JVM heap memory at its disposal.
Peer-to-Peer Architecture
In both cases, Infinispan instances detect one another over a network, form a cluster, and start sharing
data to provide applications with an in-memory data structure that transparently spans all the servers
in a cluster. This allows applications to theoretically address an unlimited amount of in-memory
storage as nodes are added to the cluster, increasing overall capacity.
Infinispan is a peer-to-peer technology where each instance in the cluster is equal to every other
instance in the cluster. This means there is no single point of failure and no single bottleneck. Most
importantly, it provides applications with an elastic data structure that can scale horizontally by
adding more instances. And that can also scale back in, by shutting down some instances, all the
while allowing the application to continue operating with no loss of overall functionality.
88 Infinispan
Figure 7.2: Infinispan as a remote data grid
Radar Gun
Radar Gun2 is an open source benchmarking framework that was designed to perform comparative
(as well as competitive) benchmarks, to measure scalability and to generate reports from data points
collected. Radar Gun is specifically targeted at distributed data structures such as Infinispan, and
has been used extensively during the development of Infinispan to identify and fix bottlenecks. See
Section 7.4 for more information on Radar Gun.
2 https://github.com/radargun/radargun/wiki
Manik Surtani 89
Yahoo Cloud Serving Benchmark
The Yahoo Cloud Serving Benchmark3 (YCSB) is an open source tool created to test latency when
communicating with a remote data store to read or write data of varying sizes. YCSB treats all data
stores as a single remote endpoint, so it doesn’t attempt to measure scalability as nodes are added
to or removed from a cluster. Since YCSB has no concept of a distributed data structure, it is only
useful to benchmark Infinispan in client/server mode.
Distributed Features
Radar Gun soon expanded to cover distributed data structures. Still focused on embedded libraries,
Radar Gun is able to launch multiple instances of the framework on different servers, which in turn
would launch instances of the distributed caching library. The benchmark is then run in parallel on
each node in the cluster. Results are collated and reports are generated by the Radar Gun controller.
The ability to automatically bring up and shut down nodes is crucial to scalability testing, as it
becomes infeasible and impractical to manually run and re-run benchmarks on clusters of varying
sizes, from two nodes all the way to hundreds or even thousands of nodes.
90 Infinispan
Figure 7.3: Radar Gun
Profiling
Radar Gun is also able to start and attach profiler instances to each data grid node and take profiler
snapshots, for more insight into the goings on in each node when under load.
Memory Performance
Radar Gun also has the ability to measure the state of memory consumption of each node, to
measure memory performance. In an in-memory data store, performance isn’t all about how fast you
read or write data, but also how well the structure performs with regards to memory consumption.
This is particularly important in Java-based systems, as garbage collection can adversely affect the
responsiveness of a system. Garbage collection is discussed in more detail later.
Metrics
Radar Gun measures performance in terms of transactions per second. This is captured for each node
and then aggregated on the controller. Both reads and writes are measured and charted separately,
even though they are performed simultaneously (to ensure a realistic test where such operations are
interleaved). Radar Gun also captures means, medians, standard deviation, maximum and minimum
values for read and write transactions, and these too are logged although they may not be charted.
Memory performance is also captured, by way of a footprint for any given iteration.
Manik Surtani 91
Extensibility
Radar Gun is an extensible framework. It allows you to plug in your own data access patterns, data
types and sizes. Further, it also allows you to add adapters to any data structure, caching library or
NoSQL database that you would like to test.
End-users are often encouraged to use Radar Gun too when attempting to compare the perfor-
mance of different configurations of a data grid.
The Network
Network communication is the single most expensive part of Infinispan, whether used for communi-
cation between peers or between clients and the grid itself.
Peer Network
Infinispan makes use of JGroups8 , an open source peer-to-peer group communication library for
inter-node communication. JGroups can make use of either TCP or UDP network protocols, including
UDP multicast, and provides high level features such as message delivery guarantees, retransmission
and message ordering even over unreliable protocols such as UDP.
It becomes crucially important to tune the JGroups layer correctly, to match the characteristics
of your network and application, such as time-to-live, buffer sizes, and thread pool sizes. It is also
important to account for the way JGroups performs bundling—the combining of multiple small
messages into single network packets—or fragmentation—the reverse of bundling, where large
messages are broken down into multiple smaller network packets.
The network stack on your operating system and your network equipment (switches and routers)
should also be tuned to match this configuration. Operating system TCP send and receive buffer
sizes, frame sizes, jumbo frames, etc. all play a part in ensuring the most expensive component in
your data grid is performing optimally.
Tools such as netstat and wireshark can help analyse packets, and Radar Gun can help drive
load through a grid. Radar Gun can also be used to profile the JGroups layer of Infinispan to help
locate bottlenecks.
Server Sockets
Infinispan makes use of the popular Netty9 framework to create and manage server sockets. Netty is
a wrapper around the asynchronous Java NIO framework, which in turn makes use of the operating
system’s asynchronous network I/O capabilities. This allows for efficient resource utilization at the
expense of some context switching. In general, this performs very well under load.
8 http://www.jgroups.org
9 http://www.netty.io
92 Infinispan
Netty offers several levels of tuning to ensure optimal performance. These include buffer sizes,
thread pools and the like, and should also be matched up with operating system TCP send and receive
buffers.
Data Serialization
Before putting data on the network, application objects need to be serialized into bytes so that they
can be pushed across a network, into the grid, and then again between peers. The bytes then need
to be de-serialized back into application objects, when read by the application. In most common
configurations, about 20% of the time spent in processing a request is spent in serialization and
de-serialization.
Default Java serialization (and de-serialization) is notoriously slow, both in terms of CPU cycles
and the bytes that are produced—they are often unnecessarily large, which means more data to push
around a network.
Infinispan uses its own serialization scheme, where full class definitions are not written to the
stream. Instead, magic numbers are used for known types where each known type is represented by a
single byte. This greatly improves not just serialization and de-serialization speed, but also produces
a much more compact byte stream for transmission across a network. An externalizer is registered
for each known data type, registered against a magic number. This externalizer contains the logic to
convert object to bytes and vice versa.
This technique works well for known types, such as internal Infinispan objects that are exchanged
between peer nodes. Internal objects—such as commands, envelopes, etc.—have externalizers and
corresponding unique magic numbers. But what about application objects? By default, if Infinispan
encounters an object type it is unaware of, it falls back to default Java serialization for that object.
This allows Infinispan to work out of the box—albeit in a less efficient manner when dealing with
unknown application object types.
To get around this, Infinispan allows application developers to register externalizers for application
data types as well. This allows powerful, fast, efficient serialization of application objects too, as long
as the application developer can write and register externalizer implementations for each application
object type.
This externalizer code has been released as a separate, reusable library, called JBoss Mar-
shalling.10 It is packaged with Infinispan and included in Infinispan distributions, but it is also used
in various other open source projects to improve serialization performance.
Writing to Disk
In addition to being an in-memory data structure, Infinispan can also optionally persist to disk.
Persistence can either be for durability—to survive restarts or node failures, in which case
everything in memory also exists on disk—or it can be configured as an overflow when Infinispan
runs out of memory, in which case it acts in a manner similar to an operating system paging to disk.
In the latter case, data is only written to disk when data needs to be evicted from memory to free up
space.
When persisting for durability, persistence can either be online, where the application thread is
blocked until data is safely written to disk, or offline, where data is flushed to disk periodically and
asynchronously. In the latter case, the application thread is not blocked on the process of persistence,
in exchange for uncertainty as to whether the data was successfully persisted to disk at all.
10 http://www.jboss.org/jbossmarshalling
Manik Surtani 93
Infinispan supports several pluggable cache stores—adapters that can be used to persist data to
disk or any form of secondary storage. The current default implementation is a simplistic hash bucket
and linked list implementation, where each hash bucket is represented by a file on the filesystem.
While easy to use and configure, this isn’t the best-performing implementation.
Two high-performance, filesystem-based native cache store implementations are currently on the
roadmap. Both will be written in C, with the ability to make system calls and use direct I/O where
available (such as on Unix systems), to bypass kernel buffers and caches.
One of the implementations will be optimized for use as a paging system, and will therefore need
to have random access, possibly a b-tree structure.
The other will be optimized as a durable store, and will mirror what is stored in memory. As
such, it will be an append-only structure, designed for fast writing but not necessarily for fast
reading/seeking.
94 Infinispan
Specific areas to look at are the asynchronous transport thread pool (if using asynchronous
communications) and ensuring this thread pool is at least as large as the expected number of concurrent
updates each node is expected to handle. Similarly, when tuning JGroups, the OOB14 and incoming
thread pools should be at least as big as the expected number of concurrent updates.
Garbage Collection
General good practices with regards to working with JVM garbage collectors is an important
consideration for any Java-based software, and Infinispan is no exception. If anything, it is all the
more important for a data grid, since container objects may survive for long periods of time while
lots of transient objects—related to a specific operation or transaction—are also created. Further,
garbage collection pauses can have adverse effects on a distributed data structure, as they can render
a node unresponsive and cause the node to be marked as failed.
These have been taken into consideration when designing and building Infinispan, but at the
same time there is a lot to consider when configuring a JVM to run Infinispan. Each JVM is different.
However, some analysis15 has been done on the optimal settings for certain JVMs when running
Infinispan. For example, if using OpenJDK16 or Oracle’s HotSpot JVM17 , using the Concurrent
Mark and Sweep collector18 alongside large pages19 for JVMs given about 12 GB of heap each
appears to be an optimal configuration.
Further, pauseless garbage collectors—such as C420 , used in Azul’s Zing JVM21 —are worthwhile
considering cases where garbage collection pauses become a noticeable issue.
14 http://www.jgroups.org/manual/html/user-advanced.html#d0e3284
15 http://howtojboss.com/2013/01/08/data-grid-performance-tuning/
16 http://openjdk.java.net/
17 http://www.oracle.com/technetwork/java/javase/downloads/index.html
18 http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#cms
19 http://www.oracle.com/technetwork/java/javase/tech/largememory-jsp-137182.html
20 http://www.azulsystems.com/technology/c4-garbage-collector
21 http://www.azulsystems.com/products/zing/virtual-machine
Manik Surtani 95
7.6 Conclusions
Performance-centric middleware like Infinispan has to be architected, designed and developed
with performance in mind every step of the way. From using the best non-blocking and lock-free
algorithms to understanding the characteristics of garbage being generated, to developing with an
appreciation for JVM context switching overhead, to being able to step outside the JVM where
needed (writing native persistence components, for example) are all important parts of the mindset
needed when developing Infinispan. Further, the right tools for benchmarking and profiling, as well
as running benchmarks in a continuous integration style, helps ensure performance is never sacrificed
as features are added.
96 Infinispan
[chapter 8]
Talos
Clint Talbert and Joel Maher
At Mozilla, one of our very first automation systems was a performance testing framework we dubbed
Talos. Talos had been faithfully maintained without substantial modification since its inception in
2007, even though many of the original assumptions and design decisions behind Talos were lost as
ownership of the tool changed hands.
In the summer of 2011, we finally began to look askance at the noise and the variation in the
Talos numbers, and we began to wonder how we could make some small modification to the system
to start improving it. We had no idea we were about to open Pandora’s Box.
In this chapter, we will detail what we found as we peeled back layer after layer of this software,
what problems we uncovered, and what steps we took to address them in hopes that you might learn
from both our mistakes and our successes.
8.1 Overview
Let’s unpack the different parts of Talos. At its heart, Talos is a simple test harness which creates a
new Firefox profile, initializes the profile, calibrates the browser, runs a specified test, and finally
reports a summary of the test results. The tests live inside the Talos repository and are one of two
types: a single page which reports a single number (e.g., startup time via a web page’s onload
handler) or a collection of pages that are cycled through to measure page load times. Internally, a
Firefox extension is used to cycle the pages and collect information such as memory and page load
time, to force garbage collection, and to test different browser modes. The original goal was to create
as generic a harness as possible to allow the harness to perform all manner of testing and measure
some collection of performance attributes as defined by the test itself.
To report its data, the Talos harness can send JSON to Graph Server: an in-house graphing web
application that accepts Talos data as long as that data meets a specific, predefined format for each
test, value, platform, and configuration. Graph Server also serves as the interface for investigating
trends and performance regressions. A local instance of a standard Apache web server serve the
pages during a test run.
The final component of Talos is the regression reporting tools. For every check-in to the Firefox
repository, several Talos tests are run, these tests upload their data to Graph Server, and another script
consumes the data from Graph Server and ascertains whether or not there has been a regression. If a
regression is found (i.e., the script’s analysis indicates that the code checked in made performance on
this test significantly worse), the script emails a message to a mailing list as well as to the individual
that checked in the offending code.
While this architecture—summarized in Figure 8.1—seems fairly straightforward, each piece of
Talos has morphed over the years as Mozilla has added new platforms, products, and tests. With
minimal oversight of the entire system as an end to end solution, Talos wound up in need of some
serious work:
• Noise—the script watching the incoming data flagged as many spikes in test noise as actual
regressions and was impossible to trust.
• To determine a regression, the script compared each check-in to Firefox with the values for
three check-ins prior and three afterward. This meant that the Talos results for your check-in
might not be available for several hours.
• Graph Server had a hard requirement that all incoming data be tied to a previously defined
platform, branch, test type, and configuration. This meant that adding new tests was difficult
as it involved running a SQL statement against the database for each new test.
• The Talos harness itself was hard to run because it took its requirement to be generic a little
too seriously—it had a “configure” step to generate a configuration script that it would then
use to run the test in its next step.
98 Talos
conditions. It is also important to have a repeatable environment so you can reproduce results as
needed. But, what is most important is understanding what tests you have and what you measure
from those tests.
A few weeks into our project, we had all been learning more about the entire system and started
experimenting with various parameters to run the tests differently. One recurring question was “what
do the numbers mean?” This was not easily answered. Many of the tests had been around for years,
with little to no documentation.
Worse yet, it was not possible to produce the same results locally that were reported from an
automated test run. It became evident that the harness itself performed calculations, (it would drop
the highest value per page, then report the average for the rest of the cycles) and Graph Server did
as well (drop the highest page value, then average the pages together). The end result was that no
historical data existed that could provide much value, nor did anybody understand the tests we were
running.
We did have some knowledge about one particular test. We knew that this test took the top 100
websites snapshotted in time and loaded each page one at a time, repeating 10 times. Talos loaded
the page, waited for the mozAfterPaint event, (a standard event which is fired when Firefox has
painted the canvas for the webpage) and then recorded the time from loading the page to receiving
this event. Looking at the 1000 data points produced from a single test run, there was no obvious
pattern. Imagine boiling those 10,000 points down to a single number and tracking that number over
time. What if we made CSS parsing faster, but image loading slower? How would we detect that?
Would it be possible to see page 17 slow down if all 99 other pages remained the same? To showcase
how the values were calculated in the original version of Talos, consider the following numbers.
For the following page load values:
• Page 1: 570, 572, 600, 503, 560
• Page 2: 780, 650, 620, 700, 750
• Page 3: 1220, 980, 1000, 1100, 1200
First, the Talos harness itself would drop the first value and calculate the median:
• Page 1: 565.5
• Page 2: 675
• Page 3: 1050
These values would be submitted to Graph Server. Graph Server would drop the highest value and
calculate the mean using these per page values and it would report that one value:
565.5 + 675
= 620.25
2
This final value would be graphed over time, and as you can see it generates an approximate
value that is not good for anything more than a coarse evaluation of performance. Furthermore, if a
regression is detected using a value like this, it would be extremely difficult to work backwards and
see which pages caused the regression so that a developer could be directed to a specific issue to fix.
We were determined to prove that we could reduce the noise in the data from this 100 page
test. Since the test measured the time to load a page, we first needed to isolate the test from other
influences in the system like caching. We changed the test to load the same page over and over again,
rather than cycling between pages, so that load times were measured for a page that was mostly
cached. While this approach is not indicative of how end users actually browse the web, it reduced
some of the noise in the recorded data. Unfortunately, looking at only 10 data points for a given page
was not a useful sample size.
100 Talos
singularly bad decision. We should have built a separate harness and then compared the new harness
with the old one.
Trying to support the original flow of data and the new method for measuring data for each page
proved to be difficult. On the positive side, it forced us to restructure much of the code internal to the
framework and to streamline quite a few things. But, we had to do all this piecemeal on a running
piece of automation, which caused us several headaches in our continuous integration rigs.
It would have been far better to develop both Talos the framework and Datazilla its reporting
system in parallel from scratch, leaving all of the old code behind. Especially when it came to staging,
it would have been far easier to stage the new system without attempting to wire in the generation
of development data for the upcoming Datazilla system in running automation. We had thought it
was necessary to do this so that we could generate test data with real builds and real load to ensure
that our design would scale properly. In the end, that build data was not worth the complexity of
modifying a production system. If we had known at the time that we were embarking on a year long
project instead of our projected six month project, we would have rewritten Talos and the results
framework from scratch.
8.5 Conclusion
In the last year, we dug into every part of performance testing automation at Mozilla. We have
analyzed the test harness, the reporting tools, and the statistical soundness of the results that were
being generated. Over the course of that year, we used what we learned to make the Talos framework
easier to maintain, easier to run, simpler to set up, easier to test experimental patches with, and less
error prone. We have created Datazilla as an extensible system for storing and retrieving all of our
performance metrics from Talos and any future performance automation. We have rebooted our
performance statistical analysis and created statistically viable, per-push regression/improvement
detection. We have made all of these systems easier to use and more open so that any contributor
anywhere can take a look at our code and even experiment with new methods of statistical analysis
on our performance data. Our constant commitment to reviewing the data again and again at each
milestone of the project and our willingness to throw out data that proved inconclusive or invalid
helped us retain our focus as we drove this gigantic project forward. Bringing in people from across
teams at Mozilla as well as many new volunteers helped lend the effort validity and also helped to
establish a resurgence in performance monitoring and data analysis across several areas of Mozilla’s
efforts, resulting in an even more data-driven, performance-focused culture.
1 https://github.com/mozilla/datazilla/blob/2c39a3/vendor/dzmetrics/ttest.py
2 https://github.com/mozilla/datazilla/blob/2c369a/vendor/dzmetrics/fdr.py
3 https://github.com/mozilla/datazilla/blob/2c369a/vendor/dzmetrics/data_smoothing.py
102 Talos
[chapter 9]
Zotonic
Arjan Scherpenisse and Marc Worrell
104 Zotonic
Erlang is a (mostly) functional programming language and runtime system. Erlang/OTP appli-
cations were originally developed for telephone switches, and are known for their fault-tolerance
and their concurrent nature. Erlang employs an actor-based concurrency model: each actor is a
lightweight “process” (green thread) and the only way to share state between processes is to pass
messages. The Open Telecom Platform is the set of standard Erlang libraries which enable fault
tolerance and process supervision, amongst others.
Fault tolerance is at the core of its programming paradigm: let it crash is the main philosophy
of the system. As processes don’t share any state (to share state, they must send messages to each
other), their state is isolated from other processes. As such, a single crashing process will never take
down the system. When a process crashes, its supervisor process can decide to restart it.
Let it crash also allows you to program for the happy case. Using pattern matching and function
guards to assure a sane state means less error handling code is needed, which usually results in clean,
concise, and readable code.
106 Zotonic
request. Such a module might impact the performance of the entire system. In this chapter we’ll
leave this out of consideration, and instead focus on the core performance issues.
Client-Side Caching
The client-side caching is done by the browser. The browser caches images, CSS and JavaScript
files. Zotonic does not allow client-side caching of HTML pages, it always dynamically generates all
pages. Because it is very efficient in doing so (as described in the previous section) and not caching
HTML pages prevents showing old pages after users log in, log out, or comments are placed.
Zotonic improves client-side performance in two ways:
1. It allows caching of static files (CSS, JavaScript, images etc.)
2. It includes multiple CSS or JavaScript files in a single response
108 Zotonic
The first is done by adding the appropriate HTTP headers to the request2 :
Multiple CSS or JavaScript files are concatenated into a single file, separating individual files by a
tilde and only mentioning paths if they change between files:
http://example.org/lib/bootstrap/css/bootstrap
~bootstrap-responsive~bootstrap-base-site~
/css/jquery.loadmask~z.growl~z.modal~site~63523081976.css
The number at the end is a timestamp of the newest file in the list. The necessary CSS link or
JavaScript script tag is generated using the {% lib %} template tag.
Server-Side Caching
Zotonic is a large system, and many parts in it do caching in some way. The sections below explain
some of the more interesting parts.
Rendered Templates
Templates are compiled into Erlang modules, after which the byte code is kept in memory. Compiled
templates are called as regular Erlang functions.
The template system detects any changes to templates and will recompile the template during
runtime. When compilation is finished Erlang’s hot code upgrade mechanism is used to load the
newly compiled Erlang module.
2 Note that Zotonic does not set an ETag. Some browsers check the ETag for every use of the file by making a request to the
server. Which defies the whole idea of caching and making fewer requests.
3 A byte array, or binary, is a native Erlang data type. If it is smaller than 64 bytes it is copied between processes, larger ones
are shared between processes. Erlang also shares parts of byte arrays between processes with references to those parts and
not copying the data itself, thus making these byte arrays an efficient and easy to use data type.
In-Memory Caching
All caching is done in memory, in the Erlang VM itself. No communication between computers or
operating system processes is needed to access the cached data. This greatly simplifies and optimizes
the use of the cached data.
As a comparison, accessing a memcache server typically takes 0.5 milliseconds. In contrast,
accessing main memory within the same process takes 1 nanoseconds on a CPU cache hit and 100
nanoseconds on a CPU cache miss—not to mention the huge speed difference between memory and
network.4
Zotonic has two in-memory caching mechanisms 5 :
1. Depcache, the central per-site cache
2. Process Dictionary Memo Cache
Depcache
The central caching mechanism in every Zotonic site is the depcache, which is short for dependency
cache. The depcache is an in-memory key-value store with a list of dependencies for every stored
key.
For every key in the depcache we store:
• the key’s value;
• a serial number, a global integer incremented with every update request;
• the key’s expiration time (counted in seconds);
4 See “Latency Numbers Every Programmer Should Know” at
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html.
5 In addition to these mechanisms, the database server performs some in-memory caching, but that is not within the scope of
this chapter.
110 Zotonic
• a list of other keys that this key depends on (e.g., a resource ID displayed in a
cached template); and
• if the key is still being calculated, a list of processes waiting for the key’s value.
If a key is requested then the cache checks if the key is present, not expired, and if the serial
numbers of all the dependency keys are lower than serial number of the cached key. If the key
was still valid its value is returned, otherwise the key and its value is removed from the cache and
undefined is returned.
Alternatively if the key was being calculated then the requesting process would be added to the
waiting list of the key.
The implementation makes use of ETS, the Erlang Term Storage, a standard hash table imple-
mentation which is part of the Erlang OTP distribution. The following ETS tables are created by
Zotonic for the depcache:
• Meta table: the ETS table holding all stored keys, the expiration and the depending keys. A
record in this table is written as #meta{key, expire, serial, deps}.
• Deps table: the ETS table stores the serial for each key.
• Data table: the ETS table that stores each key’s data.
• Waiting PIDs dictionary: the ETS table that stores the IDs of all processes waiting for the
arrival of a key’s value.
The ETS tables are optimized for parallel reads and usually directly accessed by the calling
process. This prevents any communication between the calling process and the depcache process.
The depcache process is called for:
• memoization where processes wait for another process’s value to be calculated;
• put (store) requests, serializing the serial number increments; and
• delete requests, also serializing the depcache access.
The depcache can get quite large. To prevent it from growing too large there is a garbage collector
process. The garbage collector slowly iterates over the complete depcache, evicting expired or
invalidated keys. If the depcache size is above a certain threshold (100 MiB by default) then the
garbage collector speeds up and evicts 10% of all encountered items. It keeps evicting until the cache
is below its threshold size.
100 MiB might sound small in this area of multi-TB databases. However, as the cache mostly
contains textual data it will be big enough to contain the hot data for most web sites. Otherwise the
size of the cache can be changed in configuration.
112 Zotonic
Separate Heap for Bigger Byte Arrays
There is a big exception for copying data between processes. Byte arrays larger than 64 bytes are not
copied between processes. They have their own heap and are separately garbage collected.
This makes it cheap to send a big byte array between processes, as only a reference to the byte
array is copied. However, it does make garbage collection harder, as all references must be garbage
collected before the byte array can be freed.
Sometimes, references to parts of a big byte array are passed: the bigger byte array can’t be
garbage collected until the reference to the smaller part is garbage collected. A consequence is that
copying a byte array is an optimization if that frees up the bigger byte array.
<<"Hello World">>.
114 Zotonic
3. Some callback functions (like last_modified) are called multiple times during request
evaluation.
4. When Webmachine crashes during request evaluation no log entry is made by the request
logger.
5. No support for HTTP Upgrade, making WebSockets support harder.
The first problem (no partitioning of dispatch rules) is only a nuisance. It makes the list of
dispatch rules less intuitive and more difficult to interpret.
The second problem (copying the dispatch list for every request) turned out to be a show stopper
for Zotonic. The lists could become so large that copying it could take the majority of time needed
to handle a request.
The third problem (multiple calls to the same functions) forced controller writers to implement
their own caching mechanisms, which is error prone.
The fourth problem (no log on crash) makes it harder to see problems when in production.
The fifth problem (no HTTP Upgrade) prevents us from using the nice abstractions available in
Webmachine for WebSocket connections.
The above problems were so serious that we had to modify Webmachine for our own purposes.
First a new option was added: dispatcher. A dispatcher is a module implementing the dispatch/3
function which matches a request to a dispatch list. The dispatcher also selects the correct site (virtual
host) using the HTTP Host header. When testing a simple “hello world” controller, these changes
gave a threefold increase of throughput. We also observed that the gain was much higher on systems
with many virtual hosts and dispatch rules.
Webmachine maintains two data structures, one for the request data and one for the internal
request processing state. These data structures were referring to each other and actually were almost
always used in tandem, so we combined them in a single data structure. Which made it easier to
remove the use of the process dictionary and add the new single data structure as an argument to all
functions inside Webmachine. This resulted in 20% less processing time per request.
We optimized Webmachine in many other ways that we will not describe in detail here, but the
most important points are:
• Return values of some controller callbacks are cached (charsets_provided,
content_types_provided, encodings_provided, last_modified, and generate_etag).
• More process dictionary use was removed (less global state, clearer code, easier testing).
• Separate logger process per request; even when a request crashes we have a log up to the point
of the crash.
• An HTTP Upgrade callback was added as a step after the forbidden access check to support
WebSockets.
• Originally, a controller was called a “resource”. We changed it to “controller” to make a clear
distinction between the (data-)resources being served and the code serving those resources.
• Some instrumentation was added to measure request speed and size.
A Simplified Benchmark
What a benchmark might do is show where you could optimize the system first.
With this in mind we benchmarked Zotonic using the TechEmpower JSON benchmark, which is
basically testing the request dispatcher, JSON encoder, HTTP request handling and the TCP/IP stack.
The benchmark was performed on a Intel i7 quad core M620 @ 2.67 GHz. The command was
wrk -c 3000 -t 3000 http://localhost:8080/json. The results are shown in Table 9.1.
Zotonic’s dynamic dispatcher and HTTP protocol abstraction gives lower scores in such a micro
benchmark. Those are relatively easy to solve, and the solutions were already planned:
• Replace the standard webmachine logger with a more efficient one
• Compile the dispatch rules in an Erlang module (instead of a single process interpreting the
dispatch rule list)
• Replace the MochiWeb HTTP handler with the Elli HTTP handler
• Use byte arrays in Webmachine instead of the current character lists
116 Zotonic
The solution was a system with four virtual servers, each with 2 GB RAM and running their
own independent Zotonic system. Three nodes handled voting, one node was for administration. All
nodes were independent but the voting nodes shared every vote with the at least two other nodes, so
no vote would be lost if a node crashed.
A single vote gave ~30 HTTP requests for dynamic HTML (in multiple languages), Ajax, and
static assets like css and javascript. Multiple requests were needed for selecting the three projects to
vote on and filling in the details of the voter.
When tested we easily met the customer’s requirements without pushing the system to the max.
The voting simulation was stopped at 500,000 complete voting procedures per hour, using bandwidth
of around 400 mbps, and 99% of request handling times were below 200 milliseconds.
From the above it is clear that Zotonic can handle popular dynamic web sites. On real hardware we
have observed much higher performance, especially for the underlying I/O and database performance.
9.10 Conclusion
When building a content management system or framework it is important to take the full stack of
your application into consideration, from the web server, the request handling system, the caching
systems, down to the database system. All parts must work well together for good performance.
Much performance can be gained by preprocessing data. An example of preprocessing is pre-
escaping and sanitizing data before storing it into the database.
Caching hot data is a good strategy for web sites with a clear set of popular pages followed by a
long tail of less popular pages. Placing this cache in the same memory space as the request handling
code gives a clear edge over using separate caching servers, both in speed and simplicity.
Another optimization for handling sudden bursts in popularity is to dynamically match similar
requests and process them once for the same result. When this is well implemented, a proxy can be
avoided and all HTML pages generated dynamically.
Erlang is a great match for building dynamic web based systems due to its lightweight multipro-
cessing, failure handling, and memory management.
Using Erlang, Zotonic makes it possible to build a very competent and well-performing con-
tent management system and framework without needing separate web servers, caching proxies,
memcache servers, or e-mail handlers. This greatly simplifies system management tasks.
On current hardware a single Zotonic server can handle thousands of dynamic page requests per
second, thus easily serving the fast majority of web sites on the world wide web.
Using Erlang, Zotonic is prepared for the future of multi-core systems with dozens of cores and
many gigabytes of memory.
9.11 Acknowledgements
The authors would like to thank Michiel Klønhammer (Maximonster Interactive Things), Andreas
Stenius, Maas-Maarten Zeeman and Atilla Erdődi.
117
[chapter 10]
10.1 Introduction
The last few years have brought us significant progress in mobile cellular network performance. But
many mobile applications cannot fully benefit from this advance due to inflated network latencies.
Latency has long been synomonous with mobile networking. Though progress has been made in
recent years, the reductions to network latency have not kept pace with the increases in speed. As a
consequence of this disparity, it is latency, not throughput, that is most often the factor limiting the
performance of network transactions.
There are two logical sections to this chapter. The first portion will explore the particulars of
mobile cellular networking that contribute to the latency problem. In the second portion software
techniques to minimize the performance impact of elevated network latency will be introduced.
Baseband Processor
Inside most mobile devices are actually two very sophisticated computers. The application processor
is responsible for hosting the operating system and applications, and is analogous to your computer
or laptop. The baseband processor is responsible for all wireless network functions, and is analogous
to a computer modem that uses radio waves instead of a phone line.1
The baseband processor is a consistent but usually neglible latency source. High-speed wireless
networking is a frighteningly complicated affair. The sophisticated signal processing it requires
contributes a fixed delay, in the range of microseconds to milliseconds, to most network communica-
tions.
1 In
fact many mobile phones manage the baseband processor with an AT-like command set. See http://www.3gpp.org/
ftp/Specs/html-info/0707.htm
Backhaul Network
A backhaul network is the dedicated WAN connection between a cell site, its controller, and the core
network. Backhaul networks have long been, and continue to be notorious contributors of latency.
Backhaul network latency classically arises from the circuit-switched, or frame-based transport
protocols employed on older mobile networks (e.g., GSM, EV-DO). Such protocols exhibit latencies
due to their synchronous nature, where logical connections are represented by a channel that may only
receive or transmit data during a brief, pre-assigned time period. In contrast, the latest generation of
mobile networks employ IP-based packet-switched backhaul networks that support asynchronous
data transmission. This switchover has drastically reduced backhaul latency.
The bandwidth limitations of the physical infrastructure are a continuing bottleneck. Many
backhauls were not designed to handle the peak traffic loads that modern high-speed mobile networks
are capable of, and often demonstrate large variances in latency and throughput as they become
congested. Carriers are making efforts to upgrade these networks as quickly as possible, but this
component remains a weak point in many network infrastructures.
Power Conservation
One of the most significant sources of mobile network latency is directly related to the limited
capacity of mobile phone batteries.
The network radio of a high-speed mobile device can consume over 3 Watts of power when in
operation. This figure is large enough to drain the battery of an iPhone 5 in just over one hour. For
this reason mobile devices remove or reduce power to the radio circuitry at every opportunity. This
is ideal for extending battery life but also introduces a startup delay any time the radio circuitry is
repowered to deliver or receive data.
All mobile cellular network standards formalize a radio resource management (RRM) scheme to
conserver power. Most RRM conventions define three states—active, idle and disconnected—that
each represent some compromise between startup latency and power consumption.
Active
Active represents a state where data may be transmitted and received at high speed with minimal
latency.
This state consumes large amounts of power even when idle. Small periods of network inactivity,
often less than second, trigger transition to the lower power idle state. The performance implication of
this is important to note: sufficiently long pauses during a network transaction can trigger additional
delays as the device fluctuates between the active and idle states.
Idle
Idle is a compromise of lower power usage and moderate startup latency.
The device remains connected to the network, unable to transmit or receive data but capable
of receiving network requests that require the active state to fulfill (e.g., incoming data). After a
reasonable period of network inactivity, usually a minute or less, the device will transition to the
disconnected state.
Disconnected
Disconnected has the lowest power usage with the largest startup delays.
The device is disconnected from the mobile network and the radio deactivated. The radio is
activated, however infrequently, to listen for network requests arriving over a special broadbcast
channel.
Disconnected shares the same latency sources as idle plus the additional delays of network
reconnection. Connecting to a mobile network is a complicated process involving multiple rounds
of message exchanges (i.e., signaling). At minimum, restoring a connection will take hundreds of
milliseconds, and it’s not unusual to see connection times in the seconds.
Conventionally, outside of connection recycling, there’s been no way to avoid the delay of the
TCP three-way handshake. However, this has changed recently with introduction of the TCP Fast
Open IETF specification.
TCP Fast Open (TFO) allows the client to start sending data before the connection is logically
established. This effectively negates any round-trip delay from the three-way handshake. The
cumulative effect of this optimization is impressive. According to Google research TFO can reduce
page load times by as much as 40%. Although still only a draft specification, TFO is already supported
by major browsers (Chrome 22+) and platforms (Linux 3.6+), with other vendors pledging to fully
support it soon.
TCP Fast Open is a modification to the three-way handshake allowing a small data payload (e.g.,
HTTP request) to be placed within the SYN message. This payload is passed to the application server
while the connection handshake is completed as it would otherwise.
Earlier extension proposals like TFO ultimately failed due to security concerns. TFO addresses
this issue with the notion of a secure token, or cookie, assigned to the client during the course of
a conventional TCP connection handshake, and expected to be included in the SYN message of a
TFO-optimized request.
There are some minor caveats to the use of TFO. The most notable is lack of any idempotency
guarantees for the request data supplied with the initiating SYN message. TCP ensures duplicate
packets (duplication happens frequently) are ignored by the receiver, but this same assurance does
not apply to the connection handhshake. There are on-going efforts to standardize a solution in the
draft specification, but in the meantime TFO can still be safely deployed for idempotent transactions.
Keepalive
Keepalive is an HTTP convention enabling use of the same TCP connection across sequential requests.
At minimum a single round-trip—required for TCP’s three-way handshake—is avoided, saving tens
or hundreds of milliseconds per request. Further, keepalive has the additional, and often unheralded,
performance advantage of preserving the current TCP congestion window between requests, resulting
in far fewer cwnd exhaustion events.
Header Reduction
Perhaps surprising to some, many HTTP request types are not formally required to include any
headers. This can save some serious space. It is good rule of thumb to begin with zero headers, and
include only what’s necessary. Be on the lookout for any headers automagically tacked on by the
HTTP client or server. Some configuration may be necessary to disable this behavior.
Delta Encoding
Delta encoding is compression technique that leverages the similarities between consecutive messages.
A delta-encoded message is represented only by its differences from the previous. JSON-formatted
messages with consistent formatting are particularly well suited for this technique.
Pipelining
Pipelining is an HTTP convention for submitting multiple sequential requests in a single transaction.
This has the performance advantages of HTTP keepalive, while also eliminating the round-trips
typically needed for the additional HTTP requests.
The good news: any technique that preserves the TCP connection between transactions, such as
HTTP’s keepalive convention, also preserves the TLS session. However, it’s not always practical to
maintain a long-lived secure TCP connection. Offered here are two methods that accelerate the TLS
handshake itself.
Session Resumption
The TLS session resumption feature allows a secure session to be preserved between TCP connections.
Session resumption eliminates the initial handshake message exchange reserved for the public key
cryptography that validates the server’s identity and establishes the symmetric encryption key. While
there’s some performance benefit to avoiding computationally expensive public crypto operations,
the greater time savings belongs to eliminating the round-trip delay of a single message exchange.
Earlier revisions of TLS (i.e., SSL) depended upon the server to preserve the session state, which
presented a real challenge to highly distributed server architectures. TLS session tickets offer a much
simpler solution. This extension allows the client to preserve session state in the form an encrypted
payload (i.e., session ticket) granted by the server during the handshake process. Resuming a session
requires that the client submit this ticket at the beginning of the handshake.
False Start
False start is a protocol modification originating from a clever observation of the TLS handshake:
technically, the client may send encrypted data immediately after transmitting its final handshake
message to the server. Acting on this insight, false start eliminates the round-trip delay normally
occurring as the client awaited the final handshake message from the server.
False start exhibits the same performance benefit as session resumption with the added benefit of
being stateless—client and server are relieved of the burden to manage session state. The majority of
web clients support false start with just minor changes. And surprisingly, in about 99% of the cases,
server support requires no changes at all, making this optimization immediately deployable in most
infrastructures.
Generally, the hosting platform provides a cache implementation to avoid frequent DNS queries.
The semantics of DNS caching are simple. Each DNS response contains a time-to-live (TTL) attribute
declaring how long the result may cached. TTLs can range from seconds to days but are typically
on the order of several minutes. Very low TTL values, usually under a minute, are used to affect
load-distribution or minimize downtime from server replacement or ISP failover.
The native DNS cache implementations of most platforms don’t account for elevated round-trip
times of mobile networks. Many mobile applications could benefit from a cache implementation
that augments or replaces the stock solution. Suggested here are several cache strategies, that if
deployed for application use, will eliminate any random and spurious delays caused by unnecessary
DNS queries.
Refresh on Failure
Highly-available systems usually rely upon redundant infrastructures hosted within their IP address
space. Low-TTL DNS entries have the benefit of reducing the time a network client may refer to
the address of a failed host, but at the same time triggers a lot of extra DNS queries. The TTL is a
compromise between minimizing downtime and maximizing client performance.
It makes no sense to generally degrade client performance when server failures are the exception
to the rule. There is a simple solution to this dilemma, rather than strictly obeying the TTL a cached
DNS entry is only refreshed when a non-recoverable error is detected by higher-level protocol such as
TCP or HTTP. Under most scenarios this technique emulates the behavior of a TTL-conformant DNS
cache while nearly eliminating the performance penalties normally associated with any DNS-based
high-availability solution.
It should be noted this cache technique would likely be incompatible with any DNS-based load
distribution scheme.
Asynchronous Refresh
Asynchronous refresh is an approach to DNS caching that (mostly) obeys posted TTLs while largely
eliminating the latency of frequent DNS queries. An asynchronous DNS client library, such as c-ares,
10.9 Conclusion
Mitigating the impact of mobile networks’ inflated latency requires reducing the network round-
trips that exacerbate its effect. Employing software optimizations solely focused on minimizing
or eliminating round-trip protocol messaging is critical to surmounting this daunting performance
issue.
131
[chapter 11]
Warp
Kazu Yamamoto, Michael Snoyman, and Andreas Voellmy
Warp is a high-performance HTTP server library written in Haskell, a purely functional programming
language. Both Yesod, a web application framework, and mighty, an HTTP server, are implemented
over Warp. According to our throughput benchmark, mighty provides performance on a par with
nginx. This article will explain the architecture of Warp and how we achieved its performance. Warp
can run on many platforms, including Linux, BSD variants, Mac OS, and Windows. To simplify our
explanation, however, we will only talk about Linux for the remainder of this article.
Native Threads
Traditional servers use a technique called thread programming. In this architecture, each connection
is handled by a single process or native thread (sometimes called an OS thread).
This architecture can be further segmented based on the mechanism used for creating the processes
or native threads. When using a thread pool, multiple processes or native threads are created in
advance. An example of this is the prefork mode in Apache. Otherwise, a process or native thread is
spawned each time a connection is received. Figure 11.1 illustrates this.
The advantage of this architecture is that it enables developers to write clear code. In particular,
the use of threads allows the code to follow a simple and familiar flow of control and to use simple
procedure calls to fetch input or send output. Also, because the kernel assigns processes or native
threads to available cores, we can balance utilization of cores. Its disadvantage is that a large number
Figure 11.1: Native threads
of context switches between kernel and processes or native threads occur, resulting in performance
degradation.
Event-Driven Architecture
In the world of high-performance servers, the recent trend has been to take advantage of event-driven
programming. In this architecture multiple connections are handled by a single process (Figure 11.2).
Lighttpd is an example of a web server using this architecture.
Since there is no need to switch processes, fewer context switches occur, and performance is
thereby improved. This is its chief advantage.
On the other hand, this architecture substantially complicates the network program. In particular,
this architecture inverts the flow of control so that the event loop controls the overall execution of the
program. Programmers must therefore restructure their program into event handlers, each of which
execute only non-blocking code. This restriction prevents programmers from performing I/O using
procedure calls; instead more complicated asynchronous methods must be used. Along the same
lines, conventional exception handling methods are no longer applicable.
134 Warp
Figure 11.3: One process per core
One web server that uses this architecture is nginx. Node.js used the event-driven architecture in
the past, but recently it also implemented the prefork technique. The advantage of this architecture is
that it utilizes all cores and improves performance. However, it does not resolve the issue of programs
having poor clarity, due to the reliance on handler and callback functions.
User Threads
GHC’s user threads can be used to help solve the code clarity issue. In particular, we can handle
each HTTP connection in a new user thread. This thread is programmed in a traditional style, using
logically blocking I/O calls. This keeps the program clear and simple, while GHC handles the
complexities of non-blocking I/O and multi-core work dispatching.
Under the hood, GHC multiplexes user threads over a small number of native threads. GHC’s
run-time system includes a multi-core thread scheduler that can switch between user threads cheaply,
since it does so without involving any OS context switches.
GHC’s user threads are lightweight; modern computers can run 100,000 user threads smoothly.
They are robust; even asynchronous exceptions are caught (this feature is used by the timeout handler,
described in Section 11.2 and in Section 11.7.) In addition, the scheduler includes a multi-core load
balancing algorithm to help utilize capacity of all available cores.
When a user thread performs a logically blocking I/O operation, such as receiving or sending data
on a socket, a non-blocking call is actually attempted. If it succeeds, the thread continues immediately
without involving the I/O manager or the thread scheduler. If the call would block, the thread instead
registers interest for the relevant event with the run-time system’s I/O manager component and then
indicates to the scheduler that it is waiting. Independently, an I/O manager thread monitors events
and notifies threads when their events occur, causing them to be re-scheduled for execution. This all
happens transparently to the user thread, with no effort on the Haskell programmer’s part.
In Haskell, most computation is non-destructive. This means that almost all functions are thread-
safe. GHC uses data allocation as a safe point to switch context of user threads. Because of functional
programming style, new data are frequently created and it is known that such data allocation occurs
regularly enough for context switching.
Though some languages provided user threads in the past, they are not commonly used now
because they were not lightweight or were not robust. Note that some languages provide library-level
coroutines but they are not preemptive threads. Note also that Erlang and Go provide lightweight
processes and lightweight goroutines, respectively.
As of this writing, mighty uses the prefork technique to fork processes in order to use more cores.
(Warp does not have this functionality.) Figure 11.4 illustrates this arrangement in the context of a
We found that the I/O manager component of the GHC run-time system itself has performance
bottlenecks. To solve this problem, we developed a parallel I/O manager that uses per-core event
registration tables and event monitors to greatly improve multi-core scaling. A Haskell program
with the parallel I/O manager is executed as a single process and multiple I/O managers run as native
threads to use multiple cores (Figure 11.5). Each user thread is executed on any one of the cores.
GHC version 7.8—which includes the parallel I/O manager—will be released in the autumn
of 2013. With GHC version 7.8, Warp itself will be able to use this architecture without any
modifications and mighty will not need to use the prefork technique.
136 Warp
Figure 11.6: Web Application Interface (WAI)
In Haskell, argument types of functions are separated by right arrows and the rightmost one is the
type of the return value. So, we can interpret the definition as: a WAI Application takes a Request
and returns a Response, used in the context where I/O is possible and resources are well managed.
After accepting a new HTTP connection, a dedicated user thread is spawned for the connection. It
first receives an HTTP request from a client and parses it to Request. Then, Warp gives the Request
to the WAI application and receives a Response from it. Finally, Warp builds an HTTP response
based on the Response value and sends it back to the client. This is illustrated in Figure 11.7.
The user thread repeats this procedure as necessary and terminates itself when the connection
is closed by the peer or an invalid request is received. The thread also terminates if a significant
amount of data is not received after a certain period of time (i.e., a timeout has occurred).
This means that 1,000 HTTP connections are established, with each connection sending 100
requests. 10 native threads are spawned to carry out these jobs.
The target web servers were compiled on Linux. For all requests, the same index.html file is
returned. We used nginx’s index.html, whose size is 151 bytes.
Since Linux/FreeBSD have many control parameters, we need to configure the parameters
carefully. You can find a good introduction to Linux parameter tuning in ApacheBench and HTTPerf.1
We carefully configured both mighty and nginx as follows:
• enabled file descriptor cache
• disabled logging
• disabled rate limitation
Here is the result:
The x-axis is the number of workers and the y-axis gives throughput, measured in requests per
second.
• mighty 2.8.4 (GHC 7.7): compiled with GHC version 7.7.20130504 (to be GHC version
7.8). It uses the parallel I/O manager with only one worker. GHC run-time option, +RTS -qa
-A128m -N<x> is specified where <x> is the number of cores and 128m is the allocation area
size used by the garbage collector.
• mighty 2.8.4 (GHC 7.6.3): compiled with GHC version 7.6.3 (which is the latest stable version).
1 http://gwan.com/en_apachebench_httperf.html
138 Warp
11.4 Key Ideas
We kept four key ideas in mind when implementing our high-performance server in Haskell:
1. Issuing as few system calls as possible
2. Using specialized function implementations and avoiding recalculation
3. Avoiding locks
4. Using proper data structures
So, we implemented a special formatter to generate GMT date strings. A comparison of our
specialized function and the standard Haskell implementation using the criterion benchmark
library showed that ours was much faster. But if an HTTP server accepts more than one request per
Avoiding Locks
Unnecessary locks are evil for programming. Our code sometimes uses unnecessary locks impercep-
tibly because, internally, the run-time systems or libraries use locks. To implement high-performance
servers, we need to identify such locks and avoid them if possible. It is worth pointing out that locks
will become much more critical under the parallel I/O manager. We will talk about how to identify
and avoid locks in Section 11.7 and Section 11.8.
0008
message=
140 Warp
000a
helloworld
0000
GET / HTTP/1.1
The HTTP parser must extract the /some/path pathname and the Content-Type header and
pass these to the application. When the application begins reading the request body, it must
strip off the chunk headers (e.g., 0008 and 000a) and instead provide the actual content, i.e.,
message=helloworld. It must also ensure that no more bytes are consumed after the chunk termi-
nator (0000) so as to not interfere with the next pipelined request.
Conduit
This article has mentioned a few times the concept of passing the request body to the application.
It has also hinted at the issue of the application passing a response back to the server, and the
server receiving data from and sending data to the socket. A final related point not yet discussed
is middleware, which are components sitting between the server and application that modify the
request or response. The definition of a middleware is:
The intuition behind this is that a middleware will take some “internal” application, preprocess
the request, pass it to the internal application to get a response, and then postprocess the response.
For our purposes, a good example would be a gzip middleware, which automatically compresses
response bodies.
142 Warp
A prerequisite for the creation of such middlewares is a means of modifying both incoming and
outgoing data streams. A standard approach historically in the Haskell world has been lazy I/O. With
lazy I/O, we represent a stream of values as a single, pure data structure. As more data is requested
from this structure, I/O actions are performed to grab the data from its source. Lazy I/O provides
a huge level of composability. However, for a high-throughput server, it presents a major obstacle:
resource finalization in lazy I/O is non-deterministic. Using lazy I/O, it would be easy for a server
under high load to quickly run out of file descriptors.
It would also be possible to use a lower-level abstraction, essentially dealing directly with read
and write functions. However, one of the advantages of Haskell is its high-level approach, allowing
us to reason about the behavior of our code. It’s also not obvious how such a solution would deal
with some of the common issues which arise when creating web applications. For example, it’s often
necessary to have a buffering solution, where we read a certain amount of data at one step (e.g., the
request header processing), and read the remainder in a separate part of the code base (e.g., the web
application).
To address this dilemma, the WAI protocol (and therefore Warp) is built on top of the conduit
package. This package provides an abstraction for streams of data. It keeps much of the composability
of lazy I/O, provides a buffering solution, and guarantees deterministic resource handling. Exceptions
are also kept where they belong, in the parts of your code which deal with I/O, instead of hiding
them in a data structure claiming to be pure.
Warp represents the incoming stream of bytes from the client as a Source, and writes data to
be sent to the client to a Sink. The Application is provided a Source with the request body, and
provides a response as a Source as well. Middlewares are able to intercept the Sources for the
request and response bodies and apply transformations to them. Figure 11.10 demonstrates how a
middleware fits between Warp and an application. The composability of the conduit package makes
this an easy and efficient operation.
ResponseFile is used to send a static file while ResponseBuilder and ResponseSource are
for sending dynamic contents created in memory. Each constructor includes both Status and
ResponseHeaders. ResponseHeaders is defined as a list of key/value header pairs.
144 Warp
system calls can be omitted thanks to the cache mechanism described in Section 11.7. The following
subsection describes another performance tuning in the case of ResponseFile.
To send them in a single TCP packet (when possible), new Warp switched from writev() to
send(). It uses send() with the MSG_MORE flag to store a header and sendfile() to send both
the stored header and a file. This made the throughput at least 100 times faster according to our
throughput benchmark.
Nothing indicates an error (with no reason specified) and Just encloses a successful value a. So,
timeout returns Nothing if an action is not completed in a specified time. Otherwise, a successful
value is returned wrapped with Just. The timeout function eloquently shows how great Haskell’s
composability is.
timeout is useful for many purposes, but its performance is inadequate for implementing high-
performance servers. The problem is that for each timeout created, this function will spawn a new
user thread. While user threads are cheaper than system threads, they still involve an overhead which
can add up. We need to avoid the creation of a user thread for each connection’s timeout handling.
So, we implemented a timeout system which uses only one user thread, called the timeout manager,
to handle the timeouts of all connections. At its core are the following two ideas:
• double IORefs
• safe swap and merge algorithm
Suppose that status of connections is described as Active and Inactive. To clean up inactive
connections, the timeout manager repeatedly inspects the status of each connection. If status is
Active, the timeout manager turns it to Inactive. If Inactive, the timeout manager kills its
associated user thread.
Each status is referred to by an IORef. IORef is a reference whose value can be destructively
updated. In addition to the timeout manager, each user thread repeatedly turns its status to Active
through its own IORef as its connection actively continues.
The timeout manager uses a list of the IORef to these statuses. A user thread spawned for a new
connection tries to prepend its new IORef for an Active status to the list. So, the list is a critical
section and we need atomicity to keep the list consistent.
Figure 11.12: A list of status values. A and I indicates Active and Inactive, respectively
A standard way to keep consistency in Haskell is MVar. But MVar is slow, since each MVar
is protected with a home-brewed lock. Instead, we used another IORef to refer to the list and
atomicModifyIORef to manipulate it. atomicModifyIORef is a function for atomically updating an
IORef’s values. It is implemented via CAS (Compare-and-Swap), which is much faster than locks.
146 Warp
The following is the outline of the safe swap and merge algorithm:
do xs <- atomicModifyIORef ref (\ys -> ([], ys)) -- swap with an empty list, []
xs’ <- manipulates_status xs
atomicModifyIORef ref (\ys -> (merge xs’ ys, ()))
The timeout manager atomically swaps the list with an empty list. Then it manipulates the list by
toggling thread status or removing unnecessary status for killed user threads. During this process,
new connections may be created and their status values are inserted via atomicModifyIORef by
their corresponding user threads. Then, the timeout manager atomically merges the pruned list and
the new list. Thanks to the lazy evaluation of Haskell, the application of the merge function is done in
O(1) and the merge operation, which is in O(N), is postponed until its values are actually consumed.
Memory Allocation
When receiving and sending packets, buffers are allocated. These buffers are allocated as “pinned”
byte arrays, so that they can be passed to C procedures like recv() and send(). Since it is best
to receive or send as much data as possible in each system call, these buffers are moderately sized.
Unfortunately, GHC’s method for allocating large (larger than 409 bytes in 64 bit machines) pinned
byte arrays takes a global lock in the run-time system. This lock may become a bottleneck when
scaling beyond 16 cores, if each core user thread frequently allocates such buffers.
We performed an initial investigation of the performance impact of large pinned array allocation
for HTTP response header generation. For this purpose, GHC provides eventlog which can record
timestamps of each event. We surrounded a memory allocation function with the function to record
a user event. Then we compiled mighty with it and recorded the eventlog. The resulting eventlog is
illustrated in Figure 11.13.
The small vertical bars in the row labelled “HEC 0” indicate the event created by us. So, the area
surrounded by two bars is the time consumed by memory allocation. It is about 1/10 of an HTTP
session. We are discussing how to implement memory allocation without locks.
148 Warp
Recent network servers tend to use the epoll family. If workers share a listening socket and they
manipulate connections through the epoll family, thundering herd appears again. This is because
the convention of the epoll family is to notify all processes or native threads. nginx and mighty
are victims of this new thundering herd.
The parallel I/O manager is free from the new thundering herd problem. In this architecture,
only one I/O manager accepts new connections through the epoll family. And other I/O managers
handle established connections.
11.9 Conclusion
Warp is a versatile web server library, providing efficient HTTP communication for a wide range
of use cases. In order to achieve its high performance, optimizations have been performed at many
levels, including network communications, thread management, and request parsing.
Haskell has proven to be an amazing language for writing such a code base. Features like
immutability by default make it easier to write thread-safe code and avoid extra buffer copying. The
multi-threaded run time drastically simplifies the process of writing event-driven code. And GHC’s
powerful optimizations mean that in many cases, we can write high-level code and still reap the
benefits of high performance. Yet with all of this performance, our code base is still relatively tiny
(under 1300 SLOC at time of writing). If you are looking to write maintainable, efficient, concurrent
code, Haskell should be a strong consideration.
149
[chapter 12]
12.1 Introduction
Bioinformatics and Big Data
The field of bioinformatics seeks to provide tools and analyses that facilitate understanding of
the molecular mechanisms of life on Earth, largely by analyzing and correlating genomic and
proteomic information. As increasingly large amounts of genomic information, including both
genome sequences and expressed gene sequences, becomes available, more efficient, sensitive, and
specific analyses become critical.
In DNA sequencing, a chemical and mechanical process essentially “digitizes” the information
present in DNA and RNA. These sequences are recorded using an alphabet of one letter per nucleotide.
Various analyses are performed on this sequence data to determine how it is structured into larger
building blocks and how it relates to other sequence data. This serves as the basis for the study of
biological evolution and development, genetics, and, increasingly, medicine.
Data on nucleotide chains comes from the sequencing process in strings of letters known as
reads. (The use of the term read in the bioinformatics sense is an unfortunate collision with the use
of the term in the computer science and software engineering sense. This is especially true as the
performance of reading reads can be tuned, as we will discuss. To disambiguate this unfortunate
collision we refer to sequences from genomes as genomic reads.) To analyze larger scale structures
and processes, multiple genomic reads must be fit together. This fitting is different than a jigsaw
puzzle in that the picture is often not known a priori and that the pieces may (and often do) overlap.
A further complication is introduced in that not all genomic reads are of perfect fidelity and may
contain a variety of errors, such as insertions or deletions of letters or substitutions of the wrong
letters for nucleotides. While having redundant reads can help in the assembly or fitting of the
puzzle pieces, it is also a hindrance because of this imperfect fidelity in all of the existing sequencing
technologies. The appearance of erroneous genomic reads scales with the volume of data and this
complicates assembly of the data.
As sequencing technology has improved, the volume of sequence data being produced has begun
to exceed the capabilities of computer hardware employing conventional methods for analyzing such
data. (Much of the state-of-the-art in sequencing technology produces vast quantities of genomic
reads, typically tens of millions to billions, each having a sequence of 50 to 100 nucleotides.) This
trend is expected to continue and is part of what is known as the Big Data [Varc] problem in the high
performance computing (HPC), analytics, and information science communities. With hardware
becoming a limiting factor, increasing attention has turned to ways to mitigate the problem with
software solutions. In this chapter, we present one such software solution and how we tuned and
scaled it to handle terabytes of data.
Our research focus has been on efficient pre-processing, in which various filters and binning
approaches trim, discard, and bin the genomic reads, in order to improve downstream analyses. This
approach has the benefit of limiting the changes that need to be made to downstream analyses, which
generally consume genomic reads directly.
In this chapter, we present our software solution and describe how we tuned and scaled it to
efficient handle increasingly large amounts of data.
Figure 12.1: Decomposition of a genomic sequence into 4-mers. In khmer, the forward sequence and reverse
complement of each k-mer are hashed to the same value, in recognition that DNA is double-stranded. See
Future Directions.
Since we want to tell you about how we measured and tuned this piece of open source software,
we’ll skip over much of the theory behind it. Suffice it to say that k-mer counting is central to much
of its operation. To compactly count a large number of k-mers, a data structure known as a Bloom
filter [Vard] is used (Figure 12.2). Armed with k-mer counts, we can then exclude highly redundant
data from further processing, a process known as “digital normalization”. We can also treat low
abundance sequence data as probable errors and exclude it from further processing, in an approach
to error trimming. These normalization and trimming processes greatly reduce the amount of raw
sequence data needed for further analysis, while mostly preserving information of interest.
Khmer is designed to operate on large data sets of millions to billions of genomic reads, containing
tends of billions of unique k-mers. Some of our existing data sets require up to a terabyte of system
M h1(M)
h1(M)
h2(M)
h2(M)
h3(M)
h3(M)
h4(M)
h4(M)
Figure 12.2: A Bloom filter is essentially a large, fixed-size hash table, into which elements are inserted or
queried using multiple orthogonal hash functions, with no attempt at collision tracking; they are therefore
probabilistic data structures. Our implementation uses multiple distinct hash tables each with its own hash
function, but the properties are identical. We typically recommend that khmer’s Bloom filters be configured to
use as much main memory as is available, as this reduces collisions maximally.
memory simply to hold the k-mer counts in memory, but this is not due to inefficient programming:
in [PHCK+ 12] we show that khmer is considerably more memory efficient than any exact set
membership scheme for a wide regime of interesting k-mer problems. It is unlikely that significant
improvements in memory usage can be obtained easily.
Our goal, then, is simple: in the face of these large data sets, we would like to optimize khmer for
processing time, including most especially the time required to load data from disk and count k-mers.
For the curious, the khmer sources and documentation can be cloned from GitHub at
http://github.com/ged-lab/khmer.git. Khmer has been available for about four years, but
only with the posting of several preprint papers have others started to use it; we estimate the user
population at around 100 groups based on e-mail interactions in 2012, although it seems to be
growing rapidly as it becomes clear that a large class of assembly problems is more readily tractable
with khmer [BHZ+ 12].
The core of the software is written in C++. This core consists of a data pump (the component
which moves data from online storage into physical RAM), parsers for genomic reads in several
common formats, and several k-mer counters. An application programming interface (API) is built
around the core. This API can, of course, be used from C++ programs, as we do with some of our test
drivers, but also serves as the foundation for a Python wrapper. A Python package is built upon the
Python wrapper. Numerous Python scripts are distributed along with the package. Thus, the khmer
software, in its totality, is the combination of core components, written in C++ for speed, higher-level
interfaces, exposed via Python for ease of manipulation, and an assortment of tool scripts, which
provide convenient ways to perform various bioinformatics tasks.
The khmer software supports batch operation in multiple phases, each with separate data inputs
and outputs. For example, it can take a set of genomic reads, count k-mers in these, and then,
optionally, save the Bloom filter hash tables for later use. Later, it can use saved hash tables to
perform k-mer abundance filtering on a new set of genomic reads, saving the filtered data. This
flexibility to reuse earlier outputs and to decide what to keep allows a user to tailor a procedure
specific to his/her needs and storage constraints.
Tools
Profiling tools primarily concern themselves with the amount of time spent in any particular section
of code. To measure this quantity, they inject instrumentation into the code at compile time. This
instrumentation does change the size of functions, which may affect inlining during optimization.
The instrumentation also directly introduces some overhead on the total execution time; in particular,
the profiling of high traffic areas of code may result in a fairly significant overhead. So, if you are
also measuring the total elapsed time of execution for your code, you need to be mindful of how
profiling itself affects this. To gauge this, a simple external data collection mechanism, such as
/usr/bin/time, can be used to compare non-profiling and profiling execution times for an identical
set of optimization flags and operating parameters.
We gauged the effect of profiling by measuring the difference between profiled and non-profiled
code across a range of k sizes—smaller k values lead to more k-mers per genomic read, increasing
profiler-specific effects. For k = 20, we found that non-profiled code ran about 19% faster than
profiled code, and, for k = 30, that non-profiled code ran about 14% faster than profiled code.
Prior to any performance tuning, our profiling data showed that the k-mer counting logic was the
highest traffic portion of the code, as we had predicted by eye. What was a little surprising was how
significant of a fraction it was (around 83% of the total time), contrasted to I/O operations against
storage (around 5% of the total time, for one particular medium and low bandwidth contention).
Manual Instrumentation
Examining the performance of a piece of software with independent, external profilers is a quick and
convenient way to learn something about the execution times of various parts of software at a first
glance. However, profilers are generally not so good at reporting how much time code spends in a
particular spinlock within a particular function or what the input rate of your data is. To augment
or complement external profiling capabilities, manual instrumentation may needed. Also, manual
1 Ifthe size of a data cache is larger than the data being used in I/O performance benchmarks, then retrieval directly from the
cache rather than the original data source may skew the measurements from successive runs of the benchmarks. Having a
data source larger than the data cache helps guarantee data cycling in the cache, thereby giving the appearance of a continuous
stream of non-repeating data.
12.4 Tuning
Making software work more efficiently is quite a gratifying experience, especially in the face of
trillions of bytes passing through it. Our narrative will now turn to the various measures we took to
improve efficiency. We divide these into two parts: optimization of the reading and parsing of input
data and optimization of the manipulation and writing of the Bloom filter contents.
#define is_valid_dna(ch) \
((toupper(ch)) == ’A’ || (toupper(ch)) == ’C’ || \
(toupper(ch)) == ’G’ || (toupper(ch)) == ’T’)
and:
#define twobit_repr(ch) \
((toupper(ch)) == ’A’ ? 0LL : \
(toupper(ch)) == ’T’ ? 1LL : \
(toupper(ch)) == ’C’ ? 2LL : 3LL)
If you read the manual page for the toupper function or inspect the headers for the GNU C
library, you might find that it is actually a locale-aware function and not simply a macro. So, this
means that there is the overhead of calling a potentially non-trivial function involved—at least when
the GNU C library is being used. But, we are working with an alphabet of four ASCII characters.
A locale-aware function is overkill for our purposes. So, not only do we want to eliminate the
redundancy but we want to use something more efficient.
We decided to normalize the sequences to uppercase letters prior to validating them. (And,
of course, validation happens before attempting to convert them into hash codes.) While it might
be ideal to perform this normalization in the parser, it turns out that sequences can be introduced
to the Bloom filter via other routes. So, for the time being, we chose to normalize the sequences
immediately prior to validating them. This allows us to drop all calls to toupper in both the sequence
validator and in the hash encoders.
Considering that terabytes of genomic data may be passing through the sequence normalizer, it
is in our interests to optimize it as much as we can. One approach is:
For each and every byte, the above should execute one compare, one branch, and possibly one
addition. Can we do better than this? As it turns out, yes. Note that every lowercase letter has an
The above has one bitwise operation, no compares, and no branches. Uppercase letters pass through
unmolested; lowercase letters become uppercase. Perfect, just we wanted. For our trouble, we gained
about a 13% speedup in the runtime of the entire process (!)
Our Bloom filter’s hash tables are. . . “expansive”. To increment the counts for the hash code
of a particular k-mer means hitting almost N different memory pages, where N is the number of
hash tables allocated to the filter. In many cases, the memory pages which need to be updated for the
next k-mer are entirely different than those for the current one. This can lead the much cycling of
memory pages from main memory without being able to utilize the benefits of caching. If we have a
genomic read with a 79-character long sequence and are scanning k-mers of length 20, and if we
have 4 hash tables, then up to 236 (59 * 4) different memory pages are potentially being touched. If
we are processing 50 million reads, then it is easy to see how costly this is. What to do about it?
One solution is to batch the hash table updates. By accumulating a number of hash codes for
various k-mers and then periodically using them to increment counts on a table-by-table basis, we can
greatly improve cache utilization. Initial work on this front looks quite promising and, hopefully, by
the time you are reading this, we will have fully integrated this modification into our code. Although
we did not mention it earlier in our discussion of measurement and profiling, cachegrind, a program
which is part of the open-source Valgrind [eac] distribution, is a very useful tool for gauging the
effectiveness of this kind of work.
12.6 Parallelization
With the proliferation of multi-core architectures in today’s world, it is tempting to try taking
advantage of them. However, unlike many other problem domains, such as computational fluid
dynamics or molecular dynamics, our Big Data problem relies on high throughput processing of
data—it must become essentially I/O-bound beyond a certain point of parallelization. Beyond this
point, throwing additional threads at it does not help as the bandwidth to the storage media is saturated
and the threads simply end up with increased blocking or I/O wait times. That said, utilizing some
threads can be useful, particularly if the data to be processed is held in physical RAM, which generally
has a much higher bandwidth than online storage. As discussed previously, we have implemented
a prefetch buffer in conjunction with direct input. Multiple threads can use this buffer; more will
be said about this below. I/O bandwidth is not the only finite resource which multiple threads must
share. The hash tables used for k-mer counting are another one. Shared access to these will also be
discussed below.
Scaling
Was making the khmer software scalable worth our effort? Yes. Of course, we did not achieve
perfectly linear speedup. But, for every doubling of the number of cores, we presently get about a
factor of 1.9 speedup.
In parallel computing, one must be mindful of Amdahl’s Law [Vara] and the Law of Diminishing
Returns. The common formulation of Amdahl’s Law, in the context of parallel computing, is
S(N ) = (1−P1)+ P , where S is the speedup achieved given N CPU cores and, P , the proportion
N
of the code which is parallelized. For limN →∞ S = (1−P 1
) , a constant. The I/O bandwidth of the
storage system, which the software utilizes, is finite and non-scalable; this contributes to a non-zero
(1 − P ). Moreover, contention for shared resources in the parallelized portion means that N P
is, in
reality, N l , where l < 1 versus the ideal case of l = 1. Therefore, returns will diminish over a finite
P
12.9 Acknowledgements
We thank Alexis Black-Pyrkosz and Rosangela Canino-Koning for comments and discussion.
[ABB+ 86] Mike Accetta, Robert Baron, William Bolosky, David Golub, Richard Rashid, Avadis
Tavanian, and Michael Young. Mach: A New Kernel Foundation for UNIX Devel-
opment. In Proceedings of the Summer 1986 USENIX Technical Conference and
Exhibition, pages 93–112, June 1986.
[AOS+ 00] Alexander B. Arulanthu, Carlos O’Ryan, Douglas C. Schmidt, Michael Kircher, and
Jeff Parsons. The Design and Performance of a Scalable ORB Architecture for
CORBA Asynchronous Messaging. In Proceedings of the Middleware 2000 Con-
ference. ACM/IFIP, April 2000.
[ATK05] Anatoly Akkerman, Alexander Totok, and Vijay Karamcheti. Infrastructure for Auto-
matic Dynamic Deployment of J2EE Applications in Distributed Environments. In
3rd International Working Conference on Component Deployment (CD 2005), pages
17–32, Grenoble, France, November 2005.
[BHZ+ 12] CT Brown, A Howe, Q Zhang, A Pyrkosz, and TH Brom. A reference-free algorithm
for computational normalization of shotgun sequencing data. In review at PLoS One,
July 2012; Preprint at http://arxiv.org/abs/1203.4802, 2012.
[bsd] bit.ly software developers. dablooms: a scalable, counting Bloom filter. http://
github.com/bitly/dablooms.
[BW11] Amy Brown and Greg Wilson. The Architecture Of Open Source Applications. lulu.com,
June 2011.
[CJRS89] David D. Clark, Van Jacobson, John Romkey, and Howard Salwen. An Analysis of
TCP Processing Overhead. IEEE Communications Magazine, 27(6):23–29, June 1989.
[CT90] David D. Clark and David L. Tennenhouse. Architectural Considerations for a New
Generation of Protocols. In Proceedings of the Symposium on Communications Archi-
tectures and Protocols (SIGCOMM), pages 200–208. ACM, September 1990.
[DBCP97] Mikael Degermark, Andrej Brodnik, Svante Carlsson, and Stephen Pink. Small For-
warding Tables for Fast Routing Lookups. In Proceedings of the ACM SIGCOMM ’97
Conference on Applications, Technologies, Architectures, and Protocols for Computer
Communication, pages 3–14. ACM Press, 1997.
[DBO+ 05] Gan Deng, Jaiganesh Balasubramanian, William Otte, Douglas C. Schmidt, and Anirud-
dha Gokhale. DAnCE: A QoS-enabled Component Deployment and Configuration
Engine. In Proceedings of the 3rd Working Conference on Component Deployment
(CD 2005), pages 67–82, November 2005.
[DEG+ 12] Abhishek Dubey, William Emfinger, Aniruddha Gokhale, Gabor Karsai, William Otte,
Jeffrey Parsons, Csanad Czabo, Alessandro Coglio, Eric Smith, and Prasanta Bose. A
Software Platform for Fractionated Spacecraft. In Proceedings of the IEEE Aerospace
Conference, 2012, pages 1–20. IEEE, March 2012.
[DP93] Peter Druschel and Larry L. Peterson. Fbufs: A High-Bandwidth Cross-Domain Trans-
fer Facility. In Proceedings of the 14th Symposium on Operating System Principles
(SOSP), December 1993.
[eaa] A. D. Malony et al. TAU: Tuning and Analysis Utilities. http://www.cs.uoregon.
edu/Research/tau/home.php.
[eab] C. Titus Brown et al. khmer: genomic data filtering and partitioning software. http:
//github.com/ged-lab/khmer.
[eac] Julian Seward et al. Valgrind. http://valgrind.org/.
[EK96] Dawson R. Engler and M. Frans Kaashoek. DPF: Fast, Flexible Message Demulti-
plexing using Dynamic Code Generation. In Proceedings of ACM SIGCOMM ’96
Conference in Computer Communication Review, pages 53–59. ACM Press, August
1996.
[FHHC07] D. R. Fatland, M. J. Heavner, E. Hood, and C. Connor. The SEAMONSTER Sen-
sor Web: Lessons and Opportunities after One Year. AGU Fall Meeting Abstracts,
December 2007.
[GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns:
Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.
[GNS+ 02] Aniruddha Gokhale, Balachandran Natarajan, Douglas C. Schmidt, Andrey Nechy-
purenko, Jeff Gray, Nanbor Wang, Sandeep Neema, Ted Bapty, and Jeff Parsons.
CoSMIC: An MDA Generative Tool for Distributed Real-time and Embedded Compo-
nent Middleware and Applications. In Proceedings of the OOPSLA 2002 Workshop on
Generative Techniques in the Context of Model Driven Architecture. ACM, November
2002.
[HC01] George T. Heineman and Bill T. Councill. Component-Based Software Engineering:
Putting the Pieces Together. Addison-Wesley, 2001.
[HP88] Norman C. Hutchinson and Larry L. Peterson. Design of the x-Kernel. In Proceedings
of the SIGCOMM ’88 Symposium, pages 65–75, August 1988.
[HV05] Jahangir Hasan and T. N. Vijaykumar. Dynamic pipelining: Making IP-lookup Truly
Scalable. In SIGCOMM ’05: Proceedings of the 2005 Conference on Applications,
technologies, architectures, and protocols for computer communications, pages 205–
216. ACM Press, 2005.
[Insty] Institute for Software Integrated Systems. Component-Integrated ACE ORB (CIAO).
www.dre.vanderbilt.edu/CIAO, Vanderbilt University.
166 BIBLIOGRAPHY
[KOS+ 08] John S. Kinnebrew, William R. Otte, Nishanth Shankaran, Gautam Biswas, and Dou-
glas C. Schmidt. Intelligent Resource Management and Dynamic Adaptation in a
Distributed Real-time and Embedded Sensor Web System. Technical Report ISIS-08-
906, Vanderbilt University, 2008.
[mem] OpenMP members. OpenMP. http://openmp.org.
[MJ93] Steven McCanne and Van Jacobson. The BSD Packet Filter: A New Architecture for
User-level Packet Capture. In Proceedings of the Winter USENIX Conference, pages
259–270, January 1993.
[MRA87] Jeffrey C. Mogul, Richard F. Rashid, and Michal J. Accetta. The Packet Filter: an Effi-
cient Mechanism for User-level Network Code. In Proceedings of the 11th Symposium
on Operating System Principles (SOSP), November 1987.
[NO88] M. Nelson and J. Ousterhout. Copy-on-Write For Sprite. In USENIX Summer Confer-
ence, pages 187–201. USENIX Association, June 1988.
[Obj06] ObjectWeb Consortium. CARDAMOM - An Enterprise Middleware for Building
Mission and Safety Critical Applications. cardamom.objectweb.org, 2006.
[OGS11] William R. Otte, Aniruddha Gokhale, and Douglas C. Schmidt. Predictable Deployment
in Component-based Enterprise Distributed Real-time and Embedded Systems. In
Proceedings of the 14th international ACM Sigsoft Symposium on Component Based
Software Engineering, CBSE ’11, pages 21–30. ACM, 2011.
[OGST13] William Otte, Aniruddha Gokhale, Douglas Schmidt, and Alan Tackett. Efficient and
Deterministic Application Deployment in Component-based, Enterprise Distributed,
Real-time, and Embedded Systems. Elsevier Journal of Information and Software
Technology (IST), 55(2):475–488, February 2013.
[OMG04] Object Management Group. Lightweight CCM FTF Convenience Document, ptc/04-
06-10 edition, June 2004.
[OMG06] OMG. Deployment and Configuration of Component-based Distributed Applications,
v4.0, Document formal/2006-04-02 edition, April 2006.
[OMG08] Object Management Group. The Common Object Request Broker: Architecture
and Specification Version 3.1, Part 2: CORBA Interoperability, OMG Document
formal/2008-01-07 edition, January 2008.
[PDZ00] Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. IO-Lite: A Unified I/O Buffering
and Caching System. ACM Transactions of Computer Systems, 18(1):37–66, 2000.
[PHCK+ 12] J Pell, A Hintze, R Canino-Koning, A Howe, JM Tiedje, and CT Brown. Scaling
metagenome sequence assembly with probabilistic de bruijn graphs. Accepted at
PNAS, July 2012; Preprint at http://arxiv.org/abs/1112.4193, 2012.
[RDR+ 97] Y. Rekhter, B. Davie, E. Rosen, G. Swallow, D. Farinacci, and D. Katz. Tag Switching
Architecture Overview. Proceedings of the IEEE, 85(12):1973–1983, December 1997.
167
[SHS+ 06] Dipa Suri, Adam Howell, Nishanth Shankaran, John Kinnebrew, Will Otte, Douglas C.
Schmidt, and Gautam Biswas. Onboard Processing using the Adaptive Network
Architecture. In Proceedings of the Sixth Annual NASA Earth Science Technology
Conference, June 2006.
[SK03] Sartaj Sahni and Kun Suk Kim. Efficient Construction of Multibit Tries for IP Lookup.
IEEE/ACM Trans. Netw., 11(4):650–662, 2003.
[SNG+ 02] Douglas C. Schmidt, Bala Natarajan, Aniruddha Gokhale, Nanbor Wang, and Christo-
pher Gill. TAO: A Pattern-Oriented Object Request Broker for Distributed Real-time
and Embedded Systems. IEEE Distributed Systems Online, 3(2), February 2002.
[SSRB00] Douglas C. Schmidt, Michael Stal, Hans Rohnert, and Frank Buschmann. Pattern-
Oriented Software Architecture: Patterns for Concurrent and Networked Objects,
Volume 2. Wiley & Sons, New York, 2000.
[SV95] M. Shreedhar and George Varghese. Efficient Fair Queueing using Deficit Round
Robin. In SIGCOMM ’95: Proceedings of the conference on Applications, technologies,
architectures, and protocols for computer communication, pages 231–242. ACM Press,
1995.
[Vara] Various. Amdahl’s Law. http://en.wikipedia.org/w/index.php?title=
Amdahl%27s_law&oldid=515929929.
[Varb] Various. atomic operations. http://en.wikipedia.org/w/index.php?title=
Linearizability&oldid=511650567.
[WDS+ 11] Jules White, Brian Dougherty, Richard Schantz, Douglas C. Schmidt, Adam Porter,
and Angelo Corsaro. R&D Challenges and Solutions for Highly Complex Distributed
Systems: a Middleware Perspective. the Springer Journal of Internet Services and
Applications special issue on the Future of Middleware, 2(3), December 2011.
168 BIBLIOGRAPHY
[WKNS05] Jules White, Boris Kolpackov, Balachandran Natarajan, and Douglas C. Schmidt.
Reducing Application Code Complexity with Vocabulary-specific XML language
Bindings. In ACM-SE 43: Proceedings of the 43rd annual Southeast regional confer-
ence, 2005.
169
Colophon
The cover font is Museo from the exljibris foundry, by Jos Buivenga. The text font is TEXGyre
Termes and the heading font is TEXGyre Heros, both by Bogusław Jackowski and Janusz M. Nowacki.
The code font is Inconsolata by Raph Levien.
The front cover photo is of the former workings of the turret clock of St. Stephen’s Cathedral
in Vienna. The workings can now be seen in the Vienna Clock Museum. The picture was taken by
Michelle Enemark. (http://www.mjenemark.com/)
This book was built with open source software (with the exception of the cover). Programs like
LATEX, Pandoc, Python, and Calibre (ebook-convert) were especially helpful.