Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 214eb2c

Browse files
kmodcorona10
andauthored
gh-90536: Add support for the BOLT post-link binary optimizer (gh-95908)
* Add support for the BOLT post-link binary optimizer Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task. * Simplify the build flags * Add a NEWS entry * Update Makefile.pre.in Co-authored-by: Dong-hee Na <[email protected]> * Update configure.ac Co-authored-by: Dong-hee Na <[email protected]> * Add myself to ACKS * Add docs * Other review comments * fix tab/space issue * Make it more clear that --enable-bolt is experimental * Add link to bolt's github page Co-authored-by: Dong-hee Na <[email protected]>
1 parent 22a95cb commit 214eb2c

File tree

7 files changed

+351
-1
lines changed

7 files changed

+351
-1
lines changed

Doc/using/configure.rst

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,8 @@ Performance options
191191
-------------------
192192

193193
Configuring Python using ``--enable-optimizations --with-lto`` (PGO + LTO) is
194-
recommended for best performance.
194+
recommended for best performance. The experimental ``--enable-bolt`` flag can
195+
also be used to improve performance.
195196

196197
.. cmdoption:: --enable-optimizations
197198

@@ -231,6 +232,24 @@ recommended for best performance.
231232
.. versionadded:: 3.11
232233
To use ThinLTO feature, use ``--with-lto=thin`` on Clang.
233234

235+
.. cmdoption:: --enable-bolt
236+
237+
Enable usage of the `BOLT post-link binary optimizer
238+
<https://github.com/llvm/llvm-project/tree/main/bolt>` (disabled by
239+
default).
240+
241+
BOLT is part of the LLVM project but is not always included in their binary
242+
distributions. This flag requires that ``llvm-bolt`` and ``merge-fdata``
243+
are available.
244+
245+
BOLT is still a fairly new project so this flag should be considered
246+
experimental for now. Because this tool operates on machine code its success
247+
is dependent on a combination of the build environment + the other
248+
optimization configure args + the CPU architecture, and not all combinations
249+
are supported.
250+
251+
.. versionadded:: 3.12
252+
234253
.. cmdoption:: --with-computed-gotos
235254

236255
Enable computed gotos in evaluation loop (enabled by default on supported

Doc/whatsnew/3.12.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,10 @@ Optimizations
133133
It reduces object size by 8 or 16 bytes on 64bit platform. (:pep:`623`)
134134
(Contributed by Inada Naoki in :gh:`92536`.)
135135

136+
* Added experimental support for using the BOLT binary optimizer in the build
137+
process, which improves performance by 1-5%.
138+
(Contributed by Kevin Modzelewski in :gh:`90536`.)
139+
136140

137141
CPython bytecode changes
138142
========================

Makefile.pre.in

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -640,6 +640,16 @@ profile-opt: profile-run-stamp
640640
-rm -f profile-clean-stamp
641641
$(MAKE) @DEF_MAKE_RULE@ CFLAGS_NODIST="$(CFLAGS_NODIST) $(PGO_PROF_USE_FLAG)" LDFLAGS_NODIST="$(LDFLAGS_NODIST)"
642642

643+
bolt-opt: @PREBOLT_RULE@
644+
rm -f *.fdata
645+
@LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
646+
./$(BUILDPYTHON).bolt_inst $(PROFILE_TASK) || true
647+
@MERGE_FDATA@ $(BUILDPYTHON).*.fdata > $(BUILDPYTHON).fdata
648+
@LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
649+
rm -f *.fdata
650+
rm -f $(BUILDPYTHON).bolt_inst
651+
mv $(BUILDPYTHON).bolt $(BUILDPYTHON)
652+
643653
# Compile and run with gcov
644654
.PHONY=coverage coverage-lcov coverage-report
645655
coverage:

Misc/ACKS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1212,6 +1212,7 @@ Gideon Mitchell
12121212
Tim Mitchell
12131213
Zubin Mithra
12141214
Florian Mladitsch
1215+
Kevin Modzelewski
12151216
Doug Moen
12161217
Jakub Molinski
12171218
Juliette Monsel
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Use the BOLT post-link optimizer to improve performance, particularly on
2+
medium-to-large applications.

configure

Lines changed: 261 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)