ELF: Switch to parallelSort for RELR relocations. #138370

pcc · 2025-05-03T00:26:25Z

For firefox-x64 one of the more time consuming parts
of finalizeSections() was the call to llvm::sort in
RelrSection::updateAllocSize(). Switching that to use parallelSort
yielded the following improvement on firefox-x64 with ldflags -S on
an Apple M2 Ultra:

    N           Min           Max        Median           Avg        Stddev
x 512     1.1446024     1.2462944     1.1918706     1.1929871      0.016145
+ 512     1.1142867     1.2350058     1.1858642     1.1858839   0.016219708
Difference at 95.0% confidence
	-0.00710318 +/- 0.00198234
	-0.595412% +/- 0.166166%
	(Student's t, pooled s = 0.0161824)

Created using spr 1.3.6-beta.1

llvmbot · 2025-05-03T00:26:56Z

@llvm/pr-subscribers-lld-elf

Author: Peter Collingbourne (pcc)

Changes

For firefox-x64 one of the more time consuming parts
of finalizeSections() was the call to llvm::sort in
RelrSection::updateAllocSize(). Switching that to use parallelSort
yielded the following improvement on firefox-x64 with ldflags -S on
an Apple M2 Ultra:

    N           Min           Max        Median           Avg        Stddev
x 512     1.1446024     1.2462944     1.1918706     1.1929871      0.016145
+ 512     1.1142867     1.2350058     1.1858642     1.1858839   0.016219708
Difference at 95.0% confidence
	-0.00710318 +/- 0.00198234
	-0.595412% +/- 0.166166%
	(Student's t, pooled s = 0.0161824)

Full diff: https://github.com/llvm/llvm-project/pull/138370.diff

1 Files Affected:

(modified) lld/ELF/SyntheticSections.cpp (+1-1)

diff --git a/lld/ELF/SyntheticSections.cpp b/lld/ELF/SyntheticSections.cpp
index 2531227cb99b7..eceb297dbfc0d 100644
--- a/lld/ELF/SyntheticSections.cpp
+++ b/lld/ELF/SyntheticSections.cpp
@@ -2111,7 +2111,7 @@ template <class ELFT> bool RelrSection<ELFT>::updateAllocSize(Ctx &ctx) {
   std::unique_ptr<uint64_t[]> offsets(new uint64_t[relocs.size()]);
   for (auto [i, r] : llvm::enumerate(relocs))
     offsets[i] = r.getOffset();
-  llvm::sort(offsets.get(), offsets.get() + relocs.size());
+  llvm::parallelSort(offsets.get(), offsets.get() + relocs.size());
 
   // For each leading relocation, find following ones that can be folded
   // as a bitmap and fold them.

llvmbot · 2025-05-03T00:26:57Z

@llvm/pr-subscribers-lld

Author: Peter Collingbourne (pcc)

Changes

For firefox-x64 one of the more time consuming parts
of finalizeSections() was the call to llvm::sort in
RelrSection::updateAllocSize(). Switching that to use parallelSort
yielded the following improvement on firefox-x64 with ldflags -S on
an Apple M2 Ultra:

    N           Min           Max        Median           Avg        Stddev
x 512     1.1446024     1.2462944     1.1918706     1.1929871      0.016145
+ 512     1.1142867     1.2350058     1.1858642     1.1858839   0.016219708
Difference at 95.0% confidence
	-0.00710318 +/- 0.00198234
	-0.595412% +/- 0.166166%
	(Student's t, pooled s = 0.0161824)

Full diff: https://github.com/llvm/llvm-project/pull/138370.diff

1 Files Affected:

(modified) lld/ELF/SyntheticSections.cpp (+1-1)

diff --git a/lld/ELF/SyntheticSections.cpp b/lld/ELF/SyntheticSections.cpp
index 2531227cb99b7..eceb297dbfc0d 100644
--- a/lld/ELF/SyntheticSections.cpp
+++ b/lld/ELF/SyntheticSections.cpp
@@ -2111,7 +2111,7 @@ template <class ELFT> bool RelrSection<ELFT>::updateAllocSize(Ctx &ctx) {
   std::unique_ptr<uint64_t[]> offsets(new uint64_t[relocs.size()]);
   for (auto [i, r] : llvm::enumerate(relocs))
     offsets[i] = r.getOffset();
-  llvm::sort(offsets.get(), offsets.get() + relocs.size());
+  llvm::parallelSort(offsets.get(), offsets.get() + relocs.size());
 
   // For each leading relocation, find following ones that can be folded
   // as a bitmap and fold them.

MaskRay · 2025-05-03T02:39:24Z

(Still on a trip with limited computer access)

We call updateAllocSize on sections like .relr.dyn and .rela.dyn. Since .relr.dyn is predominant in PIEs, parallelism could be beneficial.
I likely tested parallelSort around 2022 (around commit da0e5b8) while optimizing -z combreloc but saw no significant improvement, so I didn't pursue it.

Using parallelSort in computeRels is a compromise due to our basic scheduler's lack of support for nested parallelism (see https://reviews.llvm.org/D61115) and how lld parallelizes OutputSection and InputSectionBase write https://reviews.llvm.org/D131247 .

However, updateAllocSize should be fine, as it’s called within finalizeAddressDependentContent without nested parallelism requirements.

[𝘀𝗽𝗿] initial version

173ed58

Created using spr 1.3.6-beta.1

pcc requested a review from MaskRay May 3, 2025 00:26

llvmbot added lld lld:ELF labels May 3, 2025

pcc requested a review from smithp35 May 3, 2025 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ELF: Switch to parallelSort for RELR relocations. #138370

ELF: Switch to parallelSort for RELR relocations. #138370

Uh oh!

pcc commented May 3, 2025

Uh oh!

llvmbot commented May 3, 2025

Uh oh!

llvmbot commented May 3, 2025

Uh oh!

MaskRay commented May 3, 2025

Uh oh!

Uh oh!

ELF: Switch to parallelSort for RELR relocations. #138370

Are you sure you want to change the base?

ELF: Switch to parallelSort for RELR relocations. #138370

Uh oh!

Conversation

pcc commented May 3, 2025

Uh oh!

llvmbot commented May 3, 2025

Uh oh!

llvmbot commented May 3, 2025

Uh oh!

MaskRay commented May 3, 2025

Uh oh!

Uh oh!