Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

hz2
Copy link
Contributor

@hz2 hz2 commented Aug 10, 2025

This analysis addresses issue #8061 by implementing parallel number parsing for sort -g (general numeric sort). The change adds multi-core processing to improve performance on large datasets.

Benchmarking Setup

Machine: Mac M1 running macOS Sequoia Version 15.6 (24G84)
Test Data: 387,597 floating-point numbers (matches original issue dataset size)
Tools: Hyperfine (--shell=none --warmup 3 --min-runs 5 --prepare 'sync')

Performance Results

Original Implementation (main branch):

hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' \
  './target/release/sort -g tmp/floatdata' \
  './target/release/sort -n tmp/floatdata'

Benchmark 1: ./target/release/sort -g tmp/floatdata
  Time (mean ± σ):     117.7 ms ±   7.7 ms    [User: 341.1 ms, System: 13.8 ms]
  Range (min … max):   113.5 ms … 147.0 ms    17 runs

Benchmark 2: ./target/release/sort -n tmp/floatdata
  Time (mean ± σ):      41.2 ms ±   1.0 ms    [User: 93.1 ms, System: 11.5 ms]
  Range (min … max):    38.7 ms …  44.0 ms    34 runs

Summary
  ./target/release/sort -n tmp/floatdata ran
    2.85 ± 0.20 times faster than ./target/release/sort -g tmp/floatdata

Parallel Implementation (sort-parallelize branch):

hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' \
  './target/release/sort -g tmp/floatdata' \
  './target/release/sort -n tmp/floatdata'

Benchmark 1: ./target/release/sort -g tmp/floatdata
  Time (mean ± σ):      84.3 ms ±   8.3 ms    [User: 350.9 ms, System: 22.3 ms]
  Range (min … max):    77.2 ms … 119.3 ms    22 runs

Benchmark 2: ./target/release/sort -n tmp/floatdata
  Time (mean ± σ):      42.8 ms ±   1.7 ms    [User: 93.7 ms, System: 11.9 ms]
  Range (min … max):    39.6 ms …  47.9 ms    31 runs

Summary
  ./target/release/sort -n tmp/floatdata ran
    1.97 ± 0.21 times faster than ./target/release/sort -g tmp/floatdata

Analysis

  • sort -g performance: 117.7ms → 84.3ms (28.4% speedup)
  • Gap reduction vs sort -n: 2.85× → 1.97× (31% smaller gap).
  • Multi-core utilization: Increased User CPU time shows effective parallelization
Metric Sequential Parallel Change
sort -g time 117.7 ± 7.7 ms 84.3 ± 8.3 ms +28.4%
sort -n time 41.2 ± 1.0 ms 42.8 ± 1.7 ms ~same
Gap (n/g) 2.85× 1.97× −31%
User CPU 341.1 ms 350.9 ms (parallel work)

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)

@hz2 hz2 marked this pull request as ready for review August 10, 2025 22:08
@drinkcat
Copy link
Collaborator

Thanks for picking this up!

I'm getting somewhat strange results here, on an Intel i7-1370P, that has 6 Performance cores (*2 with Hyperthreading) and 8 Efficient cores.

Basically no improvement on average:

hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' -L sort ./target/release/sort,./sort.main  '{sort} -g tmp/floatdata'
Benchmark 1: ./target/release/sort -g tmp/floatdata
  Time (mean ± σ):     131.7 ms ±  17.3 ms    [User: 1134.5 ms, System: 141.6 ms]
  Range (min … max):    92.4 ms … 159.4 ms    21 runs
 
Benchmark 2: ./sort.main -g tmp/floatdata
  Time (mean ± σ):     136.9 ms ±  12.9 ms    [User: 494.2 ms, System: 31.1 ms]
  Range (min … max):   121.3 ms … 165.9 ms    20 runs
 
Summary
  ./target/release/sort -g tmp/floatdata ran
    1.04 ± 0.17 times faster than ./sort.main -g tmp/floatdata

But if I restrict to the P cores only (virtual cores 0-11, there are 6 physical cores), performance is good:

taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' -L sort ./target/release/sort,./sort.main  '{sort} -g tmp/floatdata'
Benchmark 1: ./target/release/sort -g tmp/floatdata
  Time (mean ± σ):      74.2 ms ±   5.0 ms    [User: 459.4 ms, System: 39.5 ms]
  Range (min … max):    70.0 ms …  94.8 ms    37 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
Benchmark 2: ./sort.main -g tmp/floatdata
  Time (mean ± σ):     115.7 ms ±   3.1 ms    [User: 387.1 ms, System: 18.3 ms]
  Range (min … max):   110.4 ms … 121.7 ms    25 runs
 
Summary
  ./target/release/sort -g tmp/floatdata ran
    1.56 ± 0.11 times faster than ./sort.main -g tmp/floatdata

Adding in more and more E cores (12-19) to the mix makes things worse and worse:

taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12 hyperfine ...
  ./target/release/sort -g tmp/floatdata ran
    1.23 ± 0.14 times faster than ./sort.main -g tmp/floatdata
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13 hyperfine ... 
  ./target/release/sort -g tmp/floatdata ran
    1.04 ± 0.16 times faster than ./sort.main -g tmp/floatdata
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14 hyperfine -...
 ./target/release/sort,./sort.main  '{sort} -g tmp/floatdata'
    1.01 ± 0.12 times faster than ./sort.main -g tmp/floatdata

I'm somewhat surprised as the M1 CPU also has heterogeneous cores, but for some reason that does not cause problems for you.

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@hz2
Copy link
Contributor Author

hz2 commented Aug 11, 2025

Thanks for picking this up!

ofc! not many opportunities to work on parallelism :)

I'm somewhat surprised as the M1 CPU also has heterogeneous cores, but for some reason that does not cause problems for you.

I suspect that the intel-E cores are much slower than the P-cores, which I believe differs than how apple's E/P cores work. I believe the M1's scheduler and memory system are more forgiving, where on i7-1370P the E-cores contend for bandwidth and extend the tail.

based off your results, i noticed the huge jump in user time with little wall-time gain when E-cores are present:

  • With all cores: 1134.5ms User time but only 1.04x speedup
  • P-cores only: 459.4ms User time with 1.56x speedup

showing that parallel work is being done, but my hunch is that the E-cores are creating a bottleneck. maybe no memory-bandwidth bound + merge i was doing?

so including them in workloads actually hurt the performance, which is great to know, thanks for pointing that out!

i tried making some changes based on this in the latest commit. will continue iterating with any feedback

@hz2 hz2 changed the title parallel processing for sort -g sort: -g parallel processing Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants