-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
sort: -g parallel processing #8459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
GNU testsuite comparison:
|
Thanks for picking this up! I'm getting somewhat strange results here, on an Intel i7-1370P, that has 6 Performance cores (*2 with Hyperthreading) and 8 Efficient cores. Basically no improvement on average:
But if I restrict to the P cores only (virtual cores 0-11, there are 6 physical cores), performance is good:
Adding in more and more E cores (12-19) to the mix makes things worse and worse:
I'm somewhat surprised as the M1 CPU also has heterogeneous cores, but for some reason that does not cause problems for you. |
GNU testsuite comparison:
|
GNU testsuite comparison:
|
ofc! not many opportunities to work on parallelism :)
I suspect that the intel-E cores are much slower than the P-cores, which I believe differs than how apple's E/P cores work. I believe the M1's scheduler and memory system are more forgiving, where on i7-1370P the E-cores contend for bandwidth and extend the tail. based off your results, i noticed the huge jump in user time with little wall-time gain when E-cores are present:
showing that parallel work is being done, but my hunch is that the E-cores are creating a bottleneck. maybe no memory-bandwidth bound + merge i was doing? so including them in workloads actually hurt the performance, which is great to know, thanks for pointing that out! i tried making some changes based on this in the latest commit. will continue iterating with any feedback |
This analysis addresses issue #8061 by implementing parallel number parsing for
sort -g
(general numeric sort). The change adds multi-core processing to improve performance on large datasets.Benchmarking Setup
Machine: Mac M1 running macOS Sequoia Version 15.6 (24G84)
Test Data: 387,597 floating-point numbers (matches original issue dataset size)
Tools: Hyperfine (
--shell=none --warmup 3 --min-runs 5 --prepare 'sync'
)Performance Results
Original Implementation (main branch):
Parallel Implementation (sort-parallelize branch):
Analysis