Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: improve Timsort with powersort merge-policy #29208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 17, 2025

Conversation

moritz-gross
Copy link
Contributor

our contribution

This PR implements Powersort, a sorting algorithm based on the currently used Timsort that has already been implemented in CPython and PyPy.

We benchmarked the implementation on collected data and archieved a strong improvement (in particular on large instances with length over 1 million):
plot of benchmark results

Powersort is a drop-in replacement with the same functionality as Timsort, which we tested with spin test -m full (though only spin test -- -k "test_multiarray" should be relevant).

open questions

  • So far, the only implemented data type is numeric sorts, i.e. string and generic are not implemented yet, as well as argsort respectively. Should we add this in this PR?

Implement the improved merge policy for Timsort,
as developed by Munro and Wild.
Benchmarks show a significant improvement in performance.
@charris
Copy link
Member

charris commented Jun 14, 2025

The test failures all look lint related.

@moritz-gross
Copy link
Contributor Author

moritz-gross commented Jun 15, 2025

Hi, I might need some help understanding what the failing checks are about.

Linux tests / lint (pull_request) shows 2 errors concerning numpy/_core/tests/test_multiarray.py if I see that correctly, but I didn't work on that file, and it also doesn't show up in the Diff of the PR.

thanks for the quick response!

@charris
Copy link
Member

charris commented Jun 15, 2025

Probably #29210. If so, a rebase should fix it.

@charris
Copy link
Member

charris commented Jun 15, 2025

We aren't able to run ruff on the differences yet, but it is fast enough that we run it on the whole project. Not sure how that failure slipped through.

@jorenham
Copy link
Member

We aren't able to run ruff on the differences yet, but it is fast enough that we run it on the whole project. Not sure how that failure slipped through.

It must've been an outdated branch that was merged

@charris charris merged commit 32f4afa into numpy:main Jun 17, 2025
74 checks passed
@charris
Copy link
Member

charris commented Jun 17, 2025

Thanks @moritz-gross . Note that this is safe as long as the names/api in the sort library are not changed. The indirect sort versions would be good to have. In general, all types are desirable, but note that the short integer types are actually handled by radix sort. Is the much to be gained for those types? Argsorts have the disadvantage of accessing memory in a scattered matter and might benefit from handling long runs.

I suppose the real tradeoff is clarity in what algorithms are used and where. We should probably put together a table of which algorithms handle which types, and another for which algorithms are actually used by NumPy for the different types and sort kinds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants