Thanks to visit codestin.com
Credit goes to github.com

Skip to content

bpo-36095: Better NaN sorting. #12001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Feb 23, 2019

Sorting sequences containing NaN values produces an incompletely sorted result. Further, because of the complexity of the timsort, this incomplete sort often silently produces unintuitive, unstable-seeming results that are extremely sensitive to the ordering of the inputs:

>>> sorted([3, 1, 2, float('nan'), 2.0, 2, 2.0])
[1, 2, 2.0, 2.0, 3, nan, 2]
>>> sorted(reversed([3, 1, 2, float('nan'), 2.0, 2, 2.0]))
[1, 2.0, 2, 2.0, nan, 2, 3]

The patch I have provided addresses these issues, including for lists containing nested lists/tuples with NaN values. Specifically, it stably sorts NaNs to the end of the list with no changes to the timsort itself (just the element-wise comparison functions):

>>> sorted([3, 1, 2, float('nan'), 2.0, 2, 2.0])
[1, 2, 2.0, 2, 2.0, 3, nan]
>>> sorted([[3], [1], [2], [float('nan')], [2.0], [2], [2.0]])
[[1], [2], [2.0], [2], [2.0], [3], [nan]]

It also includes a new regression test for this behavior.

Some other benefits to this patch:

  • These changes generally result in a sorting performance improvement across data types. The largest increases here are for nested lists, since we add a new unsafe_list_compare function. Other speed increases are due to safe_object_compare's delegation to unsafe comparison functions for objects of the same type. Specifically, the speed impact (positive is faster, negative is slower) is between:

    • -3% and +3% (10 elements, no PGO)
    • 0% and +4% (10 elements, PGO)
    • 0% and +9% (1000 elements, no PGO)
    • -1% and +9% (1000 elements, PGO)
  • The current weird NaN-sorting behavior is not documented, so this is not a breaking change.

  • IEEE754 compliance is maintained. The result is still a stable (arguably, more stable), nondecreasing ordering of the original list.

https://bugs.python.org/issue36095

This will be necessary later when sorting sequences of possible NaN values. It is essentially identical to unsafe_tuple_compare.
This behavior stably pushes NaN values to the end of the list.
This includes properly delegating to unsafe_float_compare, unsafe_tuple_compare, and unsafe_list_compare when necessary. As a nice bonus, it also allows us to sometimes use unsafe_object_compare to compare same-typed elements in mixed-type lists.
This includes singly- and doubly-nested lists/tuples.
Fixes whitespace and checks that lists/tuples have at least one element before delegating comparison to their unsafe functions.
Replace ternary extension with proper C, and no don't use PyTuple_GET_ITEM when the target could be a list.
@rhettinger
Copy link
Contributor

Marking as rejected. Tim and Mark concur that special-casing sort/min/max/heaps/bisect etc is the wrong approach and is an incorrect separation of responsibilities.

@rhettinger rhettinger closed this Feb 24, 2019
@brandtbucher brandtbucher deleted the nan-sorting branch March 21, 2019 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants