Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 1544fc5

Browse files
author
Tim Peters
committed
Various clarifications based on feedback & questions over the years.
2 parents 9d95254 + ec8147b commit 1544fc5

1 file changed

Lines changed: 96 additions & 19 deletions

File tree

Objects/listsort.txt

Lines changed: 96 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -100,11 +100,13 @@ Comparison with Python's Samplesort Hybrid
100100
The algorithms are effectively identical in these cases, except that
101101
timsort does one less compare in \sort.
102102

103-
Now for the more interesting cases. lg(n!) is the information-theoretic
104-
limit for the best any comparison-based sorting algorithm can do on
105-
average (across all permutations). When a method gets significantly
106-
below that, it's either astronomically lucky, or is finding exploitable
107-
structure in the data.
103+
Now for the more interesting cases. Where lg(x) is the logarithm of x to
104+
the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
105+
the best any comparison-based sorting algorithm can do on average (across
106+
all permutations). When a method gets significantly below that, it's
107+
either astronomically lucky, or is finding exploitable structure in the
108+
data.
109+
108110

109111
n lg(n!) *sort 3sort +sort %sort ~sort !sort
110112
------- ------- ------ ------- ------- ------ ------- --------
@@ -251,7 +253,7 @@ Computing minrun
251253
----------------
252254
If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
253255
array then; it's hard to beat that given the overheads of trying something
254-
fancier.
256+
fancier (see note BINSORT).
255257

256258
When N is a power of 2, testing on random data showed that minrun values of
257259
16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
@@ -379,10 +381,10 @@ with wildly unbalanced run lengths.
379381

380382
Merge Memory
381383
------------
382-
Merging adjacent runs of lengths A and B in-place is very difficult.
383-
Theoretical constructions are known that can do it, but they're too difficult
384-
and slow for practical use. But if we have temp memory equal to min(A, B),
385-
it's easy.
384+
Merging adjacent runs of lengths A and B in-place, and in linear time, is
385+
difficult. Theoretical constructions are known that can do it, but they're
386+
too difficult and slow for practical use. But if we have temp memory equal
387+
to min(A, B), it's easy.
386388

387389
If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
388390
and then we can do the obvious merge algorithm left to right, from the temp
@@ -457,10 +459,10 @@ finding the right spot early in B (more on that later).
457459

458460
After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
459461
consecutive elements, and a straight binary search requires exactly k-1
460-
additional comparisons to nail it. Then we copy all the B's up to that
461-
point in one chunk, and then copy A[0]. Note that no matter where A[0]
462-
belongs in B, the combination of galloping + binary search finds it in no
463-
more than about 2*lg(B) comparisons.
462+
additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we
463+
copy all the B's up to that point in one chunk, and then copy A[0]. Note
464+
that no matter where A[0] belongs in B, the combination of galloping + binary
465+
search finds it in no more than about 2*lg(B) comparisons.
464466

465467
If we did a straight binary search, we could find it in no more than
466468
ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
@@ -573,11 +575,11 @@ Galloping Complication
573575
The description above was for merge_lo. merge_hi has to merge "from the
574576
other end", and really needs to gallop starting at the last element in a run
575577
instead of the first. Galloping from the first still works, but does more
576-
comparisons than it should (this is significant -- I timed it both ways).
577-
For this reason, the gallop_left() and gallop_right() functions have a
578-
"hint" argument, which is the index at which galloping should begin. So
579-
galloping can actually start at any index, and proceed at offsets of 1, 3,
580-
7, 15, ... or -1, -3, -7, -15, ... from the starting index.
578+
comparisons than it should (this is significant -- I timed it both ways). For
579+
this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
580+
functions have a "hint" argument, which is the index at which galloping
581+
should begin. So galloping can actually start at any index, and proceed at
582+
offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
581583

582584
In the code as I type it's always called with either 0 or n-1 (where n is
583585
the # of elements in a run). It's tempting to try to do something fancier,
@@ -676,3 +678,78 @@ immediately. The consequence is that it ends up using two compares to sort
676678
[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
677679
taught how to deal with mixtures of ascending and descending runs
678680
efficiently in all cases.
681+
682+
683+
NOTES
684+
-----
685+
686+
BINSORT
687+
A "binary insertion sort" is just like a textbook insertion sort, but instead
688+
of locating the correct position of the next item via linear (one at a time)
689+
search, an equivalent to Python's bisect.bisect_right is used to find the
690+
correct position in logarithmic time. Most texts don't mention this
691+
variation, and those that do usually say it's not worth the bother: insertion
692+
sort remains quadratic (expected and worst cases) either way. Speeding the
693+
search doesn't reduce the quadratic data movement costs.
694+
695+
But in CPython's case, comparisons are extraordinarily expensive compared to
696+
moving data, and the details matter. Moving objects is just copying
697+
pointers. Comparisons can be arbitrarily expensive (can invoke arbitary
698+
user-supplied Python code), but even in simple cases (like 3 < 4) _all_
699+
decisions are made at runtime: what's the type of the left comparand? the
700+
type of the right? do they need to be coerced to a common type? where's the
701+
code to compare these types? And so on. Even the simplest Python comparison
702+
triggers a large pile of C-level pointer dereferences, conditionals, and
703+
function calls.
704+
705+
So cutting the number of compares is almost always measurably helpful in
706+
CPython, and the savings swamp the quadratic-time data movement costs for
707+
reasonable minrun values.
708+
709+
710+
LEFT OR RIGHT
711+
gallop_left() and gallop_right() are akin to the Python bisect module's
712+
bisect_left() and bisect_right(): they're the same unless the slice they're
713+
searching contains a (at least one) value equal to the value being searched
714+
for. In that case, gallop_left() returns the position immediately before the
715+
leftmost equal value, and gallop_right() the position immediately after the
716+
rightmost equal value. The distinction is needed to preserve stability. In
717+
general, when merging adjacent runs A and B, gallop_left is used to search
718+
thru B for where an element from A belongs, and gallop_right to search thru A
719+
for where an element from B belongs.
720+
721+
722+
REGION OF UNCERTAINTY
723+
Two kinds of confusion seem to be common about the claim that after finding
724+
a k such that
725+
726+
B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
727+
728+
then a binary search requires exactly k-1 tries to find A[0]'s proper
729+
location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
730+
731+
The first confusion takes the form "OK, then the region of uncertainty is at
732+
indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 =
733+
3"; or the region is viewed as a Python slice and the objection is "but that's
734+
the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already
735+
compared A[0] against B[3] and against B[7], so A[0]'s correct location is
736+
already known wrt _both_ endpoints. What remains is to find A[0]'s correct
737+
location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the
738+
slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
739+
inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
740+
(2**k-1)-1 - 2**(k-1) + 1 =
741+
2**k-1 - 2**(k-1) =
742+
2*2**k-1 - 2**(k-1) =
743+
(2-1)*2**(k-1) - 1 =
744+
2**(k-1) - 1
745+
elements.
746+
747+
The second confusion: "k-1 = 2 binary searches can find the correct location
748+
among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we
749+
could make this more efficient by arranging for the region of uncertainty to
750+
span 2**(k-1) elements." Resolution: that confuses "elements" with
751+
"locations". In a slice with N elements, there are N+1 _locations_. In the
752+
example, with the region of uncertainty B[4], B[5], B[6], there are 4
753+
locations: before B[4], between B[4] and B[5], between B[5] and B[6], and
754+
after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1)
755+
locations. That's why k-1 binary searches are necessary and sufficient.

0 commit comments

Comments
 (0)