@@ -100,11 +100,13 @@ Comparison with Python's Samplesort Hybrid
100100 The algorithms are effectively identical in these cases, except that
101101 timsort does one less compare in \sort.
102102
103- Now for the more interesting cases. lg(n!) is the information-theoretic
104- limit for the best any comparison-based sorting algorithm can do on
105- average (across all permutations). When a method gets significantly
106- below that, it's either astronomically lucky, or is finding exploitable
107- structure in the data.
103+ Now for the more interesting cases. Where lg(x) is the logarithm of x to
104+ the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
105+ the best any comparison-based sorting algorithm can do on average (across
106+ all permutations). When a method gets significantly below that, it's
107+ either astronomically lucky, or is finding exploitable structure in the
108+ data.
109+
108110
109111 n lg(n!) *sort 3sort +sort %sort ~sort !sort
110112------- ------- ------ ------- ------- ------ ------- --------
@@ -251,7 +253,7 @@ Computing minrun
251253----------------
252254If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
253255array then; it's hard to beat that given the overheads of trying something
254- fancier.
256+ fancier (see note BINSORT) .
255257
256258When N is a power of 2, testing on random data showed that minrun values of
25725916, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
@@ -379,10 +381,10 @@ with wildly unbalanced run lengths.
379381
380382Merge Memory
381383------------
382- Merging adjacent runs of lengths A and B in-place is very difficult.
383- Theoretical constructions are known that can do it, but they're too difficult
384- and slow for practical use. But if we have temp memory equal to min(A, B),
385- it's easy.
384+ Merging adjacent runs of lengths A and B in-place, and in linear time, is
385+ difficult. Theoretical constructions are known that can do it, but they're
386+ too difficult and slow for practical use. But if we have temp memory equal
387+ to min(A, B), it's easy.
386388
387389If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
388390and then we can do the obvious merge algorithm left to right, from the temp
@@ -457,10 +459,10 @@ finding the right spot early in B (more on that later).
457459
458460After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
459461consecutive elements, and a straight binary search requires exactly k-1
460- additional comparisons to nail it. Then we copy all the B's up to that
461- point in one chunk, and then copy A[0]. Note that no matter where A[0]
462- belongs in B, the combination of galloping + binary search finds it in no
463- more than about 2*lg(B) comparisons.
462+ additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we
463+ copy all the B's up to that point in one chunk, and then copy A[0]. Note
464+ that no matter where A[0] belongs in B, the combination of galloping + binary
465+ search finds it in no more than about 2*lg(B) comparisons.
464466
465467If we did a straight binary search, we could find it in no more than
466468ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
@@ -573,11 +575,11 @@ Galloping Complication
573575The description above was for merge_lo. merge_hi has to merge "from the
574576other end", and really needs to gallop starting at the last element in a run
575577instead of the first. Galloping from the first still works, but does more
576- comparisons than it should (this is significant -- I timed it both ways).
577- For this reason, the gallop_left() and gallop_right() functions have a
578- "hint" argument, which is the index at which galloping should begin. So
579- galloping can actually start at any index, and proceed at offsets of 1, 3,
580- 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
578+ comparisons than it should (this is significant -- I timed it both ways). For
579+ this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
580+ functions have a "hint" argument, which is the index at which galloping
581+ should begin. So galloping can actually start at any index, and proceed at
582+ offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
581583
582584In the code as I type it's always called with either 0 or n-1 (where n is
583585the # of elements in a run). It's tempting to try to do something fancier,
@@ -676,3 +678,78 @@ immediately. The consequence is that it ends up using two compares to sort
676678[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
677679taught how to deal with mixtures of ascending and descending runs
678680efficiently in all cases.
681+
682+
683+ NOTES
684+ -----
685+
686+ BINSORT
687+ A "binary insertion sort" is just like a textbook insertion sort, but instead
688+ of locating the correct position of the next item via linear (one at a time)
689+ search, an equivalent to Python's bisect.bisect_right is used to find the
690+ correct position in logarithmic time. Most texts don't mention this
691+ variation, and those that do usually say it's not worth the bother: insertion
692+ sort remains quadratic (expected and worst cases) either way. Speeding the
693+ search doesn't reduce the quadratic data movement costs.
694+
695+ But in CPython's case, comparisons are extraordinarily expensive compared to
696+ moving data, and the details matter. Moving objects is just copying
697+ pointers. Comparisons can be arbitrarily expensive (can invoke arbitary
698+ user-supplied Python code), but even in simple cases (like 3 < 4) _all_
699+ decisions are made at runtime: what's the type of the left comparand? the
700+ type of the right? do they need to be coerced to a common type? where's the
701+ code to compare these types? And so on. Even the simplest Python comparison
702+ triggers a large pile of C-level pointer dereferences, conditionals, and
703+ function calls.
704+
705+ So cutting the number of compares is almost always measurably helpful in
706+ CPython, and the savings swamp the quadratic-time data movement costs for
707+ reasonable minrun values.
708+
709+
710+ LEFT OR RIGHT
711+ gallop_left() and gallop_right() are akin to the Python bisect module's
712+ bisect_left() and bisect_right(): they're the same unless the slice they're
713+ searching contains a (at least one) value equal to the value being searched
714+ for. In that case, gallop_left() returns the position immediately before the
715+ leftmost equal value, and gallop_right() the position immediately after the
716+ rightmost equal value. The distinction is needed to preserve stability. In
717+ general, when merging adjacent runs A and B, gallop_left is used to search
718+ thru B for where an element from A belongs, and gallop_right to search thru A
719+ for where an element from B belongs.
720+
721+
722+ REGION OF UNCERTAINTY
723+ Two kinds of confusion seem to be common about the claim that after finding
724+ a k such that
725+
726+ B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
727+
728+ then a binary search requires exactly k-1 tries to find A[0]'s proper
729+ location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
730+
731+ The first confusion takes the form "OK, then the region of uncertainty is at
732+ indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 =
733+ 3"; or the region is viewed as a Python slice and the objection is "but that's
734+ the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already
735+ compared A[0] against B[3] and against B[7], so A[0]'s correct location is
736+ already known wrt _both_ endpoints. What remains is to find A[0]'s correct
737+ location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the
738+ slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
739+ inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
740+ (2**k-1)-1 - 2**(k-1) + 1 =
741+ 2**k-1 - 2**(k-1) =
742+ 2*2**k-1 - 2**(k-1) =
743+ (2-1)*2**(k-1) - 1 =
744+ 2**(k-1) - 1
745+ elements.
746+
747+ The second confusion: "k-1 = 2 binary searches can find the correct location
748+ among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we
749+ could make this more efficient by arranging for the region of uncertainty to
750+ span 2**(k-1) elements." Resolution: that confuses "elements" with
751+ "locations". In a slice with N elements, there are N+1 _locations_. In the
752+ example, with the region of uncertainty B[4], B[5], B[6], there are 4
753+ locations: before B[4], between B[4] and B[5], between B[5] and B[6], and
754+ after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1)
755+ locations. That's why k-1 binary searches are necessary and sufficient.
0 commit comments