Thanks to visit codestin.com
Credit goes to doxygen.postgresql.org

PostgreSQL Source Code git master
tuplesort.c
Go to the documentation of this file.
1/*-------------------------------------------------------------------------
2 *
3 * tuplesort.c
4 * Generalized tuple sorting routines.
5 *
6 * This module provides a generalized facility for tuple sorting, which can be
7 * applied to different kinds of sortable objects. Implementation of
8 * the particular sorting variants is given in tuplesortvariants.c.
9 * This module works efficiently for both small and large amounts
10 * of data. Small amounts are sorted in-memory using qsort(). Large
11 * amounts are sorted using temporary files and a standard external sort
12 * algorithm.
13 *
14 * See Knuth, volume 3, for more than you want to know about external
15 * sorting algorithms. The algorithm we use is a balanced k-way merge.
16 * Before PostgreSQL 15, we used the polyphase merge algorithm (Knuth's
17 * Algorithm 5.4.2D), but with modern hardware, a straightforward balanced
18 * merge is better. Knuth is assuming that tape drives are expensive
19 * beasts, and in particular that there will always be many more runs than
20 * tape drives. The polyphase merge algorithm was good at keeping all the
21 * tape drives busy, but in our implementation a "tape drive" doesn't cost
22 * much more than a few Kb of memory buffers, so we can afford to have
23 * lots of them. In particular, if we can have as many tape drives as
24 * sorted runs, we can eliminate any repeated I/O at all.
25 *
26 * Historically, we divided the input into sorted runs using replacement
27 * selection, in the form of a priority tree implemented as a heap
28 * (essentially Knuth's Algorithm 5.2.3H), but now we always use quicksort
29 * for run generation.
30 *
31 * The approximate amount of memory allowed for any one sort operation
32 * is specified in kilobytes by the caller (most pass work_mem). Initially,
33 * we absorb tuples and simply store them in an unsorted array as long as
34 * we haven't exceeded workMem. If we reach the end of the input without
35 * exceeding workMem, we sort the array using qsort() and subsequently return
36 * tuples just by scanning the tuple array sequentially. If we do exceed
37 * workMem, we begin to emit tuples into sorted runs in temporary tapes.
38 * When tuples are dumped in batch after quicksorting, we begin a new run
39 * with a new output tape. If we reach the max number of tapes, we write
40 * subsequent runs on the existing tapes in a round-robin fashion. We will
41 * need multiple merge passes to finish the merge in that case. After the
42 * end of the input is reached, we dump out remaining tuples in memory into
43 * a final run, then merge the runs.
44 *
45 * When merging runs, we use a heap containing just the frontmost tuple from
46 * each source run; we repeatedly output the smallest tuple and replace it
47 * with the next tuple from its source tape (if any). When the heap empties,
48 * the merge is complete. The basic merge algorithm thus needs very little
49 * memory --- only M tuples for an M-way merge, and M is constrained to a
50 * small number. However, we can still make good use of our full workMem
51 * allocation by pre-reading additional blocks from each source tape. Without
52 * prereading, our access pattern to the temporary file would be very erratic;
53 * on average we'd read one block from each of M source tapes during the same
54 * time that we're writing M blocks to the output tape, so there is no
55 * sequentiality of access at all, defeating the read-ahead methods used by
56 * most Unix kernels. Worse, the output tape gets written into a very random
57 * sequence of blocks of the temp file, ensuring that things will be even
58 * worse when it comes time to read that tape. A straightforward merge pass
59 * thus ends up doing a lot of waiting for disk seeks. We can improve matters
60 * by prereading from each source tape sequentially, loading about workMem/M
61 * bytes from each tape in turn, and making the sequential blocks immediately
62 * available for reuse. This approach helps to localize both read and write
63 * accesses. The pre-reading is handled by logtape.c, we just tell it how
64 * much memory to use for the buffers.
65 *
66 * In the current code we determine the number of input tapes M on the basis
67 * of workMem: we want workMem/M to be large enough that we read a fair
68 * amount of data each time we read from a tape, so as to maintain the
69 * locality of access described above. Nonetheless, with large workMem we
70 * can have many tapes. The logical "tapes" are implemented by logtape.c,
71 * which avoids space wastage by recycling disk space as soon as each block
72 * is read from its "tape".
73 *
74 * When the caller requests random access to the sort result, we form
75 * the final sorted run on a logical tape which is then "frozen", so
76 * that we can access it randomly. When the caller does not need random
77 * access, we return from tuplesort_performsort() as soon as we are down
78 * to one run per logical tape. The final merge is then performed
79 * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
80 * saves one cycle of writing all the data out to disk and reading it in.
81 *
82 * This module supports parallel sorting. Parallel sorts involve coordination
83 * among one or more worker processes, and a leader process, each with its own
84 * tuplesort state. The leader process (or, more accurately, the
85 * Tuplesortstate associated with a leader process) creates a full tapeset
86 * consisting of worker tapes with one run to merge; a run for every
87 * worker process. This is then merged. Worker processes are guaranteed to
88 * produce exactly one output run from their partial input.
89 *
90 *
91 * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
92 * Portions Copyright (c) 1994, Regents of the University of California
93 *
94 * IDENTIFICATION
95 * src/backend/utils/sort/tuplesort.c
96 *
97 *-------------------------------------------------------------------------
98 */
99
100#include "postgres.h"
101
102#include <limits.h>
103
104#include "commands/tablespace.h"
105#include "miscadmin.h"
106#include "pg_trace.h"
107#include "storage/shmem.h"
108#include "utils/guc.h"
109#include "utils/memutils.h"
110#include "utils/pg_rusage.h"
111#include "utils/tuplesort.h"
112
113/*
114 * Initial size of memtuples array. We're trying to select this size so that
115 * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
116 * allocation might possibly be lowered. However, we don't consider array sizes
117 * less than 1024.
118 *
119 */
120#define INITIAL_MEMTUPSIZE Max(1024, \
121 ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
122
123/* GUC variables */
124bool trace_sort = false;
125
126#ifdef DEBUG_BOUNDED_SORT
127bool optimize_bounded_sort = true;
128#endif
129
130
131/*
132 * During merge, we use a pre-allocated set of fixed-size slots to hold
133 * tuples. To avoid palloc/pfree overhead.
134 *
135 * Merge doesn't require a lot of memory, so we can afford to waste some,
136 * by using gratuitously-sized slots. If a tuple is larger than 1 kB, the
137 * palloc() overhead is not significant anymore.
138 *
139 * 'nextfree' is valid when this chunk is in the free list. When in use, the
140 * slot holds a tuple.
141 */
142#define SLAB_SLOT_SIZE 1024
143
144typedef union SlabSlot
145{
149
150/*
151 * Possible states of a Tuplesort object. These denote the states that
152 * persist between calls of Tuplesort routines.
153 */
154typedef enum
155{
156 TSS_INITIAL, /* Loading tuples; still within memory limit */
157 TSS_BOUNDED, /* Loading tuples into bounded-size heap */
158 TSS_BUILDRUNS, /* Loading tuples; writing to tape */
159 TSS_SORTEDINMEM, /* Sort completed entirely in memory */
160 TSS_SORTEDONTAPE, /* Sort completed, final run is on tape */
161 TSS_FINALMERGE, /* Performing final merge on-the-fly */
163
164/*
165 * Parameters for calculation of number of tapes to use --- see inittapes()
166 * and tuplesort_merge_order().
167 *
168 * In this calculation we assume that each tape will cost us about 1 blocks
169 * worth of buffer space. This ignores the overhead of all the other data
170 * structures needed for each tape, but it's probably close enough.
171 *
172 * MERGE_BUFFER_SIZE is how much buffer space we'd like to allocate for each
173 * input tape, for pre-reading (see discussion at top of file). This is *in
174 * addition to* the 1 block already included in TAPE_BUFFER_OVERHEAD.
175 */
176#define MINORDER 6 /* minimum merge order */
177#define MAXORDER 500 /* maximum merge order */
178#define TAPE_BUFFER_OVERHEAD BLCKSZ
179#define MERGE_BUFFER_SIZE (BLCKSZ * 32)
180
181
182/*
183 * Private state of a Tuplesort operation.
184 */
186{
188 TupSortStatus status; /* enumerated value as shown above */
189 bool bounded; /* did caller specify a maximum number of
190 * tuples to return? */
191 bool boundUsed; /* true if we made use of a bounded heap */
192 int bound; /* if bounded, the maximum number of tuples */
193 int64 tupleMem; /* memory consumed by individual tuples.
194 * storing this separately from what we track
195 * in availMem allows us to subtract the
196 * memory consumed by all tuples when dumping
197 * tuples to tape */
198 int64 availMem; /* remaining memory available, in bytes */
199 int64 allowedMem; /* total memory allowed, in bytes */
200 int maxTapes; /* max number of input tapes to merge in each
201 * pass */
202 int64 maxSpace; /* maximum amount of space occupied among sort
203 * of groups, either in-memory or on-disk */
204 bool isMaxSpaceDisk; /* true when maxSpace is value for on-disk
205 * space, false when its value for in-memory
206 * space */
207 TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */
208 LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
209
210 /*
211 * This array holds the tuples now in sort memory. If we are in state
212 * INITIAL, the tuples are in no particular order; if we are in state
213 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
214 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
215 * H. In state SORTEDONTAPE, the array is not used.
216 */
217 SortTuple *memtuples; /* array of SortTuple structs */
218 int memtupcount; /* number of tuples currently present */
219 int memtupsize; /* allocated length of memtuples array */
220 bool growmemtuples; /* memtuples' growth still underway? */
221
222 /*
223 * Memory for tuples is sometimes allocated using a simple slab allocator,
224 * rather than with palloc(). Currently, we switch to slab allocation
225 * when we start merging. Merging only needs to keep a small, fixed
226 * number of tuples in memory at any time, so we can avoid the
227 * palloc/pfree overhead by recycling a fixed number of fixed-size slots
228 * to hold the tuples.
229 *
230 * For the slab, we use one large allocation, divided into SLAB_SLOT_SIZE
231 * slots. The allocation is sized to have one slot per tape, plus one
232 * additional slot. We need that many slots to hold all the tuples kept
233 * in the heap during merge, plus the one we have last returned from the
234 * sort, with tuplesort_gettuple.
235 *
236 * Initially, all the slots are kept in a linked list of free slots. When
237 * a tuple is read from a tape, it is put to the next available slot, if
238 * it fits. If the tuple is larger than SLAB_SLOT_SIZE, it is palloc'd
239 * instead.
240 *
241 * When we're done processing a tuple, we return the slot back to the free
242 * list, or pfree() if it was palloc'd. We know that a tuple was
243 * allocated from the slab, if its pointer value is between
244 * slabMemoryBegin and -End.
245 *
246 * When the slab allocator is used, the USEMEM/LACKMEM mechanism of
247 * tracking memory usage is not used.
248 */
250
251 char *slabMemoryBegin; /* beginning of slab memory arena */
252 char *slabMemoryEnd; /* end of slab memory arena */
253 SlabSlot *slabFreeHead; /* head of free list */
254
255 /* Memory used for input and output tape buffers. */
257
258 /*
259 * When we return a tuple to the caller in tuplesort_gettuple_XXX, that
260 * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE
261 * modes), we remember the tuple in 'lastReturnedTuple', so that we can
262 * recycle the memory on next gettuple call.
263 */
265
266 /*
267 * While building initial runs, this is the current output run number.
268 * Afterwards, it is the number of initial runs we made.
269 */
271
272 /*
273 * Logical tapes, for merging.
274 *
275 * The initial runs are written in the output tapes. In each merge pass,
276 * the output tapes of the previous pass become the input tapes, and new
277 * output tapes are created as needed. When nInputTapes equals
278 * nInputRuns, there is only one merge pass left.
279 */
283
287
288 LogicalTape *destTape; /* current output tape */
289
290 /*
291 * These variables are used after completion of sorting to keep track of
292 * the next tuple to return. (In the tape case, the tape's current read
293 * position is also critical state.)
294 */
295 LogicalTape *result_tape; /* actual tape of finished output */
296 int current; /* array index (only used if SORTEDINMEM) */
297 bool eof_reached; /* reached EOF (needed for cursors) */
298
299 /* markpos_xxx holds marked position for mark and restore */
300 int64 markpos_block; /* tape block# (only used if SORTEDONTAPE) */
301 int markpos_offset; /* saved "current", or offset in tape block */
302 bool markpos_eof; /* saved "eof_reached" */
303
304 /*
305 * These variables are used during parallel sorting.
306 *
307 * worker is our worker identifier. Follows the general convention that
308 * -1 value relates to a leader tuplesort, and values >= 0 worker
309 * tuplesorts. (-1 can also be a serial tuplesort.)
310 *
311 * shared is mutable shared memory state, which is used to coordinate
312 * parallel sorts.
313 *
314 * nParticipants is the number of worker Tuplesortstates known by the
315 * leader to have actually been launched, which implies that they must
316 * finish a run that the leader needs to merge. Typically includes a
317 * worker state held by the leader process itself. Set in the leader
318 * Tuplesortstate only.
319 */
323
324 /*
325 * Additional state for managing "abbreviated key" sortsupport routines
326 * (which currently may be used by all cases except the hash index case).
327 * Tracks the intervals at which the optimization's effectiveness is
328 * tested.
329 */
330 int64 abbrevNext; /* Tuple # at which to next check
331 * applicability */
332
333 /*
334 * Resource snapshot for time of sort start.
335 */
337};
338
339/*
340 * Private mutable state of tuplesort-parallel-operation. This is allocated
341 * in shared memory.
342 */
344{
345 /* mutex protects all fields prior to tapes */
346 slock_t mutex;
347
348 /*
349 * currentWorker generates ordinal identifier numbers for parallel sort
350 * workers. These start from 0, and are always gapless.
351 *
352 * Workers increment workersFinished to indicate having finished. If this
353 * is equal to state.nParticipants within the leader, leader is ready to
354 * merge worker runs.
355 */
358
359 /* Temporary file space */
361
362 /* Size of tapes flexible array */
364
365 /*
366 * Tapes array used by workers to report back information needed by the
367 * leader to concatenate all worker tapes into one for merging
368 */
370};
371
372/*
373 * Is the given tuple allocated from the slab memory arena?
374 */
375#define IS_SLAB_SLOT(state, tuple) \
376 ((char *) (tuple) >= (state)->slabMemoryBegin && \
377 (char *) (tuple) < (state)->slabMemoryEnd)
378
379/*
380 * Return the given tuple to the slab memory free list, or free it
381 * if it was palloc'd.
382 */
383#define RELEASE_SLAB_SLOT(state, tuple) \
384 do { \
385 SlabSlot *buf = (SlabSlot *) tuple; \
386 \
387 if (IS_SLAB_SLOT((state), buf)) \
388 { \
389 buf->nextfree = (state)->slabFreeHead; \
390 (state)->slabFreeHead = buf; \
391 } else \
392 pfree(buf); \
393 } while(0)
394
395#define REMOVEABBREV(state,stup,count) ((*(state)->base.removeabbrev) (state, stup, count))
396#define COMPARETUP(state,a,b) ((*(state)->base.comparetup) (a, b, state))
397#define WRITETUP(state,tape,stup) ((*(state)->base.writetup) (state, tape, stup))
398#define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
399#define FREESTATE(state) ((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
400#define LACKMEM(state) ((state)->availMem < 0 && !(state)->slabAllocatorUsed)
401#define USEMEM(state,amt) ((state)->availMem -= (amt))
402#define FREEMEM(state,amt) ((state)->availMem += (amt))
403#define SERIAL(state) ((state)->shared == NULL)
404#define WORKER(state) ((state)->shared && (state)->worker != -1)
405#define LEADER(state) ((state)->shared && (state)->worker == -1)
406
407/*
408 * NOTES about on-tape representation of tuples:
409 *
410 * We require the first "unsigned int" of a stored tuple to be the total size
411 * on-tape of the tuple, including itself (so it is never zero; an all-zero
412 * unsigned int is used to delimit runs). The remainder of the stored tuple
413 * may or may not match the in-memory representation of the tuple ---
414 * any conversion needed is the job of the writetup and readtup routines.
415 *
416 * If state->sortopt contains TUPLESORT_RANDOMACCESS, then the stored
417 * representation of the tuple must be followed by another "unsigned int" that
418 * is a copy of the length --- so the total tape space used is actually
419 * sizeof(unsigned int) more than the stored length value. This allows
420 * read-backwards. When the random access flag was not specified, the
421 * write/read routines may omit the extra length word.
422 *
423 * writetup is expected to write both length words as well as the tuple
424 * data. When readtup is called, the tape is positioned just after the
425 * front length word; readtup must read the tuple data and advance past
426 * the back length word (if present).
427 *
428 * The write/read routines can make use of the tuple description data
429 * stored in the Tuplesortstate record, if needed. They are also expected
430 * to adjust state->availMem by the amount of memory space (not tape space!)
431 * released or consumed. There is no error return from either writetup
432 * or readtup; they should ereport() on failure.
433 *
434 *
435 * NOTES about memory consumption calculations:
436 *
437 * We count space allocated for tuples against the workMem limit, plus
438 * the space used by the variable-size memtuples array. Fixed-size space
439 * is not counted; it's small enough to not be interesting.
440 *
441 * Note that we count actual space used (as shown by GetMemoryChunkSpace)
442 * rather than the originally-requested size. This is important since
443 * palloc can add substantial overhead. It's not a complete answer since
444 * we won't count any wasted space in palloc allocation blocks, but it's
445 * a lot better than what we were doing before 7.3. As of 9.6, a
446 * separate memory context is used for caller passed tuples. Resetting
447 * it at certain key increments significantly ameliorates fragmentation.
448 * readtup routines use the slab allocator (they cannot use
449 * the reset context because it gets deleted at the point that merging
450 * begins).
451 */
452
453
456static void inittapes(Tuplesortstate *state, bool mergeruns);
457static void inittapestate(Tuplesortstate *state, int maxTapes);
459static void init_slab_allocator(Tuplesortstate *state, int numSlots);
460static void mergeruns(Tuplesortstate *state);
461static void mergeonerun(Tuplesortstate *state);
462static void beginmerge(Tuplesortstate *state);
463static bool mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup);
464static void dumptuples(Tuplesortstate *state, bool alltuples);
472static unsigned int getlen(LogicalTape *tape, bool eofOK);
473static void markrunend(LogicalTape *tape);
478static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
481
482/*
483 * Specialized comparators that we can inline into specialized sorts. The goal
484 * is to try to sort two tuples without having to follow the pointers to the
485 * comparator or the tuple.
486 *
487 * XXX: For now, there is no specialization for cases where datum1 is
488 * authoritative and we don't even need to fall back to a callback at all (that
489 * would be true for types like int4/int8/timestamp/date, but not true for
490 * abbreviations of text or multi-key sorts. There could be! Is it worth it?
491 */
492
493/* Used if first key's comparator is ssup_datum_unsigned_cmp */
496{
497 int compare;
498
499 compare = ApplyUnsignedSortComparator(a->datum1, a->isnull1,
500 b->datum1, b->isnull1,
501 &state->base.sortKeys[0]);
502 if (compare != 0)
503 return compare;
504
505 /*
506 * No need to waste effort calling the tiebreak function when there are no
507 * other keys to sort on.
508 */
509 if (state->base.onlyKey != NULL)
510 return 0;
511
512 return state->base.comparetup_tiebreak(a, b, state);
513}
514
515/* Used if first key's comparator is ssup_datum_signed_cmp */
518{
519 int compare;
520
521 compare = ApplySignedSortComparator(a->datum1, a->isnull1,
522 b->datum1, b->isnull1,
523 &state->base.sortKeys[0]);
524
525 if (compare != 0)
526 return compare;
527
528 /*
529 * No need to waste effort calling the tiebreak function when there are no
530 * other keys to sort on.
531 */
532 if (state->base.onlyKey != NULL)
533 return 0;
534
535 return state->base.comparetup_tiebreak(a, b, state);
536}
537
538/* Used if first key's comparator is ssup_datum_int32_cmp */
541{
542 int compare;
543
544 compare = ApplyInt32SortComparator(a->datum1, a->isnull1,
545 b->datum1, b->isnull1,
546 &state->base.sortKeys[0]);
547
548 if (compare != 0)
549 return compare;
550
551 /*
552 * No need to waste effort calling the tiebreak function when there are no
553 * other keys to sort on.
554 */
555 if (state->base.onlyKey != NULL)
556 return 0;
557
558 return state->base.comparetup_tiebreak(a, b, state);
559}
560
561/*
562 * Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
563 * any variant of SortTuples, using the appropriate comparetup function.
564 * qsort_ssup() is specialized for the case where the comparetup function
565 * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts
566 * and Datum sorts. qsort_tuple_{unsigned,signed,int32} are specialized for
567 * common comparison functions on pass-by-value leading datums.
568 */
569
570#define ST_SORT qsort_tuple_unsigned
571#define ST_ELEMENT_TYPE SortTuple
572#define ST_COMPARE(a, b, state) qsort_tuple_unsigned_compare(a, b, state)
573#define ST_COMPARE_ARG_TYPE Tuplesortstate
574#define ST_CHECK_FOR_INTERRUPTS
575#define ST_SCOPE static
576#define ST_DEFINE
577#include "lib/sort_template.h"
578
579#define ST_SORT qsort_tuple_signed
580#define ST_ELEMENT_TYPE SortTuple
581#define ST_COMPARE(a, b, state) qsort_tuple_signed_compare(a, b, state)
582#define ST_COMPARE_ARG_TYPE Tuplesortstate
583#define ST_CHECK_FOR_INTERRUPTS
584#define ST_SCOPE static
585#define ST_DEFINE
586#include "lib/sort_template.h"
587
588#define ST_SORT qsort_tuple_int32
589#define ST_ELEMENT_TYPE SortTuple
590#define ST_COMPARE(a, b, state) qsort_tuple_int32_compare(a, b, state)
591#define ST_COMPARE_ARG_TYPE Tuplesortstate
592#define ST_CHECK_FOR_INTERRUPTS
593#define ST_SCOPE static
594#define ST_DEFINE
595#include "lib/sort_template.h"
596
597#define ST_SORT qsort_tuple
598#define ST_ELEMENT_TYPE SortTuple
599#define ST_COMPARE_RUNTIME_POINTER
600#define ST_COMPARE_ARG_TYPE Tuplesortstate
601#define ST_CHECK_FOR_INTERRUPTS
602#define ST_SCOPE static
603#define ST_DECLARE
604#define ST_DEFINE
605#include "lib/sort_template.h"
606
607#define ST_SORT qsort_ssup
608#define ST_ELEMENT_TYPE SortTuple
609#define ST_COMPARE(a, b, ssup) \
610 ApplySortComparator((a)->datum1, (a)->isnull1, \
611 (b)->datum1, (b)->isnull1, (ssup))
612#define ST_COMPARE_ARG_TYPE SortSupportData
613#define ST_CHECK_FOR_INTERRUPTS
614#define ST_SCOPE static
615#define ST_DEFINE
616#include "lib/sort_template.h"
617
618/*
619 * tuplesort_begin_xxx
620 *
621 * Initialize for a tuple sort operation.
622 *
623 * After calling tuplesort_begin, the caller should call tuplesort_putXXX
624 * zero or more times, then call tuplesort_performsort when all the tuples
625 * have been supplied. After performsort, retrieve the tuples in sorted
626 * order by calling tuplesort_getXXX until it returns false/NULL. (If random
627 * access was requested, rescan, markpos, and restorepos can also be called.)
628 * Call tuplesort_end to terminate the operation and release memory/disk space.
629 *
630 * Each variant of tuplesort_begin has a workMem parameter specifying the
631 * maximum number of kilobytes of RAM to use before spilling data to disk.
632 * (The normal value of this parameter is work_mem, but some callers use
633 * other values.) Each variant also has a sortopt which is a bitmask of
634 * sort options. See TUPLESORT_* definitions in tuplesort.h
635 */
636
638tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
639{
641 MemoryContext maincontext;
642 MemoryContext sortcontext;
643 MemoryContext oldcontext;
644
645 /* See leader_takeover_tapes() remarks on random access support */
646 if (coordinate && (sortopt & TUPLESORT_RANDOMACCESS))
647 elog(ERROR, "random access disallowed under parallel sort");
648
649 /*
650 * Memory context surviving tuplesort_reset. This memory context holds
651 * data which is useful to keep while sorting multiple similar batches.
652 */
654 "TupleSort main",
656
657 /*
658 * Create a working memory context for one sort operation. The content of
659 * this context is deleted by tuplesort_reset.
660 */
661 sortcontext = AllocSetContextCreate(maincontext,
662 "TupleSort sort",
664
665 /*
666 * Additionally a working memory context for tuples is setup in
667 * tuplesort_begin_batch.
668 */
669
670 /*
671 * Make the Tuplesortstate within the per-sortstate context. This way, we
672 * don't need a separate pfree() operation for it at shutdown.
673 */
674 oldcontext = MemoryContextSwitchTo(maincontext);
675
677
678 if (trace_sort)
679 pg_rusage_init(&state->ru_start);
680
681 state->base.sortopt = sortopt;
682 state->base.tuples = true;
683 state->abbrevNext = 10;
684
685 /*
686 * workMem is forced to be at least 64KB, the current minimum valid value
687 * for the work_mem GUC. This is a defense against parallel sort callers
688 * that divide out memory among many workers in a way that leaves each
689 * with very little memory.
690 */
691 state->allowedMem = Max(workMem, 64) * (int64) 1024;
692 state->base.sortcontext = sortcontext;
693 state->base.maincontext = maincontext;
694
695 /*
696 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
697 * see comments in grow_memtuples().
698 */
699 state->memtupsize = INITIAL_MEMTUPSIZE;
700 state->memtuples = NULL;
701
702 /*
703 * After all of the other non-parallel-related state, we setup all of the
704 * state needed for each batch.
705 */
707
708 /*
709 * Initialize parallel-related state based on coordination information
710 * from caller
711 */
712 if (!coordinate)
713 {
714 /* Serial sort */
715 state->shared = NULL;
716 state->worker = -1;
717 state->nParticipants = -1;
718 }
719 else if (coordinate->isWorker)
720 {
721 /* Parallel worker produces exactly one final run from all input */
722 state->shared = coordinate->sharedsort;
724 state->nParticipants = -1;
725 }
726 else
727 {
728 /* Parallel leader state only used for final merge */
729 state->shared = coordinate->sharedsort;
730 state->worker = -1;
731 state->nParticipants = coordinate->nParticipants;
732 Assert(state->nParticipants >= 1);
733 }
734
735 MemoryContextSwitchTo(oldcontext);
736
737 return state;
738}
739
740/*
741 * tuplesort_begin_batch
742 *
743 * Setup, or reset, all state need for processing a new set of tuples with this
744 * sort state. Called both from tuplesort_begin_common (the first time sorting
745 * with this sort state) and tuplesort_reset (for subsequent usages).
746 */
747static void
749{
750 MemoryContext oldcontext;
751
752 oldcontext = MemoryContextSwitchTo(state->base.maincontext);
753
754 /*
755 * Caller tuple (e.g. IndexTuple) memory context.
756 *
757 * A dedicated child context used exclusively for caller passed tuples
758 * eases memory management. Resetting at key points reduces
759 * fragmentation. Note that the memtuples array of SortTuples is allocated
760 * in the parent context, not this context, because there is no need to
761 * free memtuples early. For bounded sorts, tuples may be pfreed in any
762 * order, so we use a regular aset.c context so that it can make use of
763 * free'd memory. When the sort is not bounded, we make use of a bump.c
764 * context as this keeps allocations more compact with less wastage.
765 * Allocations are also slightly more CPU efficient.
766 */
767 if (TupleSortUseBumpTupleCxt(state->base.sortopt))
768 state->base.tuplecontext = BumpContextCreate(state->base.sortcontext,
769 "Caller tuples",
771 else
772 state->base.tuplecontext = AllocSetContextCreate(state->base.sortcontext,
773 "Caller tuples",
775
776
777 state->status = TSS_INITIAL;
778 state->bounded = false;
779 state->boundUsed = false;
780
781 state->availMem = state->allowedMem;
782
783 state->tapeset = NULL;
784
785 state->memtupcount = 0;
786
787 /*
788 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
789 * see comments in grow_memtuples().
790 */
791 state->growmemtuples = true;
792 state->slabAllocatorUsed = false;
793 if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
794 {
795 pfree(state->memtuples);
796 state->memtuples = NULL;
797 state->memtupsize = INITIAL_MEMTUPSIZE;
798 }
799 if (state->memtuples == NULL)
800 {
801 state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
803 }
804
805 /* workMem must be large enough for the minimal memtuples array */
806 if (LACKMEM(state))
807 elog(ERROR, "insufficient memory allowed for sort");
808
809 state->currentRun = 0;
810
811 /*
812 * Tape variables (inputTapes, outputTapes, etc.) will be initialized by
813 * inittapes(), if needed.
814 */
815
816 state->result_tape = NULL; /* flag that result tape has not been formed */
817
818 MemoryContextSwitchTo(oldcontext);
819}
820
821/*
822 * tuplesort_set_bound
823 *
824 * Advise tuplesort that at most the first N result tuples are required.
825 *
826 * Must be called before inserting any tuples. (Actually, we could allow it
827 * as long as the sort hasn't spilled to disk, but there seems no need for
828 * delayed calls at the moment.)
829 *
830 * This is a hint only. The tuplesort may still return more tuples than
831 * requested. Parallel leader tuplesorts will always ignore the hint.
832 */
833void
835{
836 /* Assert we're called before loading any tuples */
837 Assert(state->status == TSS_INITIAL && state->memtupcount == 0);
838 /* Assert we allow bounded sorts */
839 Assert(state->base.sortopt & TUPLESORT_ALLOWBOUNDED);
840 /* Can't set the bound twice, either */
841 Assert(!state->bounded);
842 /* Also, this shouldn't be called in a parallel worker */
844
845 /* Parallel leader allows but ignores hint */
846 if (LEADER(state))
847 return;
848
849#ifdef DEBUG_BOUNDED_SORT
850 /* Honor GUC setting that disables the feature (for easy testing) */
851 if (!optimize_bounded_sort)
852 return;
853#endif
854
855 /* We want to be able to compute bound * 2, so limit the setting */
856 if (bound > (int64) (INT_MAX / 2))
857 return;
858
859 state->bounded = true;
860 state->bound = (int) bound;
861
862 /*
863 * Bounded sorts are not an effective target for abbreviated key
864 * optimization. Disable by setting state to be consistent with no
865 * abbreviation support.
866 */
867 state->base.sortKeys->abbrev_converter = NULL;
868 if (state->base.sortKeys->abbrev_full_comparator)
869 state->base.sortKeys->comparator = state->base.sortKeys->abbrev_full_comparator;
870
871 /* Not strictly necessary, but be tidy */
872 state->base.sortKeys->abbrev_abort = NULL;
873 state->base.sortKeys->abbrev_full_comparator = NULL;
874}
875
876/*
877 * tuplesort_used_bound
878 *
879 * Allow callers to find out if the sort state was able to use a bound.
880 */
881bool
883{
884 return state->boundUsed;
885}
886
887/*
888 * tuplesort_free
889 *
890 * Internal routine for freeing resources of tuplesort.
891 */
892static void
894{
895 /* context swap probably not needed, but let's be safe */
896 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
897 int64 spaceUsed;
898
899 if (state->tapeset)
900 spaceUsed = LogicalTapeSetBlocks(state->tapeset);
901 else
902 spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
903
904 /*
905 * Delete temporary "tape" files, if any.
906 *
907 * We don't bother to destroy the individual tapes here. They will go away
908 * with the sortcontext. (In TSS_FINALMERGE state, we have closed
909 * finished tapes already.)
910 */
911 if (state->tapeset)
912 LogicalTapeSetClose(state->tapeset);
913
914 if (trace_sort)
915 {
916 if (state->tapeset)
917 elog(LOG, "%s of worker %d ended, %" PRId64 " disk blocks used: %s",
918 SERIAL(state) ? "external sort" : "parallel external sort",
919 state->worker, spaceUsed, pg_rusage_show(&state->ru_start));
920 else
921 elog(LOG, "%s of worker %d ended, %" PRId64 " KB used: %s",
922 SERIAL(state) ? "internal sort" : "unperformed parallel sort",
923 state->worker, spaceUsed, pg_rusage_show(&state->ru_start));
924 }
925
926 TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);
927
929 MemoryContextSwitchTo(oldcontext);
930
931 /*
932 * Free the per-sort memory context, thereby releasing all working memory.
933 */
934 MemoryContextReset(state->base.sortcontext);
935}
936
937/*
938 * tuplesort_end
939 *
940 * Release resources and clean up.
941 *
942 * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
943 * pointing to garbage. Be careful not to attempt to use or free such
944 * pointers afterwards!
945 */
946void
948{
950
951 /*
952 * Free the main memory context, including the Tuplesortstate struct
953 * itself.
954 */
955 MemoryContextDelete(state->base.maincontext);
956}
957
958/*
959 * tuplesort_updatemax
960 *
961 * Update maximum resource usage statistics.
962 */
963static void
965{
966 int64 spaceUsed;
967 bool isSpaceDisk;
968
969 /*
970 * Note: it might seem we should provide both memory and disk usage for a
971 * disk-based sort. However, the current code doesn't track memory space
972 * accurately once we have begun to return tuples to the caller (since we
973 * don't account for pfree's the caller is expected to do), so we cannot
974 * rely on availMem in a disk sort. This does not seem worth the overhead
975 * to fix. Is it worth creating an API for the memory context code to
976 * tell us how much is actually used in sortcontext?
977 */
978 if (state->tapeset)
979 {
980 isSpaceDisk = true;
981 spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
982 }
983 else
984 {
985 isSpaceDisk = false;
986 spaceUsed = state->allowedMem - state->availMem;
987 }
988
989 /*
990 * Sort evicts data to the disk when it wasn't able to fit that data into
991 * main memory. This is why we assume space used on the disk to be more
992 * important for tracking resource usage than space used in memory. Note
993 * that the amount of space occupied by some tupleset on the disk might be
994 * less than amount of space occupied by the same tupleset in memory due
995 * to more compact representation.
996 */
997 if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
998 (isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
999 {
1000 state->maxSpace = spaceUsed;
1001 state->isMaxSpaceDisk = isSpaceDisk;
1002 state->maxSpaceStatus = state->status;
1003 }
1004}
1005
1006/*
1007 * tuplesort_reset
1008 *
1009 * Reset the tuplesort. Reset all the data in the tuplesort, but leave the
1010 * meta-information in. After tuplesort_reset, tuplesort is ready to start
1011 * a new sort. This allows avoiding recreation of tuple sort states (and
1012 * save resources) when sorting multiple small batches.
1013 */
1014void
1016{
1019
1020 /*
1021 * After we've freed up per-batch memory, re-setup all of the state common
1022 * to both the first batch and any subsequent batch.
1023 */
1025
1026 state->lastReturnedTuple = NULL;
1027 state->slabMemoryBegin = NULL;
1028 state->slabMemoryEnd = NULL;
1029 state->slabFreeHead = NULL;
1030}
1031
1032/*
1033 * Grow the memtuples[] array, if possible within our memory constraint. We
1034 * must not exceed INT_MAX tuples in memory or the caller-provided memory
1035 * limit. Return true if we were able to enlarge the array, false if not.
1036 *
1037 * Normally, at each increment we double the size of the array. When doing
1038 * that would exceed a limit, we attempt one last, smaller increase (and then
1039 * clear the growmemtuples flag so we don't try any more). That allows us to
1040 * use memory as fully as permitted; sticking to the pure doubling rule could
1041 * result in almost half going unused. Because availMem moves around with
1042 * tuple addition/removal, we need some rule to prevent making repeated small
1043 * increases in memtupsize, which would just be useless thrashing. The
1044 * growmemtuples flag accomplishes that and also prevents useless
1045 * recalculations in this function.
1046 */
1047static bool
1049{
1050 int newmemtupsize;
1051 int memtupsize = state->memtupsize;
1052 int64 memNowUsed = state->allowedMem - state->availMem;
1053
1054 /* Forget it if we've already maxed out memtuples, per comment above */
1055 if (!state->growmemtuples)
1056 return false;
1057
1058 /* Select new value of memtupsize */
1059 if (memNowUsed <= state->availMem)
1060 {
1061 /*
1062 * We've used no more than half of allowedMem; double our usage,
1063 * clamping at INT_MAX tuples.
1064 */
1065 if (memtupsize < INT_MAX / 2)
1066 newmemtupsize = memtupsize * 2;
1067 else
1068 {
1069 newmemtupsize = INT_MAX;
1070 state->growmemtuples = false;
1071 }
1072 }
1073 else
1074 {
1075 /*
1076 * This will be the last increment of memtupsize. Abandon doubling
1077 * strategy and instead increase as much as we safely can.
1078 *
1079 * To stay within allowedMem, we can't increase memtupsize by more
1080 * than availMem / sizeof(SortTuple) elements. In practice, we want
1081 * to increase it by considerably less, because we need to leave some
1082 * space for the tuples to which the new array slots will refer. We
1083 * assume the new tuples will be about the same size as the tuples
1084 * we've already seen, and thus we can extrapolate from the space
1085 * consumption so far to estimate an appropriate new size for the
1086 * memtuples array. The optimal value might be higher or lower than
1087 * this estimate, but it's hard to know that in advance. We again
1088 * clamp at INT_MAX tuples.
1089 *
1090 * This calculation is safe against enlarging the array so much that
1091 * LACKMEM becomes true, because the memory currently used includes
1092 * the present array; thus, there would be enough allowedMem for the
1093 * new array elements even if no other memory were currently used.
1094 *
1095 * We do the arithmetic in float8, because otherwise the product of
1096 * memtupsize and allowedMem could overflow. Any inaccuracy in the
1097 * result should be insignificant; but even if we computed a
1098 * completely insane result, the checks below will prevent anything
1099 * really bad from happening.
1100 */
1101 double grow_ratio;
1102
1103 grow_ratio = (double) state->allowedMem / (double) memNowUsed;
1104 if (memtupsize * grow_ratio < INT_MAX)
1105 newmemtupsize = (int) (memtupsize * grow_ratio);
1106 else
1107 newmemtupsize = INT_MAX;
1108
1109 /* We won't make any further enlargement attempts */
1110 state->growmemtuples = false;
1111 }
1112
1113 /* Must enlarge array by at least one element, else report failure */
1114 if (newmemtupsize <= memtupsize)
1115 goto noalloc;
1116
1117 /*
1118 * On a 32-bit machine, allowedMem could exceed MaxAllocHugeSize. Clamp
1119 * to ensure our request won't be rejected. Note that we can easily
1120 * exhaust address space before facing this outcome. (This is presently
1121 * impossible due to guc.c's MAX_KILOBYTES limitation on work_mem, but
1122 * don't rely on that at this distance.)
1123 */
1124 if ((Size) newmemtupsize >= MaxAllocHugeSize / sizeof(SortTuple))
1125 {
1126 newmemtupsize = (int) (MaxAllocHugeSize / sizeof(SortTuple));
1127 state->growmemtuples = false; /* can't grow any more */
1128 }
1129
1130 /*
1131 * We need to be sure that we do not cause LACKMEM to become true, else
1132 * the space management algorithm will go nuts. The code above should
1133 * never generate a dangerous request, but to be safe, check explicitly
1134 * that the array growth fits within availMem. (We could still cause
1135 * LACKMEM if the memory chunk overhead associated with the memtuples
1136 * array were to increase. That shouldn't happen because we chose the
1137 * initial array size large enough to ensure that palloc will be treating
1138 * both old and new arrays as separate chunks. But we'll check LACKMEM
1139 * explicitly below just in case.)
1140 */
1141 if (state->availMem < (int64) ((newmemtupsize - memtupsize) * sizeof(SortTuple)))
1142 goto noalloc;
1143
1144 /* OK, do it */
1145 FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
1146 state->memtupsize = newmemtupsize;
1147 state->memtuples = (SortTuple *)
1148 repalloc_huge(state->memtuples,
1149 state->memtupsize * sizeof(SortTuple));
1150 USEMEM(state, GetMemoryChunkSpace(state->memtuples));
1151 if (LACKMEM(state))
1152 elog(ERROR, "unexpected out-of-memory situation in tuplesort");
1153 return true;
1154
1155noalloc:
1156 /* If for any reason we didn't realloc, shut off future attempts */
1157 state->growmemtuples = false;
1158 return false;
1159}
1160
1161/*
1162 * Shared code for tuple and datum cases.
1163 */
1164void
1166 bool useAbbrev, Size tuplen)
1167{
1168 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
1169
1170 Assert(!LEADER(state));
1171
1172 /* account for the memory used for this tuple */
1173 USEMEM(state, tuplen);
1174 state->tupleMem += tuplen;
1175
1176 if (!useAbbrev)
1177 {
1178 /*
1179 * Leave ordinary Datum representation, or NULL value. If there is a
1180 * converter it won't expect NULL values, and cost model is not
1181 * required to account for NULL, so in that case we avoid calling
1182 * converter and just set datum1 to zeroed representation (to be
1183 * consistent, and to support cheap inequality tests for NULL
1184 * abbreviated keys).
1185 */
1186 }
1187 else if (!consider_abort_common(state))
1188 {
1189 /* Store abbreviated key representation */
1190 tuple->datum1 = state->base.sortKeys->abbrev_converter(tuple->datum1,
1191 state->base.sortKeys);
1192 }
1193 else
1194 {
1195 /*
1196 * Set state to be consistent with never trying abbreviation.
1197 *
1198 * Alter datum1 representation in already-copied tuples, so as to
1199 * ensure a consistent representation (current tuple was just
1200 * handled). It does not matter if some dumped tuples are already
1201 * sorted on tape, since serialized tuples lack abbreviated keys
1202 * (TSS_BUILDRUNS state prevents control reaching here in any case).
1203 */
1204 REMOVEABBREV(state, state->memtuples, state->memtupcount);
1205 }
1206
1207 switch (state->status)
1208 {
1209 case TSS_INITIAL:
1210
1211 /*
1212 * Save the tuple into the unsorted array. First, grow the array
1213 * as needed. Note that we try to grow the array when there is
1214 * still one free slot remaining --- if we fail, there'll still be
1215 * room to store the incoming tuple, and then we'll switch to
1216 * tape-based operation.
1217 */
1218 if (state->memtupcount >= state->memtupsize - 1)
1219 {
1220 (void) grow_memtuples(state);
1221 Assert(state->memtupcount < state->memtupsize);
1222 }
1223 state->memtuples[state->memtupcount++] = *tuple;
1224
1225 /*
1226 * Check if it's time to switch over to a bounded heapsort. We do
1227 * so if the input tuple count exceeds twice the desired tuple
1228 * count (this is a heuristic for where heapsort becomes cheaper
1229 * than a quicksort), or if we've just filled workMem and have
1230 * enough tuples to meet the bound.
1231 *
1232 * Note that once we enter TSS_BOUNDED state we will always try to
1233 * complete the sort that way. In the worst case, if later input
1234 * tuples are larger than earlier ones, this might cause us to
1235 * exceed workMem significantly.
1236 */
1237 if (state->bounded &&
1238 (state->memtupcount > state->bound * 2 ||
1239 (state->memtupcount > state->bound && LACKMEM(state))))
1240 {
1241 if (trace_sort)
1242 elog(LOG, "switching to bounded heapsort at %d tuples: %s",
1243 state->memtupcount,
1244 pg_rusage_show(&state->ru_start));
1246 MemoryContextSwitchTo(oldcontext);
1247 return;
1248 }
1249
1250 /*
1251 * Done if we still fit in available memory and have array slots.
1252 */
1253 if (state->memtupcount < state->memtupsize && !LACKMEM(state))
1254 {
1255 MemoryContextSwitchTo(oldcontext);
1256 return;
1257 }
1258
1259 /*
1260 * Nope; time to switch to tape-based operation.
1261 */
1262 inittapes(state, true);
1263
1264 /*
1265 * Dump all tuples.
1266 */
1267 dumptuples(state, false);
1268 break;
1269
1270 case TSS_BOUNDED:
1271
1272 /*
1273 * We don't want to grow the array here, so check whether the new
1274 * tuple can be discarded before putting it in. This should be a
1275 * good speed optimization, too, since when there are many more
1276 * input tuples than the bound, most input tuples can be discarded
1277 * with just this one comparison. Note that because we currently
1278 * have the sort direction reversed, we must check for <= not >=.
1279 */
1280 if (COMPARETUP(state, tuple, &state->memtuples[0]) <= 0)
1281 {
1282 /* new tuple <= top of the heap, so we can discard it */
1283 free_sort_tuple(state, tuple);
1285 }
1286 else
1287 {
1288 /* discard top of heap, replacing it with the new tuple */
1289 free_sort_tuple(state, &state->memtuples[0]);
1291 }
1292 break;
1293
1294 case TSS_BUILDRUNS:
1295
1296 /*
1297 * Save the tuple into the unsorted array (there must be space)
1298 */
1299 state->memtuples[state->memtupcount++] = *tuple;
1300
1301 /*
1302 * If we are over the memory limit, dump all tuples.
1303 */
1304 dumptuples(state, false);
1305 break;
1306
1307 default:
1308 elog(ERROR, "invalid tuplesort state");
1309 break;
1310 }
1311 MemoryContextSwitchTo(oldcontext);
1312}
1313
1314static bool
1316{
1317 Assert(state->base.sortKeys[0].abbrev_converter != NULL);
1318 Assert(state->base.sortKeys[0].abbrev_abort != NULL);
1319 Assert(state->base.sortKeys[0].abbrev_full_comparator != NULL);
1320
1321 /*
1322 * Check effectiveness of abbreviation optimization. Consider aborting
1323 * when still within memory limit.
1324 */
1325 if (state->status == TSS_INITIAL &&
1326 state->memtupcount >= state->abbrevNext)
1327 {
1328 state->abbrevNext *= 2;
1329
1330 /*
1331 * Check opclass-supplied abbreviation abort routine. It may indicate
1332 * that abbreviation should not proceed.
1333 */
1334 if (!state->base.sortKeys->abbrev_abort(state->memtupcount,
1335 state->base.sortKeys))
1336 return false;
1337
1338 /*
1339 * Finally, restore authoritative comparator, and indicate that
1340 * abbreviation is not in play by setting abbrev_converter to NULL
1341 */
1342 state->base.sortKeys[0].comparator = state->base.sortKeys[0].abbrev_full_comparator;
1343 state->base.sortKeys[0].abbrev_converter = NULL;
1344 /* Not strictly necessary, but be tidy */
1345 state->base.sortKeys[0].abbrev_abort = NULL;
1346 state->base.sortKeys[0].abbrev_full_comparator = NULL;
1347
1348 /* Give up - expect original pass-by-value representation */
1349 return true;
1350 }
1351
1352 return false;
1353}
1354
1355/*
1356 * All tuples have been provided; finish the sort.
1357 */
1358void
1360{
1361 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
1362
1363 if (trace_sort)
1364 elog(LOG, "performsort of worker %d starting: %s",
1365 state->worker, pg_rusage_show(&state->ru_start));
1366
1367 switch (state->status)
1368 {
1369 case TSS_INITIAL:
1370
1371 /*
1372 * We were able to accumulate all the tuples within the allowed
1373 * amount of memory, or leader to take over worker tapes
1374 */
1375 if (SERIAL(state))
1376 {
1377 /* Just qsort 'em and we're done */
1379 state->status = TSS_SORTEDINMEM;
1380 }
1381 else if (WORKER(state))
1382 {
1383 /*
1384 * Parallel workers must still dump out tuples to tape. No
1385 * merge is required to produce single output run, though.
1386 */
1387 inittapes(state, false);
1388 dumptuples(state, true);
1390 state->status = TSS_SORTEDONTAPE;
1391 }
1392 else
1393 {
1394 /*
1395 * Leader will take over worker tapes and merge worker runs.
1396 * Note that mergeruns sets the correct state->status.
1397 */
1400 }
1401 state->current = 0;
1402 state->eof_reached = false;
1403 state->markpos_block = 0L;
1404 state->markpos_offset = 0;
1405 state->markpos_eof = false;
1406 break;
1407
1408 case TSS_BOUNDED:
1409
1410 /*
1411 * We were able to accumulate all the tuples required for output
1412 * in memory, using a heap to eliminate excess tuples. Now we
1413 * have to transform the heap to a properly-sorted array. Note
1414 * that sort_bounded_heap sets the correct state->status.
1415 */
1417 state->current = 0;
1418 state->eof_reached = false;
1419 state->markpos_offset = 0;
1420 state->markpos_eof = false;
1421 break;
1422
1423 case TSS_BUILDRUNS:
1424
1425 /*
1426 * Finish tape-based sort. First, flush all tuples remaining in
1427 * memory out to tape; then merge until we have a single remaining
1428 * run (or, if !randomAccess and !WORKER(), one run per tape).
1429 * Note that mergeruns sets the correct state->status.
1430 */
1431 dumptuples(state, true);
1433 state->eof_reached = false;
1434 state->markpos_block = 0L;
1435 state->markpos_offset = 0;
1436 state->markpos_eof = false;
1437 break;
1438
1439 default:
1440 elog(ERROR, "invalid tuplesort state");
1441 break;
1442 }
1443
1444 if (trace_sort)
1445 {
1446 if (state->status == TSS_FINALMERGE)
1447 elog(LOG, "performsort of worker %d done (except %d-way final merge): %s",
1448 state->worker, state->nInputTapes,
1449 pg_rusage_show(&state->ru_start));
1450 else
1451 elog(LOG, "performsort of worker %d done: %s",
1452 state->worker, pg_rusage_show(&state->ru_start));
1453 }
1454
1455 MemoryContextSwitchTo(oldcontext);
1456}
1457
1458/*
1459 * Internal routine to fetch the next tuple in either forward or back
1460 * direction into *stup. Returns false if no more tuples.
1461 * Returned tuple belongs to tuplesort memory context, and must not be freed
1462 * by caller. Note that fetched tuple is stored in memory that may be
1463 * recycled by any future fetch.
1464 */
1465bool
1467 SortTuple *stup)
1468{
1469 unsigned int tuplen;
1470 size_t nmoved;
1471
1472 Assert(!WORKER(state));
1473
1474 switch (state->status)
1475 {
1476 case TSS_SORTEDINMEM:
1477 Assert(forward || state->base.sortopt & TUPLESORT_RANDOMACCESS);
1478 Assert(!state->slabAllocatorUsed);
1479 if (forward)
1480 {
1481 if (state->current < state->memtupcount)
1482 {
1483 *stup = state->memtuples[state->current++];
1484 return true;
1485 }
1486 state->eof_reached = true;
1487
1488 /*
1489 * Complain if caller tries to retrieve more tuples than
1490 * originally asked for in a bounded sort. This is because
1491 * returning EOF here might be the wrong thing.
1492 */
1493 if (state->bounded && state->current >= state->bound)
1494 elog(ERROR, "retrieved too many tuples in a bounded sort");
1495
1496 return false;
1497 }
1498 else
1499 {
1500 if (state->current <= 0)
1501 return false;
1502
1503 /*
1504 * if all tuples are fetched already then we return last
1505 * tuple, else - tuple before last returned.
1506 */
1507 if (state->eof_reached)
1508 state->eof_reached = false;
1509 else
1510 {
1511 state->current--; /* last returned tuple */
1512 if (state->current <= 0)
1513 return false;
1514 }
1515 *stup = state->memtuples[state->current - 1];
1516 return true;
1517 }
1518 break;
1519
1520 case TSS_SORTEDONTAPE:
1521 Assert(forward || state->base.sortopt & TUPLESORT_RANDOMACCESS);
1522 Assert(state->slabAllocatorUsed);
1523
1524 /*
1525 * The slot that held the tuple that we returned in previous
1526 * gettuple call can now be reused.
1527 */
1528 if (state->lastReturnedTuple)
1529 {
1530 RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
1531 state->lastReturnedTuple = NULL;
1532 }
1533
1534 if (forward)
1535 {
1536 if (state->eof_reached)
1537 return false;
1538
1539 if ((tuplen = getlen(state->result_tape, true)) != 0)
1540 {
1541 READTUP(state, stup, state->result_tape, tuplen);
1542
1543 /*
1544 * Remember the tuple we return, so that we can recycle
1545 * its memory on next call. (This can be NULL, in the
1546 * !state->tuples case).
1547 */
1548 state->lastReturnedTuple = stup->tuple;
1549
1550 return true;
1551 }
1552 else
1553 {
1554 state->eof_reached = true;
1555 return false;
1556 }
1557 }
1558
1559 /*
1560 * Backward.
1561 *
1562 * if all tuples are fetched already then we return last tuple,
1563 * else - tuple before last returned.
1564 */
1565 if (state->eof_reached)
1566 {
1567 /*
1568 * Seek position is pointing just past the zero tuplen at the
1569 * end of file; back up to fetch last tuple's ending length
1570 * word. If seek fails we must have a completely empty file.
1571 */
1572 nmoved = LogicalTapeBackspace(state->result_tape,
1573 2 * sizeof(unsigned int));
1574 if (nmoved == 0)
1575 return false;
1576 else if (nmoved != 2 * sizeof(unsigned int))
1577 elog(ERROR, "unexpected tape position");
1578 state->eof_reached = false;
1579 }
1580 else
1581 {
1582 /*
1583 * Back up and fetch previously-returned tuple's ending length
1584 * word. If seek fails, assume we are at start of file.
1585 */
1586 nmoved = LogicalTapeBackspace(state->result_tape,
1587 sizeof(unsigned int));
1588 if (nmoved == 0)
1589 return false;
1590 else if (nmoved != sizeof(unsigned int))
1591 elog(ERROR, "unexpected tape position");
1592 tuplen = getlen(state->result_tape, false);
1593
1594 /*
1595 * Back up to get ending length word of tuple before it.
1596 */
1597 nmoved = LogicalTapeBackspace(state->result_tape,
1598 tuplen + 2 * sizeof(unsigned int));
1599 if (nmoved == tuplen + sizeof(unsigned int))
1600 {
1601 /*
1602 * We backed up over the previous tuple, but there was no
1603 * ending length word before it. That means that the prev
1604 * tuple is the first tuple in the file. It is now the
1605 * next to read in forward direction (not obviously right,
1606 * but that is what in-memory case does).
1607 */
1608 return false;
1609 }
1610 else if (nmoved != tuplen + 2 * sizeof(unsigned int))
1611 elog(ERROR, "bogus tuple length in backward scan");
1612 }
1613
1614 tuplen = getlen(state->result_tape, false);
1615
1616 /*
1617 * Now we have the length of the prior tuple, back up and read it.
1618 * Note: READTUP expects we are positioned after the initial
1619 * length word of the tuple, so back up to that point.
1620 */
1621 nmoved = LogicalTapeBackspace(state->result_tape,
1622 tuplen);
1623 if (nmoved != tuplen)
1624 elog(ERROR, "bogus tuple length in backward scan");
1625 READTUP(state, stup, state->result_tape, tuplen);
1626
1627 /*
1628 * Remember the tuple we return, so that we can recycle its memory
1629 * on next call. (This can be NULL, in the Datum case).
1630 */
1631 state->lastReturnedTuple = stup->tuple;
1632
1633 return true;
1634
1635 case TSS_FINALMERGE:
1636 Assert(forward);
1637 /* We are managing memory ourselves, with the slab allocator. */
1638 Assert(state->slabAllocatorUsed);
1639
1640 /*
1641 * The slab slot holding the tuple that we returned in previous
1642 * gettuple call can now be reused.
1643 */
1644 if (state->lastReturnedTuple)
1645 {
1646 RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
1647 state->lastReturnedTuple = NULL;
1648 }
1649
1650 /*
1651 * This code should match the inner loop of mergeonerun().
1652 */
1653 if (state->memtupcount > 0)
1654 {
1655 int srcTapeIndex = state->memtuples[0].srctape;
1656 LogicalTape *srcTape = state->inputTapes[srcTapeIndex];
1657 SortTuple newtup;
1658
1659 *stup = state->memtuples[0];
1660
1661 /*
1662 * Remember the tuple we return, so that we can recycle its
1663 * memory on next call. (This can be NULL, in the Datum case).
1664 */
1665 state->lastReturnedTuple = stup->tuple;
1666
1667 /*
1668 * Pull next tuple from tape, and replace the returned tuple
1669 * at top of the heap with it.
1670 */
1671 if (!mergereadnext(state, srcTape, &newtup))
1672 {
1673 /*
1674 * If no more data, we've reached end of run on this tape.
1675 * Remove the top node from the heap.
1676 */
1678 state->nInputRuns--;
1679
1680 /*
1681 * Close the tape. It'd go away at the end of the sort
1682 * anyway, but better to release the memory early.
1683 */
1684 LogicalTapeClose(srcTape);
1685 return true;
1686 }
1687 newtup.srctape = srcTapeIndex;
1689 return true;
1690 }
1691 return false;
1692
1693 default:
1694 elog(ERROR, "invalid tuplesort state");
1695 return false; /* keep compiler quiet */
1696 }
1697}
1698
1699
1700/*
1701 * Advance over N tuples in either forward or back direction,
1702 * without returning any data. N==0 is a no-op.
1703 * Returns true if successful, false if ran out of tuples.
1704 */
1705bool
1707{
1708 MemoryContext oldcontext;
1709
1710 /*
1711 * We don't actually support backwards skip yet, because no callers need
1712 * it. The API is designed to allow for that later, though.
1713 */
1714 Assert(forward);
1715 Assert(ntuples >= 0);
1716 Assert(!WORKER(state));
1717
1718 switch (state->status)
1719 {
1720 case TSS_SORTEDINMEM:
1721 if (state->memtupcount - state->current >= ntuples)
1722 {
1723 state->current += ntuples;
1724 return true;
1725 }
1726 state->current = state->memtupcount;
1727 state->eof_reached = true;
1728
1729 /*
1730 * Complain if caller tries to retrieve more tuples than
1731 * originally asked for in a bounded sort. This is because
1732 * returning EOF here might be the wrong thing.
1733 */
1734 if (state->bounded && state->current >= state->bound)
1735 elog(ERROR, "retrieved too many tuples in a bounded sort");
1736
1737 return false;
1738
1739 case TSS_SORTEDONTAPE:
1740 case TSS_FINALMERGE:
1741
1742 /*
1743 * We could probably optimize these cases better, but for now it's
1744 * not worth the trouble.
1745 */
1746 oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
1747 while (ntuples-- > 0)
1748 {
1749 SortTuple stup;
1750
1751 if (!tuplesort_gettuple_common(state, forward, &stup))
1752 {
1753 MemoryContextSwitchTo(oldcontext);
1754 return false;
1755 }
1757 }
1758 MemoryContextSwitchTo(oldcontext);
1759 return true;
1760
1761 default:
1762 elog(ERROR, "invalid tuplesort state");
1763 return false; /* keep compiler quiet */
1764 }
1765}
1766
1767/*
1768 * tuplesort_merge_order - report merge order we'll use for given memory
1769 * (note: "merge order" just means the number of input tapes in the merge).
1770 *
1771 * This is exported for use by the planner. allowedMem is in bytes.
1772 */
1773int
1775{
1776 int mOrder;
1777
1778 /*----------
1779 * In the merge phase, we need buffer space for each input and output tape.
1780 * Each pass in the balanced merge algorithm reads from M input tapes, and
1781 * writes to N output tapes. Each tape consumes TAPE_BUFFER_OVERHEAD bytes
1782 * of memory. In addition to that, we want MERGE_BUFFER_SIZE workspace per
1783 * input tape.
1784 *
1785 * totalMem = M * (TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE) +
1786 * N * TAPE_BUFFER_OVERHEAD
1787 *
1788 * Except for the last and next-to-last merge passes, where there can be
1789 * fewer tapes left to process, M = N. We choose M so that we have the
1790 * desired amount of memory available for the input buffers
1791 * (TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE), given the total memory
1792 * available for the tape buffers (allowedMem).
1793 *
1794 * Note: you might be thinking we need to account for the memtuples[]
1795 * array in this calculation, but we effectively treat that as part of the
1796 * MERGE_BUFFER_SIZE workspace.
1797 *----------
1798 */
1799 mOrder = allowedMem /
1801
1802 /*
1803 * Even in minimum memory, use at least a MINORDER merge. On the other
1804 * hand, even when we have lots of memory, do not use more than a MAXORDER
1805 * merge. Tapes are pretty cheap, but they're not entirely free. Each
1806 * additional tape reduces the amount of memory available to build runs,
1807 * which in turn can cause the same sort to need more runs, which makes
1808 * merging slower even if it can still be done in a single pass. Also,
1809 * high order merges are quite slow due to CPU cache effects; it can be
1810 * faster to pay the I/O cost of a multi-pass merge than to perform a
1811 * single merge pass across many hundreds of tapes.
1812 */
1813 mOrder = Max(mOrder, MINORDER);
1814 mOrder = Min(mOrder, MAXORDER);
1815
1816 return mOrder;
1817}
1818
1819/*
1820 * Helper function to calculate how much memory to allocate for the read buffer
1821 * of each input tape in a merge pass.
1822 *
1823 * 'avail_mem' is the amount of memory available for the buffers of all the
1824 * tapes, both input and output.
1825 * 'nInputTapes' and 'nInputRuns' are the number of input tapes and runs.
1826 * 'maxOutputTapes' is the max. number of output tapes we should produce.
1827 */
1828static int64
1829merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns,
1830 int maxOutputTapes)
1831{
1832 int nOutputRuns;
1833 int nOutputTapes;
1834
1835 /*
1836 * How many output tapes will we produce in this pass?
1837 *
1838 * This is nInputRuns / nInputTapes, rounded up.
1839 */
1840 nOutputRuns = (nInputRuns + nInputTapes - 1) / nInputTapes;
1841
1842 nOutputTapes = Min(nOutputRuns, maxOutputTapes);
1843
1844 /*
1845 * Each output tape consumes TAPE_BUFFER_OVERHEAD bytes of memory. All
1846 * remaining memory is divided evenly between the input tapes.
1847 *
1848 * This also follows from the formula in tuplesort_merge_order, but here
1849 * we derive the input buffer size from the amount of memory available,
1850 * and M and N.
1851 */
1852 return Max((avail_mem - TAPE_BUFFER_OVERHEAD * nOutputTapes) / nInputTapes, 0);
1853}
1854
1855/*
1856 * inittapes - initialize for tape sorting.
1857 *
1858 * This is called only if we have found we won't sort in memory.
1859 */
1860static void
1862{
1863 Assert(!LEADER(state));
1864
1865 if (mergeruns)
1866 {
1867 /* Compute number of input tapes to use when merging */
1868 state->maxTapes = tuplesort_merge_order(state->allowedMem);
1869 }
1870 else
1871 {
1872 /* Workers can sometimes produce single run, output without merge */
1874 state->maxTapes = MINORDER;
1875 }
1876
1877 if (trace_sort)
1878 elog(LOG, "worker %d switching to external sort with %d tapes: %s",
1879 state->worker, state->maxTapes, pg_rusage_show(&state->ru_start));
1880
1881 /* Create the tape set */
1882 inittapestate(state, state->maxTapes);
1883 state->tapeset =
1885 state->shared ? &state->shared->fileset : NULL,
1886 state->worker);
1887
1888 state->currentRun = 0;
1889
1890 /*
1891 * Initialize logical tape arrays.
1892 */
1893 state->inputTapes = NULL;
1894 state->nInputTapes = 0;
1895 state->nInputRuns = 0;
1896
1897 state->outputTapes = palloc0(state->maxTapes * sizeof(LogicalTape *));
1898 state->nOutputTapes = 0;
1899 state->nOutputRuns = 0;
1900
1901 state->status = TSS_BUILDRUNS;
1902
1904}
1905
1906/*
1907 * inittapestate - initialize generic tape management state
1908 */
1909static void
1911{
1912 int64 tapeSpace;
1913
1914 /*
1915 * Decrease availMem to reflect the space needed for tape buffers; but
1916 * don't decrease it to the point that we have no room for tuples. (That
1917 * case is only likely to occur if sorting pass-by-value Datums; in all
1918 * other scenarios the memtuples[] array is unlikely to occupy more than
1919 * half of allowedMem. In the pass-by-value case it's not important to
1920 * account for tuple space, so we don't care if LACKMEM becomes
1921 * inaccurate.)
1922 */
1923 tapeSpace = (int64) maxTapes * TAPE_BUFFER_OVERHEAD;
1924
1925 if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
1926 USEMEM(state, tapeSpace);
1927
1928 /*
1929 * Make sure that the temp file(s) underlying the tape set are created in
1930 * suitable temp tablespaces. For parallel sorts, this should have been
1931 * called already, but it doesn't matter if it is called a second time.
1932 */
1934}
1935
1936/*
1937 * selectnewtape -- select next tape to output to.
1938 *
1939 * This is called after finishing a run when we know another run
1940 * must be started. This is used both when building the initial
1941 * runs, and during merge passes.
1942 */
1943static void
1945{
1946 /*
1947 * At the beginning of each merge pass, nOutputTapes and nOutputRuns are
1948 * both zero. On each call, we create a new output tape to hold the next
1949 * run, until maxTapes is reached. After that, we assign new runs to the
1950 * existing tapes in a round robin fashion.
1951 */
1952 if (state->nOutputTapes < state->maxTapes)
1953 {
1954 /* Create a new tape to hold the next run */
1955 Assert(state->outputTapes[state->nOutputRuns] == NULL);
1956 Assert(state->nOutputRuns == state->nOutputTapes);
1957 state->destTape = LogicalTapeCreate(state->tapeset);
1958 state->outputTapes[state->nOutputTapes] = state->destTape;
1959 state->nOutputTapes++;
1960 state->nOutputRuns++;
1961 }
1962 else
1963 {
1964 /*
1965 * We have reached the max number of tapes. Append to an existing
1966 * tape.
1967 */
1968 state->destTape = state->outputTapes[state->nOutputRuns % state->nOutputTapes];
1969 state->nOutputRuns++;
1970 }
1971}
1972
1973/*
1974 * Initialize the slab allocation arena, for the given number of slots.
1975 */
1976static void
1978{
1979 if (numSlots > 0)
1980 {
1981 char *p;
1982 int i;
1983
1984 state->slabMemoryBegin = palloc(numSlots * SLAB_SLOT_SIZE);
1985 state->slabMemoryEnd = state->slabMemoryBegin +
1986 numSlots * SLAB_SLOT_SIZE;
1987 state->slabFreeHead = (SlabSlot *) state->slabMemoryBegin;
1988 USEMEM(state, numSlots * SLAB_SLOT_SIZE);
1989
1990 p = state->slabMemoryBegin;
1991 for (i = 0; i < numSlots - 1; i++)
1992 {
1993 ((SlabSlot *) p)->nextfree = (SlabSlot *) (p + SLAB_SLOT_SIZE);
1994 p += SLAB_SLOT_SIZE;
1995 }
1996 ((SlabSlot *) p)->nextfree = NULL;
1997 }
1998 else
1999 {
2000 state->slabMemoryBegin = state->slabMemoryEnd = NULL;
2001 state->slabFreeHead = NULL;
2002 }
2003 state->slabAllocatorUsed = true;
2004}
2005
2006/*
2007 * mergeruns -- merge all the completed initial runs.
2008 *
2009 * This implements the Balanced k-Way Merge Algorithm. All input data has
2010 * already been written to initial runs on tape (see dumptuples).
2011 */
2012static void
2014{
2015 int tapenum;
2016
2017 Assert(state->status == TSS_BUILDRUNS);
2018 Assert(state->memtupcount == 0);
2019
2020 if (state->base.sortKeys != NULL && state->base.sortKeys->abbrev_converter != NULL)
2021 {
2022 /*
2023 * If there are multiple runs to be merged, when we go to read back
2024 * tuples from disk, abbreviated keys will not have been stored, and
2025 * we don't care to regenerate them. Disable abbreviation from this
2026 * point on.
2027 */
2028 state->base.sortKeys->abbrev_converter = NULL;
2029 state->base.sortKeys->comparator = state->base.sortKeys->abbrev_full_comparator;
2030
2031 /* Not strictly necessary, but be tidy */
2032 state->base.sortKeys->abbrev_abort = NULL;
2033 state->base.sortKeys->abbrev_full_comparator = NULL;
2034 }
2035
2036 /*
2037 * Reset tuple memory. We've freed all the tuples that we previously
2038 * allocated. We will use the slab allocator from now on.
2039 */
2040 MemoryContextResetOnly(state->base.tuplecontext);
2041
2042 /*
2043 * We no longer need a large memtuples array. (We will allocate a smaller
2044 * one for the heap later.)
2045 */
2046 FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
2047 pfree(state->memtuples);
2048 state->memtuples = NULL;
2049
2050 /*
2051 * Initialize the slab allocator. We need one slab slot per input tape,
2052 * for the tuples in the heap, plus one to hold the tuple last returned
2053 * from tuplesort_gettuple. (If we're sorting pass-by-val Datums,
2054 * however, we don't need to do allocate anything.)
2055 *
2056 * In a multi-pass merge, we could shrink this allocation for the last
2057 * merge pass, if it has fewer tapes than previous passes, but we don't
2058 * bother.
2059 *
2060 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism
2061 * to track memory usage of individual tuples.
2062 */
2063 if (state->base.tuples)
2064 init_slab_allocator(state, state->nOutputTapes + 1);
2065 else
2067
2068 /*
2069 * Allocate a new 'memtuples' array, for the heap. It will hold one tuple
2070 * from each input tape.
2071 *
2072 * We could shrink this, too, between passes in a multi-pass merge, but we
2073 * don't bother. (The initial input tapes are still in outputTapes. The
2074 * number of input tapes will not increase between passes.)
2075 */
2076 state->memtupsize = state->nOutputTapes;
2077 state->memtuples = (SortTuple *) MemoryContextAlloc(state->base.maincontext,
2078 state->nOutputTapes * sizeof(SortTuple));
2079 USEMEM(state, GetMemoryChunkSpace(state->memtuples));
2080
2081 /*
2082 * Use all the remaining memory we have available for tape buffers among
2083 * all the input tapes. At the beginning of each merge pass, we will
2084 * divide this memory between the input and output tapes in the pass.
2085 */
2086 state->tape_buffer_mem = state->availMem;
2087 USEMEM(state, state->tape_buffer_mem);
2088 if (trace_sort)
2089 elog(LOG, "worker %d using %zu KB of memory for tape buffers",
2090 state->worker, state->tape_buffer_mem / 1024);
2091
2092 for (;;)
2093 {
2094 /*
2095 * On the first iteration, or if we have read all the runs from the
2096 * input tapes in a multi-pass merge, it's time to start a new pass.
2097 * Rewind all the output tapes, and make them inputs for the next
2098 * pass.
2099 */
2100 if (state->nInputRuns == 0)
2101 {
2102 int64 input_buffer_size;
2103
2104 /* Close the old, emptied, input tapes */
2105 if (state->nInputTapes > 0)
2106 {
2107 for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
2108 LogicalTapeClose(state->inputTapes[tapenum]);
2109 pfree(state->inputTapes);
2110 }
2111
2112 /* Previous pass's outputs become next pass's inputs. */
2113 state->inputTapes = state->outputTapes;
2114 state->nInputTapes = state->nOutputTapes;
2115 state->nInputRuns = state->nOutputRuns;
2116
2117 /*
2118 * Reset output tape variables. The actual LogicalTapes will be
2119 * created as needed, here we only allocate the array to hold
2120 * them.
2121 */
2122 state->outputTapes = palloc0(state->nInputTapes * sizeof(LogicalTape *));
2123 state->nOutputTapes = 0;
2124 state->nOutputRuns = 0;
2125
2126 /*
2127 * Redistribute the memory allocated for tape buffers, among the
2128 * new input and output tapes.
2129 */
2130 input_buffer_size = merge_read_buffer_size(state->tape_buffer_mem,
2131 state->nInputTapes,
2132 state->nInputRuns,
2133 state->maxTapes);
2134
2135 if (trace_sort)
2136 elog(LOG, "starting merge pass of %d input runs on %d tapes, " INT64_FORMAT " KB of memory for each input tape: %s",
2137 state->nInputRuns, state->nInputTapes, input_buffer_size / 1024,
2138 pg_rusage_show(&state->ru_start));
2139
2140 /* Prepare the new input tapes for merge pass. */
2141 for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
2142 LogicalTapeRewindForRead(state->inputTapes[tapenum], input_buffer_size);
2143
2144 /*
2145 * If there's just one run left on each input tape, then only one
2146 * merge pass remains. If we don't have to produce a materialized
2147 * sorted tape, we can stop at this point and do the final merge
2148 * on-the-fly.
2149 */
2150 if ((state->base.sortopt & TUPLESORT_RANDOMACCESS) == 0
2151 && state->nInputRuns <= state->nInputTapes
2152 && !WORKER(state))
2153 {
2154 /* Tell logtape.c we won't be writing anymore */
2156 /* Initialize for the final merge pass */
2158 state->status = TSS_FINALMERGE;
2159 return;
2160 }
2161 }
2162
2163 /* Select an output tape */
2165
2166 /* Merge one run from each input tape. */
2168
2169 /*
2170 * If the input tapes are empty, and we output only one output run,
2171 * we're done. The current output tape contains the final result.
2172 */
2173 if (state->nInputRuns == 0 && state->nOutputRuns <= 1)
2174 break;
2175 }
2176
2177 /*
2178 * Done. The result is on a single run on a single tape.
2179 */
2180 state->result_tape = state->outputTapes[0];
2181 if (!WORKER(state))
2182 LogicalTapeFreeze(state->result_tape, NULL);
2183 else
2185 state->status = TSS_SORTEDONTAPE;
2186
2187 /* Close all the now-empty input tapes, to release their read buffers. */
2188 for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
2189 LogicalTapeClose(state->inputTapes[tapenum]);
2190}
2191
2192/*
2193 * Merge one run from each input tape.
2194 */
2195static void
2197{
2198 int srcTapeIndex;
2199 LogicalTape *srcTape;
2200
2201 /*
2202 * Start the merge by loading one tuple from each active source tape into
2203 * the heap.
2204 */
2206
2207 Assert(state->slabAllocatorUsed);
2208
2209 /*
2210 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
2211 * out, and replacing it with next tuple from same tape (if there is
2212 * another one).
2213 */
2214 while (state->memtupcount > 0)
2215 {
2216 SortTuple stup;
2217
2218 /* write the tuple to destTape */
2219 srcTapeIndex = state->memtuples[0].srctape;
2220 srcTape = state->inputTapes[srcTapeIndex];
2221 WRITETUP(state, state->destTape, &state->memtuples[0]);
2222
2223 /* recycle the slot of the tuple we just wrote out, for the next read */
2224 if (state->memtuples[0].tuple)
2225 RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);
2226
2227 /*
2228 * pull next tuple from the tape, and replace the written-out tuple in
2229 * the heap with it.
2230 */
2231 if (mergereadnext(state, srcTape, &stup))
2232 {
2233 stup.srctape = srcTapeIndex;
2235 }
2236 else
2237 {
2239 state->nInputRuns--;
2240 }
2241 }
2242
2243 /*
2244 * When the heap empties, we're done. Write an end-of-run marker on the
2245 * output tape.
2246 */
2247 markrunend(state->destTape);
2248}
2249
2250/*
2251 * beginmerge - initialize for a merge pass
2252 *
2253 * Fill the merge heap with the first tuple from each input tape.
2254 */
2255static void
2257{
2258 int activeTapes;
2259 int srcTapeIndex;
2260
2261 /* Heap should be empty here */
2262 Assert(state->memtupcount == 0);
2263
2264 activeTapes = Min(state->nInputTapes, state->nInputRuns);
2265
2266 for (srcTapeIndex = 0; srcTapeIndex < activeTapes; srcTapeIndex++)
2267 {
2268 SortTuple tup;
2269
2270 if (mergereadnext(state, state->inputTapes[srcTapeIndex], &tup))
2271 {
2272 tup.srctape = srcTapeIndex;
2274 }
2275 }
2276}
2277
2278/*
2279 * mergereadnext - read next tuple from one merge input tape
2280 *
2281 * Returns false on EOF.
2282 */
2283static bool
2285{
2286 unsigned int tuplen;
2287
2288 /* read next tuple, if any */
2289 if ((tuplen = getlen(srcTape, true)) == 0)
2290 return false;
2291 READTUP(state, stup, srcTape, tuplen);
2292
2293 return true;
2294}
2295
2296/*
2297 * dumptuples - remove tuples from memtuples and write initial run to tape
2298 *
2299 * When alltuples = true, dump everything currently in memory. (This case is
2300 * only used at end of input data.)
2301 */
2302static void
2304{
2305 int memtupwrite;
2306 int i;
2307
2308 /*
2309 * Nothing to do if we still fit in available memory and have array slots,
2310 * unless this is the final call during initial run generation.
2311 */
2312 if (state->memtupcount < state->memtupsize && !LACKMEM(state) &&
2313 !alltuples)
2314 return;
2315
2316 /*
2317 * Final call might require no sorting, in rare cases where we just so
2318 * happen to have previously LACKMEM()'d at the point where exactly all
2319 * remaining tuples are loaded into memory, just before input was
2320 * exhausted. In general, short final runs are quite possible, but avoid
2321 * creating a completely empty run. In a worker, though, we must produce
2322 * at least one tape, even if it's empty.
2323 */
2324 if (state->memtupcount == 0 && state->currentRun > 0)
2325 return;
2326
2327 Assert(state->status == TSS_BUILDRUNS);
2328
2329 /*
2330 * It seems unlikely that this limit will ever be exceeded, but take no
2331 * chances
2332 */
2333 if (state->currentRun == INT_MAX)
2334 ereport(ERROR,
2335 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
2336 errmsg("cannot have more than %d runs for an external sort",
2337 INT_MAX)));
2338
2339 if (state->currentRun > 0)
2341
2342 state->currentRun++;
2343
2344 if (trace_sort)
2345 elog(LOG, "worker %d starting quicksort of run %d: %s",
2346 state->worker, state->currentRun,
2347 pg_rusage_show(&state->ru_start));
2348
2349 /*
2350 * Sort all tuples accumulated within the allowed amount of memory for
2351 * this run using quicksort
2352 */
2354
2355 if (trace_sort)
2356 elog(LOG, "worker %d finished quicksort of run %d: %s",
2357 state->worker, state->currentRun,
2358 pg_rusage_show(&state->ru_start));
2359
2360 memtupwrite = state->memtupcount;
2361 for (i = 0; i < memtupwrite; i++)
2362 {
2363 SortTuple *stup = &state->memtuples[i];
2364
2365 WRITETUP(state, state->destTape, stup);
2366 }
2367
2368 state->memtupcount = 0;
2369
2370 /*
2371 * Reset tuple memory. We've freed all of the tuples that we previously
2372 * allocated. It's important to avoid fragmentation when there is a stark
2373 * change in the sizes of incoming tuples. In bounded sorts,
2374 * fragmentation due to AllocSetFree's bucketing by size class might be
2375 * particularly bad if this step wasn't taken.
2376 */
2377 MemoryContextReset(state->base.tuplecontext);
2378
2379 /*
2380 * Now update the memory accounting to subtract the memory used by the
2381 * tuple.
2382 */
2383 FREEMEM(state, state->tupleMem);
2384 state->tupleMem = 0;
2385
2386 markrunend(state->destTape);
2387
2388 if (trace_sort)
2389 elog(LOG, "worker %d finished writing run %d to tape %d: %s",
2390 state->worker, state->currentRun, (state->currentRun - 1) % state->nOutputTapes + 1,
2391 pg_rusage_show(&state->ru_start));
2392}
2393
2394/*
2395 * tuplesort_rescan - rewind and replay the scan
2396 */
2397void
2399{
2400 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
2401
2402 Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);
2403
2404 switch (state->status)
2405 {
2406 case TSS_SORTEDINMEM:
2407 state->current = 0;
2408 state->eof_reached = false;
2409 state->markpos_offset = 0;
2410 state->markpos_eof = false;
2411 break;
2412 case TSS_SORTEDONTAPE:
2413 LogicalTapeRewindForRead(state->result_tape, 0);
2414 state->eof_reached = false;
2415 state->markpos_block = 0L;
2416 state->markpos_offset = 0;
2417 state->markpos_eof = false;
2418 break;
2419 default:
2420 elog(ERROR, "invalid tuplesort state");
2421 break;
2422 }
2423
2424 MemoryContextSwitchTo(oldcontext);
2425}
2426
2427/*
2428 * tuplesort_markpos - saves current position in the merged sort file
2429 */
2430void
2432{
2433 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
2434
2435 Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);
2436
2437 switch (state->status)
2438 {
2439 case TSS_SORTEDINMEM:
2440 state->markpos_offset = state->current;
2441 state->markpos_eof = state->eof_reached;
2442 break;
2443 case TSS_SORTEDONTAPE:
2444 LogicalTapeTell(state->result_tape,
2445 &state->markpos_block,
2446 &state->markpos_offset);
2447 state->markpos_eof = state->eof_reached;
2448 break;
2449 default:
2450 elog(ERROR, "invalid tuplesort state");
2451 break;
2452 }
2453
2454 MemoryContextSwitchTo(oldcontext);
2455}
2456
2457/*
2458 * tuplesort_restorepos - restores current position in merged sort file to
2459 * last saved position
2460 */
2461void
2463{
2464 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
2465
2466 Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);
2467
2468 switch (state->status)
2469 {
2470 case TSS_SORTEDINMEM:
2471 state->current = state->markpos_offset;
2472 state->eof_reached = state->markpos_eof;
2473 break;
2474 case TSS_SORTEDONTAPE:
2475 LogicalTapeSeek(state->result_tape,
2476 state->markpos_block,
2477 state->markpos_offset);
2478 state->eof_reached = state->markpos_eof;
2479 break;
2480 default:
2481 elog(ERROR, "invalid tuplesort state");
2482 break;
2483 }
2484
2485 MemoryContextSwitchTo(oldcontext);
2486}
2487
2488/*
2489 * tuplesort_get_stats - extract summary statistics
2490 *
2491 * This can be called after tuplesort_performsort() finishes to obtain
2492 * printable summary information about how the sort was performed.
2493 */
2494void
2497{
2498 /*
2499 * Note: it might seem we should provide both memory and disk usage for a
2500 * disk-based sort. However, the current code doesn't track memory space
2501 * accurately once we have begun to return tuples to the caller (since we
2502 * don't account for pfree's the caller is expected to do), so we cannot
2503 * rely on availMem in a disk sort. This does not seem worth the overhead
2504 * to fix. Is it worth creating an API for the memory context code to
2505 * tell us how much is actually used in sortcontext?
2506 */
2508
2509 if (state->isMaxSpaceDisk)
2511 else
2513 stats->spaceUsed = (state->maxSpace + 1023) / 1024;
2514
2515 switch (state->maxSpaceStatus)
2516 {
2517 case TSS_SORTEDINMEM:
2518 if (state->boundUsed)
2520 else
2522 break;
2523 case TSS_SORTEDONTAPE:
2525 break;
2526 case TSS_FINALMERGE:
2528 break;
2529 default:
2531 break;
2532 }
2533}
2534
2535/*
2536 * Convert TuplesortMethod to a string.
2537 */
2538const char *
2540{
2541 switch (m)
2542 {
2544 return "still in progress";
2546 return "top-N heapsort";
2548 return "quicksort";
2550 return "external sort";
2552 return "external merge";
2553 }
2554
2555 return "unknown";
2556}
2557
2558/*
2559 * Convert TuplesortSpaceType to a string.
2560 */
2561const char *
2563{
2565 return t == SORT_SPACE_TYPE_DISK ? "Disk" : "Memory";
2566}
2567
2568
2569/*
2570 * Heap manipulation routines, per Knuth's Algorithm 5.2.3H.
2571 */
2572
2573/*
2574 * Convert the existing unordered array of SortTuples to a bounded heap,
2575 * discarding all but the smallest "state->bound" tuples.
2576 *
2577 * When working with a bounded heap, we want to keep the largest entry
2578 * at the root (array entry zero), instead of the smallest as in the normal
2579 * sort case. This allows us to discard the largest entry cheaply.
2580 * Therefore, we temporarily reverse the sort direction.
2581 */
2582static void
2584{
2585 int tupcount = state->memtupcount;
2586 int i;
2587
2588 Assert(state->status == TSS_INITIAL);
2589 Assert(state->bounded);
2590 Assert(tupcount >= state->bound);
2592
2593 /* Reverse sort direction so largest entry will be at root */
2595
2596 state->memtupcount = 0; /* make the heap empty */
2597 for (i = 0; i < tupcount; i++)
2598 {
2599 if (state->memtupcount < state->bound)
2600 {
2601 /* Insert next tuple into heap */
2602 /* Must copy source tuple to avoid possible overwrite */
2603 SortTuple stup = state->memtuples[i];
2604
2606 }
2607 else
2608 {
2609 /*
2610 * The heap is full. Replace the largest entry with the new
2611 * tuple, or just discard it, if it's larger than anything already
2612 * in the heap.
2613 */
2614 if (COMPARETUP(state, &state->memtuples[i], &state->memtuples[0]) <= 0)
2615 {
2616 free_sort_tuple(state, &state->memtuples[i]);
2618 }
2619 else
2620 tuplesort_heap_replace_top(state, &state->memtuples[i]);
2621 }
2622 }
2623
2624 Assert(state->memtupcount == state->bound);
2625 state->status = TSS_BOUNDED;
2626}
2627
2628/*
2629 * Convert the bounded heap to a properly-sorted array
2630 */
2631static void
2633{
2634 int tupcount = state->memtupcount;
2635
2636 Assert(state->status == TSS_BOUNDED);
2637 Assert(state->bounded);
2638 Assert(tupcount == state->bound);
2640
2641 /*
2642 * We can unheapify in place because each delete-top call will remove the
2643 * largest entry, which we can promptly store in the newly freed slot at
2644 * the end. Once we're down to a single-entry heap, we're done.
2645 */
2646 while (state->memtupcount > 1)
2647 {
2648 SortTuple stup = state->memtuples[0];
2649
2650 /* this sifts-up the next-largest entry and decreases memtupcount */
2652 state->memtuples[state->memtupcount] = stup;
2653 }
2654 state->memtupcount = tupcount;
2655
2656 /*
2657 * Reverse sort direction back to the original state. This is not
2658 * actually necessary but seems like a good idea for tidiness.
2659 */
2661
2662 state->status = TSS_SORTEDINMEM;
2663 state->boundUsed = true;
2664}
2665
2666/*
2667 * Sort all memtuples using specialized qsort() routines.
2668 *
2669 * Quicksort is used for small in-memory sorts, and external sort runs.
2670 */
2671static void
2673{
2674 Assert(!LEADER(state));
2675
2676 if (state->memtupcount > 1)
2677 {
2678 /*
2679 * Do we have the leading column's value or abbreviation in datum1,
2680 * and is there a specialization for its comparator?
2681 */
2682 if (state->base.haveDatum1 && state->base.sortKeys)
2683 {
2684 if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)
2685 {
2686 qsort_tuple_unsigned(state->memtuples,
2687 state->memtupcount,
2688 state);
2689 return;
2690 }
2691 else if (state->base.sortKeys[0].comparator == ssup_datum_signed_cmp)
2692 {
2693 qsort_tuple_signed(state->memtuples,
2694 state->memtupcount,
2695 state);
2696 return;
2697 }
2698 else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)
2699 {
2700 qsort_tuple_int32(state->memtuples,
2701 state->memtupcount,
2702 state);
2703 return;
2704 }
2705 }
2706
2707 /* Can we use the single-key sort function? */
2708 if (state->base.onlyKey != NULL)
2709 {
2710 qsort_ssup(state->memtuples, state->memtupcount,
2711 state->base.onlyKey);
2712 }
2713 else
2714 {
2715 qsort_tuple(state->memtuples,
2716 state->memtupcount,
2717 state->base.comparetup,
2718 state);
2719 }
2720 }
2721}
2722
2723/*
2724 * Insert a new tuple into an empty or existing heap, maintaining the
2725 * heap invariant. Caller is responsible for ensuring there's room.
2726 *
2727 * Note: For some callers, tuple points to a memtuples[] entry above the
2728 * end of the heap. This is safe as long as it's not immediately adjacent
2729 * to the end of the heap (ie, in the [memtupcount] array entry) --- if it
2730 * is, it might get overwritten before being moved into the heap!
2731 */
2732static void
2734{
2735 SortTuple *memtuples;
2736 int j;
2737
2738 memtuples = state->memtuples;
2739 Assert(state->memtupcount < state->memtupsize);
2740
2742
2743 /*
2744 * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
2745 * using 1-based array indexes, not 0-based.
2746 */
2747 j = state->memtupcount++;
2748 while (j > 0)
2749 {
2750 int i = (j - 1) >> 1;
2751
2752 if (COMPARETUP(state, tuple, &memtuples[i]) >= 0)
2753 break;
2754 memtuples[j] = memtuples[i];
2755 j = i;
2756 }
2757 memtuples[j] = *tuple;
2758}
2759
2760/*
2761 * Remove the tuple at state->memtuples[0] from the heap. Decrement
2762 * memtupcount, and sift up to maintain the heap invariant.
2763 *
2764 * The caller has already free'd the tuple the top node points to,
2765 * if necessary.
2766 */
2767static void
2769{
2770 SortTuple *memtuples = state->memtuples;
2771 SortTuple *tuple;
2772
2773 if (--state->memtupcount <= 0)
2774 return;
2775
2776 /*
2777 * Remove the last tuple in the heap, and re-insert it, by replacing the
2778 * current top node with it.
2779 */
2780 tuple = &memtuples[state->memtupcount];
2782}
2783
2784/*
2785 * Replace the tuple at state->memtuples[0] with a new tuple. Sift up to
2786 * maintain the heap invariant.
2787 *
2788 * This corresponds to Knuth's "sift-up" algorithm (Algorithm 5.2.3H,
2789 * Heapsort, steps H3-H8).
2790 */
2791static void
2793{
2794 SortTuple *memtuples = state->memtuples;
2795 unsigned int i,
2796 n;
2797
2798 Assert(state->memtupcount >= 1);
2799
2801
2802 /*
2803 * state->memtupcount is "int", but we use "unsigned int" for i, j, n.
2804 * This prevents overflow in the "2 * i + 1" calculation, since at the top
2805 * of the loop we must have i < n <= INT_MAX <= UINT_MAX/2.
2806 */
2807 n = state->memtupcount;
2808 i = 0; /* i is where the "hole" is */
2809 for (;;)
2810 {
2811 unsigned int j = 2 * i + 1;
2812
2813 if (j >= n)
2814 break;
2815 if (j + 1 < n &&
2816 COMPARETUP(state, &memtuples[j], &memtuples[j + 1]) > 0)
2817 j++;
2818 if (COMPARETUP(state, tuple, &memtuples[j]) <= 0)
2819 break;
2820 memtuples[i] = memtuples[j];
2821 i = j;
2822 }
2823 memtuples[i] = *tuple;
2824}
2825
2826/*
2827 * Function to reverse the sort direction from its current state
2828 *
2829 * It is not safe to call this when performing hash tuplesorts
2830 */
2831static void
2833{
2834 SortSupport sortKey = state->base.sortKeys;
2835 int nkey;
2836
2837 for (nkey = 0; nkey < state->base.nKeys; nkey++, sortKey++)
2838 {
2839 sortKey->ssup_reverse = !sortKey->ssup_reverse;
2840 sortKey->ssup_nulls_first = !sortKey->ssup_nulls_first;
2841 }
2842}
2843
2844
2845/*
2846 * Tape interface routines
2847 */
2848
2849static unsigned int
2850getlen(LogicalTape *tape, bool eofOK)
2851{
2852 unsigned int len;
2853
2854 if (LogicalTapeRead(tape,
2855 &len, sizeof(len)) != sizeof(len))
2856 elog(ERROR, "unexpected end of tape");
2857 if (len == 0 && !eofOK)
2858 elog(ERROR, "unexpected end of data");
2859 return len;
2860}
2861
2862static void
2864{
2865 unsigned int len = 0;
2866
2867 LogicalTapeWrite(tape, &len, sizeof(len));
2868}
2869
2870/*
2871 * Get memory for tuple from within READTUP() routine.
2872 *
2873 * We use next free slot from the slab allocator, or palloc() if the tuple
2874 * is too large for that.
2875 */
2876void *
2878{
2879 SlabSlot *buf;
2880
2881 /*
2882 * We pre-allocate enough slots in the slab arena that we should never run
2883 * out.
2884 */
2885 Assert(state->slabFreeHead);
2886
2887 if (tuplen > SLAB_SLOT_SIZE || !state->slabFreeHead)
2888 return MemoryContextAlloc(state->base.sortcontext, tuplen);
2889 else
2890 {
2891 buf = state->slabFreeHead;
2892 /* Reuse this slot */
2893 state->slabFreeHead = buf->nextfree;
2894
2895 return buf;
2896 }
2897}
2898
2899
2900/*
2901 * Parallel sort routines
2902 */
2903
2904/*
2905 * tuplesort_estimate_shared - estimate required shared memory allocation
2906 *
2907 * nWorkers is an estimate of the number of workers (it's the number that
2908 * will be requested).
2909 */
2910Size
2912{
2913 Size tapesSize;
2914
2915 Assert(nWorkers > 0);
2916
2917 /* Make sure that BufFile shared state is MAXALIGN'd */
2918 tapesSize = mul_size(sizeof(TapeShare), nWorkers);
2919 tapesSize = MAXALIGN(add_size(tapesSize, offsetof(Sharedsort, tapes)));
2920
2921 return tapesSize;
2922}
2923
2924/*
2925 * tuplesort_initialize_shared - initialize shared tuplesort state
2926 *
2927 * Must be called from leader process before workers are launched, to
2928 * establish state needed up-front for worker tuplesortstates. nWorkers
2929 * should match the argument passed to tuplesort_estimate_shared().
2930 */
2931void
2933{
2934 int i;
2935
2936 Assert(nWorkers > 0);
2937
2938 SpinLockInit(&shared->mutex);
2939 shared->currentWorker = 0;
2940 shared->workersFinished = 0;
2941 SharedFileSetInit(&shared->fileset, seg);
2942 shared->nTapes = nWorkers;
2943 for (i = 0; i < nWorkers; i++)
2944 {
2945 shared->tapes[i].firstblocknumber = 0L;
2946 }
2947}
2948
2949/*
2950 * tuplesort_attach_shared - attach to shared tuplesort state
2951 *
2952 * Must be called by all worker processes.
2953 */
2954void
2956{
2957 /* Attach to SharedFileSet */
2958 SharedFileSetAttach(&shared->fileset, seg);
2959}
2960
2961/*
2962 * worker_get_identifier - Assign and return ordinal identifier for worker
2963 *
2964 * The order in which these are assigned is not well defined, and should not
2965 * matter; worker numbers across parallel sort participants need only be
2966 * distinct and gapless. logtape.c requires this.
2967 *
2968 * Note that the identifiers assigned from here have no relation to
2969 * ParallelWorkerNumber number, to avoid making any assumption about
2970 * caller's requirements. However, we do follow the ParallelWorkerNumber
2971 * convention of representing a non-worker with worker number -1. This
2972 * includes the leader, as well as serial Tuplesort processes.
2973 */
2974static int
2976{
2977 Sharedsort *shared = state->shared;
2978 int worker;
2979
2981
2982 SpinLockAcquire(&shared->mutex);
2983 worker = shared->currentWorker++;
2984 SpinLockRelease(&shared->mutex);
2985
2986 return worker;
2987}
2988
2989/*
2990 * worker_freeze_result_tape - freeze worker's result tape for leader
2991 *
2992 * This is called by workers just after the result tape has been determined,
2993 * instead of calling LogicalTapeFreeze() directly. They do so because
2994 * workers require a few additional steps over similar serial
2995 * TSS_SORTEDONTAPE external sort cases, which also happen here. The extra
2996 * steps are around freeing now unneeded resources, and representing to
2997 * leader that worker's input run is available for its merge.
2998 *
2999 * There should only be one final output run for each worker, which consists
3000 * of all tuples that were originally input into worker.
3001 */
3002static void
3004{
3005 Sharedsort *shared = state->shared;
3007
3009 Assert(state->result_tape != NULL);
3010 Assert(state->memtupcount == 0);
3011
3012 /*
3013 * Free most remaining memory, in case caller is sensitive to our holding
3014 * on to it. memtuples may not be a tiny merge heap at this point.
3015 */
3016 pfree(state->memtuples);
3017 /* Be tidy */
3018 state->memtuples = NULL;
3019 state->memtupsize = 0;
3020
3021 /*
3022 * Parallel worker requires result tape metadata, which is to be stored in
3023 * shared memory for leader
3024 */
3025 LogicalTapeFreeze(state->result_tape, &output);
3026
3027 /* Store properties of output tape, and update finished worker count */
3028 SpinLockAcquire(&shared->mutex);
3029 shared->tapes[state->worker] = output;
3030 shared->workersFinished++;
3031 SpinLockRelease(&shared->mutex);
3032}
3033
3034/*
3035 * worker_nomergeruns - dump memtuples in worker, without merging
3036 *
3037 * This called as an alternative to mergeruns() with a worker when no
3038 * merging is required.
3039 */
3040static void
3042{
3044 Assert(state->result_tape == NULL);
3045 Assert(state->nOutputRuns == 1);
3046
3047 state->result_tape = state->destTape;
3049}
3050
3051/*
3052 * leader_takeover_tapes - create tapeset for leader from worker tapes
3053 *
3054 * So far, leader Tuplesortstate has performed no actual sorting. By now, all
3055 * sorting has occurred in workers, all of which must have already returned
3056 * from tuplesort_performsort().
3057 *
3058 * When this returns, leader process is left in a state that is virtually
3059 * indistinguishable from it having generated runs as a serial external sort
3060 * might have.
3061 */
3062static void
3064{
3065 Sharedsort *shared = state->shared;
3066 int nParticipants = state->nParticipants;
3067 int workersFinished;
3068 int j;
3069
3071 Assert(nParticipants >= 1);
3072
3073 SpinLockAcquire(&shared->mutex);
3074 workersFinished = shared->workersFinished;
3075 SpinLockRelease(&shared->mutex);
3076
3077 if (nParticipants != workersFinished)
3078 elog(ERROR, "cannot take over tapes before all workers finish");
3079
3080 /*
3081 * Create the tapeset from worker tapes, including a leader-owned tape at
3082 * the end. Parallel workers are far more expensive than logical tapes,
3083 * so the number of tapes allocated here should never be excessive.
3084 */
3085 inittapestate(state, nParticipants);
3086 state->tapeset = LogicalTapeSetCreate(false, &shared->fileset, -1);
3087
3088 /*
3089 * Set currentRun to reflect the number of runs we will merge (it's not
3090 * used for anything, this is just pro forma)
3091 */
3092 state->currentRun = nParticipants;
3093
3094 /*
3095 * Initialize the state to look the same as after building the initial
3096 * runs.
3097 *
3098 * There will always be exactly 1 run per worker, and exactly one input
3099 * tape per run, because workers always output exactly 1 run, even when
3100 * there were no input tuples for workers to sort.
3101 */
3102 state->inputTapes = NULL;
3103 state->nInputTapes = 0;
3104 state->nInputRuns = 0;
3105
3106 state->outputTapes = palloc0(nParticipants * sizeof(LogicalTape *));
3107 state->nOutputTapes = nParticipants;
3108 state->nOutputRuns = nParticipants;
3109
3110 for (j = 0; j < nParticipants; j++)
3111 {
3112 state->outputTapes[j] = LogicalTapeImport(state->tapeset, j, &shared->tapes[j]);
3113 }
3114
3115 state->status = TSS_BUILDRUNS;
3116}
3117
3118/*
3119 * Convenience routine to free a tuple previously loaded into sort memory
3120 */
3121static void
3123{
3124 if (stup->tuple)
3125 {
3127 pfree(stup->tuple);
3128 stup->tuple = NULL;
3129 }
3130}
3131
3132int
3134{
3135 if (x < y)
3136 return -1;
3137 else if (x > y)
3138 return 1;
3139 else
3140 return 0;
3141}
3142
3143int
3145{
3146 int64 xx = DatumGetInt64(x);
3147 int64 yy = DatumGetInt64(y);
3148
3149 if (xx < yy)
3150 return -1;
3151 else if (xx > yy)
3152 return 1;
3153 else
3154 return 0;
3155}
3156
3157int
3159{
3160 int32 xx = DatumGetInt32(x);
3161 int32 yy = DatumGetInt32(y);
3162
3163 if (xx < yy)
3164 return -1;
3165 else if (xx > yy)
3166 return 1;
3167 else
3168 return 0;
3169}
void PrepareTempTablespaces(void)
Definition: tablespace.c:1331
MemoryContext BumpContextCreate(MemoryContext parent, const char *name, Size minContextSize, Size initBlockSize, Size maxBlockSize)
Definition: bump.c:133
#define Min(x, y)
Definition: c.h:1004
#define MAXALIGN(LEN)
Definition: c.h:811
#define Max(x, y)
Definition: c.h:998
#define INT64_FORMAT
Definition: c.h:557
int64_t int64
Definition: c.h:536
#define FLEXIBLE_ARRAY_MEMBER
Definition: c.h:471
#define pg_attribute_always_inline
Definition: c.h:269
int32_t int32
Definition: c.h:535
size_t Size
Definition: c.h:611
int errcode(int sqlerrcode)
Definition: elog.c:854
int errmsg(const char *fmt,...)
Definition: elog.c:1071
#define LOG
Definition: elog.h:31
#define ERROR
Definition: elog.h:39
#define elog(elevel,...)
Definition: elog.h:226
#define ereport(elevel,...)
Definition: elog.h:150
static int compare(const void *arg1, const void *arg2)
Definition: geqo_pool.c:145
Assert(PointerIsAligned(start, uint64))
FILE * output
int y
Definition: isn.c:76
int b
Definition: isn.c:74
int x
Definition: isn.c:75
int a
Definition: isn.c:73
int j
Definition: isn.c:78
int i
Definition: isn.c:77
void LogicalTapeRewindForRead(LogicalTape *lt, size_t buffer_size)
Definition: logtape.c:846
void LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts)
Definition: logtape.c:750
size_t LogicalTapeBackspace(LogicalTape *lt, size_t size)
Definition: logtape.c:1062
size_t LogicalTapeRead(LogicalTape *lt, void *ptr, size_t size)
Definition: logtape.c:928
int64 LogicalTapeSetBlocks(LogicalTapeSet *lts)
Definition: logtape.c:1181
void LogicalTapeClose(LogicalTape *lt)
Definition: logtape.c:733
void LogicalTapeSetClose(LogicalTapeSet *lts)
Definition: logtape.c:667
void LogicalTapeSeek(LogicalTape *lt, int64 blocknum, int offset)
Definition: logtape.c:1133
LogicalTapeSet * LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)
Definition: logtape.c:556
void LogicalTapeTell(LogicalTape *lt, int64 *blocknum, int *offset)
Definition: logtape.c:1162
void LogicalTapeWrite(LogicalTape *lt, const void *ptr, size_t size)
Definition: logtape.c:761
LogicalTape * LogicalTapeCreate(LogicalTapeSet *lts)
Definition: logtape.c:680
void LogicalTapeFreeze(LogicalTape *lt, TapeShare *share)
Definition: logtape.c:981
LogicalTape * LogicalTapeImport(LogicalTapeSet *lts, int worker, TapeShare *shared)
Definition: logtape.c:609
void * MemoryContextAlloc(MemoryContext context, Size size)
Definition: mcxt.c:1229
void MemoryContextReset(MemoryContext context)
Definition: mcxt.c:400
void pfree(void *pointer)
Definition: mcxt.c:1594
Size GetMemoryChunkSpace(void *pointer)
Definition: mcxt.c:767
void * palloc0(Size size)
Definition: mcxt.c:1395
void * palloc(Size size)
Definition: mcxt.c:1365
MemoryContext CurrentMemoryContext
Definition: mcxt.c:160
void MemoryContextDelete(MemoryContext context)
Definition: mcxt.c:469
void * repalloc_huge(void *pointer, Size size)
Definition: mcxt.c:1735
void MemoryContextResetOnly(MemoryContext context)
Definition: mcxt.c:419
#define AllocSetContextCreate
Definition: memutils.h:129
#define MaxAllocHugeSize
Definition: memutils.h:45
#define ALLOCSET_DEFAULT_SIZES
Definition: memutils.h:160
#define CHECK_FOR_INTERRUPTS()
Definition: miscadmin.h:122
static MemoryContext MemoryContextSwitchTo(MemoryContext context)
Definition: palloc.h:124
const void size_t len
const char * pg_rusage_show(const PGRUsage *ru0)
Definition: pg_rusage.c:40
void pg_rusage_init(PGRUsage *ru0)
Definition: pg_rusage.c:27
static char * buf
Definition: pg_test_fsync.c:72
static int64 DatumGetInt64(Datum X)
Definition: postgres.h:393
uint64_t Datum
Definition: postgres.h:70
static int32 DatumGetInt32(Datum X)
Definition: postgres.h:212
void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg)
Definition: sharedfileset.c:56
void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
Definition: sharedfileset.c:38
Size add_size(Size s1, Size s2)
Definition: shmem.c:493
Size mul_size(Size s1, Size s2)
Definition: shmem.c:510
static int ApplySignedSortComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup)
Definition: sortsupport.h:266
static int ApplyUnsignedSortComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup)
Definition: sortsupport.h:233
static int ApplyInt32SortComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup)
Definition: sortsupport.h:300
#define SpinLockInit(lock)
Definition: spin.h:57
#define SpinLockRelease(lock)
Definition: spin.h:61
#define SpinLockAcquire(lock)
Definition: spin.h:59
SharedFileSet fileset
Definition: tuplesort.c:360
TapeShare tapes[FLEXIBLE_ARRAY_MEMBER]
Definition: tuplesort.c:369
int workersFinished
Definition: tuplesort.c:357
int nTapes
Definition: tuplesort.c:363
slock_t mutex
Definition: tuplesort.c:346
int currentWorker
Definition: tuplesort.c:356
Sharedsort * sharedsort
Definition: tuplesort.h:59
bool ssup_nulls_first
Definition: sortsupport.h:75
void * tuple
Definition: tuplesort.h:150
int srctape
Definition: tuplesort.h:153
Datum datum1
Definition: tuplesort.h:151
int64 firstblocknumber
Definition: logtape.h:54
TuplesortMethod sortMethod
Definition: tuplesort.h:113
TuplesortSpaceType spaceType
Definition: tuplesort.h:114
void * lastReturnedTuple
Definition: tuplesort.c:264
LogicalTapeSet * tapeset
Definition: tuplesort.c:208
bool isMaxSpaceDisk
Definition: tuplesort.c:204
bool growmemtuples
Definition: tuplesort.c:220
SortTuple * memtuples
Definition: tuplesort.c:217
int64 maxSpace
Definition: tuplesort.c:202
LogicalTape ** inputTapes
Definition: tuplesort.c:280
bool slabAllocatorUsed
Definition: tuplesort.c:249
TuplesortPublic base
Definition: tuplesort.c:187
char * slabMemoryEnd
Definition: tuplesort.c:252
int64 tupleMem
Definition: tuplesort.c:193
PGRUsage ru_start
Definition: tuplesort.c:336
char * slabMemoryBegin
Definition: tuplesort.c:251
LogicalTape ** outputTapes
Definition: tuplesort.c:284
bool eof_reached
Definition: tuplesort.c:297
size_t tape_buffer_mem
Definition: tuplesort.c:256
TupSortStatus status
Definition: tuplesort.c:188
int64 availMem
Definition: tuplesort.c:198
LogicalTape * destTape
Definition: tuplesort.c:288
TupSortStatus maxSpaceStatus
Definition: tuplesort.c:207
bool markpos_eof
Definition: tuplesort.c:302
int64 abbrevNext
Definition: tuplesort.c:330
int64 markpos_block
Definition: tuplesort.c:300
Sharedsort * shared
Definition: tuplesort.c:321
LogicalTape * result_tape
Definition: tuplesort.c:295
SlabSlot * slabFreeHead
Definition: tuplesort.c:253
int markpos_offset
Definition: tuplesort.c:301
int64 allowedMem
Definition: tuplesort.c:199
Definition: regguts.h:323
void tuplesort_rescan(Tuplesortstate *state)
Definition: tuplesort.c:2398
void tuplesort_performsort(Tuplesortstate *state)
Definition: tuplesort.c:1359
int tuplesort_merge_order(int64 allowedMem)
Definition: tuplesort.c:1774
#define TAPE_BUFFER_OVERHEAD
Definition: tuplesort.c:178
static void tuplesort_heap_delete_top(Tuplesortstate *state)
Definition: tuplesort.c:2768
#define INITIAL_MEMTUPSIZE
Definition: tuplesort.c:120
static unsigned int getlen(LogicalTape *tape, bool eofOK)
Definition: tuplesort.c:2850
void tuplesort_initialize_shared(Sharedsort *shared, int nWorkers, dsm_segment *seg)
Definition: tuplesort.c:2932
#define COMPARETUP(state, a, b)
Definition: tuplesort.c:396
static void selectnewtape(Tuplesortstate *state)
Definition: tuplesort.c:1944
void tuplesort_reset(Tuplesortstate *state)
Definition: tuplesort.c:1015
#define SERIAL(state)
Definition: tuplesort.c:403
#define FREESTATE(state)
Definition: tuplesort.c:399
static void markrunend(LogicalTape *tape)
Definition: tuplesort.c:2863
bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
Definition: tuplesort.c:1706
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
Definition: tuplesort.c:3122
#define REMOVEABBREV(state, stup, count)
Definition: tuplesort.c:395
#define LACKMEM(state)
Definition: tuplesort.c:400
static void reversedirection(Tuplesortstate *state)
Definition: tuplesort.c:2832
#define USEMEM(state, amt)
Definition: tuplesort.c:401
static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple)
Definition: tuplesort.c:2733
int ssup_datum_signed_cmp(Datum x, Datum y, SortSupport ssup)
Definition: tuplesort.c:3144
static bool grow_memtuples(Tuplesortstate *state)
Definition: tuplesort.c:1048
int ssup_datum_unsigned_cmp(Datum x, Datum y, SortSupport ssup)
Definition: tuplesort.c:3133
static void beginmerge(Tuplesortstate *state)
Definition: tuplesort.c:2256
static void make_bounded_heap(Tuplesortstate *state)
Definition: tuplesort.c:2583
bool tuplesort_used_bound(Tuplesortstate *state)
Definition: tuplesort.c:882
#define WRITETUP(state, tape, stup)
Definition: tuplesort.c:397
static void sort_bounded_heap(Tuplesortstate *state)
Definition: tuplesort.c:2632
TupSortStatus
Definition: tuplesort.c:155
@ TSS_SORTEDONTAPE
Definition: tuplesort.c:160
@ TSS_SORTEDINMEM
Definition: tuplesort.c:159
@ TSS_INITIAL
Definition: tuplesort.c:156
@ TSS_FINALMERGE
Definition: tuplesort.c:161
@ TSS_BUILDRUNS
Definition: tuplesort.c:158
@ TSS_BOUNDED
Definition: tuplesort.c:157
static int worker_get_identifier(Tuplesortstate *state)
Definition: tuplesort.c:2975
static void mergeonerun(Tuplesortstate *state)
Definition: tuplesort.c:2196
#define FREEMEM(state, amt)
Definition: tuplesort.c:402
#define MAXORDER
Definition: tuplesort.c:177
static void inittapestate(Tuplesortstate *state, int maxTapes)
Definition: tuplesort.c:1910
#define SLAB_SLOT_SIZE
Definition: tuplesort.c:142
static void leader_takeover_tapes(Tuplesortstate *state)
Definition: tuplesort.c:3063
Size tuplesort_estimate_shared(int nWorkers)
Definition: tuplesort.c:2911
void tuplesort_get_stats(Tuplesortstate *state, TuplesortInstrumentation *stats)
Definition: tuplesort.c:2495
Tuplesortstate * tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
Definition: tuplesort.c:638
static void tuplesort_sort_memtuples(Tuplesortstate *state)
Definition: tuplesort.c:2672
void tuplesort_end(Tuplesortstate *state)
Definition: tuplesort.c:947
static void inittapes(Tuplesortstate *state, bool mergeruns)
Definition: tuplesort.c:1861
void tuplesort_markpos(Tuplesortstate *state)
Definition: tuplesort.c:2431
void tuplesort_puttuple_common(Tuplesortstate *state, SortTuple *tuple, bool useAbbrev, Size tuplen)
Definition: tuplesort.c:1165
const char * tuplesort_space_type_name(TuplesortSpaceType t)
Definition: tuplesort.c:2562
#define MERGE_BUFFER_SIZE
Definition: tuplesort.c:179
#define READTUP(state, stup, tape, len)
Definition: tuplesort.c:398
int ssup_datum_int32_cmp(Datum x, Datum y, SortSupport ssup)
Definition: tuplesort.c:3158
#define LEADER(state)
Definition: tuplesort.c:405
#define WORKER(state)
Definition: tuplesort.c:404
bool tuplesort_gettuple_common(Tuplesortstate *state, bool forward, SortTuple *stup)
Definition: tuplesort.c:1466
static int64 merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns, int maxOutputTapes)
Definition: tuplesort.c:1829
static bool mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup)
Definition: tuplesort.c:2284
union SlabSlot SlabSlot
static void tuplesort_updatemax(Tuplesortstate *state)
Definition: tuplesort.c:964
static void worker_freeze_result_tape(Tuplesortstate *state)
Definition: tuplesort.c:3003
bool trace_sort
Definition: tuplesort.c:124
static pg_attribute_always_inline int qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:517
#define RELEASE_SLAB_SLOT(state, tuple)
Definition: tuplesort.c:383
void tuplesort_attach_shared(Sharedsort *shared, dsm_segment *seg)
Definition: tuplesort.c:2955
static void worker_nomergeruns(Tuplesortstate *state)
Definition: tuplesort.c:3041
const char * tuplesort_method_name(TuplesortMethod m)
Definition: tuplesort.c:2539
static pg_attribute_always_inline int qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:495
static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple)
Definition: tuplesort.c:2792
void tuplesort_restorepos(Tuplesortstate *state)
Definition: tuplesort.c:2462
static pg_attribute_always_inline int qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:540
static void mergeruns(Tuplesortstate *state)
Definition: tuplesort.c:2013
void * tuplesort_readtup_alloc(Tuplesortstate *state, Size tuplen)
Definition: tuplesort.c:2877
#define MINORDER
Definition: tuplesort.c:176
static void tuplesort_begin_batch(Tuplesortstate *state)
Definition: tuplesort.c:748
void tuplesort_set_bound(Tuplesortstate *state, int64 bound)
Definition: tuplesort.c:834
static void init_slab_allocator(Tuplesortstate *state, int numSlots)
Definition: tuplesort.c:1977
static bool consider_abort_common(Tuplesortstate *state)
Definition: tuplesort.c:1315
static void tuplesort_free(Tuplesortstate *state)
Definition: tuplesort.c:893
static void dumptuples(Tuplesortstate *state, bool alltuples)
Definition: tuplesort.c:2303
#define TupleSortUseBumpTupleCxt(opt)
Definition: tuplesort.h:109
#define TUPLESORT_RANDOMACCESS
Definition: tuplesort.h:97
#define TUPLESORT_ALLOWBOUNDED
Definition: tuplesort.h:100
TuplesortSpaceType
Definition: tuplesort.h:88
@ SORT_SPACE_TYPE_DISK
Definition: tuplesort.h:89
@ SORT_SPACE_TYPE_MEMORY
Definition: tuplesort.h:90
TuplesortMethod
Definition: tuplesort.h:77
@ SORT_TYPE_EXTERNAL_SORT
Definition: tuplesort.h:81
@ SORT_TYPE_TOP_N_HEAPSORT
Definition: tuplesort.h:79
@ SORT_TYPE_QUICKSORT
Definition: tuplesort.h:80
@ SORT_TYPE_STILL_IN_PROGRESS
Definition: tuplesort.h:78
@ SORT_TYPE_EXTERNAL_MERGE
Definition: tuplesort.h:82
char buffer[SLAB_SLOT_SIZE]
Definition: tuplesort.c:147
union SlabSlot * nextfree
Definition: tuplesort.c:146