Header And Logo

PostgreSQL
| The world's most advanced open source database.

tuplesort.c

Go to the documentation of this file.
00001 /*-------------------------------------------------------------------------
00002  *
00003  * tuplesort.c
00004  *    Generalized tuple sorting routines.
00005  *
00006  * This module handles sorting of heap tuples, index tuples, or single
00007  * Datums (and could easily support other kinds of sortable objects,
00008  * if necessary).  It works efficiently for both small and large amounts
00009  * of data.  Small amounts are sorted in-memory using qsort().  Large
00010  * amounts are sorted using temporary files and a standard external sort
00011  * algorithm.
00012  *
00013  * See Knuth, volume 3, for more than you want to know about the external
00014  * sorting algorithm.  We divide the input into sorted runs using replacement
00015  * selection, in the form of a priority tree implemented as a heap
00016  * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase
00017  * merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D
00018  * are implemented by logtape.c, which avoids space wastage by recycling
00019  * disk space as soon as each block is read from its "tape".
00020  *
00021  * We do not form the initial runs using Knuth's recommended replacement
00022  * selection data structure (Algorithm 5.4.1R), because it uses a fixed
00023  * number of records in memory at all times.  Since we are dealing with
00024  * tuples that may vary considerably in size, we want to be able to vary
00025  * the number of records kept in memory to ensure full utilization of the
00026  * allowed sort memory space.  So, we keep the tuples in a variable-size
00027  * heap, with the next record to go out at the top of the heap.  Like
00028  * Algorithm 5.4.1R, each record is stored with the run number that it
00029  * must go into, and we use (run number, key) as the ordering key for the
00030  * heap.  When the run number at the top of the heap changes, we know that
00031  * no more records of the prior run are left in the heap.
00032  *
00033  * The approximate amount of memory allowed for any one sort operation
00034  * is specified in kilobytes by the caller (most pass work_mem).  Initially,
00035  * we absorb tuples and simply store them in an unsorted array as long as
00036  * we haven't exceeded workMem.  If we reach the end of the input without
00037  * exceeding workMem, we sort the array using qsort() and subsequently return
00038  * tuples just by scanning the tuple array sequentially.  If we do exceed
00039  * workMem, we construct a heap using Algorithm H and begin to emit tuples
00040  * into sorted runs in temporary tapes, emitting just enough tuples at each
00041  * step to get back within the workMem limit.  Whenever the run number at
00042  * the top of the heap changes, we begin a new run with a new output tape
00043  * (selected per Algorithm D).  After the end of the input is reached,
00044  * we dump out remaining tuples in memory into a final run (or two),
00045  * then merge the runs using Algorithm D.
00046  *
00047  * When merging runs, we use a heap containing just the frontmost tuple from
00048  * each source run; we repeatedly output the smallest tuple and insert the
00049  * next tuple from its source tape (if any).  When the heap empties, the merge
00050  * is complete.  The basic merge algorithm thus needs very little memory ---
00051  * only M tuples for an M-way merge, and M is constrained to a small number.
00052  * However, we can still make good use of our full workMem allocation by
00053  * pre-reading additional tuples from each source tape.  Without prereading,
00054  * our access pattern to the temporary file would be very erratic; on average
00055  * we'd read one block from each of M source tapes during the same time that
00056  * we're writing M blocks to the output tape, so there is no sequentiality of
00057  * access at all, defeating the read-ahead methods used by most Unix kernels.
00058  * Worse, the output tape gets written into a very random sequence of blocks
00059  * of the temp file, ensuring that things will be even worse when it comes
00060  * time to read that tape.  A straightforward merge pass thus ends up doing a
00061  * lot of waiting for disk seeks.  We can improve matters by prereading from
00062  * each source tape sequentially, loading about workMem/M bytes from each tape
00063  * in turn.  Then we run the merge algorithm, writing but not reading until
00064  * one of the preloaded tuple series runs out.  Then we switch back to preread
00065  * mode, fill memory again, and repeat.  This approach helps to localize both
00066  * read and write accesses.
00067  *
00068  * When the caller requests random access to the sort result, we form
00069  * the final sorted run on a logical tape which is then "frozen", so
00070  * that we can access it randomly.  When the caller does not need random
00071  * access, we return from tuplesort_performsort() as soon as we are down
00072  * to one run per logical tape.  The final merge is then performed
00073  * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
00074  * saves one cycle of writing all the data out to disk and reading it in.
00075  *
00076  * Before Postgres 8.2, we always used a seven-tape polyphase merge, on the
00077  * grounds that 7 is the "sweet spot" on the tapes-to-passes curve according
00078  * to Knuth's figure 70 (section 5.4.2).  However, Knuth is assuming that
00079  * tape drives are expensive beasts, and in particular that there will always
00080  * be many more runs than tape drives.  In our implementation a "tape drive"
00081  * doesn't cost much more than a few Kb of memory buffers, so we can afford
00082  * to have lots of them.  In particular, if we can have as many tape drives
00083  * as sorted runs, we can eliminate any repeated I/O at all.  In the current
00084  * code we determine the number of tapes M on the basis of workMem: we want
00085  * workMem/M to be large enough that we read a fair amount of data each time
00086  * we preread from a tape, so as to maintain the locality of access described
00087  * above.  Nonetheless, with large workMem we can have many tapes.
00088  *
00089  *
00090  * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
00091  * Portions Copyright (c) 1994, Regents of the University of California
00092  *
00093  * IDENTIFICATION
00094  *    src/backend/utils/sort/tuplesort.c
00095  *
00096  *-------------------------------------------------------------------------
00097  */
00098 
00099 #include "postgres.h"
00100 
00101 #include <limits.h>
00102 
00103 #include "access/htup_details.h"
00104 #include "access/nbtree.h"
00105 #include "catalog/index.h"
00106 #include "commands/tablespace.h"
00107 #include "executor/executor.h"
00108 #include "miscadmin.h"
00109 #include "pg_trace.h"
00110 #include "utils/datum.h"
00111 #include "utils/logtape.h"
00112 #include "utils/lsyscache.h"
00113 #include "utils/memutils.h"
00114 #include "utils/pg_rusage.h"
00115 #include "utils/rel.h"
00116 #include "utils/sortsupport.h"
00117 #include "utils/tuplesort.h"
00118 
00119 
00120 /* sort-type codes for sort__start probes */
00121 #define HEAP_SORT       0
00122 #define INDEX_SORT      1
00123 #define DATUM_SORT      2
00124 #define CLUSTER_SORT    3
00125 
00126 /* GUC variables */
00127 #ifdef TRACE_SORT
00128 bool        trace_sort = false;
00129 #endif
00130 
00131 #ifdef DEBUG_BOUNDED_SORT
00132 bool        optimize_bounded_sort = true;
00133 #endif
00134 
00135 
00136 /*
00137  * The objects we actually sort are SortTuple structs.  These contain
00138  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
00139  * which is a separate palloc chunk --- we assume it is just one chunk and
00140  * can be freed by a simple pfree().  SortTuples also contain the tuple's
00141  * first key column in Datum/nullflag format, and an index integer.
00142  *
00143  * Storing the first key column lets us save heap_getattr or index_getattr
00144  * calls during tuple comparisons.  We could extract and save all the key
00145  * columns not just the first, but this would increase code complexity and
00146  * overhead, and wouldn't actually save any comparison cycles in the common
00147  * case where the first key determines the comparison result.  Note that
00148  * for a pass-by-reference datatype, datum1 points into the "tuple" storage.
00149  *
00150  * When sorting single Datums, the data value is represented directly by
00151  * datum1/isnull1.  If the datatype is pass-by-reference and isnull1 is false,
00152  * then datum1 points to a separately palloc'd data value that is also pointed
00153  * to by the "tuple" pointer; otherwise "tuple" is NULL.
00154  *
00155  * While building initial runs, tupindex holds the tuple's run number.  During
00156  * merge passes, we re-use it to hold the input tape number that each tuple in
00157  * the heap was read from, or to hold the index of the next tuple pre-read
00158  * from the same tape in the case of pre-read entries.  tupindex goes unused
00159  * if the sort occurs entirely in memory.
00160  */
00161 typedef struct
00162 {
00163     void       *tuple;          /* the tuple proper */
00164     Datum       datum1;         /* value of first key column */
00165     bool        isnull1;        /* is first key column NULL? */
00166     int         tupindex;       /* see notes above */
00167 } SortTuple;
00168 
00169 
00170 /*
00171  * Possible states of a Tuplesort object.  These denote the states that
00172  * persist between calls of Tuplesort routines.
00173  */
00174 typedef enum
00175 {
00176     TSS_INITIAL,                /* Loading tuples; still within memory limit */
00177     TSS_BOUNDED,                /* Loading tuples into bounded-size heap */
00178     TSS_BUILDRUNS,              /* Loading tuples; writing to tape */
00179     TSS_SORTEDINMEM,            /* Sort completed entirely in memory */
00180     TSS_SORTEDONTAPE,           /* Sort completed, final run is on tape */
00181     TSS_FINALMERGE              /* Performing final merge on-the-fly */
00182 } TupSortStatus;
00183 
00184 /*
00185  * Parameters for calculation of number of tapes to use --- see inittapes()
00186  * and tuplesort_merge_order().
00187  *
00188  * In this calculation we assume that each tape will cost us about 3 blocks
00189  * worth of buffer space (which is an underestimate for very large data
00190  * volumes, but it's probably close enough --- see logtape.c).
00191  *
00192  * MERGE_BUFFER_SIZE is how much data we'd like to read from each input
00193  * tape during a preread cycle (see discussion at top of file).
00194  */
00195 #define MINORDER        6       /* minimum merge order */
00196 #define TAPE_BUFFER_OVERHEAD        (BLCKSZ * 3)
00197 #define MERGE_BUFFER_SIZE           (BLCKSZ * 32)
00198 
00199 typedef int (*SortTupleComparator) (const SortTuple *a, const SortTuple *b,
00200                                                 Tuplesortstate *state);
00201 
00202 /*
00203  * Private state of a Tuplesort operation.
00204  */
00205 struct Tuplesortstate
00206 {
00207     TupSortStatus status;       /* enumerated value as shown above */
00208     int         nKeys;          /* number of columns in sort key */
00209     bool        randomAccess;   /* did caller request random access? */
00210     bool        bounded;        /* did caller specify a maximum number of
00211                                  * tuples to return? */
00212     bool        boundUsed;      /* true if we made use of a bounded heap */
00213     int         bound;          /* if bounded, the maximum number of tuples */
00214     long        availMem;       /* remaining memory available, in bytes */
00215     long        allowedMem;     /* total memory allowed, in bytes */
00216     int         maxTapes;       /* number of tapes (Knuth's T) */
00217     int         tapeRange;      /* maxTapes-1 (Knuth's P) */
00218     MemoryContext sortcontext;  /* memory context holding all sort data */
00219     LogicalTapeSet *tapeset;    /* logtape.c object for tapes in a temp file */
00220 
00221     /*
00222      * These function pointers decouple the routines that must know what kind
00223      * of tuple we are sorting from the routines that don't need to know it.
00224      * They are set up by the tuplesort_begin_xxx routines.
00225      *
00226      * Function to compare two tuples; result is per qsort() convention, ie:
00227      * <0, 0, >0 according as a<b, a=b, a>b.  The API must match
00228      * qsort_arg_comparator.
00229      */
00230     SortTupleComparator comparetup;
00231 
00232     /*
00233      * Function to copy a supplied input tuple into palloc'd space and set up
00234      * its SortTuple representation (ie, set tuple/datum1/isnull1).  Also,
00235      * state->availMem must be decreased by the amount of space used for the
00236      * tuple copy (note the SortTuple struct itself is not counted).
00237      */
00238     void        (*copytup) (Tuplesortstate *state, SortTuple *stup, void *tup);
00239 
00240     /*
00241      * Function to write a stored tuple onto tape.  The representation of the
00242      * tuple on tape need not be the same as it is in memory; requirements on
00243      * the tape representation are given below.  After writing the tuple,
00244      * pfree() the out-of-line data (not the SortTuple struct!), and increase
00245      * state->availMem by the amount of memory space thereby released.
00246      */
00247     void        (*writetup) (Tuplesortstate *state, int tapenum,
00248                                          SortTuple *stup);
00249 
00250     /*
00251      * Function to read a stored tuple from tape back into memory. 'len' is
00252      * the already-read length of the stored tuple.  Create a palloc'd copy,
00253      * initialize tuple/datum1/isnull1 in the target SortTuple struct, and
00254      * decrease state->availMem by the amount of memory space consumed.
00255      */
00256     void        (*readtup) (Tuplesortstate *state, SortTuple *stup,
00257                                         int tapenum, unsigned int len);
00258 
00259     /*
00260      * Function to reverse the sort direction from its current state. (We
00261      * could dispense with this if we wanted to enforce that all variants
00262      * represent the sort key information alike.)
00263      */
00264     void        (*reversedirection) (Tuplesortstate *state);
00265 
00266     /*
00267      * This array holds the tuples now in sort memory.  If we are in state
00268      * INITIAL, the tuples are in no particular order; if we are in state
00269      * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
00270      * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
00271      * H.  (Note that memtupcount only counts the tuples that are part of the
00272      * heap --- during merge passes, memtuples[] entries beyond tapeRange are
00273      * never in the heap and are used to hold pre-read tuples.)  In state
00274      * SORTEDONTAPE, the array is not used.
00275      */
00276     SortTuple  *memtuples;      /* array of SortTuple structs */
00277     int         memtupcount;    /* number of tuples currently present */
00278     int         memtupsize;     /* allocated length of memtuples array */
00279     bool        growmemtuples;  /* memtuples' growth still underway? */
00280 
00281     /*
00282      * While building initial runs, this is the current output run number
00283      * (starting at 0).  Afterwards, it is the number of initial runs we made.
00284      */
00285     int         currentRun;
00286 
00287     /*
00288      * Unless otherwise noted, all pointer variables below are pointers to
00289      * arrays of length maxTapes, holding per-tape data.
00290      */
00291 
00292     /*
00293      * These variables are only used during merge passes.  mergeactive[i] is
00294      * true if we are reading an input run from (actual) tape number i and
00295      * have not yet exhausted that run.  mergenext[i] is the memtuples index
00296      * of the next pre-read tuple (next to be loaded into the heap) for tape
00297      * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
00298      * points to the last pre-read tuple from each tape.  mergeavailslots[i]
00299      * is the number of unused memtuples[] slots reserved for tape i, and
00300      * mergeavailmem[i] is the amount of unused space allocated for tape i.
00301      * mergefreelist and mergefirstfree keep track of unused locations in the
00302      * memtuples[] array.  The memtuples[].tupindex fields link together
00303      * pre-read tuples for each tape as well as recycled locations in
00304      * mergefreelist. It is OK to use 0 as a null link in these lists, because
00305      * memtuples[0] is part of the merge heap and is never a pre-read tuple.
00306      */
00307     bool       *mergeactive;    /* active input run source? */
00308     int        *mergenext;      /* first preread tuple for each source */
00309     int        *mergelast;      /* last preread tuple for each source */
00310     int        *mergeavailslots;    /* slots left for prereading each tape */
00311     long       *mergeavailmem;  /* availMem for prereading each tape */
00312     int         mergefreelist;  /* head of freelist of recycled slots */
00313     int         mergefirstfree; /* first slot never used in this merge */
00314 
00315     /*
00316      * Variables for Algorithm D.  Note that destTape is a "logical" tape
00317      * number, ie, an index into the tp_xxx[] arrays.  Be careful to keep
00318      * "logical" and "actual" tape numbers straight!
00319      */
00320     int         Level;          /* Knuth's l */
00321     int         destTape;       /* current output tape (Knuth's j, less 1) */
00322     int        *tp_fib;         /* Target Fibonacci run counts (A[]) */
00323     int        *tp_runs;        /* # of real runs on each tape */
00324     int        *tp_dummy;       /* # of dummy runs for each tape (D[]) */
00325     int        *tp_tapenum;     /* Actual tape numbers (TAPE[]) */
00326     int         activeTapes;    /* # of active input tapes in merge pass */
00327 
00328     /*
00329      * These variables are used after completion of sorting to keep track of
00330      * the next tuple to return.  (In the tape case, the tape's current read
00331      * position is also critical state.)
00332      */
00333     int         result_tape;    /* actual tape number of finished output */
00334     int         current;        /* array index (only used if SORTEDINMEM) */
00335     bool        eof_reached;    /* reached EOF (needed for cursors) */
00336 
00337     /* markpos_xxx holds marked position for mark and restore */
00338     long        markpos_block;  /* tape block# (only used if SORTEDONTAPE) */
00339     int         markpos_offset; /* saved "current", or offset in tape block */
00340     bool        markpos_eof;    /* saved "eof_reached" */
00341 
00342     /*
00343      * These variables are specific to the MinimalTuple case; they are set by
00344      * tuplesort_begin_heap and used only by the MinimalTuple routines.
00345      */
00346     TupleDesc   tupDesc;
00347     SortSupport sortKeys;       /* array of length nKeys */
00348 
00349     /*
00350      * This variable is shared by the single-key MinimalTuple case and the
00351      * Datum case (which both use qsort_ssup()).  Otherwise it's NULL.
00352      */
00353     SortSupport onlyKey;
00354 
00355     /*
00356      * These variables are specific to the CLUSTER case; they are set by
00357      * tuplesort_begin_cluster.  Note CLUSTER also uses tupDesc and
00358      * indexScanKey.
00359      */
00360     IndexInfo  *indexInfo;      /* info about index being used for reference */
00361     EState     *estate;         /* for evaluating index expressions */
00362 
00363     /*
00364      * These variables are specific to the IndexTuple case; they are set by
00365      * tuplesort_begin_index_xxx and used only by the IndexTuple routines.
00366      */
00367     Relation    heapRel;        /* table the index is being built on */
00368     Relation    indexRel;       /* index being built */
00369 
00370     /* These are specific to the index_btree subcase: */
00371     ScanKey     indexScanKey;
00372     bool        enforceUnique;  /* complain if we find duplicate tuples */
00373 
00374     /* These are specific to the index_hash subcase: */
00375     uint32      hash_mask;      /* mask for sortable part of hash code */
00376 
00377     /*
00378      * These variables are specific to the Datum case; they are set by
00379      * tuplesort_begin_datum and used only by the DatumTuple routines.
00380      */
00381     Oid         datumType;
00382     /* we need typelen and byval in order to know how to copy the Datums. */
00383     int         datumTypeLen;
00384     bool        datumTypeByVal;
00385 
00386     /*
00387      * Resource snapshot for time of sort start.
00388      */
00389 #ifdef TRACE_SORT
00390     PGRUsage    ru_start;
00391 #endif
00392 };
00393 
00394 #define COMPARETUP(state,a,b)   ((*(state)->comparetup) (a, b, state))
00395 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
00396 #define WRITETUP(state,tape,stup)   ((*(state)->writetup) (state, tape, stup))
00397 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
00398 #define REVERSEDIRECTION(state) ((*(state)->reversedirection) (state))
00399 #define LACKMEM(state)      ((state)->availMem < 0)
00400 #define USEMEM(state,amt)   ((state)->availMem -= (amt))
00401 #define FREEMEM(state,amt)  ((state)->availMem += (amt))
00402 
00403 /*
00404  * NOTES about on-tape representation of tuples:
00405  *
00406  * We require the first "unsigned int" of a stored tuple to be the total size
00407  * on-tape of the tuple, including itself (so it is never zero; an all-zero
00408  * unsigned int is used to delimit runs).  The remainder of the stored tuple
00409  * may or may not match the in-memory representation of the tuple ---
00410  * any conversion needed is the job of the writetup and readtup routines.
00411  *
00412  * If state->randomAccess is true, then the stored representation of the
00413  * tuple must be followed by another "unsigned int" that is a copy of the
00414  * length --- so the total tape space used is actually sizeof(unsigned int)
00415  * more than the stored length value.  This allows read-backwards.  When
00416  * randomAccess is not true, the write/read routines may omit the extra
00417  * length word.
00418  *
00419  * writetup is expected to write both length words as well as the tuple
00420  * data.  When readtup is called, the tape is positioned just after the
00421  * front length word; readtup must read the tuple data and advance past
00422  * the back length word (if present).
00423  *
00424  * The write/read routines can make use of the tuple description data
00425  * stored in the Tuplesortstate record, if needed.  They are also expected
00426  * to adjust state->availMem by the amount of memory space (not tape space!)
00427  * released or consumed.  There is no error return from either writetup
00428  * or readtup; they should ereport() on failure.
00429  *
00430  *
00431  * NOTES about memory consumption calculations:
00432  *
00433  * We count space allocated for tuples against the workMem limit, plus
00434  * the space used by the variable-size memtuples array.  Fixed-size space
00435  * is not counted; it's small enough to not be interesting.
00436  *
00437  * Note that we count actual space used (as shown by GetMemoryChunkSpace)
00438  * rather than the originally-requested size.  This is important since
00439  * palloc can add substantial overhead.  It's not a complete answer since
00440  * we won't count any wasted space in palloc allocation blocks, but it's
00441  * a lot better than what we were doing before 7.3.
00442  */
00443 
00444 /* When using this macro, beware of double evaluation of len */
00445 #define LogicalTapeReadExact(tapeset, tapenum, ptr, len) \
00446     do { \
00447         if (LogicalTapeRead(tapeset, tapenum, ptr, len) != (size_t) (len)) \
00448             elog(ERROR, "unexpected end of data"); \
00449     } while(0)
00450 
00451 
00452 static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
00453 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
00454 static void inittapes(Tuplesortstate *state);
00455 static void selectnewtape(Tuplesortstate *state);
00456 static void mergeruns(Tuplesortstate *state);
00457 static void mergeonerun(Tuplesortstate *state);
00458 static void beginmerge(Tuplesortstate *state);
00459 static void mergepreread(Tuplesortstate *state);
00460 static void mergeprereadone(Tuplesortstate *state, int srcTape);
00461 static void dumptuples(Tuplesortstate *state, bool alltuples);
00462 static void make_bounded_heap(Tuplesortstate *state);
00463 static void sort_bounded_heap(Tuplesortstate *state);
00464 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
00465                       int tupleindex, bool checkIndex);
00466 static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
00467 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
00468 static void markrunend(Tuplesortstate *state, int tapenum);
00469 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
00470                 Tuplesortstate *state);
00471 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
00472 static void writetup_heap(Tuplesortstate *state, int tapenum,
00473               SortTuple *stup);
00474 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
00475              int tapenum, unsigned int len);
00476 static void reversedirection_heap(Tuplesortstate *state);
00477 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
00478                    Tuplesortstate *state);
00479 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
00480 static void writetup_cluster(Tuplesortstate *state, int tapenum,
00481                  SortTuple *stup);
00482 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
00483                 int tapenum, unsigned int len);
00484 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
00485                        Tuplesortstate *state);
00486 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
00487                       Tuplesortstate *state);
00488 static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
00489 static void writetup_index(Tuplesortstate *state, int tapenum,
00490                SortTuple *stup);
00491 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
00492               int tapenum, unsigned int len);
00493 static void reversedirection_index_btree(Tuplesortstate *state);
00494 static void reversedirection_index_hash(Tuplesortstate *state);
00495 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
00496                  Tuplesortstate *state);
00497 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
00498 static void writetup_datum(Tuplesortstate *state, int tapenum,
00499                SortTuple *stup);
00500 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
00501               int tapenum, unsigned int len);
00502 static void reversedirection_datum(Tuplesortstate *state);
00503 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
00504 
00505 /*
00506  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
00507  * any variant of SortTuples, using the appropriate comparetup function.
00508  * qsort_ssup() is specialized for the case where the comparetup function
00509  * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts
00510  * and Datum sorts.
00511  */
00512 #include "qsort_tuple.c"
00513 
00514 
00515 /*
00516  *      tuplesort_begin_xxx
00517  *
00518  * Initialize for a tuple sort operation.
00519  *
00520  * After calling tuplesort_begin, the caller should call tuplesort_putXXX
00521  * zero or more times, then call tuplesort_performsort when all the tuples
00522  * have been supplied.  After performsort, retrieve the tuples in sorted
00523  * order by calling tuplesort_getXXX until it returns false/NULL.  (If random
00524  * access was requested, rescan, markpos, and restorepos can also be called.)
00525  * Call tuplesort_end to terminate the operation and release memory/disk space.
00526  *
00527  * Each variant of tuplesort_begin has a workMem parameter specifying the
00528  * maximum number of kilobytes of RAM to use before spilling data to disk.
00529  * (The normal value of this parameter is work_mem, but some callers use
00530  * other values.)  Each variant also has a randomAccess parameter specifying
00531  * whether the caller needs non-sequential access to the sort result.
00532  */
00533 
00534 static Tuplesortstate *
00535 tuplesort_begin_common(int workMem, bool randomAccess)
00536 {
00537     Tuplesortstate *state;
00538     MemoryContext sortcontext;
00539     MemoryContext oldcontext;
00540 
00541     /*
00542      * Create a working memory context for this sort operation. All data
00543      * needed by the sort will live inside this context.
00544      */
00545     sortcontext = AllocSetContextCreate(CurrentMemoryContext,
00546                                         "TupleSort",
00547                                         ALLOCSET_DEFAULT_MINSIZE,
00548                                         ALLOCSET_DEFAULT_INITSIZE,
00549                                         ALLOCSET_DEFAULT_MAXSIZE);
00550 
00551     /*
00552      * Make the Tuplesortstate within the per-sort context.  This way, we
00553      * don't need a separate pfree() operation for it at shutdown.
00554      */
00555     oldcontext = MemoryContextSwitchTo(sortcontext);
00556 
00557     state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
00558 
00559 #ifdef TRACE_SORT
00560     if (trace_sort)
00561         pg_rusage_init(&state->ru_start);
00562 #endif
00563 
00564     state->status = TSS_INITIAL;
00565     state->randomAccess = randomAccess;
00566     state->bounded = false;
00567     state->boundUsed = false;
00568     state->allowedMem = workMem * 1024L;
00569     state->availMem = state->allowedMem;
00570     state->sortcontext = sortcontext;
00571     state->tapeset = NULL;
00572 
00573     state->memtupcount = 0;
00574     state->memtupsize = 1024;   /* initial guess */
00575     state->growmemtuples = true;
00576     state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
00577 
00578     USEMEM(state, GetMemoryChunkSpace(state->memtuples));
00579 
00580     /* workMem must be large enough for the minimal memtuples array */
00581     if (LACKMEM(state))
00582         elog(ERROR, "insufficient memory allowed for sort");
00583 
00584     state->currentRun = 0;
00585 
00586     /*
00587      * maxTapes, tapeRange, and Algorithm D variables will be initialized by
00588      * inittapes(), if needed
00589      */
00590 
00591     state->result_tape = -1;    /* flag that result tape has not been formed */
00592 
00593     MemoryContextSwitchTo(oldcontext);
00594 
00595     return state;
00596 }
00597 
00598 Tuplesortstate *
00599 tuplesort_begin_heap(TupleDesc tupDesc,
00600                      int nkeys, AttrNumber *attNums,
00601                      Oid *sortOperators, Oid *sortCollations,
00602                      bool *nullsFirstFlags,
00603                      int workMem, bool randomAccess)
00604 {
00605     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
00606     MemoryContext oldcontext;
00607     int         i;
00608 
00609     oldcontext = MemoryContextSwitchTo(state->sortcontext);
00610 
00611     AssertArg(nkeys > 0);
00612 
00613 #ifdef TRACE_SORT
00614     if (trace_sort)
00615         elog(LOG,
00616              "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
00617              nkeys, workMem, randomAccess ? 't' : 'f');
00618 #endif
00619 
00620     state->nKeys = nkeys;
00621 
00622     TRACE_POSTGRESQL_SORT_START(HEAP_SORT,
00623                                 false,  /* no unique check */
00624                                 nkeys,
00625                                 workMem,
00626                                 randomAccess);
00627 
00628     state->comparetup = comparetup_heap;
00629     state->copytup = copytup_heap;
00630     state->writetup = writetup_heap;
00631     state->readtup = readtup_heap;
00632     state->reversedirection = reversedirection_heap;
00633 
00634     state->tupDesc = tupDesc;   /* assume we need not copy tupDesc */
00635 
00636     /* Prepare SortSupport data for each column */
00637     state->sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
00638 
00639     for (i = 0; i < nkeys; i++)
00640     {
00641         SortSupport sortKey = state->sortKeys + i;
00642 
00643         AssertArg(attNums[i] != 0);
00644         AssertArg(sortOperators[i] != 0);
00645 
00646         sortKey->ssup_cxt = CurrentMemoryContext;
00647         sortKey->ssup_collation = sortCollations[i];
00648         sortKey->ssup_nulls_first = nullsFirstFlags[i];
00649         sortKey->ssup_attno = attNums[i];
00650 
00651         PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
00652     }
00653 
00654     if (nkeys == 1)
00655         state->onlyKey = state->sortKeys;
00656 
00657     MemoryContextSwitchTo(oldcontext);
00658 
00659     return state;
00660 }
00661 
00662 Tuplesortstate *
00663 tuplesort_begin_cluster(TupleDesc tupDesc,
00664                         Relation indexRel,
00665                         int workMem, bool randomAccess)
00666 {
00667     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
00668     MemoryContext oldcontext;
00669 
00670     Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
00671 
00672     oldcontext = MemoryContextSwitchTo(state->sortcontext);
00673 
00674 #ifdef TRACE_SORT
00675     if (trace_sort)
00676         elog(LOG,
00677              "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
00678              RelationGetNumberOfAttributes(indexRel),
00679              workMem, randomAccess ? 't' : 'f');
00680 #endif
00681 
00682     state->nKeys = RelationGetNumberOfAttributes(indexRel);
00683 
00684     TRACE_POSTGRESQL_SORT_START(CLUSTER_SORT,
00685                                 false,  /* no unique check */
00686                                 state->nKeys,
00687                                 workMem,
00688                                 randomAccess);
00689 
00690     state->comparetup = comparetup_cluster;
00691     state->copytup = copytup_cluster;
00692     state->writetup = writetup_cluster;
00693     state->readtup = readtup_cluster;
00694     state->reversedirection = reversedirection_index_btree;
00695 
00696     state->indexInfo = BuildIndexInfo(indexRel);
00697     state->indexScanKey = _bt_mkscankey_nodata(indexRel);
00698 
00699     state->tupDesc = tupDesc;   /* assume we need not copy tupDesc */
00700 
00701     if (state->indexInfo->ii_Expressions != NULL)
00702     {
00703         TupleTableSlot *slot;
00704         ExprContext *econtext;
00705 
00706         /*
00707          * We will need to use FormIndexDatum to evaluate the index
00708          * expressions.  To do that, we need an EState, as well as a
00709          * TupleTableSlot to put the table tuples into.  The econtext's
00710          * scantuple has to point to that slot, too.
00711          */
00712         state->estate = CreateExecutorState();
00713         slot = MakeSingleTupleTableSlot(tupDesc);
00714         econtext = GetPerTupleExprContext(state->estate);
00715         econtext->ecxt_scantuple = slot;
00716     }
00717 
00718     MemoryContextSwitchTo(oldcontext);
00719 
00720     return state;
00721 }
00722 
00723 Tuplesortstate *
00724 tuplesort_begin_index_btree(Relation heapRel,
00725                             Relation indexRel,
00726                             bool enforceUnique,
00727                             int workMem, bool randomAccess)
00728 {
00729     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
00730     MemoryContext oldcontext;
00731 
00732     oldcontext = MemoryContextSwitchTo(state->sortcontext);
00733 
00734 #ifdef TRACE_SORT
00735     if (trace_sort)
00736         elog(LOG,
00737              "begin index sort: unique = %c, workMem = %d, randomAccess = %c",
00738              enforceUnique ? 't' : 'f',
00739              workMem, randomAccess ? 't' : 'f');
00740 #endif
00741 
00742     state->nKeys = RelationGetNumberOfAttributes(indexRel);
00743 
00744     TRACE_POSTGRESQL_SORT_START(INDEX_SORT,
00745                                 enforceUnique,
00746                                 state->nKeys,
00747                                 workMem,
00748                                 randomAccess);
00749 
00750     state->comparetup = comparetup_index_btree;
00751     state->copytup = copytup_index;
00752     state->writetup = writetup_index;
00753     state->readtup = readtup_index;
00754     state->reversedirection = reversedirection_index_btree;
00755 
00756     state->heapRel = heapRel;
00757     state->indexRel = indexRel;
00758     state->indexScanKey = _bt_mkscankey_nodata(indexRel);
00759     state->enforceUnique = enforceUnique;
00760 
00761     MemoryContextSwitchTo(oldcontext);
00762 
00763     return state;
00764 }
00765 
00766 Tuplesortstate *
00767 tuplesort_begin_index_hash(Relation heapRel,
00768                            Relation indexRel,
00769                            uint32 hash_mask,
00770                            int workMem, bool randomAccess)
00771 {
00772     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
00773     MemoryContext oldcontext;
00774 
00775     oldcontext = MemoryContextSwitchTo(state->sortcontext);
00776 
00777 #ifdef TRACE_SORT
00778     if (trace_sort)
00779         elog(LOG,
00780         "begin index sort: hash_mask = 0x%x, workMem = %d, randomAccess = %c",
00781              hash_mask,
00782              workMem, randomAccess ? 't' : 'f');
00783 #endif
00784 
00785     state->nKeys = 1;           /* Only one sort column, the hash code */
00786 
00787     state->comparetup = comparetup_index_hash;
00788     state->copytup = copytup_index;
00789     state->writetup = writetup_index;
00790     state->readtup = readtup_index;
00791     state->reversedirection = reversedirection_index_hash;
00792 
00793     state->heapRel = heapRel;
00794     state->indexRel = indexRel;
00795 
00796     state->hash_mask = hash_mask;
00797 
00798     MemoryContextSwitchTo(oldcontext);
00799 
00800     return state;
00801 }
00802 
00803 Tuplesortstate *
00804 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
00805                       bool nullsFirstFlag,
00806                       int workMem, bool randomAccess)
00807 {
00808     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
00809     MemoryContext oldcontext;
00810     int16       typlen;
00811     bool        typbyval;
00812 
00813     oldcontext = MemoryContextSwitchTo(state->sortcontext);
00814 
00815 #ifdef TRACE_SORT
00816     if (trace_sort)
00817         elog(LOG,
00818              "begin datum sort: workMem = %d, randomAccess = %c",
00819              workMem, randomAccess ? 't' : 'f');
00820 #endif
00821 
00822     state->nKeys = 1;           /* always a one-column sort */
00823 
00824     TRACE_POSTGRESQL_SORT_START(DATUM_SORT,
00825                                 false,  /* no unique check */
00826                                 1,
00827                                 workMem,
00828                                 randomAccess);
00829 
00830     state->comparetup = comparetup_datum;
00831     state->copytup = copytup_datum;
00832     state->writetup = writetup_datum;
00833     state->readtup = readtup_datum;
00834     state->reversedirection = reversedirection_datum;
00835 
00836     state->datumType = datumType;
00837 
00838     /* Prepare SortSupport data */
00839     state->onlyKey = (SortSupport) palloc0(sizeof(SortSupportData));
00840 
00841     state->onlyKey->ssup_cxt = CurrentMemoryContext;
00842     state->onlyKey->ssup_collation = sortCollation;
00843     state->onlyKey->ssup_nulls_first = nullsFirstFlag;
00844 
00845     PrepareSortSupportFromOrderingOp(sortOperator, state->onlyKey);
00846 
00847     /* lookup necessary attributes of the datum type */
00848     get_typlenbyval(datumType, &typlen, &typbyval);
00849     state->datumTypeLen = typlen;
00850     state->datumTypeByVal = typbyval;
00851 
00852     MemoryContextSwitchTo(oldcontext);
00853 
00854     return state;
00855 }
00856 
00857 /*
00858  * tuplesort_set_bound
00859  *
00860  *  Advise tuplesort that at most the first N result tuples are required.
00861  *
00862  * Must be called before inserting any tuples.  (Actually, we could allow it
00863  * as long as the sort hasn't spilled to disk, but there seems no need for
00864  * delayed calls at the moment.)
00865  *
00866  * This is a hint only. The tuplesort may still return more tuples than
00867  * requested.
00868  */
00869 void
00870 tuplesort_set_bound(Tuplesortstate *state, int64 bound)
00871 {
00872     /* Assert we're called before loading any tuples */
00873     Assert(state->status == TSS_INITIAL);
00874     Assert(state->memtupcount == 0);
00875     Assert(!state->bounded);
00876 
00877 #ifdef DEBUG_BOUNDED_SORT
00878     /* Honor GUC setting that disables the feature (for easy testing) */
00879     if (!optimize_bounded_sort)
00880         return;
00881 #endif
00882 
00883     /* We want to be able to compute bound * 2, so limit the setting */
00884     if (bound > (int64) (INT_MAX / 2))
00885         return;
00886 
00887     state->bounded = true;
00888     state->bound = (int) bound;
00889 }
00890 
00891 /*
00892  * tuplesort_end
00893  *
00894  *  Release resources and clean up.
00895  *
00896  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
00897  * pointing to garbage.  Be careful not to attempt to use or free such
00898  * pointers afterwards!
00899  */
00900 void
00901 tuplesort_end(Tuplesortstate *state)
00902 {
00903     /* context swap probably not needed, but let's be safe */
00904     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
00905 
00906 #ifdef TRACE_SORT
00907     long        spaceUsed;
00908 
00909     if (state->tapeset)
00910         spaceUsed = LogicalTapeSetBlocks(state->tapeset);
00911     else
00912         spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
00913 #endif
00914 
00915     /*
00916      * Delete temporary "tape" files, if any.
00917      *
00918      * Note: want to include this in reported total cost of sort, hence need
00919      * for two #ifdef TRACE_SORT sections.
00920      */
00921     if (state->tapeset)
00922         LogicalTapeSetClose(state->tapeset);
00923 
00924 #ifdef TRACE_SORT
00925     if (trace_sort)
00926     {
00927         if (state->tapeset)
00928             elog(LOG, "external sort ended, %ld disk blocks used: %s",
00929                  spaceUsed, pg_rusage_show(&state->ru_start));
00930         else
00931             elog(LOG, "internal sort ended, %ld KB used: %s",
00932                  spaceUsed, pg_rusage_show(&state->ru_start));
00933     }
00934 
00935     TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);
00936 #else
00937 
00938     /*
00939      * If you disabled TRACE_SORT, you can still probe sort__done, but you
00940      * ain't getting space-used stats.
00941      */
00942     TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
00943 #endif
00944 
00945     /* Free any execution state created for CLUSTER case */
00946     if (state->estate != NULL)
00947     {
00948         ExprContext *econtext = GetPerTupleExprContext(state->estate);
00949 
00950         ExecDropSingleTupleTableSlot(econtext->ecxt_scantuple);
00951         FreeExecutorState(state->estate);
00952     }
00953 
00954     MemoryContextSwitchTo(oldcontext);
00955 
00956     /*
00957      * Free the per-sort memory context, thereby releasing all working memory,
00958      * including the Tuplesortstate struct itself.
00959      */
00960     MemoryContextDelete(state->sortcontext);
00961 }
00962 
00963 /*
00964  * Grow the memtuples[] array, if possible within our memory constraint.
00965  * Return TRUE if we were able to enlarge the array, FALSE if not.
00966  *
00967  * Normally, at each increment we double the size of the array.  When we no
00968  * longer have enough memory to do that, we attempt one last, smaller increase
00969  * (and then clear the growmemtuples flag so we don't try any more).  That
00970  * allows us to use allowedMem as fully as possible; sticking to the pure
00971  * doubling rule could result in almost half of allowedMem going unused.
00972  * Because availMem moves around with tuple addition/removal, we need some
00973  * rule to prevent making repeated small increases in memtupsize, which would
00974  * just be useless thrashing.  The growmemtuples flag accomplishes that and
00975  * also prevents useless recalculations in this function.
00976  */
00977 static bool
00978 grow_memtuples(Tuplesortstate *state)
00979 {
00980     int         newmemtupsize;
00981     int         memtupsize = state->memtupsize;
00982     long        memNowUsed = state->allowedMem - state->availMem;
00983 
00984     /* Forget it if we've already maxed out memtuples, per comment above */
00985     if (!state->growmemtuples)
00986         return false;
00987 
00988     /* Select new value of memtupsize */
00989     if (memNowUsed <= state->availMem)
00990     {
00991         /*
00992          * It is surely safe to double memtupsize if we've used no more than
00993          * half of allowedMem.
00994          *
00995          * Note: it might seem that we need to worry about memtupsize * 2
00996          * overflowing an int, but the MaxAllocSize clamp applied below
00997          * ensures the existing memtupsize can't be large enough for that.
00998          */
00999         newmemtupsize = memtupsize * 2;
01000     }
01001     else
01002     {
01003         /*
01004          * This will be the last increment of memtupsize.  Abandon doubling
01005          * strategy and instead increase as much as we safely can.
01006          *
01007          * To stay within allowedMem, we can't increase memtupsize by more
01008          * than availMem / sizeof(SortTuple) elements.  In practice, we want
01009          * to increase it by considerably less, because we need to leave some
01010          * space for the tuples to which the new array slots will refer.  We
01011          * assume the new tuples will be about the same size as the tuples
01012          * we've already seen, and thus we can extrapolate from the space
01013          * consumption so far to estimate an appropriate new size for the
01014          * memtuples array.  The optimal value might be higher or lower than
01015          * this estimate, but it's hard to know that in advance.
01016          *
01017          * This calculation is safe against enlarging the array so much that
01018          * LACKMEM becomes true, because the memory currently used includes
01019          * the present array; thus, there would be enough allowedMem for the
01020          * new array elements even if no other memory were currently used.
01021          *
01022          * We do the arithmetic in float8, because otherwise the product of
01023          * memtupsize and allowedMem could overflow.  (A little algebra shows
01024          * that grow_ratio must be less than 2 here, so we are not risking
01025          * integer overflow this way.)  Any inaccuracy in the result should be
01026          * insignificant; but even if we computed a completely insane result,
01027          * the checks below will prevent anything really bad from happening.
01028          */
01029         double      grow_ratio;
01030 
01031         grow_ratio = (double) state->allowedMem / (double) memNowUsed;
01032         newmemtupsize = (int) (memtupsize * grow_ratio);
01033 
01034         /* We won't make any further enlargement attempts */
01035         state->growmemtuples = false;
01036     }
01037 
01038     /* Must enlarge array by at least one element, else report failure */
01039     if (newmemtupsize <= memtupsize)
01040         goto noalloc;
01041 
01042     /*
01043      * On a 64-bit machine, allowedMem could be more than MaxAllocSize.  Clamp
01044      * to ensure our request won't be rejected by palloc.
01045      */
01046     if ((Size) newmemtupsize >= MaxAllocSize / sizeof(SortTuple))
01047     {
01048         newmemtupsize = (int) (MaxAllocSize / sizeof(SortTuple));
01049         state->growmemtuples = false;   /* can't grow any more */
01050     }
01051 
01052     /*
01053      * We need to be sure that we do not cause LACKMEM to become true, else
01054      * the space management algorithm will go nuts.  The code above should
01055      * never generate a dangerous request, but to be safe, check explicitly
01056      * that the array growth fits within availMem.  (We could still cause
01057      * LACKMEM if the memory chunk overhead associated with the memtuples
01058      * array were to increase.  That shouldn't happen with any sane value of
01059      * allowedMem, because at any array size large enough to risk LACKMEM,
01060      * palloc would be treating both old and new arrays as separate chunks.
01061      * But we'll check LACKMEM explicitly below just in case.)
01062      */
01063     if (state->availMem < (long) ((newmemtupsize - memtupsize) * sizeof(SortTuple)))
01064         goto noalloc;
01065 
01066     /* OK, do it */
01067     FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
01068     state->memtupsize = newmemtupsize;
01069     state->memtuples = (SortTuple *)
01070         repalloc(state->memtuples,
01071                  state->memtupsize * sizeof(SortTuple));
01072     USEMEM(state, GetMemoryChunkSpace(state->memtuples));
01073     if (LACKMEM(state))
01074         elog(ERROR, "unexpected out-of-memory situation during sort");
01075     return true;
01076 
01077 noalloc:
01078     /* If for any reason we didn't realloc, shut off future attempts */
01079     state->growmemtuples = false;
01080     return false;
01081 }
01082 
01083 /*
01084  * Accept one tuple while collecting input data for sort.
01085  *
01086  * Note that the input data is always copied; the caller need not save it.
01087  */
01088 void
01089 tuplesort_puttupleslot(Tuplesortstate *state, TupleTableSlot *slot)
01090 {
01091     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01092     SortTuple   stup;
01093 
01094     /*
01095      * Copy the given tuple into memory we control, and decrease availMem.
01096      * Then call the common code.
01097      */
01098     COPYTUP(state, &stup, (void *) slot);
01099 
01100     puttuple_common(state, &stup);
01101 
01102     MemoryContextSwitchTo(oldcontext);
01103 }
01104 
01105 /*
01106  * Accept one tuple while collecting input data for sort.
01107  *
01108  * Note that the input data is always copied; the caller need not save it.
01109  */
01110 void
01111 tuplesort_putheaptuple(Tuplesortstate *state, HeapTuple tup)
01112 {
01113     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01114     SortTuple   stup;
01115 
01116     /*
01117      * Copy the given tuple into memory we control, and decrease availMem.
01118      * Then call the common code.
01119      */
01120     COPYTUP(state, &stup, (void *) tup);
01121 
01122     puttuple_common(state, &stup);
01123 
01124     MemoryContextSwitchTo(oldcontext);
01125 }
01126 
01127 /*
01128  * Accept one index tuple while collecting input data for sort.
01129  *
01130  * Note that the input tuple is always copied; the caller need not save it.
01131  */
01132 void
01133 tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple)
01134 {
01135     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01136     SortTuple   stup;
01137 
01138     /*
01139      * Copy the given tuple into memory we control, and decrease availMem.
01140      * Then call the common code.
01141      */
01142     COPYTUP(state, &stup, (void *) tuple);
01143 
01144     puttuple_common(state, &stup);
01145 
01146     MemoryContextSwitchTo(oldcontext);
01147 }
01148 
01149 /*
01150  * Accept one Datum while collecting input data for sort.
01151  *
01152  * If the Datum is pass-by-ref type, the value will be copied.
01153  */
01154 void
01155 tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
01156 {
01157     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01158     SortTuple   stup;
01159 
01160     /*
01161      * If it's a pass-by-reference value, copy it into memory we control, and
01162      * decrease availMem.  Then call the common code.
01163      */
01164     if (isNull || state->datumTypeByVal)
01165     {
01166         stup.datum1 = val;
01167         stup.isnull1 = isNull;
01168         stup.tuple = NULL;      /* no separate storage */
01169     }
01170     else
01171     {
01172         stup.datum1 = datumCopy(val, false, state->datumTypeLen);
01173         stup.isnull1 = false;
01174         stup.tuple = DatumGetPointer(stup.datum1);
01175         USEMEM(state, GetMemoryChunkSpace(stup.tuple));
01176     }
01177 
01178     puttuple_common(state, &stup);
01179 
01180     MemoryContextSwitchTo(oldcontext);
01181 }
01182 
01183 /*
01184  * Shared code for tuple and datum cases.
01185  */
01186 static void
01187 puttuple_common(Tuplesortstate *state, SortTuple *tuple)
01188 {
01189     switch (state->status)
01190     {
01191         case TSS_INITIAL:
01192 
01193             /*
01194              * Save the tuple into the unsorted array.  First, grow the array
01195              * as needed.  Note that we try to grow the array when there is
01196              * still one free slot remaining --- if we fail, there'll still be
01197              * room to store the incoming tuple, and then we'll switch to
01198              * tape-based operation.
01199              */
01200             if (state->memtupcount >= state->memtupsize - 1)
01201             {
01202                 (void) grow_memtuples(state);
01203                 Assert(state->memtupcount < state->memtupsize);
01204             }
01205             state->memtuples[state->memtupcount++] = *tuple;
01206 
01207             /*
01208              * Check if it's time to switch over to a bounded heapsort. We do
01209              * so if the input tuple count exceeds twice the desired tuple
01210              * count (this is a heuristic for where heapsort becomes cheaper
01211              * than a quicksort), or if we've just filled workMem and have
01212              * enough tuples to meet the bound.
01213              *
01214              * Note that once we enter TSS_BOUNDED state we will always try to
01215              * complete the sort that way.  In the worst case, if later input
01216              * tuples are larger than earlier ones, this might cause us to
01217              * exceed workMem significantly.
01218              */
01219             if (state->bounded &&
01220                 (state->memtupcount > state->bound * 2 ||
01221                  (state->memtupcount > state->bound && LACKMEM(state))))
01222             {
01223 #ifdef TRACE_SORT
01224                 if (trace_sort)
01225                     elog(LOG, "switching to bounded heapsort at %d tuples: %s",
01226                          state->memtupcount,
01227                          pg_rusage_show(&state->ru_start));
01228 #endif
01229                 make_bounded_heap(state);
01230                 return;
01231             }
01232 
01233             /*
01234              * Done if we still fit in available memory and have array slots.
01235              */
01236             if (state->memtupcount < state->memtupsize && !LACKMEM(state))
01237                 return;
01238 
01239             /*
01240              * Nope; time to switch to tape-based operation.
01241              */
01242             inittapes(state);
01243 
01244             /*
01245              * Dump tuples until we are back under the limit.
01246              */
01247             dumptuples(state, false);
01248             break;
01249 
01250         case TSS_BOUNDED:
01251 
01252             /*
01253              * We don't want to grow the array here, so check whether the new
01254              * tuple can be discarded before putting it in.  This should be a
01255              * good speed optimization, too, since when there are many more
01256              * input tuples than the bound, most input tuples can be discarded
01257              * with just this one comparison.  Note that because we currently
01258              * have the sort direction reversed, we must check for <= not >=.
01259              */
01260             if (COMPARETUP(state, tuple, &state->memtuples[0]) <= 0)
01261             {
01262                 /* new tuple <= top of the heap, so we can discard it */
01263                 free_sort_tuple(state, tuple);
01264                 CHECK_FOR_INTERRUPTS();
01265             }
01266             else
01267             {
01268                 /* discard top of heap, sift up, insert new tuple */
01269                 free_sort_tuple(state, &state->memtuples[0]);
01270                 tuplesort_heap_siftup(state, false);
01271                 tuplesort_heap_insert(state, tuple, 0, false);
01272             }
01273             break;
01274 
01275         case TSS_BUILDRUNS:
01276 
01277             /*
01278              * Insert the tuple into the heap, with run number currentRun if
01279              * it can go into the current run, else run number currentRun+1.
01280              * The tuple can go into the current run if it is >= the first
01281              * not-yet-output tuple.  (Actually, it could go into the current
01282              * run if it is >= the most recently output tuple ... but that
01283              * would require keeping around the tuple we last output, and it's
01284              * simplest to let writetup free each tuple as soon as it's
01285              * written.)
01286              *
01287              * Note there will always be at least one tuple in the heap at
01288              * this point; see dumptuples.
01289              */
01290             Assert(state->memtupcount > 0);
01291             if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
01292                 tuplesort_heap_insert(state, tuple, state->currentRun, true);
01293             else
01294                 tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
01295 
01296             /*
01297              * If we are over the memory limit, dump tuples till we're under.
01298              */
01299             dumptuples(state, false);
01300             break;
01301 
01302         default:
01303             elog(ERROR, "invalid tuplesort state");
01304             break;
01305     }
01306 }
01307 
01308 /*
01309  * All tuples have been provided; finish the sort.
01310  */
01311 void
01312 tuplesort_performsort(Tuplesortstate *state)
01313 {
01314     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01315 
01316 #ifdef TRACE_SORT
01317     if (trace_sort)
01318         elog(LOG, "performsort starting: %s",
01319              pg_rusage_show(&state->ru_start));
01320 #endif
01321 
01322     switch (state->status)
01323     {
01324         case TSS_INITIAL:
01325 
01326             /*
01327              * We were able to accumulate all the tuples within the allowed
01328              * amount of memory.  Just qsort 'em and we're done.
01329              */
01330             if (state->memtupcount > 1)
01331             {
01332                 /* Can we use the single-key sort function? */
01333                 if (state->onlyKey != NULL)
01334                     qsort_ssup(state->memtuples, state->memtupcount,
01335                                state->onlyKey);
01336                 else
01337                     qsort_tuple(state->memtuples,
01338                                 state->memtupcount,
01339                                 state->comparetup,
01340                                 state);
01341             }
01342             state->current = 0;
01343             state->eof_reached = false;
01344             state->markpos_offset = 0;
01345             state->markpos_eof = false;
01346             state->status = TSS_SORTEDINMEM;
01347             break;
01348 
01349         case TSS_BOUNDED:
01350 
01351             /*
01352              * We were able to accumulate all the tuples required for output
01353              * in memory, using a heap to eliminate excess tuples.  Now we
01354              * have to transform the heap to a properly-sorted array.
01355              */
01356             sort_bounded_heap(state);
01357             state->current = 0;
01358             state->eof_reached = false;
01359             state->markpos_offset = 0;
01360             state->markpos_eof = false;
01361             state->status = TSS_SORTEDINMEM;
01362             break;
01363 
01364         case TSS_BUILDRUNS:
01365 
01366             /*
01367              * Finish tape-based sort.  First, flush all tuples remaining in
01368              * memory out to tape; then merge until we have a single remaining
01369              * run (or, if !randomAccess, one run per tape). Note that
01370              * mergeruns sets the correct state->status.
01371              */
01372             dumptuples(state, true);
01373             mergeruns(state);
01374             state->eof_reached = false;
01375             state->markpos_block = 0L;
01376             state->markpos_offset = 0;
01377             state->markpos_eof = false;
01378             break;
01379 
01380         default:
01381             elog(ERROR, "invalid tuplesort state");
01382             break;
01383     }
01384 
01385 #ifdef TRACE_SORT
01386     if (trace_sort)
01387     {
01388         if (state->status == TSS_FINALMERGE)
01389             elog(LOG, "performsort done (except %d-way final merge): %s",
01390                  state->activeTapes,
01391                  pg_rusage_show(&state->ru_start));
01392         else
01393             elog(LOG, "performsort done: %s",
01394                  pg_rusage_show(&state->ru_start));
01395     }
01396 #endif
01397 
01398     MemoryContextSwitchTo(oldcontext);
01399 }
01400 
01401 /*
01402  * Internal routine to fetch the next tuple in either forward or back
01403  * direction into *stup.  Returns FALSE if no more tuples.
01404  * If *should_free is set, the caller must pfree stup.tuple when done with it.
01405  */
01406 static bool
01407 tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
01408                           SortTuple *stup, bool *should_free)
01409 {
01410     unsigned int tuplen;
01411 
01412     switch (state->status)
01413     {
01414         case TSS_SORTEDINMEM:
01415             Assert(forward || state->randomAccess);
01416             *should_free = false;
01417             if (forward)
01418             {
01419                 if (state->current < state->memtupcount)
01420                 {
01421                     *stup = state->memtuples[state->current++];
01422                     return true;
01423                 }
01424                 state->eof_reached = true;
01425 
01426                 /*
01427                  * Complain if caller tries to retrieve more tuples than
01428                  * originally asked for in a bounded sort.  This is because
01429                  * returning EOF here might be the wrong thing.
01430                  */
01431                 if (state->bounded && state->current >= state->bound)
01432                     elog(ERROR, "retrieved too many tuples in a bounded sort");
01433 
01434                 return false;
01435             }
01436             else
01437             {
01438                 if (state->current <= 0)
01439                     return false;
01440 
01441                 /*
01442                  * if all tuples are fetched already then we return last
01443                  * tuple, else - tuple before last returned.
01444                  */
01445                 if (state->eof_reached)
01446                     state->eof_reached = false;
01447                 else
01448                 {
01449                     state->current--;   /* last returned tuple */
01450                     if (state->current <= 0)
01451                         return false;
01452                 }
01453                 *stup = state->memtuples[state->current - 1];
01454                 return true;
01455             }
01456             break;
01457 
01458         case TSS_SORTEDONTAPE:
01459             Assert(forward || state->randomAccess);
01460             *should_free = true;
01461             if (forward)
01462             {
01463                 if (state->eof_reached)
01464                     return false;
01465                 if ((tuplen = getlen(state, state->result_tape, true)) != 0)
01466                 {
01467                     READTUP(state, stup, state->result_tape, tuplen);
01468                     return true;
01469                 }
01470                 else
01471                 {
01472                     state->eof_reached = true;
01473                     return false;
01474                 }
01475             }
01476 
01477             /*
01478              * Backward.
01479              *
01480              * if all tuples are fetched already then we return last tuple,
01481              * else - tuple before last returned.
01482              */
01483             if (state->eof_reached)
01484             {
01485                 /*
01486                  * Seek position is pointing just past the zero tuplen at the
01487                  * end of file; back up to fetch last tuple's ending length
01488                  * word.  If seek fails we must have a completely empty file.
01489                  */
01490                 if (!LogicalTapeBackspace(state->tapeset,
01491                                           state->result_tape,
01492                                           2 * sizeof(unsigned int)))
01493                     return false;
01494                 state->eof_reached = false;
01495             }
01496             else
01497             {
01498                 /*
01499                  * Back up and fetch previously-returned tuple's ending length
01500                  * word.  If seek fails, assume we are at start of file.
01501                  */
01502                 if (!LogicalTapeBackspace(state->tapeset,
01503                                           state->result_tape,
01504                                           sizeof(unsigned int)))
01505                     return false;
01506                 tuplen = getlen(state, state->result_tape, false);
01507 
01508                 /*
01509                  * Back up to get ending length word of tuple before it.
01510                  */
01511                 if (!LogicalTapeBackspace(state->tapeset,
01512                                           state->result_tape,
01513                                           tuplen + 2 * sizeof(unsigned int)))
01514                 {
01515                     /*
01516                      * If that fails, presumably the prev tuple is the first
01517                      * in the file.  Back up so that it becomes next to read
01518                      * in forward direction (not obviously right, but that is
01519                      * what in-memory case does).
01520                      */
01521                     if (!LogicalTapeBackspace(state->tapeset,
01522                                               state->result_tape,
01523                                               tuplen + sizeof(unsigned int)))
01524                         elog(ERROR, "bogus tuple length in backward scan");
01525                     return false;
01526                 }
01527             }
01528 
01529             tuplen = getlen(state, state->result_tape, false);
01530 
01531             /*
01532              * Now we have the length of the prior tuple, back up and read it.
01533              * Note: READTUP expects we are positioned after the initial
01534              * length word of the tuple, so back up to that point.
01535              */
01536             if (!LogicalTapeBackspace(state->tapeset,
01537                                       state->result_tape,
01538                                       tuplen))
01539                 elog(ERROR, "bogus tuple length in backward scan");
01540             READTUP(state, stup, state->result_tape, tuplen);
01541             return true;
01542 
01543         case TSS_FINALMERGE:
01544             Assert(forward);
01545             *should_free = true;
01546 
01547             /*
01548              * This code should match the inner loop of mergeonerun().
01549              */
01550             if (state->memtupcount > 0)
01551             {
01552                 int         srcTape = state->memtuples[0].tupindex;
01553                 Size        tuplen;
01554                 int         tupIndex;
01555                 SortTuple  *newtup;
01556 
01557                 *stup = state->memtuples[0];
01558                 /* returned tuple is no longer counted in our memory space */
01559                 if (stup->tuple)
01560                 {
01561                     tuplen = GetMemoryChunkSpace(stup->tuple);
01562                     state->availMem += tuplen;
01563                     state->mergeavailmem[srcTape] += tuplen;
01564                 }
01565                 tuplesort_heap_siftup(state, false);
01566                 if ((tupIndex = state->mergenext[srcTape]) == 0)
01567                 {
01568                     /*
01569                      * out of preloaded data on this tape, try to read more
01570                      *
01571                      * Unlike mergeonerun(), we only preload from the single
01572                      * tape that's run dry.  See mergepreread() comments.
01573                      */
01574                     mergeprereadone(state, srcTape);
01575 
01576                     /*
01577                      * if still no data, we've reached end of run on this tape
01578                      */
01579                     if ((tupIndex = state->mergenext[srcTape]) == 0)
01580                         return true;
01581                 }
01582                 /* pull next preread tuple from list, insert in heap */
01583                 newtup = &state->memtuples[tupIndex];
01584                 state->mergenext[srcTape] = newtup->tupindex;
01585                 if (state->mergenext[srcTape] == 0)
01586                     state->mergelast[srcTape] = 0;
01587                 tuplesort_heap_insert(state, newtup, srcTape, false);
01588                 /* put the now-unused memtuples entry on the freelist */
01589                 newtup->tupindex = state->mergefreelist;
01590                 state->mergefreelist = tupIndex;
01591                 state->mergeavailslots[srcTape]++;
01592                 return true;
01593             }
01594             return false;
01595 
01596         default:
01597             elog(ERROR, "invalid tuplesort state");
01598             return false;       /* keep compiler quiet */
01599     }
01600 }
01601 
01602 /*
01603  * Fetch the next tuple in either forward or back direction.
01604  * If successful, put tuple in slot and return TRUE; else, clear the slot
01605  * and return FALSE.
01606  */
01607 bool
01608 tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
01609                        TupleTableSlot *slot)
01610 {
01611     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01612     SortTuple   stup;
01613     bool        should_free;
01614 
01615     if (!tuplesort_gettuple_common(state, forward, &stup, &should_free))
01616         stup.tuple = NULL;
01617 
01618     MemoryContextSwitchTo(oldcontext);
01619 
01620     if (stup.tuple)
01621     {
01622         ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, should_free);
01623         return true;
01624     }
01625     else
01626     {
01627         ExecClearTuple(slot);
01628         return false;
01629     }
01630 }
01631 
01632 /*
01633  * Fetch the next tuple in either forward or back direction.
01634  * Returns NULL if no more tuples.  If *should_free is set, the
01635  * caller must pfree the returned tuple when done with it.
01636  */
01637 HeapTuple
01638 tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
01639 {
01640     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01641     SortTuple   stup;
01642 
01643     if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
01644         stup.tuple = NULL;
01645 
01646     MemoryContextSwitchTo(oldcontext);
01647 
01648     return stup.tuple;
01649 }
01650 
01651 /*
01652  * Fetch the next index tuple in either forward or back direction.
01653  * Returns NULL if no more tuples.  If *should_free is set, the
01654  * caller must pfree the returned tuple when done with it.
01655  */
01656 IndexTuple
01657 tuplesort_getindextuple(Tuplesortstate *state, bool forward,
01658                         bool *should_free)
01659 {
01660     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01661     SortTuple   stup;
01662 
01663     if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
01664         stup.tuple = NULL;
01665 
01666     MemoryContextSwitchTo(oldcontext);
01667 
01668     return (IndexTuple) stup.tuple;
01669 }
01670 
01671 /*
01672  * Fetch the next Datum in either forward or back direction.
01673  * Returns FALSE if no more datums.
01674  *
01675  * If the Datum is pass-by-ref type, the returned value is freshly palloc'd
01676  * and is now owned by the caller.
01677  */
01678 bool
01679 tuplesort_getdatum(Tuplesortstate *state, bool forward,
01680                    Datum *val, bool *isNull)
01681 {
01682     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
01683     SortTuple   stup;
01684     bool        should_free;
01685 
01686     if (!tuplesort_gettuple_common(state, forward, &stup, &should_free))
01687     {
01688         MemoryContextSwitchTo(oldcontext);
01689         return false;
01690     }
01691 
01692     if (stup.isnull1 || state->datumTypeByVal)
01693     {
01694         *val = stup.datum1;
01695         *isNull = stup.isnull1;
01696     }
01697     else
01698     {
01699         if (should_free)
01700             *val = stup.datum1;
01701         else
01702             *val = datumCopy(stup.datum1, false, state->datumTypeLen);
01703         *isNull = false;
01704     }
01705 
01706     MemoryContextSwitchTo(oldcontext);
01707 
01708     return true;
01709 }
01710 
01711 /*
01712  * tuplesort_merge_order - report merge order we'll use for given memory
01713  * (note: "merge order" just means the number of input tapes in the merge).
01714  *
01715  * This is exported for use by the planner.  allowedMem is in bytes.
01716  */
01717 int
01718 tuplesort_merge_order(long allowedMem)
01719 {
01720     int         mOrder;
01721 
01722     /*
01723      * We need one tape for each merge input, plus another one for the output,
01724      * and each of these tapes needs buffer space.  In addition we want
01725      * MERGE_BUFFER_SIZE workspace per input tape (but the output tape doesn't
01726      * count).
01727      *
01728      * Note: you might be thinking we need to account for the memtuples[]
01729      * array in this calculation, but we effectively treat that as part of the
01730      * MERGE_BUFFER_SIZE workspace.
01731      */
01732     mOrder = (allowedMem - TAPE_BUFFER_OVERHEAD) /
01733         (MERGE_BUFFER_SIZE + TAPE_BUFFER_OVERHEAD);
01734 
01735     /* Even in minimum memory, use at least a MINORDER merge */
01736     mOrder = Max(mOrder, MINORDER);
01737 
01738     return mOrder;
01739 }
01740 
01741 /*
01742  * inittapes - initialize for tape sorting.
01743  *
01744  * This is called only if we have found we don't have room to sort in memory.
01745  */
01746 static void
01747 inittapes(Tuplesortstate *state)
01748 {
01749     int         maxTapes,
01750                 ntuples,
01751                 j;
01752     long        tapeSpace;
01753 
01754     /* Compute number of tapes to use: merge order plus 1 */
01755     maxTapes = tuplesort_merge_order(state->allowedMem) + 1;
01756 
01757     /*
01758      * We must have at least 2*maxTapes slots in the memtuples[] array, else
01759      * we'd not have room for merge heap plus preread.  It seems unlikely that
01760      * this case would ever occur, but be safe.
01761      */
01762     maxTapes = Min(maxTapes, state->memtupsize / 2);
01763 
01764     state->maxTapes = maxTapes;
01765     state->tapeRange = maxTapes - 1;
01766 
01767 #ifdef TRACE_SORT
01768     if (trace_sort)
01769         elog(LOG, "switching to external sort with %d tapes: %s",
01770              maxTapes, pg_rusage_show(&state->ru_start));
01771 #endif
01772 
01773     /*
01774      * Decrease availMem to reflect the space needed for tape buffers; but
01775      * don't decrease it to the point that we have no room for tuples. (That
01776      * case is only likely to occur if sorting pass-by-value Datums; in all
01777      * other scenarios the memtuples[] array is unlikely to occupy more than
01778      * half of allowedMem.  In the pass-by-value case it's not important to
01779      * account for tuple space, so we don't care if LACKMEM becomes
01780      * inaccurate.)
01781      */
01782     tapeSpace = maxTapes * TAPE_BUFFER_OVERHEAD;
01783     if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
01784         USEMEM(state, tapeSpace);
01785 
01786     /*
01787      * Make sure that the temp file(s) underlying the tape set are created in
01788      * suitable temp tablespaces.
01789      */
01790     PrepareTempTablespaces();
01791 
01792     /*
01793      * Create the tape set and allocate the per-tape data arrays.
01794      */
01795     state->tapeset = LogicalTapeSetCreate(maxTapes);
01796 
01797     state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
01798     state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
01799     state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
01800     state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
01801     state->mergeavailmem = (long *) palloc0(maxTapes * sizeof(long));
01802     state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
01803     state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
01804     state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
01805     state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
01806 
01807     /*
01808      * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
01809      * marked as belonging to run number zero.
01810      *
01811      * NOTE: we pass false for checkIndex since there's no point in comparing
01812      * indexes in this step, even though we do intend the indexes to be part
01813      * of the sort key...
01814      */
01815     ntuples = state->memtupcount;
01816     state->memtupcount = 0;     /* make the heap empty */
01817     for (j = 0; j < ntuples; j++)
01818     {
01819         /* Must copy source tuple to avoid possible overwrite */
01820         SortTuple   stup = state->memtuples[j];
01821 
01822         tuplesort_heap_insert(state, &stup, 0, false);
01823     }
01824     Assert(state->memtupcount == ntuples);
01825 
01826     state->currentRun = 0;
01827 
01828     /*
01829      * Initialize variables of Algorithm D (step D1).
01830      */
01831     for (j = 0; j < maxTapes; j++)
01832     {
01833         state->tp_fib[j] = 1;
01834         state->tp_runs[j] = 0;
01835         state->tp_dummy[j] = 1;
01836         state->tp_tapenum[j] = j;
01837     }
01838     state->tp_fib[state->tapeRange] = 0;
01839     state->tp_dummy[state->tapeRange] = 0;
01840 
01841     state->Level = 1;
01842     state->destTape = 0;
01843 
01844     state->status = TSS_BUILDRUNS;
01845 }
01846 
01847 /*
01848  * selectnewtape -- select new tape for new initial run.
01849  *
01850  * This is called after finishing a run when we know another run
01851  * must be started.  This implements steps D3, D4 of Algorithm D.
01852  */
01853 static void
01854 selectnewtape(Tuplesortstate *state)
01855 {
01856     int         j;
01857     int         a;
01858 
01859     /* Step D3: advance j (destTape) */
01860     if (state->tp_dummy[state->destTape] < state->tp_dummy[state->destTape + 1])
01861     {
01862         state->destTape++;
01863         return;
01864     }
01865     if (state->tp_dummy[state->destTape] != 0)
01866     {
01867         state->destTape = 0;
01868         return;
01869     }
01870 
01871     /* Step D4: increase level */
01872     state->Level++;
01873     a = state->tp_fib[0];
01874     for (j = 0; j < state->tapeRange; j++)
01875     {
01876         state->tp_dummy[j] = a + state->tp_fib[j + 1] - state->tp_fib[j];
01877         state->tp_fib[j] = a + state->tp_fib[j + 1];
01878     }
01879     state->destTape = 0;
01880 }
01881 
01882 /*
01883  * mergeruns -- merge all the completed initial runs.
01884  *
01885  * This implements steps D5, D6 of Algorithm D.  All input data has
01886  * already been written to initial runs on tape (see dumptuples).
01887  */
01888 static void
01889 mergeruns(Tuplesortstate *state)
01890 {
01891     int         tapenum,
01892                 svTape,
01893                 svRuns,
01894                 svDummy;
01895 
01896     Assert(state->status == TSS_BUILDRUNS);
01897     Assert(state->memtupcount == 0);
01898 
01899     /*
01900      * If we produced only one initial run (quite likely if the total data
01901      * volume is between 1X and 2X workMem), we can just use that tape as the
01902      * finished output, rather than doing a useless merge.  (This obvious
01903      * optimization is not in Knuth's algorithm.)
01904      */
01905     if (state->currentRun == 1)
01906     {
01907         state->result_tape = state->tp_tapenum[state->destTape];
01908         /* must freeze and rewind the finished output tape */
01909         LogicalTapeFreeze(state->tapeset, state->result_tape);
01910         state->status = TSS_SORTEDONTAPE;
01911         return;
01912     }
01913 
01914     /* End of step D2: rewind all output tapes to prepare for merging */
01915     for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
01916         LogicalTapeRewind(state->tapeset, tapenum, false);
01917 
01918     for (;;)
01919     {
01920         /*
01921          * At this point we know that tape[T] is empty.  If there's just one
01922          * (real or dummy) run left on each input tape, then only one merge
01923          * pass remains.  If we don't have to produce a materialized sorted
01924          * tape, we can stop at this point and do the final merge on-the-fly.
01925          */
01926         if (!state->randomAccess)
01927         {
01928             bool        allOneRun = true;
01929 
01930             Assert(state->tp_runs[state->tapeRange] == 0);
01931             for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
01932             {
01933                 if (state->tp_runs[tapenum] + state->tp_dummy[tapenum] != 1)
01934                 {
01935                     allOneRun = false;
01936                     break;
01937                 }
01938             }
01939             if (allOneRun)
01940             {
01941                 /* Tell logtape.c we won't be writing anymore */
01942                 LogicalTapeSetForgetFreeSpace(state->tapeset);
01943                 /* Initialize for the final merge pass */
01944                 beginmerge(state);
01945                 state->status = TSS_FINALMERGE;
01946                 return;
01947             }
01948         }
01949 
01950         /* Step D5: merge runs onto tape[T] until tape[P] is empty */
01951         while (state->tp_runs[state->tapeRange - 1] ||
01952                state->tp_dummy[state->tapeRange - 1])
01953         {
01954             bool        allDummy = true;
01955 
01956             for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
01957             {
01958                 if (state->tp_dummy[tapenum] == 0)
01959                 {
01960                     allDummy = false;
01961                     break;
01962                 }
01963             }
01964 
01965             if (allDummy)
01966             {
01967                 state->tp_dummy[state->tapeRange]++;
01968                 for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
01969                     state->tp_dummy[tapenum]--;
01970             }
01971             else
01972                 mergeonerun(state);
01973         }
01974 
01975         /* Step D6: decrease level */
01976         if (--state->Level == 0)
01977             break;
01978         /* rewind output tape T to use as new input */
01979         LogicalTapeRewind(state->tapeset, state->tp_tapenum[state->tapeRange],
01980                           false);
01981         /* rewind used-up input tape P, and prepare it for write pass */
01982         LogicalTapeRewind(state->tapeset, state->tp_tapenum[state->tapeRange - 1],
01983                           true);
01984         state->tp_runs[state->tapeRange - 1] = 0;
01985 
01986         /*
01987          * reassign tape units per step D6; note we no longer care about A[]
01988          */
01989         svTape = state->tp_tapenum[state->tapeRange];
01990         svDummy = state->tp_dummy[state->tapeRange];
01991         svRuns = state->tp_runs[state->tapeRange];
01992         for (tapenum = state->tapeRange; tapenum > 0; tapenum--)
01993         {
01994             state->tp_tapenum[tapenum] = state->tp_tapenum[tapenum - 1];
01995             state->tp_dummy[tapenum] = state->tp_dummy[tapenum - 1];
01996             state->tp_runs[tapenum] = state->tp_runs[tapenum - 1];
01997         }
01998         state->tp_tapenum[0] = svTape;
01999         state->tp_dummy[0] = svDummy;
02000         state->tp_runs[0] = svRuns;
02001     }
02002 
02003     /*
02004      * Done.  Knuth says that the result is on TAPE[1], but since we exited
02005      * the loop without performing the last iteration of step D6, we have not
02006      * rearranged the tape unit assignment, and therefore the result is on
02007      * TAPE[T].  We need to do it this way so that we can freeze the final
02008      * output tape while rewinding it.  The last iteration of step D6 would be
02009      * a waste of cycles anyway...
02010      */
02011     state->result_tape = state->tp_tapenum[state->tapeRange];
02012     LogicalTapeFreeze(state->tapeset, state->result_tape);
02013     state->status = TSS_SORTEDONTAPE;
02014 }
02015 
02016 /*
02017  * Merge one run from each input tape, except ones with dummy runs.
02018  *
02019  * This is the inner loop of Algorithm D step D5.  We know that the
02020  * output tape is TAPE[T].
02021  */
02022 static void
02023 mergeonerun(Tuplesortstate *state)
02024 {
02025     int         destTape = state->tp_tapenum[state->tapeRange];
02026     int         srcTape;
02027     int         tupIndex;
02028     SortTuple  *tup;
02029     long        priorAvail,
02030                 spaceFreed;
02031 
02032     /*
02033      * Start the merge by loading one tuple from each active source tape into
02034      * the heap.  We can also decrease the input run/dummy run counts.
02035      */
02036     beginmerge(state);
02037 
02038     /*
02039      * Execute merge by repeatedly extracting lowest tuple in heap, writing it
02040      * out, and replacing it with next tuple from same tape (if there is
02041      * another one).
02042      */
02043     while (state->memtupcount > 0)
02044     {
02045         /* write the tuple to destTape */
02046         priorAvail = state->availMem;
02047         srcTape = state->memtuples[0].tupindex;
02048         WRITETUP(state, destTape, &state->memtuples[0]);
02049         /* writetup adjusted total free space, now fix per-tape space */
02050         spaceFreed = state->availMem - priorAvail;
02051         state->mergeavailmem[srcTape] += spaceFreed;
02052         /* compact the heap */
02053         tuplesort_heap_siftup(state, false);
02054         if ((tupIndex = state->mergenext[srcTape]) == 0)
02055         {
02056             /* out of preloaded data on this tape, try to read more */
02057             mergepreread(state);
02058             /* if still no data, we've reached end of run on this tape */
02059             if ((tupIndex = state->mergenext[srcTape]) == 0)
02060                 continue;
02061         }
02062         /* pull next preread tuple from list, insert in heap */
02063         tup = &state->memtuples[tupIndex];
02064         state->mergenext[srcTape] = tup->tupindex;
02065         if (state->mergenext[srcTape] == 0)
02066             state->mergelast[srcTape] = 0;
02067         tuplesort_heap_insert(state, tup, srcTape, false);
02068         /* put the now-unused memtuples entry on the freelist */
02069         tup->tupindex = state->mergefreelist;
02070         state->mergefreelist = tupIndex;
02071         state->mergeavailslots[srcTape]++;
02072     }
02073 
02074     /*
02075      * When the heap empties, we're done.  Write an end-of-run marker on the
02076      * output tape, and increment its count of real runs.
02077      */
02078     markrunend(state, destTape);
02079     state->tp_runs[state->tapeRange]++;
02080 
02081 #ifdef TRACE_SORT
02082     if (trace_sort)
02083         elog(LOG, "finished %d-way merge step: %s", state->activeTapes,
02084              pg_rusage_show(&state->ru_start));
02085 #endif
02086 }
02087 
02088 /*
02089  * beginmerge - initialize for a merge pass
02090  *
02091  * We decrease the counts of real and dummy runs for each tape, and mark
02092  * which tapes contain active input runs in mergeactive[].  Then, load
02093  * as many tuples as we can from each active input tape, and finally
02094  * fill the merge heap with the first tuple from each active tape.
02095  */
02096 static void
02097 beginmerge(Tuplesortstate *state)
02098 {
02099     int         activeTapes;
02100     int         tapenum;
02101     int         srcTape;
02102     int         slotsPerTape;
02103     long        spacePerTape;
02104 
02105     /* Heap should be empty here */
02106     Assert(state->memtupcount == 0);
02107 
02108     /* Adjust run counts and mark the active tapes */
02109     memset(state->mergeactive, 0,
02110            state->maxTapes * sizeof(*state->mergeactive));
02111     activeTapes = 0;
02112     for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
02113     {
02114         if (state->tp_dummy[tapenum] > 0)
02115             state->tp_dummy[tapenum]--;
02116         else
02117         {
02118             Assert(state->tp_runs[tapenum] > 0);
02119             state->tp_runs[tapenum]--;
02120             srcTape = state->tp_tapenum[tapenum];
02121             state->mergeactive[srcTape] = true;
02122             activeTapes++;
02123         }
02124     }
02125     state->activeTapes = activeTapes;
02126 
02127     /* Clear merge-pass state variables */
02128     memset(state->mergenext, 0,
02129            state->maxTapes * sizeof(*state->mergenext));
02130     memset(state->mergelast, 0,
02131            state->maxTapes * sizeof(*state->mergelast));
02132     state->mergefreelist = 0;   /* nothing in the freelist */
02133     state->mergefirstfree = activeTapes;        /* 1st slot avail for preread */
02134 
02135     /*
02136      * Initialize space allocation to let each active input tape have an equal
02137      * share of preread space.
02138      */
02139     Assert(activeTapes > 0);
02140     slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
02141     Assert(slotsPerTape > 0);
02142     spacePerTape = state->availMem / activeTapes;
02143     for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
02144     {
02145         if (state->mergeactive[srcTape])
02146         {
02147             state->mergeavailslots[srcTape] = slotsPerTape;
02148             state->mergeavailmem[srcTape] = spacePerTape;
02149         }
02150     }
02151 
02152     /*
02153      * Preread as many tuples as possible (and at least one) from each active
02154      * tape
02155      */
02156     mergepreread(state);
02157 
02158     /* Load the merge heap with the first tuple from each input tape */
02159     for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
02160     {
02161         int         tupIndex = state->mergenext[srcTape];
02162         SortTuple  *tup;
02163 
02164         if (tupIndex)
02165         {
02166             tup = &state->memtuples[tupIndex];
02167             state->mergenext[srcTape] = tup->tupindex;
02168             if (state->mergenext[srcTape] == 0)
02169                 state->mergelast[srcTape] = 0;
02170             tuplesort_heap_insert(state, tup, srcTape, false);
02171             /* put the now-unused memtuples entry on the freelist */
02172             tup->tupindex = state->mergefreelist;
02173             state->mergefreelist = tupIndex;
02174             state->mergeavailslots[srcTape]++;
02175         }
02176     }
02177 }
02178 
02179 /*
02180  * mergepreread - load tuples from merge input tapes
02181  *
02182  * This routine exists to improve sequentiality of reads during a merge pass,
02183  * as explained in the header comments of this file.  Load tuples from each
02184  * active source tape until the tape's run is exhausted or it has used up
02185  * its fair share of available memory.  In any case, we guarantee that there
02186  * is at least one preread tuple available from each unexhausted input tape.
02187  *
02188  * We invoke this routine at the start of a merge pass for initial load,
02189  * and then whenever any tape's preread data runs out.  Note that we load
02190  * as much data as possible from all tapes, not just the one that ran out.
02191  * This is because logtape.c works best with a usage pattern that alternates
02192  * between reading a lot of data and writing a lot of data, so whenever we
02193  * are forced to read, we should fill working memory completely.
02194  *
02195  * In FINALMERGE state, we *don't* use this routine, but instead just preread
02196  * from the single tape that ran dry.  There's no read/write alternation in
02197  * that state and so no point in scanning through all the tapes to fix one.
02198  * (Moreover, there may be quite a lot of inactive tapes in that state, since
02199  * we might have had many fewer runs than tapes.  In a regular tape-to-tape
02200  * merge we can expect most of the tapes to be active.)
02201  */
02202 static void
02203 mergepreread(Tuplesortstate *state)
02204 {
02205     int         srcTape;
02206 
02207     for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
02208         mergeprereadone(state, srcTape);
02209 }
02210 
02211 /*
02212  * mergeprereadone - load tuples from one merge input tape
02213  *
02214  * Read tuples from the specified tape until it has used up its free memory
02215  * or array slots; but ensure that we have at least one tuple, if any are
02216  * to be had.
02217  */
02218 static void
02219 mergeprereadone(Tuplesortstate *state, int srcTape)
02220 {
02221     unsigned int tuplen;
02222     SortTuple   stup;
02223     int         tupIndex;
02224     long        priorAvail,
02225                 spaceUsed;
02226 
02227     if (!state->mergeactive[srcTape])
02228         return;                 /* tape's run is already exhausted */
02229     priorAvail = state->availMem;
02230     state->availMem = state->mergeavailmem[srcTape];
02231     while ((state->mergeavailslots[srcTape] > 0 && !LACKMEM(state)) ||
02232            state->mergenext[srcTape] == 0)
02233     {
02234         /* read next tuple, if any */
02235         if ((tuplen = getlen(state, srcTape, true)) == 0)
02236         {
02237             state->mergeactive[srcTape] = false;
02238             break;
02239         }
02240         READTUP(state, &stup, srcTape, tuplen);
02241         /* find a free slot in memtuples[] for it */
02242         tupIndex = state->mergefreelist;
02243         if (tupIndex)
02244             state->mergefreelist = state->memtuples[tupIndex].tupindex;
02245         else
02246         {
02247             tupIndex = state->mergefirstfree++;
02248             Assert(tupIndex < state->memtupsize);
02249         }
02250         state->mergeavailslots[srcTape]--;
02251         /* store tuple, append to list for its tape */
02252         stup.tupindex = 0;
02253         state->memtuples[tupIndex] = stup;
02254         if (state->mergelast[srcTape])
02255             state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
02256         else
02257             state->mergenext[srcTape] = tupIndex;
02258         state->mergelast[srcTape] = tupIndex;
02259     }
02260     /* update per-tape and global availmem counts */
02261     spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
02262     state->mergeavailmem[srcTape] = state->availMem;
02263     state->availMem = priorAvail - spaceUsed;
02264 }
02265 
02266 /*
02267  * dumptuples - remove tuples from heap and write to tape
02268  *
02269  * This is used during initial-run building, but not during merging.
02270  *
02271  * When alltuples = false, dump only enough tuples to get under the
02272  * availMem limit (and leave at least one tuple in the heap in any case,
02273  * since puttuple assumes it always has a tuple to compare to).  We also
02274  * insist there be at least one free slot in the memtuples[] array.
02275  *
02276  * When alltuples = true, dump everything currently in memory.
02277  * (This case is only used at end of input data.)
02278  *
02279  * If we empty the heap, close out the current run and return (this should
02280  * only happen at end of input data).  If we see that the tuple run number
02281  * at the top of the heap has changed, start a new run.
02282  */
02283 static void
02284 dumptuples(Tuplesortstate *state, bool alltuples)
02285 {
02286     while (alltuples ||
02287            (LACKMEM(state) && state->memtupcount > 1) ||
02288            state->memtupcount >= state->memtupsize)
02289     {
02290         /*
02291          * Dump the heap's frontmost entry, and sift up to remove it from the
02292          * heap.
02293          */
02294         Assert(state->memtupcount > 0);
02295         WRITETUP(state, state->tp_tapenum[state->destTape],
02296                  &state->memtuples[0]);
02297         tuplesort_heap_siftup(state, true);
02298 
02299         /*
02300          * If the heap is empty *or* top run number has changed, we've
02301          * finished the current run.
02302          */
02303         if (state->memtupcount == 0 ||
02304             state->currentRun != state->memtuples[0].tupindex)
02305         {
02306             markrunend(state, state->tp_tapenum[state->destTape]);
02307             state->currentRun++;
02308             state->tp_runs[state->destTape]++;
02309             state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
02310 
02311 #ifdef TRACE_SORT
02312             if (trace_sort)
02313                 elog(LOG, "finished writing%s run %d to tape %d: %s",
02314                      (state->memtupcount == 0) ? " final" : "",
02315                      state->currentRun, state->destTape,
02316                      pg_rusage_show(&state->ru_start));
02317 #endif
02318 
02319             /*
02320              * Done if heap is empty, else prepare for new run.
02321              */
02322             if (state->memtupcount == 0)
02323                 break;
02324             Assert(state->currentRun == state->memtuples[0].tupindex);
02325             selectnewtape(state);
02326         }
02327     }
02328 }
02329 
02330 /*
02331  * tuplesort_rescan     - rewind and replay the scan
02332  */
02333 void
02334 tuplesort_rescan(Tuplesortstate *state)
02335 {
02336     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
02337 
02338     Assert(state->randomAccess);
02339 
02340     switch (state->status)
02341     {
02342         case TSS_SORTEDINMEM:
02343             state->current = 0;
02344             state->eof_reached = false;
02345             state->markpos_offset = 0;
02346             state->markpos_eof = false;
02347             break;
02348         case TSS_SORTEDONTAPE:
02349             LogicalTapeRewind(state->tapeset,
02350                               state->result_tape,
02351                               false);
02352             state->eof_reached = false;
02353             state->markpos_block = 0L;
02354             state->markpos_offset = 0;
02355             state->markpos_eof = false;
02356             break;
02357         default:
02358             elog(ERROR, "invalid tuplesort state");
02359             break;
02360     }
02361 
02362     MemoryContextSwitchTo(oldcontext);
02363 }
02364 
02365 /*
02366  * tuplesort_markpos    - saves current position in the merged sort file
02367  */
02368 void
02369 tuplesort_markpos(Tuplesortstate *state)
02370 {
02371     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
02372 
02373     Assert(state->randomAccess);
02374 
02375     switch (state->status)
02376     {
02377         case TSS_SORTEDINMEM:
02378             state->markpos_offset = state->current;
02379             state->markpos_eof = state->eof_reached;
02380             break;
02381         case TSS_SORTEDONTAPE:
02382             LogicalTapeTell(state->tapeset,
02383                             state->result_tape,
02384                             &state->markpos_block,
02385                             &state->markpos_offset);
02386             state->markpos_eof = state->eof_reached;
02387             break;
02388         default:
02389             elog(ERROR, "invalid tuplesort state");
02390             break;
02391     }
02392 
02393     MemoryContextSwitchTo(oldcontext);
02394 }
02395 
02396 /*
02397  * tuplesort_restorepos - restores current position in merged sort file to
02398  *                        last saved position
02399  */
02400 void
02401 tuplesort_restorepos(Tuplesortstate *state)
02402 {
02403     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
02404 
02405     Assert(state->randomAccess);
02406 
02407     switch (state->status)
02408     {
02409         case TSS_SORTEDINMEM:
02410             state->current = state->markpos_offset;
02411             state->eof_reached = state->markpos_eof;
02412             break;
02413         case TSS_SORTEDONTAPE:
02414             if (!LogicalTapeSeek(state->tapeset,
02415                                  state->result_tape,
02416                                  state->markpos_block,
02417                                  state->markpos_offset))
02418                 elog(ERROR, "tuplesort_restorepos failed");
02419             state->eof_reached = state->markpos_eof;
02420             break;
02421         default:
02422             elog(ERROR, "invalid tuplesort state");
02423             break;
02424     }
02425 
02426     MemoryContextSwitchTo(oldcontext);
02427 }
02428 
02429 /*
02430  * tuplesort_get_stats - extract summary statistics
02431  *
02432  * This can be called after tuplesort_performsort() finishes to obtain
02433  * printable summary information about how the sort was performed.
02434  * spaceUsed is measured in kilobytes.
02435  */
02436 void
02437 tuplesort_get_stats(Tuplesortstate *state,
02438                     const char **sortMethod,
02439                     const char **spaceType,
02440                     long *spaceUsed)
02441 {
02442     /*
02443      * Note: it might seem we should provide both memory and disk usage for a
02444      * disk-based sort.  However, the current code doesn't track memory space
02445      * accurately once we have begun to return tuples to the caller (since we
02446      * don't account for pfree's the caller is expected to do), so we cannot
02447      * rely on availMem in a disk sort.  This does not seem worth the overhead
02448      * to fix.  Is it worth creating an API for the memory context code to
02449      * tell us how much is actually used in sortcontext?
02450      */
02451     if (state->tapeset)
02452     {
02453         *spaceType = "Disk";
02454         *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
02455     }
02456     else
02457     {
02458         *spaceType = "Memory";
02459         *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
02460     }
02461 
02462     switch (state->status)
02463     {
02464         case TSS_SORTEDINMEM:
02465             if (state->boundUsed)
02466                 *sortMethod = "top-N heapsort";
02467             else
02468                 *sortMethod = "quicksort";
02469             break;
02470         case TSS_SORTEDONTAPE:
02471             *sortMethod = "external sort";
02472             break;
02473         case TSS_FINALMERGE:
02474             *sortMethod = "external merge";
02475             break;
02476         default:
02477             *sortMethod = "still in progress";
02478             break;
02479     }
02480 }
02481 
02482 
02483 /*
02484  * Heap manipulation routines, per Knuth's Algorithm 5.2.3H.
02485  *
02486  * Compare two SortTuples.  If checkIndex is true, use the tuple index
02487  * as the front of the sort key; otherwise, no.
02488  */
02489 
02490 #define HEAPCOMPARE(tup1,tup2) \
02491     (checkIndex && ((tup1)->tupindex != (tup2)->tupindex) ? \
02492      ((tup1)->tupindex) - ((tup2)->tupindex) : \
02493      COMPARETUP(state, tup1, tup2))
02494 
02495 /*
02496  * Convert the existing unordered array of SortTuples to a bounded heap,
02497  * discarding all but the smallest "state->bound" tuples.
02498  *
02499  * When working with a bounded heap, we want to keep the largest entry
02500  * at the root (array entry zero), instead of the smallest as in the normal
02501  * sort case.  This allows us to discard the largest entry cheaply.
02502  * Therefore, we temporarily reverse the sort direction.
02503  *
02504  * We assume that all entries in a bounded heap will always have tupindex
02505  * zero; it therefore doesn't matter that HEAPCOMPARE() doesn't reverse
02506  * the direction of comparison for tupindexes.
02507  */
02508 static void
02509 make_bounded_heap(Tuplesortstate *state)
02510 {
02511     int         tupcount = state->memtupcount;
02512     int         i;
02513 
02514     Assert(state->status == TSS_INITIAL);
02515     Assert(state->bounded);
02516     Assert(tupcount >= state->bound);
02517 
02518     /* Reverse sort direction so largest entry will be at root */
02519     REVERSEDIRECTION(state);
02520 
02521     state->memtupcount = 0;     /* make the heap empty */
02522     for (i = 0; i < tupcount; i++)
02523     {
02524         if (state->memtupcount >= state->bound &&
02525           COMPARETUP(state, &state->memtuples[i], &state->memtuples[0]) <= 0)
02526         {
02527             /* New tuple would just get thrown out, so skip it */
02528             free_sort_tuple(state, &state->memtuples[i]);
02529             CHECK_FOR_INTERRUPTS();
02530         }
02531         else
02532         {
02533             /* Insert next tuple into heap */
02534             /* Must copy source tuple to avoid possible overwrite */
02535             SortTuple   stup = state->memtuples[i];
02536 
02537             tuplesort_heap_insert(state, &stup, 0, false);
02538 
02539             /* If heap too full, discard largest entry */
02540             if (state->memtupcount > state->bound)
02541             {
02542                 free_sort_tuple(state, &state->memtuples[0]);
02543                 tuplesort_heap_siftup(state, false);
02544             }
02545         }
02546     }
02547 
02548     Assert(state->memtupcount == state->bound);
02549     state->status = TSS_BOUNDED;
02550 }
02551 
02552 /*
02553  * Convert the bounded heap to a properly-sorted array
02554  */
02555 static void
02556 sort_bounded_heap(Tuplesortstate *state)
02557 {
02558     int         tupcount = state->memtupcount;
02559 
02560     Assert(state->status == TSS_BOUNDED);
02561     Assert(state->bounded);
02562     Assert(tupcount == state->bound);
02563 
02564     /*
02565      * We can unheapify in place because each sift-up will remove the largest
02566      * entry, which we can promptly store in the newly freed slot at the end.
02567      * Once we're down to a single-entry heap, we're done.
02568      */
02569     while (state->memtupcount > 1)
02570     {
02571         SortTuple   stup = state->memtuples[0];
02572 
02573         /* this sifts-up the next-largest entry and decreases memtupcount */
02574         tuplesort_heap_siftup(state, false);
02575         state->memtuples[state->memtupcount] = stup;
02576     }
02577     state->memtupcount = tupcount;
02578 
02579     /*
02580      * Reverse sort direction back to the original state.  This is not
02581      * actually necessary but seems like a good idea for tidiness.
02582      */
02583     REVERSEDIRECTION(state);
02584 
02585     state->status = TSS_SORTEDINMEM;
02586     state->boundUsed = true;
02587 }
02588 
02589 /*
02590  * Insert a new tuple into an empty or existing heap, maintaining the
02591  * heap invariant.  Caller is responsible for ensuring there's room.
02592  *
02593  * Note: we assume *tuple is a temporary variable that can be scribbled on.
02594  * For some callers, tuple actually points to a memtuples[] entry above the
02595  * end of the heap.  This is safe as long as it's not immediately adjacent
02596  * to the end of the heap (ie, in the [memtupcount] array entry) --- if it
02597  * is, it might get overwritten before being moved into the heap!
02598  */
02599 static void
02600 tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
02601                       int tupleindex, bool checkIndex)
02602 {
02603     SortTuple  *memtuples;
02604     int         j;
02605 
02606     /*
02607      * Save the tupleindex --- see notes above about writing on *tuple. It's a
02608      * historical artifact that tupleindex is passed as a separate argument
02609      * and not in *tuple, but it's notationally convenient so let's leave it
02610      * that way.
02611      */
02612     tuple->tupindex = tupleindex;
02613 
02614     memtuples = state->memtuples;
02615     Assert(state->memtupcount < state->memtupsize);
02616 
02617     CHECK_FOR_INTERRUPTS();
02618 
02619     /*
02620      * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
02621      * using 1-based array indexes, not 0-based.
02622      */
02623     j = state->memtupcount++;
02624     while (j > 0)
02625     {
02626         int         i = (j - 1) >> 1;
02627 
02628         if (HEAPCOMPARE(tuple, &memtuples[i]) >= 0)
02629             break;
02630         memtuples[j] = memtuples[i];
02631         j = i;
02632     }
02633     memtuples[j] = *tuple;
02634 }
02635 
02636 /*
02637  * The tuple at state->memtuples[0] has been removed from the heap.
02638  * Decrement memtupcount, and sift up to maintain the heap invariant.
02639  */
02640 static void
02641 tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
02642 {
02643     SortTuple  *memtuples = state->memtuples;
02644     SortTuple  *tuple;
02645     int         i,
02646                 n;
02647 
02648     if (--state->memtupcount <= 0)
02649         return;
02650 
02651     CHECK_FOR_INTERRUPTS();
02652 
02653     n = state->memtupcount;
02654     tuple = &memtuples[n];      /* tuple that must be reinserted */
02655     i = 0;                      /* i is where the "hole" is */
02656     for (;;)
02657     {
02658         int         j = 2 * i + 1;
02659 
02660         if (j >= n)
02661             break;
02662         if (j + 1 < n &&
02663             HEAPCOMPARE(&memtuples[j], &memtuples[j + 1]) > 0)
02664             j++;
02665         if (HEAPCOMPARE(tuple, &memtuples[j]) <= 0)
02666             break;
02667         memtuples[i] = memtuples[j];
02668         i = j;
02669     }
02670     memtuples[i] = *tuple;
02671 }
02672 
02673 
02674 /*
02675  * Tape interface routines
02676  */
02677 
02678 static unsigned int
02679 getlen(Tuplesortstate *state, int tapenum, bool eofOK)
02680 {
02681     unsigned int len;
02682 
02683     if (LogicalTapeRead(state->tapeset, tapenum,
02684                         &len, sizeof(len)) != sizeof(len))
02685         elog(ERROR, "unexpected end of tape");
02686     if (len == 0 && !eofOK)
02687         elog(ERROR, "unexpected end of data");
02688     return len;
02689 }
02690 
02691 static void
02692 markrunend(Tuplesortstate *state, int tapenum)
02693 {
02694     unsigned int len = 0;
02695 
02696     LogicalTapeWrite(state->tapeset, tapenum, (void *) &len, sizeof(len));
02697 }
02698 
02699 
02700 /*
02701  * Inline-able copy of FunctionCall2Coll() to save some cycles in sorting.
02702  */
02703 static inline Datum
02704 myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, Datum arg1, Datum arg2)
02705 {
02706     FunctionCallInfoData fcinfo;
02707     Datum       result;
02708 
02709     InitFunctionCallInfoData(fcinfo, flinfo, 2, collation, NULL, NULL);
02710 
02711     fcinfo.arg[0] = arg1;
02712     fcinfo.arg[1] = arg2;
02713     fcinfo.argnull[0] = false;
02714     fcinfo.argnull[1] = false;
02715 
02716     result = FunctionCallInvoke(&fcinfo);
02717 
02718     /* Check for null result, since caller is clearly not expecting one */
02719     if (fcinfo.isnull)
02720         elog(ERROR, "function %u returned NULL", fcinfo.flinfo->fn_oid);
02721 
02722     return result;
02723 }
02724 
02725 /*
02726  * Apply a sort function (by now converted to fmgr lookup form)
02727  * and return a 3-way comparison result.  This takes care of handling
02728  * reverse-sort and NULLs-ordering properly.  We assume that DESC and
02729  * NULLS_FIRST options are encoded in sk_flags the same way btree does it.
02730  */
02731 static inline int32
02732 inlineApplySortFunction(FmgrInfo *sortFunction, int sk_flags, Oid collation,
02733                         Datum datum1, bool isNull1,
02734                         Datum datum2, bool isNull2)
02735 {
02736     int32       compare;
02737 
02738     if (isNull1)
02739     {
02740         if (isNull2)
02741             compare = 0;        /* NULL "=" NULL */
02742         else if (sk_flags & SK_BT_NULLS_FIRST)
02743             compare = -1;       /* NULL "<" NOT_NULL */
02744         else
02745             compare = 1;        /* NULL ">" NOT_NULL */
02746     }
02747     else if (isNull2)
02748     {
02749         if (sk_flags & SK_BT_NULLS_FIRST)
02750             compare = 1;        /* NOT_NULL ">" NULL */
02751         else
02752             compare = -1;       /* NOT_NULL "<" NULL */
02753     }
02754     else
02755     {
02756         compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
02757                                                     datum1, datum2));
02758 
02759         if (sk_flags & SK_BT_DESC)
02760             compare = -compare;
02761     }
02762 
02763     return compare;
02764 }
02765 
02766 
02767 /*
02768  * Routines specialized for HeapTuple (actually MinimalTuple) case
02769  */
02770 
02771 static int
02772 comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
02773 {
02774     SortSupport sortKey = state->sortKeys;
02775     HeapTupleData ltup;
02776     HeapTupleData rtup;
02777     TupleDesc   tupDesc;
02778     int         nkey;
02779     int32       compare;
02780 
02781     /* Compare the leading sort key */
02782     compare = ApplySortComparator(a->datum1, a->isnull1,
02783                                   b->datum1, b->isnull1,
02784                                   sortKey);
02785     if (compare != 0)
02786         return compare;
02787 
02788     /* Compare additional sort keys */
02789     ltup.t_len = ((MinimalTuple) a->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
02790     ltup.t_data = (HeapTupleHeader) ((char *) a->tuple - MINIMAL_TUPLE_OFFSET);
02791     rtup.t_len = ((MinimalTuple) b->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
02792     rtup.t_data = (HeapTupleHeader) ((char *) b->tuple - MINIMAL_TUPLE_OFFSET);
02793     tupDesc = state->tupDesc;
02794     sortKey++;
02795     for (nkey = 1; nkey < state->nKeys; nkey++, sortKey++)
02796     {
02797         AttrNumber  attno = sortKey->ssup_attno;
02798         Datum       datum1,
02799                     datum2;
02800         bool        isnull1,
02801                     isnull2;
02802 
02803         datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
02804         datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);
02805 
02806         compare = ApplySortComparator(datum1, isnull1,
02807                                       datum2, isnull2,
02808                                       sortKey);
02809         if (compare != 0)
02810             return compare;
02811     }
02812 
02813     return 0;
02814 }
02815 
02816 static void
02817 copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
02818 {
02819     /*
02820      * We expect the passed "tup" to be a TupleTableSlot, and form a
02821      * MinimalTuple using the exported interface for that.
02822      */
02823     TupleTableSlot *slot = (TupleTableSlot *) tup;
02824     MinimalTuple tuple;
02825     HeapTupleData htup;
02826 
02827     /* copy the tuple into sort storage */
02828     tuple = ExecCopySlotMinimalTuple(slot);
02829     stup->tuple = (void *) tuple;
02830     USEMEM(state, GetMemoryChunkSpace(tuple));
02831     /* set up first-column key value */
02832     htup.t_len = tuple->t_len + MINIMAL_TUPLE_OFFSET;
02833     htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
02834     stup->datum1 = heap_getattr(&htup,
02835                                 state->sortKeys[0].ssup_attno,
02836                                 state->tupDesc,
02837                                 &stup->isnull1);
02838 }
02839 
02840 static void
02841 writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
02842 {
02843     MinimalTuple tuple = (MinimalTuple) stup->tuple;
02844 
02845     /* the part of the MinimalTuple we'll write: */
02846     char       *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
02847     unsigned int tupbodylen = tuple->t_len - MINIMAL_TUPLE_DATA_OFFSET;
02848 
02849     /* total on-disk footprint: */
02850     unsigned int tuplen = tupbodylen + sizeof(int);
02851 
02852     LogicalTapeWrite(state->tapeset, tapenum,
02853                      (void *) &tuplen, sizeof(tuplen));
02854     LogicalTapeWrite(state->tapeset, tapenum,
02855                      (void *) tupbody, tupbodylen);
02856     if (state->randomAccess)    /* need trailing length word? */
02857         LogicalTapeWrite(state->tapeset, tapenum,
02858                          (void *) &tuplen, sizeof(tuplen));
02859 
02860     FREEMEM(state, GetMemoryChunkSpace(tuple));
02861     heap_free_minimal_tuple(tuple);
02862 }
02863 
02864 static void
02865 readtup_heap(Tuplesortstate *state, SortTuple *stup,
02866              int tapenum, unsigned int len)
02867 {
02868     unsigned int tupbodylen = len - sizeof(int);
02869     unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
02870     MinimalTuple tuple = (MinimalTuple) palloc(tuplen);
02871     char       *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
02872     HeapTupleData htup;
02873 
02874     USEMEM(state, GetMemoryChunkSpace(tuple));
02875     /* read in the tuple proper */
02876     tuple->t_len = tuplen;
02877     LogicalTapeReadExact(state->tapeset, tapenum,
02878                          tupbody, tupbodylen);
02879     if (state->randomAccess)    /* need trailing length word? */
02880         LogicalTapeReadExact(state->tapeset, tapenum,
02881                              &tuplen, sizeof(tuplen));
02882     stup->tuple = (void *) tuple;
02883     /* set up first-column key value */
02884     htup.t_len = tuple->t_len + MINIMAL_TUPLE_OFFSET;
02885     htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
02886     stup->datum1 = heap_getattr(&htup,
02887                                 state->sortKeys[0].ssup_attno,
02888                                 state->tupDesc,
02889                                 &stup->isnull1);
02890 }
02891 
02892 static void
02893 reversedirection_heap(Tuplesortstate *state)
02894 {
02895     SortSupport sortKey = state->sortKeys;
02896     int         nkey;
02897 
02898     for (nkey = 0; nkey < state->nKeys; nkey++, sortKey++)
02899     {
02900         sortKey->ssup_reverse = !sortKey->ssup_reverse;
02901         sortKey->ssup_nulls_first = !sortKey->ssup_nulls_first;
02902     }
02903 }
02904 
02905 
02906 /*
02907  * Routines specialized for the CLUSTER case (HeapTuple data, with
02908  * comparisons per a btree index definition)
02909  */
02910 
02911 static int
02912 comparetup_cluster(const SortTuple *a, const SortTuple *b,
02913                    Tuplesortstate *state)
02914 {
02915     ScanKey     scanKey = state->indexScanKey;
02916     HeapTuple   ltup;
02917     HeapTuple   rtup;
02918     TupleDesc   tupDesc;
02919     int         nkey;
02920     int32       compare;
02921 
02922     /* Compare the leading sort key, if it's simple */
02923     if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
02924     {
02925         compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
02926                                           scanKey->sk_collation,
02927                                           a->datum1, a->isnull1,
02928                                           b->datum1, b->isnull1);
02929         if (compare != 0 || state->nKeys == 1)
02930             return compare;
02931         /* Compare additional columns the hard way */
02932         scanKey++;
02933         nkey = 1;
02934     }
02935     else
02936     {
02937         /* Must compare all keys the hard way */
02938         nkey = 0;
02939     }
02940 
02941     /* Compare additional sort keys */
02942     ltup = (HeapTuple) a->tuple;
02943     rtup = (HeapTuple) b->tuple;
02944 
02945     if (state->indexInfo->ii_Expressions == NULL)
02946     {
02947         /* If not expression index, just compare the proper heap attrs */
02948         tupDesc = state->tupDesc;
02949 
02950         for (; nkey < state->nKeys; nkey++, scanKey++)
02951         {
02952             AttrNumber  attno = state->indexInfo->ii_KeyAttrNumbers[nkey];
02953             Datum       datum1,
02954                         datum2;
02955             bool        isnull1,
02956                         isnull2;
02957 
02958             datum1 = heap_getattr(ltup, attno, tupDesc, &isnull1);
02959             datum2 = heap_getattr(rtup, attno, tupDesc, &isnull2);
02960 
02961             compare = inlineApplySortFunction(&scanKey->sk_func,
02962                                               scanKey->sk_flags,
02963                                               scanKey->sk_collation,
02964                                               datum1, isnull1,
02965                                               datum2, isnull2);
02966             if (compare != 0)
02967                 return compare;
02968         }
02969     }
02970     else
02971     {
02972         /*
02973          * In the expression index case, compute the whole index tuple and
02974          * then compare values.  It would perhaps be faster to compute only as
02975          * many columns as we need to compare, but that would require
02976          * duplicating all the logic in FormIndexDatum.
02977          */
02978         Datum       l_index_values[INDEX_MAX_KEYS];
02979         bool        l_index_isnull[INDEX_MAX_KEYS];
02980         Datum       r_index_values[INDEX_MAX_KEYS];
02981         bool        r_index_isnull[INDEX_MAX_KEYS];
02982         TupleTableSlot *ecxt_scantuple;
02983 
02984         /* Reset context each time to prevent memory leakage */
02985         ResetPerTupleExprContext(state->estate);
02986 
02987         ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
02988 
02989         ExecStoreTuple(ltup, ecxt_scantuple, InvalidBuffer, false);
02990         FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
02991                        l_index_values, l_index_isnull);
02992 
02993         ExecStoreTuple(rtup, ecxt_scantuple, InvalidBuffer, false);
02994         FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
02995                        r_index_values, r_index_isnull);
02996 
02997         for (; nkey < state->nKeys; nkey++, scanKey++)
02998         {
02999             compare = inlineApplySortFunction(&scanKey->sk_func,
03000                                               scanKey->sk_flags,
03001                                               scanKey->sk_collation,
03002                                               l_index_values[nkey],
03003                                               l_index_isnull[nkey],
03004                                               r_index_values[nkey],
03005                                               r_index_isnull[nkey]);
03006             if (compare != 0)
03007                 return compare;
03008         }
03009     }
03010 
03011     return 0;
03012 }
03013 
03014 static void
03015 copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
03016 {
03017     HeapTuple   tuple = (HeapTuple) tup;
03018 
03019     /* copy the tuple into sort storage */
03020     tuple = heap_copytuple(tuple);
03021     stup->tuple = (void *) tuple;
03022     USEMEM(state, GetMemoryChunkSpace(tuple));
03023     /* set up first-column key value, if it's a simple column */
03024     if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
03025         stup->datum1 = heap_getattr(tuple,
03026                                     state->indexInfo->ii_KeyAttrNumbers[0],
03027                                     state->tupDesc,
03028                                     &stup->isnull1);
03029 }
03030 
03031 static void
03032 writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
03033 {
03034     HeapTuple   tuple = (HeapTuple) stup->tuple;
03035     unsigned int tuplen = tuple->t_len + sizeof(ItemPointerData) + sizeof(int);
03036 
03037     /* We need to store t_self, but not other fields of HeapTupleData */
03038     LogicalTapeWrite(state->tapeset, tapenum,
03039                      &tuplen, sizeof(tuplen));
03040     LogicalTapeWrite(state->tapeset, tapenum,
03041                      &tuple->t_self, sizeof(ItemPointerData));
03042     LogicalTapeWrite(state->tapeset, tapenum,
03043                      tuple->t_data, tuple->t_len);
03044     if (state->randomAccess)    /* need trailing length word? */
03045         LogicalTapeWrite(state->tapeset, tapenum,
03046                          &tuplen, sizeof(tuplen));
03047 
03048     FREEMEM(state, GetMemoryChunkSpace(tuple));
03049     heap_freetuple(tuple);
03050 }
03051 
03052 static void
03053 readtup_cluster(Tuplesortstate *state, SortTuple *stup,
03054                 int tapenum, unsigned int tuplen)
03055 {
03056     unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
03057     HeapTuple   tuple = (HeapTuple) palloc(t_len + HEAPTUPLESIZE);
03058 
03059     USEMEM(state, GetMemoryChunkSpace(tuple));
03060     /* Reconstruct the HeapTupleData header */
03061     tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
03062     tuple->t_len = t_len;
03063     LogicalTapeReadExact(state->tapeset, tapenum,
03064                          &tuple->t_self, sizeof(ItemPointerData));
03065     /* We don't currently bother to reconstruct t_tableOid */
03066     tuple->t_tableOid = InvalidOid;
03067     /* Read in the tuple body */
03068     LogicalTapeReadExact(state->tapeset, tapenum,
03069                          tuple->t_data, tuple->t_len);
03070     if (state->randomAccess)    /* need trailing length word? */
03071         LogicalTapeReadExact(state->tapeset, tapenum,
03072                              &tuplen, sizeof(tuplen));
03073     stup->tuple = (void *) tuple;
03074     /* set up first-column key value, if it's a simple column */
03075     if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
03076         stup->datum1 = heap_getattr(tuple,
03077                                     state->indexInfo->ii_KeyAttrNumbers[0],
03078                                     state->tupDesc,
03079                                     &stup->isnull1);
03080 }
03081 
03082 
03083 /*
03084  * Routines specialized for IndexTuple case
03085  *
03086  * The btree and hash cases require separate comparison functions, but the
03087  * IndexTuple representation is the same so the copy/write/read support
03088  * functions can be shared.
03089  */
03090 
03091 static int
03092 comparetup_index_btree(const SortTuple *a, const SortTuple *b,
03093                        Tuplesortstate *state)
03094 {
03095     /*
03096      * This is similar to _bt_tuplecompare(), but we have already done the
03097      * index_getattr calls for the first column, and we need to keep track of
03098      * whether any null fields are present.  Also see the special treatment
03099      * for equal keys at the end.
03100      */
03101     ScanKey     scanKey = state->indexScanKey;
03102     IndexTuple  tuple1;
03103     IndexTuple  tuple2;
03104     int         keysz;
03105     TupleDesc   tupDes;
03106     bool        equal_hasnull = false;
03107     int         nkey;
03108     int32       compare;
03109 
03110     /* Compare the leading sort key */
03111     compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
03112                                       scanKey->sk_collation,
03113                                       a->datum1, a->isnull1,
03114                                       b->datum1, b->isnull1);
03115     if (compare != 0)
03116         return compare;
03117 
03118     /* they are equal, so we only need to examine one null flag */
03119     if (a->isnull1)
03120         equal_hasnull = true;
03121 
03122     /* Compare additional sort keys */
03123     tuple1 = (IndexTuple) a->tuple;
03124     tuple2 = (IndexTuple) b->tuple;
03125     keysz = state->nKeys;
03126     tupDes = RelationGetDescr(state->indexRel);
03127     scanKey++;
03128     for (nkey = 2; nkey <= keysz; nkey++, scanKey++)
03129     {
03130         Datum       datum1,
03131                     datum2;
03132         bool        isnull1,
03133                     isnull2;
03134 
03135         datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
03136         datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
03137 
03138         compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
03139                                           scanKey->sk_collation,
03140                                           datum1, isnull1,
03141                                           datum2, isnull2);
03142         if (compare != 0)
03143             return compare;     /* done when we find unequal attributes */
03144 
03145         /* they are equal, so we only need to examine one null flag */
03146         if (isnull1)
03147             equal_hasnull = true;
03148     }
03149 
03150     /*
03151      * If btree has asked us to enforce uniqueness, complain if two equal
03152      * tuples are detected (unless there was at least one NULL field).
03153      *
03154      * It is sufficient to make the test here, because if two tuples are equal
03155      * they *must* get compared at some stage of the sort --- otherwise the
03156      * sort algorithm wouldn't have checked whether one must appear before the
03157      * other.
03158      */
03159     if (state->enforceUnique && !equal_hasnull)
03160     {
03161         Datum       values[INDEX_MAX_KEYS];
03162         bool        isnull[INDEX_MAX_KEYS];
03163 
03164         /*
03165          * Some rather brain-dead implementations of qsort (such as the one in
03166          * QNX 4) will sometimes call the comparison routine to compare a
03167          * value to itself, but we always use our own implementation, which
03168          * does not.
03169          */
03170         Assert(tuple1 != tuple2);
03171 
03172         index_deform_tuple(tuple1, tupDes, values, isnull);
03173         ereport(ERROR,
03174                 (errcode(ERRCODE_UNIQUE_VIOLATION),
03175                  errmsg("could not create unique index \"%s\"",
03176                         RelationGetRelationName(state->indexRel)),
03177                  errdetail("Key %s is duplicated.",
03178                            BuildIndexValueDescription(state->indexRel,
03179                                                       values, isnull)),
03180                  errtableconstraint(state->heapRel,
03181                                  RelationGetRelationName(state->indexRel))));
03182     }
03183 
03184     /*
03185      * If key values are equal, we sort on ItemPointer.  This does not affect
03186      * validity of the finished index, but it may be useful to have index
03187      * scans in physical order.
03188      */
03189     {
03190         BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
03191         BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
03192 
03193         if (blk1 != blk2)
03194             return (blk1 < blk2) ? -1 : 1;
03195     }
03196     {
03197         OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
03198         OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
03199 
03200         if (pos1 != pos2)
03201             return (pos1 < pos2) ? -1 : 1;
03202     }
03203 
03204     return 0;
03205 }
03206 
03207 static int
03208 comparetup_index_hash(const SortTuple *a, const SortTuple *b,
03209                       Tuplesortstate *state)
03210 {
03211     uint32      hash1;
03212     uint32      hash2;
03213     IndexTuple  tuple1;
03214     IndexTuple  tuple2;
03215 
03216     /*
03217      * Fetch hash keys and mask off bits we don't want to sort by. We know
03218      * that the first column of the index tuple is the hash key.
03219      */
03220     Assert(!a->isnull1);
03221     hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
03222     Assert(!b->isnull1);
03223     hash2 = DatumGetUInt32(b->datum1) & state->hash_mask;
03224 
03225     if (hash1 > hash2)
03226         return 1;
03227     else if (hash1 < hash2)
03228         return -1;
03229 
03230     /*
03231      * If hash values are equal, we sort on ItemPointer.  This does not affect
03232      * validity of the finished index, but it may be useful to have index
03233      * scans in physical order.
03234      */
03235     tuple1 = (IndexTuple) a->tuple;
03236     tuple2 = (IndexTuple) b->tuple;
03237 
03238     {
03239         BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
03240         BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
03241 
03242         if (blk1 != blk2)
03243             return (blk1 < blk2) ? -1 : 1;
03244     }
03245     {
03246         OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
03247         OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
03248 
03249         if (pos1 != pos2)
03250             return (pos1 < pos2) ? -1 : 1;
03251     }
03252 
03253     return 0;
03254 }
03255 
03256 static void
03257 copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
03258 {
03259     IndexTuple  tuple = (IndexTuple) tup;
03260     unsigned int tuplen = IndexTupleSize(tuple);
03261     IndexTuple  newtuple;
03262 
03263     /* copy the tuple into sort storage */
03264     newtuple = (IndexTuple) palloc(tuplen);
03265     memcpy(newtuple, tuple, tuplen);
03266     USEMEM(state, GetMemoryChunkSpace(newtuple));
03267     stup->tuple = (void *) newtuple;
03268     /* set up first-column key value */
03269     stup->datum1 = index_getattr(newtuple,
03270                                  1,
03271                                  RelationGetDescr(state->indexRel),
03272                                  &stup->isnull1);
03273 }
03274 
03275 static void
03276 writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
03277 {
03278     IndexTuple  tuple = (IndexTuple) stup->tuple;
03279     unsigned int tuplen;
03280 
03281     tuplen = IndexTupleSize(tuple) + sizeof(tuplen);
03282     LogicalTapeWrite(state->tapeset, tapenum,
03283                      (void *) &tuplen, sizeof(tuplen));
03284     LogicalTapeWrite(state->tapeset, tapenum,
03285                      (void *) tuple, IndexTupleSize(tuple));
03286     if (state->randomAccess)    /* need trailing length word? */
03287         LogicalTapeWrite(state->tapeset, tapenum,
03288                          (void *) &tuplen, sizeof(tuplen));
03289 
03290     FREEMEM(state, GetMemoryChunkSpace(tuple));
03291     pfree(tuple);
03292 }
03293 
03294 static void
03295 readtup_index(Tuplesortstate *state, SortTuple *stup,
03296               int tapenum, unsigned int len)
03297 {
03298     unsigned int tuplen = len - sizeof(unsigned int);
03299     IndexTuple  tuple = (IndexTuple) palloc(tuplen);
03300 
03301     USEMEM(state, GetMemoryChunkSpace(tuple));
03302     LogicalTapeReadExact(state->tapeset, tapenum,
03303                          tuple, tuplen);
03304     if (state->randomAccess)    /* need trailing length word? */
03305         LogicalTapeReadExact(state->tapeset, tapenum,
03306                              &tuplen, sizeof(tuplen));
03307     stup->tuple = (void *) tuple;
03308     /* set up first-column key value */
03309     stup->datum1 = index_getattr(tuple,
03310                                  1,
03311                                  RelationGetDescr(state->indexRel),
03312                                  &stup->isnull1);
03313 }
03314 
03315 static void
03316 reversedirection_index_btree(Tuplesortstate *state)
03317 {
03318     ScanKey     scanKey = state->indexScanKey;
03319     int         nkey;
03320 
03321     for (nkey = 0; nkey < state->nKeys; nkey++, scanKey++)
03322     {
03323         scanKey->sk_flags ^= (SK_BT_DESC | SK_BT_NULLS_FIRST);
03324     }
03325 }
03326 
03327 static void
03328 reversedirection_index_hash(Tuplesortstate *state)
03329 {
03330     /* We don't support reversing direction in a hash index sort */
03331     elog(ERROR, "reversedirection_index_hash is not implemented");
03332 }
03333 
03334 
03335 /*
03336  * Routines specialized for DatumTuple case
03337  */
03338 
03339 static int
03340 comparetup_datum(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
03341 {
03342     return ApplySortComparator(a->datum1, a->isnull1,
03343                                b->datum1, b->isnull1,
03344                                state->onlyKey);
03345 }
03346 
03347 static void
03348 copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup)
03349 {
03350     /* Not currently needed */
03351     elog(ERROR, "copytup_datum() should not be called");
03352 }
03353 
03354 static void
03355 writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
03356 {
03357     void       *waddr;
03358     unsigned int tuplen;
03359     unsigned int writtenlen;
03360 
03361     if (stup->isnull1)
03362     {
03363         waddr = NULL;
03364         tuplen = 0;
03365     }
03366     else if (state->datumTypeByVal)
03367     {
03368         waddr = &stup->datum1;
03369         tuplen = sizeof(Datum);
03370     }
03371     else
03372     {
03373         waddr = DatumGetPointer(stup->datum1);
03374         tuplen = datumGetSize(stup->datum1, false, state->datumTypeLen);
03375         Assert(tuplen != 0);
03376     }
03377 
03378     writtenlen = tuplen + sizeof(unsigned int);
03379 
03380     LogicalTapeWrite(state->tapeset, tapenum,
03381                      (void *) &writtenlen, sizeof(writtenlen));
03382     LogicalTapeWrite(state->tapeset, tapenum,
03383                      waddr, tuplen);
03384     if (state->randomAccess)    /* need trailing length word? */
03385         LogicalTapeWrite(state->tapeset, tapenum,
03386                          (void *) &writtenlen, sizeof(writtenlen));
03387 
03388     if (stup->tuple)
03389     {
03390         FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
03391         pfree(stup->tuple);
03392     }
03393 }
03394 
03395 static void
03396 readtup_datum(Tuplesortstate *state, SortTuple *stup,
03397               int tapenum, unsigned int len)
03398 {
03399     unsigned int tuplen = len - sizeof(unsigned int);
03400 
03401     if (tuplen == 0)
03402     {
03403         /* it's NULL */
03404         stup->datum1 = (Datum) 0;
03405         stup->isnull1 = true;
03406         stup->tuple = NULL;
03407     }
03408     else if (state->datumTypeByVal)
03409     {
03410         Assert(tuplen == sizeof(Datum));
03411         LogicalTapeReadExact(state->tapeset, tapenum,
03412                              &stup->datum1, tuplen);
03413         stup->isnull1 = false;
03414         stup->tuple = NULL;
03415     }
03416     else
03417     {
03418         void       *raddr = palloc(tuplen);
03419 
03420         LogicalTapeReadExact(state->tapeset, tapenum,
03421                              raddr, tuplen);
03422         stup->datum1 = PointerGetDatum(raddr);
03423         stup->isnull1 = false;
03424         stup->tuple = raddr;
03425         USEMEM(state, GetMemoryChunkSpace(raddr));
03426     }
03427 
03428     if (state->randomAccess)    /* need trailing length word? */
03429         LogicalTapeReadExact(state->tapeset, tapenum,
03430                              &tuplen, sizeof(tuplen));
03431 }
03432 
03433 static void
03434 reversedirection_datum(Tuplesortstate *state)
03435 {
03436     state->onlyKey->ssup_reverse = !state->onlyKey->ssup_reverse;
03437     state->onlyKey->ssup_nulls_first = !state->onlyKey->ssup_nulls_first;
03438 }
03439 
03440 /*
03441  * Convenience routine to free a tuple previously loaded into sort memory
03442  */
03443 static void
03444 free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
03445 {
03446     FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
03447     pfree(stup->tuple);
03448 }