Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 3592e0f

Browse files
committed
Have ExecFindPartition cache the last found partition
Here we add code which detects when ExecFindPartition() continually finds the same partition and add a caching layer to improve partition lookup performance for such cases. Both RANGE and LIST partitioned tables traditionally require a binary search for the set of Datums that a partition needs to be found for. This binary search is commonly visible in profiles when bulk loading into a partitioned table. Here we aim to reduce the overhead of bulk-loading into partitioned tables for cases where many consecutive tuples belong to the same partition and make the performance of this operation closer to what it is with a traditional non-partitioned table. When we find the same partition 16 times in a row, the next search will result in us simply just checking if the current set of values belongs to the last found partition. For LIST partitioning we record the index into the PartitionBoundInfo's datum array. This allows us to check if the current Datum is the same as the Datum that was last looked up. This means if any given LIST partition supports storing multiple different Datum values, then the caching only works when we find the same value as we did the last time. For RANGE partitioning we simply check if the given Datums are in the same range as the previously found partition. We store the details of the cached partition in PartitionDesc (i.e. relcache) so that the cached values are maintained over multiple statements. No caching is done for HASH partitions. The majority of the cost in HASH partition lookups are in the hashing function(s), which would also have to be executed if we were to try to do caching for HASH partitioned tables. Since most of the cost is already incurred, we just don't bother. We also don't do any caching for LIST partitions when we continually find the values being looked up belong to the DEFAULT partition. We've no corresponding index in the PartitionBoundInfo's datum array for this case. We also don't cache when we find the given values match to a LIST partitioned table's NULL partition. This is so cheap that there's no point in doing any caching for this. We also don't cache for a RANGE partitioned table's DEFAULT partition. There have been a number of different patches submitted to improve partition lookups. Hou, Zhijie submitted a patch to detect when the value belonging to the partition key column(s) were constant and added code to cache the partition in that case. Amit Langote then implemented an idea suggested by me to remember the last found partition and start to check if the current values work for that partition. The final patch here was written by me and was done by taking many of the ideas I liked from the patches in the thread and redesigning other aspects. Discussion: https://postgr.es/m/OS0PR01MB571649B27E912EA6CC4EEF03942D9%40OS0PR01MB5716.jpnprd01.prod.outlook.com Author: Amit Langote, Hou Zhijie, David Rowley Reviewed-by: Amit Langote, Hou Zhijie
1 parent 83f1793 commit 3592e0f

File tree

3 files changed

+204
-19
lines changed

3 files changed

+204
-19
lines changed

src/backend/executor/execPartition.c

+173-19
Original file line numberDiff line numberDiff line change
@@ -1332,48 +1332,134 @@ FormPartitionKeyDatum(PartitionDispatch pd,
13321332
elog(ERROR, "wrong number of partition key expressions");
13331333
}
13341334

1335+
/*
1336+
* The number of times the same partition must be found in a row before we
1337+
* switch from a binary search for the given values to just checking if the
1338+
* values belong to the last found partition. This must be above 0.
1339+
*/
1340+
#define PARTITION_CACHED_FIND_THRESHOLD 16
1341+
13351342
/*
13361343
* get_partition_for_tuple
13371344
* Finds partition of relation which accepts the partition key specified
1338-
* in values and isnull
1345+
* in values and isnull.
1346+
*
1347+
* Calling this function can be quite expensive when LIST and RANGE
1348+
* partitioned tables have many partitions. This is due to the binary search
1349+
* that's done to find the correct partition. Many of the use cases for LIST
1350+
* and RANGE partitioned tables make it likely that the same partition is
1351+
* found in subsequent ExecFindPartition() calls. This is especially true for
1352+
* cases such as RANGE partitioned tables on a TIMESTAMP column where the
1353+
* partition key is the current time. When asked to find a partition for a
1354+
* RANGE or LIST partitioned table, we record the partition index and datum
1355+
* offset we've found for the given 'values' in the PartitionDesc (which is
1356+
* stored in relcache), and if we keep finding the same partition
1357+
* PARTITION_CACHED_FIND_THRESHOLD times in a row, then we'll enable caching
1358+
* logic and instead of performing a binary search to find the correct
1359+
* partition, we'll just double-check that 'values' still belong to the last
1360+
* found partition, and if so, we'll return that partition index, thus
1361+
* skipping the need for the binary search. If we fail to match the last
1362+
* partition when double checking, then we fall back on doing a binary search.
1363+
* In this case, unless we find 'values' belong to the DEFAULT partition,
1364+
* we'll reset the number of times we've hit the same partition so that we
1365+
* don't attempt to use the cache again until we've found that partition at
1366+
* least PARTITION_CACHED_FIND_THRESHOLD times in a row.
1367+
*
1368+
* For cases where the partition changes on each lookup, the amount of
1369+
* additional work required just amounts to recording the last found partition
1370+
* and bound offset then resetting the found counter. This is cheap and does
1371+
* not appear to cause any meaningful slowdowns for such cases.
1372+
*
1373+
* No caching of partitions is done when the last found partition is the
1374+
* DEFAULT or NULL partition. For the case of the DEFAULT partition, there
1375+
* is no bound offset storing the matching datum, so we cannot confirm the
1376+
* indexes match. For the NULL partition, this is just so cheap, there's no
1377+
* sense in caching.
13391378
*
13401379
* Return value is index of the partition (>= 0 and < partdesc->nparts) if one
13411380
* found or -1 if none found.
13421381
*/
13431382
static int
13441383
get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
13451384
{
1346-
int bound_offset;
1385+
int bound_offset = -1;
13471386
int part_index = -1;
13481387
PartitionKey key = pd->key;
13491388
PartitionDesc partdesc = pd->partdesc;
13501389
PartitionBoundInfo boundinfo = partdesc->boundinfo;
13511390

1391+
/*
1392+
* In the switch statement below, when we perform a cached lookup for
1393+
* RANGE and LIST partitioned tables, if we find that the last found
1394+
* partition matches the 'values', we return the partition index right
1395+
* away. We do this instead of breaking out of the switch as we don't
1396+
* want to execute the code about the DEFAULT partition or do any updates
1397+
* for any of the cache-related fields. That would be a waste of effort
1398+
* as we already know it's not the DEFAULT partition and have no need to
1399+
* increment the number of times we found the same partition any higher
1400+
* than PARTITION_CACHED_FIND_THRESHOLD.
1401+
*/
1402+
13521403
/* Route as appropriate based on partitioning strategy. */
13531404
switch (key->strategy)
13541405
{
13551406
case PARTITION_STRATEGY_HASH:
13561407
{
13571408
uint64 rowHash;
13581409

1410+
/* hash partitioning is too cheap to bother caching */
13591411
rowHash = compute_partition_hash_value(key->partnatts,
13601412
key->partsupfunc,
13611413
key->partcollation,
13621414
values, isnull);
13631415

1364-
part_index = boundinfo->indexes[rowHash % boundinfo->nindexes];
1416+
/*
1417+
* HASH partitions can't have a DEFAULT partition and we don't
1418+
* do any caching work for them, so just return the part index
1419+
*/
1420+
return boundinfo->indexes[rowHash % boundinfo->nindexes];
13651421
}
1366-
break;
13671422

13681423
case PARTITION_STRATEGY_LIST:
13691424
if (isnull[0])
13701425
{
1426+
/* this is far too cheap to bother doing any caching */
13711427
if (partition_bound_accepts_nulls(boundinfo))
1372-
part_index = boundinfo->null_index;
1428+
{
1429+
/*
1430+
* When there is a NULL partition we just return that
1431+
* directly. We don't have a bound_offset so it's not
1432+
* valid to drop into the code after the switch which
1433+
* checks and updates the cache fields. We perhaps should
1434+
* be invalidating the details of the last cached
1435+
* partition but there's no real need to. Keeping those
1436+
* fields set gives a chance at matching to the cached
1437+
* partition on the next lookup.
1438+
*/
1439+
return boundinfo->null_index;
1440+
}
13731441
}
13741442
else
13751443
{
1376-
bool equal = false;
1444+
bool equal;
1445+
1446+
if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
1447+
{
1448+
int last_datum_offset = partdesc->last_found_datum_index;
1449+
Datum lastDatum = boundinfo->datums[last_datum_offset][0];
1450+
int32 cmpval;
1451+
1452+
/* does the last found datum index match this datum? */
1453+
cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
1454+
key->partcollation[0],
1455+
lastDatum,
1456+
values[0]));
1457+
1458+
if (cmpval == 0)
1459+
return boundinfo->indexes[last_datum_offset];
1460+
1461+
/* fall-through and do a manual lookup */
1462+
}
13771463

13781464
bound_offset = partition_list_bsearch(key->partsupfunc,
13791465
key->partcollation,
@@ -1403,23 +1489,64 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
14031489
}
14041490
}
14051491

1406-
if (!range_partkey_has_null)
1492+
/* NULLs belong in the DEFAULT partition */
1493+
if (range_partkey_has_null)
1494+
break;
1495+
1496+
if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
14071497
{
1408-
bound_offset = partition_range_datum_bsearch(key->partsupfunc,
1409-
key->partcollation,
1410-
boundinfo,
1411-
key->partnatts,
1412-
values,
1413-
&equal);
1498+
int last_datum_offset = partdesc->last_found_datum_index;
1499+
Datum *lastDatums = boundinfo->datums[last_datum_offset];
1500+
PartitionRangeDatumKind *kind = boundinfo->kind[last_datum_offset];
1501+
int32 cmpval;
1502+
1503+
/* check if the value is >= to the lower bound */
1504+
cmpval = partition_rbound_datum_cmp(key->partsupfunc,
1505+
key->partcollation,
1506+
lastDatums,
1507+
kind,
1508+
values,
1509+
key->partnatts);
14141510

14151511
/*
1416-
* The bound at bound_offset is less than or equal to the
1417-
* tuple value, so the bound at offset+1 is the upper
1418-
* bound of the partition we're looking for, if there
1419-
* actually exists one.
1512+
* If it's equal to the lower bound then no need to check
1513+
* the upper bound.
14201514
*/
1421-
part_index = boundinfo->indexes[bound_offset + 1];
1515+
if (cmpval == 0)
1516+
return boundinfo->indexes[last_datum_offset + 1];
1517+
1518+
if (cmpval < 0 && last_datum_offset + 1 < boundinfo->ndatums)
1519+
{
1520+
/* check if the value is below the upper bound */
1521+
lastDatums = boundinfo->datums[last_datum_offset + 1];
1522+
kind = boundinfo->kind[last_datum_offset + 1];
1523+
cmpval = partition_rbound_datum_cmp(key->partsupfunc,
1524+
key->partcollation,
1525+
lastDatums,
1526+
kind,
1527+
values,
1528+
key->partnatts);
1529+
1530+
if (cmpval > 0)
1531+
return boundinfo->indexes[last_datum_offset + 1];
1532+
}
1533+
/* fall-through and do a manual lookup */
14221534
}
1535+
1536+
bound_offset = partition_range_datum_bsearch(key->partsupfunc,
1537+
key->partcollation,
1538+
boundinfo,
1539+
key->partnatts,
1540+
values,
1541+
&equal);
1542+
1543+
/*
1544+
* The bound at bound_offset is less than or equal to the
1545+
* tuple value, so the bound at offset+1 is the upper bound of
1546+
* the partition we're looking for, if there actually exists
1547+
* one.
1548+
*/
1549+
part_index = boundinfo->indexes[bound_offset + 1];
14231550
}
14241551
break;
14251552

@@ -1433,7 +1560,34 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
14331560
* the default partition, if there is one.
14341561
*/
14351562
if (part_index < 0)
1436-
part_index = boundinfo->default_index;
1563+
{
1564+
/*
1565+
* No need to reset the cache fields here. The next set of values
1566+
* might end up belonging to the cached partition, so leaving the
1567+
* cache alone improves the chances of a cache hit on the next lookup.
1568+
*/
1569+
return boundinfo->default_index;
1570+
}
1571+
1572+
/* we should only make it here when the code above set bound_offset */
1573+
Assert(bound_offset >= 0);
1574+
1575+
/*
1576+
* Attend to the cache fields. If the bound_offset matches the last
1577+
* cached bound offset then we've found the same partition as last time,
1578+
* so bump the count by one. If all goes well, we'll eventually reach
1579+
* PARTITION_CACHED_FIND_THRESHOLD and try the cache path next time
1580+
* around. Otherwise, we'll reset the cache count back to 1 to mark that
1581+
* we've found this partition for the first time.
1582+
*/
1583+
if (bound_offset == partdesc->last_found_datum_index)
1584+
partdesc->last_found_count++;
1585+
else
1586+
{
1587+
partdesc->last_found_count = 1;
1588+
partdesc->last_found_part_index = part_index;
1589+
partdesc->last_found_datum_index = bound_offset;
1590+
}
14371591

14381592
return part_index;
14391593
}

src/backend/partitioning/partdesc.c

+6
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,12 @@ RelationBuildPartitionDesc(Relation rel, bool omit_detached)
290290
{
291291
oldcxt = MemoryContextSwitchTo(new_pdcxt);
292292
partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
293+
294+
/* Initialize caching fields for speeding up ExecFindPartition */
295+
partdesc->last_found_datum_index = -1;
296+
partdesc->last_found_part_index = -1;
297+
partdesc->last_found_count = 0;
298+
293299
partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid));
294300
partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool));
295301

src/include/partitioning/partdesc.h

+25
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,31 @@ typedef struct PartitionDescData
3636
* the corresponding 'oids' element belongs to
3737
* a leaf partition or not */
3838
PartitionBoundInfo boundinfo; /* collection of partition bounds */
39+
40+
/* Caching fields to cache lookups in get_partition_for_tuple() */
41+
42+
/*
43+
* Index into the PartitionBoundInfo's datum array for the last found
44+
* partition or -1 if none.
45+
*/
46+
int last_found_datum_index;
47+
48+
/*
49+
* Partition index of the last found partition or -1 if none has been
50+
* found yet.
51+
*/
52+
int last_found_part_index;
53+
54+
/*
55+
* For LIST partitioning, this is the number of times in a row that the
56+
* datum we're looking for a partition for matches the datum in the
57+
* last_found_datum_index index of the boundinfo->datums array. For RANGE
58+
* partitioning, this is the number of times in a row we've found that the
59+
* datum we're looking for a partition for falls into the range of the
60+
* partition corresponding to the last_found_datum_index index of the
61+
* boundinfo->datums array.
62+
*/
63+
int last_found_count;
3964
} PartitionDescData;
4065

4166

0 commit comments

Comments
 (0)