Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@RekGRpth
Copy link
Member

@RekGRpth RekGRpth commented Dec 3, 2025

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Communicate in the mailing list if needed
  • Pass make installcheck
  • Review a PR in return to support the community

whitehawk and others added 30 commits September 22, 2025 22:53
Problem description:
A query with an 'execute on initplan' function having a parameter that refers
the other relation in the query caused SIGSEGV during execution.

Root cause:
As the function was executed in the initplan, during the execution the value
of the parameter was not yet defined, as the initplan is processed before the
main plan.

Fix:
When appending an initplan node for the function scan, set the subplan's
'parent_root' to NULL, isolating it from plan params of outer querie's.
Now all such params don't get into the valid parameter list in
SS_finalize_plan(), and the planner emits an error for such a query.
The project's Code of Conduct was updated in accordance with the recent changes
approved by the Greengage DB architectural committee.
Reader gangs use local snapshot to access catalog, as a result, it will
not synchronize with the sharedSnapshot from write gang which will
lead to inconsistent visibility of catalog table on idle reader gang.
Considering the case:

select * from t, t t1; -- create a reader gang.
begin;
create role r1;
set role r1;  -- set command will also dispatched to idle reader gang

When set role command dispatched to idle reader gang, reader gang
cannot see the new tuple t1 in catalog table pg_auth.
To fix this issue, we should drop the idle reader gangs after each
utility statement which may modify the catalog table.

Reviewed-by: Zhenghua Lyu <[email protected]>
(cherry picked from commit d1ba4da)
Previously in some cases when rescanning a hash join was required, and the
hash join was executed in multiple batches that were spilled to disk, several
issues occurred when rescan happened in the middle of a batch:
1. Outer batch files (which contain outer tuples for all the next batches)
were never cleared, and after rescan they were filled again. When
gp_workfile_compression is enabled, this caused a BufFile to be in an
unexpected state (reading instead of writing), causing an error.
2. When inner batch file is read into the in-memory hash table, it is deleted.
After processing the batch, the hash table is spilled again into the batch
file in case rescan is possible. However, if rescan happens in the middle of a
batch, the hash table was never spilled and its contents were lost, resulting
in missing data in the output.

To fix these issues, when rescan is executed without rebuilding the whole hash
table, clear all the outer batch files and also spill the current batch to an
inner batch file.

Ticket: ADBDEV-8371
In some cases processing of injected panic fault can take a long time.
So queries, sent after sending fault, can either complete
successfully or end with an error.
To prevent this uncertainty the waiting mechanism was introduced in
test - before sending next commands after injecting panic fault, we wait
for it to be fully processed that is to trigger SIGCHLD.
When a coordinator backend encounters an OOM error during a distributed query,
it triggers a cleanup sequence that can execute twice or more times, causing a
use-after-free crash.

The sequence of events is as follows: during normal execution cleanup in
`standard_ExecutorEnd()`, the system calls `cdbdisp_destroyDispatcherState()` to
clean up the dispatcher state. This function frees the results array, including
the `segdbDesc` pointers that reference executor backend connections.

However, an OOM error may occur inside already occuring cleanup since we are
foolishly trying to allocate more memory, `AbortTransaction()`, then
`cdbcomponent_recycleIdleQE()` will be called later. This abort calls
`cdbdisp_cleanupDispatcherHandle()` to perform cleanup operations again. The
second cleanup attempt tries to read data from executor backends using the
`segdbDesc` pointers that were already freed during the first cleanup in
`checkDispatchResult()` to cancel the query, resulting in a segmentation fault
when dereferencing invalid memory.

Additionally, the `numActiveQEs` counter can become negative during reentrant
calls to `cdbcomponent_recycleIdleQE()`, causing assertion failures.

When a backend encounters an OOM error, it will throw an error and abort the
transaction. If the backend happens to be a coordinator, it would also attempt
to cancel the distributed query first, by reading results from executor
backends. It is futile, since the gang would be discarded anyways if
`ERRCODE_GP_MEMPROT_KILL` is encountered, which is a sign of OOM.

Fixes:

- Move `numActiveQEs` decrement after a `goto`.

- Prevent dispatcher handle cleanup from cancelling the dispatch in case of
  OOM. It is instead done by `DisconnectAndDestroyAllGangs()`, based on the
  new global variable.

- Avoid `ABORT` dispatch for distributed transactions when the new flag is
  active. Without a response from coordinator, the segments will eventually
  trigger `ABORT` themselves.

- Whether gang is destroyed depends on the error code. Specify error code for
  runaway cleaner cancellations, so they are treated equivalently to OOM
  conditions.

- Unit tests are modified to avoid mocking `in_oom_error_trouble()` for
  every `palloc()` call.

- The test makes sure there aren't any `palloc()` calls at all if the server
  is currently handling an OOM condition.

Ticket: ADBDEV-7916

Co-authored-by: Vladimir Sarmin <[email protected]>
Fix Orca cost model to prefer hashing smaller tables

Previously in Orca it was possible to achieve bad hash join plans that hashed
a much bigger table. This happened because in Orca's cost model there is a
cost associated with columns used in the join conditions, and this cost was
smaller when tuples are hashed than when tuples fed from an outer child. This
doesn't really make sense since it could make Orca hash a bigger table if
there are enough join conditions, no matter how much bigger this table is.

To make sure this never happens, increase the cost per join column for inner
child, so that it is bigger than for outer child (same as cost per byte
already present).

Additionally, Orca increased cost per join column for outer child when
spilling was predicted, which doesn't make sense either since there is no
additional hashing when spilling is enabled. Postgres planner only imposes
additional per-byte (or rather per-page) cost when spilling hash join, so Orca
should have the same per-join-column cost for both spilling and non-spilling
cases.

A lot of tests are affected by this change, but for most of them only costs
are changed. For some, hash joins are reordered, swapping inner and outer
children, since Orca previously hashed the bigger child in some cases. In case
of LOJNullRejectingZeroPlacePredicates.mdp this actually restored the old plan
specified in the comment. Also add a new regress test.

One common change in some tests are replacing Hash Semi Join with a regular
Hash Join + Sort + GroupAggregate. There is only Left Semi Join, so swapping
the inner and outer children is impossible in case of semi joins. This means
that it's slightly cheaper to convert Hash Semi Join to regular Hash Join to
be able to swap the children. The opposite conversion also takes place where
previously GroupAggregate was used.

Another common change is replacing HashJoin(table1, Broadcast(table2)) gets
replaced with HashJoin(Redistribute(table1), Redistribute(table2)), adding
another slice. This happens because the cost for hashing is now slightly
bigger, and so Orca prefers to split hashing table2 to all segments, instead
of every segment hashing all rows as it would be with Broadcast.

Below are some notable changes in minidump files:

- ExtractPredicateFromDisjWithComputedColumns.mdp

This patch changed the join order from ((cust, sales), datedim) to ((datedim,
sales), cust). All three tables are identical from Orca's point of view: they
are all empty and all table scans are 24 bytes wide, so there is no reason for
Orca to prefer one join order over the other since they all have the same cost.

- HAWQ-TPCH-Stat-Derivation.mdp

The only change in the plan is swapping children on 3rd Hash Join in the plan,
one involving lineitem_ao_column_none_level0 and
HashJoin(partsupp_ao_column_none_level0, part_ao_column_none_level0).
lineitem_ao_column_none_level0 is predicted to have approximately 22 billion
rows and the hash join is predicted to have approximately 10 billion rows, so
making the hash join the inner child is good in this case, since the smaller
relation is hashed.

- Nested-Setops-2.mdp

Same here. Two swaps were performed between dept and emp in two different
places. dept contains 1 row and emp contains 10001, so it's better if dept is
hashed. A Redistribute Motion was also replaced with Broadcast Motion in both
cases.

- TPCH-Q5.mdp

Probably the best improvement out of these plans. The previous plan had this
join order:

```
-> Hash Join (6,000,000 rows)
   -> Hash Join (300,000,000 rows)
      -> lineitem (1,500,000,000 rows)
      -> Hash Join (500,000 rows)
         -> supplier (2,500,000 rows)
         -> Hash Join (5 rows)
            -> nation (25 rows)
            -> region (1 row)
   -> Hash Join (100,000,000 rows)
      -> customer (40,000,000 rows)
      -> orders (100,000,000 rows)
```

which contains hashing 100 million rows twice (first order, then its hash join
with customer). The new plan has no such issues:

```
-> Hash Join (6,000,000 rows)
   -> Hash Join (170,000,000 rows)
      -> lineitem (1,500,000,000 rows)
      -> Hash Join (20,000,000 rows)
         -> orders (100,000,000 rows)
         -> Hash Join (7,000,000 rows)
            -> customer (40,000,000 rows)
            -> Hash Join (5 rows)
               -> nation (25 rows)
               -> region (1 row)
   -> supplier (2,500,000 rows)
```

This plan only hashes around 30 million rows in total, much better than 200
million.

Ticket: ADBDEV-8413
When appending statistics to a group, it is first checked whether statistics
already exist in the group. If it exists, then new statistics are appended to
the existing one using the AppendStats method. If there are no statistics in the
group, the existence of statistics in the duplicate group is also checked. When
appending statistics to an existing one, we take the existing statistics of the
group (or its duplicate), create a copy of it, and add statistics to this copy,
and release the old one. If the group does not have its own statistics and the
duplicate statistics are used. Then we would add statistics to the duplicate
group and try to release the statistics of the current group, which is NULL,
which leads to a segmentation fault.

Fix this by calling the AppendStats method on the duplicate.
During the exploration phase, new groups of equivalent expressions are created.
In this process, some groups are marked as duplicates. After exploration,
expressions from duplicate groups are moved into the group they duplicate.

In cases where a duplicate group contains an expression that references the
duplicated group, merging them results in a situation where a group contains an
expression that references the very group it belongs to. This leads to infinite
recursion during statistics derivation.

The fix is to improve the cycle-detection logic so that it can recognize when an
expression references the group it resides in.
- New CI Job for auto build deb-package
- New targets for gpAux Makefile: changelog, changelog-deb, pkg, pkg-deb
- New gpAux/debian folder with package description/rules for `debuild` utility
- Copy `VERSION` file from source to main layer in Docker Image
- Disable clean `VERSION` file if `.git` directory not exists
- Deb package name gets from `Package` field in `gpAux/debian/control` file

Ticket: ADBDEV-7873
Change bug_report format to YAML
Updated links to refer to the main branch
Prior to 3ce2e6a, querying pg_locks (or using pg_lock_status()),
approximately 75% of backend memory allocations for resulting tuples weren't
registered with Vmtracker or Resource Group Control. This memory would also
leak if the query was cancelled or failed.

This happened because CdbDispatchCommand(), which was previously used by
pg_locks, called libpq to obtain the results that were allocated as PQresult
structures with bare malloc(), even on the server side.

This patch fixes both untracked memory issues by enforcing Vmtracker routines
for PGresult allocations on the server-side.

Including postgres.h in frontend code causes several errcode-related macro
redefinition warnings. They are now un-definined first.

Recursive errors due to mishandled OOM errors are addressed in c4e1085.
This PR also adds an additional set of tests, building on top of the said
commit.

Ticket: ADBDEV-7691
Function XactLockTableWait() calls LocalXidGetDistributedXid() which
may get gxid corresponding to local wait xid from distributed clog in
case if the dtx (which we are waiting for) managed to commit by that time we
access its gxid. And for such case there is an assertion introduced by
commit 13a1f66. The assert indicates that the commited transaction was
just running in parallel with current one, meaning there is no other
reason to access distributed transaction history. If the transaction was
commited long time ago the XactLockTableWait() would never be called.

However, there is a case when we can't compare the timestamps:
vacuum operation, which performs in-place update of pg_database
(or pg_class) without being in distributed transaction. For this
case this patch extends the assertion by allowing current timestamp to
have zero value.

The new test related to this case is added to file 2c2753a.
Previously, commit 8359bfa reverted changes related to a TAP tests that
required hot standby functionality since it is not available in 6.X.

Standby errored out before it could fully start due to several functions that
threw an error. Skip the error if connection is in utility mode.

Ticket: ADBDEV-7948
Problem description:
After sequential execution of isolation2 tests 'standby_replay_dtx_info' and
'ao_unique_index' the coordinator's standby postmaster process together with
its children processes were terminated.

Root cause:
Test 'standby_replay_dtx_info' sets fault injection 'standby_gxacts_overflow'
on coordinator's standby, which updates the global var 'max_tm_gxacts' (the
limit of distributed transactions) to 1, but at the reset of this fault the
value of 'max_tm_gxacts' was not updated to its original value. Therefore, on
any next test that created more than 2 distributed transactions that were
replayed on the standby, the standby encountered the fatal error "the limit of 1
distributed transactions has been reached" and it was terminated.

Fix:
Set 'max_tm_gxacts' to its original value when fault injection
'standby_gxacts_overflow' is not set.

(cherry picked from commit 423cc57b779bfb8f048f47425b428091a7d959a9)
The planner can execute some functions on segments. However, their contents
will also be planned on segments. Planning may create motions, which is
unacceptable on segments. Since the contents of functions may be unavailable
when planning the initial query (for example, a C-function with a call to SQL
in the SPI), it is sufficient to prevent motions from being created when
planning a function on a segment.

Ticket: ADBDEV-8689
(cherry picked from commit 50385e2ebc768a92cda0692a604df80981061512)
## ADBDEV-8787: (v6) Run tests for feature branches with slashes (#115)

- Update pull-request branches to '**'
Target CI jobs to v8
- build
- behave-tests
- regression-tests
- orca-tests
- upload

Ticket: ADBDEV-8833
If the add column + insert gets aborted, pg_aocsseg still holds the
aborted column's vpinfo. We need to read only the committed columns'
vpinfo and ignore all aborted column's vpinfo.

Before this change during pg_aocsseg reads we were copying over
the whole vpinfo which includes entries for aborted columns
This creates memory corruption as we are copying over more than what is
needed/commited (aborted column's vpinfo)

This change edits the read of vpinfo to limit upto committed columns.

All the other code paths
(that assert for VARSIZE(dv) == aocs_vpinfo_size)
don't encounter failure with aborted columns, as they
are only reached after aocsseg is already updated in same transaction.

AOCSFileSegInfoAddVpe doesn't encounter this problem as we always update
aocsseg entry with empty vpe (with new number of columns)
in aocs_addcol_emptyvpe before we reach
AOCSFileSegInfoAddVpe in aocs_addcol_closefiles

(cherry picked from commit 0a2d3cb)

Changes compared to the original commit:
1. 6x specific change in the test query: changed 'USING ao_column' to
'WITH (APPENDONLY=TRUE, ORIENTATION=COLUMN)'.
2. 6x specific change in the test query: in the queries with
'gp_toolkit.__gp_aocsseg()' removed ordering by segment_id, as it doesn't exist.
3. Updated test answer file according to the changes above.
Problem description:
Vacuum failed to clean up segment files for AOCS tables, that were created in a
transaction, which was rolled back.

Root cause:
For AOCS tables, a transaction that creates segment files for the very first
time, inserts a corresponding record into 'pg_aoseg.pg_aocsseg_<oid>' in
'InsertInitialAOCSFileSegInfo()'. But if the transaction was rolled back, this
record in 'pg_aoseg.pg_aocsseg_<oid>' was no more available. But the physical
segment files still existed.
Vacuum process in AOCSTruncateToEOF() relies on the information from
'pg_aoseg.pg_aocsseg_<oid>' to get all segments vacuum needs to scan. Obviously,
the segment files from the aborted transaction were not visible to it.

Fix:
Add 'heap_freeze_tuple_wal_logged(segrel, segtup)' into
'InsertInitialAOCSFileSegInfo()', so vacuum now can see the new segment files.
It is already done so in 7x (refer to 1306d47).

But this change interferes with commit 9e106f5,
as freezing the tuple partially reverts its logic.
Part of this interference is resolved by preceding cherry-pick of
0a2d3cb, which handles the same problem as
9e106f5, but in a different way.
But for 6x additional changes are required in 'UpdateAOCSFileSegInfo()', so this
patch adds usage of 'deformAOCSVPInfo()' into it in the manner intended by
0a2d3cb.

Plus this change requires update of the output of the test
'uao_crash_compaction_column', because the output of 'gp_toolkit.__gp_aocsseg()'
now contains records about segment files created by the interrupted vacuum
command.
If you create a table and don't insert any data into it, the relation file
is never fsync'd. You don't lose data, because an empty table doesn't have
any data to begin with, but if you crash and lose the file, subsequent
operations on the table will fail with "could not open file" error.

To fix, register an fsync request in mdcreate(), like we do for mdwrite().

Per discussion, we probably should also fsync the containing directory
after creating a new file. But that's a separate and much wider issue.

Backpatch to all supported versions.

Reviewed-by: Andres Freund, Thomas Munro
Discussion: https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi

6X changes:
In 6X, the smgr_which field is not used and not set. This patch adds setting
this field. To maintain binary compatibility, this field is set outside the
smgropen function.
(cherry picked from commit 1b4f1c6)
Target the following reusable workflows to v9:
- behave-tests
- regression-tests
- orca-tests
- upload
- package

Ticket: [GG-16](https://tracker.yandex.ru/GG-16)
Fix escaping for perfmon metrics export

What occurs?
Whenever new log rows, containing database names with some special characters,
were tried to be uploaded from _queries_history.dat file to the
queries_history external table an error considering the size of a db column
would occur. It would prevent any new logs to be loaded whatsoever.

Why it occurs?
The db column has a type of VARCHAR(64), thus it means that a larger string
was tried to be put. Obvious reason for this - incorrect escaping or lack of
such whatsoever. Only two symbols were observed to lead to errors: " and |.
In case of a pipe - it leads to another error about incorrect row structure,
which is logical, as pipes are used as delimiters inside the file.

How do we fix it?
Before the patch (1f67d39) db field used to be written to
_queries_history.dat without double quotes, so no escaping was present. Now
it has it, as we enclose every database name in double quotes. Also, we double
the already present in database name double quotes.

Does it fix the problem?
Yes, as now we escape the whole database name string with all of its special
(or not) characters - the very same method is used for the SQL query command
and it works. Doubling double quotes is needed as we need to escape the quotes
to not to end the string to early.

What was changed?
Minor code additions for escaping inside the src/backend/gpmon/gpmon.c

Tests?
New BDD auto test to check that before mentioned logs are added to
queries_history correctly (as they appear there). But, not all symbols can be
auto tested using present testing functions: ', | and UTF-8 symbols. Errors
occur at different steps of a test for different type of symbol and are
connected to the way commands are issued to the DBSM. Nevertheless, during hand
testing these symbols passed.

Observations?
dbuser column also is not escaped, but I have not managed to recreate the
same issue with it. Yet, it may be worth to add escaping to it in future, but
now it seems like an extremely edge case scenario.

Ticket: ADBDEV-7272

---------

Co-authored-by: Vladimir Sarmin <[email protected]>
Co-authored-by: Georgy Shelkovy <[email protected]>
RekGRpth and others added 6 commits November 26, 2025 19:01
When the DROP IF EXISTS command is called, it is unconditionally dispatched to
the segments, even if the object doesn't exist on the coordinator. This can
lead to inconsistencies, for example:

- the first session starts a table drop on the coordinator. Even though the
table doesn't exist, the drop is still dispatched to the segments.

- at the same time, the second session creates the table (on the coordinator
and on the segments).

- the first session (on the segments) already sees the table and therefore
deletes it.

Don't dispatch DROP if object doesn't exist on coordinator. Exclude the
drop_rename test from the parallel group because it contains fault-injectors.
Add a new GUC, gp_dispatch_drop_always, for unconditional DROP dispatch even if
the object isn't present on the coordinator, as this functionality is used by
the gpcheckcat utility, for example, in the orphan_temp_table test.

Ticket: ADBDEV-8867
…ables (#89)

The add_rte_to_flat_rtable function places a RangeTblEntry into the
glob->finalrtable table with a zeroed list of functions. The
ParallelizeCorrelatedSubPlanMutator function uses ctx->rtable, which is
populated by the root->glob->finalrtable function in the
ParallelizeCorrelatedSubPlan function. Therefore, in the
ParallelizeCorrelatedSubPlanMutator function, the rte->functions list is always
empty. Therefore, the old condition for checking whether functions are
correlated did not work. Now, in the ParallelizeCorrelatedSubPlanMutator
function, functions are available in the fscan->functions list, which is where
they are taken for correlation and location checking. Correlated non-any
functions are prohibited in subplans. For other combinations, we now add
broadcasting and materialization.

Another issue arose with correlated subplans with master-only or replicated
tables. The existing condition checked the scan locus only, without checking
the requested locus. As a result, the table scan was performed on segments,
which is unacceptable for the gp_segment_configuration master-only table. Now,
if the requested locus for such tables and the scan locus are not equal to the
entry, we again add broadcasting and materialization.

Co-authored-by: Georgy Shelkovy <[email protected]>
Co-authored-by: Maxim Michkov <[email protected]>

Ticket: ADBDEV-6884
Commit 73b889e replaced the use of the gp_tablespace_segment_location function
in the correlated subplan in the new version of the arenadata_toolkit extension
with the pg_tablespace_location and gp_dist_random functions. However, this
function remained in previous versions of the extension, causing the
upgrade_test test to fail. Replace it in previous versions as well.
@RekGRpth RekGRpth closed this Dec 5, 2025
@RekGRpth RekGRpth deleted the ADBDEV-8923 branch December 5, 2025 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.