forked from greenplum-db/gpdb-archive
-
Notifications
You must be signed in to change notification settings - Fork 23
[WIP] Adbdev 8923 #2143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
[WIP] Adbdev 8923 #2143
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Problem description: A query with an 'execute on initplan' function having a parameter that refers the other relation in the query caused SIGSEGV during execution. Root cause: As the function was executed in the initplan, during the execution the value of the parameter was not yet defined, as the initplan is processed before the main plan. Fix: When appending an initplan node for the function scan, set the subplan's 'parent_root' to NULL, isolating it from plan params of outer querie's. Now all such params don't get into the valid parameter list in SS_finalize_plan(), and the planner emits an error for such a query.
The project's Code of Conduct was updated in accordance with the recent changes approved by the Greengage DB architectural committee.
Reader gangs use local snapshot to access catalog, as a result, it will not synchronize with the sharedSnapshot from write gang which will lead to inconsistent visibility of catalog table on idle reader gang. Considering the case: select * from t, t t1; -- create a reader gang. begin; create role r1; set role r1; -- set command will also dispatched to idle reader gang When set role command dispatched to idle reader gang, reader gang cannot see the new tuple t1 in catalog table pg_auth. To fix this issue, we should drop the idle reader gangs after each utility statement which may modify the catalog table. Reviewed-by: Zhenghua Lyu <[email protected]> (cherry picked from commit d1ba4da)
Previously in some cases when rescanning a hash join was required, and the hash join was executed in multiple batches that were spilled to disk, several issues occurred when rescan happened in the middle of a batch: 1. Outer batch files (which contain outer tuples for all the next batches) were never cleared, and after rescan they were filled again. When gp_workfile_compression is enabled, this caused a BufFile to be in an unexpected state (reading instead of writing), causing an error. 2. When inner batch file is read into the in-memory hash table, it is deleted. After processing the batch, the hash table is spilled again into the batch file in case rescan is possible. However, if rescan happens in the middle of a batch, the hash table was never spilled and its contents were lost, resulting in missing data in the output. To fix these issues, when rescan is executed without rebuilding the whole hash table, clear all the outer batch files and also spill the current batch to an inner batch file. Ticket: ADBDEV-8371
In some cases processing of injected panic fault can take a long time. So queries, sent after sending fault, can either complete successfully or end with an error. To prevent this uncertainty the waiting mechanism was introduced in test - before sending next commands after injecting panic fault, we wait for it to be fully processed that is to trigger SIGCHLD.
When a coordinator backend encounters an OOM error during a distributed query, it triggers a cleanup sequence that can execute twice or more times, causing a use-after-free crash. The sequence of events is as follows: during normal execution cleanup in `standard_ExecutorEnd()`, the system calls `cdbdisp_destroyDispatcherState()` to clean up the dispatcher state. This function frees the results array, including the `segdbDesc` pointers that reference executor backend connections. However, an OOM error may occur inside already occuring cleanup since we are foolishly trying to allocate more memory, `AbortTransaction()`, then `cdbcomponent_recycleIdleQE()` will be called later. This abort calls `cdbdisp_cleanupDispatcherHandle()` to perform cleanup operations again. The second cleanup attempt tries to read data from executor backends using the `segdbDesc` pointers that were already freed during the first cleanup in `checkDispatchResult()` to cancel the query, resulting in a segmentation fault when dereferencing invalid memory. Additionally, the `numActiveQEs` counter can become negative during reentrant calls to `cdbcomponent_recycleIdleQE()`, causing assertion failures. When a backend encounters an OOM error, it will throw an error and abort the transaction. If the backend happens to be a coordinator, it would also attempt to cancel the distributed query first, by reading results from executor backends. It is futile, since the gang would be discarded anyways if `ERRCODE_GP_MEMPROT_KILL` is encountered, which is a sign of OOM. Fixes: - Move `numActiveQEs` decrement after a `goto`. - Prevent dispatcher handle cleanup from cancelling the dispatch in case of OOM. It is instead done by `DisconnectAndDestroyAllGangs()`, based on the new global variable. - Avoid `ABORT` dispatch for distributed transactions when the new flag is active. Without a response from coordinator, the segments will eventually trigger `ABORT` themselves. - Whether gang is destroyed depends on the error code. Specify error code for runaway cleaner cancellations, so they are treated equivalently to OOM conditions. - Unit tests are modified to avoid mocking `in_oom_error_trouble()` for every `palloc()` call. - The test makes sure there aren't any `palloc()` calls at all if the server is currently handling an OOM condition. Ticket: ADBDEV-7916 Co-authored-by: Vladimir Sarmin <[email protected]>
Fix Orca cost model to prefer hashing smaller tables
Previously in Orca it was possible to achieve bad hash join plans that hashed
a much bigger table. This happened because in Orca's cost model there is a
cost associated with columns used in the join conditions, and this cost was
smaller when tuples are hashed than when tuples fed from an outer child. This
doesn't really make sense since it could make Orca hash a bigger table if
there are enough join conditions, no matter how much bigger this table is.
To make sure this never happens, increase the cost per join column for inner
child, so that it is bigger than for outer child (same as cost per byte
already present).
Additionally, Orca increased cost per join column for outer child when
spilling was predicted, which doesn't make sense either since there is no
additional hashing when spilling is enabled. Postgres planner only imposes
additional per-byte (or rather per-page) cost when spilling hash join, so Orca
should have the same per-join-column cost for both spilling and non-spilling
cases.
A lot of tests are affected by this change, but for most of them only costs
are changed. For some, hash joins are reordered, swapping inner and outer
children, since Orca previously hashed the bigger child in some cases. In case
of LOJNullRejectingZeroPlacePredicates.mdp this actually restored the old plan
specified in the comment. Also add a new regress test.
One common change in some tests are replacing Hash Semi Join with a regular
Hash Join + Sort + GroupAggregate. There is only Left Semi Join, so swapping
the inner and outer children is impossible in case of semi joins. This means
that it's slightly cheaper to convert Hash Semi Join to regular Hash Join to
be able to swap the children. The opposite conversion also takes place where
previously GroupAggregate was used.
Another common change is replacing HashJoin(table1, Broadcast(table2)) gets
replaced with HashJoin(Redistribute(table1), Redistribute(table2)), adding
another slice. This happens because the cost for hashing is now slightly
bigger, and so Orca prefers to split hashing table2 to all segments, instead
of every segment hashing all rows as it would be with Broadcast.
Below are some notable changes in minidump files:
- ExtractPredicateFromDisjWithComputedColumns.mdp
This patch changed the join order from ((cust, sales), datedim) to ((datedim,
sales), cust). All three tables are identical from Orca's point of view: they
are all empty and all table scans are 24 bytes wide, so there is no reason for
Orca to prefer one join order over the other since they all have the same cost.
- HAWQ-TPCH-Stat-Derivation.mdp
The only change in the plan is swapping children on 3rd Hash Join in the plan,
one involving lineitem_ao_column_none_level0 and
HashJoin(partsupp_ao_column_none_level0, part_ao_column_none_level0).
lineitem_ao_column_none_level0 is predicted to have approximately 22 billion
rows and the hash join is predicted to have approximately 10 billion rows, so
making the hash join the inner child is good in this case, since the smaller
relation is hashed.
- Nested-Setops-2.mdp
Same here. Two swaps were performed between dept and emp in two different
places. dept contains 1 row and emp contains 10001, so it's better if dept is
hashed. A Redistribute Motion was also replaced with Broadcast Motion in both
cases.
- TPCH-Q5.mdp
Probably the best improvement out of these plans. The previous plan had this
join order:
```
-> Hash Join (6,000,000 rows)
-> Hash Join (300,000,000 rows)
-> lineitem (1,500,000,000 rows)
-> Hash Join (500,000 rows)
-> supplier (2,500,000 rows)
-> Hash Join (5 rows)
-> nation (25 rows)
-> region (1 row)
-> Hash Join (100,000,000 rows)
-> customer (40,000,000 rows)
-> orders (100,000,000 rows)
```
which contains hashing 100 million rows twice (first order, then its hash join
with customer). The new plan has no such issues:
```
-> Hash Join (6,000,000 rows)
-> Hash Join (170,000,000 rows)
-> lineitem (1,500,000,000 rows)
-> Hash Join (20,000,000 rows)
-> orders (100,000,000 rows)
-> Hash Join (7,000,000 rows)
-> customer (40,000,000 rows)
-> Hash Join (5 rows)
-> nation (25 rows)
-> region (1 row)
-> supplier (2,500,000 rows)
```
This plan only hashes around 30 million rows in total, much better than 200
million.
Ticket: ADBDEV-8413
When appending statistics to a group, it is first checked whether statistics already exist in the group. If it exists, then new statistics are appended to the existing one using the AppendStats method. If there are no statistics in the group, the existence of statistics in the duplicate group is also checked. When appending statistics to an existing one, we take the existing statistics of the group (or its duplicate), create a copy of it, and add statistics to this copy, and release the old one. If the group does not have its own statistics and the duplicate statistics are used. Then we would add statistics to the duplicate group and try to release the statistics of the current group, which is NULL, which leads to a segmentation fault. Fix this by calling the AppendStats method on the duplicate.
During the exploration phase, new groups of equivalent expressions are created. In this process, some groups are marked as duplicates. After exploration, expressions from duplicate groups are moved into the group they duplicate. In cases where a duplicate group contains an expression that references the duplicated group, merging them results in a situation where a group contains an expression that references the very group it belongs to. This leads to infinite recursion during statistics derivation. The fix is to improve the cycle-detection logic so that it can recognize when an expression references the group it resides in.
- New CI Job for auto build deb-package - New targets for gpAux Makefile: changelog, changelog-deb, pkg, pkg-deb - New gpAux/debian folder with package description/rules for `debuild` utility - Copy `VERSION` file from source to main layer in Docker Image - Disable clean `VERSION` file if `.git` directory not exists - Deb package name gets from `Package` field in `gpAux/debian/control` file Ticket: ADBDEV-7873
Change bug_report format to YAML
Updated links to refer to the main branch
Prior to 3ce2e6a, querying pg_locks (or using pg_lock_status()), approximately 75% of backend memory allocations for resulting tuples weren't registered with Vmtracker or Resource Group Control. This memory would also leak if the query was cancelled or failed. This happened because CdbDispatchCommand(), which was previously used by pg_locks, called libpq to obtain the results that were allocated as PQresult structures with bare malloc(), even on the server side. This patch fixes both untracked memory issues by enforcing Vmtracker routines for PGresult allocations on the server-side. Including postgres.h in frontend code causes several errcode-related macro redefinition warnings. They are now un-definined first. Recursive errors due to mishandled OOM errors are addressed in c4e1085. This PR also adds an additional set of tests, building on top of the said commit. Ticket: ADBDEV-7691
Function XactLockTableWait() calls LocalXidGetDistributedXid() which may get gxid corresponding to local wait xid from distributed clog in case if the dtx (which we are waiting for) managed to commit by that time we access its gxid. And for such case there is an assertion introduced by commit 13a1f66. The assert indicates that the commited transaction was just running in parallel with current one, meaning there is no other reason to access distributed transaction history. If the transaction was commited long time ago the XactLockTableWait() would never be called. However, there is a case when we can't compare the timestamps: vacuum operation, which performs in-place update of pg_database (or pg_class) without being in distributed transaction. For this case this patch extends the assertion by allowing current timestamp to have zero value. The new test related to this case is added to file 2c2753a.
Previously, commit 8359bfa reverted changes related to a TAP tests that required hot standby functionality since it is not available in 6.X. Standby errored out before it could fully start due to several functions that threw an error. Skip the error if connection is in utility mode. Ticket: ADBDEV-7948
This reverts commit 8359bfa.
Problem description: After sequential execution of isolation2 tests 'standby_replay_dtx_info' and 'ao_unique_index' the coordinator's standby postmaster process together with its children processes were terminated. Root cause: Test 'standby_replay_dtx_info' sets fault injection 'standby_gxacts_overflow' on coordinator's standby, which updates the global var 'max_tm_gxacts' (the limit of distributed transactions) to 1, but at the reset of this fault the value of 'max_tm_gxacts' was not updated to its original value. Therefore, on any next test that created more than 2 distributed transactions that were replayed on the standby, the standby encountered the fatal error "the limit of 1 distributed transactions has been reached" and it was terminated. Fix: Set 'max_tm_gxacts' to its original value when fault injection 'standby_gxacts_overflow' is not set. (cherry picked from commit 423cc57b779bfb8f048f47425b428091a7d959a9)
The planner can execute some functions on segments. However, their contents will also be planned on segments. Planning may create motions, which is unacceptable on segments. Since the contents of functions may be unavailable when planning the initial query (for example, a C-function with a call to SQL in the SPI), it is sufficient to prevent motions from being created when planning a function on a segment. Ticket: ADBDEV-8689 (cherry picked from commit 50385e2ebc768a92cda0692a604df80981061512)
## ADBDEV-8787: (v6) Run tests for feature branches with slashes (#115) - Update pull-request branches to '**'
Target CI jobs to v8 - build - behave-tests - regression-tests - orca-tests - upload Ticket: ADBDEV-8833
If the add column + insert gets aborted, pg_aocsseg still holds the aborted column's vpinfo. We need to read only the committed columns' vpinfo and ignore all aborted column's vpinfo. Before this change during pg_aocsseg reads we were copying over the whole vpinfo which includes entries for aborted columns This creates memory corruption as we are copying over more than what is needed/commited (aborted column's vpinfo) This change edits the read of vpinfo to limit upto committed columns. All the other code paths (that assert for VARSIZE(dv) == aocs_vpinfo_size) don't encounter failure with aborted columns, as they are only reached after aocsseg is already updated in same transaction. AOCSFileSegInfoAddVpe doesn't encounter this problem as we always update aocsseg entry with empty vpe (with new number of columns) in aocs_addcol_emptyvpe before we reach AOCSFileSegInfoAddVpe in aocs_addcol_closefiles (cherry picked from commit 0a2d3cb) Changes compared to the original commit: 1. 6x specific change in the test query: changed 'USING ao_column' to 'WITH (APPENDONLY=TRUE, ORIENTATION=COLUMN)'. 2. 6x specific change in the test query: in the queries with 'gp_toolkit.__gp_aocsseg()' removed ordering by segment_id, as it doesn't exist. 3. Updated test answer file according to the changes above.
Problem description: Vacuum failed to clean up segment files for AOCS tables, that were created in a transaction, which was rolled back. Root cause: For AOCS tables, a transaction that creates segment files for the very first time, inserts a corresponding record into 'pg_aoseg.pg_aocsseg_<oid>' in 'InsertInitialAOCSFileSegInfo()'. But if the transaction was rolled back, this record in 'pg_aoseg.pg_aocsseg_<oid>' was no more available. But the physical segment files still existed. Vacuum process in AOCSTruncateToEOF() relies on the information from 'pg_aoseg.pg_aocsseg_<oid>' to get all segments vacuum needs to scan. Obviously, the segment files from the aborted transaction were not visible to it. Fix: Add 'heap_freeze_tuple_wal_logged(segrel, segtup)' into 'InsertInitialAOCSFileSegInfo()', so vacuum now can see the new segment files. It is already done so in 7x (refer to 1306d47). But this change interferes with commit 9e106f5, as freezing the tuple partially reverts its logic. Part of this interference is resolved by preceding cherry-pick of 0a2d3cb, which handles the same problem as 9e106f5, but in a different way. But for 6x additional changes are required in 'UpdateAOCSFileSegInfo()', so this patch adds usage of 'deformAOCSVPInfo()' into it in the manner intended by 0a2d3cb. Plus this change requires update of the output of the test 'uao_crash_compaction_column', because the output of 'gp_toolkit.__gp_aocsseg()' now contains records about segment files created by the interrupted vacuum command.
If you create a table and don't insert any data into it, the relation file is never fsync'd. You don't lose data, because an empty table doesn't have any data to begin with, but if you crash and lose the file, subsequent operations on the table will fail with "could not open file" error. To fix, register an fsync request in mdcreate(), like we do for mdwrite(). Per discussion, we probably should also fsync the containing directory after creating a new file. But that's a separate and much wider issue. Backpatch to all supported versions. Reviewed-by: Andres Freund, Thomas Munro Discussion: https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi 6X changes: In 6X, the smgr_which field is not used and not set. This patch adds setting this field. To maintain binary compatibility, this field is set outside the smgropen function. (cherry picked from commit 1b4f1c6)
Target the following reusable workflows to v9: - behave-tests - regression-tests - orca-tests - upload - package Ticket: [GG-16](https://tracker.yandex.ru/GG-16)
Fix escaping for perfmon metrics export What occurs? Whenever new log rows, containing database names with some special characters, were tried to be uploaded from _queries_history.dat file to the queries_history external table an error considering the size of a db column would occur. It would prevent any new logs to be loaded whatsoever. Why it occurs? The db column has a type of VARCHAR(64), thus it means that a larger string was tried to be put. Obvious reason for this - incorrect escaping or lack of such whatsoever. Only two symbols were observed to lead to errors: " and |. In case of a pipe - it leads to another error about incorrect row structure, which is logical, as pipes are used as delimiters inside the file. How do we fix it? Before the patch (1f67d39) db field used to be written to _queries_history.dat without double quotes, so no escaping was present. Now it has it, as we enclose every database name in double quotes. Also, we double the already present in database name double quotes. Does it fix the problem? Yes, as now we escape the whole database name string with all of its special (or not) characters - the very same method is used for the SQL query command and it works. Doubling double quotes is needed as we need to escape the quotes to not to end the string to early. What was changed? Minor code additions for escaping inside the src/backend/gpmon/gpmon.c Tests? New BDD auto test to check that before mentioned logs are added to queries_history correctly (as they appear there). But, not all symbols can be auto tested using present testing functions: ', | and UTF-8 symbols. Errors occur at different steps of a test for different type of symbol and are connected to the way commands are issued to the DBSM. Nevertheless, during hand testing these symbols passed. Observations? dbuser column also is not escaped, but I have not managed to recreate the same issue with it. Yet, it may be worth to add escaping to it in future, but now it seems like an extremely edge case scenario. Ticket: ADBDEV-7272 --------- Co-authored-by: Vladimir Sarmin <[email protected]> Co-authored-by: Georgy Shelkovy <[email protected]>
When the DROP IF EXISTS command is called, it is unconditionally dispatched to the segments, even if the object doesn't exist on the coordinator. This can lead to inconsistencies, for example: - the first session starts a table drop on the coordinator. Even though the table doesn't exist, the drop is still dispatched to the segments. - at the same time, the second session creates the table (on the coordinator and on the segments). - the first session (on the segments) already sees the table and therefore deletes it. Don't dispatch DROP if object doesn't exist on coordinator. Exclude the drop_rename test from the parallel group because it contains fault-injectors. Add a new GUC, gp_dispatch_drop_always, for unconditional DROP dispatch even if the object isn't present on the coordinator, as this functionality is used by the gpcheckcat utility, for example, in the orphan_temp_table test. Ticket: ADBDEV-8867
…ables (#89) The add_rte_to_flat_rtable function places a RangeTblEntry into the glob->finalrtable table with a zeroed list of functions. The ParallelizeCorrelatedSubPlanMutator function uses ctx->rtable, which is populated by the root->glob->finalrtable function in the ParallelizeCorrelatedSubPlan function. Therefore, in the ParallelizeCorrelatedSubPlanMutator function, the rte->functions list is always empty. Therefore, the old condition for checking whether functions are correlated did not work. Now, in the ParallelizeCorrelatedSubPlanMutator function, functions are available in the fscan->functions list, which is where they are taken for correlation and location checking. Correlated non-any functions are prohibited in subplans. For other combinations, we now add broadcasting and materialization. Another issue arose with correlated subplans with master-only or replicated tables. The existing condition checked the scan locus only, without checking the requested locus. As a result, the table scan was performed on segments, which is unacceptable for the gp_segment_configuration master-only table. Now, if the requested locus for such tables and the scan locus are not equal to the entry, we again add broadcasting and materialization. Co-authored-by: Georgy Shelkovy <[email protected]> Co-authored-by: Maxim Michkov <[email protected]> Ticket: ADBDEV-6884
Commit 73b889e replaced the use of the gp_tablespace_segment_location function in the correlated subplan in the new version of the arenadata_toolkit extension with the pg_tablespace_location and gp_dist_random functions. However, this function remained in previous versions of the extension, causing the upgrade_test test to fail. Replace it in previous versions as well.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here are some reminders before you submit the pull request
make installcheck