Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mos65o2
Copy link

@mos65o2 mos65o2 commented Mar 2, 2025

Add support parameterized clauses in the subplan with volatile function (#1240)

Problem:
Queries with parameterized clauses (e.g. LIMIT, ORDER BY) and volotile
functions could return incorrect results. This was due to the construction of
an invalid plan when Motion was added on top of a subplan with volatile
functions. Since parameters are not passed via Motion, parameterized clause did
not work correctly. Motion was added late in the subplan processing in
fix_subplan_motion() because the subplan had an Entry locus. Initially, after
the subplan is built, it has a General locus. However, due to the presence of a
volatile function in the subplan, the locus is changed to Entry. This is done to
ensure that the data set in the subplan is identical on all segments (more
information d1f9b96). However, in the case of the parameterized subplan, adding
Motion on top was too late.

Changes:
First, after "current_rel" is generated and a Result node (with a volatile
function) is added on top of its path, the locus is changed to SingleQE.
This is necessary in order to calculate the dataset on one segment. The change
occurs only for the General locus (data is available on any segment) and if
the root locus of the subplane is not Replicated (exclude this locus).
Further, if the subquery is correlated (has parameterized operators), then
distribute the dataset to all segments by adding a Motion.
To eliminate Motion(1:1) the fix_outer_query_motions_mutator function has been
fixed.
Also added a test case.

Ticket: ADBDEV-6886

@mos65o2 mos65o2 marked this pull request as ready for review March 3, 2025 07:52
Copy link

@silent-observer silent-observer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an incomplete solution, for example SELECT (SELECT a + random() + few.id FROM generate_series(1, 10) a LIMIT 1 OFFSET few.id) FROM few; causes ERROR: illegal rescan of motion node. This is because + few.id part is placed into the Function Scan node, and so Materialize tries to rescan it when parameter value changes. Is this patch only supposed to fix the narrow issue with parameters in LIMIT/OFFSET clauses?

@mos65o2
Copy link
Author

mos65o2 commented Mar 3, 2025

This seems like an incomplete solution, for example SELECT (SELECT a + random() + few.id FROM generate_series(1, 10) a LIMIT 1 OFFSET few.id) FROM few; causes ERROR: illegal rescan of motion node. This is because + few.id part is placed into the Function Scan node, and so Materialize tries to rescan it when parameter value changes. Is this patch only supposed to fix the narrow issue with parameters in LIMIT/OFFSET clauses?

This is a common problem when running volatile functions in a subplan.
Motion is added inside such a subplan so that there is one data set for all segments.
If something contains parameters below Motion, this causes problems. As for the query you provided.

Apparently this query does not work anyway (after merge 9a1e48c)
postgres=# SELECT (SELECT a + random() + few.id FROM generate_series(1, 10) a LIMIT 1 OFFSET few.id) FROM few;
ERROR: Passing parameters across motion is not supported. (cdbmutate.c:2051)

Is this patch only supposed to fix the narrow issue with parameters in LIMIT/OFFSET clauses?

Yes, I'm trying to fix Limit-Offset. In my case it is possible to pull up the parameterized Limit above the Motion. For the underlying nodes a different approach is needed.

silent-observer
silent-observer previously approved these changes Mar 4, 2025
@bimboterminator1
Copy link
Member

bimboterminator1 commented Mar 12, 2025

Yes, I'm trying to fix Limit-Offset. In my case it is possible to pull up the parameterized Limit above the Motion. For the underlying nodes a different approach is needed.

There are probably several plan nodes similar to LIMIT-OFFSET parametrization behaviour (I.E. nodes, for which it is possible to isolate parametrization as it's done in current patch, without parametrizing underlying nodes). Shouldn't we generalize them as well? Or this should be done in separate ticket? Like (without error throwng patches):

explain (verbose, costs off) SELECT (SELECT f(a) FROM generate_series(1,10) a 
        ORDER BY abs(a - limit_tbl.i) limit 1 )
FROM limit_tbl;

explain (verbose, costs off) SELECT (SELECT f(a) FROM generate_series(1,10) a 
        ORDER BY abs(a - limit_tbl.i) limit 1 )
FROM limit_tbl;
                                        QUERY PLAN                                         
-------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Seq Scan on public.limit_tbl
         Output: (SubPlan 1)
         SubPlan 1
           ->  Materialize
                 Output: (f(a.a)), (abs((a.a - limit_tbl.i)))
                 ->  Broadcast Motion 1:3  (slice2)
                       Output: (f(a.a)), (abs((a.a - limit_tbl.i)))
                       ->  Limit
                             Output: (f(a.a)), (abs((a.a - limit_tbl.i)))
                             ->  Result
                                   Output: f(a.a), (abs((a.a - limit_tbl.i)))
                                   ->  Sort
                                         Output: (abs((a.a - limit_tbl.i))), a.a
                                         Sort Key: (abs((a.a - limit_tbl.i)))
                                         ->  Function Scan on pg_catalog.generate_series a
                                               Output: abs((a.a - limit_tbl.i)), a.a
                                               Function Call: generate_series(1, 10)
 Optimizer: Postgres-based planner
 Settings: optimizer = 'off'
(21 rows)

postgres=# SELECT (SELECT f(a) FROM generate_series(1,10) a 
        ORDER BY abs(a - limit_tbl.i) limit 1 )
FROM limit_tbl;
 f 
---
 1
 1
 1
(3 rows)

@mos65o2
Copy link
Author

mos65o2 commented Mar 13, 2025

Shouldn't we generalize them as well? Or this should be done in separate ticket?

It looks like the parameterized Order By operator can also be raised above Motion. We can add Motion to Order By when creating its path (similar to the current patch). Or when creating the table scan path (I assume). If we choose the first method, then this can be done in a separate ticket.

@bimboterminator1
Copy link
Member

bimboterminator1 commented Mar 13, 2025

It looks like the parameterized Order By operator can also be raised above Motion. We can add Motion to Order By when creating its path (similar to the current patch). Or when creating the table scan path (I assume). If we choose the first method, then this can be done in a separate ticket.

First of all, I'd suggest to research the existance of similar cases with other nodes (probably its not only order by), and research the possibility to apply your logic in more general manner, without concentration on the specific edge case. Draw some conclusions, then decide whether we should leave everything as it is and cover only the limit case or make the planning of such queries more wise

@mos65o2 mos65o2 changed the title ADBDEV-6886: Add support parameterized LIMIT in the sub plan with volatile functions ADBDEV-6886: Add support parameterized clauses in the subplan with volatile functions Apr 16, 2025
@mos65o2
Copy link
Author

mos65o2 commented Apr 16, 2025

There are probably several plan nodes similar to LIMIT-OFFSET parametrization behaviour (I.E. nodes, for which it is possible to isolate parametrization as it's done in current patch, without parametrizing underlying nodes). Shouldn't we generalize them as well? Or this should be done in separate ticket? Like (without error throwng patches):

explain (verbose, costs off) SELECT (SELECT f(a) FROM generate_series(1,10) a 
        ORDER BY abs(a - limit_tbl.i) limit 1 )
FROM limit_tbl;

explain (verbose, costs off) SELECT (SELECT f(a) FROM generate_series(1,10) a 
        ORDER BY abs(a - limit_tbl.i) limit 1 )
FROM limit_tbl;
                                        QUERY PLAN                                         
-------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Seq Scan on public.limit_tbl
         Output: (SubPlan 1)
         SubPlan 1
           ->  Materialize
                 Output: (f(a.a)), (abs((a.a - limit_tbl.i)))
                 ->  Broadcast Motion 1:3  (slice2)
                       Output: (f(a.a)), (abs((a.a - limit_tbl.i)))
                       ->  Limit
                             Output: (f(a.a)), (abs((a.a - limit_tbl.i)))
                             ->  Result
                                   Output: f(a.a), (abs((a.a - limit_tbl.i)))
                                   ->  Sort
                                         Output: (abs((a.a - limit_tbl.i))), a.a
                                         Sort Key: (abs((a.a - limit_tbl.i)))
                                         ->  Function Scan on pg_catalog.generate_series a
                                               Output: abs((a.a - limit_tbl.i)), a.a
                                               Function Call: generate_series(1, 10)
 Optimizer: Postgres-based planner
 Settings: optimizer = 'off'
(21 rows)

postgres=# SELECT (SELECT f(a) FROM generate_series(1,10) a 
        ORDER BY abs(a - limit_tbl.i) limit 1 )
FROM limit_tbl;
 f 
---
 1
 1
 1
(3 rows)

This query still doesn't work. The parameter is passed inside the scan.

explain (verbose, costs off) SELECT (SELECT f(a) FROM generate_series(1,10) a 
        ORDER BY abs(a - limit_tbl.i) limit 1 )
FROM limit_tbl;
                                        QUERY PLAN                                         
-------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Seq Scan on public.limit_tbl
         Output: (SubPlan 1)
         SubPlan 1
           ->  Limit
                 Output: (f(a.a)), (abs((a.a - limit_tbl.i)))
                 ->  Result
                       Output: f(a.a), (abs((a.a - limit_tbl.i)))
                       ->  Sort
                             Output: (abs((a.a - limit_tbl.i))), a.a
                             Sort Key: (abs((a.a - limit_tbl.i)))
                             ->  Materialize
                                   Output: (abs((a.a - limit_tbl.i))), a.a
                                   ->  Broadcast Motion 1:3  (slice2; segments: 1)
                                         Output: (abs((a.a - limit_tbl.i))), a.a
                                         ->  Function Scan on pg_catalog.generate_series a
                                               Output: abs((a.a - limit_tbl.i)), a.a
                                               Function Call: generate_series(1, 10)
 Optimizer: Postgres-based planner
 Settings: optimizer = 'off'
(21 rows)

@mos65o2
Copy link
Author

mos65o2 commented Apr 16, 2025

It works:

explain (verbose, costs off)
select (select f(a) from generate_series(1,10) a order by f(a) limit 1 offset limit_tbl.i) from limit_tbl;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Seq Scan on public.limit_tbl
         Output: (SubPlan 1)
         SubPlan 1
           ->  Limit
                 Output: (f(a.a))
                 ->  Sort
                       Output: (f(a.a))
                       Sort Key: (f(a.a))
                       ->  Materialize
                             Output: (f(a.a))
                             ->  Broadcast Motion 1:3  (slice2; segments: 1)
                                   Output: (f(a.a))
                                   ->  Function Scan on pg_catalog.generate_series a
                                         Output: f(a.a)
                                         Function Call: generate_series(1, 10)
 Optimizer: Postgres-based planner
 Settings: optimizer = 'off'

select (select f(a) from generate_series(1,10) a order by f(a) limit 1 offset limit_tbl.i) from limit_tbl;
 f 
---
 2
 3
 4

@bimboterminator1
Copy link
Member

Should we do something with that cases:
1.

explain (costs off, verbose)  SELECT (SELECT (f(a))* random() from generate_series(1, 10)a where a > random() limit 1 offset few.id) FROM few;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Seq Scan on public.few
         Output: (SubPlan 1)
         SubPlan 1
           ->  Limit
                 Output: (((f(a.a))::double precision * random()))
                 ->  Function Scan on pg_catalog.generate_series a
                       Output: ((f(a.a))::double precision * random())
                       Function Call: generate_series(1, 10)
                       Filter: ((a.a)::double precision > random())
 Optimizer: Postgres-based planner
 Settings: optimizer = 'off'
(13 rows)

  1. SegmentGeneral
explain (costs off, verbose)  SELECT (SELECT (f(i)) from t_repl limit 1 offset few.id) FROM few;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Seq Scan on public.few
         Output: (SubPlan 1)
         SubPlan 1
           ->  Limit
                 Output: (f(t_repl.i))
                 ->  Result
                       Output: f(t_repl.i)
                       ->  Materialize
                             Output: t_repl.i
                             ->  Broadcast Motion 1:3  (slice2; segments: 1)
                                   Output: t_repl.i
                                   ->  Seq Scan on public.t_repl
                                         Output: t_repl.i
 Optimizer: Postgres-based planner
 Settings: optimizer = 'off'
(17 rows)

Comment on lines 989 to +995
* For non-top slice, if this motion is QE singleton and subplan's locus
* is CdbLocusType_SegmentGeneral, omit this motion.
*/
shouldOmit |= context->sliceDepth > 0 &&
context->currentPlanFlow->flotype == FLOW_SINGLETON &&
shouldOmit |= context->currentPlanFlow->flotype == FLOW_SINGLETON &&
context->currentPlanFlow->segindex == 0 &&
motion->plan.lefttree->flow->locustype == CdbLocusType_SegmentGeneral;
(motion->plan.lefttree->flow->locustype == CdbLocusType_SegmentGeneral ||
motion->plan.lefttree->flow->locustype == CdbLocusType_SingleQE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says "non-top slice". Is there the case when we could mistakenly omit the motion in case of context->sliceDepth > 0 && motion->plan.lefttree->flow->locustype == CdbLocusType_SingleQE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the condition is weakened. But is there a way to strictly identify our specific case with motion?

One-Time Filter: ("*VALUES*".column1 = "*VALUES*".column1)
Optimizer: Postgres query optimizer
(9 rows)
-> Materialize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'll ask again, is tuplestore of Material node refilled during SubPlan rescan?

  2. If so, is there the simple way to omit material node?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check when we omit Motion so that the upper Materilize node is also omitted (in fix_outer_query_motions_mutator()).

if (CdbPathLocus_IsGeneral(origpath->locus) ||
CdbPathLocus_IsOuterQuery(origpath->locus))
if (CdbPathLocus_IsGeneral(origpath->locus) || CdbPathLocus_IsOuterQuery(origpath->locus) ||
((CdbPathLocus_IsSegmentGeneral(origpath->locus) || CdbPathLocus_IsSingleQE(origpath->locus))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What CdbPathLocus_IsSingleQE(origpath->locus) condition stands for?

  2. Should be difference ( i mean plans with and without patch differ) for join plans like

explain (costs off, verbose)  SELECT (SELECT f(t_repl.i) from  t_repl join t_strewn using(i) where  t_repl.j < few.id) FROM few;

taken into account? I suggest just to test join plans to find some side effects of this condition. At first glance nothing drastic happens.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We postpone changing the locus and adding Motion if we initially have a General or SingleQE locus and we also have a volatile function. Otherwise we may capture unnecessary cases. For example this one.
  2. It seems that the principle of adding Motion on top of Result does not always work. Although the query works without the patch, it doesn't fit into the "single data set" principle, so there should probably be a different plan here:

without patch:

postgres=# explain (costs off, verbose)  SELECT (SELECT f(t_repl.i) from  t_repl join t_strewn using(i) where  t_repl.j < few.id) FROM few;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Seq Scan on public.few
         Output: (SubPlan 1)
         SubPlan 1
           ->  Hash Join
                 Output: f(t_repl.i)
                 Hash Cond: (t_repl.i = t_strewn.i)
                 ->  Result
                       Output: t_repl.i, t_repl.j
                       Filter: (t_repl.j < few.id)
                       ->  Materialize
                             Output: t_repl.i, t_repl.j
                             ->  Broadcast Motion 1:3  (slice2; segments: 1)
                                   Output: t_repl.i, t_repl.j
                                   ->  Seq Scan on public.t_repl
                                         Output: t_repl.i, t_repl.j
                 ->  Hash
                       Output: t_strewn.i
                       ->  Materialize
                             Output: t_strewn.i
                             ->  Broadcast Motion 3:3  (slice3; segments: 3)
                                   Output: t_strewn.i
                                   ->  Seq Scan on public.t_strewn
                                         Output: t_strewn.i

with (checkMotionWithParam is off):

 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Seq Scan on public.few
         Output: (SubPlan 1)
         SubPlan 1
           ->  Result
                 Output: f(t_repl.i)
                 ->  Materialize
                       Output: t_repl.i
                       ->  Broadcast Motion 3:3  (slice2; segments: 3)
                             Output: t_repl.i
                             ->  Hash Join
                                   Output: t_repl.i
                                   Hash Cond: (t_repl.i = t_strewn.i)
                                   ->  Result
                                         Output: t_repl.i, t_repl.j
                                         Filter: (t_repl.j < few.id)
                                         ->  Seq Scan on public.t_repl
                                               Output: t_repl.i, t_repl.j
                                   ->  Hash
                                         Output: t_strewn.i
                                         ->  Seq Scan on public.t_strewn
                                               Output: t_strewn.i

Comment on lines 989 to +995
* For non-top slice, if this motion is QE singleton and subplan's locus
* is CdbLocusType_SegmentGeneral, omit this motion.
*/
shouldOmit |= context->sliceDepth > 0 &&
context->currentPlanFlow->flotype == FLOW_SINGLETON &&
shouldOmit |= context->currentPlanFlow->flotype == FLOW_SINGLETON &&
context->currentPlanFlow->segindex == 0 &&
motion->plan.lefttree->flow->locustype == CdbLocusType_SegmentGeneral;
(motion->plan.lefttree->flow->locustype == CdbLocusType_SegmentGeneral ||
motion->plan.lefttree->flow->locustype == CdbLocusType_SingleQE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the condition is weakened. But is there a way to strictly identify our specific case with motion?

@bimboterminator1
Copy link
Member

Also, add comments to all newly added code. The logic is really unclear and should be described in details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants