Thread: Re: Implement waiting for wal lsn replay: reloaded

Re: Implement waiting for wal lsn replay: reloaded

From

Yura Sokolov

Date:

06 February, 11:31:28

27.11.2024 07:08, Alexander Korotkov wrote:
> Present solution
> 
> The present patch implements a new utility command WAIT FOR LSN
> 'target_lsn' [, TIMEOUT 'timeout'][, THROW 'throw'].  Unlike previous
> attempts to implement custom syntax, it uses only one extra unreserved
> keyword.  The parameters are implemented as generic_option_list.
> 
> Custom syntax eliminates the problem of running within an empty
> transaction of REPEATABLE READ level or higher.  We don't need to
> lookup a system catalog.  Thus, we have to set a transaction snapshot.
> 
> Also, revising PlannedStmtRequiresSnapshot() allows us to avoid
> holding a snapshot to return a value.  Therefore, the WAIT command in
> the attached patch returns its result status.
> 
> Also, the attached patch explicitly checks if the standby has been
> promoted to throw the most relevant form of an error.  The issue of
> inaccurate error messages has been previously spotted in [5].
> 
> Any comments?

Good day, Alexander.

I briefly looked into patch and have couple of minor remarks:

1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.

2. I'd move `inHeap` field between `procno` and `phNode` to fill the gap
between fields on 64bit platforms.
Well, I believe, it would be better to tweak `pairingheap_node` to make it
clear if it is in heap or not. But such change would be unrelated to
current patch's sense. So lets stick with `inHeap`, but move it a bit.

Non-code question: do you imagine for `WAIT` command reuse for other cases?
Is syntax rule in gram.y convenient enough for such reuse? I believe, `LSN`
is not part of syntax to not introduce new keyword. But is it correct way?
I have no answer or strong opinion.

Otherwise, the patch looks quite strong to me.

-------
regards
Yura Sokolov

Re: Implement waiting for wal lsn replay: reloaded

From

Yura Sokolov

Date:

28 February, 16:03:33

17.02.2025 00:27, Alexander Korotkov wrote:
> On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <[email protected]> wrote:
>> I briefly looked into patch and have couple of minor remarks:
>>
>> 1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
>> problems, but still don't like it. I'd prefer to see local fixed array, say
>> of 16 elements, and loop around remaining function body acting in batch of
>> 16 wakeups. Doubtfully there will be more than 16 waiting clients often,
>> and even then it wont be much heavier than fetching all at once.
> 
> OK, I've refactored this to use static array of 16 size.  palloc() is
> used only if we don't fit static array.

I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?

-------
regards
Yura Sokolov aka funny-falcon

Attachment

v3-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Yura Sokolov

Date:

28 February, 16:55:21

28.02.2025 16:03, Yura Sokolov пишет:
> 17.02.2025 00:27, Alexander Korotkov wrote:
>> On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <[email protected]> wrote:
>>> I briefly looked into patch and have couple of minor remarks:
>>>
>>> 1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
>>> problems, but still don't like it. I'd prefer to see local fixed array, say
>>> of 16 elements, and loop around remaining function body acting in batch of
>>> 16 wakeups. Doubtfully there will be more than 16 waiting clients often,
>>> and even then it wont be much heavier than fetching all at once.
>>
>> OK, I've refactored this to use static array of 16 size.  palloc() is
>> used only if we don't fit static array.
> 
> I've rebased patch and:
> - fixed compiler warning in wait.c ("maybe uninitialized 'result'").
> - made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
> keep indentation, perhaps `do {} while` would be better?

And fixed:
   'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
gram.y's bare_label_keyword rule

-------
regards
Yura Sokolov aka funny-falcon

Attachment

v4-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Alexander Korotkov

Date:

10 March, 14:30:31

On Fri, Feb 28, 2025 at 3:55 PM Yura Sokolov <[email protected]> wrote:
> 28.02.2025 16:03, Yura Sokolov пишет:
> > 17.02.2025 00:27, Alexander Korotkov wrote:
> >> On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <[email protected]> wrote:
> >>> I briefly looked into patch and have couple of minor remarks:
> >>>
> >>> 1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
> >>> problems, but still don't like it. I'd prefer to see local fixed array, say
> >>> of 16 elements, and loop around remaining function body acting in batch of
> >>> 16 wakeups. Doubtfully there will be more than 16 waiting clients often,
> >>> and even then it wont be much heavier than fetching all at once.
> >>
> >> OK, I've refactored this to use static array of 16 size.  palloc() is
> >> used only if we don't fit static array.
> >
> > I've rebased patch and:
> > - fixed compiler warning in wait.c ("maybe uninitialized 'result'").
> > - made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
> > keep indentation, perhaps `do {} while` would be better?
>
> And fixed:
>    'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
> gram.y's bare_label_keyword rule

Thank you, Yura.  I've further revised the patch.  Mostly added the
documentation including SQL command reference and few paragraphs in
the high availability chapter explaining the read-your-writes
consistency concept.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v5-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Yura Sokolov

Date:

12 March, 17:44:28

10.03.2025 14:30, Alexander Korotkov пишет:
> On Fri, Feb 28, 2025 at 3:55 PM Yura Sokolov <[email protected]> wrote:
>> 28.02.2025 16:03, Yura Sokolov пишет:
>>> 17.02.2025 00:27, Alexander Korotkov wrote:
>>>> On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <[email protected]> wrote:
>>>>> I briefly looked into patch and have couple of minor remarks:
>>>>>
>>>>> 1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
>>>>> problems, but still don't like it. I'd prefer to see local fixed array, say
>>>>> of 16 elements, and loop around remaining function body acting in batch of
>>>>> 16 wakeups. Doubtfully there will be more than 16 waiting clients often,
>>>>> and even then it wont be much heavier than fetching all at once.
>>>>
>>>> OK, I've refactored this to use static array of 16 size.  palloc() is
>>>> used only if we don't fit static array.
>>>
>>> I've rebased patch and:
>>> - fixed compiler warning in wait.c ("maybe uninitialized 'result'").
>>> - made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
>>> keep indentation, perhaps `do {} while` would be better?
>>
>> And fixed:
>>    'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
>> gram.y's bare_label_keyword rule
> 
> Thank you, Yura.  I've further revised the patch.  Mostly added the
> documentation including SQL command reference and few paragraphs in
> the high availability chapter explaining the read-your-writes
> consistency concept.

Good day, Alexander.

Looking "for the last time" to the patch I found there remains
`pg_wal_replay_wait` function in documentation and one comment.
So I fixed it in documentation, and removed sentence from comment.

Otherwise v6 is just rebased v5.

-------
regards
Yura Sokolov aka funny-falcon

Attachment

v6-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Tomas Vondra

Date:

13 March, 17:15:01

Hi,

I did a quick look at this patch. I haven't found any correctness
issues, but I have some general review comments and questions about the
grammar / syntax.

1) The sgml docs don't really show the syntax very nicely, it only shows
this at the beginning of wait_for.sgml:

   WAIT FOR ( <replaceable class="parameter">parameter</replaceable>
'<replaceable class="parameter">value</replaceable>' [, ... ] ) ]

I kinda understand this comes from using the generic option list (I'll
get to that shortly), but I think it'd be much better to actually show
the "full" syntax here, instead of leaving the "parameters" to later.


2) The syntax description suggests "(" and ")" are required, but that
does not seem to be the case - in fact, it's not even optional, and when
I try using that, I get syntax error.


3) I have my doubts about using the generic_option_list for this. Yes, I
understand this allows using fewer reserved keywords, but it leads to
some weirdness and I'm not sure it's worth it. Not sure what the right
trade off is here.

Anyway, some examples of the weird stuff implied by this approach:

- it forces "," between the options, which is a clear difference from
what we do for every other command

- it forces everything to be a string, i.e. you can' say "TIMEOUT 10",
it has to be "TIMEOUT '10'"

I don't have a very strong opinion on this, but the result seems a bit
strange to me.


4) I'm not sure I understand the motivation of the "throw false" mode,
and I'm not sure I understand this description in the sgml docs:

    On timeout, or if the server is promoted before
    <parameter>lsn</parameter> is reached, an error is emitted,
    as soon as <parameter>throw</parameter> is not specified or set to
    true.
    If <parameter>throw</parameter> is set to false, then the command
    doesn't throw errors.

I find it a bit confusing. What is the use case for this mode?


5) One place in the docs says:

      The target log sequence number to wait for.

   Thie is literally the only place using "log sequence number" in our
   code base, I'd just use "LSN" just like every other place.


6) The docs for the TIMEOUT parameter say this:

   <varlistentry>
    <term><replaceable class="parameter">timeout</replaceable></term>
    <listitem>
     <para>
      When specified and greater than zero, the command waits until
      <parameter>lsn</parameter> is reached or the specified
      <parameter>timeout</parameter> has elapsed.  Must be a non-
      negative integer, the default is zero.
     </para>
    </listitem>
   </varlistentry>

   That doesn't say what unit does the option use. Is is seconds,
   milliseconds or what?

   In fact, it'd be nice to let users specify that in the value, similar
   to other options (e.g. SET statement_timeout = '10s').


7) One place in the docs says this:

    That is, after this function execution, the value returned by
    <function>pg_last_wal_replay_lsn</function> should be greater ...

  I think the reference to "function execution" is obsolete?


8) I find this confusing:

    However, if <command>WAIT FOR</command> is
    called on primary promoted from standby and <literal>lsn</literal>
    was already replayed, then the <command>WAIT FOR</command> command
    just exits immediately.

  Does this mean running the WAIT command on a primary (after it was
  already promoted) will exit immediately? Why does it matter that it
  was promoted from a standby? Shouldn't it exit immediately even for
  a standalone instance?


9) xlogwait.c

I think this should start with a basic "design" description of how the
wait is implemented, in a comment at the top of the file. That is, what
we keep in the shared memory, what happens during a wait, how it uses
the pairing heap, etc. After reading this comment I should understand
how it all fits together.


10) WaitForLSNReplay / WaitLSNWakeup

I think the function comment should document the important stuff (e.g.
return values for various situations, how it groups waiters into chunks
of 16 elements during wakeup, ...).


11) WaitLSNProcInfo / WaitLSNState

Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?


regards

-- 
Tomas Vondra

Re: Implement waiting for wal lsn replay: reloaded

From

vignesh C

Date:

16 March, 16:32:11

On Wed, 12 Mar 2025 at 20:14, Yura Sokolov <[email protected]> wrote:
>
> Otherwise v6 is just rebased v5.

I noticed that Tomas's comments from [1] are not yet addressed, I have
changed the commitfest status to Waiting on Author, please address
them and update it to Needs review.
[1] - https://www.postgresql.org/message-id/[email protected]

Regards,
Vignesh

Re: Implement waiting for wal lsn replay: reloaded

From

Alexander Korotkov

Date:

29 April, 14:27:25

Hi, Tomas.

Thank you so much for your review!  Please find the revised patchset.

On Thu, Mar 13, 2025 at 4:15 PM Tomas Vondra <[email protected]> wrote:
> I did a quick look at this patch. I haven't found any correctness
> issues, but I have some general review comments and questions about the
> grammar / syntax.
>
> 1) The sgml docs don't really show the syntax very nicely, it only shows
> this at the beginning of wait_for.sgml:
>
>    WAIT FOR ( <replaceable class="parameter">parameter</replaceable>
> '<replaceable class="parameter">value</replaceable>' [, ... ] ) ]
>
> I kinda understand this comes from using the generic option list (I'll
> get to that shortly), but I think it'd be much better to actually show
> the "full" syntax here, instead of leaving the "parameters" to later.

Sounds reasonable, changed to show the full syntax in the synopsis.

> 2) The syntax description suggests "(" and ")" are required, but that
> does not seem to be the case - in fact, it's not even optional, and when
> I try using that, I get syntax error.

Good catch, fixed.

> 3) I have my doubts about using the generic_option_list for this. Yes, I
> understand this allows using fewer reserved keywords, but it leads to
> some weirdness and I'm not sure it's worth it. Not sure what the right
> trade off is here.
>
> Anyway, some examples of the weird stuff implied by this approach:
>
> - it forces "," between the options, which is a clear difference from
> what we do for every other command
>
> - it forces everything to be a string, i.e. you can' say "TIMEOUT 10",
> it has to be "TIMEOUT '10'"
>
> I don't have a very strong opinion on this, but the result seems a bit
> strange to me.

I've improved the syntax.  I still tried to keep the number of new
keywords and grammar rules minimal.  That leads to moving some parser
login into wait.c.  This is probably a bit awkward, but saves our
grammar from bloat.  Let me know what do you think about this
approach.

> 4) I'm not sure I understand the motivation of the "throw false" mode,
> and I'm not sure I understand this description in the sgml docs:
>
>     On timeout, or if the server is promoted before
>     <parameter>lsn</parameter> is reached, an error is emitted,
>     as soon as <parameter>throw</parameter> is not specified or set to
>     true.
>     If <parameter>throw</parameter> is set to false, then the command
>     doesn't throw errors.
>
> I find it a bit confusing. What is the use case for this mode?

The idea here is that application could do some handling of these
errors without having to parse the error messages (parsing error
messages is inconvenient because of localization etc).

> 5) One place in the docs says:
>
>       The target log sequence number to wait for.
>
>    Thie is literally the only place using "log sequence number" in our
>    code base, I'd just use "LSN" just like every other place.

OK fixed.

> 6) The docs for the TIMEOUT parameter say this:
>
>    <varlistentry>
>     <term><replaceable class="parameter">timeout</replaceable></term>
>     <listitem>
>      <para>
>       When specified and greater than zero, the command waits until
>       <parameter>lsn</parameter> is reached or the specified
>       <parameter>timeout</parameter> has elapsed.  Must be a non-
>       negative integer, the default is zero.
>      </para>
>     </listitem>
>    </varlistentry>
>
>    That doesn't say what unit does the option use. Is is seconds,
>    milliseconds or what?
>
>    In fact, it'd be nice to let users specify that in the value, similar
>    to other options (e.g. SET statement_timeout = '10s').

The default unit of milliseconds is specified.  Also, an alternative
way to specify timeout is now supported.  Timeout might be a string
literal consisting of numeric and unit specifier.

> 7) One place in the docs says this:
>
>     That is, after this function execution, the value returned by
>     <function>pg_last_wal_replay_lsn</function> should be greater ...
>
>   I think the reference to "function execution" is obsolete?

Actually, this is just the function, which reports current replay LSN,
not function introduced by previous version of this patch.  We refer
it to just express the constraint that LSN must be replayed after
execution of the command.

> 8) I find this confusing:
>
>     However, if <command>WAIT FOR</command> is
>     called on primary promoted from standby and <literal>lsn</literal>
>     was already replayed, then the <command>WAIT FOR</command> command
>     just exits immediately.
>
>   Does this mean running the WAIT command on a primary (after it was
>   already promoted) will exit immediately? Why does it matter that it
>   was promoted from a standby? Shouldn't it exit immediately even for
>   a standalone instance?

I think the previous sentence should give an idea that otherwise error
gets thrown.  That also happens immediately for sure.

> 9) xlogwait.c
>
> I think this should start with a basic "design" description of how the
> wait is implemented, in a comment at the top of the file. That is, what
> we keep in the shared memory, what happens during a wait, how it uses
> the pairing heap, etc. After reading this comment I should understand
> how it all fits together.

OK, I've added the header comment.

> 10) WaitForLSNReplay / WaitLSNWakeup
>
> I think the function comment should document the important stuff (e.g.
> return values for various situations, how it groups waiters into chunks
> of 16 elements during wakeup, ...).

Revised header comments for those functions too.

> 11) WaitLSNProcInfo / WaitLSNState
>
> Does this need to be exposed in xlogwait.h? These structs seem private
> to xlogwait.c, so maybe declare it there?

Hmm, I don't remember why I moved them to xlogwait.h.  OK, moved them
back to xlogwait.c.


------
Regards,
Alexander Korotkov
Supabase

Attachment

v6-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Álvaro Herrera

Date:

05 August, 16:47:07

On 2025-Apr-29, Alexander Korotkov wrote:

> > 11) WaitLSNProcInfo / WaitLSNState
> >
> > Does this need to be exposed in xlogwait.h? These structs seem private
> > to xlogwait.c, so maybe declare it there?
> 
> Hmm, I don't remember why I moved them to xlogwait.h.  OK, moved them
> back to xlogwait.c.

This change made the code no longer compile, because
WaitLSNState->minWaitedLSN is used in xlogrecovery.c which no longer has
access to the field definition.  A rebased version with that change
reverted is attached.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

Attachment

v7-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Xuneng Zhou

Date:

07 August, 18:00:50

Hi,

Thanks for working on this.

I’ve just come across this thread and haven’t had a chance to dig into
the patch yet, but I’m keen to review it soon. In the meantime, I have
a quick question: is WAIT FOR REPLY intended mainly for user-defined
functions, or can internal code invoke it as well?

During a recent performance run [1] I noticed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
also hints that we could replace this check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by walsender:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when the wait is long. I built
a POC patch that swaps polling for CVs, but a single global CV (or
even separate “flush” and “replay” CVs) isn’t ideal:

The wake-up routines don’t know which LSN each waiter cares about, so
they’d have to broadcast on every flush/replay. Caching the minimum
outstanding LSN could reduce spuriously awakened waiters, yet wouldn’t
eliminate them—multiple backends might wait for different LSNs
simultaneously. A more precise solution would require a request queue
that maps waiters to target LSNs and issues targeted wake-ups, adding
complexity.

Walsender accepts the potential broadcast overhead by using two cvs
for different waiters, so it might be acceptable for
read_local_xlog_page_guts() as well. However, if WAIT FOR REPLY
becomes available to backend code, we might leverage it to eliminate
the polling for waiting replay in read_local_xlog_page_guts() without
introducing a bespoke dispatcher. I’d appreciate any thoughts on
whether that use case is in scope.

Best,
Xuneng

[1] https://www.postgresql.org/message-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com

Re: Implement waiting for wal lsn replay: reloaded

From

Alexander Korotkov

Date:

08 August, 09:54:18

Hello, Álvaro!

On Wed, Aug 6, 2025 at 6:01 AM Álvaro Herrera <[email protected]> wrote:
>
> On 2025-Apr-29, Alexander Korotkov wrote:
>
> > > 11) WaitLSNProcInfo / WaitLSNState
> > >
> > > Does this need to be exposed in xlogwait.h? These structs seem private
> > > to xlogwait.c, so maybe declare it there?
> >
> > Hmm, I don't remember why I moved them to xlogwait.h.  OK, moved them
> > back to xlogwait.c.
>
> This change made the code no longer compile, because
> WaitLSNState->minWaitedLSN is used in xlogrecovery.c which no longer has
> access to the field definition.  A rebased version with that change
> reverted is attached.

Thank you!  The rebased version looks correct for me.

------
Regards,
Alexander Korotkov
Supabase

Re: Implement waiting for wal lsn replay: reloaded

From

Alexander Korotkov

Date:

08 August, 10:08:49

Hi, Xuneng Zhou!

On Thu, Aug 7, 2025 at 6:01 PM Xuneng Zhou <[email protected]> wrote:
> Thanks for working on this.
>
> I’ve just come across this thread and haven’t had a chance to dig into
> the patch yet, but I’m keen to review it soon.

Great.  Thank you for your attention to this patch.  I appreciate your
intention to review it.

> In the meantime, I have
> a quick question: is WAIT FOR REPLY intended mainly for user-defined
> functions, or can internal code invoke it as well?

Currently, WaitForLSNReplay() is assumed to only be called from
backend, as corresponding shmem is allocated only per-backend.  But
there is absolutely no problem to tweak the patch to allocate shmem
for every Postgres process.  This would enable to call
WaitForLSNReplay() wherever it is needed.  There is only no problem to
extend this approach to support other kinds of LSNs not just replay
LSN.


> During a recent performance run [1] I noticed heavy polling in
> read_local_xlog_page_guts(). Heikki’s comment from a few months ago
> also hints that we could replace this check–sleep–repeat loop with the
> condition-variable (CV) infrastructure used by walsender:
>
> /*
>  * Loop waiting for xlog to be available if necessary
>  *
>  * TODO: The walsender has its own version of this function, which uses a
>  * condition variable to wake up whenever WAL is flushed. We could use the
>  * same infrastructure here, instead of the check/sleep/repeat style of
>  * loop.
>  */
>
> Because read_local_xlog_page_guts() waits for a specific flush or
> replay LSN, polling becomes inefficient when the wait is long. I built
> a POC patch that swaps polling for CVs, but a single global CV (or
> even separate “flush” and “replay” CVs) isn’t ideal:
>
> The wake-up routines don’t know which LSN each waiter cares about, so
> they’d have to broadcast on every flush/replay. Caching the minimum
> outstanding LSN could reduce spuriously awakened waiters, yet wouldn’t
> eliminate them—multiple backends might wait for different LSNs
> simultaneously. A more precise solution would require a request queue
> that maps waiters to target LSNs and issues targeted wake-ups, adding
> complexity.
>
> Walsender accepts the potential broadcast overhead by using two cvs
> for different waiters, so it might be acceptable for
> read_local_xlog_page_guts() as well. However, if WAIT FOR REPLY
> becomes available to backend code, we might leverage it to eliminate
> the polling for waiting replay in read_local_xlog_page_guts() without
> introducing a bespoke dispatcher. I’d appreciate any thoughts on
> whether that use case is in scope.

This looks like a great new use-case for facilities developed in this
patch!  I'll remove the restriction to use WaitForLSNReplay() only in
backend.  I think you can write a patch with additional pairing heap
for flush LSN and include that into thread about
read_local_xlog_page_guts() optimization.  Let me know if you need any
assistance.

------
Regards,
Alexander Korotkov
Supabase

Re: Implement waiting for wal lsn replay: reloaded

From

Xuneng Zhou

Date:

09 August, 13:52:37

Hi Alexander!

> > In the meantime, I have
> > a quick question: is WAIT FOR REPLY intended mainly for user-defined
> > functions, or can internal code invoke it as well?
>
> Currently, WaitForLSNReplay() is assumed to only be called from
> backend, as corresponding shmem is allocated only per-backend.  But
> there is absolutely no problem to tweak the patch to allocate shmem
> for every Postgres process.  This would enable to call
> WaitForLSNReplay() wherever it is needed.  There is only no problem to
> extend this approach to support other kinds of LSNs not just replay
> LSN.

Thanks for extending the functionality of the Wait For Replay patch!

> This looks like a great new use-case for facilities developed in this
> patch!  I'll remove the restriction to use WaitForLSNReplay() only in
> backend.  I think you can write a patch with additional pairing heap
> for flush LSN and include that into thread about
> read_local_xlog_page_guts() optimization.  Let me know if you need any
> assistance.

This could be a more elegant approach which would solve the polling
issue well. I'll prepare a follow-up patch for it.

Best,
Xuneng

Re: Implement waiting for wal lsn replay: reloaded

From

Xuneng Zhou

Date:

09 August, 14:27:25

Hi,

> On Thu, Aug 7, 2025 at 6:01 PM Xuneng Zhou <[email protected]> wrote:
> > Thanks for working on this.
> >
> > I’ve just come across this thread and haven’t had a chance to dig into
> > the patch yet, but I’m keen to review it soon.
>
> Great.  Thank you for your attention to this patch.  I appreciate your
> intention to review it.

I did a quick pass over v7. There are a few thoughts to share—mostly
around documentation, build, and tests, plus some minor nits. The core
logic looks solid to me. I’ll take a deeper look as I work on a
follow‑up patch to add waiting for flush LSNs. And the patch seems to
need rebase; it can't be applied to HEAD cleanly for now.

Build
1) Consider adding a comma in `src/test/recovery/meson.build` after
`'t/048_vacuum_horizon_floor.pl'` so the list remains valid.

Core code
2) It may be safer for `WaitLSNWakeup()` to assert against the stack array size:
) Perhaps `Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);`
rather than `MaxBackends`.
For option parsing UX in `wait.c`, we might prefer:
3) Using `ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR),
errmsg(...)))` instead of `elog(ERROR, ...)` for consistency and
translatability.
4) Explicitly rejecting duplicate `LSN`/`TIMEOUT` options with a syntax error.
5) The result column label could align better with other utility
outputs if shortened to `status` (lowercase, no space).
6) After `parse_real()`, it could help to validate/clamp the timeout
to avoid overflow when converting to `int64` and when passing a `long`
to `WaitLatch()`.
7) If `nodes/print.h` in `src/backend/commands/wait.c` isn’t used, we
might drop the include.
8) A couple of comment nits: “do it this outside” → “do this outside”.

Tests
9) We might consider adding cases for:
- Negative `TIMEOUT` (to exercise the error path).
- Syntax errors (unknown option; duplicate `LSN`/`TIMEOUT`; missing `LSN`).

Documentation
`doc/src/sgml/ref/wait_for.sgml`
10) The index term could be updated to `<primary>WAIT FOR</primary>`.
11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds |
'duration-with-units'> ] [ NO_THROW ]
12) The purpose line might be smoother as “wait for a target LSN to be
replayed, optionally with a timeout”.
13) Return values might use `<literal>` for `success`, `timeout`, `not
in recovery`.
14) Consistently calling this a “command” (rather than
function/procedure) could reduce confusion.
15) The example text might read more cleanly as “If the target LSN is
not reached before the timeout …”.

`doc/src/sgml/high-availability.sgml`
16) The sentence could read “However, it is possible to address this
without switching to synchronous replication.”

`src/backend/utils/activity/wait_event_names.txt`
17) The description for `WAIT_FOR_WAL_REPLAY` might be clearer as
“Waiting for WAL replay to reach a target LSN on a standby.”

Best,
Xuneng

Re: Implement waiting for wal lsn replay: reloaded

From

Xuneng Zhou

Date:

27 August, 18:54:25

Hi all,

I did a rebase for the patch to v8 and incorporated a few changes:

1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.

Started a new thread [1] and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.

[1] https://www.postgresql.org/message-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com

Feedbacks welcome.

Best,
Xuneng

Attachment

v8-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Alexander Korotkov

Date:

13 September, 22:31:32

Hi, Xuneng!

On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <[email protected]> wrote:
> I did a rebase for the patch to v8 and incorporated a few changes:
>
> 1) Updated documentation, added new tests, and applied minor code
> adjustments based on prior review comments.
> 2) Tweaked the initialization of waitReplayLSNState so that
> non-backend processes can call wait for replay.
>
> Started a new thread [1] and attached a patch addressing the polling
> issue in the function
> read_local_xlog_page_guts built on the infra of patch v8.
>
> [1] https://www.postgresql.org/message-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com
>
> Feedbacks welcome.

Thank you for your reviewing and revising this patch.

I see you've integrated most of your points expressed in [1].  I went
though them and I've integrated the rest of them.  Except this one.

> 11) The synopsis might read more clearly as:
> - WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]

I didn't find examples on how we do the similar things on other places
of docs.  This is why I decided to leave this place as it currently
is.

Also, I found some mess up with typedefs.list.  I've returned the
changes to typdefs.list back and re-indented the sources.

I'd like to ask your opinion of the way this feature is implemented in
terms of grammar: generic parsing implemented in gram.y and the rest
is done in wait.c.  I think this approach should minimize additional
keywords and states for parsing code.  This comes at the price of more
complex code in wait.c, but I think this is a fair price.

Links.
1. https://www.postgresql.org/message-id/CABPTF7VsoGDMBq34MpLrMSZyxNZvVbgH6-zxtJOg5AwOoYURbw%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Attachment

v9-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Xuneng Zhou

Date:

14 September, 16:51:21

Hi Alexander,

On Sun, Sep 14, 2025 at 3:31 AM Alexander Korotkov <[email protected]> wrote:
>
> Hi, Xuneng!
>
> On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <[email protected]> wrote:
> > I did a rebase for the patch to v8 and incorporated a few changes:
> >
> > 1) Updated documentation, added new tests, and applied minor code
> > adjustments based on prior review comments.
> > 2) Tweaked the initialization of waitReplayLSNState so that
> > non-backend processes can call wait for replay.
> >
> > Started a new thread [1] and attached a patch addressing the polling
> > issue in the function
> > read_local_xlog_page_guts built on the infra of patch v8.
> >
> > [1] https://www.postgresql.org/message-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com
> >
> > Feedbacks welcome.
>
> Thank you for your reviewing and revising this patch.
>
> I see you've integrated most of your points expressed in [1].  I went
> though them and I've integrated the rest of them.  Except this one.
>
> > 11) The synopsis might read more clearly as:
> > - WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]
>
> I didn't find examples on how we do the similar things on other places
> of docs.  This is why I decided to leave this place as it currently
> is.

+1. I re-check other commands with similar parameter patterns, and
they follow the approach in v9.

>
> Also, I found some mess up with typedefs.list.  I've returned the
> changes to typdefs.list back and re-indented the sources.

 Thanks for catching and fixing that.

> I'd like to ask your opinion of the way this feature is implemented in
> terms of grammar: generic parsing implemented in gram.y and the rest
> is done in wait.c.  I think this approach should minimize additional
> keywords and states for parsing code.  This comes at the price of more
> complex code in wait.c, but I think this is a fair price.

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Best,
Xuneng

Re: Implement waiting for wal lsn replay: reloaded

From

Alexander Korotkov

Date:

15 September, 21:59:42

Hi, Xuneng!

On Sun, Sep 14, 2025 at 4:51 PM Xuneng Zhou <[email protected]> wrote:
>
> On Sun, Sep 14, 2025 at 3:31 AM Alexander Korotkov <[email protected]> wrote:
> > On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <[email protected]> wrote:
> > > I did a rebase for the patch to v8 and incorporated a few changes:
> > >
> > > 1) Updated documentation, added new tests, and applied minor code
> > > adjustments based on prior review comments.
> > > 2) Tweaked the initialization of waitReplayLSNState so that
> > > non-backend processes can call wait for replay.
> > >
> > > Started a new thread [1] and attached a patch addressing the polling
> > > issue in the function
> > > read_local_xlog_page_guts built on the infra of patch v8.
> > >
> > > [1] https://www.postgresql.org/message-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com
> > >
> > > Feedbacks welcome.
> >
> > Thank you for your reviewing and revising this patch.
> >
> > I see you've integrated most of your points expressed in [1].  I went
> > though them and I've integrated the rest of them.  Except this one.
> >
> > > 11) The synopsis might read more clearly as:
> > > - WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]
> >
> > I didn't find examples on how we do the similar things on other places
> > of docs.  This is why I decided to leave this place as it currently
> > is.
>
> +1. I re-check other commands with similar parameter patterns, and
> they follow the approach in v9.
>
> >
> > Also, I found some mess up with typedefs.list.  I've returned the
> > changes to typdefs.list back and re-indented the sources.
>
>  Thanks for catching and fixing that.
>
> > I'd like to ask your opinion of the way this feature is implemented in
> > terms of grammar: generic parsing implemented in gram.y and the rest
> > is done in wait.c.  I think this approach should minimize additional
> > keywords and states for parsing code.  This comes at the price of more
> > complex code in wait.c, but I think this is a fair price.
>
> It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
> PUBLICATION - all use minimal grammar rules that produce generic
> option lists, with the actual interpretation done in their respective
> implementation files. The moderate complexity in wait.c seems
> acceptable.

The attached revision of patch contains fix of the typo in the comment
you reported off-list.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v10-0001-Implement-WAIT-FOR-command.patch

Re: Implement waiting for wal lsn replay: reloaded

From

Álvaro Herrera

Date:

15 September, 23:24:05

On 2025-Sep-15, Alexander Korotkov wrote:

> > It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
> > PUBLICATION - all use minimal grammar rules that produce generic
> > option lists, with the actual interpretation done in their respective
> > implementation files. The moderate complexity in wait.c seems
> > acceptable.

Actually I find the code in ExecWaitStmt pretty unusual.  We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up.  Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that.  Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved.  Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Implement waiting for wal lsn replay: reloaded

From

Xuneng Zhou

Date:

26 September, 14:22:42

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <[email protected]> wrote:
>
> On 2025-Sep-15, Alexander Korotkov wrote:
>
> > > It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
> > > PUBLICATION - all use minimal grammar rules that produce generic
> > > option lists, with the actual interpretation done in their respective
> > > implementation files. The moderate complexity in wait.c seems
> > > acceptable.
>
> Actually I find the code in ExecWaitStmt pretty unusual.  We tend to use
> lists of DefElem (a name optionally followed by a value) instead of
> individual scattered elements that must later be matched up.  Why not
> use utility_option_list instead and then loop on the list of DefElems?
> It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

> Also, we've found that failing to surround the options by parens leads
> to pain down the road, so maybe add that.  Given that the LSN seems to
> be mandatory, maybe make it something like
>
> WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
>
> This requires that you make LSN a keyword, albeit unreserved.  Or you
> could make it
> WAIT FOR Ident [the rest]
> and then ensure in C that the identifier matches the word LSN, such as
> we do for "permissive" and "restrictive" in
> RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Best,
Xuneng

Re: Implement waiting for wal lsn replay: reloaded

From

Xuneng Zhou

Date:

28 September, 12:02:43

Hi,

On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi Álvaro,
>
> Thanks for your review.
>
> On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <[email protected]> wrote:
> >
> > On 2025-Sep-15, Alexander Korotkov wrote:
> >
> > > > It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
> > > > PUBLICATION - all use minimal grammar rules that produce generic
> > > > option lists, with the actual interpretation done in their respective
> > > > implementation files. The moderate complexity in wait.c seems
> > > > acceptable.
> >
> > Actually I find the code in ExecWaitStmt pretty unusual.  We tend to use
> > lists of DefElem (a name optionally followed by a value) instead of
> > individual scattered elements that must later be matched up.  Why not
> > use utility_option_list instead and then loop on the list of DefElems?
> > It'd be a lot simpler.
>
> I took a look at commands like VACUUM and EXPLAIN and they do follow
> this pattern. v11 will make use of utility_option_list.
>
> > Also, we've found that failing to surround the options by parens leads
> > to pain down the road, so maybe add that.  Given that the LSN seems to
> > be mandatory, maybe make it something like
> >
> > WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
> >
> > This requires that you make LSN a keyword, albeit unreserved.  Or you
> > could make it
> > WAIT FOR Ident [the rest]
> > and then ensure in C that the identifier matches the word LSN, such as
> > we do for "permissive" and "restrictive" in
> > RowSecurityDefaultPermissive.
>
> Shall make LSN an unreserved keyword as well.

Here's the updated v11.  Many thanks Jian for off-list discussions and review.

Best,
Xuneng

Attachment

v11-0001-Implement-WAIT-FOR-command.patch