Fix slow slink guarantees #8880

kpritam · 2024-05-20T14:44:43Z

tapSink usage merge with a HaltStrategy.Both, that should guarantee execution of both sides (left and right) to the completion.

This was the regression introduced in this PR: #8311

varshith257 · 2024-05-20T23:53:46Z

@kpritam I am not authorised person to review but I have also made some trial&errors with tapsink to check it's behaviour.

Changing from forkIn(scope) to forkDaemon might help the test pass, but it can introduce unpredictability and resource management issues.The original use of forkIn(scope) ensures that all resources are cleaned up i.e.. queues etc.. predictably within the scope. This is crucial for reliable and consistent behavior of tapSink what it introduced from Daemon -> Scope in #8311

Using forkDaemon can lead to incomplete processing and resource leaks, as it doesn't guarantee

cc/ @adamgfraser @eyalfa

varshith257 · 2024-05-21T00:02:15Z

forkIn(scope): When a fiber is forked with forkIn(scope), it is tied to the lifecycle of the specified scope. This means that the fiber finalization will be awaited as part of the scope finalization and also ensures that resources are cleaned up predictably within the scope. This is crucial for avoiding resource leaks and ensuring that all side effects (such as updating a queue) are completed before the scope ends.
forkDaemon: When a fiber is forked with forkDaemon, it runs independently of the scope that created it. The main process does not wait for daemon fibers to complete their work. There is no guarantee that resources will be cleaned up before the scope ends. This can lead to unpredictable behavior and resource leaks.

cc/ @eyalfa This the behaviour what I expect of Scope and Daemon does

kpritam · 2024-05-21T00:44:58Z

varshith257 Bunch of things to unfold here, for the record I understand the difference between forkIn(scope) and forkDaemon quite well and usage for forkDaemon here is not to just make test pass but is intentional.

This is my understanding based on my limited knowledge of the codebase 😉 :

Primary objective of PR Ensure Queue Will Be Shutdown Before Awaiting It In ZStream#tapSink #8311 was to fix memory leak with tapSink implementation, these changes are retained, queues are properly shutdown in tapSink as before.
Channels are properly closed using (queueReader >>> self).toPullIn(scope) & (queueReader >>> that).toPullIn(scope)
Note that I wanted to use pullL.fork.zipWith(pullR.fork) that pass slow sink test but preserves scope of inner fibers test becomes flaky which is why I had to fallback to original implementation which was using forkDaemon.
Also note that tapSink was providing guarantees to execute the both sides to completion before the introduction of forkIn(scope) which tells me that ZStream.tapSink, either a flaky test or flaky implementation #8792 is indeed an valid bug.

eyalfa · 2024-05-21T12:00:17Z

@kpritam I am not authorised person to review but I have also made some trial&errors with tapsink to check it's behaviour.

Changing from forkIn(scope) to forkDaemon might help the test pass, but it can introduce unpredictability and resource management issues.The original use of forkIn(scope) ensures that all resources are cleaned up i.e.. queues etc.. predictably within the scope. This is crucial for reliable and consistent behavior of tapSink what it introduced from Daemon -> Scope in #8311

Using forkDaemon can lead to incomplete processing and resource leaks, as it doesn't guarantee

cc/ @adamgfraser @eyalfa

I was about to submit this exact same comment myself

eyalfa · 2024-05-21T12:20:44Z

@kpritam I tend to categorize this as a bug as well, can't really see how changing to fork daemon guarantees anything, I suspect it just changes the probabilities.
I'm not sure why @adamgfraser changed from fork to forkScope in the first place (might have my thoughts on this but don't have the time to dive this deep into the code atm) but I wouldn't change it before having a full understanding of the effect of fibers runtime scope on the merge operator.

the way I see it the issue is that the finalization code is not waiting for the fiber running the sink, so the trick is first to make sure it can complete (by offering the final end marker into the queue) and then wait for the sink fiber to complete. however... this is not always the correct behavior, in case of stream interruption or failure we want to interrupt the sink as well (unless there's a requirement to guarantee the sink sees all successful elements...)
I think the merge based implementation basically attempts to achieve just this and it does so in most cases, simply because it effectively awaits the sink fiber upon successful completion before finalization, so this await is interruptible.
What happens when downstream cancels 'early' is that the upstream is never pulled again and its finalizer is invoked, this is when the fiber sink gets interrupted. identifying this is a bit unpleasant as it requires the implementation to keep track of upstream completion/failure/interruption and try to figure out why was the finalizer invoked

jdegoes · 2024-05-21T15:05:22Z

I think the correct fix to this will come about through deeper understanding of the underlying race condition causing the bug.

Fix slow slink guarantees

89d0dca

algora-pbc bot mentioned this pull request May 20, 2024

ZStream.tapSink, either a flaky test or flaky implementation #8792

Open

algora-pbc bot added the 🙋 Bounty claim label May 20, 2024

jdegoes closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix slow slink guarantees #8880

Fix slow slink guarantees #8880

Uh oh!

kpritam commented May 20, 2024 •

edited

Loading

Uh oh!

varshith257 commented May 20, 2024 •

edited

Loading

Uh oh!

varshith257 commented May 21, 2024 •

edited

Loading

Uh oh!

kpritam commented May 21, 2024 •

edited

Loading

Uh oh!

eyalfa commented May 21, 2024

Uh oh!

eyalfa commented May 21, 2024

Uh oh!

jdegoes commented May 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Fix slow slink guarantees #8880

Fix slow slink guarantees #8880

Uh oh!

Conversation

kpritam commented May 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varshith257 commented May 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varshith257 commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpritam commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eyalfa commented May 21, 2024

Uh oh!

eyalfa commented May 21, 2024

Uh oh!

jdegoes commented May 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kpritam commented May 20, 2024 •

edited

Loading

varshith257 commented May 20, 2024 •

edited

Loading

varshith257 commented May 21, 2024 •

edited

Loading

kpritam commented May 21, 2024 •

edited

Loading