Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improve workflow task deadlock and eviction #806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 4, 2025

Conversation

cretz
Copy link
Member

@cretz cretz commented Mar 28, 2025

What was changed

TL;DR - Fixed issue where on deadlock and/or eviction fail, slot/thread could never recover and thread pool could starve even if issues resolved themselves later. Also gave deadlocked threads a bit of a poke to unstick them.


There are issues where deadlocking or eviction-swallowing code eats up both workflow task slots and workflow executor threads. This issue makes some changes to help these situations.

Before this PR (i.e. today as of this writing), we have the following behavior:

  • Default workflow task thread pool max size is max(os.cpu_count(), 4) because we naively assumed in original development that workflow tasks are CPU bound, no need for more threads than CPU, but they can use IO, not to mention Python and the OS can both yield to other native threads if needed.
  • When a deadlock is detected and we respond to Core saying it is deadlocked and the workflow task thread is hung (or maybe it finishes at some point, we don't care), Core gives us an eviction which we try to process it (on a separate workflow task thread from the pool since that other thread hasn't been returned to the pool) by cancelling all tasks, but not all tasks get canceled because one is hung. Therefore the slot is never returned, the thread is never reusable, and eventually the worker can run out of slots.
  • When an eviction occurs, all outstanding asyncio tasks are canceled so they can be gracefully collected (because if you don't GC wakes them up in other workflow threads, scary, which we fixed in Safe Eviction #499). But if a user swallows an eviction today (e.g. catching asyncio cancel exception which is a base exception and then doing more accidentally blocking work), we log an exception and then just never return an eviction response to Core so it can't accept any more work for the run ID. This means a worker could never shut down because it had outstanding work that was never resolved.

Problems with this:

  • Threads being less than slots means all threads could be consumed while we're still accepting work, causing it to be enqueued on the thread pool causing those to hit deadlock timeout (even though they never started)
  • Deadlock that eventually is resolved does not free up the thread
  • Eviction that is eventually resolved does not free up the slot and does not make the run ID available for more work

With this PR the behavior now is:

  • Default workflow task thread pool max size is now max_concurrent_workflow_tasks or 500 if unset (e.g. using workflow tuner explicitly). This makes sure that deadlocks that eat threads don't affect other workflow tasks.
  • When a deadlock detected, we do advanced PyThreadState_SetAsyncExc C call similar to sync activities to attempt to unstick it. This works for spinning loops and many IO things, but it doesn't work for things like time.sleep() or threading.Event's wait() which are reactive. Still, this improves our ability to interrupt some stuck Python code. It technically could cause user issues if they were expecting their Python not to be interrupted even after deadlock timeout, but we don't support code running successfully after deadlock timeout.
  • On eviction, if there was a deadlocked task (it comes right after deadlock because deadlock is wft fail which causes eviction), we wait on it forever because we must have the thread unstuck to evict. This allows bad workflow code that happens to complete itself after the 2s deadlock timeout to properly complete and give resources back.
  • On eviction, we try to run the eviction (i.e. task teardown) process on a thread, continually, forever until it succeeds. We give a descriptive error if it times out (same timeout as deadlock) because that usually means it is swallowing the eviction exceptions (task cancel or our own workflow-being-deleted-can't-run-workflow-call error). Running continually (with a 2s sleep between each attempt) allows eviction that eventually succeeds to clear up and allow a worker to shut down if it can. We still log on worker shutdown if there were any evictions that are still trying to be processed since they can prevent workflow shutdown from completing (by intention).

Checklist

  1. Closes [Bug] Avoid losing worker slot on error while processing cache eviction #784

@cretz cretz requested a review from a team as a code owner March 28, 2025 20:38
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it doesn't work for things like time.sleep() or threading.Event's wait() which are reactive.

Maybe that could be solved by somehow intercepting those cpython calls or something, but, definitely not worth the effort.

Comment on lines +411 to +412
# Only want to log and mark as could not evict once
if not seen_fail:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it could be worth logging this repeatedly, every 1 minute or something, just so it's extra obvious to people they've got something stuck.

Not blocking tho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've found either people just haven't enabled logging or they found this log. We haven't seen situations where the log was not seen by users due to age. Would rather not add a timer mechanism into this unless I have to.

Copy link
Contributor

@THardy98 THardy98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR description, really nice.

As I understand it, our solution to this the approach is to:

  • issue more threads so there's a buffer in case of lost threads
    • the jump from 4 to 500 threads sounds like a lot, but I'm presuming that memory pressure isn't really a concern (I don't know how memory heavy threads are in Python)
  • keep trying to evict the deadlocked task, under the assumption that it will complete at some point
    • if the task truly is deadlocked, then we're stuck where we are today, but with some optimistic eviction-retry behavior (and some nice log entries)

One question I have:

Is it possible for a deadlocked workflow to continue receiving tasks/activations? (if so, I suppose this would this imply they are sent to a different task slot/thread). Asking because it might be prudent to check if deadlocked_activation_task is set for tasks that arrive after the deadlocked task (if possible)

@cretz
Copy link
Member Author

cretz commented Mar 31, 2025

Maybe that could be solved by somehow intercepting those cpython calls or something, but, definitely not worth the effort.

Yeah, we're already using a not-commonly-recommend C-only-call to interrupt even the interpreter, anything more would be a bit too far (I haven't researched, but would be rough I think if even doable)

the jump from 4 to 500 threads sounds like a lot, but I'm presuming that memory pressure isn't really a concern (I don't know how memory heavy threads are in Python)

For many people, this will be max_concurrent_workflow_tasks not 500. And this number is the max, not the most eagerly created. Amount of concurrent workflow tasks is handle by the user-controlled tuner (or tuner options like max_concurrent_workflow_tasks) and so that's where they'd limit tasks and such that affect memory.

if the task truly is deadlocked, then we're stuck where we are today, but with some optimistic eviction-retry behavior (and some nice log entries)

Yup, nothing we can do if they are truly deadlocked except continue to eat a thread/task.

Is it possible for a deadlocked workflow to continue receiving tasks/activations?

Traditionally yes, but in our case, intentionally no. So we do respond to Core on deadlock so that it can fail the workflow task. Then Core comes back and asks us to evict and this is where we intentionally hang. By hanging the ability to evict this run ID, Core will not accept new work for the run ID (i.e. will return eager failure to server if server sends us a task for it).

So server, on workflow task fail, will keep trying the task. And every one that lands on this worker (unsure if sticky is retained in this situation) will be eagerly failed by Core IIUC. Every one that lands on another worker may proceed but may end up with the same deadlock issue on that worker. (cc @Sushisource if I have any of this incorrect)

@THardy98
Copy link
Contributor

By hanging the ability to evict this run ID, Core will not accept new work for the run ID (i.e. will return eager failure to server if server sends us a task for it).

Ah ok, this is the key bit, I did not know this.

So server, on workflow task fail, will keep trying the task. And every one that lands on this worker (unsure if sticky is retained in this situation) will be eagerly failed by Core IIUC. Every one that lands on another worker may proceed but may end up with the same deadlock issue on that worker. (cc @Sushisource if I have any of this incorrect)

Interesting. I'm only familiar with sticky queues being reset/dropped (idk the right terminology for this) when the worker dies (i.e. as described in this doc).

if deadlocked_thread_id:
temporalio.bridge.runtime.Runtime._raise_in_thread(
deadlocked_thread_id, _InterruptDeadlockError
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timed-out task could have completed between line 275 and line 289. In that case, the event loop in the thread that it was running on may now be processing a task from an unrelated workflow, and we will throw an exception in that workflow. Is that what you were referring to here?

It technically could cause user issues if they were expecting their Python not to be interrupted even after deadlock timeout, but we don't support code running successfully after deadlock timeout.

Copy link
Member Author

@cretz cretz Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the event loop in the thread that it was running on may now be processing a task from an unrelated workflow

Hrmm, this is true. I wonder if there is a reasonable way without too much effort to prevent the thread from being returned to the pool if we encounter a deadlock. Hopefully it doesn't require a mutex or us to hang the activate call open on some thread-blocking thing if we see deadlock. But it may. I will think on it. Any ideas? If it's too complicated we may have to scrap the idea of interrupting the deadlock this way.

Is that what you were referring to here?

No, this was something else (technically we interrupt their deadlock where we didn't before, possibly causing other code to run)

Copy link
Member Author

@cretz cretz Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, at the likely-negligible cost of creating a mutex for every workflow instance, I added a check to make sure interruption only happens if still in thread. Take a look at 106ed0c and give feedback.

@dandavison
Copy link
Contributor

dandavison commented Apr 1, 2025

Here's an explanation of this PR via sequence diagrams:

image

Copy link
Contributor

@dandavison dandavison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Impressive PR with clever tests that I imagine I'll refer back to. I tried breaking the tests in a few ways and failed.

Comment on lines 67 to 68
workflow_task_executor=workflow_task_executor
or concurrent.futures.ThreadPoolExecutor(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in syntactic situations like this it's usual to wrap in parens for clearer indentation. Suggest we adopt that style.

Suggested change
workflow_task_executor=workflow_task_executor
or concurrent.futures.ThreadPoolExecutor(),
workflow_task_executor=(
workflow_task_executor or concurrent.futures.ThreadPoolExecutor()
),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruff often does this for me in other multiline expression situations, I wonder why it did not here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't have parens; I don't think ruff will actually add the missing paren characters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I think I have seen it add parens in other situations where they weren't present. I can add them here though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

# Wait for task fail
await assert_task_fail_eventually(handle, message_contains="deadlock")
# Confirm could not be interrupted
assert deadlock_uninterruptible_completed == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more correct to assert that waiting for >0 times out? Or is there a reason this line does not need to wait but similar checks in this and the interruptible test use assert_eventually?

Copy link
Member Author

@cretz cretz Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, yes technically it could have been interrupted after the task failed (though it'd have to be a quick server round trip to go faster than a Python interpreter thread, which is why I'm not too worried about it). All of the eventual assertions are for positive assertions, we don't want timing out to be a valid thing that happens in successful tests usually (for test perf reasons, though sometimes we have to like deadlock itself). Can we think of another way to confirm that it cannot be interrupted? Would you settle for removing this assertion and having a whole new test that proves that threading.Event's wait is uninterruptible via our interruption mechanism?

Or can we accept that this may usually fail but not guaranteed fail if Python updates the interruptibility of this call?

Copy link
Contributor

@dandavison dandavison Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion. Would it help to put a mock object around _handle_cache_eviction? Perhaps waiting for it to be called, confirming that we haven't hit the finally, and then setting the event. Something involving something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have traditionally tried to avoid internal expectations on these tests and rather test behavior from the outside as a user may. I think it's probably ok to say that this assertion will just "usually" fail (i.e. fail like 99.9% of the time unless the server somehow does its round trip before this Python thread can execute its couple of lines) if Python stdlib changes behavior here instead of "always". It'll still be noticeable by us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's OK with me too. Maybe add a comment explaining the technically possible thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Didn't get to this pre-merge, may be able to in another PR

cretz added 2 commits April 4, 2025 11:30
…-task-slots

# Conflicts:
#	tests/worker/test_workflow.py
@cretz cretz merged commit 7ffa822 into temporalio:main Apr 4, 2025
13 checks passed
@cretz cretz deleted the lost-workflow-task-slots branch April 4, 2025 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Avoid losing worker slot on error while processing cache eviction
4 participants