Improve workflow task deadlock and eviction #806

cretz · 2025-03-28T20:38:06Z

What was changed

TL;DR - Fixed issue where on deadlock and/or eviction fail, slot/thread could never recover and thread pool could starve even if issues resolved themselves later. Also gave deadlocked threads a bit of a poke to unstick them.

There are issues where deadlocking or eviction-swallowing code eats up both workflow task slots and workflow executor threads. This issue makes some changes to help these situations.

Before this PR (i.e. today as of this writing), we have the following behavior:

Default workflow task thread pool max size is max(os.cpu_count(), 4) because we naively assumed in original development that workflow tasks are CPU bound, no need for more threads than CPU, but they can use IO, not to mention Python and the OS can both yield to other native threads if needed.
When a deadlock is detected and we respond to Core saying it is deadlocked and the workflow task thread is hung (or maybe it finishes at some point, we don't care), Core gives us an eviction which we try to process it (on a separate workflow task thread from the pool since that other thread hasn't been returned to the pool) by cancelling all tasks, but not all tasks get canceled because one is hung. Therefore the slot is never returned, the thread is never reusable, and eventually the worker can run out of slots.
When an eviction occurs, all outstanding asyncio tasks are canceled so they can be gracefully collected (because if you don't GC wakes them up in other workflow threads, scary, which we fixed in Safe Eviction #499). But if a user swallows an eviction today (e.g. catching asyncio cancel exception which is a base exception and then doing more accidentally blocking work), we log an exception and then just never return an eviction response to Core so it can't accept any more work for the run ID. This means a worker could never shut down because it had outstanding work that was never resolved.

Problems with this:

Threads being less than slots means all threads could be consumed while we're still accepting work, causing it to be enqueued on the thread pool causing those to hit deadlock timeout (even though they never started)
Deadlock that eventually is resolved does not free up the thread
Eviction that is eventually resolved does not free up the slot and does not make the run ID available for more work

With this PR the behavior now is:

Default workflow task thread pool max size is now max_concurrent_workflow_tasks or 500 if unset (e.g. using workflow tuner explicitly). This makes sure that deadlocks that eat threads don't affect other workflow tasks.
When a deadlock detected, we do advanced PyThreadState_SetAsyncExc C call similar to sync activities to attempt to unstick it. This works for spinning loops and many IO things, but it doesn't work for things like time.sleep() or threading.Event's wait() which are reactive. Still, this improves our ability to interrupt some stuck Python code. It technically could cause user issues if they were expecting their Python not to be interrupted even after deadlock timeout, but we don't support code running successfully after deadlock timeout.
On eviction, if there was a deadlocked task (it comes right after deadlock because deadlock is wft fail which causes eviction), we wait on it forever because we must have the thread unstuck to evict. This allows bad workflow code that happens to complete itself after the 2s deadlock timeout to properly complete and give resources back.
On eviction, we try to run the eviction (i.e. task teardown) process on a thread, continually, forever until it succeeds. We give a descriptive error if it times out (same timeout as deadlock) because that usually means it is swallowing the eviction exceptions (task cancel or our own workflow-being-deleted-can't-run-workflow-call error). Running continually (with a 2s sleep between each attempt) allows eviction that eventually succeeds to clear up and allow a worker to shut down if it can. We still log on worker shutdown if there were any evictions that are still trying to be processed since they can prevent workflow shutdown from completing (by intention).

Checklist

Closes [Bug] Avoid losing worker slot on error while processing cache eviction #784

Sushisource

but it doesn't work for things like time.sleep() or threading.Event's wait() which are reactive.

Maybe that could be solved by somehow intercepting those cpython calls or something, but, definitely not worth the effort.

temporalio/worker/_workflow.py

Sushisource · 2025-03-28T21:04:39Z

temporalio/worker/_workflow.py

+                    # Only want to log and mark as could not evict once
+                    if not seen_fail:


IMO it could be worth logging this repeatedly, every 1 minute or something, just so it's extra obvious to people they've got something stuck.

Not blocking tho.

We've found either people just haven't enabled logging or they found this log. We haven't seen situations where the log was not seen by users due to age. Would rather not add a timer mechanism into this unless I have to.

THardy98

Thanks for the PR description, really nice.

As I understand it, our solution to this the approach is to:

issue more threads so there's a buffer in case of lost threads
- the jump from 4 to 500 threads sounds like a lot, but I'm presuming that memory pressure isn't really a concern (I don't know how memory heavy threads are in Python)
keep trying to evict the deadlocked task, under the assumption that it will complete at some point
- if the task truly is deadlocked, then we're stuck where we are today, but with some optimistic eviction-retry behavior (and some nice log entries)

One question I have:

Is it possible for a deadlocked workflow to continue receiving tasks/activations? (if so, I suppose this would this imply they are sent to a different task slot/thread). Asking because it might be prudent to check if deadlocked_activation_task is set for tasks that arrive after the deadlocked task (if possible)

cretz · 2025-03-31T14:26:07Z

Maybe that could be solved by somehow intercepting those cpython calls or something, but, definitely not worth the effort.

Yeah, we're already using a not-commonly-recommend C-only-call to interrupt even the interpreter, anything more would be a bit too far (I haven't researched, but would be rough I think if even doable)

the jump from 4 to 500 threads sounds like a lot, but I'm presuming that memory pressure isn't really a concern (I don't know how memory heavy threads are in Python)

For many people, this will be max_concurrent_workflow_tasks not 500. And this number is the max, not the most eagerly created. Amount of concurrent workflow tasks is handle by the user-controlled tuner (or tuner options like max_concurrent_workflow_tasks) and so that's where they'd limit tasks and such that affect memory.

if the task truly is deadlocked, then we're stuck where we are today, but with some optimistic eviction-retry behavior (and some nice log entries)

Yup, nothing we can do if they are truly deadlocked except continue to eat a thread/task.

Is it possible for a deadlocked workflow to continue receiving tasks/activations?

Traditionally yes, but in our case, intentionally no. So we do respond to Core on deadlock so that it can fail the workflow task. Then Core comes back and asks us to evict and this is where we intentionally hang. By hanging the ability to evict this run ID, Core will not accept new work for the run ID (i.e. will return eager failure to server if server sends us a task for it).

So server, on workflow task fail, will keep trying the task. And every one that lands on this worker (unsure if sticky is retained in this situation) will be eagerly failed by Core IIUC. Every one that lands on another worker may proceed but may end up with the same deadlock issue on that worker. (cc @Sushisource if I have any of this incorrect)

THardy98 · 2025-03-31T20:07:43Z

By hanging the ability to evict this run ID, Core will not accept new work for the run ID (i.e. will return eager failure to server if server sends us a task for it).

Ah ok, this is the key bit, I did not know this.

So server, on workflow task fail, will keep trying the task. And every one that lands on this worker (unsure if sticky is retained in this situation) will be eagerly failed by Core IIUC. Every one that lands on another worker may proceed but may end up with the same deadlock issue on that worker. (cc @Sushisource if I have any of this incorrect)

Interesting. I'm only familiar with sticky queues being reset/dropped (idk the right terminology for this) when the worker dies (i.e. as described in this doc).

dandavison · 2025-03-31T21:49:47Z

temporalio/worker/_workflow.py

+                    if deadlocked_thread_id:
+                        temporalio.bridge.runtime.Runtime._raise_in_thread(
+                            deadlocked_thread_id, _InterruptDeadlockError
+                        )


The timed-out task could have completed between line 275 and line 289. In that case, the event loop in the thread that it was running on may now be processing a task from an unrelated workflow, and we will throw an exception in that workflow. Is that what you were referring to here?

It technically could cause user issues if they were expecting their Python not to be interrupted even after deadlock timeout, but we don't support code running successfully after deadlock timeout.

In that case, the event loop in the thread that it was running on may now be processing a task from an unrelated workflow

Hrmm, this is true. I wonder if there is a reasonable way without too much effort to prevent the thread from being returned to the pool if we encounter a deadlock. Hopefully it doesn't require a mutex or us to hang the activate call open on some thread-blocking thing if we see deadlock. But it may. I will think on it. Any ideas? If it's too complicated we may have to scrap the idea of interrupting the deadlock this way.

Is that what you were referring to here?

No, this was something else (technically we interrupt their deadlock where we didn't before, possibly causing other code to run)

Ok, at the likely-negligible cost of creating a mutex for every workflow instance, I added a check to make sure interruption only happens if still in thread. Take a look at 106ed0c and give feedback.

dandavison · 2025-04-01T21:22:21Z

Here's an explanation of this PR via sequence diagrams:

dandavison

Very nice. Impressive PR with clever tests that I imagine I'll refer back to. I tried breaking the tests in a few ways and failed.

dandavison · 2025-03-31T22:22:01Z

temporalio/worker/_replayer.py

+            workflow_task_executor=workflow_task_executor
+            or concurrent.futures.ThreadPoolExecutor(),


nit: in syntactic situations like this it's usual to wrap in parens for clearer indentation. Suggest we adopt that style.

Suggested change

workflow_task_executor=workflow_task_executor

or concurrent.futures.ThreadPoolExecutor(),

workflow_task_executor=(

workflow_task_executor or concurrent.futures.ThreadPoolExecutor()

),

Ruff often does this for me in other multiline expression situations, I wonder why it did not here

This didn't have parens; I don't think ruff will actually add the missing paren characters.

Hrmm, I think I have seen it add parens in other situations where they weren't present. I can add them here though.

dandavison · 2025-04-04T15:10:50Z

tests/worker/test_workflow.py

+        # Wait for task fail
+        await assert_task_fail_eventually(handle, message_contains="deadlock")
+        # Confirm could not be interrupted
+        assert deadlock_uninterruptible_completed == 0


Would it be more correct to assert that waiting for >0 times out? Or is there a reason this line does not need to wait but similar checks in this and the interruptible test use assert_eventually?

Hrmm, yes technically it could have been interrupted after the task failed (though it'd have to be a quick server round trip to go faster than a Python interpreter thread, which is why I'm not too worried about it). All of the eventual assertions are for positive assertions, we don't want timing out to be a valid thing that happens in successful tests usually (for test perf reasons, though sometimes we have to like deadlock itself). Can we think of another way to confirm that it cannot be interrupted? Would you settle for removing this assertion and having a whole new test that proves that threading.Event's wait is uninterruptible via our interruption mechanism?

Or can we accept that this may usually fail but not guaranteed fail if Python updates the interruptibility of this call?

I don't have a strong opinion. Would it help to put a mock object around _handle_cache_eviction? Perhaps waiting for it to be called, confirming that we haven't hit the finally, and then setting the event. Something involving something like that?

I have traditionally tried to avoid internal expectations on these tests and rather test behavior from the outside as a user may. I think it's probably ok to say that this assertion will just "usually" fail (i.e. fail like 99.9% of the time unless the server somehow does its round trip before this Python thread can execute its couple of lines) if Python stdlib changes behavior here instead of "always". It'll still be noticeable by us.

Yes, that's OK with me too. Maybe add a comment explaining the technically possible thing.

👍 Didn't get to this pre-merge, may be able to in another PR

…-task-slots # Conflicts: # tests/worker/test_workflow.py

Improve workflow task deadlock and eviction

8f185e9

cretz requested a review from a team as a code owner March 28, 2025 20:38

Sushisource approved these changes Mar 28, 2025

View reviewed changes

THardy98 reviewed Mar 29, 2025

View reviewed changes

dandavison reviewed Mar 31, 2025

View reviewed changes

Fix issue where deadlock interruption could occur on wrong thread

106ed0c

dandavison approved these changes Apr 4, 2025

View reviewed changes

cretz added 2 commits April 4, 2025 11:30

Fix parens

df65f79

Merge remote-tracking branch 'remotes/origin/main' into lost-workflow…

08b1701

…-task-slots # Conflicts: # tests/worker/test_workflow.py

cretz merged commit 7ffa822 into temporalio:main Apr 4, 2025
13 checks passed

cretz deleted the lost-workflow-task-slots branch April 4, 2025 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve workflow task deadlock and eviction #806

Improve workflow task deadlock and eviction #806

cretz commented Mar 28, 2025 •

edited

Loading

Sushisource left a comment

Sushisource Mar 28, 2025

cretz Mar 31, 2025

THardy98 left a comment •

edited

Loading

cretz commented Mar 31, 2025 •

edited

Loading

THardy98 commented Mar 31, 2025

dandavison Mar 31, 2025

cretz Mar 31, 2025 •

edited

Loading

cretz Apr 1, 2025 •

edited

Loading

dandavison commented Apr 1, 2025 •

edited

Loading

dandavison left a comment •

edited

Loading

dandavison Mar 31, 2025

cretz Apr 4, 2025

dandavison Apr 4, 2025

cretz Apr 4, 2025

cretz Apr 4, 2025

dandavison Apr 4, 2025

cretz Apr 4, 2025 •

edited

Loading

dandavison Apr 4, 2025 •

edited

Loading

cretz Apr 4, 2025

dandavison Apr 4, 2025

cretz Apr 4, 2025

		# Only want to log and mark as could not evict once
		if not seen_fail:

		workflow_task_executor=workflow_task_executor
		or concurrent.futures.ThreadPoolExecutor(),

Improve workflow task deadlock and eviction #806

Improve workflow task deadlock and eviction #806

Conversation

cretz commented Mar 28, 2025 • edited Loading

What was changed

Checklist

Sushisource left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

THardy98 left a comment • edited Loading

Choose a reason for hiding this comment

cretz commented Mar 31, 2025 • edited Loading

THardy98 commented Mar 31, 2025

Choose a reason for hiding this comment

cretz Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

cretz Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

dandavison commented Apr 1, 2025 • edited Loading

dandavison left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Apr 4, 2025 • edited Loading

Choose a reason for hiding this comment

dandavison Apr 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz commented Mar 28, 2025 •

edited

Loading

THardy98 left a comment •

edited

Loading

cretz commented Mar 31, 2025 •

edited

Loading

cretz Mar 31, 2025 •

edited

Loading

cretz Apr 1, 2025 •

edited

Loading

dandavison commented Apr 1, 2025 •

edited

Loading

dandavison left a comment •

edited

Loading

cretz Apr 4, 2025 •

edited

Loading

dandavison Apr 4, 2025 •

edited

Loading