-
Notifications
You must be signed in to change notification settings - Fork 97
Improve workflow task deadlock and eviction #806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it doesn't work for things like time.sleep() or threading.Event's wait() which are reactive.
Maybe that could be solved by somehow intercepting those cpython calls or something, but, definitely not worth the effort.
# Only want to log and mark as could not evict once | ||
if not seen_fail: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it could be worth logging this repeatedly, every 1 minute or something, just so it's extra obvious to people they've got something stuck.
Not blocking tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've found either people just haven't enabled logging or they found this log. We haven't seen situations where the log was not seen by users due to age. Would rather not add a timer mechanism into this unless I have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR description, really nice.
As I understand it, our solution to this the approach is to:
- issue more threads so there's a buffer in case of lost threads
- the jump from 4 to 500 threads sounds like a lot, but I'm presuming that memory pressure isn't really a concern (I don't know how memory heavy threads are in Python)
- keep trying to evict the deadlocked task, under the assumption that it will complete at some point
- if the task truly is deadlocked, then we're stuck where we are today, but with some optimistic eviction-retry behavior (and some nice log entries)
One question I have:
Is it possible for a deadlocked workflow to continue receiving tasks/activations? (if so, I suppose this would this imply they are sent to a different task slot/thread). Asking because it might be prudent to check if deadlocked_activation_task
is set for tasks that arrive after the deadlocked task (if possible)
Yeah, we're already using a not-commonly-recommend C-only-call to interrupt even the interpreter, anything more would be a bit too far (I haven't researched, but would be rough I think if even doable)
For many people, this will be
Yup, nothing we can do if they are truly deadlocked except continue to eat a thread/task.
Traditionally yes, but in our case, intentionally no. So we do respond to Core on deadlock so that it can fail the workflow task. Then Core comes back and asks us to evict and this is where we intentionally hang. By hanging the ability to evict this run ID, Core will not accept new work for the run ID (i.e. will return eager failure to server if server sends us a task for it). So server, on workflow task fail, will keep trying the task. And every one that lands on this worker (unsure if sticky is retained in this situation) will be eagerly failed by Core IIUC. Every one that lands on another worker may proceed but may end up with the same deadlock issue on that worker. (cc @Sushisource if I have any of this incorrect) |
Ah ok, this is the key bit, I did not know this.
Interesting. I'm only familiar with sticky queues being reset/dropped (idk the right terminology for this) when the worker dies (i.e. as described in this doc). |
temporalio/worker/_workflow.py
Outdated
if deadlocked_thread_id: | ||
temporalio.bridge.runtime.Runtime._raise_in_thread( | ||
deadlocked_thread_id, _InterruptDeadlockError | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timed-out task could have completed between line 275 and line 289. In that case, the event loop in the thread that it was running on may now be processing a task from an unrelated workflow, and we will throw an exception in that workflow. Is that what you were referring to here?
It technically could cause user issues if they were expecting their Python not to be interrupted even after deadlock timeout, but we don't support code running successfully after deadlock timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, the event loop in the thread that it was running on may now be processing a task from an unrelated workflow
Hrmm, this is true. I wonder if there is a reasonable way without too much effort to prevent the thread from being returned to the pool if we encounter a deadlock. Hopefully it doesn't require a mutex or us to hang the activate call open on some thread-blocking thing if we see deadlock. But it may. I will think on it. Any ideas? If it's too complicated we may have to scrap the idea of interrupting the deadlock this way.
Is that what you were referring to here?
No, this was something else (technically we interrupt their deadlock where we didn't before, possibly causing other code to run)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, at the likely-negligible cost of creating a mutex for every workflow instance, I added a check to make sure interruption only happens if still in thread. Take a look at 106ed0c and give feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Impressive PR with clever tests that I imagine I'll refer back to. I tried breaking the tests in a few ways and failed.
temporalio/worker/_replayer.py
Outdated
workflow_task_executor=workflow_task_executor | ||
or concurrent.futures.ThreadPoolExecutor(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in syntactic situations like this it's usual to wrap in parens for clearer indentation. Suggest we adopt that style.
workflow_task_executor=workflow_task_executor | |
or concurrent.futures.ThreadPoolExecutor(), | |
workflow_task_executor=( | |
workflow_task_executor or concurrent.futures.ThreadPoolExecutor() | |
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ruff often does this for me in other multiline expression situations, I wonder why it did not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't have parens; I don't think ruff will actually add the missing paren characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, I think I have seen it add parens in other situations where they weren't present. I can add them here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
# Wait for task fail | ||
await assert_task_fail_eventually(handle, message_contains="deadlock") | ||
# Confirm could not be interrupted | ||
assert deadlock_uninterruptible_completed == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more correct to assert that waiting for >0
times out? Or is there a reason this line does not need to wait but similar checks in this and the interruptible
test use assert_eventually
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, yes technically it could have been interrupted after the task failed (though it'd have to be a quick server round trip to go faster than a Python interpreter thread, which is why I'm not too worried about it). All of the eventual assertions are for positive assertions, we don't want timing out to be a valid thing that happens in successful tests usually (for test perf reasons, though sometimes we have to like deadlock itself). Can we think of another way to confirm that it cannot be interrupted? Would you settle for removing this assertion and having a whole new test that proves that threading.Event
's wait
is uninterruptible via our interruption mechanism?
Or can we accept that this may usually fail but not guaranteed fail if Python updates the interruptibility of this call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion. Would it help to put a mock object around _handle_cache_eviction
? Perhaps waiting for it to be called, confirming that we haven't hit the finally
, and then setting the event. Something involving something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have traditionally tried to avoid internal expectations on these tests and rather test behavior from the outside as a user may. I think it's probably ok to say that this assertion will just "usually" fail (i.e. fail like 99.9% of the time unless the server somehow does its round trip before this Python thread can execute its couple of lines) if Python stdlib changes behavior here instead of "always". It'll still be noticeable by us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's OK with me too. Maybe add a comment explaining the technically possible thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Didn't get to this pre-merge, may be able to in another PR
…-task-slots # Conflicts: # tests/worker/test_workflow.py
What was changed
TL;DR - Fixed issue where on deadlock and/or eviction fail, slot/thread could never recover and thread pool could starve even if issues resolved themselves later. Also gave deadlocked threads a bit of a poke to unstick them.
There are issues where deadlocking or eviction-swallowing code eats up both workflow task slots and workflow executor threads. This issue makes some changes to help these situations.
Before this PR (i.e. today as of this writing), we have the following behavior:
max(os.cpu_count(), 4)
because we naively assumed in original development that workflow tasks are CPU bound, no need for more threads than CPU, but they can use IO, not to mention Python and the OS can both yield to other native threads if needed.Problems with this:
With this PR the behavior now is:
max_concurrent_workflow_tasks
or500
if unset (e.g. using workflow tuner explicitly). This makes sure that deadlocks that eat threads don't affect other workflow tasks.time.sleep()
orthreading.Event
'swait()
which are reactive. Still, this improves our ability to interrupt some stuck Python code. It technically could cause user issues if they were expecting their Python not to be interrupted even after deadlock timeout, but we don't support code running successfully after deadlock timeout.Checklist