Apply eviction before completing activation #466

cretz · 2024-02-01T21:39:33Z

What was changed

Today if we receive an activation with a cache-remove job, we complete the activation in core and then remove the workflow from cache. That's bad. There is the potential for a rare bug where:

Python receives activation
Activation takes too long causing server-side task timeout causing server to send new task which core buffers while current activation is in flight
Already-timed-out-server-side activation completion is sent to core
Core sends eviction activation and Python sends activation completion before removing from cache
Core sends the buffered task to run from beginning, but it reuses cache from old eviction activation because the eviction code wasn't reached just yet

This moves eviction processing to before activation completion. Also, this adds optimizations to not yield to a codec handler on cache-eviction-only activations potentially saving cycles and improving performance for codec users. Also we now log a warning if an activation contains a start-workflow job but it's already in cache (should never happen).

There are no real tests beyond the current suite to confirm no regression since this is a simple order-of-operations change. We have struggled to reliably replicate in an integration sense, and patching/mocking core to inject this bug is too contrived to represent what could really happen.

Sushisource

Makes sense 👍

Sushisource · 2024-02-01T22:53:45Z

temporalio/worker/_workflow.py

+                    # This should never happen
+                    logger.warn("Cache already exists for activation with start job")


If you have a thing like dbg_assert! that will throw during tests but not during prod, I would definitely use that here.

There isn't a good one. There is assert which is based __debug__ being True which it is by default (have to opt out with -O). However as these kinds of options grow, I could see a "strict mode" that we enable for tests.

In that case I would at least run the tests once with this as an exception and make sure it's not happening.

If it doesn't take much time, I'd see about adding a way to get that behavior. It's pretty valuable and has saved me more than once on things like this.

FWIW, I did a quick try on adding this same check in TS, and no test fail either.

Also just switched to error and ran full suite locally with no issue here

mjameswh · 2024-02-02T02:51:23Z

temporalio/worker/_workflow.py

+                )
+                del self._running_workflows[act.run_id]
+            else:
+                logger.debug(


Consider making this one a warn. Just like the "cache already exists", falling in this case is most probably a symptom of something incorrect elsewhere, which we're unlikely to ever find out if its only pronted out as debug.

This isn't new code, I just moved it from below. But I can update this log level.

mjameswh

That actually turned out cleaner than before... Nice!

Apply eviction before completing activation

de7072f

cretz requested a review from a team as a code owner February 1, 2024 21:39

Sushisource approved these changes Feb 1, 2024

View reviewed changes

mjameswh reviewed Feb 2, 2024

View reviewed changes

mjameswh approved these changes Feb 2, 2024

View reviewed changes

Increase log level of eviction failure

5df0799

cretz merged commit 50768df into temporalio:main Feb 2, 2024

cretz deleted the evict-before-complete branch February 2, 2024 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply eviction before completing activation #466

Apply eviction before completing activation #466

cretz commented Feb 1, 2024 •

edited

Loading

Uh oh!

Sushisource left a comment

Uh oh!

Sushisource Feb 1, 2024

Uh oh!

cretz Feb 1, 2024

Uh oh!

Sushisource Feb 1, 2024

Uh oh!

mjameswh Feb 2, 2024

Uh oh!

cretz Feb 2, 2024

Uh oh!

mjameswh Feb 2, 2024

Uh oh!

cretz Feb 2, 2024

Uh oh!

cretz Feb 2, 2024

Uh oh!

mjameswh left a comment

Uh oh!

Uh oh!

		# This should never happen
		logger.warn("Cache already exists for activation with start job")

Apply eviction before completing activation #466

Apply eviction before completing activation #466

Conversation

cretz commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Uh oh!

Sushisource left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjameswh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cretz commented Feb 1, 2024 •

edited

Loading