Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Apply eviction before completing activation #466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 2, 2024

Conversation

cretz
Copy link
Member

@cretz cretz commented Feb 1, 2024

What was changed

Today if we receive an activation with a cache-remove job, we complete the activation in core and then remove the workflow from cache. That's bad. There is the potential for a rare bug where:

  • Python receives activation
  • Activation takes too long causing server-side task timeout causing server to send new task which core buffers while current activation is in flight
  • Already-timed-out-server-side activation completion is sent to core
  • Core sends eviction activation and Python sends activation completion before removing from cache
  • Core sends the buffered task to run from beginning, but it reuses cache from old eviction activation because the eviction code wasn't reached just yet

This moves eviction processing to before activation completion. Also, this adds optimizations to not yield to a codec handler on cache-eviction-only activations potentially saving cycles and improving performance for codec users. Also we now log a warning if an activation contains a start-workflow job but it's already in cache (should never happen).

There are no real tests beyond the current suite to confirm no regression since this is a simple order-of-operations change. We have struggled to reliably replicate in an integration sense, and patching/mocking core to inject this bug is too contrived to represent what could really happen.

@cretz cretz requested a review from a team as a code owner February 1, 2024 21:39
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍

Comment on lines +216 to +217
# This should never happen
logger.warn("Cache already exists for activation with start job")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a thing like dbg_assert! that will throw during tests but not during prod, I would definitely use that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a good one. There is assert which is based __debug__ being True which it is by default (have to opt out with -O). However as these kinds of options grow, I could see a "strict mode" that we enable for tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I would at least run the tests once with this as an exception and make sure it's not happening.

If it doesn't take much time, I'd see about adding a way to get that behavior. It's pretty valuable and has saved me more than once on things like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I did a quick try on adding this same check in TS, and no test fail either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just switched to error and ran full suite locally with no issue here

)
del self._running_workflows[act.run_id]
else:
logger.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this one a warn. Just like the "cache already exists", falling in this case is most probably a symptom of something incorrect elsewhere, which we're unlikely to ever find out if its only pronted out as debug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't new code, I just moved it from below. But I can update this log level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

@mjameswh mjameswh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually turned out cleaner than before... Nice!

@cretz cretz merged commit 50768df into temporalio:main Feb 2, 2024
@cretz cretz deleted the evict-before-complete branch February 2, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants