-
Notifications
You must be signed in to change notification settings - Fork 97
Apply eviction before completing activation #466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👍
# This should never happen | ||
logger.warn("Cache already exists for activation with start job") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a thing like dbg_assert!
that will throw during tests but not during prod, I would definitely use that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a good one. There is assert
which is based __debug__
being True
which it is by default (have to opt out with -O
). However as these kinds of options grow, I could see a "strict mode" that we enable for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I would at least run the tests once with this as an exception and make sure it's not happening.
If it doesn't take much time, I'd see about adding a way to get that behavior. It's pretty valuable and has saved me more than once on things like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I did a quick try on adding this same check in TS, and no test fail either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just switched to error and ran full suite locally with no issue here
temporalio/worker/_workflow.py
Outdated
) | ||
del self._running_workflows[act.run_id] | ||
else: | ||
logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this one a warn. Just like the "cache already exists", falling in this case is most probably a symptom of something incorrect elsewhere, which we're unlikely to ever find out if its only pronted out as debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't new code, I just moved it from below. But I can update this log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That actually turned out cleaner than before... Nice!
What was changed
Today if we receive an activation with a cache-remove job, we complete the activation in core and then remove the workflow from cache. That's bad. There is the potential for a rare bug where:
This moves eviction processing to before activation completion. Also, this adds optimizations to not yield to a codec handler on cache-eviction-only activations potentially saving cycles and improving performance for codec users. Also we now log a warning if an activation contains a start-workflow job but it's already in cache (should never happen).
There are no real tests beyond the current suite to confirm no regression since this is a simple order-of-operations change. We have struggled to reliably replicate in an integration sense, and patching/mocking core to inject this bug is too contrived to represent what could really happen.