-
Notifications
You must be signed in to change notification settings - Fork 3k
erts: fix attempt to start timer when executing on dirty scheduler #2024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jhogberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! It looks good aside from the lack of a test, do you think you could add one?
| @@ -421,8 +421,15 @@ static void schedule_delete_dist_entry(DistEntry* dep) | |||
| * | |||
| * Note that timeouts do not guarantee thread progress. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment on why we're re-scheduling on the first scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this trick in erl_hl_timer.c, erts_start_timer_callback.
It is guaranteed that scheduler #1 is always online (and active), even on a system with a single core.
This is a super-rare event, because in most cases garbage collection of dist entry happens on a normal scheduler. So it does not seem necessary to take a random scheduler out of those that are online.
Making unit test for this case seems rather complicated. I'd probably suggest to have a different solution for all dirty schedulers, in erl_process.c, erts_schedule, when there is AUX WORK scheduled for dirty scheduler, either abort the emulator (because internal state is broken) or silently reschedule AUX WORK on normal scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was a bit too unclear, I want a brief comment on why this dance is done in the first place as it might not be immediately obvious. There's nothing wrong with picking scheduler 1 for this purpose.
As for the test, I think we can live without one if you haven't found a neat way to provoke this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing it internally, we think that it's best to abort the emulator when aux work is erroneously scheduled on a dirty scheduler. Feel free to add a commit that does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I just misread what you've written.
I added an explanatory comment.
I think crashing emulator on an attempt to schedule any AUX work on a dirty scheduler should be a separate commit/PR, as it potentially touches a lot of other subsystems and may uncover even more issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
I'll add the assertion to master, we're a bit too close to the 21.2 release to include it there.
Since OTP R20, there is a possibility for MAJOR garbage collection to run on dirty scheduler. So DistEntry destructor is being called on dirty scheduler as well. This, in turn, leads to an attempt to schedule timer on a dirty scheduler too, which is impossible (and will assert on debug build, but does succeed for release build, creating an infinite busy loop, since aux work wakes scheduler up, but dirty scheduler cannot execute aus work). There is a similar method in erl_hl_timer, see erts_start_timer_callback.
ad16e3e to
63077f5
Compare
|
Merged, thanks again for the PR! |
Since OTP R20, there is a possibility for MAJOR garbage collection to
run on dirty scheduler. So DistEntry destructor is being called on
dirty scheduler as well. This, in turn, leads to an attempt to schedule
timer on a dirty scheduler too, which is impossible (and will assert
on debug build, but does succeed for release build, creating an
infinite busy loop, since aux work wakes scheduler up, but dirty
scheduler cannot execute aus work).
There is a similar method in erl_hl_timer, see erts_start_timer_callback.