Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@max-au
Copy link
Contributor

@max-au max-au commented Nov 19, 2018

Since OTP R20, there is a possibility for MAJOR garbage collection to
run on dirty scheduler. So DistEntry destructor is being called on
dirty scheduler as well. This, in turn, leads to an attempt to schedule
timer on a dirty scheduler too, which is impossible (and will assert
on debug build, but does succeed for release build, creating an
infinite busy loop, since aux work wakes scheduler up, but dirty
scheduler cannot execute aus work).
There is a similar method in erl_hl_timer, see erts_start_timer_callback.

@jhogberg jhogberg added team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI labels Nov 20, 2018
@jhogberg jhogberg self-assigned this Nov 20, 2018
Copy link
Contributor

@jhogberg jhogberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! It looks good aside from the lack of a test, do you think you could add one?

@@ -421,8 +421,15 @@ static void schedule_delete_dist_entry(DistEntry* dep)
*
* Note that timeouts do not guarantee thread progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on why we're re-scheduling on the first scheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this trick in erl_hl_timer.c, erts_start_timer_callback.
It is guaranteed that scheduler #1 is always online (and active), even on a system with a single core.
This is a super-rare event, because in most cases garbage collection of dist entry happens on a normal scheduler. So it does not seem necessary to take a random scheduler out of those that are online.
Making unit test for this case seems rather complicated. I'd probably suggest to have a different solution for all dirty schedulers, in erl_process.c, erts_schedule, when there is AUX WORK scheduled for dirty scheduler, either abort the emulator (because internal state is broken) or silently reschedule AUX WORK on normal scheduler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was a bit too unclear, I want a brief comment on why this dance is done in the first place as it might not be immediately obvious. There's nothing wrong with picking scheduler 1 for this purpose.

As for the test, I think we can live without one if you haven't found a neat way to provoke this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing it internally, we think that it's best to abort the emulator when aux work is erroneously scheduled on a dirty scheduler. Feel free to add a commit that does this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just misread what you've written.
I added an explanatory comment.
I think crashing emulator on an attempt to schedule any AUX work on a dirty scheduler should be a separate commit/PR, as it potentially touches a lot of other subsystems and may uncover even more issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
I'll add the assertion to master, we're a bit too close to the 21.2 release to include it there.

@jhogberg jhogberg removed the testing currently being tested, tag is used by OTP internal CI label Nov 26, 2018
Since OTP R20, there is a possibility for MAJOR garbage collection to
run on dirty scheduler. So DistEntry destructor is being called on
dirty scheduler as well. This, in turn, leads to an attempt to schedule
timer on a dirty scheduler too, which is impossible (and will assert
on debug build, but does succeed for release build, creating an
infinite busy loop, since aux work wakes scheduler up, but dirty
scheduler cannot execute aus work).
There is a similar method in erl_hl_timer, see erts_start_timer_callback.
@max-au max-au force-pushed the fix_aux_work_on_dcpu_sched branch from ad16e3e to 63077f5 Compare November 26, 2018 19:27
@jhogberg jhogberg merged commit 39d52f3 into erlang:maint Nov 27, 2018
@jhogberg
Copy link
Contributor

Merged, thanks again for the PR!

@max-au max-au deleted the fix_aux_work_on_dcpu_sched branch January 25, 2022 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team:VM Assigned to OTP team VM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants