-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
esp32: Fix hang in taskYIELD() on riscv CPUs when IRQs disabled. #15910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
esp32: Fix hang in taskYIELD() on riscv CPUs when IRQs disabled. #15910
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #15910 +/- ##
=======================================
Coverage 98.57% 98.57%
=======================================
Files 164 164
Lines 21336 21336
=======================================
Hits 21031 21031
Misses 305 305 ☔ View full report in Codecov by Sentry. |
0315183
to
a8a777e
Compare
Replying to this question asked on the linked issue:
It's a bit subtle. The hang happens in this line of the port layer:
Which is looping until the interrupt for the yield has triggered and cleared the interrupt register. This is needed because the CPU keeps running for a few cycles otherwise, and can lead to weird problems like a yielding function continuing to execute a few instructions before it actually yields. There's a long comment there (that I think I wrote in 2020, lol) with some explanation. If interrupts are disabled then the interrupt never runs to clear that register, so it loops here indefinitely. This PR changes it so that now FreeRTOS knows it's in a critical section, specifically |
a8a777e
to
172bca3
Compare
172bca3
to
d060bc3
Compare
d060bc3
to
570b8c5
Compare
Testing this again on stm32, unfortunately the new test does not reliably pass. That's because on stm32 the ticks-us counter is effectively frozen when IRQs are disabled. It does still count a little bit but can't wrap around past 1ms. |
To a much lesser extend that should apply to SAMD and MIMXRT port as well. But there it's the counter roll-over interrupt, which happens rarer. |
tests/extmod/machine_disable_irq.py
Outdated
else: | ||
# busy-wait in a tight loop for 1ms, to simulate doing some work in a critical section | ||
t0 = ticks_us() | ||
while ticks_diff(ticks_us(), t0) < 1000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest just making this a fixed-iteration loop calling ticks_us()
each time (so it actually does some work). Eg:
for _ in range(100):
ticks_us()
That should still trigger the original bug because there are more than 32 iterations, so the VM will try to release the GIL.
Then separately (one day) we could add a specific test for using ticks_us()
while interrupts are disabled, and specify which ports that actually works on, and maybe fix other ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, of course!
Have implemented the suggestion, confirmed that the test still fails on ESP32-C3 on master and passes with the fix. Pyboard V1.1 also passes this test now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating. I also tested on PYBD-SF2 and it works now.
Regression introduced in 337742f. The hang occurs because the esp32 port was calling "from ISR" port-layer functions to set/clear the interrupt mask. FreeRTOS kernel therefore doesn't know the CPU is in a critical section. In taskYIELD() the riscv port layer blocks after yielding until it knows the yield has happened, and would block indefinitely if IRQs are disabled (until INT WDT triggers). Moving to the "public" portENTER_CRITICAL/portEXIT_CRITICAL API means that FreeRTOS knows we're in a critical section and can react accordingly. Adds a regression test for this case (should be safe to run on all ports). On single core CPUs, this should result in almost exactly the same behaviour apart from fixing this case. On dual core CPUs, we now have cross-CPU mutual exclusion for atomic sections. This also shouldn't change anything, mostly because all the code which enters an atomic section runs on the same CPU. If it does change something, it will be to fix a thread safety bug. There is some risk that this change triggers a FreeRTOS crash where there is a call to a blocking FreeRTOS API with interrupts disabled. Previously this code might have worked, but was probably thread unsafe and would have hung in some circumstances. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <[email protected]>
570b8c5
to
05ac693
Compare
Summary
Closes #15846, fixing a regression introduced in #15476.
The hang occurs because the esp32 port was calling "from ISR" port-layer functions to set/clear the interrupt mask. FreeRTOS kernel therefore doesn't know the CPU is in a critical section. In
taskYIELD()
the riscv port layer blocks after yielding until it knows the yield has happened, and would block indefinitely if IRQs are disabled (until INT WDT triggers).Moving to the "public"
portENTER_CRITICAL
/portEXIT_CRITICAL API
means that FreeRTOS knows we're in a critical section and can react accordingly.Adds a regression test for this case (should be safe to run on all ports).
On single core CPUs, this should result in almost exactly the same behaviour apart from fixing this case.
On dual core CPUs, we now have cross-CPU mutual exclusion for atomic sections. This also shouldn't change anything, mostly because all the code which enters an atomic section runs on the same CPU. If it does change something, it will be to fix a thread safety bug.
Testing
Ran unit tests on ESP32, ESP32-C3 and ESP32-S2 and verified all passing (one vfat related test is failing on S2, but it also fails on master - will investigate separately). Tested with IDF V5.2.2 on all three chips, and also V5.0.4 on ESP32-C3 only.
There is some risk that this change triggers a FreeRTOS crash where there is a call to a blocking FreeRTOS API with interrupts disabled. Previously this code might have worked, but was probably thread unsafe and would have hung in some circumstances - now FreeRTOS knows it's in a critical section so it may crash outright.
This work was funded through GitHub Sponsors.
Trade-offs and Alternatives
There's no FreeRTOS API to verify if interrupts are disabled, but the alternative would have been to track this manually as per the workaround commit #15846 (comment) . However I think using the "proper" critical section is more future-proof, and might avoid other subtle bugs in SMP configs. There is a little more runtime overhead to using the critical section, but this code is pretty optimised in ESP-IDF.