Thanks to visit codestin.com
Credit goes to github.com

Skip to content

esp32: Fix hang in taskYIELD() on riscv CPUs when IRQs disabled. #15910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

projectgus
Copy link
Contributor

@projectgus projectgus commented Sep 25, 2024

Summary

Closes #15846, fixing a regression introduced in #15476.

The hang occurs because the esp32 port was calling "from ISR" port-layer functions to set/clear the interrupt mask. FreeRTOS kernel therefore doesn't know the CPU is in a critical section. In taskYIELD() the riscv port layer blocks after yielding until it knows the yield has happened, and would block indefinitely if IRQs are disabled (until INT WDT triggers).

Moving to the "public" portENTER_CRITICAL/portEXIT_CRITICAL API means that FreeRTOS knows we're in a critical section and can react accordingly.

Adds a regression test for this case (should be safe to run on all ports).

On single core CPUs, this should result in almost exactly the same behaviour apart from fixing this case.

On dual core CPUs, we now have cross-CPU mutual exclusion for atomic sections. This also shouldn't change anything, mostly because all the code which enters an atomic section runs on the same CPU. If it does change something, it will be to fix a thread safety bug.

Testing

Ran unit tests on ESP32, ESP32-C3 and ESP32-S2 and verified all passing (one vfat related test is failing on S2, but it also fails on master - will investigate separately). Tested with IDF V5.2.2 on all three chips, and also V5.0.4 on ESP32-C3 only.

There is some risk that this change triggers a FreeRTOS crash where there is a call to a blocking FreeRTOS API with interrupts disabled. Previously this code might have worked, but was probably thread unsafe and would have hung in some circumstances - now FreeRTOS knows it's in a critical section so it may crash outright.

This work was funded through GitHub Sponsors.

Trade-offs and Alternatives

There's no FreeRTOS API to verify if interrupts are disabled, but the alternative would have been to track this manually as per the workaround commit #15846 (comment) . However I think using the "proper" critical section is more future-proof, and might avoid other subtle bugs in SMP configs. There is a little more runtime overhead to using the critical section, but this code is pretty optimised in ESP-IDF.

Copy link

codecov bot commented Sep 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.57%. Comparing base (197becb) to head (05ac693).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #15910   +/-   ##
=======================================
  Coverage   98.57%   98.57%           
=======================================
  Files         164      164           
  Lines       21336    21336           
=======================================
  Hits        21031    21031           
  Misses        305      305           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@projectgus projectgus force-pushed the bugfix/esp32_yield_ints_disabled branch from 0315183 to a8a777e Compare September 25, 2024 23:24
@projectgus
Copy link
Contributor Author

projectgus commented Sep 25, 2024

Replying to this question asked on the linked issue:

I've submitted a candidate fix in the linked PR,

Where exactly? I do not see it in the linked PR.

It's a bit subtle. The hang happens in this line of the port layer:

system_cpu_int_reg = SYSTEM_CPU_INTR_FROM_CPU_0_REG;
[...]
while (port_xSchedulerRunning[coreID] && port_uxCriticalNesting[coreID] == 0 && REG_READ(system_cpu_int_reg + 4 * coreID) != 0) {}

Which is looping until the interrupt for the yield has triggered and cleared the interrupt register. This is needed because the CPU keeps running for a few cycles otherwise, and can lead to weird problems like a yielding function continuing to execute a few instructions before it actually yields. There's a long comment there (that I think I wrote in 2020, lol) with some explanation.

If interrupts are disabled then the interrupt never runs to clear that register, so it loops here indefinitely.

This PR changes it so that now FreeRTOS knows it's in a critical section, specifically port_uxCriticalNesting value is set and the loop exits immediately.

@dpgeorge dpgeorge added this to the release-1.24.0 milestone Sep 26, 2024
@projectgus projectgus force-pushed the bugfix/esp32_yield_ints_disabled branch from a8a777e to 172bca3 Compare October 1, 2024 08:09
@projectgus projectgus force-pushed the bugfix/esp32_yield_ints_disabled branch from 172bca3 to d060bc3 Compare October 8, 2024 05:52
@projectgus projectgus force-pushed the bugfix/esp32_yield_ints_disabled branch from d060bc3 to 570b8c5 Compare October 9, 2024 06:39
dpgeorge
dpgeorge previously approved these changes Oct 9, 2024
@dpgeorge
Copy link
Member

dpgeorge commented Oct 9, 2024

Testing this again on stm32, unfortunately the new test does not reliably pass. That's because on stm32 the ticks-us counter is effectively frozen when IRQs are disabled. It does still count a little bit but can't wrap around past 1ms.

@dpgeorge dpgeorge dismissed their stale review October 9, 2024 13:07

Test fails on stm32.

@robert-hh
Copy link
Contributor

That's because on stm32 the ticks-us counter is effectively frozen when IRQs are disabled. It does still count a little bit but can't wrap around past 1ms.

To a much lesser extend that should apply to SAMD and MIMXRT port as well. But there it's the counter roll-over interrupt, which happens rarer.

else:
# busy-wait in a tight loop for 1ms, to simulate doing some work in a critical section
t0 = ticks_us()
while ticks_diff(ticks_us(), t0) < 1000:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest just making this a fixed-iteration loop calling ticks_us() each time (so it actually does some work). Eg:

for _ in range(100):
    ticks_us()

That should still trigger the original bug because there are more than 32 iterations, so the VM will try to release the GIL.

Then separately (one day) we could add a specific test for using ticks_us() while interrupts are disabled, and specify which ports that actually works on, and maybe fix other ports.

Copy link
Contributor Author

@projectgus projectgus Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, of course!

Have implemented the suggestion, confirmed that the test still fails on ESP32-C3 on master and passes with the fix. Pyboard V1.1 also passes this test now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating. I also tested on PYBD-SF2 and it works now.

Regression introduced in 337742f.

The hang occurs because the esp32 port was calling "from ISR" port-layer
functions to set/clear the interrupt mask. FreeRTOS kernel therefore
doesn't know the CPU is in a critical section. In taskYIELD() the riscv
port layer blocks after yielding until it knows the yield has happened, and
would block indefinitely if IRQs are disabled (until INT WDT triggers).

Moving to the "public" portENTER_CRITICAL/portEXIT_CRITICAL API means that
FreeRTOS knows we're in a critical section and can react accordingly.

Adds a regression test for this case (should be safe to run on all ports).

On single core CPUs, this should result in almost exactly the same
behaviour apart from fixing this case.

On dual core CPUs, we now have cross-CPU mutual exclusion for atomic
sections. This also shouldn't change anything, mostly because all the code
which enters an atomic section runs on the same CPU. If it does change
something, it will be to fix a thread safety bug.

There is some risk that this change triggers a FreeRTOS crash where there
is a call to a blocking FreeRTOS API with interrupts disabled. Previously
this code might have worked, but was probably thread unsafe and would have
hung in some circumstances.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <[email protected]>
@projectgus projectgus force-pushed the bugfix/esp32_yield_ints_disabled branch from 570b8c5 to 05ac693 Compare October 10, 2024 00:00
@dpgeorge dpgeorge merged commit 05ac693 into micropython:master Oct 10, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

esp32c3: calling ticks_us() for >=32 times will cause hang when IRQ is disabled
3 participants