Thanks to visit codestin.com
Credit goes to github.com

Skip to content

esp32: Fix uneven GIL allocation between Python threads. #15476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

projectgus
Copy link
Contributor

@projectgus projectgus commented Jul 17, 2024

Summary

Explicitly yield each time a thread mutex is unlocked. Closes #15423.

Key to understanding this bug is that Python threads run at equal RTOS priority, and although ESP-IDF FreeRTOS (and I think vanilla FreeRTOS) scheduler will round-robin equal priority tasks in the ready state it does not make a similar guarantee for tasks moving between ready and waiting.

The pathological case of this bug is when one Python thread task is busy (i.e. never blocks) it will hog the CPU more than expected, sometimes for an unbounded amount of time. This happens even though it periodically unlocks the GIL to allow another task to run.

Assume T1 is busy and T2 is blocked waiting for the GIL. T1 is executing and hits a condition to yield execution:

  1. T1 calls MP_THREAD_GIL_EXIT
  2. FreeRTOS sees T2 is waiting for the GIL and moves it to the Ready list (but does not preempt, as T2 is same priority, so T1 keeps running).
  3. T1 immediately calls MP_THREAD_GIL_ENTER and re-takes the GIL.
  4. Pre-emptive context switch happens, T2 wakes up, sees GIL is not available, and goes on the waiting list for the GIL again.

To break this cycle step 4 must happen before step 3, but this may be a very narrow window of time so it may not happen regularly - and quantisation of the timing of the tick interrupt to trigger a context switch may mean it never happens.

Yielding at the end of step 2 maximises the chance for another task to run.

Testing

Adds a test case which is based on the code in the linked bug report. This test fails consistently on esp32 without this fix, and passes afterwards. Test case also passes on rp2, which is the only other port where thread tests are currently enabled, and passes if run manually on the unix port.

This PR also includes a commit to enable the thread tests on the esp32 port.

Trade-offs and Alternatives

Could make a port-specific MP_THREAD_GIL_EXIT macro that yields, to avoid the overhead of yielding any time a thread mutex is unlocked. However almost all of the thread mutex lock/unlock events in the esp32 port are the GIL (the other thread mutexes are the GC mutex and the mutex that protects thread creation and cleanup).

This work was funded through GitHub Sponsors.

Copy link

codecov bot commented Jul 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.43%. Comparing base (2994354) to head (46c3df0).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #15476   +/-   ##
=======================================
  Coverage   98.43%   98.43%           
=======================================
  Files         161      161           
  Lines       21275    21275           
=======================================
  Hits        20942    20942           
  Misses        333      333           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@projectgus projectgus force-pushed the bugfix/esp32_thread_starvation branch from 18f13e8 to 13596c6 Compare July 17, 2024 06:41
@dpgeorge
Copy link
Member

Could make a port-specific MP_THREAD_GIL_EXIT macro that yields, to avoid the overhead of yielding any time a thread mutex is unlocked. However almost all of the thread mutex lock/unlock events in the esp32 port are the GIL (the other thread mutexes are the GC mutex and the mutex that protects thread creation and cleanup).

When the GIL is enabled the GC mutex is not used (neither the qstr mutex). The GIL is "global" so is enough to make everything exclusive across thread access.

But mutex's are also used for _thread.allocate_lock() (regardless of the GIL being enabled/disabled). And I think it's also good that they will now yield when releasing, that may improve thread cooperation on esp32 when the user code uses locks at the Python level.

Explicitly yield each time a thread mutex is unlocked.

Key to understanding this bug is that Python threads run at equal RTOS
priority, and although ESP-IDF FreeRTOS (and I think vanilla FreeRTOS)
scheduler will round-robin equal priority tasks in the ready state it does
not make a similar guarantee for tasks moving between ready and waiting.

The pathological case of this bug is when one Python thread task is busy
(i.e. never blocks) it will hog the CPU more than expected, sometimes for
an unbounded amount of time. This happens even though it periodically
unlocks the GIL to allow another task to run.

Assume T1 is busy and T2 is blocked waiting for the GIL. T1 is executing
and hits a condition to yield execution:

1. T1 calls MP_THREAD_GIL_EXIT
2. FreeRTOS sees T2 is waiting for the GIL and moves it to the Ready list
   (but does not preempt, as T2 is same priority, so T1 keeps running).
3. T1 immediately calls MP_THREAD_GIL_ENTER and re-takes the GIL.
4. Pre-emptive context switch happens, T2 wakes up, sees GIL is not
   available, and goes on the waiting list for the GIL again.

To break this cycle step 4 must happen before step 3, but this may be a
very narrow window of time so it may not happen regularly - and
quantisation of the timing of the tick interrupt to trigger a context
switch may mean it never happens.

Yielding at the end of step 2 maximises the chance for another task to run.

Adds a test that fails on esp32 before this fix and passes afterwards.

Fixes issue micropython#15423.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <[email protected]>
Before the fix in parent commit, some of these tests hung indefinitely.

After, they seem to consistently pass.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <[email protected]>
@dpgeorge dpgeorge force-pushed the bugfix/esp32_thread_starvation branch from 13596c6 to 46c3df0 Compare July 23, 2024 02:37
@dpgeorge
Copy link
Member

Also good that you enabled thread tests on esp32. I think if they were already enabled a long time ago, we probably would have caught this issue when updating the IDF to v5.

@projectgus
Copy link
Contributor Author

I think if they were already enabled a long time ago, we probably would have caught this issue when updating the IDF to v5.

Probably, at least one of the existing tests hangs intermittently without this fix.

@dpgeorge dpgeorge merged commit 46c3df0 into micropython:master Jul 23, 2024
28 checks passed
@dpgeorge
Copy link
Member

I tested this and it works well, fixes the original issue.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent thread behavior in different versions of MicroPython on ESP32
2 participants