-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
esp32: Fix uneven GIL allocation between Python threads. #15476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
esp32: Fix uneven GIL allocation between Python threads. #15476
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #15476 +/- ##
=======================================
Coverage 98.43% 98.43%
=======================================
Files 161 161
Lines 21275 21275
=======================================
Hits 20942 20942
Misses 333 333 ☔ View full report in Codecov by Sentry. |
18f13e8
to
13596c6
Compare
When the GIL is enabled the GC mutex is not used (neither the qstr mutex). The GIL is "global" so is enough to make everything exclusive across thread access. But mutex's are also used for |
Explicitly yield each time a thread mutex is unlocked. Key to understanding this bug is that Python threads run at equal RTOS priority, and although ESP-IDF FreeRTOS (and I think vanilla FreeRTOS) scheduler will round-robin equal priority tasks in the ready state it does not make a similar guarantee for tasks moving between ready and waiting. The pathological case of this bug is when one Python thread task is busy (i.e. never blocks) it will hog the CPU more than expected, sometimes for an unbounded amount of time. This happens even though it periodically unlocks the GIL to allow another task to run. Assume T1 is busy and T2 is blocked waiting for the GIL. T1 is executing and hits a condition to yield execution: 1. T1 calls MP_THREAD_GIL_EXIT 2. FreeRTOS sees T2 is waiting for the GIL and moves it to the Ready list (but does not preempt, as T2 is same priority, so T1 keeps running). 3. T1 immediately calls MP_THREAD_GIL_ENTER and re-takes the GIL. 4. Pre-emptive context switch happens, T2 wakes up, sees GIL is not available, and goes on the waiting list for the GIL again. To break this cycle step 4 must happen before step 3, but this may be a very narrow window of time so it may not happen regularly - and quantisation of the timing of the tick interrupt to trigger a context switch may mean it never happens. Yielding at the end of step 2 maximises the chance for another task to run. Adds a test that fails on esp32 before this fix and passes afterwards. Fixes issue micropython#15423. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <[email protected]>
Before the fix in parent commit, some of these tests hung indefinitely. After, they seem to consistently pass. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <[email protected]>
13596c6
to
46c3df0
Compare
Also good that you enabled thread tests on esp32. I think if they were already enabled a long time ago, we probably would have caught this issue when updating the IDF to v5. |
Probably, at least one of the existing tests hangs intermittently without this fix. |
I tested this and it works well, fixes the original issue. Thank you! |
Summary
Explicitly yield each time a thread mutex is unlocked. Closes #15423.
Key to understanding this bug is that Python threads run at equal RTOS priority, and although ESP-IDF FreeRTOS (and I think vanilla FreeRTOS) scheduler will round-robin equal priority tasks in the ready state it does not make a similar guarantee for tasks moving between ready and waiting.
The pathological case of this bug is when one Python thread task is busy (i.e. never blocks) it will hog the CPU more than expected, sometimes for an unbounded amount of time. This happens even though it periodically unlocks the GIL to allow another task to run.
Assume T1 is busy and T2 is blocked waiting for the GIL. T1 is executing and hits a condition to yield execution:
To break this cycle step 4 must happen before step 3, but this may be a very narrow window of time so it may not happen regularly - and quantisation of the timing of the tick interrupt to trigger a context switch may mean it never happens.
Yielding at the end of step 2 maximises the chance for another task to run.
Testing
Adds a test case which is based on the code in the linked bug report. This test fails consistently on esp32 without this fix, and passes afterwards. Test case also passes on rp2, which is the only other port where thread tests are currently enabled, and passes if run manually on the unix port.
This PR also includes a commit to enable the thread tests on the esp32 port.
Trade-offs and Alternatives
Could make a port-specific
MP_THREAD_GIL_EXIT
macro that yields, to avoid the overhead of yielding any time a thread mutex is unlocked. However almost all of the thread mutex lock/unlock events in the esp32 port are the GIL (the other thread mutexes are the GC mutex and the mutex that protects thread creation and cleanup).This work was funded through GitHub Sponsors.