-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix Intermittent ESP32 Crashes by Wrapping longjmp
#10328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
o For Xtensa cores, wraps `longjmp` with a patched implementation that protects register window update in a critical section o Adds support for `ENABLE_JTAG=1`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not have time to test but the code changes look good to me. Great job tracking this down, interrupts during critical sections are never fun to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing using three pairs of QT Py boards, one of each pair running 10.0.0-alpha.4, the other running artifacts from this PR:
- (2) ESP32-S2 running HTTP server
- (2) ESP32-S2 running HTTPS server
- (2) ESP32-S3 running HTTPS server
Looks good so far. Each of the 10.0.0-alpha.4 boards has already encountered a safemode. None of the PR#10328 boards have had any issues. Will let it run for a few days and report if anything unexpected happens.
HTTPS Server code.py
import time
import os
import microcontroller
import supervisor
import storage
import gc
import board
import digitalio
import traceback
import wifi
import socketpool
import ssl
import adafruit_connection_manager
from adafruit_httpserver import Server, Request, Response, REQUEST_HANDLED_RESPONSE_SENT
time.sleep(3) # wait for serial
print(f"{gc.mem_free()=}")
if microcontroller.nvm[0] == 0xAA:
print(f'{"="*25}\nreset from safemode.py')
microcontroller.nvm[0] = 0x55
else:
print(f'{"="*25}\nreset clean')
wifi.radio.connect(os.getenv("WIFI_SSID"), os.getenv("WIFI_PASSWORD"))
pool = adafruit_connection_manager.get_radio_socketpool(wifi.radio)
server = Server(
pool,
root_path="/static",
https=True,
certfile="cert.pem",
keyfile="key.pem",
debug=True,
)
@server.route("/")
def base(request: Request):
resp = f"{storage.getmount("/").label} t={time.monotonic_ns()} mem={gc.mem_free()}"
print(resp)
return Response(request, resp)
server.start(str(wifi.radio.ipv4_address))
while True:
try:
server.poll()
except Exception as ex:
traceback.print_exception(ex, ex, ex.__traceback__)
HTTP Server is the same, but with the server configured without HTTPS.
No clients are attempting to connect.
Also, there is a safemode.py
file that will automatically reset the board, and set nvm
to indicate safemode.
Result:
With over 60 hours of completely automatic operation, pre-PR S2 & S3 devices experienced numerous intermitent safemodes and resets (as expected), and PR S2 & S3 devices ran continuously without any issues at all (as expected).
Just for fun, started up a client to repeatedly make requests on each device to confirm the servers were still operational: all servers are serving successfully. HTTP response times typically under 50ms, HTTPS response times typically just over 1/3 of a second. Measurements are from the client perspective. The first request to a device usually takes longer, sometimes much longer, probably due to TCP and / or TLS setup.
Resounding success, thanks @eightycc !
Is this issue present in MicroPython upstream? If so, maybe worth opening an issue there to point it out to them. EDIT: I misunderstood whether this was RISCV or not. Revised the text above. |
Good thought. I'll have a look. @dhalbert Because MicroPython links ESP-IDF using ESP-IDF's CMAKE (CircuityPython uses its own custom |
Thanks for looking! |
This PR resolves issues where Espressif parts with Xtensa cores would crash intermittently, typically while idling with socket polling taking place. Crashes were most often double-faults or WDT timeouts. The root cause of the crashes was use of a version of
longjmp
in ROM that exposed a window of about 6 instructions that manipulated register windowing control registers. If an interrupt occurred inside this window, a register windowing control register could become corrupted due to interaction with a FreeRTOS context switch.This PR:
longjmp
with an Espressif provided__wrap_longjmp
function that creates a critical section for the register window manipulation instructions. It does this by briefly disabling interrupts.ENABLE_JTAG=1
when building.Resolves #9937 and #9428. May also resolve #9003 and #9460.
Tested by running test from #9428 for more than 36 hours without failure. This test typically failed in 2 to 12 hours.