Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix Intermittent ESP32 Crashes by Wrapping longjmp #10328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 9, 2025

Conversation

eightycc
Copy link
Collaborator

@eightycc eightycc commented May 9, 2025

This PR resolves issues where Espressif parts with Xtensa cores would crash intermittently, typically while idling with socket polling taking place. Crashes were most often double-faults or WDT timeouts. The root cause of the crashes was use of a version of longjmp in ROM that exposed a window of about 6 instructions that manipulated register windowing control registers. If an interrupt occurred inside this window, a register windowing control register could become corrupted due to interaction with a FreeRTOS context switch.

This PR:

  • Wraps the faulty longjmp with an Espressif provided __wrap_longjmp function that creates a critical section for the register window manipulation instructions. It does this by briefly disabling interrupts.
  • Adds support for JTAG debugging by allowing use of ENABLE_JTAG=1 when building.

Resolves #9937 and #9428. May also resolve #9003 and #9460.

Tested by running test from #9428 for more than 36 hours without failure. This test typically failed in 2 to 12 hours.

eightycc added 2 commits May 8, 2025 07:17
o For Xtensa cores, wraps `longjmp` with a patched implementation that protects register window update in a critical section
o Adds support for `ENABLE_JTAG=1`
Copy link
Member

@gamblor21 gamblor21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not have time to test but the code changes look good to me. Great job tracking this down, interrupts during critical sections are never fun to debug.

Copy link
Member

@tannewt tannewt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Thank you!

@tannewt tannewt merged commit 781fb9e into adafruit:main May 9, 2025
240 checks passed
Copy link
Member

@anecdata anecdata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing using three pairs of QT Py boards, one of each pair running 10.0.0-alpha.4, the other running artifacts from this PR:

  • (2) ESP32-S2 running HTTP server
  • (2) ESP32-S2 running HTTPS server
  • (2) ESP32-S3 running HTTPS server

Looks good so far. Each of the 10.0.0-alpha.4 boards has already encountered a safemode. None of the PR#10328 boards have had any issues. Will let it run for a few days and report if anything unexpected happens.

HTTPS Server code.py
import time
import os
import microcontroller
import supervisor
import storage
import gc
import board
import digitalio
import traceback
import wifi
import socketpool
import ssl
import adafruit_connection_manager
from adafruit_httpserver import Server, Request, Response, REQUEST_HANDLED_RESPONSE_SENT

time.sleep(3)  # wait for serial
print(f"{gc.mem_free()=}")

if microcontroller.nvm[0] == 0xAA:
    print(f'{"="*25}\nreset from safemode.py')
    microcontroller.nvm[0] = 0x55
else:
    print(f'{"="*25}\nreset clean')

wifi.radio.connect(os.getenv("WIFI_SSID"), os.getenv("WIFI_PASSWORD"))
pool = adafruit_connection_manager.get_radio_socketpool(wifi.radio)
server = Server(
    pool,
    root_path="/static",
    https=True,
    certfile="cert.pem",
    keyfile="key.pem",
    debug=True,
)

@server.route("/")
def base(request: Request):
    resp = f"{storage.getmount("/").label} t={time.monotonic_ns()} mem={gc.mem_free()}"
    print(resp)
    return Response(request, resp)

server.start(str(wifi.radio.ipv4_address))
while True:
    try:
        server.poll()
    except Exception as ex:
        traceback.print_exception(ex, ex, ex.__traceback__)

HTTP Server is the same, but with the server configured without HTTPS.

No clients are attempting to connect.

Also, there is a safemode.py file that will automatically reset the board, and set nvm to indicate safemode.

Result:

With over 60 hours of completely automatic operation, pre-PR S2 & S3 devices experienced numerous intermitent safemodes and resets (as expected), and PR S2 & S3 devices ran continuously without any issues at all (as expected).

Just for fun, started up a client to repeatedly make requests on each device to confirm the servers were still operational: all servers are serving successfully. HTTP response times typically under 50ms, HTTPS response times typically just over 1/3 of a second. Measurements are from the client perspective. The first request to a device usually takes longer, sometimes much longer, probably due to TCP and / or TLS setup.

Resounding success, thanks @eightycc !

@dhalbert
Copy link
Collaborator

dhalbert commented May 13, 2025

Is this issue present in MicroPython upstream? If so, maybe worth opening an issue there to point it out to them.

EDIT: I misunderstood whether this was RISCV or not. Revised the text above.

@eightycc
Copy link
Collaborator Author

eightycc commented May 13, 2025

Is this issue present in MicroPython upstream? If so, maybe worth opening an issue there to point it out to them.

Good thought. I'll have a look.

@dhalbert Because MicroPython links ESP-IDF using ESP-IDF's CMAKE (CircuityPython uses its own custom Makefile), it automatically picks up the wrapper. Verified by building MicroPython and examining micropython.map.

@dhalbert
Copy link
Collaborator

@dhalbert Because MicroPython links ESP-IDF using ESP-IDF's CMAKE (CircuityPython uses its own custom Makefile), it automatically picks up the wrapper. Verified by building MicroPython and examining micropython.map.

Thanks for looking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESP32-S* Safe Mode when using HTTPS / TLS TCP server MemoryError with HTTPS server and SSLSocket.accept() on ESP32-S2, Pico W (possibly other)
5 participants