fix(tls): Android wss:// CertificateBundleLoadFailure — merge CA trust across WSS + matchmaking (#24)#25
Conversation
Android release builds intermittently failed WSS with a certificate-bundle load failure: connections worked for some users/devices and failed for others against the same endpoint. Root cause was the CA trust-store selection in ws_tls_init(), which picked a single source in priority order (system -> bundled Mozilla -> settings): - On Android the device system store (scanned by Zig's rescan of /system/etc/security/cacerts) shadowed the comprehensive bundled Mozilla set whenever it loaded. If that device-specific store lacked the server's root, or failed to parse, TLS aborted with no fallback to the bundled roots — so success depended on the device's filesystem/cert contents. - The settings/override source (priority 3) was unreachable because the bundled set (priority 2) is always compiled in and non-empty, so Godot's certificate_bundle_override was silently ignored. Now every available CA source is merged into one mbedtls chain (mbedtls_x509_crt_parse appends and parses permissively): the bundled Mozilla roots are always loaded as a device-independent baseline, with the system store and any explicit override layered on top for private/enterprise roots. TLS init only fails if the resulting chain is empty, and that failure now surfaces a specific close reason instead of a generic "TLS init failed". Godot binding: the override loader now resolves res://-/user:// paths via ProjectSettings.globalize_path() before reading (raw fopen() could not open Godot virtual paths, which is why the documented user:// workaround failed on Android). The binding now provides only an explicit override — the core owns the bundled baseline — avoiding a redundant second parse of the bundle. Assisted-by: Claude Opus 4.8
…chmaking Adds isolated WSS/TLS verification tests and the two fixes needed to make a self-signed wss:// connection actually validatable end to end. http.zig: HTTPS matchmaking used std.http.Client, which only trusts the OS system store rescan — it ignored both the bundled Mozilla roots and the certificate_bundle_override. So matchmaking could fail where the WSS socket (post-#24) succeeds, notably on Android. setupCaBundle() now seeds the client trust store with system + bundled Mozilla + settings override, mirroring ws_tls_init. Only runs for secure requests; plain http:// is untouched. The settings mirror is extended to the real layout (adds tls_skip_verification, ca_pem_data, ca_pem_len) — offsets verified against settings.h. websocket_transport.c: ws_tls_handshake_tick returned the same false for both "want read/write" and fatal errors, so a certificate verification failure spun the state machine forever instead of erroring. It now returns a status; fatal results close with code 1015 and a specific reason ("TLS certificate verification failed"), surfacing the failure #24 asked to diagnose. Tests (tests/test_tls.zig) drive the transport directly against a zero-dep self-signed wss echo server (tests/tls/), bypassing matchmaking: - trusted CA via settings -> handshake succeeds + echo (the #24 regression: the override CA must be honored, not shadowed) - wrong CA / bundled-only -> verification fails, never opens - tls_skip_verification -> opens regardless Certs are generated on demand (gitignored) and wired into tests.yml. Assisted-by: Claude Opus 4.8
Adds test_wss.gd, which joins example-server's "my_room" over wss:// through a TLS proxy (tls-proxy.mjs) that fronts the plain server on 2568. It exercises the full secure path the ws:// tests don't: the binding loading network/tls/certificate_bundle_override (res:// resolved via globalize_path), HTTPS matchmaking trusting that override (http.zig), and the WSS handshake verifying against it (ws_tls_init). project.godot points the override at res://tls/ca.pem (generated on demand, gitignored; a non-fatal no-op when absent, so other tests are unaffected). run-tests.sh generates the certs and starts/stops the proxy automatically. The client connects via the `localhost` hostname rather than 127.0.0.1: Zig's std.http (matchmaking) verifies DNS-name SANs but not IP-address SANs, and the proxy listens dual-stack so localhost resolves either way. Verified locally against Godot 4.6.1 + example-server: 39/39 GUT tests pass. Assisted-by: Claude Opus 4.8
Bumps the Godot SDK to 0.17.11-rc.1 (pre-release) and documents the #24 TLS fixes in the CHANGELOG: merged CA trust chain so Android wss:// verifies reliably, shared trust for HTTPS matchmaking, surfaced TLS handshake failures, and res://-/user:// certificate_bundle_override support. Reported by @pierroo in #24. Assisted-by: Claude Opus 4.8
The Windows CI job failed: tests/tls/gen-certs.sh exited 1 because it used bash process substitution (-extfile <(...)), which the native openssl on the Git-Bash runner can't open; the echo server then never started and the always()-run log step also errored because it ran in PowerShell (cat on a missing wss.log). - gen-certs.sh: write the SAN extfile to a real temp file instead of <(...). - build.zig: skip test_tls when the target OS is Windows (the TLS code is OS-agnostic and is exercised on Linux + macOS). - tests.yml: gate the cert/echo-server fixture steps to non-Windows, and run "Print server logs" under bash so `cat … || true` is safe on all runners. Assisted-by: Claude Opus 4.8
bsharma-imperium
left a comment
There was a problem hiding this comment.
LGTM.
Added few comments.
| data->state = COLYSEUS_WS_HANDSHAKE_SENDING; | ||
| ws_http_handshake_init(data); | ||
| } else if (hs == WS_TLS_HS_CERT_FAILED) { | ||
| ws_close_impl(transport, 1015, "TLS certificate verification failed"); |
There was a problem hiding this comment.
RFC reserves 1015 for browser-internal TLS signaling i.e connection failures caused by a TLS handshake error. Shall we swap this to something from application-defined range (4000–4999) to be safe. Maybe 4015?
There was a problem hiding this comment.
Good point in general, but 1015 never reaches the wire on this path. ws_close_impl only queues a Close frame when state == COLYSEUS_WS_CONNECTED (websocket_transport.c:253), and a TLS handshake failure happens at COLYSEUS_WS_TLS_HANDSHAKE — before the HTTP upgrade and before any wslay context exists — so nothing is transmitted to the peer. It is purely a local code passed to on_close, which is exactly 1015's RFC-intended use, and it matches the SDK already using reserved 1006 as a local code here (including the sibling "TLS init failed"). A 4015 would land inside Colyseus's 4xxx protocol-code range (protocol.h: CONSENTED / SERVER_SHUTDOWN / MAY_TRY_RECONNECT / …) where it has no defined meaning, so it'd actually be less consistent. It's also reconnect-safe: not in room_close_code_is_drop (room.c:518), and the !has_joined guard means a TLS-stage close can't reach the reconnect path anyway. Keeping 1015.
There was a problem hiding this comment.
Cool. Thanks for the context.
| return; | ||
|
|
||
| if (!load_certificates_from_godot_path(override_path)) { | ||
| fprintf(stderr, "[Colyseus] Failed to load certificate override; falling back to built-in CA roots\n"); |
There was a problem hiding this comment.
Maybe also add what next steps should be if we hit this error path, something like
"If using a packed res:// asset, copy it to user:// first."
There was a problem hiding this comment.
Done in 0cb1f4c — the fallback log now ends with (hint: copy a packed res:// asset to user:// first). Phrased as a hint rather than asserting res:// is the cause, since a res:// override works fine when running from the editor/source (it only fails for packed assets on an exported build).
| } | ||
|
|
||
| static bool ws_tls_init(colyseus_ws_transport_data_t* data, const char** out_err) { | ||
| colyseus_tls_context_t* tls = malloc(sizeof(colyseus_tls_context_t)); |
There was a problem hiding this comment.
Pre-existing nitpick: Should we switch to calloc instead? If tls_skip_verification = true, ca_chain_initialized would be garbage.
There was a problem hiding this comment.
ca_chain_initialized is actually set to 0 explicitly at websocket_transport.c:986 — unconditionally, before the tls_skip_verify branch and before data->tls_ctx = tls is published (which is the only way ws_tls_cleanup becomes reachable). So in the skip-verify path it's 0, not garbage, and cleanup correctly skips mbedtls_x509_crt_free. calloc here would just redundantly zero the large mbedTLS sub-structs (ssl/conf/entropy/ctr_drbg) that get re-init'd right after, without fixing a real bug — so leaving it as malloc. Thanks for the careful read though!
There was a problem hiding this comment.
Ohk. Its a old code, I didn't really look at the init, just ca_chain_initialized won't be set caught my mind. Lol.
Addresses PR #25 review feedback (@imperiumplay): when a certificate_bundle_override fails to load — most commonly a packed res:// asset on an exported build, which has no OS path for fopen() — point the user at the fix (copy it to user://). Assisted-by: Claude Opus 4.8
|
Thank you so much for the review @imperiumplay 🥳 |
Fixes #24 — Android
wss://release builds intermittently failed with a TLS certificate-bundle load failure (worked for some devices/users, failed for others against the same endpoint). Reported by @pierroo.Root cause
The SDK runs its own mbedTLS handshake (not Godot's TLS). It picked a single CA source in priority order — system store → bundled Mozilla → settings override — so on Android a device's system trust store could shadow the comprehensive bundled roots and abort TLS entirely if it lacked the server's root or failed to parse. That's the device-dependence. The override was also unreachable (the always-present bundled set sat ahead of it). Separately, HTTPS matchmaking uses a different trust stack (Zig
std.http) that trusted only the OS store — same Android exposure, before the socket connects.Changes
websocket_transport.c) — merge all CA sources into one mbedTLS chain (bundled Mozilla always loaded as a device-independent baseline; system + override layered on top). Fail only if the chain is empty.websocket_transport.c) — surface fatal TLS handshake failures: previously a cert-verification failure spun the state machine forever (silent hang); now closes with code1015+ a specific reason.http.zig) — HTTPS matchmaking now seedsstd.http.Client's trust store with system + bundled Mozilla + settings override, mirroring the WSS path (secure requests only;http://untouched).tls_certificates.c) —network/tls/certificate_bundle_overridenow honorsres:///user://paths viaProjectSettings.globalize_path()(rawfopen()couldn't open Godot virtual paths — why theuser://workaround silently fell back).Tests
tests/test_tls.zig— drives the transport directly against a self-signedwss://echo server (no matchmaking): trusted-CA-via-settings → connects (the Android WSS fails with CertificateBundleLoadFailure in release build #24 regression), wrong/absent CA → fails verification,tls_skip_verification→ connects. Wired intotests.yml.platforms/godot/tests/test/test_wss.gd— fulljoin_or_createoverwss://through a self-signed TLS proxy frontingexample-server. Validated locally on Godot 4.6.1 → 39/39 GUT pass. Runs viarun-tests.sh(generates certs + starts the proxy).Notes
0.17.11-rc.1(pre-release) so @pierroo can verify on the affected Android devices before a stable0.17.11— the device-level manifestation is the one thing we can't reproduce in CI.std.httpdoesn't verify IP-address SANs (only DNS/CN), so the Godot test connects via thelocalhosthostname; real deployments use hostnames, so this is an edge case.🤖 Generated with Claude Code