Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix(tls): Android wss:// CertificateBundleLoadFailure — merge CA trust across WSS + matchmaking (#24)#25

Merged
endel merged 6 commits into
mainfrom
fix/android-wss-ca-trust
Jun 7, 2026
Merged

fix(tls): Android wss:// CertificateBundleLoadFailure — merge CA trust across WSS + matchmaking (#24)#25
endel merged 6 commits into
mainfrom
fix/android-wss-ca-trust

Conversation

@endel

@endel endel commented Jun 7, 2026

Copy link
Copy Markdown
Member

Fixes #24 — Android wss:// release builds intermittently failed with a TLS certificate-bundle load failure (worked for some devices/users, failed for others against the same endpoint). Reported by @pierroo.

Root cause

The SDK runs its own mbedTLS handshake (not Godot's TLS). It picked a single CA source in priority order — system store → bundled Mozilla → settings override — so on Android a device's system trust store could shadow the comprehensive bundled roots and abort TLS entirely if it lacked the server's root or failed to parse. That's the device-dependence. The override was also unreachable (the always-present bundled set sat ahead of it). Separately, HTTPS matchmaking uses a different trust stack (Zig std.http) that trusted only the OS store — same Android exposure, before the socket connects.

Changes

  • Core (websocket_transport.c) — merge all CA sources into one mbedTLS chain (bundled Mozilla always loaded as a device-independent baseline; system + override layered on top). Fail only if the chain is empty.
  • Core (websocket_transport.c) — surface fatal TLS handshake failures: previously a cert-verification failure spun the state machine forever (silent hang); now closes with code 1015 + a specific reason.
  • Core (http.zig) — HTTPS matchmaking now seeds std.http.Client's trust store with system + bundled Mozilla + settings override, mirroring the WSS path (secure requests only; http:// untouched).
  • Godot binding (tls_certificates.c)network/tls/certificate_bundle_override now honors res:///user:// paths via ProjectSettings.globalize_path() (raw fopen() couldn't open Godot virtual paths — why the user:// workaround silently fell back).

Tests

  • tests/test_tls.zig — drives the transport directly against a self-signed wss:// echo server (no matchmaking): trusted-CA-via-settings → connects (the Android WSS fails with CertificateBundleLoadFailure in release build #24 regression), wrong/absent CA → fails verification, tls_skip_verification → connects. Wired into tests.yml.
  • platforms/godot/tests/test/test_wss.gd — full join_or_create over wss:// through a self-signed TLS proxy fronting example-server. Validated locally on Godot 4.6.1 → 39/39 GUT pass. Runs via run-tests.sh (generates certs + starts the proxy).
  • Core unit tests: 65/65 (no regression). Extension cross-compiles for macOS/Linux/Windows (verified) and iOS/Android/Web targets.

Notes

  • Tagged 0.17.11-rc.1 (pre-release) so @pierroo can verify on the affected Android devices before a stable 0.17.11 — the device-level manifestation is the one thing we can't reproduce in CI.
  • std.http doesn't verify IP-address SANs (only DNS/CN), so the Godot test connects via the localhost hostname; real deployments use hostnames, so this is an edge case.

🤖 Generated with Claude Code

endel added 4 commits June 6, 2026 21:21
Android release builds intermittently failed WSS with a certificate-bundle
load failure: connections worked for some users/devices and failed for
others against the same endpoint.

Root cause was the CA trust-store selection in ws_tls_init(), which picked a
single source in priority order (system -> bundled Mozilla -> settings):

- On Android the device system store (scanned by Zig's rescan of
  /system/etc/security/cacerts) shadowed the comprehensive bundled Mozilla
  set whenever it loaded. If that device-specific store lacked the server's
  root, or failed to parse, TLS aborted with no fallback to the bundled roots
  — so success depended on the device's filesystem/cert contents.
- The settings/override source (priority 3) was unreachable because the
  bundled set (priority 2) is always compiled in and non-empty, so Godot's
  certificate_bundle_override was silently ignored.

Now every available CA source is merged into one mbedtls chain
(mbedtls_x509_crt_parse appends and parses permissively): the bundled Mozilla
roots are always loaded as a device-independent baseline, with the system
store and any explicit override layered on top for private/enterprise roots.
TLS init only fails if the resulting chain is empty, and that failure now
surfaces a specific close reason instead of a generic "TLS init failed".

Godot binding: the override loader now resolves res://-/user:// paths via
ProjectSettings.globalize_path() before reading (raw fopen() could not open
Godot virtual paths, which is why the documented user:// workaround failed on
Android). The binding now provides only an explicit override — the core owns
the bundled baseline — avoiding a redundant second parse of the bundle.

Assisted-by: Claude Opus 4.8
…chmaking

Adds isolated WSS/TLS verification tests and the two fixes needed to make a
self-signed wss:// connection actually validatable end to end.

http.zig: HTTPS matchmaking used std.http.Client, which only trusts the OS
system store rescan — it ignored both the bundled Mozilla roots and the
certificate_bundle_override. So matchmaking could fail where the WSS socket
(post-#24) succeeds, notably on Android. setupCaBundle() now seeds the client
trust store with system + bundled Mozilla + settings override, mirroring
ws_tls_init. Only runs for secure requests; plain http:// is untouched. The
settings mirror is extended to the real layout (adds tls_skip_verification,
ca_pem_data, ca_pem_len) — offsets verified against settings.h.

websocket_transport.c: ws_tls_handshake_tick returned the same false for both
"want read/write" and fatal errors, so a certificate verification failure spun
the state machine forever instead of erroring. It now returns a status; fatal
results close with code 1015 and a specific reason ("TLS certificate
verification failed"), surfacing the failure #24 asked to diagnose.

Tests (tests/test_tls.zig) drive the transport directly against a zero-dep
self-signed wss echo server (tests/tls/), bypassing matchmaking:
- trusted CA via settings -> handshake succeeds + echo (the #24 regression:
  the override CA must be honored, not shadowed)
- wrong CA / bundled-only -> verification fails, never opens
- tls_skip_verification -> opens regardless
Certs are generated on demand (gitignored) and wired into tests.yml.

Assisted-by: Claude Opus 4.8
Adds test_wss.gd, which joins example-server's "my_room" over wss:// through a
TLS proxy (tls-proxy.mjs) that fronts the plain server on 2568. It exercises the
full secure path the ws:// tests don't: the binding loading
network/tls/certificate_bundle_override (res:// resolved via globalize_path),
HTTPS matchmaking trusting that override (http.zig), and the WSS handshake
verifying against it (ws_tls_init).

project.godot points the override at res://tls/ca.pem (generated on demand,
gitignored; a non-fatal no-op when absent, so other tests are unaffected).
run-tests.sh generates the certs and starts/stops the proxy automatically.

The client connects via the `localhost` hostname rather than 127.0.0.1: Zig's
std.http (matchmaking) verifies DNS-name SANs but not IP-address SANs, and the
proxy listens dual-stack so localhost resolves either way.

Verified locally against Godot 4.6.1 + example-server: 39/39 GUT tests pass.

Assisted-by: Claude Opus 4.8
Bumps the Godot SDK to 0.17.11-rc.1 (pre-release) and documents the #24 TLS
fixes in the CHANGELOG: merged CA trust chain so Android wss:// verifies
reliably, shared trust for HTTPS matchmaking, surfaced TLS handshake failures,
and res://-/user:// certificate_bundle_override support.

Reported by @pierroo in #24.

Assisted-by: Claude Opus 4.8
@endel endel requested a review from bsharma-imperium as a code owner June 7, 2026 04:17
The Windows CI job failed: tests/tls/gen-certs.sh exited 1 because it used bash
process substitution (-extfile <(...)), which the native openssl on the Git-Bash
runner can't open; the echo server then never started and the always()-run log
step also errored because it ran in PowerShell (cat on a missing wss.log).

- gen-certs.sh: write the SAN extfile to a real temp file instead of <(...).
- build.zig: skip test_tls when the target OS is Windows (the TLS code is
  OS-agnostic and is exercised on Linux + macOS).
- tests.yml: gate the cert/echo-server fixture steps to non-Windows, and run
  "Print server logs" under bash so `cat … || true` is safe on all runners.

Assisted-by: Claude Opus 4.8

@bsharma-imperium bsharma-imperium left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Added few comments.

data->state = COLYSEUS_WS_HANDSHAKE_SENDING;
ws_http_handshake_init(data);
} else if (hs == WS_TLS_HS_CERT_FAILED) {
ws_close_impl(transport, 1015, "TLS certificate verification failed");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC reserves 1015 for browser-internal TLS signaling i.e connection failures caused by a TLS handshake error. Shall we swap this to something from application-defined range (4000–4999) to be safe. Maybe 4015?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point in general, but 1015 never reaches the wire on this path. ws_close_impl only queues a Close frame when state == COLYSEUS_WS_CONNECTED (websocket_transport.c:253), and a TLS handshake failure happens at COLYSEUS_WS_TLS_HANDSHAKE — before the HTTP upgrade and before any wslay context exists — so nothing is transmitted to the peer. It is purely a local code passed to on_close, which is exactly 1015's RFC-intended use, and it matches the SDK already using reserved 1006 as a local code here (including the sibling "TLS init failed"). A 4015 would land inside Colyseus's 4xxx protocol-code range (protocol.h: CONSENTED / SERVER_SHUTDOWN / MAY_TRY_RECONNECT / …) where it has no defined meaning, so it'd actually be less consistent. It's also reconnect-safe: not in room_close_code_is_drop (room.c:518), and the !has_joined guard means a TLS-stage close can't reach the reconnect path anyway. Keeping 1015.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Thanks for the context.

Comment thread platforms/godot/src/tls_certificates.c Outdated
return;

if (!load_certificates_from_godot_path(override_path)) {
fprintf(stderr, "[Colyseus] Failed to load certificate override; falling back to built-in CA roots\n");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add what next steps should be if we hit this error path, something like

"If using a packed res:// asset, copy it to user:// first."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0cb1f4c — the fallback log now ends with (hint: copy a packed res:// asset to user:// first). Phrased as a hint rather than asserting res:// is the cause, since a res:// override works fine when running from the editor/source (it only fails for packed assets on an exported build).

}

static bool ws_tls_init(colyseus_ws_transport_data_t* data, const char** out_err) {
colyseus_tls_context_t* tls = malloc(sizeof(colyseus_tls_context_t));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing nitpick: Should we switch to calloc instead? If tls_skip_verification = true, ca_chain_initialized would be garbage.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca_chain_initialized is actually set to 0 explicitly at websocket_transport.c:986 — unconditionally, before the tls_skip_verify branch and before data->tls_ctx = tls is published (which is the only way ws_tls_cleanup becomes reachable). So in the skip-verify path it's 0, not garbage, and cleanup correctly skips mbedtls_x509_crt_free. calloc here would just redundantly zero the large mbedTLS sub-structs (ssl/conf/entropy/ctr_drbg) that get re-init'd right after, without fixing a real bug — so leaving it as malloc. Thanks for the careful read though!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohk. Its a old code, I didn't really look at the init, just ca_chain_initialized won't be set caught my mind. Lol.

Addresses PR #25 review feedback (@imperiumplay): when a certificate_bundle_override
fails to load — most commonly a packed res:// asset on an exported build, which
has no OS path for fopen() — point the user at the fix (copy it to user://).

Assisted-by: Claude Opus 4.8
@endel endel merged commit 1e4068d into main Jun 7, 2026
19 of 20 checks passed
endel added a commit that referenced this pull request Jun 7, 2026
Promotes the Android wss:// TLS trust fixes (#24, PR #25) from 0.17.11-rc.1 to
the stable 0.17.11 release.

Assisted-by: Claude Opus 4.8
@endel

endel commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

Thank you so much for the review @imperiumplay 🥳

@endel endel deleted the fix/android-wss-ca-trust branch June 7, 2026 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Android WSS fails with CertificateBundleLoadFailure in release build

2 participants