Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Smithy fleet reliability — rolling tracker (meld side) #139

@avrabe

Description

@avrabe

Rolling tracker for meld's experience of the smithy CI fleet. Companion issue to whatever lands in `pulseengine/smithy`; this issue captures meld-specific observations, failed PRs, and the acceptance bar for releases.

ubuntu-latest is not a fallback option per project policy. Smithy stability is a release-path dependency.

Observed failure modes (from v0.7.0 release cycle, 2026-05-03 → 2026-05-11)

1. Whole-fleet offline events

  • 2026-05-03 ~19:00Z: all 8 `pulseengine-ci-01-*` runners reported `status: offline` simultaneously. Recovered ~3 h later (status: online, busy: true).
  • 2026-05-10 ~17:00Z–17:30Z: full offline window again, recovered within ~20 min.

Pattern: when the host(s) go down, every queued job sits forever (jobs queue against runner labels, not against fleet health). meld saw a 2 h 11 min queue on PR #135 with zero pickup before the fleet came back.

2. Disk-space failures on rust-cpu runners

Runner pool currently 3× rust-cpu (`pulseengine-ci-01-{5,6,7}`). Two of three show the same pattern:

Runner Symptom
`-5` Jobs fail in 30–70 s with `error: failed to build archive: No space left on device (os error 28)`
`-6` Same as -5; also went offline mid-day twice during the v0.7.0 cycle
`-7` Fuzz + bench succeed here

Concrete failures: PR #134 (Bench compile), PR #135 (Bench, Coverage, fuzz_resolver_terminates), PR #137 (Bench, Coverage, Test, fuzz_parse_component, fuzz_fusion_roundtrip). Each landed via `gh run rerun --failed` until `-7` was assigned, except PR #137 which had to merge via `--admin` because the rerun cycle never converged.

3. Per-runner config drift (sanitizer + musl)

PR #135 fuzz failure on runner-5:

```
error: sanitizer is incompatible with statically linked libc,
disable it using `-C target-feature=-crt-static`
```

Same workflow, same toolchain install step (`dtolnay/rust-toolchain@nightly` with `targets: x86_64-unknown-linux-musl`), succeeds on `-7`, fails on `-5/-6`. Suggests host-level `.cargo/config.toml` or rustup component state differs between runners.

4. Cross-org capacity contention

While meld jobs queued, other org repos held rust-cpu slots:

  • `spar/CI` run on `release/v0.9.2` ran 11.5 h before completing (start 09:29Z, presumed hung)
  • `loom/CI` and `loom/Validate Shared Architecture` both ran 12 h+ before either failing or being cancelled

I cancelled the loom jobs to free slots for meld's release; this was a one-off intervention, not a sustainable answer.

Acceptance bar for meld (proposed)

Smithy is "stable enough" to be meld's only CI path when all of:

  • Three consecutive meld PRs merge without any `--admin` bypass.
  • No PR sits in the `queued` state for more than 30 min without a runner being assigned.
  • Disk-space (`os error 28`) failures stop appearing in retry-then-pass cycles for at least one full release cycle (~7–10 PRs).
  • Sanitizer + musl fuzz builds succeed on every rust-cpu runner, not just `-7`.
  • No whole-fleet offline event in a calendar week.

Until then, releases are explicitly authorized to merge with `--admin` for known-infra failures, documented in the release PR body.

What this issue is for

  • A single place to drop "I saw this on smithy today" reports from any meld PR.
  • A trail of evidence for `pulseengine/smithy` agents/issues to consume.
  • The release-side definition of "ready to lift the admin-bypass policy."

Pin / keep open. Update the checklist above as conditions are met.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions