Thanks to visit codestin.com
Credit goes to github.com

Skip to content

B2: aorta triage run -mode matrix #151

@oyazdanb

Description

@oyazdanb

Goal

Productize a contingency-table style triage matrix: for a given workload, run a fixed set of (mitigation × environment) cells, count failures and step-times per cell, and produce a matrix.md table that flags cells where a mitigation may be working only via a speed confound.

Driven by either a checked-in recipe file (primary) or ad-hoc flags (shim) — both modes converge on the same run_recipe(recipe) execution path.

Why this matters — speed-confound detection

Investigations into mitigation effectiveness routinely run into the same trap: a change that eliminates a numeric failure also slows down GPU execution. Without measuring step-time alongside pass/fail, the matrix lies about which mitigations "fix" the bug versus which ones merely delay it long enough to not observe.

Concrete pattern from prior internal RCAs:

  • A mitigation that disables TF32 eliminates a NaN — but is ~25% slower.
  • A mitigation that toggles XNACK eliminates the same NaN — and is not measurably slower.

The first reads "speed confound suspected; verify with profiler before drawing conclusions"; the second reads "trust this cell." Encoding this distinction in the matrix output is the core value-add over a hand-maintained spreadsheet.

The lesson:

  • Capture step-time per trial alongside pass/fail.
  • Compute baseline step-time from the (no-mitigation × baseline-environment) cell.
  • Flag any cell where cell_step_time / baseline_step_time > 1.15 as a potential speed confound.

Design shape — recipe primary, flags secondary

A triage matrix run is two pieces of information:

  1. Which cells to run — the cartesian product (or a hand-picked subset) of mitigation × environment, plus per-cell trial/step counts.
  2. What investigation it belongs to — drives output grouping; ties weeks of matrix.md artifacts back to the originating ticket.

A YAML/JSON recipe file captures both, lives in version control, and gives users a one-line invocation: aorta triage run --recipe recipes/<name>.yaml. Flags are kept for ad-hoc one-shots and as an escape hatch — internally they construct an in-memory recipe and reuse the same execution path.

Recipe names resolve through aorta.registry (B3): mitigations: [tf32_off, xnack] is a list of registered names (built-in or contributed via aorta.mitigations entry-points). Same for environment: <name>. Unknown names raise B3's UnknownMitigationError / UnknownEnvironmentError.

Files to create / modify

src/aorta/cli/triage.py             # MODIFY — replace placeholder with real call; add dual-mode CLI
src/aorta/triage/runner.py          # NEW — main orchestration: cell loop, calls run_trials() per cell
src/aorta/triage/recipe.py          # NEW — recipe schema + loader (YAML/JSON), name resolution via aorta.registry,
                                    #       flag-mode helper that builds a Recipe in memory from CLI args
src/aorta/triage/matrix.py          # NEW — contingency table data structure + aggregation
src/aorta/triage/output.py          # NEW — matrix.json writer + matrix.md formatter; per-ticket dir resolution
src/aorta/triage/confound.py        # NEW — speed-confound detection logic
tests/triage/__init__.py            # NEW (empty)
tests/triage/test_recipe.py         # NEW — schema validation, name resolution, flag→recipe builder
tests/triage/test_matrix.py         # NEW
tests/triage/test_confound.py       # NEW
tests/triage/test_output_layout.py  # NEW — per-ticket / per-workload / per-timestamp dir grouping
recipes/                            # NEW — public-friendly example recipes only
recipes/example-fsdp-smoke.yaml     # NEW — minimal smoke recipe wired to the in-tree fsdp workload
recipes/README.md                   # NEW — recipe authoring guide

Recipe schema

Authoritative shape (YAML; JSON accepted with the same keys):

# recipes/example.yaml
schema_version: 1                    # required; loader rejects unknown versions
ticket: EXAMPLE-001                  # optional; drives output grouping when present
workload: fsdp                       # required; resolved via aorta.workloads entry-point group
trials: 8                            # required; per-cell trial count
steps: 5000                          # required; per-cell step count

# Confound detection — overrides defaults when present
confound:
  threshold: 1.15                    # cell_step_time / baseline_step_time above this -> "speed (+N%)"
  baseline_cell: baseline-local      # optional; defaults to first cell named "baseline-*" or first cell with mitigations: [none]

# The cells to run (NOT the cartesian product — explicit list keeps recipes readable
# and lets users omit redundant pairings).
cells:
  - name: baseline-local
    mitigations: [none]              # list resolved through load_mitigations(); env vars unioned
    environment: local               # single string resolved through load_environments()

  - name: tf32_off-local
    mitigations: [tf32_off]
    environment: local

  - name: stack-tf32+xnack-local     # shows mitigation stacking
    mitigations: [tf32_off, xnack]
    environment: local
    trials: 16                       # optional per-cell override of top-level trials
    steps: 8000                      # optional per-cell override of top-level steps

  - name: try-nightly                # inline docker shorthand (Option A) — no registry registration needed
    mitigations: [none]
    environment: { docker: "rocm/pytorch:nightly" }

Note on workload-coupled mitigations. Recipes that need workload-internal env vars (e.g. AMP_DTYPE, SHAMPOO_PRECONDITIONER_DTYPE) must reference mitigations registered by the workload's own package via the aorta.mitigations entry-point group. Public aorta ships only true runtime-level built-ins (none, tf32_off, xnack); see B3 spec for the rationale.

Schema rules

  • schema_version — required int; current value 1. Unknown values → RecipeSchemaError.
  • ticket — optional string; format-free (caller's choice). Drives output dir; absence is allowed for ad-hoc runs.
  • workload — required string; must resolve via aorta.workloads entry-point group (B1's existing dependency). Unknown → UnknownWorkloadError from B1's loader.
  • trials / steps — required ints at top level; per-cell overrides allowed.
  • confound.threshold — optional float, default 1.15.
  • confound.baseline_cell — optional string. Resolution order if absent: (1) first cell with name starting baseline-; (2) first cell where mitigations == ["none"]; (3) error if neither found and >1 cell exists.
  • cells[*].name — required string, unique within the recipe; used as the row label in matrix.md.
  • cells[*].mitigations — required list[str]; each name resolved through aorta.registry.get_mitigation(). Empty list rejected (use ["none"] for the explicit baseline). Multiple names → env-var bundles unioned in list order; collision within a cell raises RecipeCellError.
  • cells[*].environment — required. One of:
    • String — a registered environment name; resolved through aorta.registry.get_environment().
    • Mapping { docker: "<image-ref>" } — inline docker shorthand (Option A). Auto-named _inline_<hash> where <hash> is the first 8 chars of blake2b(image-ref). Behaves identically to a named docker entry from that point on (per-environment probe, recipe.resolved.yaml records the auto-name + ref, output dir uses the auto-name). Intentionally no name: field — anything you'd want to name belongs in the registry. No other keys accepted.
  • Inline ad-hoc env override — supported as cells[*].extra_env: {KEY: VALUE, ...}. Applied AFTER mitigation env-vars in that cell, so it can override a registered mitigation's bundle for one-off experiments without polluting the registry. Logged in matrix.json per cell so the audit trail is preserved.

The loader normalizes the YAML/JSON into an in-memory Recipe dataclass (frozen). Validation happens once, at load time; the runner consumes the validated structure.

CLI surface

aorta triage run --recipe <file>              # primary mode
                 [--output-dir DIR]
                 [--dry-run]                  # validate + print resolved cells, do NOT execute

aorta triage run --mode matrix                # secondary mode (flag shim — constructs a Recipe in memory)
                 --workload <name>
                 --mitigation-axis m1,m2,m3   # cartesian product with --environment-axis (each value = one matrix row)
                 --environment-axis e1,e2,image:rocm/foo:bar   # bare names = registry lookup; "image:<ref>" = inline docker (Option B)
                 --trials N
                 [--steps S]
                 [--ticket TICKET]            # optional; only effect is output dir grouping
                 [--baseline-cell NAME]       # default: name of (mitigations=[none], environment=first env-axis value)
                 [--confound-threshold 1.15]
                 [--output-dir triage_results]

aorta triage --list-mitigations               # delegates to B3 resolver; tags entries by source_package
aorta triage --list-environments

Flag mode internally builds a Recipe whose cells is the full cartesian product mitigation_axis × environment_axis, with each cell named <mitigation>-<environment>. The runner does not branch on mode after that point — both paths converge on the same run_recipe(recipe) function.

--environment-axis item parsing (Option B): each comma-separated item is parsed by prefix. image:<ref> becomes an inline-docker cell using the same { docker: "<ref>" } path as Option A — auto-named _inline_<hash>, hash visible in the cell name (<mitigation>-_inline_<hash>) so multiple inline images on one axis are distinguishable. Anything without a recognized prefix is a registered environment name. No other prefixes are defined for MVP — the mental model is "registry name OR image:, that's it."

There is intentionally no --dockers flag — environments are the abstraction (docker is one component of an environment alongside venv and ROCm version).

Behavior

For each cell (name, mitigations, environment, trials, steps, extra_env):

  1. Build RunRequest (B1's contract):
    • workload = recipe.workload
    • mitigations = cell.mitigations (list of names — B1 resolves via load_mitigations() and unions env vars)
    • environment = cell.environment (single name — B1 resolves via load_environments() to docker / venv / rocm)
    • extra_env = cell.extra_env (passed through if set)
    • trials = cell.trials (with per-cell override falling back to recipe top-level)
    • steps = cell.steps
    • output_dir = <output_dir>/<ticket>/<workload>/<run-timestamp>/cells/<cell.name>/
  2. Call B1's run_trials(RunRequest) -> list[TrialResult] in-process (one Python interpreter for the whole matrix). B1's dispatcher writes per-trial JSON to RunRequest.output_dir; B2 also receives the in-memory list for aggregation.
  3. Aggregate per cell:
    • passed_count / failed_count (failure = workload-defined: NaN, throw, etc.)
    • nan_rate = failed_count / trials
    • mean_step_time_ms, std_step_time_ms, p50, p99
    • mean_wall_clock_sec
    • error: str | None if the whole cell failed (e.g., docker pull failure) — surface in matrix without aborting other cells
  4. After all cells run: locate the baseline cell per the resolution order in §Schema rules. Compute step_time_ratio for every non-baseline cell.
  5. Apply confound rules (see below).
  6. Write matrix.json (full data) and matrix.md (human-readable table) to the run-timestamp directory.

Confound rules (Confound column overload)

A single Confound column carries one of:

  • (baseline) — the baseline cell.
  • step_time_ratio <= threshold AND nan_rate < baseline_nan_rate (mitigation works without a speed cost; trust the cell).
  • speed (+N%)step_time_ratio > threshold (the mitigation may be suppressing failure via slower iteration).
  • no effectnan_rate >= baseline_nan_rate AND step_time_ratio <= threshold (the mitigation neither moved the failure rate nor slowed iteration).
  • error — the whole cell failed; row is preserved so the matrix is complete.

Pre-registering kill criteria in the same column (rather than a separate Verdict column) keeps the matrix layout matching the source-of-truth manual table.

Implicit env probes (host + per-environment)

Per A1's contract, aorta.instrumentation.environment.collect_env() is a library function the runner calls directly — never via shelling out to aorta env probe. The runner takes two snapshots beyond what B1 already embeds per trial:

Scope When captured Where written
Host (kernel, amdgpu module, dmesg, /dev/kfd, KFD version) Once at runner start, before any cell runs <run-timestamp>/host_env.json — sibling of matrix.md
Per-environment (ROCm runtime in container, hipBLASLt version, pip freeze, env vars before mitigation injection) Once per unique --environment-axis value (or unique cells[*].environment in recipe mode), immediately before that environment's first cell runs <run-timestamp>/environments/<env-name>/env.json — sibling of cells/

Host state is invariant across the matrix; per-environment state varies per env cell. Splitting them this way keeps host_env.json a single canonical "what was the box like when this matrix ran" record, while environments/<env-name>/env.json is the file users reach for when a bug looks like ROCm-version drift between two environments.

Per-trial env capture is unchanged — B1 still writes EnvSnapshot into each trial_<id>.json. The two new top-level snapshots are deduplication: the host snapshot would otherwise appear in every trial JSON unchanged; the per-environment snapshot would otherwise appear in every cell's trial JSONs unchanged.

Probe failure never aborts the matrix. collect_env() is contractually fail-soft (A1) — if dmesg is restricted or rdhc isn't installed, the snapshot lands with partial: true plus partial_reasons, and the runner continues. A top-of-file warning in matrix.md surfaces the partial state.

In-process execution (not subprocess)

Per-cell execution calls B1's run_trials() as a Python function. Why this matters:

  • One Python process for the whole matrix (no interpreter / torch / entry-point-discovery cost per cell)
  • Cell results returned as list[TrialResult] objects in memory — no JSON parse round-trip
  • Workload exceptions surface as Python tracebacks, not exit-code + stderr text
  • Tests mock run_trials() directly; no subprocess plumbing
  • Distributed workloads under torchrun work without extra wiring — every rank executes the matrix loop, B1's dispatcher's existing rank-aware writes apply per cell, only RANK == 0 writes matrix.{json,md} (same pattern B1 uses for trial_<id>.json)

This requires B1 to expose a clean Python entry-point (run_trials(RunRequest) -> list[TrialResult]).

Output layout

triage_results/
├── EXAMPLE-001/                              # <ticket> from recipe; "_no_ticket_" if absent
│   └── fsdp/                                 # <workload>
│       ├── 2026-04-28T14-12-03/              # <run-timestamp>; one dir per invocation, never overwritten
│       │   ├── matrix.md
│       │   ├── matrix.json
│       │   ├── recipe.resolved.yaml          # the recipe AS EXECUTED (registry names already resolved
│       │   │                                 #  to env-var bundles + image refs; reproducibility artifact)
│       │   ├── host_env.json                 # collect_env() snapshot taken once at runner start (host scope)
│       │   ├── environments/
│       │   │   └── local/
│       │   │       └── env.json              # collect_env() snapshot before first cell on this env
│       │   └── cells/
│       │       ├── baseline-local/
│       │       │   ├── trial_0.json          # written by B1
│       │       │   ├── trial_1.json
│       │       │   └── ...
│       │       ├── tf32_off-local/
│       │       └── ...
│       └── 2026-04-29T09-44-17/              # next run; same layout
└── _no_ticket_/                              # ad-hoc runs without a ticket
    └── fsdp/
        └── 2026-04-28T15-02-11/
            └── ...

<ticket>/<workload>/ lets users see the full history of attempts on a given problem at a glance. recipe.resolved.yaml is the post-resolution snapshot — it embeds the actual env-var bundles, docker digests, etc., so re-running it on a different machine reproduces the exact same matrix even if the registries have changed since.

matrix.md target format

# Triage Matrix — fsdp

**Ticket**: EXAMPLE-001
**Workload**: fsdp
**Recipe**: recipes/example-fsdp.yaml (sha256:abc12...)
**Trials per cell**: 8
**Steps per trial**: 5000
**Run timestamp**: 2026-04-28T14:12:03Z
**Baseline cell**: baseline-local (mean step time = 412 ms)

## Reproduction Summary

| Cell                          | Mitigations            | Environment | NaN rate | Trials | Mean step (ms) | Confound      |
|-------------------------------|------------------------|-------------|----------|--------|----------------|---------------|
| baseline-local                | none                   | local       | 50%      | 4 / 8  | 412            | (baseline)    |
| tf32_off-local                | tf32_off               | local       | 0%       | 0 / 8  | 515            | speed (+25%)  |
| xnack-local                   | xnack                  | local       | 0%       | 0 / 8  | 414            ||
| stack-tf32+xnack-local        | tf32_off, xnack        | local       | 0%       | 0 / 8  | 518            | speed (+26%)  |

## Notes

- Cell name comes from the recipe; mitigations + environment columns disambiguate when names get terse.
- Confound column legend:
  - `(baseline)` — the cell against which all step-time ratios are computed.
  - `` — the mitigation appears to work without a speed cost. Trust this cell.
  - `speed (+N%)` — the mitigation may be suppressing failure via slower iteration rather than a real fix. Verify with `rocprofv3` dispatch comparison before drawing causal conclusions.
  - `no effect` — the mitigation neither changed the failure rate nor slowed iteration; it likely doesn't apply to this workload (the env vars it sets aren't read).
- Only `mean step (ms)` is shown here. Per-cell `std`, `p50`, `p99`, raw step-time arrays, and per-trial JSON paths are in `matrix.json`.
- `recipe.resolved.yaml` (alongside this file) captures the registry state at run time — re-run it to reproduce.

Acceptance criteria

Recipe path (primary)

  • aorta triage run --recipe recipes/example-fsdp-smoke.yaml validates the recipe and runs the matrix
  • aorta triage run --recipe <bad.yaml> --dry-run prints the resolved cell list and validation errors WITHOUT executing
  • Recipe loader rejects unknown schema_version with a clear message
  • Recipe loader resolves all mitigation/environment names through aorta.registry; unknown names raise B3's UnknownMitigationError / UnknownEnvironmentError
  • Per-cell extra_env overrides registered mitigation env vars and is recorded in matrix.json for that cell
  • Per-cell trials / steps overrides take effect; absence falls back to recipe top-level
  • recipe.resolved.yaml is written alongside matrix.md, with all registry names expanded to their underlying env-var bundles + docker refs
  • Inline docker in recipe (Option A): a recipe cell with environment: { docker: "rocm/pytorch:nightly" } runs end-to-end without any registry registration. Auto-name _inline_<8-char-blake2b> appears in recipe.resolved.yaml, in environments/<auto-name>/env.json, and in cells/<cell-name>/. Two cells with the same docker ref get the same auto-name (deterministic). Schema rejects any extra keys in the mapping with a clear RecipeSchemaError.

Flag path (secondary)

  • aorta triage run --mode matrix --workload fsdp --mitigation-axis none,tf32_off,xnack --environment-axis local --trials 8 produces a matrix.md with one row per (mitigation × environment) combination
  • Flag mode internally constructs a Recipe and reuses the same execution path (verified by mocking run_recipe and checking it's called once with a fully-formed Recipe)
  • --ticket SOME-TICKET flag groups output under triage_results/SOME-TICKET/...; absence routes to triage_results/_no_ticket_/...
  • Inline docker on CLI (Option B): --environment-axis local,image:rocm/pytorch:nightly runs as two cells per mitigation; the inline cell uses the same _inline_<hash> auto-name as Option A would for the same ref (verified by parsing recipe.resolved.yaml). Bare names continue to resolve via the registry. An unknown bare name still raises UnknownEnvironmentError; image: items never go through registry lookup.

Output layout

  • Output path: <output-dir>/<ticket>/<workload>/<run-timestamp>/{matrix.md,matrix.json,recipe.resolved.yaml,host_env.json,environments/,cells/}
  • Re-running the same recipe creates a NEW run-timestamp dir; never overwrites prior results
  • cells/<cell.name>/trial_*.json matches B1's per-trial JSON output (B1 writes them; B2 just points B1 there via RunRequest.output_dir)
  • Per-workload tracking: running triage twice with different workloads doesn't conflate results
  • Per-ticket tracking: running triage twice with different tickets keeps history separate per ticket

Implicit env probes

  • Host probe captured once: host_env.json exists in the run-timestamp dir, contains a valid EnvSnapshot (per A1 schema), and is written exactly once per aorta triage run invocation regardless of cell count. Verified by mocking collect_env and asserting it's called exactly once for host scope.
  • Per-environment probe captured once per unique environment: for a recipe whose cells reference 3 distinct environments across 12 cells, environments/<env-name>/env.json exists for each of the 3 environments, written before that environment's first cell runs. collect_env is called exactly 3 times in env scope (not 12).
  • Probes are calls to A1's library function — runner imports collect_env from aorta.instrumentation.environment. Verified by grep: no subprocess.run([...,"aorta","env","probe",...]) anywhere under src/aorta/triage/.
  • Probe failure does not abort the matrix: when collect_env() returns partial=True, the snapshot is persisted as-is, the matrix run continues, and matrix.md includes a top-of-file warning naming which probe scope was partial. Verified by a test that monkeypatches collect_env to return a partial snapshot and asserts (a) matrix.md still writes, (b) the warning text appears, (c) all cells execute.

Matrix correctness

  • Speed confound detection works: a synthetic test where one cell has 1.25× the baseline step-time produces speed (+25%) in that cell's Confound column
  • A cell with step_time_ratio == 1.0 (no slowdown) and nan_rate < baseline shows
  • no effect overload: a cell with nan_rate >= baseline_nan_rate AND step_time_ratio <= threshold shows no effect in the same Confound column (no separate Verdict column)
  • matrix.json has full per-cell data: trial JSONs paths, raw step times, aggregated stats, resolved env vars, environment descriptor

Resilience

  • Resilient to single-trial failures within a cell (one fail in a cell of 8 = 7 / 8 not 0 / 8)
  • Resilient to a whole cell failing (e.g., docker not available) — that cell shows error in matrix.md, others still run, baseline detection still works as long as the baseline cell itself succeeded
  • If the baseline cell itself errors, matrix.md is still written with step_time_ratio columns showing n/a and a top-of-file warning; the run does NOT abort silently

Plumbing

  • In-process per-cell execution: triage runner calls B1's run_trials(RunRequest) as a Python function; does NOT shell out via subprocess.run(["aorta", "run", ...]). Verified by: (a) no subprocess import in src/aorta/triage/; (b) running the smoke matrix produces no extra child Python processes (one process for the whole matrix).
  • aorta triage --list-mitigations and --list-environments delegate to B3's resolver and tag each row with source_package so users see which entries come from aorta vs which plugin.
  • Tests cover: recipe schema validation, name resolution (mock the registry), flag→recipe builder, matrix aggregation, confound detection thresholds, output formatting, single-cell-failure resilience, baseline-cell-failure handling, per-cell call routes through B1's Python API (mock run_trials and assert it's called once per cell with the expected RunRequest)

Out of scope (P1+)

  • --mode optimize (Optuna-driven mitigation-stack search)
  • Protocol generator (aorta triage generate-protocol)
  • Secondhand matrix builder (aorta triage matrix --import slack-message.txt)
  • aorta triage matrix <dir> re-analysis mode (re-classify cells from existing trial JSONs without re-running)
  • Statistical significance testing (Fisher exact, chi-squared)
  • Auto-pruning of dominated mitigation combinations
  • Matrix sharding / parallel cell execution (--shard i/N, intra-node multi-GPU fan-out, node-level scheduler) — the MVP runner is a sequential nested loop. Don't bake parallelism into B2.
  • Recipe versioning / migration beyond the schema_version reject — once schema_version 2 exists we'll add a migrator.

How to test

# Smoke test on the in-tree fsdp workload — flag mode
aorta triage run --mode matrix --workload fsdp \
  --mitigation-axis none --environment-axis local \
  --trials 2 --steps 100
ls triage_results/_no_ticket_/fsdp/        # one timestamp dir
cat triage_results/_no_ticket_/fsdp/*/matrix.md

# Smoke test — recipe mode
aorta triage run --recipe recipes/example-fsdp-smoke.yaml --dry-run
aorta triage run --recipe recipes/example-fsdp-smoke.yaml

# Full smoke matrix
aorta triage run --mode matrix --workload fsdp \
  --mitigation-axis none,tf32_off,xnack \
  --environment-axis local \
  --trials 4 --steps 1000

# Discoverability
aorta triage --list-mitigations
aorta triage --list-environments

PR template

Title: B2: aorta triage run --mode matrix (recipe + flag modes)
Body: include sample matrix.md output, a small example recipe.yaml, confirm confound detection (synthetic +25% slowdown produces speed (+25%) flag), include matrix.json snippet, link to B3 PR (registries) and B1 PR (run_trials Python API).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions