You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Productize a contingency-table style triage matrix: for a given workload, run a fixed set of (mitigation × environment) cells, count failures and step-times per cell, and produce a matrix.md table that flags cells where a mitigation may be working only via a speed confound.
Driven by either a checked-in recipe file (primary) or ad-hoc flags (shim) — both modes converge on the same run_recipe(recipe) execution path.
Why this matters — speed-confound detection
Investigations into mitigation effectiveness routinely run into the same trap: a change that eliminates a numeric failure also slows down GPU execution. Without measuring step-time alongside pass/fail, the matrix lies about which mitigations "fix" the bug versus which ones merely delay it long enough to not observe.
Concrete pattern from prior internal RCAs:
A mitigation that disables TF32 eliminates a NaN — but is ~25% slower.
A mitigation that toggles XNACK eliminates the same NaN — and is not measurably slower.
The first reads "speed confound suspected; verify with profiler before drawing conclusions"; the second reads "trust this cell." Encoding this distinction in the matrix output is the core value-add over a hand-maintained spreadsheet.
The lesson:
Capture step-time per trial alongside pass/fail.
Compute baseline step-time from the (no-mitigation × baseline-environment) cell.
Flag any cell where cell_step_time / baseline_step_time > 1.15 as a potential speed confound.
Design shape — recipe primary, flags secondary
A triage matrix run is two pieces of information:
Which cells to run — the cartesian product (or a hand-picked subset) of mitigation × environment, plus per-cell trial/step counts.
What investigation it belongs to — drives output grouping; ties weeks of matrix.md artifacts back to the originating ticket.
A YAML/JSON recipe file captures both, lives in version control, and gives users a one-line invocation: aorta triage run --recipe recipes/<name>.yaml. Flags are kept for ad-hoc one-shots and as an escape hatch — internally they construct an in-memory recipe and reuse the same execution path.
Recipe names resolve through aorta.registry (B3): mitigations: [tf32_off, xnack] is a list of registered names (built-in or contributed via aorta.mitigations entry-points). Same for environment: <name>. Unknown names raise B3's UnknownMitigationError / UnknownEnvironmentError.
Files to create / modify
src/aorta/cli/triage.py # MODIFY — replace placeholder with real call; add dual-mode CLI
src/aorta/triage/runner.py # NEW — main orchestration: cell loop, calls run_trials() per cell
src/aorta/triage/recipe.py # NEW — recipe schema + loader (YAML/JSON), name resolution via aorta.registry,
# flag-mode helper that builds a Recipe in memory from CLI args
src/aorta/triage/matrix.py # NEW — contingency table data structure + aggregation
src/aorta/triage/output.py # NEW — matrix.json writer + matrix.md formatter; per-ticket dir resolution
src/aorta/triage/confound.py # NEW — speed-confound detection logic
tests/triage/__init__.py # NEW (empty)
tests/triage/test_recipe.py # NEW — schema validation, name resolution, flag→recipe builder
tests/triage/test_matrix.py # NEW
tests/triage/test_confound.py # NEW
tests/triage/test_output_layout.py # NEW — per-ticket / per-workload / per-timestamp dir grouping
recipes/ # NEW — public-friendly example recipes only
recipes/example-fsdp-smoke.yaml # NEW — minimal smoke recipe wired to the in-tree fsdp workload
recipes/README.md # NEW — recipe authoring guide
Recipe schema
Authoritative shape (YAML; JSON accepted with the same keys):
# recipes/example.yamlschema_version: 1# required; loader rejects unknown versionsticket: EXAMPLE-001 # optional; drives output grouping when presentworkload: fsdp # required; resolved via aorta.workloads entry-point grouptrials: 8# required; per-cell trial countsteps: 5000# required; per-cell step count# Confound detection — overrides defaults when presentconfound:
threshold: 1.15# cell_step_time / baseline_step_time above this -> "speed (+N%)"baseline_cell: baseline-local # optional; defaults to first cell named "baseline-*" or first cell with mitigations: [none]# The cells to run (NOT the cartesian product — explicit list keeps recipes readable# and lets users omit redundant pairings).cells:
- name: baseline-localmitigations: [none] # list resolved through load_mitigations(); env vars unionedenvironment: local # single string resolved through load_environments()
- name: tf32_off-localmitigations: [tf32_off]environment: local
- name: stack-tf32+xnack-local # shows mitigation stackingmitigations: [tf32_off, xnack]environment: localtrials: 16# optional per-cell override of top-level trialssteps: 8000# optional per-cell override of top-level steps
- name: try-nightly # inline docker shorthand (Option A) — no registry registration neededmitigations: [none]environment: { docker: "rocm/pytorch:nightly" }
Note on workload-coupled mitigations. Recipes that need workload-internal env vars (e.g. AMP_DTYPE, SHAMPOO_PRECONDITIONER_DTYPE) must reference mitigations registered by the workload's own package via the aorta.mitigations entry-point group. Public aorta ships only true runtime-level built-ins (none, tf32_off, xnack); see B3 spec for the rationale.
Schema rules
schema_version — required int; current value 1. Unknown values → RecipeSchemaError.
ticket — optional string; format-free (caller's choice). Drives output dir; absence is allowed for ad-hoc runs.
workload — required string; must resolve via aorta.workloads entry-point group (B1's existing dependency). Unknown → UnknownWorkloadError from B1's loader.
trials / steps — required ints at top level; per-cell overrides allowed.
confound.baseline_cell — optional string. Resolution order if absent: (1) first cell with name starting baseline-; (2) first cell where mitigations == ["none"]; (3) error if neither found and >1 cell exists.
cells[*].name — required string, unique within the recipe; used as the row label in matrix.md.
cells[*].mitigations — required list[str]; each name resolved through aorta.registry.get_mitigation(). Empty list rejected (use ["none"] for the explicit baseline). Multiple names → env-var bundles unioned in list order; collision within a cell raises RecipeCellError.
cells[*].environment — required. One of:
String — a registered environment name; resolved through aorta.registry.get_environment().
Mapping { docker: "<image-ref>" } — inline docker shorthand (Option A). Auto-named _inline_<hash> where <hash> is the first 8 chars of blake2b(image-ref). Behaves identically to a named docker entry from that point on (per-environment probe, recipe.resolved.yaml records the auto-name + ref, output dir uses the auto-name). Intentionally no name: field — anything you'd want to name belongs in the registry. No other keys accepted.
Inline ad-hoc env override — supported as cells[*].extra_env: {KEY: VALUE, ...}. Applied AFTER mitigation env-vars in that cell, so it can override a registered mitigation's bundle for one-off experiments without polluting the registry. Logged in matrix.json per cell so the audit trail is preserved.
The loader normalizes the YAML/JSON into an in-memory Recipe dataclass (frozen). Validation happens once, at load time; the runner consumes the validated structure.
CLI surface
aorta triage run --recipe <file> # primary mode
[--output-dir DIR]
[--dry-run] # validate + print resolved cells, do NOT execute
aorta triage run --mode matrix # secondary mode (flag shim — constructs a Recipe in memory)
--workload <name>
--mitigation-axis m1,m2,m3 # cartesian product with --environment-axis (each value = one matrix row)
--environment-axis e1,e2,image:rocm/foo:bar # bare names = registry lookup; "image:<ref>" = inline docker (Option B)
--trials N
[--steps S]
[--ticket TICKET] # optional; only effect is output dir grouping
[--baseline-cell NAME] # default: name of (mitigations=[none], environment=first env-axis value)
[--confound-threshold 1.15]
[--output-dir triage_results]
aorta triage --list-mitigations # delegates to B3 resolver; tags entries by source_package
aorta triage --list-environments
Flag mode internally builds a Recipe whose cells is the full cartesian product mitigation_axis × environment_axis, with each cell named <mitigation>-<environment>. The runner does not branch on mode after that point — both paths converge on the same run_recipe(recipe) function.
--environment-axis item parsing (Option B): each comma-separated item is parsed by prefix. image:<ref> becomes an inline-docker cell using the same { docker: "<ref>" } path as Option A — auto-named _inline_<hash>, hash visible in the cell name (<mitigation>-_inline_<hash>) so multiple inline images on one axis are distinguishable. Anything without a recognized prefix is a registered environment name. No other prefixes are defined for MVP — the mental model is "registry name OR image:, that's it."
There is intentionally no --dockers flag — environments are the abstraction (docker is one component of an environment alongside venv and ROCm version).
Behavior
For each cell (name, mitigations, environment, trials, steps, extra_env):
Build RunRequest (B1's contract):
workload = recipe.workload
mitigations = cell.mitigations (list of names — B1 resolves via load_mitigations() and unions env vars)
environment = cell.environment (single name — B1 resolves via load_environments() to docker / venv / rocm)
extra_env = cell.extra_env (passed through if set)
trials = cell.trials (with per-cell override falling back to recipe top-level)
Call B1's run_trials(RunRequest) -> list[TrialResult]in-process (one Python interpreter for the whole matrix). B1's dispatcher writes per-trial JSON to RunRequest.output_dir; B2 also receives the in-memory list for aggregation.
error: str | None if the whole cell failed (e.g., docker pull failure) — surface in matrix without aborting other cells
After all cells run: locate the baseline cell per the resolution order in §Schema rules. Compute step_time_ratio for every non-baseline cell.
Apply confound rules (see below).
Write matrix.json (full data) and matrix.md (human-readable table) to the run-timestamp directory.
Confound rules (Confound column overload)
A single Confound column carries one of:
(baseline) — the baseline cell.
— — step_time_ratio <= threshold AND nan_rate < baseline_nan_rate (mitigation works without a speed cost; trust the cell).
speed (+N%) — step_time_ratio > threshold (the mitigation may be suppressing failure via slower iteration).
no effect — nan_rate >= baseline_nan_rate AND step_time_ratio <= threshold (the mitigation neither moved the failure rate nor slowed iteration).
error — the whole cell failed; row is preserved so the matrix is complete.
Pre-registering kill criteria in the same column (rather than a separate Verdict column) keeps the matrix layout matching the source-of-truth manual table.
Implicit env probes (host + per-environment)
Per A1's contract, aorta.instrumentation.environment.collect_env() is a library function the runner calls directly — never via shelling out to aorta env probe. The runner takes two snapshots beyond what B1 already embeds per trial:
<run-timestamp>/host_env.json — sibling of matrix.md
Per-environment (ROCm runtime in container, hipBLASLt version, pip freeze, env vars before mitigation injection)
Once per unique --environment-axis value (or unique cells[*].environment in recipe mode), immediately before that environment's first cell runs
<run-timestamp>/environments/<env-name>/env.json — sibling of cells/
Host state is invariant across the matrix; per-environment state varies per env cell. Splitting them this way keeps host_env.json a single canonical "what was the box like when this matrix ran" record, while environments/<env-name>/env.json is the file users reach for when a bug looks like ROCm-version drift between two environments.
Per-trial env capture is unchanged — B1 still writes EnvSnapshot into each trial_<id>.json. The two new top-level snapshots are deduplication: the host snapshot would otherwise appear in every trial JSON unchanged; the per-environment snapshot would otherwise appear in every cell's trial JSONs unchanged.
Probe failure never aborts the matrix.collect_env() is contractually fail-soft (A1) — if dmesg is restricted or rdhc isn't installed, the snapshot lands with partial: true plus partial_reasons, and the runner continues. A top-of-file warning in matrix.md surfaces the partial state.
In-process execution (not subprocess)
Per-cell execution calls B1's run_trials() as a Python function. Why this matters:
One Python process for the whole matrix (no interpreter / torch / entry-point-discovery cost per cell)
Cell results returned as list[TrialResult] objects in memory — no JSON parse round-trip
Workload exceptions surface as Python tracebacks, not exit-code + stderr text
Tests mock run_trials() directly; no subprocess plumbing
Distributed workloads under torchrun work without extra wiring — every rank executes the matrix loop, B1's dispatcher's existing rank-aware writes apply per cell, only RANK == 0 writes matrix.{json,md} (same pattern B1 uses for trial_<id>.json)
This requires B1 to expose a clean Python entry-point (run_trials(RunRequest) -> list[TrialResult]).
Output layout
triage_results/
├── EXAMPLE-001/ # <ticket> from recipe; "_no_ticket_" if absent
│ └── fsdp/ # <workload>
│ ├── 2026-04-28T14-12-03/ # <run-timestamp>; one dir per invocation, never overwritten
│ │ ├── matrix.md
│ │ ├── matrix.json
│ │ ├── recipe.resolved.yaml # the recipe AS EXECUTED (registry names already resolved
│ │ │ # to env-var bundles + image refs; reproducibility artifact)
│ │ ├── host_env.json # collect_env() snapshot taken once at runner start (host scope)
│ │ ├── environments/
│ │ │ └── local/
│ │ │ └── env.json # collect_env() snapshot before first cell on this env
│ │ └── cells/
│ │ ├── baseline-local/
│ │ │ ├── trial_0.json # written by B1
│ │ │ ├── trial_1.json
│ │ │ └── ...
│ │ ├── tf32_off-local/
│ │ └── ...
│ └── 2026-04-29T09-44-17/ # next run; same layout
└── _no_ticket_/ # ad-hoc runs without a ticket
└── fsdp/
└── 2026-04-28T15-02-11/
└── ...
<ticket>/<workload>/ lets users see the full history of attempts on a given problem at a glance. recipe.resolved.yaml is the post-resolution snapshot — it embeds the actual env-var bundles, docker digests, etc., so re-running it on a different machine reproduces the exact same matrix even if the registries have changed since.
matrix.md target format
# Triage Matrix — fsdp**Ticket**: EXAMPLE-001
**Workload**: fsdp
**Recipe**: recipes/example-fsdp.yaml (sha256:abc12...)
**Trials per cell**: 8
**Steps per trial**: 5000
**Run timestamp**: 2026-04-28T14:12:03Z
**Baseline cell**: baseline-local (mean step time = 412 ms)
## Reproduction Summary| Cell | Mitigations | Environment | NaN rate | Trials | Mean step (ms) | Confound ||-------------------------------|------------------------|-------------|----------|--------|----------------|---------------|| baseline-local | none | local | 50% | 4 / 8 | 412 | (baseline) || tf32_off-local | tf32_off | local | 0% | 0 / 8 | 515 | speed (+25%) || xnack-local | xnack | local | 0% | 0 / 8 | 414 | — || stack-tf32+xnack-local | tf32_off, xnack | local | 0% | 0 / 8 | 518 | speed (+26%) |## Notes- Cell name comes from the recipe; mitigations + environment columns disambiguate when names get terse.
- Confound column legend:
-`(baseline)` — the cell against which all step-time ratios are computed.
-`—` — the mitigation appears to work without a speed cost. Trust this cell.
-`speed (+N%)` — the mitigation may be suppressing failure via slower iteration rather than a real fix. Verify with `rocprofv3` dispatch comparison before drawing causal conclusions.
-`no effect` — the mitigation neither changed the failure rate nor slowed iteration; it likely doesn't apply to this workload (the env vars it sets aren't read).
- Only `mean step (ms)` is shown here. Per-cell `std`, `p50`, `p99`, raw step-time arrays, and per-trial JSON paths are in `matrix.json`.
-`recipe.resolved.yaml` (alongside this file) captures the registry state at run time — re-run it to reproduce.
Acceptance criteria
Recipe path (primary)
aorta triage run --recipe recipes/example-fsdp-smoke.yaml validates the recipe and runs the matrix
aorta triage run --recipe <bad.yaml> --dry-run prints the resolved cell list and validation errors WITHOUT executing
Recipe loader rejects unknown schema_version with a clear message
Recipe loader resolves all mitigation/environment names through aorta.registry; unknown names raise B3's UnknownMitigationError / UnknownEnvironmentError
Per-cell extra_env overrides registered mitigation env vars and is recorded in matrix.json for that cell
Per-cell trials / steps overrides take effect; absence falls back to recipe top-level
recipe.resolved.yaml is written alongside matrix.md, with all registry names expanded to their underlying env-var bundles + docker refs
Inline docker in recipe (Option A): a recipe cell with environment: { docker: "rocm/pytorch:nightly" } runs end-to-end without any registry registration. Auto-name _inline_<8-char-blake2b> appears in recipe.resolved.yaml, in environments/<auto-name>/env.json, and in cells/<cell-name>/. Two cells with the same docker ref get the same auto-name (deterministic). Schema rejects any extra keys in the mapping with a clear RecipeSchemaError.
Flag path (secondary)
aorta triage run --mode matrix --workload fsdp --mitigation-axis none,tf32_off,xnack --environment-axis local --trials 8 produces a matrix.md with one row per (mitigation × environment) combination
Flag mode internally constructs a Recipe and reuses the same execution path (verified by mocking run_recipe and checking it's called once with a fully-formed Recipe)
--ticket SOME-TICKET flag groups output under triage_results/SOME-TICKET/...; absence routes to triage_results/_no_ticket_/...
Inline docker on CLI (Option B): --environment-axis local,image:rocm/pytorch:nightly runs as two cells per mitigation; the inline cell uses the same _inline_<hash> auto-name as Option A would for the same ref (verified by parsing recipe.resolved.yaml). Bare names continue to resolve via the registry. An unknown bare name still raises UnknownEnvironmentError; image: items never go through registry lookup.
Re-running the same recipe creates a NEW run-timestamp dir; never overwrites prior results
cells/<cell.name>/trial_*.json matches B1's per-trial JSON output (B1 writes them; B2 just points B1 there via RunRequest.output_dir)
Per-workload tracking: running triage twice with different workloads doesn't conflate results
Per-ticket tracking: running triage twice with different tickets keeps history separate per ticket
Implicit env probes
Host probe captured once: host_env.json exists in the run-timestamp dir, contains a valid EnvSnapshot (per A1 schema), and is written exactly once per aorta triage run invocation regardless of cell count. Verified by mocking collect_env and asserting it's called exactly once for host scope.
Per-environment probe captured once per unique environment: for a recipe whose cells reference 3 distinct environments across 12 cells, environments/<env-name>/env.json exists for each of the 3 environments, written before that environment's first cell runs. collect_env is called exactly 3 times in env scope (not 12).
Probes are calls to A1's library function — runner imports collect_env from aorta.instrumentation.environment. Verified by grep: no subprocess.run([...,"aorta","env","probe",...]) anywhere under src/aorta/triage/.
Probe failure does not abort the matrix: when collect_env() returns partial=True, the snapshot is persisted as-is, the matrix run continues, and matrix.md includes a top-of-file warning naming which probe scope was partial. Verified by a test that monkeypatches collect_env to return a partial snapshot and asserts (a) matrix.md still writes, (b) the warning text appears, (c) all cells execute.
Matrix correctness
Speed confound detection works: a synthetic test where one cell has 1.25× the baseline step-time produces speed (+25%) in that cell's Confound column
A cell with step_time_ratio == 1.0 (no slowdown) and nan_rate < baseline shows —
no effect overload: a cell with nan_rate >= baseline_nan_rate AND step_time_ratio <= threshold shows no effect in the same Confound column (no separate Verdict column)
matrix.json has full per-cell data: trial JSONs paths, raw step times, aggregated stats, resolved env vars, environment descriptor
Resilience
Resilient to single-trial failures within a cell (one fail in a cell of 8 = 7 / 8 not 0 / 8)
Resilient to a whole cell failing (e.g., docker not available) — that cell shows error in matrix.md, others still run, baseline detection still works as long as the baseline cell itself succeeded
If the baseline cell itself errors, matrix.md is still written with step_time_ratio columns showing n/a and a top-of-file warning; the run does NOT abort silently
Plumbing
In-process per-cell execution: triage runner calls B1's run_trials(RunRequest) as a Python function; does NOT shell out via subprocess.run(["aorta", "run", ...]). Verified by: (a) no subprocess import in src/aorta/triage/; (b) running the smoke matrix produces no extra child Python processes (one process for the whole matrix).
aorta triage --list-mitigations and --list-environments delegate to B3's resolver and tag each row with source_package so users see which entries come from aorta vs which plugin.
Tests cover: recipe schema validation, name resolution (mock the registry), flag→recipe builder, matrix aggregation, confound detection thresholds, output formatting, single-cell-failure resilience, baseline-cell-failure handling, per-cell call routes through B1's Python API (mock run_trials and assert it's called once per cell with the expected RunRequest)
Matrix sharding / parallel cell execution (--shard i/N, intra-node multi-GPU fan-out, node-level scheduler) — the MVP runner is a sequential nested loop. Don't bake parallelism into B2.
Recipe versioning / migration beyond the schema_version reject — once schema_version 2 exists we'll add a migrator.
How to test
# Smoke test on the in-tree fsdp workload — flag mode
aorta triage run --mode matrix --workload fsdp \
--mitigation-axis none --environment-axis local \
--trials 2 --steps 100
ls triage_results/_no_ticket_/fsdp/ # one timestamp dir
cat triage_results/_no_ticket_/fsdp/*/matrix.md
# Smoke test — recipe mode
aorta triage run --recipe recipes/example-fsdp-smoke.yaml --dry-run
aorta triage run --recipe recipes/example-fsdp-smoke.yaml
# Full smoke matrix
aorta triage run --mode matrix --workload fsdp \
--mitigation-axis none,tf32_off,xnack \
--environment-axis local \
--trials 4 --steps 1000
# Discoverability
aorta triage --list-mitigations
aorta triage --list-environments
PR template
Title: B2: aorta triage run --mode matrix (recipe + flag modes)
Body: include sample matrix.md output, a small example recipe.yaml, confirm confound detection (synthetic +25% slowdown produces speed (+25%) flag), include matrix.json snippet, link to B3 PR (registries) and B1 PR (run_trials Python API).
Goal
Productize a contingency-table style triage matrix: for a given workload, run a fixed set of
(mitigation × environment)cells, count failures and step-times per cell, and produce amatrix.mdtable that flags cells where a mitigation may be working only via a speed confound.Driven by either a checked-in recipe file (primary) or ad-hoc flags (shim) — both modes converge on the same
run_recipe(recipe)execution path.Why this matters — speed-confound detection
Investigations into mitigation effectiveness routinely run into the same trap: a change that eliminates a numeric failure also slows down GPU execution. Without measuring step-time alongside pass/fail, the matrix lies about which mitigations "fix" the bug versus which ones merely delay it long enough to not observe.
Concrete pattern from prior internal RCAs:
The first reads "speed confound suspected; verify with profiler before drawing conclusions"; the second reads "trust this cell." Encoding this distinction in the matrix output is the core value-add over a hand-maintained spreadsheet.
The lesson:
cell_step_time / baseline_step_time > 1.15as a potential speed confound.Design shape — recipe primary, flags secondary
A triage matrix run is two pieces of information:
matrix.mdartifacts back to the originating ticket.A YAML/JSON recipe file captures both, lives in version control, and gives users a one-line invocation:
aorta triage run --recipe recipes/<name>.yaml. Flags are kept for ad-hoc one-shots and as an escape hatch — internally they construct an in-memory recipe and reuse the same execution path.Recipe names resolve through
aorta.registry(B3):mitigations: [tf32_off, xnack]is a list of registered names (built-in or contributed viaaorta.mitigationsentry-points). Same forenvironment: <name>. Unknown names raise B3'sUnknownMitigationError/UnknownEnvironmentError.Files to create / modify
Recipe schema
Authoritative shape (YAML; JSON accepted with the same keys):
Schema rules
schema_version— requiredint; current value1. Unknown values →RecipeSchemaError.ticket— optional string; format-free (caller's choice). Drives output dir; absence is allowed for ad-hoc runs.workload— required string; must resolve viaaorta.workloadsentry-point group (B1's existing dependency). Unknown →UnknownWorkloadErrorfrom B1's loader.trials/steps— required ints at top level; per-cell overrides allowed.confound.threshold— optional float, default1.15.confound.baseline_cell— optional string. Resolution order if absent: (1) first cell with name startingbaseline-; (2) first cell wheremitigations == ["none"]; (3) error if neither found and >1 cell exists.cells[*].name— required string, unique within the recipe; used as the row label inmatrix.md.cells[*].mitigations— requiredlist[str]; each name resolved throughaorta.registry.get_mitigation(). Empty list rejected (use["none"]for the explicit baseline). Multiple names → env-var bundles unioned in list order; collision within a cell raisesRecipeCellError.cells[*].environment— required. One of:aorta.registry.get_environment().{ docker: "<image-ref>" }— inline docker shorthand (Option A). Auto-named_inline_<hash>where<hash>is the first 8 chars ofblake2b(image-ref). Behaves identically to a named docker entry from that point on (per-environment probe,recipe.resolved.yamlrecords the auto-name + ref, output dir uses the auto-name). Intentionally noname:field — anything you'd want to name belongs in the registry. No other keys accepted.cells[*].extra_env: {KEY: VALUE, ...}. Applied AFTER mitigation env-vars in that cell, so it can override a registered mitigation's bundle for one-off experiments without polluting the registry. Logged inmatrix.jsonper cell so the audit trail is preserved.The loader normalizes the YAML/JSON into an in-memory
Recipedataclass (frozen). Validation happens once, at load time; the runner consumes the validated structure.CLI surface
Flag mode internally builds a
Recipewhosecellsis the full cartesian productmitigation_axis × environment_axis, with each cell named<mitigation>-<environment>. The runner does not branch on mode after that point — both paths converge on the samerun_recipe(recipe)function.--environment-axisitem parsing (Option B): each comma-separated item is parsed by prefix.image:<ref>becomes an inline-docker cell using the same{ docker: "<ref>" }path as Option A — auto-named_inline_<hash>, hash visible in the cell name (<mitigation>-_inline_<hash>) so multiple inline images on one axis are distinguishable. Anything without a recognized prefix is a registered environment name. No other prefixes are defined for MVP — the mental model is "registry name ORimage:, that's it."There is intentionally no
--dockersflag — environments are the abstraction (docker is one component of an environment alongside venv and ROCm version).Behavior
For each cell
(name, mitigations, environment, trials, steps, extra_env):RunRequest(B1's contract):workload = recipe.workloadmitigations = cell.mitigations(list of names — B1 resolves viaload_mitigations()and unions env vars)environment = cell.environment(single name — B1 resolves viaload_environments()to docker / venv / rocm)extra_env = cell.extra_env(passed through if set)trials = cell.trials(with per-cell override falling back to recipe top-level)steps = cell.stepsoutput_dir = <output_dir>/<ticket>/<workload>/<run-timestamp>/cells/<cell.name>/run_trials(RunRequest) -> list[TrialResult]in-process (one Python interpreter for the whole matrix). B1's dispatcher writes per-trial JSON toRunRequest.output_dir; B2 also receives the in-memory list for aggregation.passed_count/failed_count(failure = workload-defined: NaN, throw, etc.)nan_rate = failed_count / trialsmean_step_time_ms,std_step_time_ms,p50,p99mean_wall_clock_secerror: str | Noneif the whole cell failed (e.g., docker pull failure) — surface in matrix without aborting other cellsstep_time_ratiofor every non-baseline cell.matrix.json(full data) andmatrix.md(human-readable table) to the run-timestamp directory.Confound rules (Confound column overload)
A single Confound column carries one of:
(baseline)— the baseline cell.——step_time_ratio <= thresholdANDnan_rate < baseline_nan_rate(mitigation works without a speed cost; trust the cell).speed (+N%)—step_time_ratio > threshold(the mitigation may be suppressing failure via slower iteration).no effect—nan_rate >= baseline_nan_rateANDstep_time_ratio <= threshold(the mitigation neither moved the failure rate nor slowed iteration).error— the whole cell failed; row is preserved so the matrix is complete.Pre-registering kill criteria in the same column (rather than a separate Verdict column) keeps the matrix layout matching the source-of-truth manual table.
Implicit env probes (host + per-environment)
Per A1's contract,
aorta.instrumentation.environment.collect_env()is a library function the runner calls directly — never via shelling out toaorta env probe. The runner takes two snapshots beyond what B1 already embeds per trial:<run-timestamp>/host_env.json— sibling ofmatrix.md--environment-axisvalue (or uniquecells[*].environmentin recipe mode), immediately before that environment's first cell runs<run-timestamp>/environments/<env-name>/env.json— sibling ofcells/Host state is invariant across the matrix; per-environment state varies per env cell. Splitting them this way keeps
host_env.jsona single canonical "what was the box like when this matrix ran" record, whileenvironments/<env-name>/env.jsonis the file users reach for when a bug looks like ROCm-version drift between two environments.Per-trial env capture is unchanged — B1 still writes
EnvSnapshotinto eachtrial_<id>.json. The two new top-level snapshots are deduplication: the host snapshot would otherwise appear in every trial JSON unchanged; the per-environment snapshot would otherwise appear in every cell's trial JSONs unchanged.Probe failure never aborts the matrix.
collect_env()is contractually fail-soft (A1) — if dmesg is restricted or rdhc isn't installed, the snapshot lands withpartial: truepluspartial_reasons, and the runner continues. A top-of-file warning inmatrix.mdsurfaces the partial state.In-process execution (not subprocess)
Per-cell execution calls B1's
run_trials()as a Python function. Why this matters:torch/ entry-point-discovery cost per cell)list[TrialResult]objects in memory — no JSON parse round-triprun_trials()directly; no subprocess plumbingRANK == 0writesmatrix.{json,md}(same pattern B1 uses fortrial_<id>.json)This requires B1 to expose a clean Python entry-point (
run_trials(RunRequest) -> list[TrialResult]).Output layout
<ticket>/<workload>/lets users see the full history of attempts on a given problem at a glance.recipe.resolved.yamlis the post-resolution snapshot — it embeds the actual env-var bundles, docker digests, etc., so re-running it on a different machine reproduces the exact same matrix even if the registries have changed since.matrix.md target format
Acceptance criteria
Recipe path (primary)
aorta triage run --recipe recipes/example-fsdp-smoke.yamlvalidates the recipe and runs the matrixaorta triage run --recipe <bad.yaml> --dry-runprints the resolved cell list and validation errors WITHOUT executingschema_versionwith a clear messageaorta.registry; unknown names raise B3'sUnknownMitigationError/UnknownEnvironmentErrorextra_envoverrides registered mitigation env vars and is recorded inmatrix.jsonfor that celltrials/stepsoverrides take effect; absence falls back to recipe top-levelrecipe.resolved.yamlis written alongsidematrix.md, with all registry names expanded to their underlying env-var bundles + docker refsenvironment: { docker: "rocm/pytorch:nightly" }runs end-to-end without any registry registration. Auto-name_inline_<8-char-blake2b>appears inrecipe.resolved.yaml, inenvironments/<auto-name>/env.json, and incells/<cell-name>/. Two cells with the same docker ref get the same auto-name (deterministic). Schema rejects any extra keys in the mapping with a clearRecipeSchemaError.Flag path (secondary)
aorta triage run --mode matrix --workload fsdp --mitigation-axis none,tf32_off,xnack --environment-axis local --trials 8produces amatrix.mdwith one row per (mitigation × environment) combinationRecipeand reuses the same execution path (verified by mockingrun_recipeand checking it's called once with a fully-formedRecipe)--ticket SOME-TICKETflag groups output undertriage_results/SOME-TICKET/...; absence routes totriage_results/_no_ticket_/...--environment-axis local,image:rocm/pytorch:nightlyruns as two cells per mitigation; the inline cell uses the same_inline_<hash>auto-name as Option A would for the same ref (verified by parsingrecipe.resolved.yaml). Bare names continue to resolve via the registry. An unknown bare name still raisesUnknownEnvironmentError;image:items never go through registry lookup.Output layout
<output-dir>/<ticket>/<workload>/<run-timestamp>/{matrix.md,matrix.json,recipe.resolved.yaml,host_env.json,environments/,cells/}cells/<cell.name>/trial_*.jsonmatches B1's per-trial JSON output (B1 writes them; B2 just points B1 there viaRunRequest.output_dir)Implicit env probes
host_env.jsonexists in the run-timestamp dir, contains a validEnvSnapshot(per A1 schema), and is written exactly once peraorta triage runinvocation regardless of cell count. Verified by mockingcollect_envand asserting it's called exactly once for host scope.environments/<env-name>/env.jsonexists for each of the 3 environments, written before that environment's first cell runs.collect_envis called exactly 3 times in env scope (not 12).collect_envfromaorta.instrumentation.environment. Verified by grep: nosubprocess.run([...,"aorta","env","probe",...])anywhere undersrc/aorta/triage/.collect_env()returnspartial=True, the snapshot is persisted as-is, the matrix run continues, andmatrix.mdincludes a top-of-file warning naming which probe scope was partial. Verified by a test that monkeypatchescollect_envto return a partial snapshot and asserts (a) matrix.md still writes, (b) the warning text appears, (c) all cells execute.Matrix correctness
speed (+25%)in that cell's Confound columnstep_time_ratio == 1.0(no slowdown) andnan_rate < baselineshows—no effectoverload: a cell withnan_rate >= baseline_nan_rateANDstep_time_ratio <= thresholdshowsno effectin the same Confound column (no separate Verdict column)matrix.jsonhas full per-cell data: trial JSONs paths, raw step times, aggregated stats, resolved env vars, environment descriptorResilience
7 / 8not0 / 8)errorin matrix.md, others still run, baseline detection still works as long as the baseline cell itself succeededstep_time_ratiocolumns showingn/aand a top-of-file warning; the run does NOT abort silentlyPlumbing
run_trials(RunRequest)as a Python function; does NOT shell out viasubprocess.run(["aorta", "run", ...]). Verified by: (a) nosubprocessimport insrc/aorta/triage/; (b) running the smoke matrix produces no extra child Python processes (one process for the whole matrix).aorta triage --list-mitigationsand--list-environmentsdelegate to B3's resolver and tag each row withsource_packageso users see which entries come fromaortavs which plugin.run_trialsand assert it's called once per cell with the expectedRunRequest)Out of scope (P1+)
--mode optimize(Optuna-driven mitigation-stack search)aorta triage generate-protocol)aorta triage matrix --import slack-message.txt)aorta triage matrix <dir>re-analysis mode (re-classify cells from existing trial JSONs without re-running)--shard i/N, intra-node multi-GPU fan-out, node-level scheduler) — the MVP runner is a sequential nested loop. Don't bake parallelism into B2.schema_versionreject — once schema_version 2 exists we'll add a migrator.How to test
PR template
Title:
B2: aorta triage run --mode matrix (recipe + flag modes)Body: include sample matrix.md output, a small example recipe.yaml, confirm confound detection (synthetic +25% slowdown produces
speed (+25%)flag), include matrix.json snippet, link to B3 PR (registries) and B1 PR (run_trialsPython API).