Summary
Add aorta reduce: a customer-side, guided loop that takes a failing
workload and iteratively shrinks it along customer-declared axes until
the workload is small enough to share with AMD while still tripping the
same failure. The output is a minimal reproducer bundle (script +
env.json + recipe) that drops cleanly into the existing
aorta triage run / aorta bundle flow.
Today the only path from "customer hits a bug in a 100k-line training
script" to "AMD has a runnable reproducer" is: customer pastes the full
script into a private workload directory and AMD wraps it. That step is
slow, NDA-heavy, and most of the script is dead weight relative to the
bug. aorta reduce is the missing upstream step that lets customers do
the shrinking themselves, on their hardware, before anything leaves
their site.
Why a new command (not a flag on triage or run)
aorta triage --mode optimize (roadmap, P2) searches for the
minimal mitigation stack that makes a bug go away on a fixed
workload. Runs on AMD's side, after a repro exists.
aorta reduce runs on the customer's side, treats the workload
as the variable and the bug as the invariant, and outputs a smaller
workload. Different actor, different success criterion, different
artifact.
Overloading triage or run would hide that asymmetry. A dedicated
verb keeps the contract honest.
Proposed UX
Customer authors a reduction spec next to their script:
# aorta.yaml
workload:
command: "bash run_repro.sh"
cwd: ./
oracle:
kind: regex # regex | exit_code | python
pattern: "(?m)NaN detected at step"
stream: stdout
budget:
trials_per_candidate: 3
k_of_n_to_count_as_reproducing: 2
max_wall_clock: 4h
axes:
- name: NUM_STEPS
kind: env
type: int
initial: 5000
floor: 50
strategy: binary
monotone: true
- name: BATCH_SIZE
kind: env
type: int
initial: 2048
floor: 16
strategy: binary
- name: SHAPES_FILE
kind: file_subset
initial: shapes.txt
strategy: dd-min
aorta reduce --spec aorta.yaml --output reduce/MY-TICKET
aorta bundle reduce/MY-TICKET/<timestamp>
For wrapped workloads (aorta.workloads entry point), axes can be
declared in Python on the Workload subclass and --spec becomes
optional. This is the M2 path; M1 ships the generic-oracle flow only.
Oracle contract
"Still reproduces?" must be answerable without leaking customer data.
Three kinds:
exit_code — any non-zero is "reproduces" (or a configurable set).
regex — match against stdout/stderr.
python — for registered Workload plugins, reuses
WorkloadResult.failure_count > 0 (or a workload-defined predicate
over WorkloadResult.failure_details).
The oracle output is a single bool per trial; per-candidate verdict is
k-of-n to drop intermittents (avoids reducing past the point where
the bug stops being reliably observable).
Reduction axes
Open-ended; the spec author declares them. Suggested starter set:
- numeric env vars (
NUM_STEPS, BATCH_SIZE, SEQ_LEN, model dim, ...)
- world size / rank count
- file subsets (shapes lists, dataset shards) via DD-min over lines
- step ranges (skip first K, run only middle window)
Each axis declares: type, initial value, floor, search strategy
(binary | linear | dd-min), and an optional monotonicity hint
that lets the search assume "smaller is at-most-as-interesting."
Search loop (M1)
- For each axis in priority order, run binary search to the floor,
accepting only candidates where the oracle fires k-of-n trials.
- After a 1D pass over all axes, run DD-min style fixed-point over the
axis set (re-shrink each axis given the others' new values) until
no axis moves.
- Per candidate: capture
env.json via collect_env(), record
accept/reject + oracle outcome to a reduction_trace.jsonl.
Wall-clock and trial budgets bound the search; partial results are
always written.
Output bundle
reduce/<ticket>/<timestamp>/
spec.yaml # the input spec
reduction_trace.jsonl # every candidate, accepted or not
shrunken/
run_repro.sh # final script with shrunken axis values baked in
aorta.yaml # final values for handoff to triage
env.json # pinned env at the shrunken point
recipe.resolved.yaml # ready-to-run by `aorta triage`
SHIP.md # checklist of what's being shared + scrub status
The bundle is consumable by the existing / planned aorta triage run
and aorta bundle commands without modification — reduce is purely
additive.
Confidentiality affordances
--scrub PATH_PATTERN[,PATTERN...] redacts matching strings from
captured logs before they enter the bundle.
--dry-run runs the loop without persisting any bundle artifacts
(e.g. for a customer's first try).
- The
SHIP.md is the explicit consent checklist; nothing leaves the
customer's site automatically.
Non-goals
- Not a fuzzer; does not synthesize inputs the customer didn't declare.
- Not a source-level delta-debugger; does not edit script bodies (no
AST shrinking). Reduction is over declared axes only.
- Does not replace
aorta triage; it produces the artifact triage
consumes.
- Does not modify any
source/ tree of a registered private workload.
Milestones
- M1 — Generic-oracle path.
command + regex/exit_code oracle.
Numeric env axes only (int/float). Binary search per axis.
Bundle output. CI demo: shrink recom_repro-style script's
NUM_STEPS from 5000 -> minimum reproducing value.
- M2 —
Workload-declared axes (typed, validated). DD-min over
multiple axes. File-subset axis (shapes / shard files).
- M3 — Distributed reduction (world size, rank subset) via
torchrun-aware launcher; multi-rank oracle aggregation.
Open design questions
- Where does the spec live? Standalone
aorta.yaml (proposed) vs
reuse the existing triage recipe schema with a new reduce:
section. Standalone is simpler; recipe-extension keeps one source
of truth.
- Should the oracle have access to step-time data? Some perf-flavored
bugs only reproduce under load. Adding a min_step_time_ms gate
would catch that but conflates correctness and perf.
- Search strategy default per axis — binary is right for monotone
numeric axes, wrong for non-monotone ones (e.g. layer-index subset
where only layer 17 is load-bearing). DD-min covers both at higher
cost. Default to which?
- Interaction with mitigations — should
reduce honor an active
mitigation set during the search, or always reduce against the
bare baseline? Probably honor, since the bug-under-test may
only manifest with the customer's prod env. Worth confirming.
- Failure mode when nothing shrinks — emit a "minimal" bundle
equal to the input? Refuse to write a bundle? Suggest enabling
more axes? UX call.
Related roadmap items (aorta-internal/README.md)
aorta bundle (Planned P1) — consumes reduce/ output.
aorta diverge (Planned P1) — runs on the shrunken bundle.
aorta triage --mode optimize (Planned P2) — distinct, opposite
direction; this issue does not change its scope.
tracelens_proxy (Planned P1, public) — different no-source path;
reduce is for cases where source exists and can run.
Acceptance criteria
aorta reduce --spec aorta.yaml --output <dir> shrinks a known
reproducer's NUM_STEPS to the minimum reproducing value with
bounded trial count.
- Output bundle is consumed by
aorta triage run --recipe without
manual editing.
env.json from the final accepted candidate is byte-identical
(modulo timestamps) to one produced by aorta env probe against
the same shrunken script.
- Scrub patterns are applied to
reduction_trace.jsonl and any
captured logs in the bundle.
- Dry-run produces no on-disk artifacts.
Summary
Add
aorta reduce: a customer-side, guided loop that takes a failingworkload and iteratively shrinks it along customer-declared axes until
the workload is small enough to share with AMD while still tripping the
same failure. The output is a minimal reproducer bundle (script +
env.json+ recipe) that drops cleanly into the existingaorta triage run/aorta bundleflow.Today the only path from "customer hits a bug in a 100k-line training
script" to "AMD has a runnable reproducer" is: customer pastes the full
script into a private workload directory and AMD wraps it. That step is
slow, NDA-heavy, and most of the script is dead weight relative to the
bug.
aorta reduceis the missing upstream step that lets customers dothe shrinking themselves, on their hardware, before anything leaves
their site.
Why a new command (not a flag on
triageorrun)aorta triage --mode optimize(roadmap, P2) searches for theminimal mitigation stack that makes a bug go away on a fixed
workload. Runs on AMD's side, after a repro exists.
aorta reduceruns on the customer's side, treats the workloadas the variable and the bug as the invariant, and outputs a smaller
workload. Different actor, different success criterion, different
artifact.
Overloading
triageorrunwould hide that asymmetry. A dedicatedverb keeps the contract honest.
Proposed UX
Customer authors a reduction spec next to their script:
For wrapped workloads (
aorta.workloadsentry point), axes can bedeclared in Python on the
Workloadsubclass and--specbecomesoptional. This is the M2 path; M1 ships the generic-oracle flow only.
Oracle contract
"Still reproduces?" must be answerable without leaking customer data.
Three kinds:
exit_code— any non-zero is "reproduces" (or a configurable set).regex— match against stdout/stderr.python— for registeredWorkloadplugins, reusesWorkloadResult.failure_count > 0(or a workload-defined predicateover
WorkloadResult.failure_details).The oracle output is a single bool per trial; per-candidate verdict is
k-of-nto drop intermittents (avoids reducing past the point wherethe bug stops being reliably observable).
Reduction axes
Open-ended; the spec author declares them. Suggested starter set:
NUM_STEPS,BATCH_SIZE,SEQ_LEN, model dim, ...)Each axis declares: type, initial value, floor, search strategy
(
binary|linear|dd-min), and an optional monotonicity hintthat lets the search assume "smaller is at-most-as-interesting."
Search loop (M1)
accepting only candidates where the oracle fires k-of-n trials.
axis set (re-shrink each axis given the others' new values) until
no axis moves.
env.jsonviacollect_env(), recordaccept/reject + oracle outcome to a
reduction_trace.jsonl.Wall-clock and trial budgets bound the search; partial results are
always written.
Output bundle
The bundle is consumable by the existing / planned
aorta triage runand
aorta bundlecommands without modification —reduceis purelyadditive.
Confidentiality affordances
--scrub PATH_PATTERN[,PATTERN...]redacts matching strings fromcaptured logs before they enter the bundle.
--dry-runruns the loop without persisting any bundle artifacts(e.g. for a customer's first try).
SHIP.mdis the explicit consent checklist; nothing leaves thecustomer's site automatically.
Non-goals
AST shrinking). Reduction is over declared axes only.
aorta triage; it produces the artifacttriageconsumes.
source/tree of a registered private workload.Milestones
command + regex/exit_codeoracle.Numeric env axes only (
int/float). Binary search per axis.Bundle output. CI demo: shrink
recom_repro-style script'sNUM_STEPSfrom 5000 -> minimum reproducing value.Workload-declared axes (typed, validated). DD-min overmultiple axes. File-subset axis (shapes / shard files).
torchrun-aware launcher; multi-rank oracle aggregation.
Open design questions
aorta.yaml(proposed) vsreuse the existing triage recipe schema with a new
reduce:section. Standalone is simpler; recipe-extension keeps one source
of truth.
bugs only reproduce under load. Adding a
min_step_time_msgatewould catch that but conflates correctness and perf.
numeric axes, wrong for non-monotone ones (e.g. layer-index subset
where only layer 17 is load-bearing). DD-min covers both at higher
cost. Default to which?
reducehonor an activemitigation set during the search, or always reduce against the
bare baseline? Probably honor, since the bug-under-test may
only manifest with the customer's prod env. Worth confirming.
equal to the input? Refuse to write a bundle? Suggest enabling
more axes? UX call.
Related roadmap items (aorta-internal/README.md)
aorta bundle(Planned P1) — consumesreduce/output.aorta diverge(Planned P1) — runs on the shrunken bundle.aorta triage --mode optimize(Planned P2) — distinct, oppositedirection; this issue does not change its scope.
tracelens_proxy(Planned P1, public) — different no-source path;reduceis for cases where source exists and can run.Acceptance criteria
aorta reduce --spec aorta.yaml --output <dir>shrinks a knownreproducer's
NUM_STEPSto the minimum reproducing value withbounded trial count.
aorta triage run --recipewithoutmanual editing.
env.jsonfrom the final accepted candidate is byte-identical(modulo timestamps) to one produced by
aorta env probeagainstthe same shrunken script.
reduction_trace.jsonland anycaptured logs in the bundle.