Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Proposal: aorta reduce — customer-side guided workload reduction to produce a shippable minimal reproducer #192

@amd-vivekag

Description

@amd-vivekag

Summary

Add aorta reduce: a customer-side, guided loop that takes a failing
workload and iteratively shrinks it along customer-declared axes until
the workload is small enough to share with AMD while still tripping the
same failure. The output is a minimal reproducer bundle (script +
env.json + recipe) that drops cleanly into the existing
aorta triage run / aorta bundle flow.

Today the only path from "customer hits a bug in a 100k-line training
script" to "AMD has a runnable reproducer" is: customer pastes the full
script into a private workload directory and AMD wraps it. That step is
slow, NDA-heavy, and most of the script is dead weight relative to the
bug. aorta reduce is the missing upstream step that lets customers do
the shrinking themselves, on their hardware, before anything leaves
their site.

Why a new command (not a flag on triage or run)

  • aorta triage --mode optimize (roadmap, P2) searches for the
    minimal mitigation stack that makes a bug go away on a fixed
    workload. Runs on AMD's side, after a repro exists.
  • aorta reduce runs on the customer's side, treats the workload
    as the variable and the bug as the invariant, and outputs a smaller
    workload. Different actor, different success criterion, different
    artifact.

Overloading triage or run would hide that asymmetry. A dedicated
verb keeps the contract honest.

Proposed UX

Customer authors a reduction spec next to their script:

# aorta.yaml
workload:
  command: "bash run_repro.sh"
  cwd: ./
oracle:
  kind: regex                       # regex | exit_code | python
  pattern: "(?m)NaN detected at step"
  stream: stdout
budget:
  trials_per_candidate: 3
  k_of_n_to_count_as_reproducing: 2
  max_wall_clock: 4h
axes:
  - name: NUM_STEPS
    kind: env
    type: int
    initial: 5000
    floor: 50
    strategy: binary
    monotone: true
  - name: BATCH_SIZE
    kind: env
    type: int
    initial: 2048
    floor: 16
    strategy: binary
  - name: SHAPES_FILE
    kind: file_subset
    initial: shapes.txt
    strategy: dd-min
aorta reduce --spec aorta.yaml --output reduce/MY-TICKET
aorta bundle reduce/MY-TICKET/<timestamp>

For wrapped workloads (aorta.workloads entry point), axes can be
declared in Python on the Workload subclass and --spec becomes
optional. This is the M2 path; M1 ships the generic-oracle flow only.

Oracle contract

"Still reproduces?" must be answerable without leaking customer data.
Three kinds:

  1. exit_code — any non-zero is "reproduces" (or a configurable set).
  2. regex — match against stdout/stderr.
  3. python — for registered Workload plugins, reuses
    WorkloadResult.failure_count > 0 (or a workload-defined predicate
    over WorkloadResult.failure_details).

The oracle output is a single bool per trial; per-candidate verdict is
k-of-n to drop intermittents (avoids reducing past the point where
the bug stops being reliably observable).

Reduction axes

Open-ended; the spec author declares them. Suggested starter set:

  • numeric env vars (NUM_STEPS, BATCH_SIZE, SEQ_LEN, model dim, ...)
  • world size / rank count
  • file subsets (shapes lists, dataset shards) via DD-min over lines
  • step ranges (skip first K, run only middle window)

Each axis declares: type, initial value, floor, search strategy
(binary | linear | dd-min), and an optional monotonicity hint
that lets the search assume "smaller is at-most-as-interesting."

Search loop (M1)

  1. For each axis in priority order, run binary search to the floor,
    accepting only candidates where the oracle fires k-of-n trials.
  2. After a 1D pass over all axes, run DD-min style fixed-point over the
    axis set (re-shrink each axis given the others' new values) until
    no axis moves.
  3. Per candidate: capture env.json via collect_env(), record
    accept/reject + oracle outcome to a reduction_trace.jsonl.

Wall-clock and trial budgets bound the search; partial results are
always written.

Output bundle

reduce/<ticket>/<timestamp>/
  spec.yaml                 # the input spec
  reduction_trace.jsonl     # every candidate, accepted or not
  shrunken/
    run_repro.sh            # final script with shrunken axis values baked in
    aorta.yaml              # final values for handoff to triage
    env.json                # pinned env at the shrunken point
    recipe.resolved.yaml    # ready-to-run by `aorta triage`
  SHIP.md                   # checklist of what's being shared + scrub status

The bundle is consumable by the existing / planned aorta triage run
and aorta bundle commands without modification — reduce is purely
additive.

Confidentiality affordances

  • --scrub PATH_PATTERN[,PATTERN...] redacts matching strings from
    captured logs before they enter the bundle.
  • --dry-run runs the loop without persisting any bundle artifacts
    (e.g. for a customer's first try).
  • The SHIP.md is the explicit consent checklist; nothing leaves the
    customer's site automatically.

Non-goals

  • Not a fuzzer; does not synthesize inputs the customer didn't declare.
  • Not a source-level delta-debugger; does not edit script bodies (no
    AST shrinking). Reduction is over declared axes only.
  • Does not replace aorta triage; it produces the artifact triage
    consumes.
  • Does not modify any source/ tree of a registered private workload.

Milestones

  • M1 — Generic-oracle path. command + regex/exit_code oracle.
    Numeric env axes only (int/float). Binary search per axis.
    Bundle output. CI demo: shrink recom_repro-style script's
    NUM_STEPS from 5000 -> minimum reproducing value.
  • M2Workload-declared axes (typed, validated). DD-min over
    multiple axes. File-subset axis (shapes / shard files).
  • M3 — Distributed reduction (world size, rank subset) via
    torchrun-aware launcher; multi-rank oracle aggregation.

Open design questions

  1. Where does the spec live? Standalone aorta.yaml (proposed) vs
    reuse the existing triage recipe schema with a new reduce:
    section. Standalone is simpler; recipe-extension keeps one source
    of truth.
  2. Should the oracle have access to step-time data? Some perf-flavored
    bugs only reproduce under load. Adding a min_step_time_ms gate
    would catch that but conflates correctness and perf.
  3. Search strategy default per axis — binary is right for monotone
    numeric axes, wrong for non-monotone ones (e.g. layer-index subset
    where only layer 17 is load-bearing). DD-min covers both at higher
    cost. Default to which?
  4. Interaction with mitigations — should reduce honor an active
    mitigation set during the search, or always reduce against the
    bare baseline? Probably honor, since the bug-under-test may
    only manifest with the customer's prod env. Worth confirming.
  5. Failure mode when nothing shrinks — emit a "minimal" bundle
    equal to the input? Refuse to write a bundle? Suggest enabling
    more axes? UX call.

Related roadmap items (aorta-internal/README.md)

  • aorta bundle (Planned P1) — consumes reduce/ output.
  • aorta diverge (Planned P1) — runs on the shrunken bundle.
  • aorta triage --mode optimize (Planned P2) — distinct, opposite
    direction; this issue does not change its scope.
  • tracelens_proxy (Planned P1, public) — different no-source path;
    reduce is for cases where source exists and can run.

Acceptance criteria

  • aorta reduce --spec aorta.yaml --output <dir> shrinks a known
    reproducer's NUM_STEPS to the minimum reproducing value with
    bounded trial count.
  • Output bundle is consumed by aorta triage run --recipe without
    manual editing.
  • env.json from the final accepted candidate is byte-identical
    (modulo timestamps) to one produced by aorta env probe against
    the same shrunken script.
  • Scrub patterns are applied to reduction_trace.jsonl and any
    captured logs in the bundle.
  • Dry-run produces no on-disk artifacts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions