You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Universal workload runner that works identically for first-party and third-party workloads via the aorta.workloads entry-point group. Owns the per-trial result schema and persistence.
B1 ships as two PRs against this single B1 issue so downstream work (the triage matrix scaffold and the workload tasks) can start against real types instead of local stubs. Both PRs use Closes #<B1-issue> — GitHub auto-closes the issue when B1.1 merges; B1.0's earlier merge does not close it (other open referencing PRs keep it alive).
src/aorta/run/results.py — TrialResult dataclass with the full schema documented below (frozen, with all fields). JSON round-trip helpers (to_dict, from_dict).
src/aorta/run/dispatcher.py — RunRequest dataclass + run_trials(req: RunRequest) -> list[TrialResult] whose body is raise NotImplementedError("B1.1").
src/aorta/run/collectors.py — KNOWN_RECIPES: frozenset[str] constant (see "Collector flag reservation" section).
src/aorta/cli/run.py — Click handler that parses all CLI args (including --collect with KNOWN_RECIPES validation), builds a RunRequest, calls run_trials(). The handler propagates the NotImplementedError for now — that's fine; B1.1 lifts the body.
Acceptance criteria for B1.0 (subset of the full list below):
from aorta.run.dispatcher import run_trials, RunRequest imports cleanly
from aorta.run.results import TrialResult imports cleanly
aorta run --workload fsdp --collect rocprof --trials 1 parses and reaches run_trials() (which raises NotImplementedError); --collect bogus rejects at parse time with available-name list
run_trials() body is exactly raise NotImplementedError("B1.1") with a non-empty docstring
TrialResult round-trips through to_dict() / from_dict() losslessly on a hand-built fixture
No subprocess code, no env probe call, no entry-point lookup yet — those land in B1.1
B1.1 — implementation (~2-3 days, 1 PR)
PR title: B1.1: aorta run implementation. PR body references B1.0's PR.
Deliverables: everything else in this spec — the dispatcher loop, mitigation injection, env probe wiring, JSON write-out, launch-mode validation, rank-aware writes, plugin discovery, all integration tests.
Why split it
B1.0 PR merge is the unblock signal for downstream task owners — when it lands, switch from local stubs to real imports.
Same-day unblock. B1.0 lands day 1; downstream work codes against real imports from day 2 onward instead of stubbing in their own files.
Reviews split cleanly. B1.0 review is "is this contract right?" — fast. B1.1 review is "is the implementation correct?" — slower, focuses on subprocess + env injection + error paths. One PR loses both modes.
Why this matters
A recurring pattern in numerical / correctness investigations on GPU stacks: weeks of effort go into characterizing a problem, but progress only inflects once a standalone, parameterized reproducer exists that can be run unattended across docker images and mitigation combinations. The platform's job is to make "run this workload across N trials × M dockers" a one-liner. aorta run is that one-liner.
The platform never owns the workload itself, only the orchestration around it. Workload owners (first-party or third-party) implement the Workload ABC and register via the aorta.workloads entry-point group.
Files to create / modify
src/aorta/cli/run.py # MODIFY — replace ClickException with real call
src/aorta/run/__init__.py # NEW (empty)
src/aorta/run/dispatcher.py # NEW — main orchestration loop
src/aorta/run/results.py # NEW — TrialResult schema + JSON writers
src/aorta/run/discovery.py # NEW — entry-point lookup helper
tests/run/__init__.py # NEW (empty)
tests/run/test_dispatcher.py # NEW
tests/run/test_results.py # NEW
Behavior
aorta run --workload <name>
--trials N
[--environment NAME] # registered environment name (default: "local"); resolved via aorta.registry
[--mitigations m1,m2] # default: ["none"] (baseline); each name resolved via aorta.registry
[--extra-env KEY=VAL,...] # ad-hoc env override applied AFTER mitigation env-vars (one-off experiments)
[--collect r1,r2] # in-trial capture recipes — name-validating no-op in MVP, see "Collector flag reservation" below
[--steps S] # workload-specific override
[--results-dir results] # default
A single aorta run invocation is one cell (one environment, one mitigation set, N trials). Multi-environment fan-out lives in aorta triage — B1 stays the per-cell unit so triage can call run_trials() once per cell without inheriting axis logic.
Per trial in the request:
Capture environment by calling aorta.instrumentation.environment.collect_env() (the A1 library function) → returns an EnvSnapshot embedded in the trial's TrialResult.env. NEVER shell out to aorta env probe, and never re-implement env capture inside the dispatcher — there is exactly one env-probe code path and B1 calls it as a function.
Resolve mitigations and environment through B3's public resolver:
from aorta.registry import get_mitigation, get_environment
For each name in request.mitigations, call get_mitigation(name) and union the returned dict[str, str] env-var bundles in list order.
Call get_environment(request.environment) to obtain the Environment descriptor (docker / venv / rocm).
Apply the unioned mitigation env-vars on top of the current process environment, then layer request.extra_env on top (one-off override path).
Unknown names raise B3's UnknownMitigationError / UnknownEnvironmentError; the dispatcher does NOT catch — surfaces directly so the CLI exits non-zero with B3's actionable message.
Discover workload class via importlib.metadata.entry_points(group="aorta.workloads")
Instantiate WorkloadClass(config) where config = workload defaults + CLI overrides
aorta run owns the live training subprocess and is therefore where in-trial data-capture recipes (e.g., rocprofv3, numerics dumps, AMD_LOG instrumentation) must attach — they wrap the subprocess from inside run's already-existing process tree. Recipes themselves are P1 — but B1 reserves the CLI surface now so P1 doesn't have to refactor aorta run's argument parsing.
MVP behavior:
--collect <name1>,<name2> parses into a tuple of recipe names.
Names are validated against a known set ({"rocprof", "numerics", "amd_log"} — placeholders, lowered into aorta.run.collectors.KNOWN_RECIPES so the list has one home).
Unknown names raise a clear error listing valid names.
Known names are accepted and silently no-op in MVP. No subprocess wrapping, no extra files written. The flag is reserved surface only.
A separate top-level aorta bundle <results-dir> command (P1) handles post-hoc artifact packaging — NOT a --collect recipe and NOT in scope for B1.
Python API contract (consumed by B2)
B2 (aorta triage) calls B1's dispatcher in-process — NOT via subprocess. To make this clean, B1 exposes a single public function alongside the Click handler:
# src/aorta/run/dispatcher.py@dataclass(frozen=True)classRunRequest:
workload: strtrials: intenvironment: str="local"# registered name; resolved via aorta.registry.get_environmentmitigations: tuple[str, ...] = ("none",) # registered names; each resolved via aorta.registry.get_mitigationextra_env: dict[str, str] =field(default_factory=dict) # one-off override applied after mitigation env-varssteps: int|None=Noneconfig_overrides: dict[str, Any] =field(default_factory=dict)
results_dir: Path=Path("results")
defrun_trials(request: RunRequest) ->list[TrialResult]:
"""Run N trials for a single (workload, docker, mitigation-set) combination. Returns the in-memory TrialResult list. JSONs are still written to disk as a side-effect (per the per-workload tracking criterion). Trial-level failures (workload returns passed=False, throws, or times out) surface as TrialResult entries with exit_status set accordingly — they do NOT raise. Infrastructure failures that prevent any trial from running (e.g., workload entry-point not found) raise so the caller can decide what to do (CLI exits non-zero; B2 marks the cell as `error` and continues). """
The Click handler in cli/run.py becomes a thin shell: parse CLI args → build RunRequest → call run_trials() → derive exit code from results. All orchestration logic lives in run_trials(). No business logic in the Click handler.
Why this matters:
B2's runner calls run_trials() once per cell — no subprocess overhead, no JSON parse round-trip, one Python process for the whole matrix
Workload exceptions surface as Python tracebacks B2 can catch and report per cell
Unit tests mock run_trials() directly; no subprocess plumbing in tests
Single source of truth for "what does one logical run do?" — both CLI and triage go through the same function
Distributed launch validation (next section) lives inside run_trials() so B2 inherits it for free
Distributed launch contract
aorta run defaults to single-process execution. Bare aorta run --workload X runs once, no torchrun, no distributed init — that's the floor behavior. Workloads that need multiple ranks opt in via the Workload ABC's class attributes; the user then wraps aorta run with torchrun.
Launch-mode validation (B1's job)
The Workload ABC carries two class attributes:
classWorkload:
launch_mode: ClassVar[Literal["single_process", "distributed"]] ="single_process"min_world_size: ClassVar[int] =1# only consulted when launch_mode == "distributed"
Before calling setup(), B1's dispatcher reads WORLD_SIZE from env (default 1 if unset) and validates against the workload's declaration:
Declared
WORLD_SIZE
Outcome
single_process (default)
1
✓ proceed
single_process (default)
> 1
✗ raise: "workload <name> is single_process; do not wrap with torchrun"
distributed, min_world_size=N
≥ N
✓ proceed
distributed, min_world_size=N
< N
✗ raise: "workload <name> requires WORLD_SIZE >= N (got W); launch with torchrun --nproc_per_node=N -m aorta run …"
This catches the four-way footgun (workload mode × launch mode mismatch) at one well-known spot with one consistent error message — workloads don't repeat the check in setup().
Runtime rules under torchrun
For workloads that DO declare launch_mode = "distributed":
Every rank runs the full trial lifecycle (env probe → workload setup/run/cleanup). The workload's setup() calls init_process_group() itself.
Only RANK==0 writes trial_<id>.json. Other ranks compute their own WorkloadResult (needed for the local lifecycle), but discard it on the way out.
RANK env var unset → treat as rank 0. This is what makes single-process workloads "just work" with no torchrun.
Trial-id collisions across ranks are impossible because only rank 0 writes; no per-rank suffix needed.
env probe still runs on every rank so the local WorkloadResult is well-formed, but only rank 0's env probe ends up in the persisted TrialResult.env.
Single-node example (8 GPUs, one node)
torchrun --nproc_per_node=8 -m aorta run --workload fsdp --trials 1 --steps 100
# Writes results/fsdp/trial_0.json once (from global rank 0).
Multi-node works the same way — aorta run is launch-agnostic, so use whatever multi-node torchrun invocation your team is set up for. Only global rank 0 writes the trial JSON regardless of node count. MVP testing is single-node; multi-node verification deferred until a real multi-node consumer asks for it.
TrialResult schema
{
"schema_version": "0.1",
"trial_id": "fsdp_d0_m0_t0",
"workload": "fsdp",
"execution_env": {
"kind": "docker", // mirrors Environment descriptor: "docker" | "venv" | "rocm" | "local""name": "<environment-name>", // the registered name resolved through aorta.registry"image": "<image-ref>", // when Environment.docker is set"digest": "sha256:...", // when docker inspect resolves it; null otherwise"venv": null, // when Environment.venv is set"rocm": null, // when Environment.rocm is set"source_package": "aorta"// which package registered this environment (B3 plumbs this)
},
"mitigations_applied": ["tf32_off"],
"config": {...},
"env": { ... full env.json ... },
"result": {
"passed": true,
"failure_count": 0,
"first_failure_iteration": null,
"failure_details": [],
"total_iterations": 5000,
"step_times_ms": [...],
"elapsed_sec": 412.5,
"metrics": {...}
},
"wall_clock_sec": 425.1,
"exit_status": "ok"
}
Schema is unstable (0.1) until at least one external consumer pins it. Field renames and additions are allowed without a major-version bump during MVP. Bump to 1.0 when triage validation lands AND an external reader (downstream tool, analysis notebook outside the team, customer script) starts depending on the shape.
execution_env.kind is derived from the resolved Environment descriptor: "docker" when Environment.docker is set, "venv" when only venv is set, "local" for the default local environment with no overrides. The wrapping object exists so future kinds (slurm, singularity, conda) can be added without a v2 schema. image/digest populate when Environment.docker resolves; digest is filled best-effort via docker inspect and is null if unresolved — never block the trial on digest resolution.
Acceptance criteria
aorta run --workload fsdp --trials 2 writes results/fsdp/trial_0.json and results/fsdp/trial_1.json (default --environment local, default --mitigations none)
aorta run --workload fsdp --trials 2 --environment local --mitigations tf32_off writes 2 JSONs whose mitigations_applied is ["tf32_off"] and whose env.env_vars.DISABLE_TF32 is "1"
No --dockers flag. Multi-environment fan-out is aorta triage's job; B1 takes a single --environment NAME. Verified by reading the Click handler: only one environment-related option exists, and it accepts a single string.
Mitigation resolution routes through aorta.registry: dispatcher imports get_mitigation from aorta.registry (not from any plugin package). Verified by grep on src/aorta/run/.
Environment resolution routes through aorta.registry: dispatcher imports get_environment from aorta.registry; the resolved Environment descriptor populates TrialResult.execution_env.
Mitigation env-var union order: with --mitigations a,b, env-var bundles are unioned in list order (later names override earlier). With --extra-env KEY=VAL, extra_env is layered on top of the union. Verified by a unit test passing two registered mitigations whose env vars conflict and asserting the second wins, then setting extra_env to override and asserting extra_env wins.
Unknown name surfaces B3's error: aorta run --workload fsdp --mitigations not_a_real_thing --trials 1 exits non-zero with B3's UnknownMitigationError message text (which lists available names). Same for --environment not_a_real_env.
Mitigation injection end-to-end (after B3 lands): aorta run --workload fsdp --mitigations tf32_off --environment local --trials 1 injects DISABLE_TF32=1 into the workload's env via aorta.registry.get_mitigation. Once a downstream plugin package registers entries, the same command with that plugin's --environment <name> resolves the descriptor without B1 changing.
Per-workload result tracking — different workloads write to results/<workload>/, never aggregated
Resilient: one trial failing (workload returns passed=False OR throws OR times out) does NOT kill remaining trials. Mark exit_status accordingly.
env probe runs once per trial (not once per command) — captures any per-trial env drift
env probe is the A1 library call: dispatcher imports collect_env from aorta.instrumentation.environment and uses its return value. Verified by: (a) a grep showing collect_env is imported in dispatcher.py; (b) no subprocess invocation of aorta env probe anywhere under src/aorta/run/.
env probe failure does NOT kill the trial — collect_env() itself never raises (A1 contract); if its returned snapshot has partial=True, the trial proceeds and TrialResult.env records the partial snapshot. Trials never carry env: null unless every probe failed AND the snapshot itself was unobtainable (which A1 contractually excludes).
Workload not found in entry-points → clear error message listing available workloads
CLI flag --workload matches entry-point name exactly (case-sensitive)
--collect flag reserved (MVP no-op): aorta run --workload fsdp --collect rocprof --trials 1 parses the flag and validates each name against the known-recipe set; unknown names raise a clear error listing valid names. Known names are accepted and silently no-op (no recipes implemented in MVP). Verified by unit tests: (a) known name parses into the request without error; (b) unknown name raises with available list; (c) known name does NOT cause subprocess wrapping or extra files written.
execution_env block in TrialResult: each trial JSON's execution_env populates kind, name, image/digest/venv/rocm (whichever the resolved Environment carries), and source_package. The top-level docker field from earlier drafts is gone. Verified by schema round-trip test asserting key paths.
Python API exposed for B2: from aorta.run.dispatcher import run_trials, RunRequest works. cli/run.py's Click handler builds a RunRequest from CLI args and calls run_trials() — it contains no orchestration logic of its own. Verified by: (a) a unit test that imports run_trials() directly, calls it with a RunRequest against a mock workload, and asserts a list[TrialResult] of the expected length is returned; (b) the Click handler in cli/run.py is under ~30 lines and contains no for trial in range(...) loop.
Launch-mode validation: dispatcher reads Workload.launch_mode / Workload.min_world_size and WORLD_SIZE env (default 1) before setup(). single_process workloads under torchrun (WORLD_SIZE > 1) raise with the "do not wrap with torchrun" message; distributed workloads with WORLD_SIZE < min_world_size raise with the "launch with torchrun" message naming the required N. Both errors fire BEFORE the workload's setup() runs.
Rank-aware JSON writes: under torchrun (RANK env var set), every rank runs the trial lifecycle but only RANK==0 writes trial_<id>.json. With RANK unset, treated as rank 0 (single-process default). Verified by launching torchrun --nproc_per_node=2 -m aorta run --workload fsdp --trials 1 and confirming exactly one trial_0.json is written.
Plugin discovery validated end-to-end across package boundaries: with aorta editable-installed alongside a second package that registers a workload via the aorta.workloads entry-point group (e.g. pip install -e <path-to-other-package>), aorta run --workload <workload-name> --trials 1 against that externally-registered workload discovers and dispatches it correctly. Direct dispatch IS the proof — if it runs and writes a trial result, the entry-point bridge works. This is the canonical end-to-end check that the aorta.workloads entry-point bridge spans package boundaries — not just in-tree workloads.
Tests cover: dispatcher loop, result schema, entry-point discovery (with both in-tree and externally-installed packages), mitigation env injection, timeout handling
Out of scope (P1+)
--search flag for adaptive perf search — Optuna integration is P1
Retry-on-failure logic (workload owns its retries)
Container lifecycle (docker pull, run, etc.) — aorta run does NOT manage containers. The --environment value resolves to a descriptor (image ref, venv path, ROCm version) that lands in TrialResult.execution_env as a label; the user is responsible for actually being inside that environment when they invoke aorta run. Multi-environment fan-out (running the same workload across N environments) is aorta triage's job, not B1's.
Distributed trial parallelism (one trial at a time is fine for MVP)
How to test
Tests are split into two groups by what dependencies they need:
B1-dev tests (runnable while building B1, no other tasks needed)
These verify the dispatcher in isolation. Use the in-tree fsdp workload (after the FSDP workload task lands) or a tiny test-only workload registered via a local entry-point in tests/.
# Distributed workload, single node, 8 GPUs — exercises launch_mode="distributed"
torchrun --nproc_per_node=8 -m aorta run --workload fsdp --trials 1 --steps 100
ls results/fsdp/ # exactly one trial_0.json (only global rank 0 wrote it)# Mismatch: bare `aorta run` against a distributed workload — should raise BEFORE setup()# with: "fsdp requires WORLD_SIZE >= 2; launch with `torchrun --nproc_per_node=N ...`"
aorta run --workload fsdp --trials 1 # expect: clean error, no partial trial JSON# Mismatch: torchrun-wrapping a single_process workload — should raise BEFORE setup()# with: "<workload> is single_process; do not wrap with torchrun"
torchrun --nproc_per_node=2 -m aorta run --workload <single_process_workload> --trials 1
# Multi-trial, single environment (default --environment local)
aorta run --workload fsdp --trials 2
ls results/fsdp/ # 2 JSONs# Explicit environment label (resolved through aorta.registry — no docker management)
aorta run --workload fsdp --trials 2 --environment local
ls results/fsdp/ # 2 JSONs whose execution_env.name == "local"# Unknown environment name → B3 error surfaces with available list
aorta run --workload fsdp --trials 1 --environment not_a_real_env
# expect: UnknownEnvironmentError with sorted list of registered names# Unknown mitigation name → B3 error surfaces with available list
aorta run --workload fsdp --trials 1 --mitigations not_a_real_thing
# expect: UnknownMitigationError with sorted list of registered names# Workload-not-found error
aorta run --workload nonexistent --trials 1
# Should error cleanly listing available workloads, not stack-trace
Integration verification (run when downstream tasks land)
Each block below is gated on its prerequisite. Don't expect to run these during B1 development.
# When a second package registers a workload via the aorta.workloads entry-point# group: install both packages and dispatch — direct dispatch IS the proof that# the entry-point bridge spans package boundaries.
pip install -e <path-to-aorta>
pip install -e <path-to-other-package>
aorta run --workload <workload-name> --trials 1
ls results/<workload-name>/ # expect: trial_0.json
jq '.workload' results/<workload-name>/trial_0.json # expect: "<workload-name>"# After B3 lands: end-to-end mitigation + environment injection via the real registry
aorta run --workload <wl> --mitigations tf32_off --environment local --trials 1 --steps 200
jq '.mitigations_applied' results/<wl>/trial_0.json # ["tf32_off"]
jq '.env.env_vars.DISABLE_TF32' results/<wl>/trial_0.json # "1"
jq '.execution_env.name' results/<wl>/trial_0.json # "local"
jq '.execution_env.source_package' results/<wl>/trial_0.json # "aorta"# Same command targeting an environment registered by a downstream plugin package# resolves the descriptor (image ref / venv / ROCm version) via the entry-point bridge.# Without that plugin installed, B3 raises UnknownEnvironmentError — proving the public# package has zero hard imports of any plugin's package.
PR template
Title: B1: aorta run — universal workload runner
Body: include sample TrialResult JSON, confirm per-workload tracking, demo of resilience to single-trial failure.
Goal
Universal workload runner that works identically for first-party and third-party workloads via the
aorta.workloadsentry-point group. Owns the per-trial result schema and persistence.Phasing — B1.0 (interface stub) → B1.1 (implementation)
B1 ships as two PRs against this single B1 issue so downstream work (the triage matrix scaffold and the workload tasks) can start against real types instead of local stubs. Both PRs use
Closes #<B1-issue>— GitHub auto-closes the issue when B1.1 merges; B1.0's earlier merge does not close it (other open referencing PRs keep it alive).B1.0 — interface stub (~½ day, 1 PR)
PR title:
B1.0: aorta run interface stub (TrialResult, RunRequest, run_trials signature)Deliverables:
src/aorta/run/results.py—TrialResultdataclass with the full schema documented below (frozen, with all fields). JSON round-trip helpers (to_dict,from_dict).src/aorta/run/dispatcher.py—RunRequestdataclass +run_trials(req: RunRequest) -> list[TrialResult]whose body israise NotImplementedError("B1.1").src/aorta/run/collectors.py—KNOWN_RECIPES: frozenset[str]constant (see "Collector flag reservation" section).src/aorta/cli/run.py— Click handler that parses all CLI args (including--collectwithKNOWN_RECIPESvalidation), builds aRunRequest, callsrun_trials(). The handler propagates theNotImplementedErrorfor now — that's fine; B1.1 lifts the body.Acceptance criteria for B1.0 (subset of the full list below):
from aorta.run.dispatcher import run_trials, RunRequestimports cleanlyfrom aorta.run.results import TrialResultimports cleanlyaorta run --workload fsdp --collect rocprof --trials 1parses and reachesrun_trials()(which raisesNotImplementedError);--collect bogusrejects at parse time with available-name listrun_trials()body is exactlyraise NotImplementedError("B1.1")with a non-empty docstringTrialResultround-trips throughto_dict()/from_dict()losslessly on a hand-built fixtureB1.1 — implementation (~2-3 days, 1 PR)
PR title:
B1.1: aorta run implementation. PR body references B1.0's PR.Deliverables: everything else in this spec — the dispatcher loop, mitigation injection, env probe wiring, JSON write-out, launch-mode validation, rank-aware writes, plugin discovery, all integration tests.
Why split it
Why this matters
A recurring pattern in numerical / correctness investigations on GPU stacks: weeks of effort go into characterizing a problem, but progress only inflects once a standalone, parameterized reproducer exists that can be run unattended across docker images and mitigation combinations. The platform's job is to make "run this workload across N trials × M dockers" a one-liner.
aorta runis that one-liner.The platform never owns the workload itself, only the orchestration around it. Workload owners (first-party or third-party) implement the
WorkloadABC and register via theaorta.workloadsentry-point group.Files to create / modify
Behavior
A single
aorta runinvocation is one cell (one environment, one mitigation set, N trials). Multi-environment fan-out lives inaorta triage— B1 stays the per-cell unit so triage can callrun_trials()once per cell without inheriting axis logic.Per trial in the request:
aorta.instrumentation.environment.collect_env()(the A1 library function) → returns anEnvSnapshotembedded in the trial'sTrialResult.env. NEVER shell out toaorta env probe, and never re-implement env capture inside the dispatcher — there is exactly one env-probe code path and B1 calls it as a function.from aorta.registry import get_mitigation, get_environmentrequest.mitigations, callget_mitigation(name)and union the returneddict[str, str]env-var bundles in list order.get_environment(request.environment)to obtain theEnvironmentdescriptor (docker / venv / rocm).request.extra_envon top (one-off override path).UnknownMitigationError/UnknownEnvironmentError; the dispatcher does NOT catch — surfaces directly so the CLI exits non-zero with B3's actionable message.importlib.metadata.entry_points(group="aorta.workloads")WorkloadClass(config)where config = workload defaults + CLI overridessetup()→run()→cleanup()WorkloadResultinTrialResult(adds trial_id, environment, mitigations applied, env snapshot, wall-clock, exit info)<results-dir>/<workload>/trial_<id>.jsonCollector flag reservation (
--collect, MVP no-op)aorta runowns the live training subprocess and is therefore where in-trial data-capture recipes (e.g., rocprofv3, numerics dumps, AMD_LOG instrumentation) must attach — they wrap the subprocess from insiderun's already-existing process tree. Recipes themselves are P1 — but B1 reserves the CLI surface now so P1 doesn't have to refactoraorta run's argument parsing.MVP behavior:
--collect <name1>,<name2>parses into a tuple of recipe names.{"rocprof", "numerics", "amd_log"}— placeholders, lowered intoaorta.run.collectors.KNOWN_RECIPESso the list has one home).aorta bundle <results-dir>command (P1) handles post-hoc artifact packaging — NOT a--collectrecipe and NOT in scope for B1.Python API contract (consumed by B2)
B2 (
aorta triage) calls B1's dispatcher in-process — NOT via subprocess. To make this clean, B1 exposes a single public function alongside the Click handler:The Click handler in
cli/run.pybecomes a thin shell: parse CLI args → buildRunRequest→ callrun_trials()→ derive exit code from results. All orchestration logic lives inrun_trials(). No business logic in the Click handler.Why this matters:
run_trials()once per cell — no subprocess overhead, no JSON parse round-trip, one Python process for the whole matrixrun_trials()directly; no subprocess plumbing in testsrun_trials()so B2 inherits it for freeDistributed launch contract
aorta rundefaults to single-process execution. Bareaorta run --workload Xruns once, no torchrun, no distributed init — that's the floor behavior. Workloads that need multiple ranks opt in via theWorkloadABC's class attributes; the user then wrapsaorta runwithtorchrun.Launch-mode validation (B1's job)
The
WorkloadABC carries two class attributes:Before calling
setup(), B1's dispatcher readsWORLD_SIZEfrom env (default 1 if unset) and validates against the workload's declaration:WORLD_SIZEsingle_process(default)single_process(default)<name>is single_process; do not wrap with torchrun"distributed,min_world_size=Ndistributed,min_world_size=N<name>requires WORLD_SIZE >= N (got W); launch withtorchrun --nproc_per_node=N -m aorta run …"This catches the four-way footgun (workload mode × launch mode mismatch) at one well-known spot with one consistent error message — workloads don't repeat the check in
setup().Runtime rules under torchrun
For workloads that DO declare
launch_mode = "distributed":setup()callsinit_process_group()itself.RANK==0writestrial_<id>.json. Other ranks compute their ownWorkloadResult(needed for the local lifecycle), but discard it on the way out.RANKenv var unset → treat as rank 0. This is what makes single-process workloads "just work" with no torchrun.WorkloadResultis well-formed, but only rank 0's env probe ends up in the persistedTrialResult.env.Single-node example (8 GPUs, one node)
torchrun --nproc_per_node=8 -m aorta run --workload fsdp --trials 1 --steps 100 # Writes results/fsdp/trial_0.json once (from global rank 0).Multi-node works the same way —
aorta runis launch-agnostic, so use whatever multi-nodetorchruninvocation your team is set up for. Only global rank 0 writes the trial JSON regardless of node count. MVP testing is single-node; multi-node verification deferred until a real multi-node consumer asks for it.TrialResult schema
{ "schema_version": "0.1", "trial_id": "fsdp_d0_m0_t0", "workload": "fsdp", "execution_env": { "kind": "docker", // mirrors Environment descriptor: "docker" | "venv" | "rocm" | "local" "name": "<environment-name>", // the registered name resolved through aorta.registry "image": "<image-ref>", // when Environment.docker is set "digest": "sha256:...", // when docker inspect resolves it; null otherwise "venv": null, // when Environment.venv is set "rocm": null, // when Environment.rocm is set "source_package": "aorta" // which package registered this environment (B3 plumbs this) }, "mitigations_applied": ["tf32_off"], "config": {...}, "env": { ... full env.json ... }, "result": { "passed": true, "failure_count": 0, "first_failure_iteration": null, "failure_details": [], "total_iterations": 5000, "step_times_ms": [...], "elapsed_sec": 412.5, "metrics": {...} }, "wall_clock_sec": 425.1, "exit_status": "ok" }exit_status∈{"ok", "workload_failed", "infrastructure_failed", "timeout"}.Schema is unstable (
0.1) until at least one external consumer pins it. Field renames and additions are allowed without a major-version bump during MVP. Bump to1.0when triage validation lands AND an external reader (downstream tool, analysis notebook outside the team, customer script) starts depending on the shape.execution_env.kindis derived from the resolvedEnvironmentdescriptor:"docker"whenEnvironment.dockeris set,"venv"when onlyvenvis set,"local"for the defaultlocalenvironment with no overrides. The wrapping object exists so future kinds (slurm,singularity,conda) can be added without a v2 schema.image/digestpopulate whenEnvironment.dockerresolves;digestis filled best-effort viadocker inspectand isnullif unresolved — never block the trial on digest resolution.Acceptance criteria
aorta run --workload fsdp --trials 2writesresults/fsdp/trial_0.jsonandresults/fsdp/trial_1.json(default--environment local, default--mitigations none)aorta run --workload fsdp --trials 2 --environment local --mitigations tf32_offwrites 2 JSONs whosemitigations_appliedis["tf32_off"]and whoseenv.env_vars.DISABLE_TF32is"1"--dockersflag. Multi-environment fan-out isaorta triage's job; B1 takes a single--environment NAME. Verified by reading the Click handler: only one environment-related option exists, and it accepts a single string.aorta.registry: dispatcher importsget_mitigationfromaorta.registry(not from any plugin package). Verified by grep onsrc/aorta/run/.aorta.registry: dispatcher importsget_environmentfromaorta.registry; the resolvedEnvironmentdescriptor populatesTrialResult.execution_env.--mitigations a,b, env-var bundles are unioned in list order (later names override earlier). With--extra-env KEY=VAL,extra_envis layered on top of the union. Verified by a unit test passing two registered mitigations whose env vars conflict and asserting the second wins, then settingextra_envto override and assertingextra_envwins.aorta run --workload fsdp --mitigations not_a_real_thing --trials 1exits non-zero with B3'sUnknownMitigationErrormessage text (which lists available names). Same for--environment not_a_real_env.aorta run --workload fsdp --mitigations tf32_off --environment local --trials 1injectsDISABLE_TF32=1into the workload's env viaaorta.registry.get_mitigation. Once a downstream plugin package registers entries, the same command with that plugin's--environment <name>resolves the descriptor without B1 changing.results/<workload>/, never aggregatedpassed=FalseOR throws OR times out) does NOT kill remaining trials. Markexit_statusaccordingly.collect_envfromaorta.instrumentation.environmentand uses its return value. Verified by: (a) a grep showingcollect_envis imported indispatcher.py; (b) nosubprocessinvocation ofaorta env probeanywhere undersrc/aorta/run/.collect_env()itself never raises (A1 contract); if its returned snapshot haspartial=True, the trial proceeds andTrialResult.envrecords the partial snapshot. Trials never carryenv: nullunless every probe failed AND the snapshot itself was unobtainable (which A1 contractually excludes).--workloadmatches entry-point name exactly (case-sensitive)--collectflag reserved (MVP no-op):aorta run --workload fsdp --collect rocprof --trials 1parses the flag and validates each name against the known-recipe set; unknown names raise a clear error listing valid names. Known names are accepted and silently no-op (no recipes implemented in MVP). Verified by unit tests: (a) known name parses into the request without error; (b) unknown name raises with available list; (c) known name does NOT cause subprocess wrapping or extra files written.execution_envblock in TrialResult: each trial JSON'sexecution_envpopulateskind,name,image/digest/venv/rocm(whichever the resolved Environment carries), andsource_package. The top-leveldockerfield from earlier drafts is gone. Verified by schema round-trip test asserting key paths.from aorta.run.dispatcher import run_trials, RunRequestworks.cli/run.py's Click handler builds aRunRequestfrom CLI args and callsrun_trials()— it contains no orchestration logic of its own. Verified by: (a) a unit test that importsrun_trials()directly, calls it with aRunRequestagainst a mock workload, and asserts alist[TrialResult]of the expected length is returned; (b) the Click handler incli/run.pyis under ~30 lines and contains nofor trial in range(...)loop.Workload.launch_mode/Workload.min_world_sizeandWORLD_SIZEenv (default 1) beforesetup().single_processworkloads under torchrun (WORLD_SIZE > 1) raise with the "do not wrap with torchrun" message;distributedworkloads withWORLD_SIZE < min_world_sizeraise with the "launch with torchrun" message naming the required N. Both errors fire BEFORE the workload'ssetup()runs.RANKenv var set), every rank runs the trial lifecycle but onlyRANK==0writestrial_<id>.json. WithRANKunset, treated as rank 0 (single-process default). Verified by launchingtorchrun --nproc_per_node=2 -m aorta run --workload fsdp --trials 1and confirming exactly onetrial_0.jsonis written.aortaeditable-installed alongside a second package that registers a workload via theaorta.workloadsentry-point group (e.g.pip install -e <path-to-other-package>),aorta run --workload <workload-name> --trials 1against that externally-registered workload discovers and dispatches it correctly. Direct dispatch IS the proof — if it runs and writes a trial result, the entry-point bridge works. This is the canonical end-to-end check that theaorta.workloadsentry-point bridge spans package boundaries — not just in-tree workloads.Out of scope (P1+)
--searchflag for adaptive perf search — Optuna integration is P1aorta rundoes NOT manage containers. The--environmentvalue resolves to a descriptor (image ref, venv path, ROCm version) that lands inTrialResult.execution_envas a label; the user is responsible for actually being inside that environment when they invokeaorta run. Multi-environment fan-out (running the same workload across N environments) isaorta triage's job, not B1's.How to test
Tests are split into two groups by what dependencies they need:
B1-dev tests (runnable while building B1, no other tasks needed)
These verify the dispatcher in isolation. Use the in-tree
fsdpworkload (after the FSDP workload task lands) or a tiny test-only workload registered via a local entry-point intests/.Plus pytest-level unit tests under
tests/run/:Integration verification (run when downstream tasks land)
Each block below is gated on its prerequisite. Don't expect to run these during B1 development.
PR template
Title:
B1: aorta run — universal workload runnerBody: include sample TrialResult JSON, confirm per-workload tracking, demo of resilience to single-trial failure.