Thanks to visit codestin.com
Credit goes to github.com

Skip to content

B1: aorta run #148

@oyazdanb

Description

@oyazdanb

Goal

Universal workload runner that works identically for first-party and third-party workloads via the aorta.workloads entry-point group. Owns the per-trial result schema and persistence.

Phasing — B1.0 (interface stub) → B1.1 (implementation)

B1 ships as two PRs against this single B1 issue so downstream work (the triage matrix scaffold and the workload tasks) can start against real types instead of local stubs. Both PRs use Closes #<B1-issue> — GitHub auto-closes the issue when B1.1 merges; B1.0's earlier merge does not close it (other open referencing PRs keep it alive).

B1.0 — interface stub (~½ day, 1 PR)

PR title: B1.0: aorta run interface stub (TrialResult, RunRequest, run_trials signature)

Deliverables:

  • src/aorta/run/results.pyTrialResult dataclass with the full schema documented below (frozen, with all fields). JSON round-trip helpers (to_dict, from_dict).
  • src/aorta/run/dispatcher.pyRunRequest dataclass + run_trials(req: RunRequest) -> list[TrialResult] whose body is raise NotImplementedError("B1.1").
  • src/aorta/run/collectors.pyKNOWN_RECIPES: frozenset[str] constant (see "Collector flag reservation" section).
  • src/aorta/cli/run.py — Click handler that parses all CLI args (including --collect with KNOWN_RECIPES validation), builds a RunRequest, calls run_trials(). The handler propagates the NotImplementedError for now — that's fine; B1.1 lifts the body.

Acceptance criteria for B1.0 (subset of the full list below):

  • from aorta.run.dispatcher import run_trials, RunRequest imports cleanly
  • from aorta.run.results import TrialResult imports cleanly
  • aorta run --workload fsdp --collect rocprof --trials 1 parses and reaches run_trials() (which raises NotImplementedError); --collect bogus rejects at parse time with available-name list
  • run_trials() body is exactly raise NotImplementedError("B1.1") with a non-empty docstring
  • TrialResult round-trips through to_dict() / from_dict() losslessly on a hand-built fixture
  • No subprocess code, no env probe call, no entry-point lookup yet — those land in B1.1

B1.1 — implementation (~2-3 days, 1 PR)

PR title: B1.1: aorta run implementation. PR body references B1.0's PR.

Deliverables: everything else in this spec — the dispatcher loop, mitigation injection, env probe wiring, JSON write-out, launch-mode validation, rank-aware writes, plugin discovery, all integration tests.

Why split it

  • B1.0 PR merge is the unblock signal for downstream task owners — when it lands, switch from local stubs to real imports.
  • Same-day unblock. B1.0 lands day 1; downstream work codes against real imports from day 2 onward instead of stubbing in their own files.
  • Reviews split cleanly. B1.0 review is "is this contract right?" — fast. B1.1 review is "is the implementation correct?" — slower, focuses on subprocess + env injection + error paths. One PR loses both modes.

Why this matters

A recurring pattern in numerical / correctness investigations on GPU stacks: weeks of effort go into characterizing a problem, but progress only inflects once a standalone, parameterized reproducer exists that can be run unattended across docker images and mitigation combinations. The platform's job is to make "run this workload across N trials × M dockers" a one-liner. aorta run is that one-liner.

The platform never owns the workload itself, only the orchestration around it. Workload owners (first-party or third-party) implement the Workload ABC and register via the aorta.workloads entry-point group.

Files to create / modify

src/aorta/cli/run.py                # MODIFY — replace ClickException with real call
src/aorta/run/__init__.py           # NEW (empty)
src/aorta/run/dispatcher.py         # NEW — main orchestration loop
src/aorta/run/results.py            # NEW — TrialResult schema + JSON writers
src/aorta/run/discovery.py          # NEW — entry-point lookup helper
tests/run/__init__.py               # NEW (empty)
tests/run/test_dispatcher.py        # NEW
tests/run/test_results.py           # NEW

Behavior

aorta run --workload <name>
          --trials N
          [--environment NAME]        # registered environment name (default: "local"); resolved via aorta.registry
          [--mitigations m1,m2]       # default: ["none"] (baseline); each name resolved via aorta.registry
          [--extra-env KEY=VAL,...]   # ad-hoc env override applied AFTER mitigation env-vars (one-off experiments)
          [--collect r1,r2]           # in-trial capture recipes — name-validating no-op in MVP, see "Collector flag reservation" below
          [--steps S]                 # workload-specific override
          [--results-dir results]     # default

A single aorta run invocation is one cell (one environment, one mitigation set, N trials). Multi-environment fan-out lives in aorta triage — B1 stays the per-cell unit so triage can call run_trials() once per cell without inheriting axis logic.

Per trial in the request:

  1. Capture environment by calling aorta.instrumentation.environment.collect_env() (the A1 library function) → returns an EnvSnapshot embedded in the trial's TrialResult.env. NEVER shell out to aorta env probe, and never re-implement env capture inside the dispatcher — there is exactly one env-probe code path and B1 calls it as a function.
  2. Resolve mitigations and environment through B3's public resolver:
    • from aorta.registry import get_mitigation, get_environment
    • For each name in request.mitigations, call get_mitigation(name) and union the returned dict[str, str] env-var bundles in list order.
    • Call get_environment(request.environment) to obtain the Environment descriptor (docker / venv / rocm).
    • Apply the unioned mitigation env-vars on top of the current process environment, then layer request.extra_env on top (one-off override path).
    • Unknown names raise B3's UnknownMitigationError / UnknownEnvironmentError; the dispatcher does NOT catch — surfaces directly so the CLI exits non-zero with B3's actionable message.
  3. Discover workload class via importlib.metadata.entry_points(group="aorta.workloads")
  4. Instantiate WorkloadClass(config) where config = workload defaults + CLI overrides
  5. Call setup()run()cleanup()
  6. Wrap WorkloadResult in TrialResult (adds trial_id, environment, mitigations applied, env snapshot, wall-clock, exit info)
  7. Write <results-dir>/<workload>/trial_<id>.json

Collector flag reservation (--collect, MVP no-op)

aorta run owns the live training subprocess and is therefore where in-trial data-capture recipes (e.g., rocprofv3, numerics dumps, AMD_LOG instrumentation) must attach — they wrap the subprocess from inside run's already-existing process tree. Recipes themselves are P1 — but B1 reserves the CLI surface now so P1 doesn't have to refactor aorta run's argument parsing.

MVP behavior:

  • --collect <name1>,<name2> parses into a tuple of recipe names.
  • Names are validated against a known set ({"rocprof", "numerics", "amd_log"} — placeholders, lowered into aorta.run.collectors.KNOWN_RECIPES so the list has one home).
  • Unknown names raise a clear error listing valid names.
  • Known names are accepted and silently no-op in MVP. No subprocess wrapping, no extra files written. The flag is reserved surface only.
  • A separate top-level aorta bundle <results-dir> command (P1) handles post-hoc artifact packaging — NOT a --collect recipe and NOT in scope for B1.

Python API contract (consumed by B2)

B2 (aorta triage) calls B1's dispatcher in-process — NOT via subprocess. To make this clean, B1 exposes a single public function alongside the Click handler:

# src/aorta/run/dispatcher.py

@dataclass(frozen=True)
class RunRequest:
    workload: str
    trials: int
    environment: str = "local"             # registered name; resolved via aorta.registry.get_environment
    mitigations: tuple[str, ...] = ("none",)  # registered names; each resolved via aorta.registry.get_mitigation
    extra_env: dict[str, str] = field(default_factory=dict)  # one-off override applied after mitigation env-vars
    steps: int | None = None
    config_overrides: dict[str, Any] = field(default_factory=dict)
    results_dir: Path = Path("results")


def run_trials(request: RunRequest) -> list[TrialResult]:
    """Run N trials for a single (workload, docker, mitigation-set) combination.

    Returns the in-memory TrialResult list. JSONs are still written to disk
    as a side-effect (per the per-workload tracking criterion).

    Trial-level failures (workload returns passed=False, throws, or times out)
    surface as TrialResult entries with exit_status set accordingly — they do
    NOT raise. Infrastructure failures that prevent any trial from running
    (e.g., workload entry-point not found) raise so the caller can decide what
    to do (CLI exits non-zero; B2 marks the cell as `error` and continues).
    """

The Click handler in cli/run.py becomes a thin shell: parse CLI args → build RunRequest → call run_trials() → derive exit code from results. All orchestration logic lives in run_trials(). No business logic in the Click handler.

Why this matters:

  • B2's runner calls run_trials() once per cell — no subprocess overhead, no JSON parse round-trip, one Python process for the whole matrix
  • Workload exceptions surface as Python tracebacks B2 can catch and report per cell
  • Unit tests mock run_trials() directly; no subprocess plumbing in tests
  • Single source of truth for "what does one logical run do?" — both CLI and triage go through the same function
  • Distributed launch validation (next section) lives inside run_trials() so B2 inherits it for free

Distributed launch contract

aorta run defaults to single-process execution. Bare aorta run --workload X runs once, no torchrun, no distributed init — that's the floor behavior. Workloads that need multiple ranks opt in via the Workload ABC's class attributes; the user then wraps aorta run with torchrun.

Launch-mode validation (B1's job)

The Workload ABC carries two class attributes:

class Workload:
    launch_mode: ClassVar[Literal["single_process", "distributed"]] = "single_process"
    min_world_size: ClassVar[int] = 1   # only consulted when launch_mode == "distributed"

Before calling setup(), B1's dispatcher reads WORLD_SIZE from env (default 1 if unset) and validates against the workload's declaration:

Declared WORLD_SIZE Outcome
single_process (default) 1 ✓ proceed
single_process (default) > 1 ✗ raise: "workload <name> is single_process; do not wrap with torchrun"
distributed, min_world_size=N ≥ N ✓ proceed
distributed, min_world_size=N < N ✗ raise: "workload <name> requires WORLD_SIZE >= N (got W); launch with torchrun --nproc_per_node=N -m aorta run …"

This catches the four-way footgun (workload mode × launch mode mismatch) at one well-known spot with one consistent error message — workloads don't repeat the check in setup().

Runtime rules under torchrun

For workloads that DO declare launch_mode = "distributed":

  • Every rank runs the full trial lifecycle (env probe → workload setup/run/cleanup). The workload's setup() calls init_process_group() itself.
  • Only RANK==0 writes trial_<id>.json. Other ranks compute their own WorkloadResult (needed for the local lifecycle), but discard it on the way out.
  • RANK env var unset → treat as rank 0. This is what makes single-process workloads "just work" with no torchrun.
  • Trial-id collisions across ranks are impossible because only rank 0 writes; no per-rank suffix needed.
  • env probe still runs on every rank so the local WorkloadResult is well-formed, but only rank 0's env probe ends up in the persisted TrialResult.env.

Single-node example (8 GPUs, one node)

torchrun --nproc_per_node=8 -m aorta run --workload fsdp --trials 1 --steps 100
# Writes results/fsdp/trial_0.json once (from global rank 0).

Multi-node works the same way — aorta run is launch-agnostic, so use whatever multi-node torchrun invocation your team is set up for. Only global rank 0 writes the trial JSON regardless of node count. MVP testing is single-node; multi-node verification deferred until a real multi-node consumer asks for it.

TrialResult schema

{
  "schema_version": "0.1",
  "trial_id": "fsdp_d0_m0_t0",
  "workload": "fsdp",
  "execution_env": {
    "kind": "docker",                          // mirrors Environment descriptor: "docker" | "venv" | "rocm" | "local"
    "name": "<environment-name>",              // the registered name resolved through aorta.registry
    "image": "<image-ref>",                    // when Environment.docker is set
    "digest": "sha256:...",                    // when docker inspect resolves it; null otherwise
    "venv": null,                              // when Environment.venv is set
    "rocm": null,                              // when Environment.rocm is set
    "source_package": "aorta"                  // which package registered this environment (B3 plumbs this)
  },
  "mitigations_applied": ["tf32_off"],
  "config": {...},
  "env": { ... full env.json ... },
  "result": {
    "passed": true,
    "failure_count": 0,
    "first_failure_iteration": null,
    "failure_details": [],
    "total_iterations": 5000,
    "step_times_ms": [...],
    "elapsed_sec": 412.5,
    "metrics": {...}
  },
  "wall_clock_sec": 425.1,
  "exit_status": "ok"
}

exit_status{"ok", "workload_failed", "infrastructure_failed", "timeout"}.

Schema is unstable (0.1) until at least one external consumer pins it. Field renames and additions are allowed without a major-version bump during MVP. Bump to 1.0 when triage validation lands AND an external reader (downstream tool, analysis notebook outside the team, customer script) starts depending on the shape.

execution_env.kind is derived from the resolved Environment descriptor: "docker" when Environment.docker is set, "venv" when only venv is set, "local" for the default local environment with no overrides. The wrapping object exists so future kinds (slurm, singularity, conda) can be added without a v2 schema. image/digest populate when Environment.docker resolves; digest is filled best-effort via docker inspect and is null if unresolved — never block the trial on digest resolution.

Acceptance criteria

  • aorta run --workload fsdp --trials 2 writes results/fsdp/trial_0.json and results/fsdp/trial_1.json (default --environment local, default --mitigations none)
  • aorta run --workload fsdp --trials 2 --environment local --mitigations tf32_off writes 2 JSONs whose mitigations_applied is ["tf32_off"] and whose env.env_vars.DISABLE_TF32 is "1"
  • No --dockers flag. Multi-environment fan-out is aorta triage's job; B1 takes a single --environment NAME. Verified by reading the Click handler: only one environment-related option exists, and it accepts a single string.
  • Mitigation resolution routes through aorta.registry: dispatcher imports get_mitigation from aorta.registry (not from any plugin package). Verified by grep on src/aorta/run/.
  • Environment resolution routes through aorta.registry: dispatcher imports get_environment from aorta.registry; the resolved Environment descriptor populates TrialResult.execution_env.
  • Mitigation env-var union order: with --mitigations a,b, env-var bundles are unioned in list order (later names override earlier). With --extra-env KEY=VAL, extra_env is layered on top of the union. Verified by a unit test passing two registered mitigations whose env vars conflict and asserting the second wins, then setting extra_env to override and asserting extra_env wins.
  • Unknown name surfaces B3's error: aorta run --workload fsdp --mitigations not_a_real_thing --trials 1 exits non-zero with B3's UnknownMitigationError message text (which lists available names). Same for --environment not_a_real_env.
  • Mitigation injection end-to-end (after B3 lands): aorta run --workload fsdp --mitigations tf32_off --environment local --trials 1 injects DISABLE_TF32=1 into the workload's env via aorta.registry.get_mitigation. Once a downstream plugin package registers entries, the same command with that plugin's --environment <name> resolves the descriptor without B1 changing.
  • Per-workload result tracking — different workloads write to results/<workload>/, never aggregated
  • Resilient: one trial failing (workload returns passed=False OR throws OR times out) does NOT kill remaining trials. Mark exit_status accordingly.
  • env probe runs once per trial (not once per command) — captures any per-trial env drift
  • env probe is the A1 library call: dispatcher imports collect_env from aorta.instrumentation.environment and uses its return value. Verified by: (a) a grep showing collect_env is imported in dispatcher.py; (b) no subprocess invocation of aorta env probe anywhere under src/aorta/run/.
  • env probe failure does NOT kill the trial — collect_env() itself never raises (A1 contract); if its returned snapshot has partial=True, the trial proceeds and TrialResult.env records the partial snapshot. Trials never carry env: null unless every probe failed AND the snapshot itself was unobtainable (which A1 contractually excludes).
  • Workload not found in entry-points → clear error message listing available workloads
  • CLI flag --workload matches entry-point name exactly (case-sensitive)
  • --collect flag reserved (MVP no-op): aorta run --workload fsdp --collect rocprof --trials 1 parses the flag and validates each name against the known-recipe set; unknown names raise a clear error listing valid names. Known names are accepted and silently no-op (no recipes implemented in MVP). Verified by unit tests: (a) known name parses into the request without error; (b) unknown name raises with available list; (c) known name does NOT cause subprocess wrapping or extra files written.
  • execution_env block in TrialResult: each trial JSON's execution_env populates kind, name, image/digest/venv/rocm (whichever the resolved Environment carries), and source_package. The top-level docker field from earlier drafts is gone. Verified by schema round-trip test asserting key paths.
  • Python API exposed for B2: from aorta.run.dispatcher import run_trials, RunRequest works. cli/run.py's Click handler builds a RunRequest from CLI args and calls run_trials() — it contains no orchestration logic of its own. Verified by: (a) a unit test that imports run_trials() directly, calls it with a RunRequest against a mock workload, and asserts a list[TrialResult] of the expected length is returned; (b) the Click handler in cli/run.py is under ~30 lines and contains no for trial in range(...) loop.
  • Launch-mode validation: dispatcher reads Workload.launch_mode / Workload.min_world_size and WORLD_SIZE env (default 1) before setup(). single_process workloads under torchrun (WORLD_SIZE > 1) raise with the "do not wrap with torchrun" message; distributed workloads with WORLD_SIZE < min_world_size raise with the "launch with torchrun" message naming the required N. Both errors fire BEFORE the workload's setup() runs.
  • Rank-aware JSON writes: under torchrun (RANK env var set), every rank runs the trial lifecycle but only RANK==0 writes trial_<id>.json. With RANK unset, treated as rank 0 (single-process default). Verified by launching torchrun --nproc_per_node=2 -m aorta run --workload fsdp --trials 1 and confirming exactly one trial_0.json is written.
  • Plugin discovery validated end-to-end across package boundaries: with aorta editable-installed alongside a second package that registers a workload via the aorta.workloads entry-point group (e.g. pip install -e <path-to-other-package>), aorta run --workload <workload-name> --trials 1 against that externally-registered workload discovers and dispatches it correctly. Direct dispatch IS the proof — if it runs and writes a trial result, the entry-point bridge works. This is the canonical end-to-end check that the aorta.workloads entry-point bridge spans package boundaries — not just in-tree workloads.
  • Tests cover: dispatcher loop, result schema, entry-point discovery (with both in-tree and externally-installed packages), mitigation env injection, timeout handling

Out of scope (P1+)

  • --search flag for adaptive perf search — Optuna integration is P1
  • Retry-on-failure logic (workload owns its retries)
  • Container lifecycle (docker pull, run, etc.) — aorta run does NOT manage containers. The --environment value resolves to a descriptor (image ref, venv path, ROCm version) that lands in TrialResult.execution_env as a label; the user is responsible for actually being inside that environment when they invoke aorta run. Multi-environment fan-out (running the same workload across N environments) is aorta triage's job, not B1's.
  • Distributed trial parallelism (one trial at a time is fine for MVP)

How to test

Tests are split into two groups by what dependencies they need:

B1-dev tests (runnable while building B1, no other tasks needed)

These verify the dispatcher in isolation. Use the in-tree fsdp workload (after the FSDP workload task lands) or a tiny test-only workload registered via a local entry-point in tests/.

# Distributed workload, single node, 8 GPUs — exercises launch_mode="distributed"
torchrun --nproc_per_node=8 -m aorta run --workload fsdp --trials 1 --steps 100
ls results/fsdp/   # exactly one trial_0.json (only global rank 0 wrote it)

# Mismatch: bare `aorta run` against a distributed workload — should raise BEFORE setup()
# with: "fsdp requires WORLD_SIZE >= 2; launch with `torchrun --nproc_per_node=N ...`"
aorta run --workload fsdp --trials 1   # expect: clean error, no partial trial JSON

# Mismatch: torchrun-wrapping a single_process workload — should raise BEFORE setup()
# with: "<workload> is single_process; do not wrap with torchrun"
torchrun --nproc_per_node=2 -m aorta run --workload <single_process_workload> --trials 1

# Multi-trial, single environment (default --environment local)
aorta run --workload fsdp --trials 2
ls results/fsdp/   # 2 JSONs

# Explicit environment label (resolved through aorta.registry — no docker management)
aorta run --workload fsdp --trials 2 --environment local
ls results/fsdp/   # 2 JSONs whose execution_env.name == "local"

# Unknown environment name → B3 error surfaces with available list
aorta run --workload fsdp --trials 1 --environment not_a_real_env
# expect: UnknownEnvironmentError with sorted list of registered names

# Unknown mitigation name → B3 error surfaces with available list
aorta run --workload fsdp --trials 1 --mitigations not_a_real_thing
# expect: UnknownMitigationError with sorted list of registered names

# Workload-not-found error
aorta run --workload nonexistent --trials 1
# Should error cleanly listing available workloads, not stack-trace

Plus pytest-level unit tests under tests/run/:

  • Dispatcher loop (mock workload, assert lifecycle ordering)
  • TrialResult schema round-trip
  • Entry-point discovery (in-tree workload only)
  • Mitigation env-injection against a fake registry dict (no external registry needed)
  • Timeout handling
  • Launch-mode validation table (4 cases — each declared mode × WORLD_SIZE 1/N)

Integration verification (run when downstream tasks land)

Each block below is gated on its prerequisite. Don't expect to run these during B1 development.

# When a second package registers a workload via the aorta.workloads entry-point
# group: install both packages and dispatch — direct dispatch IS the proof that
# the entry-point bridge spans package boundaries.
pip install -e <path-to-aorta>
pip install -e <path-to-other-package>
aorta run --workload <workload-name> --trials 1
ls results/<workload-name>/   # expect: trial_0.json
jq '.workload' results/<workload-name>/trial_0.json   # expect: "<workload-name>"

# After B3 lands: end-to-end mitigation + environment injection via the real registry
aorta run --workload <wl> --mitigations tf32_off --environment local --trials 1 --steps 200
jq '.mitigations_applied'           results/<wl>/trial_0.json   # ["tf32_off"]
jq '.env.env_vars.DISABLE_TF32'     results/<wl>/trial_0.json   # "1"
jq '.execution_env.name'            results/<wl>/trial_0.json   # "local"
jq '.execution_env.source_package'  results/<wl>/trial_0.json   # "aorta"

# Same command targeting an environment registered by a downstream plugin package
# resolves the descriptor (image ref / venv / ROCm version) via the entry-point bridge.
# Without that plugin installed, B3 raises UnknownEnvironmentError — proving the public
# package has zero hard imports of any plugin's package.

PR template

Title: B1: aorta run — universal workload runner
Body: include sample TrialResult JSON, confirm per-workload tracking, demo of resilience to single-trial failure.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions