B1: aorta run

## Goal

Universal workload runner that works identically for first-party and third-party workloads via the `aorta.workloads` entry-point group. Owns the per-trial result schema and persistence.

## Phasing — B1.0 (interface stub) → B1.1 (implementation)

B1 ships as **two PRs against this single B1 issue** so downstream work (the triage matrix scaffold and the workload tasks) can start against real types instead of local stubs. Both PRs use `Closes #<B1-issue>` — GitHub auto-closes the issue when B1.1 merges; B1.0's earlier merge does not close it (other open referencing PRs keep it alive).

### B1.0 — interface stub (~½ day, 1 PR)

PR title: `B1.0: aorta run interface stub (TrialResult, RunRequest, run_trials signature)`

Deliverables:
- `src/aorta/run/results.py` — `TrialResult` dataclass with the full schema documented below (frozen, with all fields). JSON round-trip helpers (`to_dict`, `from_dict`).
- `src/aorta/run/dispatcher.py` — `RunRequest` dataclass + `run_trials(req: RunRequest) -> list[TrialResult]` whose body is `raise NotImplementedError("B1.1")`.
- `src/aorta/run/collectors.py` — `KNOWN_RECIPES: frozenset[str]` constant (see "Collector flag reservation" section).
- `src/aorta/cli/run.py` — Click handler that parses all CLI args (including `--collect` with `KNOWN_RECIPES` validation), builds a `RunRequest`, calls `run_trials()`. The handler propagates the `NotImplementedError` for now — that's fine; B1.1 lifts the body.

Acceptance criteria for B1.0 (subset of the full list below):
- [ ] `from aorta.run.dispatcher import run_trials, RunRequest` imports cleanly
- [ ] `from aorta.run.results import TrialResult` imports cleanly
- [ ] `aorta run --workload fsdp --collect rocprof --trials 1` parses and reaches `run_trials()` (which raises `NotImplementedError`); `--collect bogus` rejects at parse time with available-name list
- [ ] `run_trials()` body is exactly `raise NotImplementedError("B1.1")` with a non-empty docstring
- [ ] `TrialResult` round-trips through `to_dict()` / `from_dict()` losslessly on a hand-built fixture
- [ ] No subprocess code, no env probe call, no entry-point lookup yet — those land in B1.1

### B1.1 — implementation (~2-3 days, 1 PR)

PR title: `B1.1: aorta run implementation`. PR body references B1.0's PR.

Deliverables: everything else in this spec — the dispatcher loop, mitigation injection, env probe wiring, JSON write-out, launch-mode validation, rank-aware writes, plugin discovery, all integration tests.

### Why split it

- **B1.0 PR merge is the unblock signal** for downstream task owners — when it lands, switch from local stubs to real imports.
- **Same-day unblock.** B1.0 lands day 1; downstream work codes against real imports from day 2 onward instead of stubbing in their own files.
- **Reviews split cleanly.** B1.0 review is "is this contract right?" — fast. B1.1 review is "is the implementation correct?" — slower, focuses on subprocess + env injection + error paths. One PR loses both modes.

## Why this matters

A recurring pattern in numerical / correctness investigations on GPU stacks: weeks of effort go into characterizing a problem, but progress only inflects once a standalone, parameterized reproducer exists that can be run unattended across docker images and mitigation combinations. The platform's job is to make "run this workload across N trials × M dockers" a one-liner. `aorta run` is that one-liner.

The platform never owns the workload itself, only the orchestration around it. Workload owners (first-party or third-party) implement the `Workload` ABC and register via the `aorta.workloads` entry-point group.

## Files to create / modify

```
src/aorta/cli/run.py                # MODIFY — replace ClickException with real call
src/aorta/run/__init__.py           # NEW (empty)
src/aorta/run/dispatcher.py         # NEW — main orchestration loop
src/aorta/run/results.py            # NEW — TrialResult schema + JSON writers
src/aorta/run/discovery.py          # NEW — entry-point lookup helper
tests/run/__init__.py               # NEW (empty)
tests/run/test_dispatcher.py        # NEW
tests/run/test_results.py           # NEW
```

## Behavior

```
aorta run --workload <name>
          --trials N
          [--environment NAME]        # registered environment name (default: "local"); resolved via aorta.registry
          [--mitigations m1,m2]       # default: ["none"] (baseline); each name resolved via aorta.registry
          [--extra-env KEY=VAL,...]   # ad-hoc env override applied AFTER mitigation env-vars (one-off experiments)
          [--collect r1,r2]           # in-trial capture recipes — name-validating no-op in MVP, see "Collector flag reservation" below
          [--steps S]                 # workload-specific override
          [--results-dir results]     # default
```

A single `aorta run` invocation is **one cell** (one environment, one mitigation set, N trials). Multi-environment fan-out lives in `aorta triage` — B1 stays the per-cell unit so triage can call `run_trials()` once per cell without inheriting axis logic.

Per trial in the request:
1. Capture environment by calling `aorta.instrumentation.environment.collect_env()` (the A1 library function) → returns an `EnvSnapshot` embedded in the trial's `TrialResult.env`. **NEVER** shell out to `aorta env probe`, and never re-implement env capture inside the dispatcher — there is exactly one env-probe code path and B1 calls it as a function.
2. Resolve mitigations and environment through B3's public resolver:
   - `from aorta.registry import get_mitigation, get_environment`
   - For each name in `request.mitigations`, call `get_mitigation(name)` and union the returned `dict[str, str]` env-var bundles in list order.
   - Call `get_environment(request.environment)` to obtain the `Environment` descriptor (docker / venv / rocm).
   - Apply the unioned mitigation env-vars on top of the current process environment, then layer `request.extra_env` on top (one-off override path).
   - Unknown names raise B3's `UnknownMitigationError` / `UnknownEnvironmentError`; the dispatcher does NOT catch — surfaces directly so the CLI exits non-zero with B3's actionable message.
3. Discover workload class via `importlib.metadata.entry_points(group="aorta.workloads")`
4. Instantiate `WorkloadClass(config)` where config = workload defaults + CLI overrides
5. Call `setup()` → `run()` → `cleanup()`
6. Wrap `WorkloadResult` in `TrialResult` (adds trial_id, environment, mitigations applied, env snapshot, wall-clock, exit info)
7. Write `<results-dir>/<workload>/trial_<id>.json`

## Collector flag reservation (`--collect`, MVP no-op)

`aorta run` owns the live training subprocess and is therefore where in-trial data-capture recipes (e.g., rocprofv3, numerics dumps, AMD_LOG instrumentation) must attach — they wrap the subprocess from inside `run`'s already-existing process tree. Recipes themselves are P1 — but B1 reserves the CLI surface now so P1 doesn't have to refactor `aorta run`'s argument parsing.

MVP behavior:
- `--collect <name1>,<name2>` parses into a tuple of recipe names.
- Names are validated against a known set (`{"rocprof", "numerics", "amd_log"}` — placeholders, lowered into `aorta.run.collectors.KNOWN_RECIPES` so the list has one home).
- Unknown names raise a clear error listing valid names.
- Known names are accepted and **silently no-op** in MVP. No subprocess wrapping, no extra files written. The flag is reserved surface only.
- A separate top-level `aorta bundle <results-dir>` command (P1) handles post-hoc artifact packaging — NOT a `--collect` recipe and NOT in scope for B1.

## Python API contract (consumed by B2)

B2 (`aorta triage`) calls B1's dispatcher **in-process** — NOT via subprocess. To make this clean, B1 exposes a single public function alongside the Click handler:

```python
# src/aorta/run/dispatcher.py

@dataclass(frozen=True)
class RunRequest:
    workload: str
    trials: int
    environment: str = "local"             # registered name; resolved via aorta.registry.get_environment
    mitigations: tuple[str, ...] = ("none",)  # registered names; each resolved via aorta.registry.get_mitigation
    extra_env: dict[str, str] = field(default_factory=dict)  # one-off override applied after mitigation env-vars
    steps: int | None = None
    config_overrides: dict[str, Any] = field(default_factory=dict)
    results_dir: Path = Path("results")


def run_trials(request: RunRequest) -> list[TrialResult]:
    """Run N trials for a single (workload, docker, mitigation-set) combination.

    Returns the in-memory TrialResult list. JSONs are still written to disk
    as a side-effect (per the per-workload tracking criterion).

    Trial-level failures (workload returns passed=False, throws, or times out)
    surface as TrialResult entries with exit_status set accordingly — they do
    NOT raise. Infrastructure failures that prevent any trial from running
    (e.g., workload entry-point not found) raise so the caller can decide what
    to do (CLI exits non-zero; B2 marks the cell as `error` and continues).
    """
```

The Click handler in `cli/run.py` becomes a **thin shell**: parse CLI args → build `RunRequest` → call `run_trials()` → derive exit code from results. **All orchestration logic lives in `run_trials()`.** No business logic in the Click handler.

Why this matters:
- B2's runner calls `run_trials()` once per cell — no subprocess overhead, no JSON parse round-trip, one Python process for the whole matrix
- Workload exceptions surface as Python tracebacks B2 can catch and report per cell
- Unit tests mock `run_trials()` directly; no subprocess plumbing in tests
- Single source of truth for "what does one logical run do?" — both CLI and triage go through the same function
- Distributed launch validation (next section) lives inside `run_trials()` so B2 inherits it for free

## Distributed launch contract

`aorta run` defaults to **single-process** execution. Bare `aorta run --workload X` runs once, no torchrun, no distributed init — that's the floor behavior. Workloads that need multiple ranks opt in via the `Workload` ABC's class attributes; the user then wraps `aorta run` with `torchrun`.

### Launch-mode validation (B1's job)

The `Workload` ABC carries two class attributes:

```python
class Workload:
    launch_mode: ClassVar[Literal["single_process", "distributed"]] = "single_process"
    min_world_size: ClassVar[int] = 1   # only consulted when launch_mode == "distributed"
```

Before calling `setup()`, B1's dispatcher reads `WORLD_SIZE` from env (default 1 if unset) and validates against the workload's declaration:

| Declared | `WORLD_SIZE` | Outcome |
|---|---|---|
| `single_process` (default) | 1 | ✓ proceed |
| `single_process` (default) | > 1 | ✗ raise: *"workload `<name>` is single_process; do not wrap with torchrun"* |
| `distributed`, `min_world_size=N` | ≥ N | ✓ proceed |
| `distributed`, `min_world_size=N` | < N | ✗ raise: *"workload `<name>` requires WORLD_SIZE >= N (got W); launch with `torchrun --nproc_per_node=N -m aorta run …`"* |

This catches the four-way footgun (workload mode × launch mode mismatch) at one well-known spot with one consistent error message — workloads don't repeat the check in `setup()`.

### Runtime rules under torchrun

For workloads that DO declare `launch_mode = "distributed"`:

- **Every rank runs the full trial lifecycle** (env probe → workload setup/run/cleanup). The workload's `setup()` calls `init_process_group()` itself.
- **Only `RANK==0` writes `trial_<id>.json`.** Other ranks compute their own `WorkloadResult` (needed for the local lifecycle), but discard it on the way out.
- **`RANK` env var unset → treat as rank 0.** This is what makes single-process workloads "just work" with no torchrun.
- **Trial-id collisions across ranks are impossible** because only rank 0 writes; no per-rank suffix needed.
- **env probe still runs on every rank** so the local `WorkloadResult` is well-formed, but only rank 0's env probe ends up in the persisted `TrialResult.env`.

### Single-node example (8 GPUs, one node)
```bash
torchrun --nproc_per_node=8 -m aorta run --workload fsdp --trials 1 --steps 100
# Writes results/fsdp/trial_0.json once (from global rank 0).
```

Multi-node works the same way — `aorta run` is launch-agnostic, so use whatever multi-node `torchrun` invocation your team is set up for. Only global rank 0 writes the trial JSON regardless of node count. MVP testing is single-node; multi-node verification deferred until a real multi-node consumer asks for it.

## TrialResult schema

```json
{
  "schema_version": "0.1",
  "trial_id": "fsdp_d0_m0_t0",
  "workload": "fsdp",
  "execution_env": {
    "kind": "docker",                          // mirrors Environment descriptor: "docker" | "venv" | "rocm" | "local"
    "name": "<environment-name>",              // the registered name resolved through aorta.registry
    "image": "<image-ref>",                    // when Environment.docker is set
    "digest": "sha256:...",                    // when docker inspect resolves it; null otherwise
    "venv": null,                              // when Environment.venv is set
    "rocm": null,                              // when Environment.rocm is set
    "source_package": "aorta"                  // which package registered this environment (B3 plumbs this)
  },
  "mitigations_applied": ["tf32_off"],
  "config": {...},
  "env": { ... full env.json ... },
  "result": {
    "passed": true,
    "failure_count": 0,
    "first_failure_iteration": null,
    "failure_details": [],
    "total_iterations": 5000,
    "step_times_ms": [...],
    "elapsed_sec": 412.5,
    "metrics": {...}
  },
  "wall_clock_sec": 425.1,
  "exit_status": "ok"
}
```

`exit_status` ∈ `{"ok", "workload_failed", "infrastructure_failed", "timeout"}`.

**Schema is unstable (`0.1`)** until at least one external consumer pins it. Field renames and additions are allowed without a major-version bump during MVP. Bump to `1.0` when triage validation lands AND an external reader (downstream tool, analysis notebook outside the team, customer script) starts depending on the shape.

`execution_env.kind` is derived from the resolved `Environment` descriptor: `"docker"` when `Environment.docker` is set, `"venv"` when only `venv` is set, `"local"` for the default `local` environment with no overrides. The wrapping object exists so future kinds (`slurm`, `singularity`, `conda`) can be added without a v2 schema. `image`/`digest` populate when `Environment.docker` resolves; `digest` is filled best-effort via `docker inspect` and is `null` if unresolved — never block the trial on digest resolution.

## Acceptance criteria

- [ ] `aorta run --workload fsdp --trials 2` writes `results/fsdp/trial_0.json` and `results/fsdp/trial_1.json` (default `--environment local`, default `--mitigations none`)
- [ ] `aorta run --workload fsdp --trials 2 --environment local --mitigations tf32_off` writes 2 JSONs whose `mitigations_applied` is `["tf32_off"]` and whose `env.env_vars.DISABLE_TF32` is `"1"`
- [ ] **No `--dockers` flag.** Multi-environment fan-out is `aorta triage`'s job; B1 takes a single `--environment NAME`. Verified by reading the Click handler: only one environment-related option exists, and it accepts a single string.
- [ ] **Mitigation resolution routes through `aorta.registry`**: dispatcher imports `get_mitigation` from `aorta.registry` (not from any plugin package). Verified by grep on `src/aorta/run/`.
- [ ] **Environment resolution routes through `aorta.registry`**: dispatcher imports `get_environment` from `aorta.registry`; the resolved `Environment` descriptor populates `TrialResult.execution_env`.
- [ ] **Mitigation env-var union order**: with `--mitigations a,b`, env-var bundles are unioned in list order (later names override earlier). With `--extra-env KEY=VAL`, `extra_env` is layered on top of the union. Verified by a unit test passing two registered mitigations whose env vars conflict and asserting the second wins, then setting `extra_env` to override and asserting `extra_env` wins.
- [ ] **Unknown name surfaces B3's error**: `aorta run --workload fsdp --mitigations not_a_real_thing --trials 1` exits non-zero with B3's `UnknownMitigationError` message text (which lists available names). Same for `--environment not_a_real_env`.
- [ ] **Mitigation injection end-to-end (after B3 lands)**: `aorta run --workload fsdp --mitigations tf32_off --environment local --trials 1` injects `DISABLE_TF32=1` into the workload's env via `aorta.registry.get_mitigation`. Once a downstream plugin package registers entries, the same command with that plugin's `--environment <name>` resolves the descriptor without B1 changing.
- [ ] **Per-workload result tracking** — different workloads write to `results/<workload>/`, never aggregated
- [ ] **Resilient**: one trial failing (workload returns `passed=False` OR throws OR times out) does NOT kill remaining trials. Mark `exit_status` accordingly.
- [ ] env probe runs once per trial (not once per command) — captures any per-trial env drift
- [ ] **env probe is the A1 library call**: dispatcher imports `collect_env` from `aorta.instrumentation.environment` and uses its return value. Verified by: (a) a grep showing `collect_env` is imported in `dispatcher.py`; (b) no `subprocess` invocation of `aorta env probe` anywhere under `src/aorta/run/`.
- [ ] env probe failure does NOT kill the trial — `collect_env()` itself never raises (A1 contract); if its returned snapshot has `partial=True`, the trial proceeds and `TrialResult.env` records the partial snapshot. Trials never carry `env: null` unless every probe failed AND the snapshot itself was unobtainable (which A1 contractually excludes).
- [ ] Workload not found in entry-points → clear error message listing available workloads
- [ ] CLI flag `--workload` matches entry-point name exactly (case-sensitive)
- [ ] **`--collect` flag reserved (MVP no-op)**: `aorta run --workload fsdp --collect rocprof --trials 1` parses the flag and validates each name against the known-recipe set; unknown names raise a clear error listing valid names. Known names are accepted and silently no-op (no recipes implemented in MVP). Verified by unit tests: (a) known name parses into the request without error; (b) unknown name raises with available list; (c) known name does NOT cause subprocess wrapping or extra files written.
- [ ] **`execution_env` block in TrialResult**: each trial JSON's `execution_env` populates `kind`, `name`, `image`/`digest`/`venv`/`rocm` (whichever the resolved Environment carries), and `source_package`. The top-level `docker` field from earlier drafts is gone. Verified by schema round-trip test asserting key paths.
- [ ] **Python API exposed for B2**: `from aorta.run.dispatcher import run_trials, RunRequest` works. `cli/run.py`'s Click handler builds a `RunRequest` from CLI args and calls `run_trials()` — it contains no orchestration logic of its own. Verified by: (a) a unit test that imports `run_trials()` directly, calls it with a `RunRequest` against a mock workload, and asserts a `list[TrialResult]` of the expected length is returned; (b) the Click handler in `cli/run.py` is under ~30 lines and contains no `for trial in range(...)` loop.
- [ ] **Launch-mode validation**: dispatcher reads `Workload.launch_mode` / `Workload.min_world_size` and `WORLD_SIZE` env (default 1) before `setup()`. `single_process` workloads under torchrun (`WORLD_SIZE > 1`) raise with the "do not wrap with torchrun" message; `distributed` workloads with `WORLD_SIZE < min_world_size` raise with the "launch with torchrun" message naming the required N. Both errors fire BEFORE the workload's `setup()` runs.
- [ ] **Rank-aware JSON writes**: under torchrun (`RANK` env var set), every rank runs the trial lifecycle but only `RANK==0` writes `trial_<id>.json`. With `RANK` unset, treated as rank 0 (single-process default). Verified by launching `torchrun --nproc_per_node=2 -m aorta run --workload fsdp --trials 1` and confirming exactly one `trial_0.json` is written.
- [ ] **Plugin discovery validated end-to-end across package boundaries**: with `aorta` editable-installed alongside a second package that registers a workload via the `aorta.workloads` entry-point group (e.g. `pip install -e <path-to-other-package>`), `aorta run --workload <workload-name> --trials 1` against that externally-registered workload discovers and dispatches it correctly. Direct dispatch IS the proof — if it runs and writes a trial result, the entry-point bridge works. This is the canonical end-to-end check that the `aorta.workloads` entry-point bridge spans package boundaries — not just in-tree workloads.
- [ ] Tests cover: dispatcher loop, result schema, entry-point discovery (with both in-tree and externally-installed packages), mitigation env injection, timeout handling

## Out of scope (P1+)

- `--search` flag for adaptive perf search — Optuna integration is P1
- Retry-on-failure logic (workload owns its retries)
- Container lifecycle (docker pull, run, etc.) — `aorta run` does NOT manage containers. The `--environment` value resolves to a descriptor (image ref, venv path, ROCm version) that lands in `TrialResult.execution_env` as a label; the user is responsible for actually being inside that environment when they invoke `aorta run`. Multi-environment fan-out (running the same workload across N environments) is `aorta triage`'s job, not B1's.
- Distributed trial parallelism (one trial at a time is fine for MVP)

## How to test

Tests are split into two groups by what dependencies they need:

### B1-dev tests (runnable while building B1, no other tasks needed)

These verify the dispatcher in isolation. Use the in-tree `fsdp` workload (after the FSDP workload task lands) or a tiny test-only workload registered via a local entry-point in `tests/`.

```bash
# Distributed workload, single node, 8 GPUs — exercises launch_mode="distributed"
torchrun --nproc_per_node=8 -m aorta run --workload fsdp --trials 1 --steps 100
ls results/fsdp/   # exactly one trial_0.json (only global rank 0 wrote it)

# Mismatch: bare `aorta run` against a distributed workload — should raise BEFORE setup()
# with: "fsdp requires WORLD_SIZE >= 2; launch with `torchrun --nproc_per_node=N ...`"
aorta run --workload fsdp --trials 1   # expect: clean error, no partial trial JSON

# Mismatch: torchrun-wrapping a single_process workload — should raise BEFORE setup()
# with: "<workload> is single_process; do not wrap with torchrun"
torchrun --nproc_per_node=2 -m aorta run --workload <single_process_workload> --trials 1

# Multi-trial, single environment (default --environment local)
aorta run --workload fsdp --trials 2
ls results/fsdp/   # 2 JSONs

# Explicit environment label (resolved through aorta.registry — no docker management)
aorta run --workload fsdp --trials 2 --environment local
ls results/fsdp/   # 2 JSONs whose execution_env.name == "local"

# Unknown environment name → B3 error surfaces with available list
aorta run --workload fsdp --trials 1 --environment not_a_real_env
# expect: UnknownEnvironmentError with sorted list of registered names

# Unknown mitigation name → B3 error surfaces with available list
aorta run --workload fsdp --trials 1 --mitigations not_a_real_thing
# expect: UnknownMitigationError with sorted list of registered names

# Workload-not-found error
aorta run --workload nonexistent --trials 1
# Should error cleanly listing available workloads, not stack-trace
```

Plus pytest-level unit tests under `tests/run/`:
- Dispatcher loop (mock workload, assert lifecycle ordering)
- TrialResult schema round-trip
- Entry-point discovery (in-tree workload only)
- **Mitigation env-injection** against a fake registry dict (no external registry needed)
- Timeout handling
- Launch-mode validation table (4 cases — each declared mode × WORLD_SIZE 1/N)

### Integration verification (run when downstream tasks land)

Each block below is gated on its prerequisite. Don't expect to run these during B1 development.

```bash
# When a second package registers a workload via the aorta.workloads entry-point
# group: install both packages and dispatch — direct dispatch IS the proof that
# the entry-point bridge spans package boundaries.
pip install -e <path-to-aorta>
pip install -e <path-to-other-package>
aorta run --workload <workload-name> --trials 1
ls results/<workload-name>/   # expect: trial_0.json
jq '.workload' results/<workload-name>/trial_0.json   # expect: "<workload-name>"

# After B3 lands: end-to-end mitigation + environment injection via the real registry
aorta run --workload <wl> --mitigations tf32_off --environment local --trials 1 --steps 200
jq '.mitigations_applied'           results/<wl>/trial_0.json   # ["tf32_off"]
jq '.env.env_vars.DISABLE_TF32'     results/<wl>/trial_0.json   # "1"
jq '.execution_env.name'            results/<wl>/trial_0.json   # "local"
jq '.execution_env.source_package'  results/<wl>/trial_0.json   # "aorta"

# Same command targeting an environment registered by a downstream plugin package
# resolves the descriptor (image ref / venv / ROCm version) via the entry-point bridge.
# Without that plugin installed, B3 raises UnknownEnvironmentError — proving the public
# package has zero hard imports of any plugin's package.
```

## PR template

Title: `B1: aorta run — universal workload runner`
Body: include sample TrialResult JSON, confirm per-workload tracking, demo of resilience to single-trial failure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B1: aorta run #148

Goal

Phasing — B1.0 (interface stub) → B1.1 (implementation)

B1.0 — interface stub (~½ day, 1 PR)

B1.1 — implementation (~2-3 days, 1 PR)

Why split it

Why this matters

Files to create / modify

Behavior

Collector flag reservation (`--collect`, MVP no-op)

Python API contract (consumed by B2)

Distributed launch contract

Launch-mode validation (B1's job)

Runtime rules under torchrun

Single-node example (8 GPUs, one node)

TrialResult schema

Acceptance criteria

Out of scope (P1+)

How to test

B1-dev tests (runnable while building B1, no other tasks needed)

Integration verification (run when downstream tasks land)

PR template

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Declared	`WORLD_SIZE`	Outcome
`single_process` (default)	1	✓ proceed
`single_process` (default)	> 1	✗ raise: "workload `<name>` is single_process; do not wrap with torchrun"
`distributed`, `min_world_size=N`	≥ N	✓ proceed
`distributed`, `min_world_size=N`	< N	✗ raise: "workload `<name>` requires WORLD_SIZE >= N (got W); launch with `torchrun --nproc_per_node=N -m aorta run …`"

B1: aorta run #148

Description

Goal

Phasing — B1.0 (interface stub) → B1.1 (implementation)

B1.0 — interface stub (~½ day, 1 PR)

B1.1 — implementation (~2-3 days, 1 PR)

Why split it

Why this matters

Files to create / modify

Behavior

Collector flag reservation (--collect, MVP no-op)

Python API contract (consumed by B2)

Distributed launch contract

Launch-mode validation (B1's job)

Runtime rules under torchrun

Single-node example (8 GPUs, one node)

TrialResult schema

Acceptance criteria

Out of scope (P1+)

How to test

B1-dev tests (runnable while building B1, no other tasks needed)

Integration verification (run when downstream tasks land)

PR template

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Collector flag reservation (`--collect`, MVP no-op)