A2:FSDP Workload

## Goal

One in-tree workload that proves the platform's plugin contract works end-to-end. After this, `aorta run --workload fsdp` produces a trial JSON.

## Why this workload first

`aorta.race/modes/fsdp.py` is the most workload-shaped code in the repo today and exercises the patterns the platform is designed for (multi-stream H2D, all_gather / reduce_scatter, GEMM compute, per-iteration verification). Building `FsdpWorkload` first validates the entire MVP plumbing with the smallest amount of new design.

## Reference material (read for understanding — do NOT mechanically transform)

The race code is being absorbed into the new platform, but **"absorbed" means the engineer reads it to understand the algorithm and writes `FsdpWorkload` fresh against the new `Workload` ABC.** Don't copy + rename — that would carry race-specific naming and BaseReproducer assumptions into the new file. The new implementation is a clean class that happens to do similar work.

| File | Why it's worth reading |
|---|---|
| `src/aorta/race/modes/fsdp.py` | Algorithm reference: per-layer all_gather, reduce_scatter, double-buffered H2D, known-pattern verification |
| `src/aorta/race/base.py` | Run loop reference: warmup + verification phase pattern. The new `FsdpWorkload.run()` borrows the *idea* (warmup, then measured loop) but writes fresh against `WorkloadResult`. |
| `src/aorta/race/compute.py` | Compute simulation primitives. Fine to USE if you import (not deprecated); don't copy. |

## Files to create / modify

```
src/aorta/workloads/fsdp/__init__.py       # NEW — exports FsdpWorkload
src/aorta/workloads/fsdp/workload.py       # NEW — FsdpWorkload(Workload) class
src/aorta/workloads/fsdp/configs/          # NEW
  default.yaml                             # sample config
tests/workloads/__init__.py                # NEW (empty)
tests/workloads/test_fsdp.py               # NEW — smoke + result schema
```

The pyproject.toml entry-point already exists from L0:
```toml
[project.entry-points."aorta.workloads"]
fsdp = "aorta.workloads.fsdp:FsdpWorkload"
```

## Contract details

- **`WorkloadResult` and `Workload` ABC** live in `src/aorta/workloads/_base.py`. Import from `aorta.workloads`.
- **Distributed launch contract**: `FsdpWorkload` assumes a torchrun-style launch — `setup()` reads `RANK`, `WORLD_SIZE`, `LOCAL_RANK` from env vars and calls `torch.distributed.init_process_group(...)` itself. The platform (`aorta run`) does NOT initialize distributed for the workload.
- **`--steps` flow**: the `--steps S` CLI flag on `aorta run` (B1) is merged into the workload config dict and arrives as `self.config["steps"]`. `FsdpWorkload` uses it as the measured-loop iteration count (overriding any value from `default.yaml`).
- **GPU count**: FSDP requires `WORLD_SIZE >= 2`. On a single-GPU box the workload should `raise RuntimeError` from `setup()` with a clear message, NOT silently degenerate to single-rank.

## Class skeleton (illustrative, not prescriptive)

```python
from aorta.workloads import Workload, WorkloadResult


class FsdpWorkload(Workload):
    def setup(self) -> None:
        # init torch.distributed, allocate param shards, set up streams
        ...

    def run(self) -> WorkloadResult:
        # warmup loop + measured loop
        # populate WorkloadResult(passed, failure_count, step_times_ms, ...)
        ...

    def cleanup(self) -> None:
        # destroy_process_group, free buffers
        ...
```

## Acceptance criteria

- [ ] `FsdpWorkload(Workload)` implements `setup()`, `run()`, `cleanup()`
- [ ] `run()` returns `WorkloadResult` with at minimum: `passed: bool`, `failure_count: int`, `step_times_ms: list[float]`, `total_iterations: int`
- [ ] Optional metrics in `WorkloadResult.metrics`: `gemm_throughput`, `step_time_p50`, `step_time_p99`
- [ ] Reads workload-specific config from `self.config` dict (warmup, verify, gemm_size, fsdp_shard_size, etc.); `--steps` from `aorta run` arrives as `self.config["steps"]` and overrides the YAML default
- [ ] **Distributed init is owned by the workload**: `setup()` reads `RANK` / `WORLD_SIZE` / `LOCAL_RANK` from env and calls `init_process_group()`; the platform doesn't do it for you
- [ ] **Multi-GPU required**: `WORLD_SIZE >= 2`. Single-rank invocations raise `RuntimeError` from `setup()` with a clear message; do not silently degenerate
- [ ] **`python -m aorta.race --mode fsdp ...` keeps working** — deprecation shim still routes through the old race path. Do NOT delete `src/aorta/race/`.
- [ ] After A2 lands, `aorta run --workload fsdp --trials 1` (once B1 is implemented) produces a valid trial JSON
- [ ] `tests/workloads/test_fsdp.py` covers: instantiation with empty config, setup→run→cleanup lifecycle (mocked distributed), WorkloadResult schema completeness

## Out of scope (defer to P1+)

- `workloads/ddp/`, `workloads/torchrec/`, `workloads/hsdp/` — separate tasks, P1+
- Custom verification modes beyond what the workload needs to report pass/fail
- Performance tuning of the workload itself
- `search_space()` / `objective()` methods — deferred (adaptive search is P1+)

## How to test

```bash
# Direct (no platform):
torchrun --nproc_per_node=8 -m aorta.race --mode fsdp --warmup 10 --verify 100
# Should still work post-A2 (race deprecation shim remains)

# Via platform (after B1 lands):
aorta run --workload fsdp --trials 1 --steps 100
cat results/fsdp/trial_0.json | python -m json.tool
```

## PR template

Title: `A2: workloads/fsdp/ — FsdpWorkload(Workload) plugin`
Body: confirm race deprecation shim still works (regression test), include sample trial JSON output.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2:FSDP Workload #149

Goal

Why this workload first

Reference material (read for understanding — do NOT mechanically transform)

Files to create / modify

Contract details

Class skeleton (illustrative, not prescriptive)

Acceptance criteria

Out of scope (defer to P1+)

How to test

PR template

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

File	Why it's worth reading
`src/aorta/race/modes/fsdp.py`	Algorithm reference: per-layer all_gather, reduce_scatter, double-buffered H2D, known-pattern verification
`src/aorta/race/base.py`	Run loop reference: warmup + verification phase pattern. The new `FsdpWorkload.run()` borrows the idea (warmup, then measured loop) but writes fresh against `WorkloadResult`.
`src/aorta/race/compute.py`	Compute simulation primitives. Fine to USE if you import (not deprecated); don't copy.

A2:FSDP Workload #149

Description

Goal

Why this workload first

Reference material (read for understanding — do NOT mechanically transform)

Files to create / modify

Contract details

Class skeleton (illustrative, not prescriptive)

Acceptance criteria

Out of scope (defer to P1+)

How to test

PR template

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions