You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One in-tree workload that proves the platform's plugin contract works end-to-end. After this, aorta run --workload fsdp produces a trial JSON.
Why this workload first
aorta.race/modes/fsdp.py is the most workload-shaped code in the repo today and exercises the patterns the platform is designed for (multi-stream H2D, all_gather / reduce_scatter, GEMM compute, per-iteration verification). Building FsdpWorkload first validates the entire MVP plumbing with the smallest amount of new design.
Reference material (read for understanding — do NOT mechanically transform)
The race code is being absorbed into the new platform, but "absorbed" means the engineer reads it to understand the algorithm and writes FsdpWorkload fresh against the new Workload ABC. Don't copy + rename — that would carry race-specific naming and BaseReproducer assumptions into the new file. The new implementation is a clean class that happens to do similar work.
Run loop reference: warmup + verification phase pattern. The new FsdpWorkload.run() borrows the idea (warmup, then measured loop) but writes fresh against WorkloadResult.
src/aorta/race/compute.py
Compute simulation primitives. Fine to USE if you import (not deprecated); don't copy.
Files to create / modify
src/aorta/workloads/fsdp/__init__.py # NEW — exports FsdpWorkload
src/aorta/workloads/fsdp/workload.py # NEW — FsdpWorkload(Workload) class
src/aorta/workloads/fsdp/configs/ # NEW
default.yaml # sample config
tests/workloads/__init__.py # NEW (empty)
tests/workloads/test_fsdp.py # NEW — smoke + result schema
The pyproject.toml entry-point already exists from L0:
WorkloadResult and Workload ABC live in src/aorta/workloads/_base.py. Import from aorta.workloads.
Distributed launch contract: FsdpWorkload assumes a torchrun-style launch — setup() reads RANK, WORLD_SIZE, LOCAL_RANK from env vars and calls torch.distributed.init_process_group(...) itself. The platform (aorta run) does NOT initialize distributed for the workload.
--steps flow: the --steps S CLI flag on aorta run (B1) is merged into the workload config dict and arrives as self.config["steps"]. FsdpWorkload uses it as the measured-loop iteration count (overriding any value from default.yaml).
GPU count: FSDP requires WORLD_SIZE >= 2. On a single-GPU box the workload should raise RuntimeError from setup() with a clear message, NOT silently degenerate to single-rank.
run() returns WorkloadResult with at minimum: passed: bool, failure_count: int, step_times_ms: list[float], total_iterations: int
Optional metrics in WorkloadResult.metrics: gemm_throughput, step_time_p50, step_time_p99
Reads workload-specific config from self.config dict (warmup, verify, gemm_size, fsdp_shard_size, etc.); --steps from aorta run arrives as self.config["steps"] and overrides the YAML default
Distributed init is owned by the workload: setup() reads RANK / WORLD_SIZE / LOCAL_RANK from env and calls init_process_group(); the platform doesn't do it for you
Multi-GPU required: WORLD_SIZE >= 2. Single-rank invocations raise RuntimeError from setup() with a clear message; do not silently degenerate
python -m aorta.race --mode fsdp ... keeps working — deprecation shim still routes through the old race path. Do NOT delete src/aorta/race/.
After A2 lands, aorta run --workload fsdp --trials 1 (once B1 is implemented) produces a valid trial JSON
Goal
One in-tree workload that proves the platform's plugin contract works end-to-end. After this,
aorta run --workload fsdpproduces a trial JSON.Why this workload first
aorta.race/modes/fsdp.pyis the most workload-shaped code in the repo today and exercises the patterns the platform is designed for (multi-stream H2D, all_gather / reduce_scatter, GEMM compute, per-iteration verification). BuildingFsdpWorkloadfirst validates the entire MVP plumbing with the smallest amount of new design.Reference material (read for understanding — do NOT mechanically transform)
The race code is being absorbed into the new platform, but "absorbed" means the engineer reads it to understand the algorithm and writes
FsdpWorkloadfresh against the newWorkloadABC. Don't copy + rename — that would carry race-specific naming and BaseReproducer assumptions into the new file. The new implementation is a clean class that happens to do similar work.src/aorta/race/modes/fsdp.pysrc/aorta/race/base.pyFsdpWorkload.run()borrows the idea (warmup, then measured loop) but writes fresh againstWorkloadResult.src/aorta/race/compute.pyFiles to create / modify
The pyproject.toml entry-point already exists from L0:
Contract details
WorkloadResultandWorkloadABC live insrc/aorta/workloads/_base.py. Import fromaorta.workloads.FsdpWorkloadassumes a torchrun-style launch —setup()readsRANK,WORLD_SIZE,LOCAL_RANKfrom env vars and callstorch.distributed.init_process_group(...)itself. The platform (aorta run) does NOT initialize distributed for the workload.--stepsflow: the--steps SCLI flag onaorta run(B1) is merged into the workload config dict and arrives asself.config["steps"].FsdpWorkloaduses it as the measured-loop iteration count (overriding any value fromdefault.yaml).WORLD_SIZE >= 2. On a single-GPU box the workload shouldraise RuntimeErrorfromsetup()with a clear message, NOT silently degenerate to single-rank.Class skeleton (illustrative, not prescriptive)
Acceptance criteria
FsdpWorkload(Workload)implementssetup(),run(),cleanup()run()returnsWorkloadResultwith at minimum:passed: bool,failure_count: int,step_times_ms: list[float],total_iterations: intWorkloadResult.metrics:gemm_throughput,step_time_p50,step_time_p99self.configdict (warmup, verify, gemm_size, fsdp_shard_size, etc.);--stepsfromaorta runarrives asself.config["steps"]and overrides the YAML defaultsetup()readsRANK/WORLD_SIZE/LOCAL_RANKfrom env and callsinit_process_group(); the platform doesn't do it for youWORLD_SIZE >= 2. Single-rank invocations raiseRuntimeErrorfromsetup()with a clear message; do not silently degeneratepython -m aorta.race --mode fsdp ...keeps working — deprecation shim still routes through the old race path. Do NOT deletesrc/aorta/race/.aorta run --workload fsdp --trials 1(once B1 is implemented) produces a valid trial JSONtests/workloads/test_fsdp.pycovers: instantiation with empty config, setup→run→cleanup lifecycle (mocked distributed), WorkloadResult schema completenessOut of scope (defer to P1+)
workloads/ddp/,workloads/torchrec/,workloads/hsdp/— separate tasks, P1+search_space()/objective()methods — deferred (adaptive search is P1+)How to test
PR template
Title:
A2: workloads/fsdp/ — FsdpWorkload(Workload) pluginBody: confirm race deprecation shim still works (regression test), include sample trial JSON output.