Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A2:FSDP Workload #149

@oyazdanb

Description

@oyazdanb

Goal

One in-tree workload that proves the platform's plugin contract works end-to-end. After this, aorta run --workload fsdp produces a trial JSON.

Why this workload first

aorta.race/modes/fsdp.py is the most workload-shaped code in the repo today and exercises the patterns the platform is designed for (multi-stream H2D, all_gather / reduce_scatter, GEMM compute, per-iteration verification). Building FsdpWorkload first validates the entire MVP plumbing with the smallest amount of new design.

Reference material (read for understanding — do NOT mechanically transform)

The race code is being absorbed into the new platform, but "absorbed" means the engineer reads it to understand the algorithm and writes FsdpWorkload fresh against the new Workload ABC. Don't copy + rename — that would carry race-specific naming and BaseReproducer assumptions into the new file. The new implementation is a clean class that happens to do similar work.

File Why it's worth reading
src/aorta/race/modes/fsdp.py Algorithm reference: per-layer all_gather, reduce_scatter, double-buffered H2D, known-pattern verification
src/aorta/race/base.py Run loop reference: warmup + verification phase pattern. The new FsdpWorkload.run() borrows the idea (warmup, then measured loop) but writes fresh against WorkloadResult.
src/aorta/race/compute.py Compute simulation primitives. Fine to USE if you import (not deprecated); don't copy.

Files to create / modify

src/aorta/workloads/fsdp/__init__.py       # NEW — exports FsdpWorkload
src/aorta/workloads/fsdp/workload.py       # NEW — FsdpWorkload(Workload) class
src/aorta/workloads/fsdp/configs/          # NEW
  default.yaml                             # sample config
tests/workloads/__init__.py                # NEW (empty)
tests/workloads/test_fsdp.py               # NEW — smoke + result schema

The pyproject.toml entry-point already exists from L0:

[project.entry-points."aorta.workloads"]
fsdp = "aorta.workloads.fsdp:FsdpWorkload"

Contract details

  • WorkloadResult and Workload ABC live in src/aorta/workloads/_base.py. Import from aorta.workloads.
  • Distributed launch contract: FsdpWorkload assumes a torchrun-style launch — setup() reads RANK, WORLD_SIZE, LOCAL_RANK from env vars and calls torch.distributed.init_process_group(...) itself. The platform (aorta run) does NOT initialize distributed for the workload.
  • --steps flow: the --steps S CLI flag on aorta run (B1) is merged into the workload config dict and arrives as self.config["steps"]. FsdpWorkload uses it as the measured-loop iteration count (overriding any value from default.yaml).
  • GPU count: FSDP requires WORLD_SIZE >= 2. On a single-GPU box the workload should raise RuntimeError from setup() with a clear message, NOT silently degenerate to single-rank.

Class skeleton (illustrative, not prescriptive)

from aorta.workloads import Workload, WorkloadResult


class FsdpWorkload(Workload):
    def setup(self) -> None:
        # init torch.distributed, allocate param shards, set up streams
        ...

    def run(self) -> WorkloadResult:
        # warmup loop + measured loop
        # populate WorkloadResult(passed, failure_count, step_times_ms, ...)
        ...

    def cleanup(self) -> None:
        # destroy_process_group, free buffers
        ...

Acceptance criteria

  • FsdpWorkload(Workload) implements setup(), run(), cleanup()
  • run() returns WorkloadResult with at minimum: passed: bool, failure_count: int, step_times_ms: list[float], total_iterations: int
  • Optional metrics in WorkloadResult.metrics: gemm_throughput, step_time_p50, step_time_p99
  • Reads workload-specific config from self.config dict (warmup, verify, gemm_size, fsdp_shard_size, etc.); --steps from aorta run arrives as self.config["steps"] and overrides the YAML default
  • Distributed init is owned by the workload: setup() reads RANK / WORLD_SIZE / LOCAL_RANK from env and calls init_process_group(); the platform doesn't do it for you
  • Multi-GPU required: WORLD_SIZE >= 2. Single-rank invocations raise RuntimeError from setup() with a clear message; do not silently degenerate
  • python -m aorta.race --mode fsdp ... keeps working — deprecation shim still routes through the old race path. Do NOT delete src/aorta/race/.
  • After A2 lands, aorta run --workload fsdp --trials 1 (once B1 is implemented) produces a valid trial JSON
  • tests/workloads/test_fsdp.py covers: instantiation with empty config, setup→run→cleanup lifecycle (mocked distributed), WorkloadResult schema completeness

Out of scope (defer to P1+)

  • workloads/ddp/, workloads/torchrec/, workloads/hsdp/ — separate tasks, P1+
  • Custom verification modes beyond what the workload needs to report pass/fail
  • Performance tuning of the workload itself
  • search_space() / objective() methods — deferred (adaptive search is P1+)

How to test

# Direct (no platform):
torchrun --nproc_per_node=8 -m aorta.race --mode fsdp --warmup 10 --verify 100
# Should still work post-A2 (race deprecation shim remains)

# Via platform (after B1 lands):
aorta run --workload fsdp --trials 1 --steps 100
cat results/fsdp/trial_0.json | python -m json.tool

PR template

Title: A2: workloads/fsdp/ — FsdpWorkload(Workload) plugin
Body: confirm race deprecation shim still works (regression test), include sample trial JSON output.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions