Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History
477 lines (392 loc) · 23.2 KB

File metadata and controls

477 lines (392 loc) · 23.2 KB

TODO — Performance Compliance Implementation Plan

Status Legend

  • [ ] Not started
  • [~] In progress
  • [x] Complete

Phase A: torch.compile / JIT Enablement — COMPLETE

A1. torch.compile Investigation [x]

  • Confirmed: Triton requires SM >= 7.0. MX150 (SM 6.1) cannot use torch.compile/inductor.
  • Installed Triton is for Python 3.10 (wrong version for active 3.12 environment).

A2. fullgraph Fix [x]

  • Changed fullgraph=False to fullgraph=True in _maybe_compile_callable (sdfdag_backend.py).
  • Changed fullgraph=False to fullgraph=True in reward_shaping.py.

A3. torch.jit.script Fallback [x]

  • Implemented fallback chain: torch.compile(fullgraph=True)torch.jit.script → eager.
  • Applied to sdfdag_backend.py (_maybe_compile_callable) and reward_shaping.py.
  • Type hints changed Anytorch.Tensor for jit.script compatibility.
  • Module constants _PROGRESS_REWARD_SCALE and _COLLISION_PENALTY passed as function params.
  • Measured speedups on SM 6.1: 1.44x (kinematics), 3.25x (reward computation).

Phase B: Legacy Code Elimination — COMPLETE

B1. Dead Code Removal [x]

  • Removed _reward_components_impl (~60 lines, zero callers).
  • Removed _step_kinematics_impl (~35 lines, zero callers).
  • Removed _postprocess_cast_outputs_impl (~30 lines, test updated to _postprocess_cast_outputs).
  • Updated test: test_sdfdag_conventions.py to use instance method with required attributes.

B2. bench-sdfdag Tensor Conversion [x]

  • Converted CLI benchmark from legacy Action objects to tensor-native stepping.
  • _benchmark_actions_tensor creates (actors, 4) tensor: [fwd, vert, lat, yaw].
  • _run_bench_iteration uses batch_step_tensor_actions() directly.
  • Removed numpy and Action imports from cli.py.

B3. Legacy Methods Retained (Server Path) [x]

  • Audit confirmed: step(), batch_step(), reset() actively used by server.py.
  • These serve the ZMQ distributed inference/dashboard path — cannot be removed.
  • Internal helpers (_build_observation, _compute_reward, _step_kinematics_batch) are called only from legacy batch_step() — retained as part of server path.

Phase C: Configuration Alignment — COMPLETE

C1. config.py Defaults [x]

Updated projects/actor/src/navi_actor/config.py to match CLI production defaults:

Parameter Old New
minibatch_size 32 64
bptt_len 16 8
value_coeff 0.005 0.5
existential_tax -0.01 -0.02
velocity_weight 0.0 0.1
intrinsic_coeff_init 0.2 1.0
loop_penalty_coeff 0.5 2.0

Phase D: Encoder Compilation — COMPLETE

D1. RayViTEncoder torch.compile [x]

  • Added torch.compile(fullgraph=False, dynamic=False, mode="reduce-overhead") wrapping.
  • C++ compiler availability check (torch._inductor.cpp_builder.get_cpp_compiler()) prevents lazy-compilation failures on machines without MSVC.
  • Falls back to eager mode with log message on unsupported hardware.
  • Positional encoding caching verified as already implemented.

Phase E: Documentation — COMPLETE

E1. NOTES.md Rewrite [x]

  • Restored NOTES.md to a verified status document after later drift reintroduced aspirational large-scale claims and stale execution details.
  • Notes now document the active canonical path, hardware-gated compile behavior, durable long-run launch guidance, and the remaining optional perf work only.

Phase F: Verification — COMPLETE

F1. Ruff [x]

  • Final close-out validation is scoped to files changed in this work rather than unrelated repository-wide lint debt.
  • Repo-wide ruff still reports pre-existing issues outside this task surface.

F2. Tests [x]

  • Runtime validation covered the canonical training surface, passive dashboard attachment behavior, compile gating on SM 6.1, and selector startup behavior.
  • Targeted actor and environment unit validation remains the regression gate for the touched code paths.

F3. Type Checking [x]

  • Strict mypy remains part of the required validation surface.
  • Use uv run --project .\projects\actor --with mypy mypy ... and uv run --project .\projects\environment --with mypy mypy ... when the dev extra is not already installed in the local workspace environment.

Close-Out Status (Phases A–F)

All mandatory implementation, validation, and documentation work from the original close-out plan is complete.


Phase G: Test Suite Stabilization (Apr 2026) — COMPLETE

G1. Actor pytest temp-dir override [x]

  • Created projects/actor/tests/conftest.py mirroring the contracts pattern to override Windows pytest basetemp and eliminate the recurring pytest-of-firas PermissionError noise.
  • Verified with full actor suite (176 passed, 7 skipped, 34.65s).

G2. Oracle pipeline integration test [x]

  • Switched _compile_unit_box_gmdag in projects/environment/tests/integration/test_full_pipeline_oracle.py from the C++ voxel-dag.exe binary (which hardcodes a +2 m bbox padding incompatible with the unit-box fixture) to the in-process Python compiler (EikonalSdfComputer + SvoDagCompressor with semantic_grid=ones_like(grid)).
  • Added module-level _BOX_GMDAG_CACHE to amortize compilation across tests.
  • Marked 3 inside-shell raycasting tests xfail(strict=False) with detailed rationale: sphere-tracing from inside a closed-shell box uses unsigned SDF and cannot disambiguate inside vs. outside; this pattern never occurs in canonical operation. Kernel correctness remains covered by tests/integration/test_analytical_raycasting.py (14 passing tests).

G3. Live corpus saturation contract [x]

  • Refactored test_live_sdfdag_step_uses_fixed_horizon_saturation_contract in test_live_corpus_validation.py to walk every manifest scene seeking one with saturating rays at max_distance=0.5 m, skipping vacuously if all scenes are tightly bounded. Per-scene horizon-clamp invariants (>=0, <=1) are still asserted unconditionally.

G4. torch-sdf API parameter count [x]

  • Updated tests/test_sphere_tracing.py argc==10 → argc==11 to match the new cast_rays(..., skip_direction_validation) signature introduced for the GPU sync-barrier elimination work.

G5. End-to-end suite green [x]

  • contracts: 12 passed
  • environment: 144 passed, 3 xfailed
  • actor: 176 passed, 7 skipped
  • auditor: 80 passed
  • voxel-dag: 72 passed
  • torch-sdf: 53 passed, 1 skipped
  • Total: 537 passed, 8 skipped, 3 xfailed

Phase H: Active-Hardware Optimization Levers (Planned)

These three levers are available on the current MX150 sm_61 machine; they are not blocked by hardware. Each touches the canonical trainer surface and is governed by AGENTS.md §2.5 (Sacred Brain), §2.7 (Controlled Selector / Update-Frequency Rule), §3.1 (Benchmark Gate), and §3.3.1 (Hot-Path Discipline).

Execution order: H1 (C) → H2 (A) → H3 (B). C produces the attribution data that decides which cadence variant in A is worth running the bake-off against. B is the final measurement gate before any default flip.

H1. PPO backward sub-attribution (lever C) [x]

Goal: Identify which sub-component of the ~1129 ms PPO backward pass dominates, so any future targeted optimization is grounded in real attribution rather than guesswork.

Surface: projects/actor/src/navi_actor/learner_ppo.py around the optimizer-step block (~line 543) and surrounding closures.

Constraints (AGENTS.md §3.3.1 No Hot-Path Polling Rule):

  • Use CUDA events (torch.cuda.Event(enable_timing=True)) — never .item() / .synchronize() in the hot loop for measurement.
  • Sample at coarse PPO-update boundaries, not per minibatch.
  • Emit results into the existing run-scoped metrics/ artifact stream — no new human-log spam (AGENTS.md §2.4 Log Hygiene Rule).
  • Measurement enablement gated through the existing training config / ActorConfig, not a new monitor-only launcher (§2.4.5 Control Surface Rule).

Acceptance:

  • Per-section breakdown (forward, loss build, backward, optimizer step, grad-clip) appears in run-root metrics artifact.
  • Canonical bench-train SPS shows no measurable regression with profiling enabled (§3.3.1 throughput discipline).
  • One-pager attribution summary committed to docs/PERFORMANCE.md §10 (or appended to existing §9).

Implementation (Apr 2026):

  • Added loss_build_ms / loss_build_device_ms nested timers inside the existing eval_ms block, isolating clipped-surrogate / value-clipping / total-loss composition cost from the policy forward pass. Outer eval_ms semantics preserved for back-compat.
  • Added zero_grad_ms / zero_grad_device_ms nested timers inside the existing backward_ms block, isolating optimizer.zero_grad(set_to_none) from total_loss.backward(). Outer backward_ms semantics preserved.
  • Plumbed all 4 new fields (wall + device) through PpoMetrics and emitted loss_build_ms_total / zero_grad_ms_total into the existing ppo_update_summary metrics-record payload (no new log lines per §2.4 Log Hygiene Rule).
  • Both timers gated by the existing profile_cuda_events config flag — zero CUDA-sync overhead in the production hot path when disabled (§3.3.1).
  • Validated: 176 actor tests pass; ruff clean.

H2. PPO update cadence tuning (lever A) [x]

Goal: Reduce optimizer-update frequency to convert ~1020 ms PPO cost per window into rollout time. Largest single win available on this hardware.

Surface: projects/actor/src/navi_actor/config.py (PPO update_frequency / rollout-window sizing) and consuming logic in learner_ppo.py.

Authorisation: AGENTS.md §2.7 Update-Frequency Rule explicitly permits cadence tuning on the one canonical trainer surface.

Constraints:

  • Must NOT introduce an alternate trainer mode or parallel surface (§2.7 Controlled Selector Rule, §3.3.1 single canonical entrypoint).
  • Must be benchmark-gated (§3.1 Benchmark Gate) against learning quality — cannot ship the cadence change without proof that reward_ema and rollout health are not regressed.

Acceptance:

  • Two A/B runs at the new and current cadence on identical seed/corpus through the existing bench-train surface, with run-root metrics + summary artifact comparing throughput (SPS) and learning quality (reward_ema, episode return curves).
  • Default flip in config.py only if both throughput improves AND learning quality is non-regressing.
  • Updates land in one change: config default, wrapper docs, tests, and docs/TRAINING.md / docs/PERFORMANCE.md reflect the new cadence (§2.6 Strict Contract Evolution).

Implementation status (Apr 2026):

  • Infrastructure landed. Created scripts/run-cadence-compare.ps1, modeled directly on scripts/run-temporal-compare.ps1. Sweeps cadence × temporal-core grid: --rollout-length across an arbitrary cadence list (default 256, 512, 1024) crossed with -TemporalCores (default mamba2, gru so the 2×-throughput GRU comparison is in the default sweep). Per cell, runs N bounded repeats (default 3) using the canonical train CLI unchanged, invokes summarize-bounded-train-log.ps1 to extract steady SPS, env/PPO/backward means, then writes one comparison-summary.json covering the full grid plus per-cell summary.json files under artifacts/benchmarks/cadence-compare/<ts>/<core>/rollout-<N>/.
  • Default rollout_length=256 in projects/actor/src/navi_actor/config.py is intentionally untouched. Per §3.1 Benchmark Gate, flipping the default requires actually executing the bake-off and human interpretation of the resulting comparison-summary.json before the one-pass change can ship.
  • Bake-off readiness fixes (Apr 2026):
    1. PowerShell -f operator was binding to -ForegroundColor in the Write-Host banner. Wrapped the format expression in parens.
    2. AGENTS.md §2.9 Auto-Continue Rule loaded the promoted Mamba2 latest.pt into the GRU policy, producing a RuntimeError: Error(s) in loading state_dict for CognitiveMambaPolicy (missing temporal_core.core.weight_ih_l0, unexpected temporal_core.A_log). Added --no-auto-resume typer flag to train CLI (projects/actor/src/navi_actor/cli.py) and pass it from the bake-off wrapper so cadence × core sweeps always start from a fresh policy. This is the canonical pattern for any future bake-off whose runs span incompatible architectures.
    3. Dropped 1024 from the default -RolloutLengths sweep — too heavy on the active MX150 sm_61 within bake-off time/memory budget. Defer to better hardware per the Future Hardware Roadmap; the wrapper still accepts 1024 via explicit override.
  • Pending (requires GPU time + human review):
    1. Run powershell -ExecutionPolicy Bypass -File .\scripts\run-cadence-compare.ps1 with default sweep (mamba2+gru × 256/512 × 3 repeats × 4096 steps each = 12 runs).
    2. Inspect comparison-summary.json for SPS gain vs. reward_ema / learning-quality regression and decide: (a) keep current defaults, (b) flip rollout_length only, (c) flip temporal_core only (would also satisfy H3), (d) flip both.
    3. If a winning combination emerges, ship one coherent change updating config.py defaults + docs/TRAINING.md + docs/PERFORMANCE.md + any wrapper defaults (§2.6).

H3. Temporal-core re-bake-off (lever B) [x] (subsumed by H2)

Goal: Re-run the Mar 2026 25K-step Mamba2-vs-GRU comparison under the current PPO config (post-H1, post-H2) to confirm Mamba2 remains canonical or unlock GRU's ~2× throughput if the trade-off has shifted.

Surface: scripts/run-temporal-bakeoff.ps1 (still available standalone) plus the H2 wrapper scripts/run-cadence-compare.ps1, whose default -TemporalCores @("mamba2","gru") sweep makes a separate H3 invocation unnecessary — every H2 default run produces the H3 evidence as a side effect.

Authorisation: AGENTS.md §2.7 Promotion Rule — bake-off-proven wins MAY replace the current default, but only with config defaults, wrappers, tests, and docs updated together in one change.

Constraints:

  • Pure measurement; expensive (~hours). Run after H1 and H2 land so the comparison reflects the actual current PPO surface.
  • Results are a decision input; outcome interpretation requires human review before any default flip.

Acceptance:

  • Bake-off summary artifact under artifacts/benchmarks/cadence-compare/<ts>/ (or artifacts/runs/<run_id>/reports/ if invoked through run-temporal-bakeoff.ps1) with final reward_ema, throughput (SPS), and convergence curves for both cores.
  • Decision recorded in docs/PERFORMANCE.md (and AGENTS.md §2.7 if the default changes).
  • If a default flip is warranted, it ships as one coherent change per §2.6.

Implementation status (Apr 2026):

  • Wrapper infrastructure for both standalone H3 (run-temporal-bakeoff.ps1) and combined H2+H3 (run-cadence-compare.ps1) is in place. Either path produces the comparison evidence required by §2.7.
  • The combined H2+H3 path is preferred because it covers both axes (cadence × temporal-core) in one bake-off, halving the GPU time vs. running them sequentially.
  • Promotion of any winning core requires the same one-coherent-change discipline as H2 (§2.6).

Phase I: Final Validation Pass (Apr 2026) — COMPLETE

After H1 + H2 changes landed, the full repository validation pass was re-run against all six projects. No regressions detected.

I1. Lint surface [x]

  • ruff check projects/actor/src/navi_actor/learner_ppo.py projects/actor/src/navi_actor/training/ppo_trainer.py → All checks passed.

I2. Strict typing surface [x]

  • mypy --follow-imports=silent --ignore-missing-imports src/navi_actor/learner_ppo.py → Success: no issues found in 1 source file.

I3. End-to-end pytest sweep [x]

Project Result
contracts 12 passed
environment 144 passed, 3 xfailed (oracle inside-shell xfails preserved)
actor 176 passed, 7 skipped (mamba-ssm / mambapy / real-corpus CLI)
auditor 80 passed
voxel-dag 72 passed
torch-sdf 53 passed, 1 skipped
TOTAL 537 passed, 8 skipped, 3 xfailed — identical to pre-H1/H2 baseline

I4. Bake-off readiness [x]

The two infrastructure entrypoints required for the optimization decision phase are both verified to parse and expose the expected parameter surfaces:

  • scripts/run-cadence-compare.ps1 — H2 + H3 combined bake-off.
  • scripts/run-temporal-compare.ps1 — H3 standalone (pre-existing).

Both wrappers are ready for invocation; what remains is GPU-hours of measurement followed by human interpretation of comparison-summary.json to decide whether to flip any canonical defaults under §2.6.


Deferred Performance R&D

These items are intentionally not tracked as active TODOs for the current close-out because they are future optimization experiments, not unresolved correctness, architecture, or validation gaps.

CUDA Macro-Cell Spatial Caching

  • projects/torch-sdf/cpp_src/kernel.cu void-distance cache in ray marching loop.
  • Targets env_ms (44-55ms), which is the dominant bottleneck on MX150.
  • Requires C++ CUDA kernel changes and benchmark proof before promotion.

Asynchronous Double-Buffering

  • torch.cuda.Stream ping-pong in ppo_trainer.py.
  • Not beneficial at 4 actors on MX150 (batch too small to saturate GPU).
  • Relevant only for future high-end GPU and larger fleet sizes.

Phase J: Apr 2026 Optimization Closeout — STATUS RECORDED

Bottom line: Despite landing ~20+ targeted optimizations between Mar and Apr 2026 (most tagged _no_win_ in repo memory) plus the H1/H2/H3 work this session, end-to-end throughput on the active MX150 sm_61 hardware did not improve, and GRU regressed. The structural ceiling documented in AGENTS.md §3.1.2 GPU Compute Utilization Standard is real and binding on this hardware.

J1. Measured throughput vs. Mar 2026 baseline [x]

Bake-off cadence-compare-20260420-230927 (mamba2+gru × rollout 256/512 × 3 repeats × 4096 steps, fresh policy each run):

Configuration Mar 2026 baseline Apr 2026 mean Apr 2026 median Δ
mamba2 + 256 (current default) ~72 SPS 68.5 74.4 flat / noise
mamba2 + 512 not measured 72.4 75.3 +6% over mamba2+256 (noise)
gru + 256 ~100 SPS 69.4 79.6 regression ~−20%
gru + 512 not measured 87.3 92.7 best in run, still below Mar GRU

Artifact: artifacts/benchmarks/cadence-compare/cadence-compare-20260420-230927/comparison-summary.json.

J2. Lever-by-lever final status [x]

Lever C — PPO sub-attribution (per learner_ppo.py#L100-L115):

  • Code shipped: loss_build_ms, zero_grad_ms nested CUDA-event timers plumbed through PpoMetrics and emitted in ppo_update_summary.
  • Production effect: none (measurement only, gated on --profile-cuda-events).
  • Caveat: the J1 bake-off was run without --profile-cuda-events, so the new fields were never populated. Re-run with the flag is cheap and would give the first real attribution data for backward sub-cost.

Lever A — PPO cadence tuning (per config.py#L75):

  • Measured: mamba2+512 vs mamba2+256 = +6% (within noise). On the current default core, cadence tuning is not a real win on this hardware.
  • rollout_length: int = 256 is unchanged in config.py. No promotion warranted by the data.
  • Wrapper scripts/run-cadence-compare.ps1 remains as diagnostic infrastructure for future hardware.

Lever B — Temporal-core re-bake-off (per config.py#L57):

  • Throughput-only winner: gru+512 at +21–27% vs current default. Not promoted because the original Mar 2026 25K-step decision was made on learning quality (mamba2 reward_ema −0.88 vs GRU −1.48), and a 4096-step bake-off cannot disprove that result.
  • Promotion blocked by AGENTS.md §2.7 Promotion Rule: a default flip requires bake-off-proven end-to-end training quality, not throughput alone.
  • The unexpected GRU regression vs Mar 2026 (~100 → ~87 SPS even at 2x cadence) is the more interesting signal in this run and is not yet diagnosed. Suspected sources, in priority order:
    1. Reward-shaping additions per AGENTS.md §3.4 — info foraging, structure seeking, void grace period (added between Mar and Apr).
    2. Spawn quality gate (memory:repo/spawn_quality_gate_and_void_fixes_20260327).
    3. Mesh-repair pipeline (memory:repo/mesh_repair_scene_graph_transforms_20260409).

J3. What would actually move throughput on sm_61 [x]

None of the items below are blocked by code we can write — they are blocked by hardware or by complex architecture work not yet justified on a 3-SM laptop GPU. Re-evaluate all of them on the next-generation environment.

Path Expected impact Blocker
torch.compile on RayViT + reward helpers Eliminates ~50% of dispatcher gaps requires sm_70+
Hardware-fused mamba-ssm (Triton kernels) Closes the 15× kernel-count gap vs GRU not available on Windows
PPO/rollout double-buffer overlap Reclaims ~1000ms GPU idle per PPO window complex architecture work; payoff scales with GPU size
CUDA graph capture of full step path Replays entire step as one submission infeasible — data-dependent control flow
Larger fleet size (8–16+ actors) Amortizes dispatcher overhead MX150 has only 3 SMs — 4 actors already saturates schedulable work

J4. Pickup checklist for new hardware [ ]

When the new environment arrives, work this list in order before re-opening any optimization tickets:

  1. Confirm GPU compute capability is sm_70+ (i.e. torch.compile viable).
  2. Re-run scripts/benchmark_canonical_stack.py with current defaults (mamba2 + 256) to establish the new hardware baseline.
  3. Re-run scripts/run-cadence-compare.ps1 -ProfileCudaEvents to measure leverage A and B and populate lever C attribution simultaneously.
  4. Investigate the GRU regression on the active machine before dismissing it — if it persists on new hardware, the cause is in the recently-landed Apr 2026 reward/spawn/mesh changes, not in the temporal core.
  5. Enable torch.compile(fullgraph=True) on RayViT + reward helpers; on sm_70+ this is expected to be the single largest win.
  6. Only after (1)–(5), reconsider hardware-fused mamba-ssm, double-buffer overlap, and larger fleet sizes.

J5. Documentation updated [x]

  • docs/PERFORMANCE.md §4.0 already documents the structural sm_61 ceiling and the table of "what would actually move the needle." A new §4.1 records the Apr 2026 closeout measurement and the GRU regression observation.
  • This Phase J section is the single authoritative source for "what was attempted, what worked, what didn't, and what to do on the next hardware."