- keep PPO update attribution on the canonical learner path with explicit sub-stage means for minibatch fetch, minibatch prep, policy evaluation, backward, gradient clip, optimizer step, RND step, and progress-callback overhead
- when diagnostic CUDA event profiling is enabled, canonical PPO learner stage timings must synchronize around eval, backward, clip, optimizer, and RND so device time is attributed truthfully rather than inferred from host dispatch latency
Subsystem: Brain Layer - Sacred Cognitive Engine
Package: navi-actor
Status: Active canonical specification
Policy: See AGENTS.md for implementation rules and non-negotiables
The actor is the sacred cognitive core of Navi. It transforms spherical observations into continuous 4-DOF motion commands while maintaining temporal context and exploration pressure.
The core design rule is unchanged:
- environment and compiler work may improve the runtime beneath the actor
- but the actor-side cognitive pipeline itself remains fixed
The active production architecture therefore improves the system below the actor boundary while preserving the actor's observation and action semantics.
The actor domain owns:
- the Ray-ViT perception stack
- temporal sequence modeling
- actor-critic action and value heads
- intrinsic curiosity via RND
- episodic memory and loop avoidance
- rollout buffer management
- PPO update logic
- coarse actor telemetry publication
The actor domain does not own:
- corpus compilation
- raw dataset normalization
- world stepping implementation
- dashboard rendering
# Runtime service
uv run navi-actor serve --sub tcp://localhost:5559 --pub tcp://*:5557 --mode step --step-endpoint tcp://localhost:5560
# Shortcut
uv run brain
# Canonical training
uv run navi-actor trainRepository wrappers mirror the same surfaces:
./scripts/train.ps1
./scripts/train-all-night.ps1
./scripts/run-dashboard.ps1 --matrix-sub tcp://localhost:5559 --actor-sub tcp://localhost:5557 --step-endpoint tcp://localhost:5560By default, canonical training:
- discovers the full available corpus
- prepares or reuses compiled
.gmdagassets - uses the canonical
256x48observation contract - runs continuously until explicitly stopped or bounded by user override
train is the single canonical actor training entrypoint.
It means:
- direct in-process stepping of
SdfDagBackend - batched observation and action flow
- tensor-native preference when runtime seams are available
- preallocated batched rollout storage on the active device when the canonical trainer has a fixed rollout horizon
- masked batched hidden-state, auxiliary-state, and episode bookkeeping instead of per-actor Python state in the rollout hot path
- one production rollout/update loop
Alternate training architectures are intentionally not preserved as equal modes.
The actor consumes DistanceMatrix as the public observation contract.
That contract preserves:
- spherical depth
- semantic ids
- validity mask
- pose metadata for diagnostics and replay
The canonical training path stacks public observation semantics into a tensor of
shape (B, 3, Az, El):
- channel
0: normalized depth - channel
1: semantic ids - channel
2: valid mask
This is the actor's true runtime input for the hot path.
Canonical default resolution for production training, benchmarks, wrappers, and
diagnostic entrypoints is Az=256, El=48. Lower-resolution defaults are not
part of the active production surface.
The actor internally reasons in 4-DOF form:
- forward
- vertical
- lateral
- yaw
The public Action model remains stable for service mode and diagnostics.
DistanceMatrix semantics
-> stacked observation tensor
-> encoder (rayvit default / spherical_cnn experimental)
-> latent embedding z_t
-> RND and episodic-memory side channels
-> temporal core
-> actor-critic heads
-> action sample and value estimate
This pipeline is sacred and immutable at the architectural level.
Compiler/runtime improvements must preserve it rather than pressure it into new
sensor-specific branches. The encoder is selectable via --encoder-backend
or NAVI_ACTOR_ENCODER_BACKEND; see ENCODER_ARCHITECTURE.md.
Module: perception.py
The canonical default is rayvit — a Vision Transformer that treats spherical
observations as structured patches with fixed positional meaning. An
experimental spherical_cnn encoder replaces quadratic self-attention with
linear convolutions on the spherical grid with circular azimuth padding.
Current properties (all encoders):
- input shape
(B, 3, Az, El) - output latent embedding
z_twith configurable embedding dimension - encoder contract: must consume
(B, 3, Az, El)and produce(B, embedding_dim)
Ray-ViT (rayvit) properties:
- strided
Conv2dpatch projection withpatch_size=8 nn.TransformerEncoderover patch tokens plus one[CLS]token- fixed spherical positional structure rather than ad hoc flattened features
SphericalCNN (spherical_cnn) properties:
CircularAzimuthConv2dfor azimuth wrap, zero-padded elevation- depthwise-separable convolution blocks for FLOP efficiency
- global average pooling instead of massive linear projection
- ~226K params (vs RayViT's ~306K), ~29M FLOPs forward (vs ~150M)
Architectural consequence:
- runtime upgrades must preserve spherical observation semantics
- imported ideas such as foveation remain optional future work and do not alter the current production contract
The current encoder is not resolution-linear.
Patch-token count grows as:
(Az / 8) * (El / 8)
and the encoder then applies full self-attention over those tokens. That means encoder cost grows much faster than raw ray count once the profile is widened.
Concrete token counts on the active benchmark profiles are:
256x48->192patch tokens384x72->432patch tokens512x96->768patch tokens768x144->1728patch tokens
That is why the environment runtime and the actor do not scale the same way. The environment mostly tracks ray-count growth; the actor additionally pays the transformer token-growth penalty during PPO evaluation.
The March 2026 resolution sweep established the following on the active 4-actor canonical trainer surface:
256x48: about49.6 SPS,ppo_update_msabout1019.68384x72: about49.34 SPS,ppo_update_msabout1249.87512x96: about43.96 SPS,ppo_update_msabout17731.88768x144: trainer OOM on the active MX150 surface inside transformer self-attention during PPO update
Those results come from the canonical bounded trainer artifacts under
artifacts/benchmarks/resolution-compare/ and they matter architecturally:
- the environment runtime is no longer the only performance story
- the actor encoder is already a first-class scaling limit
- future temporal-core promotion alone will not make the full trainer scale linearly with observation resolution
Module: rnd.py
RND provides intrinsic novelty pressure by comparing a frozen target projection against a trainable predictor.
Current role in the production system:
- encourage exploration in novel latent regions
- remain separate from the public wire model
- contribute to the total reward formulation without widening service contracts
The predictor architecture was already simplified in earlier repo work to avoid collapsing the intrinsic signal.
Module: memory/episodic.py
The current canonical system uses a tensor-native cosine-similarity memory buffer with fixed-capacity overwrite behavior.
Architecturally important rules:
- embeddings stay on the active policy device
- exclusion windows avoid trivial self-matches
- capacity enforcement must stay amortized
- full rebuilds on every post-capacity insert are not acceptable on the canonical training path
- when the same rollout embedding batch is both queried and inserted in one trainer tick, normalization must be computed once and reused across both memory operations
- when sparse step telemetry is enabled, action and done mirrors should be batch-copied once for the selected telemetry tick rather than extracted via per-actor device synchronization
This subsystem is now part of the performance conversation, not just the behavior conversation.
The current production direction for trainer-side throughput is:
- keep rollout observations, actions, values, rewards, dones, and auxiliary tensors in reusable batched device storage
- batch hidden-state carry and reset with done masks
- reserve host extraction for sparse telemetry, checkpoints, and passive observation publication only
- when completed episodes must be published, actor id, return, and length should be batch-copied once for the done set rather than mirrored through fragmented per-field host transfers
- keep aggregate reward accounting on-device across rollout ticks and materialize it on the host only when producing final training metrics
- when PPO update summaries are needed, learner loss metrics should be materialized through one packed epoch-end host copy rather than repeated scalar extraction calls
- keep PPO update attribution on the canonical learner path with explicit sub-stage means for minibatch prep, policy evaluation, backward, gradient clip, optimizer step, and RND step
- when
seq_len > 0, canonical PPO minibatches must already carry sequence-native observation, action, and auxiliary views rather than asking the learner to rebuild them from flat tensors - canonical rollout buffers must normalize advantages once per finalized rollout and reuse that cached tensor across PPO epochs rather than re-normalizing during each sampling pass or leaving the batched path raw
- canonical PPO learner minibatch prep must not rebuild or copy hidden-state batches on the update path while the production trainer keeps one hidden-free sequence seam across supported temporal backends
- canonical trainer rollout code should keep hidden-state storage out of the hot path unless a benchmark-proven production path requires it for correctness
- batched rollout buffers and minibatches should avoid hidden-state slabs on the canonical path while still preserving the recurrent API seam for supported alternate backends and future promotions
- when
MultiTrajectoryBufferis running in its canonical batched mode (capacityset), it must not also allocate per-actor fallbackTrajectoryBufferwrappers; the production rollout store is the batched slab itself - completed-episode host extraction must happen only for actors whose sparse episode telemetry will actually be published; canonical reset bookkeeping for done actors stays on-device
- initial tensor resets and live runtime steps must materialize
DistanceMatrixobservations only for actors whose dashboard observation stream will actually publish; when observation streaming is disabled, canonical training must request no published observations at all
This preserves the sacred actor topology while removing avoidable Python orchestration above the already-batched CUDA environment step.
Module: cognitive_policy.py
Canonical runtime now exposes one controlled selector on the same sacred actor topology: mamba2 is the default backend, with gru and mambapy available as supported comparison backends.
Current architectural status:
- one canonical actor topology and one canonical trainer surface in production
- all production training and inference surfaces share the same temporal-core selector contract
mamba2is the default backend, proven by 25K-step training comparison to deliver significantly better learning qualitygruandmambapyare supported alternate backends for controlled comparisons on that same surface- fused Mamba-2 (
mamba-ssm) remains a future hardware-fused upgrade target, not the current production dependency
The temporal core sits directly on the actor's hottest remaining wall-clock
path: CognitiveMambaPolicy.evaluate_sequence() during PPO BPTT updates.
A controlled 25K-step head-to-head training comparison on the active MX150 machine established the decision:
| Metric | GRU (cuDNN) | Mamba2 (Pure-PyTorch SSD) |
|---|---|---|
| Final reward_ema | -1.48 | -0.88 |
| Rollout SPS | ~100 | ~72 |
| Forward pass | 6–8 ms | 13–18 ms |
| Wall-clock (25K steps) | 7m 28s | 9m 37s |
Mamba2 SSD's selective state-space mechanism (8,192-dim effective state vs GRU's 128-dim hidden) provides meaningfully better long-range situational awareness across partial observability, which shows up as more stable and higher-quality late-game reward.
The throughput gap is modest (1.29x wall-clock) because PPO optimizer cost dominates total training time regardless of temporal core choice.
The pure-PyTorch SSD implementation requires no Triton, no causal-conv1d wheels, and no custom C++ extensions — it uses only standard PyTorch ops (einsum, cumsum, exp, tril) and works on any CUDA-capable PyTorch install.
The selector contract is now the architecture policy, not ad hoc tuning.
- canonical actor environments must expose one temporal-core selector across config, CLI, and wrappers
mamba2is the default backend and must continue to work with no extra extension build requirementgruandmambapyare supported only through that same selector on the same canonical trainer and serve surfaces- fused
mamba-ssmremains a migration target and may be benchmarked, but it is not part of the current supported production selector set
If a future temporal core candidate proves better learning quality or throughput in a controlled head-to-head comparison, the upgrade checklist is:
- run a bounded training comparison (25K+ steps) on the canonical trainer surface
- compare final reward_ema, throughput SPS, and wall-clock time
- verify the candidate works on the active training machine with no extra build requirements
- update config defaults, wrappers, tests, and docs together in one pass
- update AGENTS.md to codify the new canonical runtime
The temporal core remains critical because it gives the actor memory across partial observability without reopening transformer-scale quadratic costs.
Module: actor_critic.py
The actor-critic heads project temporal features into:
- Gaussian action distribution parameters
- scalar value estimates
Architectural consequence:
- action semantics remain stable even if the environment stepping seam becomes more tensor-native underneath
- PPO remains able to evaluate log-probabilities and clipped importance ratios over the same action meaning
The total training signal combines:
- environment-side reward and penalties
- intrinsic novelty pressure from RND
- loop-avoidance penalties from episodic memory
The environment and actor each own different parts of the shaping problem. That split is intentional:
- environment owns geometry-derived shaping directly tied to observations
- actor owns latent novelty and memory-derived shaping
The environment reward engine computes nine tensor components per step:
- Exploration reward — decays with spatial visit count; includes heading novelty and frontier adjacency bonuses.
- Progress reward — proportional to displacement, discounted by proximity ratio so approaching walls yields diminishing credit.
- Clearance-delta reward — positive when the actor increases free space
while within the obstacle clearance window (
3.0 m). - Starvation penalty — fires when horizon-saturated rays dominate the spherical observation.
- Proximity penalty — scales with the fraction of near-field valid hits
below the proximity threshold (
2.0 m). - Structure-band reward — rewards stable mid-range geometry visibility.
- Forward-structure reward — rewards informative geometry in the forward sector.
- Inspection reward — rewards orientation changes that gain structure information.
- Collision penalty — velocity-scaled:
penalty * (1 + speed_norm)so fast crashes are punished more severely than gentle grazing.
All terms are batched CUDA tensors derived from the existing spherical observation. They do not require a second sensing pipeline.
Exploration rewards are additionally clearance-gated: rewards are multiplied by
clamp(current_clearance, 0, 1) so exploring into tight spaces yields
diminishing exploration credit.
The actor's RewardShaper adds:
- Existential tax — constant per-step cost (
-0.02). - Velocity reward — disabled by default (
weight=0.0) to remove forward approach bias near obstacles. - Intrinsic novelty — RND prediction-error signal, annealed over training.
- Loop penalty — cosine-similarity detection against episodic memory.
- Loop temporal decay — exponential half-life weighting on loop detection.
The actor reasons in 4-DOF continuous form:
forward— forward/backward translationvertical— up/down translationlateral— left/right strafeyaw— horizontal rotation
This covers all six movement directions (forward, backward, up, down, left, right) plus heading control through yaw. Pitch and roll are intentionally excluded to keep the action space simple and the observation contract stable.
Module: training/ppo_trainer.py
The current production trainer is a direct in-process PPO runtime with a strong preference for runtime tensor seams.
Current production traits:
- tensor-native reset seeding when available
- tensor-native batched stepping when available
- selective publication only for actors that need passive observation
- inline PPO updates at rollout boundaries
- coarse dashboard heartbeat publication during optimizer windows
- configurable passive dashboard observation cadence with a canonical default of
about
10 Hzfor the selected actor
The previous async optimizer-worker design is no longer the production runtime.
Module: rollout_buffer.py
The rollout buffer stores trajectory information for PPO and sequential minibatching.
Current production rules:
- keep actor trajectories logically separate for PPO and BPTT correctness
- cache stacked rollout tensors once per PPO update when practical
- keep minibatch shuffle indices on the same device as sampled tensors
- avoid rebuilding Python transition objects unnecessarily when batched tensor fields already exist
The current actor bottleneck map includes:
- remaining host extraction inside rollout steps
- observation or action materialization performed only for passive publication
- episodic-memory query and append cost
- PPO minibatch assembly and optimizer wall time
- telemetry publication overhead
This means actor-side runtime code must now be treated like systems code. The environment is no longer the only place where performance decisions matter.
The actor publishes coarse telemetry for:
- policy and value optimization metrics
- reward summaries
- throughput and timing summaries
- dashboard heartbeats when optimizer windows would otherwise starve the UI
Canonical observer rule:
- passive matrix publication stays low-volume and actor-filtered
- the default cadence is intentionally much lower than UI render cadence but high enough to feel live during training inspection
Important rule:
- observability is allowed
- but it must remain coarse, safe, and subordinate to rollout throughput
The canonical trainer uses one end-to-end pipeline only:
- one compiled-scene backend
- one sacred actor pipeline
- one PPO update surface
- one benchmarked production runtime
There are no equal-status alternate production trainers.
Module: training/bc_trainer.py
Behavioral cloning provides a supervised pre-training path that trains the same
sacred CognitiveMambaPolicy from human navigation demonstrations. The BC
trainer shares the evaluate_sequence() forward pass with PPO but uses
maximum-likelihood loss instead of the clipped surrogate objective.
Architectural properties:
- trains the identical pipeline: encoder →
TemporalCore→ActorCriticHeads - uses BPTT sequences to preserve temporal-core context
- freezes
log_stdduring BC to preserve exploration capacity for subsequent PPO fine-tuning - produces standard v3 checkpoints loadable by
PpoTrainer.load_training_state() - supports
--checkpointfor incremental improvement across scenes
Demonstration capture occurs in the auditor project (DemonstrationRecorder),
not in the actor. This preserves the boundary: the auditor records passively,
the actor trains. The data exchange format is .npz archives with observations
matching (N, 3, Az, El) and actions matching (N, 4) in normalised policy
space.
- Load all
.npzdemonstration files from the demonstrations directory. - Chunk concatenated observations and actions into BPTT sequences.
- Shuffle and iterate in minibatches through the full dataset per epoch.
- Compute loss:
L = -E[log pi(a|o)] - beta * H(pi)whereHis entropy. - Gradient clip and Adam optimiser step.
- Save v3 checkpoint with fresh RND weights.
# Single-scene workflow
uv run --project projects/auditor explore --record --gmdag-file <scene.gmdag>
uv run --project projects/actor brain bc-pretrain
# Incremental multi-scene
uv run --project projects/actor brain bc-pretrain --checkpoint artifacts/checkpoints/bc_base_model.pt
# Multi-scene exploration then training
./scripts/run-explore-scenes.ps1
./scripts/run-bc-pretrain.ps1All training sources (RL, BC, inference) emit v3 checkpoints with enriched
metadata including step_id, episode_count, reward_ema, parent_checkpoint,
training_source, temporal_core, and corpus_summary. Only v3 checkpoints
are accepted; loading v2 or older will fail fast.
When no --checkpoint is specified, both the train CLI and
run-ghost-stack.ps1 -Train automatically resume from
artifacts/models/latest.pt if it exists. This enables seamless accumulation
across BC pre-training, RL sessions, and nightly validation.
After training completes, the trainer auto-promotes the final checkpoint to the
model registry when its reward_ema exceeds the current latest.
Every checkpoint records parent_checkpoint to maintain the full training
lineage:
BC Baseline (v001) → RL Fine-tuned (v002) → RL Continued (v003)
source: bc source: rl source: rl
parent: null parent: v001 parent: v002
| Command | Purpose |
|---------|---------||
| brain promote <path> | Register a checkpoint in the model registry |
| brain models | List all promoted models with metadata |
| brain evaluate <path> | Bounded inference with quality metrics |
| brain compare <a> <b> | Side-by-side checkpoint comparison |
See docs/TRAINING.md § 6 for format details and registry structure.
docs/ARCHITECTURE.mddocs/SIMULATION.mddocs/DATAFLOW.mddocs/PERFORMANCE.mddocs/TRAINING.md