ACTOR.md

keep PPO update attribution on the canonical learner path with explicit sub-stage means for minibatch fetch, minibatch prep, policy evaluation, backward, gradient clip, optimizer step, RND step, and progress-callback overhead
when diagnostic CUDA event profiling is enabled, canonical PPO learner stage timings must synchronize around eval, backward, clip, optimizer, and RND so device time is attributed truthfully rather than inferred from host dispatch latency

ACTOR.md - Canonical Cognitive Actor Architecture

Subsystem: Brain Layer - Sacred Cognitive Engine Package: navi-actor Status: Active canonical specification Policy: See AGENTS.md for implementation rules and non-negotiables

1. Executive Summary

The actor is the sacred cognitive core of Navi. It transforms spherical observations into continuous 4-DOF motion commands while maintaining temporal context and exploration pressure.

The core design rule is unchanged:

environment and compiler work may improve the runtime beneath the actor
but the actor-side cognitive pipeline itself remains fixed

The active production architecture therefore improves the system below the actor boundary while preserving the actor's observation and action semantics.

2. Canonical Responsibilities

The actor domain owns:

the Ray-ViT perception stack
temporal sequence modeling
actor-critic action and value heads
intrinsic curiosity via RND
episodic memory and loop avoidance
rollout buffer management
PPO update logic
coarse actor telemetry publication

The actor domain does not own:

corpus compilation
raw dataset normalization
world stepping implementation
dashboard rendering

3. Canonical Launch Surfaces

# Runtime service
uv run navi-actor serve --sub tcp://localhost:5559 --pub tcp://*:5557 --mode step --step-endpoint tcp://localhost:5560

# Shortcut
uv run brain

# Canonical training
uv run navi-actor train

Repository wrappers mirror the same surfaces:

./scripts/train.ps1
./scripts/train-all-night.ps1
./scripts/run-dashboard.ps1 --matrix-sub tcp://localhost:5559 --actor-sub tcp://localhost:5557 --step-endpoint tcp://localhost:5560

By default, canonical training:

discovers the full available corpus
prepares or reuses compiled .gmdag assets
uses the canonical 256x48 observation contract
runs continuously until explicitly stopped or bounded by user override

4. Canonical Training Surface

train is the single canonical actor training entrypoint. It means:

direct in-process stepping of SdfDagBackend
batched observation and action flow
tensor-native preference when runtime seams are available
preallocated batched rollout storage on the active device when the canonical trainer has a fixed rollout horizon
masked batched hidden-state, auxiliary-state, and episode bookkeeping instead of per-actor Python state in the rollout hot path
one production rollout/update loop

Alternate training architectures are intentionally not preserved as equal modes.

5. Input And Output Contracts

5.1 Public Observation Boundary

The actor consumes DistanceMatrix as the public observation contract. That contract preserves:

spherical depth
semantic ids
validity mask
pose metadata for diagnostics and replay

5.2 Internal Observation Tensor

The canonical training path stacks public observation semantics into a tensor of shape (B, 3, Az, El):

channel 0: normalized depth
channel 1: semantic ids
channel 2: valid mask

This is the actor's true runtime input for the hot path.

Canonical default resolution for production training, benchmarks, wrappers, and diagnostic entrypoints is Az=256, El=48. Lower-resolution defaults are not part of the active production surface.

5.3 Action Boundary

The actor internally reasons in 4-DOF form:

forward
vertical
lateral
yaw

The public Action model remains stable for service mode and diagnostics.

6. Cognitive Pipeline

DistanceMatrix semantics
  -> stacked observation tensor
  -> encoder (rayvit default / spherical_cnn experimental)
  -> latent embedding z_t
  -> RND and episodic-memory side channels
  -> temporal core
  -> actor-critic heads
  -> action sample and value estimate

This pipeline is sacred and immutable at the architectural level. Compiler/runtime improvements must preserve it rather than pressure it into new sensor-specific branches. The encoder is selectable via --encoder-backend or NAVI_ACTOR_ENCODER_BACKEND; see ENCODER_ARCHITECTURE.md.

7. Perception: Encoder (Ray-ViT / SphericalCNN)

Module: perception.py

The canonical default is rayvit — a Vision Transformer that treats spherical observations as structured patches with fixed positional meaning. An experimental spherical_cnn encoder replaces quadratic self-attention with linear convolutions on the spherical grid with circular azimuth padding.

Current properties (all encoders):

input shape (B, 3, Az, El)
output latent embedding z_t with configurable embedding dimension
encoder contract: must consume (B, 3, Az, El) and produce (B, embedding_dim)

Ray-ViT (rayvit) properties:

strided Conv2d patch projection with patch_size=8
nn.TransformerEncoder over patch tokens plus one [CLS] token
fixed spherical positional structure rather than ad hoc flattened features

SphericalCNN (spherical_cnn) properties:

CircularAzimuthConv2d for azimuth wrap, zero-padded elevation
depthwise-separable convolution blocks for FLOP efficiency
global average pooling instead of massive linear projection
~226K params (vs RayViT's ~306K), ~29M FLOPs forward (vs ~150M)

Architectural consequence:

runtime upgrades must preserve spherical observation semantics
imported ideas such as foveation remain optional future work and do not alter the current production contract

7.1 Resolution Scaling Boundary

The current encoder is not resolution-linear.

Patch-token count grows as:

(Az / 8) * (El / 8)

and the encoder then applies full self-attention over those tokens. That means encoder cost grows much faster than raw ray count once the profile is widened.

Concrete token counts on the active benchmark profiles are:

256x48 -> 192 patch tokens
384x72 -> 432 patch tokens
512x96 -> 768 patch tokens
768x144 -> 1728 patch tokens

That is why the environment runtime and the actor do not scale the same way. The environment mostly tracks ray-count growth; the actor additionally pays the transformer token-growth penalty during PPO evaluation.

7.2 Current Measured Consequence

The March 2026 resolution sweep established the following on the active 4-actor canonical trainer surface:

256x48: about 49.6 SPS, ppo_update_ms about 1019.68
384x72: about 49.34 SPS, ppo_update_ms about 1249.87
512x96: about 43.96 SPS, ppo_update_ms about 17731.88
768x144: trainer OOM on the active MX150 surface inside transformer self-attention during PPO update

Those results come from the canonical bounded trainer artifacts under artifacts/benchmarks/resolution-compare/ and they matter architecturally:

the environment runtime is no longer the only performance story
the actor encoder is already a first-class scaling limit
future temporal-core promotion alone will not make the full trainer scale linearly with observation resolution

8. Curiosity: RND Module

Module: rnd.py

RND provides intrinsic novelty pressure by comparing a frozen target projection against a trainable predictor.

Current role in the production system:

encourage exploration in novel latent regions
remain separate from the public wire model
contribute to the total reward formulation without widening service contracts

The predictor architecture was already simplified in earlier repo work to avoid collapsing the intrinsic signal.

9. Loop Avoidance: Episodic Memory

Module: memory/episodic.py

The current canonical system uses a tensor-native cosine-similarity memory buffer with fixed-capacity overwrite behavior.

Architecturally important rules:

embeddings stay on the active policy device
exclusion windows avoid trivial self-matches
capacity enforcement must stay amortized
full rebuilds on every post-capacity insert are not acceptable on the canonical training path
when the same rollout embedding batch is both queried and inserted in one trainer tick, normalization must be computed once and reused across both memory operations
when sparse step telemetry is enabled, action and done mirrors should be batch-copied once for the selected telemetry tick rather than extracted via per-actor device synchronization

This subsystem is now part of the performance conversation, not just the behavior conversation.

9.1 Rollout Storage Direction

The current production direction for trainer-side throughput is:

keep rollout observations, actions, values, rewards, dones, and auxiliary tensors in reusable batched device storage
batch hidden-state carry and reset with done masks
reserve host extraction for sparse telemetry, checkpoints, and passive observation publication only
when completed episodes must be published, actor id, return, and length should be batch-copied once for the done set rather than mirrored through fragmented per-field host transfers
keep aggregate reward accounting on-device across rollout ticks and materialize it on the host only when producing final training metrics
when PPO update summaries are needed, learner loss metrics should be materialized through one packed epoch-end host copy rather than repeated scalar extraction calls
keep PPO update attribution on the canonical learner path with explicit sub-stage means for minibatch prep, policy evaluation, backward, gradient clip, optimizer step, and RND step
when seq_len > 0, canonical PPO minibatches must already carry sequence-native observation, action, and auxiliary views rather than asking the learner to rebuild them from flat tensors
canonical rollout buffers must normalize advantages once per finalized rollout and reuse that cached tensor across PPO epochs rather than re-normalizing during each sampling pass or leaving the batched path raw
canonical PPO learner minibatch prep must not rebuild or copy hidden-state batches on the update path while the production trainer keeps one hidden-free sequence seam across supported temporal backends
canonical trainer rollout code should keep hidden-state storage out of the hot path unless a benchmark-proven production path requires it for correctness
batched rollout buffers and minibatches should avoid hidden-state slabs on the canonical path while still preserving the recurrent API seam for supported alternate backends and future promotions
when MultiTrajectoryBuffer is running in its canonical batched mode (capacity set), it must not also allocate per-actor fallback TrajectoryBuffer wrappers; the production rollout store is the batched slab itself
completed-episode host extraction must happen only for actors whose sparse episode telemetry will actually be published; canonical reset bookkeeping for done actors stays on-device
initial tensor resets and live runtime steps must materialize DistanceMatrix observations only for actors whose dashboard observation stream will actually publish; when observation streaming is disabled, canonical training must request no published observations at all

This preserves the sacred actor topology while removing avoidable Python orchestration above the already-batched CUDA environment step.

10. Temporal Core

Module: cognitive_policy.py

Canonical runtime now exposes one controlled selector on the same sacred actor topology: mamba2 is the default backend, with gru and mambapy available as supported comparison backends.

Current architectural status:

one canonical actor topology and one canonical trainer surface in production
all production training and inference surfaces share the same temporal-core selector contract
mamba2 is the default backend, proven by 25K-step training comparison to deliver significantly better learning quality
gru and mambapy are supported alternate backends for controlled comparisons on that same surface
fused Mamba-2 (mamba-ssm) remains a future hardware-fused upgrade target, not the current production dependency

10.1 Why Mamba2 SSD Is Canonical

The temporal core sits directly on the actor's hottest remaining wall-clock path: CognitiveMambaPolicy.evaluate_sequence() during PPO BPTT updates.

A controlled 25K-step head-to-head training comparison on the active MX150 machine established the decision:

Metric	GRU (cuDNN)	Mamba2 (Pure-PyTorch SSD)
Final reward_ema	-1.48	-0.88
Rollout SPS	~100	~72
Forward pass	6–8 ms	13–18 ms
Wall-clock (25K steps)	7m 28s	9m 37s

Mamba2 SSD's selective state-space mechanism (8,192-dim effective state vs GRU's 128-dim hidden) provides meaningfully better long-range situational awareness across partial observability, which shows up as more stable and higher-quality late-game reward.

The throughput gap is modest (1.29x wall-clock) because PPO optimizer cost dominates total training time regardless of temporal core choice.

The pure-PyTorch SSD implementation requires no Triton, no causal-conv1d wheels, and no custom C++ extensions — it uses only standard PyTorch ops (einsum, cumsum, exp, tril) and works on any CUDA-capable PyTorch install.

10.2 Canonical Enforcement

The selector contract is now the architecture policy, not ad hoc tuning.

canonical actor environments must expose one temporal-core selector across config, CLI, and wrappers
mamba2 is the default backend and must continue to work with no extra extension build requirement
gru and mambapy are supported only through that same selector on the same canonical trainer and serve surfaces
fused mamba-ssm remains a migration target and may be benchmarked, but it is not part of the current supported production selector set

10.3 How To Switch Back Later

If a future temporal core candidate proves better learning quality or throughput in a controlled head-to-head comparison, the upgrade checklist is:

run a bounded training comparison (25K+ steps) on the canonical trainer surface
compare final reward_ema, throughput SPS, and wall-clock time
verify the candidate works on the active training machine with no extra build requirements
update config defaults, wrappers, tests, and docs together in one pass
update AGENTS.md to codify the new canonical runtime

The temporal core remains critical because it gives the actor memory across partial observability without reopening transformer-scale quadratic costs.

11. Decision Layer: Actor-Critic Heads

Module: actor_critic.py

The actor-critic heads project temporal features into:

Gaussian action distribution parameters
scalar value estimates

Architectural consequence:

action semantics remain stable even if the environment stepping seam becomes more tensor-native underneath
PPO remains able to evaluate log-probabilities and clipped importance ratios over the same action meaning

12. Reward Formulation In Actor Context

The total training signal combines:

environment-side reward and penalties
intrinsic novelty pressure from RND
loop-avoidance penalties from episodic memory

The environment and actor each own different parts of the shaping problem. That split is intentional:

environment owns geometry-derived shaping directly tied to observations
actor owns latent novelty and memory-derived shaping

12.1 Environment-Side Components (9 terms)

The environment reward engine computes nine tensor components per step:

Exploration reward — decays with spatial visit count; includes heading novelty and frontier adjacency bonuses.
Progress reward — proportional to displacement, discounted by proximity ratio so approaching walls yields diminishing credit.
Clearance-delta reward — positive when the actor increases free space while within the obstacle clearance window (3.0 m).
Starvation penalty — fires when horizon-saturated rays dominate the spherical observation.
Proximity penalty — scales with the fraction of near-field valid hits below the proximity threshold (2.0 m).
Structure-band reward — rewards stable mid-range geometry visibility.
Forward-structure reward — rewards informative geometry in the forward sector.
Inspection reward — rewards orientation changes that gain structure information.
Collision penalty — velocity-scaled: penalty * (1 + speed_norm) so fast crashes are punished more severely than gentle grazing.

All terms are batched CUDA tensors derived from the existing spherical observation. They do not require a second sensing pipeline.

Exploration rewards are additionally clearance-gated: rewards are multiplied by clamp(current_clearance, 0, 1) so exploring into tight spaces yields diminishing exploration credit.

12.2 Actor-Side Components (5 terms)

The actor's RewardShaper adds:

Existential tax — constant per-step cost (-0.02).
Velocity reward — disabled by default (weight=0.0) to remove forward approach bias near obstacles.
Intrinsic novelty — RND prediction-error signal, annealed over training.
Loop penalty — cosine-similarity detection against episodic memory.
Loop temporal decay — exponential half-life weighting on loop detection.

12.3 Action Space

The actor reasons in 4-DOF continuous form:

forward — forward/backward translation
vertical — up/down translation
lateral — left/right strafe
yaw — horizontal rotation

This covers all six movement directions (forward, backward, up, down, left, right) plus heading control through yaw. Pitch and roll are intentionally excluded to keep the action space simple and the observation contract stable.

13. Canonical PPO Runtime

Module: training/ppo_trainer.py

The current production trainer is a direct in-process PPO runtime with a strong preference for runtime tensor seams.

Current production traits:

tensor-native reset seeding when available
tensor-native batched stepping when available
selective publication only for actors that need passive observation
inline PPO updates at rollout boundaries
coarse dashboard heartbeat publication during optimizer windows
configurable passive dashboard observation cadence with a canonical default of about 10 Hz for the selected actor

The previous async optimizer-worker design is no longer the production runtime.

14. Rollout Buffer Architecture

Module: rollout_buffer.py

The rollout buffer stores trajectory information for PPO and sequential minibatching.

Current production rules:

keep actor trajectories logically separate for PPO and BPTT correctness
cache stacked rollout tensors once per PPO update when practical
keep minibatch shuffle indices on the same device as sampled tensors
avoid rebuilding Python transition objects unnecessarily when batched tensor fields already exist

15. Actor-Side Performance Reality

The current actor bottleneck map includes:

remaining host extraction inside rollout steps
observation or action materialization performed only for passive publication
episodic-memory query and append cost
PPO minibatch assembly and optimizer wall time
telemetry publication overhead

This means actor-side runtime code must now be treated like systems code. The environment is no longer the only place where performance decisions matter.

16. Actor Telemetry And Observability

The actor publishes coarse telemetry for:

policy and value optimization metrics
reward summaries
throughput and timing summaries
dashboard heartbeats when optimizer windows would otherwise starve the UI

Canonical observer rule:

passive matrix publication stays low-volume and actor-filtered
the default cadence is intentionally much lower than UI render cadence but high enough to feel live during training inspection

Important rule:

observability is allowed
but it must remain coarse, safe, and subordinate to rollout throughput

17. Single-Pipeline Invariant

The canonical trainer uses one end-to-end pipeline only:

one compiled-scene backend
one sacred actor pipeline
one PPO update surface
one benchmarked production runtime

There are no equal-status alternate production trainers.

18. Behavioral Cloning Pre-Training

Module: training/bc_trainer.py

Behavioral cloning provides a supervised pre-training path that trains the same sacred CognitiveMambaPolicy from human navigation demonstrations. The BC trainer shares the evaluate_sequence() forward pass with PPO but uses maximum-likelihood loss instead of the clipped surrogate objective.

Architectural properties:

trains the identical pipeline: encoder → TemporalCore → ActorCriticHeads
uses BPTT sequences to preserve temporal-core context
freezes log_std during BC to preserve exploration capacity for subsequent PPO fine-tuning
produces standard v3 checkpoints loadable by PpoTrainer.load_training_state()
supports --checkpoint for incremental improvement across scenes

Demonstration capture occurs in the auditor project (DemonstrationRecorder), not in the actor. This preserves the boundary: the auditor records passively, the actor trains. The data exchange format is .npz archives with observations matching (N, 3, Az, El) and actions matching (N, 4) in normalised policy space.

18.1 BC Training Algorithm

Load all .npz demonstration files from the demonstrations directory.
Chunk concatenated observations and actions into BPTT sequences.
Shuffle and iterate in minibatches through the full dataset per epoch.
Compute loss: L = -E[log pi(a|o)] - beta * H(pi) where H is entropy.
Gradient clip and Adam optimiser step.
Save v3 checkpoint with fresh RND weights.

18.2 BC Commands

# Single-scene workflow
uv run --project projects/auditor explore --record --gmdag-file <scene.gmdag>
uv run --project projects/actor brain bc-pretrain

# Incremental multi-scene
uv run --project projects/actor brain bc-pretrain --checkpoint artifacts/checkpoints/bc_base_model.pt

# Multi-scene exploration then training
./scripts/run-explore-scenes.ps1
./scripts/run-bc-pretrain.ps1

19. Model Management & Checkpoint Lineage

All training sources (RL, BC, inference) emit v3 checkpoints with enriched metadata including step_id, episode_count, reward_ema, parent_checkpoint, training_source, temporal_core, and corpus_summary. Only v3 checkpoints are accepted; loading v2 or older will fail fast.

Auto-Continue

When no --checkpoint is specified, both the train CLI and run-ghost-stack.ps1 -Train automatically resume from artifacts/models/latest.pt if it exists. This enables seamless accumulation across BC pre-training, RL sessions, and nightly validation.

Auto-Promote

After training completes, the trainer auto-promotes the final checkpoint to the model registry when its reward_ema exceeds the current latest.

Checkpoint Lineage

Every checkpoint records parent_checkpoint to maintain the full training lineage:

BC Baseline (v001)  →  RL Fine-tuned (v002)  →  RL Continued (v003)
  source: bc              source: rl               source: rl
  parent: null            parent: v001             parent: v002

CLI Commands

| Command | Purpose | |---------|---------|| | brain promote <path> | Register a checkpoint in the model registry | | brain models | List all promoted models with metadata | | brain evaluate <path> | Bounded inference with quality metrics | | brain compare <a> <b> | Side-by-side checkpoint comparison |

See docs/TRAINING.md § 6 for format details and registry structure.

20. Related Docs

docs/ARCHITECTURE.md
docs/SIMULATION.md
docs/DATAFLOW.md
docs/PERFORMANCE.md
docs/TRAINING.md

FilesExpand file tree

ACTOR.md

Latest commit

History

ACTOR.md

File metadata and controls

ACTOR.md - Canonical Cognitive Actor Architecture

1. Executive Summary

2. Canonical Responsibilities

3. Canonical Launch Surfaces

4. Canonical Training Surface

5. Input And Output Contracts

5.1 Public Observation Boundary

5.2 Internal Observation Tensor

5.3 Action Boundary

6. Cognitive Pipeline

7. Perception: Encoder (Ray-ViT / SphericalCNN)

7.1 Resolution Scaling Boundary

7.2 Current Measured Consequence

8. Curiosity: RND Module

9. Loop Avoidance: Episodic Memory

9.1 Rollout Storage Direction

10. Temporal Core

10.1 Why Mamba2 SSD Is Canonical

10.2 Canonical Enforcement

10.3 How To Switch Back Later

11. Decision Layer: Actor-Critic Heads

12. Reward Formulation In Actor Context

12.1 Environment-Side Components (9 terms)

12.2 Actor-Side Components (5 terms)

12.3 Action Space

13. Canonical PPO Runtime

14. Rollout Buffer Architecture

15. Actor-Side Performance Reality

16. Actor Telemetry And Observability

17. Single-Pipeline Invariant

18. Behavioral Cloning Pre-Training

18.1 BC Training Algorithm

18.2 BC Commands

19. Model Management & Checkpoint Lineage

Auto-Continue

Auto-Promote

Checkpoint Lineage

CLI Commands

20. Related Docs