HPCNC (Hippocampus & Neocortex)

RWKV7-1.5B Training Benchmarks

Tested on NVIDIA GeForce RTX 3060 (12 GB VRAM) with full parameter training (no LoRA).

Model	Optimizer	Max Context	Throughput	Memory Breakdown
RWKV7-1.5B	Per-param BF16 AdamW	3,072	1,174 tok/s	Model 2.9 GB + Opt 5.7 GB
RWKV7-1.5B	Per-param 8-bit AdamW	7,168	1,320 tok/s	Model 2.9 GB + Opt ~3 GB
RWKV7-2.9B	Per-param SGD	7,168	716 tok/s	Model 5.5 GB + Opt 0 GB
RWKV7-2.9B	Per-param 8-bit Lion	3,072	598 tok/s	Model 5.5 GB + Opt ~2.8 GB

Note: Standard AdamW OOMs - optimizer states alone require ~11.6 GB for 1.5B model.

Gradient Checkpointing (grad_cp=1): Recompute activations during backward
- Saves ~7x activation memory
- Faster at high memory utilization (less allocation overhead)
Per-Parameter Optimizer: Run optimizer step during backward via register_post_accumulate_grad_hook
- Each parameter is updated immediately when its gradient is computed
- Gradient is freed right after update (param.grad = None)
- Avoids storing any gradients in VRAM - only one gradient exists at a time
Infinite Context Mode (train_type="infctx"): This project always trains with infinite context length
- Model names like rwkv7-g1a-0.1b-20250728-ctx4096 indicate the original training context length
- This does NOT limit inference or training context - we always use unbounded context
- Both ctx_len=sys.maxsize and chunk_ctx=sys.maxsize are set to allow arbitrarily long sequences

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
.claude		.claude
.gemini		.gemini
.github		.github
.vscode		.vscode
investigation		investigation
lib		lib
src/hpcnc		src/hpcnc
tests		tests
typings		typings
.envrc		.envrc
.gitignore		.gitignore
.mcp.json		.mcp.json
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
GEMINI.md		GEMINI.md
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
pyproject.toml		pyproject.toml
uv.lock		uv.lock