Tested on NVIDIA GeForce RTX 3060 (12 GB VRAM) with full parameter training (no LoRA).
| Model | Optimizer | Max Context | Throughput | Memory Breakdown |
|---|---|---|---|---|
| RWKV7-1.5B | Per-param BF16 AdamW | 3,072 | 1,174 tok/s | Model 2.9 GB + Opt 5.7 GB |
| RWKV7-1.5B | Per-param 8-bit AdamW | 7,168 | 1,320 tok/s | Model 2.9 GB + Opt ~3 GB |
| RWKV7-2.9B | Per-param SGD | 7,168 | 716 tok/s | Model 5.5 GB + Opt 0 GB |
| RWKV7-2.9B | Per-param 8-bit Lion | 3,072 | 598 tok/s | Model 5.5 GB + Opt ~2.8 GB |
Note: Standard AdamW OOMs - optimizer states alone require ~11.6 GB for 1.5B model.
-
Gradient Checkpointing (
grad_cp=1): Recompute activations during backward- Saves ~7x activation memory
- Faster at high memory utilization (less allocation overhead)
-
Per-Parameter Optimizer: Run optimizer step during backward via
register_post_accumulate_grad_hook- Each parameter is updated immediately when its gradient is computed
- Gradient is freed right after update (
param.grad = None) - Avoids storing any gradients in VRAM - only one gradient exists at a time
-
Infinite Context Mode (
train_type="infctx"): This project always trains with infinite context length- Model names like
rwkv7-g1a-0.1b-20250728-ctx4096indicate the original training context length - This does NOT limit inference or training context - we always use unbounded context
- Both
ctx_len=sys.maxsizeandchunk_ctx=sys.maxsizeare set to allow arbitrarily long sequences
- Model names like