LLM Architectures, techniques, and research papers for experimentation and learning — from scratch.
- DeepSeek V3.2 GRPO optimizations: Off-policy masking, Unbiased KL estimate
- Nvidia LatentMoE from Nemotron 3 white paper
- Qwen SAPO (Soft Adaptive Policy Optimization) loss implementation
More details are available in each subfolder's
README.md
| Key Components | |
|---|---|
| GPT-2* | • MHA • LayerNorm • FFN • GeLU • KVCache |
| GPT to Llama 3.2* | • GQA • RoPE + YaRN • RMS Norm • SwiGLU |
| Llama 3.2 to DeepSeek V3/R1 | • MLA • MTP • DeepSeek MoE |
| Llama 3.2 to Gemma 3 (text-only) | • GeGLU • Local/Global attention • SWA • QK norm • Pre+Post RMSNorm • Logit softcapping (Gemma 2) |
| Qwen3 (dense and MoE) | — |
| Qwen3-Next | • Gated DeltaNet • Gated Attention • Zero-Centered RMSNorm • Weighted shared expert • Partial RoPE |
| Variant | Notes |
|---|---|
| Sparse MoE | Classic auxiliary loss + z router loss |
| DeepSeek MoE | Fine-grained + shared expert isolation + auxiliary loss-free load balancing |
| Nvidia LatentMoE | latent/low rank compression + experts rebalancing |
| Method | Notes |
|---|---|
| DPO* | With cDPO for noisy labels, step by step |
| RLHF with GRPO | including variants: Dr. GRPO, DAPO, GSPO, SAPO |
| RLVR with GRPO | — |
| Qwen GSPO | Transition from GRPO implementation |
| Reinforcement Pretraining (RPT) | — |
| Key Components | |
|---|---|
| Part 1: GPT to ViT | • Image patches + learnable CLS token + positional encoding • Full Attention • Classification head |
| Part 2: VLM | • ViT-LLM adapter (multimodal alignment/fine-tuning) • Early fusion (image + text embeddings) |
| Type | Method |
|---|---|
| Classifier | Hidden state retrieval for the last real token |
| Instruction* | — |
| Notes | |
|---|---|
| QK-Clip | Query-Key clipping (naive & per head + GQA compatible) from Moonshot.ai's MuonClip and experimental "Magnitude" variant. |
| Speculative Decoding | Google's original version |
| Dynamic Tanh | Normalization-free alternative to RMSNorm/LayerNorm (Zhu et al., 2025) |
| RoPE + YaRN | NTK-aware + by-part/wavelength scaling |
| LoRA* | — |
| Number Token Loss | Regression-like loss on number tokens — Wasserstein Distance variant (Zausinger et al., 2025) |
| generate.py | common sampling functions: temperature, top-k, top-p, min-p |
| experimental | — |
* : Already covered by @rasbt; my code is similar.
1 : The original GPT-2 implementation only included causal masks, not attention masks. (In OpenAI's code, causal masks are called "attention mask", which can be confusing)
Most notably, the Open-source community, without whom none of this would have been possible.
Whether academia, top AI labs or independent researchers, I am grateful for their shared knowledge and research.
Research papers used in the repo are always cited and linked in the relevant readmes or code comments.
Special mention to @rasbt for the LLMs-from-scratch book/repo, which made me kickstart this repo and became a base for verbose re-implementations of various research papers.