LLM Quest

LLM Architectures, techniques, and research papers for experimentation and learning — from scratch.

Latest 3 updates

DeepSeek V3.2 GRPO optimizations: Off-policy masking, Unbiased KL estimate
Nvidia LatentMoE from Nemotron 3 white paper
Qwen SAPO (Soft Adaptive Policy Optimization) loss implementation

Content

More details are available in each subfolder's README.md

Architectures

	Key Components
GPT-2*	• MHA • LayerNorm • FFN • GeLU • KVCache
GPT to Llama 3.2*	• GQA • RoPE + YaRN • RMS Norm • SwiGLU
Llama 3.2 to DeepSeek V3/R1	• MLA • MTP • DeepSeek MoE
Llama 3.2 to Gemma 3 (text-only)	• GeGLU • Local/Global attention • SWA • QK norm • Pre+Post RMSNorm • Logit softcapping (Gemma 2)
Qwen3 (dense and MoE)	—
Qwen3-Next	• Gated DeltaNet • Gated Attention • Zero-Centered RMSNorm • Weighted shared expert • Partial RoPE

Mixture of Experts (MoE)

Variant	Notes
Sparse MoE	Classic auxiliary loss + z router loss
DeepSeek MoE	Fine-grained + shared expert isolation + auxiliary loss-free load balancing
Nvidia LatentMoE	latent/low rank compression + experts rebalancing

Alignment & Reasoning

Method	Notes
DPO*	With cDPO for noisy labels, step by step
RLHF with GRPO	including variants: Dr. GRPO, DAPO, GSPO, SAPO
RLVR with GRPO	—
Qwen GSPO	Transition from GRPO implementation
Reinforcement Pretraining (RPT)	—

Multimodal

	Key Components
Part 1: GPT to ViT	• Image patches + learnable CLS token + positional encoding • Full Attention • Classification head
Part 2: VLM	• ViT-LLM adapter (multimodal alignment/fine-tuning) • Early fusion (image + text embeddings)

Fine-tuning (SFT)

Type	Method
Classifier	Hidden state retrieval for the last real token
Instruction*	—

Other Model-Agnostic Techniques and Papers

	Notes
QK-Clip	Query-Key clipping (naive & per head + GQA compatible) from Moonshot.ai's MuonClip and experimental "Magnitude" variant.
Speculative Decoding	Google's original version
Dynamic Tanh	Normalization-free alternative to RMSNorm/LayerNorm (Zhu et al., 2025)
RoPE + YaRN	NTK-aware + by-part/wavelength scaling
LoRA*	—
Number Token Loss	Regression-like loss on number tokens — Wasserstein Distance variant (Zausinger et al., 2025)
generate.py	common sampling functions: temperature, top-k, top-p, min-p
experimental	—

* : Already covered by @rasbt; my code is similar.

¹ : The original GPT-2 implementation only included causal masks, not attention masks. (In OpenAI's code, causal masks are called "attention mask", which can be confusing)

Acknowledgements

Most notably, the Open-source community, without whom none of this would have been possible.
Whether academia, top AI labs or independent researchers, I am grateful for their shared knowledge and research.

Research papers used in the repo are always cited and linked in the relevant readmes or code comments.

Special mention to @rasbt for the LLMs-from-scratch book/repo, which made me kickstart this repo and became a base for verbose re-implementations of various research papers.

Name		Name	Last commit message	Last commit date
Latest commit History 541 Commits
llm_quest		llm_quest
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Quest

Latest 3 updates

Content

Architectures

Mixture of Experts (MoE)

Alignment & Reasoning

Multimodal

Fine-tuning (SFT)

Other Model-Agnostic Techniques and Papers

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

casinca/LLM-quest

Folders and files

Latest commit

History

Repository files navigation

LLM Quest

Latest 3 updates

Content

Architectures

Mixture of Experts (MoE)

Alignment & Reasoning

Multimodal

Fine-tuning (SFT)

Other Model-Agnostic Techniques and Papers

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages