Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Verbose implementations of LLMs architectures, techniques and research papers from scratch. DeepSeek, Qwen3..., RLHF, MoE, Multimodal...

License

Notifications You must be signed in to change notification settings

casinca/LLM-quest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Quest

LLM Architectures, techniques, and research papers for experimentation and learning — from scratch.

 

Latest 3 updates

  • DeepSeek V3.2 GRPO optimizations: Off-policy masking, Unbiased KL estimate
  • Nvidia LatentMoE from Nemotron 3 white paper
  • Qwen SAPO (Soft Adaptive Policy Optimization) loss implementation

 

Content

More details are available in each subfolder's README.md

 

Architectures

Key Components
GPT-2* • MHA
• LayerNorm
• FFN
• GeLU
• KVCache
GPT to Llama 3.2* • GQA
• RoPE + YaRN
• RMS Norm
• SwiGLU
Llama 3.2 to DeepSeek V3/R1 • MLA
• MTP
• DeepSeek MoE
Llama 3.2 to Gemma 3 (text-only) • GeGLU
• Local/Global attention
• SWA
• QK norm
• Pre+Post RMSNorm
• Logit softcapping (Gemma 2)
Qwen3 (dense and MoE)
Qwen3-Next • Gated DeltaNet
• Gated Attention
• Zero-Centered RMSNorm
• Weighted shared expert
• Partial RoPE

 

Mixture of Experts (MoE)

Variant Notes
Sparse MoE Classic auxiliary loss + z router loss
DeepSeek MoE Fine-grained + shared expert isolation + auxiliary loss-free load balancing
Nvidia LatentMoE latent/low rank compression + experts rebalancing

 

Alignment & Reasoning

Method Notes
DPO* With cDPO for noisy labels, step by step
RLHF with GRPO including variants: Dr. GRPO, DAPO, GSPO, SAPO
RLVR with GRPO
Qwen GSPO Transition from GRPO implementation
Reinforcement Pretraining (RPT)

 

Multimodal

Key Components
Part 1: GPT to ViT • Image patches + learnable CLS token + positional encoding
• Full Attention
• Classification head
Part 2: VLM • ViT-LLM adapter (multimodal alignment/fine-tuning)
• Early fusion (image + text embeddings)

 

Fine-tuning (SFT)

Type Method
Classifier Hidden state retrieval for the last real token
Instruction*

 

Other Model-Agnostic Techniques and Papers

Notes
QK-Clip Query-Key clipping (naive & per head + GQA compatible) from Moonshot.ai's MuonClip and experimental "Magnitude" variant.
Speculative Decoding Google's original version
Dynamic Tanh Normalization-free alternative to RMSNorm/LayerNorm (Zhu et al., 2025)
RoPE + YaRN NTK-aware + by-part/wavelength scaling
LoRA*
Number Token Loss Regression-like loss on number tokens — Wasserstein Distance variant (Zausinger et al., 2025)
generate.py common sampling functions: temperature, top-k, top-p, min-p
experimental

 

* : Already covered by @rasbt; my code is similar.

1 : The original GPT-2 implementation only included causal masks, not attention masks. (In OpenAI's code, causal masks are called "attention mask", which can be confusing)

 

Acknowledgements

Most notably, the Open-source community, without whom none of this would have been possible.
Whether academia, top AI labs or independent researchers, I am grateful for their shared knowledge and research.

Research papers used in the repo are always cited and linked in the relevant readmes or code comments.

Special mention to @rasbt for the LLMs-from-scratch book/repo, which made me kickstart this repo and became a base for verbose re-implementations of various research papers.

About

Verbose implementations of LLMs architectures, techniques and research papers from scratch. DeepSeek, Qwen3..., RLHF, MoE, Multimodal...

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages