"True understanding comes from building the engine, not just driving the car." This repository documents my systematic exploration of advanced DL concepts – implementing every component from scratch without high-level APIs to force fundamental understanding.
A checklist of concepts and models to implement.
- Normalization Layers (NormalizationLayers.ipynb)
- Batch Normalization
- Layer Normalization
- Instance Normalization
- Group Normalization
- Activation Functions (ActivationFunctions.ipynb)
- ReLU Variants: Leaky ReLU, Parametric ReLU (PReLU), Exponential Linear Unit (ELU)
- Advanced: GELU (Gaussian Error Linear Unit), SiLU, Mish
- Efficient: Hard Sigmoid, Hard Swish, Hard Tanh
- Gated Activations: Gated Linear Unit (GLU)
- Convolutional Variants
- Dilated (Atrous) Convolution
- Depthwise Separable Convolution
- Deformable Convolution
- Attention Mechanisms (beyond basic self-attention)
- Self-Attention: Multi-head attention mechanism with causal masking
- Multi-Head Latent Attention (MLA): DeepSeek-style MLA (SimpleDeepseek.ipynb)
- Cross-Attention
- FlashAttention (I/O-aware implementation)
- Sparse/Linear Attention (e.g., in Longformer, Performer)
- Variational Autoencoders (VAEs)
- Core VAE (Reparameterization trick, ELBO loss: $\log p(x) \ge \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) || p(z))$)
- Conditional VAE (CVAE)
- Vector Quantized VAE (VQ-VAE)
- Generative Adversarial Networks (GANs)
- Wasserstein GAN (WGAN-GP)
- CycleGAN
- StyleGAN
- Diffusion Models
- Denoising Diffusion Probabilistic Models (DDPM)
- Latent Diffusion Models (LDM)
- Autoregressive Models
- PixelCNN
- Normalizing Flows
- RealNVP or GLOW
- Core Transformer & GPT Implementation
- SimpleGPT: Complete GPT implementation from scratch with multi-head attention, transformer blocks, and autoregressive text generation
- Multi-Head Attention: Self-attention mechanism with causal masking
- Transformer Architecture: Positional embeddings, transformer blocks, feed-forward networks
- Rotary Position Embeddings (RoPE): Rotary positional encoding (see "Model Creation (MLA with RoPE positional Encoding)")
- Efficient Transformers
- Longformer / BigBird
- Reformer
- Modern LLM Architectures
- Mixture of Experts (MoE) layer
- State Space Models (Mamba)
- Parameter-Efficient Fine-Tuning (PEFT)
- Low-Rank Adaptation (LoRA)
- LLM Application Paradigms
- Retrieval-Augmented Generation (RAG) System
- Object Detection
- YOLO (You Only Look Once)
- DETR (DEtection TRansformer)
- Segmentation
- U-Net
- Vision Transformer for segmentation
- Transformers in Vision
- Vision Transformer (ViT)
- Swin Transformer
- 3D Vision & Scene Representation
- Neural Radiance Fields (NeRF)
- PointNet / PointNet++
- Graph Neural Networks (GNNs)
- Graph Convolutional Networks (GCN)
- GraphSAGE
- Graph Attention Networks (GAT)
- Value-Based Methods
- Double Dueling DQN
- Advanced Actor-Critic Methods
- Proximal Policy Optimization (PPO)
- Soft Actor-Critic (SAC)
- Contrastive Learning
- SimCLR
- MoCo
- Masked Modeling
- Masked Autoencoders (MAE) for vision
- Multi-Modal Models
- CLIP (Replicating the training approach on a smaller scale)
- Knowledge Distillation
- Network Pruning (e.g., Lottery Ticket Hypothesis)
- Quantization (Post-Training and Quantization-Aware)
- Bayesian Deep Learning
- Uncertainty Estimation via MC Dropout
- Advanced Optimizers
- Adaptive Learning Rate: AdaGrad, RMSprop
- Adam Variants: AdamW (Decoupled Weight Decay), RAdam (Rectified Adam)
- Recent Developments: Lion (EvoLved Sign Momentum), Lookahead
- Regularization-based: Sharpness-Aware Minimization (SAM)