Future Proof Yourself-An AI Era Survival Guide
Future Proof Yourself-An AI Era Survival Guide
2 Optimization 28
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Loss Functions: Measuring Model Error . . . . . . . . . . . . . 29
2.2.1 Why We Need a Loss Function . . . . . . . . . . . . . 29
2.2.2 Probability and Likelihood in ML (A Deeper Look) . . 29
2.2.3 Information Theory and Entropy . . . . . . . . . . . . 30
2.2.4 Cross-Entropy in Classification . . . . . . . . . . . . . 31
2.3 Gradient Descent: An Overview of Optimization . . . . . . . . 31
2.3.1 Local Minima, Global Minima, and Saddle Points . . . 32
1
2.3.2 Basic Gradient Descent . . . . . . . . . . . . . . . . . . 32
2.3.3 Stochastic and Mini-Batch Gradient Descent . . . . . . 33
2.3.4 Variants: Momentum, Adam, and Beyond . . . . . . . 33
2.4 Backpropagation: Computing Gradients in Deep Networks . . 34
2.4.1 Chain Rule Refresher . . . . . . . . . . . . . . . . . . . 34
2.4.2 Backpropagation Step-by-Step . . . . . . . . . . . . . . 34
2.4.3 A Simple Example . . . . . . . . . . . . . . . . . . . . 35
2.5 Challenges in Training Deep Networks . . . . . . . . . . . . . 35
2.5.1 The Vanishing Gradient Problem . . . . . . . . . . . . 35
2.5.2 When Adding Layers Degrades Performance . . . . . . 36
2.6 Skip Connections and Residual Networks . . . . . . . . . . . . 36
2.6.1 Why Skip Connections Help . . . . . . . . . . . . . . . 37
2.6.2 Beyond ResNet . . . . . . . . . . . . . . . . . . . . . . 37
2.7 Putting It All Together: Practical Training Steps . . . . . . . 38
2.8 Extended Explanations of Tricky Concepts . . . . . . . . . . . 39
2.8.1 Local Minima, Saddle Points, and High-Dimensional Landscapes 39
2.8.2 Chain Rule and Partial Derivatives: Why It’s Intuitive 39
2.8.3 Vanishing Gradients: A Numerical Example . . . . . . 39
2.8.4 Skip Connections in More Depth . . . . . . . . . . . . 40
2.9 Conclusion and Next Steps . . . . . . . . . . . . . . . . . . . . 40
2.9.1 Chapter Recap . . . . . . . . . . . . . . . . . . . . . . 40
2
3.5.4 Receptive Field and Deep CNNs . . . . . . . . . . . . . 48
3.6 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . 48
3.6.1 Sequential Data and the Need for Memory . . . . . . . 48
3.6.2 Parameter Sharing Over Time . . . . . . . . . . . . . . 48
3.6.3 Backpropagation Through Time (BPTT) . . . . . . . . 49
3.7 Transformers and Self-Attention . . . . . . . . . . . . . . . . . 49
3.7.1 Limitations of RNNs and the Rise of Transformers . . 49
3.7.2 Self-Attention Mechanism . . . . . . . . . . . . . . . . 49
3.7.3 Parallelization and Multi-Head Attention . . . . . . . . 50
3.8 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . 50
3.8.1 Motivation and Basic Ideas . . . . . . . . . . . . . . . 50
3.8.2 Entropy Minimization . . . . . . . . . . . . . . . . . . 51
3.8.3 Pseudo-Labeling . . . . . . . . . . . . . . . . . . . . . 51
3.9 Self-Supervised Learning . . . . . . . . . . . . . . . . . . . . . 51
3.9.1 Overview and Motivation . . . . . . . . . . . . . . . . 51
3.9.2 Masked Modeling: BERT . . . . . . . . . . . . . . . . 52
3.9.3 Autoregressive Generation: GPT . . . . . . . . . . . . 52
3.9.4 Contrastive Learning . . . . . . . . . . . . . . . . . . . 52
3.10 Vision Transformers (ViT) . . . . . . . . . . . . . . . . . . . . 52
3.10.1 Adapting Transformers to Images . . . . . . . . . . . . 52
3.10.2 Performance Considerations . . . . . . . . . . . . . . . 53
3.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 History of AI 56
4.1 Machine Learning vs. Deep Learning . . . . . . . . . . . . . . 56
4.2 Information Theory and Cross-Entropy . . . . . . . . . . . . . 56
4.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Backpropagation and Optimizer . . . . . . . . . . . . . . . . . 57
4.5 Loss Landscape and Skip-Connection . . . . . . . . . . . . . . 58
4.6 Hopfield Network . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7 Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 61
4.9 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . 61
4.10 Convolutional Neural Network . . . . . . . . . . . . . . . . . . 62
4.11 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . 62
4.12 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.13 Self-Supervised Learning . . . . . . . . . . . . . . . . . . . . . 63
4.13.1 Autoregressive Generation . . . . . . . . . . . . . . . . 63
3
4.13.2 Generative Pre-Training (GPT) . . . . . . . . . . . . . 63
4.13.3 Masked Generation / Prediction . . . . . . . . . . . . . 63
4.13.4 Vision Transformer (ViT) . . . . . . . . . . . . . . . . 64
4.13.5 Contrastive Learning . . . . . . . . . . . . . . . . . . . 64
4.14 Evolution of AI . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.14.1 Foundational Neural Network Theories (1982 to 2011) . 64
4.14.2 Supervised Learning and Specialized Architectures (2012 to 2016) 65
4.14.3 Attention-Based Universal Architectures (2017 to 2024) 65
4.15 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.16 Reinforcement Learning from Human Feedback (RLHF) . . . . 68
4.17 Red Teaming . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.18 Chain-of-Thought (CoT) . . . . . . . . . . . . . . . . . . . . . 71
4.19 Self-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.20 Direct Preference Optimization (DPO) . . . . . . . . . . . . . 72
4.21 Retrieval-Augmented Generation (RAG) . . . . . . . . . . . . 73
5 Model Scaling 75
5.1 Introduction to Model Scaling . . . . . . . . . . . . . . . . . . 75
5.2 Depth-Based Scaling of Networks . . . . . . . . . . . . . . . . 77
5.2.1 Going Deeper: VGG Networks . . . . . . . . . . . . . . 77
5.2.2 Residual Networks (ResNets) . . . . . . . . . . . . . . 78
5.2.3 Inception Modules and Going Wider . . . . . . . . . . 79
5.2.4 Dense Connections: DenseNets . . . . . . . . . . . . . 80
5.3 Efficient Model Scaling: Neural Architecture Search and Compound Scaling 81
5.3.1 Neural Architecture Search (NAS) for Scaling . . . . . 82
5.3.2 Compound Scaling and EfficientNet . . . . . . . . . . . 82
5.3.3 Designing Network Families: RegNet . . . . . . . . . . 84
5.4 Scaling Transformers and Language Models . . . . . . . . . . 85
5.4.1 From Millions to Billions of Parameters . . . . . . . . . 86
5.4.2 Emergent Abilities and Limits of Scaling . . . . . . . . 88
5.5 Test-Time Compute: Scaling Reasoning at Inference . . . . . . 89
5.6 The Role of Normalization in Stable Scaling . . . . . . . . . . 92
5.7 Efficient Building Blocks for Scalable Models . . . . . . . . . . 95
5.7.1 Depthwise Separable Convolutions and MobileNets . . 95
5.7.2 Other Efficient Layer Techniques . . . . . . . . . . . . 97
5.8 Distributed Training for Scalable Deep Learning . . . . . . . . 99
5.9 Environmental and Computational Trade-offs . . . . . . . . . 102
5.10 AI Accelerators and Inference-Time Optimization . . . . . . . 105
4
5.10.1 AI Accelerators: GPUs, TPUs, and more . . . . . . . . 105
5.10.2 Inference-Time Optimization Techniques . . . . . . . . 107
5.11 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . 110
5
7.10 Advanced GAN Models . . . . . . . . . . . . . . . . . . . . . . 135
7.10.1 Deep Convolutional GAN (DCGAN) . . . . . . . . . . 135
7.10.2 Conditional GAN (CGAN) . . . . . . . . . . . . . . . . 135
7.10.3 Pix2Pix . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.10.4 CycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.10.5 StarGAN . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.11 Normalizing Flow . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.11.1 RealNVP and Glow . . . . . . . . . . . . . . . . . . . . 137
7.12 Diffusion Models: From Noise to Clarity . . . . . . . . . . . . 137
7.12.1 Connection to Denoising Autoencoders . . . . . . . . . 137
7.13 Denoising Diffusion Models (DDPM, DDIM) . . . . . . . . . . 137
7.13.1 DDPM (Ho et al.) . . . . . . . . . . . . . . . . . . . . 137
7.13.2 DDIM (Song et al.) . . . . . . . . . . . . . . . . . . . . 138
7.14 Guided Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.15 Latent Diffusion Model (LDM) . . . . . . . . . . . . . . . . . 138
7.16 Summary and Final Thoughts . . . . . . . . . . . . . . . . . . 139
8 Transformers 140
8.1 Introduction to Transformers and Motivation . . . . . . . . . . 140
8.2 Attention Mechanism: Queries, Keys, Values . . . . . . . . . . 141
8.3 Self-Attention: Contextual Interpretation and Word Disambiguation142
8.4 Scaled Dot-Product Attention: Formula Walkthrough with Example143
8.5 Multi-Head Attention: How Multiple Heads Help . . . . . . . 145
8.6 Positional Encoding: Sinusoidal Encoding and Intuition . . . . 147
8.7 Feed-Forward Network (FFN): Structure and Purpose . . . . . 149
8.8 Look-Ahead Mask: Preventing Use of Future Information . . . 151
8.9 Byte-Pair Encoding (BPE): Subword Tokenization Method . . 152
8.10 GPT Architecture: Decoder-Only Transformers and Autoregressive Modeling155
8.11 BERT and RoBERTa: Masked Language Modeling and Improvements157
8.12 Vision Transformer (ViT): Transformers for Images as Patches 160
8.13 CLIP: Contrastive Training of Image and Text Encoders . . . 163
8.14 InstructGPT and Reward Models: Fine-Tuning with Human Preference166
8.15 Reinforcement Learning with PPO: Fine-Tuning Language Models168
8.16 Direct Preference Optimization (DPO): A New Approach Without RL171
8.17 Language Models as Reward Models: Concept and Future Implications174
6
9 Object-Oriented AI Development Based on MCP 177
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.2 Object-Oriented Programming (OOP) . . . . . . . . . . . . . 178
9.2.1 Classes and Objects . . . . . . . . . . . . . . . . . . . 178
9.2.2 Data Abstraction and Encapsulation . . . . . . . . . . 178
9.2.3 Inheritance . . . . . . . . . . . . . . . . . . . . . . . . 179
9.2.4 Polymorphism . . . . . . . . . . . . . . . . . . . . . . . 179
9.2.5 Benefits of Object-Oriented Programming . . . . . . . 179
9.2.6 Example: Linking to Real-World Projects . . . . . . . 180
9.2.7 Why an Object-Oriented Mindset? . . . . . . . . . . . 180
9.3 Model Context Protocol (MCP) . . . . . . . . . . . . . . . . . 180
9.3.1 Overview and Motivation . . . . . . . . . . . . . . . . 180
9.3.2 Core Concepts of MCP . . . . . . . . . . . . . . . . . . 181
9.3.3 Benefits for AI Systems . . . . . . . . . . . . . . . . . . 181
9.3.4 Illustrating an MCP Workflow with Text . . . . . . . . 182
9.3.5 Example: Minimal Pseudocode for MCP Calls . . . . . 183
9.4 Hierarchical Ontology for Multimodal Systems . . . . . . . . . 184
9.4.1 What is an Ontology? . . . . . . . . . . . . . . . . . . 184
9.4.2 Why Hierarchy Matters for Multimodality . . . . . . . 184
9.4.3 Illustrative Textual Example: Animal Hierarchy . . . . 184
9.4.4 Class-Based Ontology in Simple Code . . . . . . . . . . 185
9.4.5 Task Ontologies . . . . . . . . . . . . . . . . . . . . . . 186
9.4.6 Real-World Examples of Ontologies . . . . . . . . . . . 186
9.4.7 Why This Matters for Object-Oriented AI . . . . . . . 187
9.5 LLMs as AI Engines . . . . . . . . . . . . . . . . . . . . . . . 187
9.5.1 Parallel with Game Engines . . . . . . . . . . . . . . . 187
9.5.2 Tool-Oriented LLM Workflows . . . . . . . . . . . . . . 188
9.5.3 Function Calling in Modern LLMs . . . . . . . . . . . . 188
9.5.4 How an AI Engine Might Loop Internally . . . . . . . . 188
9.5.5 Code Snippet: LangChain-like Workflow . . . . . . . . 189
9.5.6 Benefits of the Engine Approach . . . . . . . . . . . . . 190
9.5.7 Multiple Agents in One Engine . . . . . . . . . . . . . 190
9.6 Implementation Tips and Considerations . . . . . . . . . . . . 191
9.6.1 Define Clear Interfaces . . . . . . . . . . . . . . . . . . 191
9.6.2 Use Ontologies Early . . . . . . . . . . . . . . . . . . . 191
9.6.3 Permission and Security Boundaries . . . . . . . . . . . 191
9.6.4 Controlling Context Size . . . . . . . . . . . . . . . . . 192
9.6.5 Iterative Development and Testing . . . . . . . . . . . 192
7
9.7 Conclusion and Future Directions . . . . . . . . . . . . . . . . 192
10 AI and the Metaverse: Digital Twins, Egocentric Multimodal AI, and Decentralize
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
10.2 Digital Twin-Based Physical AI . . . . . . . . . . . . . . . . . 195
10.2.1 What is a Digital Twin? . . . . . . . . . . . . . . . . . 195
10.2.2 Why Combine AI with Digital Twins? . . . . . . . . . 195
10.2.3 Examples of Digital Twin Use . . . . . . . . . . . . . . 196
10.3 Egocentric Multimodal AI Agents for AR Glasses . . . . . . . 196
10.3.1 First-Person Perspective AI . . . . . . . . . . . . . . . 196
10.3.2 Sensors and Multimodal Inputs . . . . . . . . . . . . . 197
10.3.3 AI Models for Egocentric Intelligence . . . . . . . . . . 198
10.3.4 Industry Examples . . . . . . . . . . . . . . . . . . . . 198
10.4 Decentralized GPU Clusters for Training and Inference . . . . 199
10.4.1 The Need for Scalable AI Computation . . . . . . . . . 199
10.4.2 What is a Decentralized GPU Cluster? . . . . . . . . . 199
10.4.3 Benefits of Decentralization . . . . . . . . . . . . . . . 199
10.4.4 Key Technologies . . . . . . . . . . . . . . . . . . . . . 200
10.4.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . 201
10.5 Discussion and Future Outlook . . . . . . . . . . . . . . . . . 201
10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8
B.1.3 Why Git? . . . . . . . . . . . . . . . . . . . . . . . . . 208
B.1.4 Git vs. GitHub . . . . . . . . . . . . . . . . . . . . . . 209
B.2 Installing Git and Setting Up a Local Repository . . . . . . . 209
B.2.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . 209
B.2.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . 209
B.2.3 Initializing a New Repository . . . . . . . . . . . . . . 210
B.3 Basic Git Workflow . . . . . . . . . . . . . . . . . . . . . . . . 210
B.3.1 The Edit-Stage-Commit Cycle . . . . . . . . . . . . . . 210
B.4 Branching and Merging . . . . . . . . . . . . . . . . . . . . . . 211
B.4.1 Why Use Branches? . . . . . . . . . . . . . . . . . . . 211
B.4.2 Creating and Switching Branches . . . . . . . . . . . . 211
B.4.3 Merging Branches . . . . . . . . . . . . . . . . . . . . . 212
B.4.4 Resolving Merge Conflicts . . . . . . . . . . . . . . . . 212
B.5 GitHub: Pull Requests and Issues . . . . . . . . . . . . . . . . 212
B.5.1 Pull Requests (PRs) . . . . . . . . . . . . . . . . . . . 212
B.5.2 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
B.6 Advanced Topics: Rebasing and Workflows . . . . . . . . . . . 213
B.6.1 Rebasing . . . . . . . . . . . . . . . . . . . . . . . . . . 213
B.6.2 Common Git Workflows . . . . . . . . . . . . . . . . . 214
B.6.3 Best Practices . . . . . . . . . . . . . . . . . . . . . . . 214
B.7 Homework Assignment: “Git in Practice” . . . . . . . . . . . 214
B.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9
C.6.2 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
C.6.3 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . 222
C.6.4 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
C.7 File Handling Basics . . . . . . . . . . . . . . . . . . . . . . . 223
C.8 Introduction to Object-Oriented Programming (OOP) . . . . . 223
C.8.1 Classes and Objects . . . . . . . . . . . . . . . . . . . 223
C.9 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . 224
C.10 Homework Assignment: Contact Book Project . . . . . . . . . 225
10
E.10 Step 6: Free Tier Constraints . . . . . . . . . . . . . . . . . . 239
E.11 Step 7: Putting It All Together . . . . . . . . . . . . . . . . . 239
E.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
11
Chapter 1
12
trying to discover hidden structures or patterns. Here there are no
explicit correct outputs given. Clustering and dimensionality reduction
are examples of unsupervised learning methods.
• Reinforcement learning: The algorithm (often called an agent)
learns by interacting with an environment. Instead of correct input-
output pairs, the agent receives rewards or penalties for its actions and
aims to learn a strategy (policy) to maximize cumulative reward. This
paradigm is inspired by behavioral learning and is used in scenarios like
game-playing, robotics, and control systems.
Each of these paradigms addresses different kinds of problems. In the fol-
lowing sections, we will delve into supervised learning (using a simple exam-
ple of classifying fish), then contrast it with unsupervised and reinforcement
learning, all while introducing key concepts such as model generalization,
overfitting, features, and the recent advances brought by deep learning.
13
separate the salmon from the bass.
14
image), it might accidentally pick up meaningless patterns, like a reflection in
the water in one specific salmon image, and treat that as a feature of “salmon-
ness.” This model might correctly classify every training example (since it
effectively memorized them), but then classify new fish in a nonsensical way.
A common analogy for underfitting vs. overfitting is fitting a curve to
data points. Suppose the true relationship is a smooth curve. A very low-
degree polynomial (like a straight line) will underfit (it cannot bend to fit
the data), whereas a very high-degree polynomial can pass through every
training point and will overfit (wiggling wildly between points).
The goal is to find a model complexity that is “just right” – complex
enough to capture the true structure of the data but simple enough to avoid
modeling noise. Techniques like cross-validation (testing the model on
held-out data during training), and regularization methods (which penalize
overly complex models) are commonly used to combat overfitting [23]. We
won’t dive into those techniques here, but it’s important to be aware that
they exist as part of the toolbox for building robust models.
15
Another example: for face detection (deciding if an image contains a face),
the Viola-Jones algorithm [93] was a groundbreaking approach in the early
2000s. It relied on very simple rectangular pattern features (reminiscent
of Haar wavelets) that basically measure contrasts (like the difference in
pixel intensity between adjacent regions, which can capture things like “the
eye region is darker than the cheeks”). Viola and Jones used a machine
learning method called AdaBoost to automatically select a small set of these
features that are most useful for distinguishing faces from non-faces, and then
combined them in a cascade of simple classifiers for efficient detection [93].
This approach was efficient enough to run in real time and was the first to
enable things like real-time face detection in consumer cameras. It’s a good
illustration of how much thought went into designing and selecting the right
features in traditional computer vision.
In general, feature selection refers to the process of selecting a subset
of relevant features for use in model construction. Especially when you have
a very large number of candidate features, some of them may be redundant
or irrelevant (and including them could actually hurt the model by making
it more prone to overfitting or by slowing it down). Research has shown that
eliminating useless features and focusing on the most informative ones can
improve learning algorithms’ performance [28]. Feature selection can be done
through statistical tests, through algorithms that try different combinations,
or via regularization techniques that implicitly drive weights of unimportant
features to zero.
To summarize: in traditional machine learning, a lot of the “intelligence”
of the solution was in the human-driven step of deciding how to represent
the data (which features to use). A famous saying was “data is the fuel,
but feature engineering is the rocket” – meaning that with the right features,
even a simple algorithm can do very well.
16
learn increasingly abstract features from the raw data.
The key advantage of deep learning is that if we provide a large amount
of raw data, a deep neural network can learn good features at multiple levels
of abstraction, automatically. This has led to extraordinary breakthroughs.
For example, in image recognition benchmarks, deep learning approaches
started to dominate around 2012. The catalyst was a convolutional neural
network by Krizhevsky et al. (commonly known as AlexNet) that won the
ImageNet image classification challenge by a large margin [51]. AlexNet was
trained on millions of images and was able to learn edge detectors in its
first layer, simple shape detectors in the next, and eventually very complex
structures in deeper layers (as we will discuss in the next section). This was
a departure from earlier methods that might use hand-crafted features like
HOG or SIFT and then feed them into, say, a support vector machine (SVM)
for classification.
To contrast traditional ML with deep learning:
• In deep learning, the pipeline is: collect data (usually a lot more
data is needed) → feed the raw data into a neural network which auto-
matically learns multiple layers of feature transformations → the final
layer of the network produces the predictions. The model (the neural
network) is complex but it discovers for itself what features to use.
Deep learning has been especially successful in fields like computer vision,
speech recognition, and natural language processing, where raw data is high-
dimensional and complex, and where we have benefitted from improvements
in computing power (GPUs) and the availability of big datasets. LeCun,
Bengio, and Hinton (2015) provide a good overview of why deep learning
works and what impact it has had [53]. One key reason is that deep networks
can express very complicated functions (they are very flexible models, with
many tunable parameters), and if regularized and trained on enough data,
they can learn the correct complex patterns rather than overfitting random
noise. Another reason is that by learning features incrementally (one layer
builds on the output of the previous), they can build a hierarchy of concepts
– exactly what human engineers were trying to do with multi-stage feature
engineering, but now it happens automatically.
17
However, deep learning models are not always the answer to every prob-
lem. They typically require large amounts of data and computational re-
sources to train, and it can be harder to interpret why they make a given
decision (they can feel like a “black box”). For many simpler problems with
limited data, a well-chosen set of features and a simple model might still be
the more practical solution.
Next, we focus on one of the most important types of deep learning ar-
chitecture in the context of computer vision: the Convolutional Neural
Network (CNN).
18
• Pooling: CNNs often include pooling layers, which down-sample the
image representation to make it smaller and more manageable, and
to aggregate information. For example, a common operation is max-
pooling which takes a 2 × 2 block of neurons in one layer and outputs
the maximum value in that block to the next layer. Pooling helps make
the representation roughly invariant to small translations or distortions
(e.g., if an image shifts by one pixel, the pooled representation might
stay the same).
19
1.5.1 Softmax and Multi-Class Classification
Many machine learning tasks, including image classification, are not just
yes/no decisions but involve choosing between multiple categories. For ex-
ample, a single CNN might be trained to classify images into 100 different
object categories. How does the network output a decision among many
classes?
Typically, the final layer of a classification neural network uses a func-
tion called softmax. The softmax function converts a vector of raw scores
(sometimes called logits) from the network into a set of probabilities for each
class. Suppose the network’s final outputs (before softmax) are numbers
[z1 , z2 , ..., zK ] for K classes. The softmax for class i outputs:
exp(zi )
P (y = i) = PK .
j=1 exp(z j )
Each P (y = i) is in the range (0, 1) and all the P ’s sum to 1, so they can be
interpreted as the predicted probability that the input belongs to class i. The
model’s predicted class would usually be the one with highest probability.
For example, if you input a picture into a trained model and it outputs
(after softmax) [P (cat) = 0.1, P (dog) = 0.7, P (car) = 0.2], then the model
is saying it’s 70% confident the image is a dog, 20% a car, 10% a cat (and it
would choose “dog” as the final answer).
The training process for such a network uses a loss function called cross-
entropy, which measures the discrepancy between the predicted probabili-
ties and the true answer. Without going into formula details, the network
adjusts its weights via backpropagation to increase the probability of the
correct class for each training example. Over time, the network becomes
better at outputting high probabilities for the right class and low for others.
The softmax function is widely used in classification tasks because it
neatly generalizes the idea of a sigmoid (logistic) function (which covers the
two-class case) to multiple classes, and it provides a probabilistic interpreta-
tion of the network’s output. Understanding softmax is important because
it’s a core component in many AI systems – from image classifiers to language
models.
20
1.6 Unsupervised Learning and Clustering
Not all learning is about predicting a label. In unsupervised learning, we
try to make sense of data without any labeled examples guiding us. One
common unsupervised task is clustering: dividing data into groups (clus-
ters) such that items in the same group are more similar to each other than
to those in other groups.
Let’s illustrate clustering with a playful example. Imagine we have a
collection of animal images that include ducks, rabbits, and hedgehogs, but
we have no labels for which image is which animal. A clustering algorithm,
when given all the images, might group them into three clusters: one cluster
containing mostly ducks, another with rabbits, and another with hedgehogs.
In this case, the algorithm has discovered the natural grouping correspond-
ing to the animal types, without ever being told what a “duck” or “rabbit”
or “hedgehog” is. We as humans could then look at the clusters and as-
sign meaning (“Cluster 1 seems to be ducks, cluster 2 rabbits, cluster 3
hedgehogs”). This kind of task is useful when you want to find structure in
data—like grouping customers by purchasing behavior, grouping news arti-
cles by topic, or grouping genes by expression patterns—without pre-specified
categories.
A very well-known clustering algorithm is k-means clustering [58]. The
way k-means works is:
2. Initialize k points in the data space (these will serve as initial centroids
of clusters).
3. Assign each data point to the nearest centroid (using a distance metric,
typically Euclidean distance).
4. Recompute each centroid as the mean of all data points assigned to it.
The result is a partition of the dataset into k clusters. This algorithm is sim-
ple yet often effective, and it’s been around for a long time (first introduced
by MacQueen in 1967 [58]).
21
Clustering doesn’t give you definitive answers (because without labels,
there isn’t a single “correct” clustering—there can be multiple ways to group
data). However, it can be a great exploratory tool. In our animal exam-
ple, maybe the clustering algorithm actually grouped animals by background
color instead of species—then we’d realize we need to extract better features
(e.g., focus on the animal shape, not the whole image) for meaningful clus-
tering.
Unsupervised learning includes other techniques beyond clustering, like
dimensionality reduction (e.g., Principal Component Analysis) which
simplifies data while preserving as much structure as possible, or anomaly
detection where the goal is to find unusual data points that don’t fit any
cluster well. But clustering is a cornerstone concept to grasp because it con-
trasts with classification: clustering creates its own labels (cluster IDs) based
on inherent similarity, whereas classification needs given labels to learn from.
In many supervised tasks, you have a fixed dataset. The learning algo-
rithm passively observes examples and tries to generalize. By contrast, in
RL, the agent’s actions influence the data it sees next, because the envi-
ronment responds to those actions. This creates a feedback loop where
the agent’s decisions directly affect future inputs and future possibilities for
earning rewards.
22
1.6.2 The Potential of Simulation-Based Learning
A key advantage of reinforcement learning is that it does not necessarily
require a static dataset. Instead, the agent can generate its own experience
by exploring an environment. This environment could be:
• A real physical system (for example, a robot interacting with the real
world).
• Exploitation: The agent uses its current knowledge to pick the best
known action in a given state. If the agent already believes one action
has the highest payoff, it consistently chooses that action to maximize
immediate reward.
23
• Exploration: The agent tries actions that it is less certain about, to
gain more information. Even if these actions seem worse initially, they
might lead to discovering better strategies that yield higher rewards in
the long run.
24
Another example is training agents to play Atari video games from raw
pixels [20]. The agent uses the game score as its reward, exploring differ-
ent joystick actions to improve performance over time. In robotics, RL can
be used to learn locomotion skills, manipulate objects, or navigate around
obstacles [49].
1.7 Conclusion
In this chapter, we covered a broad spectrum of fundamental concepts in ma-
chine learning and introduced how deep learning extends these ideas. We be-
25
gan with the notion of learning from data, emphasizing the difference between
supervised learning (learning from labeled examples), unsupervised learning
(finding structure without labels), and reinforcement learning (learning via
reward and punishment through interaction). Through the example of the
salmon vs. bass fish classifier, we illustrated what it means for a model to
learn a decision boundary and the pitfalls of underfitting (model too simple)
and overfitting (model too complex and tuned to noise). We discussed the
importance of features in traditional ML and how methods like feature se-
lection and hand-crafted descriptors (e.g., HOG for image detection or Haar
features for face detection) were crucial in earlier approaches.
We then described the paradigm shift brought by deep learning, where
feature extraction is no longer manual but learned by the layers of a neural
network. Convolutional Neural Networks exemplify this by learning low-
level to high-level features directly from image pixels, enabling state-of-the-
art performance in vision tasks. We explained how CNNs work, in intuitive
terms, and introduced the softmax function as the key to making multi-class
predictions with neural networks.
In unsupervised learning, we saw how algorithms like k-means can let us
discover hidden groupings (like clustering animals by similarity), which is a
different goal than predicting a specific label. And in reinforcement learning,
we saw a completely different learning setup where an agent learns from
trial and error using feedback in the form of rewards, allowing AI to achieve
goals in dynamic environments (for example, mastering games or controlling
robots).
To a newcomer, this might seem like a lot of diverse concepts, but the
unifying theme is learning from data. Whether it’s fitting a curve, choosing
features, adjusting millions of neural network weights, clustering data points,
or updating an agent’s strategy – in all cases the system is improving its
performance by observing examples or feedback. The specific techniques
differ, but they all move beyond explicitly programming every detail, and
instead, the programmer provides a framework (like a model structure or a
reward function) and the data or experience, and the algorithm figures out
a good solution.
As you continue in AI and machine learning, you’ll delve deeper into each
of these topics. Supervised learning will lead you to numerous algorithms
(from linear regression and decision trees to SVMs and deep networks) and
practices for model evaluation. Unsupervised learning will introduce you
to methods for discovering patterns and compressing data. Reinforcement
26
learning will teach you about balancing exploration and exploitation and
optimizing long-term returns. And deep learning will open up a range of
specialized architectures (CNNs for images, RNNs and transformers for se-
quences, etc.) and tricks for training them.
The concepts covered here form a foundation: understanding what it
means to overfit, why features matter, how learning paradigms differ, and
what makes deep learning powerful. With this foundation, you can better
appreciate both the potential and the limitations of AI systems. Machine
learning is a fast-evolving field, but these core ideas remain relevant and will
help you navigate more advanced material and real-world applications.
27
Chapter 2
Optimization
2.1 Introduction
In the field of artificial intelligence (AI), neural networks are computational
models inspired by the brain’s interconnected neurons. Modern networks
can be very deep, containing many layers, and they can solve complex tasks
such as image classification, language translation, or speech recognition. But
how do these networks actually “learn” appropriate parameters (weights and
biases) to make accurate predictions?
The core answer involves two key ideas:
28
• Difficulties in Training Deep Networks. We address the vanishing
gradient problem and see why deeper networks can be harder to train.
29
by considering the product of probabilities assigned to each observed
outcome.
In classification, we typically assume our neural network outputs “proba-
bilities” for each class. Then, maximizing likelihood of the observed training
data is equivalent to minimizing the negative log-likelihood, which leads us
directly to cross-entropy loss.
30
2.2.4 Cross-Entropy in Classification
Consider a classification problem with C possible classes. We denote the
true label of a training example by a one-hot vector y. For example, if the
correct class is 2 out of {1, 2, 3}, then y = [0, 1, 0]. Let ŷ
P= [ŷ1 , . . . , ŷC ] be
the network’s predicted probabilities for each class, with C c=1 ŷc = 1.
The cross-entropy loss for this one example is:
C
X
LCE (y, ŷ) = − yc log ŷc = − log ŷcorrect ,
c=1
31
2.3.1 Local Minima, Global Minima, and Saddle Points
It helps to visualize the loss function as a high-dimensional “landscape.” The
shape can be quite complicated, with many peaks (high loss) and valleys (low
loss). A global minimum is the absolute lowest point. A local minimum is a
point where the loss cannot be decreased by any small step, but it might not
be the global lowest. Additionally, neural network loss landscapes can have
saddle points, where the gradient is zero yet the point is neither a true local
minimum nor a maximum.
A key fact in modern neural networks is that we rarely find a single global
minimum. Instead, we converge to a “good enough” local minimum or a low-
loss region. Empirically, in high-dimensional spaces, many local minima or
saddle regions can yield similarly good generalization performance, so the
distinction between a local vs. global minimum is not always as critical as
once feared.
32
2.3.3 Stochastic and Mini-Batch Gradient Descent
Batch gradient descent recalculates the gradient using the entire dataset
at each step, which can be computationally expensive. Thus, two common
alternatives are:
Regardless of the specific variant, all rely on the gradient, which must
be computed efficiently. For neural networks, this is done through back-
propagation.
33
2.4 Backpropagation: Computing Gradients
in Deep Networks
2.4.1 Chain Rule Refresher
In a multi-layer network, each layer transforms its input via a function, and
the final output is used to compute the loss. To find ∂L ∂θ
for each parameter
θ, we repeatedly use the chain rule of calculus:
∂y ∂y ∂u
= · ,
∂x ∂u ∂x
for a function y(u(x)). In a deep network, we have many compositions:
h(1) = f1 (x), h(2) = f2 (h(1) ), . . .. The chain rule tracks how changes in early
layers propagate through subsequent layers to alter the final loss.
34
Why is Backprop so Efficient? If you tried to compute each partial
derivative from scratch for millions of parameters, you would do a lot of
redundant work. Backprop reuses intermediate gradients in a systematic
way, letting you compute all needed partial derivatives in about the same
order of time as a few forward passes.
z = (x + y) w.
∂z ∂z ∂z
We want , ,
∂x ∂y
and ∂w
. The forward pass is:
35
Example of Vanishing
If each layer’s typical derivative magnitude is 0.9, and you have 30 layers,
the gradient shrinks roughly like 0.930 ≈ 0.042. That is less than 5% of its
original magnitude, severely slowing or halting meaningful learning in early
layers.
Common Solutions
• ReLU-type Activations: ReLU activations do not saturate in the
positive domain, so their derivative is 1 for positive inputs. This helps
keep gradients from systematically decaying across layers.
• Weight Initialization: Careful initialization ensures that signals nei-
ther explode nor vanish at the start of training. Methods like Xavier
(Glorot) or He initialization set the variance of weights to keep the
standard deviation of activations stable.
• Normalization Layers: Techniques such as Batch Normalization or
Layer Normalization rescale activations so that each layer’s outputs
have consistent means and variances, preventing extreme values and
helping maintain stable gradients.
36
Here, F (x) might consist of a few convolutional layers, batch normalization,
and ReLU activations. Meanwhile, x is passed directly to the output and
added on—hence the name “skip” or “shortcut.”
• Empirical Success. ResNets with 50, 101, and 152 layers became fea-
sible to train and outperformed shallower networks in ImageNet recog-
nition tasks [31]. This “deeper is better” approach unleashed a new
wave of highly deep architectures in both vision and language models.
In all these cases, the principle remains: letting information “skip” or bypass
certain transformations can greatly ease the optimization difficulties of deep
networks.
37
2.7 Putting It All Together: Practical Train-
ing Steps
Bringing the concepts together, training a typical deep neural network in-
volves the following:
Note on Overfitting. If your model fits the training set extremely well but
fails on validation or test data, you are overfitting. Techniques like dropout,
weight decay, or data augmentation are then used to regularize training.
38
2.8 Extended Explanations of Tricky Concepts
2.8.1 Local Minima, Saddle Points, and High-Dimensional
Landscapes
Newcomers often worry that gradient descent might get stuck in “bad” local
minima. In low-dimensional problems (like a 2D function), local minima can
be a big issue. But neural networks are typically very high-dimensional (often
millions of parameters). In such huge spaces, local minima that are truly
“bad” everywhere are statistically rare. More common are flat or saddle-like
regions where gradients are very small, causing slow progress.
Practically, modern gradient-based optimizers still find solutions that gen-
eralize well, even if they are not global minima. Researchers have shown that
many local minima yield similarly good performance. Consequently, while
local minima and saddle points exist, they do not typically ruin training the
way one might initially fear. Tuning hyperparameters (e.g., learning rate,
batch size) often has a more direct impact on final performance than stress-
ing over local vs. global minima.
39
you try to pass a gradient backward:
Only around 10% of the original signal remains. Increase layers to 30, and
you end up with about 0.830 ≈ 0.015. That is barely 1.5% of the original
gradient. Those earliest layers do not get enough signal to update effectively.
In practice, factors can be even smaller, leading to near-total vanishing.
y = x + F (x),
the gradient from y back to x has a direct path with derivative 1, circum-
venting the repeated multiplication by α. This architecture ensures deeper
networks (such as 50 or 100 layers) can still learn effectively because early
layers receive strong gradient signals.
40
• Challenges in Deep Networks. Vanishing gradients can hamper
learning in early layers. Merely adding more layers sometimes degrades
performance without additional strategies.
41
Chapter 3
3.1 Introduction
Artificial Neural Networks (ANNs) are computational models inspired by the
neural connections in the human brain. They serve as a powerful approach
to approximating a wide range of functions, which has led to their success in
tasks such as image recognition, language understanding, speech processing,
and more. Over the past decade, improvements in hardware and algorithms
have brought deep learning to the forefront of machine learning research and
real-world applications.
In these expanded lecture notes, we will begin by recalling the concept of
linear regression and see why a neural network without activation functions
behaves similarly to a single linear model. We will then explore how acti-
vation functions transform these stacked linear layers into highly expressive
models capable of modeling non-linear relationships.
Moving forward, we will introduce Multi-Layer Perceptrons (MLPs) and
discuss how the parameter count can explode when dealing with high-dimensional
data. This naturally motivates Convolutional Neural Networks (CNNs),
which reduce the number of parameters by sharing weights in local regions
of the input. We will also turn to Recurrent Neural Networks (RNNs), which
maintain a hidden state across time for sequential data, before exploring
Transformers that rely on attention mechanisms for parallelizable, long-
range sequence modeling. Finally, we will see how modern deep learning
techniques handle data-scarce situations through semi-supervised and self-
supervised learning, culminating in the introduction of Vision Transformers
42
(ViT) for image tasks.
These notes aim to be accessible to undergraduate students and begin-
ners with no prior background in AI or advanced mathematics. Wherever
possible, we will focus on intuitive explanations that build a solid foundation
for further study.
z1 = W1 x + b1 , z2 = W2 z1 + b2 , ...
43
W2 (W1 x+ b1 ) + b2 is just another linear mapping from x to the output. This
means that no matter how many such layers you stack, you never increase
the true modeling capacity beyond that of a single linear regression model
[25].
This observation underscores the reason why activation functions are cru-
cial to the power of neural networks. By introducing non-linearity, activation
functions enable the network to learn more complex, nonlinear relationships
in the data.
• Tanh: Similar “S”-shaped curve but outputs in the range (-1, 1). This
can help center the data around zero. However, it still saturates at
large positive or negative inputs.
44
• ReLU (Rectified Linear Unit): Defined as max(0, x). This is piece-
wise linear: it outputs 0 for negative x and x itself for positive x. ReLU
is computationally efficient and has a gradient of 1 for all positive in-
puts, which reduces the vanishing gradient issue [29]. It has become
the default choice for many hidden layers in modern deep networks.
Choosing the right activation can affect both convergence speed and final
performance, but the common thread is that any non-linear activation helps
a network escape the limitations of purely linear models.
45
size 224 × 224 pixels (around 150k pixels), with three color channels. If we
feed these pixels directly into a single hidden layer of 1000 neurons, that
first layer alone has about 150, 000 × 1, 000 weights, which is 150 million
parameters, plus 1000 biases. Even with modern hardware, this is large and
can lead to overfitting or slow training [52].
Furthermore, images exhibit spatial structure (nearby pixels often relate
to each other) that fully-connected layers do not exploit. This inefficiency in
parameter usage is a key motivation for more specialized architectures. Still,
MLPs remain relevant for structured data or lower-dimensional data where
this explosive growth is less severe.
46
This approach dramatically reduces the number of parameters while also
leveraging the fact that certain features (like edges) can appear anywhere in
an image.
Padding When the filter approaches the edges of an image, part of the
filter would lie outside the image if we did not add padding. A common
strategy is zero-padding, where extra rows and columns of zeros are added
around the image. This influences the output size and ensures that the filter
can be applied at the boundary. Without padding, the output shrinks after
each convolution, which might be undesirable in some designs.
47
achieves a degree of translational invariance. Small shifts in the input image
result in less dramatic changes in the pooled representation.
48
position in the sequence, allowing the network to generalize across different
parts of the input [10].
49
each query, we compute scores with every key to gauge how much attention
the model should pay to that key’s corresponding value. These scores are
normalized via a softmax, resulting in attention weights. The output is then
a weighted sum of the value vectors.
This is often referred to as Scaled Dot-Product Attention:
QK ⊤
Attention(Q, K, V ) = softmax √ V,
dk
where dk is the dimensionality of keys. Because every element can attend
to every other element in parallel, Transformers can capture long-range re-
lationships in a single layer, rather than through multiple recurrent steps
[90].
50
3.8.2 Entropy Minimization
One simple idea is entropy minimization, which encourages the model to
produce confident (low-entropy) predictions on unlabeled data. Suppose the
network’s output layer is a softmax distribution over classes. If the dis-
tribution for an unlabeled sample is uniform or uncertain, the entropy is
high. By penalizing high entropy, the model is nudged to produce sharper,
low-entropy predictions without explicit labels. This technique can push the
decision boundary into low-density regions of the data space.
3.8.3 Pseudo-Labeling
Another popular method is pseudo-labeling or self-training [56]. The model
uses its own high-confidence predictions on unlabeled samples as if they were
ground-truth labels. Then, these pseudo-labeled samples are added to the
training set. This simple approach can be surprisingly effective, though it
requires care, because if the model is confident yet wrong, it can reinforce its
mistakes. Researchers often combine pseudo-labeling with confidence thresh-
olds or repeated refinement to mitigate error propagation.
By combining these ideas with strong data augmentation and consistency
constraints, modern semi-supervised methods narrow the gap between fully
supervised training and training with limited labeled data.
51
• Distinguishing augmented views of the same sample from views of dif-
ferent samples (contrastive methods).
52
breaks an image into patches, each patch treated like a token in the Trans-
former pipeline. For instance, a 224 × 224 image can be split into 16 × 16
patches, leading to (224/16)2 = 142 = 196 patches. Each patch is flattened
and projected into a vector embedding.
A special “class token” can be prepended, and standard Transformer
layers process all these patch embeddings in parallel using multi-head self-
attention. This lets each patch “attend” to other patches regardless of spatial
distance, allowing the model to capture both local and global patterns in the
first layer.
3.11 Conclusion
We have traversed a broad landscape of deep learning approaches:
53
• Multi-Layer Perceptrons (MLPs): Showed how parameters grow
rapidly with input size, motivating more specialized architectures for
high-dimensional inputs.
• Convolutional Neural Networks (CNNs): Leveraged local con-
nectivity and weight sharing for images, drastically reducing parame-
ters and respecting spatial structure.
• Recurrent Neural Networks (RNNs): Provided a way to handle
sequential data through hidden states passed over time. BPTT al-
lowed training but brought issues of vanishing gradients, mitigated by
LSTM/GRU cells.
• Transformers: Replaced recurrence with self-attention, enabling par-
allel processing of sequences and better handling of long-range depen-
dencies. Became a core architecture in NLP and are expanding to other
domains.
• Semi-Supervised Learning: Employed entropy minimization and
pseudo-labeling to utilize unlabeled data, bridging the gap between
supervised and unsupervised settings.
• Self-Supervised Learning: Created training signals from data itself,
as seen in masked language modeling (BERT), autoregressive modeling
(GPT), and contrastive learning (SimCLR).
• Vision Transformers (ViT): Extended the Transformer architecture
to images by splitting them into patches, enabling global attention in
a single layer and achieving competitive performance with CNNs.
Taken together, these developments underscore the flexibility and strength
of deep learning. The same broad principles – layering transformations with
non-linear activations, leveraging shared parameters for efficiency, and learn-
ing from data directly – drive the architecture design in various modali-
ties (images, text, time series, etc.). Where labeled data is scarce, semi-
supervised and self-supervised approaches continue to expand the reach of
AI, enabling large models to learn universal features from unlabeled corpora
or images.
As you progress, you may dive deeper into each architecture’s detailed
mathematics or attempt to implement them from scratch. While new tech-
niques and variations will surely arise, the foundations covered here remain
54
essential to understanding modern deep learning systems. Mastery of these
concepts will prepare you to adapt and innovate in the constantly evolving
field of AI.
55
Chapter 4
History of AI
56
Cross-Entropy in Model Training In supervised learning for classifica-
tion, a common loss function is the cross-entropy:
X
H(p, q) = − p(x) log q(x) ,
x
where p(x) is the true distribution (often represented as one-hot labels) and
q(x) is the model’s predicted distribution. Cross-entropy reaches its mini-
mum when p(x) and q(x) match exactly.
57
4.5 Loss Landscape and Skip-Connection
Loss Landscape A neural network’s loss landscape describes how the loss
function changes as we move through the space of possible parameter values.
Difficult landscapes with many “cliffs” or sharp local minima can slow or
derail training.
Key Characteristics
where s denotes the states of all neurons, wij the weight between neu-
rons i and j, and bi the bias term for neuron i. The network naturally
evolves toward states that minimize this energy.
58
• Iterative Update: Neurons update their states one at a time or in
small groups. Each update rule typically involves switching the neu-
ron’s state to the sign of the weighted sum of its inputs.
59
where v represents the visible units, h the hidden units, wij the connections
among visible units, wkl the connections among hidden units, wik the cross-
connections, and bi , ck the biases for visible and hidden units respectively.
The network defines a probability distribution over the states (v, h) via the
Boltzmann distribution:
1
P (v, h) = exp −E(v, h) ,
Z
where Z is the partition function ensuring that all probabilities sum to 1.
60
• Limitations:
Examples
1
• Sigmoid: σ(x) = 1+e−x
• tanh: tanh(x)
61
4.10 Convolutional Neural Network
Convolutional Layers Instead of full connections, convolutional layers
use small filters (kernels) that slide across the input. These filters are shared
across locations, greatly reducing parameters and capturing local structures
(e.g., edges in images).
Receptive Field In a CNN, the receptive field is the region of the input
that influences a neuron in the output. Deeper layers combine information
from earlier layers, expanding the receptive field. This allows higher layers
to detect more complex features.
4.12 Transformer
Transformers remove the RNN-style recurrence. They rely on attention,
which compares all pairs of tokens in a sequence simultaneously:
QK ⊤
Attention(Q, K, V ) = softmax √ V,
d
where Q, K, and V are query, key, and value matrices respectively, and d
is a scaling factor. This parallel approach speeds up training, especially for
long sequences.
62
4.13 Self-Supervised Learning
Self-supervised learning methods allow models to train on unlabeled data by
creating proxy tasks. One such task is self-prediction, where certain parts of
the data are masked or removed, and the model tries to predict them. This
reduces the need for extensive human-generated labels.
This approach underlies many modern language models, such as GPT, which
generate text one token at a time.
In-Context Learning Once trained, GPT can adapt to new tasks simply
by reading example prompts in the input:
63
4.13.4 Vision Transformer (ViT)
Vision Transformers adapt the Transformer architecture to images by split-
ting them into patches and treating each patch as a token. The attention
mechanism then learns how patches relate to one another, enabling the model
to classify or interpret an entire image.
4.14 Evolution of AI
AI has progressed through several major stages, each marked by key theo-
retical and practical advances as well as new, larger datasets.
64
a systematic way to compute how each weight contributes to network
error. This solved a crucial training bottleneck for multi-layer neural
networks.
65
• Scale and Generality: Larger models, especially in language tasks,
demonstrated unprecedented capabilities. Models like BERT (2018),
GPT-2 (2019), GPT-3 (2020), and subsequent generations performed
well on a wide variety of tasks without specialized architectures for each
task type.
Across each era, larger datasets have consistently enabled bigger models
and more advanced techniques, acting as a key driver in the evolution of AI.
4.15 Datasets
Datasets form the foundation for training and evaluating AI models. Below
are some major datasets that have propelled research in machine learning
and computer vision.
MNIST
66
CIFAR-10 and CIFAR-100
ImageNet
OpenImages
67
• Variety: Covers diverse categories and object types with multiple an-
notations per image.
• Usage: Encourages research in object detection, instance segmentation,
and other computer vision tasks at scale.
Common Crawl
• Description: A massive collection of raw web data, continually updated
from billions of webpages.
• Relevance to Language Models: Used in large-scale text pre-training
for models like GPT. Contains diverse, real-world text samples.
• Size: Petabytes of data, making it invaluable for training state-of-the-
art natural language processing systems.
LAION
• Description: A publicly available dataset of millions of image-text pairs.
• Purpose: Aims to advance multimodal research, enabling tasks where
both visual and textual understanding are crucial.
• Use Cases: Training large-scale models that can associate images with
their captions. Models trained on LAION data can generate text de-
scriptions for images, perform image-based question answering, and
more.
These datasets each address different needs in AI development. Whether
it’s benchmarking simple classification (MNIST), tackling complex image
tasks (ImageNet, MS-COCO), or training vast language models (Common
Crawl), they have collectively shaped the progress and capabilities of modern
AI systems.
68
loop of a model. Rather than relying on numerical reward signals from an
environment (as in traditional Reinforcement Learning), RLHF uses data col-
lected directly from human evaluators who compare different model outputs
and indicate which one is better.
Training Steps
1. Supervised Fine-Tuning:
69
• Using the reward model as a proxy for human approval, the sys-
tem optimizes the policy (i.e., the original language model) to
maximize the reward score.
• This is often done with methods such as Proximal Policy Opti-
mization (PPO), although other RL algorithms can be used.
• During this process, the model learns to produce outputs that
humans find more acceptable or useful.
• After training, the reward model can be applied to new model outputs
to assign scores indicating how closely they match human desires.
70
Examples of Red Teaming Strategies
• Provocative questions: Asking offensive or controversial questions
to test model responses.
Key Benefits
• Enhanced reasoning: By sequentially explaining its thought process,
the model can tackle tasks requiring multiple steps (e.g., multi-stage
math problems or logical puzzles).
Implementation Approaches
• Prompt-based: The user asks the model to “explain step-by-step” or
to “show your reasoning.”
71
4.19 Self-Instruct
Self-Instruct is a method where a language model uses a small set of “seed”
instructions and outputs. It then generates new instructions and potential
answers, effectively teaching itself how to handle various requests.
Advantages
72
Comparison to RLHF
Motivation
1. Finetuning:
73
• Improves performance on a particular type of task but may reduce
generality if too narrowly fine-tuned.
2. Prompt Engineering:
3. Retrieval-Augmented Generation:
Conclusion
Throughout the history of AI, new algorithms and architectural innovations
have often led to significant leaps in capability. This trend has closely tied
each breakthrough to the availability of larger or more specialized datasets.
By understanding these key developments and techniques, students can ap-
preciate both where AI comes from and where it might go next.
74
Chapter 5
Model Scaling
75
bedding sizes or sequence lengths. This can increase the amount of
information the model can handle.
76
(data/model parallelism and tools like DeepSpeed) that allow training mas-
sive models across many devices. We must also consider the environmental
and computational tradeoffs: larger models can be extremely costly to
train and deploy, so efficiency and sustainability are growing concerns. We’ll
discuss how specialized AI accelerators (GPUs, TPUs, etc.) and inference-
time optimizations (quantization, pruning, etc.) are used to manage these
costs. Finally, we’ll conclude with a summary and outlook on the future of
model scaling.
77
scaling. The model had very large numbers of parameters (e.g., about 138
million in VGG-16, largely due to the fully-connected layers at the end) and
was computationally expensive, making it slow to train and use. The large
parameter count meant VGG was prone to overfitting and heavily reliant on
regularization and huge datasets. Nevertheless, the success of VGG was a
turning point: it established a new baseline that deeper (and conceptually
simpler) networks can outperform shallower ones, inspiring further explo-
ration into depth.
78
ual blocks with normalization, as we’ll discuss later), depth scaling has es-
sentially no fundamental barrier: networks with 100+ layers can not only be
trained, but also generalize well, as long as they’re designed to mitigate opti-
mization issues. This opened the door to going deeper and deeper whenever
more performance was needed, without being stuck by training divergence.
After ResNet, researchers even tried thousand-layer networks as experiments
(e.g., ResNet-1001) and found they could still train, although in practice,
other limitations like data and compute become the bottleneck.
79
ception family illustrates another aspect of scaling: you can scale depth, but
also width and multi-branch complexity, to get better performance. While
ResNet showed raw depth can be pushed, Inception showed that a thought-
fully structured wide architecture can extract richer features without blowing
up computation.
80
layer produces something useless, a later layer could learn to ignore it since
all features are available.
In summary, depth-based scaling has been a fundamental driver of progress:
• ResNet solved the key optimization problem, enabling very deep net-
works to train successfully through residual connections.
• DenseNet proved that dense feature reusage can allow going deep with-
out a blow-up in parameters, by weaving layers together.
81
5.3.1 Neural Architecture Search (NAS) for Scaling
Neural Architecture Search (NAS) refers to techniques that automate the de-
sign of neural network architectures. Instead of a human manually specifying
the number of layers, filter sizes, etc., NAS uses an algorithm (like a genetic
algorithm, reinforcement learning agent, or gradient-based method) to search
through the space of possible architectures and find high-performing ones.
Early NAS work by Zoph and Le (2017) [96] demonstrated that it’s possible
to learn convolutional architectures that rival or even beat manually-designed
networks. They employed a reinforcement learning controller to sample ar-
chitecture descriptions (like how many filters, kernel sizes, skip connections,
etc.), trained each candidate on data, and used the performance as a reward
signal to improve the controller. This process, while conceptually straight-
forward, was extremely computationally expensive at first (requiring tens of
thousands of training runs). Nonetheless, it proved the point that algorithms
can discover non-intuitive architectures. For instance, NAS found motifs like
skip connections and convolutions of varying sizes, some of which resembled
Inception-like modules or other patterns.
One notable result of NAS was the NASNet-A architecture (Zoph et
al., 2018) and the AmoebaNet series (Real et al., 2018, evolved with evo-
lutionary strategies). NASNet-A, when scaled up to a large model for Ima-
geNet, slightly outperformed human-designed models of similar cost. How-
ever, the discovered architecture was quite complex and irregular, which
made it harder to interpret or implement efficiently. NASNet did introduce
the idea of cells: rather than searching the entire large network structure,
they searched for a small cell (a subgraph of layers) that can be stacked re-
peatedly to form the full network. This cell-based design made it easier to
scale the discovered architecture to different depths and widths.
82
efficients φ that uniformly scale the network’s depth, width, and resolution
according to preset ratios. For example, in EfficientNet, increasing φ by 1
might mean: increase depth by factor α, width by factor β, and image size
by factor γ, where α, β, γ are chosen such that the overall FLOPs roughly
multiply by a certain amount (e.g., 2φ growth in FLOPs).
Concretely, EfficientNet started from a small but efficient baseline model
(EfficientNet-B0), which itself was found via NAS (they used a mobile-sized
NAS search similar to how MnasNet was developed, focusing on depthwise
separable conv blocks with squeeze-and-excitation SE layers for efficiency).
Then they set α = 1.2, β = 1.1, γ = 1.15 (these numbers were determined
via small grid search under constraints) such that when φ increases, depth
≈ αφ , width ≈ β φ , and resolution ≈ γ φ . By scaling up from B0 to B1, B2,
. . . up to B7 using this compound rule, they obtained a family of models
from small to large, each roughly optimal in accuracy for its model size.
The results were impressive: EfficientNet models significantly outper-
formed other networks at the same level of compute. For instance, EfficientNet-
B4 achieved accuracy similar to ResNet-50 but with much less compute, and
the largest model EfficientNet-B7 (with about 66M parameters) attained
state-of-the-art ImageNet accuracy of ∼84.4% top-1 at the time, while
being an order of magnitude smaller and faster than previous best models.
To put it in perspective, one contemporary model called GPipe (an extremely
large NAS-based model with 557M parameters) had slightly lower accuracy
(84.3%) than EfficientNet-B7, despite B7 being 8× smaller and faster [87].
This demonstrated that a well-scaled mid-sized model can beat an unscaled
giant model.
EfficientNet’s approach has two key takeaways: First, when scaling up a
model, it is beneficial to balance multiple factors (depth, width, input size)
rather than just one. Intuitively, if you only make the network deeper but
keep it very narrow, it might become too bottlenecked at each layer; if you
only make it wider but not deeper, it might not have enough sequential layers
to abstract high-level concepts; if you use huge images but the network is too
small, it can’t take advantage of the extra detail. EfficientNet formalized one
way to achieve this balance. Second, using NAS to find a good starting block
or cell can complement scaling: EfficientNet didn’t search every model size,
it just searched for a good base architecture (B0) in a small regime, and then
scaled that up with a rule. This was much more computationally feasible
than doing a full NAS for a large model, and it gave excellent results.
Today, the EfficientNet paper and models are a standard reference for
83
how to do principled model scaling. Variants like EfficientNetV2 have further
improved aspects like training speed and used progressive learning (smaller
resolution to larger during training). But the core idea remains widely influ-
ential, even beyond vision: the notion of compound scaling can be seen in
some transformer scaling setups too (e.g., scaling width vs depth of trans-
former layers).
84
designed regime like RegNet might be slightly less optimal in accuracy per
FLOP, but win in actual efficiency and simplicity.
RegNet and EfficientNet families both illustrate the concept of model
families that span from small to large. Instead of designing a one-off model,
researchers now think in terms of scalable families: provide a recipe to get
a model at 50M FLOPs, 200M FLOPs, 1B FLOPs, etc., all with similar
design DNA. This is very useful in practice because one often needs models
of different sizes for different use cases (mobile vs server, etc.).
In summary, efficient scaling approaches aim to get the most out of each
parameter or each operation:
• NAS methods automate the search for good architectures. They have
found novel architectures, especially for smaller models, that can then
be scaled up. However, pure NAS is expensive and can yield compli-
cated designs.
Ultimately, these approaches share the goal of pushing accuracy higher with-
out simply throwing exponentially more resources – they try to use parameters
and compute in a smarter way. This has become increasingly important as
we reach scales where each new model can cost millions of dollars to train;
we want to ensure that such investments are as optimal as possible.
85
sequence models due to their parallelizability and stable training dynam-
ics (thanks in part to layer normalization and residual connections in every
layer). Over the past few years, we have seen an unprecedented growth in
the size of language models, leading to qualitatively new capabilities.
86
to a mixture-of-experts design). Each time, these models set new records on
language benchmarks and demonstrated increasingly sophisticated behavior
(e.g., better understanding of nuance, some reasoning ability, code genera-
tion, etc.).
One interesting observation made during this period is that larger models
often follow a scaling law: performance (e.g., measured in perplexity or
accuracy on some task) tends to improve as a power-law as we increase
model size, dataset size, and compute. Initially, studies (e.g., by Kaplan et
al., 2020 at OpenAI) suggested that model performance improves predictably
with more parameters if you also feed it enough data, and they extrapolated
that trend outwards. This provided a theoretical justification for building
bigger models: if you can afford 10x more compute, a 10x bigger model
(with appropriately more data) will reliably give better results, following a
log-linear trend.
However, simply making the model huge without adjusting other factors
can be suboptimal. A pivotal study from DeepMind in 2022, often referred
to by the codename Chinchilla [37], revisited these scaling laws by consid-
ering compute budget as the fundamental constraint. They asked: for a given
amount of compute (FLOPs spent in training), what is the optimal model
size and amount of training data? The surprising finding was that many
existing large models were actually undrained in terms of data. For instance,
GPT-3 used 300 billion tokens for training; Chinchilla analysis suggested that
with the compute GPT-3 used, it should have used about 4 times more data
and a smaller model to get the best results. The rule of thumb they found
was to scale model size and training data in tandem — roughly, parameter
count should be proportional to the number of training tokens (specifically,
for every doubling of model parameters, also double the dataset size).
To prove this, they trained Chinchilla, a model with only 70B param-
eters (much smaller than GPT-3’s 175B), but on 1.4 trillion tokens (about
4.7x more data than GPT-3). Importantly, the total compute used was kept
similar to that of a larger model like Gopher (280B) or GPT-3. The result
was that Chinchilla outperformed Gopher (280B), GPT-3 (175B), and other
models on a wide range of language tasks [37]. In other words, if you have
e.g. X GPU-days to train a model, you’re better off with a moderately sized
model trained on lots of data than an extremely large model trained on a
limited data budget. This was a course-correction in the scaling narrative:
bigger is not better unless you also increase the training duration/data.
The Chinchilla finding has important implications. It suggests current
87
large models might not be fully utilizing their capacity because they haven’t
seen enough data to learn all that they could. For future projects, it advises
an optimal balance: don’t just blindly push parameter counts; also invest in
gathering more data or training for more steps. It also means that if one is
willing to train for longer, one could get away with a smaller (cheaper) model
with equal or better performance, which has downstream benefits like faster
inference and easier deployment.
88
• Scaling laws provide a rough roadmap, but careful tuning of data vs
model size is required to truly get optimal performance.
89
answer. For example, instead of directly asking the model a math word
problem and expecting an answer, we prompt it to first produce a step-
by-step solution. This effectively multiplies the amount of computation
(each step is another forward pass or another segment of output it must
produce) but greatly improves accuracy on tasks requiring reasoning.
The model is using more compute per query to think things through.
90
but it can dramatically improve accuracy and keep the model itself
smaller since it doesn’t have to memorize everything.
The concept of scaling test-time compute is closely tied to the idea of
reasoning. Instead of relying purely on what the static network weights
encode, the model can engage in a computation process (potentially involving
sequences of reasoning steps or multiple passes) to arrive at an answer. This
can sometimes compensate for not having an extremely large parametric
memory. For example, a 6-billion-parameter model with a well-implemented
reasoning strategy and multiple inference steps can potentially solve tasks
that a 6-billion parameter model in one-shot cannot, and might even rival a
larger 100B model on some complex tasks, by virtue of “thinking harder”.
One concrete demonstration is in mathematical problem solving. A big
model like GPT-3 (175B) might only solve a certain fraction of multi-step
math problems if it answers in one go. But a much smaller model that is
allowed to do scratch work via chain-of-thought and even check intermediate
results can solve a higher fraction of those problems, albeit taking a few passes
to do so. Essentially, compute is an alternative currency to parameters: you
can either pre-compute a lot (big training, big weights) or compute more on
the fly (reasoning steps) to achieve an outcome.
It’s worth noting that scaling test-time compute has its own challenges.
The model needs to be guided to use the extra compute effectively (hence
methods like chain-of-thought prompting explicitly tell it to output interme-
diate steps). If the model isn’t trained or prompted properly to do multi-step
reasoning, simply giving it a loop or more time might not help. There’s active
research in training models to plan, to self-reflect, and to use tools so that
when confronted with a new task, they can break it down into manageable
subtasks.
Also, more compute at inference means slower responses, which might be
a trade-off. In some applications, you can’t afford to have the model think
for a whole minute if the user expects an answer in one second. But if you
do have the luxury (like offline analysis, or non-real-time tasks), then you
can squeeze more quality out of the model by letting it churn longer on the
problem.
In summary, reasoning-based scaling at test time is an exciting com-
plement to the traditional parameter-based scaling:
• It allows even a fixed-size model to become more accurate by using
additional computation per input (like an ensemble of one model with
91
itself, or an internal dialogue).
92
that BatchNorm could enable training of networks that completely failed to
converge otherwise, and it often also improved final accuracy. It also had a
mild regularization effect (because each batch’s statistics add some noise),
often reducing the need for other regularizers like Dropout in convolutional
nets.
BatchNorm was a key ingredient in the success of VGG-like networks and
ResNets. For example, ResNets insert a BatchNorm after each convolution
(and before adding the residual) which keeps the residual addition stable.
Without BatchNorm, extremely deep ResNets might still have struggled. In
fact, after BatchNorm’s introduction, virtually all high-performance CNNs
incorporated it (or a variant) – it became almost implied when scaling depth.
However, BatchNorm has a limitation: it depends on batch statistics,
which means the behavior can be tricky during inference (when you typi-
cally switch to using accumulated moving averages of means/variances) and
it doesn’t work as well for very small batch sizes or certain tasks like recur-
rent sequence modeling. For such cases, other normalization methods were
developed: • Layer Normalization (LayerNorm) [1] (Ba et al., 2016)
normalizes across the neurons in a layer for each single example (rather than
across examples). This doesn’t depend on other examples in a batch and is
suitable for RNNs/Transformers. In Transformers, every sub-layer (atten-
tion or feed-forward) is preceded or followed by a LayerNorm. This keeps
the scale of activations under control even as the model depth (number of
transformer layers) grows. Without layer norm, training large Transformers
might diverge or be very sensitive to learning rate. • Instance Normal-
ization, Group Normalization (Wu & He, 2018), etc., are other variants
that normalize over different dimensions, useful in specific contexts (instance
norm for style transfer, group norm as a replacement for BN when batch
sizes are small). • More recently, RMSNorm (Root Mean Square Norm)
and other normalization tweaks have been used especially in very large lan-
guage models (some LLMs use RMSNorm which is like layer norm without
the mean subtraction, to simplify things).
Normalization helps with scaling in another way: it prevents activation
magnitudes from blowing up or collapsing when networks get deeper or when
learning rates are high. For example, if you tried to stack 100 fully connected
layers without any normalization or special initialization, the activations and
gradients might either explode to infinity or shrink to zero by the time they
reach the end, due to multiplicative effects. Techniques like careful weight
initialization (e.g., Xavier/He initialization) can mitigate this to some extent,
93
but normalization actively keeps things in check throughout training.
In practice, when designing a scaled-up model, adding normalization lay-
ers is now a standard part of the recipe: • In a CNN, we usually do Conv
-¿ BatchNorm -¿ ReLU as a basic trio, repeated. • In a Transformer block,
we do LayerNorm -¿ Attention -¿ LayerNorm -¿ FFN (plus residual adds in
between). • Even in very deep MLPs or other architectures, some form of
normalization or scaled initialization is used to ensure gradient flow.
It’s worth noting that normalization itself has some computational and
memory overhead, and at inference time batchnorm can be folded into the
preceding linear layer (since it’s a linear operation when using fixed mean/-
var), so it’s not a big cost there. But the benefits during training far outweigh
the slight cost.
There have been attempts to remove the need for batch normalization, for
instance self-normalizing networks (SELUs) or other normalization-free
networks. For example, there’s a concept of Normalization-Free ResNets
(Brock et al., 2021) that use careful initialization and activation scaling (and
sometimes smaller learning rates) to train deep nets without BatchNorm.
They did manage to train 1000-layer nets without BN. This is interesting
academically, but in most cases it’s just easier to use BN or LN.
In summary, normalization techniques like BatchNorm and LayerNorm
have been unsung heroes in enabling stable scaling:
• LayerNorm proved essential for Transformers, which are now the back-
bone of large language models.
94
5.7 Efficient Building Blocks for Scalable Mod-
els
Another aspect of model scaling is the design of more efficient layer build-
ing blocks that allow the creation of large networks without proportional
increases in computation. If each layer of a network can be made cheaper (in
terms of FLOPs or parameters) while still expressive, you can afford to have
more layers or a wider network under the same resource constraints. This is
particularly important for deploying models on limited hardware (like mobile
phones) or when trying to train very large models with fixed computing bud-
gets. We’ll discuss a few such building block innovations, notably depthwise
separable convolutions popularized by MobileNet [39], as well as others
like bottleneck layers and group convolutions.
95
The total number of weights in a depthwise separable conv is M ×k 2 +M ×N.
Compare this to a standard convolution’s M ×N ×k 2 . For typical values (say
k = 3 and N on the order of M), the separable conv is much cheaper. For
example, if M = N = 128 and k = 3, standard conv uses 128×128×9 ≈ 147k
weights, whereas depthwise separable uses 128 × 9 + 128 × 128 ≈ 16k + 16k =
32k weights, which is nearly 5 times smaller. Similar reductions occur in
computation.
This idea was actually used in some earlier architectures in part (like
Inception modules effectively used 1x1 convs to reduce channels then a spa-
tial conv). MobileNet (Howard et al., 2017) [39] was the architecture
that fully embraced depthwise separable convolutions to create an extremely
lightweight model for mobile devices. MobileNet v1 is basically a streamlined
CNN where every convolution is replaced by a depthwise conv + pointwise
conv pair (except the very first conv layer). This allowed the network to be
very deep and still very fast. The original MobileNet had 28 layers of convs
(depthwise + pointwise counted separately) and only 4.2 million parameters,
yet achieved around 70% top-1 accuracy on ImageNet. In contrast, a much
larger model like VGG-16 had 138 million parameters and about 74% accu-
racy. So MobileNet delivered reasonable accuracy at a tiny fraction of the
size by using an efficient building block.
MobileNet v1 also introduced a width multiplier hyperparameter α
that could scale down every layer’s channel counts by a factor (to trade
accuracy for even more speed if needed) and a resolution multiplier for input
image size. These gave developers flexibility to deploy smaller variants if 70%
accuracy was not needed and 60% could suffice for an even lighter model.
Following MobileNet v1, there was MobileNet v2 (Sandler et al., 2018),
which further refined the block by introducing inverted residual blocks
with linear bottlenecks. This sounds complex but the idea was: • Instead
of doing depthwise conv on a narrow set of channels (which might become a
bottleneck for information), they first expand the number of channels with
a 1 × 1 conv (say from M to t · M where t is expansion factor, like 6), then
do a depthwise conv on this larger space, then project down with a linear
1 × 1 conv back to a smaller number of channels (possibly even smaller than
M, hence “bottleneck”). • They also added a residual connection around
each such block (if input and output dimensions were the same) — hence
“inverted residual”, inverted because a traditional ResNet bottleneck first
reduces then expands, whereas this block first expands then reduces back. •
The use of a linear activation at the final projection (no ReLU on the last layer
96
of the block) was important to not destroy information during the bottleneck
projection (ReLU could kill information when collapsing dimensions).
MobileNet v2 achieved even higher accuracy ( 72% on ImageNet) with
similar or fewer operations than v1, making it one of the most efficient models
for its time. These MobileNets were instrumental for on-device AI, and they
also influenced larger model design by showing the effectiveness of depthwise
convs.
97
another pointwise group conv, with shuffling in between. • Squeeze-and-
Excitation (SE) blocks: These were introduced in the SENet model (Hu
et al., 2018). SE blocks are not about reducing computation of a single conv,
but about adding a tiny neural network that does channel-wise attention.
Specifically, an SE block takes the output of some layers, pools it to a vector
(of length equal to number of channels), passes it through a small bottleneck
MLP and sigmoid to produce weights for each channel, and multiplies those
weights to the channels (re-weighting them). This is a lightweight way for the
network to recalibrate channel importance and yielded significant accuracy
improvements ( +1-2% ImageNet accuracy) for a very minor cost (maybe
0.5-2% extra compute). EfficientNet and many other models incorporated
SE blocks because they improve efficiency (accuracy per parameter) even
though they add a few parameters. • Transformer efficient blocks: In
the transformer world, efficient building blocks mean things like optimized
attention mechanisms (like sparse attention patterns for long sequences, or
replacing softmax attention with linear attention approximations for better
scaling). For example, the Performer or Linformer try to reduce the O(n2 )
cost of attention for long sequence length n. That’s another kind of efficiency
which is about scaling to longer inputs rather than scaling model size.
In summary, efficient building blocks allow networks to scale to either
larger depths or to work within constrained environments:
• Attention to efficiency at the micro level (like SE blocks adding a big ac-
curacy boost for a few extra ops) yields models that dominate accuracy-
vs-compute benchmarks.
98
(like to more layers). Thus, model scaling isn’t just about macro architecture
(how many layers) but also about the micro operations chosen for each layer.
99
hold different parts of the model. For example, if you have a 100-layer
network, you might put 50 layers on one GPU and the next 50 on an-
other; or for a single giant layer (like a huge fully-connected layer with
a massive weight matrix), you might split the neurons between GPUs.
During a forward pass, the data has to move from one GPU to the
next for each layer (like pipelining through the layers), or if splitting
within a layer, intermediate results need to be shared. Model paral-
lelism is more complex because it requires partitioning the computation
graph and managing communication of activations and gradients be-
tween devices. It’s typically not as efficient as data parallelism due to
communication overhead, but it’s indispensable for ultra-large models.
For example, GPT-3 (175B) model weights could not fit in one GPU
memory (which might be 16 GB for a V100 or 40GB for an A100, and
GPT-3 175B in half precision would be ¿300GB), so it must be sharded
across many GPUs. One scheme is tensor slicing: break big matri-
ces so each GPU stores a slice and compute collectively. Another is
pipeline parallelism: assign contiguous layers to different GPUs and
pass micro-batches sequentially through them (this keeps each GPU
busy with different samples in different pipeline stages).
• Mixed Parallelism: In practice, large-scale training uses a combina-
tion. For instance, one might use data parallelism across nodes, and
within each node use model parallelism to split a large model. Or
use pipeline parallelism combined with data parallel groups to strike a
balance. The combination is often necessary to scale to many devices
without hitting network bandwidth limits or memory limits.
Managing all this complexity led to the development of software frame-
works and libraries: • Horovod (Uber) was an early library to ease data
parallel training across many GPUs, by abstracting the all-reduce communi-
cations. • PyTorch Distributed Data Parallel (DDP) is now a built-in
way to do data parallel training in PyTorch efficiently with minimal code
changes. • TensorFlow’s Distribution strategies similarly handle multi-
GPU or multi-node training.
However, as model parallelism and memory sharding became more impor-
tant, more specialized libraries emerged: • Mesh TensorFlow (Google) and
Megatron-LM (NVIDIA) provided patterns to split transformer models
across many GPUs, handling the details of splitting matrices for multi-GPU
operations (like splitting the heads of multi-head attention across GPUs,
100
etc.). • DeepSpeed (Microsoft) [71] and FairScale (Facebook) introduced
the concept of ZeRO (Zero Redundancy Optimizer) and Fully Sharded
Data Parallel (FSDP). The idea of ZeRO is to reduce memory usage in
data parallelism by not replicating all the training states on each GPU. In
normal data parallelism, if you have K GPUs, you have K copies of the
model and optimizer states (gradients, momentum, etc.), which is wasteful.
ZeRO partitions these states across GPUs so each GPU might only store 1/K
of the gradients, 1/K of the optimizer moments, etc., while still each GPU
has the full model for forward/backward. This sharding allows, say, 8 GPUs
each storing 1/8 of the optimizer states, thus collectively handling a model 8x
larger than one GPU could with the same memory. DeepSpeed and similar
systems implement this transparently, along with offloading to CPU memory
or NVMe for parts of the model not in active use, gradient checkpointing
(trading compute for memory by not storing some intermediates), etc. •
Pipeline parallel frameworks: DeepSpeed and others also allow defining
pipeline stages easily and manage the scheduling (like the 1F1B algorithm
for efficient pipeline utilization). Pipeline parallelism slices the mini-batch
into micro-batches and overlaps the computation of different micro-batches
on different stages, to keep all GPUs busy most of the time.
Through these tools, researchers trained models with tens or hundreds
of billions of parameters. For example, the 175B GPT-3 was trained using
model parallelism (sharding each matrix across multiple GPUs) combined
with data parallel across many nodes. The specifics: they might have used
8-way model parallel per model, and 128-way data parallel (just as an illus-
trative breakdown). The end result is as if training one giant model on one
giant “virtual GPU” that is an aggregate of 1024 physical GPUs.
Another aspect of distributed training is communication and synchroniza-
tion. There is an overhead when GPUs sync gradients or exchange activa-
tions. Techniques like gradient compression or lazy communication can
reduce overhead (e.g., quantize gradients before sending, or overlap communi-
cation with computation). Also, using faster network hardware (InfiniBand,
NVLink, NVSwitch) is crucial for keeping scaling efficient. At large scale, one
can measure how training speed scales with number of GPUs: ideally linear
(100 GPUs = 100x speed of 1 GPU), but often sub-linear due to overhead.
Finally, training large models often requires careful learning rate schedul-
ing and batch size tuning. With data parallel, if you increase total batch
size (say 32 GPUs each with batch 32, so total batch 1024), you often need
to adjust the learning rate (linear scaling rule: multiply LR by number of
101
GPUs, sometimes with a small warmup). Too large a batch can hurt general-
ization, so there’s active research on how far you can scale batch size without
losing accuracy, or how to adjust optimization hyperparameters accordingly.
In short, distributed training is the backbone of modern deep learning at
scale:
102
GPT-3. As another data point, training PaLM (540B parameters) likely
used even more compute (Google hasn’t disclosed exact cost, but one can
imagine it’s higher). This puts such efforts out of reach for most academic
labs and startups, raising concerns about the democratization of AI research.
If only giant tech companies can afford to train the most powerful models,
progress could become more siloed. However, the open-source community
and some academic consortiums are working on reproducing large models
collaboratively (e.g., the EleutherAI group reproduced GPT-like models at
smaller scales).
Environmentally, the energy consumption and resulting carbon footprint
of large-scale training runs is substantial. A well-cited study by Strubell et al.
(2019) [83] highlighted that training a big NLP model with hyperparameter
tuning could emit on the order of hundreds of thousands of pounds
of CO2 . Specifically, one experiment they analyzed (a Transformer trained
with neural architecture search) was estimated to emit 626,000 pounds of
CO2 (approximately 284 metric tons), which they noted is roughly five times
the lifetime emissions of an average car. Even training a single large model
without extensive tuning can use as much electricity as several households
would use in a year. And these numbers have likely grown with the size of
models in 2020-2023.
These environmental costs have sparked a movement towards Green AI,
which calls for more focus on efficiency and for reporting the compute/energy
used in research publications. Researchers are encouraged to consider the
computational cost vs. benefit of model improvements. For example, if a
new model achieves 1% higher accuracy but requires 10x more computation,
is it worth it? Could we find a more efficient way to get that improvement?
There’s also the perspective of diminishing returns. Often, scaling up
yields smaller and smaller improvements: the first 10x increase in model size
might give a huge jump in performance, but the next 10x might only give
a marginal gain. At some point, the gain might not justify the cost. For
instance, going from a 1B to a 10B parameter model might yield a big boost,
but going from 100B to 1T might yield relatively less new capability (unless
new emergent behaviors appear, which is uncertain). Understanding where
these inflection points are is important for decision-making. The Chinchilla
result we discussed is an example where bigger was not better because re-
sources were misallocated; it showed a way to be more compute-efficient by
balancing data and model size.
Another trade-off is inference cost. A model that’s huge not only costs
103
a lot to train, but also to deploy: running GPT-3 (175B) for a single user
query can require multiple GPU seconds of compute. If you have millions of
users, that quickly becomes untenable. This is why companies often distill or
compress large models for deployment, or why they invest in super-efficient
serving infrastructure. There’s a direct cost (in electricity and hardware
wear) for each inference as well.
However, we also see that investing in a large model can sometimes re-
duce costs in other ways: for example, a powerful model might handle many
tasks (reducing the need to train separate specialized models for each task).
There’s a notion of model reuse and foundation models – train one giant
model and use it for many purposes. The computational cost is front-loaded
in training, and then you get a general model. This could be efficient at a
societal level if managed well (instead of everyone training their own medium
model from scratch, they fine-tune a shared giant model). But it also con-
centrates the cost at the initial training.
From an educational perspective: when you plan an AI project, you
should consider if you really need the largest model, or if a smaller effi-
cient model can solve the problem. There’s often an elegance in achieving
the same result with less. A well-known phrase by Google researchers is
“the best model is the one that is the most efficient while meeting the task
requirements.”
Research is actively addressing these trade-offs: • Techniques like model
pruning, quantization, and distillation (discussed in the next section) aim to
reduce model size and compute while preserving performance. • Algorithms
like gradient checkpointing can reduce memory (thus enabling fewer GPUs to
train a model, albeit with more computation). • New training methods (like
retrieval-based training that doesn’t have to store everything in weights, or
using smaller models with external memory) might bypass the need for gar-
gantuan monolithic networks. • Use of renewable energy and more efficient
hardware can mitigate the carbon footprint. For example, some companies
schedule training for times when renewable electricity is abundant, or site
their data centers in regions with clean energy. • Finally, there’s interest
in algorithms that could make use of large models more sample-efficiently
or allow training to converge faster so we don’t waste as much energy on
trial-and-error.
In summary, while model scaling has delivered incredible results, it comes
with heavy computational and environmental costs:
104
• Training large models consumes a lot of electricity and hardware re-
sources, sometimes only accessible to large institutions.
• Researchers must weigh if the accuracy gains justify the resource usage,
and seek innovative ways to get more out of less.
Ultimately, the goal is to keep advancing AI in a way that is not only effective
but also sustainable and broadly accessible.
105
speedups. A single modern GPU can be tens or hundreds of times faster
than a CPU for training a neural net.
As deep learning took off, GPU manufacturers like NVIDIA started opti-
mizing their hardware specifically for AI. For example, they introduced Ten-
sor Cores (first in the Volta architecture, around 2017) which are units that
perform matrix multiply–accumulate operations very fast at lower precision
(like FP16 or BF16). These cores are tailored for the kind of computations
in training deep nets, and using them can provide another 5-10x speed boost
compared to using normal GPU cores, albeit requiring using mixed-precision
training (which is now standard; it trains faster and uses less memory, with
no loss in model quality in most cases).
Meanwhile, Google developed the Tensor Processing Unit (TPU)
[46], an ASIC (application-specific integrated circuit) specifically for neural
network workloads. TPUs were deployed in Google’s datacenters starting
mid-2010s. They excel at matrix operations and are used both for training
(Cloud TPU) and for inference (Edge TPU for smaller devices, for instance).
One advantage of TPUs is that they can be built into very large pods with
fast interconnect, allowing Google to train very large models (like their 11B
parameter T5 in 2019, and more recently PaLM 540B were trained on TPU
v4 pods).
Other companies/efforts have produced accelerators: • FPGAs (Field
Programmable Gate Arrays) can be configured to run neural nets efficiently
and are sometimes used in low-latency environments (like high-frequency
trading with AI, because FPGAs can get very low latency). • ASICs from
startups: There are Graphcore’s IPU, Cerebras’s Wafer-Scale Engine (which
is essentially a huge chip containing many cores for deep learning), Habana
Labs’ Gaudi (now owned by Intel), etc. These often promise either more
speed, better energy efficiency, or memory advantages. • Neuromorphic
chips: Though not mainstream for deep learning, some research chips model
brain-like spiking neural nets (IBM TrueNorth, Intel Loihi) which are very
power efficient for certain tasks, but they’re a bit outside the typical deep
learning deployment.
Even at smaller scale, modern smartphones come with NPUs or DSPs
optimized for AI. Apple’s A-series chips have a “Neural Engine”, Qualcomm
has the Hexagon DSP that accelerates neural nets, etc. They enable running
moderately complex models on-device in real time (think of face recognition
in the camera app, or voice assistants). These mobile accelerators are why
we can have things like real-time video filters or AR effects using neural nets
106
on a phone without killing the battery immediately.
In essence, hardware has co-evolved with model scaling: bigger models
needed better hardware, and better hardware enabled even bigger models.
Without GPUs and TPUs, we simply could not have trained models like
ResNet-152 or GPT-3 in any reasonable timeframe.
107
quantized inference of ResNet-50 can be 2-3x faster than float16 infer-
ence on CPUs. Even more extreme, there are research works on 4-bit
and 2-bit networks, or even binary neural networks (1-bit weights).
Those often see larger accuracy hits, but for some applications they
can work. Modern toolkit: TensorFlow Lite, PyTorch Mobile, etc., all
have quantization support to help deploy models in low precision.
108
you can often batch multiple inputs together to make use of the hard-
ware more efficiently. GPUs, for instance, are great at throughput
if you provide a large batch, although that introduces some latency.
Balancing batch size for throughput vs. latency is an engineering con-
cern. But for offline inference (processing large datasets), batching can
drastically reduce total compute time by amortizing overheads.
109
also specialized hardware like content-addressable memory for fast retrieval
that could integrate with neural nets in future.
To summarize this section:
110
MobileNet) made it possible to use parameters more effectively. Neu-
ral architecture search and compound scaling gave systematic ways to
scale models without wasting computation.
What might the future hold for model scaling? On one hand, we might
continue to see growth in model size, especially as organizations compete
to build more capable general AI systems. It’s possible we’ll see trillion-
parameter models become more common (some already exist in sparse form).
With improved algorithms, those could be trained efficiently (for example,
using mixture-of-experts to activate only parts of the model per input, so
not all trillion parameters are used every time).
We may also discover new paradigms that break the current scaling
mold. For instance, there’s interest in sparsity and conditional compu-
tation: instead of a monolithic dense model, have a very large network but
only a small subset is used for any given task or input (expert networks,
dynamic routing, etc.). This way, capacity grows without linear growth in
computation. Another direction is neurosymbolic or hybrid systems
that combine neural networks with explicit reasoning or search components,
111
which ties into our discussion on test-time compute. These could achieve
better performance without needing exponentially more parameters.
The role of data is also becoming more prominent. If truly massive
models are under-trained, one bottleneck might be high-quality data. We
may see efforts to create larger and more diverse training corpora, or synthetic
data generation to feed these models.
From a research perspective, a fascinating question is: How far can scaling
take us? Some in the field (inspired by results like GPT-3) suspect that
simply making models bigger and training on more data will eventually yield
very powerful general AI. Others believe algorithmic advances will be needed
beyond a point, because some aspects of intelligence might not emerge just
from scale. The likely reality is a combination: scaling will continue to
produce gains, but we’ll also augment models with new ideas to make them
more efficient and robust.
For you, as future practitioners or researchers, understanding model scal-
ing gives you a powerful lens. If your model isn’t performing well enough,
consider if making it larger or giving it more data might help, and know the
techniques to do so effectively (like adding normalization or using a known
scalable architecture). Conversely, if you need to deploy a model, know how
to compress and optimize it, and consider if you can achieve the same with
a smaller model for that context.
In conclusion, model scaling has been a driving force in the progress of
deep learning. We’ve gone from relatively shallow nets that could recognize
handwritten digits, to networks tens of layers deep mastering ImageNet, to
networks so large they can write coherent essays or have conversational abil-
ity. Each jump required not just more computation, but also ingenuity in
design to make that computation count. As we continue to scale, issues of ef-
ficiency, cost, and sustainability will be paramount, but so will the potential
for truly remarkable AI capabilities. The journey of scaling is not just about
making things bigger; it’s about learning how to grow our models wisely and
responsibly to reach new heights of performance.
112
Chapter 6
6.1 Introduction
As AI models become increasingly powerful, their computational and storage
costs have grown dramatically. For example, the language model GPT-3 has
175 billion parameters, requiring hundreds of gigabytes of memory and mas-
sive compute power to train and run [6]. Such large models are challenging
to deploy on everyday devices or within limited data centers. Model com-
pression and cost optimization techniques address this challenge by making
models smaller, faster, and more efficient, without significantly sacrificing
accuracy. These methods are crucial for:
113
experiment with advanced models without requiring prohibitive com-
putational resources.
In the following sections, we will explore several key techniques for model
compression and efficiency. Each section introduces a concept in accessible
terms and highlights why it matters, along with references to seminal works
or widely-used methods in that area.
114
not rely too heavily on any one neuron, thereby reducing overfitting. While
the primary purpose of dropout is to improve generalization, one useful side
effect is that it encourages the network to develop redundant, distributed
representations of features. In practice, this means the network’s effective
capacity is used more efficiently, and at inference time (when we typically
remove dropout), we can often get away with smaller models or sparser ac-
tivations without losing much accuracy.
Beyond dropout, research has shown that neural networks often con-
tain much smaller *sparse sub-networks* that can be trained to achieve
performance comparable to the full model. This is sometimes referred to
as the *“lottery ticket hypothesis,”* which suggests that within a large
random-initialized network, there exist sparse winning sub-networks that,
when trained in isolation, can reach nearly the original accuracy [21]. This
finding implies that many weights in large models are redundant. If we can
identify and extract these efficient sub-networks, we have an opportunity to
drastically reduce model size and computation.
In summary, techniques that promote sparsity (whether through regular-
ization like dropout or through explicit discovery of sparse structures) hint
that we can slim down models. They lay the groundwork for methods like
pruning, which we discuss next, by indicating which parts of a network are
less necessary.
115
remaining connections to fine-tune the model. The result is a much sparser
network. In a follow-up work, *Deep Compression*, Han and colleagues
combined pruning with quantization and even Huffman coding to compress
neural networks by an order of magnitude or more [30].
Pruned models are especially useful for deployment on resource-limited
hardware. With far fewer active weights, these models can execute faster
on CPUs and GPUs (since there is less work to do) and can be stored in
smaller flash or disk storage. Pruning also benefits energy efficiency: skip-
ping unnecessary calculations means less power is consumed. As an added
benefit, some hardware and libraries can exploit sparsity by only computing
the needed operations.
In practice, pruning can be applied at different levels of granularity:
• Weight pruning: Remove individual connections (weights) that are
low in magnitude.
• Neuron or filter pruning: Remove entire neurons or convolutional
filters that have little impact (this results in smaller layers and can
directly reduce computation in those layers).
• Structured pruning: Remove structured parts of the network (like
whole channels or attention heads) so that the resulting network can
still be efficiently implemented without irregular memory access pat-
terns.
After pruning, the model is usually fine-tuned or retrained, allowing the
remaining weights to adjust and sometimes recover any lost accuracy.
116
but only use a small fraction of them for each data sample, thereby not
dramatically increasing the computation required for each inference [77]. The
gating network learns to route each input to the most appropriate experts.
Because only a few experts process any given input, the effective computation
(and cost) per input remains manageable, even though the total parameter
count is very high.
From a cost-optimization perspective, MoEs are appealing because they
offer a way to scale model size (and therefore potential accuracy or capacity)
without a proportional scaling of computational cost. This is a form of
conditional computation: the model “decides” where to allocate resources for
each example. In deployed systems, MoEs could be used to save computation
by, for instance, only running a complex portion of a model when needed.
If certain inputs are easy, the gate might route them to a simple expert,
whereas only challenging inputs invoke the full capacity of larger experts.
It’s worth noting that while MoEs reduce the average computation, they
introduce complexity in training and load-balancing the use of experts. How-
ever, as shown by recent large-scale implementations of MoE in natural lan-
guage processing, these challenges can be managed to achieve state-of-the-art
results efficiently [77].
117
Another method, called *LoRA (Low-Rank Adaptation)*, injects train-
able low-rank matrices into each layer of the model [40]. During fine-tuning,
only these low-rank matrices are updated. LoRA has shown that it can match
the performance of full fine-tuning while training a tiny fraction of the pa-
rameters [40]. This again means huge memory savings, especially when one
model must be fine-tuned to many tasks.
Parameter-efficient fine-tuning is important for multi-task and continual
learning scenarios. If you have a single big model that needs to serve dozens
of different tasks or domains, using these techniques allows you to adapt to
each task without the overhead of a full model copy. It also often speeds up
training (fewer parameters to update) and reduces the risk of catastrophic
forgetting by limiting how much of the model is altered for each new task.
118
computations line up with these units (for example, using matrix sizes
that are multiples of certain values) can improve efficiency.
119
• 8-bit integers (INT8) for inference: After training a model in float32,
one can convert weights (and even activations) to 8-bit integers for
inference. This shrinks the model size by 4x and often allows using fast
integer math pipelines in hardware.
• During each training step, cast weights to 16-bit and compute forward
pass and gradients in 16-bit.
120
• Compute weight updates (like gradient accumulations) in 32-bit to pre-
serve precision, then update the 32-bit master weights.
By doing this, one can often get nearly a 2x speedup in training and reduce
memory usage, all while maintaining model quality [60]. Mixed precision
has quickly become a standard in training large models because it provides
significant efficiency gains with very little downside.
In summary, mixed precision training exploits the insight that not all
calculations need full 32-bit precision. Many parts of neural network com-
putation are tolerant to lower precision, as long as critical parts (like weight
updates) retain accuracy. This technique directly cuts down training time
and resource usage, enabling us to train bigger models or train models faster
on existing hardware.
6.10 Quantization
Quantization is the process of converting a neural network’s parameters and
computations from high precision (e.g., 32-bit float) to lower precision (e.g.,
8-bit integer). Unlike the mixed precision approach which still kept some
high precision around, quantization often focuses on making a model purely
low-bit for inference after it has been trained. The main goal is efficiency:
an 8-bit model uses a quarter of the memory of a 32-bit model, and integer
arithmetic can be executed much faster on many processors (especially those
with specialized DSP or vector units).
A simple form is *post-training quantization*: after training a model
with full precision, you convert the weights to 8-bit. In many cases, you also
quantize activations (the intermediate results) to 8-bit as the model runs.
Modern libraries and frameworks have support for this, often with minimal
changes needed from the user. The challenge is ensuring the quantized model
still performs well. Some networks can lose accuracy when naively quantized
because the 8-bit approximation of their weights/activations is not exact.
One influential work by Jacob et al. (2018) described a quantization
scheme used for efficient inference on mobile CPUs, where both weights and
121
activations are int8, and the model achieves nearly the same accuracy as the
float version [44]. The approach involves choosing appropriate scaling factors
for each layer so that 8-bit values cover the range of the 32-bit tensors as
effectively as possible. For example, if a layer’s weights range from -2.5 to
2.5 in float32, you might map that to the -128 to 127 range of int8. During
inference, operations are carried out in integer math, and results are scaled
back to normal ranges at the end.
Quantization can provide tremendous speed-ups. Many CPUs and accel-
erators have special instructions for 8-bit matrix multiplication which runs
much faster than 32-bit math. Also, the reduced memory bandwidth is a
big win: reading 8-bit values from memory is 4 times faster than reading 32-
bit values, which often is a bottleneck. Thus, quantization not only shrinks
model size on disk, but often yields real-time speed improvements and energy
savings during inference.
However, aggressive quantization (like going below 8 bits) can hurt accu-
racy significantly if not done carefully. That’s where techniques like quantization-
aware training come in.
• Starting with a pre-trained float model (or training from scratch with
quantization awareness).
122
• During each forward pass, before computing each layer’s output, insert
a simulated quantization step (rounding to 8-bit representation).
• Compute the loss as usual and do backpropagation. The gradients
flow through these simulated quantization operations (often using a
straight-through estimator for the non-differentiable rounding).
• The model learns to tolerate or correct for the quantization. For in-
stance, weight values might shift slightly to better align with repre-
sentable 8-bit numbers.
After such training, when the weights are finally quantized and the model
is deployed in 8-bit, the accuracy is typically much closer to the original
full-precision model compared to naive post-training quantization. This is
especially useful for networks that are sensitive to quantization (e.g., those
with very small or very large weight distributions, or very tight tolerance in
certain layers like last-layer classifiers).
Quantization-aware training requires more effort during the training phase
and sometimes a specialized training pipeline, but it pays off by enabling the
efficiency of quantized models without sacrificing predictive performance. It’s
a critical technique for pushing models to low bit-widths in scenarios where
every bit and every joule of energy matters (like on-device speech recognition
or vision on smartphones).
123
• PyTorch Mobile – a set of tools within PyTorch to package models for
mobile, including support for quantized models.
Conclusion
Model compression and cost optimization techniques are essential for bring-
ing the power of AI to practical applications. By distilling knowledge, prun-
ing unneeded parts, using clever architectures like mixtures of experts, tuning
only what we need, leveraging hardware capabilities, and reducing numerical
precision, we can create models that are both powerful and efficient. This
means AI systems can be deployed more widely — from cloud servers to tiny
edge devices — and operate under real-world constraints.
The continued development of these techniques, along with supportive
frameworks and hardware advances, will allow us to scale AI models further
while keeping them affordable and energy-efficient. As you explore these
124
methods, remember that the goal is to achieve the right balance between
model complexity and operational efficiency for your specific task and con-
straints.
125
Chapter 7
Manifold Learning
126
2. Connect loss functions to probability and likelihood, building to-
ward variational autoencoders (VAEs) that allow sampling from
learned manifolds.
127
• Feature Extraction: The encoder can serve as a learned feature ex-
tractor for classification or other tasks.
128
7.3.1 L1 Loss (Mean Absolute Error)
X
L1 (x, x′ ) = kx − x′ k1 = |xj − x′j |.
j
Commonly used when xj ∈ {0, 1} or [0, 1]; interprets x′j as a Bernoulli pa-
rameter. Especially popular in classification or binary data reconstructions.
129
When we have a simple parametric form, pθ (x), maximizing log pθ (x(i) )
P
is straightforward. But with complex latent-variable models like autoen-
coders or neural nets, it is not always trivial to compute or differentiate
log pθ (x). This leads to variational methods and adversarial approaches
we’ll see later.
In practice:
• Directly modeling pθ (x) is what VAEs and flows attempt; GANs im-
plicitly learn it by generating samples, and we measure sample realism
with a discriminator.
130
9 def forward(self, x):
10 x = F.relu(self.conv1(x))
11 x = F.relu(self.conv2(x))
12 x = F.relu(self.conv3(x))
13 x = x.view(x.size(0), -1)
14 z = self.fc(x)
15 return z
16
17 class ConvDecoder(nn.Module):
18 def __init__(self, latent_dim=128):
19 super().__init__()
20 self.fc = nn.Linear(latent_dim, 8*8*64)
21 self.deconv1 = nn.ConvTranspose2d(64, 32, 4, 2, 1)
22 self.deconv2 = nn.ConvTranspose2d(32, 16, 4, 2, 1)
23 self.deconv3 = nn.ConvTranspose2d(16, 3, 4, 2, 1)
24
25 def forward(self, z):
26 x = F.relu(self.fc(z))
27 x = x.view(-1, 64, 8, 8)
28 x = F.relu(self.deconv1(x))
29 x = F.relu(self.deconv2(x))
30 x = torch.sigmoid(self.deconv3(x))
31 return x
In training, we feed images [batch, 3, 64, 64] to the encoder, get a latent
z, and decode back to reconstruct the image. Convolutional AEs excel at:
But still, we have no direct way to sample new images unless we sample z from
an unknown distribution. This leads us to probabilistic autoencoders.
x̃ = Corrupt(x), x′ = gθ fφ (x̃) ,
131
and we minimize L(x, x′ ) where x′ tries to match the original, uncorrupted
x.
1 for x in data_loader:
2 noise = torch.randn_like(x) * 0.1
3 x_noisy = x + noise
4 h = encoder(x_noisy)
5 x_recon = decoder(h)
6 loss = ((x_recon - x)**2).mean()
7 ...
132
1. Good reconstruction (log pθ (x|z)).
However, VAEs sometimes produce blurry samples due to the pixel-wise like-
lihood objective. We’ll see alternative approaches (GANs) or improvements
(e.g., Vector Quantization) to mitigate this.
133
7.8 Vector Quantization and VQ-VAE
VQ-VAE [89] replaces continuous latents with discrete codes drawn from a
learned codebook {e1 , . . . , eK }.
1. Encoder outputs a continuous h.
2. Quantization: find the nearest code ek in the codebook. The latent is
ek (discrete index k).
3. Decoder reconstructs from ek .
Gradients bypass the non-differentiable nearest-neighbor step via a straight-
through or stop-gradient trick. The loss has:
Recon loss + ksg[h] − ek k2 + βkh − sg[ek ]k2 .
This commitment loss updates both codebook vectors and encoder output.
Why discrete latents?
• Natural for compression or symbolic domains (speech, text).
• Can train a separate discrete prior model (like PixelCNN) over latent
indices, enabling powerful generation.
• Often avoids posterior collapse seen in continuous VAEs.
134
7.9.1 GAN Training Loop (Pseudocode)
1 for real_data in data_loader:
2 # Update D
3 z = torch.randn(batch_size, latent_dim)
4 fake_data = G(z).detach()
5 d_loss = -( torch.log(D(real_data) + eps).mean()
6 + torch.log(1 - D(fake_data) + eps).mean() )
7 optimizerD.zero_grad(); d_loss.backward(); optimizerD.step()
8
9 # Update G
10 z = torch.randn(batch_size, latent_dim)
11 gen_data = G(z)
12 g_loss = -torch.log(D(gen_data) + eps).mean() # non-saturating
13 optimizerG.zero_grad(); g_loss.backward(); optimizerG.step()
GANs can generate extremely realistic images but are trickier to train:
• Mode collapse: generator repeatedly produces similar outputs.
• No explicit latent inference: we cannot easily get z for a given real x.
• Hyperparameter sensitivity: balancing G and D is delicate.
135
7.10.3 Pix2Pix
[43] Uses CGAN for image-to-image translation with paired data: (A, B)
pairs. The generator learns A → B, and the discriminator sees (A, B) vs.
(A, B̂). Adding an L1 or L2 term ensures overall structure is preserved, while
the GAN loss provides sharp details.
7.10.4 CycleGAN
[95] Handles unpaired translation across domains A and B by training two
generators G : A → B and F : B → A plus two discriminators. Enforces
cycle consistency: F (G(a)) ≈ a and G(F (b)) ≈ b. This allows, e.g.,
turning horse images into zebra images without one-to-one pairs.
7.10.5 StarGAN
[11] Extends multi-domain image translation in a single model. Condition
the generator on a target domain label. Discriminator must classify domain
and authenticity. Reduces complexity from training many pairwise models.
GANs remain popular for tasks needing crisp visuals or domain transla-
tion. But they lack a likelihood function and can be unstable. Now we turn
to normalizing flows, which do have tractable likelihoods.
zK = fK ◦ · · · ◦ f1 (z0 ).
With careful design (e.g. coupling layers), det is easy to compute. Flows
allow:
136
• Exact log-likelihood evaluation.
• Direct sampling by drawing z0 and applying the transformations.
• Inversion to find latent codes of real data (unlike vanilla GAN).
137
where xt is a noisy version of x0 , and ǫθ is the network output. At inference,
sampling each step t yields high-fidelity images but can be slow (hundreds of
steps).
138
3. Decode final latent ẑ to image x̂ = D(ẑ).
This two-stage approach underlies Stable Diffusion: it’s far more efficient
than pixel-based diffusion while maintaining image fidelity.
Applications go beyond text-to-image:
139
Chapter 8
Transformers
140
an encoder-decoder structure similar to previous seq2seq models, but each
layer in a Transformer uses a technique called self-attention to look at other
positions in the same sequence. This design enables much more paralleliza-
tion during training (since you don’t have to process tokens one-by-one as
in RNNs) and can capture long-range dependencies more directly. When
[91] introduced Transformers in the paper “Attention Is All You Need”, they
achieved state-of-the-art results in translation, and the model has since be-
come the foundation for many advanced language models. In summary,
Transformers were motivated by the need for models that handle long se-
quences with flexible, learned alignments (attention) and that can be effi-
ciently trained with parallel computation.
141
the queried word. We then take a weighted sum of all value vectors, using
these scores (after normalization) as weights. The resulting vector is the new
representation of the query word, enriched by information from the words it
found relevant.
This mechanism allows the model to flexibly blend information from dif-
ferent parts of the sequence. For example, if the word “it” is the query, its
query vector might match strongly with the key of another word like “bank”
or “animal” depending on context, and thus pull in the value (features) of
those words. The attention output for “it” will then incorporate clues from
those related words, helping the model figure out what “it” refers to. All of
this is done with learned vectors and is differentiable, so the model learns
to set queries, keys, and values in a way that useful connections get high
attention weights.
142
focus on words like “river” or “slippery,” leading to a different interpretation
(landform by a river). In essence, self-attention allows each word to adapt
its representation based on the other words present, giving the model a way
to disambiguate words with multiple meanings using context.
Self-attention is also useful for understanding relationships like pronoun
references. For example, in the sentence “Alice gave her keys to Bob be-
cause she trusted him,” the pronoun “she” should be linked to “Alice” and
“him” to “Bob.” A Transformer can handle this by having the query vector
for “she” strongly attend to the key of “Alice,” and similarly the query for
“him” attend to “Bob.” This means the model’s representation of “she” will
incorporate information from “Alice” (e.g., gender or identity cues), helping
it keep track of who is who. Traditional left-to-right models (like standard
RNNs) would have to carry along context in a fixed-size hidden state, which
can be challenging over long distances. Self-attention, by contrast, creates
direct connections between relevant words, no matter how far apart they are
in the sentence.
Another benefit is that self-attention is computed in parallel for all words.
The model doesn’t have to process word by word sequentially; it can look
at the whole sentence at once. This means it can capture long-range de-
pendencies (like a connection between the first and last word of a sentence)
without difficulty. In summary, self-attention gives Transformers the ability
to read a sentence and dynamically decide, for each word, which other words
to pay attention to. This leads to rich, context-dependent representations
that make tasks like translation, comprehension, and summarization much
more accurate.
143
Let’s break this down:
• QK T : This computes the dot product between each query vector and
each key vector. If Q has dimension (n × dk ) and K is (m × dk ) (where
n could be the number of query positions and m the number of key
positions, and dk is the dimensionality of keys/queries), then QK T is
an n × m matrix of raw attention scores. The entry at position (i, j) is
basically qi · kj , the similarity between the ith query and jth key.
• √1 : We divide the dot products by the square root of the key di-
dk
mension. This is a scaling factor. Without it, when dk is large, the
dot products tend to have large magnitude, pushing the softmax into
very small gradients (because
√ softmax would produce extremely peaked
distributions). Scaling by dk keeps the variance of the dot products
more normalized, which empirically leads to more stable training [91].
Example: Suppose we have a single query and two key-value pairs for
illustration. Let the dot products (after scaling) between the query and
the two keys be [2.0, 1.0]. Applying softmax, we get weights approxi-
mately [0.73, 0.27] (since e2.0 = 7.39, e1.0 = 2.72, and normalized these
144
give 7.39/(7.39 + 2.72) ≈ 0.73, 2.72/(7.39 + 2.72) ≈ 0.27). This means the
first key is considered about three times more relevant than the second for
answering the query. The attention output will be 0.73·v1 +0.27·v2 . In other
words, it’s mostly leaning on the information from the first value vector, but
also mixing in a bit of the second. If v1 represented, say, the context “Paris”
and v2 represented “London,” the output would be a vector that is closer to
the content of “Paris.” Essentially, the mechanism decided “Paris” was more
relevant to our query, and thus the resulting representation emphasizes that
content.
We can outline the computation in pseudocode for clarity:
1 # Given: list of queries Q, list of keys K, list of values V (
aligned by index)
2 for each query_index i:
3 for each key_index j:
4 score[i][j] = dot_product(Q[i], K[j]) / sqrt(d_k)
5 weights[i] = softmax(score[i]) # softmax over j
6 output[i] = sum_j(weights[i][j] * V[j])
This process happens for every query (in practice we compute it as ma-
trix operations for efficiency). The outcome is that each query vector qi is
transformed into a new vector oi that is a blend of the values, with more
weight coming from those values whose keys matched qi closely. This is the
core operation that allows Transformers to route information flexibly around
the sequence.
145
for each head:
using the scaled dot-product attention formula. Each head will produce its
own output vector for each position (of size dv , which is often chosen to be
dk for simplicity). Then, the h outputs for each position are concatenated
and projected again with another weight matrix W O :
146
• Projects the input into multiple sets of queries, keys, values (one set
per head).
147
positional differences), whereas higher i yields slower oscillations (capturing
coarse positional information).
These positional encoding vectors are added to the token’s word em-
bedding vectors at the input of the Transformer. By addition, each word
embedding is slightly shifted in a unique way depending on its position. The
model can then use these signals to infer the relative or absolute position
of words. For example, the difference P E(pos2 ) − P E(pos1 ) between two
position encodings has meaningful information about how far apart pos2 is
from pos1 . The sinusoidal scheme has a nice property: it’s periodic and con-
tinuous, so the model can potentially generalize to sequence lengths longer
than those seen in training (though in practice, there’s a limit), and it can
learn to attend to relative positions by combining sin and cos values (like
learning to detect phase differences between positions).
Let’s build some intuition. Think of each position encoding as a kind of
unique fingerprint for that position, created by mixing different frequencies
of sine and cosine waves. No two positions (within a reasonable range) will
have the exact same encoding because all the sinusoids align differently for
each index. The model doesn’t “know” the math of these encodings, but
it can learn to interpret them. For instance, it might learn that certain
patterns in the positional encoding correspond to the token being early in
the sentence vs late, or that if you subtract one token’s positional encoding
from another’s, the resulting difference vector might correlate with how far
apart they are.
In practice, other approaches to positional encoding exist (like learned
positional embeddings where you just have a trainable vector for each position
up to a maximum length). The sinusoidal method has the advantage of not
adding any new parameters and providing a kind of generalization for very
long sequences (you could extrapolate beyond the positions seen in training,
in principle). Another form of positional encoding used in some Transformer
variants is “relative positional encoding,” which encodes relative distances
rather than absolute positions, but the basic need remains: give the model
information about order.
To visualize sinusoidal position encoding, imagine one dimension is a sine
wave that completes one full cycle every 100 positions, another completes a
cycle every 1000 positions, another every 10,000, etc. At position 0, all sinu-
soids start at well-defined values (sin(0)=0, cos(0)=1). As position increases,
each sinusoid oscillates. Any specific position will have a unique combination
of sine and cosine values across these frequencies, which the model can use
148
as a signature of that position.
In summary, positional encoding injects order information into the oth-
erwise order-agnostic Transformer. The sinusoidal formulation is a clever,
continuous way to do this, ensuring that each position has a unique code
and that the model can learn to attend to specific relative positions if useful
(since, say, shifting the query by k positions causes a predictable phase shift
in the encoding).
FFN(x) = W2 max(0, W1 x + b1 ) + b2 ,
149
The FFN can create new features out of the attended output for each token.
For example, if after attention a token’s representation has information from
itself and related words, the FFN might compute something like “given this
combined info, what is a good higher-level feature representation for this to-
ken?” It’s like a little neural network “brain” at each position that further
refines the representation.
Moreover, by increasing the dimensionality to dff (like 4 times larger)
in the middle, the FFN can capture complex combinations of features in
a higher-dimensional space, then project them back down. This expansion
gives the model more capacity at each layer to model relationships or patterns
that are local to the token.
A complete Transformer layer (in the encoder or decoder) typically goes:
multi-head attention → add & normalize → feed-forward → add & nor-
malize. The “add & normalize” refers to the residual connection and layer
normalization. The residual connection means that the input of each sublayer
(attention or FFN) is added to its output, and then a layer normalization
is applied. For instance, if h is the input to the FFN sublayer and h′ is
the output of FFN, we actually output LayerNorm(h + h′ ). This residual
addition helps preserve the original information and makes training easier (a
technique inspired by ResNets in vision [91]). The layer normalization helps
stabilize and smooth the training by normalizing the output at each layer.
From a beginner’s perspective, one can view each Transformer layer as
doing two things:
1. Mix information across the sequence with attention (so each word learns
something from other words).
The combination of these is powerful: the attention ensures the model has
a rich soup of information at each position, and the FFN then turns that
soup into something useful for the next layer or final prediction. Without
the FFN, the model would be linear combinations of values only; the FFN
introduces non-linearity and interactions among the combined features.
In summary, the feed-forward network in Transformers gives each po-
sition a chance to independently process the information it gathered from
others and increase the model’s overall expressiveness. It’s a simple but cru-
150
cial component that works with attention to build deep representations of
sequences.
1 1 1 1
where row i corresponds to query position i and column j to key position
j. Row 3 (0-indexed) has [1,1,1,0] meaning the 4th token can attend to
positions 1,2,3 (and itself position 4 if we include it as ¡= i) but not position
4+1 (which doesn’t exist) or beyond. In this matrix, 1 indicates “can attend”
and 0 indicates “masked out.” When applying this mask to the computed
attention scores, any position with 0 will get −∞ before softmax, resulting
in 0 probability assigned.
151
What this means conceptually is that when the model is computing the
representation for the 5th word in a sequence, it will only be allowed to
incorporate information from the 1st through 5th words. It cannot sneak a
look at the 6th word. This property is crucial for generative tasks so that
the model doesn’t cheat by looking ahead. During training, we present the
model with the full target sentence (for example, in translation the decoder
is given the shifted target sentence), but thanks to masking, each position’s
prediction is made using only earlier target words.
If we didn’t apply a look-ahead mask, the self-attention in the decoder
could attend to the future words, and the model would essentially see the
answer before predicting it, making training meaningless for generation pur-
poses. The mask forces the decoder to behave in an autoregressive manner.
At inference time, we generate one word at a time: feed the first word, get the
next, then append it, feed the first two, get the third, etc., which naturally
ensures we never see future words. The training-time mask just mirrors this
process in a parallelized way.
To sum up, the look-ahead mask is a simple trick to enforce causality in
sequence generation. By zeroing out attention to future tokens, it ensures
the model can be used to generate coherent text one token after another
without inadvertently using information that should not be known yet. In
the Transformer implementation, this masking is often done√by adding a mask
matrix M (with −∞ for masked positions) to the QK T / dk scores before
softmax. The result is that attention weights for future positions become
0. Thus, the decoder can be trained on full sequences while maintaining the
principle that each position predicts the next token using only past context.
152
builds a vocabulary of common subword chunks so that frequent words are
usually one or a few tokens, while rare or unseen words can be constructed
from smaller pieces.
The BPE algorithm for text works roughly as follows [76]:
1. Start with all words in the training corpus broken down into individual
characters (plus a special end-of-word symbol so that we know where
one word ends).
2. Count the frequency of every pair of symbols that appear next to each
other. A “symbol” is initially a character, but as we merge, symbols
can become sequences of characters.
4. Merge that pair into a single new symbol (effectively, add a new token
to the vocabulary which is the combination of those two).
5. Replace all occurrences of that pair in the text with the merged symbol.
6. Repeat steps 2-5 until we have reached the desired vocabulary size or
there are no more pairs to merge.
Next, maybe “lo” + “w” is frequent (in “low” and “lowest”), merge to get
“low”. Now:
153
Next, perhaps “e” + “r” is frequent (in “newer”), merge into “er”. Now:
And then “low” + “est” might merge depending on frequencies, etc. Eventu-
ally, we might end up with vocabulary tokens like “low”, “est”, “new”, “er”,
etc., such that “lowest” is tokenized as “low” + “est” and “newer” as “new”
+ “er”. We achieved an open vocabulary: even if we encounter a new word
like “newest” later, it can be tokenized as “new” + “est” which are in our
vocabulary.
The reason this is helpful is that the model doesn’t have to learn from
scratch that “est” is a suffix meaning something like “most” in superlatives,
or that “low” is a root – it sees those as separate tokens. It also drastically
reduces the number of unknown or out-of-vocabulary words. Almost any
word can be expressed as a sequence of BPE tokens from a good vocabulary.
The merges effectively incorporate frequent letter combinations (including
whole words for very common ones) so the model can treat them as a single
token, which is more efficient and often more meaningful.
Byte-Pair Encoding was originally a compression algorithm [22] (it re-
placed common byte pairs in data with shorter codes). In NLP, [76] adapted
it for word segmentation. One advantage of BPE is that it’s deterministic
given the learned merges: any new text will be tokenized in a consistent way
by greedily applying the longest possible merge rules (so it always prefers
longer known subwords over splitting into characters).
In modern Transformers:
• GPT models use variants of BPE (GPT-2 used BPE on byte sequences,
treating text as bytes to include any Unicode).
Both ensure that common words are usually one token (“the”, “apple”),
slightly less common words might be two tokens (“ap@@” + “ple” in Word-
Piece notation, or “ap”, “##ple”), and rare words break into several pieces
or characters.
For a beginner, think of BPE as teaching the model a “syllabary” or “al-
phabet” of word pieces. Instead of single letters (too slow to spell everything
154
out) or whole words (can’t cover all words, especially misspellings or names),
it learns common chunks. This way, the model sees “nationalization” bro-
ken into “national”, “ization” for example, understanding those parts, and
it can recombine parts for new words (“internationalization” would share
“national” and “ization”). BPE makes training faster (fewer time steps than
character-level) and generalization better (no out-of-vocab errors).
155
prompt), and then it will continue generating additional text.
GPT-1 [67] was the first such model, demonstrating that a Transformer
language model pre-trained on a large corpus (BooksCorpus) and then fine-
tuned on specific tasks could yield good results. It had on the order of 117
million parameters and was a proof of concept that pre-training on unla-
beled text can help downstream tasks (this was around the same time the
BERT model came out, which is different in approach but also based on
Transformers).
GPT-2 [68] scaled this idea up dramatically. GPT-2 had up to 1.5 bil-
lion parameters in its largest version and was trained on a very large dataset
(around 8 million web pages, called WebText). GPT-2 showed remarkable
ability to generate fluent and coherent paragraphs of text, perform rudimen-
tary reading comprehension, translation, and question answering in a zero-
shot way (without task-specific fine-tuning) just by being prompted with an
example. For instance, if prompted with an English sentence followed by its
French translation a couple of times, GPT-2 could continue and translate the
next sentence (even though it was not explicitly trained for translation, it
picked up some ability from its huge training corpus). The release of GPT-
2 was staged carefully due to concerns about misuse (like generating fake
news), highlighting how powerful the approach is.
GPT-3 [7] pushed the scale to an unprecedented level: 175 billion param-
eters, trained on an even larger corpus of internet text. Instead of fine-tuning
for each task, the focus with GPT-3 was on few-shot learning via prompting.
Users could give GPT-3 a prompt with a few examples of a task (like a cou-
ple of math problems and solutions, or a question and its answer) and then
ask a new question; GPT-3 often could continue the pattern and produce a
correct or at least plausible answer. This showed that with enough capacity
and data, a language model could learn to perform many tasks implicitly.
The GPT-3 architecture was similar to GPT-2 (decoder-only Transformer
with masked self-attention), just much larger and with some improvements
156
in training techniques and initialization. It uses Byte-Pair Encoding for to-
kenization with a 50,000 token vocabulary, and it’s so large that it captures
a lot of world knowledge and linguistic patterns.
An example of using GPT model (autoregressive generation) in pseu-
docode:
1 prompt = "Once upon a time"
2 output = prompt
3 for i in range(100): # generate 100 tokens
4 logits = model(output) # get raw probabilities for
next token
5 next_token = sample_from_softmax(logits[-1]) # take the last
token’s logits
6 output = output + next_token # append the generated token
7 print(output)
Here, the model is a trained GPT-like Transformer. We start with a prompt
and iteratively sample the next token and append it. The look-ahead mask
during training ensured that at generation time this works correctly (the
model always conditions only on what’s already generated).
In summary, GPT’s architecture is a stack of Transformer decoder lay-
ers that use self-attention (with masking) and feed-forward networks. It is
trained as a language model on massive text data. Over the iterations from
GPT-1 to GPT-3 (and beyond), the trend has been: bigger models + more
data = better performance and emergent abilities. The “decoder-only” de-
sign is extremely effective for generating text, and it forms the backbone of
many state-of-the-art systems, including the famous ChatGPT (which is a
further fine-tuned version of a GPT model). GPT models highlight the power
of the Transformer architecture for generative tasks and have opened the era
of large-scale “foundation models” that can be adapted to many tasks.
157
technique called masked language modeling (MLM). Instead of predict-
ing the next word, BERT is trained to predict randomly masked words in a
sentence using both left and right context.
Here’s how BERT’s training works:
• You take an input sentence (or pair of sentences). You randomly choose
some of the tokens (e.g., 15% of them) and replace them with a special
[MASK] token. For example: “The [MASK] was very hungry.”
• The model is trained to output the correct identity of the masked to-
kens. In our example, if the original sentence was “The cat was very
hungry,” the model should produce “cat” in place of the mask.
158
RoBERTa (Robustly Optimized BERT) [57] is an improved version of
BERT introduced by Facebook AI. RoBERTa didn’t change the architecture,
but rather the training procedure and hyperparameters to get more out of
the model:
• They trained on much more data (including longer training with bigger
batches). BERT was trained on BookCorpus and Wikipedia (16GB of
text); RoBERTa used those plus news, web text, etc., totaling over
160GB.
• They removed the Next Sentence Prediction objective. RoBERTa found
that NSP was not helpful and that one can just train on the MLM ob-
jective alone and still get great results. In fact, removing NSP allowed
them to use uninterrupted text sequences and vary input length.
• They used dynamic masking: Instead of deciding once which words
to mask and keeping that fixed for an example throughout training (as
BERT did), they would change which tokens are masked on different
passes. This means the model sees more variety – e.g., in one epoch the
word “dog” might be masked in a sentence, in another epoch maybe
“barks” is masked for the same sentence. This leads to more robust
learning of token representations.
• Other optimizations: larger batch size, different learning rate schedule,
etc., to make training more effective.
The result was that RoBERTa outperformed BERT on many benchmarks,
essentially by “training the heck out of it” with better practices. The name
“Robustly Optimized” reflects that they did a thorough job of finding how
to get the most out of the BERT architecture.
From a beginner’s perspective, BERT and RoBERTa show another paradigm
of Transformer use: not for generating text, but for encoding text into a deep
understanding. They produce contextual embeddings – each token’s output
is an embedding that reflects the meaning of that token in context. These
embeddings can then be used for downstream tasks: classification, span ex-
traction, etc. For example, to do sentiment analysis, you can take the [CLS]
token’s embedding from BERT (which is like a summary of the sentence)
and put a classifier on top. Or for QA, you can take two sentences (passage
and question) fed together with a [SEP] separator, and train the model to
mark the start and end tokens of the answer span in the passage.
159
RoBERTa’s improvements illustrate how important the training setup is.
It wasn’t that BERT’s idea was flawed; it was that you could push it further
with more data and some simplifications. Indeed, RoBERTa removed NSP
entirely and still did better, implying the bidirectional MLM was doing most
of the heavy lifting in BERT’s learning.
In summary, BERT introduced the concept of masked language modeling
to get bidirectional context in Transformer encoders, and RoBERTa fine-
tuned that approach to yield even better pre-trained models. These models
are not used to generate free text (they’re not typically asked to continue a
story), but rather to provide powerful language understanding that can be
specialized to many tasks with a bit of fine-tuning.
• Flatten each patch into a vector. A 16x16 patch with 3 color channels
has 16 × 16 × 3 = 768 pixel values. You then linearly project this
768-dimensional vector to the model’s embedding dimension (let’s say
768 as well for convenience). This gives a patch embedding, analogous
to a word embedding in NLP.
160
• Also add positional encodings to each patch embedding, because we
want to give the model information about where each patch is located
in the image (top-left, bottom-right, etc.). They used learned posi-
tional embeddings or fixed ones (the ViT paper used learned position
embeddings since image positions are fixed grid locations).
• Now you have a sequence of tokens: [CLS] + patch 1 + patch 2 +
... + patch N. This sequence is fed into a Transformer encoder. The
Transformer layers then perform self-attention among all these patches
(and the CLS token).
The self-attention mechanism will allow the model to globally reason
about the image. Each patch can attend to any other patch. For instance, if
part of the image contains an eye of a cat and another part contains an ear of
the cat, the model can make connections between those patches via attention,
which might help it realize the overall object is a cat. This global receptive
field is a stark contrast to convolutional neural networks, which typically only
mix information locally and need many layers to achieve global interaction.
A single Transformer attention layer is global (any patch can directly look
at any other patch’s features).
After passing through multiple Transformer encoder layers, we take the
final hidden state corresponding to the [CLS] token. We then attach a simple
feed-forward neural network (like a single linear layer or a small MLP) on top
of that [CLS] representation to produce a classification (for example, which
category the image belongs to). We train the whole system end-to-end on
image classification loss (like cross-entropy for the correct label).
Some important notes and intuition:
• Each patch is like a “word” describing a part of the image. At first, this
might seem lossy (we threw away spatial resolution within each patch
by flattening), but if the patches are small (like 16x16), the network can
still reconstruct spatial information by looking at neighboring patches
and their relations.
• The positional encoding ensures the model knows patch #5 is, say, top
right corner and patch #200 is bottom left, etc. Otherwise, it would
treat the image as a bag of patches with no order.
• ViT doesn’t inherently know that patches near each other are also
spatially related unless it learns that via attention. This means it has
161
less built-in inductive bias than a CNN (which assumes locality and
translation invariance), so ViT typically needs a lot of data to train
from scratch successfully. In the ViT paper, they pre-trained on very
large datasets (like ImageNet-21k or JFT-300M) to get good results,
and then fine-tuned to smaller ones.
162
8.13 CLIP: Contrastive Training of Image and
Text Encoders
CLIP (Contrastive Language-Image Pretraining) is a model by OpenAI that
connects vision and language by training an image encoder and a text encoder
together in a joint multimodal space [64]. The goal is for the model to learn
to associate images with their correct descriptions. Once trained, CLIP can
understand images and captions in a versatile way (for example, you can use
it for zero-shot image classification).
How does CLIP work? It uses a contrastive learning approach:
• There are two networks: one takes an image and produces an image
embedding (a vector representation), and another takes a text (for
example, a caption) and produces a text embedding.
• The diagonal of this matrix (pair i with text i) are the “correct” image-
text pairs, and off-diagonals are mismatched pairs. CLIP uses a con-
trastive loss (often InfoNCE loss) that basically says: each image should
be most similar to its own caption and less similar to other captions,
and vice versa for each caption.
163
product of true image-caption pairs while minimizing it for incorrect
pairs.
• CLIP can also be used for image search (retrieval). If you encode a
bunch of images and a query text, you can find which image embedding
has the highest similarity to the query text embedding—essentially
finding images that best match a description. Conversely, you can
encode an image and search through a corpus of text embeddings for
the best matching description.
164
• Because CLIP has a rich joint space, it has even been used in generative
tasks. For instance, models like DALL-E 2 use CLIP’s image encoder
as part of a feedback mechanism to generate images that match a text
prompt.
165
8.14 InstructGPT and Reward Models: Fine-
Tuning with Human Preference
Large language models like GPT-3 are very powerful but they are trained
purely to predict the next token, not necessarily to follow human instruc-
tions or produce helpful, correct answers. InstructGPT [63] is an approach
by OpenAI to fine-tune GPT-3 models so that they follow instructions bet-
ter and align with what humans expect. The core of InstructGPT is using
human feedback to train a reward model and then using reinforcement
learning (specifically RL from human feedback, RLHF) to optimize the lan-
guage model with that reward model.
The InstructGPT process can be summarized in three phases:
166
tecture as the language model, but smaller) with an output head that
produces a scalar value instead of predicting language.
167
This combination of techniques is known as RLHF (Reinforcement Learn-
ing from Human Feedback). The concept was inspired by earlier work in
which an agent could be trained to satisfy human preferences [12]. In in-
structGPT, it’s applied to conversational AI. The success of instructGPT
was clear: the fine-tuned models were much more aligned – they were less
likely to produce irrelevant or harmful answers, and users found them more
useful. In fact, ChatGPT as known publicly is essentially this kind of model.
It’s important to note that the reward model can have flaws or biases
depending on the data. The model will optimize for whatever the reward
model (and thus the human labelers) favor. If not careful, it can learn to
trick the reward model (model outputs that look good to the reward model
but might be nonsensical to a human if the reward model has weaknesses).
To mitigate that, the iterative approach and careful checks are used.
In summary, InstructGPT augments a pre-trained language model with
an extra round of training that involves humans in the loop: - Humans pro-
vide example responses (to prime the model). - Humans compare model
outputs to train a reward model. - The model is then fine-tuned using rein-
forcement learning to maximize the reward (i.e., human satisfaction). This
process produces a model that more reliably does what you want when you
prompt it – essentially making the model follow instructions and align with
human preferences for response quality. It demonstrates how we can steer
large models to be more useful and safe by defining what we want (via the
reward model) and optimizing for it.
168
prompt or conversation history), outputs an action (a probability distribution
over the next token, and sequentially a whole response). - Generating a full
response is like the policy producing a sequence of actions until a termination
(perhaps a special end-of-sequence token or reaching a length). - After the
model produces a response, we can compute a reward R for that output
using the reward model (or other criteria). For simplicity, think of R as
a single number evaluating the whole response. - Now, we want to adjust
the policy πθ to increase the expected reward E[R] over the distribution of
prompts (and the model’s own stochastic outputs). This is a reinforcement
learning problem.
Vanilla policy gradient methods would tell us to nudge the model’s output
probabilities to make high-reward outputs more likely. However, directly
doing that can be unstable if the changes are too large (the model might
diverge and drop performance, or the language might become repetitive or
collapse). PPO offers a solution by using a clipped objective:
h i
L(θ) = Eprompt,response min r(θ) A, clip(r(θ), 1 − ǫ, 1 + ǫ) A ,
169
the probability of good responses and decrease that of bad responses, but
limited by the PPO clipping to avoid going too far.
We also often include the KL divergence penalty as part of the reward or
as a separate term. This effectively acts like a penalty if the new policy πθ
drifts too far from the original distribution πθold . It’s like saying “don’t forget
how to speak fluent language or start outputting nonsense just to tweak the
reward.” It keeps the language style and general knowledge anchored.
Imagine the reward model strongly prefers very enthusiastic answers (maybe
it learned humans like exclamation marks). If unpenalized, the policy might
start outputting exclamation-laden answers everywhere. The KL penalty
would discourage it from straying too far from the more neutral base distri-
bution unless it truly improves reward.
The end result of using PPO is a balanced update that improves the de-
sired behavior incrementally without wrecking the model’s language ability.
PPO is favored because: - It’s relatively straightforward to implement on top
of existing policy gradient code. - It doesn’t require second-order derivatives
or complex math (like TRPO, an older method, does). - It’s been found to
be stable across many domains. - The clipping mechanism is a heuristic that
works well to prevent oscillations and ensure the training doesn’t collapse
even if the reward signal is sometimes noisy or imperfect.
For a beginner: you can think of PPO as a careful teacher for the model.
Instead of just saying “this output was good, so massively boost it,” PPO
says “this output was good, let’s make it a bit more likely next time, but
not too much, and this other output was bad, let’s make it a bit less likely.”
Over many iterations, these nudges add up to a noticeable behavior change.
PPO just ensures the nudges are not too big each time (proximal = nearby,
meaning the new policy stays close to the old one after each update).
In code or algorithm form, each iteration of PPO for language model
fine-tuning might look like: 1. Sample a batch of prompts from a dataset.
2. For each prompt, have the model generate a response (maybe multiple
responses) by sampling. 3. For each (prompt, response), compute reward
via the reward model. 4. Compute the advantage A = R − b (where b could
be a baseline from a value function). 5. Compute gradients of the PPO loss
with respect to θ and update θ (possibly with multiple mini-batches from the
collected data, known as epochs in PPO). 6. Optionally update the baseline
(value function) to better predict reward. 7. Rinse and repeat with new data
from the updated policy.
Using PPO in this way was pivotal to making InstructGPT work. With-
170
out RL, if they had tried to directly fine-tune on a scalar reward with super-
vised learning (which doesn’t make sense directly because there’s no target
output), or if they tried to treat the highest-ranked output as a “correct
answer” for cross-entropy, that would ignore the nuanced feedback. RLHF
with PPO uses all the information (relative preferences) and finds an optimal
policy that maximizes expected reward.
In summary, PPO is the reinforcement learning algorithm that takes the
feedback from the reward model and turns it into a policy improvement for
the language model, doing so in a stable, controlled fashion. It ensures that
the fine-tuning process doesn’t ruin the model while trying to align it with
human preferences, leading to a high-quality tuned model that behaves better
according to those preferences.
171
regression on pairs of outputs: - For a given prompt x, suppose we have two
responses: y + (the human-preferred one) and y − (the dispreferred one). -
We want πθ (y + |x) to be higher than πθ (y − |x). Specifically, from the formula
of π ∗ , the ratio should relate to the exponent of the reward difference:
π ∗ (y + |x) π0 (y + |x)
= exp{β(R(x, y + ) − R(x, y − ))}.
π ∗ (y − |x) π0 (y − |x)
or conceptually
πθ (y + |x)
− log ,
πθ (y + |x) + πθ (y − |x)
which is basically saying “we want πθ to give the pair (y + , y − ) the correct
ordering, with as high confidence as possible.”
In simpler terms, DPO treats the preference comparison as a training ex-
ample: “For prompt x, output y + should have a higher score than y − . Make
the model’s logits reflect that.” This is reminiscent of how we might train a
binary classifier to pick which output is better, except here the “classifier” is
embedded in the generative model’s probabilities.
The surprising result from the DPO authors was that optimizing this kind
of loss (plus some regularization to not drift too far from the original model)
yields a model that is as good as the PPO-trained model in following human
preferences, but it’s much simpler to implement. It doesn’t require sampling
from the model and calculating advantages in a loop; you just need pairs of
outputs with a preference label.
Some advantages of DPO: - It doesn’t explicitly use a reward model during
training (though in practice you need one to get the comparisons, or you
already have the comparisons from human data). - It avoids the instability
of RL and can be done with standard gradient descent (it’s like doing a
form of pairwise logistic regression). - It’s computationally simpler: you can
assemble a dataset of preferred vs dispreferred output pairs and just fine-tune
the model on that dataset.
We can also see DPO as treating the problem as “the model should classify
which output is better.” Because the model is generative, making πθ (y + |x)
172
higher than πθ (y − |x) across all such pairs essentially tunes the model to
generate y + -like outputs more frequently.
It’s interesting to note that DPO’s derivation leverages the assumption
that the original model’s distribution π0 can act like a prior. DPO doesn’t
throw away the original knowledge; it’s adjusting it by the human preferences.
In a way, it’s doing the same thing as RLHF (which also often includes a
term to keep the new policy close to the original), but baked into a single
loss function.
For a beginner: imagine you have many examples of “In this situation,
humans liked Response A more than Response B.” DPO says: “Okay, I
will fine-tune the model so that it gives Response A a higher score than
Response B for that situation.” Doing this for many examples teaches the
model generally what humans prefer.
Comparing to PPO: - PPO indirectly achieves the same effect but requires
fiddling with reward scaling, advantage estimation, many iterations of sam-
pling and optimizing. - DPO just requires a dataset of comparisons (which
we typically have from the same process as training the reward model). -
DPO doesn’t need to maintain a value function or sample multiple times; it’s
a direct supervised learning approach to an RL problem.
The research found DPO to be competitive with PPO on tasks like sum-
marization and dialogue alignment, suggesting it’s a promising alternative.
However, it’s fairly new and being studied further. It simplifies alignment
work because practitioners can avoid the RL part, which is often the trickiest
and most finicky stage.
In summary, Direct Preference Optimization is a method that bypasses
reinforcement learning by converting preference data into a direct training
objective for the language model. It seeks to combine the strengths of the
original model with the feedback data in a single, stable training phase. If
PPO was like using a trial-and-error loop to gradually reinforce good behav-
ior, DPO is like jumping straight to fitting the model to what’s considered
good vs bad. It’s a great example of how insights from one approach (RL)
can inform a simpler approach that achieves a similar outcome.
173
8.17 Language Models as Reward Models: Con-
cept and Future Implications
In the context of alignment and fine-tuning, an intriguing idea has emerged:
using large language models themselves as reward models or judges. We’ve
seen that to align a model with human preferences, we often train a separate
reward model on human data. But what if the language model could directly
model the reward? Or put differently, what if a language model could be used
to evaluate other outputs?
There are a few angles to “language models as reward models”:
174
based on a set of principles (essentially the model is used to measure
how well outputs follow the principles). That critique can be seen as
a reward signal. In a sense, the language model is being used to ap-
proximate a reward function (the degree to which the output violates
or follows the written constitution).
• Unification of policy and reward: The phrase “Your Language
Model is Secretly a Reward Model” from the DPO paper hints at a
future where the line between the policy (the one generating answers)
and the reward model might blur. Perhaps large models can internally
estimate human preference as they generate text. If a model can predict
“what would a human think of this response?” accurately, it could
steer itself toward better responses without needing an external reward
model.
What are future implications of treating language models as reward mod-
els? - It could greatly streamline the alignment process. If we don’t have
to train a separate reward network and run RL, and can instead rely on
large models’ evaluation capabilities, we can more quickly fine-tune or even
prompt models to behave well. - One day, a language model might be able to
improve itself by evaluating its own outputs. For example, it could generate
several candidate answers to an instruction, then internally “think” or “vote”
on which is best (using an internal reward heuristic), and output that. In
some sense, this is already happening in techniques like “chain-of-thought”
prompting where the model is asked to critique or refine its answer. - Using
language models as reward models also ties into safety considerations: We
must ensure the model’s notion of “reward” aligns truly with human values
and not some proxy that can be exploited. If a model learns to please an AI
judge that isn’t perfectly aligned with humans, it might still output things
humans don’t actually want (but the AI judge does). This is called the align-
ment of the reward model. If the AI judge is a large model that itself was
aligned using human data (like GPT-4, presumably aligned), then we are
stacking alignment on alignment. - Another implication: scalability. Human
feedback is a bottleneck. If AI models can serve as automatic judges, we
can create massive amounts of training data for preferences without direct
human labor. For instance, you could generate thousands of summaries and
have a model pick the best ones to train a better summarizer. - There’s
also research indicating that as models get more capable, they might develop
a form of “knowledge” of ethics or human values just from their training
175
data. If that’s the case, perhaps future language models could have a built-
in “conscience” scoring mechanism (hard to measure, but conceptually) that
we could tap into instead of training separate classifiers.
One example of using a model as a reward model is the idea of “GPT-4 as
a judge for fine-tuning ChatGPT.” If GPT-4 consistently gives high ratings
to certain types of responses, we train ChatGPT to produce those. This is
essentially using GPT-4’s complex understanding as the reward model. This
approach has to be careful – we don’t want to oversimplify alignment to just
mimic a bigger model, but it’s a powerful tool.
In reinforcement learning terms, language models as reward models is like
having the critic be another model. If the critic is good, the actor (policy
model) can learn quickly. If the critic is flawed, the actor will learn bad
habits. So a lot of future work will likely focus on how to ensure these
AI-based reward models truly match human intent (maybe by periodically
auditing them with real human feedback or mixing the two).
Looking further ahead: if we manage to align models well and they truly
understand our preferences, one could envision that the distinction between
“model doing task” and “model checking task” goes away – a sufficiently
advanced model might do both in one go. For example, a single model could
be prompted with a request and, using its knowledge, just directly give a
response it knows is helpful and aligned (because it knows what we’d prefer
without needing explicit feedback). That’s the ideal scenario: the model
inherently acts as if it has an internal reward for helping us, rather than us
externally imposing it.
In conclusion, the concept of language models as reward models is about
leveraging the models’ own intelligence and knowledge to evaluate outputs,
thereby reducing reliance on separate systems or human input. It’s an active
research area and an exciting one: it suggests a future where AI can align AI,
under human guidance – a sort of bootstrap of alignment. This could make
developing helpful and safe AI assistants much more efficient. But careful
oversight will be needed to ensure the “AI feedback” remains grounded in
actual human values. It’s a bit like training a student to grade their own exam
correctly; if you succeed, the student can largely guide their own learning,
but you need to verify that their grading criteria are truly correct and not
self-serving.
176
Chapter 9
Object-Oriented AI
Development Based on MCP
9.1 Introduction
In recent years, artificial intelligence (AI) systems have become increasingly
modular and interconnected. This shift has emerged because people want
AI models not only to solve singular tasks but also to coordinate multiple
capabilities, access external tools or data sources, and support complex work-
flows. Creating such systems can be challenging if the design is not organized
from the start, which is why an object-oriented mindset is so valuable.
In this chapter, we will investigate three core ideas:
• Model Context Protocol (MCP). This is a newly introduced stan-
dard that defines how AI models can talk to external resources in a
reliable and consistent manner.
• Hierarchical Ontologies. These help manage different types of data,
tasks, or concepts in a structured way, particularly when dealing with
multimodal inputs like text, images, and audio.
• LLMs as AI Engines. Large Language Models can serve as central
orchestrators of tool usage and data flow, much like a game engine
coordinates graphics, physics, and audio in video games.
In the sections that follow, we delve into each concept in detail, always
paying attention to how they reinforce each other in an object-oriented ap-
proach.
177
9.2 Object-Oriented Programming (OOP)
Learning to code involves understanding fundamental concepts that guide
how we structure, organize, and collaborate on software projects. In par-
ticular, Object-Oriented Programming (OOP) helps us break down complex
problems into manageable parts, and version control systems like Git enable
teams (and individuals) to work together smoothly. Let us walk through
these ideas step by step.
Example Imagine a “Car” class specifying attributes like color and speed,
and methods such as accelerate() or brake(). Any individual car on the
road is an object, created using this blueprint.
178
9.2.3 Inheritance
Inheritance allows us to create a new class based on an existing class, so we
can reuse code rather than writing it from scratch.
9.2.4 Polymorphism
Polymorphism enables a function or method to behave differently depend-
ing on the context or the type of data it is handling. In simpler terms, one
function name can perform different tasks.
• It allows for more generic code that can handle various data without
rewriting.
179
9.2.6 Example: Linking to Real-World Projects
One way to strengthen your understanding of these concepts is by explor-
ing real project code. For instance, the minGPT repository by Karpathy
demonstrates how object-oriented principles and clean coding can come to-
gether in practice. Browsing open-source projects like this illustrates how
classes, objects, inheritance, and polymorphism appear in actual programs.
180
is not scalable. For example, a chatbot that needs stock market data, weather
data, or access to the user’s documents would require separate code to han-
dle each integration. This leads to repetitive development and maintenance
overhead.
Anthropic introduced the Model Context Protocol (MCP) to address this
issue. MCP is an attempt to provide a universal interface that AI models can
use to request data or actions from an external service. One might consider
it analogous to the way TCP/IP standardized networking, except that here
the focus is on bridging AI clients with specialized services.
181
• Greater consistency. Standardizing the interface ensures that requests
and responses follow a predictable format, making the system easier to
debug.
• Stronger security. Access rules can be enforced at the protocol layer.
For instance, if the AI should not be allowed to modify certain docu-
ments, the MCP server can reject any write operations to that docu-
ment, regardless of how the model was prompted.
182
9.3.5 Example: Minimal Pseudocode for MCP Calls
We can imagine a simple Python-like interface for making MCP requests.
Suppose we have a client class:
1 class MCPClient:
2 def __init__(self, server_url):
3 self.server_url = server_url
4
5 def call_method(self, method_name, parameters):
6 # This function might send an HTTP request or use another
protocol
7 # The server interprets the request, performs the action,
and returns JSON
8 response = send_request(self.server_url, method_name,
parameters)
9 return response
10
11 # In practice, you’d expand this to handle authentication, error
handling, etc.
A typical call might look like:
1 # Example usage
2 if __name__ == "__main__":
3 # Suppose we want a summarization service
4 summarizer = MCPClient("https://summarize.example.com/mcp")
5
6 text_data = {
7 "text": "A long text about renewable energy and new findings
..."
8 }
9 summary = summarizer.call_method("summarize_text", text_data)
10 print("Summary of the text:")
11 print(summary)
Although this is a simplified illustration, it captures the essence of how a
client might make MCP calls to a server.
183
9.4 Hierarchical Ontology for Multimodal Sys-
tems
9.4.1 What is an Ontology?
In AI, an ontology is a structured representation of concepts and their re-
lationships. Traditional ontologies can be complex, involving formal logic,
taxonomies, and metadata. For many practical AI applications, we want
something slightly simpler but still beneficial: a hierarchical structure that
helps us categorize data, tasks, and domain concepts.
184
Animal
— Mammal
—— Dog
—— Cat
— Bird
—— Eagle
—— Sparrow
If a user provides an image of a cat, the system can store it under “Mam-
mal → Cat.” If a user writes a sentence about dogs, that text is stored under
“Mammal → Dog.” By doing so, we can unify or cross-reference them if the
user later asks for “all mammal data” in the system.
185
dimensions, or source.
186
• AudioSet. Google’s AudioSet classifies hundreds of audio events and
arranges them in a tree. Examples include broad classes like “Music”
with subcategories for “Guitar” or “Piano,” and “Animal sounds” with
subcategories for “Dog bark,” “Bird song,” and others.
187
• The context or “world state” is a place to store memory or relevant
information about the ongoing session or environment.
• If the user says, “Retrieve the weather in Berlin next Tuesday,” the
model can output something like: {"name": "get weather", "arguments":
{"location":"Berlin","date":"2025-04-01"}}.
• The application code sees this JSON, calls the actual function get weather
with the specified arguments, obtains the result, and feeds it back to
the model. The model then composes the final response to the user
based on that data.
1. The user’s request enters the system. Possibly it includes text, images,
or other data.
188
2. The AI engine (where the LLM and orchestrator logic resides) checks if
the request can be answered directly. If not, the LLM identifies which
tool or service must be called.
3. The engine calls that service, possibly using MCP if the service is ex-
ternal, or function calling if it is local.
4. The tool returns the result, which is fed back into the LLM.
5. The LLM updates its internal reasoning or “chain of thought” and then
either calls another tool or produces a final answer.
6. That final answer is returned to the user, possibly along with logs of
what steps were taken.
189
22 engine = LLMEngine(
23 language_model="SomeLLM",
24 tools={
25 "web_search": lambda query: "Fake search results for " +
query,
26 "calculator": lambda expr: eval(expr)
27 }
28 )
29
30 user_query = "What is 52 * 2?"
31 final_answer = engine.handle_request(user_query)
32 print(final_answer)
In a real application, the LLM’s logic would produce something like, “I
need to call the calculator tool with the expression ’52 * 2’.” The rest of the
system would interpret that as a request to run eval("52*2"), return 104,
feed it back to the model, and so forth.
190
higher-level engine can orchestrate these agents, passing intermediate results
between them as needed. Each agent might also have its own set of tools or
ontological knowledge relevant to its domain.
This setup resembles multi-agent systems in robotics or simulations. The
key difference is that these agents exchange text or data rather than physi-
cally interacting, although one could integrate real-world sensors or actuators
if desired.
191
9.6.4 Controlling Context Size
LLMs have context window limits. If your AI agent constantly accumulates
new information, you may run into token length constraints. Techniques
like retrieval augmentation, chunking, or summarization can help manage
this. Ensuring the most relevant details are available, while discarding or
summarizing older context, is often necessary in long-running systems.
• The LLM’s internal reasoning step or plan (if accessible through chain-
of-thought or a partial logging approach).
These logs help you quickly diagnose which step might have failed or
which tool returned unexpected data.
192
• LLMs as AI Engines. A perspective that treats large language mod-
els as central orchestrators that can call tools, manage states, and co-
operate with other models or agents. This design parallels how game
engines coordinate various subsystems.
Putting these ideas together results in systems that are significantly more
capable than a simple stand-alone AI model. The object-oriented paradigm
provides boundaries, clarity, and extensibility. Developers can add new mod-
ules (such as a speech recognizer, an image classifier, or a robotics controller)
without overhauling the entire architecture. Each module is treated as an
object or component, with the LLM deciding how and when to use it.
Future directions in this domain may include:
193
Chapter 10
10.1 Introduction
The term metaverse is commonly used to describe a network of immersive
virtual environments that blend digital and physical realities. It promises
opportunities for interactive experiences and social connectivity beyond tra-
ditional screens. The arrival of advanced artificial intelligence (AI) and aug-
mented reality (AR) technologies means these virtual worlds are not only
visually compelling, but also deeply intelligent and context-aware. This
chapter explores three key themes central to the convergence of AI and the
metaverse: digital twin-based physical AI, egocentric multimodal AI agents
for AR glasses, and decentralized GPU clusters for energy-efficient training
and inference.
Digital twins are virtual representations of physical assets or systems that
continuously synchronize with real-world data. When enhanced by AI, digital
twins become powerful tools for monitoring, predicting, and optimizing the
behavior of their real-world counterparts. Egocentric multimodal AI agents
refer to intelligent assistants embedded in AR glasses or similar wearable de-
vices. These agents perceive the world from the user’s point of view and help
194
facilitate natural, context-driven interactions. Decentralized GPU clusters
address the growing demand for massive compute power by distributing AI
workloads across many nodes, often located near the data source, to achieve
efficiency and privacy gains.
After reading, one should understand how digital twins simulate real-
world environments in virtual spaces, how AR glasses can benefit from AI
that sees and hears from a user’s perspective, and why decentralization of
GPU resources might be essential for the next phase of AI-driven immersive
applications.
195
changes. These benefits increase efficiency and unlock the potential of contin-
uous optimization in many industries, including manufacturing, healthcare,
and energy.
196
including what objects the user looks at, what environment they are in, and
what they are trying to do.
When a user asks, “Where did I put my keys,” the AI can look back
at first-person video (if recording is enabled and privacy settings allow it)
to find the moment the user placed the keys somewhere. If the user says,
“What is that painting?” the AI can analyze the camera feed of whatever
the user is currently viewing. This new perspective drastically changes how
AI can respond to everyday tasks and information queries.
197
10.3.3 AI Models for Egocentric Intelligence
Bringing this to life involves various AI components:
Computer Vision. Deep learning models recognize objects, faces, and
text from first-person footage. They must be efficient enough to run in real
time. This often involves specialized hardware or partial cloud offloading.
Natural Language Understanding. The device interprets user voice
commands and dialogues, possibly aided by large language models. It can
combine textual understanding with visual context to answer queries such
as, “Does this snack contain peanuts?” while analyzing the label the wearer
is looking at.
Contextual Reasoning. The system maintains an internal represen-
tation of the user’s environment, tasks, and recent actions. If the user was
assembling furniture, the AI might recall the user’s last step and propose the
next step, placing holographic instructions in the AR view.
Interaction Design. AR glasses must balance helpfulness with minimal
intrusion. The system decides when to proactively offer information (for
safety or assistance) or wait for a prompt. The user typically interacts via
speech, gaze, or gestures instead of mouse and keyboard.
These techniques are under development by major companies like Meta
(formerly Facebook), Microsoft, Apple, and various startups. Their shared
objective is to make AR wearables into natural extensions of human capa-
bility rather than mere displays.
198
Others. Magic Leap focuses on enterprise solutions. Snap Spectacles
experiment with first-person video for social media. Google has shown pro-
totypes of real-time translation glasses. Startups explore specialized uses like
sports training, medical guidance, and more.
199
directly. Only relevant summaries or model updates need to go back to any
central server, reducing network loads.
Better Resource Utilization. Many GPUs sit idle when gamers are
not using them or when research labs have off-peak hours. A decentralized
approach could turn these idle resources into a global AI compute pool. This
also helps smaller organizations collaborate to match large-scale compute
capabilities.
Data Privacy and Compliance. With federated learning, data can
remain on local nodes. Users or companies can contribute model updates
rather than raw data. This aligns with regulations that mandate data resi-
dency in particular regions.
Resilience. If one node or data center fails, others can continue working.
This distributed structure avoids single points of failure.
200
10.4.5 Case Studies
Telecom operators are exploring decentralized inference for real-time analyt-
ics. A phone user running an AR application can offload some computations
to a local edge server. This reduces network traffic and speeds up responses.
Blockchain-based AI marketplaces, such as SingularityNET or DeepBrain
Chain, propose token-based incentives for individuals to rent out GPU re-
sources. Healthcare federated learning initiatives let hospitals create robust
AI models for disease detection while respecting patient privacy. Projects
combining these approaches show the potential for a new wave of distributed
AI research.
201
10.6 Conclusion
This chapter introduced three pillars that connect AI research with the meta-
verse. First, digital twin-based physical AI leverages virtual replicas to help
monitor and optimize real-world systems. Second, egocentric multimodal AI
agents bring personal context and assistance to AR glasses. Third, decentral-
ized GPU clusters address the growing computational demands of AI while
reducing latency and respecting data privacy constraints.
An undergraduate or new learner in AI should now appreciate how these
areas come together to shape next-generation platforms. Digital twins extend
our ability to study and improve the physical realm. Egocentric AI fosters
more natural human-computer interaction through a first-person perspective.
Decentralized compute enables scalable and efficient AI services by sharing
the load among many nodes. Collectively, these trends point toward a future
in which physical and digital worlds merge seamlessly, aided by intelligent
infrastructure that is globally distributed yet intimately personal.
202
Appendix A
How to Code
A.1 Introduction
Learning to code involves understanding fundamental concepts that guide
how we structure, organize, and collaborate on software projects. In par-
ticular, Object-Oriented Programming (OOP) helps us break down complex
problems into manageable parts, and version control systems like Git enable
teams (and individuals) to work together smoothly. Let us walk through
these ideas step by step.
Example Imagine a “Car” class specifying attributes like color and speed,
and methods such as accelerate() or brake(). Any individual car on the
road is an object, created using this blueprint.
203
A.3 Data Abstraction and Encapsulation
Two foundational OOP concepts that help keep code organized and secure
are data abstraction and encapsulation.
A.4 Inheritance
Inheritance allows us to create a new class based on an existing class, so we
can reuse code rather than writing it from scratch.
A.5 Polymorphism
Polymorphism enables a function or method to behave differently depend-
ing on the context or the type of data it is handling. In simpler terms, one
function name can perform different tasks.
204
A.6 Benefits of Object-Oriented Programming
Object-oriented programming simplifies both the creation and maintenance
of software projects:
• It allows for more generic code that can handle various data without
rewriting.
205
2. Organizational GitHub Page: A shared space representing a com-
pany, research group, or community, which helps organize team-based
or collective projects.
206
• git fetch : Retrieves changes from the remote but does not
merge them.
• git rebase : An advanced method of integrating changes by
rewriting commit history.
A.8 Conclusion
Coding is more than simply writing instructions for a computer. By em-
bracing Object-Oriented Programming, you learn to break your code into
logical, secure, and reusable pieces. Understanding and using Git keeps your
projects organized and collaborative, whether you work alone or in a team.
Together, these concepts form a solid foundation for anyone looking to write
efficient, maintainable, and collaborative software.
Feel free to explore open-source projects like the minGPT repository for
practical examples of OOP in action, and set up a personal GitHub page to
share your own projects. With these skills and tools, you will be well on your
way to coding more effectively and confidently.
207
Appendix B
Exercise 1: Git
B.1 Introduction
B.1.1 What is Version Control?
• Definition: A system that records changes to files, enabling you to
recall specific versions later.
• Benefits:
• Prevents the confusion of multiple file versions like project final v2 backup.
208
• Created by Linus Torvalds in 2005 to manage the Linux kernel.
• Linux: Use your package manager, e.g., sudo apt install git.
Check installation:
git --version
B.2.2 Configuration
• Set your username and email so that each commit is correctly at-
tributed:
209
• Optionally configure a default editor:
git init
• git status checks which files are tracked/untracked and shows repos-
itory status.
210
Working with Remotes
• Remote: A copy of the repository hosted on another server (e.g.,
GitHub).
• git remote add origin <URL> connects your local repo to a remote.
• git push uploads local commits to the remote; git pull retrieves and
merges any changes from it.
Cloning a Repository
• Clone: Download an existing remote repository locally.
Exercise:
1. Create a new GitHub repository.
2. Clone it locally.
3. Add and commit a file, then push to see the changes on GitHub.
211
B.4.3 Merging Branches
git checkout main
git merge feature1
<<<<<<< HEAD
(your changes)
=======
(other changes)
>>>>>>> feature1
• Decide what to keep or combine, then stage and commit the resolution.
Exercise:
1. Create and switch to a branch called experiment, make a commit.
• Typical Steps:
212
1. Push a branch to the remote repository.
2. On GitHub, open a PR by clicking “Compare & pull request.”
3. Describe your changes; reference issues if relevant.
4. Teammates review and comment.
5. Merge the PR into main.
B.5.2 Issues
• Used to track bugs, enhancements, or discussions.
• Can be referenced in commits (Fixes #3) to close them automatically
upon merging.
• Can include labels, milestones, assignees for organizing project tasks.
213
• Useful for avoiding merge commits but rewrites commit history, so use
carefully.
2. Initial Commit
Create and commit a README.md, then push to main.
214
4. Issue and Fix
Open an issue (e.g., “Typo in README”), create a fix branch refer-
encing the issue, merge, and confirm the issue is closed.
5. Rebase Practice
Simulate two branches diverging, then rebase one onto main. Resolve
conflicts if they occur.
6. Reflection
Write a short note (200–300 words) on your experience, what you
learned, and any challenges faced.
B.8 Conclusion
By completing these exercises, you will:
Over time, practice will help you master more advanced features and work-
flows. Good luck, and enjoy exploring version control with Git!
215
Appendix C
C.1 Introduction
Python is a high-level, interpreted programming language known for its read-
ability and versatility. It is widely used in various fields such as web devel-
opment, data science, machine learning, scripting, and more. Python’s large
standard library and active community make it a powerful tool for both
beginners and experienced developers.
Key features of Python include:
Best Practices
• Practice coding daily to build familiarity with the syntax.
216
• Follow PEP 8 style guidelines (indent using four spaces, limit line
length, etc.).
• NoneType (None)
Example:
1 x = 10 # int
2 y = 3.5 # float
3 name = "Alice" # str
4 is_student = True # bool
5
6 print(type(x)) # <class ’int’>
7 print(type(name)) # <class ’str’>
8
9 # Type conversion (casting):
10 z = int("123") # 123 (int)
11 w = str(5) # "5" (str)
217
Best Practices
• Use meaningful variable names that reflect the stored value.
• Be consistent with naming and avoid using reserved keywords (e.g., if,
for, and).
Example:
1 x = 5
2 print(x > 0 and x < 10) # True
3 print(not (x > 0)) # False
218
Best Practices
• Use parentheses for clarity in complex expressions.
C.4.1 if-elif-else
1 x = 7
2 if x > 0:
3 print("x is positive")
4 elif x == 0:
5 print("x is zero")
6 else:
7 print("x is negative")
219
4 count -= 1
5 print("Blast off!")
Best Practices
• Indent consistently using 4 spaces.
C.5.1 Functions
1 def greet(name):
2 """Return a greeting message."""
3 return f"Hello, {name}!"
4
5 print(greet("Alice")) # Hello, Alice!
220
Functions can have default parameter values:
1 def power(base, exponent=2):
2 return base ** exponent
3
4 print(power(5)) # 25
5 print(power(5, 3)) # 125
Best Practices
• Keep functions focused on a single task.
C.6.1 Lists
Ordered, mutable collections:
221
1 fruits = ["apple", "banana", "cherry"]
2 fruits[0] = "avocado"
3 fruits.append("date")
Common methods include append(), remove(), pop(), sort(), etc.
C.6.2 Tuples
Ordered, immutable collections:
1 point = (3, 4)
2 # point[0] = 5 # Error (immutable)
C.6.3 Dictionaries
Unordered key-value pairs:
1 student = {"name": "Alice", "age": 25}
2 student["age"] = 26
3 student["grade"] = "A"
C.6.4 Sets
Unordered collections of unique items:
1 nums = {1, 2, 3, 2, 1}
2 print(nums) # {1, 2, 3}
3 nums.add(4)
Best Practices
• Use lists for ordered, mutable data; tuples for immutable sequences;
dictionaries for key-value mappings; sets for unique elements.
222
C.7 File Handling Basics
Python handles file I/O through the open() function. Common modes are
"r" (read), "w" (write), and "a" (append). Use with to ensure proper clo-
sure.
1 with open("example.txt", "w") as f:
2 f.write("Hello, file!\n")
3
4 with open("example.txt", "r") as f:
5 content = f.read()
6 print(content)
Best Practices
• Use with open(...) to automatically close files.
223
10 dog2 = Dog("Max", 5)
11
12 dog1.bark() # Buddy says Woof!
Best Practices
• Name classes in PascalCase (e.g., Dog, Student).
Best Practices
• Catch specific exceptions (e.g., ValueError, ZeroDivisionError).
224
C.10 Homework Assignment: Contact Book
Project
Task: Build a small Contact Book application that allows a user to:
• Add a new contact (name and phone number).
• View all contacts.
• Search for a contact by name.
• Save contacts to a file (and optionally load existing contacts on startup).
Requirements:
• Use a dictionary or similar structure to store contacts, with names as
keys.
• Write at least one function to avoid code repetition.
• Use file handling to save or load contact data.
• (Optional) Use a simple class to represent a Contact.
• Use try/except to handle errors (e.g., file not found).
Implementation Hints:
1. Present a menu in a loop: 1) Add, 2) View, 3) Search, 4) Quit.
2. For “Add Contact”, prompt for name and phone number, and store
them.
3. For “View Contacts”, list them all, or report if none.
4. For “Search Contact”, prompt for a name, then find it in the dictionary.
5. Use with open(..., "w") to save contacts to a text file before quit-
ting.
6. Optionally, load existing contacts at program start if the file exists.
This exercise reinforces core Python concepts (control flow, data structures,
file I/O, OOP, exception handling) and practices your skills in structuring
and sharing a project.
225
Appendix D
D.1 Introduction
Multimodal AI applications often require handling multiple data types (e.g.,
text, images, audio, sensor data). Preparing these data sources involves
various tasks such as reading files, cleaning noise, normalizing formats, and
storing data in efficient structures. Proper data ingestion and preprocessing
are critical steps that can greatly influence model performance and reliability.
This excercise provides an overview of:
• Files (e.g., CSV, JSON, text documents, image folders, WAV files)
226
• Streaming or real-time sensors (e.g., camera feeds, microphones, wear-
able devices)
1. Identify Sources: Determine all possible input formats and data lo-
cations.
227
D.3.1 Reading and Cleaning Text
To illustrate, assume you have a large text file (e.g., dataset.txt):
1 import string
2
3 def load_text_file(filepath):
4 lines = []
5 with open(filepath, "r", encoding="utf-8") as f:
6 for line in f:
7 line = line.strip()
8 # Remove punctuation (simple approach)
9 line = line.translate(str.maketrans(’’, ’’, string.
punctuation))
10 # Convert to lower case
11 line = line.lower()
12 if line:
13 lines.append(line)
14 return lines
15
16 text_data = load_text_file("dataset.txt")
17 print(len(text_data), "lines loaded.")
Here, we:
• strip() each line to remove leading and trailing whitespace
D.3.2 Tokenization
After cleaning, the next step is splitting into tokens:
1 def tokenize(text):
2 return text.split() # simplest form of tokenization
3
4 all_tokens = []
5 for line in text_data:
6 tokens = tokenize(line)
7 all_tokens.append(tokens)
8
228
9 print(all_tokens[0]) # example of first line token list
Advanced tokenizers (e.g., for subwords, Byte Pair Encoding) can be used
for sophisticated NLP pipelines, but even simple token splits provide a foun-
dation for basic text tasks.
229
D.4.2 Preprocessing and Normalization
Common steps:
• Normalize pixel values, e.g., scale to [0,1] or subtract mean and divide
by standard deviation (common in pretrained models).
1 import numpy as np
2
3 def preprocess_image(img, desired_size=(224,224)):
4 # Resize
5 img_resized = img.resize(desired_size)
6 # Convert to numpy array
7 arr = np.array(img_resized, dtype=np.float32)
8 # Scale to [0,1]
9 arr /= 255.0
10 # (Optional) if using a model that needs mean/std normalization:
11 # arr = (arr - mean) / std
12 return arr
If you use frameworks like PyTorch, you can leverage torchvision.transforms
(e.g., transforms.Resize, transforms.ToTensor()) to automate these steps.
230
1 import librosa
2 import numpy as np
3
4 def load_audio_file(file_path, sr=16000):
5 # sr is the target sampling rate
6 audio, sample_rate = librosa.load(file_path, sr=sr)
7 return audio, sample_rate
8
9 audio_data, sr = load_audio_file("example.wav")
10 print(f"Loaded audio with {len(audio_data)} samples at {sr} Hz")
231
D.7 Common Best Practices in Data Inges-
tion
• Check data integrity: Ensure paths are valid and files are not cor-
rupt.
232
15 for item in metadata:
16 img_path = item["image_file"]
17 img = Image.open(img_path).convert("RGB")
18 arr = preprocess_image(img)
19 image_records.append(arr)
20
21 # Step 4: Process audio
22 audio_records = []
23 for item in metadata:
24 audio_path = item["audio_file"]
25 audio, sr = load_audio_file(audio_path, sr=16000)
26 mfcc = extract_mfcc(audio, sr)
27 audio_records.append(mfcc)
28
29 # Step 5: Save or return preprocessed data
30 # e.g., save to .npy or custom format
31 # ...
32 return text_records, image_records, audio_records
33
34 text_data, img_data, audio_data = pipeline_run()
35 print("Text, image, and audio data processed.")
This skeleton helps keep each stage of ingestion explicit and organized.
233
Appendix E
Exercise 4: Building a
Multimodal LLM with Groq
E.1 Introduction
This exercise guides you in using the Groq API to build a simple multimodal
large language model pipeline. The pipeline combines:
• LLM inference: Feed the extracted text into a large language model
for additional reasoning or summarization.
The project is inference only: we will not train or fine-tune the model.
Instead, we’ll tap into a pre-existing LLM or relevant sub-models, running
them on Groq hardware/resources with a free tier account.
E.2 Objectives
1. Familiarize with Groq’s free tier environment.
2. Understand how to set up and query Groq endpoints for image and
speech-based tasks.
234
3. Chain the outputs from speech or image modules into an LLM for
advanced text-based responses.
E.3 Prerequisites
• Python 3.7+ or a version compatible with your chosen Groq SDK/-
clients.
5. API Security: Store and load your credentials safely (e.g., environ-
ment variables, do not upload keys to GitHub).
235
E.5 Step 1: Environment Setup
E.5.1 Python Dependencies
A sample requirements.txt might include:
1 groq-api==<your_version>
2 requests
3 python-dotenv
4 Pillow
5 librosa
6 # ... any others you need ...
236
E.7 Step 3: Image Understanding Endpoint
Some multimodal pipelines rely on an image captioning model or vision trans-
former that outputs a textual summary:
1. Load & preprocess the image in Python (optional resizing or format
conversion).
2. Send it to Groq’s image endpoint with your API key in the header
or query param.
3. Receive textual output describing the image.
Pseudocode:
1 import requests
2 from PIL import Image
3
4 def generate_image_caption(img_path, api_key):
5 url = "https://api.groq.com/v1/image-caption"
6 headers = {"Authorization": f"Bearer {api_key}"}
7
8 # Convert image to bytes
9 with open(img_path, "rb") as f:
10 img_data = f.read()
11
12 # Possibly a multipart/form-data request
13 files = {"file": (img_path, img_data, "image/png")}
14
15 response = requests.post(url, headers=headers, files=files)
16 if response.status_code == 200:
17 return response.json()["caption"]
18 else:
19 raise ValueError(f"Error: {response.text}")
Notes
• Actual endpoint and request format may vary depending on Groq’s
APIs. Check documentation.
• If the endpoint returns features (embeddings) instead of text, you might
forward those to your LLM in the next step.
237
E.8 Step 4: Speech-to-Text Module
If Groq provides a speech recognition endpoint:
Example:
1 def speech_to_text(audio_path, api_key):
2 url = "https://api.groq.com/v1/speech2text"
3 headers = {"Authorization": f"Bearer {api_key}"}
4
5 with open(audio_path, "rb") as f:
6 audio_data = f.read()
7 files = {"file": (audio_path, audio_data, "audio/wav")}
8
9 response = requests.post(url, headers=headers, files=files)
10 if response.status_code == 200:
11 return response.json()["transcript"]
12 else:
13 raise ValueError(f"Error: {response.text}")
Example:
1 def query_groq_llm(prompt, api_key):
2 url = "https://api.groq.com/v1/llm"
3 headers = {"Authorization": f"Bearer {api_key}"}
4 payload = {"prompt": prompt, "max_tokens": 100}
238
5
6 response = requests.post(url, json=payload, headers=headers)
7 if response.status_code == 200:
8 return response.json()["text"]
9 else:
10 raise ValueError(f"Error: {response.text}")
Then you might chain them:
1 # 1) Get image caption
2 img_caption = generate_image_caption("image.jpg", GROQ_API_KEY)
3
4 # 2) Convert audio to text
5 speech_text = speech_to_text("audio.wav", GROQ_API_KEY)
6
7 # 3) Combine for final LLM request
8 combined_prompt = f"The user described an image as ’{img_caption}’
and said ’{speech_text}’. Provide a response:"
9 llm_response = query_groq_llm(combined_prompt, GROQ_API_KEY)
10 print("LLM says:", llm_response)
• Rate Limits: Groq may impose a rate limit; handle HTTP 429 re-
sponses by waiting or retrying.
239
3. Generate intermediate text from image or audio.
Security Reminder
• Always ignore .env in your .gitignore.
• If you deploy or share your project, keep the key in a secure location or
use environment variables in your production environment (e.g., Docker
secrets).
E.12 Conclusion
By completing this exercise, you gain experience:
• Handling security best practices for API keys, especially when using a
public repository.
240
Appendix F
Exercise 5: Multimodal
Chatbot with Gradio
F.1 Introduction
This final exercise culminates the previous lessons and demonstrates how
to build a fully functional, multimodal chatbot interface using Gradio.
Throughout earlier modules, you learned:
• Data ingestion & preprocessing: Techniques for reading, cleaning,
and preparing text, images, and audio.
• Groq-based inference: Leveraging accelerated hardware or endpoints
for tasks like image captioning, speech-to-text, and large language
model (LLM) queries.
• API key handling: Avoiding the exposure of credentials by storing
them in .env files or environment variables.
In this project, you will integrate these components into one seamless
user experience, where the chatbot can:
1. Accept textual queries
2. Process uploaded images (via image captioning or other vision tasks)
3. Convert audio to text (speech-to-text)
4. Combine all inputs into a coherent conversation with an LLM
241
By adding Gradio, you provide a simple and interactive web interface for
users, allowing them to upload files, speak directly, and receive immediate
responses. This real-time demonstration underscores the power of multi-
modal AI and helps you practice essential skills for modern AI develop-
ment—particularly bridging data preprocessing and model inference with
accessible UI components.
F.2 Objectives
By completing this exercise, students will be able to:
F.3 Prerequisites
• Python 3.7+ (or compatible with Gradio and Groq libraries)
242
multimodal_chatbot/
|- .env
|- .gitignore
|- main.py # Main Gradio script
|- modules/
| |- groq_inference.py
| |- data_utils.py
| |- ...
|- requirements.txt
|- README.md
The exact organization depends on your earlier exercises, but keep separate
files for inference logic, data utilities, and the main Gradio app. Ensure .env
is listed in your .gitignore file to prevent committing credentials.
243
F.6 Step 2: Creating the Multimodal Pro-
cessing Function
Assume you have helper functions from previous lessons:
• query groq llm(prompt, api key): returns a text response from the
LLM
244
F.7 Step 3: Building the Gradio Interface
In main.py, you can create a Gradio demo that collects three inputs: text,
image, and audio. For instance:
1 import gradio as gr
2 import os
3 from dotenv import load_dotenv
4 from modules.groq_inference import multimodal_chat
5
6 load_dotenv()
7 GROQ_API_KEY = os.getenv("GROQ_API_KEY")
8
9 def gradio_multimodal_interface(text_input, image_input,
audio_input):
10 # image_input and audio_input will be file paths if you set type
="filepath"
11 return multimodal_chat(text_input, image_input, audio_input,
GROQ_API_KEY)
12
13 with gr.Blocks() as demo:
14 gr.Markdown("# Multimodal Chatbot using Groq + Gradio")
15
16 text_box = gr.Textbox(label="Enter your text prompt")
17 image_uploader = gr.Image(type="filepath", label="Upload an
image (optional)")
18 audio_recorder = gr.Audio(type="filepath", label="Record or
upload audio (optional)")
19
20 chat_button = gr.Button("Send")
21 output_box = gr.Textbox(label="Chatbot Response")
22
23 chat_button.click(
24 fn=gradio_multimodal_interface,
25 inputs=[text_box, image_uploader, audio_recorder],
26 outputs=output_box
27 )
28
29 if __name__ == "__main__":
30 demo.launch()
245
Usage
• Run python main.py
246
• Implement a session-based approach in Gradio using State to track
multi-turn dialogues.
F.12 Conclusion
By integrating Gradio with your Groq-based inference functions, you create
an accessible, user-friendly multimodal chatbot that serves as a capstone
demonstration of everything learned throughout the course. This hands-on
experience combines data preprocessing, model orchestration, and interac-
tive user interfaces, providing the foundational skills for real-world AI appli-
cations.
247
Bibliography
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer nor-
malization. arXiv preprint arXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-
chine translation by jointly learning to align and translate. In 3rd In-
ternational Conference on Learning Representations (ICLR), 2015.
[3] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jack-
son Kernion, Andy Jones, et al. Constitutional ai: Harmlessness from
ai feedback. arXiv preprint arXiv:2212.08073, 2022.
[4] Christopher M. Bishop. Pattern Recognition and Machine Learning.
Springer, New York, 2006.
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka-
plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, et al. Language models are few-shot learn-
ers. Advances in Neural Information Processing Systems (NeurIPS),
33:1877–1901, 2020.
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, et al. Language models are few-shot learn-
ers. Advances in Neural Information Processing Systems, 33:1877–1901,
2020.
[7] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, et al. Language models are few-shot learners.
In Advances in Neural Information Processing Systems (NeurIPS), vol-
ume 33, 2020.
248
[8] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,
Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze,
Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end-
to-end optimizing compiler for deep learning. In 13th USENIX Sym-
posium on Operating Systems Design and Implementation (OSDI 18),
pages 578–594. USENIX Association, 2018.
[9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hin-
ton. A simple framework for contrastive learning of visual representa-
tions. In International Conference on Machine Learning (ICML), pages
1597–1607, 2020.
[10] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bah-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder for statistical machine
translation. In Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734, 2014.
[11] Yunjey Choi, Minje Choi, Munyoung Kim, et al. Stargan: Unified gen-
erative adversarial networks for multi-domain image-to-image transla-
tion. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 8789–8797, 2018.
[12] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg,
and Dario Amodei. Deep reinforcement learning from human prefer-
ences. In Advances in Neural Information Processing Systems (NIPS),
volume 30, 2017.
[13] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for
human detection. In Proceedings of the 2005 IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages 886–893, 2005.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language un-
derstanding. arXiv preprint arXiv:1810.04805, 2018.
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language un-
derstanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
249
[16] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density esti-
mation using real nvp. arXiv:1605.08803, 2017.
[19] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern clas-
sification (2nd edition). In Pattern Classification. Wiley-Interscience,
2001.
[21] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis:
Finding sparse, trainable neural networks. In International Conference
on Learning Representations (ICLR), 2019.
[22] Philip Gage. A new algorithm for data compression. C Users Journal,
12(2):23–38, 1994.
[23] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, 2016.
[24] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, Cambridge, MA, 2016.
[25] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, 2016.
250
[27] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish
Narayanan. Deep learning with limited numerical precision. In Proceed-
ings of the 32nd International Conference on Machine Learning (ICML),
pages 1737–1746. PMLR, 2015.
[30] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both
weights and connections for efficient neural networks. In Advances in
Neural Information Processing Systems (NIPS), pages 1135–1143, 2015.
[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep resid-
ual learning for image recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), pages
770–778. IEEE, 2016.
[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep resid-
ual learning for image recognition. In Conference on Computer Vision
and Pattern Recognition (CVPR), pages 770–778, 2016.
[33] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge
in a neural network. NIPS Deep Learning and Representation Learning
Workshop, 2015.
[34] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge
in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[36] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion proba-
bilistic models. In Advances in Neural Information Processing Systems
33, pages 6840–6851, 2020.
251
[37] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al.
Training compute-optimal large language models. arXiv preprint
arXiv:2203.15556 (DeepMind), 2022.
[40] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi
Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adapta-
tion of large language models. arXiv preprint arXiv:2106.09685, 2021.
[41] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
berger. Densely connected convolutional networks. In IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pages 2261–
2269, 2017.
[43] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-
image translation with conditional adversarial networks. In IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), pages
5967–5976, 2017.
252
[45] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
An Introduction to Statistical Learning. Springer, 2013.
[47] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau-
rav Agrawal, Raminder Bajwa, and et al. In-datacenter performance
analysis of a tensor processing unit. In Proceedings of the 44th Annual
International Symposium on Computer Architecture (ISCA), pages 1–
12, 2017.
[49] Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning
in robotics: A survey. The International Journal of Robotics Research,
32(11):1238–1274, 2013.
[52] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi-
fication with deep convolutional neural networks. In Advances in Neural
Information Processing Systems (NIPS), pages 1097–1105, 2012.
[53] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Na-
ture, 521(7553):436–444, 2015.
[54] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
Gradient-based learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.
253
[55] Yann LeCun, L
’eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[57] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy-
anov. Roberta: A robustly optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692, 2019.
[59] Jonathan Masci, Ueli Meier, Dan Ciresan, and J Schmidhuber. Stacked
convolutional auto-encoders for hierarchical feature extraction. In In-
ternational Conference on Artificial Neural Networks (ICANN), pages
52–59, 2011.
[63] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama,
Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan
Leike, and Ryan Lowe. Training language models to follow instructions
with human feedback. arXiv preprint arXiv:2203.02155, 2022.
254
[64] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel
Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin,
Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transfer-
able visual models from natural language supervision. arXiv preprint
arXiv:2103.00020, 2021.
[65] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised repre-
sentation learning with deep convolutional generative adversarial net-
works. In 4th International Conference on Learning Representations
(ICLR), 2016.
[66] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
Improving language understanding by generative pre-training. OpenAI
preprint, 2018.
[67] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
Improving language understanding by generative pre-training. Technical
report, OpenAI, 2018.
[68] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,
and Ilya Sutskever. Language models are unsupervised multitask learn-
ers. Technical report, OpenAI, 2019.
[69] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He,
and Piotr Dollár. Designing network design spaces. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 10428–
10436, 2020.
[70] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christo-
pher D Manning, and Chelsea Finn. Direct preference optimization:
Your language model is secretly a reward model. arXiv preprint
arXiv:2305.18290, 2023.
[71] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.
Zero: Memory optimization towards training a trillion parameter mod-
els. In International Conference for High Performance Computing, Net-
working, Storage and Analysis (SC), pages 1–16, 2020.
255
models. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10674–10685, 2022.
[75] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and
Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint
arXiv:1707.06347, 2017.
[76] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine
translation of rare words with subword units. In Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics
(ACL), pages 1715–1725, 2016.
[78] David Silver, Aja Huang, and Chris J. Maddison et al. Mastering
the game of go with deep neural networks and tree search. Nature,
529(7587):484–489, 2016.
[79] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. In International Conference on
Learning Representations (ICLR), 2015.
[81] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion
implicit models. In International Conference on Learning Representa-
tions (ICLR), 2021.
256
[82] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: A simple way to prevent neural net-
works from overfitting. Journal of Machine Learning Research, 15:1929–
1958, 2014.
[83] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and
policy considerations for deep learning in nlp. In 57th Annual Meeting of
the Association for Computational Linguistics (ACL), pages 3645–3650,
2019.
[84] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence
learning with neural networks. In Advances in Neural Information Pro-
cessing Systems (NIPS), volume 27, 2014.
[86] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich. Going deeper with convolutions. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
[87] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for
convolutional neural networks. In International Conference on Machine
Learning (ICML), pages 6105–6114, 2019.
[89] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural
discrete representation learning. In Advances in Neural Information
Processing Systems 30, 2017.
[90] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin. Attention
is all you need. In Advances in Neural Information Processing Systems
(NIPS), pages 5998–6008, 2017.
[91] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
257
is all you need. In Advances in Neural Information Processing Systems
(NIPS), volume 30, 2017.
[92] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning
useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 11:3371–3408, 2010.
[93] Paul Viola and Michael Jones. Rapid object detection using a boosted
cascade of simple features. In Proceedings of the 2001 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 511–518,
2001.
[95] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired
image-to-image translation using cycle-consistent adversarial networks.
In IEEE International Conference on Computer Vision (ICCV), pages
2223–2232, 2017.
[96] Barret Zoph and Quoc V Le. Neural architecture search with reinforce-
ment learning. In International Conference on Learning Representations
(ICLR), 2017.
258