The most complete, interview-focused ML/AI reference on GitHub.
Cracking interviews at Google DeepMind, OpenAI, Meta AI, Amazon, Microsoft.
| Section | Content |
|---|---|
| 🧠 ML Fundamentals | Bias-variance, overfitting, regularization, loss functions |
| 🔢 Math & Statistics | Linear algebra, probability, calculus for ML |
| 🤖 Deep Learning | CNNs, RNNs, transformers, attention, training tricks |
| 🗣️ NLP | BERT, GPT, RAG, embeddings, tokenization |
| 👁️ Computer Vision | YOLO, ResNet, image augmentation, segmentation |
| 🏗️ ML System Design | Recommendation systems, search, fraud detection |
| 🐍 Python & Libraries | NumPy, Pandas, scikit-learn, PyTorch one-liners |
| 🧩 Coding Patterns | Data preprocessing, model eval, cross-validation |
| 💼 Behavioral | STAR answers, research discussion, project walkthrough |
Q: Explain the bias-variance tradeoff.
Bias = error from wrong assumptions (underfitting — model too simple). Variance = error from sensitivity to training data fluctuations (overfitting — model too complex).
Total Error = Bias² + Variance + Irreducible Noise
| High Bias | High Variance | |
|---|---|---|
| Training error | High | Low |
| Test error | High | High |
| Fix | More features, complex model | Regularization, more data, dropout |
Interview tip: Draw the U-shaped test error curve. Explain that the goal is to find the sweet spot.
Q: What is regularization? Compare L1 vs L2.
Regularization adds a penalty to the loss function to prevent overfitting.
| L1 (Lasso) | L2 (Ridge) | |
|---|---|---|
| Penalty | λΣ|wᵢ| | λΣwᵢ² |
| Effect | Produces sparse weights (zeros out features) | Shrinks weights toward zero, keeps all |
| Use when | Feature selection needed | All features relevant |
| Gradient | Sign(w) | 2w |
from sklearn.linear_model import Lasso, Ridge
lasso = Lasso(alpha=0.1) # L1
ridge = Ridge(alpha=1.0) # L2Q: How do you handle imbalanced datasets?
Resampling:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Oversample minority class
X_res, y_res = SMOTE().fit_resample(X, y)
# Undersample majority class
X_res, y_res = RandomUnderSampler().fit_resample(X, y)Class weights:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(class_weight='balanced')Metrics: Avoid accuracy. Use:
- Precision, Recall, F1-score
- ROC-AUC
- PR-AUC (better for severe imbalance)
Q: Explain cross-validation. Why use k-fold?
K-fold CV splits data into k subsets. Train on k-1, test on 1. Repeat k times. Average the scores.
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")Why k-fold? Reduces variance in evaluation vs a single train/test split. Stratified k-fold preserves class distribution in each fold.
Q: What is gradient descent? Compare SGD, Mini-batch, Adam.
Gradient descent minimizes loss by updating parameters in the direction of steepest descent:
θ = θ - α · ∇L(θ)
| Batch GD | SGD | Mini-batch | |
|---|---|---|---|
| Update per | Full dataset | 1 sample | Batch (32-256) |
| Speed | Slow | Fast | Fast |
| Noise | Low | High | Medium |
| Memory | High | Low | Medium |
Adam (Adaptive Moment Estimation) — most popular optimizer:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))Adam combines momentum + RMSProp. Adapts learning rate per parameter.
Q: What is the dot product and why does it matter in ML?
a · b = Σ aᵢbᵢ = |a||b|cos(θ)
In ML: measures similarity (cosine similarity in embeddings), is the core of every linear layer:
output = X @ W.T + b # Matrix multiplication = stacked dot productsQ: Explain PCA intuitively and mathematically.
PCA finds directions (principal components) of maximum variance in data.
Steps:
- Center data:
X_centered = X - mean(X) - Compute covariance matrix:
C = (X_centered.T @ X_centered) / (n-1) - Eigendecompose:
C = V Λ Vᵀ - Project:
X_pca = X_centered @ V[:, :k]
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")Q: What is the difference between MLE and MAP?
MLE (Maximum Likelihood Estimation):
θ_MLE = argmax P(data | θ)
Finds parameters that make observed data most probable. No prior assumption.
MAP (Maximum A Posteriori):
θ_MAP = argmax P(θ | data) = argmax P(data | θ) · P(θ)
Incorporates prior belief about θ. MAP with Gaussian prior = L2 regularization. MAP with Laplace prior = L1 regularization.
Q: How does backpropagation work?
Backprop computes gradients of the loss with respect to all parameters using the chain rule.
Forward: x → [L1] → h → [L2] → ŷ → loss
Backward: ∂loss/∂W₂ → ∂loss/∂h → ∂loss/∂W₁
loss = criterion(output, target)
loss.backward() # Compute all gradients
optimizer.step() # Update weights: W = W - lr * W.grad
optimizer.zero_grad() # Clear for next batchChain rule: ∂L/∂W₁ = (∂L/∂ŷ) · (∂ŷ/∂h) · (∂h/∂W₁)
Q: Explain the vanishing gradient problem and solutions.
In deep networks, gradients shrink exponentially as they propagate backward through many sigmoid/tanh layers → early layers learn very slowly.
Solutions:
| Fix | How |
|---|---|
| ReLU activation | Gradient = 1 for positive inputs (no squashing) |
| Batch Normalization | Normalizes activations, keeps gradients stable |
| Residual connections | Gradient flows directly via skip connections |
| LSTM/GRU | Gating mechanisms preserve long-range gradients |
| Gradient clipping | torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) |
Q: What is attention mechanism / self-attention?
Attention lets the model focus on relevant parts of the input for each output position.
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
- Q (Query): what we're looking for
- K (Key): what each position offers
- V (Value): actual content to aggregate
# Scaled dot-product attention
import torch.nn.functional as F
def attention(Q, K, V):
d_k = Q.size(-1)
scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
weights = F.softmax(scores, dim=-1)
return weights @ VSelf-attention: Q=K=V come from same sequence. Allows each token to attend to all others.
Q: Compare CNN vs RNN vs Transformer for sequence tasks.
| CNN | RNN/LSTM | Transformer | |
|---|---|---|---|
| Parallelizable | ✅ Yes | ❌ Sequential | ✅ Yes |
| Long-range deps | ❌ Fixed window | ✅ Global attention | |
| Memory | Low | Medium | High (O(n²)) |
| Best for | Local patterns, CV | Short sequences | NLP, long sequences |
| Speed | Fast | Slow | Fast (parallelized) |
Q: What is batch normalization? Why does it help?
BatchNorm normalizes activations within each mini-batch to have zero mean and unit variance, then applies learnable scale/shift:
# PyTorch
self.bn = nn.BatchNorm2d(num_channels)
# What it computes:
# μ = mean(x), σ² = var(x)
# x_norm = (x - μ) / √(σ² + ε)
# output = γ * x_norm + β (γ, β are learned)Benefits: Reduces internal covariate shift, acts as regularization, allows higher learning rates, reduces sensitivity to weight initialization.
Q: How does BERT work? What makes it different from GPT?
BERT (Bidirectional Encoder Representations from Transformers):
- Encoder-only transformer
- Bidirectional: sees both left and right context simultaneously
- Pre-trained with: Masked Language Model (MLM) + Next Sentence Prediction (NSP)
- Best for: classification, NER, Q&A (understanding tasks)
GPT (Generative Pre-trained Transformer):
- Decoder-only transformer
- Unidirectional (causal): only sees left context
- Pre-trained with: Next token prediction
- Best for: text generation, summarization, chat
BERT: [CLS] The [MASK] sat on the mat [SEP] → predicts "cat"
GPT: The cat sat on → predicts "the"
Q: What is RAG (Retrieval-Augmented Generation)?
RAG combines a retriever (finds relevant documents) with a generator (LLM) to answer questions grounded in external knowledge.
Query → [Embed] → Vector DB search → Top-k docs
↓
Prompt: "Using these docs: {docs}\nAnswer: {query}"
↓
LLM generates answer
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)Why RAG? Overcomes LLM knowledge cutoff, reduces hallucination, keeps knowledge updatable without retraining.
Q: Explain word embeddings. Word2Vec vs GloVe vs FastText.
Embeddings map words to dense vectors where similar words are close.
| Word2Vec | GloVe | FastText | |
|---|---|---|---|
| Method | Neural (CBOW/Skip-gram) | Matrix factorization on co-occurrence | Word2Vec + subword n-grams |
| OOV handling | ❌ No | ❌ No | ✅ Yes (via n-grams) |
| Morphology | ❌ No | ❌ No | ✅ Yes |
| Best for | General NLP | General NLP | Morphologically rich languages |
from gensim.models import Word2Vec, FastText
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
vector = model.wv['python'] # shape: (100,)Q: How does YOLO work? What makes YOLOv8 better?
YOLO (You Only Look Once) divides image into S×S grid. Each cell predicts B bounding boxes + class probabilities in a single forward pass.
Input image (640×640)
↓
Backbone (feature extraction)
↓
Neck (FPN/PAN — multi-scale features)
↓
Head (predict boxes + classes for 3 scales)
↓
NMS (remove overlapping boxes)
↓
Final detections
YOLOv8 improvements over v5:
- Anchor-free detection (no pre-defined anchors)
- Decoupled head (separate classification and regression)
- C2f module replaces C3 (better gradient flow)
- New loss: Distribution Focal Loss for bounding box regression
- ~35% fewer parameters than YOLOv5 at same accuracy
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model('image.jpg')
results[0].boxes # bboxes, confidences, classesQ: What is transfer learning? When to freeze layers?
Transfer learning: use a model trained on large dataset (ImageNet) as starting point for your task.
Strategies:
| Scenario | Approach |
|---|---|
| Small dataset, similar domain | Freeze backbone, train head only |
| Small dataset, different domain | Freeze early layers, fine-tune later layers |
| Large dataset, any domain | Fine-tune entire network |
# PyTorch — freeze backbone
model = torchvision.models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False # Freeze all
# Replace and unfreeze head
model.fc = nn.Linear(2048, num_classes) # Only fc trainsQ: Design a real-time fraud detection system.
Transaction → [Feature Engineering] → [ML Model] → Decision
↑ ↓
Feature Store Risk Score
(user history, → Block / Flag / Pass
merchant profile,
velocity features)
Key components:
- Features: amount, merchant category, location delta, time of day, velocity (5 txns in 1 min?), device fingerprint
- Model: XGBoost / LightGBM (low latency), backed by deep learning for complex patterns
- Threshold: p(fraud) > 0.7 → block, 0.3-0.7 → MFA challenge, < 0.3 → pass
- Latency: < 100ms P99 via feature precomputation + model serving (TorchServe/TFServing)
- Feedback loop: labeled outcomes → retrain weekly
Metrics: Precision (false positives = bad UX), Recall (false negatives = fraud loss), F1, AUC
Q: Design a recommendation system (YouTube/Netflix style).
Two-stage architecture:
100M items → [Retrieval (fast)] → 1000 candidates
↓
[Ranking (accurate)]
↓
Top 50 shown to user
Retrieval: Matrix factorization / two-tower neural network
# Two-tower: embed user and item separately, dot product for score
user_embed = user_tower(user_features) # (batch, 128)
item_embed = item_tower(item_features) # (batch, 128)
scores = (user_embed * item_embed).sum(-1)Ranking: Wide & Deep / DIN / Transformer on (user, item, context) features
Metrics: Click-through rate, Watch time, NDCG, Coverage, Diversity
Essential NumPy one-liners
import numpy as np
# Shape manipulation
x = np.random.randn(100, 3)
x.reshape(50, 6) # Reshape
x.T # Transpose
x[:, np.newaxis] # Add dimension
np.squeeze(x) # Remove size-1 dims
# Math
np.dot(A, B) # Matrix multiply (2D)
A @ B # Same, cleaner syntax
np.linalg.norm(x) # L2 norm
np.linalg.eig(A) # Eigendecomposition
np.linalg.svd(A) # SVD
# Stats
np.mean(x, axis=0) # Column means
np.std(x, ddof=1) # Sample std
np.percentile(x, 75) # 75th percentile
np.corrcoef(x[:, 0], x[:, 1]) # Correlation
# Boolean ops
np.where(x > 0, x, 0) # ReLU!
x[x > 0] # Boolean indexing
np.any(x > 5), np.all(x > 0)Essential Pandas one-liners
import pandas as pd
# Load
df = pd.read_csv('data.csv')
df.info() # Shape, dtypes, nulls
df.describe() # Stats summary
df.head(), df.tail()
# Missing values
df.isnull().sum() # Null count per col
df.fillna(df.mean(), inplace=True) # Fill numeric
df.dropna(subset=['target']) # Drop rows with null target
df['col'].fillna(df['col'].mode()[0]) # Fill with mode
# Feature engineering
df['log_price'] = np.log1p(df['price'])
df['age_group'] = pd.cut(df['age'], bins=[0,18,35,60,100], labels=['teen','young','mid','senior'])
pd.get_dummies(df['category'], prefix='cat', drop_first=True) # One-hot encode
# Aggregation
df.groupby('city')['salary'].agg(['mean','median','count'])
df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')Scikit-learn full pipeline template
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
num_features = ['age', 'salary', 'experience']
cat_features = ['city', 'department']
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
('num', num_pipeline, num_features),
('cat', cat_pipeline, cat_features)
])
full_pipeline = Pipeline([
('prep', preprocessor),
('model', GradientBoostingClassifier(n_estimators=100))
])
# Cross-validate
scores = cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring='roc_auc')
# Grid search
param_grid = {'model__n_estimators': [100, 200], 'model__max_depth': [3, 5]}
gs = GridSearchCV(full_pipeline, param_grid, cv=5, n_jobs=-1, scoring='roc_auc')
gs.fit(X_train, y_train)PyTorch training loop template
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
def train_one_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss = 0
for X, y in loader:
X, y = X.to(device), y.to(device)
optimizer.zero_grad()
output = model(X)
loss = criterion(output, y)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
return total_loss / len(loader)
@torch.no_grad()
def evaluate(model, loader, criterion, device):
model.eval()
total_loss, correct = 0, 0
for X, y in loader:
X, y = X.to(device), y.to(device)
output = model(X)
total_loss += criterion(output, y).item()
correct += (output.argmax(1) == y).sum().item()
return total_loss / len(loader), correct / len(loader.dataset)
# Training loop with early stopping
best_val_loss, patience, wait = float('inf'), 5, 0
for epoch in range(100):
train_loss = train_one_epoch(model, train_loader, optimizer, criterion, device)
val_loss, val_acc = evaluate(model, val_loader, criterion, device)
scheduler.step(val_loss)
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pt')
wait = 0
else:
wait += 1
if wait >= patience:
print(f'Early stopping at epoch {epoch}')
break
print(f'Epoch {epoch}: train={train_loss:.4f} val={val_loss:.4f} acc={val_acc:.4f}')Custom Dataset class (PyTorch)
from torch.utils.data import Dataset
from PIL import Image
import torchvision.transforms as T
class ImageDataset(Dataset):
def __init__(self, df, img_dir, transform=None):
self.df = df.reset_index(drop=True)
self.img_dir = img_dir
self.transform = transform or T.Compose([
T.Resize((224, 224)),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
img = Image.open(f"{self.img_dir}/{self.df.loc[idx, 'filename']}").convert('RGB')
label = self.df.loc[idx, 'label']
return self.transform(img), torch.tensor(label, dtype=torch.long)Evaluation metrics from scratch
import numpy as np
def precision_recall_f1(y_true, y_pred):
tp = ((y_pred == 1) & (y_true == 1)).sum()
fp = ((y_pred == 1) & (y_true == 0)).sum()
fn = ((y_pred == 0) & (y_true == 1)).sum()
precision = tp / (tp + fp + 1e-8)
recall = tp / (tp + fn + 1e-8)
f1 = 2 * precision * recall / (precision + recall + 1e-8)
return precision, recall, f1
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def iou(box1, box2):
x1 = max(box1[0], box2[0]); y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2]); y2 = min(box1[3], box2[3])
intersection = max(0, x2-x1) * max(0, y2-y1)
union = (box1[2]-box1[0])*(box1[3]-box1[1]) + (box2[2]-box2[0])*(box2[3]-box2[1]) - intersection
return intersection / (union + 1e-8)Project walkthrough template (STAR format)
Situation: "At [company/university], we faced [problem] — [metric showing scale]."
Task: "My role was to [responsibility] — specifically [what you owned]."
Action:
- "First I [explored/analyzed] the data and found [insight]."
- "I chose [model/approach] because [reason over alternatives]."
- "Key challenge was [X] — I solved it by [Y]."
Result: "[Metric improvement] — e.g., reduced inference time by 40%, improved F1 from 0.72 to 0.89."
Tip: Always quantify. "Improved accuracy" < "Improved F1 from 0.72 to 0.89 on a 100K sample test set."
Questions to ask the interviewer
- "What does the ML infrastructure look like — on-prem, cloud, internal tooling?"
- "How do you handle model monitoring and drift detection in production?"
- "What's the typical iteration cycle from idea to model in production?"
- "What's the biggest unsolved ML challenge on the team right now?"
- "How does the team balance research vs engineering vs product priorities?"
- LLM fine-tuning section (LoRA, QLoRA, RLHF)
- MLOps questions (MLflow, DVC, Kubeflow, feature stores)
- Reinforcement Learning fundamentals
- Time series deep dive
- More system design case studies
Found an error? Have a great question/answer pair? PRs are very welcome.
git checkout -b add/new-question
# Add your Q&A in the relevant section
git commit -m 'Add: <topic> question on <concept>'
git push origin add/new-question