AI and Machine Learning Jargon

A playful, practitioner-first glossary of AI/ML terms I actually use. Plain English, punchy definitions, and quick mental models—sprinkled with Rick & Morty asides when things get weird. Built for real work: code reviews, design docs, and late‑night debugging with a portal gun in one hand and coffee in the other.

Minimal fluff, maximum signal. When it helps, I drop tiny Python snippets, gotchas, and the occasional “don’t overfit, Morty” so you can apply ideas immediately.

Table of Contents

Supervised Learning

Learning with labeled data: the model maps inputs to known target outputs (classification, regression). The goal is to minimize a loss that measures prediction error.

Rick & Morty: What's this?

Morty: "So supervised learning is like having the answers on the back of the book?"

Rick: "Yeah, Morty. Labeled inputs, known targets, minimize a loss. Training wheels for pattern‑matching—now let's not flunk the cosmos."

Morty: "Okay, so supervised means we already know the answers?"

Rick: "Labeled input–output pairs, Morty. The model learns a mapping that minimizes a loss. Think of it like teaching someone to recognize cats by showing them thousands of pictures labeled 'cat' and 'not cat'. The algorithm finds patterns in the pixels that correlate with the labels."

Morty: "And the loss is like a score of how wrong we are?"

Rick: "Exactly. We adjust weights to make that score smaller until generalization stops improving. Lower loss means better predictions, but don't get cocky—what matters is how well it works on new, unseen data."

Morty: "Any gotchas?"

Rick: "Overfitting. Regularize, validate, and stop before you memorize the homework key. The model might learn to recognize the exact training examples instead of the underlying patterns. It's like memorizing answers without understanding the concepts—works great on the practice test, fails spectacularly on the real exam."

Morty: "So we need to test it on data it's never seen?"

Rick: "That's the validation set, Morty. Keep some data hidden during training, then see if the model can handle surprises. If training accuracy is high but validation accuracy tanks, you've got yourself a classic overfitting situation."

Unsupervised Learning

Learning patterns from unlabeled data (e.g., clustering, density modeling, dimensionality reduction). Often used for exploration or as preprocessing.

Rick & Morty: What's this?

Morty: "Uh, so… no labels?"

Rick: "Right. We toss data into a void and let it find structure—clusters, manifolds, whatever. Curiosity without supervision, Morty."

Morty: "No labels… so what are we learning?"

Rick: "Structure, Morty. Groups, manifolds, directions of variance—whatever patterns float to the top. Imagine you have a bunch of customer data but no idea what makes them different. Unsupervised learning might discover that some customers buy luxury items while others are bargain hunters, even though nobody told it to look for that."

Morty: "How do we check if it's good?"

Rick: "Qualitative checks, downstream performance, or metrics like silhouette. Don't expect a single 'right' answer. Unlike supervised learning where you can check against ground truth, here you're exploring the unknown. Maybe the clusters make business sense, maybe they reveal hidden customer segments, or maybe they're just mathematical artifacts."

Morty: "So it's like exploring a new dimension?"

Rick: "Yeah, and trying not to get eaten by your own assumptions. The algorithm might find patterns that are real but useless, or useful but not obvious. It's like being an explorer without a map—you might discover treasure or just end up lost in a swamp of irrelevant correlations."

Morty: "When would we actually use this?"

Rick: "Data exploration, feature engineering, anomaly detection, or when you need to understand your data before building supervised models. It's reconnaissance for your data science mission, Morty."

Reinforcement Learning

Learning through trial and error by receiving rewards for actions taken in an environment; aims to learn a policy that maximizes expected return.

Rick & Morty: What's this?

Morty: "We get points for doing good stuff?"

Rick: "Rewards, Morty. Agent learns a policy to maximize return by trial and error. It's like arcade tokens but with Bellman equations."

Morty: "We just try stuff and see what pays off?"

Rick: "Agent, environment, rewards. Learn a policy that maximizes expected return, Morty. Think of it like learning to play a video game—you don't know the rules at first, but you get points for good moves and lose points for bad ones. Eventually you figure out the strategy that gets you the highest score."

Morty: "Do we plan ahead or just react?"

Rick: "Both—value functions evaluate, policies act. Bootstrapping stitches it together. The value function is like your inner voice saying 'this situation looks promising' or 'this is probably a bad idea.' The policy is your actual decision-making process. Sometimes you plan several moves ahead, sometimes you just react."

Morty: "And exploration?"

Rick: "Essential. Otherwise you get stuck milking mediocre rewards forever. It's the classic explore-exploit dilemma—do you keep doing what you know works, or try something new that might work better? Most RL algorithms have some mechanism to encourage trying new things, otherwise they get stuck in local optima like a rat hitting the same lever."

Morty: "So it's like learning to ride a bike?"

Rick: "Exactly, but the bike is on fire and the road keeps changing. You start terrible, crash a lot, but gradually learn what actions lead to staying upright. The reward signal is your main teacher—pain when you fall, satisfaction when you succeed."

Model

A function with learnable parameters that transforms inputs into predictions or decisions. Can be parametric (fixed-size, e.g., linear regression) or nonparametric (flexible size, e.g., k-NN).

Rick & Morty: What's this?

Morty: "A model is like a brain?"

Rick: "A function, Morty. Parameters in, predictions out. Some are simple, some are monstrosities. All are hungry for data."

Morty: "So a model is like a brain-in-a-jar?"

Rick: "It's a function with tunable parameters, Morty. Some tiny, some so big they write papers. A linear regression model might have just a few parameters—maybe one for each input feature plus a bias term. A large language model might have billions of parameters, each one a tiny weight that helps the model understand language."

Morty: "How do we pick one?"

Rick: "Start simple, watch validation, escalate complexity only if it earns its keep. Begin with something like logistic regression—it's interpretable and fast. If it's not performing well enough, maybe try a random forest. Still not good enough? Neural networks. But each step up the complexity ladder means more data requirements, longer training times, and harder debugging."

Morty: "And k‑NN isn't even training?"

Rick: "Lazy learning. You store data and suffer at query time. k-NN doesn't build a model during training—it just memorizes all the examples. When you ask for a prediction, it searches through all stored examples to find the k nearest neighbors and averages their labels. Fast training, slow inference."

Morty: "So bigger models are always better?"

Rick: "Hell no, Morty. Bigger models overfit easier, cost more to run, and are harder to understand. Sometimes a simple model that captures the main pattern is better than a complex one that memorizes noise. It's about finding the sweet spot between underfitting and overfitting."

Parameters

Internal values learned from data (e.g., weights in neural networks, coefficients in linear models). Adjusted by optimizers using gradients.

Rick & Morty: What's this?

Morty: "Are parameters the knobs?"

Rick: "They're the weights, Morty. The numbers the optimizer cranks so the model stops embarrassing itself."

Morty: "Parameters change during training, right?"

Rick: "Yeah. Gradients nudge them toward lower loss each update. Think of parameters as the model's memory—they encode everything the model has learned about the patterns in your data. In a neural network, they're the connection strengths between neurons."

Morty: "Do all parameters get updated the same?"

Rick: "Not necessarily—different layers, schedules, and weight decay can treat them differently. Early layers in a neural network might learn slowly with small learning rates, while the final layer learns faster. Some parameters might be frozen during transfer learning, others might have different regularization."

Morty: "So they're the dials the model learns to twist?"

Rick: "Exactly, Morty. Imagine the model as a complex machine with millions of tiny dials. During training, the optimizer figures out how to adjust each dial to make better predictions. Some dials barely move, others get cranked hard. The art is in getting them all to work together."

Morty: "How many parameters do models usually have?"

Rick: "Ranges from dozens to trillions. A simple linear model might have one parameter per feature. GPT-3 has 175 billion parameters. More parameters mean more capacity to learn complex patterns, but also more risk of overfitting and higher computational costs."

Hyperparameters

External configuration values set before training (e.g., learning rate, tree depth, regularization strength). Tuned via validation or cross-validation.

Rick & Morty: What's this?

Morty: "So these are settings we pick?"

Rick: "Yeah. Learning rate, depth, regularization. You tune 'em or watch your model crash like a rickety portal gun."

Morty: "We pick hyperparameters before we start?"

Rick: "Yep. They shape the learning process—too spicy or too bland ruins dinner. Think of hyperparameters as the recipe settings for cooking your model. Learning rate is how fast you turn up the heat, batch size is how much you cook at once, number of layers is how complex your dish gets."

Morty: "How do we choose them?"

Rick: "Search: grid, random, Bayesian, or bandits. Validate honestly, avoid leakage. Grid search tries every combination like a methodical scientist. Random search is like throwing darts—surprisingly effective and way faster. Bayesian optimization is the smart approach that learns from previous attempts."

Morty: "And automate it?"

Rick: "If you like compute bills, sure. Automated hyperparameter tuning can burn through cloud credits faster than Jerry burns through excuses. But it's often worth it—the difference between good and great hyperparameters can make or break your model."

Morty: "What happens if we get them wrong?"

Rick: "Learning rate too high? Your loss function bounces around like a ping-pong ball. Too low? Training takes forever and might never converge. Wrong architecture depth? Either underfitting or overfitting hell. It's like tuning a musical instrument—everything has to be just right for harmony."

Feature

An input variable describing aspects of the data (e.g., age, pixels). Features can be raw or engineered; quality strongly affects model performance.

Rick & Morty: What's this?

Morty: "Features are like clues?"

Rick: "Inputs, Morty. Pixels, ages, frequencies. Better clues, better models—unless you leak future info and blow up reality."

Morty: "Features are our clues to the answer?"

Rick: "Clues and context. Good ones make learning easy; bad ones make models hallucinate. Think of features as the sensory inputs for your model—they're how the algorithm perceives the world. In image recognition, features might be pixel values, edges, or textures. For predicting house prices, features could be square footage, neighborhood, and number of bedrooms."

Morty: "So engineering matters?"

Rick: "Often more than model choice, Morty. But don't leak the future. A brilliant feature engineer can make a simple linear model outperform a fancy neural network with bad features. Features are where domain expertise meets machine learning—knowing what matters in your problem space."

Morty: "Leak the… future?"

Rick: "Using info unavailable at prediction time. Multiverse-breaking mistake. Like using tomorrow's stock price to predict today's stock price—technically perfect accuracy, completely useless in practice. Always ask: 'Would I have this information when making real predictions?'"

Morty: "What makes a good feature?"

Rick: "Relevance, availability, and stability, Morty. It should be correlated with your target, available when you need predictions, and not change meaning over time. A feature that's perfect in your training data but unavailable in production is worse than useless—it's a trap."

Label

The target variable the model is trained to predict. Can be numeric (regression) or categorical (classification).

Rick & Morty: What's this?

Morty: "Labels are answers?"

Rick: "Targets, Morty. The thing you're trying to predict. Numbers or categories. No label, no supervision."

Morty: "Labels are the answers we compare to?"

Rick: "Exactly. Noisy labels mislead, so curate. Garbage in, garbage universes out. Labels are your ground truth—what you want the model to learn to predict. In spam detection, labels are 'spam' or 'not spam'. In house pricing, the label is the actual sale price."

Morty: "What about class imbalance?"

Rick: "Reweight, resample, or tune thresholds—metrics must match reality. If 99% of your emails are not spam, your model will just predict 'not spam' for everything and be 99% accurate but completely useless. You need to balance the training data or adjust how you evaluate performance."

Morty: "So labels can be the bottleneck?"

Rick: "Frequently, Morty. Getting high-quality labels is often the hardest part of machine learning. Think medical diagnosis—you need expert doctors to label X-rays, which is expensive and time-consuming. Bad labels create bad models, period."

Morty: "How do we get better labels?"

Rick: "Multiple annotators, clear guidelines, quality checks, and sometimes active learning where the model asks for labels on the most uncertain examples. And always remember—your model can only be as good as your labels allow it to be."

Dataset

Collection of examples used to train and evaluate models, typically split into train/validation/test sets to measure generalization.

Rick & Morty: What's this?

Morty: "A dataset is just… a bunch of examples?"

Rick: "Collections of reality slices, Morty. Train, validate, test—don't mix 'em or you'll contaminate timelines."

Morty: "Dataset equals rows of reality?"

Rick: "Rows and context. Quality, coverage, and drift resistance matter more than size alone. A dataset is your model's entire understanding of the world. If it's biased, incomplete, or outdated, your model will be too. Think of it as a representative sample of all possible situations your model might encounter."

Morty: "How do we keep it clean?"

Rick: "Version it, audit it, and track provenance like it's plutonium. Every time you update your dataset, version it. Know where each data point came from, when it was collected, and how it was processed. Data lineage is crucial for debugging and compliance."

Morty: "And splits?"

Rick: "Disjoint, time‑aware when needed, consistent preprocessing across them. Your train, validation, and test sets should never overlap—that's cheating. For time series data, respect temporal order. And whatever preprocessing you do to training data, do exactly the same to validation and test data."

Morty: "How big should datasets be?"

Rick: "Depends on the complexity, Morty. Simple problems might need hundreds of examples, complex deep learning might need millions. But remember: 1000 high-quality, relevant examples often beat 100,000 noisy, irrelevant ones."

Train/Validation/Test Split

Common split to train models, tune hyperparameters, and estimate generalization. Avoid leakage by ensuring disjoint splits and consistent preprocessing.

Rick & Morty: What's this?

Morty: "We divide the data into three piles?"

Rick: "Train to learn, validation to tune, test to judge. Keep 'em disjoint or you're just flattering yourself, Morty."

Morty: "Three piles, three purposes?"

Rick: "Train learns, validation tunes, test judges. Keep them separate or you flatter yourself. Training data teaches the model patterns, validation data helps you pick the best hyperparameters, and test data gives you an honest assessment of real-world performance."

Morty: "Time series, too?"

Rick: "Use temporal splits. Random shuffles can lie to you, Morty. With time series, you can't randomly mix past and future—always split chronologically. Train on older data, validate on recent data, test on the most recent data. Otherwise you're cheating by using future information."

Morty: "And never peek?"

Rick: "Never. The test set is sacred, Morty. Look at it once, at the very end, after everything else is finalized. If you keep peeking and adjusting based on test performance, you're essentially training on your test set."

Morty: "What about the split ratios?"

Rick: "Common rule is 60/20/20 or 70/15/15, but it depends on your data size. With millions of examples, you can get away with smaller validation and test sets. With hundreds, you might need larger portions to get reliable estimates."

Overfitting

Model fits noise or spurious patterns in training data and performs poorly on unseen data. Mitigate with regularization, augmentation, and early stopping.

Rick & Morty: What's this?

Morty: "Overfitting is memorizing, right?"

Rick: "Yeah. The model writes the answers on its arm and fails the real exam. Regularize, augment, stop early, Morty."

Morty: "Memorizing the homework key, huh?"

Rick: "Yeah. Low training error, high test regret. The model learns the training data so well that it memorizes irrelevant details and noise instead of the underlying patterns. It's like a student who memorizes the exact wording of practice problems but can't solve new ones with different wording."

Morty: "How do we fix it?"

Rick: "Simplify, regularize, augment, or get more data—and watch validation like a hawk. Use a simpler model, add regularization penalties, augment your training data with variations, or collect more diverse examples. The key is monitoring validation performance—when it starts getting worse while training improves, you're overfitting."

Morty: "Early stopping?"

Rick: "Classic and effective. Stop training when validation performance plateaus or starts degrading, even if training performance could still improve. Save the model state from when validation was best, not when training finished."

Morty: "How do we spot it?"

Rick: "Large gap between training and validation performance, Morty. If your model gets 99% accuracy on training data but only 70% on validation, that's a classic overfitting red flag. The model is too complex for the amount of data you have."

Underfitting

Model is too simple or undertrained to capture underlying patterns. Mitigate by increasing capacity, training longer, or improving features.

Rick & Morty: What's this?

Morty: "Underfitting means it's too simple?"

Rick: "Too weak to capture the signal. Give it capacity, time, or smarter features."

Morty: "The model's just not smart enough?"

Rick: "Or not trained enough. Add capacity, train longer, or build better features. Underfitting is the opposite of overfitting—the model is too simple to learn even the basic patterns in your data. It's like trying to fit a curved line with a straight line—it just can't capture the complexity."

Morty: "Any risks?"

Rick: "Swing too far and you overfit. It's a bias‑variance seesaw. The art is finding the sweet spot between too simple (underfitting) and too complex (overfitting). You want just enough model complexity to capture the real patterns without memorizing noise."

Morty: "So tune, don't blindly crank."

Rick: "Bingo. Start simple and gradually increase complexity while monitoring validation performance. Add layers, parameters, or training time incrementally. Stop when validation performance plateaus or starts degrading."

Morty: "How do we recognize underfitting?"

Rick: "Both training and validation performance are poor, Morty. If your model can't even learn the training data well, it's probably underfitting. The validation performance will be bad but close to training performance—they're both struggling."

Bias-Variance Trade-off

Balance between error from overly simple assumptions (bias) and sensitivity to noise (variance). Proper model/regularization choice aims to minimize total error.

Rick & Morty: What’s this?

Morty: "Two kinds of being wrong?"
Rick: "Bias is systematic error; variance is sensitivity to noise."
Morty: "Can't we squash both?"
Rick: "We minimize total error, Morty—trade bias for variance or vice versa."
Morty: "How do we shift the balance?"
Rick: "Regularization, data size, and model capacity—all nudge the seesaw."
Morty: "So there's no free lunch?"
Rick: "Only trade‑offs. Measure, don't guess."

Regularization

Techniques to prevent overfitting by penalizing complexity or adding noise (e.g., L1/L2, dropout, early stopping). Encourages simpler, more generalizable models.

Rick & Morty: What’s this?

Morty: "Regularization keeps models humble?"
Rick: "Penalize big weights, drop neurons, stop early—anything to tame variance."
Morty: "Does it hurt training loss?"
Rick: "Usually. But it helps test error, which is what matters."
Morty: "So it's a controlled handicap?"
Rick: "Exactly."

Loss Function

Quantifies prediction error for optimization (e.g., MSE for regression, cross-entropy for classification). Lower loss indicates better fit to data.

Rick & Morty: What's this?

Morty: "It's how wrong we are?"

Rick: "A scalar shame-meter. MSE, cross-entropy—lower it and maybe you'll impress one timeline. The loss function is how you teach the model what 'wrong' means. It's the feedback signal that drives learning."

Morty: "Why different loss functions?"

Rick: "Different problems, different measures of wrongness, Morty. Mean Squared Error for regression penalizes big mistakes heavily. Cross-entropy for classification cares about probability distributions. Huber loss is robust to outliers. The choice shapes how your model learns."

Morty: "How does it actually work?"

Rick: "Compare predictions to truth, compute badness, backpropagate to update weights. The optimizer uses the loss gradient to figure out which direction to nudge each parameter. Lower loss means better predictions, in theory."

Morty: "Any gotchas?"

Rick: "Loss can be gamed, Morty. A model might minimize training loss by memorizing noise. That's why we use validation loss to judge real performance. Also, some losses don't match real-world objectives—optimizing for accuracy might not optimize for fairness."

Objective Function

The function optimized during training (often the loss plus regularization).

Rick & Morty: What’s this?

Morty: "Objective equals target of training?"
Rick: "Yep. Loss plus any penalties you tack on."
Morty: "Why add penalties?"
Rick: "To keep models from overfitting—simplicity tax, Morty."
Morty: "So the optimizer chases this function?"
Rick: "Relentlessly, with your compute bill screaming."

Gradient

Vector of partial derivatives indicating the direction of steepest increase of a function.

Rick & Morty: What’s this?

Morty: "Gradients tell us how to change?"
Rick: "Direction and magnitude for each parameter."
Morty: "And we go downhill to minimize?"
Rick: "Negative gradient, Morty—gravity for math."
Morty: "Any issues?"
Rick: "Vanishing or exploding. Normalize, clip, or redesign."

Gradient Descent

Optimization algorithm that updates parameters in the negative gradient direction. Variants include SGD with momentum, Adam, RMSProp.

Rick & Morty: What’s this?

Morty: "We take steps guided by gradients?"
Rick: "SGD is the basic stride; momentum and Adam smooth or adapt it."
Morty: "Which one should we use?"
Rick: "Start with Adam; switch to SGD for fine polish if needed."
Morty: "And tune learning rate?"
Rick: "Always."

Learning Rate

Step size controlling how much parameters change per update.

Rick & Morty: What’s this?

Morty: "Too big and we overshoot?"
Rick: "Too small and we crawl. Schedules help—warmup, decay."
Morty: "Adaptive optimizers fix it?"
Rick: "They help, but you still tune."
Morty: "So LR is the most important knob?"
Rick: "Usually, Morty."

Epoch

One full pass through the training dataset.

Rick & Morty: What’s this?

Morty: "A season of training, basically?"
Rick: "Complete pass. Count them, but watch validation—not just episode numbers."
Morty: "Early stopping watches validation?"
Rick: "And stops the show before it jumps the shark."

Batch / Batch Size

Subset of the dataset used per gradient update; size controls memory and stability.

Rick & Morty: What’s this?

Morty: "Bigger batches, smoother updates?"
Rick: "Smoother but costlier. Small batches add noise that sometimes helps."
Morty: "Any rule of thumb?"
Rick: "Fit your memory and keep throughput high."
Morty: "Gradient accumulation?"
Rick: "A hack to pretend the batch is bigger."

Optimizer

Algorithm to update parameters (e.g., SGD, Adam, RMSProp). Choice affects convergence speed and stability.

Rick & Morty: What’s this?

Morty: "Optimizers are update strategies?"
Rick: "Exactly. They decide how to move in parameter space."
Morty: "And hyperparameters matter here too?"
Rick: "Learning rate, betas, weight decay—tune 'em or suffer."
Morty: "So optimizer choice isn't magic?"
Rick: "It's taste plus evidence, Morty."

Backpropagation

Method to compute gradients in neural networks via the chain rule.

Rick & Morty: What’s this?

Morty: "We backtrack errors through layers?"
Rick: "Chain rule lets each layer know how it messed up."
Morty: "Why do gradients vanish?"
Rick: "Deep chains with saturating activations. Use residuals, norms, and better activations."
Morty: "So architecture fights calculus?"
Rick: "It negotiates, Morty."

Activation Function

Nonlinear function applied to neuron outputs (e.g., ReLU, sigmoid, tanh).

Rick & Morty: What’s this?

Morty: "Nonlinearity gives power?"
Rick: "Otherwise it's just a linear stack. ReLU's the workhorse."
Morty: "Sigmoid and tanh?"
Rick: "Squashers—use carefully or drown in saturation."
Morty: "Newer ones?"
Rick: "GELU, Swish—smoother vibes."

Softmax

Normalizes logits into a probability distribution.

Rick & Morty: What’s this?

Morty: "Exponentiate then normalize?"
Rick: "Yep. Softmax turns raw scores into a proper distribution."
Morty: "Temperature ties in?"
Rick: "Divide logits before softmax to control sharpness."
Morty: "Calibration?"
Rick: "Check it—confidence isn't accuracy."

Logits

Pre-activation scores output by a model before normalization (e.g., before softmax).

Rick & Morty: What’s this?

Morty: "Raw scores, unnormalized?"
Rick: "Logits. Big differences → confident softmax; tiny ones → indecision."
Morty: "We can inspect them?"
Rick: "For debugging or margin tricks, yeah."
Morty: "So they’re pre‑probabilities?"
Rick: "Exactly."

Embeddings

Dense, low-dimensional vector representations of discrete items (e.g., words, products, users) that capture semantic relationships. Learned via tasks like next-token prediction or matrix factorization.

Rick & Morty: What’s this?

Morty: "Vectors that capture meaning?"
Rick: "Geometry that encodes relationships—neighbors mean similar."
Morty: "How do we train them?"
Rick: "Self‑supervised tasks or downstream fine‑tuning."
Morty: "And we reuse them?"
Rick: "Transfer them across tasks like a portal pass."

Tokenization

Converting raw text into tokens (words, subwords, characters) for modeling. Modern LLMs use subword tokenizers (e.g., BPE, WordPiece) to balance vocabulary size and expressivity.

Rick & Morty: What’s this?

Morty: "Breaking text into chunks?"
Rick: "Subwords balance vocabulary size and flexibility."
Morty: "Why not characters?"
Rick: "Longer sequences; models cry."
Morty: "And words?"
Rick: "Too many, rare ones explode."

Attention

Mechanism allowing models to focus on relevant parts of the input when producing outputs by computing weighted combinations of values based on query–key similarity.

Rick & Morty: What's this?

Morty: "Paying attention—literally?"

Rick: "Queries, keys, values. We weight what's relevant so the model stops staring into space, Morty."

Morty: "Queries point to keys?"

Rick: "Scores pick values to mix—attention directs computation. Think of it like this: you have a question (query), a bunch of topics (keys), and detailed answers (values). The attention mechanism figures out which topics are most relevant to your question, then mixes together the corresponding answers based on relevance."

Morty: "Multi‑head?"

Rick: "Parallel focuses for different patterns. Each head learns to pay attention to different aspects—one might focus on grammar, another on meaning, another on relationships between words. It's like having multiple experts each with their own specialty, then combining their insights."

Morty: "Self vs cross?"

Rick: "Self looks within; cross looks across modalities or sequences. Self-attention lets words in a sentence pay attention to other words in the same sentence—like 'it' referring back to 'cat'. Cross-attention looks between different sources, like when you're translating and need to figure out which English word corresponds to which French word."

Morty: "Why is this such a big deal?"

Rick: "Because it solved the long-range dependency problem, Morty. Before attention, models would forget important context from earlier in the sequence. Now they can directly connect to any relevant information, regardless of distance. It's what made transformers possible and kicked off the modern AI revolution."

Morty: "So it's like having perfect memory?"

Rick: "Perfect memory with smart indexing. The model doesn't just remember everything—it learns what to remember and when it's relevant. It's the difference between a cluttered attic and a well-organized library."

Transformer

Neural architecture built on attention mechanisms (self-attention, cross-attention) and feed-forward layers, often with residual connections and normalization. Dominant in NLP, vision, and multimodal tasks.

Rick & Morty: What's this?

Morty: "The big attention architecture?"

Rick: "Stacks of attention and feed-forward layers with skips and norms. Dominates NLP, vision—pretty much your homework, Morty."

Morty: "Stacks of attention blocks?"

Rick: "Plus feed‑forwards, skips, and norms. Each transformer block has a self-attention layer that lets tokens talk to each other, followed by a feed-forward network that processes each token independently. Residual connections and layer normalization keep gradients flowing during training."

Morty: "Why so dominant?"

Rick: "Scales well and models long‑range dependencies. Unlike RNNs that process sequences step-by-step, transformers can look at all positions simultaneously. This parallelization makes training much faster, and the attention mechanism captures relationships between distant tokens that RNNs often forget."

Morty: "Any downsides?"

Rick: "Context window limits and compute hunger. Attention complexity scales quadratically with sequence length, so transformers hit memory and compute walls with very long sequences. They also need lots of data and computation to train effectively."

Morty: "What made them revolutionary?"

Rick: "The 'Attention is All You Need' paper, Morty. They showed you could ditch recurrence and convolution entirely, just use attention. This unlocked massive parallelization and led to the current AI revolution—GPT, BERT, Vision Transformers, everything."

LLM

Large Language Model: a transformer-based model with many parameters trained on large text corpora to model next-token probabilities. Supports prompting and fine-tuning for downstream tasks.

Rick & Morty: What's this?

Morty: "A huge text-predictor brain?"

Rick: "Gigantic transformers trained on oceans of text to guess the next token. Prompt it right or it rambles like Jerry."

Morty: "Huge text predictors?"

Rick: "Autoregressive token guessers. Prompt savvy makes them useful, Morty. Think of them as incredibly sophisticated autocomplete systems—they've read basically everything on the internet and learned to predict what word comes next in any context."

Morty: "Fine‑tuning or prompts?"

Rick: "Both—choose based on data, control, and cost. Prompting is like asking a really smart friend for help—you describe what you want and hope they understand. Fine-tuning is like hiring that friend and training them specifically for your exact job. Prompting is cheap and flexible, fine-tuning gives you more control but costs more."

Morty: "Safety?"

Rick: "Guardrails or you'll get cosmic nonsense. These models will confidently tell you that sharks are mammals or help you make explosives if you ask nicely enough. They need safety training, content filters, and careful prompt engineering to behave responsibly."

Morty: "How do they actually work?"

Rick: "They break text into tokens, encode them as vectors, then use attention mechanisms to understand relationships between all the tokens. When generating, they sample from a probability distribution over all possible next tokens. The magic is in the training—they learn patterns from billions of examples."

Morty: "Why are they so good at everything?"

Rick: "Scale and emergent abilities, Morty. Train a big enough model on enough data, and it starts showing capabilities nobody explicitly taught it—like reasoning, coding, and creative writing. It's like crossing a threshold where quantity becomes quality."

Prompt

Text or structured input used to guide an LLM's output. Good prompts provide context, constraints, and examples (few-shot) to steer model behavior.

Rick & Morty: What's this?

Morty: "We tell it what to do?"

Rick: "Context, constraints, examples. Good prompts steer; bad ones summon gibberish from the abyss, Morty."

Morty: "We steer with text?"

Rick: "Context + constraints + examples—give it rails to run on. Think of prompting as programming with natural language. You're not coding logic, you're describing the task, providing context, and showing examples of what good output looks like."

Morty: "Few‑shot helps?"

Rick: "Shows patterns to imitate. Zero-shot is 'translate this French text'. Few-shot is 'here are three examples of French-to-English translations, now translate this new one'. The examples teach the model the specific style and format you want."

Morty: "And system messages?"

Rick: "Set behavior—like telling me to be nice. System prompts define the model's role and personality. 'You are a helpful assistant' vs 'You are a cynical scientist' will produce very different responses to the same user question."

Morty: "What makes prompts work well?"

Rick: "Clarity, specificity, and structure, Morty. Be explicit about what you want, provide relevant context, use clear formatting, and give examples when needed. The model can't read your mind—it only has your prompt to work with."

Morty: "Any tricks?"

Rick: "Chain of thought prompting—ask the model to think step by step. Role playing—have it act as an expert. Temperature tuning—adjust randomness. And always test variations to see what works best for your use case."

Context Window

Maximum number of tokens (input + generated) an LLM can process at once; exceeding it truncates inputs or requires special strategies (e.g., chunking, retrieval).

Rick & Morty: What’s this?

Morty: "Memory limit in tokens?"
Rick: "Context window—go past it, you lose information."
Morty: "Workarounds?"
Rick: "Chunking, retrieval, or models with bigger windows."
Morty: "So planning matters?"
Rick: "Always."

Temperature / Top-p

Sampling parameters controlling randomness (temperature) and nucleus sampling (top-p) during generation. Lower temperature/top-p increases determinism; higher values increase diversity.

Rick & Morty: What’s this?

Morty: "Knobs for chaos?"
Rick: "Temperature sets noise; top‑p keeps only the likely mass."
Morty: "Defaults?"
Rick: "Start modest—then tune for creativity or reliability."
Morty: "Combine both?"
Rick: "Sure, but don't over‑randomize."

Beam Search

Search strategy that explores multiple candidate sequences in parallel and keeps the best beams by cumulative log-probability. Useful for translation and structured generation.

Rick & Morty: What’s this?

Morty: "Parallel guesses with pruning?"
Rick: "Keep top beams by score, expand until done."
Morty: "Any trade‑offs?"
Rick: "Diversity drops; add penalties or sampling for variety."
Morty: "Use cases?"
Rick: "Translation, constrained generation, planning."

Accuracy, Precision, Recall, F1

Common classification metrics; F1 balances precision and recall.

Rick & Morty: What’s this?

Morty: "Accuracy vs precision vs recall?"
Rick: "Accuracy is overall rightness; precision avoids false positives; recall avoids false negatives."
Morty: "And F1?"
Rick: "Harmonic mediator—use when you need balance."
Morty: "Pick based on stakes?"
Rick: "Always, Morty."

Confusion Matrix

Table showing counts of true/false positives/negatives.

Rick & Morty: What’s this?

Morty: "Four boxes, huh?"
Rick: "TP, FP, TN, FN—see where the model fails."
Morty: "Threshold changes it?"
Rick: "Yes—move it and the boxes reshuffle."
Morty: "So context matters."
Rick: "Yep."

ROC / AUC

Receiver Operating Characteristic curve and Area Under the Curve; evaluate ranking quality across thresholds. Use predicted probabilities/scores, not hard labels.

Rick & Morty: What’s this?

Morty: "Plot TPR vs FPR?"
Rick: "Across thresholds—AUC summarizes ranking performance."
Morty: "Use scores, not labels?"
Rick: "Right—labels are binary; we need the continuum."
Morty: "And PR curves?"
Rick: "For imbalanced data, often more telling."

Perplexity

Exponentiated average negative log-likelihood; measures language model uncertainty (lower is better). Sensitive to tokenization and dataset domain.

Rick & Morty: What’s this?

Morty: "Perplexity is confusion?"
Rick: "Lower means the model predicts well."
Morty: "But tokenization matters?"
Rick: "Change tokens, change numbers—compare apples to apples."
Morty: "So domain shifts break comparisons."
Rick: "Exactly."

Cross-Validation

Resampling technique to estimate generalization by training/validating across multiple splits (e.g., k-fold, stratified). Helps reduce variance in performance estimates.

Rick & Morty: What’s this?

Morty: "We rotate the validation set?"
Rick: "Train on folds, validate on the held‑out one—repeat and average."
Morty: "Stratify for class balance?"
Rick: "Yep, or risk misleading scores."
Morty: "Expensive?"
Rick: "Computationally, yes."

Data Augmentation

Synthetic transformations to expand training data (e.g., flips, crops, noise); improves robustness and reduces overfitting, especially in vision/audio.

Rick & Morty: What’s this?

Morty: "We mutate images and audio?"
Rick: "And text—paraphrases, masking. Toughen models against reality."
Morty: "Any pitfalls?"
Rick: "Don't change labels or inject artifacts."
Morty: "So realistic transforms only."
Rick: "Right."

Normalization / Standardization

Scaling features to comparable ranges (e.g., min–max, z-score) to stabilize training and accelerate convergence. Fit on training data, apply consistently to validation/test.

Rick & Morty: What’s this?

Morty: "We scale for stability?"
Rick: "And speed. Fit on train, reuse on val/test to avoid leakage."
Morty: "Per‑feature or per‑batch?"
Rick: "Depends—preprocessing vs BatchNorm/LayerNorm."
Morty: "Keep pipelines consistent."
Rick: "Always."

Dropout

Randomly zeroing activations during training to reduce overfitting by preventing co-adaptation of features. Disabled at inference.

Rick & Morty: What’s this?

Morty: "We drop neurons randomly?"
Rick: "Force redundancy so features don't collude."
Morty: "At inference?"
Rick: "Turn it off; scale weights accordingly."
Morty: "Rates?"
Rick: "Tune 0.1–0.5, context‑dependent."

Weight Decay

L2 regularization applied to weights during optimization, discouraging large weights and smoothing solutions. Implemented via optimizer weight_decay.

Rick & Morty: What’s this?

Morty: "Shrink weights to smooth?"
Rick: "Adds a penalty—curbs complexity and improves generalization."
Morty: "Same as L2?"
Rick: "Equivalent effect within the optimizer—mind exceptions like AdamW's decoupling."
Morty: "So use AdamW."
Rick: "Often, yes."

Batch Norm / Layer Norm

Normalization strategies that stabilize and accelerate training in deep nets: BatchNorm normalizes over batch statistics; LayerNorm normalizes over feature dimensions and is common in transformers.

Rick & Morty: What’s this?

Morty: "Normalize to keep training sane?"
Rick: "BatchNorm uses batch stats; LayerNorm uses per‑feature."
Morty: "When do we pick which?"
Rick: "Transformers love LayerNorm; conv nets often use BatchNorm."
Morty: "Any caveats?"
Rick: "Batch size sensitivity and inference behavior—mind the stats."

Transfer Learning

Using a pretrained model as a starting point for a new task. Typically freeze early layers, replace task head, then fine-tune selectively.

Rick & Morty: What’s this?

Morty: "Shortcut to good features?"
Rick: "Borrow general representations, adapt the head to your task."
Morty: "Freeze or not?"
Rick: "Freeze early layers, unfreeze later when you have data."
Morty: "Watch for forgetting?"
Rick: "Regularize and use small LRs."

Fine-Tuning

Further training of a pretrained model on task-specific data using lower learning rates and careful regularization to avoid catastrophic forgetting.

Rick & Morty: What’s this?

Morty: "We tune gently?"
Rick: "Small steps, strong priors—retain the useful generality."
Morty: "Layer‑wise learning rates?"
Rick: "Lower for early layers, higher for task head."
Morty: "Checkpoint often?"
Rick: "And validate constantly."

Model Distillation

Training a small “student” model to mimic a large “teacher” model.

Rick & Morty: What’s this?

Morty: "Shrink without losing brains?"
Rick: "Distill softened targets/logits and sometimes features."
Morty: "Why softened?"
Rick: "They carry dark knowledge—fine‑grained class relations."
Morty: "Deploy the student, retire the teacher?"
Rick: "If metrics hold, yes."

Pruning / Quantization

Compressing models by removing weights (pruning) or using lower precision (quantization). Reduces model size and latency at some accuracy cost.

Rick & Morty: What’s this?

Morty: "Trade size for accuracy?"
Rick: "Sparse weights or fewer bits—measure the hit and the speedup."
Morty: "Post‑training or during?"
Rick: "Both exist—calibrate carefully for quantization."
Morty: "Edge devices?"
Rick: "These tricks are their lifeline."

Feature Engineering

The process of transforming raw data into meaningful, machine‑learnable inputs using domain knowledge and statistical techniques. It includes handling missing values, encoding categorical variables, scaling, creating interaction or aggregate features, and time‑aware features. Good feature engineering often yields bigger gains than model changes and must avoid leakage by using only information available at prediction time.

Rick & Morty: What’s this?

Morty: "Make inputs smarter?"
Rick: "Clean, encode, aggregate, and respect time—no peeking ahead."
Morty: "Why so powerful?"
Rick: "Signal quality beats model complexity nine times out of ten."
Morty: "Share features?"
Rick: "Use a feature store to avoid chaos."

One-Hot Encoding

A method to represent categorical variables as sparse binary vectors, one column per category with a 1 for the observed category and 0 otherwise. It preserves distance relationships for algorithms that assume numeric inputs without imposing arbitrary ordering. Beware of high‑cardinality explosion; consider target encoding or embeddings when categories are numerous.

Rick & Morty: What’s this?

Morty: "Binary flags per category?"
Rick: "Yep—simple and reliable until cardinality explodes."
Morty: "Alternatives?"
Rick: "Target encoding or learn embeddings."
Morty: "Watch leakage?"
Rick: "Always, especially with target encoding."

Dimensionality Reduction

Techniques that reduce the number of input variables while preserving as much information as possible. Linear methods (e.g., PCA) capture variance along orthogonal directions; nonlinear methods (e.g., t‑SNE, UMAP) capture manifold structure for visualization or preprocessing. Benefits include noise reduction, speedups, and mitigation of the curse of dimensionality.

Rick & Morty: What’s this?

Morty: "Fewer features, same gist?"
Rick: "Compress dimensions to keep signal and drop noise."
Morty: "For modeling or visuals?"
Rick: "Both—PCA for models, t‑SNE/UMAP for plots."
Morty: "Beware distortions?"
Rick: "Especially with t‑SNE globally."

PCA / t-SNE / UMAP

PCA is a linear projection maximizing variance and enabling fast compression and whitening. t‑SNE preserves local neighbor relationships for high‑quality visualizations but distorts global distances and is not ideal for downstream learning. UMAP models data as a fuzzy topological graph to preserve local/global structure better than t‑SNE in many cases and scales well.

Rick & Morty: What’s this?

Morty: "Pick the right reducer?"
Rick: "PCA for linear variance; UMAP for structure with speed; t‑SNE for pretty plots."
Morty: "Use UMAP over t‑SNE?"
Rick: "Often for scalability and global coherence."
Morty: "Downstream learning?"
Rick: "Prefer PCA embeddings there."

Linear / Logistic Regression

Linear regression models a continuous target as a weighted sum of features under assumptions like linearity and homoscedastic errors. Logistic regression models the log‑odds of class membership to produce calibrated probabilities for binary or multiclass tasks. Both are interpretable, support regularization (L1/L2), and serve as strong baselines.

Rick & Morty: What’s this?

Morty: "Numbers vs probabilities?"
Rick: "Linear for continuous; logistic for classes with odds."
Morty: "Regularization?"
Rick: "L1 sparsifies, L2 smooths—pick your poison."
Morty: "Calibration?"
Rick: "Logistic often does it well."

SVM / k-NN / Decision Trees

SVMs find a maximum‑margin separator (with kernels for nonlinearity) and can be robust in high dimensions but require tuning C/kernel parameters. k‑NN classifies by majority vote among nearest neighbors and is simple yet sensitive to scaling and k choice. Decision trees learn hierarchical rules; they are interpretable but prone to overfitting without pruning.

Rick & Morty: What’s this?

Morty: "Three classics, three vibes?"
Rick: "Margins (SVM), neighbors (k‑NN), rules (trees)."
Morty: "Scaling matters?"
Rick: "For k‑NN and SVM—normalize features."
Morty: "Trees overfit?"
Rick: "Prune or bag 'em."

Random Forest / XGBoost

Random Forest averages many decorrelated decision trees (bagging) to reduce variance and improve generalization with minimal tuning. XGBoost (gradient boosting) builds trees sequentially to correct residual errors, often delivering state‑of‑the‑art tabular performance with careful regularization. Boosting is powerful but more sensitive to hyperparameters than bagging.

Rick & Morty: What’s this?

Morty: "Bag vs boost?"
Rick: "Forest reduces variance; boosting reduces bias—tune more carefully."
Morty: "When to use which?"
Rick: "Forest for quick baselines; XGBoost when squeezing leaderboard points."
Morty: "Feature importance?"
Rick: "Mind biases—permutation importance helps."

Naive Bayes

A family of probabilistic classifiers that assume conditional independence of features given the class (e.g., Gaussian, Multinomial, Bernoulli variants). Despite the strong assumption, they work surprisingly well for text and other high‑dimensional sparse data. They are fast to train, require little data, and yield calibrated posteriors under model correctness.

Rick & Morty: What’s this?

Morty: "Cheap and cheerful?"
Rick: "Fast, decent for text, naive in assumptions—know when to stop."
Morty: "Feature independence is false, right?"
Rick: "Often, but it still works."
Morty: "Use as baseline?"
Rick: "Always a good start."

Entropy / KL Divergence

Entropy quantifies the uncertainty of a random variable; higher entropy means more unpredictability. KL divergence measures how one probability distribution diverges from a reference distribution and is asymmetric. They are foundational in information theory, variational inference, and regularization of probabilistic models.

Rick & Morty: What’s this?

Morty: "Surprise and mismatch?"
Rick: "Entropy is average surprise; KL is directed difference."
Morty: "Symmetric?"
Rick: "No—KL(P||Q) ≠ KL(Q||P)."
Morty: "Use cases?"
Rick: "VI, regularization, and diagnostics."

Bayesian Inference

A principled framework that combines prior beliefs with data likelihood to produce a posterior distribution over parameters or latent variables. Exact posteriors are rare, so conjugacy, MCMC, or variational inference are used for approximation. Bayesian methods enable uncertainty quantification and coherent decision‑making under uncertainty.

Rick & Morty: What’s this?

Morty: "Beliefs updated by evidence?"
Rick: "Prior × likelihood → posterior."
Morty: "Exact answers?"
Rick: "Rare—approximate with MCMC or VI."
Morty: "Why bother?"
Rick: "Uncertainty you can reason about."

MAP / MLE

MLE chooses parameters that maximize the likelihood of observed data, often yielding unbiased estimates with large samples. MAP incorporates a prior and maximizes the posterior, acting like regularized MLE (e.g., Gaussian prior → L2 penalty). They coincide when the prior is uniform and differ when prior beliefs meaningfully constrain parameters.

Rick & Morty: What’s this?

Morty: "Likelihood vs prior influence?"
Rick: "MLE ignores priors; MAP bakes them in."
Morty: "Regularization link?"
Rick: "MAP with Gaussian prior looks like L2."
Morty: "Pick based on beliefs?"
Rick: "And data scarcity."

Markov Chains / HMM

Markov chains model sequences where the next state depends only on the current state (memoryless property). Hidden Markov Models add latent states emitting observations with state‑dependent probabilities, enabling speech, bioinformatics, and time‑series modeling. Inference typically uses the Forward‑Backward and Viterbi algorithms.

Rick & Morty: What’s this?

Morty: "Memoryless steps?"
Rick: "Markov property—next depends on now."
Morty: "Hidden states?"
Rick: "HMMs infer unseen causes of observations."
Morty: "Algorithms?"
Rick: "Forward‑Backward and Viterbi."

MCMC / Variational Inference

MCMC constructs a Markov chain whose stationary distribution is the target posterior, producing asymptotically exact samples at the cost of compute and mixing diagnostics. Variational inference turns inference into optimization over a tractable family, trading bias for speed and scalability. Modern practice often mixes both, e.g., using VI for initialization and MCMC for refinement.

Rick & Morty: What’s this?

Morty: "Exact by wandering vs fast approximations?"
Rick: "MCMC samples; VI optimizes an approximation."
Morty: "Which to use?"
Rick: "Start with VI for scale; refine with MCMC if needed."
Morty: "Diagnostics?"
Rick: "Check mixing, ELBO, and autocorrelation."

Autoencoder

A neural network trained to reconstruct inputs through a bottleneck, forcing a compact latent representation. The encoder maps inputs to a latent code; the decoder reconstructs inputs from that code, with reconstruction loss guiding learning. Uses include denoising, dimensionality reduction, pretraining, and anomaly detection.

VAE

A probabilistic autoencoder that learns a distribution over latent variables and decodes samples to data space. Trained by maximizing the ELBO, it balances reconstruction accuracy with a KL penalty that regularizes the latent space toward a prior (often standard normal). VAEs enable interpolation, sampling, and controlled generation with continuous latents.

Rick & Morty: What’s this?

Morty: "Probabilistic latent spaces?"
Rick: "Encode to distributions, sample with reparameterization, decode—learn a smooth latent map."
Morty: "Why the KL term?"
Rick: "Regularizes latents toward a prior so you can sample and interpolate sensibly."
Morty: "Tuning tips?"
Rick: "Balance reconstruction vs KL—β‑VAE trades disentanglement for sharpness."

GAN

An adversarial framework where a generator produces samples and a discriminator distinguishes real from fake, trained in a minimax game. GANs can generate sharp images but are prone to instability, mode collapse, and require careful architecture, normalization, and loss choices. Variants (WGAN, StyleGAN) improve training dynamics and controllability.

Rick & Morty: What’s this?

Morty: "Adversaries that teach each other?"

Rick: "Generator learns to fool; discriminator learns to detect—training is a knife‑edge balance."

Morty: "Stability hacks?"

Rick: "Spectral norm, gradient penalties, better losses like WGAN."

Morty: "Mode collapse?"

Rick: "Diversify with techniques like minibatch discrimination."

Diffusion Model

A generative model that learns to invert a forward noising process via a sequence of denoising steps. Training fits a noise predictor across timesteps; sampling iteratively refines from noise to data, yielding high‑fidelity, diverse outputs. They are compute‑intensive at inference but amenable to acceleration (DDIM, distillation).

Rick & Morty: What’s this?

Morty: "We teach the model to un‑noise?"
Rick: "Learn a noise predictor across timesteps, then sample by gradually denoising from pure noise."
Morty: "Why so slow?"
Rick: "Many steps—accelerate with schedulers, DDIM, or distilled samplers."
Morty: "Quality vs speed?"
Rick: "Trade‑off central—pick your sampler and schedule wisely."

Policy / Value Function

In RL, a policy maps states to actions (stochastic or deterministic), while value functions estimate expected returns for states or state‑action pairs. Policies can be learned directly (policy gradient) or derived from value estimates (e.g., greedy w.r.t. Q). The interplay between acting and evaluating underpins most RL algorithms.

Rick & Morty: What’s this?

Morty: "Decide vs judge?"
Rick: "Policy acts, value estimates reward."
Morty: "Learn both?"
Rick: "Actor‑critic pairs them nicely."
Morty: "Exploration still needed?"
Rick: "Always."

Exploration vs. Exploitation

The tension between gathering information (exploration) and maximizing reward using current knowledge (exploitation). Practical strategies include ε‑greedy, softmax over action values, optimism/UCB, and intrinsic motivation bonuses. Effective exploration reduces regret and avoids premature convergence to suboptimal policies.

Rick & Morty: What’s this?

Morty: "Try new, or farm known?"
Rick: "Balance. ε‑greedy is simple; UCB is clever."
Morty: "Intrinsic bonuses?"
Rick: "Curiosity signals to seek novelty."
Morty: "Measure regret?"
Rick: "Lower is better, Morty."

Q-Learning / TD Learning

Q‑learning learns optimal action‑values off‑policy by bootstrapping from estimated future returns, enabling learning from replayed experiences. Temporal‑difference methods update estimates using a mix of observed rewards and bootstrap predictions, balancing bias and variance. Stability often relies on target networks, experience replay, and careful learning‑rate schedules.

Rick & Morty: What’s this?

Morty: "Update values from experience?"
Rick: "TD methods blend observed rewards with predictions."
Morty: "Why target networks?"
Rick: "Stability—reduce moving‑target chaos."
Morty: "Replay buffers?"
Rick: "Decorrelate and reuse data."

Replay Buffer / Actor-Critic

A replay buffer stores past transitions to decorrelate updates and improve sample efficiency by reusing data. Actor‑critic methods pair a policy (actor) with a value estimator (critic), combining low‑variance value updates with flexible policy optimization. Modern variants (A2C/A3C, PPO, SAC) add stability through constraints or entropy regularization.

Rick & Morty: What’s this?

Morty: "Save experiences and split duties?"
Rick: "Buffer for reuse; actor chooses, critic evaluates."
Morty: "Variants?"
Rick: "PPO clips updates; SAC adds entropy."
Morty: "Why?"
Rick: "Stability and exploration."

Inference

The deployment‑time phase where a trained model processes new inputs to produce outputs under latency, memory, and cost constraints. Optimizations include batching, quantization, graph compilation, and hardware acceleration. Observability and correctness (schema validation, canaries) are critical to safe operation.

Rick & Morty: What’s this?

Morty: "Use the model in production?"
Rick: "Serve predictions fast and correctly—optimize and observe."
Morty: "Batching helps throughput?"
Rick: "And hurts latency—trade‑offs everywhere."
Morty: "Ship safely?"
Rick: "Validate schemas and canary changes."

Latency / Throughput

Latency measures time per prediction, while throughput measures predictions per unit time for a system. They trade off via batching and parallelism, and both are constrained by model size, I/O, and hardware. SLOs commonly set tail‑latency targets; monitoring captures warm vs cold start and queuing effects.

Rick & Morty: What’s this?

Morty: "Fast vs many?"
Rick: "Latency is speed; throughput is volume—tune batching and parallelism."
Morty: "Watch tails?"
Rick: "Tail latency ruins SLOs—optimize cold starts and queues."
Morty: "Hardware matters?"
Rick: "Always—CPU/GPU/TPU change the game."

MLOps

Engineering practices to reliably build, train, evaluate, deploy, and operate ML systems at scale. It emphasizes reproducibility (data/versioning), automated pipelines (CI/CD for ML), governance (approvals/audit), and monitoring (performance, drift, fairness) across the lifecycle. Collaboration between data science and platform teams is central.

Rick & Morty: What’s this?

Morty: "Ops but for ML?"
Rick: "Pipelines, registries, monitoring—make models repeatable and observable."
Morty: "Governance?"
Rick: "Approvals, audit trails—keep regulators off your back."
Morty: "Teams?"
Rick: "Data science plus platform—no silos, Morty."

Feature Store

A centralized system that defines, computes, and serves features consistently to training and online inference. Key capabilities include point‑in‑time correctness, offline/online parity, and low‑latency retrieval keyed by entity IDs. It reduces leakage bugs and duplication while enabling feature reuse across teams.

Rick & Morty: What’s this?

Morty: "One source of feature truth?"
Rick: "Consistent definitions online/offline—avoid leakage and mismatches."
Morty: "Keys?"
Rick: "Entity IDs for fast lookups."
Morty: "Reuse?"
Rick: "Share across teams without chaos."

Model Registry

A catalog that tracks models, versions, lineage, metrics, and deployment stages (e.g., staging, production, archived). It supports approvals, rollbacks, and governance by linking artifacts to code, data, and evaluations. Registries integrate with CI/CD to automate promotion and deployment.

Rick & Morty: What’s this?

Morty: "Where we keep model history?"
Rick: "Versions, metrics, lineage—know what runs and why."
Morty: "Rollbacks?"
Rick: "Push a button and undo the bad."
Morty: "CI/CD?"
Rick: "Automate promotions safely."

A/B Test / Canary / Shadow

A/B testing splits traffic to compare candidate vs control models with statistical rigor. Canary gradually shifts a small fraction of live traffic to a new model to detect issues before full rollout; shadow sends mirrored traffic to a model without affecting users to gather metrics safely. Choice depends on risk tolerance, evaluation time, and observability.

Rick & Morty: What’s this?

Morty: "Split, trickle, or mirror?"
Rick: "A/B compares; canary tests safely; shadow observes silently."
Morty: "Pick one?"
Rick: "Based on risk and how fast you need answers."
Morty: "Metrics?"
Rick: "Collect thoroughly—no surprises."

Monitoring / Drift

Production monitoring tracks prediction quality, data quality, fairness, and system health, triggering alerts on anomalies. Data drift (input distribution change) and concept drift (target relationship change) degrade performance if unaddressed. Mitigations include retraining, feature recalibration, and adaptive thresholds.

Rick & Morty: What’s this?

Morty: "Keep an eye on models?"
Rick: "Metrics and alerts—catch drift early."
Morty: "Data vs concept drift?"
Rick: "Inputs shift vs relationships shift—both hurt."
Morty: "Fixes?"
Rick: "Retrain or adapt features/thresholds."

Fairness / Bias / Explainability

Fairness assesses whether outcomes are equitable across groups via metrics like demographic parity, equalized odds, or calibration. Bias can stem from data, labels, or modeling choices; mitigation techniques include reweighting, debiasing, and constraint‑aware training. Explainability tools (SHAP, LIME, saliency) provide local/global insight to support trust and compliance.

Rick & Morty: What’s this?

Morty: "Fair and explainable outcomes?"
Rick: "Measure across groups; mitigate bias; explain decisions."
Morty: "Which metrics?"
Rick: "Depends on policy—parity, odds, calibration."
Morty: "Tools?"
Rick: "SHAP, LIME, and saliency maps."

Adversarial Examples / Robustness

Adversarial examples are inputs with small, targeted perturbations that cause large model errors; threat models range from white‑box to black‑box. Robustness techniques include adversarial training, certified defenses, and randomized smoothing, with trade‑offs in accuracy and compute. Robust evaluation requires adaptive attacks and realistic constraints.

Rick & Morty: What’s this?

Morty: "Tiny tweaks, big mistakes?"
Rick: "Attacks craft perturbations; defenses toughen models—at a cost."
Morty: "White vs black box?"
Rick: "Access to internals vs just outputs—changes attack strength."
Morty: "Evaluate how?"
Rick: "Adaptive attacks and real constraints."

Code Examples

Toggle Code Examples

Practical snippets grouped by concept. Use these alongside the glossary for hands-on intuition.

# Supervised Learning<br>
from sklearn.linear_model import LogisticRegression<br>
model = LogisticRegression()<br>
model.fit(X_train, y_train)<br>
preds = model.predict(X_val)<br>
```<br>

```python<br>
# Unsupervised Learning<br>
from sklearn.cluster import KMeans<br>
km = KMeans(n_clusters=3)<br>
labels = km.fit_predict(X)<br>
```<br>

```python<br>
# Reinforcement Learning (Q-learning)<br>
Q[s, a] = Q[s, a] + alpha * (r + gamma * max(Q[s_next, a_]) - Q[s, a])<br>
```<br>

```python<br>
# Regularization (L2 / Weight Decay)<br>
import torch<br>
opt = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=1e-4)<br>
```<br>

```python<br>
# Loss Function (Cross-Entropy)<br>
import torch.nn.functional as F<br>
loss = F.cross_entropy(logits, targets)<br>
```<br>

```python<br>
# Gradient Descent<br>
w = w - lr * grad_w<br>
```<br>

```python<br>
# Optimizer (Adam)<br>
import torch<br>
opt = torch.optim.Adam(model.parameters(), lr=1e-3)<br>
```<br>

```python<br>
# Softmax<br>
import numpy as np<br>
def softmax(z):<br>
    e = np.exp(z - np.max(z))<br>
    return e / e.sum()<br>
```<br>

```python<br>
# Classification Metrics<br>
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score<br>
precision = precision_score(y_true, y_pred)<br>
recall   = recall_score(y_true, y_pred)<br>
f1       = f1_score(y_true, y_pred)<br>
acc      = accuracy_score(y_true, y_pred)<br>
```<br>

```python<br>
# Confusion Matrix<br>
from sklearn.metrics import ConfusionMatrixDisplay<br>
ConfusionMatrixDisplay.from_predictions(y_true, y_pred)<br>
```<br>

```python<br>
# ROC / AUC<br>
from sklearn.metrics import RocCurveDisplay, roc_auc_score<br>
RocCurveDisplay.from_predictions(y_true, y_score)<br>
auc = roc_auc_score(y_true, y_score)<br>
```<br>

```python<br>
# Tokenization (Transformer tokenizers)<br>
from transformers import AutoTokenizer<br>
tok = AutoTokenizer.from_pretrained("bert-base-uncased")<br>
batch = tok(["Hello world", "ML is fun"], padding=True, return_tensors="pt")<br>
```<br>

```python<br>
# Attention (Scaled Dot-Product)<br>
import numpy as np<br>
def scaled_dot_product_attention(Q, K, V):<br>
    scores = (Q @ K.T) / np.sqrt(Q.shape[-1])<br>
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))<br>
    weights = weights / weights.sum(axis=-1, keepdims=True)<br>
    return weights @ V<br>
```<br>

```python<br>
# Transformer (PyTorch built-in)<br>
import torch<br>
import torch.nn as nn<br>
model = nn.Transformer(d_model=128, nhead=8, num_encoder_layers=2, num_decoder_layers=2)<br>
src = torch.randn(10, 32, 128)<br>
tgt = torch.randn(9,  32, 128)<br>
out = model(src, tgt)<br>
```<br>

```python<br>
# LLM Generation (Transformers)<br>
from transformers import AutoModelForCausalLM, AutoTokenizer<br>
tok = AutoTokenizer.from_pretrained("gpt2")<br>
lm  = AutoModelForCausalLM.from_pretrained("gpt2")<br>
inp = tok("Hello, I'm a language model", return_tensors="pt")<br>
gen = lm.generate(**inp, max_length=50, temperature=0.8, top_p=0.9)<br>
print(tok.decode(gen[0], skip_special_tokens=True))<br>
```<br>

```python<br>
# Beam Search / Sampling Controls<br>
lm.generate(**inp, num_beams=5)<br>
lm.generate(**inp, temperature=0.7, top_p=0.9)<br>
```<br>

```python<br>
# Cross-Validation<br>
from sklearn.model_selection import cross_val_score<br>
from sklearn.linear_model import LogisticRegression<br>
scores = cross_val_score(LogisticRegression(), X, y, cv=5)<br>
```<br>

```python<br>
# Data Augmentation (Vision)<br>
from torchvision import transforms<br>
aug = transforms.Compose([<br>
    transforms.RandomResizedCrop(224),<br>
    transforms.RandomHorizontalFlip(),<br>
    transforms.ColorJitter(0.2, 0.2, 0.2, 0.1),<br>
])<br>
```<br>

```python<br>
# Normalization / Standardization<br>
from sklearn.preprocessing import StandardScaler<br>
scaler = StandardScaler().fit(X_train)<br>
X_train_std = scaler.transform(X_train)<br>
X_val_std   = scaler.transform(X_val)<br>
```<br>

```python<br>
# Dropout / BatchNorm / LayerNorm<br>
import torch.nn as nn<br>
layer = nn.Sequential(<br>
    nn.Linear(128, 128),<br>
    nn.ReLU(),<br>
    nn.Dropout(p=0.5),<br>
    nn.LayerNorm(128),<br>
)<br>
```<br>

```python<br>
# Transfer Learning / Fine-Tuning<br>
import torch<br>
import torchvision.models as models<br>
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)<br>
for p in model.parameters():<br>
    p.requires_grad = False<br>
model.fc = torch.nn.Linear(model.fc.in_features, num_classes)<br>
for p in model.layer4.parameters():<br>
    p.requires_grad = True<br>
opt = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)<br>
```<br>

```python<br>
# One-Hot Encoding<br>
import pandas as pd<br>
X = pd.DataFrame({"color": ["red", "green", "blue"]})<br>
X_oh = pd.get_dummies(X, columns=["color"])  # one-hot columns<br>
```<br>

```python<br>
# PCA / Dimensionality Reduction<br>
from sklearn.decomposition import PCA<br>
pca = PCA(n_components=2)<br>
X2 = pca.fit_transform(X)<br>
```<br>

```python<br>
# Linear / Logistic Regression<br>
from sklearn.linear_model import LinearRegression, LogisticRegression<br>
reg = LinearRegression().fit(X, y_cont)<br>
clf = LogisticRegression().fit(X, y_bin)<br>
```<br>

```python<br>
# SVM / k-NN / Decision Trees<br>
from sklearn.svm import SVC<br>
from sklearn.neighbors import KNeighborsClassifier<br>
from sklearn.tree import DecisionTreeClassifier<br>
svm = SVC().fit(X, y)<br>
knn = KNeighborsClassifier(n_neighbors=5).fit(X, y)<br>
dt  = DecisionTreeClassifier().fit(X, y)<br>
```<br>

```python<br>
# Random Forest<br>
from sklearn.ensemble import RandomForestClassifier<br>
rf = RandomForestClassifier(n_estimators=200, max_depth=10).fit(X, y)<br>
```<br>

```python<br>
# Pruning / Quantization<br>
import torch<br>
quantized = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)<br>
import torch.nn.utils.prune as prune<br>
for module in model.modules():<br>
    if isinstance(module, torch.nn.Linear):<br>
        prune.l1_unstructured(module, name="weight", amount=0.2)<br>
        prune.remove(module, "weight")<br>
```<br>

```python<br>
# Naive Bayes<br>
from sklearn.naive_bayes import GaussianNB<br>
nb = GaussianNB().fit(X, y)<br>
```<br>

```python<br>
# Autoencoder (Skeleton)<br>
import torch.nn as nn<br>
class AE(nn.Module):<br>
    def __init__(self, d):<br>
        super().__init__()<br>
        self.enc = nn.Sequential(nn.Linear(d, 64), nn.ReLU(), nn.Linear(64, 16))<br>
        self.dec = nn.Sequential(nn.Linear(16, 64), nn.ReLU(), nn.Linear(64, d))<br>
    def forward(self, x):<br>
        z = self.enc(x)<br>
        return self.dec(z)<br>
```<br>

```python<br>
# VAE (Reparameterization Trick)<br>
import torch<br>
def reparam(mu, logvar):<br>
    std = torch.exp(0.5 * logvar)<br>
    eps = torch.randn_like(std)<br>
    return mu + eps * std<br>
```<br>

```python<br>
# GAN (Training Loop Sketch)<br>
for x in dataloader:<br>
    d_loss = D_loss(D(x), D(G(z)))<br>
    d_opt.zero_grad(); d_loss.backward(); d_opt.step()<br>
    g_loss = G_loss(D(G(z)))<br>
    g_opt.zero_grad(); g_loss.backward(); g_opt.step()<br>
```<br>

```python<br>
# Diffusion (Denoising Step Sketch)<br>
x_{t-1} = denoise(x_t, t, eps_theta)<br>
```<br>

```python<br>
# Inference / Latency<br>
import time<br>
start = time.perf_counter()<br>
_ = model(x)<br>
latency_ms = (time.perf_counter() - start) * 1000<br>
```<br>

</details>

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

License

hemanth/ai-ml-jargon

Folders and files

Latest commit

History

Repository files navigation

AI and Machine Learning Jargon

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Model

Parameters

Hyperparameters

Feature

Label

Dataset

Train/Validation/Test Split

Overfitting

Underfitting

Bias-Variance Trade-off

Regularization

Loss Function

Objective Function

Gradient

Gradient Descent

Learning Rate

Epoch

Batch / Batch Size

Optimizer

Backpropagation

Activation Function

Softmax

Logits

Embeddings

Tokenization

Attention

Transformer

LLM

Prompt

Context Window

Temperature / Top-p

Beam Search

Accuracy, Precision, Recall, F1

Confusion Matrix

ROC / AUC

Perplexity

Cross-Validation

Data Augmentation

Normalization / Standardization

Dropout

Weight Decay

Batch Norm / Layer Norm

Transfer Learning

Fine-Tuning

Model Distillation

Pruning / Quantization

Feature Engineering

One-Hot Encoding

Dimensionality Reduction

PCA / t-SNE / UMAP

Linear / Logistic Regression

SVM / k-NN / Decision Trees

Random Forest / XGBoost

Naive Bayes

Entropy / KL Divergence

Bayesian Inference

MAP / MLE

Markov Chains / HMM

MCMC / Variational Inference

Autoencoder

VAE

GAN

Diffusion Model

Policy / Value Function

Exploration vs. Exploitation

Q-Learning / TD Learning

Replay Buffer / Actor-Critic

Inference

Latency / Throughput

MLOps

Packages