Absolutely!
Below is PSCS511 – Machine Learning made exam-ready, easy to understand,
while still sounding smart and detailed for M.Sc. CS exams. The structure follows:
Definition
Explanation
Example (Simple + Real-world)
Types (if applicable)
UNIT 1: The Fundamentals of Machine Learning
1. What is Machine Learning?
Definition:
Machine Learning (ML) is a method where computers learn from data to make decisions without
being directly programmed.
Explanation:
Instead of giving step-by-step instructions, we give the computer a lot of examples. It finds
patterns on its own and uses them to make future predictions.
Example:
If we show a computer many pictures of cats and dogs, it learns what a cat looks like vs a dog.
Later, it can tell if a new picture is a cat or dog without being told how.
Types:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
2. Need and Importance of Machine Learning
Definition:
ML is needed to solve problems where writing exact rules is difficult, or where the system
should improve over time.
Explanation:
Some tasks are too complex for normal programming (like recognizing faces or driving cars). ML
learns automatically and adapts as it gets more data.
Example:
Google Maps learns traffic patterns to suggest faster routes, based on data from millions of
phones.
3. Types of Machine Learning
A. Supervised Learning
Definition:
Learning from data that already has the right answers (labels).
Explanation:
You teach the model with examples: each input has a correct output. The model tries to learn
the link.
Example:
Emails labeled “spam” or “not spam” help the model learn how to classify new emails.
Types:
• Classification (output is a category like Yes/No, Male/Female)
• Regression (output is a number like salary or price)
B. Unsupervised Learning
Definition:
Learning from data without any labels or correct answers.
Explanation:
The model explores the data and finds patterns or groups on its own.
Example:
Netflix groups users with similar movie tastes even if they never said what genre they like.
Types:
• Clustering (grouping similar data)
• Dimensionality Reduction (simplifying data without losing meaning)
C. Reinforcement Learning
Definition:
A model learns by doing actions and receiving rewards or penalties.
Explanation:
It improves step by step by trying, failing, and learning—like a child learning to ride a bike.
Example:
A robot learns to walk by getting points when it moves forward and losing points when it falls.
Types:
• Model-free (Q-learning)
• Model-based (plans using known environment)
4. Challenges of Machine Learning
Definition:
Things that make ML hard to build or use effectively.
Explanation:
• Overfitting: model remembers training data too well but fails on new data
• Underfitting: model is too simple and can’t learn well
• Bad data: missing, incorrect, or imbalanced data can confuse models
• Interpretability: some models are like “black boxes” (hard to understand)
Example:
If a face-recognition app only trains on pictures of light-skinned people, it may fail on dark-
skinned faces — that’s biased data.
5. Testing and Validation
Definition:
Steps to check how well the model is working on new (unseen) data.
Explanation:
• Training Set: to teach the model
• Validation Set: to tune the model (like testing before the final test)
• Test Set: to evaluate the final model’s real accuracy
Example:
A student studies from a textbook (training), solves a practice test (validation), and then gives a
real exam (testing).
6. Classification
Definition:
Predicting a category (label) based on input data.
Explanation:
The model puts things into classes based on features.
Example:
Medical diagnosis: Given symptoms, classify as “flu” or “not flu”.
Types:
• Binary (Yes/No)
• Multiclass (e.g., digits 0–9)
7. MNIST Dataset
Definition:
A famous dataset of handwritten digits (0 to 9) used for image classification tasks.
Explanation:
It contains 70,000 images of numbers written by different people. Each image is 28x28 pixels.
Example:
Used to train models that can read postal codes or numbers from scanned forms.
8. Performance Measures
Definition:
Metrics used to check how good or bad a model is.
Explanation:
• Accuracy = correct predictions ÷ total predictions
• Precision = how many predicted positives were correct
• Recall = how many actual positives were found
• F1 Score = balance between precision and recall
Example:
If your model says 100 people have COVID, but only 80 actually do, precision = 80%.
9. Confusion Matrix
Definition:
A table that shows how many predictions were right and wrong.
Explanation:
• TP (True Positive) = correctly predicted positive
• TN (True Negative) = correctly predicted negative
• FP (False Positive) = wrongly predicted as positive
• FN (False Negative) = wrongly predicted as negative
Example:
Predicted Positive Predicted Negative
Actual Positive TP (e.g., 90) FN (e.g., 10)
Actual Negative FP (e.g., 20) TN (e.g., 80)
10. Precision & Recall
Definition:
• Precision: How many predicted positives are correct
• Recall: How many actual positives were found
Explanation:
High precision = few false alarms
High recall = few missed cases
Example:
In cancer testing:
• High recall = catch most sick patients
• High precision = avoid falsely telling healthy people they’re sick
11. Precision-Recall Tradeoff
Definition:
Improving one (precision or recall) often worsens the other.
Explanation:
Changing the threshold (cut-off score) can make the model stricter or more generous.
Example:
A strict spam filter has high precision (only spam marked as spam), but may miss some spam
(low recall).
12. ROC Curve
Definition:
A graph that shows how well the model separates the classes.
Explanation:
ROC stands for Receiver Operating Characteristic. It plots TP rate vs FP rate.
AUC (Area Under Curve) shows overall quality: closer to 1 = better model.
Example:
If model A has AUC 0.95 and model B has 0.65, model A is better.
13. Multiclass Classification
Definition:
Classifying data into more than two categories.
Explanation:
Instead of just Yes/No, the model decides among 3 or more options.
Example:
Digit recognition (0 to 9) = 10 classes.
Techniques:
• One-vs-Rest: One class vs all others
• Softmax: Outputs probabilities for each class
14. Error Analysis
Definition:
Carefully studying where and why the model made mistakes.
Explanation:
Helps improve the model by finding patterns in errors.
Example:
If your digit model often confuses 4 and 9, maybe it needs better features to tell them apart.
Unit 1: DONE
UNIT 2: Training Models
1. Linear Regression
Definition:
A method to predict a continuous value based on the relationship between input (X) and output
(Y) using a straight line.
Explanation:
It fits the best straight line (Y = mX + b) through the data points by minimizing the gap between
predicted and actual values.
Example:
Predicting house prices based on size: bigger house → higher price.
Type:
• Simple Linear Regression (1 feature)
• Multiple Linear Regression (many features)
2. Gradient Descent
Definition:
An optimization algorithm used to minimize the error in a model by updating weights step by
step.
Explanation:
It checks how the output error changes with small changes in model weights and adjusts them
to reduce the error.
Example:
Like rolling a ball down a hill to reach the lowest point (minimum error).
Types:
• Batch GD – entire dataset per update (accurate but slow)
• Stochastic GD (SGD) – one sample per update (fast but noisy)
• Mini-batch GD – small batches (balance between speed and accuracy)
3. Polynomial Regression
Definition:
A type of regression that fits a curved line by adding powers of input (like X², X³).
Explanation:
Used when the data doesn't follow a straight-line trend, but still wants to use regression
technique.
Example:
Predicting salary based on experience where salary increases faster after some years.
Type:
• Quadratic Regression (X²)
• Cubic Regression (X³)
4. Learning Curves
Definition:
A graph showing the model’s performance (error) on training and validation sets over time or
data size.
Explanation:
Helps in checking whether the model is overfitting or underfitting.
Example:
If training error is low and validation error is high → overfitting.
5. Bias–Variance Tradeoff
Definition:
A balance between two types of model errors: bias (too simple) and variance (too complex).
Explanation:
• High bias: model can't learn well → underfitting
• High variance: model memorizes data too much → overfitting
Example:
A student who guesses answers (high bias) vs. one who memorizes the book but fails
application (high variance).
6. Ridge Regression
Definition:
A regularized version of linear regression that adds an L2 penalty (squares of weights) to reduce
overfitting.
Explanation:
Keeps the model weights small to prevent it from fitting noise in the data.
Example:
Useful when input features are many and related (multicollinearity).
7. Lasso Regression
Definition:
Similar to ridge regression but uses an L1 penalty (absolute values of weights).
Explanation:
Can shrink some weights to zero, effectively removing unnecessary features (feature selection).
Example:
If only 5 out of 20 features really matter, lasso keeps those 5.
8. Early Stopping
Definition:
A technique to stop training when the model starts overfitting (validation error increases).
Explanation:
Checks validation performance and stops before model starts to memorize the training data.
Example:
Like stopping cooking when the food is perfectly done, not overcooked.
9. Logistic Regression
Definition:
A classification algorithm that predicts the probability of a binary outcome using the sigmoid
function.
Explanation:
Instead of predicting numbers, it predicts a value between 0 and 1, which can be interpreted as
probability.
Example:
Predicting whether an email is spam (1) or not (0).
Types:
1. Binary Logistic Regression. 2. Multiclass via Softmax Regression.
10. Decision Boundaries
Definition:
A line (in 2D) or surface (in higher dimensions) that separates classes in classification tasks.
Explanation:
Based on model predictions, this is the line where the model decides whether an input belongs
to class A or B.
Example:
In a plot of height vs. weight, the boundary line can separate "fit" and "overweight" categories.
11. Softmax Regression
Definition:
A generalization of logistic regression for multiclass classification problems.
Explanation:
It gives a probability for each class and the class with the highest probability is selected.
Example:
Handwriting recognition from digits 0–9: softmax gives probabilities for each digit.
12. Cross-Entropy
Definition:
A loss function used in classification to measure the difference between predicted probabilities
and actual labels.
Explanation:
Helps the model learn how far off its predictions are, especially for probability-based outputs
like softmax.
Example:
If the model predicts 0.9 for class A but actual is class B, cross-entropy gives a high error.
Unit 2: DONE
UNIT 3: Support Vector Machines (SVM) & Decision Trees
1. Support Vector Machine (SVM)
Definition:
SVM is a supervised learning algorithm used for classification and regression. It finds the best
boundary (hyperplane) that separates classes with the maximum margin.
Explanation:
It selects data points (support vectors) closest to the decision boundary and tries to maximize
the distance between them. This helps the model generalize better.
Example:
In classifying emails as spam or not, SVM will find the best dividing line (or surface) based on
features like words or links.
Types:
• Linear SVM
• Non-Linear SVM (with kernels)
• SVM for regression (SVR)
2. Linear SVM Classification
Definition:
A type of SVM that works when the data can be separated by a straight line (or plane in higher
dimensions).
Explanation:
It finds a hyperplane (line or surface) that best separates the data into classes with maximum
margin.
Example:
Classifying apples and oranges based on size and color, when they are clearly separable.
3. Soft Margin Classification
Definition:
A variation of SVM that allows some misclassifications to prevent overfitting.
Explanation:
Not all data can be perfectly separated. Soft margin lets some points be on the wrong side of
the line but controls it using a parameter C.
Example:
In noisy email data, a few wrong classifications are allowed so the model doesn't become too
rigid.
4. Non-Linear SVM Classification
Definition:
An SVM that can classify data which is not linearly separable by using kernel functions.
Explanation:
It transforms input data into a higher-dimensional space where it becomes linearly separable.
Example:
Separating concentric circles (like a donut shape) which cannot be done using a straight line.
5. Polynomial Kernel
Definition:
A kernel function that maps input data into a higher-dimensional space using polynomial
equations.
Explanation:
It allows the SVM to create curved boundaries instead of straight lines.
Example:
If a linear boundary fails, a polynomial kernel can draw a curved line to separate overlapping
classes.
6. Gaussian RBF Kernel (Radial Basis Function)
Definition:
A kernel function that uses distance between points to measure similarity. It maps input data
into infinite-dimensional space.
Explanation:
The RBF kernel creates a flexible boundary that fits complex patterns.
Example:
Used in face recognition systems to handle non-linear and varied facial features.
7. SVM Regression (SVR)
Definition:
An SVM model used for predicting continuous values, not classes.
Explanation:
It tries to fit a line within a margin and ignores small errors. Focuses on points that lie outside
the margin.
Example:
Predicting stock prices where small prediction errors are acceptable.
8. Decision Trees
Definition:
A tree-like model used for classification and regression. It splits data into subsets based on
feature values.
Explanation:
Each internal node tests a feature, each branch represents the result, and each leaf node gives
the prediction.
Example:
A decision tree to decide if a person should play tennis based on weather conditions.
9. Training and Visualizing a Decision Tree
Definition:
Creating a tree by choosing the best features to split the data at each level using a criterion like
Gini or Entropy.
Explanation:
The goal is to build a tree that separates data as cleanly as possible.
Example:
If the first best split is "weather = sunny", the tree creates branches for "yes" and "no".
10. Making Predictions (Decision Tree)
Definition:
To predict, the model follows the tree path based on input feature values.
Explanation:
It starts from the root, checks each condition, and moves down to a leaf that contains the final
answer.
Example:
Input: “Humidity = High” → “Wind = Strong” → “Play = No”.
11. CART Training Algorithm
Definition:
CART (Classification and Regression Tree) is a standard algorithm for building decision trees.
Explanation:
It uses Gini Impurity for classification and splits data to reduce impurity at each step.
Example:
Used to build simple binary trees for decision-making in applications like loan approval.
12. Gini Impurity vs Entropy
Definition:
Both are measures of impurity (mixing) in a dataset.
Explanation:
• Gini Impurity: Measures how often a randomly chosen element would be incorrectly
labeled.
• Entropy: Measures the disorder or unpredictability in the dataset.
Example:
If a node has 50% yes and 50% no, both Gini and Entropy are high (uncertain). If all are yes, both
are 0 (pure).
13. Regularization Hyperparameters (Decision Trees)
Definition:
Settings that control the size and complexity of the tree to avoid overfitting.
Explanation:
• max_depth: limits how deep the tree can grow
• min_samples_split: sets minimum samples to make a split
• min_samples_leaf: sets minimum samples in each leaf
Example:
Setting max_depth=3 limits the tree to 3 levels → smaller and simpler model.
Unit 3: DONE
UNIT 4: Fundamentals of Deep Learning
1. What is Deep Learning?
Definition:
Deep Learning is a subset of Machine Learning that uses multi-layered neural networks to learn
from large amounts of data.
Explanation:
It mimics the way the human brain processes information using layers of "neurons" that
transform input data step by step into accurate predictions.
Example:
Face recognition in smartphones uses deep learning to identify your face from millions of pixels.
Types:
Deep Learning mainly uses:
• Artificial Neural Networks (ANN)
• Convolutional Neural Networks (CNN)
• Recurrent Neural Networks (RNN)
2. Need for Deep Learning
Definition:
Deep Learning is needed when traditional ML methods cannot handle the complexity or size of
the data.
Explanation:
DL can learn hidden patterns in high-dimensional data (like images, audio, video) and performs
better as the dataset grows.
Example:
Detecting diseases from X-ray images, translating speech to text, or self-driving car vision
systems—all use DL because simpler models fail to capture the complexity.
3. Artificial Neural Network (ANN)
Definition:
ANN is a network of interconnected nodes (neurons) inspired by the human brain, used to solve
tasks like classification, prediction, and pattern recognition.
Explanation:
Each neuron takes inputs, applies a weight, adds a bias, and passes the result through an
activation function. Layers of such neurons allow complex transformations.
Example:
An ANN predicting handwritten digits (0–9) by learning patterns in pixel values.
Types of Layers:
• Input Layer: Takes raw data
• Hidden Layers: Learn patterns
• Output Layer: Gives final result
4. Core Components of Neural Networks
Definition:
Basic building blocks that make up a neural network.
Explanation:
Includes:
• Neurons: processing units
• Weights: importance of inputs
• Bias: adjustable constant
• Activation Functions: decide output
• Loss Function: measures error
• Optimizer: improves learning
Example:
A single neuron gets 3 inputs with weights and bias, applies an activation function, and gives an
output.
5. Multi-Layer Perceptron (MLP)
Definition:
A type of ANN with multiple layers of neurons between input and output.
Explanation:
MLP is fully connected, meaning each neuron in one layer connects to all neurons in the next. It
is used for tasks like classification and regression.
Example:
An MLP trained on MNIST dataset learns to recognize digits by adjusting weights across layers.
Types:
• Shallow MLP: 1 hidden layer
• Deep MLP: multiple hidden layers
6. Activation Functions
Definition:
Functions that decide whether a neuron should be activated (i.e., pass information to the next
layer).
Explanation:
They introduce non-linearity, allowing the network to learn complex patterns.
Common Types:
• Sigmoid: outputs between 0 and 1
• ReLU (Rectified Linear Unit): outputs 0 if input < 0, else returns input
• Tanh: outputs between -1 and 1
Example:
ReLU is used in image models because it helps with faster training and avoids the vanishing
gradient problem.
7. Tensors and Tensor Operations
Definition:
Tensors are multi-dimensional arrays used to store and operate on data in deep learning.
Explanation:
Like numbers (0D), vectors (1D), and matrices (2D), tensors can be extended to higher
dimensions. Operations like addition, multiplication, reshaping happen on tensors.
Example:
A colored image is stored as a 3D tensor: height × width × color_channels (e.g., 28×28×3).
Types:
• 0D: Scalar (single value)
• 1D: Vector
• 2D: Matrix
• 3D+: Tensor (e.g., batches of images)
8. TensorFlow Framework
Definition:
TensorFlow is an open-source deep learning library developed by Google for building and
training machine learning models.
Explanation:
It supports both simple and complex DL models using Python. TensorFlow uses computational
graphs and automatic differentiation to speed up training on CPU/GPU/TPU.
Example:
You can build a neural network in TensorFlow with just a few lines: define layers, loss function,
optimizer, and train using .fit().
Features:
• Easy to use with Keras
• Works on all platforms
• Supports visualization with TensorBoard
UNIT 4: DONE
Important Machine Learning Formulas (Units 1–4)
UNIT 1: Fundamentals of ML
1. Accuracy
Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
Explanation:
Tells how many total predictions were correct (both positives and negatives).
Example:
If 80 emails were classified correctly (spam or not) out of 100 total:
Accuracy=80100=0.8=80%\text{Accuracy} = \frac{80}{100} = 0.8 = 80\%
2. Precision
Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
Explanation:
Out of all predicted positives, how many were actually correct.
Example:
Predicted 50 spam, and 40 were actually spam:
Precision=4050=0.8=80%\text{Precision} = \frac{40}{50} = 0.8 = 80\%
3. Recall (Sensitivity)
Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
Explanation:
Out of all actual positives, how many the model correctly identified.
Example:
There were 50 spam emails, and model caught 40:
Recall=4050=0.8=80%\text{Recall} = \frac{40}{50} = 0.8 = 80\%
4. F1 Score
F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision}
\times \text{Recall}}{\text{Precision} + \text{Recall}}
Explanation:
Harmonic mean of Precision and Recall – used when we want balance between them.
Example:
Precision = 0.8, Recall = 0.6:
F1=2×0.8×0.60.8+0.6=2×0.481.4=0.6857F1 = 2 \times \frac{0.8 \times 0.6}{0.8 + 0.6} = 2 \times
\frac{0.48}{1.4} = 0.6857
UNIT 2: Training Models
5. Linear Regression Equation
Y=wX+bY = wX + b
Explanation:
• Y: Predicted value
• X: Input feature
• w: Weight (slope)
• b: Bias (intercept)
Example:
If weight = 2 and bias = 3:
Y=2X+3Y = 2X + 3
6. Mean Squared Error (MSE)
MSE=1n∑i=1n(yi−y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2
Explanation:
Average squared difference between actual and predicted values. Used as loss function in
regression.
Example:
Actual: [4, 5], Predicted: [3, 6]
MSE = ((4−3)² + (5−6)²)/2 = (1 + 1)/2 = 1
7. Gradient Descent Update Rule
w:=w−α⋅∂J∂ww := w - \alpha \cdot \frac{\partial J}{\partial w}
Explanation:
• Update weight by subtracting the gradient (slope of error)
• α (alpha) = learning rate
Example:
If gradient = 0.5 and learning rate = 0.1, then:
w:=w−0.1×0.5=w−0.05w := w - 0.1 \times 0.5 = w - 0.05
8. Logistic Regression (Sigmoid Function)
y^=σ(z)=11+e−z,z=wX+b\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = wX + b
Explanation:
Outputs probability between 0 and 1. Used in binary classification.
Example:
If z = 2,
σ(2)=11+e−2≈0.88\sigma(2) = \frac{1}{1 + e^{-2}} ≈ 0.88
9. Cross-Entropy Loss (Binary)
Loss=−[ylog(y^)+(1−y)log(1−y^)]\text{Loss} = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
Explanation:
Measures how far the prediction probability is from the actual label.
Example:
If y = 1 and prediction = 0.9,
Loss=−log(0.9)≈0.105Loss = -\log(0.9) ≈ 0.105
UNIT 3: SVM & Decision Trees
10. SVM Optimization Objective (Hard Margin)
Minimize: 12∥w∥2\text{Minimize: } \frac{1}{2} \|w\|^2
Subject to:
yi(w⋅xi+b)≥1y_i(w \cdot x_i + b) \geq 1
Explanation:
Tries to find the hyperplane with the largest margin (distance between classes).
11. Gini Impurity (for Decision Trees)
G=1−∑i=1npi2G = 1 - \sum_{i=1}^{n} p_i^2
Explanation:
Measures how “mixed” the classes are. Lower Gini = purer node.
Example:
50% yes, 50% no → Gini = 1 - (0.5² + 0.5²) = 0.5
100% yes → Gini = 0
12. Entropy (Information Gain)
H=−∑i=1npilog2(pi)H = -\sum_{i=1}^{n} p_i \log_2(p_i)
Explanation:
Used to measure uncertainty. Used in ID3 decision tree algorithm.
Example:
p(Yes) = 0.8, p(No) = 0.2
H=−[0.8log2(0.8)+0.2log2(0.2)]≈0.72H = -[0.8 \log_2(0.8) + 0.2 \log_2(0.2)] ≈ 0.72
UNIT 4: Deep Learning
13. Neuron Output
z=w1x1+w2x2+...+wnxn+bz = w_1x_1 + w_2x_2 + ... + w_nx_n + b
Output=Activation(z)\text{Output} = \text{Activation}(z)
Explanation:
A neuron computes weighted sum of inputs and passes it through an activation function.
Example:
For inputs x₁=1, x₂=2; weights w₁=0.5, w₂=0.3, b=0.2:
z=0.5×1+0.3×2+0.2=1.3z = 0.5×1 + 0.3×2 + 0.2 = 1.3
14. ReLU Activation Function
f(x)=max(0,x)f(x) = \max(0, x)
Explanation:
If input is negative, output is 0. Else, return the input. Used for speed and performance.
Example:
f(3) = 3, f(-5) = 0
15. Softmax Function (for Multiclass Output)
σ(zi)=ezi∑j=1Kezj\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
Explanation:
Gives probabilities for each class in multiclass classification. All outputs sum to 1.
Example:
If z = [2, 1, 0], softmax returns probabilities like [0.7, 0.2, 0.1]
16. Categorical Cross-Entropy (for Softmax)
Loss=−∑i=1Cyilog(y^i)\text{Loss} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)
Explanation:
Used to measure how close predicted class probabilities are to actual class.
That's All the Important Formulas!