UNIT-III: STATISTICAL LEARNING
1. Explain Bayesian reasoning in Machine Learning with an example. (10 marks)
Bayesian reasoning is a probabilistic approach in machine learning based on Bayes’ Theorem, which
describes how to update the probability of a hypothesis as more evidence or information becomes available.
Bayes’ Theorem:
P (E∨H )⋅ P(H )
P(H ∨E)=
P (E)
Where:
P(H ∨E): Posterior probability (probability of hypothesis H given evidence E )
P(E∨H ): Likelihood (probability of evidence given hypothesis)
P(H ): Prior probability (initial belief before evidence)
P(E): Marginal probability of evidence
Bayesian Reasoning in ML:
Used in Bayesian classification, Naïve Bayes algorithms, Bayesian networks, and probabilistic
inference.
Example: Email Spam Classification
Let’s say:
H : Email is spam
E : Email contains the word “lottery”
We want to compute P(spam∨lottery)
Assume:
P(spam)=0.4
P(lottery∨spam)=0.8
P(lottery)=0.5
Apply Bayes' Theorem:
0.8 ×0.4
P(spam∨lottery)= =0.64
0.5
So, there’s a 64% chance that the email is spam if it contains “lottery”.
Applications:
Naïve Bayes Classifier (text classification, sentiment analysis)
Bayesian Optimization (hyperparameter tuning)
Bayesian Networks (modeling uncertainty in medical diagnosis)
Advantages:
Handles uncertainty and prior knowledge.
Works well with small datasets.
Computationally efficient.
Limitations:
Requires strong assumptions (e.g., feature independence in Naïve Bayes).
Computing priors can be difficult in complex domains.
Conclusion:
Bayesian reasoning provides a powerful mathematical framework for dealing with uncertainty, making it
widely applicable in many ML tasks involving probability-based decision-making.
2. Describe the K-Nearest Neighbor (KNN) classifier. What are its advantages and
limitations? (10 marks)
K-Nearest Neighbor (KNN) is a non-parametric, instance-based supervised learning algorithm used for
classification and regression.
How KNN Works:
1. Choose K (number of neighbors).
2. Calculate distance (usually Euclidean) between the test point and all training points.
3. Select the K closest neighbors.
4. Vote for classification (majority class) or average for regression.
Formula (Euclidean Distance):
√
n
d= ∑ ¿¿ ¿
i=1
Where x and y are data points in n -dimensional space.
Example:
Suppose we want to classify a fruit as apple or orange based on features like weight and color. If K = 3 and
among the 3 closest fruits, 2 are apples and 1 is orange, the new fruit is classified as apple.
Advantages:
Simple and easy to implement
No training phase — good for real-time applications
Naturally handles multi-class classification
Adaptable to non-linear decision boundaries
Limitations:
Slow for large datasets (needs to compute distance to all points)
Sensitive to irrelevant or unscaled features
Memory-intensive — stores entire dataset
Choice of K and distance metric affects performance
Applications:
Handwritten digit recognition (e.g., MNIST)
Recommendation systems
Medical diagnosis (predicting disease type)
Conclusion:
KNN is a powerful yet simple algorithm for classification and regression, best used when the dataset is small
and well-preprocessed. Its performance depends heavily on feature scaling and choice of K.
3. Derive the Least Square Error Criterion for Linear Regression. (10 marks)
Linear Regression is a supervised learning algorithm used to model the relationship between an input
variable X and an output variable Y by fitting a linear equation:
^y =w0 + w1 x
Where:
^y : predicted output
w 0: intercept (bias)
w 1: slope (weight)
Objective:
To find the best-fitting line by minimizing the error between predicted and actual values using the Least
Squares Error (LSE) criterion.
1. Define Error (Residual):
For each data point i :
e i= y i− ^yi = y i−(w0 + w1 x i )
2. Define the Cost Function (Sum of Squared Errors):
n
J (w0 , w1 )=∑ ¿ ¿
i=1
This function measures how well the model fits the data. The goal is to minimize J (w0 , w1 ).
3. Derive with Respect to Parameters:
To find the optimal values of w 0 and w 1, we use calculus to minimize the cost function by setting partial
derivatives to zero.
Partial Derivatives:
n
∂J
=−2 ∑ (¿ y i−w 0−w1 x i )¿
∂ w0 i=1
n
∂J
=−2 ∑ x i ( y i−w 0−w1 x i )
∂ w1 i=1
4. Solve the Normal Equations:
Using the above derivatives, solve the following system of equations:
∑ y i=n w0 + w1 ∑ x i (1)
2
∑ x i y i =w 0 ∑ x i +w 1 ∑ x i (2)
Solving these gives the optimal parameters:
n ∑ xi y i−∑ x i ∑ y i
w 1= 2
n ∑ x i −¿ ¿
w 0= ý −w1 x́
Conclusion:
Least Squares Error criterion helps determine the best-fitting line by minimizing the total squared
differences between actual and predicted outputs.
4. Explain Logistic Regression for classification tasks. How does it differ from Linear
Regression? (10 marks)
Logistic Regression is a supervised learning algorithm used for binary (or multi-class) classification, not
regression, despite its name.
Purpose:
To predict the probability that a given input belongs to a certain class (e.g., yes/no, spam/ham).
1. Problem with Linear Regression in Classification:
Linear regression can output any real number, but classification problems need probabilities in the range
[0, 1].
2. Logistic Function (Sigmoid):
To map output to a probability, logistic regression uses the sigmoid function:
1
σ (z)= −z
, where z=w0 + w1 x
1+e
1
⇒ P( y=1∨x )= −(w 0+ w1 x)
1+ e
3. Classification Rule:
{
^y = 1 if P( y=1∨x )> 0.5
0 otherwise
4. Cost Function (Cross-Entropy Loss):
Since squared error is not suitable for classification, logistic regression uses log-loss:
n
−1
J (w)= ∑ [ y log ( ^y i )+(1− y i)log (1−^y i)]
n i=1 i
Differences Between Logistic and Linear Regression:
Feature Linear Regression Logistic Regression
Output Continuous values Probabilities [0, 1]
Use Case Regression Classification problems
problems
Activation None Sigmoid function
Cost Function Mean squared error Cross-entropy loss
Decision Boundary Not defined Defined via threshold (e.g., 0.5)
Applications:
Spam detection
Credit risk prediction
Medical diagnosis (e.g., predicting disease presence)
Conclusion:
Logistic regression is a robust, interpretable method for classification. It differs from linear regression by
outputting probabilities and using a sigmoid activation and cross-entropy loss.
5. Discuss Fisher’s Linear Discriminant and its role in classification. (10 marks)
What is Fisher’s Linear Discriminant?
Fisher’s Linear Discriminant (FLD) is a supervised dimensionality reduction technique used in
classification problems. It projects high-dimensional data onto a line in such a way that class separability is
maximized.
While similar to Principal Component Analysis (PCA), which maximizes variance, FLD focuses on
maximizing class separation.
Objective:
To find a projection vector w such that:
T
y=w x
where y is the 1D projection, and x is a data point.
The goal is to maximize the distance between class means while minimizing the variance within each
class.
Fisher’s Criterion:
2
¿ μ 1−μ2 ¿
J (w)= 2 2
s1 + s2
Where:
μ1 , μ 2 are the projected class means,
2 2
s1 , s 2 are the variances of each class after projection.
This can be expressed in matrix form as:
T
w SB w
J (w)= T
w SW w
Where:
S B: Between-class scatter matrix
SW : Within-class scatter matrix
Solution:
¿
The optimal projection vector w is obtained by:
¿ −1
w =S W (μ1−μ2 )
Role in Classification:
Projects data to 1D (or low-dimensional) space for better class separation.
Especially useful in binary classification.
Used as a preprocessing step before applying a classifier like logistic regression, SVM, or KNN.
Example:
Classifying two types of flowers (e.g., Setosa vs. Versicolor) based on features like petal width and length.
FLD finds the line along which the classes are most separated.
Advantages:
Improves class separability.
Works well with small datasets.
Easy to implement and interpret.
Limitations:
Assumes data is linearly separable.
Works best with normally distributed classes with equal covariance.
Conclusion:
Fisher’s Linear Discriminant helps reduce dimensionality while preserving class discrimination, making it
an effective tool for classification and preprocessing in ML pipelines.
6. What is the Minimum Description Length (MDL) principle? How is it used in model
selection? (10 marks)
What is MDL Principle?
The Minimum Description Length (MDL) principle is a formalization of Occam’s Razor in information
theory. It states:
"The best model is the one that compresses the data most effectively."
In other words, the optimal model is the one that minimizes the total description length of:
1. The model itself, and
2. The data given the model (i.e., the errors or residuals).
Mathematically:
L(D , M )=L(M )+ L(D∨M )
Where:
L(M ): Length (complexity) of the model,
L(D∨M ): Length of the data when encoded with the model (i.e., how well the model explains the
data),
L(D , M ): Total length of the description.
Interpretation:
Simple models have short L(M ) but may fit poorly (long L(D∨M )).
Complex models fit well (short L(D∨M )) but are hard to describe (long L(M )).
MDL aims to balance model simplicity and data fit.
Use in Model Selection:
1. Compare multiple models (e.g., polynomial regression of degree 1, 2, 3...).
2. Calculate L(M )+ L( D∨M ) for each.
3. Select the model with the smallest total length.
Example:
Choosing between:
A linear regression model with fewer coefficients,
A higher-degree polynomial that fits training data better.
MDL may favor the simpler model if the added complexity of the polynomial does not justify the
improvement in fit (to avoid overfitting).
Applications:
Model selection in regression and classification.
Decision tree pruning.
Feature selection.
Comparing probabilistic models.
Advantages:
Theoretically sound and general.
Prevents overfitting by penalizing model complexity.
Limitations:
Requires a way to quantify model and data length (encoding scheme).
May be computationally intensive for large models.
Conclusion:
The MDL principle provides a principled way to select models that generalize well by trading off accuracy
and complexity, aligning closely with the goals of machine learning.