Machine Learning
Dr. Indu Joshi
Assistant Professor at
Indian Institute of Technology Mandi
18 August 2025
Supervised Learning
• Let’s start by talking about a few examples of supervised
learning problems. Suppose we have a dataset giving the
living areas and prices of some houses from Portland:
Living area (feet2 ) Price (1000$)
2104 400
1600 330
2400 369
1416 232
3000 540
• Given data like this, how can we learn to predict the prices of
other houses in Portland, as a function of the size of their
living areas?
Classification Vs Regression
• Objective:
• Regression: Predicts a continuous numerical value.
• Classification: Predicts a discrete category or class label.
• Output Type:
• Regression: Real numbers (e.g., price, temperature, salary).
• Classification: Class labels (e.g., spam/ham, disease/no
disease).
Classification Vs Regression
• Error Metrics:
• Regression: Mean Squared Error (MSE), Mean Absolute Error
(MAE), R2 score (quantifies how well a regression model
explains the variance in the dependent variable).
• Classification: Accuracy, Precision, Recall, F1-score,
AUC-ROC.
• Learning:
• Regression: A mapping function that outputs a continuous
function.
• Classification: Has a decision boundary that separates different
classes.
Supervised Learning
• To establish notation for future use, we’ll use x (i) to denote
the “input” variables (living area in this example), also called
input features, and y (i) to denote the “output” or target
variable that we are trying to predict (price).
• A pair (x (i) , y (i) ) is called a training example, and the dataset
that we’ll be using to learn—a list of m training examples
{(x (i) , y (i) ); i = 1, . . . , m}—is called a training set.
• We will also use X to denote the space of input values, and Y
the space of output values. In this example, X = Y = R.
Linear Regression
To make our housing example more interesting, let’s consider a
slightly richer dataset in which we also know the number of
bedrooms in each house:
Living area (feet2 ) # bedrooms Price ($1000)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
(i)
Here, the x’s are two-dimensional vectors in R2 . For instance, x1
(i)
is the living area of the i-th house in the training set, and x2 is its
number of bedrooms.
Linear Regression (contd.)
• To perform supervised learning, we must decide how to
represent functions/hypotheses h in a computer. As an initial
choice, let’s approximate y as a linear function of x:
hθ (x) = θ0 + θ1 x1 + θ2 x2
• Here, the θi ’s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X
to Y . For simplicity, we will drop the θ subscript in hθ (x) and
write it more simply as h(x).
• To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θi xi = θT x,
i=0
Linear Regression
Linear Regression (contd.)
• where on the right-hand side above we are viewing θ and x
both as vectors, and n is the number of input variables (not
counting x0 ).
• Now, given a training set, how do we pick or learn the
parameters θ? One reasonable method seems to be to make
h(x) close to y , at least for the training examples we have.
• To formalize this, we define a function that measures, for each
value of the θ’s, how close the h(x (i) )’s are to the
corresponding y (i) ’s.
• We define the cost function:
m
1 X 2
J(θ) = hθ (x (i) ) − y (i) .
2
i=1
Gradient Descent
• We want to choose θ so as to minimize J(θ).
• To do so, let’s use a search algorithm that starts with an
initial guess for θ and repeatedly changes θ to make J(θ)
smaller, until we converge to a value that minimizes J(θ).
• Specifically, we use the gradient descent algorithm:
∂
θj := θj − α J(θ),
∂θj
where α is the learning rate. The update is performed for all
values of j = 0, . . . , n simultaneously.
Gradient Descent
• To compute ∂
∂θj J(θ) for a single training example (x, y ), we
have:
∂ ∂ 1
J(θ) = (hθ (x) − y )2 ,
∂θj ∂θj 2
n
!
∂ X
= (hθ (x) − y ) · θi xi − y ,
∂θj
i=0
= (hθ (x) − y ) xj .
Thus, for a single training example, the update rule becomes:
(i)
θj := θj + α y (i) − hθ (x (i) ) xj .
This rule is called the LMS update rule (Least Mean Squares)
and is also known as the Widrow-Hoff learning rule.
Gradient Descent
This rule has several natural and intuitive properties. For example:
• The magnitude of the update is proportional to the error term
y (i) − hθ (x (i) ) .
• If the prediction hθ (x (i) ) is close to the actual value y (i) , the
parameter update will be small.
• If the prediction error is large, the update will be
correspondingly larger.
Gradient Descent: Numerical Example
Minimize the function:
f (x) = x 2
using gradient descent with 5 iterations.
Gradient: f ′ (x) = 2x
• Initial value: x0 = 1.0
• Learning rate: α = 0.2
• Number of iterations: 5
Update Rule:
xk+1 = xk − αf ′ (xk )
xk+1 = xk − 0.2(2xk ) = xk − 0.4xk = 0.6xk
Each step reduces xk by 40%.
Iterations Step-by-Step
Iteration k xk Gradient 2xk Updated xk+1 = 0.6xk
0 1.0000 2.0000 0.6000
1 0.6000 1.2000 0.3600
2 0.3600 0.7200 0.2160
3 0.2160 0.4320 0.1296
4 0.1296 0.2592 0.0778
5 0.0778 0.1555 0.0467
After 5 iterations, we get:
x5 ≈ 0.0467
Function value:
f (0.0467) = (0.0467)2 ≈ 0.0022
This is very close to zero, meaning successful convergence.
Role of Hyperparameter α
• Too Small α: Convergence is very slow, requiring many
iterations. The model may get stuck in local minima due to
slow updates.
• Too Large α: May cause the optimization process to
overshoot the minimum, leading to divergence instead of
convergence.
• Optimal α: Ensures steady progress towards the minimum,
balancing convergence speed and stability.
Numerical Example: High Learning Rate Causing Instability
Consider the function:
f (x) = (x − 3)2
with gradient:
∇f (x) = 2(x − 3)
Applying gradient descent with:
• Initial value: x0 = 0
• Learning rate: α = 1.5
• Maximum iterations: 5
Iteration Steps:
x1 = 0 − 1.5 × 2(0 − 3) = 9
x2 = 9 − 1.5 × 2(9 − 3) = −9
x3 = −9 − 1.5 × 2(−9 − 3) = 27
Observations
x4 = 27 − 1.5 × 2(27 − 3) = −45
x5 = −45 − 1.5 × 2(−45 − 3) = 99
• The values of x oscillate instead of converging to 3.
• A high learning rate causes the updates to overshoot.
• This results in instability and prevents convergence.
Impact of Initial Point
• A good initial value can speed up convergence.
• A bad initial value can slow down learning, cause oscillations,
or even make the algorithm diverge.
• For non-convex problems, the initial value may determine
which local minimum the algorithm converges to.
Poor Initialization
• Consider the function:
f (x) = x 3 − 3x
f ′ (x) = 3x 2 − 3
3x 2 − 3 = 0
x 2 = 1 =⇒ x = ±1
• f ′′ (1) = 6(1) = 6 > 0 ⇒ Local minimum at x = 1
• f ′′ (−1) = 6(−1) = −6 < 0 ⇒ Local maximum at x = −1
• Learning rate (α) = 0.1
• A poor choice would be x0 = −1.5, which is close to a local
maximum at x = −1.
Poor Initialization
x1 = x0 − α∇f (x0 ) = −1.5 − 0.1(3(−1.5)2 − 3) = −1.875
x2 = x1 − 0.1(3(−1.875)2 − 3) = −2.6297
x3 = x2 − 0.1(3(−2.6297)2 − 3) = −4.4043
The values grow rapidly, indicating divergence due to poor
initialization and a large learning rate.
Batch Gradient Descent
• The LMS rule for a single training example can be generalized
to a training set with more than one example. The update
rule becomes:
Repeat until convergence:{
m
(i)
X
θj := θj + α y (i) − hθ (x (i) ) xj for every j
i=1
}
• This method iterates over the entire training set on every step
and is called batch gradient descent.
• The reader can verify that the summation term in the update
rule above is simply ∂J(θ)
∂θj for the original cost function J.
• Hence, this is equivalent to performing gradient descent on J.
Properties of Gradient Descent
• Gradient descent can be susceptible to local minima in
general. However, for linear regression, the optimization
problem is convex, meaning there is only one global minimum
and no local minima.
• Gradient descent always converges to the global minimum as
long as the learning rate α is not too large.
• The cost function J(θ) for linear regression is a convex
quadratic function.
Stochastic Gradient Descent Algorithm
Loop { for i = 1 to m {
(i)
θj := θj + α y (i) − hθ (x (i) ) xj (for every j)
}
}
• In this algorithm, we repeatedly run through the training set,
and each time we encounter a training example, we update
the parameters according to the gradient of the error with
respect to that single training example only.
• This algorithm is called stochastic gradient descent (also
incremental gradient descent).
• Whereas batch gradient descent has to scan through the
entire training set before taking a single step—a costly
operation if m is large.
Stochastic Gradient Descent Algorithm
• Stochastic gradient descent can start making progress right
away, and continues to make progress with each example it
looks at.
• Often, stochastic gradient descent gets θ “close” to the
minimum much faster than batch gradient descent.
• Note however that it may never “converge” to the minimum,
and the parameters θ will keep oscillating around the minimum
of J(θ); but in practice most of the values near the minimum
will be reasonably good approximations to the true minimum.
• For these reasons, particularly when the training set is large,
stochastic gradient descent is often preferred over batch
gradient descent.
Probabilistic Interpretation
• Under the probabilistic assumptions on the data, least-squares
regression corresponds to finding the maximum likelihood
estimate of θ.
• These assumptions justify least-squares as a natural method
for regression.
Maximum Likelihood Estimation (MLE)
• Maximum Likelihood Estimation (MLE) is a method for
estimating the parameters of a probability distribution by
maximizing the likelihood function.
• The idea is to find the parameter values that make the
observed data most probable.
Probabilistic Interpretation (contd.)
Assume that the target variables and the inputs are related as
y (i) = θT x (i) + ϵ(i) ,
where ϵ(i) represents random noise, assumed to be IID Gaussian:
ϵ(i) ∼ N (0, σ 2 ).
The error term ϵ(i) has a mean of zero and variance σ 2
ϵ(i) ∼ N (0, σ 2 ).
The density of ϵ(i) is:
!
(i) 1 (ϵ(i) )2
p(ϵ ) = √ exp − .
2πσ 2σ 2
Probabilistic Interpretation (contd.)
2 !
1 y (i) − θT x (i)
p(y (i) |x (i) ; θ) = √ exp − .
2πσ 2σ 2
Here, p(y (i) |x (i) ; θ) represents the distribution of y (i) given x (i) ,
parameterized by θ. We can also express:
y (i) |x (i) ; θ ∼ N (θT x (i) , σ 2 ).
Likelihood Function
• Given the design matrix X (containing all the x (i) ) and θ, the
probability of the data y is p(y |X ; θ).
• When this is treated as a function of θ, it is referred to as the
likelihood function:
L(θ) = L(θ; X , y ) = p(y |X ; θ).
• By the independence assumption of ϵ(i) (and consequently
y (i) given x (i) ), we can write:
m
Y
L(θ) = p(y (i) |x (i) ; θ).
i=1
m 2 !
Y 1 y (i) − θT x (i)
L(θ) = √ exp − .
2πσ 2σ 2
i=1
Maximum Likelihood Estimation
• To find the best θ, we use the principle of maximum
likelihood:
Maximize L(θ).
• Instead of maximizing L(θ) directly, we can maximize any
strictly increasing function of L(θ).
• A common choice is to maximize the log-likelihood, which
simplifies derivations.
• The log-likelihood function ℓ(θ) is given by:
m 2 !
Y 1 y (i) − θT x (i)
ℓ(θ) = log L(θ) = log √ exp − .
2πσ 2σ 2
i=1
Maximum Likelihood Estimation (contd.)
2
Pm
√1
Pm (y (i) −θT x (i) )
ℓ(θ) = i=1 log 2πσ
+ i=1 − 2σ 2
m
1 1 X (i) 2
ℓ(θ) = m log √ − 2 y − θT x (i)
2πσ 2σ i=1
Hence, maximizing ℓ(θ) is equivalent to minimizing:
m
1 X (i) 2
y − θT x (i) ,
2
i=1
which is the least-squares cost function J(θ).
Maximum Likelihood Estimation
• To summarize, under the previous probabilistic assumptions
on the data, least-squares regression corresponds to finding
the maximum likelihood estimate of θ.
Linear Regression
Linear Regression
Linear Regression
Ridge Regression
• Ridge regression is a regularized version of Ordinary Least
Squares (OLS).
• It modifies the loss function by adding a penalty on the size of
parameters (weights).
• L2 penalty shrinks parameters, yielding a more stable model
with much better test performance.
( n )
X 2
arg min yi − θ⊤ xi + λ∥θ∥22
θ
i=1
Ridge Regression
Maximum Aposteriori Estimation
1 θ⊤ θ
P(θ) = √ e − 2τ 2
2πτ 2
θ = arg max P(θ|y1 , x1 , . . . , yn , xn )
θ
P(y1 , x1 , . . . , yn , xn |θ)P(θ)
= arg max
θ P(y1 , x1 , . . . , yn , xn )
= arg max P(y1 , x1 , . . . , yn , xn |θ)P(θ)
θ
" n #
Y
= arg max P(yi , xi |θ) P(θ)
θ
i=1
n
" #
Y
= arg max P(yi |xi , θ)P(xi |θ) P(θ)
θ
i=1
Maximum Aposteriori Estimation
n
" #
Y
= arg max P(yi |xi , θ)P(xi ) P(θ)
θ
i=1
" n #
Y
= arg max P(yi |xi , θ) P(θ)
θ
i=1
n
X
= arg max log P(yi |xi , θ) + log P(θ)
θ
i=1
n
1 X ⊤ 1
= arg min (xi θ − yi )2 + 2 θ ⊤ θ
θ 2σ 2 2τ
i=1
Maximum Aposteriori Estimation
n
1X ⊤
= arg min (xi θ − yi )2 + λ∥θ∥22
θ n
i=1
σ2
λ=
nτ 2
Thank You
Contact: [email protected]