Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views41 pages

Linear Regression For Machine Learning Course

The document discusses supervised learning, focusing on regression and classification, and introduces linear regression as a method for predicting house prices based on living area and number of bedrooms. It explains the concepts of cost functions, gradient descent, and the differences between batch and stochastic gradient descent. Additionally, it covers the probabilistic interpretation of regression and the maximum likelihood estimation method for parameter estimation.

Uploaded by

kartikey4115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views41 pages

Linear Regression For Machine Learning Course

The document discusses supervised learning, focusing on regression and classification, and introduces linear regression as a method for predicting house prices based on living area and number of bedrooms. It explains the concepts of cost functions, gradient descent, and the differences between batch and stochastic gradient descent. Additionally, it covers the probabilistic interpretation of regression and the maximum likelihood estimation method for parameter estimation.

Uploaded by

kartikey4115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning

Dr. Indu Joshi

Assistant Professor at
Indian Institute of Technology Mandi

18 August 2025
Supervised Learning

• Let’s start by talking about a few examples of supervised


learning problems. Suppose we have a dataset giving the
living areas and prices of some houses from Portland:
Living area (feet2 ) Price (1000$)
2104 400
1600 330
2400 369
1416 232
3000 540
• Given data like this, how can we learn to predict the prices of
other houses in Portland, as a function of the size of their
living areas?
Classification Vs Regression

• Objective:
• Regression: Predicts a continuous numerical value.
• Classification: Predicts a discrete category or class label.
• Output Type:
• Regression: Real numbers (e.g., price, temperature, salary).
• Classification: Class labels (e.g., spam/ham, disease/no
disease).
Classification Vs Regression

• Error Metrics:
• Regression: Mean Squared Error (MSE), Mean Absolute Error
(MAE), R2 score (quantifies how well a regression model
explains the variance in the dependent variable).
• Classification: Accuracy, Precision, Recall, F1-score,
AUC-ROC.
• Learning:
• Regression: A mapping function that outputs a continuous
function.
• Classification: Has a decision boundary that separates different
classes.
Supervised Learning

• To establish notation for future use, we’ll use x (i) to denote


the “input” variables (living area in this example), also called
input features, and y (i) to denote the “output” or target
variable that we are trying to predict (price).
• A pair (x (i) , y (i) ) is called a training example, and the dataset
that we’ll be using to learn—a list of m training examples
{(x (i) , y (i) ); i = 1, . . . , m}—is called a training set.
• We will also use X to denote the space of input values, and Y
the space of output values. In this example, X = Y = R.
Linear Regression

To make our housing example more interesting, let’s consider a


slightly richer dataset in which we also know the number of
bedrooms in each house:
Living area (feet2 ) # bedrooms Price ($1000)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
(i)
Here, the x’s are two-dimensional vectors in R2 . For instance, x1
(i)
is the living area of the i-th house in the training set, and x2 is its
number of bedrooms.
Linear Regression (contd.)
• To perform supervised learning, we must decide how to
represent functions/hypotheses h in a computer. As an initial
choice, let’s approximate y as a linear function of x:

hθ (x) = θ0 + θ1 x1 + θ2 x2

• Here, the θi ’s are the parameters (also called weights)


parameterizing the space of linear functions mapping from X
to Y . For simplicity, we will drop the θ subscript in hθ (x) and
write it more simply as h(x).
• To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θi xi = θT x,
i=0
Linear Regression
Linear Regression (contd.)

• where on the right-hand side above we are viewing θ and x


both as vectors, and n is the number of input variables (not
counting x0 ).
• Now, given a training set, how do we pick or learn the
parameters θ? One reasonable method seems to be to make
h(x) close to y , at least for the training examples we have.
• To formalize this, we define a function that measures, for each
value of the θ’s, how close the h(x (i) )’s are to the
corresponding y (i) ’s.
• We define the cost function:
m
1 X 2
J(θ) = hθ (x (i) ) − y (i) .
2
i=1
Gradient Descent

• We want to choose θ so as to minimize J(θ).


• To do so, let’s use a search algorithm that starts with an
initial guess for θ and repeatedly changes θ to make J(θ)
smaller, until we converge to a value that minimizes J(θ).
• Specifically, we use the gradient descent algorithm:


θj := θj − α J(θ),
∂θj
where α is the learning rate. The update is performed for all
values of j = 0, . . . , n simultaneously.
Gradient Descent
• To compute ∂
∂θj J(θ) for a single training example (x, y ), we
have:

∂ ∂ 1
J(θ) = (hθ (x) − y )2 ,
∂θj ∂θj 2
n
!
∂ X
= (hθ (x) − y ) · θi xi − y ,
∂θj
i=0

= (hθ (x) − y ) xj .
Thus, for a single training example, the update rule becomes:
 
(i)
θj := θj + α y (i) − hθ (x (i) ) xj .

This rule is called the LMS update rule (Least Mean Squares)
and is also known as the Widrow-Hoff learning rule.
Gradient Descent

This rule has several natural and intuitive properties. For example:
• The magnitude of the update is proportional to the error term
y (i) − hθ (x (i) ) .


• If the prediction hθ (x (i) ) is close to the actual value y (i) , the


parameter update will be small.
• If the prediction error is large, the update will be
correspondingly larger.
Gradient Descent: Numerical Example

Minimize the function:


f (x) = x 2
using gradient descent with 5 iterations.
Gradient: f ′ (x) = 2x
• Initial value: x0 = 1.0
• Learning rate: α = 0.2
• Number of iterations: 5
Update Rule:
xk+1 = xk − αf ′ (xk )
xk+1 = xk − 0.2(2xk ) = xk − 0.4xk = 0.6xk
Each step reduces xk by 40%.
Iterations Step-by-Step

Iteration k xk Gradient 2xk Updated xk+1 = 0.6xk


0 1.0000 2.0000 0.6000
1 0.6000 1.2000 0.3600
2 0.3600 0.7200 0.2160
3 0.2160 0.4320 0.1296
4 0.1296 0.2592 0.0778
5 0.0778 0.1555 0.0467

After 5 iterations, we get:

x5 ≈ 0.0467

Function value:

f (0.0467) = (0.0467)2 ≈ 0.0022

This is very close to zero, meaning successful convergence.


Role of Hyperparameter α

• Too Small α: Convergence is very slow, requiring many


iterations. The model may get stuck in local minima due to
slow updates.
• Too Large α: May cause the optimization process to
overshoot the minimum, leading to divergence instead of
convergence.
• Optimal α: Ensures steady progress towards the minimum,
balancing convergence speed and stability.
Numerical Example: High Learning Rate Causing Instability
Consider the function:

f (x) = (x − 3)2

with gradient:
∇f (x) = 2(x − 3)
Applying gradient descent with:
• Initial value: x0 = 0
• Learning rate: α = 1.5
• Maximum iterations: 5
Iteration Steps:

x1 = 0 − 1.5 × 2(0 − 3) = 9
x2 = 9 − 1.5 × 2(9 − 3) = −9
x3 = −9 − 1.5 × 2(−9 − 3) = 27
Observations

x4 = 27 − 1.5 × 2(27 − 3) = −45


x5 = −45 − 1.5 × 2(−45 − 3) = 99

• The values of x oscillate instead of converging to 3.


• A high learning rate causes the updates to overshoot.
• This results in instability and prevents convergence.
Impact of Initial Point

• A good initial value can speed up convergence.


• A bad initial value can slow down learning, cause oscillations,
or even make the algorithm diverge.
• For non-convex problems, the initial value may determine
which local minimum the algorithm converges to.
Poor Initialization

• Consider the function:

f (x) = x 3 − 3x

f ′ (x) = 3x 2 − 3
3x 2 − 3 = 0
x 2 = 1 =⇒ x = ±1

• f ′′ (1) = 6(1) = 6 > 0 ⇒ Local minimum at x = 1


• f ′′ (−1) = 6(−1) = −6 < 0 ⇒ Local maximum at x = −1
• Learning rate (α) = 0.1
• A poor choice would be x0 = −1.5, which is close to a local
maximum at x = −1.
Poor Initialization

x1 = x0 − α∇f (x0 ) = −1.5 − 0.1(3(−1.5)2 − 3) = −1.875


x2 = x1 − 0.1(3(−1.875)2 − 3) = −2.6297
x3 = x2 − 0.1(3(−2.6297)2 − 3) = −4.4043

The values grow rapidly, indicating divergence due to poor


initialization and a large learning rate.
Batch Gradient Descent

• The LMS rule for a single training example can be generalized


to a training set with more than one example. The update
rule becomes:
Repeat until convergence:{
m  
(i)
X
θj := θj + α y (i) − hθ (x (i) ) xj for every j
i=1

}
• This method iterates over the entire training set on every step
and is called batch gradient descent.
• The reader can verify that the summation term in the update
rule above is simply ∂J(θ)
∂θj for the original cost function J.
• Hence, this is equivalent to performing gradient descent on J.
Properties of Gradient Descent

• Gradient descent can be susceptible to local minima in


general. However, for linear regression, the optimization
problem is convex, meaning there is only one global minimum
and no local minima.
• Gradient descent always converges to the global minimum as
long as the learning rate α is not too large.
• The cost function J(θ) for linear regression is a convex
quadratic function.
Stochastic Gradient Descent Algorithm
Loop { for i = 1 to m {
 
(i)
θj := θj + α y (i) − hθ (x (i) ) xj (for every j)

}
}
• In this algorithm, we repeatedly run through the training set,
and each time we encounter a training example, we update
the parameters according to the gradient of the error with
respect to that single training example only.
• This algorithm is called stochastic gradient descent (also
incremental gradient descent).
• Whereas batch gradient descent has to scan through the
entire training set before taking a single step—a costly
operation if m is large.
Stochastic Gradient Descent Algorithm

• Stochastic gradient descent can start making progress right


away, and continues to make progress with each example it
looks at.
• Often, stochastic gradient descent gets θ “close” to the
minimum much faster than batch gradient descent.
• Note however that it may never “converge” to the minimum,
and the parameters θ will keep oscillating around the minimum
of J(θ); but in practice most of the values near the minimum
will be reasonably good approximations to the true minimum.
• For these reasons, particularly when the training set is large,
stochastic gradient descent is often preferred over batch
gradient descent.
Probabilistic Interpretation

• Under the probabilistic assumptions on the data, least-squares


regression corresponds to finding the maximum likelihood
estimate of θ.
• These assumptions justify least-squares as a natural method
for regression.
Maximum Likelihood Estimation (MLE)

• Maximum Likelihood Estimation (MLE) is a method for


estimating the parameters of a probability distribution by
maximizing the likelihood function.
• The idea is to find the parameter values that make the
observed data most probable.
Probabilistic Interpretation (contd.)
Assume that the target variables and the inputs are related as

y (i) = θT x (i) + ϵ(i) ,

where ϵ(i) represents random noise, assumed to be IID Gaussian:

ϵ(i) ∼ N (0, σ 2 ).

The error term ϵ(i) has a mean of zero and variance σ 2

ϵ(i) ∼ N (0, σ 2 ).

The density of ϵ(i) is:


!
(i) 1 (ϵ(i) )2
p(ϵ ) = √ exp − .
2πσ 2σ 2
Probabilistic Interpretation (contd.)

2 !
1 y (i) − θT x (i)
p(y (i) |x (i) ; θ) = √ exp − .
2πσ 2σ 2

Here, p(y (i) |x (i) ; θ) represents the distribution of y (i) given x (i) ,
parameterized by θ. We can also express:

y (i) |x (i) ; θ ∼ N (θT x (i) , σ 2 ).


Likelihood Function
• Given the design matrix X (containing all the x (i) ) and θ, the
probability of the data y is p(y |X ; θ).
• When this is treated as a function of θ, it is referred to as the
likelihood function:

L(θ) = L(θ; X , y ) = p(y |X ; θ).

• By the independence assumption of ϵ(i) (and consequently


y (i) given x (i) ), we can write:
m
Y
L(θ) = p(y (i) |x (i) ; θ).
i=1

m 2 !
Y 1 y (i) − θT x (i)
L(θ) = √ exp − .
2πσ 2σ 2
i=1
Maximum Likelihood Estimation

• To find the best θ, we use the principle of maximum


likelihood:
Maximize L(θ).
• Instead of maximizing L(θ) directly, we can maximize any
strictly increasing function of L(θ).
• A common choice is to maximize the log-likelihood, which
simplifies derivations.
• The log-likelihood function ℓ(θ) is given by:

m 2 !
Y 1 y (i) − θT x (i)
ℓ(θ) = log L(θ) = log √ exp − .
2πσ 2σ 2
i=1
Maximum Likelihood Estimation (contd.)

2
 
Pm
√1
Pm (y (i) −θT x (i) )
ℓ(θ) = i=1 log 2πσ
+ i=1 − 2σ 2
m
1 1 X  (i) 2
ℓ(θ) = m log √ − 2 y − θT x (i)
2πσ 2σ i=1
Hence, maximizing ℓ(θ) is equivalent to minimizing:
m
1 X  (i) 2
y − θT x (i) ,
2
i=1

which is the least-squares cost function J(θ).


Maximum Likelihood Estimation

• To summarize, under the previous probabilistic assumptions


on the data, least-squares regression corresponds to finding
the maximum likelihood estimate of θ.
Linear Regression
Linear Regression
Linear Regression
Ridge Regression

• Ridge regression is a regularized version of Ordinary Least


Squares (OLS).
• It modifies the loss function by adding a penalty on the size of
parameters (weights).
• L2 penalty shrinks parameters, yielding a more stable model
with much better test performance.
( n )
X 2
arg min yi − θ⊤ xi + λ∥θ∥22
θ
i=1
Ridge Regression
Maximum Aposteriori Estimation

1 θ⊤ θ
P(θ) = √ e − 2τ 2
2πτ 2

θ = arg max P(θ|y1 , x1 , . . . , yn , xn )


θ

P(y1 , x1 , . . . , yn , xn |θ)P(θ)
= arg max
θ P(y1 , x1 , . . . , yn , xn )

= arg max P(y1 , x1 , . . . , yn , xn |θ)P(θ)


θ
" n #
Y
= arg max P(yi , xi |θ) P(θ)
θ
i=1
n
" #
Y
= arg max P(yi |xi , θ)P(xi |θ) P(θ)
θ
i=1
Maximum Aposteriori Estimation

n
" #
Y
= arg max P(yi |xi , θ)P(xi ) P(θ)
θ
i=1
" n #
Y
= arg max P(yi |xi , θ) P(θ)
θ
i=1
n
X
= arg max log P(yi |xi , θ) + log P(θ)
θ
i=1
n
1 X ⊤ 1
= arg min (xi θ − yi )2 + 2 θ ⊤ θ
θ 2σ 2 2τ
i=1
Maximum Aposteriori Estimation

n
1X ⊤
= arg min (xi θ − yi )2 + λ∥θ∥22
θ n
i=1

σ2
λ=
nτ 2
Thank You

Contact: [email protected]

You might also like