0% found this document useful (0 votes)

26 views41 pages

Linear Regression For Machine Learning Course

The document discusses supervised learning, focusing on regression and classification, and introduces linear regression as a method for predicting house prices based on living area and number of bedrooms. It explains the concepts of cost functions, gradient descent, and the differences between batch and stochastic gradient descent. Additionally, it covers the probabilistic interpretation of regression and the maximum likelihood estimation method for parameter estimation.

Uploaded by

kartikey4115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views41 pages

Linear Regression For Machine Learning Course

Uploaded by

kartikey4115

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Machine Learning

Dr. Indu Joshi

Assistant Professor at
Indian Institute of Technology Mandi

18 August 2025
Supervised Learning

• Let’s start by talking about a few examples of supervised

learning problems. Suppose we have a dataset giving the
living areas and prices of some houses from Portland:
Living area (feet2 ) Price (1000$)
2104 400
1600 330
2400 369
1416 232
3000 540
• Given data like this, how can we learn to predict the prices of
other houses in Portland, as a function of the size of their
living areas?
Classification Vs Regression

• Objective:
• Regression: Predicts a continuous numerical value.
• Classification: Predicts a discrete category or class label.
• Output Type:
• Regression: Real numbers (e.g., price, temperature, salary).
• Classification: Class labels (e.g., spam/ham, disease/no
disease).
Classification Vs Regression

• Error Metrics:
• Regression: Mean Squared Error (MSE), Mean Absolute Error
(MAE), R2 score (quantifies how well a regression model
explains the variance in the dependent variable).
• Classification: Accuracy, Precision, Recall, F1-score,
AUC-ROC.
• Learning:
• Regression: A mapping function that outputs a continuous
function.
• Classification: Has a decision boundary that separates different
classes.
Supervised Learning

• To establish notation for future use, we’ll use x (i) to denote

the “input” variables (living area in this example), also called
input features, and y (i) to denote the “output” or target
variable that we are trying to predict (price).
• A pair (x (i) , y (i) ) is called a training example, and the dataset
that we’ll be using to learn—a list of m training examples
{(x (i) , y (i) ); i = 1, . . . , m}—is called a training set.
• We will also use X to denote the space of input values, and Y
the space of output values. In this example, X = Y = R.
Linear Regression

To make our housing example more interesting, let’s consider a

slightly richer dataset in which we also know the number of
bedrooms in each house:
Living area (feet2 ) # bedrooms Price ($1000)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
(i)
Here, the x’s are two-dimensional vectors in R2 . For instance, x1
(i)
is the living area of the i-th house in the training set, and x2 is its
number of bedrooms.
Linear Regression (contd.)
• To perform supervised learning, we must decide how to
represent functions/hypotheses h in a computer. As an initial
choice, let’s approximate y as a linear function of x:

hθ (x) = θ0 + θ1 x1 + θ2 x2

• Here, the θi ’s are the parameters (also called weights)

parameterizing the space of linear functions mapping from X
to Y . For simplicity, we will drop the θ subscript in hθ (x) and
write it more simply as h(x).
• To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θi xi = θT x,
i=0
Linear Regression
Linear Regression (contd.)

• where on the right-hand side above we are viewing θ and x

both as vectors, and n is the number of input variables (not
counting x0 ).
• Now, given a training set, how do we pick or learn the
parameters θ? One reasonable method seems to be to make
h(x) close to y , at least for the training examples we have.
• To formalize this, we define a function that measures, for each
value of the θ’s, how close the h(x (i) )’s are to the
corresponding y (i) ’s.
• We define the cost function:
m
1 X 2
J(θ) = hθ (x (i) ) − y (i) .
2
i=1
Gradient Descent

• We want to choose θ so as to minimize J(θ).

• To do so, let’s use a search algorithm that starts with an
initial guess for θ and repeatedly changes θ to make J(θ)
smaller, until we converge to a value that minimizes J(θ).
• Specifically, we use the gradient descent algorithm:

∂
θj := θj − α J(θ),
∂θj
where α is the learning rate. The update is performed for all
values of j = 0, . . . , n simultaneously.
Gradient Descent
• To compute ∂
∂θj J(θ) for a single training example (x, y ), we
have:

∂ ∂ 1
J(θ) = (hθ (x) − y )2 ,
∂θj ∂θj 2
n
!
∂ X
= (hθ (x) − y ) · θi xi − y ,
∂θj
i=0

= (hθ (x) − y ) xj .
Thus, for a single training example, the update rule becomes:

(i)
θj := θj + α y (i) − hθ (x (i) ) xj .

This rule is called the LMS update rule (Least Mean Squares)
and is also known as the Widrow-Hoff learning rule.
Gradient Descent

This rule has several natural and intuitive properties. For example:
• The magnitude of the update is proportional to the error term
y (i) − hθ (x (i) ) .

• If the prediction hθ (x (i) ) is close to the actual value y (i) , the

parameter update will be small.
• If the prediction error is large, the update will be
correspondingly larger.
Gradient Descent: Numerical Example

Minimize the function:

f (x) = x 2
using gradient descent with 5 iterations.
Gradient: f ′ (x) = 2x
• Initial value: x0 = 1.0
• Learning rate: α = 0.2
• Number of iterations: 5
Update Rule:
xk+1 = xk − αf ′ (xk )
xk+1 = xk − 0.2(2xk ) = xk − 0.4xk = 0.6xk
Each step reduces xk by 40%.
Iterations Step-by-Step

Iteration k xk Gradient 2xk Updated xk+1 = 0.6xk

0 1.0000 2.0000 0.6000
1 0.6000 1.2000 0.3600
2 0.3600 0.7200 0.2160
3 0.2160 0.4320 0.1296
4 0.1296 0.2592 0.0778
5 0.0778 0.1555 0.0467

After 5 iterations, we get:

x5 ≈ 0.0467

Function value:

f (0.0467) = (0.0467)2 ≈ 0.0022

This is very close to zero, meaning successful convergence.

Role of Hyperparameter α

• Too Small α: Convergence is very slow, requiring many

iterations. The model may get stuck in local minima due to
slow updates.
• Too Large α: May cause the optimization process to
overshoot the minimum, leading to divergence instead of
convergence.
• Optimal α: Ensures steady progress towards the minimum,
balancing convergence speed and stability.
Numerical Example: High Learning Rate Causing Instability
Consider the function:

f (x) = (x − 3)2

with gradient:
∇f (x) = 2(x − 3)
Applying gradient descent with:
• Initial value: x0 = 0
• Learning rate: α = 1.5
• Maximum iterations: 5
Iteration Steps:

x1 = 0 − 1.5 × 2(0 − 3) = 9
x2 = 9 − 1.5 × 2(9 − 3) = −9
x3 = −9 − 1.5 × 2(−9 − 3) = 27
Observations

x4 = 27 − 1.5 × 2(27 − 3) = −45

x5 = −45 − 1.5 × 2(−45 − 3) = 99

• The values of x oscillate instead of converging to 3.

• A high learning rate causes the updates to overshoot.
• This results in instability and prevents convergence.
Impact of Initial Point

• A good initial value can speed up convergence.

• A bad initial value can slow down learning, cause oscillations,
or even make the algorithm diverge.
• For non-convex problems, the initial value may determine
which local minimum the algorithm converges to.
Poor Initialization

• Consider the function:

f (x) = x 3 − 3x

f ′ (x) = 3x 2 − 3
3x 2 − 3 = 0
x 2 = 1 =⇒ x = ±1

• f ′′ (1) = 6(1) = 6 > 0 ⇒ Local minimum at x = 1

• f ′′ (−1) = 6(−1) = −6 < 0 ⇒ Local maximum at x = −1
• Learning rate (α) = 0.1
• A poor choice would be x0 = −1.5, which is close to a local
maximum at x = −1.
Poor Initialization

x1 = x0 − α∇f (x0 ) = −1.5 − 0.1(3(−1.5)2 − 3) = −1.875

x2 = x1 − 0.1(3(−1.875)2 − 3) = −2.6297
x3 = x2 − 0.1(3(−2.6297)2 − 3) = −4.4043

The values grow rapidly, indicating divergence due to poor

initialization and a large learning rate.
Batch Gradient Descent

• The LMS rule for a single training example can be generalized

to a training set with more than one example. The update
rule becomes:
Repeat until convergence:{
m
(i)
X
θj := θj + α y (i) − hθ (x (i) ) xj for every j
i=1

}
• This method iterates over the entire training set on every step
and is called batch gradient descent.
• The reader can verify that the summation term in the update
rule above is simply ∂J(θ)
∂θj for the original cost function J.
• Hence, this is equivalent to performing gradient descent on J.
Properties of Gradient Descent

• Gradient descent can be susceptible to local minima in

general. However, for linear regression, the optimization
problem is convex, meaning there is only one global minimum
and no local minima.
• Gradient descent always converges to the global minimum as
long as the learning rate α is not too large.
• The cost function J(θ) for linear regression is a convex
quadratic function.
Stochastic Gradient Descent Algorithm
Loop { for i = 1 to m {

(i)
θj := θj + α y (i) − hθ (x (i) ) xj (for every j)

}
}
• In this algorithm, we repeatedly run through the training set,
and each time we encounter a training example, we update
the parameters according to the gradient of the error with
respect to that single training example only.
• This algorithm is called stochastic gradient descent (also
incremental gradient descent).
• Whereas batch gradient descent has to scan through the
entire training set before taking a single step—a costly
operation if m is large.
Stochastic Gradient Descent Algorithm

• Stochastic gradient descent can start making progress right

away, and continues to make progress with each example it
looks at.
• Often, stochastic gradient descent gets θ “close” to the
minimum much faster than batch gradient descent.
• Note however that it may never “converge” to the minimum,
and the parameters θ will keep oscillating around the minimum
of J(θ); but in practice most of the values near the minimum
will be reasonably good approximations to the true minimum.
• For these reasons, particularly when the training set is large,
stochastic gradient descent is often preferred over batch
gradient descent.
Probabilistic Interpretation

• Under the probabilistic assumptions on the data, least-squares

regression corresponds to finding the maximum likelihood
estimate of θ.
• These assumptions justify least-squares as a natural method
for regression.
Maximum Likelihood Estimation (MLE)

• Maximum Likelihood Estimation (MLE) is a method for

estimating the parameters of a probability distribution by
maximizing the likelihood function.
• The idea is to find the parameter values that make the
observed data most probable.
Probabilistic Interpretation (contd.)
Assume that the target variables and the inputs are related as

y (i) = θT x (i) + ϵ(i) ,

where ϵ(i) represents random noise, assumed to be IID Gaussian:

ϵ(i) ∼ N (0, σ 2 ).

The error term ϵ(i) has a mean of zero and variance σ 2

ϵ(i) ∼ N (0, σ 2 ).

The density of ϵ(i) is:

!
(i) 1 (ϵ(i) )2
p(ϵ ) = √ exp − .
2πσ 2σ 2
Probabilistic Interpretation (contd.)

2 !
1 y (i) − θT x (i)
p(y (i) |x (i) ; θ) = √ exp − .
2πσ 2σ 2

Here, p(y (i) |x (i) ; θ) represents the distribution of y (i) given x (i) ,
parameterized by θ. We can also express:

y (i) |x (i) ; θ ∼ N (θT x (i) , σ 2 ).

Likelihood Function
• Given the design matrix X (containing all the x (i) ) and θ, the
probability of the data y is p(y |X ; θ).
• When this is treated as a function of θ, it is referred to as the
likelihood function:

L(θ) = L(θ; X , y ) = p(y |X ; θ).

• By the independence assumption of ϵ(i) (and consequently

y (i) given x (i) ), we can write:
m
Y
L(θ) = p(y (i) |x (i) ; θ).
i=1

m 2 !
Y 1 y (i) − θT x (i)
L(θ) = √ exp − .
2πσ 2σ 2
i=1
Maximum Likelihood Estimation

• To find the best θ, we use the principle of maximum

likelihood:
Maximize L(θ).
• Instead of maximizing L(θ) directly, we can maximize any
strictly increasing function of L(θ).
• A common choice is to maximize the log-likelihood, which
simplifies derivations.
• The log-likelihood function ℓ(θ) is given by:

m 2 !
Y 1 y (i) − θT x (i)
ℓ(θ) = log L(θ) = log √ exp − .
2πσ 2σ 2
i=1
Maximum Likelihood Estimation (contd.)

2

Pm
√1
Pm (y (i) −θT x (i) )
ℓ(θ) = i=1 log 2πσ
+ i=1 − 2σ 2
m
1 1 X (i) 2
ℓ(θ) = m log √ − 2 y − θT x (i)
2πσ 2σ i=1
Hence, maximizing ℓ(θ) is equivalent to minimizing:
m
1 X (i) 2
y − θT x (i) ,
2
i=1

which is the least-squares cost function J(θ).

Maximum Likelihood Estimation

• To summarize, under the previous probabilistic assumptions

on the data, least-squares regression corresponds to finding
the maximum likelihood estimate of θ.
Linear Regression
Linear Regression
Linear Regression
Ridge Regression

• Ridge regression is a regularized version of Ordinary Least

Squares (OLS).
• It modifies the loss function by adding a penalty on the size of
parameters (weights).
• L2 penalty shrinks parameters, yielding a more stable model
with much better test performance.
( n )
X 2
arg min yi − θ⊤ xi + λ∥θ∥22
θ
i=1
Ridge Regression
Maximum Aposteriori Estimation

1 θ⊤ θ
P(θ) = √ e − 2τ 2
2πτ 2

θ = arg max P(θ|y1 , x1 , . . . , yn , xn )

P(y1 , x1 , . . . , yn , xn |θ)P(θ)
= arg max
θ P(y1 , x1 , . . . , yn , xn )

= arg max P(y1 , x1 , . . . , yn , xn |θ)P(θ)

θ
" n #
Y
= arg max P(yi , xi |θ) P(θ)
θ
i=1
n
" #
Y
= arg max P(yi |xi , θ)P(xi |θ) P(θ)
θ
i=1
Maximum Aposteriori Estimation

n
" #
Y
= arg max P(yi |xi , θ)P(xi ) P(θ)
θ
i=1
" n #
Y
= arg max P(yi |xi , θ) P(θ)
θ
i=1
n
X
= arg max log P(yi |xi , θ) + log P(θ)
θ
i=1
n
1 X ⊤ 1
= arg min (xi θ − yi )2 + 2 θ ⊤ θ
θ 2σ 2 2τ
i=1
Maximum Aposteriori Estimation

n
1X ⊤
= arg min (xi θ − yi )2 + λ∥θ∥22
θ n
i=1

σ2
λ=
nτ 2
Thank You

Contact: [email protected]

Process Gas Compressors
100% (1)
Process Gas Compressors
24 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
30 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Jawaban MTCNA
No ratings yet
Jawaban MTCNA
13 pages
cs229 2
No ratings yet
cs229 2
275 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Maintenance Manual mb491 PDF
No ratings yet
Maintenance Manual mb491 PDF
298 pages
Sony hcd-gtr6 gtr6b gtr7 gtr8 gtr8b Ver.1.2 PDF
No ratings yet
Sony hcd-gtr6 gtr6b gtr7 gtr8 gtr8b Ver.1.2 PDF
92 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
CS 229: Supervised Learning Basics
100% (1)
CS 229: Supervised Learning Basics
48 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Machine Learning Basics for Students
No ratings yet
Machine Learning Basics for Students
7 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
CM20315 06 Fitting
No ratings yet
CM20315 06 Fitting
67 pages
Linear Models & Optimization Techniques
No ratings yet
Linear Models & Optimization Techniques
24 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
10 pages
Control Engineering Basics
100% (1)
Control Engineering Basics
18 pages
Regression
No ratings yet
Regression
30 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
CS229 Lecture Notes
No ratings yet
CS229 Lecture Notes
142 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Chief Architect Current Reference Manual
No ratings yet
Chief Architect Current Reference Manual
1,044 pages
77 9097
No ratings yet
77 9097
75 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Machine Learning Lab Guide
No ratings yet
Machine Learning Lab Guide
69 pages
Linear Regression
No ratings yet
Linear Regression
95 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Computer Applications in Hydraulic Engineering Tutorials 2020-Jul-21
No ratings yet
Computer Applications in Hydraulic Engineering Tutorials 2020-Jul-21
100 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Chart - Poster - PMBOK 6th Ed Data Flow Diagram
No ratings yet
Chart - Poster - PMBOK 6th Ed Data Flow Diagram
1 page
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Independent Speed Test Analysis of 4G Mobile Networks Performed by DIKW Consulting
No ratings yet
Independent Speed Test Analysis of 4G Mobile Networks Performed by DIKW Consulting
50 pages
Cs229 ML Notes
No ratings yet
Cs229 ML Notes
192 pages
AOPA - GPS Technology
100% (1)
AOPA - GPS Technology
16 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
!!!!!!!!!AC SINGLE PHASE INDUCTION MOTOR SPEED CONTROL U2008b PDF
No ratings yet
!!!!!!!!!AC SINGLE PHASE INDUCTION MOTOR SPEED CONTROL U2008b PDF
6 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Object Oriented Programming - ABAP Oops-Abap - 1
No ratings yet
Object Oriented Programming - ABAP Oops-Abap - 1
8 pages
CS229
No ratings yet
CS229
69 pages
Unit 1 DBMS
No ratings yet
Unit 1 DBMS
107 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Perbandingan Biaya Jaringan Dan Kelayakan Teknologi LTE Pada Frekuensi 900 MHZ, 1800 MHZ, 2100 MHZ, Dan 2300 MHZ Untuk Mendukung Rencana Pita Lebar Di Indonesia
No ratings yet
Perbandingan Biaya Jaringan Dan Kelayakan Teknologi LTE Pada Frekuensi 900 MHZ, 1800 MHZ, 2100 MHZ, Dan 2300 MHZ Untuk Mendukung Rencana Pita Lebar Di Indonesia
16 pages
Exploring English Learners Experiences of Using M
No ratings yet
Exploring English Learners Experiences of Using M
15 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
KHDA - Staff Approval Application - V1.22-June 2022
No ratings yet
KHDA - Staff Approval Application - V1.22-June 2022
4 pages
Pothole Detection via Lightweight Networks
No ratings yet
Pothole Detection via Lightweight Networks
90 pages
Lab 2 - Behavioral Level, RTL, and Gate Level Design
No ratings yet
Lab 2 - Behavioral Level, RTL, and Gate Level Design
3 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Huawei RTN 905e Brochure
No ratings yet
Huawei RTN 905e Brochure
2 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Form5 Accounting HHW December 2022
No ratings yet
Form5 Accounting HHW December 2022
15 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Exam Paper 2020 Oct
100% (1)
Exam Paper 2020 Oct
7 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
Coast Guard Exam Admit Card
No ratings yet
Coast Guard Exam Admit Card
7 pages
Session 1
No ratings yet
Session 1
39 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Philips C++ Coding Standard ( C++11)
No ratings yet
Philips C++ Coding Standard ( C++11)
97 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
Chapter04 Training Models
No ratings yet
Chapter04 Training Models
33 pages
Outdoor Waterproof PoE Switch Guide
No ratings yet
Outdoor Waterproof PoE Switch Guide
12 pages
MG5050 Installation Guide
No ratings yet
MG5050 Installation Guide
1 page
Notes 1
No ratings yet
Notes 1
30 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
01 RL Fundamentals - Complete Beginner's Guide
No ratings yet
01 RL Fundamentals - Complete Beginner's Guide
22 pages
Final Semester Exam Paper
No ratings yet
Final Semester Exam Paper
4 pages
Machine Learning 45 A 87
No ratings yet
Machine Learning 45 A 87
43 pages
Gradient Descent New
No ratings yet
Gradient Descent New
42 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
No ratings yet
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
14 pages
Mum-25-26-Pr0247 - SBM Bank - Ibm Instana - Proposal Dated 07-July-2025
No ratings yet
Mum-25-26-Pr0247 - SBM Bank - Ibm Instana - Proposal Dated 07-July-2025
1 page
Assignment 4
No ratings yet
Assignment 4
8 pages
CHC 351 Module 5
No ratings yet
CHC 351 Module 5
65 pages
Lecture 03 04
No ratings yet
Lecture 03 04
28 pages

Linear Regression For Machine Learning Course

Uploaded by

Linear Regression For Machine Learning Course

Uploaded by

Machine Learning

Dr. Indu Joshi

• Let’s start by talking about a few examples of supervised

• To establish notation for future use, we’ll use x (i) to denote

To make our housing example more interesting, let’s consider a

• Here, the θi ’s are the parameters (also called weights)

• where on the right-hand side above we are viewing θ and x

• We want to choose θ so as to minimize J(θ).

• If the prediction hθ (x (i) ) is close to the actual value y (i) , the

Minimize the function:

Iteration k xk Gradient 2xk Updated xk+1 = 0.6xk

After 5 iterations, we get:

f (0.0467) = (0.0467)2 ≈ 0.0022

This is very close to zero, meaning successful convergence.

• Too Small α: Convergence is very slow, requiring many

x4 = 27 − 1.5 × 2(27 − 3) = −45

• The values of x oscillate instead of converging to 3.

• A good initial value can speed up convergence.

• Consider the function:

• f ′′ (1) = 6(1) = 6 > 0 ⇒ Local minimum at x = 1

x1 = x0 − α∇f (x0 ) = −1.5 − 0.1(3(−1.5)2 − 3) = −1.875

The values grow rapidly, indicating divergence due to poor

• The LMS rule for a single training example can be generalized

• Gradient descent can be susceptible to local minima in

• Stochastic gradient descent can start making progress right

• Under the probabilistic assumptions on the data, least-squares

• Maximum Likelihood Estimation (MLE) is a method for

y (i) = θT x (i) + ϵ(i) ,

where ϵ(i) represents random noise, assumed to be IID Gaussian:

The error term ϵ(i) has a mean of zero and variance σ 2

The density of ϵ(i) is:

y (i) |x (i) ; θ ∼ N (θT x (i) , σ 2 ).

L(θ) = L(θ; X , y ) = p(y |X ; θ).

• By the independence assumption of ϵ(i) (and consequently

• To find the best θ, we use the principle of maximum

which is the least-squares cost function J(θ).

• To summarize, under the previous probabilistic assumptions

• Ridge regression is a regularized version of Ordinary Least

θ = arg max P(θ|y1 , x1 , . . . , yn , xn )

= arg max P(y1 , x1 , . . . , yn , xn |θ)P(θ)

You might also like