0% found this document useful (0 votes)

11 views52 pages

Unit3 ML

Uploaded by

kbhavana0602

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views52 pages

Unit3 ML

Uploaded by

kbhavana0602

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Training Models

Ways of training a Linear Regression Model:

⚫ Using a direct “closed-form” equation that directly computes

the model parameters that best fit the model to the training
set (i.e., the model parameters that minimize the cost
function over the training set).

⚫ Using an iterative optimization approach, called Gradient

Descent (GD), that gradually tweaks the model parameters to
minimize the cost function over the training set, eventually
converging to the same set of parameters as the first method.
Linear Regression
Computational Complexity
⚫ The Normal Equation computes the inverse of XT ・ X, which
is an n × n matrix (where n is the number of features).The
computational complexity of inverting such a matrix is typically
about O(n3).

⚫ In other words, if you double the number of features, you

multiply the computation time by roughly to 23 = 8.
Gradient Descent

⚫ Gradient Descent is a very generic optimization algorithm capable of

finding optimal solutions to a wide range of problems.

⚫ The general idea of Gradient Descent is to tweak parameters

iteratively in order to minimize a cost function

⚫ Concretely, you start by filling θ with random values (this is called

random initialization), and then you improve it gradually, taking one
baby step at a time, each step attempting to decrease the cost
function (e.g., the MSE), until the algorithm converges to a minimum
Batch Gradient Descent
Stochastic Gradient Descent
⚫ The main problem with Batch Gradient Descent is the fact
that it uses the whole training set to compute the gradients
at every step, which makes it very slow when the training
set is large.

⚫ At the opposite extreme, Stochastic Gradient Descent just

picks a random instance in the training set at every step and
computes the gradients based only on that single instance.
Obviously this makes the algorithm much faster since it has
very little data to manipulate at every iteration.

⚫ It also makes it possible to train on huge training sets, since

only one instance needs to be in memory at each iteration
(SGD can be implemented as an out-of-core algorithm)
⚫ When the cost function is very irregular this can actually help the
algorithm jump out of local minima, so Stochastic Gradient Descent has a
better chance of finding the global minimum than Batch Gradient Descent
does.

⚫ Therefore randomness is good to escape from local optima, but bad

because it means that the algorithm can never settle at the minimum. One
solution to this dilemma is to gradually reduce the learning rate. The steps
start out large (which helps make quick progress and escape local minima),
then get smaller and smaller, allowing the algorithm to settle at the global
minimum.

⚫ This process is called simulated annealing, because it resembles the

process of annealing in metallurgy where molten metal is slowly cooled
down.

⚫ The function that determines the learning rate at each iteration is called
the learning schedule. If the learning rate is reduced too quickly, you may
get stuck in a local minimum, or even end up frozen halfway to the
minimum. If the learning rate is reduced too slowly, you may jump around
the minimum for a long time and end up with a suboptimal solution if you
halt training too early.
Mini-batch Gradient Descent
⚫ At each step, instead of computing the gradients based on the
full training set (as in Batch GD) or based on just one instance
(as in Stochastic GD), Mini batch GD computes the gradients
on small random sets of instances called minibatches.
Comparison
Polynomial Regression
⚫ A simple way to do this is to add powers of each feature as
new features, then train a linear model on this extended set of
features. This technique is called Polynomial Regression.
Learning Curves
Overfitting example
Regularized Linear Models
Ridge Regression
Lasso Regression
⚫ Least Absolute Shrinkage and Selection Operator Regression
(simply called Lasso Regression) is another regularized
version of Linear Regression: just like Ridge Regression, it
adds a regularization term to the cost function

⚫ It uses the ℓ1 norm of the weight vector instead of half the

square of the ℓ2 norm
Elastic Net
⚫ Elastic Net is a middle ground between Ridge Regression and
Lasso Regression. The regularization term is a simple mix of
both Ridge and Lasso’s regularization terms, and you can
control the mix ratio r.

⚫ When r = 0, Elastic Net is equivalent to Ridge Regression, and

when r = 1, it is equivalent to Lasso Regression
Early Stopping
⚫ A very different way to regularize iterative learning algorithms such
as Gradient Descent is to stop training as soon as the validation
error reaches a minimum. This is called early stopping.
Logistic Regression
⚫ Logistic Regression (also called Logit Regression) is commonly
used to estimate the probability that an instance belongs to a
particular class (e.g., what is the probability that this email is
spam?)

⚫ If the estimated probability is greater than 50%, then the

model predicts that the instance belongs to that class (called
the positive class, labeled “1”)

⚫ Else it predicts that it does not (i.e., it belongs to the negative

class, labeled “0”). This makes it a binary classifier.
Training and Cost Function
Decision Boundaries
Softmax Regression
⚫ The Logistic Regression model can be generalized to
support multiple classes directly, without having to train
and combine multiple binary classifiers.
⚫ This is called Softmax Regression, or Multinomial Logistic
Regression.
⚫ The idea is quite simple: when given an instance x, the
Softmax Regression model first computes a score
sk(x) for each class k, then estimates the probability of
each class by applying the softmax function (also called
the normalized exponential) to the scores.
⚫ The equation to compute sk(x) should look familiar, as
it is just like the equation for Linear Regression
prediction .
Note that each class has its own dedicated parameter vector θk . All these
vectors are typically stored as rows in a parameter matrix Θ.

Once you have computed the score of every class for the instance x, you can
estimate the probability pk that the instance belongs to class k by running the
scores through the softmax function . It computes the exponential of every
score, then normalizes them (dividing by the sum of all the exponentials).

Algebra 2 Practice Workbook
50% (2)
Algebra 2 Practice Workbook
92 pages
Revision Sheet of GCSE Trig Transformations
No ratings yet
Revision Sheet of GCSE Trig Transformations
3 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Chapter04 Training Models
No ratings yet
Chapter04 Training Models
33 pages
Lecture Slides - Linear Reg
No ratings yet
Lecture Slides - Linear Reg
34 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
HandsOnML Ch4E
No ratings yet
HandsOnML Ch4E
46 pages
Lecture 08 ML
No ratings yet
Lecture 08 ML
20 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Lecture 05 06
No ratings yet
Lecture 05 06
40 pages
Lec6 7 Linear Regression
No ratings yet
Lec6 7 Linear Regression
38 pages
ML - Mca
No ratings yet
ML - Mca
48 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
All Unit
No ratings yet
All Unit
100 pages
Lecture04. Training Models (Regression in Chapter 4)
No ratings yet
Lecture04. Training Models (Regression in Chapter 4)
44 pages
ANN-Regression-Python Examples
No ratings yet
ANN-Regression-Python Examples
35 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
89 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
Linear Regression Techniques
No ratings yet
Linear Regression Techniques
25 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Regression
No ratings yet
Regression
25 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
No ratings yet
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
14 pages
Session 1
No ratings yet
Session 1
39 pages
Linear Regression, Polynomical, Gradiant Descent
No ratings yet
Linear Regression, Polynomical, Gradiant Descent
42 pages
Intro to Classification & Regression
No ratings yet
Intro to Classification & Regression
42 pages
Lecture 03 04
No ratings yet
Lecture 03 04
28 pages
ML 1
No ratings yet
ML 1
24 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Machine Learning Basics Guide
No ratings yet
Machine Learning Basics Guide
56 pages
Fileml
No ratings yet
Fileml
54 pages
ML Models and When To Choose One Over Others
No ratings yet
ML Models and When To Choose One Over Others
7 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Final ML
No ratings yet
Final ML
54 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
Module B Handbook
No ratings yet
Module B Handbook
11 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
CM20315 06 Fitting
No ratings yet
CM20315 06 Fitting
67 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Regression
No ratings yet
Regression
30 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
ML Regression & Gradient Descent
No ratings yet
ML Regression & Gradient Descent
37 pages
Machine Learning Algorithms Guide
No ratings yet
Machine Learning Algorithms Guide
5 pages
Maths-Crash Course-1
No ratings yet
Maths-Crash Course-1
5 pages
MA101, Real Analysis Riemann Integration
No ratings yet
MA101, Real Analysis Riemann Integration
1 page
MATH 102 Integration 2 Lecture
No ratings yet
MATH 102 Integration 2 Lecture
24 pages
Shell Element Internal Forces/Stresses Output Convention
No ratings yet
Shell Element Internal Forces/Stresses Output Convention
5 pages
Lathi 3ed-719
No ratings yet
Lathi 3ed-719
1 page
Matrices Study Guide
No ratings yet
Matrices Study Guide
26 pages
Chapter 5 - The Fourier Transform: Selected Solutions
No ratings yet
Chapter 5 - The Fourier Transform: Selected Solutions
43 pages
Igcse p2 Hot Topics June 2025 Set C
No ratings yet
Igcse p2 Hot Topics June 2025 Set C
22 pages
Theoryofnonlinearacousticsinfluids
100% (1)
Theoryofnonlinearacousticsinfluids
290 pages
Ss..fourier Series Formulaes
No ratings yet
Ss..fourier Series Formulaes
4 pages
Newton Raphson & Bisection Method - MSC Practical Qns
No ratings yet
Newton Raphson & Bisection Method - MSC Practical Qns
1 page
MAE 101 Homework 2
No ratings yet
MAE 101 Homework 2
1 page
Presentation Matrix
No ratings yet
Presentation Matrix
18 pages
Functions and Graphs Problem Set
No ratings yet
Functions and Graphs Problem Set
7 pages
Recent Research Advances in The Dynamic Behavior of Shells: 1989-2000, Part 2: Homogeneous Shells
No ratings yet
Recent Research Advances in The Dynamic Behavior of Shells: 1989-2000, Part 2: Homogeneous Shells
20 pages
Bicknell1 A Primer For The Fibonacci Number Part IX
No ratings yet
Bicknell1 A Primer For The Fibonacci Number Part IX
8 pages
Crash Simulation of A Boeing 737 Fuselage Section Vertical Drop Test
No ratings yet
Crash Simulation of A Boeing 737 Fuselage Section Vertical Drop Test
14 pages
Srednicki qft2 PDF
100% (1)
Srednicki qft2 PDF
109 pages
Derivatives Vector Valued Functions
No ratings yet
Derivatives Vector Valued Functions
15 pages
Joseph Breen
No ratings yet
Joseph Breen
9 pages
Linear Algebra Exercises & SVD
No ratings yet
Linear Algebra Exercises & SVD
17 pages
CH Pres 02
No ratings yet
CH Pres 02
52 pages
Aod Final Revision Sheet - 1
No ratings yet
Aod Final Revision Sheet - 1
5 pages
(Ebook) Spectral Theory and Differential Operators by D.E. Edmunds Des Evans ISBN 9780198812050, 0198812051 PDF Download
100% (1)
(Ebook) Spectral Theory and Differential Operators by D.E. Edmunds Des Evans ISBN 9780198812050, 0198812051 PDF Download
52 pages
Control System Question Bank
No ratings yet
Control System Question Bank
3 pages
Cambridge IGCSE: Additional Mathematics 0606/12
No ratings yet
Cambridge IGCSE: Additional Mathematics 0606/12
16 pages
Welcome To Cmpe140 Final Exam: Studentid
No ratings yet
Welcome To Cmpe140 Final Exam: Studentid
21 pages
Function Evaluation Guide
0% (1)
Function Evaluation Guide
21 pages

Unit3 ML

Uploaded by

Unit3 ML

Uploaded by

Training Models

Ways of training a Linear Regression Model:

⚫ Using a direct “closed-form” equation that directly computes

⚫ Using an iterative optimization approach, called Gradient

⚫ In other words, if you double the number of features, you

⚫ Gradient Descent is a very generic optimization algorithm capable of

⚫ The general idea of Gradient Descent is to tweak parameters

⚫ Concretely, you start by filling θ with random values (this is called

⚫ At the opposite extreme, Stochastic Gradient Descent just

⚫ It also makes it possible to train on huge training sets, since

⚫ Therefore randomness is good to escape from local optima, but bad

⚫ This process is called simulated annealing, because it resembles the

⚫ It uses the ℓ1 norm of the weight vector instead of half the

⚫ When r = 0, Elastic Net is equivalent to Ridge Regression, and

⚫ If the estimated probability is greater than 50%, then the

⚫ Else it predicts that it does not (i.e., it belongs to the negative

You might also like