0% found this document useful (0 votes)

20 views72 pages

Week 6

The document provides an overview of linear regression, focusing on univariate and multivariate approaches, cost functions, and optimization techniques such as gradient descent. It discusses the importance of minimizing the cost function to achieve the best fit line for predictions, as well as various regularization techniques to prevent overfitting. Additionally, it introduces different optimizers and their roles in adjusting model parameters to minimize loss functions.

Uploaded by

ibrahimsani2560

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views72 pages

Week 6

Uploaded by

ibrahimsani2560

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 72

Linear Regression

Linear regression with one variable

Univariate LR
Predict output y from one
variable/feature Model
representati
on
A Line of best fit/Regression Line is a straight line that represents the best approximation of a
scatter plot of data points
Estimated/Predicted value () Actual/True value ()/Ground Truth
/samples/instances

the number of features

Dataset size:
(x, y)= One Training Example x (1) = 2104
x (i), y (i)= ith Training example x (2) = 1416
y (1) = 460
Y (2) = 232
Training Set How do we represent h ?
^
𝑦 ( 𝑥 ) =¿
and : parameters/weights that will be
Learning Algorithm trained/determined by the ML model
Not hyperparameters
= intercept/bias/constant
= slope/coefficient/gradient
Size of h Estimated
house price
New/
unseen Linear regression with one variable.
Univariate linear regression.
data
A cost function lets us figure out how to fit
the best straight line to our data.
Objective: Find the best fit regression line
Reduce the cost function
Size in feet2 (x) Price ($) in 1000's (y)
Training Set 2104 460
1416 232
1534 315
852 178
… …

Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
y

Idea: Choose so that

is close to
for our training examples
To formalize this;
We want to want to solve a minimization problem
Minimize (hθ(x) - y)2
i.e. minimize the difference between h(x) and y for each/any/every
example
Sum this over all the training set
SSE =
the number of samples y

x
Minimize squared different between predicted house price and actual
house price 1/2m
 1/m - means we determine the average
 1/2m the 2 makes the math a bit easier, and doesn't change the
constants we determine at all (i.e., half the smallest value is still
the smallest value!)
Minimizing θ0/θ1 means we get the values
of θ0 and θ1 which find on average the minimal
deviation of x from y when we use those
parameters in our hypothesis function
More cleanly, this is a cost function

Cost function =
Hypothesis - is like your prediction machine, throw in an x value, get a
putative y value. Models from to
Cost - is a way to, using your training data, determine values for
your θ values which make the hypothesis as accurate as possible
A cost function is a measure of error between what value your model
predicts and what the value actually is
Loss function (error) is for a single training example, while the cost
function is over the entire training set
This cost function is also called the squared error cost function.
This cost function is reasonable choice for most regression functions.
Probably most commonly used function.
Various Regression Loss Functions: Mean Squared, Mean Squared
Logarithmic, Mean Absolute, Root-mean-square
Simplified
Hypothesis:

Parameters:

Cost Function:

Goal:
Find best values of θ0 and θ1 so that J(θ0,θ1) is Find best values of θ1 so that J(θ1) is
minimum minimum
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y +

1 + 1
+

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
So for example θ = 0.5 θ1 = 0
1
θ1 = 1 J(θ1) 0.58 J(θ1) 2.3
J(θ1) = 0
The line which has the least sum of squares of
errors is the best fit line
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
If we compute a range of values plot
 J(θ1) vs θ1 we get a polynomial (looks like a quadratic)

The optimization objective for the learning algorithm is find the value of θ1 which minimizes J(θ1)
 So, here θ1 = 1 is the best value for θ1
The line which has the least sum of squares of errors is the best fit line
Hypothesis:

Parameters:

Cost Function:

Goal:
(for fixed , this is a function of x) (function of the parameters )

500000

400000

Price ($) 300000

in 1000’s
200000

100000

0
500 1000 1500 2000 2500 3000
Size in feet (x)
2
Previously we plotted our cost function by plotting
θ1 vs J(θ1)
Now we have two parameters
Plot becomes a bit more complicated
Generates a 3D surface plot where axis are
X = θ1
Z = θ0
Y = J(θ0,θ1)
We can see that the height (y) indicates the value of the cost
function, so find where y is at a minimum
(for fixed , this is a function of x) (function of the parameters )

A contour plot is a graphical technique for representing a 3-dimensional

surface by plotting constant z slices, called contours, on a 2-dimensional
format
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
•Minimize cost function J
•Gradient descent
• Used all over machine learning
for minimization
•Start by looking at a general J() function
•Problem
• We have J(θ0, θ1)
• We want to get min J(θ0, θ1)
Have some function Gradient descent applies to more
general functions
Want  J(θ0, θ1, θ2 .... θn)
 min J(θ0, θ1, θ2 .... θn)

Outline:
• Start with some
• Keep changing to reduce until
we hopefully end up at a minimum
 Start with initial guesses
 Start at 0,0 (or any other value)
 Keeping changing θ0 and θ1 a little bit to try and reduce
J(θ0,θ1)
 Each time you change the parameters, you select the gradient
which reduces J(θ0,θ1) the most possible
 Repeat
 Do so until you converge to a local minimum
 Has an interesting property
 Where you start can determine which minimum you end up
 Here we can see one initialization point led to one local
minimum
 The other led to a different one
 one initialization point led to one local minimum
 The other led to a different one

J(0,1)

1
0
 one initialization point led to one local minimum
 The other led to a different one

J(0,1)

1
0
 one initialization point led to one local minimum
 The other led to a different one
Gradient descent is used to minimize the MSE by calculating
the gradient of the cost function

Correct: Simultaneous update Incorrect:

α (alpha): learning rate: hyperparameter
Controls how big a step you take
If α is big have an aggressive gradient descent
If α is small take tiny steps

Learning Rate Annealing/adaptable learning rates

Rather than using a constant learning rate throughout training, we may
anneal the learning rate, and have it decline as time progresses
Gradient descent algorithm

To understand gradient descent, we'll return to a simpler function where we

minimize one parameter to help explain the algorithm in more detail.
 min θ1 J(θ1) where θ1 is a real number
 This way of first adjusting the weights or parameters is known as
Backward Error Propagation
Notation nuances
 Partial derivative vs. derivative
 Use partial derivative when we have multiple variables but only derive
with respect to one
 Use derivative when we are deriving with respect to all the variables
 Momentum: adding a fraction of the past weight update () to the
current weight. This helps prevent the model from getting stuck in local
minima. By using momentum, the movements along the error surface
are smoother in general and the network can move more quickly
throughout the local minima and finally reaches global minima.
Momentum considers previous weight changes when updating current
weights
Error
Error
Error
Error

Want to reduce
Error
Error

Want to increase
If α is too small, gradient descent
can be slow.
Higher training time

If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.
If α is too small, gradient descent
can be slow.

If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.
If α is too small, gradient descent
can be slow.

If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.
Local minimum: value of the loss function is minimum at that point in a local region.
Global minima: value of the loss function is minimum globally across the entire
domain the loss function

Global minima
at local optima

Current value of
vanishing gradient problem
Exploding is the opposite of Vanishing
and is when the gradient continues to
get larger which causes a large weight
update and results in the Gradient
Descent to diverge
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.
Partial Derivative:
sklearn.linear_model.LinearRegression
Ordinary least squares Linear Regression.

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize='deprecated',

copy_X=True, n_jobs=None, positive=False)

fit_interceptbool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in
calculations (i.e., data is expected to be centered)

normalizebool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be
normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to
standardize, please use StandardScaler before calling fit on an estimator with normalize=False

copy_Xbool, default=True
If True, X will be copied; else, it may be overwritten
Polynomial Regression
sklearn.preprocessing.PolynomialFeatures

class sklearn.preprocessing.PolynomialFeatures(degree=2, *, interaction_only=False,

include_bias=True, order='C’)

Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with
degree less than or equal to the specified degree. For example, if an input sample is two
dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]

degreeint or tuple (min_degree, max_degree), default=2

If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple
(min_degree, max_degree) is passed, then min_degree is the minimum and max_degree is the
maximum polynomial degree of the generated features. Note that min_degree=0 and
min_degree=1 are equivalent as outputting the degree zero term is determined by include_bias.
Polynomial Regression
sklearn.preprocessing.PolynomialFeatures

class sklearn.preprocessing.PolynomialFeatures(degree=2, *, interaction_only=False,

include_bias=True, order='C’)
In Stochastic Gradient Descent (SGD), we consider just one example/sample at a time to take a
single step, calculate the gradient and Use the gradient to update the weights
Mini batch gradient descent: We use a batch of a fixed number of training examples which is less
than the actual dataset and call it a mini-batch
Different Optimizers
Optimizers are used to adjust the parameters for a model. The purpose of an optimizer is to
adjust model weights to minimize a loss function.

Gradient Descent/Batch Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient

descent with momentum, Mini-Batch Gradient Descent, Adagrad, and Adam

Stochastic Gradient descent with momentum: Adds a fraction of the previous update to the
current update of parameters
Adagrad: Uses different/adaptive learning rates for each iteration
Linear Regression with Multiple Features:
𝑇
¿ 𝜃 𝑋
Hypothesis:

Parameters: , 𝜃 2 , 𝜃 3 , ⋯ , 𝜃 𝑛

Cost Function:

Goal:
Overfit and Underfit problem:
Ridge (L2) Regularization
Regularization refers to techniques that are used to calibrate machine learning models in order
to minimize the loss function and prevent overfitting
Adding the penalty equivalent to the sum of the squares of the magnitude of coefficients
If increases, Cost Function, J increases
Hyperparameter Known as regularization constant and it is greater than zero
Regularization = Loss Function + Penalty

Cost Function:

Goal:
L1 Regularization
If increases, Cost Function, J increases
penalty equivalent to the sum of the absolute values of coefficients

Cost Function:

Goal:
Elastic Net Regularization
Elastic net linear regression uses the penalties from both the lasso and ridge techniques to
regularize regression models.

Elastic net regression acts as both lasso and ridge, and we can control its behavior with a hyperparameter
known as L1 ratio, mathematically denoted as ‘r’. A value of r is set between 0 and 1. If r=0.5, elastic net
behaves like Lasso and Ridge equivalently. Decreasing the value of r from 0.5 to 0, Elastic net behaves roughly
as Ridge, but at 0, it completely behaves as Ridge regression. If the value of r changes from 0.5 to 1, Elastic net
behaves like Lasso, but at 1, it will completely behave as Lasso regression
sklearn.linear_model.SGDRegressor
Linear model fitted by minimizing a regularized empirical loss with SGD

class sklearn.linear_model.SGDRegressor(loss='squared_error', *, penalty='l2', alpha=0.0001,

l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0,
epsilon=0.1, random_state=None, learning_rate='invscaling', eta0=0.01, power_t=0.25,
early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False,
average=False)
max_iterant: The maximum number of passes over the training data (epochs)
early_stopping: Whether to use early stopping to terminate training when validation score is not
improving
Huber loss is a loss function that is less sensitive to outliers in data than the squared error loss
Alpha: regularization term
l1_ratio: Elastic Net constant
Evaluation Metrics for Regression Model
1. Mean Absolute Error(MAE)
MAE calculates the absolute difference between actual and predicted values

from sklearn.metrics import mean_absolute_error

print("MAE",mean_absolute_error(y_test,y_pred))

2. Mean Squared Error(MSE)

MSE is the squared difference between actual and predicted value
from sklearn.metrics import mean_squared_error
print("MSE",mean_squared_error(y_test,y_pred))
Evaluation Metrics for Regression Model
3. Root Mean Squared Error(RMSE)
RMSE square root of mean squared error

print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))

Eye PPT For Nursing Students by DR - Reshma Ajay
100% (11)
Eye PPT For Nursing Students by DR - Reshma Ajay
46 pages
1.4. Exact ODEs. Integrating Factors
No ratings yet
1.4. Exact ODEs. Integrating Factors
9 pages
Dalgakiran Refrigeration Air Dryers
0% (1)
Dalgakiran Refrigeration Air Dryers
2 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Chap6 (Regression)
No ratings yet
Chap6 (Regression)
74 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
Lecture 2-Linear-Regression-Part1
No ratings yet
Lecture 2-Linear-Regression-Part1
80 pages
Computing For Data Sciences: Introduction To Regression Analysis
No ratings yet
Computing For Data Sciences: Introduction To Regression Analysis
9 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
ML 02 Linear Regression
No ratings yet
ML 02 Linear Regression
51 pages
Lec2 Linear Regression With One Variable
No ratings yet
Lec2 Linear Regression With One Variable
48 pages
Slide 3 - Linear Regression One Variable
No ratings yet
Slide 3 - Linear Regression One Variable
60 pages
ML02
No ratings yet
ML02
25 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
7 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Linear Regression
100% (1)
Linear Regression
51 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
CSE445 Linear-Regression
No ratings yet
CSE445 Linear-Regression
40 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
12 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
43 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
Linear Regression: Level:4 Department: IT, Security
No ratings yet
Linear Regression: Level:4 Department: IT, Security
35 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
(ML&PR 2025) Lec2 Regression II
No ratings yet
(ML&PR 2025) Lec2 Regression II
41 pages
Gradient Descent
No ratings yet
Gradient Descent
108 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
10 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Unit 4 - Linear Regression
No ratings yet
Unit 4 - Linear Regression
52 pages
(PR 2024) Lec2 Regression II
No ratings yet
(PR 2024) Lec2 Regression II
41 pages
Cost Function: y 2m 1 (Y ) 2m 1
No ratings yet
Cost Function: y 2m 1 (Y ) 2m 1
1 page
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Week 4
No ratings yet
Week 4
101 pages
(MLP) MidtermNote
No ratings yet
(MLP) MidtermNote
31 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
Linear Regression
No ratings yet
Linear Regression
55 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
CSE 412 Lab Manual 3 Linear Regression
No ratings yet
CSE 412 Lab Manual 3 Linear Regression
10 pages
Machine Learning Regression Basics
No ratings yet
Machine Learning Regression Basics
22 pages
Week 04
No ratings yet
Week 04
101 pages
Lecture 2.1 Linear Regression
No ratings yet
Lecture 2.1 Linear Regression
36 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
cs229 2
No ratings yet
cs229 2
275 pages
Notes 1
No ratings yet
Notes 1
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
From A Game of Polo With A Headless Goat-Annotated
No ratings yet
From A Game of Polo With A Headless Goat-Annotated
2 pages
Rotman Good Kitchen
No ratings yet
Rotman Good Kitchen
6 pages
Computer Graphics
100% (1)
Computer Graphics
132 pages
Complexity Epistemology and The Challenge of The Future
No ratings yet
Complexity Epistemology and The Challenge of The Future
12 pages
Architecture Analysis and Simulink Modeling of A High Resolution Zoom ADC
No ratings yet
Architecture Analysis and Simulink Modeling of A High Resolution Zoom ADC
6 pages
Wbi11 01 Que 20240508
No ratings yet
Wbi11 01 Que 20240508
28 pages
Impacts On Water Environment: Prediction and Assessment of
No ratings yet
Impacts On Water Environment: Prediction and Assessment of
32 pages
AR Parts AR-6
No ratings yet
AR Parts AR-6
3 pages
Mhra and Ctdi
No ratings yet
Mhra and Ctdi
34 pages
The Faerie Prince BONUS SCENES
100% (2)
The Faerie Prince BONUS SCENES
21 pages
E.macieira - MIT Cover Letter
No ratings yet
E.macieira - MIT Cover Letter
2 pages
Weight-For-Age BOYS: 6 Months To 2 Years (Percentiles)
No ratings yet
Weight-For-Age BOYS: 6 Months To 2 Years (Percentiles)
1 page
14 Network Hardwares
No ratings yet
14 Network Hardwares
11 pages
Answer Key PDF
No ratings yet
Answer Key PDF
199 pages
12 Gold 4 - C2 Edexcel PDF
No ratings yet
12 Gold 4 - C2 Edexcel PDF
17 pages
8020 Blocked From Use: Tuesday
No ratings yet
8020 Blocked From Use: Tuesday
95 pages
Lte Users Guide v0.1 en
No ratings yet
Lte Users Guide v0.1 en
11 pages
GT Full Catalogue Web
No ratings yet
GT Full Catalogue Web
314 pages
NFL Naya Nangal Six MonthTraining Report
No ratings yet
NFL Naya Nangal Six MonthTraining Report
35 pages
Boracay Rehabilitation: Case Study Analysis " "
100% (1)
Boracay Rehabilitation: Case Study Analysis " "
4 pages
Workshop:: How To Write An Effective Business Plan in Just 3 Hours
100% (2)
Workshop:: How To Write An Effective Business Plan in Just 3 Hours
24 pages
FNDS3536S-V3 Encoder Satellitegateway Iptv
No ratings yet
FNDS3536S-V3 Encoder Satellitegateway Iptv
4 pages
S. Radhakrshinan
No ratings yet
S. Radhakrshinan
37 pages
Industrial Cutting Machines Guide
No ratings yet
Industrial Cutting Machines Guide
8 pages
SAE 1065 Steel Composition Guide
No ratings yet
SAE 1065 Steel Composition Guide
2 pages
Assignemnt2 - Xchart Rchart
No ratings yet
Assignemnt2 - Xchart Rchart
2 pages
K80010292V03
No ratings yet
K80010292V03
2 pages

Week 6

Uploaded by

Week 6

Uploaded by

Linear Regression

Linear regression with one variable

the number of features

Idea: Choose so that

Price ($) 300000

A contour plot is a graphical technique for representing a 3-dimensional

Correct: Simultaneous update Incorrect:

Learning Rate Annealing/adaptable learning rates

To understand gradient descent, we'll return to a simpler function where we

If α is too large, gradient descent

If α is too large, gradient descent

If α is too large, gradient descent

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize='deprecated',

class sklearn.preprocessing.PolynomialFeatures(degree=2, *, interaction_only=False,

Generate polynomial and interaction features.

degreeint or tuple (min_degree, max_degree), default=2

class sklearn.preprocessing.PolynomialFeatures(degree=2, *, interaction_only=False,

Gradient Descent/Batch Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient

class sklearn.linear_model.SGDRegressor(loss='squared_error', *, penalty='l2', alpha=0.0001,

from sklearn.metrics import mean_absolute_error

2. Mean Squared Error(MSE)

You might also like