Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views72 pages

Week 6

The document provides an overview of linear regression, focusing on univariate and multivariate approaches, cost functions, and optimization techniques such as gradient descent. It discusses the importance of minimizing the cost function to achieve the best fit line for predictions, as well as various regularization techniques to prevent overfitting. Additionally, it introduces different optimizers and their roles in adjusting model parameters to minimize loss functions.

Uploaded by

ibrahimsani2560
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views72 pages

Week 6

The document provides an overview of linear regression, focusing on univariate and multivariate approaches, cost functions, and optimization techniques such as gradient descent. It discusses the importance of minimizing the cost function to achieve the best fit line for predictions, as well as various regularization techniques to prevent overfitting. Additionally, it introduces different optimizers and their roles in adjusting model parameters to minimize loss functions.

Uploaded by

ibrahimsani2560
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Linear Regression

Linear regression with one variable


Univariate LR
Predict output y from one
variable/feature Model
representati
on
A Line of best fit/Regression Line is a straight line that represents the best approximation of a
scatter plot of data points
Estimated/Predicted value () Actual/True value ()/Ground Truth
/samples/instances

the number of features


Dataset size:
(x, y)= One Training Example x (1) = 2104
x (i), y (i)= ith Training example x (2) = 1416
y (1) = 460
Y (2) = 232
Training Set How do we represent h ?
^
𝑦 ( 𝑥 ) =¿
and : parameters/weights that will be
Learning Algorithm trained/determined by the ML model
Not hyperparameters
= intercept/bias/constant
= slope/coefficient/gradient
Size of h Estimated
house price
New/
unseen Linear regression with one variable.
Univariate linear regression.
data
A cost function lets us figure out how to fit
the best straight line to our data.
Objective: Find the best fit regression line
Reduce the cost function
Size in feet2 (x) Price ($) in 1000's (y)
Training Set 2104 460
1416 232
1534 315
852 178
… …

Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
y

Idea: Choose so that


is close to
for our training examples
To formalize this;
We want to want to solve a minimization problem
Minimize (hθ(x) - y)2
i.e. minimize the difference between h(x) and y for each/any/every
example
Sum this over all the training set
SSE =
the number of samples y

x
Minimize squared different between predicted house price and actual
house price 1/2m
 1/m - means we determine the average
 1/2m the 2 makes the math a bit easier, and doesn't change the
constants we determine at all (i.e., half the smallest value is still
the smallest value!)
Minimizing θ0/θ1 means we get the values
of θ0 and θ1 which find on average the minimal
deviation of x from y when we use those
parameters in our hypothesis function
More cleanly, this is a cost function

Cost function =
Hypothesis - is like your prediction machine, throw in an x value, get a
putative y value. Models from to
Cost - is a way to, using your training data, determine values for
your θ values which make the hypothesis as accurate as possible
A cost function is a measure of error between what value your model
predicts and what the value actually is
Loss function (error) is for a single training example, while the cost
function is over the entire training set
This cost function is also called the squared error cost function.
This cost function is reasonable choice for most regression functions.
Probably most commonly used function.
Various Regression Loss Functions: Mean Squared, Mean Squared
Logarithmic, Mean Absolute, Root-mean-square
Simplified
Hypothesis:

Parameters:

Cost Function:

Goal:
Find best values of θ0 and θ1 so that J(θ0,θ1) is Find best values of θ1 so that J(θ1) is
minimum minimum
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y +

1 + 1
+

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
So for example θ = 0.5 θ1 = 0
1
θ1 = 1 J(θ1) 0.58 J(θ1) 2.3
J(θ1) = 0
The line which has the least sum of squares of
errors is the best fit line
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
If we compute a range of values plot
 J(θ1) vs θ1 we get a polynomial (looks like a quadratic)

The optimization objective for the learning algorithm is find the value of θ1 which minimizes J(θ1)
 So, here θ1 = 1 is the best value for θ1
The line which has the least sum of squares of errors is the best fit line
Hypothesis:

Parameters:

Cost Function:

Goal:
(for fixed , this is a function of x) (function of the parameters )

500000

400000

Price ($) 300000


in 1000’s
200000

100000

0
500 1000 1500 2000 2500 3000
Size in feet (x)
2
Previously we plotted our cost function by plotting
θ1 vs J(θ1)
Now we have two parameters
Plot becomes a bit more complicated
Generates a 3D surface plot where axis are
X = θ1
Z = θ0
Y = J(θ0,θ1)
We can see that the height (y) indicates the value of the cost
function, so find where y is at a minimum
(for fixed , this is a function of x) (function of the parameters )

A contour plot is a graphical technique for representing a 3-dimensional


surface by plotting constant z slices, called contours, on a 2-dimensional
format
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
•Minimize cost function J
•Gradient descent
• Used all over machine learning
for minimization
•Start by looking at a general J() function
•Problem
• We have J(θ0, θ1)
• We want to get min J(θ0, θ1)
Have some function Gradient descent applies to more
general functions
Want  J(θ0, θ1, θ2 .... θn)
 min J(θ0, θ1, θ2 .... θn)

Outline:
• Start with some
• Keep changing to reduce until
we hopefully end up at a minimum
 Start with initial guesses
 Start at 0,0 (or any other value)
 Keeping changing θ0 and θ1 a little bit to try and reduce
J(θ0,θ1)
 Each time you change the parameters, you select the gradient
which reduces J(θ0,θ1) the most possible
 Repeat
 Do so until you converge to a local minimum
 Has an interesting property
 Where you start can determine which minimum you end up
 Here we can see one initialization point led to one local
minimum
 The other led to a different one
 one initialization point led to one local minimum
 The other led to a different one

J(0,1)

1
0
 one initialization point led to one local minimum
 The other led to a different one

J(0,1)

1
0
 one initialization point led to one local minimum
 The other led to a different one
Gradient descent is used to minimize the MSE by calculating
the gradient of the cost function

Correct: Simultaneous update Incorrect:


α (alpha): learning rate: hyperparameter
Controls how big a step you take
If α is big have an aggressive gradient descent
If α is small take tiny steps

Learning Rate Annealing/adaptable learning rates


Rather than using a constant learning rate throughout training, we may
anneal the learning rate, and have it decline as time progresses
Gradient descent algorithm

To understand gradient descent, we'll return to a simpler function where we


minimize one parameter to help explain the algorithm in more detail.
 min θ1 J(θ1) where θ1 is a real number
 This way of first adjusting the weights or parameters is known as
Backward Error Propagation
Notation nuances
 Partial derivative vs. derivative
 Use partial derivative when we have multiple variables but only derive
with respect to one
 Use derivative when we are deriving with respect to all the variables
 Momentum: adding a fraction of the past weight update () to the
current weight. This helps prevent the model from getting stuck in local
minima. By using momentum, the movements along the error surface
are smoother in general and the network can move more quickly
throughout the local minima and finally reaches global minima.
Momentum considers previous weight changes when updating current
weights
Error
Error
Error
Error

Want to reduce
Error
Error

Want to increase
If α is too small, gradient descent
can be slow.
Higher training time

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.
If α is too small, gradient descent
can be slow.

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.
If α is too small, gradient descent
can be slow.

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.
Local minimum: value of the loss function is minimum at that point in a local region.
Global minima: value of the loss function is minimum globally across the entire
domain the loss function

Global minima
at local optima

Current value of
vanishing gradient problem
Exploding is the opposite of Vanishing
and is when the gradient continues to
get larger which causes a large weight
update and results in the Gradient
Descent to diverge
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.
Partial Derivative:
sklearn.linear_model.LinearRegression
Ordinary least squares Linear Regression.

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize='deprecated',


copy_X=True, n_jobs=None, positive=False)

fit_interceptbool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in
calculations (i.e., data is expected to be centered)

normalizebool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be
normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to
standardize, please use StandardScaler before calling fit on an estimator with normalize=False

copy_Xbool, default=True
If True, X will be copied; else, it may be overwritten
Polynomial Regression
sklearn.preprocessing.PolynomialFeatures

class sklearn.preprocessing.PolynomialFeatures(degree=2, *, interaction_only=False,


include_bias=True, order='C’)

Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with
degree less than or equal to the specified degree. For example, if an input sample is two
dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]

degreeint or tuple (min_degree, max_degree), default=2


If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple
(min_degree, max_degree) is passed, then min_degree is the minimum and max_degree is the
maximum polynomial degree of the generated features. Note that min_degree=0 and
min_degree=1 are equivalent as outputting the degree zero term is determined by include_bias.
Polynomial Regression
sklearn.preprocessing.PolynomialFeatures

class sklearn.preprocessing.PolynomialFeatures(degree=2, *, interaction_only=False,


include_bias=True, order='C’)
In Stochastic Gradient Descent (SGD), we consider just one example/sample at a time to take a
single step, calculate the gradient and Use the gradient to update the weights
Mini batch gradient descent: We use a batch of a fixed number of training examples which is less
than the actual dataset and call it a mini-batch
Different Optimizers
Optimizers are used to adjust the parameters for a model. The purpose of an optimizer is to
adjust model weights to minimize a loss function.

Gradient Descent/Batch Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient


descent with momentum, Mini-Batch Gradient Descent, Adagrad, and Adam

Stochastic Gradient descent with momentum: Adds a fraction of the previous update to the
current update of parameters
Adagrad: Uses different/adaptive learning rates for each iteration
Linear Regression with Multiple Features:
𝑇
¿ 𝜃 𝑋
Hypothesis:

Parameters: , 𝜃 2 , 𝜃 3 , ⋯ , 𝜃 𝑛

Cost Function:

Goal:
Overfit and Underfit problem:
Ridge (L2) Regularization
Regularization refers to techniques that are used to calibrate machine learning models in order
to minimize the loss function and prevent overfitting
Adding the penalty equivalent to the sum of the squares of the magnitude of coefficients
If increases, Cost Function, J increases
Hyperparameter Known as regularization constant and it is greater than zero
Regularization = Loss Function + Penalty

Cost Function:

Goal:
L1 Regularization
If increases, Cost Function, J increases
penalty equivalent to the sum of the absolute values of coefficients

Cost Function:

Goal:
Elastic Net Regularization
Elastic net linear regression uses the penalties from both the lasso and ridge techniques to
regularize regression models.

Elastic net regression acts as both lasso and ridge, and we can control its behavior with a hyperparameter
known as L1 ratio, mathematically denoted as ‘r’. A value of r is set between 0 and 1. If r=0.5, elastic net
behaves like Lasso and Ridge equivalently. Decreasing the value of r from 0.5 to 0, Elastic net behaves roughly
as Ridge, but at 0, it completely behaves as Ridge regression. If the value of r changes from 0.5 to 1, Elastic net
behaves like Lasso, but at 1, it will completely behave as Lasso regression
sklearn.linear_model.SGDRegressor
Linear model fitted by minimizing a regularized empirical loss with SGD

class sklearn.linear_model.SGDRegressor(loss='squared_error', *, penalty='l2', alpha=0.0001,


l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0,
epsilon=0.1, random_state=None, learning_rate='invscaling', eta0=0.01, power_t=0.25,
early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False,
average=False)
max_iterant: The maximum number of passes over the training data (epochs)
early_stopping: Whether to use early stopping to terminate training when validation score is not
improving
Huber loss is a loss function that is less sensitive to outliers in data than the squared error loss
Alpha: regularization term
l1_ratio: Elastic Net constant
Evaluation Metrics for Regression Model
1. Mean Absolute Error(MAE)
MAE calculates the absolute difference between actual and predicted values

from sklearn.metrics import mean_absolute_error


print("MAE",mean_absolute_error(y_test,y_pred))

2. Mean Squared Error(MSE)


MSE is the squared difference between actual and predicted value
from sklearn.metrics import mean_squared_error
print("MSE",mean_squared_error(y_test,y_pred))
Evaluation Metrics for Regression Model
3. Root Mean Squared Error(RMSE)
RMSE square root of mean squared error

print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))

You might also like