Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views52 pages

Unit3 ML

Uploaded by

kbhavana0602
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views52 pages

Unit3 ML

Uploaded by

kbhavana0602
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Training Models

Ways of training a Linear Regression Model:

⚫ Using a direct “closed-form” equation that directly computes


the model parameters that best fit the model to the training
set (i.e., the model parameters that minimize the cost
function over the training set).

⚫ Using an iterative optimization approach, called Gradient


Descent (GD), that gradually tweaks the model parameters to
minimize the cost function over the training set, eventually
converging to the same set of parameters as the first method.
Linear Regression
Computational Complexity
⚫ The Normal Equation computes the inverse of XT ・ X, which
is an n × n matrix (where n is the number of features).The
computational complexity of inverting such a matrix is typically
about O(n3).

⚫ In other words, if you double the number of features, you


multiply the computation time by roughly to 23 = 8.
Gradient Descent

⚫ Gradient Descent is a very generic optimization algorithm capable of


finding optimal solutions to a wide range of problems.

⚫ The general idea of Gradient Descent is to tweak parameters


iteratively in order to minimize a cost function

⚫ Concretely, you start by filling θ with random values (this is called


random initialization), and then you improve it gradually, taking one
baby step at a time, each step attempting to decrease the cost
function (e.g., the MSE), until the algorithm converges to a minimum
Batch Gradient Descent
Stochastic Gradient Descent
⚫ The main problem with Batch Gradient Descent is the fact
that it uses the whole training set to compute the gradients
at every step, which makes it very slow when the training
set is large.

⚫ At the opposite extreme, Stochastic Gradient Descent just


picks a random instance in the training set at every step and
computes the gradients based only on that single instance.
Obviously this makes the algorithm much faster since it has
very little data to manipulate at every iteration.

⚫ It also makes it possible to train on huge training sets, since


only one instance needs to be in memory at each iteration
(SGD can be implemented as an out-of-core algorithm)
⚫ When the cost function is very irregular this can actually help the
algorithm jump out of local minima, so Stochastic Gradient Descent has a
better chance of finding the global minimum than Batch Gradient Descent
does.

⚫ Therefore randomness is good to escape from local optima, but bad


because it means that the algorithm can never settle at the minimum. One
solution to this dilemma is to gradually reduce the learning rate. The steps
start out large (which helps make quick progress and escape local minima),
then get smaller and smaller, allowing the algorithm to settle at the global
minimum.

⚫ This process is called simulated annealing, because it resembles the


process of annealing in metallurgy where molten metal is slowly cooled
down.

⚫ The function that determines the learning rate at each iteration is called
the learning schedule. If the learning rate is reduced too quickly, you may
get stuck in a local minimum, or even end up frozen halfway to the
minimum. If the learning rate is reduced too slowly, you may jump around
the minimum for a long time and end up with a suboptimal solution if you
halt training too early.
Mini-batch Gradient Descent
⚫ At each step, instead of computing the gradients based on the
full training set (as in Batch GD) or based on just one instance
(as in Stochastic GD), Mini batch GD computes the gradients
on small random sets of instances called minibatches.
Comparison
Polynomial Regression
⚫ A simple way to do this is to add powers of each feature as
new features, then train a linear model on this extended set of
features. This technique is called Polynomial Regression.
Learning Curves
Overfitting example
Regularized Linear Models
Ridge Regression
Lasso Regression
⚫ Least Absolute Shrinkage and Selection Operator Regression
(simply called Lasso Regression) is another regularized
version of Linear Regression: just like Ridge Regression, it
adds a regularization term to the cost function

⚫ It uses the ℓ1 norm of the weight vector instead of half the


square of the ℓ2 norm
Elastic Net
⚫ Elastic Net is a middle ground between Ridge Regression and
Lasso Regression. The regularization term is a simple mix of
both Ridge and Lasso’s regularization terms, and you can
control the mix ratio r.

⚫ When r = 0, Elastic Net is equivalent to Ridge Regression, and


when r = 1, it is equivalent to Lasso Regression
Early Stopping
⚫ A very different way to regularize iterative learning algorithms such
as Gradient Descent is to stop training as soon as the validation
error reaches a minimum. This is called early stopping.
Logistic Regression
⚫ Logistic Regression (also called Logit Regression) is commonly
used to estimate the probability that an instance belongs to a
particular class (e.g., what is the probability that this email is
spam?)

⚫ If the estimated probability is greater than 50%, then the


model predicts that the instance belongs to that class (called
the positive class, labeled “1”)

⚫ Else it predicts that it does not (i.e., it belongs to the negative


class, labeled “0”). This makes it a binary classifier.
Training and Cost Function
Decision Boundaries
Softmax Regression
⚫ The Logistic Regression model can be generalized to
support multiple classes directly, without having to train
and combine multiple binary classifiers.
⚫ This is called Softmax Regression, or Multinomial Logistic
Regression.
⚫ The idea is quite simple: when given an instance x, the
Softmax Regression model first computes a score
sk(x) for each class k, then estimates the probability of
each class by applying the softmax function (also called
the normalized exponential) to the scores.
⚫ The equation to compute sk(x) should look familiar, as
it is just like the equation for Linear Regression
prediction .
Note that each class has its own dedicated parameter vector θk . All these
vectors are typically stored as rows in a parameter matrix Θ.

Once you have computed the score of every class for the instance x, you can
estimate the probability pk that the instance belongs to class k by running the
scores through the softmax function . It computes the exponential of every
score, then normalizes them (dividing by the sum of all the exponentials).

You might also like