MODULE 2
LEARNING WITH
REGRESSION AND TREES
2.1 Learning with Regression: Linear Regression,
Multivariate Linear Regression, Logistic Regression.
1
Darakhshan Khan
LOGISTIC REGRESSION
• This type of statistical model (also known as logit model) is often used for classification
• Logistic regression estimates the probability of an event occurring, such as voted or
didn’t vote, based on a given dataset of independent variables.
• Since the outcome is a probability, the dependent variable is bounded between 0 and
1.
• Consider y variable (binary classification)
• 0: negative class
• 1: positive class
• Examples
• Email: spam / not spam
• Online transactions: fraudulent / not fraudulent
2
• Tumor: malignant / not malignant
Darakhshan Khan
LOGISTIC REGRESSION (1)
• Issue 1 of Linear Regression
• Using linear regression and then threshold the classifier
output (i.e. anything over some value is yes, else no)
• In our example, linear regression with thresholding
seems to work i.e. it does a reasonable job of stratifying
the data points into one of two classes
• But what if we had a single Yes with a very large
tumour size
• This would lead to classifying all the existing yeses as
nos
3
Darakhshan Khan
LOGISTIC REGRESSION (2)
Issue 2 of Linear Regression
• We know Y is 0 or 1
• Model can give values large than 1 or less than 0
• So, logistic regression generates a value where is
always either 0 or 1
• Logistic regression is a classification algorithm -
don't be confused
4
Darakhshan Khan
LOGISTIC REGRESSION – MODEL REPRESENTATION
• What function is used to represent model in classification?
• Aim of this classifier to output values between 0 and 1
• Using linear regression, hθ(x) = (θTx)
• For classification hypothesis representation we do hθ(x) = g(θTx)
• Where, g(z), z is a real number
• g(z) = 1/(1 + e-z)
• This is the sigmoid function, or the logistic function
• If we combine these equations we can write out the hypothesis as
• What does the sigmoid function look like
• Crosses 0.5 at the origin, then flattens out]
• Asymptotes at 0 and 1
5
• Given this we need to fit θ to our data
Darakhshan Khan
LOGISTIC REGRESSION – MODEL
REPRESENTATION(1)
Interpreting output
• When (hθ(x)) outputs a number, treat that value as the estimated probability that
y=1 on input x
• Example : If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
(some value)
• hθ(x) = 0.7, Tells a patient they have a 70% chance of a tumor being malignant
• More formal notation, hθ(x) = P(y=1|x ; θ)
• Probability that y=1, given x, parameterized by θ
• Since this is a binary classification task we know y = 0 or 1
• So the following must be true
P(y=1|x ; θ) + P(y=0|x ; θ) = 1
•
6
• P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
Darakhshan Khan
LOGISTIC REGRESSION - DECISION
BOUNDARY
• Better understand of what the hypothesis function (model) looks like
• One way of using the sigmoid function is;
• When the probability of y being 1 is greater than 0.5 then we can predict y = 1
• Else we predict y = 0
• Looking at sigmoid function, g(z) is greater than or equal to 0.5 when z is
greater than or equal to 0
• So if z is positive, g(z) is greater than 0.5, where z = (θ T x)
• So when , θT x >= 0 , then hθ >= 0.5
• So what we've shown is that the hypothesis predicts y = 1 when θT x >= 0
• The corollary of that when θT x <= 0 then the hypothesis predicts y = 0
7
Darakhshan Khan
LOGISTIC REGRESSION - DECISION
• BOUNDARY(1)
Example, h (x) = g(θ + θ x + θ x )
θ 0 1 1 2 2
• Assume, θ0 = -3, θ1 = 1, θ2 = 1
• So our parameter vector is a column vector with the above values, i.e., θ T is a row vector = [-3,1,1]
• What does this mean? The z here becomes θT x
• We predict "y = 1" if -3x0 + 1x1 + 1x2 >= 0
• -3 + 1x1 + 1x2 >= 0
• We can also re-write this as If (x1 + x2 >= 3) then we predict y = 1
• If we plot x1 + x2 = 3, we graphically plot our decision boundary
• Means we have these two regions on the graph
• Blue = false, Magenta = true
• Line = decision boundary
• Concretely, the straight line is the set of points where hθ(x) = 0.5 exactly
• The decision boundary is a property of the hypothesis
8
• Means we can create the boundary with the hypothesis(function) and parameters without any data
Darakhshan Khan
• Later, we use the data to determine the parameter values
LOGISTIC REGRESSION – NON LINEAR DECISION
BOUNDARY
• Get logistic regression to fit a complex non-linear data set i,e. add
higher order terms in hypothesis function
• hθ(x) = g(θ0 + θ1x1 + θ2x2 + θ3x12 + θ4x22)
• We take the transpose of the θ vector times the input vector i.e θ T was
[-1,0,0,1,1]
• So, Predict that "y = 1" if -1 + x12 + x22 >= 0 or x12 + x22
>= 1
• If we plot x12 + x22 = 1, this gives a circle with a radius of 1 around origin
• This indicates more complex decision boundaries can be build by fitting
complex parameters to this (relatively) simple hypothesis
• More complex decision boundaries?
• By using higher order polynomial terms, we can get even more complex
decision boundaries
9
Darakhshan Khan
LOGISTIC REGRESSION – COST FUNCTION
(ERROR/LOSS)
• Fit θ parameters
• Define the optimization objective for the cost function and fit the parameters
10
Darakhshan Khan
LOGISTIC REGRESSION – COST FUNCTION
(ERROR/LOSS)
• Linear regression uses the following function to determine θ
• Instead of writing the squared error term, we can write it as "cost()“,
• cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
• Which evaluates to the cost for an individual example using the same measure as used in linear regression
• redefine J(θ) as
• To further simplify it we can get rid of the superscripts
• If we use this function for logistic regression this is a non-convex function for parameter optimization
• We have some function - J(θ) - for determining the parameters
• Our hypothesis function has a non-linearity (sigmoid function of hθ(x) )
• This is a complicated non-linear function
• If you take hθ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and
plot J(θ) we find many local optimum -> non convex function
• Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a
local minimum
11
Darakhshan Khan
LOGISTIC REGRESSION – COST FUNCTION (ERROR/LOSS)
(1)cost, where we can apply gradient descent
• We need different convex logistic regression
• This is our logistic regression cost function i.e. penalty the algorithm pays
• Plot the function, Plot y = 1 hθ(x) evaluates as -log(hθ(x) )
• So when hθ(x) =1 (correct), cost function is 0 else it slowly increases cost function as we become
"more" wrong , i.e. hθ(x) = 0
• X axis is what we predict
• Y axis is the cost associated with that prediction
• This cost functions has some interesting properties
• If y = 1 and hθ(x) = 1
• If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)
• As hθ(x) goes to 0
• Cost goes to infinity, This captures the intuition that if hθ(x) = 0 (predict P (y=1|x; θ) = 0) but y = 1
this will penalize the learning algorithm with a massive cost
• What about if y = 0, then cost is evaluated as -log(1- hθ(x)) 12
• Just
Darakhshan Khanget inverse of the other function
LOGISTIC REGRESSION – SIMPLIFIED COST FUNCTION
• Define a simpler way to write the cost function and apply gradient descent to the logistic regression
• Rather than writing cost function on two lines/two cases. We can compress them into one equation -
more efficient
• cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
• If y = 1,then -log(hθ(x)) - (0)log(1 - hθ(x)) = -log(hθ(x))
• Which is what we had before when y = 1
• If y = 0, then -(0)log(hθ(x)) - (1)log(1 - hθ(x)) = -log(1- hθ(x))
• Which is what we had before when y = 0
• Cost function for the θ parameters can be defined as
13
Darakhshan Khan
LOGISTIC REGRESSION – SIMPLIFIED COST FUNCTION(1)
• To fit parameters θ:
• Find parameters θ which minimize J(θ)
• This means we have a set of parameters to use in our model for future
predictions
• Then, if we're given some new example with set of features x, we can
take the θ which we generated, and output our prediction using
• Which is P(y=1 | x ; θ), Probability y = 1, given x, parameterized by θ 14
Darakhshan Khan
LOGISTIC REGRESSION – GRADIENT DESCENT
How to minimize the logistic regression cost function J(θ)
• Use gradient descent as before
• Repeatedly update each parameter using a learning rate (For derivation refer notes)
• It looks identical, but the hypothesis for Logistic Regression is different from Linear
Regression
• Ensuring Gradient Descent is Running Correctly
15
Darakhshan Khan
LOGISTIC REGRESSION – MULTICLASS
CLASSIFICATION PROBLEMS
• Similar terms: One-vs-all or One-vs-rest
• Examples
• Email folders or tags (4 classes): Work, Friends, Family, Hobby
• Medical Diagnosis (3 classes): Not ill, Cold, Flu
• Weather (4 classes): Sunny, Cloudy, Rainy. Snow
• Binary vs Multi-class
16
Darakhshan Khan
LOGISTIC REGRESSION – MULTICLASS
CLASSIFICATION PROBLEMS
One-vs-all (One-vs-rest)
• Split them into 3 distinct groups and
compare them to the rest
• If you have k classes, you need to
train k logistic regression classifiers
17
Darakhshan Khan