Cost Function of Logistic Regression
A cost function is a mathematical function that calculates the difference
between the target actual values (ground truth) and the values predicted by
the model. A function that assesses a machine learning model’s performance
also referred to as a loss function or objective function. Usually, the objective
of a machine learning algorithm is to reduce the error or output of cost
function.
Log loss and Cost function for Logistic Regression
One of the popular metrics to evaluate models for classification by using
probabilities is log loss.
F=−∑(i=1 to M) yilog (hθ(xi))+(1−yi)log(1−hθ(xi))
The cost function can be written as:
F(θ)=1/n∑(i=1 to n) 1/2[hθ(xi)−Yi]2
For Logistic Regression,
hθ(x)=g(θTx)
The above equation leads to a non−convex function that acts as the cost
function. The cost function logistic regression is log loss and is summarized
below.
cost(hθ(x), y) = -log(hθ(x)) , when y=1
and
cost(hθ(x), y) = -log(1 - hθ(x)) , when y=0
where,
y is the actual value of the target variable,
hθ (x) is the predicted probability that y=1 given , X and
parameterized by θ.
yi is the actual label for the i th training example.
This cost function penalizes the model with a higher loss when its prediction
diverges from the actual label. Specifically, it imposes a large penalty when
the model confidently predicts the wrong class (i.e., high probability for the
incorrect class).
Why Mean Squared Error suitable for Linear Regression?
Because in linear regression there present a value where exist minimum
error i.e. global minima.
Why Mean Squared Error not suitable for Logistic Regression?
Let’s consider the Mean Squared Error (MSE) as a cost function, but it is not
suitable for logistic regression due to its nonlinearity introduced by the
sigmoid function.
MSE = 1/2m Σ (i=1 to m) (σ(i) - yi)2
In logistic regression, if we substitute the sigmoid function into the above
MSE equation, we get
The equation 1/(1+ez) is a nonlinear transformation, and evaluating this term
within the Mean Squared Error formula results in a non-convex cost function.
A non-convex function, have multiple local minima which can make it difficult
to optimize using traditional gradient descent algorithms as shown below.
Imagine you have a function that looks like a series of hills and valleys, with
multiple peaks and troughs scattered throughout. This type of function is
called non-convex because it doesn't have a single, well-defined minimum
point; instead, it has multiple local minima (valleys) and potentially even
some local maxima (peaks).
When you're trying to optimize such a function, the goal is to find the lowest
point, which corresponds to the global minimum. However, because of the
presence of multiple local minima, traditional gradient descent algorithms
can encounter difficulties.
Why is it challenging?
1. Getting Stuck in Local Minima: Gradient descent algorithms, like the
one used in logistic regression, work by iteratively moving in the direction of
the steepest descent of the function. However, if they start from an initial
point that is not the global minimum and there are multiple local minima,
they might get trapped in one of the local minima instead of reaching the
global minimum. Once stuck in a local minimum, the algorithm cannot
escape it to find the true minimum.
2. Plateaus and Saddle Points: In addition to local minima, non-convex
functions may have plateaus (flat regions) and saddle points (points where
the gradient is zero but not a minimum or maximum). These features can
slow down or stall the convergence of gradient descent algorithms, making
optimization even more challenging.
Mark as Read
Report An Issue