Regression Analysis in Machine Learning
Context:
In order to understand the motivation behind regression, let's consider
the following simple example. The scatter plot below shows the
number of college graduates in the US from the year 2001 to 2012.
Now based on the available data, what if someone asks you how many
college graduates with master's degrees will there be in the year 2018?
It can be seen that the number of college graduates with master’s
degrees increases almost linearly with the year. So by simple visual
analysis, we can get a rough estimate of that number to be between 2.0
to 2.1 million. Let's look at the actual numbers. The graph below plots
the same variable from the year 2001 to the year 2018. It can be seen
that our predicted number was in the ballpark of the actual value.
Since it was a simpler problem (fitting a line to data), our mind was
easily able to do that. This process of fitting a function to a set of data
points is known as regression analysis.
What is Regression Analysis?
Regression analysis is the process of estimating the relationship
between a dependent variable and independent variables. In simpler
words, it means fitting a function from a selected family of functions to
the sampled data under some error function. Regression analysis is one
of the most basic tools in the area of machine learning used for
prediction. Using regression you fit a function on the available data
and try to predict the outcome for the future or hold-out datapoints.
This fitting of function serves two purposes.
1. You can estimate missing data within your data range
(Interpolation)
2. You can estimate future data outside your data range
(Extrapolation)
Some real-world examples for regression analysis include predicting
the price of a house given house features, predicting the impact of
SAT/GRE scores on college admissions, predicting the sales based on
input parameters, predicting the weather, etc.
Let's consider the previous example of college graduates.
1. Interpolation: Let's assume we have access to somewhat sparse
data where we know the number of college graduates every 4 years,
as shown in the scatter plot below.
We want to estimate the number of college graduates for all the
missing years in between. We can do this by fitting a line to the limited
available data points. This process is called interpolation.
Extrapolation: Let’s assume we have access to limited data from the
year 2001 to the year 2012, and we want to predict the number of
college graduates from the year 2013 to 2018.
It can be seen that the number of college graduates with master’s
degrees increases almost linearly with the year. Hence, it makes sense
to fit a line to the dataset. Using the 12 points to fit a line, and then test
the prediction of this line on the future 6 points, it can be seen that the
prediction is very close.
Mathematically speaking
Types of regression analysis
Now let’s talk about different ways in which we can carry out
regression. Based on the family-of-functions (f_beta), and the loss
function (l) used, we can categorize regression into the following
categories.
1. Linear Regression
In linear regression, the objective is to fit a hyperplane (a line for 2D
data points) by minimizing the sum of mean-squared error for each
data point.
Mathematically speaking, linear regression solves the following
problem
Hence we need to find 2 variables denoted by beta that parameterize
the linear function f(.). An example of linear regression can be seen in
the figure 4 above where P=5. The figure also shows the fitted linear
function with beta_0 = -90.798 and beta_1 = 0.046
2. Polynomial Regression
Linear regression assumes that the relationship between the
dependant (y) and independent (x) variables are linear. It fails to fit the
data points when the relationship between them is not linear.
Polynomial regression expands the fitting capabilities of linear
regression by fitting a polynomial of degree m to the data points
instead. The richer the function under consideration, the better (in
general) its fitting capabilities. Mathematically speaking, polynomial
regression solves the following problem.
Hence we need to find (m+1) variables denoted by beta_0, …,beta_m.
It can be seen that linear regression is a special case of polynomial
regression with degree 2.
Consider the following set of data points plotted as a scatter plot. If we
use linear regression, we get a fit that clearly fails to estimate the data
points. But if we use polynomial regression with degree 6, we get a
much better fit as shown below
[Left] Scatter plot of data — [Center] Linear regression on data — [Right] Polynomial regression of
degree 6
Since the data points did not have a linear relationship between
dependant and independent variables, linear regression failed to
estimate a good fitting function. On the other hand, polynomial
regression was able to capture the non-linear relationship.
3. Ridge Regression
Ridge regression addresses the issue of overfitting in regression
analysis. To understand that, consider the same example as above.
When a polynomial of degree 25 is fit on the data with 10 training
points, it can be seen that it fits the red data points perfectly (center
figure below). But in doing so, it compromises other points in between
(spike between last two data points). This can be seen in the figure
below. Ridge regression tries to address this issue. It tries to minimize
the generalization error by compromising the fit on the training points.
Left] Scatter plot of data — [Center] Polynomial regression of degree 25— [Right] Polynomial Ridge
regression of degree 25
Mathematically speaking, ridge regression solves the following
problem by modifying the loss function.
The function f(x) can either be linear or polynomial. In the absence of
ridge regression, when the function overfits the data points, the
weights learned to tend to be pretty high. Ridge regression avoids over-
fitting by limiting the norm of the weights being learned by introducing
the scaled L2 norm of the weights (beta) in the loss function. Hence the
trained model trade-offs between fitting the data point perfectly (large
norm of the learned weights) and limiting the norm of the weights. The
scaling constant alpha>0 is used to control this trade-off. A small value
of alpha will result in higher norm weights and overfitting the training
data points. On the other hand, a large alpha value will result in a
function with a poor fit to the training data points but a very small
norm of the weights. Choosing the value of alpha carefully will yield the
best trade-off.
4. LASSO regression
LASSO regression is similar to Ridge regression as both of them are
used as regularizers against overfitting on the training data points. But
LASSO comes with an additional benefit. It enforces sparsity on the
learned weights.
Ridge regression enforces the norm of the learned weights to be small
yielding a set of weights where the total norm is reduced. Most of the
weights (if not all) will be non-zero. LASSO on the other hand tries to
find a set of weights by making most of them really close to zero. This
yields a sparse weight matrix whose implementation can be much
more energy-efficient than a non-sparse weight matrix while
maintaining similar accuracy in terms of fitting to the data points.
The figure below tries to visualize this idea on the same example as
above. The data points are fit using both the Ridge and Lasso
regression and their corresponding fit and weighs are plotted in
ascending order. It can be seen that most of the weights in the LASSO
regression are really close to zero.
Mathematically speaking, LASSO regression solves the following
problem by modifying the loss function.
The difference between LASSO and Ridge regression is that LASSO
uses the L1 norm of the weights instead of the L2 norm. This L1 norm
in the loss function tends to increase sparsity in the learned weights.
The constant alpha>0 is used to control the tradeoff between the fit
and the sparsity in the learned weights. A large value of alpha results in
poor fit but a sparser learned set of weights. On the other hand, a small
value of alpha results in a tight fit on training data points (might lead
to over-fitting) but with a less sparse set of weights.
5. ElasticNet Regression
ElasticNet regression is a combination of Ridge and LASSO regression.
The loss term includes both the L1 and L2 norm of the weights with
their respective scaling constants. It is often used to address the
limitations of LASSO regression such as the non-convex nature.
ElasticNet adds a quadratic penalty of the weights making it
predominantly convex.
Mathematically speaking, ElasticNet regression solves the following
problem by modifying the loss function.
6. Bayesian Regression
For the regression discussed above (the frequentists approach), the
goal is to find a set of deterministic values of weights (beta) that
explain the data. In Bayesian regression, instead of finding one value
for each weight, we rather try to find the distribution for these weights
assuming a prior.
So we start off with an initial distribution of the weights and based on
the
So we start off with an initial distribution of the weights and based on
the data nudge the distribution in the right direction by making use of
the Bayesian theorem that relates the prior distribution to posterior
distribution based on the likelihood and the evidence.
When we have infinite data points, the posterior distribution of the
weights becomes an impulse at the solution of ordinary least square
solution i.e. the variance approaches zero.
Finding the distribution of weights instead of a single set of
deterministic values serves two purposes
1. t naturally guards against the issue of overfitting hence acting as a
regularizer
2. It provides confidence and a range for weights which makes more
logical sense than just returning one value.
Let us mathematically formulate the problem and state its solution.
Let us a Gaussian prior on the weights with mean μ and
covariance Σ i.e
Based on the available data D, we update this distribution. For the
problem at hand, the posterior will be a gaussian distribution with the
following parameters
7. Logistic Regression
Logistic regression comes in handy in the classification tasks where the
output needs to be the conditional probability of the output given the
input. Mathematically speaking, logistic regression solves the following
problem
Consider the following example where the data points belong to one of
the two categories: {0 (red), 1 (yellow)} as shown in the scatter plot
below.
[left] Scatter plot of data points — [Right] Logistic regression trained on data
points plotted in blue
Logistic regression uses a sigmoid function at the output of the linear
or polynomial function to map the output from (-♾, ♾)to (0, 1). A
threshold (usually 0.5) is then used to categorize the test data into one
of the two categories.
This may seem like Logistic regression is not regression but a
classification algorithm. But that is not the case. You can find
more about it here in Adrian’s post.
https://towardsdatascience.com/a-beginners-guide-to-regression-analysis-in-machine-learning-
8a828b491bbf