Machine Learning - Multi Linear Regression Analysis
Machine Learning - Multi Linear Regression Analysis
In our previous tutorial, we explored the topic of Linear Regression Analysis which attempts to model the
relationship between two variables by fitting a linear equation to the observed data. In this simple regression
analysis, we have one explanatory variable and one dependent variable. However, what happens if we believe
there is more than one explanatory variable that impacts the dependent variable? How would we model this?
Welcome to the world of multiple regression analysis. In this type of model, we attempt to model the relationship
between multiple explanatory variables to a single dependent variable. While adding more variables allows us to
model more complex phenomenons there are also additional steps we must take to make sure our model is
sound and robust.
In this tutorial, we will be performing a multiple regression analysis on South Korea's GDP growth. South Korea
in the 1950s came out of the Korean War, which left it's country ravaged and in extreme poverty. However, South
Korea would go through one most significant economic developments the World has seen, taking it from a
country in poverty to one of the top 15 economies in the World today.
Our goal is to be able to predict what the GDP growth rate will be in any year, given a few explanatory variables
that we will define below.
I will be explaining these assumptions in more detail as we arrive at each of them in the tutorial. At this point,
however, we need to have an idea of what they are.
Section One: Import our Libraries
The first thing we need to do is import the libraries we will be using in this tutorial. To visualize our data, we will
be using matplotlib and seaborn to create heatmaps and a scatter matrix. To build our model, we will be
using the sklearn library, and the evaluation will be taking place with the statsmodels library. I've also
added a few additional modules to help calculate certain metrics
In [1]: import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
%matplotlib inline
This dataset was downloaded from the World Bank website; if you would like to visit the site yourself, I
encourage you to visit the link provided below. There is a tremendous amount of data available for free, that can
be used across a wide range of models.
From here, we will set the index of our data frame using the set_index() function to the Year column. The
reasoning behind this is because it will make selecting the data easier. After we've defined the index, we convert
the entire data frame to a float data type and then select years 1969 to 2016 . These years were selected
because they do not contain any missing values.
To make selecting the columns a little easier, we will rename all of our columns. I'll create a dictionary where the
keys represent the old column names and the values associated with those keys are the new column names. I'll
then call the rename() method and pass through the new columns dictionary.
Finally, I'll check one last time for any missing values using isnull().any() , which will return true for a given
column if any values are missing, and then print the head of the data frame.
In [3]: # load the data and replace the '..' with nan
econ_df = pd.read_excel('korea_data.xlsx')
econ_df = econ_df.replace('..','nan')
# rename columns
econ_df = econ_df.rename(columns = column_names)
gdp_growth False
gross_capital_formation False
pop_growth False
birth_rate False
broad_money_growth False
final_consum_growth False
gov_final_consum_growth False
gross_cap_form_growth False
hh_consum_growth False
unemployment False
dtype: bool
'----------------------------------------------------------------------------
------------------------'
Year
What is multicollinearity?
One of the assumptions of our model is that there isn't any Perfect multicollinearity. Multicollinearity is where one
of the explanatory variables is highly correlated with another explanatory variable. In essence, one of the X
variables is almost perfectly correlated with another or multiple X variables.
Another way we can look at this problem is by using an analogy. Imagine we ask you to go to a concert and
determine who was the best singer. This task would become very challenging if you couldn't distinguish the two
singers because they are singing at the same volume. The idea is the same in our analysis, how can we
determine which variable is playing a role in our model if we can't distinguish the two? The problem is
we can't.
Now a little correlation is fine, but if it gets too high, we can effectively distinguish the two variables. The other
issue that arises is that when we have highly correlated exploratory variables is that, in a sense, we have
duplicates. This means that we can remove one of them and we haven't lost anything; the model would still
perform the same.
The first thing we can do is create a correlation matrix using the corr() function; this will create a matrix with
each variable having its correlation calculated for all the other variables. Keep in mind, if you travel diagonally
down the matrix all the associations should be one, as it is calculating the correlation of the variable with itself.
When we have multiple variables as we do, I sometimes prefer to use a correlation heatmap this way I can
quickly identify the highly correlated variables, by just looking for the darker colors.
In [27]: # calculate the correlation matrix
corr = econ_df.corr()
However, we should be more systematic in our approach to removing highly correlated variables. One method
we can use is the variance_inflation_factor which is a measure of how much a particular variable is
contributing to the standard error in the regression model. When significant multicollinearity exists, the
variance inflation factor will be huge for the variables in the calculation.
A general recommendation is that if any of our variables come back with a value of 5 or higher, then they
should be removed from the model. I decided to show you how the VFI comes out before we drop the highly
correlated variables and after we remove the highly correlated variables. Going forward in the tutorial we will only
be using the econ_df_after data frame.
In [28]: # define two data frames one before the drop and one after the drop
econ_df_before = econ_df
econ_df_after = econ_df.drop(['gdp_growth','birth_rate', 'final_consum_growth'
,'gross_capital_formation'], axis = 1)
# the VFI does expect a constant term in the data, so we need to add one using
the add_constant method
X1 = sm.tools.add_constant(econ_df_before)
X2 = sm.tools.add_constant(econ_df_after)
print('DATA AFTER')
print('-'*100)
display(series_after)
DATA BEFORE
-----------------------------------------------------------------------------
-----------------------
C:\Users\Alex\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: Fut
ureWarning: Method .ptp is deprecated and will be removed in a future versio
n. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
const 314.550195
gdp_growth 9.807879
gross_capital_formation 2.430057
pop_growth 25.759263
birth_rate 26.174368
broad_money_growth 1.633079
final_consum_growth 2305.724583
gov_final_consum_growth 32.527332
gross_cap_form_growth 3.796420
hh_consum_growth 2129.093634
unemployment 2.800008
dtype: float64
DATA AFTER
-----------------------------------------------------------------------------
-----------------------
const 27.891150
pop_growth 1.971299
broad_money_growth 1.604644
gov_final_consum_growth 1.232229
gross_cap_form_growth 2.142992
hh_consum_growth 2.782698
unemployment 1.588410
dtype: float64
Looking at the data above we now get some confirmation about our suspicion. It makes sense to remove either
birth_rate or pop_growth and some of the consumption growth metrics. Once we remove those metrics
and recalculate the VFI, we get a passing grade and can move forward.
I also want to demonstrate another way to visualize our data to check for multicollinearity. Inside of pandas ,
there is a scatter_matrix chart that will create a scatter plot for each variable in our dataset against another
variable. This is a great tool for visualizing the correlation of one variable across all the other variables in the
dataset. I'll take my econ_df_after and pass it through the scatter_matrix method. What you're looking for is
a more random distribution, there shouldn't be any strong trends in the scatter matrix as this would be identifying
correlated variables. Now, for our explanatory variable, we want to see trends!
In [29]: # define the plot
pd.plotting.scatter_matrix(econ_df_after, alpha = 1, figsize = (30, 20))
# display it
desc_df
Out[30]:
gdp_growth gross_capital_formation pop_growth birth_rate broad_money_growth final_
Looking at the data frame up above, a few values are standing out, for example, the maximum value in the
broad_money_growth column is almost four standard deviations above the mean. Such an enormous value
would qualify as an outlier.
Imagine if we wanted to remove the values that have an amount exceeding three standard deviations. How
would we approach this? Well, if we leverage the numpy module and the scipy module we can filter out the
rows using the stats.zscore function. The Z-score is the number of standard deviations from the mean a data
point is, so if it's less than 3 we keep it otherwise we drop it. From here, I also provided a way to let us know
what rows were removed by using the index.difference the function which will show the difference between
the two datasets.
In [31]: # filter the data frame to remove the values exceeding 3 standard deviations
econ_remove_df = econ_df[(np.abs(stats.zscore(econ_df)) < 3).all(axis=1)]
After splitting the data, we will create an instance of the linear regression model and pass through the X_train
and y_train variables using the fit() function.
X = econ_df_after.drop('gdp_growth', axis = 1)
Y = econ_df_after[['gdp_growth']]
The intercept term is the value of the dependent variable when all the independent variables are equal to
zero. For each slope coefficient, it is the estimated change in the dependent variable for a one unit
change in that particular independent variable, holding the other independent variables constant.
For example, if all the independent variables were equal to zero, then the gdp_growth would be 2.08%. If we
looked at the gross_cap_form_growth while holding all the other independent variables constant, then we
would say for a 1 unit increase in gross_cap_form_growth would lead to a 0.14% increase in GDP growth.
We can also now make predictions with our newly trained model. The process is simple; we call the predict
method and then pass through some values. In this case, we have some values predefined with the x_test
variable so we will pass that through. Once we do that, we can select the predictions by slicing the array.
C:\Users\Alex\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: Fut
ureWarning: Method .ptp is deprecated and will be removed in a future versio
n. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
Checking for Heteroscedasticity
What is Heteroscedasticity?
One of the assumptions of our model is that there is no heteroscedasticity. What exactly does this mean? Well,
to give a simple definition it merely means the standard errors of a variable, monitored over a specific amount of
time, are non-constant. Let's imagine a situation where heteroscedasticity could exist.
Imagine we modeled household consumption based on income, something we would probably notice is how the
variability of expenditures changes depending on how much income you have. In simple terms, we would see
that households with more income spend money on a broader set of items compared to lower income
households that would only be able to focus on the main staples. This results in standard errors that change over
income levels.
1. While heteroscedasticity does not cause bias in the coefficient estimates, it causes the coefficient
estimates to be less precise. The Lower precision increases the likelihood that the coefficient estimates
are further from the correct population value.
2. Heteroscedasticity tends to produce p-values that are smaller than they should be. This effect occurs
because heteroscedasticity increases the variance of the coefficient estimates, but the OLS procedure does
not detect this increase. Consequently, OLS calculates the t-values and F-values using an underestimated
amount of variance. This problem can lead you to conclude that a model term is statistically significant when
it is not significant.
The null hypothesis for both the White’s test and the Breusch-Pagan test is that the variances for the errors
are equal:
H0 = σ2i = σ2
The alternate hypothesis (the one you’re testing), is that the variances are not equal:
H1 = σ2i ≠ σ2
Our goal is to fail to reject the null hypothesis, have a high p-value because that means we have no
heteroscedasticity.
In [36]: # Run the White's test
_, pval, __, f_pval = diag.het_white(est.resid, est.model.exog, retres = False
)
print(pval, f_pval)
print('-'*100)
else:
print("For the White's Test")
print("The p-value was {:.4}".format(pval))
print("We reject the null hypthoesis, so there is heterosecdasticity. \n")
else:
print("For the Breusch-Pagan's Test")
print("The p-value was {:.4}".format(pval))
print("We reject the null hypthoesis, so there is heterosecdasticity.")
0.43365711028667386 0.509081191858663
-----------------------------------------------------------------------------
-----------------------
For the White's Test
The p-value was 0.4337
We fail to reject the null hypthoesis, so there is no heterosecdasticity.
0.25183646701201695 0.2662794557854012
-----------------------------------------------------------------------------
-----------------------
For the Breusch-Pagan's Test
The p-value was 0.2518
We fail to reject the null hypthoesis, so there is no heterosecdasticity.
Checking for Autocorrelation
What is autocorrelation?
Autocorrelation is a characteristic of data in which the correlation between the values of the same variables is
based on related objects. It violates the assumption of instance independence, which underlies most of
conventional models.
When you have a series of numbers, and there is a pattern such that values in the series can be predicted based
on preceding values in the series, the set of numbers is said to exhibit autocorrelation. This is also known as
serial correlation and serial dependence. It generally exists in those types of data-sets in which the data, instead
of being randomly selected, are from the same source.
That means we want to fail to reject the null hypothesis, have a large p-value because then it means we have no
autocorrelation. To use the Ljung-Box test, we will call the acorr_ljungbox function, pass through the
est.resid and then define the lags.
The lags can either be calculated by the function itself, or we can calculate them. If the function handles it the
max lag will be min((num_obs // 2 - 2), 40) , however, there is a rule of thumb that for non-seasonal time
series the lag is min(10, (num_obs // 5)) .
We also can visually check for autocorrelation by using the statsmodels.graphics module to plot a graph of
the autocorrelation factor.
In [37]: # test for autocorrelation
from statsmodels.stats.stattools import durbin_watson
# plot autocorrelation
sm.graphics.tsa.plot_acf(est.resid)
plt.show()
Visually what we are looking for is the data hugs the line tightly; this would give us confidence in our assumption
that the residuals are normally distributed. Now, it is highly unlikely that the data will perfectly hug the line, so this
is where we have to be subjective.
Mean Absolute Error (MAE): Is the mean of the absolute value of the errors. This gives an idea of
magnitude but no sense of direction (too high or too low).
Mean Squared Error (MSE): Is the mean of the squared errors. MSE is more popular than MAE because
MSE "punishes" more significant errors.
Root Mean Squared Error (RMSE): Is the square root of the mean of the squared errors. RMSE is even
more favored because it allows us to interpret the output in y-units.
Luckily for us, sklearn and statsmodel both contain functions that will calculate these metrics for us. The
examples below were calculated using the sklearn library and the math library.
MSE 0.707
MAE 0.611
RMSE 0.841
R-Squared
The R-Squared metric provides us a way to measure the goodness of fit or, in other words, how well our data fits
the model. The higher the R-Squared metric, the better the data fit our model. However, one limitation is that R-
Square increases as the number of features increase in our model, so if I keep adding variables even if they're
poor choices R-Squared will still go up! A more popular metric is the adjusted R-Square which penalizes
more complex models, or in other words models with more exploratory variables. In the example below, I
calculate the regular R-Squared value, however, the statsmodel summary will calculate the Adjusted R-
Squared below.
In [40]: model_r2 = r2_score(y_test, y_predict)
print("R2: {:.2}".format(model_r2))
R2: 0.86
Confidence Intervals
Let's look at our confidence intervals. Keep in mind that by default confidence intervals are calculated using 95%
intervals. We interpret confidence intervals by saying if the population from which this sample was drawn was
sampled 100 times. Approximately 95 of those confidence intervals would contain the "true" coefficient.
Why do we provide a confidence range? Well, it comes from the fact that we only have a sample of the
population, not the entire population itself. Because of this, it means that the "true" coefficient could exist in the
interval below or it couldn't, but we cannot say for sure. We provide some uncertainty by providing a range,
usually 95%, where the coefficient is probably in.
Out[41]:
0 1
Null Hypothesis: There is no relationship between the exploratory variables and the explanatory variable.
Alternative Hypothesis: There is a relationship between the exploratory variables and the explanatory
variable.
If we reject the null, we are saying there is a relationship, and the coefficients do not equal 0.
If we fail to reject the null, we are saying there is no relationship, and the coefficients do equal 0
Here it's a little hard to tell, but we have a few insignificant coefficients. The first is the constant itself, so
technically this should be dropped. However, we will see that once we remove the irrelevant variables that the
intercept becomes significant. If it still wasn't significant, we could have our intercept start at 0 and assume
that the cumulative effect of X on Y begins from the origin (0,0). Along with the constant, we have
unemployment and broad_money_growth both come out as insignificant.
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correc
tly specified.
The first thing we notice is that the p-values from up above are now easier to read and we can now determine
that the coefficients that have a p-value greater than 0.05 can be removed. We also have our 95% confidence
interval (described up above), our coefficient estimates (described up above), the standard errors, and t-values.
The other metric that stands out is our Adjusted R-Squared value which is .878, lower than our R-Squared value.
This makes sense as we were probably docked for the complexity of our model. However, an R-Squared over
.878 is still very strong.
The only additional metrics we will describe here is the t-value which is the coefficient divided by the standard
error. The higher the t-value, the more evidence we have to reject the null hypothesis. Also the standard error,
the standard error is the approximate standard deviation of a statistical sample population.
X = econ_df_after.drop('gdp_growth', axis = 1)
Y = econ_df_after[['gdp_growth']]
print(est.summary())
OLS Regression Results
=============================================================================
=
Dep. Variable: gdp_growth R-squared: 0.89
3
Model: OLS Adj. R-squared: 0.88
3
Method: Least Squares F-statistic: 89.9
4
Date: Sat, 27 Apr 2019 Prob (F-statistic): 2.61e-2
0
Time: 15:57:05 Log-Likelihood: -82.90
4
No. Observations: 48 AIC: 175.
8
Df Residuals: 43 BIC: 185.
2
Df Model: 4
Covariance Type: nonrobust
=============================================================================
==============
coef std err t P>|t| [0.0
25 0.975]
-----------------------------------------------------------------------------
--------------
const 1.9229 0.573 3.356 0.002 0.7
67 3.078
pop_growth 2.1704 0.477 4.546 0.000 1.2
08 3.133
gov_final_consum_growth -0.1889 0.087 -2.162 0.036 -0.3
65 -0.013
gross_cap_form_growth 0.1293 0.024 5.346 0.000 0.0
81 0.178
hh_consum_growth 0.4976 0.076 6.526 0.000 0.3
44 0.651
=============================================================================
=
Omnibus: 0.831 Durbin-Watson: 1.58
9
Prob(Omnibus): 0.660 Jarque-Bera (JB): 0.66
6
Skew: 0.282 Prob(JB): 0.71
7
Kurtosis: 2.882 Cond. No. 51.
9
=============================================================================
=
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correc
tly specified.
C:\Users\Alex\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: Fut
ureWarning: Method .ptp is deprecated and will be removed in a future versio
n. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
Looking at the output, we now see that all of the independent variables are significant and even our
constant is significant. We could rerun our test for autocorrelation and, but the tests will take you to the same
conclusions we found above so I decided to leave that out of the tutorial. At this point, we can interpret our
formula and begin making predictions. Looking at the coefficents, we would say pop_growth ,
gross_cap_form_growth , and hh_consum_growth all have a positive effect on GDP growth. Additionally, we
would say that gov_final_consum_growth has a negative effect on GDP growth. That's a little surprising to
see, but we would have to see why that might be the case.
# load it back in
with open('my_mulitlinear_regression.sav', 'rb') as pickle_file:
regression_model_2 = pickle.load(pickle_file)
Out[46]: array([[7.6042968]])