Exercise 4 : Linear Regression
Essential Libraries
Let us begin by importing the essential Python Libraries.
NumPy : Library for Numeric Computations in Python
Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
Setup : Import the Dataset
Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
houseData = pd.read_csv('train.csv')
houseData.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
\
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal
MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12
YrSold SaleType SaleCondition SalePrice
0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000
[5 rows x 81 columns]
Problem 1 : Predicting SalePrice using GrLivArea
Extract the required variables from the dataset, as mentioned in the problem.
houseGrLivArea = pd.DataFrame(houseData['GrLivArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
Plot houseSalePrice against houseGrLivArea using standard JointPlot.
trainDF = pd.concat([houseGrLivArea, houseSalePrice], axis =
1).reindex(houseGrLivArea.index)
sb.jointplot(data=trainDF, x='GrLivArea', y='SalePrice', height = 12)
<seaborn.axisgrid.JointGrid at 0x25b36d1db50>
Import the LinearRegression model from sklearn.linear_model.
# Import essential models and functions from sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Create a Linear Regression object
linreg = LinearRegression()
Prepare both the datasets by splitting in Train and Test sets.
Train Set with 1100 samples and Test Set with 360 samples.
# Split the dataset into Train and Test
#houseGrLivArea_train = pd.DataFrame(houseGrLivArea[:1100])
#houseGrLivArea_test = pd.DataFrame(houseGrLivArea[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])
# Split the Dataset into Train and Test
houseGrLivArea_train, houseGrLivArea_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(houseGrLivArea, houseSalePrice,
test_size = 360/1460)
# Check the sample sizes
print("Train Set :", houseGrLivArea_train.shape,
houseSalePrice_train.shape)
print("Test Set :", houseGrLivArea_test.shape,
houseSalePrice_test.shape)
Train Set : (1100, 1) (1100, 1)
Test Set : (360, 1) (360, 1)
Fit Linear Regression model on houseGrLivArea_train and houseSalePrice_train
linreg.fit(houseGrLivArea_train, houseSalePrice_train)
LinearRegression()
Visual Representation of the Linear Regression Model
Check the coefficients of the Linear Regression model you just fit.
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)
Intercept : b = [16608.46906887]
Coefficients : a = [[108.93750382]]
Plot the regression line based on the coefficients-intercept form.
# Formula for the Regression line
# Alternative code for below: regline_x =
houseGrLivArea_train.to_numpy()
regline_x = houseGrLivArea_train
regline_y = linreg.intercept_ + linreg.coef_ * regline_x
# Plot the Linear Regression line
f = plt.figure(figsize=(16, 8))
plt.scatter(houseGrLivArea_train, houseSalePrice_train)
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), "r-", linewidth =
3)
plt.show()
Prediction of Response based on the Predictor
Predict SalePrice given GrLivArea in the Test dataset.
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(houseGrLivArea_test)
# Plot the Predictions
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(houseGrLivArea_test, houseSalePrice_test, color = "green")
plt.scatter(houseGrLivArea_test, houseSalePrice_test_pred, color =
"red")
plt.show()
Goodness of Fit of the Linear Regression Model
Check how good the predictions are on the Train Set.
Metric : Explained Variance or R^2 on the Train Set.
print("Explained Variance (R^2) \t:",
linreg.score(houseGrLivArea_train, houseSalePrice_train))
Explained Variance (R^2) : 0.5083530900052877
Check how good the predictions are on the Test Set.
Metric : Explained Variance or R^2 on the Test Set.
print("Explained Variance (R^2) \t:",
linreg.score(houseGrLivArea_test, houseSalePrice_test))
Explained Variance (R^2) : 0.4778441365777981
You should also try the following
• Convert SalePrice to log(SalePrice) in the beginning and then use it for
Regression
Code : houseSalePrice =
pd.DataFrame(np.log(houseData['SalePrice']))
• Perform a Random Train-Test Split on the dataset before you start with the
Regression
Note : Check the preparation notebook M3 LinearRegression.ipynb for the code
Problem 2 : Predicting SalePrice using LotArea
Extract the required variables from the dataset, as mentioned in the problem.
housePredictor = pd.DataFrame(houseData['LotArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
trainDF = pd.concat([housePredictor, houseSalePrice], axis =
1).reindex(housePredictor.index)
sb.jointplot(data=trainDF, x='LotArea', y='SalePrice', height = 12)
<seaborn.axisgrid.JointGrid at 0x25b374c2e50>
Linear Regression on SalePrice vs Predictor
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression
# Split the dataset into Train and Test
#housePredictor_train = pd.DataFrame(housePredictor[:1100])
#housePredictor_test = pd.DataFrame(housePredictor[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])
# Split the Dataset into Train and Test
housePredictor_train, housePredictor_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(housePredictor, houseSalePrice,
test_size = 360/1460)
# Create a Linear Regression object
linreg = LinearRegression()
# Train the Linear Regression model
linreg.fit(housePredictor_train, houseSalePrice_train)
LinearRegression()
Visual Representation of the Linear Regression Model
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)
# Formula for the Regression line
# Alternative code for below: regline_x =
housePredictor_train.to_numpy()
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * regline_x
# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth =
3)
plt.show()
Intercept : b = [162100.36069809]
Coefficients : a = [[1.83081318]]
Prediction of Response based on the Predictor
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)
# Plot the Predictions
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color =
"red")
plt.show()
Goodness of Fit of the Linear Regression Model
print("Explained Variance (R^2) on Train Set \t:",
linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:",
linreg.score(housePredictor_test, houseSalePrice_test))
Explained Variance (R^2) on Train Set : 0.062378129532616566
Explained Variance (R^2) on Test Set : 0.08979920486099247
Problem 2 : Predicting SalePrice using TotalBsmtSF
Extract the required variables from the dataset, as mentioned in the problem.
housePredictor = pd.DataFrame(houseData['TotalBsmtSF'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
trainDF = pd.concat([housePredictor, houseSalePrice], axis =
1).reindex(housePredictor.index)
sb.jointplot(data=trainDF, x='TotalBsmtSF', y='SalePrice', height =
12)
<seaborn.axisgrid.JointGrid at 0x25b384d3a60>
Linear Regression on SalePrice vs Predictor
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression
# Split the dataset into Train and Test
#housePredictor_train = pd.DataFrame(housePredictor[:1100])
#housePredictor_test = pd.DataFrame(housePredictor[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])
# Split the Dataset into Train and Test
housePredictor_train, housePredictor_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(housePredictor, houseSalePrice,
test_size = 360/1460)
# Create a Linear Regression object
linreg = LinearRegression()
# Train the Linear Regression model
linreg.fit(housePredictor_train, houseSalePrice_train)
LinearRegression()
Visual Representation of the Linear Regression Model
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)
# Formula for the Regression line
# Alternative code for below: regline_x =
housePredictor_train.to_numpy()
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * regline_x
# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth =
3)
plt.show()
Intercept : b = [66517.88779439]
Coefficients : a = [[107.22623803]]
Prediction of Response based on the Predictor
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)
# Plot the Predictions
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color =
"red")
plt.show()
Goodness of Fit of the Linear Regression Model
print("Explained Variance (R^2) on Train Set \t:",
linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:",
linreg.score(housePredictor_test, houseSalePrice_test))
Explained Variance (R^2) on Train Set : 0.37367019826026027
Explained Variance (R^2) on Test Set : 0.38123042547864716
Problem 2 : Predicting SalePrice using GarageArea
Extract the required variables from the dataset, as mentioned in the problem.
housePredictor = pd.DataFrame(houseData['GarageArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
trainDF = pd.concat([housePredictor, houseSalePrice], axis =
1).reindex(housePredictor.index)
sb.jointplot(data=trainDF, x='GarageArea', y='SalePrice', height = 12)
<seaborn.axisgrid.JointGrid at 0x25b39697280>
Linear Regression on SalePrice vs Predictor
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression
# Split the dataset into Train and Test
#housePredictor_train = pd.DataFrame(housePredictor[:1100])
#housePredictor_test = pd.DataFrame(housePredictor[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])
# Split the Dataset into Train and Test
housePredictor_train, housePredictor_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(housePredictor, houseSalePrice,
test_size = 360/1460)
# Create a Linear Regression object
linreg = LinearRegression()
# Train the Linear Regression model
linreg.fit(housePredictor_train, houseSalePrice_train)
LinearRegression()
Visual Representation of the Linear Regression Model
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)
# Formula for the Regression line
# Alternative code for below: regline_x =
housePredictor_train.to_numpy()
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * regline_x
# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth =
3)
plt.show()
Intercept : b = [73619.02664658]
Coefficients : a = [[227.27199019]]
Prediction of Response based on the Predictor
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)
# Plot the Predictions
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color =
"red")
plt.show()
Goodness of Fit of the Linear Regression Model
print("Explained Variance (R^2) on Train Set \t:",
linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:",
linreg.score(housePredictor_test, houseSalePrice_test))
Explained Variance (R^2) on Train Set : 0.3949854232045428
Explained Variance (R^2) on Test Set : 0.37249611550040873
Extra : Predicting SalePrice using Multiple Variables
Extract the required variables from the dataset, and then perform Multi-Variate Regression.
housePredictor =
pd.DataFrame(houseData[['GrLivArea','LotArea','TotalBsmtSF','GarageAre
a']])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
Linear Regression on SalePrice vs Predictor
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression
# Split the dataset into Train and Test
#housePredictor_train = pd.DataFrame(housePredictor[:1100])
#housePredictor_test = pd.DataFrame(housePredictor[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])
# Split the Dataset into Train and Test
housePredictor_train, housePredictor_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(housePredictor, houseSalePrice,
test_size = 360/1460)
# Create a Linear Regression object
linreg = LinearRegression()
# Train the Linear Regression model
linreg.fit(housePredictor_train, houseSalePrice_train)
LinearRegression()
Coefficients of the Linear Regression Model
Note that you CANNOT visualize the model as a line on a 2D plot, as it is a multi-dimensional
surface.
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)
Intercept : b = [-18973.0369382]
Coefficients : a = [[70.07255406 0.19864354 43.29137855
96.31552172]]
Prediction of Response based on the Predictor
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_train_pred = linreg.predict(housePredictor_train)
houseSalePrice_test_pred = linreg.predict(housePredictor_test)
# Plot the Predictions vs the True values
f, axes = plt.subplots(1, 2, figsize=(24, 12))
axes[0].scatter(houseSalePrice_train, houseSalePrice_train_pred, color
= "blue")
axes[0].plot(houseSalePrice_train.to_numpy(),
houseSalePrice_train.to_numpy(), 'w-', linewidth = 1)
axes[0].set_xlabel("True values of the Response Variable (Train)")
axes[0].set_ylabel("Predicted values of the Response Variable
(Train)")
axes[1].scatter(houseSalePrice_test, houseSalePrice_test_pred, color =
"green")
axes[1].plot(houseSalePrice_test.to_numpy(),
houseSalePrice_test.to_numpy(), 'w-', linewidth = 1)
axes[1].set_xlabel("True values of the Response Variable (Test)")
axes[1].set_ylabel("Predicted values of the Response Variable (Test)")
plt.show()
Goodness of Fit of the Linear Regression Model
print("Explained Variance (R^2) on Train Set \t:",
linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:",
linreg.score(housePredictor_test, houseSalePrice_test))
Explained Variance (R^2) on Train Set : 0.6503512298667333
Explained Variance (R^2) on Test Set : 0.6970077531306746
Interpretation and Discussion
Now that you have performed Linear Regression of SalePrice against the four variables
GrLivArea, LotArea, TotalBsmtSF, GarageArea, compare-and-contrast the Exaplained
Variance (R^2) to determine which model is the best in order to predict SalePrice. What do
you think?