INTRODUCTION TO LINEAR REGRESSION
Linear regression is a machine learning algorithm used to predict the value of
a dependent variable y , given the value of an independent variable x . it is
categorized into two main forms that is simple linear regression and multiple
liner regression. The goal of learning in this algorithm is to determine the value
of the coefficients of the independent variables.
Assumptions of Linear Regression
The algorithms makes the following assumptions about the relationship of x and
y
i That x and y have a linear relationship. This means that a change in x
results in constant change in y
ii That the observations are independent, this means that one observation
of x is not dependent on another observation of x.
iii Homoscedasticity which means that different samples of data drawn from
the same distribution will have similar variance.
iv Normality which requires that the error terms of the data be normally
distributed
Simple Linear Regression
In simple linear regression, there is only one independent variable. This is
illustrated in the following equation.
y = a + bx
Here, a is known as the intercept (the value of y when x is zero), and b is
the coefficient of x.
Since linear regression is a supervised learning algorithm, for learning to occur,
the algorithm is provided with many examples of x and y (training data) which
it uses to calculate the values of a and b.
The values of the coefficients are solved using a method referred to as ordi-
nary least squares. In the method, given the dataset, the value of b is given by
the following equation
b = cov(x,y)
var(x)
The value for a is given b
a = ȳ − bx̄
Where ȳ is the mean of y and x̄ is the mean of x .
1
Mean Squared Error
Given the prediction (yi′ ) of the fitted model for a value xi′ and the actual value
yi′ the residual (error) for the example is given by ei = yi − yi′ . The mean
squared error is the mean of the sum of squares of the residuals. It is given by
the formulaPbelow.
n 2
M SE = n1 i=1 (ei )
Example 1. The following is a dataset of weight given the height of a per-
son. Calculate the linear regression coefficients, the predictions of the resulting
formula, and the mean squared error.
X = Height(F eet) y = W eight(KG)
5.4 51
5.9 61
5.7 69
5.6 64
5.5 65
Solution
x = Height (Feet) y = Weight (KG) Predictions y = a + bx Residuals Residuals S
64.8 51 61.52 -10.52 110.7
70.8 61 62.60 -1.60 2.57
68.4 69 62.17 6.83 46.61
67.2 64 61.96 2.04 4.17
66 65 61.74 3.26 10.62
Mean of x and y 67.44 62
Total 174.7
34.9509
RMS 6
Cov(x, y) 5.52
Var(x) 30.67
b = Cov(x, y) / Var(x) 0.18
a=mean(y)-b*m
ean(x)
49.8608
Multiple Linear Regression
In multiple linear regression, there is more than one independent variable. This
is illustrated in the following equation.
y = a + b1 x1 + b2 x2 + . . . + bm xm
2
Here, a is the intercept (the value of y when all the x ’s are zero), and
b1 . . . bm are the coefficients. The ordinary least squares method is modified to
cater for the increased number of coefficients. The equation below is used to
obtain a number of equations to be used in calculating the coefficient b1 . . . bm
n
X
cov (y, xj ) = bi cov (xj , xi )
i=1
The systems of resulting equations can then be solved using the Gauss
method.
Example 2. IMPLEMENTATION OF LINEAR REGRESSION US-
ING PYTHON
You can implement linear regression using python. The most important mod-
ules are found in the sci-learn python package, LinearRegression object of the
linear model package. This is illustrated by the below:
Using the equation x = 4 + 2x1 + 3x2 we can generate the following data.
x1 x2 y
5 7 35
9 2 28
1 5 21
4 1 15
5 5 29
2 2 14
4 3 21
5 3 23
Our goal is to find out if our fitted model will have good approximations of
a, bl and b2.
# prompt: import linear regression
from sklearn.linear_model import LinearRegression
import numpy as np
model = LinearRegression()
#define the data
#define the data
x=np.array([[5,7],[9,2],[1,5],[4,1],[5,5],[2,2],[4,3],[5,3]]).reshape(-1,2
)
y=np.array([35,28,21,15,29,14,21,23])
x.shape
#fit the model and print the coefficients
model.fit(x,y)
print("Coefficients:"+str(model.coef_))
print("Coefficients:"+str(model.intercept_))
#Predict a few values
3
x1=np.array([[11,7],[2,4]]).reshape(-1,2)
y1=model.predict(x1)
print(yl)
# calculate MSE
from sklearn.metrics import mean_squared_error
y_pred = model.predict(x)
mse = mean_squared_error(y, y_pred)
print("Mean squared error:", mse)