Linear Regression
S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology
MA613 Data Mining
Introduction
Regression Task
Approximating function is a hyperplane
Inner Product
1
a = −1
0
2
b = 1
6
⟨a, b⟩ = aT b = 1 ∗ 2 + −1 ∗ 1 + 0 ∗ 6 = 1
Hyperplane
Equation of a line passing through the point (x, y ) ∈ R2
y = mx + c
y − mx − c = 0
w0 + w1 x + w2 y = 0
Equation of the hyperplane passing through the point
xi = (xi1 , xi2 , . . . xin )T
w0 + w1 xi1 + w2 xi2 + . . . wn xin = w T xi + w0 = 0, w =
(w1 , w2 , . . . wn )T
Hyperplane: Classification and Regression
Classification
Decision Boundary: w T xi + w0 = 0
Regression
yi = w T xi + w0
Hyperplane: Regression
Set: Hyperplane
H = {xi ∈ Rn : w T xi + w0 = 0}
Divides the space into two halves
Linear Regression:
Formulation
Introductory Facts
Data:{(x1 , y1 ), (x2 , y2 )...(xN , yN )} , xi ∈ D ⊆ Rn and yi ∈ R.
Model (relation)
f : X → Y where f (xi ) = w0 + w T xi be the function that
generates the data
f (xi ) is the model output, which is known as the predicted
value
yi the given output
Parameters
f (xi ) = w0 + w1 xi1 + w2 xi2 + . . . wn xin
By taking xi0 = 1,
f (xi ) = w0 xi0 + w1 xi1 + w2 xi2 + . . . wn xin = w T xi where
xi := (1, xi1 , xi2 , . . . xin )T ∈ Rn+1 and
w = (w0 , w1 , . . . , wn )T ∈ Rn+1
Here
w0 , wi , . . . wn are the unknown parameters
Model Output: System of Linear Equations
N data points, N predicted values, N output values
f (x1 ) = w0 x10 + w1 x11 + w2 x12 + . . . wn x1n = y1
f (x2 ) = w0 x20 + w1 x21 + w2 x22 + . . . wn x2n = y2
..
.
f (xN ) = w0 xN0 + w1 xN1 + w2 xN2 + . . . wn xNn = yN
Design Matrix
Define the design matrix to be
x10 x11 ... x1n
x20 x21 ... x2n
X = . .
.. .. ..
.. . . .
xN0 xN1 . . . xNn
Output Vector
y1
y2
.
y = .
.
.
yN
Matrix Representation
Xw = y
where X : Rn+1 → RN
Range Space
R(X ) = {y ′ ∈ RN : y ′ = Xw, w ∈ Rn+1 } be the range space of
X
Is R(X ) a subspace of RN ?
Vector Space
V1 = {x ∈ Rn : Ax = 0}
V2 = {x ∈ Rn : Ax = b, b ̸= 0}
Vector Space
A vector space over a field K is a non empty set V on which are
defined two operations, vector addition and scalar multiplication
such that the following conditions are satisfied ∀u, v , w ∈ V :
Closed under vector addition: u + v ∈ V
Associative under vector addition:
(u + v ) + w = u + (v + w)
Commutative under vector addition: u + v = v + u
Existence of additive identity: ∃ 0 ∈ V , such that 0 + u = u
Existence of additive inverse: ∃s ∈ V , such that u + s = 0
Closed under scalar multiplication: ∀α ∈ K , αv ∈ V
Associative under scalar multiplication:
α(βv ) = (αβ)v , α, β ∈ K
Distributive of scalar multiplication with respect to vector
and field addition: α(u + v ) = αu + αv ,
(α + β)u = αu + βu, α, β ∈ K
Identity element of scalar multiplication: 1u = u, 1 ∈ K
Subspace
A subset S of a vector space V is called a subspace if it
itself a vector space.
If x, y ∈ S, αx + βy ∈ S, α, β ∈ K
Basis
Let V be a vector space and v1 , v2 , . . . vn ∈ V . A linear
combination of v1 , v2 , . . . vn is the vector
α1 v1 + α2 v2 + . . . αn vn where α1 , α2 , . . . αn ∈ K .
Let S be a nonempty subset of V . Then the set of all linear
combinations of elements of S is called
P the span of S, and
is denoted by span S. Span(S) = { i αi vi : αi ∈ K , vi ∈ S}
{v1 , v2 , . . . vn } is linearly independent iff
α1 v1 + · · · + αn vn = 0 implies α1 = α2 = . . . αn = 0
S spans V if Span(S) = V
A linearly independent subset of V that spans V is called a
basis of V
The number of elements in a basis of a V is called the
dimension of V
.
A vector space V is called finite dimensional, if it has a
finite basis. Else V is called infinite dimensional.
For a finite dimensional vector space V , any two bases for
V have the same number of vectors.
Properties of Basis
Theorem
A set V ′ = {v1 , v2 , . . . vn } is a basis of V then every element in
V can be uniquely expressed as a linear combination of
elements in V ′ .
Proof.
Given V ′ is a basis of V . Let the expression using the elements
in V ′ is not unique. Let v ∈ V . Let
v = α1 v1 + α2 v2 + . . . αn vn = β1 v1 + β2 v2 + . . . βn vn
(α1 − β1 )v1 + (α2 − β2 )v2 + · · · + (αn − βn )vn = 0
As {v1 , v2 , . . . vn } is a basis, αi = βi , i = 1, 2, . . . n. Hence the
theorem.
Properties of Basis
Theorem
If every element in V can be uniquely expressed as a linear
combination of elements in V ′ = {v1 , v2 , . . . vn }, then V ′ is a
basis of V .
Proof.
Given every element in V can be uniquely expressed as a linear
combination of elements in V ′ . To prove V ′ is a basis. As V is a
vector space, 0 ∈ V . Therefore 0 = 0 ∗ v1 + 0 ∗ v2 + . . . 0 ∗ vn .
Let α1 v1 + α2 v2 + . . . αn vn = 0. As the expression is unique,
αi = 0, ∀i. Therefore V ′ consists of linearly independent
elements that spans V and hence is a basis.
Properties of Basis
A nonempty subset S of a vector space V is a basis of V iff
every element of V can be expressed in a unique way as a
linear combination of elements of S.
Linear Regression:
Formulation
Range Space
Theorem
R(X ) is a subspace of RN .
Proof.
Let y1 , y2 ∈ R(X ). To prove αy1 + βy2 ∈ R(X ), α, β ∈ R. Now
y1 = Xw ′ , y2 = Xw ′′ , w ′ , w ′′ ∈ Rn+1 . Therefore
αy1 + βy2 = αXw ′ + βXw ′′ = X (αw ′ + βw ′′ ) = Xw, where w =
αw ′ + βw ′′ ∈ Rn+1 . This means αy1 + βy2 ∈ R(X ). Hence
R(X ) is a subspace of RN .
Range Space: Representation
y ′ ∈ R(X ). ∃w ∈ Rn+1 such that Xw = y ′
1 x11 x1n
1
x21
x2n
′
. . .
y = w0 + w1 + . . . wn
.
.
.
. . .
1 xN1 xNn
Question
Using N=5, n=3 express y ′ .
1 Question
1 If a set S spans a vector space V , then the dimension of V
1 is equal to the number of elements in S
2 is less than or equal to the number of elements in S
3 is greater than or equal to the number of elements in S
4 can be greater than or less than the number of elements in S
Dimension of Range Space
Theorem
dim(R(X )) ≤ n + 1
Proof.
Let S = {v0 , v1 v2 , . . . vn } be the column vectors of X . For every
y ′ ∈ R(X ) ∃w = (w0 , w1 , . . . wn )T ∈ Rn+1 such that y ′ =
w0 v0 + w1 v1 + . . . wn vn . Therefore S ⊆ R(X ) and R(X ) is
spanned by the columns of X . Hence the dimension of
R(X )(dim(R(X ))) is equal to the number of linearly
independent columns of X , that is dim(R(X )) ≤ n + 1.
Conditions: R(X )
X : Rn+1 → RN such that Xw = y . X can be
Not 1-1,
Not onto
1-1
Onto
X is 1-1
If X is not 1-1, ∃, y ′ ∈ R(X ) such that Xw = y ′ has more
than one solution.
If X is 1-1, for every y ′ ∈ R(X ), there exists a unique
w ∈ Rn+1 such that Xw = y ′ . That is every y ′ ∈ R(X ) can
be uniquely represented by elements of S.
X is 1-1
If X is 1-1, then the dimension of R(X ) is n + 1.
Proof.
As X is 1-1, every element in R(X ) has a unique preimage and
hence it can be uniquely expressed using elements in S
(column vectors of X ). Therefore on the basis of previous
theorem, S is a basis and dimension of R(X ) is |S| = n + 1 .
X is onto
If X is onto, then the dimension of R(X ) is N.
Proof.
As X is onto, every element in RN has a preimage. Therefore
R(X ) = RN . Hence the result.
Methods to find the solution
Linear Regression: Matrix Equation
Solve
Xw = y , X : Rn+1 → RN
X and y are given
w is an unknown parameter
Characteristics of X
The matrix X can be three types
N =n+1
n+1<N
N <n+1
N =n+1
X is a square matrix. The solution exists if X −1 exists. That
is X is 1-1 and onto. Therefore dimension of R(X ) is
n+1=N
X −1 : R N → R n+1
n+1<N
R(X ) is spanned by S, that is the column vectors X .
Therefore the dimension of R(X ) ≤ (n + 1). As n + 1 < N,
R(X ) is a proper subset of RN , that is R(X ) ⊂ RN . Hence
X is not onto.
Xw = y
If X is not onto it is not guaranteed that y ∈ R(X ) and in that
case Xw = y has no solution. In such cases, find an
approximate solution, that is find the solution of Xw ′ = y ′ where
y ′ ∈ R(X ), such that y ′ ≈ y . For finding such a y ′ , find the
projection of y onto R(X ).
Norm of a Vector
1
a = 2
0
⟨a, a⟩ = aT a = 1 ∗ 1 + 2 ∗ 2 + 0 ∗ 0 = 5 = ||a||2
Question
1 d(x, x ′ ): distance between x and x ′ .
2 Find d(x, x ′ ) where x = (1, 2, −1)T and x ′ = (−1, 2, 1)T
3 Find ||x − x ′ ||
Relationship between Distance and Norm
d(x, x ′ ) = ∥x − x ′ ∥
S = {10, 35, −10, 7}, x = 17
Find args∈S min d(x, s)
Projection
Definition
The projection of y onto R(X ) (P(y )) is that vector in R(X )
which is at a smallest distance to y . That is
P(y ) = arg min d(y , y ′ ) = arg min ||y − y ′ ||
y ′ ∈R(X ) y ′ ∈R(X )
Projection vector is unique
P(y ) is the best approximation to y out of R(X )
Best Approximation
For every given x in R m and every given subspace Y of R m
there is a unique best approximation to x out of Y (namely, y =
Px, where P : Rm → Y is the projection of Rm onto Y ).
Projection of (0,3) onto {(x, y ) ∈ R2 : y = 0}
d((-1,0),(0,3))
d((1,0),(0,3))
Preimage of P(y )
When X is not onto, the matrix equation under
consideration is
Xw = P(y )
The preimage of P(y ) has to be found out
Projection
Find the minimum of a function
Notation: minx f (x) or argx minx f (x)
1 Find the minimum of f (x) = (x 2 + 1) or arg minx f (x)
2 Find arg minx f (x) where f (x) = (x + 1)2
3 Find arg minx f (x) where f (x) = 1/2(x + 1)2
4 Find arg minx f (x) where f (x) = (x + 1)
5 Find arg minx f (x) where f (x) = x
6 Find arg minx f (x) where f (x) = x 2
7 Find arg minx f (x) where f (x) = x, x ≥ 0
8 Find arg minx f (x) where f (x) = x 2 , x ≥ 0
Composite Function
gf (x)
f : D(x) → R(f )
g : R(f ) → R(g)
Question
f (x) = ∥x∥, g(x) = x 2 , x ∈ R. What is gf (x)? Is
g : R(f ) → R(g) a monotonically increasing function?
f (x) = x, g(x) = x 2 , x ∈ R. What is gf (x)? Is
g : R(f ) → R(g) a monotonically increasing function?
minx∈Rm f (x) is equivalent to finding minx∈Rm gf (x) if g is an
monotonically increasing function defined on the range of f .
Proof.
Let x ∗ be the minimum of f . This means, f (x ∗ ) ≤ f (x)∀x ∈ Rm .
As g is a monotonically increasing function defined on the
range of f , gf (x ∗ ) ≤ gf (x)∀x ∈ Rm . Therefore x ∗ is the
minimum of gf (x)
Cost Function
Cost Function: A function that is used to measure the
discrepancy between the given output and predicted
values.
Least Square Cost Function
Let w ∗ be the pre-image of P(y). Then
min |y ′ − y || = min ||Xw − y ||
y ′ ∈R(X ) w∈Rn+1
Also minw∈Rn+1 ||Xw − y || = minw∈Rn+1 12 ||Xw − y ||2 . Therefore
w ∗ = arg min J(w)
w∈Rn+1
where
1
||Xw − y ||2
J(w) =
2
J(w) is called the least square cost function.
J(w) = 21 ||Xw − y ||2 = 12 (d(Xw, y ))2
Xw = (f (x1 ), f (x2 ), . . . f (xN ))T (Prediction vector)
y = (y1 , y2 , . . . yN )T (Given output vector)
J(w) = 21 N 2
P
i=1 (f (xi ) − yi )
Square of the Euclidean distance between prediction and
output vectors
Gradient of a Vector
x = (x1 , x2 , . . . xn )T
∂f (x)
∂x1
∂f (x)
∂x. 2
∇f (x) =
..
∂f (x)
∂xn
Find ∇3x 2 + 2y + 5z
Gradient of Inner Product
∂⟨a, b⟩ ∂aT b
= =b
∂a ∂a
∂aT b
=a
∂b
∂w T w ∂w T w
∇w ||w||2 = + = 2w
∂w ∂w
J(w)
To find the minimum of J(w), ∇J(w) has to be found
1
J(w) = ||Xw − y ||2
2
1
= ⟨Xw − y , Xw − y ⟩
2
1
= [⟨Xw, Xw⟩ − ⟨Xw, y ⟩ − ⟨y , Xw⟩ + ⟨y , y ⟩]
2
1h T T i
= w X Xw − w T X T y − y T Xw + y T y
2
’
∇w (w T X T Xw) = X T Xw + X T Xw = 2X T Xw
∇w w T X T y = X T y
∇w y T Xw = X T y
1
∇w J = (2X T Xw − 2X T y ) = X T Xw − X T y
2
Optimal Solution
At the minimum value of w, ∇J = 0. That is
∇J = X T Xw − X T y = 0
Hence,
X T Xw = X T y
This is called the normal equation. Using this,
−1
w = (X T X ) XTy
−1
The solution exists if (X T X ) exists, that is, X is 1-1. If X is
−1
1-1, then (X T X ) X T is a left inverse of X , as
−1
(X T X ) X T X = I. It is also the pseudoinverse of X .
Existence of Solution
Question
{ (1, 2)T , 1 , (−2, 3)T , −2 , (−1, 3)T , −1 , (4, −1)T , 3 }
Iterative Algorithms
For determining w using derivative method, the inverse of
X T X is to be found, which is not computationally effective
for large data sets. Hence we resort to iterative algorithms.
An iterative search algorithm that minimizes J(w), starts
with an initial guess of w and then repeatedly change w to
make J(w) smaller, until it converges to the values that
minimizes J(w).
Gradient Descent
Gradient Descent
The gradient vector can be interpreted as the "direction
and rate of fastest increase". If the gradient of a function is
non-zero at a point p, the direction of the gradient is the
direction in which the function increases most quickly from
p, and the magnitude of the gradient is the rate of increase
in that direction.
This size of steps taken to reach the solution is called the
learning rate (step length).
Gradient Descent
Gradient descent: If a real valued function F (x) is defined
and differentiable in a neighbourhood of point a, then F (x)
decreases fastest if one goes from a in the direction of the
negative gradient of F at a, ∇F (a).
w new = w current − α[∇J(w)]w current
where α > 0 is called the step length.
Updation of w
For applying gradient descent, consider the following steps.
Choose an initial w = (w0 , w1 , ...wn )T ∈ R n+1 . Then repeatedly
performs the update
w := w − α∇J
J is a function of w0 , w1 , . . . , wn . Therefore,
T
∂J ∂J ∂J
∇J = , ,...,
∂w0 ∂w1 ∂wn
T
T T ∂J ∂J ∂J
(w0 , w1 , .....wn ) := (w0 , w1 , .....wn ) −α , ,...,
∂w0 ∂w1 ∂wn
N N
1X 1X T
J(w) = (f (xi ) − yi )2 = (w xi − yi )2
2 2
i=1 i=1
N
X
∇J(w) = (w T xi − yi )xi
i=1
N
X
w := w + α (yi − w T xi )xi
i=1
Algorithm 1 Updation of w using Gradient Descent
Initialize the weight vector w
Choose a learning rate α
while not converged do
w := w + α N
P
i=1 i − f (xi ))xi
(y
end while
Algorithm 2 Updation of w: Gradient Descent
Intialize w
Iterate until convergence {
wj := wj + α N
P
(y
i=1 i − f (x i ))xij , j = 0, 1, . . . n
}
Stopping Criteria
||w new − w current || < ϵ
Batch Gradient Descent
For updating the parameter, the algorithm looks at every
data point in the training set at every step and hence it is
called batch gradient descent.
In general, gradient descent does not guarantee a global
minimum. Since J is a convex quadratic function, the
algorithm converges to the global minimum (assuming α is
not too large.
Stochastic Gradient Descent
The online version of gradient descent called stochastic
gradient descent.
In contrast to batch gradient, stochastic gradient process
only one training point at each step. Hence when N
becomes large, that is, for large data sets, stochastic
gradient descent is more computationally efficient than
batch gradient descent.
Algorithm 3 Updation of w using Stochastic Gradient Descent
Choose an initial weight vector w and learning rate α
while not converged do
for each i = 1, 2, . . . , N do
w := w + α(yi − f (xi ))xi
end for
Randomly shuffle the data
end while
Algorithm 4 Updation of w using Stochastic Gradient Descent
Choose an initial w and learning parameter α
Iterate until convergence{
for i = 1, 2 . . . N {
wj := wj + α(yi − f (xi ))xij , j = 0, 1, 2, . . . n
}
Randomly shuffle the data
}
Hyperparameters and Parameters
Hyperparameters: Those whose values has to be given
before starting the algorithm. It plays a critical role in
determining the performance of the algorithm.
α
Parameters: Those whose values has to be determined by
the algorithm.
w
N <n+1
As R(X ) ⊆ RN , the dimension of R(X ) ≤ N < n + 1
S, the column vectors of X spans R(X ). |S| = n + 1 > N.
Therefore the elements of S are linearly dependent. Hence
the expression of elements of R(X ) using S is not unique.
So X is not 1-1.
n+1>N
N <n+1
As there may be more than one w that satisfies the given
equation Xw = y , choose the solution with lowest norm.
That is, the following constrained optimization problem has to
be considered.
minimize ||w||2
w∈Rn+1
subject to Xw = y
For this to work y should have atleast one pre-image. Let
X be onto, that is, the dimension of R(X ) be N.
Constrained Optimization Problem: Equality
Constraints
Given functions f , gi , i = 1, . . . m defined on a domain Ω ⊆ Rn
minimize f (w)
w∈Ω
subject to gi (w) = 0, i = 1, 2, . . . m
m
X
L(w, λ) = f (w) + λi gi (w)
i=1
where λi , i = 1, 2, . . . N are the Lagrangian parameters and L is
called the Lagrangian function.
Lagrangian Formulation
λ1 (w T x1 − y1 )
..
.
λN (w T xN − yN )
P N T
i=1 λi (w xi − yi ), λi ∈ R
N <n+1: Lagrangian Formulation
By applying Lagrangian theory,
L(w, λ) = ||w||2 + λT (Xw − y )
where λT = (λ1 , λ2 , . . . λN ). λi , i = 1, 2, . . . N are the
∂L
Lagrangian parameters. By equating =0
∂w
2w + X T λ = 0
Hence
XTλ
w =− (1)
2
∂L
By equating = 0 we get,
∂λ
Xw − y = 0 (2)
Using (1), the above equation becomes
−XX T λ
=y
2
Therefore
λ = −(XX T )−1 2y (3)
Sub: (3) in (1),
w = X T (XX T )−1 y
provided (XX T )−1 exists, that is X is onto. If solution exists,
X T (XX T )−1 is a right inverse of X as XX T (XX T )−1 = I. It is
also the pseudoinverse of X .
Overdetermined System: N > n + 1
Underdetermined System: N < n + 1
Overfitting and Underfitting
Taken from Bishops book
Performance Measure
Testing Points: {(xt1 , yt1 ), (xt2 , yt2 ), . . . (xtm , ytm )}
Pm 2
i=1 (f (xti ) − yti )
Mean Square Error:
m
Cross Validation
Performance of the model; optimal hyperparameters
Holdout Method
Random Subsampling or Monte Carlo Cross Validation
k-fold Cross Validation
Holdout Method
Randomly choose 70% of the data for training and
remaining for testing
Develop the model using training data
Check the performance using testing data
For each value of hyperparameter (eg: 0.1, 0.2, . . . 1)
repeat the process and select the best value
If the performance of the model is good enough, take the
entire data and develop a single model
Random Subsampling or Monte Carlo Crossvalidation
Randomly choose 70% of the data for training and
remaining for testing
Develop the model using training data
Check the performance using testing data
Repeat the process for m times
For each value of hyperparameter (eg:0.1, 0.2, . . . 1)
repeat the process
If the performance of the model is good enough, take the
entire data and develop a single model
Algorithm 5 Random Subsampling or Monte Carlo Cross-
Validation
for each value of the hyperparameter do
for i = 1 to m do
Randomly select 70% of the data for training, and use the
remaining 30% for testing
Develop the model using the training data
Calculate the performance measure using the testing
data
end for
Calculate the average performance measure over all m iter-
ations
end for
Choose the hyperparameter that yields the best average per-
formance measure
if the model’s performance is satisfactory then
Train the final model on the entire dataset using the selected
hyperparameter
end if
Cross Validation: k Fold Cross Validation
Divide the data into k folds
Training points: k-1 folds
Testing points: 1 fold
Each fold comes as the validation set atmost once
Algorithm 6 k fold Cross Validation
for each value of the hyperparameter do
Divide the dataset S into k mutually exclusive and exhaus-
tive folds (S1 , S2 , . . . , Sk )
for i = 1 to k do
Training set: S − Si ; Testing set: Si
Develop the model using the training data
Calculate the performance measure using the testing
data
end for
Calculate the average performance measure across all k
folds
end for
Choose the hyperparameter that gives the best average per-
formance measure
if the model’s performance is satisfactory then
Use the entire dataset S to develop the final model with
selected hyperparameters
end if
Training, Validation & Testing
Set 20 % for testing
Apply random subsampling or k fold cross validation on the
remaining data
Develop a model using the data set apart for training &
validation and apply it on testing data
Repeat the process
Algorithm 7 Training, Validation & Testing
for t = 1 to T do
Randomly select 20% of the dataset (St ) for testing
Use the remaining 80% (S ′ ) as follows:
Apply cross-validation techniques on S ′ to determine
the optimal hyperparameter
Develop the model using S ′ with the chosen optimal
hyperparameter
Evaluate the model performance using the test set St
end for
In general, the train-validation-test split, only one test set is
used, that is T = 1. However, T > 1 provide deeper
insights, especially for models requiring extensive tuning.
Question
Apply cross validation
{(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), (x4 , y4 ), (x5 , y5 )}
Normalization
Normalization is done for making the attribute values to lie
in the same range so that no attribute dominates in
decision making.
xik − min(Ak )
max min: xik =
max(Ak ) − min(Ak )
xik − mean(Ak )
z score: xik =
std. deviation(Ak )
xi = (xi1 , xi2 , . . . xin )T
Ak is the k th attribute of the data
Use the same min(Ak ) and max(Ak ) values computed from
the training data to transform the test data in the case of
max min
Use the same mean(Ak ) and std. deviation(Ak ) values
computed from the training data to transform the test data
in the case of z score
Question
Apply normalization:
10
Ak = 25
15