0% found this document useful (0 votes)

10 views102 pages

2linear Regression

Uploaded by

shukladinesh0206

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views102 pages

2linear Regression

Uploaded by

shukladinesh0206

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 102

Linear Regression

S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology

MA613 Data Mining

Introduction

Regression Task

Approximating function is a hyperplane

Inner Product

 
1
a = −1
0

 
2
b = 1
6

⟨a, b⟩ = aT b = 1 ∗ 2 + −1 ∗ 1 + 0 ∗ 6 = 1
Hyperplane

Equation of a line passing through the point (x, y ) ∈ R2

y = mx + c
y − mx − c = 0
w0 + w1 x + w2 y = 0
Equation of the hyperplane passing through the point
xi = (xi1 , xi2 , . . . xin )T
w0 + w1 xi1 + w2 xi2 + . . . wn xin = w T xi + w0 = 0, w =
(w1 , w2 , . . . wn )T
Hyperplane: Classification and Regression

Classification
Decision Boundary: w T xi + w0 = 0
Regression
yi = w T xi + w0
Hyperplane: Regression
Set: Hyperplane

H = {xi ∈ Rn : w T xi + w0 = 0}
Divides the space into two halves
Linear Regression:
Formulation
Introductory Facts

Data:{(x1 , y1 ), (x2 , y2 )...(xN , yN )} , xi ∈ D ⊆ Rn and yi ∈ R.

Model (relation)
f : X → Y where f (xi ) = w0 + w T xi be the function that
generates the data
f (xi ) is the model output, which is known as the predicted
value
yi the given output
Parameters

f (xi ) = w0 + w1 xi1 + w2 xi2 + . . . wn xin

By taking xi0 = 1,
f (xi ) = w0 xi0 + w1 xi1 + w2 xi2 + . . . wn xin = w T xi where
xi := (1, xi1 , xi2 , . . . xin )T ∈ Rn+1 and
w = (w0 , w1 , . . . , wn )T ∈ Rn+1
Here
w0 , wi , . . . wn are the unknown parameters
Model Output: System of Linear Equations

N data points, N predicted values, N output values

f (x1 ) = w0 x10 + w1 x11 + w2 x12 + . . . wn x1n = y1

f (x2 ) = w0 x20 + w1 x21 + w2 x22 + . . . wn x2n = y2
..
.
f (xN ) = w0 xN0 + w1 xN1 + w2 xN2 + . . . wn xNn = yN
Design Matrix

Define the design matrix to be

 
x10 x11 ... x1n
 x20 x21 ... x2n 
X = . .
 
.. .. ..
 .. . . . 
xN0 xN1 . . . xNn
Output Vector

 
y1

 y2 

 . 
y = .

 . 

 . 
yN
Matrix Representation

Xw = y
where X : Rn+1 → RN
Range Space

R(X ) = {y ′ ∈ RN : y ′ = Xw, w ∈ Rn+1 } be the range space of

X
Is R(X ) a subspace of RN ?
Vector Space
V1 = {x ∈ Rn : Ax = 0}
V2 = {x ∈ Rn : Ax = b, b ̸= 0}
Vector Space
A vector space over a field K is a non empty set V on which are
defined two operations, vector addition and scalar multiplication
such that the following conditions are satisfied ∀u, v , w ∈ V :
Closed under vector addition: u + v ∈ V
Associative under vector addition:
(u + v ) + w = u + (v + w)
Commutative under vector addition: u + v = v + u
Existence of additive identity: ∃ 0 ∈ V , such that 0 + u = u
Existence of additive inverse: ∃s ∈ V , such that u + s = 0
Closed under scalar multiplication: ∀α ∈ K , αv ∈ V
Associative under scalar multiplication:
α(βv ) = (αβ)v , α, β ∈ K
Distributive of scalar multiplication with respect to vector
and field addition: α(u + v ) = αu + αv ,
(α + β)u = αu + βu, α, β ∈ K
Identity element of scalar multiplication: 1u = u, 1 ∈ K
Subspace

A subset S of a vector space V is called a subspace if it

itself a vector space.
If x, y ∈ S, αx + βy ∈ S, α, β ∈ K
Basis

Let V be a vector space and v1 , v2 , . . . vn ∈ V . A linear

combination of v1 , v2 , . . . vn is the vector
α1 v1 + α2 v2 + . . . αn vn where α1 , α2 , . . . αn ∈ K .
Let S be a nonempty subset of V . Then the set of all linear
combinations of elements of S is called
P the span of S, and
is denoted by span S. Span(S) = { i αi vi : αi ∈ K , vi ∈ S}
{v1 , v2 , . . . vn } is linearly independent iff
α1 v1 + · · · + αn vn = 0 implies α1 = α2 = . . . αn = 0
S spans V if Span(S) = V
A linearly independent subset of V that spans V is called a
basis of V
The number of elements in a basis of a V is called the
dimension of V
.
A vector space V is called finite dimensional, if it has a
finite basis. Else V is called infinite dimensional.
For a finite dimensional vector space V , any two bases for
V have the same number of vectors.
Properties of Basis

Theorem
A set V ′ = {v1 , v2 , . . . vn } is a basis of V then every element in
V can be uniquely expressed as a linear combination of
elements in V ′ .

Proof.
Given V ′ is a basis of V . Let the expression using the elements
in V ′ is not unique. Let v ∈ V . Let

v = α1 v1 + α2 v2 + . . . αn vn = β1 v1 + β2 v2 + . . . βn vn

(α1 − β1 )v1 + (α2 − β2 )v2 + · · · + (αn − βn )vn = 0

As {v1 , v2 , . . . vn } is a basis, αi = βi , i = 1, 2, . . . n. Hence the
theorem.
Properties of Basis

Theorem
If every element in V can be uniquely expressed as a linear
combination of elements in V ′ = {v1 , v2 , . . . vn }, then V ′ is a
basis of V .

Proof.
Given every element in V can be uniquely expressed as a linear
combination of elements in V ′ . To prove V ′ is a basis. As V is a
vector space, 0 ∈ V . Therefore 0 = 0 ∗ v1 + 0 ∗ v2 + . . . 0 ∗ vn .
Let α1 v1 + α2 v2 + . . . αn vn = 0. As the expression is unique,
αi = 0, ∀i. Therefore V ′ consists of linearly independent
elements that spans V and hence is a basis.
Properties of Basis

A nonempty subset S of a vector space V is a basis of V iff

every element of V can be expressed in a unique way as a
linear combination of elements of S.
Linear Regression:
Formulation
Range Space

Theorem
R(X ) is a subspace of RN .

Proof.
Let y1 , y2 ∈ R(X ). To prove αy1 + βy2 ∈ R(X ), α, β ∈ R. Now
y1 = Xw ′ , y2 = Xw ′′ , w ′ , w ′′ ∈ Rn+1 . Therefore
αy1 + βy2 = αXw ′ + βXw ′′ = X (αw ′ + βw ′′ ) = Xw, where w =
αw ′ + βw ′′ ∈ Rn+1 . This means αy1 + βy2 ∈ R(X ). Hence
R(X ) is a subspace of RN .
Range Space: Representation

y ′ ∈ R(X ). ∃w ∈ Rn+1 such that Xw = y ′

     
1 x11 x1n

 1 


 x21 


 x2n 

′
 .   .   . 
y = w0   + w1   + . . . wn  

 . 


 . 


 . 

 .   .   . 
1 xN1 xNn

Question
Using N=5, n=3 express y ′ .
1 Question
1 If a set S spans a vector space V , then the dimension of V
1 is equal to the number of elements in S
2 is less than or equal to the number of elements in S
3 is greater than or equal to the number of elements in S
4 can be greater than or less than the number of elements in S
Dimension of Range Space

Theorem
dim(R(X )) ≤ n + 1

Proof.
Let S = {v0 , v1 v2 , . . . vn } be the column vectors of X . For every
y ′ ∈ R(X ) ∃w = (w0 , w1 , . . . wn )T ∈ Rn+1 such that y ′ =
w0 v0 + w1 v1 + . . . wn vn . Therefore S ⊆ R(X ) and R(X ) is
spanned by the columns of X . Hence the dimension of
R(X )(dim(R(X ))) is equal to the number of linearly
independent columns of X , that is dim(R(X )) ≤ n + 1.
Conditions: R(X )

X : Rn+1 → RN such that Xw = y . X can be

Not 1-1,
Not onto
1-1
Onto
X is 1-1

If X is not 1-1, ∃, y ′ ∈ R(X ) such that Xw = y ′ has more

than one solution.
If X is 1-1, for every y ′ ∈ R(X ), there exists a unique
w ∈ Rn+1 such that Xw = y ′ . That is every y ′ ∈ R(X ) can
be uniquely represented by elements of S.
X is 1-1

If X is 1-1, then the dimension of R(X ) is n + 1.

Proof.
As X is 1-1, every element in R(X ) has a unique preimage and
hence it can be uniquely expressed using elements in S
(column vectors of X ). Therefore on the basis of previous
theorem, S is a basis and dimension of R(X ) is |S| = n + 1 .
X is onto

If X is onto, then the dimension of R(X ) is N.

Proof.
As X is onto, every element in RN has a preimage. Therefore
R(X ) = RN . Hence the result.
Methods to find the solution
Linear Regression: Matrix Equation

Solve
Xw = y , X : Rn+1 → RN

X and y are given

w is an unknown parameter
Characteristics of X

The matrix X can be three types

N =n+1
n+1<N
N <n+1
N =n+1

X is a square matrix. The solution exists if X −1 exists. That

is X is 1-1 and onto. Therefore dimension of R(X ) is
n+1=N

X −1 : R N → R n+1
n+1<N

R(X ) is spanned by S, that is the column vectors X .

Therefore the dimension of R(X ) ≤ (n + 1). As n + 1 < N,
R(X ) is a proper subset of RN , that is R(X ) ⊂ RN . Hence
X is not onto.
Xw = y
If X is not onto it is not guaranteed that y ∈ R(X ) and in that
case Xw = y has no solution. In such cases, find an
approximate solution, that is find the solution of Xw ′ = y ′ where
y ′ ∈ R(X ), such that y ′ ≈ y . For finding such a y ′ , find the
projection of y onto R(X ).
Norm of a Vector

 
1
a = 2
0

⟨a, a⟩ = aT a = 1 ∗ 1 + 2 ∗ 2 + 0 ∗ 0 = 5 = ||a||2
Question

1 d(x, x ′ ): distance between x and x ′ .

2 Find d(x, x ′ ) where x = (1, 2, −1)T and x ′ = (−1, 2, 1)T
3 Find ||x − x ′ ||
Relationship between Distance and Norm

d(x, x ′ ) = ∥x − x ′ ∥
S = {10, 35, −10, 7}, x = 17
Find args∈S min d(x, s)
Projection

Definition
The projection of y onto R(X ) (P(y )) is that vector in R(X )
which is at a smallest distance to y . That is

P(y ) = arg min d(y , y ′ ) = arg min ||y − y ′ ||

y ′ ∈R(X ) y ′ ∈R(X )

Projection vector is unique

P(y ) is the best approximation to y out of R(X )
Best Approximation

For every given x in R m and every given subspace Y of R m

there is a unique best approximation to x out of Y (namely, y =
Px, where P : Rm → Y is the projection of Rm onto Y ).
Projection of (0,3) onto {(x, y ) ∈ R2 : y = 0}
d((-1,0),(0,3))
d((1,0),(0,3))
Preimage of P(y )

When X is not onto, the matrix equation under

consideration is
Xw = P(y )
The preimage of P(y ) has to be found out
Projection
Find the minimum of a function

Notation: minx f (x) or argx minx f (x)

1 Find the minimum of f (x) = (x 2 + 1) or arg minx f (x)
2 Find arg minx f (x) where f (x) = (x + 1)2
3 Find arg minx f (x) where f (x) = 1/2(x + 1)2
4 Find arg minx f (x) where f (x) = (x + 1)
5 Find arg minx f (x) where f (x) = x
6 Find arg minx f (x) where f (x) = x 2
7 Find arg minx f (x) where f (x) = x, x ≥ 0
8 Find arg minx f (x) where f (x) = x 2 , x ≥ 0
Composite Function

gf (x)
f : D(x) → R(f )
g : R(f ) → R(g)
Question
f (x) = ∥x∥, g(x) = x 2 , x ∈ R. What is gf (x)? Is
g : R(f ) → R(g) a monotonically increasing function?
f (x) = x, g(x) = x 2 , x ∈ R. What is gf (x)? Is
g : R(f ) → R(g) a monotonically increasing function?
minx∈Rm f (x) is equivalent to finding minx∈Rm gf (x) if g is an
monotonically increasing function defined on the range of f .
Proof.
Let x ∗ be the minimum of f . This means, f (x ∗ ) ≤ f (x)∀x ∈ Rm .
As g is a monotonically increasing function defined on the
range of f , gf (x ∗ ) ≤ gf (x)∀x ∈ Rm . Therefore x ∗ is the
minimum of gf (x)
Cost Function

Cost Function: A function that is used to measure the

discrepancy between the given output and predicted
values.
Least Square Cost Function

Let w ∗ be the pre-image of P(y). Then

min |y ′ − y || = min ||Xw − y ||

y ′ ∈R(X ) w∈Rn+1

Also minw∈Rn+1 ||Xw − y || = minw∈Rn+1 12 ||Xw − y ||2 . Therefore

w ∗ = arg min J(w)

w∈Rn+1

where
1
||Xw − y ||2
J(w) =
2
J(w) is called the least square cost function.
J(w) = 21 ||Xw − y ||2 = 12 (d(Xw, y ))2
Xw = (f (x1 ), f (x2 ), . . . f (xN ))T (Prediction vector)
y = (y1 , y2 , . . . yN )T (Given output vector)
J(w) = 21 N 2
P
i=1 (f (xi ) − yi )
Square of the Euclidean distance between prediction and
output vectors
Gradient of a Vector

x = (x1 , x2 , . . . xn )T

∂f (x)
 
 ∂x1 
 ∂f (x) 
 
 
 ∂x. 2 
∇f (x) =  
 .. 
 
 ∂f (x) 
∂xn

Find ∇3x 2 + 2y + 5z
Gradient of Inner Product

∂⟨a, b⟩ ∂aT b
= =b
∂a ∂a
∂aT b
=a
∂b
∂w T w ∂w T w
∇w ||w||2 = + = 2w
∂w ∂w
J(w)

To find the minimum of J(w), ∇J(w) has to be found

1
J(w) = ||Xw − y ||2
2
1
= ⟨Xw − y , Xw − y ⟩
2
1
= [⟨Xw, Xw⟩ − ⟨Xw, y ⟩ − ⟨y , Xw⟩ + ⟨y , y ⟩]
2
1h T T i
= w X Xw − w T X T y − y T Xw + y T y
2
’
∇w (w T X T Xw) = X T Xw + X T Xw = 2X T Xw
∇w w T X T y = X T y
∇w y T Xw = X T y
1
∇w J = (2X T Xw − 2X T y ) = X T Xw − X T y
2
Optimal Solution
At the minimum value of w, ∇J = 0. That is

∇J = X T Xw − X T y = 0

Hence,

X T Xw = X T y
This is called the normal equation. Using this,
−1
w = (X T X ) XTy
−1
The solution exists if (X T X ) exists, that is, X is 1-1. If X is
−1
1-1, then (X T X ) X T is a left inverse of X , as
−1
(X T X ) X T X = I. It is also the pseudoinverse of X .
Existence of Solution
Question

{ (1, 2)T , 1 , (−2, 3)T , −2 , (−1, 3)T , −1 , (4, −1)T , 3 }

Iterative Algorithms

For determining w using derivative method, the inverse of

X T X is to be found, which is not computationally effective
for large data sets. Hence we resort to iterative algorithms.
An iterative search algorithm that minimizes J(w), starts
with an initial guess of w and then repeatedly change w to
make J(w) smaller, until it converges to the values that
minimizes J(w).
Gradient Descent
Gradient Descent

The gradient vector can be interpreted as the "direction

and rate of fastest increase". If the gradient of a function is
non-zero at a point p, the direction of the gradient is the
direction in which the function increases most quickly from
p, and the magnitude of the gradient is the rate of increase
in that direction.
This size of steps taken to reach the solution is called the
learning rate (step length).
Gradient Descent

Gradient descent: If a real valued function F (x) is defined

and differentiable in a neighbourhood of point a, then F (x)
decreases fastest if one goes from a in the direction of the
negative gradient of F at a, ∇F (a).

w new = w current − α[∇J(w)]w current

where α > 0 is called the step length.

Updation of w

For applying gradient descent, consider the following steps.

Choose an initial w = (w0 , w1 , ...wn )T ∈ R n+1 . Then repeatedly
performs the update

w := w − α∇J

J is a function of w0 , w1 , . . . , wn . Therefore,
T
∂J ∂J ∂J
∇J = , ,...,
∂w0 ∂w1 ∂wn

T
T T ∂J ∂J ∂J
(w0 , w1 , .....wn ) := (w0 , w1 , .....wn ) −α , ,...,
∂w0 ∂w1 ∂wn
N N
1X 1X T
J(w) = (f (xi ) − yi )2 = (w xi − yi )2
2 2
i=1 i=1

N
X
∇J(w) = (w T xi − yi )xi
i=1

N
X
w := w + α (yi − w T xi )xi
i=1
Algorithm 1 Updation of w using Gradient Descent
Initialize the weight vector w
Choose a learning rate α
while not converged do
w := w + α N
P
i=1 i − f (xi ))xi
(y
end while

Algorithm 2 Updation of w: Gradient Descent

Intialize w
Iterate until convergence {
wj := wj + α N
P
(y
i=1 i − f (x i ))xij , j = 0, 1, . . . n
}
Stopping Criteria

||w new − w current || < ϵ

Batch Gradient Descent
For updating the parameter, the algorithm looks at every
data point in the training set at every step and hence it is
called batch gradient descent.
In general, gradient descent does not guarantee a global
minimum. Since J is a convex quadratic function, the
algorithm converges to the global minimum (assuming α is
not too large.
Stochastic Gradient Descent

The online version of gradient descent called stochastic

gradient descent.
In contrast to batch gradient, stochastic gradient process
only one training point at each step. Hence when N
becomes large, that is, for large data sets, stochastic
gradient descent is more computationally efficient than
batch gradient descent.
Algorithm 3 Updation of w using Stochastic Gradient Descent
Choose an initial weight vector w and learning rate α
while not converged do
for each i = 1, 2, . . . , N do
w := w + α(yi − f (xi ))xi
end for
Randomly shuffle the data
end while

Algorithm 4 Updation of w using Stochastic Gradient Descent

Choose an initial w and learning parameter α
Iterate until convergence{
for i = 1, 2 . . . N {
wj := wj + α(yi − f (xi ))xij , j = 0, 1, 2, . . . n
}
Randomly shuffle the data
}
Hyperparameters and Parameters

Hyperparameters: Those whose values has to be given

before starting the algorithm. It plays a critical role in
determining the performance of the algorithm.
α
Parameters: Those whose values has to be determined by
the algorithm.
w
N <n+1

As R(X ) ⊆ RN , the dimension of R(X ) ≤ N < n + 1

S, the column vectors of X spans R(X ). |S| = n + 1 > N.
Therefore the elements of S are linearly dependent. Hence
the expression of elements of R(X ) using S is not unique.
So X is not 1-1.
n+1>N
N <n+1

As there may be more than one w that satisfies the given

equation Xw = y , choose the solution with lowest norm.
That is, the following constrained optimization problem has to
be considered.

minimize ||w||2
w∈Rn+1
subject to Xw = y

For this to work y should have atleast one pre-image. Let

X be onto, that is, the dimension of R(X ) be N.
Constrained Optimization Problem: Equality
Constraints

Given functions f , gi , i = 1, . . . m defined on a domain Ω ⊆ Rn

minimize f (w)
w∈Ω
subject to gi (w) = 0, i = 1, 2, . . . m

m
X
L(w, λ) = f (w) + λi gi (w)
i=1

where λi , i = 1, 2, . . . N are the Lagrangian parameters and L is

called the Lagrangian function.
Lagrangian Formulation

λ1 (w T x1 − y1 )
..
.
λN (w T xN − yN )
P N T
i=1 λi (w xi − yi ), λi ∈ R
N <n+1: Lagrangian Formulation
By applying Lagrangian theory,

L(w, λ) = ||w||2 + λT (Xw − y )

where λT = (λ1 , λ2 , . . . λN ). λi , i = 1, 2, . . . N are the

∂L
Lagrangian parameters. By equating =0
∂w

2w + X T λ = 0

Hence
XTλ
w =− (1)
2
∂L
By equating = 0 we get,
∂λ
Xw − y = 0 (2)
Using (1), the above equation becomes

−XX T λ
=y
2
Therefore
λ = −(XX T )−1 2y (3)
Sub: (3) in (1),

w = X T (XX T )−1 y
provided (XX T )−1 exists, that is X is onto. If solution exists,
X T (XX T )−1 is a right inverse of X as XX T (XX T )−1 = I. It is
also the pseudoinverse of X .
Overdetermined System: N > n + 1
Underdetermined System: N < n + 1
Overfitting and Underfitting

Taken from Bishops book

Performance Measure

Testing Points: {(xt1 , yt1 ), (xt2 , yt2 ), . . . (xtm , ytm )}

Pm 2
i=1 (f (xti ) − yti )
Mean Square Error:
m
Cross Validation

Performance of the model; optimal hyperparameters

Holdout Method
Random Subsampling or Monte Carlo Cross Validation
k-fold Cross Validation
Holdout Method

Randomly choose 70% of the data for training and

remaining for testing
Develop the model using training data
Check the performance using testing data
For each value of hyperparameter (eg: 0.1, 0.2, . . . 1)
repeat the process and select the best value
If the performance of the model is good enough, take the
entire data and develop a single model
Random Subsampling or Monte Carlo Crossvalidation

Randomly choose 70% of the data for training and

remaining for testing
Develop the model using training data
Check the performance using testing data
Repeat the process for m times
For each value of hyperparameter (eg:0.1, 0.2, . . . 1)
repeat the process
If the performance of the model is good enough, take the
entire data and develop a single model
Algorithm 5 Random Subsampling or Monte Carlo Cross-
Validation
for each value of the hyperparameter do
for i = 1 to m do
Randomly select 70% of the data for training, and use the
remaining 30% for testing
Develop the model using the training data
Calculate the performance measure using the testing
data
end for
Calculate the average performance measure over all m iter-
ations
end for
Choose the hyperparameter that yields the best average per-
formance measure
if the model’s performance is satisfactory then
Train the final model on the entire dataset using the selected
hyperparameter
end if
Cross Validation: k Fold Cross Validation

Divide the data into k folds

Training points: k-1 folds
Testing points: 1 fold
Each fold comes as the validation set atmost once
Algorithm 6 k fold Cross Validation
for each value of the hyperparameter do
Divide the dataset S into k mutually exclusive and exhaus-
tive folds (S1 , S2 , . . . , Sk )
for i = 1 to k do
Training set: S − Si ; Testing set: Si
Develop the model using the training data
Calculate the performance measure using the testing
data
end for
Calculate the average performance measure across all k
folds
end for
Choose the hyperparameter that gives the best average per-
formance measure
if the model’s performance is satisfactory then
Use the entire dataset S to develop the final model with
selected hyperparameters
end if
Training, Validation & Testing

Set 20 % for testing

Apply random subsampling or k fold cross validation on the
remaining data
Develop a model using the data set apart for training &
validation and apply it on testing data
Repeat the process
Algorithm 7 Training, Validation & Testing
for t = 1 to T do
Randomly select 20% of the dataset (St ) for testing
Use the remaining 80% (S ′ ) as follows:
Apply cross-validation techniques on S ′ to determine
the optimal hyperparameter
Develop the model using S ′ with the chosen optimal
hyperparameter
Evaluate the model performance using the test set St
end for
In general, the train-validation-test split, only one test set is
used, that is T = 1. However, T > 1 provide deeper
insights, especially for models requiring extensive tuning.
Question

Apply cross validation

{(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), (x4 , y4 ), (x5 , y5 )}
Normalization

Normalization is done for making the attribute values to lie

in the same range so that no attribute dominates in
decision making.
xik − min(Ak )
max min: xik =
max(Ak ) − min(Ak )
xik − mean(Ak )
z score: xik =
std. deviation(Ak )
xi = (xi1 , xi2 , . . . xin )T
Ak is the k th attribute of the data
Use the same min(Ak ) and max(Ak ) values computed from
the training data to transform the test data in the case of
max min
Use the same mean(Ak ) and std. deviation(Ak ) values
computed from the training data to transform the test data
in the case of z score
Question

Apply normalization:
 
10
Ak = 25
15

Linear Algebra Exam Questions
No ratings yet
Linear Algebra Exam Questions
11 pages
Linear Algebra - Paul Dawkins
92% (13)
Linear Algebra - Paul Dawkins
342 pages
Matrices and Linear Algebra in Control Applications
No ratings yet
Matrices and Linear Algebra in Control Applications
38 pages
Cambridge Math Schedules PDF
No ratings yet
Cambridge Math Schedules PDF
42 pages
Matrix of A Linear Transformation
No ratings yet
Matrix of A Linear Transformation
10 pages
MATHEMATICS, Lecture 1: Carmen Herrero
No ratings yet
MATHEMATICS, Lecture 1: Carmen Herrero
28 pages
MATH CAMP: Lecture 1: 1 Linear Algebra
No ratings yet
MATH CAMP: Lecture 1: 1 Linear Algebra
141 pages
Advanced Numerical Analysis: Data Interpolation and Smoothing
No ratings yet
Advanced Numerical Analysis: Data Interpolation and Smoothing
26 pages
Love Fest in Algebra
No ratings yet
Love Fest in Algebra
17 pages
Chapter 2 Normed Spaces
No ratings yet
Chapter 2 Normed Spaces
60 pages
Linear Transformations & Matrices
No ratings yet
Linear Transformations & Matrices
10 pages
Vector Spaces and Linear Algebra
No ratings yet
Vector Spaces and Linear Algebra
31 pages
Echelon Form of A Matrix
No ratings yet
Echelon Form of A Matrix
41 pages
Linear Algebra Lecture Notes
No ratings yet
Linear Algebra Lecture Notes
50 pages
Course: ELL 701 - Mathematical Methods in Control Instructor: M. Nabi
No ratings yet
Course: ELL 701 - Mathematical Methods in Control Instructor: M. Nabi
8 pages
Math. Ed. 445 Linear Algebra and Vector Analysis
100% (1)
Math. Ed. 445 Linear Algebra and Vector Analysis
5 pages
Math Data
No ratings yet
Math Data
117 pages
Linear Algebra Lecture Notes
No ratings yet
Linear Algebra Lecture Notes
45 pages
Maths Methods Week 1: Vector Spaces
No ratings yet
Maths Methods Week 1: Vector Spaces
100 pages
Elements of Convex Optimization Theory - 2015
No ratings yet
Elements of Convex Optimization Theory - 2015
31 pages
20 Best Advanced Linear Algebra Books of All Time
100% (1)
20 Best Advanced Linear Algebra Books of All Time
49 pages
2011 Vector Spaces Matrices
100% (2)
2011 Vector Spaces Matrices
11 pages
Linear Transformations 2017 03
No ratings yet
Linear Transformations 2017 03
84 pages
Mathematical Treatise On Linear Algebra
No ratings yet
Mathematical Treatise On Linear Algebra
7 pages
HKLam Proof
No ratings yet
HKLam Proof
14 pages
Transformation 3
No ratings yet
Transformation 3
5 pages
Linear Transformations
No ratings yet
Linear Transformations
8 pages
1 General Vector Spaces: Definition 1
No ratings yet
1 General Vector Spaces: Definition 1
9 pages
Lec 4
No ratings yet
Lec 4
21 pages
Basic of Vector Space
No ratings yet
Basic of Vector Space
16 pages
MATH 304 Linear Algebra Basis and Dimension
No ratings yet
MATH 304 Linear Algebra Basis and Dimension
16 pages
Fridberg Linear
No ratings yet
Fridberg Linear
16 pages
Lin Alg ML Mimuw
No ratings yet
Lin Alg ML Mimuw
55 pages
Vector Spaces & Linear Algebra Basics
No ratings yet
Vector Spaces & Linear Algebra Basics
9 pages
Chapter4 Part 2 Edited 3
No ratings yet
Chapter4 Part 2 Edited 3
17 pages
Matrix Algebra: Addition, Multiplication, and Transpose
No ratings yet
Matrix Algebra: Addition, Multiplication, and Transpose
32 pages
Week-4 Session 2
No ratings yet
Week-4 Session 2
21 pages
Projection
No ratings yet
Projection
12 pages
111147
No ratings yet
111147
25 pages
Functional Analysis Primer
No ratings yet
Functional Analysis Primer
33 pages
Linear Algebra
No ratings yet
Linear Algebra
20 pages
BSc-NEP Semester4
No ratings yet
BSc-NEP Semester4
65 pages
PHDbrochure - Final Version - 03062020
No ratings yet
PHDbrochure - Final Version - 03062020
20 pages
Linear Algebra Def
No ratings yet
Linear Algebra Def
11 pages
CHAPTERS 1-2 Algebre Lineaire 3
No ratings yet
CHAPTERS 1-2 Algebre Lineaire 3
80 pages
18.704 Notes: 1 Introduction To Representation Theory
No ratings yet
18.704 Notes: 1 Introduction To Representation Theory
46 pages
Math 110 Midterm Review Guide
No ratings yet
Math 110 Midterm Review Guide
6 pages
Algebra 101
No ratings yet
Algebra 101
37 pages
Linear Equations Linear Algebra
No ratings yet
Linear Equations Linear Algebra
4 pages
MAT160 - FinalSheet 2024
No ratings yet
MAT160 - FinalSheet 2024
34 pages
Adventist University of The Philippines: Course Descriptions
No ratings yet
Adventist University of The Philippines: Course Descriptions
6 pages
Linear Algebra Cdlu 2022
No ratings yet
Linear Algebra Cdlu 2022
18 pages
Lecture 37
No ratings yet
Lecture 37
49 pages
03.data Representation
No ratings yet
03.data Representation
15 pages
Course Curriculum BSMath
No ratings yet
Course Curriculum BSMath
36 pages
Lecture 3 Linear Algebra
No ratings yet
Lecture 3 Linear Algebra
11 pages
l8 mth113 2025
No ratings yet
l8 mth113 2025
5 pages
Linear Algebra Concepts Explained
No ratings yet
Linear Algebra Concepts Explained
9 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
5 pages
Plan RT 2022
No ratings yet
Plan RT 2022
5 pages
Linear Functions
No ratings yet
Linear Functions
57 pages
ALA-Assignment 3
No ratings yet
ALA-Assignment 3
13 pages
B.E. ECE All Sem
No ratings yet
B.E. ECE All Sem
29 pages
Kernel of A Linear Transformation
No ratings yet
Kernel of A Linear Transformation
31 pages
Linear Algebra and Partial Differential Equations T Veerarajan All Chapters Instant Download
No ratings yet
Linear Algebra and Partial Differential Equations T Veerarajan All Chapters Instant Download
41 pages
La de MS2 Rev
No ratings yet
La de MS2 Rev
13 pages
Short Notes Booklet GP SIR Ver Help You
No ratings yet
Short Notes Booklet GP SIR Ver Help You
56 pages
22amh32 - Data Analytics and Data Science Unit I & Mathematics Foundations For Data Science 1. Mathematics Foundations For Data Science
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Mathematics Foundations For Data Science 1. Mathematics Foundations For Data Science
5 pages
Engineering Math Exam Paper
No ratings yet
Engineering Math Exam Paper
3 pages
Math W4
No ratings yet
Math W4
57 pages
Lec 10
No ratings yet
Lec 10
31 pages
A Course in Modern Mathematical Physics Groups Hilbert Space and Differential Geometry First Edition Peter Szekeres
100% (1)
A Course in Modern Mathematical Physics Groups Hilbert Space and Differential Geometry First Edition Peter Szekeres
46 pages
Exercises
No ratings yet
Exercises
27 pages
Extra MTH 241
No ratings yet
Extra MTH 241
6 pages
Cbse Class 12 Maths Question Paper 2025 Set 3
No ratings yet
Cbse Class 12 Maths Question Paper 2025 Set 3
11 pages
Active Learning in Machine Learning
No ratings yet
Active Learning in Machine Learning
6 pages
Exercises Removed
No ratings yet
Exercises Removed
20 pages
Data Mining Quiz2
No ratings yet
Data Mining Quiz2
2 pages
Mathematical Preliminary Concepts
No ratings yet
Mathematical Preliminary Concepts
27 pages
Math1271 Linear Algebra
No ratings yet
Math1271 Linear Algebra
3 pages
Model Paper 6.1
No ratings yet
Model Paper 6.1
2 pages
Linalg Elia
No ratings yet
Linalg Elia
20 pages
Lectures On Lyapunov Exponents 1st Edition Marcelo Viana Download
100% (2)
Lectures On Lyapunov Exponents 1st Edition Marcelo Viana Download
57 pages
Kernel and Range
No ratings yet
Kernel and Range
18 pages
Calculus Volume II 2nd Edition Tom M. Apostol Instant Download
No ratings yet
Calculus Volume II 2nd Edition Tom M. Apostol Instant Download
128 pages
Computer Science & Engineering
No ratings yet
Computer Science & Engineering
27 pages
IC 114 Lecture 8
No ratings yet
IC 114 Lecture 8
16 pages
Lecture Notes
No ratings yet
Lecture Notes
38 pages
P723X Manual
No ratings yet
P723X Manual
36 pages
Dhruv Jain CV
No ratings yet
Dhruv Jain CV
2 pages
Answer 46754
No ratings yet
Answer 46754
1 page
Quiz I B.tech MA311 Probability&Statistics Sep 2021
No ratings yet
Quiz I B.tech MA311 Probability&Statistics Sep 2021
1 page
Quiz I MA311 Prob&Stat Aug 2018
No ratings yet
Quiz I MA311 Prob&Stat Aug 2018
1 page
Quiz II MA311 Prob&Stat Oct 2018
No ratings yet
Quiz II MA311 Prob&Stat Oct 2018
1 page
Quiz II B.tech MA311 Probability&Statistics Nov 2021
No ratings yet
Quiz II B.tech MA311 Probability&Statistics Nov 2021
1 page
Solutions04 For P&s
No ratings yet
Solutions04 For P&s
7 pages
RLHF - Teaching Robots Right and Wrong
No ratings yet
RLHF - Teaching Robots Right and Wrong
7 pages
Vector Spaces
No ratings yet
Vector Spaces
11 pages
MA311 Test I November 2016
No ratings yet
MA311 Test I November 2016
2 pages
Probability MA311 2016 Quiz I
No ratings yet
Probability MA311 2016 Quiz I
1 page

2linear Regression

Uploaded by

2linear Regression

Uploaded by

Linear Regression

MA613 Data Mining

Approximating function is a hyperplane

Equation of a line passing through the point (x, y ) ∈ R2

Data:{(x1 , y1 ), (x2 , y2 )...(xN , yN )} , xi ∈ D ⊆ Rn and yi ∈ R.

f (xi ) = w0 + w1 xi1 + w2 xi2 + . . . wn xin

N data points, N predicted values, N output values

f (x1 ) = w0 x10 + w1 x11 + w2 x12 + . . . wn x1n = y1

Define the design matrix to be

R(X ) = {y ′ ∈ RN : y ′ = Xw, w ∈ Rn+1 } be the range space of

A subset S of a vector space V is called a subspace if it

Let V be a vector space and v1 , v2 , . . . vn ∈ V . A linear

(α1 − β1 )v1 + (α2 − β2 )v2 + · · · + (αn − βn )vn = 0

A nonempty subset S of a vector space V is a basis of V iff

y ′ ∈ R(X ). ∃w ∈ Rn+1 such that Xw = y ′

X : Rn+1 → RN such that Xw = y . X can be

If X is not 1-1, ∃, y ′ ∈ R(X ) such that Xw = y ′ has more

If X is 1-1, then the dimension of R(X ) is n + 1.

If X is onto, then the dimension of R(X ) is N.

X and y are given

The matrix X can be three types

X is a square matrix. The solution exists if X −1 exists. That

R(X ) is spanned by S, that is the column vectors X .

1 d(x, x ′ ): distance between x and x ′ .

P(y ) = arg min d(y , y ′ ) = arg min ||y − y ′ ||

Projection vector is unique

For every given x in R m and every given subspace Y of R m

When X is not onto, the matrix equation under

Notation: minx f (x) or argx minx f (x)

Cost Function: A function that is used to measure the

Let w ∗ be the pre-image of P(y). Then

min |y ′ − y || = min ||Xw − y ||

Also minw∈Rn+1 ||Xw − y || = minw∈Rn+1 12 ||Xw − y ||2 . Therefore

w ∗ = arg min J(w)

To find the minimum of J(w), ∇J(w) has to be found

{ (1, 2)T , 1 , (−2, 3)T , −2 , (−1, 3)T , −1 , (4, −1)T , 3 }

For determining w using derivative method, the inverse of

The gradient vector can be interpreted as the "direction

Gradient descent: If a real valued function F (x) is defined

w new = w current − α[∇J(w)]w current

where α > 0 is called the step length.

For applying gradient descent, consider the following steps.

Algorithm 2 Updation of w: Gradient Descent

||w new − w current || < ϵ

The online version of gradient descent called stochastic

Algorithm 4 Updation of w using Stochastic Gradient Descent

Hyperparameters: Those whose values has to be given

As R(X ) ⊆ RN , the dimension of R(X ) ≤ N < n + 1

As there may be more than one w that satisfies the given

For this to work y should have atleast one pre-image. Let

Given functions f , gi , i = 1, . . . m defined on a domain Ω ⊆ Rn

where λi , i = 1, 2, . . . N are the Lagrangian parameters and L is

L(w, λ) = ||w||2 + λT (Xw − y )

where λT = (λ1 , λ2 , . . . λN ). λi , i = 1, 2, . . . N are the

Taken from Bishops book

Testing Points: {(xt1 , yt1 ), (xt2 , yt2 ), . . . (xtm , ytm )}

Performance of the model; optimal hyperparameters

Randomly choose 70% of the data for training and

Randomly choose 70% of the data for training and

Divide the data into k folds

Set 20 % for testing

Apply cross validation

Normalization is done for making the attribute values to lie

You might also like