CSC 576: Mathematical Foundations I
Ji Liu
Department of Computer Sciences, University of Rochester
September 20, 2016
1 Notations and Assumptions
In most cases (if without local definitions), we use
• Greek alphabets such as α, β, and γ to denote real numbers;
• Small letters such as x, y, and z to denote vectors;
• Capital letters to denote matrices, e.g., A, B, and C.
Other notations:
• R is the one dimensional Euclidean space;
• Rn is the n dimensional vector Euclidean space;
• Rm×n is the m × n dimensional matrix Euclidean space;
• R+ denotes the range [0, +∞);
• 1n ∈ Rn denotes a vector with 1 in all entries;
• For any vector x ∈ Rn , we use |x| to denote the absolute vector, that is, |x|i = |xi | ∀i =
1, · · · , n;
• denotes the component-wise product, that is, for any vectors x and y, (x y)i = xi yi .
Some assumptions:
• Unless explicit (local) definition, we always assume that all vectors are column vectors.
2 Vector norms, Inner product
A function f : x ∈ Rn → y ∈ R+ is called a “norm”, if the following three conditions are satisfied
• (Zero element) f (x) ≥ 0 and f (x) = 0 if and only if x = 0;
• (Homogeneous) For any α ∈ R and x ∈ Rn , f (αx) = |α|f (x);
• (Triangle inequality) Any x, y ∈ Rn satisfy f (x) + f (y) ≥ f (x + y).
1
The `2 norm “k · k2 ” (a special “f (·)”) in Rn is defined as
1
kxk2 = (|x1 |2 + |x2 |2 + · · · + |xn |2 ) 2 .
Because of `2 is the most commonly used norm (also known as Euclidean norm), we denote it as
k · k sometimes for short. (Think about it how about f ([x1 , x2 ]) = 2x21 + x22 ?)
A general `p norm (p ≥ 1) is defined as
1
kxkp = (|x1 |p + |x2 |p + · · · + |xn |p ) p .
Note that for p < 1, it is not a “norm” since the triangle inequality is violated. `∞ norm is defined
as
kxk∞ = max{|x1 |, |x2 |, · · · , |xn |}.
One may notice that the `∞ norm is the limit of the `p norm, that is, for any x ∈ Rn , kxk∞ =
limp→+∞ kxkp . In addition, people use kxk0 to denote the `0 “norm”.
The inner product h·, ·i in Rn is defined as
X
hx, yi = xi yi .
i
One can show that hx, xi = kxk2 . Two vectors x and y are orthogonal if hx, yi = 0. That is one
reason why `2 norm is so special.
If p ≥ q, then for any x ∈ Rn we have kxkp ≤ kxkq . In particular, we have
kxk1 ≥ kxk2 ≥ kxk∞ .
To bound from the order sides, we have
√ √
kxk1 ≤ nkxk2 kxk2 ≤ nkxk∞ .
Proof. To see the first one, we have
√
kxk1 = h1n , |x|i ≤ k1n k2 k|x|k2 = nkxk2
where the last inequality uses the Cauchy inequality. I leave the proof of the second inequality in
your homework.
Given a norm “k · kA ”, its dual norm is defined as
hx, zi
kxkA∗ = max hx, yi = max hx, yi = max .
kykA ≤1 kykA =1 z kzkA
Several important properties about the dual norm are
• The dual norm’s dual norm is itself, that is, kxk(A∗ )∗ = kxkA ;
• The `2 norm is self-dual, that is, the dual norm of the `2 norm is still the `2 norm;
• The dual norm of the `p norm (p ≥ 1) is `q norm where p and q satisfy 1/p + 1/q = 1.
Particularly, `1 norm and `∞ norm are dual to each other.
• (Holder inequality): hx, yi ≤ kxkA kykA∗
2
3 Linear space, subspace, linear transformation
A set S is a linear space if
• 0 ∈ S;
• given any two points x ∈ S, y ∈ S in S and any two scalars α ∈ R and β ∈ R, we have
αx + βy ∈ S.
Note that ∅ is not a linear space. Examples: vector space Rn , matrix space Rm×n . How about the
following things:
• 0; (no)
• {0}; (yes)
• {x | Ax = b} where A is a matrix and b is a vector. (b = 0 yes; otherwise, no)
Let S be a linear space. A set S 0 is a subspace if S 0 is a linear space and also a subset of S.
Actually, “subspace” is equivalent to “linear space”, because any subspace is a linear space and
any linear space is a subspace. They are indeed talking about the same thing.
Let S be a linear space. A function L(·) is a linear transformation if given any two points
x, y ∈ S and two scalars α ∈ R and β ∈ R, one has
L(αx + βy) = αL(x) + βL(y).
For vector space, there exists a 1-1 correspondence between a linear transformation and a matrix.
Therefore, we can simply say “a matrix is a linear transformation”.
• Prove that {L(x) | x ∈ S} is a linear space if S is a linear space and L is a linear transformation.
• Prove that {x | L(x) ∈ S} a linear space assuming S is a linear space, and L is a linear
transformation.
How to express a subspace? The most intuitive way is to use a bunch of vectors. A subspace
can be expressed by
( n )
X
span{x1 , x2 , · · · , xn } = αi xi | αi ∈ R = {Xα | α},
i=1
which is called the range space of matrix X. A subspace can be also represented by the null space
of X by
{α | Xα = 0}.
3
4 Eigenvalues / eigenvectors, rank, SVD, inverse
The transpose of a matrix A ∈ Rm×n is defined as AT ∈ Rn×m :
(AT )ij = Aji .
One can verify that
(AB)T = B T AT .
A matrix B ∈ Rn×n is the inverse of an invertible matrix A ∈ Rn×n if
AB = I and BA = I.
B can be denoted as A−1 . A has the inverse is equivalent to that A has a full rank (the definition
for “rank” will be clear very soon.) Note that the inverse of a matrix is unique. One can also verify
that if both A and B are invertible, then
(AB)−1 = B −1 A−1 .
The “transpose” and the “inverse” are exchangeable:
(AT )−1 = (A−1 )T .
When we write A−1 , we have to make sure that A is invertible.
Given a square matrix A ∈ Rn×n , x ∈ Rn (x 6= 0) is called its eigenvector and λ ∈ Rn is called
its eigenvalue, if the following relationship is satisfied
Ax = λx. (The effect of applying the linear transformation A on x is nothing but scaling it.)
Note that
• If {λ, x} is a pair of eigenvalue-eigenvector, then so is {λ, αx} for any α 6= 0.
• One eigenvalue may correspond to multiple different eigenvectors. “Different” means eigen-
vectors are different after normalization.
If the matrix A is symmetric, then any two eigenvectors (corresponding to different eigenvalues)
are orthogonal, that is, if AT = A, Ax1 = λ1 x1 , Ax2 = λ2 x2 , and λ1 6= λ2 , then
xT1 x2 = 0.
Proof. Consider xT1 Ax2 . We have
xT1 Ax2 = xT1 (Ax2 ) = xT1 (Ax2 ) = xT1 (λ2 x2 ) = λ2 xT1 x2 ,
and
xT1 Ax2 = (xT1 A)x2 = (AT x1 )T x2 |{z}
= (Ax1 )T x2 = λ1 xT1 x2 .
A=AT
Therefore, we have
λ2 xT1 x2 = λ1 xT1 x2 .
Since λ1 6= λ2 , we obtain xT1 x2 = 0.
4
A matrix A ∈ Rm×n is a “rank-1” matrix, if A can be expressed as
A = xy T
where x ∈ Rm and y ∈ Rn , and x 6= 0, y 6= 0. The rank of a matrix A ∈ Rm×n is defined as
r
( )
X
rank(A) = min r | A = xi yiT , xi ∈ Rm , yi ∈ Rn
i=1
r
( )
X
= min r | A = Bi , Bi is a “rank-1” matrix .
i=1
Examples: [1, 1; 1, 1], [1, 1; 2, 2], and many natural images have the low rank property. “Low rank”
implies that the contained information is few.
We say “U ∈ Rm×n has orthogonal columns” if U T U = I, that is, any two columns Ui· and Uj·
of U satisfies
Ui·T Uj· = 0 if i 6= j; otherwise Ui·T Uj· = 1.
Swapping any two columns in U to get U 0 , U 0 still satisfies U 0T U 0 = I.
• kU xk = kxk ∀x.
• kU T yk ≤ kyk ∀y.
If U is a square matrix and has orthogonal columns, then we call it “orthogonal matrix”. It has
some nice properties
• U −1 = U T (which means that U U T = U T U = I.)
• U T is also an orthogonal matrix.
• The effect of applying the transformation U on a vector x is to rotate x, that is, kU xk =
kxk = kU T xk.
“SVD” is short for “singular value decomposition”, which is the most important concept in
linear algebra and matrix analysis. SVD almost explores all structures of a matrix. Given any
matrix A ∈ Rm×n , it can be decomposed into
r
X
A = U ΣV T = σi Ui· Vi·T
i=1
where U ∈ Rm×r and V ∈ Rn×r have orthogonal columns, and Σ = diag{σ1 , σ2 , · · · , σr } is a
diagonal matrix with positive diagonal elements. σi ’s are called singular values, which are positive
and are arranged in the decreasing order.
• rank(A) = r;
• kAxk ≤ σ1 kxk. Why?
A matrix B ∈ Rn×n is positive semi-definite (PSD), if the following things are satisfied
5
• B is symmetric;
• ∀x ∈ Rn , we have xT Bx ≥ 0.
The positive definite matrix is defined by adding one more condition
• xT Bx = 0 ⇔ x = 0.
We can also use an equivalent definition for PSD matrices in the following: A matrix B ∈ Rn×n is
positive semi-definite (PSD), if the SVD of B can be written as
B = U ΣU T
where U T U = I and Σ is a diagonal matrix with nonnegative diagonal elements. Examples of PSD
matrices: I, AT A.
Assume matrices A and B are invertible. We have the following identity:
B −1 = A−1 − B −1 (B − A)A−1 .
The Sherman-Morrison-Woodbury Formula is very useful to calculate the matrix inverse:
(A + U V > )−1 = A−1 − A−1 U (I + V > A−1 U )−1 V > A−1 .
This result is especially important from the perspective of computation. A special case would be
that U and V are two vectors u and v. Then it is in form of
(A + uv > )−1 = A−1 − (1 + v > A−1 u)−1 A−1 uv > A−1 ,
which can be calculated with complexity O(n2 ) if A−1 is known.
The Sylvester’s determinant theorem is
det(Im + AB) = det(In + BA).
5 Matrix norms (spectral norm, nuclear norm, Frobenius norm)
The Frobenius norm (F-norm) of a matrix A ∈ Rm×n is defined as
1 !1
2
X X 2
kAkF = |Ai,j |2 = σi2
1≤i≤m,1≤j≤n i=1
If A is a vector, one can verify that kAkF = kAk2 .
The inner product h·, ·i in Rm×n is defined as
X
hX, Y i = Xij Yij = trace(X T Y ) = trace(Y X T ) = trace(XY T ) = trace(Y T X).
i,j
An important property for trace(AB):
trace(AB) = trace(BA) = trace(AT B T ) = trace(B T AT ).
6
One may notice that hX, Xi = kXk2F .
The spectral (trace) norm of a matrix A ∈ Rm×n is defined as
kAkspec = max kAxk = max y T Ax = σ1 (A)
kxk=1 kxk=1,kyk=1
The nuclear norm of a matrix A ∈ Rm×n is defined as
X
kAktr = σi (A) = trace(Σ)
i
where Σ is the diagonal matrix of SVD of A = U ΣV T .
An important relationship
p
kAkspec ≤ kAkF ≤ kAktr and rank(A)kAkspec ≥ rank(A)kAkF ≥ kAktr .
The dual norm for a matrix norm k · kA is defined as
hX, Y i
kY kA∗ := max = max hX, Y i. (1)
kXk≤1 kXkA X
We have the following properties (think about why it is true):
kXkspec∗ = kXktr , kXkF ∗ = kXkF .
6 Matrix and Vector Differential
Let f (X) : Rm×n → R be a function with respect to matrix X ∈ Rm×n . It is differential (or
gradient) is defined as
∂f (X)
· · · ∂f (X)
· · · ∂f (X)
∂X11 ∂X1j ∂X1n
··· ··· ··· ··· ···
∂f (X) ∂f (X) ∂f (X) ∂f (X)
= ∂Xi1 · · · ∂Xij · · · ∂Xin
.
∂X ···
··· ··· ··· ···
∂f (X) ∂f (X) ∂f (X)
∂Xm1 · · · ∂Xmj · · · ∂Xmn
We provide a few examples in the following
∂f (X)
f (X) = trace(AT X) = hA, Xi =A
∂X
∂f (X)
f (X) = trace(X T AX) = (A + AT )X
∂X
1 ∂f (X)
f (X) = kAX − Bk2F = AT (AX − B)
2 ∂X
1 ∂f (X)
f (X) = trace(B T X T XB) = XBB T
2 ∂X
1 ∂f (X) 1
f (X) = trace(B T X T AXB) = (A + AT )XBB T
2 ∂X 2