Matrix Calculus
Matrix Calculus
Two competing notational conventions split the field of matrix calculus into two separate
groups. The two groups can be distinguished by whether they write the derivative of a scalar
with respect to a vector as a column vector or a row vector. Both of these conventions are
possible even when the common assumption is made that vectors should be treated as
column vectors when combined with matrices (rather than row vectors). A single convention
can be somewhat standard throughout a single field that commonly uses matrix calculus
(e.g. econometrics, statistics, estimation theory and machine learning). However, even within
a given field different authors can be found using competing conventions. Authors of both
groups often write as though their specific conventions were standard. Serious mistakes can
result when combining results from different authors without carefully verifying that
compatible notations have been used. Definitions of these two conventions and comparisons
between them are collected in the layout conventions section.
Scope
Matrix calculus refers to a number of different notations that use matrices and vectors to
collect the derivative of each component of the dependent variable with respect to each
component of the independent variable. In general, the independent variable can be a scalar,
a vector, or a matrix while the dependent variable can be any of these as well. Each different
situation will lead to a different set of rules, or a separate calculus, using the broader sense of
the term. Matrix notation serves as a convenient way to collect the many derivatives in an
organized way.
As a first example, consider the gradient from vector calculus. For a scalar function of three
independent variables, , the gradient is given by the vector equation
where represents a unit vector in the direction for . This type of generalized
derivative can be seen as the derivative of a scalar, f, with respect to a vector, , and its result
can be easily collected in vector form.
More complicated examples include the derivative of a scalar function with respect to a
matrix, known as the gradient matrix, which collects the derivative with respect to each
matrix element in the corresponding position in the resulting matrix. In that case the scalar
must be a function of each of the independent variables in the matrix. As another example, if
we have an n-vector of dependent variables, or functions, of m independent variables we
might consider the derivative of the dependent vector with respect to the independent vector.
The result could be collected in an m×n matrix consisting of all of the possible derivative
combinations.
There are a total of nine possibilities using scalars, vectors, and matrices. Notice that as we
consider higher numbers of components in each of the independent and dependent variables
we can be left with a very large number of possibilities. The six kinds of derivatives that can
be most neatly organized in matrix form are collected in the following table.[1]
Scalar
Vector
Matrix
Here, we have used the term "matrix" in its most general sense, recognizing that vectors are
simply matrices with one column (and scalars are simply vectors with one row). Moreover, we
have used bold letters to indicate vectors and bold capital letters for matrices. This notation
is used throughout.
Notice that we could also talk about the derivative of a vector with respect to a matrix, or any
of the other unfilled cells in our table. However, these derivatives are most naturally organized
in a tensor of rank higher than 2, so that they do not fit neatly into a matrix. In the following
three sections we will define each one of these derivatives and relate them to other branches
of mathematics. See the layout conventions section for a more detailed table.
Kalman filter
Wiener filter
Expectation-maximization algorithm
for Gaussian mixture
Gradient descent
Notation
The vector and matrix derivatives presented in the sections to follow take full advantage of
matrix notation, using a single variable to represent a large number of variables. In what
follows we will distinguish scalars, vectors and matrices by their typeface. We will let M(n,m)
denote the space of real n×m matrices with n rows and m columns. Such matrices will be
denoted using bold capital letters: A, X, Y, etc. An element of M(n,1), that is, a column vector,
is denoted with a boldface lowercase letter: a, x, y, etc. An element of M(1,1) is a scalar,
denoted with lowercase italic typeface: a, t, x, etc. XT denotes matrix transpose, tr(X) is the
trace, and det(X) or | X | is the determinant. All functions are assumed to be of
differentiability class C1 unless otherwise noted. Generally letters from the first half of the
alphabet (a, b, c, ...) will be used to denote constants, and from the second half (t, x, y, ...) to
denote variables.
NOTE: As mentioned above, there are competing notations for laying out systems of partial
derivatives in vectors and matrices, and no standard appears to be emerging yet. The next
two introductory sections use the numerator layout convention simply for the purposes of
convenience, to avoid overly complicating the discussion. The section after them discusses
layout conventions in more detail. It is important to realize the following:
The notations developed here can accommodate the usual operations of vector calculus by
identifying the space M(n,1) of n-vectors with the Euclidean space Rn, and the scalar M(1,1)
is identified with R. The corresponding concept from vector calculus is indicated at the end of
each subsection.
NOTE: The discussion in this section assumes the numerator layout convention for
pedagogical purposes. Some authors use different conventions. The section on layout
conventions discusses this issue in greater detail. The identities given further down are
presented in forms that can be used in conjunction with all common layout conventions.
Vector-by-scalar
The derivative of a vector , by a scalar x is written (in numerator
layout notation) as
In vector calculus the derivative of a vector y with respect to a scalar x is known as the
Example Simple examples of this include the velocity vector in Euclidean space, which is the
tangent vector of the position vector (considered as a function of time). Also, the
acceleration is the tangent vector of the velocity.
Scalar-by-vector
The derivative of a scalar y by a vector , is written (in numerator
layout notation) as
In vector calculus, the gradient of a scalar field f : Rn → R (whose independent coordinates
are the components of x) is the transpose of the derivative of a scalar by a vector.
By example, in physics, the electric field is the negative vector gradient of the electric
potential.
The directional derivative of a scalar function f(x) of the space vector x in the direction of the
unit vector u (represented in this case as a column vector) is defined using the gradient as
follows.
Using the notation just defined for the derivative of a scalar with respect to a vector we can
re-write the directional derivative as This type of notation will be nice when
proving product rules and chain rules that come out looking similar to what we are familiar
with for the scalar derivative.
Vector-by-vector
Each of the previous two cases can be considered as an application of the derivative of a
vector with respect to a vector, using a vector of size one appropriately. Similarly we will find
that the derivatives involving matrices will reduce to derivatives involving vectors in a
corresponding way.
In vector calculus, the derivative of a vector function y with respect to a vector x whose
components represent a space is known as the pushforward (or differential), or the Jacobian
matrix.
Note: The discussion in this section assumes the numerator layout convention for
pedagogical purposes. Some authors use different conventions. The section on layout
conventions discusses this issue in greater detail. The identities given further down are
presented in forms that can be used in conjunction with all common layout conventions.
Matrix-by-scalar
The derivative of a matrix function Y by a scalar x is known as the tangent matrix and is
given (in numerator layout notation) by
Scalar-by-matrix
The derivative of a scalar function y, with respect to a p×q matrix X of independent variables,
is given (in numerator layout notation) by
Important examples of scalar functions of matrices include the trace of a matrix and the
determinant.
In analog with vector calculus this derivative is often written as the following.
Also in analog with vector calculus, the directional derivative of a scalar f(X) of a matrix X in
the direction of matrix Y is given by
It is the gradient matrix, in particular, that finds many uses in minimization problems in
estimation theory, particularly in the derivation of the Kalman filter algorithm, which is of
great importance in the field.
Layout conventions
This section discusses the similarities and differences between notational conventions that
are used in the various fields that take advantage of matrix calculus. Although there are
largely two consistent conventions, some authors find it convenient to mix the two
conventions in forms that are discussed below. After this section, equations will be listed in
both competing forms separately.
The fundamental issue is that the derivative of a vector with respect to a vector, i.e. , is
often written in two competing ways. If the numerator y is of size m and the denominator x
of size n, then the result can be laid out as either an m×n matrix or n×m matrix, i.e. the
elements of y laid out in columns and the elements of x laid out in rows, or vice versa. This
leads to the following possibilities:
denominator).
3. A third possibility sometimes seen
is to insist on writing the derivative
column vector.
2. If we choose denominator layout for
a row vector.
3. In the third possibility above, we
numerator layout.
Not all math textbooks and papers are consistent in this respect throughout. That is,
sometimes different conventions are used in different contexts within the same book or
paper. For example, some choose denominator layout for gradients (laying them out as
then consistent numerator layout lays out according to Y and XT, while consistent
denominator layout for and laying the result out according to YT, is rarely seen because
it makes for ugly formulas that do not correspond to the scalar formulas. As a result, the
following layouts can often be found:
according to XT.
2. Mixed layout, which lays out
to X.
intermediate vector or matrix. (This can arise, for example, if a multi-dimensional parametric
curve is defined in terms of a scalar variable, and then a derivative of a scalar function of the
curve is taken with respect to the scalar that parameterizes the curve.) For each of the
various combinations, we give numerator-layout and denominator-layout results, except in the
cases above where denominator layout rarely occurs. In cases involving matrices where it
makes sense, we give numerator-layout and mixed-layout results. As noted above, cases
where vector and matrix denominators are written in transpose notation are equivalent to
numerator layout with the denominators written without the transpose.
Keep in mind that various authors use different combinations of numerator and denominator
layouts for different types of derivatives, and there is no guarantee that an author will
consistently use either numerator or denominator layout for all types. Match up the formulas
below with those quoted in the source to determine the layout used for that particular type of
derivative, but be careful not to assume that derivatives of other types necessarily follow the
same kind of layout.
When taking derivatives with an aggregate (vector or matrix) denominator in order to find a
maximum or minimum of the aggregate, it should be kept in mind that using numerator
layout will produce results that are transposed with respect to the aggregate. For example, in
attempting to find the maximum likelihood estimate of a multivariate normal distribution
using matrix calculus, if the domain is a k×1 column vector, then the result using the
numerator layout will be in the form of a 1×k row vector. Thus, either the results should be
transposed at the end or the denominator layout (or mixed layout) should be used.
Size-m
m×n
Numerator column
matrix
Scalar x Scalar vector
Size-m row
Denominator
vector
The results of operations will be transposed when switching between numerator-layout and
denominator-layout notation.
Numerator-layout notation
Using numerator-layout notation, we have:[1]
The following definitions are only provided in numerator-layout notation:
Denominator-layout notation
Using denominator-layout notation, we have:[2]
Identities
As noted above, in general, the results of operations will be transposed when switching
between numerator-layout and denominator-layout notation.
To help make sense of all the identities below, keep in mind the most important rules: the
chain rule, product rule and sum rule. The sum rule applies universally, and the product rule
applies in most of the cases below, provided that the order of matrix products is maintained,
since matrix products are not commutative. The chain rule applies in some of the cases, but
unfortunately does not apply in matrix-by-scalar derivatives or scalar-by-matrix derivatives (in
the latter case, mostly involving the trace operator applied to matrices). In the latter case, the
product rule can't quite be applied directly, either, but the equivalent can be done with a bit
more work using the differential identities.
Vector-by-vector identities
This is presented first because all of the operations that apply to vector-by-vector
differentiation apply directly to vector-by-scalar or scalar-by-vector differentiation simply by
reducing the appropriate vector in the numerator or denominator to a scalar.
Identities: vector-by-vector
Numerator Denominator
Condition Expression layout, i.e. by layout, i.e. by
y and xT yT and x
a is not a function of
x
A is not a function of
x
A is not a function of
x
a is not a function of
x,
u = u(x)
v = v(x),
a is not a function of
x
v = v(x), u = u(x)
A is not a function of
x,
u = u(x)
u = u(x), v = v(x)
u = u(x)
u = u(x)
Scalar-by-vector identities
The fundamental identities are placed above the thick black line.
Identities: scalar-by-vector
a is not a
function of x,
u = u(x)
u = u(x), v = v(x)
u = u(x), v = v(x)
u = u(x)
u = u(x)
u = u(x),
v = v(x) in in
u = u(x),
v = v(x),
A is not a in in
, the Hessian
matrix[3]
a is not a
function of x
A is not a
function of x
b is not a
function of x
A is not a
function of x
A is not a
function of x
A is symmetric
A is not a
function of x
A is not a
function of x
A is symmetric
a is not a
function of x, in numerator in denominator
u = u(x)
layout layout
a, b are not
functions of x
A, b, C, D, e are
not functions of
x
a is not a
function of x
Vector-by-scalar identities
Identities: vector-by-scalar
Numerator
Denominator
layout, i.e. by
layout, i.e. by
y,
Condition Expression yT,
result is
result is row
column
vector
vector
a is not a function of
x,
u = u(x)
A is not a function of
x,
u = u(x)
u = u(x)
u = u(x), v = v(x)
u = u(x), v = v(x)
u = u(x)
Assumes consistent matrix
layout; see below.
u = u(x)
Assumes consistent matrix
layout; see below.
U = U(x), v = v(x)
outputs are matrices) assume the matrices are laid out consistent with the vector layout, i.e.
numerator-layout matrix when numerator-layout vector and vice versa; otherwise, transpose
the vector-by-vector derivatives.
Scalar-by-matrix identities
Note that exact equivalents of the scalar product rule and chain rule do not exist when
applied to matrix-valued functions of matrices. However, the product rule of this sort does
apply to the differential form (see below), and this is the way to derive many of the identities
below involving the trace function, combined with the fact that the trace function allows
transposing and cyclic permutation, i.e.:
(denominator layout)
(For the last step, see the Conversion from differential to derivative form section.)
Identities: scalar-by-matrix
a is not a [nb 2]
function of X
a is not a
function of X,
u = u(X)
u = u(X),
v = v(X)
u = u(X),
v = v(X)
u = u(X)
u = u(X)
U = U(X) [3]
Both forms assume numerator layout f
a and b are
not functions
of X
a and b are
not functions
of X
a, b and C are
not functions
of X
a, b and C are
not functions
of X
U = U(X),
V = V(X)
a is not a
function of X,
U = U(X)
g(X) is any
polynomial
with scalar
coefficients,
or any matrix
function
defined by an
infinite
polynomial
series (e.g. eX,
sin(X),
cos(X), ln(X),
etc. using a
Taylor series);
g(x) is the
equivalent
scalar
function, g′(x)
is its
derivative, and
g′(X) is the
corresponding
matrix
function
[4]
A is not a
function of X
[3]
A is not a
function of X
[3]
A is not a
function of X
[3]
A is not a
function of X
A, B are not
functions of X
A, B, C are not
functions of X
[3]
n is a positive
integer
A is not a [3]
function of X,
n is a positive
integer
[3]
[3]
[5]
a is not a [3]
function of X [nb 3]
[3]
A, B are not
functions of X
n is a positive
[3]
integer
[3]
(see pseudo-
inverse)
[3]
(see pseudo-
inverse)
A is not a
function of X,
X is square
and invertible
A is not a
function of X,
X is non-
square,
A is
symmetric
A is not a
function of X,
X is non-
square,
A is non-
symmetric
Matrix-by-scalar identities
Identities: matrix-by-scalar
U = U(x)
A, B are not
functions of x,
U = U(x)
U = U(x),
V = V(x)
U = U(x),
V = V(x)
U = U(x),
V = V(x)
U = U(x),
V = V(x)
U = U(x)
U = U(x,y)
A is not a
function of x,
g(X) is any
polynomial
with scalar
coefficients,
or any matrix
function
defined by an
infinite
polynomial
series (e.g. eX,
sin(X),
cos(X), ln(X),
etc.); g(x) is
the equivalent
scalar
function, g′(x)
is its
derivative, and
g′(X) is the
corresponding
matrix
function
A is not a
function of x
Scalar-by-scalar identities
u = u(x)
u = u(x), v = v(x)
With matrices involved
U = U(x)
U = U(x)
U = U(x)
U = U(x)
A is not a
function of x,
g(X) is any
polynomial
with scalar
coefficients,
or any matrix
function
defined by an
infinite
polynomial
series (e.g. eX,
sin(X),
cos(X), ln(X),
etc.); g(x) is
the equivalent
scalar
function, g′(x)
is its
derivative, and
g′(X) is the
corresponding
matrix
function.
A is not a
function of x
A is not a function of X
a is not a function of X
(Kronecker product)
(Hadamard product)
(conjugate transpose)
n is a positive integer
is
diagonalizable
f is differentiable at every
eigenvalue
orthogonal projection operators that project onto the k-th eigenvector of X. Q is the matrix of
eigenvectors of , and are the eigenvalues. The matrix function
is defined in terms of the scalar function for diagonalizable matrices by
where with .
To convert to normal derivative form, first convert it to one of the following canonical forms,
and then use these identities:
Applications
Matrix differential calculus is used in statistics and econometrics, particularly for the
statistical analysis of multivariate distributions, especially the multivariate normal distribution
and other elliptical distributions.[8][9][10]
It is used in regression analysis to compute, for example, the ordinary least squares
regression formula for the case of multiple explanatory variables.[11] It is also used in random
matrices, statistical moments, local sensitivity and statistical diagnostics.[12][13]
See also
Mathematics
portal
Derivative (generalizations)
Product integral
Ricci calculus
Notes
References
by X in
Further reading
Software
MatrixCalculus.org (http://www.matrix
calculus.org/) , a website for
evaluating matrix calculus expressions
symbolically
NCAlgebra (https://math.ucsd.edu/~n
calg/) , an open-source Mathematica
package that has some matrix calculus
functionality
SymPy supports symbolic matrix
derivatives in its matrix expression
module (https://docs.sympy.org/lates
t/modules/matrices/expressions.htm
l) , as well as symbolic tensor
derivatives in its array expression
module (https://docs.sympy.org/lates
t/modules/tensor/array_expressions.h
tml) .
Information