Tensor
Tensor
Tensors in computations
arXiv:2106.08090v1 [math.NA] 15 Jun 2021
Lek-Heng Lim
Computational and Applied Mathematics Initiative,
University of Chicago, Chicago, IL 60637, USA
E-mail: [email protected]
The notion of a tensor captures three great ideas: equivariance, multilinearity, separ-
ability. But trying to be three things at once makes the notion difficult to understand.
We will explain tensors in an accessible and elementary way through the lens of linear
algebra and numerical linear algebra, elucidated with examples from computational
and applied mathematics.
CONTENTS
1 Introduction 1
2 Tensors via transformation rules 11
3 Tensors via multilinearity 41
4 Tensors via tensor products 85
5 Odds and ends 191
References 195
1. Introduction
We have two goals in this article: the first is to explain in detail and in the simplest
possible terms what a tensor is; the second is to discuss the main ways in which
tensors play a role in computations. The two goals are interwoven: what defines
a tensor is also what makes it useful in computations, so it is important to gain
a genuine understanding of tensors. We will take the reader through the three
common definitions of a tensor: as an object that satisfies certain transformation
rules, as a multilinear map, and as an element of a tensor product of vector spaces.
We will explain the motivations behind these definitions, how one definition leads
to the next, and how they all fit together. All three definitions are useful in compu-
tational mathematics but in different ways; we will intersperse our discussions of
each definition with considerations of how it is employed in computations, using
the latter as impetus for the former.
Tensors arise in computations in one of three ways: equivariance under coordin-
2 L.-H. Lim
a large part of our article; the title of our article also includes the use of tensors
as a tool in the analysis of algorithms (e.g. self-concordance in polynomial-time
convergence), providing intractability guarantees (e.g. cryptographic multilinear
maps), reducing complexity of models (e.g. equivariant neural networks), quanti-
fying computational complexity (e.g. exponent of matrix multiplication) and in yet
other ways. It would be prudent to add that while tensors are an essential ingredient
in the aforementioned computational tools, they are rarely the only ingredient: it is
usually in combination with other concepts and techniques – calculus of variations,
Gauss quadrature, multiresolution analysis, the power method, reproducing kernel
Hilbert spaces, singular value decomposition, etc. – that they become potent tools.
The article is written with accessibility and simplicity in mind. Our exposition
assumes only standard undergraduate linear algebra: vector spaces, linear maps,
dual spaces, change of basis, etc. Knowledge of numerical linear algebra is a plus
since this article mainly targets computational mathematicians. As physicists have
played an outsize role in the development of tensors, it is inevitable that motivations
for certain aspects, explanations for certain definitions, etc., are best understood
from a physicist’s perspective and to this end we will include some discussions
to provide context. When discussing applications or the physics origin of certain
ideas, it is inevitable that we have to assume slightly more background, but we
strive to be self-contained and limit ourselves to the most pertinent basic ideas. In
particular, it is not our objective to provide a comprehensive survey of all things
tensorial. We focus on the foundational and fundamental; results from current
research make an appearance only if they illuminate these basic aspects. We have
avoided a formal textbook-style treatment and opted for a more casual exposition
that hopefully makes for easier reading (and in part to keep open the possibility of
a future book project).
1.1. Overview
There are essentially three ways to define a tensor, reflecting the chronological
evolution of the notion through the last 140 years or so:
➀ as a multi-indexed object that satisfies certain transformation rules,
➁ as a multilinear map,
➂ as an element of a tensor product of vector spaces.
The key to the coordinate-dependent definition in ➀ is the emphasized part: the
transformation rules are what define a tensor when one chooses to view it as a
multi-indexed object, akin to the group laws in the definition of a group. The
more modern coordinate-free definitions ➁ and ➂ automatically encode these
transformation rules. The multi-indexed object, which could be a hypermatrix, a
polynomial, a differential form, etc., is then a coordinate representation of either ➁
or ➂ with respect to a choice of bases, and the corresponding transformation rule
is a change-of-basis theorem.
Take the following case familiar to anyone who has studied linear algebra. Let V
4 L.-H. Lim
matrix representation 𝐴 ∈ R ′ 𝑚×𝑛 . The change-of-basis theorem states that the two
matrices are related via the transformation rule1 𝐴 ′ = 𝑋 −1 𝐴𝑌 , where 𝑋 and 𝑌 are
the change-of-basis matrices on W and V respectively. Definition ➀ is essentially
an attempt to define a linear map using its change-of-basis theorem – possible but
awkward. The reason for such an awkward definition is one of historical necessity:
definition ➀ had come before any of the modern notions we now take for granted in
linear algebra – vector space, dual space, linear map, bases, etc. – were invented.
We stress that if one chooses to work with definition ➀, then it is the transform-
ation rule/change-of-basis theorem and not the multi-indexed object that defines a
tensor. For example, depending on whether the transformation rule takes 𝐴 ∈ R𝑛×𝑛
to 𝑋 𝐴𝑌 −1 , 𝑋 𝐴𝑌 T , 𝑋 −1 𝐴𝑌 −T , 𝑋 𝐴𝑋 −1 , 𝑋 𝐴𝑋 T or 𝑋 −1 𝐴𝑋 −T , we obtain different
tensors with entirely different properties. Also, while we did not elaborate on the
change-of-basis matrices 𝑋 and 𝑌 , they play an important role in the transformation
rule. If V and W are vector spaces without any additional structures, then 𝑋 and
𝑌 are just required to be invertible; but if V and W are, say, norm or inner product
or symplectic vector spaces, then 𝑋 and 𝑌 would need to preserve these structures
too. More importantly, every notion we define, every property we study for any
tensor – rank, determinant, norm, product, eigenvalue, eigenvector, singular value,
singular vector, positive definiteness, linear system, least-squares problems, eigen-
value problems, etc. – must conform to the respective transformation rule. This is
a point that is often lost; it is not uncommon to find mentions of ‘tensor such and
such’ in recent literature that makes no sense for a tensor.
As we will see, the two preceding paragraphs extend in a straightforward way to
order-𝑑 tensors (henceforth 𝑑-tensors) of contravariant order 𝑝 and covariant order
𝑑 − 𝑝 for any integers 𝑑 ≥ 𝑝 ≥ 0, of which a linear map corresponds to the case
𝑝 = 1, 𝑑 = 2.
While discussing tensors, we will also discuss their role in computations.
The most salient applications are often variations of the familiar separation-of-
variables technique that one encounters when solving ordinary and partial differ-
ential equations, integral equations or even integro-differential equations. Here
the relevant perspective of a tensor is that in definition ➂; we will see that
𝐿 2 (𝑋1 ) ⊗ 𝐿 2 (𝑋2 ) ⊗ · · · ⊗ 𝐿 2 (𝑋𝑑 ) = 𝐿 2 (𝑋1 × 𝑋2 × · · · × 𝑋𝑑 ), that is, multivariate
or, equivalently,
𝑝,𝑞,𝑟
Õ
𝑓 = 𝜑𝑖 𝑗 ⊗ 𝜓 𝑗𝑘 ⊗ 𝜃 𝑘𝑖 . (1.4)
𝑖, 𝑗,𝑘=1
6 L.-H. Lim
This is the celebrated matrix product state in the tensor network literature. Tech-
niques such as DMRG simplify computations of eigenfunctions of Schrödinger
operators by imposing such structures on the desired eigenfunction. Note that
(1.4), like (1.3), is also a decomposition into a sum of separable functions but the
indices are captured by the following graph:
𝑧 𝑘 𝑦
𝑖 𝑗
2 Research papers that describe AlphaFold 2 are still unavailable at the time of writing but
the fact that it uses an equivariant neural network (Fuchs, Worrall, Fischer and Welling
2020) may be found in Jumper’s slides at https://predictioncenter.org/casp14/doc/presentations/
2020_12_01_TS_predictor_AlphaFold2.pdf.
Tensors in computations 7
focus on tensors in infinite-dimensional spaces, and with that comes the many
complications that can be avoided in a finite-dimensional setting. None of the
aforementioned references are easy reading, being aimed more at specialists than
beginners, with little discussion of elementary questions.
Our article takes a middle road, giving equal footing to both algebraic and analytic
aspects, that is, we discuss tensors over vector spaces (algebraic) as well as tensors
over norm or inner product spaces (analytic), and we explain why they are different
and how they are related. As we target readers whose main interests are in one way
or another related to computations – numerical linear algebra, numerical PDEs,
optimization, scientific computing, theoretical computer science, machine learning,
etc. – and such a typical target reader would tend to be far more conversant with
analysis than with algebra, the manner in which we approach algebraic and analytic
topics is calibrated accordingly. We devote considerable length to explaining
and motivating algebraic notions such as modules or commutative diagrams, but
tend to gloss over analytic ones such as distributions or compact operators. We
assume some passing familiarity with computational topics such as quadrature
or von Neumann stability analysis but none with ‘purer’ ones such as formal
power series or Littlewood–Richardson coefficients. We also draw almost all our
examples from topics close to the heart of computational mathematicians: kernel
SVMs, Krylov subspaces, Hamilton equations, multipole expansions, perturbation
theory, quantum chemistry and physics, semidefinite programming, wavelets, etc.
As a result almost all our examples tend to have an analytic bent, although there
are also exceptions such as cryptographic multilinear maps or equivariant neural
networks, which are more algebraic.
Our article is intended to be completely elementary. Modern studies of tensors
largely fall into four main areas of mathematics: algebraic geometry, differential
geometry, representation theory and functional analysis. We have avoided the first
three almost entirely and have only touched upon the most rudimentary aspects
of the last. The goal is to show that even without bringing in highbrow subjects,
there is still plenty that could be said about tensors; all we really need is standard
undergraduate linear algebra and some multivariate calculus – vector spaces, linear
maps, change of basis, direct sums and products, dual spaces, derivatives as linear
maps, etc. Among this minimal list of prerequisites, one would find that dual
spaces tend to be a somewhat common blind spot of our target readership, primarily
because they tend to work over inner product spaces where duality is usually a non-
issue. Unfortunately, duality cannot be avoided if one hopes to gain any reasonably
complete understanding of tensors, as tensors defined over inner product spaces
represent a relatively small corner of the subject. A working familiarity with dual
spaces is a must: how a dual basis behaves under a change of basis, how to define
the transpose and adjoint of abstract linear maps, how taking the dual changes direct
sums into direct products, etc. Fortunately these are all straightforward, and we
will remind readers of the relevant facts as and when they are needed. Although we
will discuss tensors over infinite-dimensional spaces, modules, bundles, operator
Tensors in computations 9
3 Í
No doubt a very convenient shorthand, and its avoidance of the summation symbol a big
typesetting advantage in the pre-TEX era.
10 L.-H. Lim
embedded in the model as a special case. If one could efficiently solve this problem,
then the range of earth-shattering things that would follow is limitless.4 Therefore,
these claims of extraordinariness are hardly surprising given such an enormous
caveat; they are simply consequences of the fact that if one could efficiently solve
any NP-hard problems, one could efficiently solve all NP-complete problems.
Nearly all the materials in this article are classical. It could have been written
twenty years ago. We would not have been able to mention the resolution of the sal-
mon conjecture or discuss applications such as AlphaFold, the matrix multiplication
exponent and complexity of integer multiplication would be slightly higher, and
some of the books cited would be in their earlier editions. But 95%, if not more, of
the content would have remained intact. We limit our discussion in this article to
results that are by-and-large rigorous, usually in the mathematical sense of having a
proof but occasionally in the sense of having extensive predictions consistent with
results of scientific experiments (like the effectiveness of DMRG). While this art-
icle contains no original research, we would like to think that it offers abundant new
insights on existing topics: our treatment of multipole expansions in Example 4.45,
separation of variables in Examples 4.32–4.35, stress tensors in Example 4.8, our
interpretation of the universal factorization property in Section 4.3, discussions of
the various forms of higher-order derivatives in Examples 3.2, 3.4, 4.29, the way
we have presented and motivated the tensor transformation rules in Section 2, etc.,
have never before appeared elsewhere to the best of our knowledge. In fact, we
4 In the words of a leading authority (Fortnow 2013, pp. ix and 11), ‘society as we know it would
change dramatically, with immediate, amazing advances in medicine, science, and entertainment
and the automation of nearly every human task,’ and ‘the world will change in ways that will make
the Internet seem like a footnote in history.’
Tensors in computations 11
have made it a point to not reproduce anything verbatim from existing literature;
even standard materials are given a fresh take.
While the idea of a tensor had appeared before in works of Cauchy and Riemann
(Conrad 2018, Yau 2020) and also around the same time in Ricci and Levi-Civita
(1900), this is believed to be the earliest appearance of the word ‘tensor’. Although
we will refer to the quote above as ‘Voigt’s definition’, it is not a direct translation
of a specific sentence from Voigt (1898) but a paraphrase attributed to Voigt (1898)
in the Oxford English Dictionary. Voigt’s definition is essentially the one adopted
in all early textbooks on tensors such as Brand (1947), Hay (1954), Lovelock and
Rund (1975), McConnell (1957), Michal (1947), Schouten (1951), Spain (1960),
Synge and Schild (1978) and Wrede (1963).5 This is not an easy definition to work
with and likely contributed to the reputation of tensors being a tough subject to
master. Famously, Einstein struggled with tensors (Earman and Glymour 1978)
and the definition he had to dabble with would invariably have been definition ➀.
Nevertheless, we should be reminded that linear algebra as we know it today was
an obscure art in its infancy in the 1900s when Einstein was learning about tensors.
Those trying to learn about tensors in modern times enjoy the benefit of a hundred
years of pedagogical progress. By building upon concepts such as vector spaces,
linear transformations, change of basis, etc., that we take for granted today but were
not readily accessible a century ago, the task of explaining tensors is significantly
simplified.
In retrospect, the main issue with definition ➀ is that it is a ‘physicist’s definition’,
that is, it attempts to define a quantity by describing the change-of-coordinates rules
that the quantity must satisfy without specifying the quantity itself. This approach
may be entirely natural in physics where one is interested in questions such as
‘Is stress a tensor?’ or ‘Is electromagnetic field strength a tensor?’, that is, the
definition is always applied in a way where some physical quantity such as stress
takes the place of the unspecified quantity, but it makes for an awkward definition in
mathematics. The modern definitions ➁ and ➂ remedy this by stating unequivocally
what this unspecified quantity is.
for any orthogonal 𝑋 ∈ R𝑚×𝑚 and invertible 𝑌 ∈ R𝑛×𝑛 . Unsurprisingly, the normal
equation 𝐴T 𝐴𝑣 = 𝐴T 𝑏 has the same property:
(𝑋 𝐴𝑌 −1 )T (𝑋 𝐴𝑌 −1 )𝑌 𝑣 = (𝑋 𝐴𝑌 −1 )T 𝑋 𝑏.
The total least-squares problem is defined by
min{k𝐸 k 2 + k𝑟 k 2 : (𝐴 + 𝐸)𝑣 = 𝑏 + 𝑟}
= min{k 𝑋 𝐸𝑌 T k 2 + k 𝑋𝑟 k 2 : (𝑋 𝐴𝑌 T + 𝑋 𝐸𝑌 T )𝑌 𝑣 = 𝑋 𝑏 + 𝑋𝑟}
for any orthogonal 𝑋 ∈ R𝑚×𝑚 and orthogonal 𝑌 ∈ R𝑛×𝑛 . Here the minimization is
over 𝐸 ∈ R𝑚×𝑛 , 𝑟 ∈ R𝑚 , and the constraint is interpreted as the linear system being
consistent. Both ordinary and total least-squares are defined on tensors – minimum
values transform as invariant 0-tensors, 𝑣, 𝑏, 𝑟 as contravariant 1-tensors and 𝐴, 𝐸
as mixed 2-tensors – but the 2-tensors involved are different as 𝑌 is not required to
be orthogonal in (2.3).
Rank, determinant, norm. Let 𝐴 ∈ R𝑚×𝑛 . Then
rank(𝑋 𝐴𝑌 −1 ) = rank(𝐴), det(𝑋 𝐴𝑌 −1 ) = det(𝐴), k 𝑋 𝐴𝑌 −1 k = k 𝐴k,
where 𝑋 and 𝑌 are, respectively, invertible, special linear or orthogonal matrices.
Here k · k may denote either the spectral, nuclear or Frobenius norm and the
determinant is regarded as identically zero whenever 𝑚 ≠ 𝑛. Rank, determinant
and norm are defined on mixed 2-tensors, special linear mixed 2-tensors and
Cartesian mixed 2-tensors respectively.
The point that we hope to make with these familiar examples is the following.
The most fundamental and important concepts, equations, properties and problems
Tensors in computations 15
in linear algebra and numerical linear algebra – notions that extend far beyond linear
algebra into other areas of mathematics, science and engineering – all conform to
tensor transformation rules. These transformation rules are not merely accessories
to the definition of a tensor: they are the very crux of it and are what make tensors
useful.
On the one hand, it is intuitively clear what the transformation rules are in
the examples on pages 13–14: They are transformations that preserve either the
form of an equation (e.g. if we write 𝐴 ′ = 𝑋 𝐴𝑋 −1 , 𝑣 ′ = 𝑋𝑣, 𝜆 ′ = 𝜆, then (2.1)
becomes 𝐴 ′𝑣 ′ = 𝜆 ′ 𝑣 ′), or the value of a quantity such as rank/determinant/norm,
or a property such as positive definiteness. In these examples, the ‘multi-indexed
object’ in definition ➀ can be a matrix like 𝐴 or 𝐸, a vector like 𝑢, 𝑣, 𝑏, 𝑟 or a scalar
like 𝜆; the coordinates of the matrix or vector, i.e. 𝑎𝑖 𝑗 and 𝑣 𝑖 , are sometimes called
the ‘components of the tensor’. The matrices 𝑋, 𝑌 play a different role in these
transformation rules and should be distinguished from the multi-indexed objects;
for reasons that will be explained in Section 3.1, we call them change-of-basis
matrices.
On the other hand, these examples also show why the transformation rules can
be confusing: they are ambiguous in multiple ways.
(a) The transformation rules can take several different forms. For example, 1-
tensors may transform as
𝑣 ′ = 𝑋𝑣 or 𝑣 ′ = 𝑋 −T 𝑣,
2-tensors may transform as
𝐴 ′ = 𝑋 𝐴𝑌 −1 , 𝐴 ′ = 𝑋 𝐴𝑌 T , 𝐴 ′ = 𝑋 𝐴𝑋 −1 , 𝐴 ′ = 𝑋 𝐴𝑋 T ,
or yet other possibilities we have not discussed.
(b) The change-of-basis matrices in these transformation rules may also take
several different forms, most commonly invertible or orthogonal. In the
examples, the change-of-basis matrices may be a single matrix6
𝑋 ∈ GL(𝑛), SL(𝑛), O(𝑛),
and a pair of them
(𝑋, 𝑌 ) ∈ GL(𝑚) × GL(𝑛), SL(𝑚) × SL(𝑛), O(𝑚) × O(𝑛),
or, as we saw in the case of ordinary least-squares (2.3), (𝑋, 𝑌 ) ∈ O(𝑚) ×
GL(𝑛). There are yet other possibilities for change-of-basis matrices we have
not discussed, such as Lorentz, symplectic, unitary, etc.
(c) Yet another ambiguity on top of (a) is that the roles of 𝑣 and 𝑣 ′ or 𝐴 and 𝐴 ′
are sometimes swapped and the transformation rules written as
𝑣 ′ = 𝑋 −1 𝑣, 𝑣′ = 𝑋 T𝑣
6 For those unfamiliar with this matrix group notation, it will be defined in (2.14).
16 L.-H. Lim
or
𝐴 ′ = 𝑋 −1 𝐴𝑌 , 𝐴 ′ = 𝑋 −1 𝐴𝑌 −T , 𝐴 ′ = 𝑋 −1 𝐴𝑋, 𝐴 ′ = 𝑋 −1 𝐴𝑋 −T
respectively. Note that the transformation rules here and those in (a) are all
but identical: the only difference is in whether we label the multi-indexed
object on the left or that on the right with a prime.
We may partly resolve the ambiguity in (a) by introducing covariance and con-
travariance type: tensors with transformation rules of the form 𝐴 ′ = 𝑋 𝐴𝑋 T are
covariant 2-tensors or (0, 2)-tensors; those of the form 𝐴 ′ = 𝑋 𝐴𝑋 −1 are mixed
2-tensors or (1, 1)-tensors; those of the form 𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 are contravariant
2-tensors or (2, 0)-tensors. Invariance, covariance and contravariance are all spe-
cial cases of equivariance that we will discuss later. Nevertheless, we are still
unable to distinguish between 𝐴 ′ = 𝑋 𝐴𝑋 T and 𝐴 ′ = 𝑋 𝐴𝑌 T : both are legitimate
transformation rules for covariant 2-tensors.
These ambiguities (a), (b), (c) are the result of a defect in definition ➀. One ought
to be asking: What is the tensor in definition ➀? The answer is that it is actually
missing from the definition. The multi-indexed object 𝐴 represents the tensor
whereas the transformation rules on 𝐴 defines the tensor but the tensor itself has
been left unspecified. This is a key reason why definition ➀ has been so confusing
to early learners. Fortunately, it is easily remedied by definition ➁ or ➂. We should,
however, bear in mind that definition ➀ predated modern notions of vector spaces
and linear maps, which are necessary for definitions ➁ and ➂.
When we introduce definitions ➁ or ➂, we will see that the ambiguity in (a)
is just due to different transformation rules for different tensors. Getting ahead
of ourselves, for vector spaces V and W, the transformation rule 𝐴 ′ = 𝑋 𝐴𝑋 −1
applies to tensors in V ⊗ V∗ whereas 𝐴 ′ = 𝑋 𝐴𝑌 −1 applies to those in V ⊗ W∗ ;
𝐴 ′ = 𝑋 𝐴𝑋 T and 𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 apply to tensors in V ⊗ V and V∗ ⊗ V∗ respectively.
The matrices 𝐴 and 𝐴 ′ are representations of a tensor with respect to two different
bases, and the ambiguity in (c) is just a result of which basis is regarded as the ‘old’
basis and which as the ‘new’ one.
The ambiguity in (b) is also easily resolved with definitions ➁ or ➂. The matrices
𝑋 and 𝑌 are change-of-basis matrices on V and W and are always invertible, but they
also have to preserve additional structures on V and W such as inner products or
norms. For example, singular values and singular vectors are only defined for inner
product spaces and so we require the matrices 𝑋 and 𝑌 to preserve inner products;
for the Euclidean inner product, this simply means that 𝑋 and 𝑌 are orthogonal
matrices. Tensors with transformation rules involving orthogonal matrices are
sometimes called ‘Cartesian tensors’.
These examples on pages 13–14 illustrate why the transformation rules in defin-
ition ➀ are as crucial in mathematics as they are in physics. Unlike in physics, we
do not use the transformation rules to check whether a physical quantity such as
stress or strain is a tensor; instead we use them to ascertain whether an equation
Tensors in computations 17
for 𝑑 = 2, it reduces to
(𝑋, 𝑌 ) · 𝐴 = 𝑋 𝐴𝑌 T .
For now, we will let the hypermatrix 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 be the multi-indexed object in
definition ➀ to keep things simple; we will later see that this multi-indexed object
does not need to be a hypermatrix.
Let 𝑋1 ∈ GL(𝑛1 ), 𝑋2 ∈ GL(𝑛2 ), . . . , 𝑋𝑑 ∈ GL(𝑛𝑑 ). The covariant 𝑑-tensor
transformation rule is
𝐴 ′ = (𝑋1T , 𝑋2T , . . . , 𝑋𝑑T ) · 𝐴. (2.7)
The contravariant 𝑑-tensor transformation rule is
𝐴 ′ = (𝑋1−1 , 𝑋2−1 , . . . , 𝑋𝑑−1 ) · 𝐴. (2.8)
The mixed 𝑑-tensor transformation rule is7
𝐴 ′ = (𝑋1−1 , . . . , 𝑋 −1
𝑝 , 𝑋 𝑝+1 , . . . , 𝑋 𝑑 ) · 𝐴.
T T
(2.9)
We say that the transformation rule in (2.9) is of type or valence (𝑝, 𝑑 − 𝑝) or, more
verbosely, of contravariant order 𝑝 and covariant order 𝑑 − 𝑝. As such, covariance
is synonymous with type (0, 𝑑) and contravariance with type (𝑑, 0).
For 𝑑 = 1, the hypermatrix is just 𝑎 ∈ R𝑛 . If it transforms as 𝑎 ′ = 𝑋 −1 𝑎,
then it is the coordinate representation of a contravariant 1-tensor or contravariant
vector; if it transforms as 𝑎 ′ = 𝑋 T 𝑎, then it is the coordinate representation of
a covariant 1-tensor or covariant vector. For 𝑑 = 2, the hypermatrix is just
a matrix 𝐴 ∈ R𝑚×𝑛 ; writing 𝑋1 = 𝑋, 𝑋2 = 𝑌 , the transformations in (2.7),
(2.8), (2.9) become 𝐴 ′ = 𝑋 T 𝐴𝑌 , 𝐴 ′ = 𝑋 −1 𝐴𝑌 −T , 𝐴 ′ = 𝑋 −1 𝐴𝑌 , which are the
transformation rules for a covariant, contravariant, mixed 2-tensor respectively.
These look different from the transformation rules in the examples on pages 13–14
as a result of the ambiguity (c) on page 15, which we elaborate below.
Clearly, the equalities in the middle and right columns below only differ in which
side we label with a prime but are otherwise identical:
contravariant 1-tensor 𝑎 ′ = 𝑋 −1 𝑎, 𝑎 ′ = 𝑋𝑎,
covariant 1-tensor 𝑎 ′ = 𝑋 T 𝑎, 𝑎 ′ = 𝑋 −T 𝑎,
contravariant 2-tensor 𝐴 ′ = 𝑋 −1 𝐴𝑌 −T , 𝐴 ′ = 𝑋 𝐴𝑌 T ,
covariant 2-tensor 𝐴 ′ = 𝑋 T 𝐴𝑌 , 𝐴 ′ = 𝑋 −T 𝐴𝑌 −1 ,
mixed 2-tensor 𝐴 ′ = 𝑋 −1 𝐴𝑌 , 𝐴 ′ = 𝑋 𝐴𝑌 −1 .
We will see in Section 3.1 after introducing definition ➁ that these transformation
rules come from the change-of-basis theorems for vectors, linear functionals, dyads,
7 We state it in this form for simplicity. For example, we do not distinguish between the 3-tensor
transformation rules 𝐴 ′ = (𝑋 −1 , 𝑌 T , 𝑍 T ) · 𝐴, 𝐴 ′ = (𝑋 T , 𝑌 −1 , 𝑍 T ) · 𝐴 and 𝐴 ′ = (𝑋 T , 𝑌 T , 𝑍 −1 ) · 𝐴.
All three are of type (1, 2).
Tensors in computations 19
bilinear functionals and linear operators, respectively, with 𝑋 and 𝑌 the change-of-
basis matrices. The versions in the middle and right column differ in terms of how
we write our change-of-basis theorem. For example, take the standard change-of-
basis theorem in a vector space. Do we write 𝑎new = 𝑋 −1 𝑎old or 𝑎old = 𝑋𝑎new ?
Observe, however, that there is no repetition in either column: the transformation
rule, whether in the middle or right column, uniquely identifies the type of tensor.
It is not possible to confuse, say, a contravariant 2-tensor with a covariant 2-tensor
just because we use the transformation rule in the middle column to describe one
and the transformation rule in the right column to describe the other.
The version in the middle column is consistent with the names ‘covariance’
and ‘contravariance’, which are based on whether the hypermatrix ‘co-varies’, i.e.
transforms with the same 𝑋, or ‘contra-varies’, i.e. transforms with its inverse 𝑋 −1 .
This is why we have stated our transformation rules in (2.7), (2.8), (2.9) to be
consistent with those in the middle column. But there are also occasions, as in the
examples on pages 13–14, when it is more natural to express the transformation
rules as those in the right column.
When 𝑛1 = 𝑛2 = · · · = 𝑛𝑑 = 𝑛, the transformation rules in (2.7), (2.8), (2.9) may
take on a different form with a single change-of-basis matrix 𝑋 ∈ GL(𝑛) as opposed
to 𝑑 of them. In this case the hypermatrix 𝐴 ∈ R𝑛×···×𝑛 is hypercubical, i.e. the
higher-order equivalent of a square matrix, and the covariant tensor transformation
rule is
𝐴 ′ = (𝑋 T , 𝑋 T , . . . , 𝑋 T ) · 𝐴, (2.10)
the contravariant tensor transformation rule is
𝐴 ′ = (𝑋 −1 , 𝑋 −1 , . . . , 𝑋 −1 ) · 𝐴, (2.11)
and the mixed tensor transformation rule is
𝐴 ′ = (𝑋 −1 , . . . , 𝑋 −1 , 𝑋 T , . . . , 𝑋 T ) · 𝐴. (2.12)
Again, we will see in Section 3.1 after introducing definition ➁ that the difference
between, say, (2.7) and (2.10) is that the former expresses the change-of-basis
theorem for a multilinear map 𝑓 : V1 × · · · × V𝑑 → R whereas the latter is for a
multilinear map 𝑓 : V × · · · × V → R.
For 𝑑 = 2, we have 𝐴 ∈ R𝑛×𝑛 , and each transformation rule again takes two
different forms that are identical in substance:
covariant 2-tensor 𝐴 ′ = 𝑋 T 𝐴𝑋, 𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 ,
contravariant 2-tensor 𝐴 ′ = 𝑋 −1 𝐴𝑋 −T , 𝐴 ′ = 𝑋 𝐴𝑋 T ,
mixed 2-tensor 𝐴 ′ = 𝑋 −1 𝐴𝑋, 𝐴 ′ = 𝑋 𝐴𝑋 −1 .
Note that either form uniquely identifies the tensor type.
Example 2.1 (covariance versus contravariance). The three transformation rules
for 2-tensors are 𝐴 ′ = 𝑋 𝐴𝑋 T , 𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 , 𝐴 ′ = 𝑋 𝐴𝑋 −1 . We know from linear
20 L.-H. Lim
algebra that there is a vast difference between the first two (congruence) and the
last one (similarity). For example, eigenvalues and eigenvectors are defined for
mixed 2-tensors by virtue of (2.1) but undefined for covariant or contravariant
2-tensors since the eigenvalue/eigenvector equation 𝐴𝑣 = 𝜆𝑣 is incompatible with
the first two transformation rules. While both the contravariant and covariant
transformation rules describe congruence of matrices, the difference between them
is best seen to be the difference between the quadratic form and the second-order
partial differential operator:
Õ𝑛 Õ 𝑛 Õ𝑛 Õ 𝑛
𝜕2
𝑎𝑖 𝑗 𝑣 𝑖 𝑣 𝑗 and 𝑎𝑖 𝑗 .
𝑖=1 𝑗=1 𝑖=1 𝑗=1
𝜕𝑣 𝑖 𝜕𝑣 𝑗
Evidently they differ from the tensor transformation rule (2.12) by a scalar factor.
The power 𝑝 ∈ Z is called the weight with 𝑝 = 0 giving us (2.12). An alternative
name for a pseudotensor is an ‘axial tensor’ and in this case a true tensor is called
a ‘polar tensor’ (Hartmann 1984, Section 4). Note that the transformation rules in
(2.16) are variations of (2.12). While it is conceivable to define similar variations
of (2.9) with det(𝑋1 · · · 𝑋𝑑 ) = det 𝑋1 · · · det 𝑋𝑑 in place of det 𝑋, we are unable to
find a reference for this and would therefore refrain from stating it formally.
A word about notation. Up to this point we have introduced only one notion that
is not from linear algebra, namely (2.6) for multilinear matrix multiplication; this
notion dates back to at least Hitchcock (1927) and the notation is standard too.
24 L.-H. Lim
We denote it so that (2.7), (2.8), (2.9) reduce to the standard notation 𝑔 · 𝑥 for a
group element 𝑔 ∈ 𝐺 acting on an element 𝑥 of a 𝐺-set. Here our group 𝐺 is
a product of other groups 𝐺 = 𝐺 1 × · · · × 𝐺 𝑑 , and so an element takes the form
𝑔 = (𝑔1 , . . . , 𝑔𝑑 ). We made a conscious decision to cast everything in this article in
terms of left action so as not to introduce another potential source of confusion. We
could have introduced a right multilinear matrix multiplication of 𝐴 ∈ R𝑛1 ×···×𝑛𝑑
by matrices 𝑋 ∈ R𝑛1 ×𝑚1 , 𝑌 ∈ R𝑛2 ×𝑚2 , . . . , 𝑍 ∈ R𝑛𝑑 ×𝑚𝑑 ; note that the dimensions
of these matrices are now the transposes of those in (2.6),
𝐴 · (𝑋, 𝑌 , . . . , 𝑍) = 𝐵,
with 𝐵 ∈ R𝑚1 ×···×𝑚𝑑 given by
Õ
𝑛1 Õ
𝑛2 Õ
𝑛𝑑
𝑏𝑖1 ···𝑖𝑑 = ··· 𝑎 𝑗1 ··· 𝑗𝑑 𝑥𝑖1 𝑗1 𝑦 𝑖2 𝑗2 · · · 𝑧𝑖𝑑 𝑗𝑑 .
𝑗1 =1 𝑗2 =1 𝑗𝑑 =1
This would have allowed us to denote (2.7) and (2.9) without transposes, with the
latter in a two-sided product. Nevertheless, we do not think it is worth the trouble.
not add two matrices representing two tensors of different types because they sat-
isfy different transformation rules; for example, if 𝐴 ∈ R𝑛×𝑛 represents a mixed
2-tensor and 𝐵 ∈ R𝑛×𝑛 a covariant 2-tensor, then 𝐴 + 𝐵 is rarely if ever meaningful.
Example 2.8 (‘identity tensors’). The identity matrix 𝐼 ∈ R3×3 is of course
3
Õ
𝐼= 𝑒𝑖 ⊗ 𝑒𝑖 ∈ R3×3 , (2.17)
𝑖=1
where 𝑒1 , 𝑒2 , 𝑒3 ∈ R3 are the standard basis vectors. What should be the extension
of this notion to 𝑑-tensors? Take 𝑑 = 3 for illustration. It would appear that
3
Õ
𝐴= 𝑒𝑖 ⊗ 𝑒𝑖 ⊗ 𝑒𝑖 ∈ R3×3×3 (2.18)
𝑖=1
is the obvious generalization but, as with the Hadamard product, obviousness does
not necessarily imply correctness when it comes to matters tensorial. One needs
to check the transformation rules. Note that (2.17) is independent of the choice
of orthonormal basis: we obtain the same matrix with any orthonormal basis
𝑞 1 , 𝑞 2 , 𝑞 3 ∈ R3 , a consequence of
(𝑄, 𝑄) · 𝐼 = 𝑄𝐼𝑄 T = 𝐼
for any 𝑄 ∈ O(3), that is, the identity matrix is well-defined as a Cartesian tensor.
On the other hand the hypermatrix 𝐴 in (2.18) does not have this property. One
may show that up to scalar multiples, 𝑀 = 𝐼 is the unique matrix satisfying
(𝑄, 𝑄) · 𝑀 = 𝑀 (2.19)
for any 𝑄 ∈ O(3), a property known as isotropic. An isotropic 3-tensor would then
be one that satisfies
(𝑄, 𝑄, 𝑄) · 𝑇 = 𝑇 . (2.20)
Up to scalar multiples, (2.20) has a unique solution given by the hypermatrix
3 Õ
Õ 3 Õ
3
𝐽= 𝜀 𝑖 𝑗𝑘 𝑒𝑖 ⊗ 𝑒 𝑗 ⊗ 𝑒 𝑘 ∈ R3×3×3 ,
𝑖=1 𝑗=1 𝑘=1
and 𝑑 ≤ 8 in Kearsley and Fong (1975), and studied for arbitrary values of 𝑛 and 𝑑
in Weyl (1997). From a tensorial perspective, isotropic tensors, not a ‘hypermatrix
with ones on the diagonal’ like (2.18), extend the notion of an identity matrix.
Example 2.9 (hyperdeterminants). The simplest order-3 analogue for the deter-
minant, called the Cayley hyperdeterminant for 𝐴 = [𝑎𝑖 𝑗𝑘 ] ∈ R2×2×2 , is given by
Det(𝐴) = 𝑎2000 𝑎2111 + 𝑎2001 𝑎2110 + 𝑎2010 𝑎2101 + 𝑎2011 𝑎2100
− 2(𝑎000 𝑎001 𝑎110 𝑎111 + 𝑎000 𝑎010 𝑎101 𝑎111 + 𝑎000 𝑎011 𝑎100 𝑎111
+ 𝑎001 𝑎010 𝑎101 𝑎110 + 𝑎001 𝑎011 𝑎110 𝑎100 + 𝑎010 𝑎011 𝑎101 𝑎100 )
+ 4(𝑎000 𝑎011 𝑎101 𝑎110 + 𝑎001 𝑎010 𝑎100 𝑎111 ),
which looks nothing like the usual expression for a matrix determinant. Just as the
matrix determinant is preserved under a transformation 𝐴 ′ = (𝑋, 𝑌 ) · 𝐴 = 𝑋 𝐴𝑌 T for
𝑋, 𝑌 ∈ SL(𝑛), this is preserved under a transformation of the form 𝐴 ′ = (𝑋, 𝑌 , 𝑍)· 𝐴
for any 𝑋, 𝑌 , 𝑍 ∈ SL(2). This notion of hyperdeterminant has been extended to
any 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 with
Õ
𝑛𝑖 − 1 ≤ (𝑛 𝑗 − 1), 𝑖 = 1, . . . , 𝑑,
𝑗≠𝑖
Of course, one might argue that anyone is free to concoct any formula for
‘tensor multiplication’ or call any hypermatrix an ‘identity tensor’ even if they are
undefined on tensors: there is no right or wrong. But there is. They would be
wrong in the same way adding fractions as 𝑎/𝑏 + 𝑐/𝑑 = (𝑎 + 𝑏)/(𝑐 + 𝑑) is wrong,
or at least far less useful than the standard way of adding fractions. Fractions are
real, and adding half a cake to a third of a cake does not give us two-fifths of a
cake. Likewise tensors are real: we discovered the transformation rules for tensors
just as we discovered the arithmetic rules for adding fractions, that is, we did not
invent them arbitrarily.
While we have limited our examples to mathematical ones, we will end with a
note about the physics perspective. In physics, these transformation rules are just as
if not more important. One does not even have to go to higher-order tensors to see
this: a careful treatment of vectors in physics already requires such an approach.
Example 2.10 (tensor transformation rules in physics). As we saw at the be-
Tensors in computations 27
𝑧 𝑧′
𝑦′
𝑥′
𝑥
Figure 2.1. Linear transformation of coordinate axes.
Indeed, contravariant 1-tensors are how many physicists would regard vectors
(Weinreich 1998): a vector is an object represented by 𝑎 ∈ R𝑛 that satisfies
the transformation rule 𝑎 ′ = 𝑋 −1 𝑎, possibly with the additional requirement that
the change-of-basis matrix 𝑋 be in O(𝑛) (Feynman, Leighton and Sands 1963,
Chapter 11) or 𝑋 ∈ O(𝑝, 𝑞) (Rindler 2006, Chapter 4).
28 L.-H. Lim
The transformation rules perspective has proved to be very useful in physics. For
example, special relativity is essentially the observation that the laws of physics
are invariant under Lorentz transformations 𝑋 ∈ O(1, 3) (Einstein 2002). In fact, a
study of the contravariant 1-tensor transformation rule under the O(1, 3)-analogue
of Givens rotations,
cosh 𝜃 − sinh 𝜃 0 0 cosh 𝜃 0 − sinh 𝜃 0 cosh 𝜃 0 0 − sinh 𝜃
− sinh 𝜃 cosh 𝜃 0 0 0 1 0 0 0 1 0 0
, , ,
0 0 1 0 − sinh 𝜃 0 cosh 𝜃 0 0 0 1 0
0 0 0 1 0 0 0 1 − sinh 𝜃 0 0 cosh 𝜃
is enough to derive most standard results of special relativity; see Friedberg, Insel
and Spence (2003, Section 6.9) and Woodhouse (2003, Chapters 4 and 5).
Covariant 1-tensors are also commonplace in physics. If the coordinates of 𝑞
transform as 𝑎 ′ = 𝑋 −1 𝑎, then the coordinates of its derivative 𝜕/𝜕𝑞 or ∇𝑞 would
transform as 𝑎 ′ = 𝑋 T 𝑎. So physical quantities that satisfy the covariant 1-tensor
transformation rule tend to have a contravariant 1-tensor ‘in the denominator’ such
as conjugate momentum 𝑝 = 𝜕𝐿/𝜕 𝑞¤ or electric field 𝐸 = −∇𝜙 − 𝜕 𝐴/𝜕𝑡. In the
former, 𝐿 is the Lagrangian, and we have the velocity 𝑞¤ ‘in the denominator’; in the
latter, 𝜙 and 𝐴, respectively, are the scalar and vector potentials, and the gradient
∇ is a derivative with respect to spatial variables, and thus have displacement 𝑞 ‘in
the denominator’.
More generally, what we said above applies to more complex physical quantities:
when the axes – also called the reference frame in physics – are changed, coordinates
of physical quantities must change in a way that preserves the laws of physics;
for tensorial quantities, this would be (2.7)–(2.12), the ‘definite way’ in Voigt’s
definition on page 11. This is the well-known maxim that the laws of physics
should not depend on coordinates, a version of which the reader will find on
page 41. The reader will also find higher-order examples in Examples 2.4, 3.12
and 4.8.
• Similarity of matrices: 𝐴 ′ = 𝑋 𝐴𝑋 −1 .
• Congruence of matrices: 𝐴 ′ = 𝑋 𝐴𝑋 T .
The most common choices for 𝑋 and 𝑌 are either orthogonal matrices or invertible
ones (similarity and congruence are identical if 𝑋 is orthogonal). These transform-
ation rules have canonical forms, and this is an important feature in linear algebra
and numerical linear algebra alike, not withstanding the celebrated fact (Golub
and Wilkinson 1976) that some of these cannot be computed in finite precision:
Smith normal form for equivalence, singular value decomposition for orthogonal
equivalence, Jordan and Weyr canonical forms for similarity (O’Meara, Clark and
Vinsonhaler 2011), Turnbull–Aitken and Hodge–Pedoe canonical forms for con-
gruence (De Terán 2016). One point to emphasize is that the transformation rules
determine the canonical form for the tensor: it makes no sense to speak of Jordan
form for a contravariant 2-tensor or a Turnbull–Aitken form for a mixed 2-tensor,
even if both tensors are represented by exactly the same matrix 𝐴 ∈ R𝑛×𝑛 .
At this juncture it is appropriate to highlight a key difference between 2-tensors
and higher-order ones: while 2-tensors have canonical forms, higher-order tensors
in general do not (Landsberg 2012, Chapter 10). This is one of several reasons why
we should not expect an extension of linear algebra or numerical linear algebra to
𝑑-dimensional hypermatrices in a manner that resembles the 𝑑 = 2 versions.
Before we continue, we will highlight two simple properties.
(i) The change-of-basis matrices may be multiplied and inverted: if 𝑋 and 𝑌 are
orthogonal or invertible, then so is 𝑋𝑌 and so is 𝑋 −1 , that is, the set of all
change-of-basis matrices O(𝑛) or GL(𝑛) forms a group.
(ii) The transformation rules may be composed: if we have 𝑎 ′ = 𝑋 −T 𝑎 and
𝑎 ′′ = 𝑌 −T 𝑎 ′ , then 𝑎 ′′ = (𝑌 𝑋)−T 𝑎; if we have 𝐴 ′ = 𝑋 𝐴𝑋 −1 and 𝐴 ′′ = 𝑌 𝐴 ′𝑌 −1 ,
then 𝐴 ′′ = (𝑌 𝑋)𝐴(𝑌 𝑋)−1 , that is, the transformation rule defines a group
action.
These innocuous observations say that to get a matrix 𝐴 into a desired form 𝐵, we
may just work on a ‘small part’ of the matrix 𝐴, e.g. a 2 × 2-submatrix or a fragment
of a column, by applying a transformation that affects that ‘small part’. We then
repeat it on other parts to obtain a sequence of transformations:
𝐴 → 𝑋1 𝐴 → 𝑋2 (𝑋1 𝐴) → · · · → 𝐵,
𝐴 → 𝑋1−T 𝐴 → 𝑋2−T (𝑋1−T 𝐴) → · · · → 𝐵,
𝐴 → 𝑋1 𝐴𝑋1T → 𝑋2 (𝑋1 𝐴𝑋1T )𝑋2T → · · · → 𝐵,
𝐴 → 𝑋1 𝐴𝑋1−1 → 𝑋2 (𝑋1 𝐴𝑋1−1 )𝑋2−1 → · · · → 𝐵,
𝐴 → 𝑋1 𝐴𝑌1−1 → 𝑋2 (𝑋1 𝐴𝑌1−1 )𝑌2−1 → · · · → 𝐵, (2.21)
and piece all change-of-basis matrices together to get the required 𝑋 as either
𝑋𝑚 𝑋𝑚−1 . . . 𝑋1 or its limit as 𝑚 → ∞ (likewise for 𝑌 ). Algorithms for computing
30 L.-H. Lim
standard matrix decompositions such as LU, QR, EVD, SVD, Cholesky, Schur, etc.,
all involve applying a sequence of such transformation rules (Golub and Van Loan
2013). In numerical linear algebra, if 𝑚 is finite, i.e. 𝑋 may be obtained in finitely
many steps (in exact arithmetic), then the algorithm is called direct, whereas if it
requires 𝑚 → ∞, i.e. 𝑋 may only be approximated with a limiting process, then
it is called iterative. Furthermore, in numerical linear algebra, one tends to see
such transformations as giving a matrix decomposition, which may then be used to
solve other problems involving 𝐴. This is sometimes called ‘the decompositional
approach to matrix computation’ (Stewart 2000).
Designers of algorithms for matrix computations, even if they were not explicitly
aware of these transformation rules and properties, were certainly observing them
implicitly. For instance, it is rare to find algorithms that mix different transformation
rules for different types of tensors, since what is incompatible for tensors tends to
lead to meaningless results. Also, since eigenvalues are defined for mixed 2-tensors
but not for contravariant or covariant 2-tensors, the transformation 𝐴 ′ = 𝑋 𝐴𝑋 −1
is pervasive in algorithms for eigenvalue decomposition but we rarely if ever find
𝐴 ′ = 𝑋 𝐴𝑋 T (unless of course 𝑋 is orthogonal).
In numerical linear algebra, the use of the transformation rules in (2.21) goes
hand in hand with a salient property of the group of change-of-basis matrices.
Example 2.12 (Givens rotations, Householder reflectors, Gauss transforms).
Recall that these are defined by
1 ··· 0 ··· 0 ··· 0
. .. .. ..
. ..
. . . . .
0 ··· cos 𝜃 · · · − sin 𝜃 · · · 0
.. ∈ SO(𝑛),
𝐺 = ... ..
.
..
.
..
. .
0 ··· sin 𝜃 · · · cos 𝜃 ··· 0
. .. .. .. ..
.. . . . .
0
··· 0 ··· 0 ··· 1
2𝑣𝑣 T
8 If 𝑚 = 𝛼𝑒 𝑗 , then 𝑀 𝐴 adds an 𝛼 multiple of the 𝑗th row to the 𝑖th row of 𝐴; so the Gauss transform
includes the elementary matrices that perform this operation.
Tensors in computations 31
Kahan bidiagonalization for SVD, etc., all rely in part on applying a sequence of
transformation rules as in (2.21) with one of these matrices playing the role of the
change-of-basis matrices. The reason this is possible is that:
• any 𝑋 ∈ SO(𝑛) is a product of Givens rotations,
• any 𝑋 ∈ O(𝑛) is a product of Householder reflectors,
• any 𝑋 ∈ GL(𝑛) is a product of elementary matrices,
• any unit lower triangular 𝑋 ∈ GL(𝑛) is a product of Gauss transforms.
In group-theoretic lingo, these matrices are generators of the respective matrix Lie
groups; in the last case, the set of all unit lower triangular matrices, i.e. ones on the
diagonal, is also a subgroup of GL(𝑛).
Whether one seeks to solve a system of linear equations or find a least-squares
solution or compute eigenvalue or singular value decompositions, the basic under-
lying principle in numerical linear algebra is to transform the problem in such a
way that the solution of the transformed problem is related to the original solution
in a definite way; note that this is practically a paraphrase of Voigt’s definition of a
tensor on page 11. Any attempt to give a comprehensive list of examples will simply
result in our reproducing a large fraction of Golub and Van Loan (2013), so we will
just give a familiar example viewed through the lens of tensor transformation rules.
Example 2.13 (full-rank least-squares). As we saw in Section 2.1, the least-
squares problem (2.3) satisfies the transformation rule of a mixed 2-tensor 𝐴 ′ =
𝑋 𝐴𝑌 −1 with change-of-basis matrices (𝑋, 𝑌 ) ∈ O(𝑚) × GL(𝑛). Suppose rank(𝐴) =
𝑛. Then, applying a sequence of covariant 1-tensor transformation rules
𝑅
𝐴 → 𝑄 1 𝐴 → 𝑄 2 (𝑄 1 𝐴) → · · · → 𝑄 𝐴 =
T T T T
0
given by the Householder QR algorithm, we get
𝑅
𝐴=𝑄 .
0
As the minimum value is an invariant Cartesian 0-tensor,
2
2 2 𝑅
min k 𝐴𝑣 − 𝑏k = min k𝑄 (𝐴𝑣 − 𝑏)k = min
T
𝑣 − 𝑄T 𝑏
0
2
𝑅 𝑐
= min 𝑣− = min k𝑅𝑣 − 𝑐k 2 + k𝑑 k 2 = k𝑑 k 2 ,
0 𝑑
where we have written
T 𝑐
𝑄 𝑏= .
𝑑
In this case the solution of the transformed problem 𝑅𝑣 = 𝑐 is in fact equal to that of
32 L.-H. Lim
the least-squares problem, and may be obtained through back-substitution, that is,
another sequence of contravariant 1-tensor transformation rules
𝑐 → 𝑌1−1 𝑐 → 𝑌2−1 (𝑌1−1 𝑐) → · · · → 𝑅 −1 𝑐 = 𝑣,
where the 𝑌𝑖 are Gauss transforms. As noted above, the solution method reflects
Voigt’s definition: we transform the problem mink 𝐴𝑣 − 𝑏k 2 into a form where the
solution of the transformed problem 𝑅𝑣 = 𝑐 is related to the original solution in a
definite way. Here we obtained the change-of-basis matrices 𝑋 = 𝑄 ∈ O(𝑚) and
𝑌 = 𝑅 −1 ∈ GL(𝑛) via Householder QR and back-substitution respectively.
As we just saw, there are distinguished choices for the change-of-basis matrices
that aid the solution of a numerical linear algebra problem. We will mention one
of the most useful choices below.
Example 2.14 (Krylov subspaces). Suppose we have 𝐴 ∈ R𝑛×𝑛 , with all eigen-
values distinct and non-zero for simplicity. Take an arbitrary 𝑏 ∈ R𝑛 . Then the
matrix 𝐾 whose columns are Krylov basis vectors of the form
𝑏, 𝐴𝑏, 𝐴2 𝑏, . . . , 𝐴𝑛−1 𝑏
is invertible, i.e. 𝐾 ∈ GL(𝑛), and using it as the change-of-basis matrix gives us
0 0 · · · 0 −𝑐0
1 0 · · · 0 −𝑐1
𝐴 = 𝐾 0 1 · · · 0 −𝑐2 𝐾 −1 . (2.22)
.. .. . . .. ..
. . . .
.
0 0 · · · 1 −𝑐 𝑛−1
This is a special case of the rational canonical form, a canonical form under
similarity. The seemingly trivial observation (2.22), when combined with other
techniques, becomes a powerful iterative method for a wide variety of computa-
tional tasks such as solving linear systems, least-squares, eigenvalue problems or
evaluating various matrix functions (van der Vorst 2000). Readers unfamiliar with
numerical linear algebra may find it odd that we do not use another obvious canon-
ical form, one that makes the aforementioned problems trivial to solve, namely, the
eigenvalue decomposition
𝜆 1 0 0 · · · 0
0 𝜆2 0 · · · 0
𝐴 = 𝑋 0 0 𝜆 3 · · · 0 𝑋 −1 , (2.23)
.. .. .. . . ..
. . . . .
0 0 0 · · · 𝜆𝑛
where the change-of-basis matrix 𝑋 ∈ GL(𝑛) has columns given by the eigenvectors
of 𝐴. The issue is that this is more difficult to compute than (2.22). In fact, to
compute it, one way is to implicitly exploit (2.22) and the relation between the two
Tensors in computations 33
canonical forms:
𝜆 1 0 0 ··· 0 0 0 ··· 0
−𝑐0
0 𝜆2 0 ··· 0 1 0 ··· 0
−𝑐1
0 0 𝜆3 ··· 0 = 𝑉 0 1 ··· 0 −1
−𝑐2
𝑉 ,
.. .. .. .. .. .. .. . . .. ..
. . . . . . . . . .
0 0 0 ··· 𝜆 𝑛 0 0 ··· 1 −𝑐 𝑛−1
where
1 𝜆 1 𝜆2 . . . 𝜆 𝑛−1
1 1
1 𝜆 2 𝜆2 . . . 𝜆 𝑛−1
2 2
2 𝑛−1
𝑉 = 1 𝜆 3 𝜆 3 . . . 𝜆 3 . (2.24)
.. .. .. . . ..
. . . . .
1 𝜆 𝑛 𝜆2𝑛 . . . 𝜆 𝑛−1
𝑛
The pertinent point here is that the approach of solving a problem by finding an
appropriate computational basis is also an instance of the 2-tensor transformation
rule. Aside from the Krylov basis, there are simpler examples such as diagonalizing
a circulant matrix
𝑐0 𝑐 𝑛−1 . . . 𝑐2 𝑐1
𝑐1 𝑐0 𝑐 𝑛−1 𝑐2
. .. ..
𝐶 = .. 𝑐1 𝑐0 . .
.. ..
𝑐 . . 𝑐 𝑛−1
𝑛−2
𝑐 𝑐0
𝑛−1 𝑐 𝑛−2 . . . 𝑐1
by expressing it in the Fourier basis, i.e. with 𝜆 𝑘𝑗 = e2( 𝑗−1)𝑘 𝜋i/𝑛 in (2.24), and
there are far more complex examples such as the wavelet approach to the fast
multipole method of Beylkin, Coifman and Rokhlin (1991). This computes, in 𝑂(𝑛)
operations and to arbitrary accuracy, a matrix–vector product 𝑣 ↦→ 𝐴𝑣 with a dense
𝐴 ∈ R𝑛×𝑛 that is a finite-dimensional approximation of certain special integral
transforms. The algorithm iteratively computes a special wavelet basis 𝑋 ∈ GL(𝑛)
so that ultimately 𝑋 −1 𝐴𝑋 = 𝐵 gives a banded matrix 𝐵 where 𝑣 ↦→ 𝐵𝑣 can be
computed in time 𝑂(𝑛 log(1/𝜀)) to 𝜀-accuracy, and where 𝑣 ↦→ 𝑋𝑣, 𝑣 ↦→ 𝑋 −1 𝑣 are
both computable in time 𝑂(𝑛). One may build upon this algorithm to obtain an
𝑂(𝑛 log2 𝑛 log 𝜅(𝐴)) algorithm for the pseudoinverse (Beylkin 1993, Section 6) and
potentially also to compute other matrix functions such as square root, exponential,
sine and cosine, etc. (Beylkin, Coifman and Rokhlin 1992, Section X). We will
discuss some aspects of this basis in Example 4.47, which is constructed as the
tensor product of multiresolution analyses.
Our intention is to highlight the role of the tensor transformation rules in numer-
ical linear algebra but we do not wish to overstate it. These rules are an important
component of various algorithms but almost never the only one. Furthermore,
34 L.-H. Lim
for a strongly convex 𝑓 ∈ 𝐶 2 (Ω) with 𝛽𝐼 ∇2 𝑓 (𝑣) 𝛾𝐼. The Newton step
Δ𝑣 ∈ R𝑛 is defined as the solution to
2
∇ 𝑓 (𝑣) 𝐴T Δ𝑣 −∇ 𝑓 (𝑣)
=
𝐴 0 Δ𝜆 0
and the Newton decrement 𝜆(𝑣) ∈ R is defined as
𝜆(𝑣)2 ≔ ∇(𝑣)T ∇ 𝑓 (𝑣)−1 ∇ 𝑓 (𝑣).
Let 𝑋 ∈ GL(𝑛) and suppose we perform a linear change of coordinates 𝑋𝑣 ′ = 𝑣.
Then the Newton step in these new coordinates is given by
T 2 T
𝑋 ∇ 𝑓 (𝑋𝑣)𝑋 𝑋 T 𝐴T Δ𝑣 ′ −𝑋 ∇ 𝑓 (𝑋𝑣)
= .
𝐴𝑋 0 Δ𝜆 ′ 0
We may check that 𝑋Δ𝑣 ′ = Δ𝑣 and thus the iterates are related by 𝑋𝑣 ′𝑘 = 𝑣 𝑘 for
all 𝑘 ∈ N as long as we initialize with 𝑋𝑣 0′ = 𝑣 0 (Boyd and Vandenberghe 2004,
Section 10.2.1). Note that steepest descent satisfies no such property no matter
which 1-tensor transformation rule we use: 𝑣 ′ = 𝑋𝑣, 𝑣 ′ = 𝑋 −1 𝑣, 𝑣 ′ = 𝑋 T 𝑣, or 𝑣 ′ =
𝑋 −T 𝑣. We also have that 𝜆(𝑋𝑣)2 = 𝜆(𝑣)2 , which is used in the stopping condition
of Newton’s method; thus the iterations stop at the same point 𝑋𝑣 ′𝑘 = 𝑣 𝑘 when
𝜆(𝑣 ′𝑘 )2 = 𝜆(𝑣 𝑘 )2 ≤ 2𝜀 for a given 𝜀 > 0. In summary, if we write 𝑔(𝑣 ′) = 𝑓 (𝑋𝑣),
then we have the following relations:
coordinates contravariant 1-tensor 𝑣 ′ = 𝑋 −1 𝑣,
gradient covariant 1-tensor ∇𝑔(𝑣 ′ ) = 𝑋 T ∇ 𝑓 (𝑋𝑣),
Hessian covariant 2-tensor ∇2 𝑔(𝑣 ′ ) = 𝑋 T ∇2 𝑓 (𝑋𝑣)𝑋,
Newton step contravariant 1-tensor Δ𝑣 ′ = 𝑋 −1 Δ𝑣,
Newton iterate contravariant 1-tensor 𝑣 ′𝑘 = 𝑋 −1 𝑣 𝑘 ,
Newton decrement invariant 0-tensor 𝜆(𝑣 ′𝑘 ) = 𝜆(𝑣 𝑘 ).
Strictly speaking, the gradient and Hessian are tensor fields, and we will explain
the difference in Example 3.12. We may extend the above discussion with the
Tensors in computations 35
However, if 𝛽𝐼 ∇2 𝑓 (𝑣) 𝛾𝐼, then 𝜅({𝑣 ∈ Ω : 𝑓 (𝑣) ≤ 𝛼}) ≤ 𝛾/𝛽 (Boyd and
Vandenberghe 2004, Section 9.1.2), and so it is ultimately controlled by 𝜅(∇2 𝑓 (𝑣)).
Our last example shows that the tensor transformation rules are as important in
the information sciences as they are in the physical sciences.
Example 2.16 (equivariant neural networks). A feed-forward neural network
is usually regarded as a function 𝑓 : R𝑛 → R𝑛 obtained by alternately composing
affine maps 𝛼𝑖 : R𝑛 → R𝑛 , 𝑖 = 1, . . . , 𝑘, with a non-linear function 𝜎 : R𝑛 → R𝑛 :
𝑓 =𝛼𝑘 ◦𝜎◦𝛼𝑘−1 ◦···◦𝜎◦𝛼2 ◦𝜎◦𝛼1
𝛼1 𝜎 𝛼2 𝜎 𝜎 𝛼𝑘
R𝑛 R𝑛 R𝑛 R𝑛 ··· R𝑛 R𝑛 .
The depth, also known as the number of layers, is 𝑘 and the width, also known
as the number of neurons, is 𝑛. We assume that our neural network has constant
width throughout all layers. The non-linear function 𝜎 is called activation and we
may assume that it is given by the ReLU function 𝜎(𝑡) ≔ max(𝑡, 0) for 𝑡 ∈ R and,
by convention,
𝜎(𝑣) = (𝜎(𝑣 1 ), . . . , 𝜎(𝑣 𝑛 )), 𝑣 = (𝑣 1 , . . . , 𝑣 𝑛 ) ∈ R𝑛 . (2.26)
The affine function is defined by 𝛼𝑖 (𝑣) = 𝐴𝑖 𝑣 + 𝑏𝑖 for some 𝐴𝑖 ∈ R𝑛×𝑛 called the
weight matrix and some 𝑏𝑖 ∈ R𝑛 called the bias vector. We assume that 𝑏 𝑘 = 0 in
the last layer.
Although convenient, it is somewhat misguided to be lumping the bias and weight
together in an affine function. The biases 𝑏𝑖 are intended to serve as thresholds
for the activation function 𝜎 (Bishop 2006, Section 4.1.7) and should be part of
it, detached from the weights 𝐴𝑖 that transform the input. If one would like to
36 L.-H. Lim
incorporate translations, one could do so with weights from a matrix group such
as SE(𝑛) in (2.15). Hence a better but mathematically equivalent description of 𝑓
would be as
plays the role of a threshold for activation as was intended by Rosenblatt (1958,
p. 392) and McCulloch and Pitts (1943, p. 120).
A major computational issue with neural networks is the large number of un-
known parameters, namely the 𝑘𝑛2 + 𝑘(𝑛 − 1) entries of the weights and biases, that
have to be fitted with data, especially for deep neural networks where 𝑘 is large.
Thus successful applications of neural networks require that we identify, based on
the problem at hand, an appropriate low-dimensional subset of R𝑛×𝑛 from which we
will find our weights 𝐴1 , . . . , 𝐴 𝑘 . For instance, the very successful convolutional
neural networks for image recognition (Krizhevsky, Sutskever and Hinton 2012)
relies on restricting 𝐴1 , . . . , 𝐴 𝑘 to some block-Toeplitz–Toeplitz-block or BTTB
matrices (Ye and Lim 2018a, Section 13) determined by a very small number of
parameters. It turns out that convolutional neural networks are a quintessential ex-
ample of equivariant neural networks (Cohen and Welling 2016), and in fact every
equivariant neural network may be regarded as a generalized convolutional neural
network in an appropriate sense (Kondor and Trivedi 2018). We will describe a
simplified version that captures its essence and illustrates the tensor transformation
rules.
Let 𝐺 ⊆ R𝑛×𝑛 be a matrix group. A function 𝑓 : R𝑛 → R𝑛 is said to be
equivariant if it satisfies the condition that
for convolutional neural networks; the 𝑝4 group that augments translations with
right-angle rotations,
cos(𝑘 𝜋/2) − sin(𝑘 𝜋/2) 𝑚1
3×3
𝐻 = sin(𝑘 𝜋/2) cos(𝑘 𝜋/2) 𝑚2 ∈ R : 𝑘 = 0, 1, 2, 3, 𝑚 1 , 𝑚 2 ∈ Z
0 0 1
in Cohen and Welling (2016, Section 4.2); the 𝑝4𝑚 group that further augments
38 L.-H. Lim
𝑝4 with reflections,
(−1) 𝑗 cos(𝑘 𝜋/2) (−1) 𝑗+1 sin(𝑘 𝜋/2) 𝑚1 3×3 𝑘 = 0, 1, 2, 3,
𝐻= sin(𝑘 𝜋/2) cos(𝑘 𝜋/2) 𝑚2 ∈ R :
0 0 1 𝑗 = 0, 1, 𝑚 1 , 𝑚 2 ∈ Z
in Cohen and Welling (2016, Section 4.3). Other possibilities for 𝐻 include the
rotation group SO(3) for 3D shape recognition (Kondor et al. 2018), the rigid
motion group SE(3) for chemical property (Fuchs et al. 2020) and protein structure
(see page 6) predictions and the Lorentz group SO(1, 3) for identifying top quarks
in high-energy physics experiments (Bogatskiy et al. 2020), etc.
When people speak of SO(3)- or SE(3)- or Lorentz-equivariant neural networks,
they are referring to the group 𝐻 and not 𝐺 = 𝜌(𝐻). A key step of these works
is the construction of an appropriate representation 𝜌 for the problem at hand, or,
equivalently, constructing a linear action of 𝐻 on R𝑛 . In these applications R𝑛
should be regarded as the set of real-valued functions on a set 𝑆 of cardinality
𝑛, a perspective that we will introduce in Example 4.5. For concreteness, take
an image recognition problem on a 60 000 collection of 28 × 28-pixel images of
handwritten digits in greyscale levels 0, 1, . . . , 255 (Deng 2012). Then 𝑆 ⊆ Z2 is
the set of 𝑛 = 282 = 784 pixel indices and an image is encoded as 𝑣 ∈ R784 whose
coordinates take values from 0 (pitch black) to 255 (pure white). Note that this is
why the first three 𝐻 above are discrete: the elements ℎ ∈ 𝐻 act on the pixel indices
𝑆 ⊆ Z2 instead of R2 . These 60 000 images are then used to fit the neural network
𝑓 , i.e. to find the parameters 𝐴1 , . . . , 𝐴 𝑘 ∈ R784×784 and 𝑏1 , . . . , 𝑏 𝑘−1 ∈ R784 .
We end this example with a note on why a non-linear (and not even multilinear)
function like ReLU can satisfy the covariant 2-tensor transformation rule, which
sounds incredible but is actually obvious once explained. Take a greyscale image
𝑣 ∈ R𝑛 drawn with a black outline (greyscale value 0) but filled with varying shades
of grey (greyscale values 1, . . . , 255) and consider the activation
(
255 𝑡 > 0,
𝜎(𝑡) = 𝑡 ∈ R,
0 𝑡 ≤ 0,
so that applying 𝜎 to the image 𝑣 produces an image 𝜎(𝑣) ∈ R𝑛 with all shadings
removed, leaving just the black outline. Now take a 45◦ rotation matrix 𝑅 ∈ SO(2)
and let 𝑋 = 𝜌(𝑅) ∈ GL(𝑛) be the corresponding matrix that rotates any image 𝑣 by
45◦ to 𝑋𝑣:
𝜎
𝑋 𝜎 𝑋 −1
The bottom line is that 𝑅 acts on the indices of 𝑣 whereas 𝜎 acts on the values of
𝑣 and the two actions are always independent, which is why 𝜎 ′ = 𝑋 −1 𝜎 𝑋 = 𝜎.
Tensors in computations 39
space of 𝑛×𝑛 symmetric matrices – making the linear algebra notation we have been
using to describe definition ➀ awkward and unnatural. Instead we should make
provision to work with tensors over arbitrary vector spaces, for example the space
of Toeplitz or Hankel or Toeplitz-plus-Hankel matrices, the space of polynomials
or differential forms or differential operators, or, in the case of equivariant neural
networks, the space of 𝐿 2 -functions on homogeneous spaces (Cohen and Welling
2016, Kondor and Trivedi 2018). This serves as another motivation for definitions ➁
and ➂.
Secondly, if one subscribes to this fallacy, then one would tend to miss important
tensors hiding in plain sight. The object of essence in each of the following
examples is a 3-tensor, but one sees no triply indexed quantities anywhere.
(i) Multiplication of complex numbers:
(𝑎 + 𝑖𝑏)(𝑐 + 𝑖𝑑) = (𝑎𝑐 − 𝑏𝑑) + 𝑖(𝑏𝑐 + 𝑎𝑑).
(ii) Matrix–matrix products:
𝑎11 𝑎12 𝑏11 𝑏12 𝑎11 𝑏11 + 𝑎12 𝑏21 𝑎11 𝑏12 + 𝑎12 𝑏22
= .
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑏11 + 𝑎22 𝑏21 𝑎21 𝑏12 + 𝑎22 𝑏22
(iii) Grothendieck’s inequality:
𝑚 Õ
Õ 𝑛 𝑚 Õ
Õ 𝑛
max 𝑎𝑖 𝑗 h𝑥𝑖 , 𝑦 𝑗 i ≤ 𝐾G max 𝑎𝑖 𝑗 𝜀𝑖 𝛿 𝑗 .
k 𝑥𝑖 k= k 𝑦 𝑗 k=1 | 𝜀𝑖 |= | 𝛿 𝑗 |=1
𝑖=1 𝑗=1 𝑖=1 𝑗=1
Linearity principle. Almost any natural process is linear in small amounts almost
everywhere (Kostrikin and Manin 1997, p. vii).
Multilinearity principle. If we keep all but one factor constant, the varying factor
obeys the linearity principle.
It is probably fair to say that the world around us is understandable largely
because of a combination of these two principles. The universal importance of
linearity needs no elaboration; in this section we will discuss the importance of
multilinearity, particularly in computations.
In mathematics, definition ➁ appeals because it is the simplest among the three
definitions: a multilinear map is trivial to define for anyone who knows about linear
maps. More importantly, definition ➁ allows us to define tensors over arbitrary
vector spaces.9 It is a misconception that in applications there is no need to discuss
more general vector spaces because one can always get by with just two of them,
R𝑛 and C𝑛 . The fact of the matter is that other vector spaces often carry structures
that are destroyed when one artificially identifies them with R𝑛 or C𝑛 . Just drawing
on the examples we have mentioned in Section 2, semidefinite programming and
equivariant neural networks already indicate why it is not a good idea to identify
the vector space of symmetric 𝑛 × 𝑛 matrices with R𝑛(𝑛+1)/2 or an 𝑚-dimensional
subspace of real-valued functions 𝑓 : Z2 → R with R𝑚 . We will elaborate on these
later in the context of definition ➁. Nevertheless, we will also see that definition ➁
has one deficiency that will serve as an impetus for definition ➂.
10 Following convention, linear and multilinear maps are called functionals if they are scalar-valued,
i.e. the codomain is R, and operator if they are vector-valued, i.e. the codomain is a vector space
of arbitrary dimension.
Tensors in computations 45
Here the juxtaposed 𝑢𝑖 𝑣 𝑗 is not given any further interpretation, and is simply taken
to be an element (called a dyad) in a new vector space of dimension 𝑚𝑛 called the
dyadic product of U and V. While this appears to be a basis-dependent notion,
it is actually not, and dyadics satisfy a contravariant 2-tensor transformation rule.
This is a precursor to definition ➂; we will see that once we insert a ⊗ between
the vectors to get 𝑢𝑖 ⊗ 𝑣 𝑗 , the dyadic product of U and V is just the tensor product
U ⊗ V that we will discuss in Section 4.
Before moving to higher-order tensors, we highlight another impetus for defin-
ition ➂. Note that there are many other maps that are also 2-tensors according to
46 L.-H. Lim
Note that this practically mirrors the discussion for the bilinear functional case on
page 45 and can be easily extended to any multilinear functional 𝜑 : V1 ×· · ·×V𝑑 →
R to get a hypermatrix representation 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 with respect to any bases ℬ𝑖
on V𝑖 , 𝑖 = 1, . . . , 𝑑.
The argument for the bilinear operator B requires an additional step. By bilin-
earity, we get
Õ 𝑛 Õ
𝑚 Õ 𝑝
B(𝑢, 𝑣) = 𝑎𝑖 𝑏 𝑗 𝑐 𝑘 B(𝑢𝑖 , 𝑣 𝑗 ),
𝑖=1 𝑗=1 𝑘=1
By the fact that 𝒞 is a basis, the above equation uniquely defines the values 𝑎𝑖 𝑗𝑘
and the hypermatrix representation of B with respect to bases 𝒜, ℬ and 𝒞 is
[B] 𝒜,ℬ,𝒞 = 𝐴 ∈ R𝑚×𝑛× 𝑝 . (3.17)
The hypermatrices in (3.15) and (3.17), even if they are identical, represent
different types of 3-tensors. Let 𝐴 ′ = [𝑎𝑖′ 𝑗𝑘 ] ∈ R𝑚×𝑛× 𝑝 be a hypermatrix repres-
entation of 𝜏 with respect to bases 𝒜 ′, ℬ′, 𝒞 ′ on U, V, W, i.e. 𝑎𝑖′ 𝑗𝑘 = 𝜏(𝑢𝑖′ , 𝑣 ′𝑗 , 𝑤 ′𝑘 ).
Then it is straightforward to deduce that
𝐴 ′ = (𝑋 T , 𝑌 T , 𝑍 T ) · 𝐴
with change-of-basis matrices 𝑋 ∈ GL(𝑚), 𝑌 ∈ GL(𝑛), 𝑍 ∈ GL(𝑝) similarly
defined as in (3.3). Hence, by (2.7), trilinear functionals are covariant 3-tensors.
On the other hand, had 𝐴 ′ ∈ R𝑚×𝑛× 𝑝 been a hypermatrix representation of B, then
𝐴 ′ = (𝑋 T , 𝑌 T , 𝑍 −1 ) · 𝐴.
Hence, by (2.9), bilinear operators are mixed 3-tensors of covariant order 2 and
contravariant order 1. The extension to other types of 3-tensors in (3.10)–(3.13)
and to arbitrary 𝑑-tensors may be carried out similarly.
For completeness, we state a formal definition of multilinear maps, if only to
serve as a glossary of terms and notation and as a pretext for interesting examples.
Definition 3.1 (tensors via multilinearity). Let V1 , . . . , V𝑑 and W be real vector
spaces. A multilinear map, or more precisely a 𝑑-linear map, is a map Φ : V1 ×
· · · × V𝑑 → W that satisfies
Φ(𝑣, . . . , 𝜆𝑣 𝑘 + 𝜆 ′ 𝑣 ′𝑘 , . . . , 𝑣 𝑑 ) = 𝜆Φ(𝑣 1 , . . . , 𝑣 𝑘 , . . . , 𝑣 𝑑 ) + 𝜆 ′Φ(𝑣 1 , . . . , 𝑣 ′𝑘 , . . . , 𝑣 𝑑 )
(3.18)
for all 𝑣 1 ∈ V1 , . . . , 𝑣 𝑘 , 𝑣 ′𝑘 ∈ V 𝑘 , . . . , 𝑣 𝑑 ∈ V𝑑 , 𝜆, 𝜆 ′ ∈ R, and all 𝑘 = 1, . . . , 𝑑.
The set of all such maps will be denoted M𝑑 (V1 , . . . , V𝑑 ; W).
Tensors in computations 49
This is a passable but awkward definition. For 𝑑 = 2 and 3, the bilinear and
trilinear functionals in (3.8) and (3.11) are tensors, but the linear and bilinear
operators in (3.7) and (3.10) strictly speaking are not; sometimes this means one
needs to convert operators to functionals, for example by identifying a linear
operator Φ : V → W as the bilinear functional defined by V × W∗ → R, (𝑣, 𝜑) ↦→
𝜑(Φ(𝑣)), before one may apply a result. Definition 3.3 is also peculiar considering
that by far the most common 1-, 2- and 3-tensors are vectors, linear operators
and bilinear operators respectively, but the definition excludes them at the outset.
So instead of simply speaking of 𝑣 ∈ V, one would need to regard it as a linear
functional on the space of linear functionals, i.e. 𝑣 ∗∗ : V∗ → R, in order to regard
it as a 1-tensor in the sense of Definition 3.3. In fact, it is often more useful to
do the reverse, that is, given a 𝑑-linear functional, we prefer to convert it into a
(𝑑 − 1)-linear operator, and we will give an example.
provide little insight when functions are defined on vector spaces other than R𝑛 .
On the other hand, using 𝑓 (𝑋) = tr(𝑋 −1 ) for illustration, with (3.20) and (3.21),
we get
𝐷 𝑓 (𝑋) : S𝑛 → R, 𝐻 ↦→ − tr(𝑋 −1 𝐻 𝑋 −1 ),
𝐷 2 𝑓 (𝑋) : S𝑛 × S𝑛 → R, (𝐻1 , 𝐻2 ) ↦→ tr(𝑋 −1 𝐻1 𝑋 −1 𝐻2 𝑋 −1 + 𝑋 −1 𝐻2 𝑋 −1 𝐻1 𝑋 −1 ),
and more generally
𝐷 𝑑 𝑓 (𝑋) : S𝑛 × · · · × S𝑛 → R,
Õ
𝑑 −1 −1 −1 −1 −1
(𝐻1 , . . . , 𝐻 𝑑 ) ↦→ (−1) tr 𝑋 𝐻 𝜎(1) 𝑋 𝐻 𝜎(2) 𝑋 · · · 𝑋 𝐻 𝜎(𝑑) 𝑋 .
𝜎 ∈𝔖𝑑
and the mixed 𝑑-tensor transformation rules (2.9) for the general case in (3.25).
Furthermore,
[𝜑] ℬ1 ,...,ℬ𝑑 = 𝐴 ∈ R𝑛1 ×···×𝑛𝑑
is exactly the hypermatrix representation of 𝜑 in Definition 2.5. The important
special case where V1 = · · · = V𝑑 = V and where we pick only a single basis ℬ
gives us the transformation rules in (2.10), (2.11) and (2.12).
At this juncture it is appropriate to highlight two previously mentioned points
(page 40) regarding the feasibility and usefulness of representing a multilinear map
as a hypermatrix.
Example 3.5 (writing down a hypermatrix is #P-hard). As we saw in (3.17),
given bases 𝒜, ℬ and 𝒞, a bilinear operator B may be represented as a hypermatrix
𝐴. Writing down the entries 𝑎 𝑖 𝑗 𝑘 as in (3.16) appears to be a straightforward process,
but this is an illusion: the task is #P-hard in general. Let 0 ≤ 𝑑1 ≤ 𝑑2 ≤ · · · ≤ 𝑑𝑛
be integers. Define the generalized Vandermonde matrix
𝑥 𝑑1 𝑥2𝑑1 ... 𝑥 𝑛𝑑1
1𝑑
𝑥 2 𝑥2𝑑2 ... 𝑥 𝑛𝑑2
1
𝑉(𝑑1 ,...,𝑑𝑛 ) (𝑥) ≔ ... ..
.
..
.
..
. ,
𝑑
𝑥 𝑛−1 𝑥2𝑑𝑛−1 ... 𝑑𝑛−1
𝑥𝑛
1
𝑥 𝑑𝑛 𝑥2𝑑𝑛 ... 𝑥 𝑛𝑑𝑛
1
observing in particular that
1 1 ... 1
𝑥1 𝑥2 . . . 𝑥 𝑛
.. .. .. ..
𝑉(0,1,... ,𝑛−1) (𝑥) = . . . .
𝑛−2 𝑛−2
𝑥 𝑛−2
1𝑛−1 𝑥2𝑛−1 . . . 𝑥 𝑛𝑛−1
𝑥 𝑥2 . . . 𝑥 𝑛
1
is the usual Vandermonde matrix. Suppose 𝑑𝑖 ≥ 𝑖 for each 𝑖 = 1, . . Î . , 𝑛; then it is
not hard to show using the well-known formula det 𝑉(0,1,...,𝑛−1) (𝑥) = 𝑖< 𝑗 (𝑥𝑖 − 𝑥 𝑗 )
that det 𝑉(𝑑1 ,𝑑2 ,...,𝑑𝑛 ) (𝑥) is divisible by det 𝑉(0,1,...,𝑛−1) (𝑥). So, for any integers
0 ≤ 𝑝1 ≤ 𝑝2 ≤ · · · ≤ 𝑝𝑛,
det 𝑉( 𝑝1 , 𝑝2 +1,..., 𝑝𝑛 +𝑛−1) (𝑥)
𝑠( 𝑝1 , 𝑝2 ,..., 𝑝𝑛 ) (𝑥) ≔
det 𝑉(0,1,... ,𝑛−1) (𝑥)
is a multivariate polynomial in the variables 𝑥1 , . . . , 𝑥 𝑛 . These are symmetric
polynomials, i.e. homogeneous polynomials 𝑠 with
𝑠(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ) = 𝑠(𝑥 𝜎(1) , 𝑥 𝜎(2) , . . . , 𝑥 𝜎(𝑛) )
for any 𝜎 ∈ 𝔖𝑛 , the permutation group on 𝑛 objects. Let U, V, W be the vector
spaces of symmetric polynomials of degrees 𝑑, 𝑑 ′ and 𝑑 + 𝑑 ′ respectively, and
let B : U × V → W be the bilinear operator given by polynomial multiplication,
54 L.-H. Lim
i.e. B(𝑠(𝑥), 𝑡(𝑥)) = 𝑠(𝑥)𝑡(𝑥) for any symmetric polynomials 𝑠(𝑥) of degree 𝑑 and
𝑡(𝑥) of degree 𝑑 ′. A well-known basis of the vector space of degree-𝑑 symmetric
polynomials is the Schur basis given by
{𝑠( 𝑝1 , 𝑝2 ,..., 𝑝𝑛 ) (𝑥) ∈ U : 𝑝 1 ≤ 𝑝 2 ≤ · · · ≤ 𝑝 𝑛 is an integer partition of 𝑑}.
Let 𝒜, ℬ, 𝒞 be the respective Schur bases for U, V, W. In this case the coefficients
in (3.16) are called Littlewood–Richardson coefficients, and determining their val-
ues is a #P-complete problem11 (Narayanan 2006). In other words, determining the
hypermatrix representation 𝐴 of the bilinear operator B is #P-hard. Littlewood–
Richardson coefficients are not as esoteric as one might think but have significance
in linear algebra and numerical linear algebra alike. Among other things they play
a central role in the resolution of Horn’s conjecture about the eigenvalues of a sum
of Hermitian matrices (Klyachko 1998, Knutson and Tao 1999).
As we mentioned earlier, a side benefit of definition ➁ is that it allows us to
work with arbitrary vector spaces. In the previous example, U, V, W are vector
spaces of degree-𝑑 symmetric polynomials for various values of 𝑑; in the next one
we will have U = V = W = S𝑛 , the vector space of 𝑛 × 𝑛 symmetric matrices.
Aside from showing that hypermatrix representations may be neither feasible nor
useful, Examples 3.5 and 3.6 also show that there are usually good reasons to work
intrinsically with whatever vector spaces one is given.
Example 3.6 (higher gradients of log determinant). The key to all interior point
methods is a barrier function that traps iterates within the feasible region of a convex
program. In semidefinite programming, the optimal barrier function with respect
to iteration complexity is the log barrier function for the cone of positive definite
matrices S++𝑛,
𝑛
𝑓 : S++ → R, 𝑓 (𝑋) = − log det 𝑋.
Using the characterization in Example 3.4 with the inner product S𝑛 × S𝑛 → R,
(𝐻1 , 𝐻2 ) ↦→ tr(𝐻1T 𝐻2 ), we may show that its gradient is given by
𝑛
∇ 𝑓 : S++ → S𝑛 , ∇ 𝑓 (𝑋) = −𝑋 −1 , (3.29)
𝑛 is the linear map
and its Hessian at any 𝑋 ∈ S++
∇2 𝑓 (𝑋) : S𝑛 → S𝑛 , 𝐻 ↦→ 𝑋 −1 𝐻 𝑋 −1 , (3.30)
expressions we may also find in Boyd and Vandenberghe (2004) and Renegar
(2001). While we may choose bases on S𝑛 and artificially write the gradient and
Hessian of 𝑓 in the forms (3.28), interested readers may check that they are a horrid
mess that obliterates all insights and advantages proffered by (3.29) and (3.30).
11 This means it is as intractable as evaluating the permanent of a matrix whose entries are zeros
and ones (Valiant 1979). #P-complete problems include all NP-complete problems, for example,
deciding whether a graph is 3-colourable is NP-complete, but computing its chromatic number is
#P-complete.
Tensors in computations 55
Among other things, (3.29) and (3.30) allow one to exploit specialized algorithms
for matrix product and inversion.
The third-order gradient ∇3 𝑓 also plays an important role as we need it to
ascertain self-concordance in Example 3.16. By our discussion in Example 3.4,
for any 𝑋 ∈ S++ , this is a bilinear operator
∇3 𝑓 (𝑋) : S𝑛 × S𝑛 → S𝑛 ,
and by (3.27), we may differentiate (3.30) to get
[∇3 𝑓 (𝑋)](𝐻1 , 𝐻2 ) = −𝑋 −1 𝐻1 𝑋 −1 𝐻2 𝑋 −1 − 𝑋 −1 𝐻2 𝑋 −1 𝐻1 𝑋 −1 . (3.31)
Repeatedly applying (3.27) gives the (𝑑 − 1)-linear operator
∇𝑑 𝑓 (𝑋) : S𝑛 × · · · × S𝑛 → S𝑛 ,
Õ
(𝐻1 , . . . , 𝐻 𝑑−1 ) ↦→ (−1)𝑑 𝑋 −1 𝐻 𝜎(1) 𝑋 −1 𝐻 𝜎(2) 𝑋 −1 · · · 𝑋 −1 𝐻 𝜎(𝑑−1) 𝑋 −1 ,
𝜎 ∈𝔖𝑑−1
as the 𝑑th gradient of 𝑓 . As interested readers may again check for themselves,
expressing this as a 𝑑-dimensional hypermatrix is even less illuminating than ex-
pressing (3.29) and (3.30) as one- and two-dimensional hypermatrices. Multilinear
maps are essential for discussing higher derivatives and gradients of multivariate
functions.
In Examples 3.4 and 3.6, the matrices in S𝑛 are 1-tensors even though they are
doubly indexed objects. This should come as no surprise after Example 2.3: the
order of a tensor is not determined by the number of indices. We will see this in
Example 3.7 again, where we will encounter a hypermatrix with 3𝑛 + 3 indices
which nonetheless represents a 3-tensor.
We conclude this section with some infinite-dimensional examples. Much like
linear operators, there is not much that one could say about multilinear operators
over infinite-dimensional vector spaces that is purely algebraic. The topic becomes
much more interesting when one brings in analytic notions by equipping the vector
spaces with norms or inner products.
As is well known, it is easy to ascertain the continuity of linear operators between
Banach spaces: Φ : V → W is continuous if and only if it is bounded in the sense
of kΦ(𝑣)k ≤ 𝑐k𝑣k for some constant 𝑐 > 0 and for all 𝑣 ∈ V, i.e. if and only if
Φ ∈ B(V, W). We have slightly abused notation by not distinguishing the norms
on different spaces and will continue to do so below. Almost exactly the same
proof extends to multilinear operators on Banach spaces: Φ : V1 × · · · × V𝑑 → W
is continuous if and only if it is bounded in the sense of
kΦ(𝑣 1 , . . . , 𝑣 𝑑 )k ≤ 𝑐k𝑣 1 k · · · k𝑣 𝑑 k
for some constant 𝑐 > 0 and for all 𝑣 1 ∈ V1 , . . . , 𝑣 𝑑 ∈ V𝑑 (Lang 1993, Chapter IV,
Section 1), i.e. if and only if its spectral norm as defined in (3.19) is finite.
So if V1 , . . . , V𝑑 are finite-dimensional, then Φ is automatically continuous; in
56 L.-H. Lim
kΦ(𝑣)k ′
kΦk ′′ = sup .
𝑣≠0 k𝑣k
kM(Φ, 𝑣)k ′
kMk 𝜎 = sup =1
Φ≠0, 𝑣≠0 kΦk ′′ k𝑣k
and thus it is continuous. Next we will look at an actual Banach space of functions.
Another quintessential bilinear operator is the convolution of two functions 𝑓 , 𝑔 ∈
𝐿 1 (R𝑛 ), defined by
∫
𝑓 ∗ 𝑔(𝑥) ≔ 𝑓 (𝑥 − 𝑦)𝑔(𝑦) d𝑦.
R𝑛
k 𝑓 ∗ 𝑔k 1
kB∗ k 𝜎 = sup ≤ 1,
𝑓 , 𝑔≠0 k 𝑓 k 1 k𝑔k 1
orthogonal expansions:
Õ
∞ Õ
∞
Φ( 𝑓 ) = hΦ(𝑒𝑖 ), 𝑒 𝑗 ih 𝑓 , 𝑒 𝑗 i𝑒 𝑗 ,
𝑖=1 𝑗=1
Õ∞ Õ ∞ Õ∞
B( 𝑓 , 𝑔) = hB(𝑒𝑖 , 𝑒 𝑗 ), 𝑒 𝑘 ih 𝑓 , 𝑒𝑖 ih𝑔, 𝑒 𝑗 i𝑒 𝑘
𝑖=1 𝑗=1 𝑘=1
for any 𝑓 , 𝑔 ∈ H. Convergence of these infinite series is, as usual, in the norm
induced by the inner product, and it follows from convergence that the hypermatrices
representing Φ and B with respect to ℬ are 𝑙 2 -summable, that is,
2 2
(hΦ(𝑒𝑖 ), 𝑒 𝑗 i)∞
𝑖, 𝑗=1 ∈ 𝑙 (N × N), (hB(𝑒𝑖 , 𝑒 𝑗 ), 𝑒 𝑘 i)∞
𝑖, 𝑗,𝑘=1 ∈ 𝑙 (N × N × N)
Banach spaces such as 𝐿 𝑝 (R𝑛 ), 𝑝 ∈ [1, ∞). Its continuous dual space,
𝑆 ′ (R𝑛 ) ≔ B(𝑆(R𝑛 ); R),
is the space of tempered distributions. It follows from Schwartz’s kernel theorem
(see Example 4.2) that any continuous bilinear operator B : 𝑆(R𝑛 )×𝑆(R𝑛 ) → 𝑆 ′ (R𝑛 )
58 L.-H. Lim
that converges in the 𝐿 𝑝 -norm, much like Parseval’s identity for a Hilbert space,
even though we do not have a Hilbert space and ℬ𝜓 is not an orthonormal basis (in
the language of Example 4.47, ℬ𝜓 is a tight wavelet frame with frame constant 1).
Furthermore, if the matrix12
𝐴 : Z𝑛+1 × Z𝑛+1 → R
satisfies an ‘almost diagonal’ growth condition that essentially says that 𝑎(𝑖,𝜆),( 𝑗,𝜇)
is small whenever (𝑖, 𝜆) and ( 𝑗, 𝜇) are far apart, then defining
Õ Õ
Φ 𝐴( 𝑓 ) ≔ 𝑎(𝑖,𝜆),( 𝑗,𝜇) h 𝑓 , 𝜓𝑖,𝜆 i𝜓 𝑗,𝜇
(𝑖,𝜆)∈Z𝑛+1 ( 𝑗,𝜈)∈Z𝑛+1
(Grafakos and Torres 2002a, Theorem 1). We refer the reader to these references
for the exact statement of the ‘almost diagonal’ growth conditions.
Aside from convolution, the best-known continuous bilinear operator is probably
the bilinear Hilbert transform
∫
d𝑦
H( 𝑓 , 𝑔)(𝑥) ≔ lim 𝑓 (𝑥 + 𝑦)𝑔(𝑥 − 𝑦) , (3.32)
𝜀→0 | 𝑦 |> 𝜀 𝑦
being a bilinear extension of the Hilbert transform
∫
d𝑦
H( 𝑓 )(𝑥) ≔ lim 𝑓 (𝑥 − 𝑦) .
𝜀→0 | 𝑦 |> 𝜀 𝑦
The latter, according to Krantz (2009, p. 15) ‘is, without question, the most im-
portant operator in analysis.’ While it is a standard result that the Hilbert transform
is continuous as a linear operator H : 𝐿 𝑝 (R) → 𝐿 𝑝 (R), 𝑝 ∈ (1, ∞), and we even
know the exact value of its operator/spectral norm (Grafakos 2014, Remark 5.1.8),
𝜋
𝜋 tan 2𝑝
1 < 𝑝 ≤ 2,
kHk 𝜎 =
𝜋
𝜋 cot
2 ≤ 𝑝 < ∞,
2𝑝
the continuity of its bilinear counterpart had been a long-standing open problem.
It was resolved by Lacey and Thiele (1997, 1999), who showed that as a bilinear
operator, H : 𝐿 𝑝 (R) × 𝐿 𝑞 (R) → 𝐿 𝑟 (R) is continuous whenever
1 1 1 2
+ = , 1 < 𝑝, 𝑞 ≤ ∞, < 𝑟 < ∞,
𝑝 𝑞 𝑟 3
that is, there exists 𝑐 > 0 such that
kH( 𝑓 , 𝑔)k 𝑟 ≤ 𝑐k 𝑓 k 𝑝 k𝑔k 𝑞
for all 𝑓 ∈ 𝐿 𝑝 (R) and 𝑔 ∈ 𝐿 𝑞 (R). The special case 𝑝 = 𝑞 = 2, 𝑟 = 1, open for
more than thirty years, was known as the Calderón conjecture.
The study of infinite-dimensional multilinear operators along the above lines has
become a vast undertaking, sometimes called multilinear harmonic analysis (Mus-
calu and Schlag 2013), with a multilinear Calderón–Zygmund theory (Grafakos and
Torres 2002b) and profound connections to wavelets (Meyer and Coifman 1997)
among its many cornerstones.
by linearity of 𝑣 ∗𝑖 and the fact that 𝑣 ∗𝑖 (𝑣 𝑗 ) = 𝛿𝑖 𝑗 . Since this holds for all 𝑢 ∈ U, we
have
𝑣 ∗𝑖 ◦ Φ = 𝑎𝑖 .
So 𝑎𝑖 : U → R is a linear functional as 𝑣 ∗𝑖 and Φ are both linear. Switching back
to our usual notation of denoting linear functionals as 𝜑𝑖 instead of 𝑎𝑖 , we see that
every linear operator Φ : U → V takes the form
Õ
𝑛
Φ(𝑢) = 𝜑 𝑗 (𝑢)𝑣 𝑗
𝑗=1
so we must have
Õ
𝑛
𝛽(𝑢, ·) = 𝑎𝑖 𝑣 ∗𝑖
𝑖=1
Evaluating at 𝑣 𝑗 gives
Õ
𝑛
𝛽(𝑢, 𝑣 𝑗 ) = 𝑎𝑖 (𝑢)𝑣 ∗𝑖 (𝑣 𝑗 ) = 𝑎 𝑗 (𝑢)
𝑖=1
where the last step is simply relabelling the indices, noting that both are sums of
terms of the form 𝜑(𝑢)𝜓(𝑣)𝑤, with 𝑟 = 𝑛𝑝. The smallest 𝑟, that is,
Õ𝑟
rank(B) = min 𝑟 : B(𝑢, 𝑣) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑖 , (3.37)
𝑖=1
is called the tensor rank of B, a notion that may be traced to Hitchcock (1927,
equations 2 and 2𝑎 ) and will play a critical role in the next section.
The same line of argument may be repeated on a trilinear functional 𝜏 : U × V ×
W → R to show that they are just sums of products of linear functionals
Õ𝑟
𝜏(𝑢, 𝑣, 𝑤) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝜃 𝑖 (𝑤) (3.38)
𝑖=1
and with it a corresponding notion of tensor rank. More generally, any 𝑑-linear
map Φ : V1 × V2 × · · · × V𝑑 → W is built up of linear functionals and vectors,
Õ
𝑟
Φ(𝑣 1 , . . . , 𝑣 𝑑 ) = 𝜑𝑖 (𝑣 1 )𝜓𝑖 (𝑣 2 ) · · · 𝜃 𝑖 (𝑣 𝑑 )𝑤 𝑖 ,
𝑖=1
where the last 𝑤 𝑖 may be dropped if it is a 𝑑-linear functional with W = R.
Consequently, we see that the ‘multilinearness’ in any multilinear map comes from
that of
R𝑑 ∋ (𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 ) ↦→ 𝑥1 𝑥2 · · · 𝑥 𝑑 ∈ R.
Take (3.38) as illustration: where does the ‘trilinearity’ of 𝜏 come from? Say we
look at the middle argument; then
Õ
𝑟
𝜏(𝑢, 𝜆𝑣 + 𝜆 ′ 𝑣 ′, 𝑤) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝜆𝑣 + 𝜆 ′ 𝑣 ′)𝜃 𝑖 (𝑤)
𝑖=1
Õ
𝑟
= 𝜑𝑖 (𝑢)[𝜆𝜓𝑖 (𝑣) + 𝜆 ′𝜓(𝑣 ′)]𝜃 𝑖 (𝑤)
𝑖=1
Õ
𝑟 Õ
𝑟
′ ′
=𝜆 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝜃 𝑖 (𝑤) + 𝜆 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣 )𝜃 𝑖 (𝑤)
𝑖=1 𝑖=1
= 𝜆𝜏(𝑢, 𝑣, 𝑤) + 𝜆 ′ 𝜏(𝑢, 𝑣 ′, 𝑤).
The reason why it is linear in the middle argument is simply a result of
𝑥(𝜆𝑦 + 𝜆 ′ 𝑦 ′)𝑧 = 𝜆𝑥𝑦𝑧 + 𝜆 ′𝑥𝑦 ′ 𝑧,
which is in turn a result of the trilinearity of (𝑥, 𝑦, 𝑧) ↦→ 𝑥𝑦𝑧. All ‘multilinearness’
Tensors in computations 63
for some 𝑟 ∈ N with the smallest possible 𝑟 given by the tensor rank of B. Note that
any decomposition of the form in (3.41) gives us an explicit algorithm for computing
Tensors in computations 65
B with 𝑟 multiplications, and thus rank(B) gives us the bilinear complexity or least
number of multiplications required to compute B. This relation between tensor
rank and evaluation of bilinear operators first appeared in Strassen (1973).
As numerical computations go, there is no need to compute a quantity exactly.
What if we just require the right-hand side (remember this gives an algorithm) of
(3.41) to be an 𝜀-approximation of the left-hand side? This leads to the notion of
border rank:
Õ𝑟
𝜀 𝜀 𝜀
rank(B) = min 𝑟 : B(𝑢, 𝑣) = lim+ 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑖 . (3.42)
𝜀→0
𝑖=1
This was first proposed by Bini, Capovani, Romani and Lotti (1979) and Bini, Lotti
and Romani (1980) and may be regarded as providing an algorithm (remembering
that every such decomposition gives an algorithm)
Õ
𝑟
B 𝜀 (𝑢, 𝑣) = 𝜑𝑖𝜀 (𝑢)𝜓𝑖𝜀 (𝑣)𝑤 𝑖𝜀
𝑖=1
The left-hand side clearly has rank no more than two; one may show that as long
as 𝜑1 , 𝜑2 are not collinear, and likewise for 𝜓1 , 𝜓2 and 𝑤 1 , 𝑤 2 , the right-hand side
of (3.43) must have rank three, that is, it defines a bilinear operator with rank three
and border rank two.
Tensor rank and border rank are purely algebraic notions defined over any vector
spaces, or even modules, which are generalizations of vector spaces that we will
soon introduce. However, if U, V, W are norm spaces, we may introduce various
notions of norms on bilinear operators B : U × V → W. We will slightly abuse
notation by denoting the norms on all three spaces by k · k. Recall that for a linear
functional 𝜓 : V → R, its dual norm is just
|𝜓(𝑣)|
k𝜓 k ∗ ≔ sup . (3.44)
𝑣≠0 k𝑣k
66 L.-H. Lim
and we call this a tensor nuclear norm. This defines a norm dual to the spectral
norm in (3.19), which in this case becomes
kB(𝑢, 𝑣)k
kBk 𝜎 = sup . (3.46)
𝑢,𝑣≠0 k𝑢k k𝑣k
We will argue later that the tensor nuclear norm, in an appropriate sense, quantifies
the optimal numerical stability of computing B just as tensor rank quantifies bilinear
complexity.
It will be instructive to begin from some low-dimensional examples where U, V,
W are of dimensions two and three.
One may show that rank(BC ) = 3 = rank(BC ), that is, Gauss’s algorithm has
optimal bilinear complexity whether in the exact or approximate sense. While
using Gauss’s algorithm for actual multiplication of complex numbers is pointless
overkill, it is actually useful in practice (Higham 1992) as one may use it for the
multiplication of complex matrices:
An algorithm similar to Gauss’s algorithm would yield the result in three multi-
plications, and it is optimal going by either rank or border rank. For bilinear
maps on three-dimensional vector spaces, a natural example is the bilinear operator
B∧ : ∧2 (R𝑛 ) × R𝑛 → R𝑛 given by the 3 × 3 skew-symmetric matrix–vector product
0 𝑎 𝑏 𝑥 𝑎𝑦 + 𝑏𝑧
−𝑎 0 𝑐 𝑦 = −𝑎𝑥 + 𝑐𝑧 .
−𝑏 −𝑐 0 𝑧 −𝑏𝑥 − 𝑐𝑦
In this case rank(B∧ ) = 5 = rank(B∧ ); see Ye and Lim (2018a, Proposition 12)
and Krishna and Makam (2018, Theorem 1.3). For a truly interesting example,
one would have to look at bilinear maps on four-dimensional vector spaces and we
shall do so next.
with
As in the case of Gauss’s algorithm, the saving of one multiplication comes at the
cost of an increase in the number of additions/subtractions from eight to fifteen.
In fact Strassen’s original version had eighteen; the version presented here is the
well-known but unpublished Winograd variant discussed in Knuth (1998, p. 500)
and Higham (2002, equation 23.6). Recursively applying this algorithm to 2 × 2
block matrices produces an algorithm for multiplying 𝑛 × 𝑛 matrices with 𝑂(𝑛log2 7 )
multiplications. More generally, the bilinear operator defined by
is called either the matrix multiplication tensor or Strassen’s tensor. Every decom-
position
Õ
𝑟
M𝑚,𝑛, 𝑝 (𝐴, 𝐵) = 𝜑𝑖 (𝐴)𝜓𝑖 (𝐵)𝑊𝑖
𝑖=1
Strassen’s work showed that 𝜔 < log2 7 ≈ 2.807 354 9, and this has been improved
over the years to 𝜔 < 2.372 859 6 at the time of writing (Alman and Williams 2021).
Recall that any linear functional 𝜑 : R𝑚×𝑛 → R must take the form 𝜑(𝐴) = tr(𝑉 T 𝐴)
for some matrix 𝑉 ∈ R𝑚×𝑛 , a consequence of the Riesz representation theorem
for an inner product space. For concreteness, when 𝑚 = 𝑛 = 𝑝 = 2, Winograd’s
variant of Strassen’s algorithm is given by
7
Õ
M2,2,2 (𝐴, 𝐵) = 𝜑𝑖 (𝐴)𝜓𝑖 (𝐵)𝑊𝑖 ,
𝑖=1
Tensors in computations 69
measures the total size of the intermediate quantities in the algorithm (3.52) and
its minimum value, given by the nuclear norm of B as defined in (3.45), provides a
measure of optimal numerical stability in the sense of Higham’s second guideline.
This was first discussed in Ye and Lim (2018a, Section 3.2). Using complex
multiplication BC in Example 3.8 for illustration, one may show that the nuclear
norm (Friedland and Lim 2018, Lemma 6.1) is given by
kBC k 𝜈 = 4.
gives an algorithm that attains both rank(BC ) and kBC k 𝜈 . In conventional notation,
this algorithm multiplies two complex numbers via
(𝑎 + 𝑏𝑖)(𝑐 + 𝑑𝑖)
1 1 1 1 1 8
= 𝑎+ √ 𝑏 𝑐+ √ 𝑑 + 𝑎− √ 𝑏 𝑐 − √ 𝑑 − 𝑏𝑑
2 3 3 3 3 3
√
𝑖 3 1 1 1 1
+ 𝑎+ √ 𝑏 𝑐+ √ 𝑑 − 𝑎− √ 𝑏 𝑐− √ 𝑑 .
2 3 3 3 3
We stress that the nuclear norm is solely a measure of stability in the sense of
Higham’s second guideline. Numerical stability is too complicated an issue to
be adequately captured by a single number. For instance, from the perspective of
cancellation errors, our algorithm above also suffers from the same issue pointed
out in Higham
√ (2002, Section 23.2.4) for Gauss’s algorithm. By choosing 𝑧 = 𝑤
and 𝑏 = 3/𝑎, our algorithm computes
√
1 1 2 1 2 8 𝑖 3 1 2 1 2
𝑎+ + 𝑎− − 2 + 𝑎+ − 𝑎− ≕ 𝑥 + 𝑖𝑦.
2 𝑎 𝑎 𝑎 2 𝑎 𝑎
There will be cancellation error in the computed real part b
𝑥 when |𝑎| is small
15 We have gone to some lengths to avoid the tensor product ⊗ in this section, preferring to defer it
to Section 4. The decompositions (3.48), (3.49), (3.53) would be considerably neater if expressed
in tensor product form, giving another impetus for definition ➂.
Tensors in computations 73
and likewise in the computed imaginary part b 𝑦 when |𝑎| is large. Nevertheless,
as discussed in Higham (2002, Section 23.2.4), the algorithm is still stable in the
weaker sense of having acceptably small |𝑥 −b
𝑥 |/|𝑧| and |𝑦 −b
𝑦 |/|𝑧| even if |𝑥 −b
𝑥 |/|𝑥|
or |𝑦 − b
𝑦 |/|𝑦| might be large.
As is the case for Example 3.8, the algorithm for complex multiplication above
is useful only when applied to complex matrices. When 𝐴, 𝐵, 𝐶, 𝐷 ∈ R𝑛×𝑛 , the
algorithms in Examples 3.8 and 3.11 provide substantial savings when used to
multiply 𝐴 + 𝑖𝐵, 𝐶 + 𝑖𝐷 ∈ C𝑛×𝑛 . This gives a good reason for extending multilinear
maps and tensors to modules (Lang 2002), i.e. vector spaces whose field of scalars
is replaced by other more general rings. Formally, if 𝑅 is a ring with multiplicative
identity 1 (which we will assume henceforth), an 𝑅-module V is a commutative
group under a group operation denoted + and has a scalar multiplication operation
denoted · such that, for all 𝑎, 𝑏 ∈ 𝑅 and 𝑣, 𝑤 ∈ V,
(i) 𝑎 · (𝑣 + 𝑤) = 𝑎 · 𝑣 + 𝑎 · 𝑤,
(ii) (𝑎 + 𝑏) · 𝑣 = 𝑎 · 𝑣 + 𝑏 · 𝑣,
(iii) (𝑎𝑏) · 𝑣 = 𝑎 · (𝑏 · 𝑣),
(iv) 1 · 𝑣 = 𝑣.
Clearly, when 𝑅 = R or C or any other field, a module just becomes a vector
space. What is useful about the notion is that it allows us to include rings of objects
we would not normally consider as ‘scalars’. For example, in (3.47) we regard
C as a two-dimensional vector space over R, but in (3.50) we regard C𝑛×𝑛 as a
two-dimensional16 module over R𝑛×𝑛 . So in the latter the ‘scalars’ are actually
matrices, i.e. 𝑅 = R𝑛×𝑛 . When we consider block matrix operations on square
matrices such as those on page 70, we are implicitly doing linear algebra over the
ring 𝑅 = R𝑛×𝑛 , which is not even commutative.
Many standard notions in linear and multilinear algebra carry through from
vector spaces to modules with little or no change. For example, the multilinear
maps and multilinear functionals of Definitions 3.1 and 3.3 apply verbatim to
modules with the field of scalars R replaced by any ring 𝑅. In other words, the
notion of tensors in the sense of definition ➁ applies equally well over modules.
We will discuss three examples of multilinear maps over modules: tensor fields,
fast integer multiplication algorithms and cryptographic multilinear maps.
When people speak of tensors in physics or geometry, they often really mean a
tensor field. As a result one may find statements of the tensor transformation rules
that bear little resemblance to our version. The next two examples are intended
to describe the analogues of definitions ➀ and ➁ for a tensor field and show how
they fit into the narrative of this article. Also, outside pure mathematics, defining
a tensor field is by far the biggest reason for considering multilinear maps over
modules.
16 Strictly speaking, the terminology over modules should be length instead of dimension.
74 L.-H. Lim
Example 3.12 (modules and tensor fields). A tensor field is – roughly speaking
– a tensor-valued function on a manifold. Let 𝑀 be a smooth manifold and
𝐶 ∞(𝑀) ≔ { 𝑓 : 𝑀 → R : 𝑓 is smooth}. Note that 𝐶 ∞ (𝑀) is a ring, as products
and linear combinations of smooth real-valued functions are smooth (henceforth
we drop the word ‘smooth’ as all functions and fields in this example are assumed
smooth). Thus 𝑅 = 𝐶 ∞ (𝑀) may play the role of the ring of scalars. A 0-tensor
field is just a function in 𝐶 ∞ (𝑀). A contravariant 1-tensor field is a vector field,
i.e. a function 𝑓 whose value at 𝑥 ∈ 𝑀 is a vector in the tangent space at 𝑥, denoted
T 𝑥 (𝑀). A covariant 1-tensor field is a covector field, i.e. a function 𝑓 whose value
at 𝑥 ∈ 𝑀 is a vector in the cotangent space at 𝑥, denoted T∗𝑥 (𝑀). Here T 𝑥 (𝑀) is a
vector space and T∗𝑥 (𝑀) is its dual vector space, in which case a 𝑑-tensor field of
contravariant order 𝑝 and covariant order 𝑑 − 𝑝 is simply a function 𝜑 whose value
at 𝑥 ∈ 𝑀 is such a 𝑑-tensor, that is, by Definition 3.3, a multilinear functional
𝜑(𝑥) : T∗𝑥 (𝑀) × · · · × T∗𝑥 (𝑀) × T 𝑥 (𝑀) × · · · × T 𝑥 (𝑀) → R.
𝑝 copies 𝑑− 𝑝 copies
This seems pretty straightforward; the catch is that 𝜑 is not a function in the usual
sense, which has a fixed domain and a fixed codomain, but the codomain of 𝜑
depends on the point 𝑥 where it is evaluated. So we have only defined a tensor field
at a point 𝑥 ∈ 𝑀 but we still need a way to ‘glue’ all these pointwise definitions
together. The customary way to do this is via coordinate charts and transition maps,
but an alternative is to simply define the tangent bundle
T(𝑀) ≔ {(𝑥, 𝑣) : 𝑥 ∈ 𝑀, 𝑣 ∈ T 𝑥 (𝑀)}
and cotangent bundle
T∗ (𝑀) ≔ {(𝑥, 𝜑) : 𝑥 ∈ 𝑀, 𝜑 ∈ T∗𝑥 (𝑀)}
and observe that these are 𝐶 ∞ (𝑀)-modules with scalar product given by pointwise
multiplication of real-valued functions with vector/covector fields. A 𝑑-tensor
field of contravariant order 𝑝 and covariant order 𝑑 − 𝑝 is then defined to be the
multilinear functional
𝜑 : T∗ (𝑀) × · · · × T∗ (𝑀) × T(𝑀) × · · · × T(𝑀) → 𝐶 ∞ (𝑀).
𝑝 copies 𝑑− 𝑝 copies
Note that this is a multilinear functional in the sense of modules: the ‘scalars’ are
drawn from a ring 𝑅 = 𝐶 ∞ (𝑀) and the ‘vector spaces’ are the 𝑅-modules T(𝑀)
and T∗ (𝑀). This is another upside of defining tensors via definition ➁; it may be
easily extended to include tensor fields.
What about definition ➀? The first thing to note is that not every result in linear
algebra over vector spaces carries over to modules. An example is the notion of
basis. While some modules do have a basis – for example, when we speak of
C𝑛×𝑛 as a two-dimensional R𝑛×𝑛 -module, it is with the basis {1, 𝑖} in mind – others
such as the 𝐶 ∞ (𝑀)-module of 𝑑-tensor fields may not have a basis when 𝑑 ≥ 1.
Tensors in computations 75
with entries in 𝑅 = 𝐶 ∞(R𝑛 ) that holds for all 𝑣 ∈ R𝑛 . The analogue of the
transformation rules is a bit more complicated since we are now allowed a non-
linear change of coordinates 𝑣 ′ = 𝐹(𝑣) as opposed to merely a linear change of
basis as in Section 2.2. Here the change-of-coordinates function 𝐹 : 𝑁(𝑣) → R𝑛
is any smooth function defined on a neighbourhood 𝑁(𝑣) ⊆ R𝑛 of 𝑣 that is locally
invertible, that is, the derivative 𝐷𝐹(𝑣) as defined in Example 3.2 is an invertible
linear map in a neighbourhood of 𝑣. This is sometimes called a curvilinear change
of coordinates. The analogue of tensor transformation rule (2.12) for a 𝑑-tensor
field on R𝑛 is then
𝜕𝑣 1′ 𝜕𝑣 1′ 𝜕𝑣 1 𝜕𝑣 1
··· ···
𝜕𝑣 1 𝜕𝑣 ′ 𝜕𝑣 𝑛′
𝜕𝑣 𝑛 1
.. ,
𝐷𝐹(𝑣) = ... ..
.
.. ,
. 𝐷𝐹(𝑣)−1 = ... ..
. .
𝜕𝑣 ′
𝑛 𝜕𝑣 𝑛′ 𝜕𝑣 𝑛
′
𝜕𝑣 𝑛
··· 𝜕𝑣 ···
𝜕𝑣 1 𝜕𝑣 𝑛 1 𝜕𝑣 𝑛′
q
𝑟= 𝑥 2 + 𝑦 2 + 𝑧2 , 𝑥 = 𝑟 sin 𝜃 cos 𝜑,
q
𝜃 = arctan 𝑥 2 + 𝑦 2 /𝑧 ,
𝑦 = 𝑟 sin 𝜃 sin 𝜑, (3.55)
𝜑 = arctan(𝑦/𝑥), 𝑧 = 𝑟 cos 𝜃,
76 L.-H. Lim
and
𝑥 𝑦 𝑧
𝑟 𝑟 𝑟
2 + 𝑦2)
𝜕(𝑟, 𝜃, 𝜑) 𝑥𝑧 𝑦𝑧 −(𝑥
𝐷𝐹(𝑣) = = 2p 2 p p ,
𝜕(𝑥, 𝑦, 𝑧) 𝑟 𝑥 + 𝑦 2 𝑟 2 𝑥 2 + 𝑦 2 𝑟 2 𝑥 2 + 𝑦 2
−𝑦 𝑥
0
2
𝑥 + 𝑦2 𝑥2 + 𝑦2
sin 𝜃 cos 𝜑 𝑟 cos 𝜃 cos 𝜑 −𝑟 sin 𝜃 sin 𝜑
𝜕(𝑥, 𝑦, 𝑧)
−1
𝐷𝐹(𝑣) = = sin 𝜃 sin 𝜑 𝑟 cos 𝜃 sin 𝜑 𝑟 sin 𝜃 cos 𝜑 .
𝜕(𝑟, 𝜃, 𝜑) cos 𝜃 −𝑟 sin 𝜃 0
Note that either of the last two matrices may be expressed solely in terms of 𝑥, 𝑦, 𝑧
or solely in terms of 𝑟, 𝜃, 𝜑; the forms above are chosen for convenience. The
transformation rule (3.54) then allows us to transform a hypermatrix in 𝑥, 𝑦, 𝑧 to
one in 𝑟, 𝜃, 𝜑 or vice versa.
Example 3.13 (tensor fields over manifolds). Since a smooth manifold 𝑀 loc-
ally resembles R𝑛 , we may take (3.54) as the local transformation rule for a tensor
field 𝜑. It describes how, for a point 𝑥 ∈ 𝑀, a hypermatrix 𝐴(𝑣(𝑥)) representing 𝜑
in one system of local coordinates 𝑣 : 𝑁(𝑥) → R𝑛 on a neighbourhood 𝑁(𝑥) ⊆ 𝑀
is related to another hypermatrix 𝐴(𝑣 ′(𝑥)) representing 𝜑 in another system of local
coordinates 𝑣 ′ : 𝑁 ′(𝑥) → R𝑛 . As is the case for the tensor transformation rule
(2.12), one may use the tensor field transformation rule (3.54) to ascertain whether
or not a physical or mathematical object represented locally by a hypermatrix
with entries in 𝐶 ∞(𝑀) is a tensor field; this is how one would usually show that
connection forms or Christoffel symbols are not tensor fields (Simmonds 1994,
pp. 64–65).
In geometry, tensor fields play a central role, as geometric structures are nearly
always tensor fields (exceptions such as Finsler metrics tend to be less studied).
The most common ones include Riemannian metrics, which are symmetric 2-tensor
fields 𝑔 : T(𝑀) × T(𝑀) → 𝐶 ∞ (𝑀), and symplectic forms, which are alternating
2-tensor fields 𝜔 : T(𝑀) × T(𝑀) → 𝐶 ∞(𝑀), toy examples of which we have
seen in (3.51). As we saw in the previous example, the change-of-coordinates
maps for tensors are invertible linear maps, but for tensor fields they are locally
invertible linear maps; these are called diffeomorphisms, Riemannian isometries,
symplectomorphisms, etc., depending on what geometric structures they preserve.
The analogues of the finite-dimensional matrix groups in (2.14) then become
infinite-dimensional Lie groups such as Diff(𝑀), Iso(𝑀, 𝑔), Symp(𝑀, 𝜔), etc.
Although beyond the scope of our article, there are two tensor fields that are too
important to be left completely unmentioned: the Ricci curvature tensor and the
Riemann curvature tensor. Without introducing additional materials, a straight-
forward way to define them is to use a special system of local coordinates called
Riemannian normal coordinates (Chern, Chen and Lam 1999, Chapter 5). For a
Tensors in computations 77
These are tensor fields – note that the coefficients depend on 𝑥 – so 𝑅(𝑥) will
generally be a different multilinear functional at different points 𝑥 ∈ 𝑀. The Ricci
curvature tensor will make a brief appearance in Example 4.33 on separation of
variables for PDEs. Riemann curvature, being a 4-tensor, is difficult to handle as
is, but when 𝑀 is embedded in Euclidean space, it appears implicitly in the form
of a 2-tensor called the Weingarten map (Bürgisser and Cucker 2013, Chapter 21)
or second fundamental form (Niyogi, Smale and Weinberger 2008), whose eigen-
values, called principal curvatures, give us condition numbers. There are other
higher-order tensor fields in geometry (Dodson and Poston 1991) such as the tor-
sion tensor, the Nijenhuis tensor (𝑑 = 3) and the Weyl curvature tensor (𝑑 = 4), all
of which are unfortunately beyond our scope.
In physics, it is probably fair to say that (i) most physical fields are tensor fields
and (ii) most tensors are tensor fields. For (i), while there are important exceptions
such as spinor fields, the most common fields such as temperature, pressure and
Higgs fields are scalar fields; electric, magnetic and flow velocity fields are vector
fields; the Cauchy stress tensor, Einstein tensor and Faraday electromagnetic tensor
are 2-tensor fields; higher-order tensor fields are rarer in physics but there are also
examples such as the Cotton tensor (García, Hehl, Heinicke and Macías 2004) and
the Lanczos tensor (Roberts 1995), both 3-tensor fields. The last five named tensors
also make the case for (ii): a ‘tensor’ in physics almost always means a tensor field
of order two or (more rarely) higher. We will describe the Cauchy stress tensor and
mention a few higher-order tensors related to it in Example 4.8.
The above examples are analytic in nature but the next two will be algebraic.
They show why it is useful to consider bilinear and more generally multilinear
maps for modules over Z/𝑚Z, the ring of integers modulo 𝑚.
Example 3.14 (modules and integer multiplication). Trivially, integer multipli-
cation B : Z × Z → Z, (𝑎, 𝑏) ↦→ 𝑎𝑏 is a bilinear map over the Z-module Z, but this
78 L.-H. Lim
is not the relevant module structure that one exploits in fast integer multiplication
algorithms. Instead they are based primarily on two key ideas. The first idea is that
integers (assumed unsigned) may be represented as polynomials,
𝑝−1
Õ 𝑝−1
Õ
𝑖
𝑎= 𝑎𝑖 𝜃 ≕ 𝑎(𝜃) and 𝑏= 𝑏 𝑗 𝜃 𝑗 ≕ 𝑏(𝜃)
𝑖=0 𝑗=0
for some number base 𝜃, and the product has coefficients given by convolutions,
2Õ
𝑝−2 Õ
𝑘
𝑎𝑏 = 𝑐 𝑘 𝜃 𝑘 ≕ 𝑐(𝜃) with 𝑐𝑘 = 𝑎𝑖 𝑏 𝑘−𝑖 .
𝑘=0 𝑖=0
Let 𝑛 = 2𝑝 − 1 and pad the vectors of coefficients with enough zeros so that we may
consider (𝑎0 , . . . , 𝑎 𝑛−1 ), (𝑏0 , . . . , 𝑏 𝑛−1 ), (𝑐0 , . . . , 𝑐 𝑛−1 ) on an equal footing. The
second idea is to use the discrete Fourier transform17 (DFT) for some root of unity
𝜔 to perform the convolution,
𝑎0′ 1 1 1 ··· 1 𝑎0
𝑎 ′ 1 𝜔 𝜔2 ··· 𝜔 𝑛−1 𝑎1
1
𝑎 ′ 1 𝜔 2 𝜔4 ··· 𝜔 2(𝑛−1) 𝑎
2 = 2 ,
. . .. .. .. .. .
.. .. . . . . ..
𝑎 ′ 1 𝜔 𝑛−1 𝜔2(𝑛−1) ··· 𝜔 (𝑛−1)(𝑛−1) 𝑎
𝑛−1 𝑛−1
𝑏0′ 1 1 1 ··· 1 𝑏0
𝑏 ′ 1 𝜔 𝜔 2 ··· 𝜔 𝑛−1 𝑏1
1
𝑏 ′ 1 𝜔 2 𝜔 4 ··· 𝜔2(𝑛−1) 𝑏2 ,
2 =
. . . . .. .. .
.. .. .. .. . . ..
𝑏 ′ 1 𝜔 𝑛−1 𝜔2(𝑛−1) · · · 𝜔 (𝑛−1)(𝑛−1) 𝑏
𝑛−1 𝑛−1
𝑐0 1 1 1 ··· 1 𝑎0′ 𝑏0′
𝑐1 1 𝜔 −1 𝜔 −2 ··· 𝜔 −(𝑛−1) 𝑎′ 𝑏′
1 1 1
𝑐 𝜔−2 𝜔−4 𝜔−2(𝑛−1) 𝑎2′ 𝑏2′ ,
2 = 1 ···
. 𝑛 . .. .. .. .. ..
.. .. . . . . .
𝑐 1 𝜔−(𝑛−1) 𝜔−2(𝑛−1) −(𝑛−1)(𝑛−1) ′ ′
𝑛−1 ··· 𝜔 𝑎 𝑛−1 𝑏 𝑛−1
taking advantage of the following well-known property: a Fourier transform F turns
convolution product ∗ into pointwise product · and the inverse Fourier transform
turns it back, that is,
𝑎 ∗ 𝑏 = F−1 (F(𝑎) · F(𝑎)).
Practical considerations inform the way we choose 𝜃 and 𝜔. We choose 𝜃 = 2𝑠
17 √
A slight departure from (3.40) is that we dropped the 1/ 𝑛 coefficient from our DFT and instead
put a 1/𝑛 with our inverse DFT to avoid surds.
Tensors in computations 79
primitive root of unity 𝑔 ∈ (Z/𝑝Z)× , that is, 𝑔 generates the multiplicative group
of integers modulo 𝑝 in the sense that every non-zero 𝑥 ∈ Z/𝑝Z may be expressed
as 𝑥 = 𝑔 𝑎 (group theory notation) or 𝑥 ≡ 𝑔 𝑎 (mod 𝑝) (number theory notation)
for some 𝑎 ∈ Z. Alice will pick a secret 𝑎 ∈ Z and send 𝑔 𝑎 publicly to Bob;
Bob will pick a secret 𝑏 ∈ Z and send 𝑔 𝑏 publicly to Alice. Alice, knowing the
value of 𝑎, may compute 𝑔 𝑎𝑏 = (𝑔 𝑏 )𝑎 from the 𝑔 𝑏 she received from Bob, and
Bob, knowing the value of 𝑏, may compute 𝑔 𝑎𝑏 = (𝑔 𝑎 )𝑏 from the 𝑔 𝑎 he received
from Alice. They now share the secure password 𝑔 𝑎𝑏 . This is the renowned
Diffie–Hellman key exchange. The security of the version described is based on
the intractability of the discrete log problem: determining the value 𝑎 = log𝑔 (𝑔 𝑎 )
from 𝑔 𝑎 and 𝑝 is believed to be intractable. Although the problem has a well-known
polynomial-time quantum algorithm (Shor 1994) and has recently been shown to
be quasi-polynomial-time in 𝑛 (Kleinjung and Wesolowski 2021) for a finite field
F 𝑝 𝑛 when 𝑝 is fixed (note that for 𝑛 = 1, F 𝑝 = Z/𝑝Z) the technology required for
the former is still in its infancy, whereas the latter does not apply in our case where
complexity is measured in terms of 𝑝 and not 𝑛 (for us, 𝑛 = 1 always but 𝑝 → ∞).
Now observe that (Z/𝑝Z)× is a commutative group under the group operation of
multiplication modulo 𝑝, and it is a Z-module as we may check that it satisfies the
properties on page 73: for any 𝑥, 𝑦 ∈ Z/𝑝Z and 𝑎, 𝑏 ∈ Z,
(i) (𝑥𝑦)𝑎 = 𝑥 𝑎 𝑦 𝑏 ,
(ii) 𝑥 (𝑎+𝑏) = 𝑥 𝑎 𝑥 𝑏 ,
(iii) 𝑥 (𝑎𝑏) = (𝑥 𝑏 )𝑎 ,
(iv) 𝑥 1 = 𝑥.
Furthermore, the Diffie–Hellman key exchange is a Z-bilinear map
B : (Z/𝑝Z)× × (Z/𝑝Z)× → (Z/𝑝Z)× , (𝑔 𝑎 , 𝑔 𝑏 ) ↦→ 𝑔 𝑎𝑏
since, for any 𝜆, 𝜆 ′ ∈ Z and 𝑔 𝑎 , 𝑔 𝑏 ∈ (Z/𝑝Z)× ,
′ ′ ′ ′ ′ ′ ′ ′
B(𝑔𝜆𝑎+𝜆 𝑎 , 𝑔 𝑏 ) = 𝑔(𝜆𝑎+𝜆 𝑎 )𝑏 = (𝑔 𝑎𝑏 )𝜆 (𝑔 𝑎 𝑏 )𝜆 = B(𝑔 𝑎 , 𝑔 𝑏 )𝜆 B(𝑔 𝑎 , 𝑔 𝑏 )𝜆
and likewise for the second argument. That the notation is written multiplicatively
with coefficients appearing in the power is immaterial; if anything it illustrates
the power of abstract algebra in recognizing common structures across different
scenarios. While one may express everything in additive notation by taking discrete
logs whenever necessary, the notation 𝑔 𝑎 serves as a useful mnemonic: anything
appearing in the power is hard to extract, while using additive notation means having
to constantly remind ourselves that extracting 𝑎 from 𝑎𝑔 and 𝜆𝑎 is intractable for
the former and trivial for the latter.
What if 𝑑 + 1 different parties need to establish a common password (say, in
a 1000-participant Zoom session)? In principle one may successively apply the
two-party Diffie–Hellman key exchange 𝑑 + 1 times with the 𝑑 + 1 parties each
doing 𝑑 + 1 exponentiations, which is obviously expensive. One may reduce the
Tensors in computations 81
18 Of course they are all isomorphic to each other, but in actual cryptographic applications, it matters
how the groups are realized: one might be an elliptic curve over a finite field and another a cyclic
subgroup of the Jacobian of a hyperelliptic curve. Also, an explicit isomorphism may not be easy
to identify or compute in practice.
82 L.-H. Lim
of them would have arrived at Φ(𝑔, . . . , 𝑔)𝑎1 ···𝑎𝑑+1 in a different way with their
own password but their results are guaranteed to be equal as a consequence of
multilinearity. There are several candidates for such cryptographic multilinear
maps (Gentry, Gorbunov and Halevi 2015, Garg, Gentry and Halevi 2013) and
a variety of cryptographic applications that go beyond multipartite key exchange
(Boneh and Silverberg 2003).
We return to multilinear maps over familiar vector spaces. The next two examples
are about trilinear functionals.
Example 3.16 (trilinear functionals and self-concordance). The definition of
self-concordance is usually stated over R𝑛 . Let 𝑓 : Ω → R be a convex 𝐶 3 -
function on an open convex subset Ω ⊆ R𝑛 . Then 𝑓 is said to be self-concordant
at 𝑥 ∈ Ω if
|𝐷 3 𝑓 (𝑥)(ℎ, ℎ, ℎ)| ≤ 2[𝐷 2 𝑓 (𝑥)(ℎ, ℎ)] 3/2 (3.56)
for all ℎ ∈ R𝑛 (Nesterov and Nemirovskii 1994). As we discussed in Example 3.2,
for any fixed 𝑥 ∈ Ω, the higher derivatives 𝐷 2 𝑓 (𝑥) and 𝐷 3 𝑓 (𝑥) in this case are
bilinear and trilinear functionals on R𝑛 given by
Õ 𝑛
𝜕 2 𝑓 (𝑥)
[𝐷 2 𝑓 (𝑥)](ℎ, ℎ) = ℎ𝑖 ℎ 𝑗 ,
𝑖, 𝑗=1
𝜕𝑥𝑖 𝜕𝑥 𝑗
Õ
𝑛
𝜕 3 𝑓 (𝑥)
3
[𝐷 𝑓 (𝑥)](ℎ, ℎ, ℎ) = ℎ𝑖 ℎ 𝑗 ℎ 𝑘 .
𝑖, 𝑗,𝑘=1
𝜕𝑥𝑖 𝜕𝑥 𝑗 𝜕𝑥 𝑘
The affine invariance (Nesterov and Nemirovskii 1994, Proposition 2.1.1) of self-
concordance implies that self-concordance is a tensorial property in the sense
of definition ➀. For the convergence and complexity analysis of interior point
methods, it goes hand in hand with the affine invariance of Newton’s method in
Example 2.15. Such analysis in turn allows one to establish the celebrated result that
a convex optimization problem may be solved to arbitrary 𝜀-accuracy in polynomial
time using interior point methods if it has a self-concordant barrier function whose
first and second derivatives may be evaluated in polynomial time (Nesterov and
Nemirovskii 1994, Chapter 6). These conditions are satisfied for many common
problems including linear programming, convex quadratic programming, second-
order cone programming, semidefinite programming and geometric programming
(Boyd and Vandenberghe 2004). Contrary to popular belief, polynomial-time
solvability to 𝜀-accuracy is not guaranteed by convexity alone: copositive and
complete positive programming are convex optimization problems but both are
known to be NP-hard (Murty and Kabadi 1987, Dickinson and Gijben 2014).
By Example 3.12, we may view 𝐷 2 𝑓 and 𝐷 3 𝑓 as covariant tensor fields on the
manifold Ω (any open subset of a manifold is itself a manifold) with
𝐷 2 𝑓 (𝑥) : T 𝑥 (Ω) × T 𝑥 (Ω) → R, 𝐷 3 𝑓 (𝑥) : T 𝑥 (Ω) × T 𝑥 (Ω) × T 𝑥 (Ω) → R
Tensors in computations 83
for any fixed 𝑥 ∈ Ω. While this tensor field perspective is strictly speaking
unnecessary, it helps us formulate (3.56) concretely in situations when we are not
working over R𝑛 . For instance, among the aforementioned optimization problems,
semidefinite, complete positive and copositive programming require that we work
over the space of symmetric matrices S𝑛 with Ω given respectively by the following
open convex cones:
𝑛
S++ = {𝐴 ∈ S𝑛 : 𝑥 T 𝐴𝑥 > 0, 0 ≠ 𝑥 ∈ R𝑛 } = {𝐵𝐵T ∈ S𝑛 : 𝐵 ∈ GL(𝑛)},
𝑛
S+++ = {𝐵𝐵T ∈ S𝑛 : 𝐵 ∈ GL(𝑛) ∩ R+𝑛×𝑛 },
𝑛∗
S+++ = {𝐴 ∈ S𝑛 : 𝑥 T 𝐴𝑥 > 0, 0 ≠ 𝑥 ∈ R+𝑛 }.
In all three cases we have T𝑋 (Ω) = S𝑛 for all 𝑋 ∈ Ω, equipped with the usual
trace inner product. To demonstrate self-concordance for the log barrier function
𝑓 (𝑋) = − log det(𝑋), Examples 3.4 and 3.6 give us
as required. For comparison, take the inverse barrier function 𝑔(𝑋) = tr(𝑋 −1 )
𝑛 and has easily
which, as a convex function that blows up near the boundary of S++
computable gradient and Hessian as we saw in Example 3.4, appears to be a perfect
barrier function for semidefinite programming. Nevertheless, using the derivatives
calculated in Example 3.4,
𝐷 2 𝑔(𝑋)(𝐻, 𝐻) = 2 tr(𝐻 𝑋 −1 𝐻 𝑋 −2 ),
𝐷 3 𝑔(𝑋)(𝐻, 𝐻, 𝐻) = −6 tr(𝐻 𝑋 −1 𝐻 𝑋 −1 𝐻 𝑋 −2 ),
and since 6|ℎ| 3 /𝑥 4 > 2(2ℎ2 /𝑥 3 )3/2 as 𝑥 → 0+ , (3.56) will not be satisfied when 𝑋
is near singular. Thus the inverse barrier function for S++ 𝑛 is not self-concordant.
𝑛
What about S+++, the completely positive cone, and its dual cone S+++ 𝑛∗ , the co-
The last example, though more of a curiosity, is a personal favourite of the author
(Friedland, Lim and Zhang 2019).
Example 3.17 (spectral norm and Grothendieck’s inequality). One of the most
fascinating inequalities in matrix and operator theory is the following. For any
84 L.-H. Lim
which is in fact a reason why the inequality is so useful, the (∞, 1)-norm being
ubiquitous in combinatorial optimization but NP-hard to compute, and the left-hand
side of (3.57) being readily computable via semidefinite programming. Writing 𝑥𝑖
and 𝑦 𝑗 , respectively, as columns and rows of matrices 𝑋 = [𝑥1 , . . . , 𝑥 𝑚 ] ∈ R 𝑝×𝑚
and 𝑌 = [𝑦 T1 , . . . , 𝑦 T𝑛 ] T ∈ R𝑛× 𝑝 , we next observe that the constraints on the left-hand
side of (3.57) may be expressed in terms of their (1, 2)-norm and (2, ∞)-norm:
k 𝑋𝑣k 2
k 𝑋 k 1,2 = max = max k𝑥𝑖 k 2 ,
𝑣≠0 k𝑣k 1 𝑖=1,...,𝑚
k𝑌 𝑣k ∞
k𝑌 k 2,∞ = max = max k𝑦 𝑗 k 2 ,
𝑣≠0 k𝑣k 2 𝑗=1,...,𝑛
namely, as k 𝑋 k 1,2 ≤ 1 and k𝑌 k 2,∞ ≤ 1. Lastly, we observe that for the standard
inner product on R 𝑝 ,
Õ𝑚 Õ𝑛
𝑎𝑖 𝑗 h𝑥𝑖 , 𝑦 𝑗 i = tr(𝑋 𝐴𝑌 ),
𝑖=1 𝑗=1
Tensors in computations 85
by the transformation rules they satisfy. In physics, the perspective may be some-
what different. Instead of viewing them as a series of pedagogical improvements,
definition ➀ was discovered alongside (most notably) general relativity, with defin-
ition ➁ as its addendum, and definition ➂ was discovered alongside (most notably)
quantum mechanics. Both remain useful in different ways for different purposes
and each is used independently of the other.
One downside of definition ➂ is that it relies on the tensor product construction
and/or the universal factorization property; both have a reputation of being abstract.
We will see that the reputation is largely undeserved; with the proper motivations
they are far easier to appreciate than, say, definition ➀. One reason for its ‘abstract’
reputation is that the construction is often cast in a way that lends itself to vast
generalizations. In fact, tensor products of Hilbert spaces (Reed and Simon 1980),
modules (Lang 2002), vector bundles (Milnor and Stasheff 1974), operator algeb-
ras (Takesaki 2002), representations (Fulton and Harris 1991), sheaves (Hartshorne
1977), cohomology rings (Hatcher 2002), etc., have become foundational materials
covered in standard textbooks, with more specialized tensor product constructions,
such as those of Banach spaces (Ryan 2002), distributions (Trèves 2006), operads
(Dunn 1988) or more generally objects in any tensor category (Etingof et al. 2015),
readily available in various monographs. This generality is a feature, not a bug.
In our (much more modest) context, it allows us to define tensor products of norm
spaces and inner product spaces, which in turn allows us to define norms and
inner products for tensors, to view multivariate functions as tensors, an enorm-
ously useful perspective in computation, and to identify ‘separation of variables’
as the common underlying thread in a disparate array of well-known algorithms.
In physics, the importance of tensor products cannot be over-emphasized: it is
one of the fundamental postulates of quantum mechanics (Nielsen and Chuang
2000, Section 2.2.8), the source behind many curious quantum phenomena that lie
at the heart of the subject, and is indispensable in technologies such as quantum
computing and quantum cryptography.
Considering its central role, we will discuss three common approaches to con-
structing tensor products, increasing in abstraction:
(i) via tensor products of function spaces,
(ii) via tensor products of more general vector spaces,
(iii) via the universal factorization property.
As in the case of the three definitions of tensor, each of these constructions is
useful in its own way and each may be taken to be a variant of definition ➂. We
will motivate each construction with concrete examples but we defer all examples
of computational relevance to Section 4.4. The approach of defining a tensor as
an element of a tensor product of vector spaces likely first appeared in the first
French edition of Bourbaki (1998) and is now standard in graduate algebra texts
(Dummit and Foote 2004, Lang 2002, Vinberg 2003). It has also caught on in
Tensors in computations 87
physics (Geroch 1985) and in statistics (McCullagh 1987). For further historical
information we refer the reader to Conrad (2018) and the last chapter of Kostrikin
and Manin (1997).
having a unique finite linear combination of the form (4.1) is the very definition of
a basis. When used in this sense, such a basis for a vector space is called a Hamel
basis.
Since any vector space is isomorphic to a function space, by defining tensor
products on function spaces we define tensor products on all vector spaces. This
is the most straightforward approach to defining tensor products. In this regard, a
𝑑-tensor is just a real-valued function 𝑓 (𝑥1 , . . . , 𝑥 𝑑 ) of 𝑑 variables. For two sets 𝑋
and 𝑌 , we define the tensor product R𝑋 ⊗ R𝑌 of R𝑋 and R𝑌 to be the subspace of
R𝑋 ×𝑌 comprising all functions that can be written as a finite sum of product of two
univariate functions, one of 𝑥 and another of 𝑦:
Õ
𝑟
𝑓 (𝑥, 𝑦) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦). (4.2)
𝑖=1
𝜑(𝑥)
𝜓(𝑦)
2
𝑓 (𝑥, 𝑦)
0
−1 4
0 3 𝑓 =𝜑⊗𝜓
1 2
𝑥 2 1 𝑦 1
3 0
0
4 −1
Figure 4.1. Separable function 𝑓 (𝑥, 𝑦) = 𝜑(𝑥)𝜓(𝑦) with 𝜑(𝑥) = 1 + sin(180𝑥) and
𝜓(𝑦) = exp −(𝑦 − 3/2)2 .
degrees. This is one of the few infinite-dimensional vector spaces that has a
countable Hamel basis, namely the set of all monic monomials
ℬ = {𝑥1𝑑1 𝑥2𝑑2 · · · 𝑥 𝑚
𝑑𝑚
: 𝑑1 , 𝑑2 , . . . , 𝑑𝑚 ∈ N ∪ {0}}.
For multivariate polynomials, it is clearly true that19
R[𝑥1 , . . . , 𝑥 𝑚 ] ⊗ R[𝑦 1 , . . . , 𝑦 𝑛 ] = R[𝑥1 , . . . , 𝑥 𝑚 , 𝑦 1 , . . . , 𝑦 𝑛 ], (4.5)
as polynomials are sums of finitely many monomials and monomials are always
separable, for example
7𝑥12 𝑥2 𝑦 32 𝑦 3 − 6𝑥24 𝑦 1 𝑦 52 𝑦 3 = (7𝑥12 𝑥2 ) · (𝑦 32 𝑦 3 ) + (−6𝑥24 ) · (𝑦 1 𝑦 52 𝑦 3 ).
But (4.4) is also clearly false in general for other infinite-dimensional vector
spaces. Take continuous real-valued functions for illustration. While sin(𝑥 + 𝑦) =
sin 𝑥 cos 𝑦 + cos 𝑥 sin 𝑦 and log(𝑥𝑦) = log 𝑥 + log 𝑦, we can never have
Õ𝑟 Õ𝑟
sin(𝑥𝑦) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦) or log(𝑥 − 𝑦) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦)
𝑖=1 𝑖=1
for any continuous functions 𝜑𝑖 , 𝜓𝑖 and finite 𝑟 ∈ N. But if we allow 𝑟 → ∞, then
Õ∞
(−1)𝑛 2𝑛+1 2𝑛+1
sin(𝑥𝑦) = 𝑥 𝑦 ,
𝑛=0
(2𝑛 + 1)!
and by Taylor’s theorem, sin(𝑥𝑦) can be approximated to arbitrary accuracy by
sums of separable functions. Likewise,
𝑥 𝑥2 𝑥3 𝑥𝑛
log(𝑥 − 𝑦) = log(−𝑦) − − 2 − 3 − · · · − 𝑛 + 𝑂(𝑥 𝑛+1 )
𝑦 2𝑦 3𝑦 𝑛𝑦
as 𝑥 → 0, or, more relevant for our later purposes,
𝑦2 𝑦3 𝑦𝑛 1
𝑦
log(𝑥 − 𝑦) = log(𝑥) − − 2 − 3 − · · · − 𝑛 + 𝑂 𝑛+1 ,
𝑥 2𝑥 3𝑥 𝑛𝑥 𝑥
as 𝑥 → ∞. In other words, we may approximate log(𝑥 − 𝑦) by sums of separable
functions when 𝑥 is very small or very large; the latter will be important when we
discuss the fast multipole method in Example 4.45.
Given that (4.5) holds for polynomials, the Stone–Weierstrass theorem gives an
indication that we might be able to extend (4.4) to other infinite-dimensional spaces,
say, the space of continuous functions 𝐶(𝑋) on some nice domain 𝑋. However,
we will need to relax the definition of tensor product to allow for limits, i.e. taking
completion with respect to some appropriate choice of norm. For instance, we may
establish, for 1 ≤ 𝑝 < ∞,
𝐶(𝑋) b
⊗ 𝐶(𝑌 ) = 𝐶(𝑋 × 𝑌 ), 𝐿 𝑝 (𝑋) b
⊗ 𝐿 𝑝 (𝑌 ) = 𝐿 𝑝 (𝑋 × 𝑌 ),
19 One may extend (4.5) to monomials involving arbitrary real powers, i.e. Laurent polynomials,
posynomials, signomials, etc., as they are all finite sums of monomials.
Tensors in computations 91
where 𝑋 and 𝑌 are locally compact Hausdorff spaces in the left equality and 𝜎-finite
measure spaces in the right equality (Light and Cheney 1985, Corollaries 1.14 and
1.52). The topological tensor product 𝐶(𝑋) b⊗ 𝐶(𝑌 ) above refers to the completion of
Õ𝑟
𝐶(𝑋) ⊗ 𝐶(𝑌 ) = 𝑓 ∈ 𝐶(𝑋 × 𝑌 ) : 𝑓 = 𝜑𝑖 ⊗ 𝜓𝑖 , 𝜑𝑖 ∈ 𝐶(𝑋), 𝜓𝑖 ∈ 𝐶(𝑌 ), 𝑟 ∈ N
𝑖=1
with (𝑎𝑖 )∞ 1 ′ ′
𝑖=1 ∈ 𝑙 (N), and 𝜑 𝑖 ∈ 𝐷 (𝑋), 𝜓𝑖 ∈ 𝐷 (𝑌 ) satisfying lim𝑖→∞ 𝜑 𝑖 = 0 =
lim𝑖→∞ 𝜓𝑖 , among other properties (Schaefer and Wolff 1999, Theorem 9.5). These
results are all consequences of Schwartz’s kernel theorem, which, alongside Mer-
Tensors in computations 93
2
This gives the unique distribution all of whose cumulant 𝑑-tensors, i.e. the 𝑑th
derivative as in Example 3.2 of log 𝜑 𝑋 , vanish when 𝑑 ≥ 3, a fact from which
94 L.-H. Lim
of basic block operations, notably in Strassen’s laser method (Bürgisser et al. 1997,
Chapter 15), or of row and column operations, notably in various variations of slice
rank (Lovett 2019), although there is nothing as sophisticated as the kind of block
matrix operations on page 70.
Usually when we recast a problem in the form of a matrix problem, we are on our
way to a solution: matrices are the tool that gets us there. The same is not true with
hypermatrices. For instance, while wemay easily capture the adjacency structure
of a 𝑑-hypergraph 𝐺 = (𝑉, 𝐸), 𝐸 ⊆ 𝑉𝑑 , with a hypermatrix 𝑓 : 𝑉 × · · · × 𝑉 → R,
(
1 {𝑖 1 , . . . , 𝑖 𝑑 } ∈ 𝐸,
𝑓 (𝑖1 , . . . , 𝑖 𝑑 ) ≔
0 {𝑖 1 , . . . , 𝑖 𝑑 } ∉ 𝐸,
the analogy with spectral graph theory stops here; the hypermatrix view (Friedman
1991, Friedman and Wigderson 1995) does not get us anywhere close to what
is possible in the 𝑑 = 2 case. Almost every mathematically sensible problem
involving hypermatrices is a challenge, even for 𝑑 = 3 and very small dimensions
such as 4 × 4 × 4. For example, we know that the 𝑚 × 𝑛 matrices of rank not more
than 𝑟 are precisely the ones with vanishing (𝑟 + 1) × (𝑟 + 1) minors. Resolving the
equivalent problem for 4 × 4 × 4 hypermatrices of rank not more than 4 requires
that we first reduce it to a series of questions about matrices and then throw every
tool in the matrix arsenal at them (Friedland 2013, Friedland and Gross 2012).
The following resolves a potential point of confusion that the observant reader
might have already noticed. In Definition 3.3, a 𝑑-tensor is a multilinear functional,
that is,
𝜑 : V1 × · · · × V 𝑑 → R (4.18)
satisfying (3.18), but in Definition 4.4 a 𝑑-tensor is merely a multivariate function
𝑓 : 𝑋1 × · · · × 𝑋 𝑑 → R (4.19)
that is not required to be multilinear, a requirement that in fact makes no sense as the
sets 𝑋1 , . . . , 𝑋𝑑 may not be vector spaces. One might think that (4.18) is a special
case of (4.19) as a multilinear functional is a special case of a multivariate function,
but this would be incorrect: they are different types of tensors. By Definition 3.3, 𝜑
is a covariant tensor whereas, as we pointed out in Example 4.2, 𝑓 is a contravariant
tensor. It suffices to explain this issue over finite dimensions and so we will assume
that V1 , . . . , V𝑑 are finite-dimensional vector spaces and 𝑋1 , . . . , 𝑋𝑑 are finite sets.
We will use slightly different notation below to avoid having to introduce more
indices.
Example 4.6 (multivariate versus multilinear). The relation between (4.18) and
(4.19) is subtler than meets the eye. Up to a choice of bases, we can construct a
unique 𝜑 from any 𝑓 and vice versa. One direction is easy. Given vector spaces
Tensors in computations 99
for all 𝑥 ′ ∈ 𝑋, and the point evaluation linear functional 𝜀 𝑥 ∈ (R𝑋 )∗ given by
𝜀 𝑥 : R𝑋 → R, 𝜀 𝑥 ( 𝑓 ) = 𝑓 (𝑥)
for all 𝑓 ∈ R𝑋 . These are bases of their respective spaces and for any set we have
the following:
𝑋 = {𝑥1 , . . . , 𝑥 𝑚 }, R𝑋 = span{𝛿 𝑥1 , . . . , 𝛿 𝑥𝑚 }, (R𝑋 )∗ = span{𝜀 𝑥1 , . . . , 𝜀 𝑥𝑚 }.
Given any 𝑑 sets,
𝑋 = {𝑥1 , . . . , 𝑥 𝑚 }, 𝑌 = {𝑦 1 , . . . , 𝑦 𝑛 }, . . . , 𝑍 = {𝑧1 , . . . , 𝑧 𝑝 },
a real-valued function 𝑓 : 𝑋 × 𝑌 × · · · × 𝑍 → R gives us a multilinear functional
𝜑 𝑓 : R𝑋 × R𝑌 × · · · × R 𝑍 → R
defined by
𝑚 Õ
Õ 𝑛 𝑝
Õ
𝜑𝑓 = ··· 𝑓 (𝑥𝑖 , 𝑦 𝑗 , . . . , 𝑧 𝑘 )𝜀 𝑥𝑖 ⊗ 𝜀 𝑦 𝑗 ⊗ · · · ⊗ 𝜀 𝑧𝑘 .
𝑖=1 𝑗=1 𝑘=1
The next example is intended to serve as further elaboration for Examples 4.3,
4.5 and 4.6, and as partial impetus for Sections 4.2 and 4.3.
Example 4.7 (quantum spin and covalent bonds). As we saw in Example 4.3,
the quantum state of a spinless particle like the Gaussian wave function 𝜓 𝑚,𝑛, 𝑝 is
an element of the Hilbert space21 𝐿 2 (R3 ). But even for a single particle, tensor
products come in handy when we need to discuss particles with spin, which are
crucial in chemistry. The simplest but also the most important case is when
a quantum particle has spin − 21 , 12 , called a spin-half particle for short, and its
quantum state is
Ψ ∈ 𝐿 2 (R3 ) ⊗ C2 . (4.21)
Going by Definition 4.4, this is a function Ψ : R3 × {− 21 , 12 } → C, where Ψ(𝑥, 𝜎)
is an 𝐿 2 (R3 )-integrable function in the argument 𝑥 for each 𝜎 = − 21 or 𝜎 = 21 . It is
also not hard to see that when we have a tensor product of an infinite-dimensional
space with a finite-dimensional one, every element must be a finite sum of separable
terms, so Ψ in (4.21) may be expressed as
Õ
𝑟 Õ
𝑟
Ψ= 𝜓𝑖 ⊗ 𝜒𝑖 or Ψ(𝑥, 𝜎) = 𝜓𝑖 (𝑥)𝜒𝑖 (𝜎) (4.22)
𝑖=1 𝑖=1
with 𝜓𝑖 ∈ and 𝜒𝑖 : {− 12 , 21 } →
𝐿 2 (R3 ) C. In physics parlance, Ψ is the total wave
function of the particle and 𝜓𝑖 and 𝜒𝑖 are its spatial and spin wave functions; the
variables that Ψ depends on are called degrees of freedom, with those describing
position and momentum called external degrees of freedom and others like spin
called internal degrees of freedom (Cohen-Tannoudji et al. 2020a, Chapter II,
Section F).
This also illustrates why, as we discussed in Example 4.5, it is often desirable to
view C𝑛 as the set of complex-valued functions on some finite sets of 𝑛 elements.
Here the finite set is {− 12 , 12 } and C2 in (4.21) is just shorthand for the spin state
space
{−1/2,1/2} 1 1
C = 𝜒: − , →C .
2 2
The domain R3 × {− 21 , 12 } is called the position–spin space (Pauli 1980) and plays
a role in the Pauli exclusion principle. While two classical particles cannot simul-
taneously occupy the exact same location in R3 , two quantum particles can as long
as they have different spins, as that means they are occupying different locations
in R3 × {− 12 , 12 }. A consequence is that two electrons with opposite spins can
occupy the same molecular orbital, and when they do, we have a covalent bond in
21 To avoid differentiability and integrability issues, we need 𝜓𝑚,𝑛, 𝑝 to be in the Schwartz space
𝑆(R3 ) ⊆ 𝐿 2 (R3 ) if we take a rigged Hilbert space approach (de la Madrid 2005) or the Sobolev
space 𝐻 2 (R3 ) ⊆ 𝐿 2 (R3 ) if we take an unbounded self-adjoint operators approach (Teschl 2014,
Chapter 7), but we disregard these to keep things simple.
Tensors in computations 101
the molecule. The Pauli exclusion principle implies the converse: if two electrons
occupy the same molecular orbital, then they must have opposite spins. We will
see that this is a consequence of the antisymmetry of the total wave function.
In fact, ‘two electrons with opposite spins occupying the same molecular orbital’
is the quantum mechanical definition of a covalent bond in chemistry. When this
happens, quantum mechanics mandates that we may not speak of each electron
individually but need to consider both as a single entity described by a single wave
function
Ψ ∈ (𝐿 2 (R3 ) ⊗ C2 ) b
⊗ (𝐿 2 (R3 ) ⊗ C2 ), (4.23)
with each copy of 𝐿 2 (R3 ) ⊗ C2 associated with one of the electrons. The issue with
writing the tensor product in the form (4.23) is that Ψ will not be a finite sum of
separable terms since it is not a tensor product of two infinite-dimensional spaces.
As we will see in Section 4.3, we may arrange the factors in a tensor product in
arbitrary order and obtain isomorphic spaces
(𝐿 2 (R3 ) ⊗ C2 ) b
⊗ (𝐿 2 (R3 ) ⊗ C2 ) 𝐿 2 (R3 ) b
⊗ 𝐿 2 (R3 ) ⊗ C2 ⊗ C2 , (4.24)
but this is also obvious from Definition 4.4 by observing that for finite sums
Õ
𝑟 Õ
𝑟
𝜓𝑖 (𝑥)𝜒𝑖 (𝜎)𝜑𝑖 (𝑦)𝜉𝑖 (𝜏) = 𝜓𝑖 (𝑥)𝜑𝑖 (𝑦)𝜒𝑖 (𝜎)𝜉𝑖 (𝜏),
𝑖=1 𝑖=1
Nevertheless, we would like more flexibility than Definition 4.4 provides. For
instance, instead of writing the state of a spin-half particle as a sum of real-valued
functions in (4.22), we might prefer to view it as a C2 -valued 𝐿 2 -vector field on R3
indexed by spin:
𝜓−1/2 (𝑥)
Ψ(𝑥) = (4.29)
𝜓1/2 (𝑥)
with ∫ ∫
kΨk 2 = |𝜓−1/2 (𝑥)| 2 d𝑥 + |𝜓1/2 (𝑥)| 2 d𝑥 < ∞.
R3 R3
vectors to bring it in line with modern treatments in algebra, which we will describe
later. Note that the ⊗ in 𝑣 ⊗ 𝑤 is solely used as a delimiter; we could have written
it as 𝑣𝑤 as in Hitchcock (1927) and Morse and Feshbach (1953) or |𝑣i|𝑤i or |𝑣, 𝑤i
in Dirac notation (Cohen-Tannoudji et al. 2020a, Chapter II, Section F2c). Indeed,
𝑣 and 𝑤 may not have coordinates and there is no ‘formula for 𝑣 ⊗ 𝑤’. Also, the
order matters as 𝑣 and 𝑤 may denote vectors of different nature, so in general
𝑣 ⊗ 𝑤 ≠ 𝑤 ⊗ 𝑣. (4.30)
The reason for this would be clarified with Example 4.8. Used in this sense, ⊗ is
called an abstract tensor product or dyadic product in the older literature (Chou
and Pagano 1992, Morse and Feshbach 1953, Tai 1997).
With hindsight, we see that the extension of a scalar in the sense of an object
with magnitude, and a vector in the sense of one with magnitude and direction, is
not a tensor but a rank-one tensor. To get tensors that are not rank-one we will have
to look at the arithmetic of dyads and polyads. As we have discussed, each dyad is
essentially a placeholder for three pieces of information:
(magnitude, first direction, second direction)
or, in our notation (k𝑣 ⊗ 𝑤 k, b b). We are trying to devise a consistent system of
𝑣, 𝑤
arithmetic for such objects that (a) preserves their information content, (b) merges
information whenever possible, and (c) is consistent with scalar and vector arith-
metic.
Scalar products are easy. The role of scalars is that they scale vectors, as reflected
in its name. We expect the same for dyads. If any of the vectors in a dyad is scaled
by 𝑎, we expect its magnitude to be scaled by 𝑎 but the two directions to remain
unchanged. So we require
(𝑎𝑣) ⊗ 𝑤 = 𝑣 ⊗ (𝑎𝑤) ≕ 𝑎𝑣 ⊗ 𝑤, (4.31)
where the last term is defined to be the common value of the first two. This
seemingly innocuous property is the main reason tensor products, not direct sums,
are used to combine quantum state spaces. In quantum mechanics, a quantum state
is described not so much by a vector 𝑣 but the entire one-dimensional subspace
spanned by 𝑣; thus (4.31) ensures that when combining two quantum states in the
form of two one-dimensional subspaces, it does not matter which (non-zero) vector
in the subspace we pick to represent the state. On the other hand, for direct sums,
(𝑎𝑣) ⊕ 𝑤 ≠ 𝑣 ⊕ (𝑎𝑤) ≠ 𝑎(𝑣 ⊕ 𝑤) (4.32)
in general; for example, with 𝑣 = (1, 0) ∈ R2 , 𝑤 = 1 ∈ R1 and 𝑎 = 2 ∈ R, we get
(2, 0, 1) ≠ (1, 0, 2) ≠ (2, 0, 2).
To ensure consistency with the usual scalar and vector arithmetic, one also needs
the assumption that scalar multiplication is always distributive and associative in
the following sense:
(𝑎 + 𝑏)𝑣 ⊗ 𝑤 = 𝑎𝑣 ⊗ 𝑤 + 𝑏𝑣 ⊗ 𝑤, (𝑎𝑏)𝑣 ⊗ 𝑤 = 𝑎(𝑏𝑣 ⊗ 𝑤).
106 L.-H. Lim
22 We could of course add corresponding vectors 𝑣 1 + 𝑣 2 and 𝑤 1 + 𝑤 2 , but that leads to direct sums
of vector spaces, which is inappropriate because of (4.32).
23 We assume ‘space’ here means R3 with coordinate axes labelled 𝑥, 𝑦, 𝑧. The older literature would
use i, j, k instead of 𝑒 𝑥 , 𝑒 𝑦 , 𝑒 𝑧 .
Tensors in computations 107
𝜎𝑧𝑧
𝜎𝑥𝑧
𝜎𝑦𝑧
𝜎𝑧 𝑥
𝑒𝑧 𝜎𝑧𝑦
𝜎𝑥 𝑥
𝑒𝑥
𝑒𝑦 𝜎𝑦 𝑥
𝜎𝑥 𝑦
𝜎𝑦 𝑦
Figure 4.2. Depiction of the stress tensor. The cube represents a single point in
R3 and should be regarded as infinitesimally small. In particular, the origins of the
three coordinate frames on the surface of the cube are the same point.
2-tensor. This provides yet another example why we do not want to identify a 2-
tensor with the matrix that represents it: stripped of the basis vectors, the physical
connotations of (4.34) and (4.35) that we discussed earlier are irretrievably lost.
What we have described above is the behaviour of stress at a single point in space,
assumed to be R3 , and in Cartesian coordinates. As we mentioned in Example 3.13,
stress is a tensor field and 𝑒 𝑥 , 𝑒 𝑦 , 𝑒 𝑧 really form a basis of the tangent space T 𝑣 (R3 )
at the point 𝑣 = (𝑥, 𝑦, 𝑧) ∈ R3 , and we could have used any coordinates. Note that
our discussion above merely used stress in a purely nominal way: our focus is on
the dyadic that represents stress. The same discussion applies mutatis mutandis
to any contravariant 2-tensors describing inertia, polarization, strain, tidal force,
viscosity, etc. (Borg 1990). We refer the readers to Borg (1990, Chapter 4), Chou
and Pagano (1992, Chapter 1) and Irgens (2019, Chapter 2) for the actual physical
details regarding stress, including how to derive (4.34) from first principles.
Stress is a particularly vital notion, as many tensors describing physical prop-
erties in various constitutive equations (Hartmann 1984, Table 1) are defined as
derivatives with respect to it. In fact, this is one way in which tensors of higher order
arise in physics – as second-order derivatives of real-valued functions with respect
to first- and second-order tensors. For instance, in Cartesian coordinates, the piezo-
electric tensor, the piezo-magnetic tensor and the elastic tensor are represented as
hypermatrices 𝐷, 𝑄 ∈ R3×3×3 and 𝑆 ∈ R3×3×3×3 , where
𝜕2𝐺 𝜕2𝐺 𝜕2𝐺
𝑑𝑖 𝑗𝑘 = − , 𝑞 𝑖 𝑗𝑘 = − , 𝑠𝑖 𝑗𝑘𝑙 = − ,
𝜕𝜎𝑖 𝑗 𝜕𝑒 𝑘 𝜕𝜎𝑖 𝑗 𝜕ℎ 𝑘 𝜕𝜎𝑖 𝑗 𝜕𝜎𝑘𝑙
and 𝑖, 𝑗, 𝑘, 𝑙 = 1, 2, 3. Here 𝐺 = 𝐺(Σ, 𝐸, 𝐻, 𝑇 ) is the Gibbs potential, a real-
valued function of the second-order stress tensor Σ, the first-order tensors 𝐸 and 𝐻
representing electric and magnetic field respectively, and the zeroth-order tensor 𝑇
representing temperature (Hartmann 1984, Section 3).
It is straightforward to extend the above construction of a dyadic to arbitrary
dimensions and arbitrary vectors in abstract vector spaces U and V. A dyadic is a
‘linear combination’ of vectors in V,
𝜎 = 𝜎1 𝑣 1 + · · · + 𝜎𝑛 𝑣 𝑛 ,
whose coefficients 𝜎 𝑗 are vectors in U:
𝜎 𝑗 = 𝜎1 𝑗 𝑢1 + · · · + 𝜎𝑚 𝑗 𝑢 𝑗 , 𝑗 = 1, . . . , 𝑛.
Thus by (4.31) and (4.33) we have
Õ
𝑚 Õ
𝑛 Õ
𝑚 Õ
𝑛
𝜎= 𝜎𝑖 𝑗 𝑢𝑖 𝑣 𝑗 = 𝜎𝑖 𝑗 𝑢𝑖 ⊗ 𝑣 𝑗
𝑖=1 𝑗=1 𝑖=1 𝑗=1
in old (pre-⊗) and modern notation respectively. Note that the coefficients 𝜎𝑖 𝑗
are now scalars and a dyadic is an honest linear combination of dyads with scalar
coefficients. We denote the set of all such dyadics as U ⊗ V.
110 L.-H. Lim
Thus we have
𝑛 Õ
𝑚 Õ
Õ 𝑝 Õ 𝑝
𝑛 Õ
𝑚 Õ
𝜏= 𝜏𝑖 𝑗𝑘 𝑢𝑖 𝑣 𝑗 𝑤 𝑘 = 𝜏𝑖 𝑗𝑘 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ 𝑤 𝑘
𝑖=1 𝑗=1 𝑗=1 𝑖=1 𝑗=1 𝑗=1
in old and modern notation respectively. Henceforth we will ditch the old notation.
Again a triadic is an honest linear combination of triads and the set of all triadics
will be denoted U ⊗ V ⊗ W. Strictly speaking, we have constructed (U ⊗ V) ⊗ W and
an analogous construction considering ‘linear combinations’ of dyadics in V ⊗ W
with coefficients in U would give us U ⊗ (V ⊗ W), but it does not matter for us as
we have implicitly imposed that
(𝑢 ⊗ 𝑣) ⊗ 𝑤 = 𝑢 ⊗ (𝑣 ⊗ 𝑤) ≕ 𝑢 ⊗ 𝑣 ⊗ 𝑤 (4.37)
for all 𝑢 ∈ U, 𝑣 ∈ V, 𝑤 ∈ W.
The observant reader may have noticed that aside from (4.37) we have also
implicitly imposed the third-order analogues of (4.31) and (4.33) in the construction
above. For completeness we will state them formally but in a combined form:
(𝜆𝑢 + 𝜆 ′𝑢 ′) ⊗ 𝑣 ⊗ 𝑤 = 𝜆𝑢 ⊗ 𝑣 ⊗ 𝑤 + 𝜆 ′𝑢 ′ ⊗ 𝑣 ⊗ 𝑤,
𝑢 ⊗ (𝜆𝑣 + 𝜆 ′ 𝑣 ′) ⊗ 𝑤 = 𝜆𝑢 ⊗ 𝑣 ⊗ 𝑤 + 𝜆 ′𝑢 ⊗ 𝑣 ′ ⊗ 𝑤, (4.38)
𝑢 ⊗ 𝑣 ⊗ (𝜆𝑤 + 𝜆 ′ 𝑤 ′) = 𝜆𝑢 ⊗ 𝑣 ⊗ 𝑤 + 𝜆 ′𝑢 ⊗ 𝑣 ⊗ 𝑤 ′
for all vectors 𝑢, 𝑢 ′ ∈ U, 𝑣, 𝑣 ′ ∈ V, 𝑤, 𝑤 ′ ∈ W and scalars 𝜆, 𝜆 ′.
The construction described above makes the set U ⊗ V ⊗ W into an algebraic
object with scalar product, tensor addition + and tensor product ⊗ interacting in
a consistent manner according to the algebraic rules (4.37) and (4.38). We will
call it a tensor product of U, V, W, or an abstract tensor product if we need to
emphasize that it refers to this particular construction. Observe that this is an
extremely general construction.
(i) It does not depend what we call ‘scalars’ so long as we may add and multiply
them, that is, this construction works for any arbitrary modules U, V, W over
a ring 𝑅.
(ii) It does not require U, V, W to be finite-dimensional (vector spaces) or finitely
generated (modules); the whole construction is about specifying how linear
combinations behave under ⊗, and there is no need for elements to be linear
combinations of some basis or generating set.
Tensors in computations 111
(iii) It does not call for separate treatments of covariant and mixed tensors; the
definition is agnostic to having some or all of U, V, W replaced by their dual
spaces U∗ , V∗ , W∗ .
It almost goes without saying that the construction can be readily extended to
arbitrary order 𝑑. We may consider ‘linear combinations’ of dyadics with dyadic
coefficients to get 𝑑 = 4, ‘linear combinations’ of triadics with dyadic coefficients
to get 𝑑 = 5, etc. But looking at the end results of the 𝑑 = 2 and 𝑑 = 3 constructions,
if we have 𝑑 vector spaces U, V, . . . , W, a more direct way is to simply consider
the set of all 𝑑-adics, i.e. finite sums of 𝑑-ads,
Õ 𝑟
U ⊗ V ⊗ ··· ⊗ W ≔ 𝑢𝑖 ⊗ 𝑣 𝑖 ⊗ · · · ⊗ 𝑤 𝑖 : 𝑢𝑖 ∈ U, 𝑣 𝑖 ∈ V, . . . , 𝑤 𝑖 ∈ W, 𝑟 ∈ N ,
𝑖=1
(4.39)
and decree that ⊗ is associative, + is associative and commutative, and ⊗ is
distributive over + in the sense of
(𝜆𝑢 + 𝜆 ′𝑢 ′) ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 = 𝜆𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 + 𝜆 ′𝑢 ′ ⊗ 𝑣 ⊗ · · · ⊗ 𝑤,
𝑢 ⊗ (𝜆𝑣 + 𝜆 ′ 𝑣 ′) ⊗ · · · ⊗ 𝑤 = 𝜆𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 + 𝜆 ′𝑢 ⊗ 𝑣 ′ ⊗ · · · ⊗ 𝑤,
.. .. (4.40)
. .
𝑢 ⊗ 𝑣 ⊗ · · · ⊗ (𝜆𝑤 + 𝜆 𝑤 ) = 𝜆𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 + 𝜆 ′𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 ′
′ ′
construction, but where we have both vector spaces and dual spaces, as another
definition for tensors.
Definition 4.9 (tensors as elements of tensor spaces). Let 𝑝 ≤ 𝑑 be non-negative
integers and let V1 , . . . , V𝑑 be vector spaces. A tensor of contravariant order 𝑝
and covariant order 𝑑 − 𝑝 is an element
𝑇 ∈ V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 .
The set above is as defined in (4.39) and is called a tensor product of vector spaces
or a tensor space for short. The tensor 𝑇 is said to be of type (𝑝, 𝑑 − 𝑝) and of order
𝑑. A tensor of type (𝑑, 0), i.e. 𝑇 ∈ V1 ⊗ · · · ⊗ V𝑑 , is called a contravariant 𝑑-tensor,
and a tensor of type (0, 𝑑), i.e. 𝑇 ∈ V∗1 ⊗ · · · ⊗ V∗𝑑 is called a covariant 𝑑-tensor.
Vector spaces have bases and a tensor space V1 ⊗ · · · ⊗ V𝑑 has a tensor product
basis given by
ℬ1 ⊗ ℬ2 ⊗ · · · ⊗ ℬ𝑑 ≔ {𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 : 𝑢 ∈ ℬ1 , 𝑣 ∈ ℬ2 , . . . , 𝑤 ∈ ℬ𝑑 },
(4.43)
where ℬ𝑖 is any basis of V𝑖 , 𝑖 = 1, . . . , 𝑑. This is the only way to get a basis of
rank-one tensors on a tensor space. If ℬ1 ⊗ · · · ⊗ ℬ𝑑 is a basis for V1 ⊗ · · · ⊗ V𝑑 ,
then ℬ1 , . . . , ℬ𝑑 must be bases of V1 , . . . , V𝑑 , respectively. More generally, for
mixed tensor spaces, ℬ1 ⊗ · · · ⊗ ℬ𝑝 ⊗ ℬ∗𝑝+1 ⊗ · · · ⊗ ℬ𝑑∗ is a basis for V1 ⊗ · · · ⊗
V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 , where ℬ∗ denotes the dual basis of ℬ. There is no need to
assume finite dimension; these bases may well be uncountable. If the vector spaces
are finite-dimensional, then (4.43) implies that
dim(V1 ⊗ · · · ⊗ V𝑑 ) = dim(V1 ) · · · dim(V𝑑 ). (4.44)
Aside from (4.31), relation (4.44) is another reason the tensor product is the
proper operation for combining quantum states of different systems: if a quantum
system is made up of superpositions (i.e. linear combinations) of 𝑚 distinct (i.e.
linearly independent) states and another is made up of superpositions of 𝑛 distinct
states, then it is physically reasonable for the combined system to be made up
of superpositions of 𝑚𝑛 distinct states (Nielsen and Chuang 2000, p. 94). As a
result, we expect the combination of an 𝑚-dimensional quantum state space and an
𝑛-dimensional quantum state space to be an 𝑚𝑛-dimensional quantum state space,
and by (4.44), the tensor product fits the bill perfectly. This of course is hand-
waving and assumes finite-dimensionality. For a fully rigorous justification, we
refer readers to Aerts and Daubechies (1978, 1979a,b).
Note that there is no contradiction between (4.42) and (4.43): in the former the
objects are vector spaces; in the latter they are bases of vector spaces. The tensor
product symbol ⊗ is used in at least a dozen different ways, but fortunately there
is little cause for confusion as its meaning is almost always unambiguous from the
context.
One downside of Definition 4.4 in Section 4.1 is that it requires us to convert
Tensors in computations 113
everything into real-valued functions before we can discuss tensor products. What
we gain from the abstraction in Definition 4.9 is generality: the construction above
allows us to form tensor products of any objects as is, so long as they belong to
some vector space (or module).
obvious points but they are often obfuscated when higher-order tensors are brought
into the picture.
The Kronecker product above may also be viewed as another manifestation of
Definition 4.9, but a more fruitful way is to deduce it as a tensor product of
linear operators that naturally follows from Definition 4.9. This is important and
sufficiently interesting to warrant separate treatment.
Example 4.11 (Kronecker product). Given linear operators
Φ1 : V1 → W1 and Φ2 : V2 → W2 ,
forming the tensor products V1 ⊗ V2 and W1 ⊗ W2 automatically25 gives us a linear
operator
Φ1 ⊗ Φ2 : V1 ⊗ V2 → W1 ⊗ W2
defined on rank-one elements by
Φ1 ⊗ Φ2 (𝑣 1 ⊗ 𝑣 2 ) ≔ Φ1 (𝑣 1 ) ⊗ Φ2 (𝑣 2 )
and extended linearly to all elements of V1 ⊗ V2 , which are all finite linear com-
binations of rank-one elements. The way it is defined, Φ1 ⊗ Φ2 is clearly unique
and we will call it the Kronecker product of linear operators Φ1 and Φ2 . It extends
easily to an arbitrary number of linear operators via
Φ1 ⊗ · · · ⊗ Φ𝑑 (𝑣 1 ⊗ · · · ⊗ 𝑣 𝑑 ) ≔ Φ1 (𝑣 1 ) ⊗ · · · ⊗ Φ𝑑 (𝑣 𝑑 ), (4.47)
and the result obeys (4.40):
(𝜆Φ1 + 𝜆 ′Φ1′ ) ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 = 𝜆Φ1 ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 + 𝜆 ′Φ1′ ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 ,
Φ1 ⊗ (𝜆Φ2 + 𝜆 ′Φ2′ ) ⊗ · · · ⊗ Φ𝑑 = 𝜆Φ1 ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 + 𝜆 ′Φ1 ⊗ Φ2′ ⊗ · · · ⊗ Φ𝑑 ,
.. ..
. .
Φ1 ⊗ Φ2 ⊗ · · · ⊗ (𝜆Φ𝑑 + 𝜆 Φ𝑑 ) = 𝜆Φ1 ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 + 𝜆 ′Φ1 ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑′ .
′ ′
In other words, the Kronecker product defines a tensor product in the sense of
Definition 4.9 on the space of linear operators,
L(V1 ; W1 ) ⊗ · · · ⊗ L(V𝑑 ; W𝑑 ),
that for finite-dimensional vector spaces equals
L(V1 ⊗ · · · ⊗ V𝑑 ; W1 ⊗ · · · ⊗ W𝑑 ).
In addition, as linear operators they may be composed and Moore–Penrose-inverted,
25 This is called functoriality in category theory. The fact that the tensor product of vector spaces
in Definition 4.9 gives a tensor product of linear operators on these spaces says that the tensor
product in Definition 4.9 is functorial.
116 L.-H. Lim
and have adjoints and ranks, images and null spaces, all of which work in tandem
with the Kronecker product:
(Φ1 ⊗ · · · ⊗ Φ𝑑 )(Ψ1 ⊗ · · · ⊗ Ψ𝑑 ) = Φ1 Ψ1 ⊗ · · · ⊗ Φ𝑑 Ψ𝑑 , (4.48)
(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = †
Φ†1 ⊗ ··· ⊗ Φ†𝑑 , (4.49)
∗
(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = Φ∗1
⊗ ··· ⊗ Φ∗𝑑 , (4.50)
rank(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = rank(Φ1 ) · · · rank(Φ𝑑 ), (4.51)
im(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = im(Φ1 ) ⊗ · · · ⊗ im(Φ𝑑 ). (4.52)
Observe that the ⊗ on the left of (4.52) is the Kronecker product of operators
whereas that on the right is the tensor product of vector spaces. For null spaces,
we have
ker(Φ1 ⊗ Φ2 ) = ker(Φ1 ) ⊗ V2 + V1 ⊗ ker(Φ2 ),
when 𝑑 = 2 and more generally
ker(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = ker(Φ1 ) ⊗ V2 ⊗ · · · ⊗ V𝑑 + V1 ⊗ ker(Φ2 ) ⊗ · · · ⊗ V𝑑
+ · · · + V1 ⊗ V2 ⊗ · · · ⊗ ker(Φ𝑑 ).
Therefore injectivity and surjectivity are preserved by taking Kronecker products.
If V𝑖 = W𝑖 , 𝑖 = 1, . . . , 𝑑, then the eigenpairs of Φ1 ⊗ · · · ⊗ Φ𝑑 are exactly those
given by
(𝜆 1 𝜆 2 · · · 𝜆 𝑑 , 𝑣 1 ⊗ 𝑣 2 ⊗ · · · ⊗ 𝑣 𝑑 ), (4.53)
where (𝜆 𝑖 , 𝑣 𝑖 ) is an eigenpair of Φ𝑖 , 𝑖 = 1, . . . , 𝑑. All of these are straightforward
consequences of (4.47) (Berberian 2014, Section 13.2).
The Kronecker product of matrices simply expresses how the matrix representing
Φ1 ⊗ · · · ⊗ Φ𝑑 is related to those representing Φ1 , . . . , Φ𝑑 . If we pick bases ℬ𝑖 for
V𝑖 and 𝒞𝑖 for W𝑖 , then each linear operator Φ𝑖 has a matrix representation
[Φ𝑖 ] ℬ𝑖 ,𝒞𝑖 = 𝑋𝑖 ∈ R𝑚𝑖 ×𝑛𝑖
as in (3.5), 𝑖 = 1, . . . , 𝑑, and
[Φ1 ⊗ · · · ⊗ Φ𝑑 ] ℬ1 ⊗ ···⊗ℬ𝑑 , 𝒞1 ⊗ ···⊗𝒞𝑑 = 𝑋1 ⊗ · · · ⊗ 𝑋𝑑 ∈ R𝑚1 𝑚2 ···𝑚𝑑 ×𝑛1 𝑛2 ···𝑛𝑑 .
Note that the tensor products on the left of the equation are Kronecker products as
defined in (4.47) and those on the right are matrix Kronecker products as defined
in Example 4.10(ii), applied 𝑑 times in any order (order does not matter as ⊗ is
associative). The tensor product of 𝑑 bases is as defined in (4.43).
Clearly the matrix Kronecker product inherits the properties (4.48)–(4.53), and
when 𝑚 𝑖 = 𝑛𝑖 , the eigenvalue property in particular gives
tr(𝑋1 ⊗ · · · ⊗ 𝑋𝑑 ) = tr(𝑋1 ) · · · tr(𝑋𝑑 ), (4.54)
det(𝑋1 ⊗ · · · ⊗ 𝑋𝑑 ) = det(𝑋1 ) 𝑝1 · · · det(𝑋𝑑 ) 𝑝𝑑 ,
where 𝑝 𝑖 = (𝑛1 𝑛2 · · · 𝑛𝑑 )/𝑛𝑖 , 𝑖 = 1, . . . , 𝑑. Incidentally these last two properties
Tensors in computations 117
This clearly extends the matrix–matrix product, which is just the case when 𝑑 = 1.
We have in fact already encountered Example 4.11 in a different form: this is
(4.48), the composition of two Kronecker products of 𝑑 linear operators. To view
it in hypermatrix form, we just use the fact that L(V; W) = V∗ ⊗ W. Indeed, we
may also view it as a tensor contraction of
𝑇 ∈ (U∗1 ⊗ V1 ) ⊗ · · · ⊗ (U∗𝑑 ⊗ V𝑑 ), 𝑇 ′ ∈ (V∗1 ⊗ W1 ) ⊗ · · · ⊗ (V∗𝑑 ⊗ W𝑑 ),
from which we obtain
h𝑇 , 𝑇 ′i ∈ (U∗1 ⊗ W1 ) ⊗ · · · ⊗ (U∗𝑑 ⊗ W𝑑 ).
Tensors in computations 119
This is entirely within expectation as the first fundamental theorem of invariant the-
ory, the result that prevents the existence of a 𝑑-hypermatrix–hypermatrix product
for odd 𝑑, also tells us that essentially the only way to multiply tensors without
increasing order is via contractions. This product is well-defined for 2𝑑-tensors
and does not depend on bases: in terms of Kronecker products of matrices,
((𝑋1 𝐴1𝑌1−1 ) ⊗ · · · ⊗ (𝑋𝑑 𝐴𝑑𝑌𝑑−1 ))((𝑌1 𝐵1 𝑍1−1 ) ⊗ · · · ⊗ (𝑌𝑑 𝐵 𝑑 𝑍 𝑑−1 ))
= (𝑋1 𝐴1 𝐵1 𝑍1−1 ) ⊗ · · · ⊗ (𝑋𝑑 𝐴𝑑 𝐵 𝑑 𝑍 𝑑−1 ),
and in terms of hypermatrix–hypermatrix products,
((𝑋1 , 𝑌1−1 , . . . , 𝑋𝑑 , 𝑌𝑑−1 ) · 𝐴)((𝑌1 , 𝑍1−1 , . . . , 𝑌𝑑 , 𝑍 𝑑−1 ) · 𝐵)
= (𝑋1 , 𝑍1−1 , . . . , 𝑋𝑑 , 𝑍 𝑑−1 ) · (𝐴𝐵),
that is, they satisfy the higher-order analogue of (2.2). Frankly, we do not see any
advantage in formulating such a product as 2𝑑-hypermatrices, but an abundance of
disadvantages.
Just as it is possible to deduce the tensor transformation rules in definition ➀ from
definition ➁, we can do likewise with definition ➂ in the form of Definition 4.9.
Example 4.14 (tensor transformation rules revisited). As stated on page 110,
the tensor product construction in this section does not require that U, V, . . . , W
have bases, that is, they could be modules, which do not have bases in general (those
with bases are called free modules). Nevertheless, in the event where they do have
bases, their change-of-basis theorems would lead us directly to the transformation
rules discussed in Section 2.
We first remind the reader of a simple notion, discussed in standard linear
algebra textbooks such as Berberian (2014, Section 3.9) and Friedberg et al. (2003,
Section 2.6) but often overlooked. Any linear operator Φ : V → W induces a
transpose linear operator on the dual spaces defined as
ΦT : W∗ → V∗ , Φ(𝜑) ≔ 𝜑 ◦ Φ
for any linear functional 𝜑 ∈ W∗ . Note that the composition 𝜑 ◦ Φ : V → R is
indeed a linear functional in V∗ . The reason for its name is that
[Φ] ℬ,𝒞 = 𝐴 ∈ R𝑚×𝑛 if and only if [ΦT ] 𝒞∗ ,ℬ∗ = 𝐴T ∈ R𝑛×𝑚 .
One may show that Φ is injective if and only if ΦT is subjective, and Φ is surjective if
and only if ΦT is injective.26 So if Φ : V → W is invertible, then so is ΦT : W∗ → V∗
and its inverse is a linear operator,
Φ−T : V∗ → W∗ .
Another name for ‘invertible linear operator’ is vector space isomorphism, es-
pecially when used in the following context. Any basis ℬ = {𝑣 1 , . . . , 𝑣 𝑛 }
26 Not a typo. Injectivity and surjectivity are in fact dual notions in this sense.
120 L.-H. Lim
Φ1 ⊗ · · · ⊗ Φ 𝑝 ⊗ Φ−𝑝+1
T
⊗ · · · ⊗ Φ−𝑑T (𝑇 ) = 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 ,
Ψ1 ⊗ · · · ⊗ Ψ 𝑝 ⊗ Ψ−𝑝+1
T
⊗ · · · ⊗ Ψ−𝑑T (𝑇 ) = 𝐴 ′ ∈ R𝑛1 ×···×𝑛𝑑 .
Tensors in computations 121
𝐴 = Φ1 Ψ−1 −1 −1 −T −1 −T ′
1 ⊗ · · · ⊗ Φ 𝑝 Ψ 𝑝 ⊗ (Φ 𝑝+1 Ψ 𝑝+1 ) ⊗ · · · ⊗ (Φ𝑑 Ψ 𝑑 ) (𝐴 ),
where we have used (4.48) and (4.49). Note that each Φ𝑖 Ψ−1𝑖 : R
𝑛𝑖 → R𝑛𝑖 is an
−1
invertible linear operator and so it must be given by Φ𝑖 Ψ𝑖 (𝑣) = 𝑋𝑖 𝑣 for some
𝑋𝑖 ∈ GL(𝑛𝑖 ). Using (4.55), the above relation between 𝐴 and 𝐴 ′ in terms of
multilinear matrix multiplication is just
𝐴 = (𝑋1 , . . . , 𝑋 𝑝 , 𝑋 −𝑝+1
T
, . . . , 𝑋𝑑−T ) · 𝐴 ′,
which gives us the tensor transformation rule (2.9). To get the isomorphism in
(4.57), we had to identify R𝑛 with (R𝑛 )∗ , and this is where we lost all information
pertaining to covariance and contravariance; a hypermatrix cannot perfectly rep-
resent a tensor, which is one reason why we need to look at the transformation rules
to ascertain the tensor.
The example above essentially discusses change of basis without any mention
of bases. In pure mathematics, sweeping such details under the rug is generally
regarded as a good thing; in applied and computational mathematics, such details
are usually unavoidable and we will work them out below.
which has the form in (4.58). Since an arbitrary 𝑇 is simply a finite sum of rank-one
122 L.-H. Lim
By (4.59) and (4.60), any 𝑇 must be expressible in the form (4.58). For a different
choice of bases ℬ1′ , ℬ2′ , . . . , ℬ𝑑′ we obtain a different hypermatrix representation
𝐴 ′ ∈ R𝑚×𝑛×···× 𝑝 for 𝑇 that is related to 𝐴 by a multilinear matrix multiplication as
in Example 4.14.
Note that (4.59) and (4.60), respectively, give the formulas for outer product and
linear combination:
with bases of U ′, V ′, . . . , W ′
ℬ1′ = {𝑢1′ , . . . , 𝑢 𝑚
′
′ }, ℬ2′ = {𝑣 1′ , . . . , 𝑣 𝑛′ ′ }, . . . , ℬ𝑑′ ′ = {𝑤 1′ , . . . , 𝑤 ′𝑝′ },
that 𝛽 is a bilinear functional and Φ a linear operator (the dyad case is discussed
in Example 4.8), information that is lost if one just looks at the matrix 𝐴 = [𝑎𝑖 𝑗 ].
Furthermore, the 𝑢𝑖 and 𝑣 𝑗 are sometimes more important than the 𝑎𝑖 𝑗 . We might
have
where the 𝑎𝑖 𝑗 are all ones, and all the important information is in the basis vectors,
which are the point evaluation functionals we encountered in Example 4.6, that is,
𝜀 𝑥 ( 𝑓 ) = 𝑓 (𝑥) for any 𝑓 : [−1, 1] → R. The decomposition (4.61) is in fact a
four-point Gauss quadrature formula on the domain [−1, 1] × [−1, 1]. We will
revisit this in Example 4.48.
Example 4.16 (multilinear rank and tensor rank). For any 𝑑 vector spaces U, V,
. . . , W and any subspaces U ′ ⊆ U, V ′ ⊆ V, . . . , W ′ ⊆ W, it follows from (4.39)
that
U ′ ⊗ V ′ ⊗ · · · ⊗ W ′ ⊆ U ⊗ V ⊗ · · · ⊗ W.
For a given 𝑇 , the vectors on the right of (4.63) are not unique but the subspaces
U ′, V ′, . . . , W ′ that they span are unique, and thus 𝜇rank(𝑇 ) is well-defined.27 One
may view these subspaces as generalizations of the row and column spaces of a
matrix and multilinear rank as a generalization of the row and column ranks. Note
that the coefficient hypermatrix 𝐶 = [𝑐𝑖 𝑗 ···𝑘 ] ∈ R 𝑝×𝑞×···×𝑟 is not unique either,
but two such coefficient hypermatrices 𝐶 and 𝐶 ′ must be related by the tensor
transformation rule (2.9), as we saw in Example 4.15.
The decomposition in (4.63) is a sum of rank-one tensors, and if we simply use
this as impetus and define
Õ𝑟
rank(𝑇 ) ≔ min 𝑟 ∈ N : 𝑇 = 𝑢𝑖 ⊗ 𝑣 𝑖 ⊗ · · · ⊗ 𝑤 𝑖 , 𝑢1 , . . . , 𝑢𝑟 ∈ U,
𝑖=1
𝑣 1 , . . . , 𝑣𝑟 ∈ V, . . . , 𝑤 1 , . . . , 𝑤 𝑟 ∈ W ,
27 While N𝑑 is only partially ordered, this shows that the minimum in (4.62) is unique.
Tensors in computations 125
Vector spaces become enormously more interesting when equipped with in-
ner products and norms; it is the same with tensor spaces. The tensor product
construction leading to Definition 4.9 allows us to incorporate them easily.
Example 4.17 (tensor products of inner product and norm spaces). We let
U, V, . . . , W be 𝑑 vector spaces equipped with either norms
k · k 1 : U → [0, ∞), k · k 2 : V → [0, ∞), . . . , k · k 𝑑 : W → [0, ∞)
or inner products
h · , · i1 : U × U → R, h · , · i2 : V × V → R, . . . , h · , · i𝑑 : W × W → R.
We would like to define a norm or an inner product on the tensor space U⊗V⊗· · ·⊗W
as defined in (4.39). Note that we do not assume that these vector spaces are finite-
dimensional.
The inner product is easy. First define it on rank-one tensors,
h𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤, 𝑢 ′ ⊗ 𝑣 ′ ⊗ · · · ⊗ 𝑤 ′i ≔ h𝑢, 𝑢 ′i1 h𝑣, 𝑣 ′i2 · · · h𝑤, 𝑤 ′i𝑑 (4.64)
for all 𝑢, 𝑢 ′ ∈ U, 𝑣, 𝑣 ′ ∈ V, . . . , 𝑤, 𝑤 ′ ∈ W. Then extend it bilinearly, that is, decree
that for rank-one tensors 𝑆, 𝑇 , 𝑆 ′, 𝑇 ′ and scalars 𝜆, 𝜆 ′,
h𝜆𝑆 + 𝜆 ′ 𝑆 ′, 𝑇 i ≔ 𝜆h𝑆, 𝑇 i + 𝜆 ′ h𝑆 ′, 𝑇 i,
h𝑆, 𝜆𝑇 + 𝜆 ′𝑇 ′i ≔ 𝜆h𝑆, 𝑇 i + 𝜆 ′ h𝑆, 𝑇 ′i.
As U ⊗ V ⊗ · · · ⊗ W comprises finite linear combinations of rank-one tensors
by definition, this defines h · , · i on the whole tensor space. It is then routine to
check that the axioms of inner product are satisfied by h · , · i. This construction
applies verbatim to other bilinear functionals such as the Lorentzian scalar product
in Example 2.4.
Norms are slightly trickier because there are multiple ways to define them on
tensor spaces. We will restrict ourselves to the three most common constructions.
If we have an inner product as above, then the norm induced by the inner product,
p
k𝑇 k F ≔ h𝑆, 𝑇 i,
is called the Hilbert–Schmidt norm. Note that by virtue of (4.64), the Hilbert–
Schmidt norm would satisfy
k𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 k = k𝑢k 1 k𝑣k 2 · · · k𝑤 k 𝑑 , (4.65)
a property that we will require of all tensor norms. As we mentioned in Example 4.1,
such norms are called cross-norms. We have to extend (4.65) so that k · k is defined
on not just rank-one tensors but on all finite sums of these. Naively we might just
define it to be the sum of the norms of its rank-one summands, but that does not
work as we get different values with different sums of rank-one tensors. Fortunately,
126 L.-H. Lim
fixes the issue and gives us a norm. This is the nuclear norm, special cases of
which we have encountered in (3.45) and in Examples 3.11 and 4.1.
There is an alternative. We may also take the supremum over all rank-one tensors
in the dual tensor space:
|𝜑 ⊗ 𝜓 ⊗ · · · ⊗ 𝜃(𝑇 )| ∗ ∗ ∗
k𝑇 k 𝜎 ≔ sup : 𝜑 ∈ U , 𝜓 ∈ V , . . . , 𝜃 ∈ W , (4.67)
k𝜑k 1∗ k𝜓 k 2∗ · · · k𝜃 k 𝑑∗
where k · k 𝑖∗ is the dual norm of k · k 𝑖 as defined by (3.44), 𝑖 = 1, . . . , 𝑑. This is the
spectral norm, special cases of which we have encountered in (3.19) and (3.46) and
in Example 3.17. There are many more cross-norms (Diestel et al. 2008, Chapter 4)
but these three are the best known. The nuclear and spectral norms are also special
in that any cross-norm k · k must satisfy
k𝑇 k 𝜎 ≤ k𝑇 k ≤ k𝑇 k 𝜈 for all 𝑇 ∈ U ⊗ V ⊗ · · · ⊗ W; (4.68)
and conversely any norm k · k that satisfies (4.68) must be a cross-norm (Ryan
2002, Proposition 6.1). In this sense, the nuclear and spectral norms, respectively,
are the largest and smallest cross-norms. If V is a Hilbert space, then by the Riesz
representation theorem, a linear functional 𝜑 : V → R takes the form 𝜑 = h𝑣, ·i for
some 𝑣 ∈ V and the Hilbert space norm on V and its dual norm may be identified.
Thus, if U, V, . . . , W are Hilbert spaces, then the spectral norm in (4.67) takes the
form
|h𝑇 , 𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤i|
k𝑇 k 𝜎 = sup : 𝑢 ∈ U, 𝑣 ∈ V, . . . , 𝑤 ∈ W . (4.69)
k𝑢k 1 k𝑣k 2 · · · k𝑤 k 𝑑
The definition of inner products on tensor spaces fits perfectly with other tensorial
notions such as the tensor product basis and the Kronecker product discussed in
Example 4.11. If ℬ1 , ℬ2 , . . . , ℬ𝑑 are orthonormal bases for U, V, . . . , W, then the
tensor product basis ℬ1 ⊗ ℬ2 ⊗ · · · ⊗ ℬ𝑑 as defined in (4.43) is an orthonormal basis
for U⊗V⊗· · ·⊗W. Here we have no need to assume finite dimension or separability:
these orthonormal bases may be uncountable. If Φ1 : U → U, . . . , Φ𝑑 : W → W
are orthogonal linear operators in the sense that
hΦ1 (𝑢), Φ1 (𝑢 ′)i1 = h𝑢, 𝑢 ′i1 , . . . , hΦ𝑑 (𝑤), Φ𝑑 (𝑤 ′)i𝑑 = h𝑤, 𝑤 ′i𝑑
for all 𝑢, 𝑢 ′ ∈ U, . . . , 𝑤, 𝑤 ′ ∈ W, then the Kronecker product Φ1 ⊗ · · · ⊗ Φ𝑑 : U ⊗
· · · ⊗ W → U ⊗ · · · ⊗ W is an orthogonal linear operator, that is,
hΦ1 ⊗ · · · ⊗ Φ𝑑 (𝑆), Φ1 ⊗ · · · ⊗ Φ𝑑 (𝑇 )i = h𝑆, 𝑇 i
for all 𝑆, 𝑇 ∈ U ⊗ · · · ⊗ W. For Kronecker products of operators defined on norm
spaces, we will defer the discussion to the next example.
Tensors in computations 127
where 𝜎𝑖 (𝐴) denotes the 𝑖th singular value of 𝐴 ∈ R𝑚×𝑛 and 𝑟 = rank(𝐴).
We emphasize that the discussion in the above paragraph requires the bases in
Example 4.15 to be orthonormal. Nevertheless, the values of the inner product
and norms do not depend on which orthonormal bases we choose. In the termin-
ology of Section 2 they are invariant under multilinear matrix multiplication by
(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) ∈ O(𝑚) × O(𝑛) × · · · × O(𝑝), or equivalently, they are defined on
Cartesian tensors. More generally, if h𝑋𝑖 𝑣, 𝑋𝑖 𝑣i𝑖 = h𝑣, 𝑣i𝑖 , 𝑖 = 1, . . . , 𝑑, then
h(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐴, (𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐵i = h𝐴, 𝐵i,
and thus k(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐴k F = k 𝐴k F; if k 𝑋𝑖 𝑣k 𝑖 = k𝑣k 𝑖 , 𝑖 = 1, . . . , 𝑑, then
k(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐴k 𝜈 = k 𝐴k 𝜈 , k(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐴k 𝜎 = k 𝐴k 𝜎 .
This explains why, when discussing definition ➀ in conjunction with inner products
or norms, we expect the change-of-basis matrices in the transformation rules to
preserve these inner products or norms.
Inner product and norm spaces become enormously more interesting when they
are completed into Hilbert and Banach spaces. The study of cross-norms was in
fact started by Grothendieck (1955) and Schatten (1950) in order to define tensor
products of Banach spaces, and this has grown into a vast subject (Defant and Floret
128 L.-H. Lim
1993, Diestel et al. 2008, Light and Cheney 1985, Ryan 2002, Trèves 2006) that
we are unable to survey at any reasonable level of detail. The following example
is intended to convey an idea of how cross-norms allow one to complete the tensor
spaces in a suitable manner.
Example 4.18 (tensor product of Hilbert and Banach spaces). We revisit our
discussion in Example 4.1, properly defining topological tensor products b
⊗ using
continuous and integrable functions as illustrations.
(i) If 𝑋 and 𝑌 are compact Hausdorff topological spaces, then by the Stone–
Weierstrass theorem, 𝐶(𝑋) ⊗ 𝐶(𝑌 ) is a dense subset of 𝐶(𝑋 × 𝑌 ) with respect
to the uniform norm
k 𝑓 k∞ = sup | 𝑓 (𝑥, 𝑦)|. (4.71)
(𝑥,𝑦)∈𝑋 ×𝑌
(ii) If 𝑋 and 𝑌 are 𝜎-finite measure spaces, then by the Fubini–Tonelli theorem,
𝐿 1 (𝑋) ⊗ 𝐿 1 (𝑌 ) is a dense subset of 𝐿 1 (𝑋 × 𝑌 ) with respect to the 𝐿 1 -norm
∫
k 𝑓 k1 = | 𝑓 (𝑥, 𝑦)| d𝑥 d𝑦. (4.72)
𝑋 ×𝑌
Here a tensor product of vector spaces is as defined in (4.39), and as we just saw in
(4.17), there are several ways to equip it with a norm and with respect to any norm
we may complete it (i.e. by adding the limits of all Cauchy sequences) to obtain
Banach and Hilbert spaces out of norm and inner product spaces. The completed
space depends on the choice of norms; with a judicious choice, we get
𝐶(𝑋) b
⊗ 𝜎 𝐶(𝑌 ) = 𝐶(𝑋 × 𝑌 ), 𝐿 1 (𝑋) b
⊗ 𝜈 𝐿 1 (𝑌 ) = 𝐿 1 (𝑋 × 𝑌 ), (4.73)
as was first discovered in Grothendieck (1955). Here b ⊗ 𝜎 and b
⊗ 𝜈 denote completion
in the spectral and nuclear norm respectively and are called the injective tensor
product and projective tensor product respectively. To be clear, the first equality
in (4.73) says that if we equip 𝐶(𝑋) ⊗ 𝐶(𝑌 ) with the spectral norm (4.67) and
complete it to obtain 𝐶(𝑋) b ⊗ 𝜎 𝐶(𝑌 ), then the resulting space is 𝐶(𝑋 × 𝑌 ) equipped
with the uniform norm (4.71), and likewise for the second equality. In particular,
(4.73) also tells us that the uniform and spectral norms are equal on 𝐶(𝑋 × 𝑌 ), and
likewise for the 𝐿 1 -norm in (4.72) and nuclear norm in (4.66). For 𝑓 ∈ 𝐿 1 (𝑋 × 𝑌 ),
∫
| 𝑓 (𝑥, 𝑦)| d𝑥 d𝑦
𝑋 ×𝑌
Õ 𝑟 ∫ ∫
= inf |𝜑𝑖 (𝑥)| d𝑥 |𝜓𝑖 (𝑦)| d𝑦 : 𝜑𝑖 ∈ 𝐿 1 (𝑋), 𝜓𝑖 ∈ 𝐿 1 (𝑌 ), 𝑟 ∈ N ,
𝑖=1 𝑋 𝑌
These are examples of topological tensor products that involve completing the
(algebraic) tensor product in (4.39) with respect to a choice of cross-norm to
obtain a complete topological vector space. These also suggest why it is desirable
to have a variety of different cross-norms, and with each a different topological
tensor product, as the ‘right’ cross-norm to choose for a class of functions F(𝑋) is
usually the one that gives us F(𝑋) b ⊗ F(𝑌 ) = F(𝑋 × 𝑌 ) as we discussed after (4.4).
For example, to get the corresponding result for 𝐿 2 -functions, we have to use the
Hilbert–Schmidt norm
𝐿 2 (𝑋) b
⊗F 𝐿 2 (𝑌 ) = 𝐿 2 (𝑋 × 𝑌 ). (4.74)
Essentially the proof relies on the fact that the completion of 𝐿 2 (𝑋) ⊗ 𝐿 2 (𝑌 ) with
respect to the Hilbert–Schmidt norm is the closure of 𝐿 2 (𝑋) ⊗ 𝐿 2 (𝑌 ) as a subspace
of 𝐿 2 (𝑋 × 𝑌 ), and as the orthogonal complement of 𝐿 2 (𝑋) ⊗ 𝐿 2 (𝑌 ) is zero, its
closure is the whole space (Light and Cheney 1985, Theorem 1.39). Nevertheless,
such results are not always possible: there is no cross-norm that will complete
𝐿 ∞ (𝑋) ⊗ 𝐿 ∞ (𝑌 ) into 𝐿 ∞ (𝑋 × 𝑌 ) for all 𝜎-finite 𝑋 and 𝑌 (Light and Cheney 1985,
Theorem 1.53). Also, we should add that a ‘right’ cross-norm that guarantees
(4.73) may be less interesting than a ‘wrong’ cross-norm that gives us a new tensor
space. For instance, had we used b ⊗ 𝜈 to form a tensor product of 𝐶(𝑋) and 𝐶(𝑌 ),
we would have obtained a smaller subset:
𝐶(𝑋) b
⊗ 𝜈 𝐶(𝑌 ) ( 𝐶(𝑋 × 𝑌 ).
This smaller tensor space of continuous functions on 𝑋 × 𝑌 , more generally the
tensor space 𝐶(𝑋1 ) b⊗𝜈 · · · b
⊗ 𝜈 𝐶(𝑋𝑛 ), is called the Varopoulos algebra and it turns
out to be very interesting and useful in harmonic analysis (Varopoulos 1965, 1967).
A point worth highlighting is the difference between 𝐿 2 (𝑋) ⊗ 𝐿 2 (𝑌 ) and 𝐿 2 (𝑋) b
⊗F
2
𝐿 (𝑌 ). While the former contains only finite sums of separable functions
Õ
𝑟
𝑓 𝑖 ⊗ 𝑔𝑖
𝑖=1
that is, the right-hand side converges to some limit in 𝐿 2 (𝑋 × 𝑌 ) in the Hilbert–
Schmidt norm k · k F . The equality in (4.74) says that every function in 𝐿 2 (𝑋 × 𝑌 )
is given by such a limit, but if we had taken completion with respect to some other
norms such as k · k 𝜈 or k · k 𝜎 , their topological tensor products 𝐿 2 (𝑋) b
⊗ 𝜈 𝐿 2 (𝑌 ) or
2
𝐿 (𝑋) b 2 2
⊗ 𝜎 𝐿 (𝑌 ) would in general be smaller or larger than 𝐿 (𝑋 × 𝑌 ) respectively.
Whatever the choice of b ⊗ , the (algebraic) tensor product U ⊗ V ⊗ · · · ⊗ W should
130 L.-H. Lim
be regarded as the subset of all finite-rank tensors in the topological tensor product
Ub⊗Vb ⊗ ···b⊗ W.
The tensor product of Hilbert spaces invariably refers to the topological tensor
product with respect to the Hilbert–Schmidt norm because the result is always a
⊗ in (4.13) should be interpreted as b
Hilbert space. In particular, the b ⊗F . However,
as we pointed out above, other topological tensor products with respect to other
cross-norms may also be very interesting.
Example 4.19 (trace-class, Hilbert–Schmidt, compact operators). For a sep-
arable Hilbert space H with inner product h · , · i and induced norm k · k, completing
H ⊗ H∗ with respect to the nuclear, Hilbert–Schmidt and spectral norms, we obtain
different types of bounded linear operators on H as follows:
ÕÕ
∗
trace-class Hb ⊗ 𝜈 H = Φ ∈ B(H) : |hΦ(𝑒𝑖 ), 𝑓 𝑗 i| < ∞ ,
𝑖 ∈𝐼 𝑗 ∈𝐼
Õ
∗ 2
Hilbert–Schmidt Hb
⊗F H = Φ ∈ B(H) : kΦ(𝑒𝑖 )k < ∞ ,
𝑖 ∈𝐼
∗
𝑋 ⊆ H bounded
compact Hb
⊗ 𝜎 H = Φ ∈ B(H) : .
⇒ Φ(𝑋) ⊆ H compact
The series convergence is understood to mean ‘for some orthonormal bases {𝑒𝑖 : 𝑖 ∈
𝐼} and { 𝑓𝑖 : 𝑖 Í
∈ 𝐼} of H’, although for trace-class operators the condition could be
simplified to 𝑖 ∈𝐼 hΦ(𝑒𝑖 ), 𝑒 𝑗 i < ∞ provided that ‘for some’ is replaced by ‘for all’.
See Schaefer and Wolff (1999, p. 278) for the trace-class result and Trèves (2006,
Theorems 48.3) for the compact result.
If H is separable, then such operators are characterized by their having a Schmidt
decomposition:
Õ
∞ Õ
∞
Φ= 𝜎𝑖 𝑢𝑖 ⊗ 𝑣 ∗𝑖 or Φ(𝑥) = 𝜎𝑖 h𝑣 𝑖 , 𝑥i𝑢𝑖 for all 𝑥 ∈ H, (4.75)
𝑖=1 𝑖=1
In other words, this is the infinite-dimensional version of the relation between the
various norms and matrix singular values in Example 4.17, and the Schmidt decom-
position is an infinite-dimensional generalization of singular value decomposition.
Unlike the finite-dimensional case, where one may freely speak of the nuclear,
Frobenius and spectral norms of any matrix, for infinite-dimensional H we have
H ⊗ H∗ ( H b
⊗ 𝜈 H∗ ( H b
⊗ F H∗ ( H b
⊗ 𝜎 H∗ , kΦk 𝜎 ≤ kΦk F ≤ kΦk 𝜈 ,
and the inclusions are strict, for example, a compact operator Φ may have kΦk 𝜈 =
∞. By our discussion at the end of Example 4.18, H ⊗ H∗ is the subset of finite-rank
operators in any of these larger spaces. The inequality relating the three norms is
⊗ 𝜈 H∗ , nuclear and spectral
a special case of (4.68) as k · k F is a cross-norm. On H b
norms are dual norms as in the finite-dimensional case (4.70):
kΦk 𝜈∗ = sup |hΦ, ΨiF | = kΦk 𝜎 ,
kΨ k 𝜈 ≤1
where 𝐴 = (𝑎𝑖 𝑗 )∞ 2 2
𝑖, 𝑗=1 ∈ 𝑙 (N ), and now observe that
Õ∞ ∞ Õ
Õ ∞ Õ∞
tr(Φ) = hΦ(𝑒 𝑘 ), 𝑒 𝑘 i = 𝑎𝑖𝑘 𝑒𝑖 , 𝑒 𝑘 = 𝑎𝑘 𝑘 . (4.77)
𝑘=1 𝑘=1 𝑖=1 𝑘=1
One may show that an operator is trace-class if and only if it is a product of two
Hilbert–Schmidt operators. A consequence is that
hΦ, ΨiF ≔ tr(Φ∗ Ψ) (4.78)
⊗F H∗ that gives k · k F as its
is always finite and defines an inner product on H b
induced norm:
hΦ, ΦiF = tr(Φ∗ Φ) = kΦk 2F .
⊗ 𝜈 H∗ , k · k 𝜈 ), (H b
While (H b ⊗F H∗ , k · k F ), (H b
⊗ 𝜎 H∗ , k · k 𝜎 ) are all Banach spaces,
132 L.-H. Lim
Banach spaces. Note that (4.66) and (4.67) are defined without any reference
⊗ 𝜎 H∗2 and H1 b
to inner products, so for H1 b ⊗ 𝜈 H∗2 we do not need H1 and H2 to
be Hilbert spaces; compact and trace-class operators may be defined for any pair
of Banach spaces, although the latter are usually called nuclear operators in this
context.
Higher order. One may define order-𝑑 ≥ 3 analogues of bounded, compact,
Hilbert–Schmidt, trace-class operators in a straightforward manner, but corres-
ponding results are more difficult; one reason is that Schmidt decomposition (4.75)
no longer holds (Bényi and Torres 2013, Cobos, Kühn and Peetre 1992, 1999).
We remind the reader that many of the topological vector spaces considered
in Examples 4.1 and 4.2, such as 𝑆(𝑋), 𝐶 ∞(𝑋), 𝐶𝑐∞(𝑋), 𝐻(𝑋), 𝑆 ′(𝑋), 𝐸 ′(𝑋),
𝐷 ′(𝑋) and 𝐻 ′(𝑋), are not Banach spaces (and thus not Hilbert spaces) but so-
called nuclear spaces. Nevertheless, similar ideas apply to yield topological tensor
products. In fact nuclear spaces have the nice property that topological tensor
products with respect to the nuclear and spectral norms, i.e. b
⊗ 𝜈 and b
⊗ 𝜎 , are always
equal and thus independent of the choice of cross-norms.
Tensors in computations 133
𝑋 𝑙 2 (N)
Figure 4.3. Depiction of the feature map 𝐹 : 𝑋 → 𝑙 2 (N) in the context of support-
vector machines.
A key reason density operators are important is that the postulates of quantum
mechanics may be reformulated in terms of them (Nielsen and Chuang 2000,
Tensors in computations 135
The evolution of the definition of quantum states from vectors in Hilbert spaces to
density operators to positive linear functionals on 𝐶 ∗ -algebras is not unlike the three
increasingly sophisticated definitions of tensors or the three increasingly abstract
definitions of tensor products. As in the case of tensors and tensor products, each
definition of quantum states is useful in its own way, and all three remain in use.
136 L.-H. Lim
Our discussion of tensor product will not be complete without mentioning the
tensor algebra. We will make this the last example of this section.
Example 4.22 (tensor algebra). Let V be a vector space and 𝑣 ∈ V. We intro-
duce the shorthand
𝑑 copies 𝑑 copies
⊗𝑑 ⊗𝑑
V ≔ V ⊗ · · · ⊗ V, 𝑣 ≔ 𝑣 ⊗ ··· ⊗ 𝑣
for any 𝑑 ∈ N. We also define V ⊗0 ≔ R and 𝑣 ⊗0 ≔ 1. There is a risk of construing
erroneous relations from notation like this. We caution that V ⊗𝑑 is not the set of
tensors of the form 𝑣 ⊗𝑑 , as we pointed out in (4.42), but neither is it the set of
linear combinations of such tensors; V ⊗𝑑 contains all finite linear combinations of
any 𝑣 1 ⊗ · · · ⊗ 𝑣 𝑑 , including but not limited to those of the form 𝑣 ⊗𝑑 . As V ⊗𝑑 ,
𝑑 = 0, 1, 2, . . . , are all vector spaces, we may form their direct sum to obtain an
infinite-dimensional vector space:
Ê
∞
T(V) = V ⊗𝑘 = R ⊕ V ⊕ V ⊗2 ⊕ V ⊗3 ⊕ · · · , (4.83)
𝑘=0
called the tensor algebra of V. Those unfamiliar with direct sums may simply
regard T(V) as the set of all finite sums of the tensors of any order:
Õ 𝑑
⊗𝑘
T(V) = 𝑇𝑘 : 𝑇𝑘 ∈ V , 𝑑 ∈ N . (4.84)
𝑘=0
However, we prefer the form in (4.84). The reason T(V) is called an algebra is that
it inherits a product operation given by the product of tensors in (4.41):
′ ′
V ⊗𝑑 × V ⊗𝑑 ∋ (𝑇 , 𝑇 ′) ↦→ 𝑇 ⊗ 𝑇 ′ ∈ V ⊗(𝑑+𝑑 ) ,
which can be extended to T(V) by
Õ 𝑑 Õ 𝑑′ Õ 𝑑′
𝑑 Õ
′
𝑇𝑗 ⊗ 𝑇𝑘 ≔ 𝑇 𝑗 ⊗ 𝑇𝑘′ .
𝑗=0 𝑘=0 𝑗=0 𝑘=0
where all terms have degree 𝑑. This is a superior representation compared to merely
representing the tensor as a hypermatrix (𝑎𝑖1 𝑖2 ···𝑖𝑑 )𝑖𝑛1 ,𝑖2 ,...,𝑖𝑑 =1 ∈ R𝑛×𝑛×···×𝑛 because,
like usual polynomials, we can take derivatives and integrals of non-commutative
polynomials and evaluate them on matrices, that is, we can take 𝐴1 , 𝐴2 , . . . , 𝐴𝑛 ∈
R𝑚×𝑚 and plug them into 𝑓 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ) to get 𝑓 (𝐴1 , 𝐴2 , . . . , 𝐴𝑛 ) ∈ R𝑚×𝑚 .
This last observation alone leads to a rich subject, often called ‘non-commutative
sums-of-squares’, that has many engineering applications (Helton and Putinar
2007).
What if we need to speak of an infinite sum of tensors of different orders? This
is not just of theoretical interest; as we will see in Example 4.45, a multipole
expansion is such an infinite sum. A straightforward way to do this would be to
replace the direct sums in (4.83) with direct products; the difference, if the reader
recalls, is that while the elements of a direct sum are zero for all but a finite number
of summands, those of a direct product may contain an infinite number of non-
zero summands. If there are just a finite number of vector spaces V1 , . . . , V𝑑 , the
direct sum V1 ⊕ · · · ⊕ V𝑑 and the direct product V1 × · · · × V𝑑 are identical, but
if we have infinitely many vector spaces, a direct product is much larger than a
É For instance, a direct sum of countably many two-dimensional
direct sum. Î vector
spaces 𝑘 ∈N V 𝑘 has countable dimension whereas a direct product 𝑘 ∈N 𝑘 has
V
uncountable dimension.
138 L.-H. Lim
Alternatively, one may also describe b T(V) as the Hilbert space direct sum of the
Hilbert spaces V ⊗𝑘 , 𝑘 = 0, 1, 2, . . . . If ℬ is an orthonormal basis of V, then the
tensor
Ð∞ product basis (4.43), denoted ℬ ⊗𝑘 , is an orthonormal basis on V ⊗𝑘 , and
⊗𝑘 is a countable orthonormal basis on b T(V), i.e. it is a separable Hilbert
𝑘=0 ℬ
space. This can be used in a ‘tensor trick’ for linearizing a non-linear function with
convergent Taylor series. For example, for any 𝑥, 𝑦 ∈ R𝑛 ,
Õ∞
1 Õ∞
1 ⊗𝑘 ⊗𝑘
exp(h𝑥, 𝑦i) = h𝑥, 𝑦i 𝑘 = h𝑥 , 𝑦 i
𝑘=0
𝑘! 𝑘=0
𝑘!
Õ
1 ⊗𝑘 Õ 1 ⊗𝑘
∞ ∞
= √ 𝑥 , √ 𝑦 = h𝑆(𝑥), 𝑆(𝑦)i,
𝑘=0 𝑘! 𝑘=0 𝑘!
is well-defined since
Õ
∞ 2 Õ∞
1 1
k𝑆(𝑥)k 2 = √ 𝑥 ⊗𝑘 = k𝑥 k 2𝑘 = exp(k𝑥 k 2 ) < ∞.
𝑘=0 𝑘! 𝑘=0
𝑘!
This was used, with sin in place of exp, in an elegant proof (Krivine 1979) of
Grothendieck’s inequality that also yields
𝜋
𝐾G ≤ √ ≈ 1.78221,
2 log(1 + 2)
the upper bound we saw in Example 3.17. This is in fact the best explicit upper
bound for the Grothendieck constant over R, although it is now known that it is not
sharp.
29 There is no standard notation. It has been variously denoted as 𝜎 (Harris 1995), as 𝜑 (Bourbaki
1998, Lang 2002), as ⊗ (Conrad 2018) and as Seg (Landsberg 2012).
Tensors in computations 141
multilinear maps. This can be stated in less precise but more intuitive terms as
follows.
(a) All the ‘multilinearness’ in a multilinear map Φ may be factored out of Φ,
leaving behind only the ‘linearness’ that is encapsulated in 𝐹Φ .
(b) The ‘multilinearness’ extracted from any multilinear map is identical, that is,
all multilinear maps are multilinear because they contain a copy of the Segre
map 𝜎⊗ .
Actually, 𝜎⊗ does depend on 𝑑 and so we should have said it is universal for 𝑑-linear
maps. An immediate consequence of the universal factorization property is that
M𝑑 (V1 , . . . , V𝑑 ; W) = L(V1 ⊗ · · · ⊗ V𝑑 ; W).
In principle one could use the universal factorization property to avoid multilinear
maps entirely and discuss only linear maps, but of course there is no reason to do
that: as we saw in Section 3, multilinearity is a very useful notion in its own right.
The universal factorization property should be viewed as a way to move back and
forth between the multilinear realm and the linear one. While we have assumed
contravariant tensors for simplicity, the discussions above apply verbatim to mixed
tensor spaces V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 and with modules in place of vector
spaces.
Pure mathematicians accustomed to commutative diagrams would swear by
(4.88) but it expresses the same thing as (4.89), which is more palatable to applied
and computational mathematicians as it reminds us of matrix factorizations like
Example 2.13, where we factor a matrix 𝐴 ∈ R𝑚×𝑛 into 𝐴 = 𝑄𝑅 with an orthogonal
component 𝑄 ∈ O(𝑚) that captures the ‘orthogonalness’ in 𝐴, leaving behind a
triangular component 𝑅. There is one big difference, though: both 𝑄 and 𝑅 depend
on 𝐴, but in (4.89), whatever our choice of Φ, the multilinear component will always
be 𝜎⊗ . The next three examples will bear witness to this intriguing property.
Example 4.23 (trilinear functionals). The universal factorization property ex-
plains why (3.9) and (4.38) look alike. A trilinear functional 𝜏 : U × V × W → R
can be factored as
𝜏 = 𝐹𝜏 ◦ 𝜎 ⊗ (4.90)
and the multilinearity of 𝜎⊗ in (4.38) accounts for that of 𝜏 in (3.9). For example,
take the first equation in (3.9). We may view it as arising from
𝜏(𝜆𝑢 + 𝜆 ′𝑢 ′, 𝑣, 𝑤) = 𝐹𝜏 (𝜎⊗ (𝜆𝑢 + 𝜆 ′𝑢 ′ , 𝑣, 𝑤)) by (4.90),
= 𝐹𝜏 ((𝜆𝑢 + 𝜆 ′𝑢 ′) ⊗ 𝑣 ⊗ 𝑤) by (4.87),
= 𝐹𝜏 (𝜆𝑢 ⊗ 𝑣 ⊗ 𝑤 + 𝜆 ′𝑢 ′ ⊗ 𝑣 ⊗ 𝑤) by (4.38),
= 𝜆𝐹𝜏 (𝑢 ⊗ 𝑣 ⊗ 𝑤) + 𝜆 ′ 𝐹𝜏 (𝑢 ′ ⊗ 𝑣 ⊗ 𝑤) 𝐹𝜏 is linear,
= 𝜆𝐹𝜏 (𝜎⊗ (𝑢, 𝑣, 𝑤)) + 𝜆 ′ 𝐹𝜏 (𝜎⊗ (𝑢 ′, 𝑣, 𝑤)) by (4.87),
= 𝜆𝜏(𝑢, 𝑣, 𝑤) + 𝜆 ′ 𝜏(𝑢 ′ , 𝑣, 𝑤) by (4.90),
142 L.-H. Lim
and similarly for the second and third equations in (3.9). More generally, if we
reread Section 3.2 with the hindsight of this section, we will realize that it is simply
a discussion of the universal factorization property without invoking the ⊗ symbol.
We next see how the universal factorization property ties together the three
matrix products that have made an appearance in this article.
Example 4.24 (matrix, Hadamard and Kronecker products). Let us consider
the standard matrix product in (2.5) and Hadamard product in (2.4) on 2 × 2
matrices. Let 𝜇, 𝜂 : R2×2 × R2×2 → R2×2 be, respectively,
𝑎11 𝑎12 𝑏11 𝑏12 𝑎 𝑏 + 𝑎12 𝑏21 𝑎11 𝑏12 + 𝑎12 𝑏22
𝜇 , = 11 11 ,
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑏11 + 𝑎22 𝑏21 𝑎21 𝑏12 + 𝑎22 𝑏22
𝑎11 𝑎12 𝑏11 𝑏12 𝑎11 𝑏11 𝑎12 𝑏12
𝜂 , = .
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑏21 𝑎22 𝑏22
They look nothing alike and, as discussed in Section 2, are of entirely different
natures. But the universal factorization property tells us that, by virtue of the fact
that both are bilinear operators from R2×2 ×R2×2 to R2×2 , the ‘bilinearness’ in 𝜇 and 𝜂
is the same, encapsulated in the Segre map 𝜎⊗ : R2×2 ×R2×2 → R2×2 ⊗R2×2 R4×4 ,
𝑎11 𝑎12 𝑏11 𝑏12 𝑎11 𝑎12 𝑏11 𝑏12
𝜎⊗ , = ⊗
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑎22 𝑏21 𝑏22
𝑎11 𝑏11 𝑎11 𝑏12 𝑎12 𝑏11 𝑎12 𝑏12
𝑎 𝑏 𝑎11 𝑏22 𝑎12 𝑏21 𝑎12 𝑏22
= 11 21 ,
𝑎21 𝑏11 𝑎21 𝑏12 𝑎22 𝑏11 𝑎22 𝑏12
𝑎21 𝑏21 𝑎21 𝑏22 𝑎22 𝑏21 𝑎22 𝑏22
i.e. the Kronecker product in Example 4.10(ii). The difference between 𝜇 and 𝜂 is
due entirely to their linear components 𝐹𝜇 , 𝐹𝜂 : R4×4 → R2×2 ,
𝑐11 𝑐12 𝑐13 𝑐14
𝑐21 𝑐22 𝑐23 𝑐24 𝑐 11 + 𝑐 23 𝑐 12 + 𝑐 24
𝐹𝜇 𝑐31 𝑐32 𝑐33 𝑐34 = 𝑐31 + 𝑐43 𝑐32 + 𝑐44 ,
𝑐41 𝑐42 𝑐43 𝑐44
𝑐11 𝑐12 𝑐13 𝑐14
𝑐21 𝑐22 𝑐23 𝑐24
𝐹𝜂 = 𝑐11 𝑐14 ,
𝑐31 𝑐32 𝑐33 𝑐34 𝑐41 𝑐44
𝑐41 𝑐42 𝑐43 𝑐44
which are indeed very different. As a sanity check, the reader may like to verify
from these formulas that
𝜇 = 𝐹𝜇 ◦ 𝜎 ⊗ , 𝜂 = 𝐹𝜂 ◦ 𝜎⊗ .
More generally, the same argument shows that any products we define on 2 × 2
matrices must share something in common, namely the Kronecker product that
Tensors in computations 143
linear and multilinear maps between spaces, the tensor product space above and
in Example 4.25 is formed with b ⊗ 𝜈 , the topological tensor product with respect to
the nuclear norm. Indeed, b ⊗ 𝜈 is an amazing tensor product in that for any norm
spaces U, V, W, if we have the commutative diagram
𝜎⊗
U × V❏ /Ub ⊗𝜈 V
❏❏
❏❏
❏❏ 𝐹Φ
Φ ❏❏❏
❏
%
W
We will now examine a non-example to see that Definition 4.26 is not omnipotent.
The tensor product of Hilbert spaces is an important exception that does not satisfy
the universal factorization property. The following example is adapted from Garrett
(2010).
Example 4.28 (no universal factorization for Hilbert spaces). We begin with
a simple observation. For a finite-dimensional vector space V, the map defined
by 𝛽 : V × V∗ → R, (𝑣, 𝜑) ↦→ 𝜑(𝑣) is clearly bilinear. So we may apply universal
factorization to get
𝜎⊗
V × V▲∗ / V ⊗ V∗
▲▲
▲▲
▲▲ tr
𝛽 ▲▲▲▲
%
R
This observation is key to our subsequent discussion. We will see that when V is
replaced by an infinite-dimensional Hilbert space and we require all maps to be
continuous, then there is no continuous linear map that can take the place of trace.
Let H be a separable infinite-dimensional Hilbert space. As discussed in Ex-
amples 4.18 and 4.19, the tensor product of H and H∗ is interpreted to be the
topological tensor product H b⊗F H∗ with respect to the Hilbert–Schmidt norm k · k F .
∗
So we must have 𝜎⊗ : H×H → H b ⊗ F H∗ , (𝑣, 𝜑) ↦→ 𝑣 ⊗ 𝜑. Take W = C and consider
the continuous bilinear functional 𝛽 : H × H∗ → C, (𝑣, 𝜑) ↦→ 𝜑(𝑣). We claim that
there is no continuous linear map 𝐹𝛽 : H b ⊗ F H∗ → C such that 𝛽 = 𝐹𝛽 ◦ 𝜎⊗ , that is,
146 L.-H. Lim
Now take any Φ ∈ H b⊗F H∗ and express it as in (4.76). Then since 𝐹𝛽 is continuous
and linear, we have
Õ∞
𝐹𝛽 (Φ) = 𝑎 𝑘 𝑘 = tr(Φ),
𝑘=1
that is, 𝐹𝛽 must be the trace as defined in (4.77). This is a contradiction as trace is
unbounded on H b ⊗F H∗ ; just take
Õ
∞
Φ= 𝑘 −1 𝑒 𝑘 ⊗ 𝑒∗𝑘 ,
𝑘=1
Í −2 < ∞ but tr(Φ) = Í∞ 𝑘 −1 = ∞.
which is Hilbert–Schmidt as kΦk 2F = ∞ 𝑘=1 𝑘 𝑘=1
Hence no such map exists.
A plausible follow-up question is that instead of b ⊗F , why not take topological
b
tensor products with respect to the nuclear norm ⊗ 𝜈 ? As we mentioned at the
end of Example 4.27, this automatically guarantees continuity of all maps. The
problem, as we pointed out in Example 4.19, is that the resulting tensor space, i.e.
⊗ 𝜈 H∗ , is not a Hilbert space. In fact, one may show
the trace-class operators H b
that no such Hilbert space exists. Suppose there is a Hilbert space T and a Segre
map 𝜎T : H × H∗ → T so that the universal factorization property in (4.91) holds.
Then we may choose W = H b ⊗F H∗ and Φ = 𝜎⊗ to get
𝜎T
H × H❑∗❑ /T
❑❑❑
❑❑ 𝐹⊗
𝜎⊗ ❑❑❑%
⊗ F H∗
Hb
from which one may deduce (see Garrett 2010 for details) that if 𝐹⊗ is a continuous
linear map, then
h · , · iT = 𝑐h · , · iF
Tensors in computations 147
for some 𝑐 > 0, that is, 𝐹⊗ is up to a constant multiple an isometry (Hilbert space
isomorphism). So T and H b ⊗ F H∗ are essentially the same Hilbert space up to
scaling and thus the same trace argument above shows that T does not satisfy the
universal factorization property.
While the goal of this example is to demonstrate that Hilbert spaces generally do
not satisfy the universal factorization property, there is a point worth highlighting
in this construction. As we saw at the beginning, if we do not care about continuity,
then everything goes through without a glitch: the evaluation bilinear functional
𝛽 : V × V∗ → R, (𝑣, 𝜑) ↦→ 𝜑(𝑣) satisfies the universal factorization property
𝜎⊗
V × V▲∗ / V ⊗ V∗
▲▲
▲▲
▲▲ tr
𝛽 ▲▲▲▲
%
R
with the linear functional tr : V ⊗ V∗ → R given by
Õ𝑟 Õ𝑟
tr 𝑣 𝑖 ⊗ 𝜑𝑖 = 𝜑𝑖 (𝑣 𝑖 )
𝑖=1 𝑖=1
denoted them differently for easy distinction, it is not uncommon to see 𝐷 𝑑 𝑓 (𝑣)
and 𝜕 𝑑 𝑓 (𝑣) used interchangeably, sometimes in the same sentence.
𝑛 → R, 𝑓 (𝑋) = − log det(𝑋) discussed in Examples 3.6
Take the log barrier 𝑓 : S++
and 3.16. We have
where the latter denotes the Kronecker product as discussed in Example 4.11. The
universal factorization property
𝜎⊗
S𝑛 × S▲𝑛 / S𝑛 ⊗ S𝑛
▲▲▲
𝜕 𝑑 𝑓 (𝑋 )
▲▲▲
𝑑
▲▲
𝐷 𝑓 (𝑋 ) ▲▲▲
&
R
gives us
Note that here it does not matter whether we use vec or vecℓ , since the matrices
involved are all symmetric.
Taylor’s theorem for a vector-valued function 𝑓 : Ω ⊆ V → W in Example 3.2
may alternatively be expressed in the form
1
𝑓 (𝑣) = 𝑓 (𝑣 0 ) + [𝜕 𝑓 (𝑣 0 )](𝑣 − 𝑣 0 ) + [𝜕 2 𝑓 (𝑣 0 )](𝑣 − 𝑣 0 ) ⊗2 + · · ·
2
1 𝑑
· · · + [𝜕 𝑓 (𝑣 0 )](𝑣 − 𝑣 0 ) ⊗𝑑 + 𝑅(𝑣 − 𝑣 0 ),
𝑑!
where 𝑣 0 ∈ Ω and k𝑅(𝑣 − 𝑣 0 )k/k𝑣 − 𝑣 0 k 𝑑 → 0 as 𝑣 → 𝑣 0 . Here the ‘Taylor
coefficients’ are linear maps 𝜕 𝑑 𝑓 (𝑣) : V ⊗𝑑 → W. For the important special case
W = R, i.e. a real-valued function 𝑓 : V → R, we have that 𝜕 𝑑 𝑓 (𝑣) : V ⊗𝑑 → R is
a linear functional. So the Taylor coefficients are covariant 𝑑-tensors
𝜕 𝑑 𝑓 (𝑣) ∈ V∗⊗𝑑 .
The tensor geometric series 𝑆(𝑣) is clearly a well-defined element of bT(V) whenever
k𝑣k < 1 and the tensor Taylor series 𝜕 𝑓 (𝑣) is well-defined as long as k𝜕 𝑘 𝑓 (𝑣)k ≤ 𝐵
for some uniform bound 𝐵 > 0. We will see this in action when we discuss multipole
expansion in Example 4.45, which is essentially (4.93) applied to 𝑓 (𝑣) = 1/k𝑣k.
Example 4.30 (vector-valued objects). For any real vector space V and any set
𝑋, we denote the set of all functions taking values in V by
V𝑋 ≔ { 𝑓 : 𝑋 → V}.
Φ : R𝑋 × V → V𝑋 , ( 𝑓 , 𝑣) ↦→ 𝑓 · 𝑣,
If we use b
⊗ 𝜎 in place of b
⊗ 𝜈 , we obtain a larger subspace, as we will see next.
Unconditionally summable sequences. Absolute summability implies uncondi-
tional summability but the converse is generally false when B is infinite-dimen-
sional. We have
Õ
∞
𝑙 1 (N) b
⊗ 𝜎 B = (𝑥𝑖 )∞
𝑖=1 : 𝑥 𝑖 ∈ B, 𝜀 𝑥
𝑖 𝑖 < ∞ for any 𝜀 𝑖 = ±1 .
𝑖=1
Í∞
The last condition is also equivalent to 𝑖=1 𝑥 𝜎(𝑖) < ∞ for any bijection 𝜎 : N → N.
Integrable functions. Let Ω be 𝜎-finite. Then
𝐿 1 (Ω) b
⊗ 𝜈 B = 𝐿 1 (Ω; B) = { 𝑓 : Ω → B : k 𝑓 k 1 < ∞}.
This is ∫called the Lebesgue–Bochner space. Here the 𝐿 1 -norm is defined as
k 𝑓 k 1 = Ω k 𝑓 (𝑥)k d𝑥. In fact the proof of the second half of (4.73) is via
𝐿 1 (𝑋) b
⊗ 𝜈 𝐿 1 (𝑌 ) = 𝐿 1 (𝑋; 𝐿 1 (𝑌 )) = 𝐿 1 (𝑋 × 𝑌 ).
and H2 . Partial traces are an indispensable tool for working with the density
operators discussed in Example 4.21; we refer readers to Cohen-Tannoudji et al.
(2020a, Chapter III, Complement E, Section 5b) and Nielsen and Chuang (2000,
Sections 2.4.3) for more information.
Our discussion about partial traces contains quite a bit of hand-waving; we will
justify some of it with the next example, where we discuss some properties of
tensor products that we have used liberally above.
Example 4.31 (calculus of tensor products). The universal factorization prop-
erty is particularly useful for establishing other properties (Greub 1978, Chapter I)
of the tensor product operation ⊗ such as how it interacts with itself,
U ⊗ V V ⊗ U, U ⊗ (V ⊗ W) (U ⊗ V) ⊗ W U ⊗ V ⊗ W,
with direct sum ⊕,
U ⊗ (V ⊕ W) (U ⊗ V) ⊕ (U ⊗ W), (U ⊕ V) ⊗ W (U ⊗ W) ⊕ (V ⊗ W),
and with intersection ∩,
(V ⊗ W) ∩ (V ′ ⊗ W ′) = (V ∩ V ′) ⊗ (W ∩ W ′),
as well as how it interacts with linear and multilinear maps,
L(U ⊗ V; W) L(U; L(V; W)) M2 (U, V; W),
with duality V∗ = L(V; R) a special case,
(V ⊗ W)∗ V∗ ⊗ W∗ , V∗ ⊗ W L(V; W),
and with the Kronecker product,
L(V; W) ⊗ L(V ′; W ′ ) L(V ⊗ V ′; W ⊗ W ′).
Collectively, these properties form a system of calculus for manipulating tensor
products of vector spaces. When pure mathematicians speak of multilinear algebra,
this is often what they have in mind, that is, the subject is less about manipulating
individual tensors and more about manipulating whole spaces of tensors. Many,
but not all, of these properties may be extended to other vector-space-like objects
such as modules and vector bundles, or to vector spaces with additional structures
such as metrics, products and topologies. One needs to exercise some caution in
making such extensions. For instance, if V and W are infinite-dimensional Hilbert
spaces, then
V∗ ⊗ W B(V; W),
no matter what notion of tensor product or topological tensor product one uses
for ⊗, that is, in this context B(V; W) is not the infinite-dimensional analogue of
L(V; W). As we saw in Examples 4.19 and 4.28, the proper interpretation of ⊗ for
Hilbert spaces is more subtle.
154 L.-H. Lim
For instance, recall the matrix–matrix product in Example 3.9 and triple product
trace in Example 3.17, respectively,
and (R 𝑝×𝑚 )∗ R𝑚× 𝑝 , the bilinear operator M𝑚,𝑛, 𝑝 and the trilinear functional
𝜏𝑚,𝑛, 𝑝 may be regarded as elements of the same tensor space. Checking their
values on the standard bases shows that they are in fact the same tensor. More
Tensors in computations 155
and Φ being linear, any sum, linear combination or, under the right conditions,
even integral of 𝑣 1 ⊗ 𝑣 2 ⊗ · · · ⊗ 𝑣 𝑑 is also a solution. The constants 𝜆 1 , . . . , 𝜆 𝑑−1
are called separation constants. The technique relies only on one easy fact about
tensor products: for any non-zero 𝑣 ∈ V and 𝑤 ∈ W,
𝑣 ⊗ 𝑤 = 𝑣′ ⊗ 𝑤′ ⇒ 𝑣 = 𝜆𝑣 ′, 𝑤 = 𝜆−1 𝑤 ′ (4.98)
156 L.-H. Lim
we get
(
Δ𝜑 = −𝜔2 𝜑,
[Δ ⊗ 𝐼 + 𝐼 ⊗ (−𝜕𝑡2 )](𝜑 ⊗ 𝜓) = 0 −→
−𝜕𝑡2 𝜓 = 𝜔2 𝜓,
with separation constant −𝜔2 . In the remainder of this example, we will focus on
the first equation, called the 𝑛-dimensional Helmholtz equation,
Δ 𝑓 + 𝜔2 𝑓 = 0, (4.103)
which may also be obtained by taking the Fourier transform of (4.102) in time. For
𝑛 = 2, (4.103) in Cartesian and polar coordinates are given by
𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓 1 𝜕 𝑓 1 𝜕2 𝑓
+ + 𝜔2 𝑓 = 0, + + + 𝜔2 𝑓 = 0 (4.104)
𝜕𝑥 2 𝜕𝑦 2 𝜕𝑟 2 𝑟 𝜕𝑟 𝑟 2 𝜕𝜃 2
respectively. Separation of variables works in both cases but gives entirely different
solutions. Applying (4.97) to 𝜕𝑥2 ⊗ 𝐼 + 𝐼 ⊗ (𝜕𝑦2 + 𝜔2 𝐼) gives us
d2 𝜑 d2 𝜓
+ 𝑘 2 𝜑 = 0, + (𝜔2 − 𝑘 2 )𝜓 = 0,
d𝑥 2 d𝑦 2
with separation constant 𝑘 2 and therefore the solution
2 −𝑘 2 )1/2 𝑦 ] 2 −𝑘 2 )1/2 𝑦 ]
𝑓 𝑘 (𝑥, 𝑦) ≔ 𝑎1 ei[𝑘 𝑥+(𝜔 + 𝑎2 ei[−𝑘 𝑥+(𝜔
2 −𝑘 2 )1/2 𝑦 ] 2 −𝑘 2 )1/2 𝑦 ]
+ 𝑎3 ei[𝑘 𝑥−(𝜔 + 𝑎4 ei[−𝑘 𝑥−(𝜔 .
Applying (4.97) to [(𝑟 2 𝜕𝑟2 + 𝜔2 𝑟 2 𝐼) ⊗ 𝐼 + 𝐼 ⊗ 𝜕𝜃2 ](𝜑 ⊗ 𝜓) = 0 gives us
d2 𝜑 d𝜑 d2 𝜓
𝑟2 +𝑟 + (𝜔2 𝑟 2 − 𝑘 2 )𝜑 = 0, + 𝑘 2 𝜓 = 0,
d𝑟 2 d𝑟 d𝜃 2
with separation constant 𝑘 2 and therefore the solution
𝑓 𝑘 (𝑟, 𝜃) ≔ 𝑎1 ei𝑘 𝜃 𝐽𝑘 (𝜔𝑟) + 𝑎2 e−i𝑘 𝜃 𝐽𝑘 (𝜔𝑟) + 𝑎3 ei𝑘 𝜃 𝐽−𝑘 (𝜔𝑟) + 𝑎4 e−i𝑘 𝜃 𝐽−𝑘 (𝜔𝑟)
where 𝐽𝑘 is a Bessel function. Any solution of the two-dimensional Helmholtz
equation in Cartesian coordinates is a sum or integral of 𝑓 𝑘 (𝑥, 𝑦) over 𝑘 and
any solution in polar coordinates is one of 𝑓 𝑘 (𝑟, 𝜃) over 𝑘. Analytic solutions
in different coordinate systems provide different insights. There are exactly two
more such coordinate systems where separation of variables works; we call these
separable coordinates. For 𝑛 = 2, (4.103) has exactly four systems of separable
coordinates: Cartesian, polar, parabolic and elliptic. For 𝑛 = 3, there are exactly
eleven (Eisenhart 1934).
The fundamental result that allows us to deduce these numbers is the Stäckel
condition: the 𝑛-dimensional Helmholtz equation in coordinates 𝑥1 , . . . , 𝑥 𝑛 can be
solved using the separation-of-variables technique if and only if (a) the Euclidean
metric tensor 𝑔 is a diagonal matrices in this coordinate system, and (b) if 𝑔 =
Tensors in computations 159
The matrices on the left are Stäckel matrices for 𝑔 in the respective coordinate
system, that is, the entries in the first row of 𝑆−1 are exactly the reciprocal of the
entries on the diagonal of 𝑔.
What we have ascertained is that the three-dimensional Helmholtz equation can
be solved by separation of variables in these four coordinate systems, without
writing down a single differential equation. In fact, with more effort, one can show
that there are exactly eleven such separable coordinate systems:
(i) Cartesian,
(ii) cylindrical,
(iii) spherical,
(iv) parabolic,
(v) paraboloidal,
(vi) ellipsoidal,
(vii) conical,
(viii) prolate spheroidal,
(ix) oblate spheroidal,
(x) elliptic cylindrical,
(xi) parabolic cylindrical.
32 We did not need to worry about the Ricci tensor because R𝑛 is a so-called Einstein manifold
¯
where 𝑔 and 𝑅¯ differ by a scalar multiple. See Example 3.13 for a cursory discussion of 𝑔 and 𝑅.
Tensors in computations 161
we get
Φ𝑘 (𝑎 𝑘 ) = 𝜆𝑎 𝑘 ,
[Φ𝑘 ⊗ 𝐼 + 𝐼 ⊗ (−Ψ𝑛 )](𝑎 ⊗ 𝑏) = 0 −→
−Ψ𝑛 (𝑏 𝑛 ) = −𝜆𝑏 𝑛 ,
with separation constant 𝜆. We write these out in full:
𝑟𝑎 𝑘−1 + (1 − 2𝑟)𝑎 𝑘 + 𝑟𝑎 𝑘+1 = 𝜆𝑎 𝑘 , 𝑘 = 1, . . . , 𝑚 − 1,
𝑏 𝑛+1 = 𝜆𝑏 𝑛 , 𝑛 = 0, 1, 2, . . . .
The second equation is trivial to solve: 𝑏 𝑛 = 𝜆 𝑛 𝑏0 . Noting that the boundary
conditions 𝑢0,𝑛+1 = 0 = 𝑢 𝑚,𝑛+1 give 𝑎0 = 0 = 𝑎 𝑚 , we see that the first equation is a
tridiagonal eigenproblem:
1 − 2𝑟 𝑟 𝑎1 𝑎1
𝑟 1 − 2𝑟 𝑟 𝑎2 𝑎2
𝑟 1 − 2𝑟 𝑟 𝑎3
= 𝜆 𝑎3 .
. .. . .. . .
.. ..
𝑟 1 − 2𝑟 𝑎 𝑚−1 𝑎 𝑚−1
The eigenvalues and eigenvectors of a tridiagonal Toeplitz matrices have well-
known closed-form expressions:
2 𝑗𝜋 𝑗 𝑘𝜋
𝜆 𝑗 = 1 − 4𝑟 sin , 𝑎 𝑗𝑘 = sin , 𝑗, 𝑘 = 1, . . . , 𝑚 − 1,
2𝑚 𝑚
where 𝑎 𝑗𝑘 is the 𝑘th coordinate of the 𝑗th eigenvector. Hence we get
Õ
𝑚−1 𝑛
2 𝑗𝜋 𝑗 𝑘𝜋
𝑢 𝑘,𝑛 = 𝑐 𝑗 𝑏0 1 − 4𝑟 sin sin .
𝑗=1
2𝑚 𝑚
consider the following integro-differential equation arising from the study of het-
erogeneous heat transfer:
∫ 𝑥
𝜕𝑓 𝜕2 𝑓
=𝑎 2 +𝑏 𝑓 (𝑦, 𝑡) d𝑦 − 𝑓 , (4.108)
𝜕𝑡 𝜕𝑥 0
with 𝑎, 𝑏 ≥ 0 and 𝑓 : [0, 1] × R → R. Note that at this juncture, if we simply
differentiate both sides to eliminate the integral, we will introduce mixed derivatives
and thus prevent ourselves from using separation of variables. Nevertheless, our
interpretation of separation of variables in Example 4.32 allows for integrals. If we
let
∫ 𝑥
𝜕2 𝑓 𝜕𝑓
Φ𝑥 ( 𝑓 ) ≔ 2
− 𝑓 +𝑏 𝑓 (𝑦, 𝑡) d𝑦, Ψ𝑡 ( 𝑓 ) ≔ − ,
𝜕𝑥 0 𝜕𝑡
then (4.97) gives us
Φ 𝑥 (𝜑) = 𝜆𝜑,
[Φ 𝑥 ⊗ 𝐼 + 𝐼 ⊗ Ψ𝑡 ](𝜑 ⊗ 𝜓) = 0 −→
Ψ𝑡 (𝜓) = −𝜆𝜓,
with separation constant 𝜆. Writing these out in full, we have
∫ 𝑥
d2 𝜑 d𝜓
𝑎 2 + (𝜆 − 1)𝜑 + 𝑏 𝜑(𝑦) d𝑦 = 0, + 𝜆𝜓 = 0.
d𝑥 0 d𝑡
The second equation is easy: 𝜓(𝑡) = 𝑐 e−𝜆𝑡 for an arbitrary constant 𝑐 that could
be determined with an initial condition. Kostoglou (2005) solved the first equation
in a convoluted manner involving Laplace transforms and partial fractions, but this
is unnecessary; at this point it is harmless to simply differentiate and eliminate
the integral. With this, we obtain a third-order homogeneous ODE with constant
coefficients, 𝑎𝜑 ′′′ + (𝜆 − 1)𝜑 ′ + 𝑏𝜑 = 0, whose solution is standard. Physical
considerations show that its characteristic polynomial
𝜆−1
𝑟3 + 𝑏
𝑎𝑟 + 𝑎 =0
must have one real and two complex roots 𝑟 1 , 𝑟 2 ± 𝑖𝑟 3 , and thus the solution is given
by 𝑐1 e𝑟1 𝑥 + e𝑟2 𝑥 (𝑐2 cos 𝑟 3 𝑥 + 𝑐3 sin 𝑟 3 𝑥) with arbitrary constants 𝑐1 , 𝑐2 , 𝑐3 that could
be determined with appropriate boundary conditions.
The last four examples are about exploiting separability in the structure of the
solutions; the next few are about exploiting separability in the structure of the
problems.
Example 4.36 (separable ODEs). The easiest ODEs to solve are probably the
separable ones,
d𝑦
= 𝑓 (𝑥)𝑔(𝑦), (4.109)
d𝑥
with special cases d𝑦/d𝑥 = 𝑓 (𝑥) and d𝑦/d𝑥 = 𝑔(𝑦) when one of the functions is
164 L.-H. Lim
simple solutions, but unfortunately such kernels are not too common in practice; in
fact they are regarded as a degenerate case in the study of integral equations. The
following discussion is adapted from Kanwal (1997, Chapter 2).
Example 4.37 (separable integral equations). Let us consider Fredholm integ-
ral equations of the first and second kind:
∫ 𝑏 ∫ 𝑏
𝑔(𝑥) = 𝐾(𝑥, 𝑦) 𝑓 (𝑦) d𝑦, 𝑓 (𝑥) = 𝑔(𝑥) + 𝜆 𝐾(𝑥, 𝑦) 𝑓 (𝑦) d𝑦, (4.112)
𝑎 𝑎
with given constants 𝑎 < 𝑏 and functions 𝑔 ∈ 𝐶 [𝑎, 𝑏] and 𝐾 ∈ 𝐶([𝑎, 𝑏] × [𝑎, 𝑏]).
The goal is to solve for 𝑓 ∈ 𝐶 [𝑎, 𝑏] and, in the second case, also 𝜆 ∈ R. The kernel
𝐾 is said to be degenerate or separable if
Õ
𝑛
𝐾(𝑥, 𝑦) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦) (4.113)
𝑖=1
finite bit length inputs with 𝐴, 𝑏, 𝑐 having rationals entries, by clearing denominat-
ors, there is no loss of generality in assuming that they have integer entries, that is,
𝐴 ∈ Z𝑚×𝑛 , 𝑏 ∈ Z𝑚 , 𝑐 ∈ Z𝑛 .
LP is famously solvable in polynomial time to arbitrary 𝜀-relative accuracy
(Khachiyan 1979), with its time complexity improved over the years, notably in
Karmarkar (1984) and Vaidya (1990), to its current bound in Cohen, Lee and Song
(2019) that is essentially in terms of 𝜔, the exponent of matrix multiplication we saw
in Example 3.9. The natural question is: Polynomial in what? The aforementioned
complexity bounds invariably involve the bit length of the input,
𝑚 Õ
Õ 𝑛 Õ
𝑚
𝐿≔ log2 (|𝑎𝑖 𝑗 | + 1) + log2 (|𝑏𝑖 | + 1) + log2 (𝑚𝑛) + 1,
𝑖=1 𝑗=1 𝑖=1
but ideally we want an algorithm that runs in time polynomial in the size of the
structure, rather than the size of the actual numbers involved. This is called a
strongly polynomial-time algorithm. Whether it exists for LP is still famously
unresolved (Smale 1998, Problem 9), although a result of Tardos (1986) shows that
it is plausible: there is a polynomial-time algorithm for LP whose time complexity
is independent of the vectors 𝑏 and 𝑐 and depends only on 𝑚, 𝑛 and the largest
subdeterminant33 of 𝐴,
Δ ≔ max{|det 𝐴 𝜎×𝜏 | : 𝜎 ⊆ [𝑚], 𝜏 ⊆ [𝑛]}.
Note that 𝐿 depends on 𝐴, 𝑏, 𝑐 and Δ only on 𝐴.
We will now assume that we have a box constraint 0 ≤ 𝑥 ≤ 1 in our LP and
ILP. In this case ILP becomes a zero-one ILP with 𝑥 ∈ {0, 1}𝑛 ; such an ILP is
also known to have time complexity that does not depend on 𝑏 and 𝑐 although
it generally still depends on the entries of 𝐴 and not just Δ (Frank and Tardos
1987). We will let the time complexity of the box-constrained LP be LP(𝑚, 𝑛, Δ)
and let that of the zero-one ILP be ILP(𝑚, 𝑛, 𝐴). The former has time complexity
polynomial in 𝑚, 𝑛, Δ by Tardos (1986); the latter is polynomial-time if 𝐴 is totally
unimodular, i.e. when Δ = 1 (Schrijver 1986, Chapter 19).
Observe that the linear objective
𝑐T 𝑥 = 𝑐 1 𝑥 1 + · · · + 𝑐 𝑛 𝑥 𝑛
is additively separable. It turns out that this separability, not so much that it is
linear, is the key to polynomial-time solvability. Consider an objective function
𝑓 : R𝑛 → R of the form
𝑓 (𝑥) = 𝑓1 (𝑥1 ) + · · · + 𝑓𝑛 (𝑥 𝑛 ),
where 𝑓1 , . . . , 𝑓𝑛 : R → R are all convex. Note that if we set 𝑓𝑖 (𝑥𝑖 ) = 𝑐𝑖 𝑥𝑖 , which is
33 Recall from Example 4.5 that a matrix is a function 𝐴 : [𝑚] × [𝑛] → R and here 𝐴 𝜎, 𝜏 is the
function restricted to the subset 𝜎 × 𝜏 ⊆ [𝑚] × [𝑛]. Recall from page 14 that a non-square
determinant is identically zero.
Tensors in computations 169
linear and therefore convex, we recover the linear objective. A surprising result of
Hochbaum and Shanthikumar (1990) shows that the separable convex programming
problem
minimize 𝑓1 (𝑥1 ) + · · · + 𝑓𝑛 (𝑥 𝑛 )
SP
subject to 𝐴𝑥 ≤ 𝑏, 0 ≤ 𝑥 ≤ 1
and its zero-one variant
minimize 𝑓1 (𝑥1 ) + · · · + 𝑓𝑛 (𝑥 𝑛 )
ISP
subject to 𝐴𝑥 ≤ 𝑏, 𝑥 ∈ {0, 1}𝑛
are solvable to 𝜀-accuracy in time complexities
𝛽
SP(𝑚, 𝑛, Δ) = log2 LP(𝑚, 8𝑛2 Δ, Δ),
2𝜀
𝛽
ISP(𝑚, 𝑛, 𝐴) = log2 LP(𝑚, 8𝑛2 Δ, Δ) + ILP(𝑚, 4𝑛2 Δ, 𝐴 ⊗ 1T4𝑛Δ )
2𝑛Δ
respectively. Here 1𝑛 ∈ R𝑛 denotes the vector of all ones, ⊗ the Kronecker product,
and 𝛽 > 0 is some constant. A consequence is that SP is always polynomial-time
solvable and ISP is polynomial-time solvable for totally unimodular 𝐴. The latter is
a particularly delicate result, and even a slight deviation yields an NP-hard problem
(Baldick 1995). While the objective 𝑓 = 𝑓1 + · · · + 𝑓𝑛 is convex, we remind readers
of Example 3.16: it is a mistake to think that all convex optimization problems
have polynomial-time algorithms.
Note that we could have stated the results above in terms of multiplicatively
separable functions: if 𝑔1 , . . . , 𝑔𝑛 : R → R++ are log convex, and 𝑔 : R𝑛 → R++ is
defined by 𝑔(𝑥1 , . . . , 𝑥 𝑛 ) = 𝑔1 (𝑥1 ) · · · 𝑔𝑛 (𝑥 𝑛 ), then SP and ISP are equivalent to
minimize log 𝑔(𝑥1 , . . . , 𝑥 𝑛 ) minimize log 𝑔(𝑥1 , . . . , 𝑥 𝑛 )
subject to 𝐴𝑥 ≤ 𝑏, 0 ≤ 𝑥 ≤ 1, subject to 𝐴𝑥 ≤ 𝑏, 𝑥 ∈ {0, 1}𝑛 ,
although it would be unnatural to discuss LP and ILP in these forms.
We will discuss another situation where additive separability arises.
Example 4.41 (separable Hamiltonians). The Hamilton equations
d𝑝 𝑖 𝜕𝐻 d𝑥𝑖 𝜕𝐻
=− (𝑥, 𝑝), =− (𝑥, 𝑝), 𝑖 = 1, . . . , 𝑛, (4.116)
d𝑡 𝜕𝑥𝑖 d𝑡 𝜕 𝑝𝑖
are said to be separable if the Hamiltonian 𝐻 : R𝑛 ×R𝑛 → R is additively separable:
𝐻(𝑥, 𝑝) = 𝑉(𝑥) + 𝑇 (𝑝), (4.117)
that is, the kinetic energy 𝑇 : R𝑛 → R depends only on momentum 𝑝 = (𝑝 1 , . . . , 𝑝 𝑛 )
and the potential energy 𝑉 : R𝑛 → R depends only on position 𝑥 = (𝑥1 , . . . , 𝑥 𝑛 ),
a common scenario. In this case the system (4.116) famously admits a finite
difference scheme that is both explicit, that is, iteration depends only on quantities
170 L.-H. Lim
already computed in a previous step, and symplectic, that is, it conforms to the
tensor transformation rules with change-of-coordinates matrices from Sp(2𝑛, R)
on page 22. Taking 𝑛 = 1 for illustration, the equations in (4.116) are
d𝑝 d𝑥
= −𝑉 ′(𝑥), = −𝑇 ′(𝑝),
d𝑡 d𝑡
and with backward Euler on the first, forward Euler on the second, we obtain the
finite difference scheme
𝑝 (𝑘+1) = 𝑝 (𝑘) + 𝑉 ′(𝑥 (𝑘+1) )Δ𝑡, 𝑥 (𝑘+1) = 𝑥 (𝑘) − 𝑇 ′(𝑥 (𝑘) )Δ𝑡,
which is easily seen to be explicit as 𝑥 (𝑘+1) may be computed before 𝑝 (𝑘+1) ; to show
that it is a symplectic integrator requires a bit more work and the reader may consult
Stuart and Humphries (1996, pp. 580–582). Note that we could have written (4.117)
in an entirely equivalent multiplicatively separable form with e𝐻 (𝑥, 𝑝) = e𝑉 (𝑥) e𝑇 ( 𝑝) ,
although our subsequent discussion would be somewhat awkward.
We now present an example with separability structures in both problem and
solution and where both additive and multiplicative separability play a role.
Example 4.42 (one-electron and Hartree–Fock approximations). The time-
dependent Schrödinger equation for a system of 𝑑 particles in R3 is
2
𝜕 ℏ
𝑖ℏ 𝑓 (𝑥, 𝑡) = − Δ + 𝑉(𝑥) 𝑓 (𝑥, 𝑡).
𝜕𝑡 2𝑚
Here 𝑥 = (𝑥1 , . . . , 𝑥 𝑑 ) ∈ R3𝑛 represents the positions of the 𝑑 particles, 𝑉 a real-
valued function representing potential, and
Δ = Δ1 + Δ2 + · · · + Δ𝑑
with each Δ𝑖 : 𝐿 2 (R3 ) → 𝐿 2 (R3 ) a copy of the Laplacian on R3 corresponding to
the 𝑖th particle. Note that these do not need to be in Cartesian coordinates. For
instance, we might have
1 𝜕 1 1 𝜕2
𝜕 𝜕 𝜕
Δ𝑖 = 2 𝑟 𝑖2 + 2 sin 𝜃 𝑖 + 2 2
𝑟 𝑖 𝜕𝑟 𝑖 𝜕𝑟 𝑖 𝑟 𝑖 sin 𝜃 𝑖 𝜕𝜃 𝑖 𝜕𝜃 𝑖 𝑟 𝑖 sin 𝜃 𝑖 𝜕𝜙2𝑖
with 𝑥𝑖 = (𝑟 𝑖 , 𝜃 𝑖 , 𝜙𝑖 ), and 𝑖 = 1, . . . , 𝑑.
We will drop the constants, which are unimportant to our discussions, and just
keep their signs:
(−Δ + 𝑉) 𝑓 − 𝑖𝜕𝑡 𝑓 = 0. (4.118)
Separation of variables (4.97) applies to give us
(−Δ + 𝑉)𝜑 = 𝐸 𝜑,
(−Δ + 𝑉) ⊗ 𝐼 + 𝐼 ⊗ (−𝑖𝜕𝑡 ) −→
−𝑖𝜕𝑡 𝜓 = −𝐸𝜓,
where we have written our separation constant as −𝐸. The second equation is
Tensors in computations 171
trivial to solve, 𝜓(𝑡) = e−i𝐸𝑡 , and as (4.118) is linear, the solution 𝑓 is given by a
linear combination of 𝜓 ⊗ 𝜑 over all possible values of 𝐸 and our main task is to
determine 𝜑 and 𝐸 from the first equation, called the time-independent Schrödinger
equation for 𝑑 particles. Everything up to this stage is similar to our discussion for
the wave equation in Example 4.33. The difference is that we now have an extra 𝑉
term: (𝐸, 𝜑) are an eigenpair of −Δ + 𝑉.
The motivation behind the one-electron, Hartree–Fock and other approximations
is as follows. If the potential 𝑉 is additively separable,
𝐸 = 𝐸1 + 𝐸2 + · · · + 𝐸 𝑑 .
[(−Δ1 + 𝑉1 ) ⊗ 𝐼 ⊗ · · · ⊗ 𝐼 + 𝐼 ⊗ (−Δ2 + 𝑉2 ) ⊗ · · · ⊗ 𝐼
+ 𝐼 ⊗ · · · ⊗ 𝐼 ⊗ (−Δ𝑑 + 𝑉𝑑 − 𝐸)](𝜑1 ⊗ 𝜑2 ⊗ · · · ⊗ 𝜑 𝑑 ) = 0
(−Δ1 + 𝑉1 )𝜑1 = 𝐸 1 𝜑1 ,
(−Δ2 + 𝑉2 )𝜑2 = 𝐸 2 𝜑2 ,
−→ .. (4.120)
.
(−Δ + 𝑉 )𝜑 = (𝐸 − 𝐸 − · · · − 𝐸 )𝜑 .
𝑑 𝑑 𝑑 1 𝑑−1 𝑑
Note that we may put the −𝐸 𝜑 term with any −Δ𝑖 + 𝑉𝑖 ; we just chose 𝑖 = 𝑑 for
convenience. If we write 𝐸 𝑑 ≔ 𝐸 − 𝐸 1 − · · · − 𝐸 𝑑−1 and 𝜑 = 𝜑1 ⊗ · · · ⊗ 𝜑 𝑑 , we
obtain the required expressions. So (4.120) transforms a 𝑑-particle Schrödinger
equation into 𝑑 one-particle Schrödinger equations.
If one could solve the 𝑑-particle Schrödinger equation, then one could in principle
determine all chemical and physical properties of atoms and molecules, so the above
discussion seems way too simple, and indeed it is: the additivity in (4.119) can only
happen if the particles do not interact. In this context an operator of the form in
(4.96) is called a non-interacting Hamiltonian, conceptually useful but unrealistic.
It is, however, the starting point from which various approximation schemes are
developed. To see the necessity of approximation, consider what appears to be a
172 L.-H. Lim
that is, no higher-order terms of the form 𝑉𝑖 𝑗𝑘 (𝑥𝑖 , 𝑥 𝑗 , 𝑥 𝑘 ), and we may even fix
𝑉𝑖 𝑗 (𝑥𝑖 , 𝑥 𝑗 ) = 1/k𝑥𝑖 − 𝑥 𝑗 k. But the equation (−Δ + 𝑉)𝜑 = 𝐸 𝜑 then becomes
computationally intractable in multiple ways (Whitfield, Love and Aspuru-Guzik
2013).
The one-electron approximation and Hartree–Fock approximation are based on
the belief that
𝜑(𝑥) ≈ 𝜑1 (𝑥1 ) · · · 𝜑 𝑑 (𝑥 𝑑 ),
𝑉(𝑥) ≈ 𝑉1 (𝑥1 ) + · · · + 𝑉𝑑 (𝑥 𝑑 ) ⇒
𝐸 ≈ 𝐸1 + · · · + 𝐸 𝑑 ,
with ‘≈’ interpreted differently and with different tools: the former uses perturb-
ation theory and the latter calculus of variations. While these approximation
methods are only tangentially related to the topic of this section, they do tell us
how additive and multiplicative separability can be approximated and thus we will
briefly describe the key ideas. Our discussion below is based on Fischer (1977),
Faddeev and Yakubovskiı̆ (2009, Chapters 33, 34, 50, 51) and Hannabuss (1997,
Chapter 12).
In most scenarios, 𝑉(𝑥) will have an additively separable component 𝑉1 (𝑥1 ) +
· · · + 𝑉𝑑 (𝑥 𝑑 ) and an additively inseparable component comprising the higher-order
interactions 𝑉𝑖 𝑗 (𝑥𝑖 , 𝑥 𝑗 ), 𝑉𝑖 𝑗𝑘 (𝑥𝑖 , 𝑥 𝑗 , 𝑥 𝑘 ), . . . . We will write
Õ
𝑑
𝐻0 ≔ (−Δ𝑖 + 𝑉𝑖 ), 𝐻1 ≔ −Δ + 𝑉 = 𝐻0 + 𝑊,
𝑖=1
and 𝜓 (𝑘) may be recursively calculated from (4.122), the hope is that
𝜆(0) + 𝜆(1) + 𝜆(2) + · · · , 𝜓 (0) + 𝜓 (1) + 𝜓 (2) + · · ·
would converge to the desired eigenpair of 𝐻1 . For instance, if 𝐻0 has an orthonor-
mal basis of eigenfunctions {𝜓𝑖 : 𝑖 ∈ N} with eigenvalues {𝜆 𝑖 : 𝑖 ∈ N} and 𝜆 𝑗 is a
simple eigenvalue, then (Hannabuss 1997, Theorem 12.4.3)
Õ h𝑊𝜓 𝑗 , 𝜓𝑖 i Õ |h𝑊𝜓 𝑗 , 𝜓𝑖 i| 2
𝜆(1)
𝑗 = h𝑊𝜓 𝑗 , 𝜓 𝑗 i, 𝜓 (1)
𝑗 = 𝜓 𝑖 , 𝜆 (2)
𝑗 = .
𝑖≠ 𝑗
𝜆 𝑗 − 𝜆𝑖 𝑖≠ 𝑗
𝜆 𝑗 − 𝜆𝑖
These equations make physical sense. For a fixed 𝑖, (4.123) is the Schrödinger
equation for particle 𝑖 in a potential field due to the charge of particle 𝑗; this charge
is spread over space with density |𝜑 𝑗 | 2 , and we have to sum over the potential
fields created by all particles 𝑗 = 1, . . . , 𝑖 − 1, 𝑖 + 1, . . . , 𝑑. While the system of
coupled integro-differential equations (4.123) generally would not have an analytic
solution like the one in Example 4.35, it readily lends itself to numerical solution
via a combination of quadrature and finite difference (Fischer 1977, Chapter 6,
Section 7).
We have ignored the spin variables in the last and the next examples, as spin has
already been addressed in Examples 4.7 and would only be an unnecessary distrac-
tion here. So, strictly speaking, these examples are about Hartree approximation
(Hartree 1928), i.e. no spin, and not Hartree–Fock approximation (Fock 1930), i.e.
with spin.
174 L.-H. Lim
with orthogonal factors (De Lathauwer, De Moor and Vandewalle 2000, equa-
tion 13)
(
0 𝑖 ≠ 𝑗,
h𝜑𝑖 , 𝜑 𝑗 i = h𝜓𝑖 , 𝜓 𝑗 i = · · · = h𝜃 𝑖 , 𝜃 𝑗 i =
1 𝑖 = 𝑗.
Note that when our spaces have inner products, any multilinear rank decomposition
(4.63) may have its factors orthogonalized (De Lathauwer et al. 2000, Theorem 2),
that is, the orthogonality constraints do not limit the range of possibilities in (4.125).
By definition of multilinear rank, there exist subspaces H1′ , . . . , H𝑑′ of dimensions
𝑟 1 , . . . , 𝑟 𝑑 with 𝑓 ∈ H1′ ⊗ · · · ⊗ H𝑑′ that attain the minimum in (4.62). As such,
we may replace H𝑖 with H𝑖′ at the outset, and to simplify our discussion we may as
well assume that H1 , . . . , H𝑑 are of dimensions 𝑟 1 , . . . , 𝑟 𝑑 .
The issue with (4.125) is that there is an exponential number of rank-one terms
as 𝑑 increases. Suppose 𝑟 1 = · · · = 𝑟 𝑑 = 𝑟; then there are 𝑟 𝑑 summands in (4.125).
This is not unexpected because (4.125) is the most general form a finite-rank tensor
can take. Here the ansatz in (4.125) describes the whole space H1 ⊗ · · · ⊗ H𝑑 and
does not quite serve its purpose; an ansatz is supposed to be an educated guess,
typically based on physical insights, that captures a small region of the space where
the solution likely lies. The goal of tensor networks is to provide such an ansatz by
limiting the coefficients [𝑐𝑖 𝑗 ···𝑘 ] ∈ R𝑟1 ×···×𝑟𝑑 to a much smaller set. The first and
best-known example is the matrix product states tensor network (Anderson 1959,
White 1992, White and Huse 1993), which imposes on the coefficients the structure
𝑐𝑖 𝑗 ···𝑘 = tr(𝐴𝑖 𝐵 𝑗 · · · 𝐶 𝑘 ), 𝐴𝑖 ∈ R𝑛1 ×𝑛2 , 𝐵 𝑗 ∈ R𝑛2 ×𝑛3 , . . . , 𝐶 𝑘 ∈ R𝑛𝑑 ×𝑛1
for 𝑖 = 1, . . . , 𝑟 1 , 𝑗 = 1, . . . , 𝑟 2 , . . . , 𝑘 = 1, . . . , 𝑟 𝑑 . An ansatz of the form
Õ
𝑟1 Õ
𝑟2 Õ
𝑟𝑑
𝑓 = ··· tr(𝐴𝑖 𝐵 𝑗 · · · 𝐶 𝑘 ) 𝜑𝑖 ⊗ 𝜓 𝑗 ⊗ · · · ⊗ 𝜃 𝑘 (4.126)
𝑖=1 𝑗=1 𝑘=1
is called a matrix product state or MPS (Affleck, Kennedy, Lieb and Tasaki 1987).
Note that the coefficients are now parametrized by 𝑟 1 + 𝑟 2 + · · · + 𝑟 𝑑 matrices of
various sizes. For easy comparison, if 𝑟 1 = · · · = 𝑟 𝑑 = 𝑟 and 𝑛1 = · · · = 𝑛𝑑 = 𝑛,
then the coefficients in (4.125) have 𝑟 𝑑 degrees of freedom whereas those in (4.126)
only have 𝑟𝑑𝑛2 . When 𝑛1 = 1, the first and last matrices in (4.126) are a row and a
column vector respectively; as the trace of a 1 × 1 matrix is itself, we may drop the
‘tr’ in (4.126). This special case with 𝑛1 = 1 is sometimes called MPS with open
boundary conditions (Anderson 1959) and the more general case is called MPS
with periodic conditions.
The above discussion of matrix product states conceals an important structure.
176 L.-H. Lim
( 𝑗)
Take 𝑑 = 3 and denote the entries of the matrices as 𝐴𝑖 = [𝑎(𝑖)
𝛼𝛽 ], 𝐵 𝑗 = [𝑏 𝛽𝛾 ] and
𝐶 𝑘 = [𝑐(𝑘)
𝛾 𝛼 ]. Then
𝑟1Õ
,𝑟2 ,𝑟3
𝑓 = tr(𝐴𝑖 𝐵 𝑗 𝐶 𝑘 ) 𝜑𝑖 ⊗ 𝜓 𝑗 ⊗ 𝜃 𝑘
𝑖, 𝑗,𝑘=1
𝑟1Õ,𝑟2 ,𝑟3 𝑛1Õ
,𝑛2 ,𝑛3
(𝑖) ( 𝑗) (𝑘)
= 𝑎 𝛼𝛽 𝑏 𝛽𝛾 𝑐 𝛾 𝛼 𝜑𝑖 ⊗ 𝜓 𝑗 ⊗ 𝜃 𝑘
𝑖, 𝑗,𝑘=1 𝛼,𝛽,𝛾=1
𝑛1Õ ,𝑛2 ,𝑛3 Õ 𝑟1 Õ 𝑟2 Õ 𝑟3
( 𝑗)
= 𝑎(𝑖)
𝛼𝛽 𝑖 𝜑 ⊗ 𝑏 𝛽𝛾 𝑗 𝜓 ⊗ 𝑐(𝑘)
𝛾𝛼 𝜃𝑘
𝛼,𝛽,𝛾=1 𝑖=1 𝑗=1 𝑘=1
𝑛1Õ ,𝑛2 ,𝑛3
= 𝜑 𝛼𝛽 ⊗ 𝜓 𝛽𝛾 ⊗ 𝜃 𝛾 𝛼 ,
𝛼,𝛽,𝛾=1
where
Õ
𝑟1 Õ
𝑟2 Õ
𝑟3
( 𝑗)
𝜑 𝛼𝛽 ≔ 𝑎(𝑖)
𝛼𝛽 𝜑 𝑖 , 𝜓 𝛽𝛾 ≔ 𝑏 𝛽𝛾 𝜓 𝑗 and 𝜃𝛾 𝛼 ≔ 𝑐(𝑘)
𝛾𝛼 𝜃𝑘 .
𝑖=1 𝑗=1 𝑘=1
In other words, the indices have the incidence structure of an undirected graph,
in this case a triangle. This was first observed in Landsberg, Qi and Ye (2012)
and later generalized in Ye and Lim (2018b), the bottom line being that any tensor
network state is a sum of separable functions indexed by a graph. In the following,
we show some of the most common tensor network states, written in this simplified
form, together with the graphs they correspond to in Figure 4.4.
Periodic matrix product states:
𝑛1Õ
,𝑛2 ,𝑛3
𝑓 (𝑥, 𝑦, 𝑧) = 𝜑𝑖 𝑗 (𝑥)𝜓 𝑗𝑘 (𝑦)𝜃 𝑘𝑖 (𝑧).
𝑖, 𝑗,𝑘=1
The second and the last, often abbreviated to TTNS and PEPS, were proposed by
Shi, Duan and Vidal (2006) and Verstraete and Cirac (2004) respectively. More
generally, any periodic MPS corresponds to a cycle graph and any open MPS
corresponds to a line graph. We have deliberately written them without the ⊗
symbol to emphasize that all these ansätze are just sums of separable functions,
differing only in terms of how their factors are indexed.
𝜋
mps (open) ttns
𝑘
𝑖 𝑗 𝑘 𝑙 𝑖 𝑗
𝜑 𝜓 𝜃 𝜋 𝜌 𝜓 𝜑 𝜃
peps
𝜎 𝜌 𝜋
𝜃 𝑘 𝜓 𝑜 𝑛
𝑗 𝑖 𝑘 𝑚
mps (periodic) 𝑖
𝑗 𝑙
𝜑 𝜑 𝜓 𝜃
Figure 4.4. Graphs associated with common tensor networks.
34 Strictly speaking, the names are used for the case when 𝑠 = dim V − 2, and for more general 𝑠 they
are called the Riesz kernel and Riesz potentials. Because of the singularity in 𝜅, the integral is
interpreted in a principal value sense as in (3.32) but we ignore such details (Stein 1993, Chapter 1,
Section 8.18). We have also dropped some multiplicative constants.
Tensors in computations 179
[𝐷 4 𝜅(𝑣)](ℎ1 , ℎ2 , ℎ3 , ℎ4 )
𝑠(𝑠 + 2)
= [hℎ1 , ℎ2 ihℎ3 , ℎ4 i + hℎ1 , ℎ3 ihℎ2 , ℎ4 i + hℎ1 , ℎ4 ihℎ2 , ℎ3 i]
k𝑣k 𝑠+4
𝑠(𝑠 + 2)(𝑠 + 4)
− [h𝑣, ℎ1 ih𝑣, ℎ2 ihℎ3 , ℎ4 i + h𝑣, ℎ1 ih𝑣, ℎ3 ihℎ2 , ℎ4 i
k𝑣k 𝑠+6
+ h𝑣, ℎ1 ih𝑣, ℎ4 ihℎ2 , ℎ3 i + h𝑣, ℎ2 ih𝑣, ℎ3 ihℎ1 , ℎ4 i
+ h𝑣, ℎ2 ih𝑣, ℎ4 ihℎ1 , ℎ3 i + h𝑣, ℎ3 ih𝑣, ℎ4 ihℎ1 , ℎ2 i]
𝑠(𝑠 + 2)(𝑠 + 4)(𝑠 + 6)
+ h𝑣, ℎ1 ih𝑣, ℎ2 ih𝑣, ℎ3 ih𝑣, ℎ4 i.
k𝑣k 𝑠+8
To obtain these, all we need is the expression for [𝐷𝜅(𝑣)](ℎ), which follows from
binomial expanding to linear terms in ℎ, that is,
−𝑠/2
1 2h𝑣, ℎi kℎk 2 𝑠
𝜅(𝑣 + ℎ) = 𝑠
1+ 2
+ 2
= 𝜅(𝑣) − h𝑣, ℎi + 𝑂(kℎk 2 ),
k𝑣k k𝑣k k𝑣k k𝑣k 𝑠+2
along with the observation that [𝐷 h𝑣, ℎ ′i](ℎ) = hℎ, ℎ ′ i and the product rule 𝐷 𝑓 ·
𝑔(𝑣) = 𝑓 (𝑣) · 𝐷𝑔(𝑣) + 𝑔(𝑣) · 𝐷 𝑓 (𝑣). Applying these repeatedly gives us 𝐷 𝑘 𝜅(𝑣) in
any coordinate system without having to calculate a single partial derivative.
As we saw in Example 4.29, these derivatives may be linearized via the universal
factorization property. With an inner product, this is particularly simple. We will
take 𝑠 = 1, as this gives us the Coulomb potential, the most common case. We may
rewrite the above expressions as
𝑣
[𝐷𝜅(𝑣)](ℎ) = − ,ℎ ,
k𝑣k 3
𝐼 3𝑣 ⊗ 𝑣
[𝐷 2 𝜅(𝑣)](ℎ1 , ℎ2 ) = − + , ℎ 1 ⊗ ℎ 2 ,
k𝑣k 3 k𝑣k 5
3 𝐼 15 𝑣 ⊗ 𝑣 ⊗ 𝑣
3
[𝐷 𝜅(𝑣)](ℎ1 , ℎ2 , ℎ3 ) = 𝑣 ⊗ 𝐼+ ⊗ +𝐼 ⊗ 𝑣 − , ℎ1 ⊗ ℎ2 ⊗ ℎ3 .
k𝑣k 5 𝑣 k𝑣k 7
The symbol where 𝐼 and 𝑣 appear above and below ⊗ is intended Í𝑛 to mean the
following. Take any orthonormal basis 𝑒1 , . . . , 𝑒 𝑛 of V; then 𝐼 = 𝑖=1 𝑒𝑖 ⊗ 𝑒𝑖 , and
we have
Õ𝑛
𝐼 Õ 𝑛 Õ𝑛
𝑣⊗𝐼= 𝑣 ⊗ 𝑒𝑖 ⊗ 𝑒𝑖 , ⊗ = 𝑒𝑖 ⊗ 𝑣 ⊗ 𝑒𝑖 , 𝐼 ⊗ 𝑣 = 𝑒𝑖 ⊗ 𝑒𝑖 ⊗ 𝑣.
𝑣
𝑖=1 𝑖=1 𝑖=1
The tensors appearing in the first argument of the inner products are precisely the
linearized derivatives and they carry important physical meanings:
1
monopole 𝜅(𝑣) = ,
k𝑣k
−1
dipole 𝜕𝜅(𝑣) = 𝑣ˆ ,
k𝑣k 2
180 L.-H. Lim
1
quadrupole 𝜕 2 𝜅(𝑣) = (3 𝑣ˆ ⊗ 𝑣ˆ − 𝐼),
k𝑣k 3
𝐼
3 −1
octupole 𝜕 𝜅(𝑣) = 15 𝑣ˆ ⊗ 𝑣ˆ ⊗ 𝑣ˆ − 3 𝑣ˆ ⊗ 𝐼+ ⊗ +𝐼 ⊗ 𝑣ˆ ,
k𝑣k 4 𝑣ˆ
where 𝑣ˆ ≔ 𝑣/k𝑣k. The norm-scaled 𝑘th linearized derivative
𝑀 𝑘 𝜅(𝑣) ≔ k𝑣k 𝑘+1 𝜕 𝑘 𝜅(𝑣) ∈ V ⊗𝑘
is called a multipole tensor or 𝑘-pole tensor. The Taylor expansion (4.92) may then
be written in terms of the multipole tensors,
Õ𝑑
1 𝑘 (𝑣 − 𝑣 0 ) ⊗𝑘
𝜅(𝑣) = 𝜅(𝑣 0 ) 𝑀 𝜅(𝑣 0 ), + 𝑅(𝑣 − 𝑣 0 ),
𝑘=0
𝑘! k𝑣 0 k 𝑘
and this is called a multipole expansion of 𝜅(𝑣) = 1/k𝑣k. We may rewrite this in
terms of a tensor Taylor series and a tensor geometric series as in (4.93),
Õ∞
1 𝑘
𝑣 − 𝑣0
Õ ∞
(𝑣 − 𝑣 0 ) ⊗𝑘 b
𝜕𝜅(𝑣 0 ) = b
𝑀 𝜅(𝑣 0 ) ∈ T(V), 𝑆 = ∈ T(V).
𝑘=0
𝑘! k𝑣 0 k 𝑘=0
k𝑣 0 k 𝑘
Note that
1 1 1 𝐼
𝜕𝜅(𝑣) = 1 − 𝑣ˆ + (3 𝑣ˆ ⊗ 𝑣ˆ − 𝐼) − 15 𝑣ˆ ⊗ 𝑣ˆ ⊗ 𝑣ˆ − 3 𝑣ˆ ⊗ 𝐼+ ⊗ +𝐼 ⊗ 𝑣ˆ + · · ·
2! 3! 4! 𝑣ˆ
is an expansion in the tensor algebra b
T(V).
Let 𝑓 : V → R be a compactly supported function that usually represents a
charge distribution confined to a region within V. The tensor-valued integral
∫
𝑘
(𝑀 𝜅) ∗ 𝑓 (𝑣) = 𝑀 𝑘 𝜅(𝑣 − 𝑤) 𝑓 (𝑤) d𝑤
𝑣 ∈V
is called a multipole moment or 𝑘-pole moment, and the Taylor expansion
Õ𝑑
1 𝑘 (𝑣 − 𝑣 0 ) ⊗𝑘
𝜅 ∗ 𝑓 (𝑣) = 𝜅(𝑣 0 ) (𝑀 𝜅) ∗ 𝑓 (𝑣 0 ), + 𝑅(𝑣 − 𝑣 0 ),
𝑘=0
𝑘! k𝑣 0 k 𝑘
is called a multipole expansion of 𝑓 . One of the most common scenarios is when
we have a finite collection of 𝑛 point charges at positions 𝑣 1 , . . . , 𝑣 𝑛 ∈ V with
charges 𝑞 1 , . . . , 𝑞 𝑛 ∈ R. Then 𝑓 is given by
𝑓 (𝑣) = 𝑞 1 𝛿(𝑣 − 𝑣 1 ) + · · · + 𝑞 𝑛 𝛿(𝑣 − 𝑣 𝑛 ). (4.127)
Our goal is to estimate the potential at a point 𝑣 some distance away from 𝑣 1 , . . . , 𝑣 𝑛 .
Now consider a distinct point 𝑣 0 that is much nearer to 𝑣 1 , . . . , 𝑣 𝑛 than to 𝑣 in the
sense that k𝑣 𝑖 −𝑣 0 k ≤ 𝑐k𝑣−𝑣 0 k, 𝑖 = 1, . . . , 𝑛, for some 𝑐 < 1. Suppose there is just a
single point charge with 𝑓 (𝑣) = 𝑞 𝑖 𝛿(𝑣 −𝑣 𝑖 ); then (𝑀 𝑘 𝜅)∗ 𝑓 (𝑣 −𝑣 0 ) = 𝑞 𝑖 𝑀 𝑘 𝜅(𝑣 𝑖 −𝑣 0 )
and the multipole expansion of 𝑓 at the point 𝑣 − 𝑣 𝑖 = (𝑣 − 𝑣 0 ) + (𝑣 0 − 𝑣 𝑖 ) about the
Tensors in computations 181
point 𝑣 − 𝑣 0 is
Õ𝑑
𝑞𝑖 (𝑣 0 − 𝑣 𝑖 ) ⊗𝑘
𝜅 ∗ 𝑓 (𝑣 − 𝑣 𝑖 ) = 𝜅(𝑣 − 𝑣 0 ) 𝑀 𝑘 𝜅(𝑣 𝑖 − 𝑣 0 ), 𝑘
+ 𝑅(𝑣 0 − 𝑣 𝑖 )
𝑘=0
𝑘! k𝑣 − 𝑣 0 k
Õ
𝑑
= 𝜑 𝑘 (𝑣 𝑖 − 𝑣 0 )𝜓 𝑘 (𝑣 − 𝑣 0 ) + 𝑂(𝑐 𝑑+1 ),
𝑘=0
This sum can be computed in 𝑂(𝑛𝑑) complexity or 𝑂(𝑛 log(1/𝜀)) for an 𝜀-accurate
algorithm. While the high level idea in (4.128) is still one of approximating a
function by a sum of 𝑑 separable functions, the fast multipole method involves a
host of other clever ideas, not least among which are the techniques for performing
such sums in more general situations (Demaine et al. 2005), for subdividing the
region containing 𝑣 1 , . . . , 𝑣 𝑛 into cubic cells (when V = R3 ) and thereby organizing
these computations into a tree-like multilevel algorithm (Barnes and Hut 1986).
Clearly the approximation is good only when 𝑣 is far from 𝑣 1 , . . . , 𝑣 𝑛 relative to 𝑣 0
but the algorithm allows one to circumvent this requirement. We refer to Greengard
and Rokhlin (1987) for further information.
Multipole moments, multipole tensors and multipole expansions are usually
discussed in terms of coordinates (Jackson 1999, Chapter 4). The coordinate-free
approach, the multipole expansion as an element of the tensor algebra, etc., are
results of our working with definitions ➁ and ➂.
Although non-separable kernels make for more interesting examples, separable
kernels arise as some of the most common multidimensional integral transforms,
as we saw in Example 4.38. Also, they warrant a mention if only to illustrate why
separability in a kernel is computationally desirable.
Example 4.46 (discrete multidimensional transforms). Three of the best-known
discrete transforms are the discrete Fourier transform we encountered in Ex-
ample 3.14, the discrete Z-transform and the discrete cosine transform:
Õ
∞ Õ
∞
𝐹(𝑥1 , . . . , 𝑥 𝑑 ) = ··· 𝑓 (𝑘 1 , . . . , 𝑘 𝑑 ) e−i𝑘1 𝑥1 −···−i𝑘𝑑 𝑥𝑑 ,
𝑘1 =−∞ 𝑘𝑑 =−∞
182 L.-H. Lim
Õ
∞ Õ
∞
−𝑘𝑑
𝐹(𝑧1 , . . . , 𝑧 𝑑 ) = ··· 𝑓 (𝑘 1 , . . . , 𝑘 𝑑 ) 𝑧−𝑘1
1 · · · 𝑧𝑑 ,
𝑘1 =−∞ 𝑘𝑑 =−∞
𝐹( 𝑗 1 , . . . , 𝑗 𝑑 ) =
1 −1
𝑛Õ 𝑑 −1
𝑛Õ
𝜋(2 𝑗 1 + 1) 𝑗 1 𝜋(2 𝑗 𝑑 + 1) 𝑗 𝑑
··· 𝑓 (𝑘 1 , . . . , 𝑘 𝑑 ) cos · · · cos ,
𝑘1 =0 𝑘𝑑 =0
2𝑛1 2𝑛𝑑
where the 𝑥𝑖 are real, the 𝑧𝑖 are complex, and 𝑗 𝑖 and 𝑘 𝑖 are integer variables. We
refer the reader to Dudgeon and Mersereau (1983, Sections 2.2 and 4.2) for the first
two and Rao and Yip (1990, Chapter 5) for the last. While we have stated them for
general 𝑑, in practice 𝑑 = 2, 3 are the most useful. The separability of these kernels
is exploited in the row–column decomposition for their evaluation (Dudgeon and
Mersereau 1983, Section 2.3.2). Assuming, for simplicity, that we have 𝑑 = 2, a
kernel 𝐾( 𝑗, 𝑘) = 𝜑( 𝑗)𝜓(𝑘) and all integer variables, then
Õ∞ Õ
∞ Õ∞ Õ
∞
𝐹(𝑥, 𝑦) = 𝜑( 𝑗)𝜓(𝑘) 𝑓 (𝑥− 𝑗, 𝑦−𝑘) = 𝜑( 𝑗) 𝜓(𝑘) 𝑓 (𝑥− 𝑗, 𝑦−𝑘) .
𝑗=−∞ 𝑘=−∞ 𝑗=−∞ 𝑘=−∞
We store the sum in the bracket, which we then re-use when evaluating 𝐹 at other
points (𝑥 ′, 𝑦) where only the first argument 𝑥 ′ = 𝑥 + 𝛿 is changed:
Õ
∞ Õ
∞
𝐹(𝑥 + 𝛿, 𝑦) = 𝜑( 𝑗) 𝜓(𝑘) 𝑓 (𝑥 − 𝑗 + 𝛿, 𝑦 − 𝑘)
𝑗=−∞ 𝑘=−∞
Õ∞ Õ
∞
= 𝜑( 𝑗 + 𝛿) 𝜓(𝑘) 𝑓 (𝑥 − 𝑗, 𝑦 − 𝑘) .
𝑗=−∞ 𝑘=−∞
We have assumed that the indices run over all integers to avoid having to deal with
boundary complications. In reality, when we have a finite sum as in the discrete
cosine transform, evaluating 𝐹 in the direct manner would have taken 𝑛21 𝑛22 · · · 𝑛2𝑑
additions and multiplications, whereas the row–column decomposition would just
require
𝑛1 𝑛2 · · · 𝑛𝑑 (𝑛1 + 𝑛2 + · · · + 𝑛𝑑 )
additions and multiplications. In cases where there are fast algorithms available
for the one-dimensional transform, say, if we employ one-dimensional FFT in an
evaluation of the 𝑑-dimensional DFT via row–column decomposition, the number
of additions and multiplications could be further reduced to
1
𝑛1 𝑛2 · · · 𝑛𝑑 log2 (𝑛1 + 𝑛2 + · · · + 𝑛𝑑 ) and 𝑛1 𝑛2 · · · 𝑛𝑑 log2 (𝑛1 + 𝑛2 + · · · + 𝑛𝑑 )
2
respectively. Supposing 𝑑 = 2 and 𝑛1 = 𝑛2 = 210 , the approximate number of multi-
plications required to evaluate a two-dimensional DFT using the direct method, the
row–column decomposition method and the row–column decomposition with FFT
Tensors in computations 183
method are 1012 , 2 × 109 , and 107 respectively (Dudgeon and Mersereau 1983,
Section 2.3.2).
Example 4.47 (tensor product wavelets and splines). We let H be any separ-
able Banach space with norm k · k. Then ℬ = {𝜑𝑖 ∈ H : 𝑖 ∈ N} is said to be a
Schauder basis of H if, for every 𝑓 ∈ H, there is a unique sequence (𝑎𝑖 )∞
𝑖=1 such
that
Õ
∞
𝑓 = 𝑎𝑖 𝜑𝑖 ,
𝑖=1
where, as usual, this means the series on the right converges to 𝑓 in k · k. A Banach
space may not have a Schauder basis but a Hilbert space always does. If H is a
Hilbert space with inner product h · , · i and k · k its induced norm, then its Schauder
basis has specific names when it satisfies additional conditions; two of the best-
known ones are the orthonormal basis and the Riesz basis. Obviously the elements
of a Schauder basis must be linearly independent, but this can be unnecessarily
restrictive since overcompleteness can be a desirable feature (Mallat 2009), leading
us to the notion of frames. For easy reference, we define them as follows (Heil
2011):
Õ
∞
orthonormal basis k 𝑓 k2 = |h 𝑓 , 𝜑𝑖 i| 2 , 𝑓 ∈ H,
𝑖=1
Õ
∞ Õ∞ 2 Õ
∞
Riesz basis 𝛼 |𝑎𝑖 | 2 ≤ 𝑎𝑖 𝜑𝑖 ≤𝛽 |𝑎𝑖 | 2 , (𝑎𝑖 )∞ 2
𝑖=1 ∈ 𝑙 (N),
𝑖=1 𝑖=1 𝑖=1
Õ
∞
frame 𝛼k 𝑓 k 2 ≤ |h 𝑓 , 𝜑𝑖 i| 2 ≤ 𝛽k 𝑓 k 2 , 𝑓 ∈ H,
𝑖=1
where the constants 0 < 𝛼 < 𝛽 are called frame constants and if 𝛼 = 𝛽, then the
frame is tight. Clearly every orthonormal basis is a Riesz basis and every Riesz
basis is a frame.
Let H1 , . . . , H𝑑 be separable Hilbert spaces and let ℬ1 , . . . , ℬ𝑑 be countable
dense spanning sets. Let H1 b ⊗···b⊗ H𝑑 be their Hilbert–Schmidt tensor product as
discussed in Examples 4.18 and 4.19 and let ℬ1 ⊗ · · · ⊗ ℬ𝑑 be as defined in (4.43).
184 L.-H. Lim
Then
orthonormal bases,
an orthonormal basis,
ℬ1 , . . . , ℬ𝑑 are Riesz bases, ⇔ ℬ1 ⊗· · ·⊗ℬ𝑑 is a Riesz basis,
frames, a frame.
The forward implication is straightforward, and if the frame constants ofÎ ℬ𝑖 are
𝑑
𝛼𝑖 and 𝛽𝑖 , 𝑖 = 1, . . . , 𝑑, then the frame constants of ℬ1 ⊗ · · · ⊗ ℬ𝑑 are 𝑖=1 𝛼𝑖
Î𝑑
and 𝑖=1 𝛽𝑖 (Feichtinger and Gröchenig 1994, Lemma 8.18). The converse is,
however, more surprising (Bourouihiya 2008).
If 𝜓 ∈ 𝐿 2 (R) is a wavelet, that is, ℬ𝜓 ≔ {𝜓 𝑚,𝑛 : (𝑚, 𝑛) ∈ Z × Z} with 𝜓 𝑚,𝑛 (𝑥) ≔
2 𝜓(2𝑚 𝑥 − 𝑛) is an orthonormal basis, Riesz basis or frame of 𝐿 2 (R), then
𝑚/2
vector product representing an integral transform like the ones in Example 4.45,
the Beylkin et al. (1991) wavelet-variant of the fast multipole algorithm mentioned
in Example 2.14 runs in time 𝑂(𝑛 log 𝑛) using the basis in (4.129) and in time 𝑂(𝑛)
using that in (4.130), both impressive compared to the usual 𝑂(𝑛2 ), but the latter is
clearly superior when 𝑛 is very large.
The main advantage of (4.129) is its generality; with other types of bases or
frames we also have similar constructions. For instance, the simplest types of
multivariate B-splines (Höllig and Hörner 2013) are constructed out of tensor
products of univariate B-splines; expressed in a multilinear rank decomposition
(4.63), we have
𝑝 Õ
Õ 𝑞 Õ
𝑟
𝐵𝑙,𝑚,𝑛 (𝑥, 𝑦, 𝑧) = 𝑎𝑖 𝑗𝑘 𝐵𝑖,𝑙 (𝑥)𝐵 𝑗,𝑚 (𝑦)𝐵 𝑘,𝑛 (𝑧), (4.131)
𝑖=1 𝑗=1 𝑘=1
where 𝐵𝑖,𝑙 , 𝐵 𝑗,𝑚 , 𝐵 𝑘,𝑛 are univariate B-splines of degrees 𝑙, 𝑚, 𝑛. Nevertheless, the
main drawback of straightforward tensor product constructions is that they attach
undue importance to the directions of the coordinate axes (Cohen and Daubechies
1993). There are often better alternatives such as box splines or beamlets, curvelets,
ridgelets, shearlets, wedgelets, etc., that exploit the geometry of R2 or R3 .
We next discuss a covariant counterpart to the contravariant example above: a
univariate quadrature on [−1, 1] is a covariant 1-tensor, and a multivariate quad-
rature on [−1, 1] 𝑑 is a covariant 𝑑-tensor.
Example 4.48 (quadrature). Let 𝑓 : [−1, 1] → R be a univariate polynomial
function of degree not more than 𝑛. Without knowing anything else about 𝑓 , we
know that there exist 𝑛 distinct points 𝑥0 , 𝑥1 , . . . , 𝑥 𝑛 ∈ [−1, 1], called nodes, and
coefficients 𝑤 0 , 𝑤 1 , . . . , 𝑤 𝑛 ∈ R, called weights, so that
∫ 1
𝑓 (𝑥) d𝑥 = 𝑤 0 𝑓 (𝑥0 ) + 𝑤 1 𝑓 (𝑥1 ) + · · · + 𝑤 𝑛 𝑓 (𝑥 𝑛 ).
−1
In fact, since 𝑓 is arbitrary, the formula holds with the same nodes and weights
for all polynomials of degree 𝑛 or less. This is called a quadrature formula and
its existence is simply a consequence of the following observations. A definite
integral is a linear functional,
∫ 1
𝐼 : 𝐶([−1, 1]) → R, 𝐼( 𝑓 ) = 𝑓 (𝑥) d𝑥,
−1
as is point evaluation, introduced in Example 4.6,
𝜀 𝑥 : 𝐶([−1, 1]) → R, 𝜀 𝑥 ( 𝑓 ) = 𝑓 (𝑥).
Since V = { 𝑓 ∈ 𝐶([−1, 1]) : 𝑓 (𝑥) = 𝑐0 + 𝑐1 𝑥 + · · · + 𝑐 𝑛 𝑥 𝑛 } is a vector space of
dimension 𝑛 + 1, its dual space V∗ also has dimension 𝑛 + 1, and the 𝑛 + 1 linear
functionals 𝜀 𝑥0 , 𝜀 𝑥1 , . . . , 𝜀 𝑥𝑛 , obviously linearly independent, form a basis of V∗ .
186 L.-H. Lim
𝐼 = 𝑤 0 𝜀 𝑥0 + 𝑤 1 𝜀 𝑥1 + · · · + 𝑤 𝑛 𝜀 𝑥 𝑛 . (4.132)
but this does not appear to work as we want multivariate quadratures to be defined
on 𝐶([−1, 1] 𝑑 ). Fortunately the universal factorization property (4.88) gives us a
Tensors in computations 187
which is called a sparse grid (Gerstner and Griebel 1998). Smolyak quadrat-
ure has led to further developments under the heading of sparse grids; we refer
interested readers to Garcke (2013) for more information. Nevertheless, many
multivariate quadratures tend to be ‘grid-free’ and involve elements of random-
ness or pseudorandomness, essentially (4.132) with the nodes 𝑥0 , . . . , 𝑥 𝑛 chosen
(pseudo)randomly in [−1, 1] 𝑑 to have low discrepancy, as defined in Niederreiter
(1992). As a result, modern study of multivariate quadratures tends to be quite
different from the classical approach described above.
We revisit a point that we made in Example 4.15: it is generally not a good idea
to identify tensors in V1 ⊗ · · · ⊗ V𝑑 with hypermatrices in R𝑛1 ×···×𝑛𝑑 , as the bases
often carry crucial information, that is, in a decomposition such as
𝑝 Õ
Õ 𝑞 Õ
𝑟
𝑇= ··· 𝑎𝑖 𝑗 ···𝑘 𝑒𝑖 ⊗ 𝑓 𝑗 ⊗ · · · ⊗ 𝑔 𝑘 ,
𝑖=1 𝑗=1 𝑘=1
the basis elements 𝑒𝑖 , 𝑓 𝑗 , . . . , 𝑔 𝑘 can be far more important than the coefficients
𝑎𝑖 𝑗 ···𝑘 . Quadrature provides a fitting example. The basis elements are the point
evaluation functionals at the nodes and these are key to the quadrature: once
they are chosen, the coefficients, i.e. the weights, can be determined almost as an
afterthought. Furthermore, while these basis elements are by definition linearly
independent, in the context of quadrature there are non-linear relations between
them that will be lost if one just looks at the coefficients.
All our examples in this article have focused on computations in the classical
sense; we will end with a quantum computing example, adapted from Nakahara
and Ohmi (2008, Chapter 7), Nielsen and Chuang (2000, Chapter 6) and Wallach
(2008, Section 2.3). While tensors appear at the beginning, ultimately Grover’s
quantum search algorithm (Grover 1996) reduces to the good old power method.
Example 4.49 (Grover’s quantum search). Let C2 be equipped with its usual
Hermitian inner product h𝑥, 𝑦i = 𝑥 ∗ 𝑦 and let 𝑒0 , 𝑒1 be any pair of orthonormal
190 L.-H. Lim
that is, 𝑓 takes the value −1 for exactly one 𝑗 ∈ {0, 1, . . . , 𝑛 − 1} but we do not
know which 𝑗. Also, 𝑓 is only accessible as a black box; we may evaluate 𝑓 (𝑖)
for any given 𝑖 and so to find 𝑗 in this manner would
√ require 𝑂(𝑛) evaluations in
the worst case. Grover’s algorithm finds 𝑗 in 𝑂( 𝑛) complexity with a quantum
computer.
We begin by observing that the 𝑑-tensor 𝑢 below may be expanded as
⊗𝑑
1 Õ
𝑛−1
1
𝑢 ≔ √ (𝑒0 + 𝑒1 ) =√ 𝑢𝑖 ∈ (C2 ) ⊗𝑑 , (4.137)
2 𝑛 𝑖=0
with
𝑢𝑖 ≔ 𝑒𝑖1 ⊗ 𝑒𝑖2 ⊗ · · · ⊗ 𝑒𝑖𝑑 ,
where 𝑖 1 , . . . , 𝑖 𝑑 ∈ {0, 1} are given by expressing the integer 𝑖 in binary:
[𝑖] 2 = 𝑖 𝑑 𝑖 𝑑−1 · · · 𝑖 2 𝑖 1 .
Furthermore, with respect to the inner product on (C2 ) ⊗𝑑 given by (4.64) and its
induced norm, 𝑢 is of unit norm and {𝑢0 , . . . , 𝑢 𝑛−1 } is an orthonormal basis. Recall
the notion of a Householder reflector in Example 2.12. We define two Householder
reflectors 𝐻 𝑓 and 𝐻𝑢 : (C2 ) ⊗𝑑 → (C2 ) ⊗𝑑 ,
𝐻 𝑓 (𝑣) = 𝑣 − 2h𝑢 𝑗 , 𝑣i𝑢 𝑗 , 𝐻𝑢 (𝑣) = 𝑣 − 2h𝑢, 𝑣i𝑢,
reflecting 𝑣 about the hyperplane orthogonal to 𝑢 𝑗 and 𝑢 respectively. Like the
function 𝑓 , the Householder reflectors 𝐻 𝑓 and 𝐻𝑢 are only accessible as black
boxes. Given any 𝑣 ∈ (C2 ) ⊗𝑑 , a quantum computer allows us to evaluate 𝐻 𝑓 (𝑣)
and 𝐻𝑢 (𝑣) in logarithmic complexity 𝑂(log 𝑛) = 𝑂(𝑑) (Nielsen and Chuang 2000,
p. 251).
Grover’s algorithm is essentially the power method with 𝐻𝑢 𝐻 𝑓 applied to 𝑢 as
the initial vector. Note that 𝐻𝑢 𝐻 𝑓 is unitary and thus norm-preserving and we may
skip the normalization step. If we write
Õ
(𝐻𝑢 𝐻 𝑓 )𝑘 𝑢 = 𝑎 𝑘 𝑢 𝑗 + 𝑏 𝑘 𝑢𝑖 ,
𝑖≠ 𝑗
then we may show (Nakahara and Ohmi 2008, Propositions 7.2–7.4) that
2−𝑛 2(1 − 𝑛)
𝑎 𝑘 = 𝑛 𝑎 𝑘−1 +
𝑏 𝑘−1 ,
1 𝑛
𝑎0 = 𝑏0 = √ ,
𝑛
2 2−𝑛
𝑏 𝑘 = 𝑎 𝑘−1 + 𝑏 𝑘−1 ,
𝑛 𝑛
Tensors in computations 191
5.1. Omissions
There are three major topics in our original plans for this article that we did not
manage to cover: (i) symmetric and alternating tensors, (ii) algorithms based on
tensor contractions, and (iii) provably correct algorithms for higher-order tensor
problems.
In (i) we would have included a discussion of the following: polynomials and dif-
ferential forms as coordinate representations of symmetric and alternating tensors;
polynomial kernels as Veronese embedding of symmetric tensors, Slater determ-
inants as Plücker embedding of alternating tensors; moments and sums-of-squares
theory for symmetric tensors; matrix–matrix multiplication as a special case of
the Cauchy–Binet formula for alternating tensors. We also planned to discuss the
symmetric and alternating analogues of the three definitions of a tensor, the three
constructions of tensor products and the tensor algebra. The last in particular gives
us the symmetric and alternating Fock spaces with their connections to bosons
and fermions. We would also have discussed various notions of symmetric and
alternating tensor ranks.
In (ii) we had intended to demonstrate how the Cooley–Tukey FFT, the multi-
dimensional DFT, the Walsh transform, wavelet packet transform, Yates’ method
192 L.-H. Lim
in factorial designs, even FFT on finite non-Abelian groups, etc., are all tensor con-
tractions. We would also show how the Strassen tensor in Example 3.9, the tensors
corresponding to the multiplication operations in Lie, Jordan and Clifford algebras,
and any tensor network in Example 4.44 may each be realized as the self-contraction
of a rank-one tensor. There would also be an explanation of matchgate tensors and
holographic algorithms tailored for a numerical linear algebra readership.
In (iii) we had planned to provide a reasonably complete overview of the handful
of provably correct algorithms for various NP-hard problems involving higher-
order tensors from three different communities: polynomial optimization, symbolic
computing/computer algebra and theoretical computer science. We will have more
to say about these below.
problems do exist in the literature – notably the line of work in Brachat, Comon,
Mourrain and Tsigaridas (2010) and Nie (2017) that extended Sylvester (1886)
and Reznick (1992) to give a provably correct algorithm for symmetric tensor
decomposition – they tend to be the exception rather than the rule. It is more
common to find ‘algorithms’ for tensor problems that are demonstrably wrong.
The scarcity of provably correct algorithms goes hand in hand with the NP-
hardness of tensor problems. For illustration, we consider the best rank-one ap-
proximation problem for a hypermatrix 𝐴 ∈ R𝑛×𝑛×𝑛 :
min k 𝐴 − 𝑥 ⊗ 𝑦 ⊗ 𝑧k F . (5.1)
𝑥,𝑦,𝑧 ∈R𝑛
A pertinent example for us is the Strassen tensor M3,3,3 in Example 3.9 with
𝑚 = 𝑛 = 𝑝 = 3. With respect to the standard basis on R3×3 , it is a simple 9 × 9 × 9
hypermatrix with mostly zeros and a few ones, but its tensor rank is still unknown
today.
rank of M2,2,2 ∈ C4×4×4 (Landsberg 2006) and finding the equations that define
border rank-4 tensors {𝐴 ∈ C4×4×4 : rank(𝐴) ≤ 4} (Friedland 2013) – were both
about 4 × 4 × 4 hypermatrices and both required Herculean efforts, notwithstanding
the fact that border rank is a simpler notion than rank.
Acknowledgement
Most of what I learned about tensors in the last ten years I learned from my two
former postdocs Yang Qi and Ke Ye; I gratefully acknowledge the education they
have given me. Aside from them, I would like to thank Keith Conrad, Zhen
Dai, Shmuel Friedland, Edinah Gnang, Shenglong Hu, Risi Kondor, and J. M.
Landsberg, Visu Makam, Peter McCullagh, Emily Riehl and Thomas Schultz for
answering my questions and/or helpful discussions. I would like to express heartfelt
gratitude to Arieh Iserles for his kind encouragement and Glennis Starling for her
excellent copy-editing. Figure 4.1 is adapted from Walmes M. Zeviani’s gallery
of TikZ art and Figure 4.3 is reproduced from Yifan Peng’s blog. This work is
partially supported by DARPA HR00112190040, NSF DMS-11854831, and the
Eckhardt Faculty Fund.
This article was written while taking refuge from Covid-19 at my parents’ home
in Singapore. After two decades in the US, it is good to be reminded of how
fortunate I am to have such wonderful parents. I dedicate this article to them: my
father Lim Pok Beng and my mother Ong Aik Kuan.
References
R. Abraham, J. E. Marsden and T. Ratiu (1988), Manifolds, Tensor Analysis, and Applica-
tions, Vol. 75 of Applied Mathematical Sciences, second edition, Springer.
D. Aerts and I. Daubechies (1978), Physical justification for using the tensor product to
describe two quantum systems as one joint system, Helv. Phys. Acta 51, 661–675.
D. Aerts and I. Daubechies (1979a), A characterization of subsystems in physics, Lett.
Math. Phys. 3, 11–17.
D. Aerts and I. Daubechies (1979b), A mathematical condition for a sublattice of a pro-
positional system to represent a physical subsystem, with a physical interpretation, Lett.
Math. Phys. 3, 19–27.
I. Affleck, T. Kennedy, E. H. Lieb and H. Tasaki (1987), Rigorous results on valence-bond
ground states in antiferromagnets, Phys. Rev. Lett. 59, 799–802.
S. S. Akbarov (2003), Pontryagin duality in the theory of topological vector spaces and in
topological algebra, J. Math. Sci. 113, 179–349.
J. Alman and V. V. Williams (2021), A refined laser method and faster matrix multiplication,
in Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA)
(D. Marx, ed.), Society for Industrial and Applied Mathematics (SIAM), pp. 522–539.
P. W. Anderson (1959), New approach to the theory of superexchange interactions, Phys.
Rev. (2) 115, 2–13.
A. Arias and J. D. Farmer (1996), On the structure of tensor products of 𝑙 𝑝 -spaces, Pacific
J. Math. 175, 13–37.
196 L.-H. Lim
C. T. J. Dodson and T. Poston (1991), Tensor Geometry: The Geometric Viewpoint and its
Uses, Vol. 130 of Graduate Texts in Mathematics, second edition, Springer.
J. Dongarra and F. Sullivan (2000), Guest editors’ introduction to the top 10 algorithms,
Comput. Sci. Eng. 2, 22–23.
D. E. Dudgeon and R. M. Mersereau (1983), Multidimensional Digital Signal Processing,
Prentice Hall.
D. S. Dummit and R. M. Foote (2004), Abstract Algebra, third edition, Wiley.
N. Dunford and J. T. Schwartz (1988), Linear Operators, Part I, Wiley Classics Library,
Wiley.
G. Dunn (1988), Tensor product of operads and iterated loop spaces, J. Pure Appl. Algebra
50, 237–258.
H. Dym (2013), Linear Algebra in Action, Vol. 78 of Graduate Studies in Mathematics,
second edition, American Mathematical Society.
J. Earman and C. Glymour (1978), Lost in the tensors: Einstein’s struggles with covariance
principles 1912–1916, Stud. Hist. Philos. Sci. A 9, 251–278.
A. Einstein (2002), Fundamental ideas and methods of the theory of relativity, presented
in its development, in The Collected Papers of Albert Einstein (M. Janssen et al., eds),
Vol. 7: The Berlin Years, 1918–1921, Princeton University Press, pp. 113–150.
L. P. Eisenhart (1934), Separable systems of Stackel, Ann. of Math. (2) 35, 284–305.
P. Etingof, S. Gelaki, D. Nikshych and V. Ostrik (2015), Tensor Categories, Vol. 205 of
Mathematical Surveys and Monographs, American Mathematical Society.
L. D. Faddeev and O. A. Yakubovskiı̆ (2009), Lectures on Quantum Mechanics for Math-
ematics Students, Vol. 47 of Student Mathematical Library, American Mathematical
Society.
C. L. Fefferman (2006), Existence and smoothness of the Navier–Stokes equation, in The
Millennium Prize Problems (J. Carlson, A. Jaffe and A. Wiles, eds), Clay Mathematics
Institute, pp. 57–67.
H. G. Feichtinger and K. Gröchenig (1994), Theory and practice of irregular sampling, in
Wavelets: Mathematics and Applications, CRC, pp. 305–363.
R. P. Feynman, R. B. Leighton and M. Sands (1963), The Feynman Lectures on Physics,
Vol. 1: Mainly Mechanics, Radiation, and Heat, Addison-Wesley.
C. F. Fischer (1977), The Hartree–Fock Method for Atoms, Wiley.
V. Fock (1930), Näherungsmethode zur Lösung des quantenmechanischen Mehrkörper-
problems, Z. Physik 61, 126–148.
L. Fortnow (2013), The Golden Ticket: P, NP, and the Search for the Impossible, Princeton
University Press.
A. Frank and E. Tardos (1987), An application of simultaneous Diophantine approximation
in combinatorial optimization, Combinatorica 7, 49–65.
M. Frazier and B. Jawerth (1990), A discrete transform and decompositions of distribution
spaces, J. Funct. Anal. 93, 34–170.
S. H. Friedberg, A. J. Insel and L. E. Spence (2003), Linear Algebra, fourth edition,
Prentice Hall.
S. Friedland (2013), On tensors of border rank 𝑙 in C𝑚×𝑛×𝑙 , Linear Algebra Appl. 438,
713–737.
S. Friedland and E. Gross (2012), A proof of the set-theoretic version of the salmon
conjecture, J. Algebra 356, 374–379.
200 L.-H. Lim
S. Friedland and L.-H. Lim (2018), Nuclear norm of higher-order tensors, Math. Comp.
87, 1255–1281.
S. Friedland, L.-H. Lim and J. Zhang (2019), Grothendieck constant is norm of Strassen
matrix multiplication tensor, Numer. Math. 143, 905–922.
J. Friedman (1991), The spectra of infinite hypertrees, SIAM J. Comput. 20, 951–961.
J. Friedman and A. Wigderson (1995), On the second eigenvalue of hypergraphs, Combin-
atorica 15, 43–65.
F. Fuchs, D. E. Worrall, V. Fischer and M. Welling (2020), SE(3)-transformers: 3D roto-
translation equivariant attention networks, in Advances in Neural Information Processing
Systems 33 (NeurIPS 2020) (H. Larochelle et al., eds), Curran Associates, pp. 1970–
1981.
W. Fulton and J. Harris (1991), Representation Theory: A First Course, Vol. 129 of
Graduate Texts in Mathematics, Springer.
M. Fürer (2009), Faster integer multiplication, SIAM J. Comput. 39, 979–1005.
A. A. García, F. W. Hehl, C. Heinicke and A. Macías (2004), The Cotton tensor in
Riemannian spacetimes, Classical Quantum Gravity 21, 1099–1118.
J. Garcke (2013), Sparse grids in a nutshell, in Sparse Grids and Applications, Vol. 88 of
Lecture Notes in Computational Science and Engineering, Springer, pp. 57–80.
M. R. Garey and D. S. Johnson (1979), Computers and Intractability: A Guide to the
Theory of NP-Completeness, W. H. Freeman.
S. Garg, C. Gentry and S. Halevi (2013), Candidate multilinear maps from ideal lattices, in
Advances in Cryptology (EUROCRYPT 2013), Vol. 7881 of Lecture Notes in Computer
Science, Springer, pp. 1–17.
P. Garrett (2010), Non-existence of tensor products of Hilbert spaces. Available at http://
www-users.math.umn.edu/~garrett/m/v/nonexistence_tensors.pdf.
I. M. Gel′fand, M. M. Kapranov and A. V. Zelevinsky (1992), Hyperdeterminants, Adv.
Math. 96, 226–263.
I. M. Gel′fand, M. M. Kapranov and A. V. Zelevinsky (1994), Discriminants, Resultants,
and Multidimensional Determinants, Mathematics: Theory & Applications, Birkhäuser.
C. Gentry, S. Gorbunov and S. Halevi (2015), Graph-induced multilinear maps from
lattices, in Theory of Cryptography (TCC 2015), part II, Vol. 9015 of Lecture Notes in
Computer Science, Springer, pp. 498–527.
R. Geroch (1985), Mathematical Physics, Chicago Lectures in Physics, University of
Chicago Press.
T. Gerstner and M. Griebel (1998), Numerical integration using sparse grids, Numer.
Algorithms 18, 209–232.
G. H. Golub and C. F. Van Loan (2013), Matrix Computations, Johns Hopkins Studies in
the Mathematical Sciences, fourth edition, Johns Hopkins University Press.
G. H. Golub and J. H. Welsch (1969), Calculation of Gauss quadrature rules, Math. Comp.
23, 221–230, A1–A10.
G. H. Golub and J. H. Wilkinson (1976), Ill-conditioned eigensystems and the computation
of the Jordan canonical form, SIAM Rev. 18, 578–619.
R. Goodman and N. R. Wallach (2009), Symmetry, Representations, and Invariants, Vol.
255 of Graduate Texts in Mathematics, Springer.
L. Grafakos (2014), Classical Fourier Analysis, Vol. 249 of Graduate Texts in Mathematics,
third edition, Springer.
Tensors in computations 201
L. Grafakos and R. H. Torres (2002a), Discrete decompositions for bilinear operators and
almost diagonal conditions, Trans. Amer. Math. Soc. 354, 1153–1176.
L. Grafakos and R. H. Torres (2002b), Multilinear Calderón–Zygmund theory, Adv. Math.
165, 124–164.
L. Greengard and V. Rokhlin (1987), A fast algorithm for particle simulations, J. Comput.
Phys. 73, 325–348.
W. Greub (1978), Multilinear Algebra, Universitext, second edition, Springer.
D. Griffiths (2008), Introduction to Elementary Particles, second edition, Wiley-VCH.
A. Grothendieck (1953), Résumé de la théorie métrique des produits tensoriels topo-
logiques, Bol. Soc. Mat. São Paulo 8, 1–79.
A. Grothendieck (1955), Produits Tensoriels Topologiques et Espaces Nucléaires, Vol. 16
of Memoirs of the American Mathematical Society, American Mathematical Society.
L. K. Grover (1996), A fast quantum mechanical algorithm for database search, in Proceed-
ings of the 28th Annual ACM Symposium on the Theory of Computing (STOC 1996),
ACM, pp. 212–219.
K. Hannabuss (1997), An Introduction to Quantum Theory, Vol. 1 of Oxford Graduate
Texts in Mathematics, The Clarendon Press, Oxford University Press.
J. Harris (1995), Algebraic Geometry: A First Course, Vol. 133 of Graduate Texts in
Mathematics, Springer.
E. Hartmann (1984), An Introduction to Crystal Physics, Vol. 18 of Commission on Crystal-
lographic Teaching: Second series pamphlets, International Union of Crystallography,
University College Cardiff Press.
D. R. Hartree (1928), The wave mechanics of an atom with a non-Coulomb central field,
I: Theory and methods, Proc. Cambridge Philos. Soc. 24, 89–132.
R. Hartshorne (1977), Algebraic Geometry, Vol. 52 of Graduate Texts in Mathematics,
Springer.
D. Harvey and J. van der Hoeven (2021), Integer multiplication in time 𝑂(𝑛 log 𝑛), Ann. of
Math. (2) 193, 563–617.
S. Hassani (1999), Mathematical Physics: A Modern Introduction to its Foundations,
Springer.
T. J. Hastie and R. J. Tibshirani (1990), Generalized Additive Models, Vol. 43 of Mono-
graphs on Statistics and Applied Probability, Chapman & Hall.
A. Hatcher (2002), Algebraic Topology, Cambridge University Press.
R. A. Hauser and Y. Lim (2002), Self-scaled barriers for irreducible symmetric cones,
SIAM J. Optim. 12, 715–723.
G. E. Hay (1954), Vector and Tensor Analysis, Dover Publications.
C. Heil (2011), A Basis Theory Primer, Applied and Numerical Harmonic Analysis,
expanded edition, Birkhäuser / Springer.
F. Heiss and V. Winschel (2008), Likelihood approximation by numerical integration on
sparse grids, J. Econometrics 144, 62–80.
S. Helgason (1978), Differential Geometry, Lie Groups, and Symmetric Spaces, Vol. 80 of
Pure and Applied Mathematics, Academic Press.
J. W. Helton and M. Putinar (2007), Positive polynomials in scalar and matrix variables,
the spectral theorem, and optimization, in Operator Theory, Structured Matrices, and
Dilations, Vol. 7 of Theta Ser. Adv. Math., Theta, Bucharest, pp. 229–306.
N. J. Higham (1992), Stability of a method for multiplying complex matrices with three
real matrix multiplications, SIAM J. Matrix Anal. Appl. 13, 681–687.
202 L.-H. Lim
N. P. Sokolov (1960), Spatial Matrices and their Applications, Gosudarstv. Izdat. Fiz.-Mat.
Lit., Moscow. In Russian.
N. P. Sokolov (1972), Introduction to the Theory of Multidimensional Matrices, Izdat.
‘Naukova Dumka’, Kiev. In Russian.
B. Spain (1960), Tensor Calculus: A Concise Course, third edition, Oliver and Boyd /
Interscience.
E. M. Stein (1993), Harmonic Analysis: Real-Variable Methods, Orthogonality, and Oscil-
latory Integrals, Vol. 43 of Princeton Mathematical Series, Princeton University Press.
I. Steinwart and A. Christmann (2008), Support Vector Machines, Information Science and
Statistics, Springer.
G. W. Stewart (2000), The decompositional approach to matrix computation, Comput. Sci.
Eng. 2, 50–59.
G. Strang (1980), Linear Algebra and its Applications, second edition, Academic Press.
V. Strassen (1969), Gaussian elimination is not optimal, Numer. Math. 13, 354–356.
V. Strassen (1973), Vermeidung von Divisionen, J. Reine Angew. Math. 264, 184–202.
V. Strassen (1987), Relative bilinear complexity and matrix multiplication, J. Reine Angew.
Math. 375/376, 406–443.
V. Strassen (1990), Algebraic complexity theory, in Handbook of Theoretical Computer
Science, Vol. A: Algorithms and Complexity, Elsevier, pp. 633–672.
A. M. Stuart and A. R. Humphries (1996), Dynamical Systems and Numerical Analysis,
Vol. 2 of Cambridge Monographs on Applied and Computational Mathematics, Cam-
bridge University Press.
J. J. Sylvester (1886), Sur une extension d’un théorème de Clebsch relatif aux courbes du
quatrième degré, C. R. Math. Acad. Sci. Paris 102, 1532–1534.
J. L. Synge and A. Schild (1978), Tensor Calculus, Dover Publications.
C.-T. Tai (1997), Generalized Vector and Dyadic Analysis, second edition, IEEE Press.
M. Takesaki (2002), Theory of Operator Algebras I, Vol. 124 of Encyclopaedia of Math-
ematical Sciences, Springer.
L. A. Takhtajan (2008), Quantum Mechanics for Mathematicians, Vol. 95 of Graduate
Studies in Mathematics, American Mathematical Society.
E. Tardos (1986), A strongly polynomial algorithm to solve combinatorial linear programs,
Oper. Res. 34, 250–256.
G. Temple (1960), Cartesian Tensors: An Introduction, Methuen’s Monographs on Physical
Subjects, Methuen / Wiley.
G. Teschl (2014), Mathematical Methods in Quantum Mechanics: With Applications to
Schrödinger Operators, Vol. 157 of Graduate Studies in Mathematics, second edition,
American Mathematical Society.
J. W. Thomas (1995), Numerical Partial Differential Equations: Finite Difference Methods,
Vol. 22 of Texts in Applied Mathematics, Springer.
K. S. Thorne and R. D. Blandford (2017), Modern Classical Physics: Optics, Fluids,
Plasmas, Elasticity, Relativity, and Statistical Physics, Princeton University Press.
A. L. Toom (1963), The complexity of a scheme of functional elements simulating the
multiplication of integers, Dokl. Akad. Nauk SSSR 150, 496–498.
L. N. Trefethen and D. Bau, III (1997), Numerical Linear Algebra, Society for Industrial
and Applied Mathematics (SIAM).
F. Trèves (2006), Topological Vector Spaces, Distributions and Kernels, Dover Publica-
tions.
208 L.-H. Lim
P. M. Vaidya (1990), An algorithm for linear programming which requires 𝑂(((𝑚 + 𝑛)𝑛2 +
(𝑚 + 𝑛)1.5 𝑛)𝐿) arithmetic operations, Math. Program. 47, 175–201.
L. G. Valiant (1979), The complexity of computing the permanent, Theoret. Comput. Sci.
8, 189–201.
H. A. van der Vorst (2000), Krylov subspace iteration, Comput. Sci. Eng. 2, 32–37.
N. T. Varopoulos (1965), Sur les ensembles parfaits et les séries trigonométriques, C. R.
Acad. Sci. Paris 260, 4668–4670, 5165–5168, 5997–6000.
N. T. Varopoulos (1967), Tensor algebras and harmonic analysis, Acta Math. 119, 51–112.
S. A. Vavasis (1991), Nonlinear Optimization: Complexity Issues, Vol. 8 of International
Series of Monographs on Computer Science, The Clarendon Press, Oxford University
Press.
F. Verstraete and J. I. Cirac (2004), Renormalization algorithms for quantum-many body
systems in two and higher dimensions. Available at arXiv:cond-mat/0407066.
E. B. Vinberg (2003), A Course in Algebra, Vol. 56 of Graduate Studies in Mathematics,
American Mathematical Society.
W. Voigt (1898), Die fundamentalen physikalischen Eigenschaften der Krystalle in ele-
mentarer Darstellung, Von Veit.
R. M. Wald (1984), General Relativity, University of Chicago Press.
N. R. Wallach (2008), Quantum computing and entanglement for mathematicians, in Rep-
resentation Theory and Complex Analysis, Vol. 1931 of Lecture Notes in Mathematics,
Springer, pp. 345–376.
W. Walter (1998), Ordinary Differential Equations, Vol. 182 of Graduate Texts in Math-
ematics, Springer.
A.-M. Wazwaz (2011), Linear and Nonlinear Integral Equations: Methods and Applica-
tions, Higher Education Press (Beijing) / Springer.
G. Weinreich (1998), Geometrical Vectors, Chicago Lectures in Physics, University of
Chicago Press.
H. Weyl (1997), The Classical Groups: Their Invariants and Representations, Princeton
Landmarks in Mathematics, Princeton University Press.
S. R. White (1992), Density matrix formulation for quantum renormalization groups, Phys.
Rev. Lett. 69, 2863–2866.
S. R. White and D. A. Huse (1993), Numerical renormalization-group study of low-lying
eigenstates of the antiferromagnetic 𝑆 = 1 Heisenberg chain, Phys. Rev. B 48, 3844–
3853.
J. D. Whitfield, P. J. Love and A. Aspuru-Guzik (2013), Computational complexity in
electronic structure, Phys. Chem. Chem. Phys. 15, 397–411.
N. M. J. Woodhouse (2003), Special Relativity, Springer Undergraduate Mathematics
Series, Springer.
R. C. Wrede (1963), Introduction to Vector and Tensor Analysis, Wiley.
S.-T. Yau (2020), Shiing-Shen Chern: A great geometer of 20th century.
https://cmsa.fas.harvard.edu/wp-content/uploads/2020/05/2020-04-22-essay-on-Chern-english-v1.pd
K. Ye and L.-H. Lim (2018a), Fast structured matrix computations: Tensor rank and
Cohn–Umans method, Found. Comput. Math. 18, 45–95.
K. Ye and L.-H. Lim (2018b), Tensor network ranks. Available at arXiv:1801.02662.