Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
100 views208 pages

Tensor

This document provides an introduction to tensors, discussing their role in computations. Tensors arise in computations in three main ways: through equivariance under coordinate changes, as multilinear maps, and via separability which allows for separation of variables. The document outlines the three common definitions of a tensor: as an object satisfying certain transformation rules, as a multilinear map, and as an element of a tensor product of vector spaces. It discusses how each definition relates to computations and how the definitions are interrelated.

Uploaded by

Mustafa Mazan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views208 pages

Tensor

This document provides an introduction to tensors, discussing their role in computations. Tensors arise in computations in three main ways: through equivariance under coordinate changes, as multilinear maps, and via separability which allows for separation of variables. The document outlines the three common definitions of a tensor: as an object satisfying certain transformation rules, as a multilinear map, and as an element of a tensor product of vector spaces. It discusses how each definition relates to computations and how the definitions are interrelated.

Uploaded by

Mustafa Mazan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 208

Acta Numerica (2021), pp.

1–208 © Cambridge University Press, 2021


doi:10.1017/S0962492921000076 Printed in the United Kingdom

Tensors in computations
arXiv:2106.08090v1 [math.NA] 15 Jun 2021

Lek-Heng Lim
Computational and Applied Mathematics Initiative,
University of Chicago, Chicago, IL 60637, USA
E-mail: [email protected]

The notion of a tensor captures three great ideas: equivariance, multilinearity, separ-
ability. But trying to be three things at once makes the notion difficult to understand.
We will explain tensors in an accessible and elementary way through the lens of linear
algebra and numerical linear algebra, elucidated with examples from computational
and applied mathematics.

CONTENTS
1 Introduction 1
2 Tensors via transformation rules 11
3 Tensors via multilinearity 41
4 Tensors via tensor products 85
5 Odds and ends 191
References 195

1. Introduction
We have two goals in this article: the first is to explain in detail and in the simplest
possible terms what a tensor is; the second is to discuss the main ways in which
tensors play a role in computations. The two goals are interwoven: what defines
a tensor is also what makes it useful in computations, so it is important to gain
a genuine understanding of tensors. We will take the reader through the three
common definitions of a tensor: as an object that satisfies certain transformation
rules, as a multilinear map, and as an element of a tensor product of vector spaces.
We will explain the motivations behind these definitions, how one definition leads
to the next, and how they all fit together. All three definitions are useful in compu-
tational mathematics but in different ways; we will intersperse our discussions of
each definition with considerations of how it is employed in computations, using
the latter as impetus for the former.
Tensors arise in computations in one of three ways: equivariance under coordin-
2 L.-H. Lim

ate changes, multilinear maps and separation of variables, each corresponding to


one of the aforementioned definitions.
Separability. The last of the three is the most prevalent, arising in various forms
in problems, solutions and algorithms. One may exploit separable structures in
ordinary differential equations, in integral equations, in Hamiltonians, in objective
functions, etc. The separation-of-variables technique may be applied to solve
partial differential and integral equations, or even finite difference and integro-
differential equations. Separability plays an indispensable role in Greengard and
Rokhlin’s fast multipole method, Grover’s quantum search algorithm, Hartree–
Fock approximations of various stripes, Hochbaum and Shanthikumar’s non-linear
separable convex optimization, Smolyak’s quadrature, and beyond. It also underlies
tensor product construction of various objects – bases, frames, function spaces,
kernels, multiresolution analyses, operators and quadratures among them.
Multilinearity. Many common operations, ranging from multiplying two complex
numbers or matrices to the convolution or bilinear Hilbert transform of functions,
are bilinear operators. Trilinear functionals arise in self-concordance and numerical
stability, and in fast integer and fast matrix multiplications. The last leads us to the
matrix multiplication exponent, which characterizes the computational complexity
of not just matrix multiplication but also inversion, determinant, null basis, linear
systems, LU/QR/eigenvalue/Hessenberg decompositions, and more. Higher-order
multilinear maps arise in the form of multidimensional Fourier/Laplace/Z/discrete
cosine transforms and as cryptographic multilinear maps. Even if a multivariate
function is not multilinear, its derivatives always are, and thus Taylor and multipole
series are expansions in multilinear maps.
Equivariance. Equivariance is an idea as old as tensors and has been implicitly used
throughout algorithms in numerical linear algebra. Every time we transform the
scalars, vectors or matrices in a problem from one form to another via a sequence of
Givens or Jacobi rotations, Householder reflectors or Gauss transforms, or through
a Krylov or Fourier or wavelet basis, we are employing equivariance in the form of
various 0-, 1- or 2-tensor transformation rules. Beyond numerical linear algebra,
the equivariance of tools from interior point methods to deep neural networks
is an important reason for their success. The equivariance of Newton’s method
allows it to solve arbitrarily ill-conditioned optimization problems in theory and
highly ill-conditioned ones in practice. AlphaFold 2 conquered the protein structure
prediction problem with an equivariant neural network.
Incidentally, out of the so-called ‘top ten algorithms of the twentieth century’
(Dongarra and Sullivan 2000), fast multipole (Board and Schulten 2000), fast
Fourier transform (Rockmore 2000), Krylov subspaces (van der Vorst 2000), Fran-
cis’s QR (Parlett 2000) and matrix decompositions (Stewart 2000) all draw from
these three aforementioned tensorial ideas in one way or another. Nevertheless,
by ‘computations’ we do not mean just algorithms, although these will constitute
Tensors in computations 3

a large part of our article; the title of our article also includes the use of tensors
as a tool in the analysis of algorithms (e.g. self-concordance in polynomial-time
convergence), providing intractability guarantees (e.g. cryptographic multilinear
maps), reducing complexity of models (e.g. equivariant neural networks), quanti-
fying computational complexity (e.g. exponent of matrix multiplication) and in yet
other ways. It would be prudent to add that while tensors are an essential ingredient
in the aforementioned computational tools, they are rarely the only ingredient: it is
usually in combination with other concepts and techniques – calculus of variations,
Gauss quadrature, multiresolution analysis, the power method, reproducing kernel
Hilbert spaces, singular value decomposition, etc. – that they become potent tools.
The article is written with accessibility and simplicity in mind. Our exposition
assumes only standard undergraduate linear algebra: vector spaces, linear maps,
dual spaces, change of basis, etc. Knowledge of numerical linear algebra is a plus
since this article mainly targets computational mathematicians. As physicists have
played an outsize role in the development of tensors, it is inevitable that motivations
for certain aspects, explanations for certain definitions, etc., are best understood
from a physicist’s perspective and to this end we will include some discussions
to provide context. When discussing applications or the physics origin of certain
ideas, it is inevitable that we have to assume slightly more background, but we
strive to be self-contained and limit ourselves to the most pertinent basic ideas. In
particular, it is not our objective to provide a comprehensive survey of all things
tensorial. We focus on the foundational and fundamental; results from current
research make an appearance only if they illuminate these basic aspects. We have
avoided a formal textbook-style treatment and opted for a more casual exposition
that hopefully makes for easier reading (and in part to keep open the possibility of
a future book project).

1.1. Overview
There are essentially three ways to define a tensor, reflecting the chronological
evolution of the notion through the last 140 years or so:
➀ as a multi-indexed object that satisfies certain transformation rules,
➁ as a multilinear map,
➂ as an element of a tensor product of vector spaces.
The key to the coordinate-dependent definition in ➀ is the emphasized part: the
transformation rules are what define a tensor when one chooses to view it as a
multi-indexed object, akin to the group laws in the definition of a group. The
more modern coordinate-free definitions ➁ and ➂ automatically encode these
transformation rules. The multi-indexed object, which could be a hypermatrix, a
polynomial, a differential form, etc., is then a coordinate representation of either ➁
or ➂ with respect to a choice of bases, and the corresponding transformation rule
is a change-of-basis theorem.
Take the following case familiar to anyone who has studied linear algebra. Let V
4 L.-H. Lim

and W be finite-dimensional vector spaces over R. A linear map 𝑓 : V → W is an


order-2 tensor of covariant order 1 and contravariant order 1; this is the description
according to definition ➁. We may construct a new vector space V∗ ⊗ W, where
V∗ denotes the dual space of V, and show that 𝑓 corresponds to a unique element
𝑇 ∈ V∗ ⊗ W; this is the description according to definition ➂. With a choice of bases
ℬV and ℬW , 𝑓 or 𝑇 may be represented as a matrix 𝐴 ∈ R𝑚×𝑛 , where 𝑚 = dim V
and 𝑛 = dim W. Note, however, that the matrix 𝐴 is not unique and depends
on our choice of bases, and different bases ℬV′ and ℬW ′ would give a different

matrix representation 𝐴 ∈ R ′ 𝑚×𝑛 . The change-of-basis theorem states that the two
matrices are related via the transformation rule1 𝐴 ′ = 𝑋 −1 𝐴𝑌 , where 𝑋 and 𝑌 are
the change-of-basis matrices on W and V respectively. Definition ➀ is essentially
an attempt to define a linear map using its change-of-basis theorem – possible but
awkward. The reason for such an awkward definition is one of historical necessity:
definition ➀ had come before any of the modern notions we now take for granted in
linear algebra – vector space, dual space, linear map, bases, etc. – were invented.
We stress that if one chooses to work with definition ➀, then it is the transform-
ation rule/change-of-basis theorem and not the multi-indexed object that defines a
tensor. For example, depending on whether the transformation rule takes 𝐴 ∈ R𝑛×𝑛
to 𝑋 𝐴𝑌 −1 , 𝑋 𝐴𝑌 T , 𝑋 −1 𝐴𝑌 −T , 𝑋 𝐴𝑋 −1 , 𝑋 𝐴𝑋 T or 𝑋 −1 𝐴𝑋 −T , we obtain different
tensors with entirely different properties. Also, while we did not elaborate on the
change-of-basis matrices 𝑋 and 𝑌 , they play an important role in the transformation
rule. If V and W are vector spaces without any additional structures, then 𝑋 and
𝑌 are just required to be invertible; but if V and W are, say, norm or inner product
or symplectic vector spaces, then 𝑋 and 𝑌 would need to preserve these structures
too. More importantly, every notion we define, every property we study for any
tensor – rank, determinant, norm, product, eigenvalue, eigenvector, singular value,
singular vector, positive definiteness, linear system, least-squares problems, eigen-
value problems, etc. – must conform to the respective transformation rule. This is
a point that is often lost; it is not uncommon to find mentions of ‘tensor such and
such’ in recent literature that makes no sense for a tensor.
As we will see, the two preceding paragraphs extend in a straightforward way to
order-𝑑 tensors (henceforth 𝑑-tensors) of contravariant order 𝑝 and covariant order
𝑑 − 𝑝 for any integers 𝑑 ≥ 𝑝 ≥ 0, of which a linear map corresponds to the case
𝑝 = 1, 𝑑 = 2.
While discussing tensors, we will also discuss their role in computations.
The most salient applications are often variations of the familiar separation-of-
variables technique that one encounters when solving ordinary and partial differ-
ential equations, integral equations or even integro-differential equations. Here
the relevant perspective of a tensor is that in definition ➂; we will see that
𝐿 2 (𝑋1 ) ⊗ 𝐿 2 (𝑋2 ) ⊗ · · · ⊗ 𝐿 2 (𝑋𝑑 ) = 𝐿 2 (𝑋1 × 𝑋2 × · · · × 𝑋𝑑 ), that is, multivariate

1 Equivalently, 𝐴 = 𝑋 𝐴 ′𝑌 −1 . In numerical linear algebra, there is a tendency to view this as matrix


decomposition as opposed to change of basis.
Tensors in computations 5

functions are tensors in an appropriate sense. Take 𝑑 = 3 for simplicity. Separation


of variables, in its most basic manifestation, exploits functions 𝑓 : 𝑋 × 𝑌 × 𝑍 → R
that take a multiplicatively separable form
𝑓 (𝑥, 𝑦, 𝑧) = 𝜑(𝑥)𝜓(𝑦)𝜃(𝑧), (1.1)
or, equivalently,
𝑓 = 𝜑 ⊗ 𝜓 ⊗ 𝜃. (1.2)
For real-valued functions, this defines the tensor product ⊗, that is, (1.2) simply
means (1.1). A tensor taking the form in (1.2) is called decomposable or pure
or rank-one (the last only if it is non-zero). This deceptively simple version is
already an essential ingredient in some of the most important algorithms, such
as Greengard and Rokhlin’s fast multipole method, Grover’s quantum search al-
gorithm, Hochbaum and Shanthikumar’s non-linear separable convex optimization,
and more.
A slightly more involved version assumes that 𝑓 is a sum of separable functions,
Õ
𝑟
𝑓 (𝑥, 𝑦, 𝑧) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦)𝜃 𝑖 (𝑧),
𝑖=1
or, equivalently,
Õ
𝑟
𝑓 = 𝜑𝑖 ⊗ 𝜓𝑖 ⊗ 𝜃 𝑖 , (1.3)
𝑖=1
and this underlies various fast algorithms for evaluating bilinear operators, which
is a 3-tensor in the sense of definition ➁. If one allows for a slight extension of
➁ to modules (essentially a vector space where the field of scalars is a ring like
Z), then various fast integer multiplication algorithms such as those of Karatsuba,
Toom and Cook, Schönhage and Strassen, and Fürer may also be viewed in the light
of ➁. But even with the standard definition of allowing only scalars from R or C,
algorithms that exploit (1.3) include Gauss’s algorithm for complex multiplication,
Strassen’s algorithm for fast matrix multiplication/inversion and various algorithms
for structured matrix–vector products (e.g. Toeplitz, Hankel, circulant matrices).
These algorithms show that a combination of both perspectives in definitions ➁
and ➂ can be fruitful.
An even more involved version of the separation-of-variables technique allows
more complicated structures on the indices such as
𝑝,𝑞,𝑟
Õ
𝑓 (𝑥, 𝑦, 𝑧) = 𝜑𝑖 𝑗 (𝑥)𝜓 𝑗𝑘 (𝑦)𝜃 𝑘𝑖 (𝑧),
𝑖, 𝑗,𝑘=1

or, equivalently,
𝑝,𝑞,𝑟
Õ
𝑓 = 𝜑𝑖 𝑗 ⊗ 𝜓 𝑗𝑘 ⊗ 𝜃 𝑘𝑖 . (1.4)
𝑖, 𝑗,𝑘=1
6 L.-H. Lim

This is the celebrated matrix product state in the tensor network literature. Tech-
niques such as DMRG simplify computations of eigenfunctions of Schrödinger
operators by imposing such structures on the desired eigenfunction. Note that
(1.4), like (1.3), is also a decomposition into a sum of separable functions but the
indices are captured by the following graph:
𝑧 𝑘 𝑦
𝑖 𝑗

More generally, for any undirected graph 𝐺 on 𝑑 vertices, we may decompose a 𝑑-


variate function into a sum of separable functions in a similar manner whereby the
indices are summed over the edges of 𝐺 and the different variables correspond to the
different vertices. Such representations of a tensor are often called tensor network
states and the graph 𝐺 a tensor network. They often make an effective ansatz for
analytical and numerical solutions of PDEs arising from quantum chemistry.
The oldest definition of a tensor, i.e. definition ➀, encapsulates a fundamental
notion, namely that of equivariance (and also invariance, as a special case) under
coordinate transformations, which continues to be relevant in modern applications.
For example, the affine invariance of Newton’s method and self-concordant barrier
functions plays an important role in interior point methods for convex optimization
(Nesterov and Nemirovskii 1994); exploiting equivariance in deep convolutional
neural networks has led to groundbreaking performance on real-world tasks such as
classifying images in the CIFAR10 database (Cohen and Welling 2016) and, more
recently, predicting the shape of protein molecules in the CASP14 protein-folding
competition (Callaway 2020, Service 2020).2 More generally, we will see that
the notion of equivariance has guided the design of algorithms and methodologies
through the ages, and is implicit in many classical algorithms in numerical linear
algebra.
The three notions we raised – separability, multilinearity, equivariance – largely
account for the usefulness of tensors in computations, but there are also various
measures of the ‘size’ of a tensor, notably tensor ranks and tensor norms, that play
an indispensable auxiliary role. Of these, two notions of rank and three notions of
norm stand out for their widespread applicability. Given the decomposition (1.5),
a natural way to define rank is evidently
 Õ𝑟 
rank( 𝑓 ) ≔ min 𝑟 : 𝑓 = 𝜑𝑖 ⊗ 𝜓𝑖 ⊗ 𝜃 𝑖 , (1.5)
𝑖=1

2 Research papers that describe AlphaFold 2 are still unavailable at the time of writing but
the fact that it uses an equivariant neural network (Fuchs, Worrall, Fischer and Welling
2020) may be found in Jumper’s slides at https://predictioncenter.org/casp14/doc/presentations/
2020_12_01_TS_predictor_AlphaFold2.pdf.
Tensors in computations 7

and a natural way to define a corresponding norm is



𝑟 Õ
𝑟 
k 𝑓 k 𝜈 ≔ inf k𝜑𝑖 k k𝜓𝑖 k k𝜃 𝑖 k : 𝑓 = 𝜑 ⊗ 𝜓𝑖 ⊗ 𝜃 𝑖 . (1.6)
𝑖=1 𝑖=1
If we have an inner product, then its dual norm is
|h 𝑓 , 𝜑 ⊗ 𝜓 ⊗ 𝜃i|
k 𝑓 k 𝜎 ≔ sup . (1.7)
k𝜑k k𝜓 k k𝜃 k
Definitions (1.5), (1.6) and (1.7) are the tensor rank, nuclear norm and spectral
norm of 𝑓 respectively. They generalize to tensors of arbitrary order 𝑑 in the
obvious way, and for 𝑑 = 2 reduce to rank, nuclear norm and spectral norm of
linear operators and matrices. The norms in (1.6) and (1.7) are in fact the two
standard ways to turn a tensor product of Banach spaces into a Banach space in
functional analysis, and in this context they are called projective and injective norms
respectively.
From a computational perspective, we will see in Sections 3 and 4 that in many
instances there is a tensor, usually of order three, at the heart of a computational
problem, and issues regarding complexity, stability, approximability, etc., of that
problem can often be translated to finding or bounding (1.5), (1.6) and (1.7). One
issue with (1.5), (1.6) and (1.7) is that they are all NP-hard to compute when 𝑑 ≥ 3.
As computationally tractable alternatives, we also consider the multilinear rank
 𝑝,𝑞,𝑟
Õ 
𝜇rank( 𝑓 ) ≔ min (𝑝, 𝑞, 𝑟) : 𝜑𝑖 ⊗ 𝜓 𝑗 ⊗ 𝜃 𝑘 ,
𝑖, 𝑗,𝑘=1

and the Frobenius norm (also called the Hilbert–Schmidt norm)


p
k 𝑓 k F = h 𝑓 , 𝑓 i.
While there are many other notions in linear algebra such as eigenvalues and eigen-
vectors, singular values and singular vectors, determinants, characteristic polyno-
mials, etc., that do generalize to higher orders 𝑑 ≥ 3, their value in computation is
somewhat limited relative to ranks and norms. As such we will only mention these
other notions in passing.

1.2. Goals and scope


We view our article in part as an elementary alternative to the heavier treatments in
Bürgisser, Clausen and Shokrollahi (1997) and Landsberg (2012, 2017, 2019), all
excellent treatises on tensors with substantial discussions of their role in computa-
tions. Nevertheless they require a heavy dose of algebra and algebraic geometry,
or at least a willingness to learn these topics. They also tend to gloss over analytic
aspects such as tensor norms. There are also excellent treatments such as those
of Light and Cheney (1985), Defant and Floret (1993), Diestel, Fourie and Swart
(2008) and Ryan (2002) that cover the analytic perspective, but they invariably
8 L.-H. Lim

focus on tensors in infinite-dimensional spaces, and with that comes the many
complications that can be avoided in a finite-dimensional setting. None of the
aforementioned references are easy reading, being aimed more at specialists than
beginners, with little discussion of elementary questions.
Our article takes a middle road, giving equal footing to both algebraic and analytic
aspects, that is, we discuss tensors over vector spaces (algebraic) as well as tensors
over norm or inner product spaces (analytic), and we explain why they are different
and how they are related. As we target readers whose main interests are in one way
or another related to computations – numerical linear algebra, numerical PDEs,
optimization, scientific computing, theoretical computer science, machine learning,
etc. – and such a typical target reader would tend to be far more conversant with
analysis than with algebra, the manner in which we approach algebraic and analytic
topics is calibrated accordingly. We devote considerable length to explaining
and motivating algebraic notions such as modules or commutative diagrams, but
tend to gloss over analytic ones such as distributions or compact operators. We
assume some passing familiarity with computational topics such as quadrature
or von Neumann stability analysis but none with ‘purer’ ones such as formal
power series or Littlewood–Richardson coefficients. We also draw almost all our
examples from topics close to the heart of computational mathematicians: kernel
SVMs, Krylov subspaces, Hamilton equations, multipole expansions, perturbation
theory, quantum chemistry and physics, semidefinite programming, wavelets, etc.
As a result almost all our examples tend to have an analytic bent, although there
are also exceptions such as cryptographic multilinear maps or equivariant neural
networks, which are more algebraic.
Our article is intended to be completely elementary. Modern studies of tensors
largely fall into four main areas of mathematics: algebraic geometry, differential
geometry, representation theory and functional analysis. We have avoided the first
three almost entirely and have only touched upon the most rudimentary aspects
of the last. The goal is to show that even without bringing in highbrow subjects,
there is still plenty that could be said about tensors; all we really need is standard
undergraduate linear algebra and some multivariate calculus – vector spaces, linear
maps, change of basis, direct sums and products, dual spaces, derivatives as linear
maps, etc. Among this minimal list of prerequisites, one would find that dual
spaces tend to be a somewhat common blind spot of our target readership, primarily
because they tend to work over inner product spaces where duality is usually a non-
issue. Unfortunately, duality cannot be avoided if one hopes to gain any reasonably
complete understanding of tensors, as tensors defined over inner product spaces
represent a relatively small corner of the subject. A working familiarity with dual
spaces is a must: how a dual basis behaves under a change of basis, how to define
the transpose and adjoint of abstract linear maps, how taking the dual changes direct
sums into direct products, etc. Fortunately these are all straightforward, and we
will remind readers of the relevant facts as and when they are needed. Although we
will discuss tensors over infinite-dimensional spaces, modules, bundles, operator
Tensors in computations 9

spaces, etc., we only do so when they are relevant to computational applications.


When discussing the various properties and notions pertaining to tensors, we put
them against a backdrop of familiar analogues in linear algebra and numerical linear
algebra, by either drawing a parallel or a contrast between the multilinear and the
linear.
Our approach towards explaining the roles of tensors in computations is to present
them as examples alongside our discussions of the three standard definitions of
tensors on page 3. The hope is that our reader would gain both an appreciation of
these three definitions, each capturing a different feature of a tensor, as well as how
each is useful in computations. While there are a very small number of inevitable
overlaps with Bürgisser et al. (1997) and Landsberg (2012, 2017, 2019) such as
fast matrix multiplication, we try to offer a different take on these topics. Also,
most of the examples in our article cannot be found in them.
Modern textbook treatments of tensors in algebra and geometry tend to sweep
definition ➀ under the rug of the coordinate-free approaches in definitions ➁ and ➂,
preferring not to belabour the coordinate-dependent point of view in definition ➀.
For example, there is no trace of it in the widely used textbook on algebra by Lang
(2002) and that on differential geometry by Lee (2013). Definition ➀ tends to be
found only in much older treatments (Borisenko and Tarapov 1979, Brand 1947, Hay
1954, Lovelock and Rund 1975, McConnell 1957, Michal 1947, Schouten 1951,
Spain 1960, Synge and Schild 1978, Wrede 1963) that are often more confusing
than illuminating to a modern reader. This has resulted in a certain amount of
misinformation. As a public service, we will devote significant space to definition ➀
so as to make it completely clear. For us, belabouring the point is the point. By
and large we will discuss definitions ➀, ➁ and ➂ in chronological order, as we
think there is some value in seeing how the concept of tensors evolved through
time. Nevertheless we will occasionally get ahead of ourselves to point out how a
later definition would clarify an earlier one or revisit an earlier definition with the
hindsight of a later one.
There is no lack of reasonably elementary, concrete treatments of tensors in
physics texts (Abraham, Marsden and Ratiu 1988, Martin 1991, Misner, Thorne and
Wheeler 1973, Wald 1984) but the ultimate goal in these books is invariably tensor
fields and not tensors. The pervasive use of Einstein’s summation convention,3
the use of both upper and lower indices (sometimes also left and right, i.e. indices
on all four corners) in these books can also be confusing, especially when one
needs to switch between Einstein’s conventions and the usual conventions in linear
algebra. Do the raised indices refer to the inverse of the matrix or are they there
simply because they needed to be summed over? How does one tell the inverse
apart from the inverse transpose of a matrix? The answers depend on the source,
somewhat defeating the purpose of a convention. In this article, tensor fields are

3 Í
No doubt a very convenient shorthand, and its avoidance of the summation symbol a big
typesetting advantage in the pre-TEX era.
10 L.-H. Lim

mentioned only in passing and we do not use Einstein’s summation convention.


Instead, we frame our discussions of tensors in notation standard in linear algebra
and numerical linear algebra or slight extensions of these and, by so doing, hope
that our readers would find it easier to see where tensors fit in.
Our article is not about algorithms for solving various problems related to higher-
order tensors. Nevertheless, we take the opportunity to make a few points in
Section 5.2 about the prospects of such algorithms. The bottom line is that we
do not think such problems, even in low-order (𝑑 = 3 or 4) low-dimensional
(𝑛 < 10) instances, have efficient or even just provably correct algorithms. This is
unfortunate as there are various ‘tensor models’ for problems in areas from quantum
chemistry to data analysis that are based on the optimistic assumption that there
exist such algorithms. It is fairly common to see claims of extraordinary things
that follow from such ‘tensor models’, but upon closer examination one would
invariably find an NP-hard problem such as
min k 𝑓 − 𝜑 ⊗ 𝜓 ⊗ 𝜃 k
𝜑,𝜓, 𝜃

embedded in the model as a special case. If one could efficiently solve this problem,
then the range of earth-shattering things that would follow is limitless.4 Therefore,
these claims of extraordinariness are hardly surprising given such an enormous
caveat; they are simply consequences of the fact that if one could efficiently solve
any NP-hard problems, one could efficiently solve all NP-complete problems.
Nearly all the materials in this article are classical. It could have been written
twenty years ago. We would not have been able to mention the resolution of the sal-
mon conjecture or discuss applications such as AlphaFold, the matrix multiplication
exponent and complexity of integer multiplication would be slightly higher, and
some of the books cited would be in their earlier editions. But 95%, if not more, of
the content would have remained intact. We limit our discussion in this article to
results that are by-and-large rigorous, usually in the mathematical sense of having a
proof but occasionally in the sense of having extensive predictions consistent with
results of scientific experiments (like the effectiveness of DMRG). While this art-
icle contains no original research, we would like to think that it offers abundant new
insights on existing topics: our treatment of multipole expansions in Example 4.45,
separation of variables in Examples 4.32–4.35, stress tensors in Example 4.8, our
interpretation of the universal factorization property in Section 4.3, discussions of
the various forms of higher-order derivatives in Examples 3.2, 3.4, 4.29, the way
we have presented and motivated the tensor transformation rules in Section 2, etc.,
have never before appeared elsewhere to the best of our knowledge. In fact, we

4 In the words of a leading authority (Fortnow 2013, pp. ix and 11), ‘society as we know it would
change dramatically, with immediate, amazing advances in medicine, science, and entertainment
and the automation of nearly every human task,’ and ‘the world will change in ways that will make
the Internet seem like a footnote in history.’
Tensors in computations 11

have made it a point to not reproduce anything verbatim from existing literature;
even standard materials are given a fresh take.

1.3. Notation and convention


Almost all results, unless otherwise noted, will hold for tensors over both R and C
alike. In order to not have to state ‘R or C’ at every turn, we will denote our field
as R with the implicit understanding that all discussions remain true over C unless
otherwise stated. We will adopt numerical linear algebra convention and regard
vectors in R𝑛 as column vectors, i.e. R𝑛 ≡ R𝑛×1 . Row vectors in R1×𝑛 will always
be denoted as 𝑣 T with 𝑣 ∈ R𝑛 . To save space, we also adopt the convention that an
𝑛-tuple delimited with parentheses always denotes a column vector, that is,
 𝑎1 
 
 
(𝑎1 , . . . , 𝑎 𝑛 ) =  ...  ∈ R𝑛 .
 
𝑎 𝑛 
 
All bases in this article will be ordered bases. We write GL(𝑛) for the general linear
group, that is, the set of 𝑛 × 𝑛 invertible matrices and O(𝑛) for the orthogonal group
of 𝑛 × 𝑛 orthogonal matrices. Other matrix groups will be introduced later. For
any 𝑋 ∈ GL(𝑛), we write
𝑋 −T ≔ (𝑋 −1 )T = (𝑋 T )−1 .
The vector space of polynomials in variables 𝑥1 , . . . , 𝑥 𝑛 with real coefficients
will be denoted by R[𝑥1 , . . . , 𝑥 𝑛 ]. For Banach spaces V and W, we write B(V; W)
for the set of all bounded linear operators Φ : V → W. We abbreviate B(V; V)
as B(V). A word of caution may be in order. We use the term ‘separable’ in two
completely different and unrelated senses with little risk of confusion: a Banach or
Hilbert space is separable if it has a countable dense subset whereas a function is
separable if it takes the form in (1.1).
We restrict some of our discussions to 3-tensors for a few reasons: (i) the
sacrifice in generality brings about an enormous reduction in notational clutter,
(ii) the generalization to 𝑑-tensors for 𝑑 > 3 is straightforward once we present
the 𝑑 = 3 case, (iii) 𝑑 = 3 is the first unfamiliar case for most readers given that
𝑑 = 0, 1, 2 are already well treated in linear algebra, and (iv) 𝑑 = 3 is the most
important case for many of our applications to computational mathematics.

2. Tensors via transformation rules


We will begin with definition ➀, which first appeared in a book on crystallography
by the physicist Woldemar Voigt (1898):
An abstract entity represented by an array of components that are functions of coordinates
such that, under a transformation of coordinates, the new components are related to the
transformation and to the original components in a definite way.
12 L.-H. Lim

While the idea of a tensor had appeared before in works of Cauchy and Riemann
(Conrad 2018, Yau 2020) and also around the same time in Ricci and Levi-Civita
(1900), this is believed to be the earliest appearance of the word ‘tensor’. Although
we will refer to the quote above as ‘Voigt’s definition’, it is not a direct translation
of a specific sentence from Voigt (1898) but a paraphrase attributed to Voigt (1898)
in the Oxford English Dictionary. Voigt’s definition is essentially the one adopted
in all early textbooks on tensors such as Brand (1947), Hay (1954), Lovelock and
Rund (1975), McConnell (1957), Michal (1947), Schouten (1951), Spain (1960),
Synge and Schild (1978) and Wrede (1963).5 This is not an easy definition to work
with and likely contributed to the reputation of tensors being a tough subject to
master. Famously, Einstein struggled with tensors (Earman and Glymour 1978)
and the definition he had to dabble with would invariably have been definition ➀.
Nevertheless, we should be reminded that linear algebra as we know it today was
an obscure art in its infancy in the 1900s when Einstein was learning about tensors.
Those trying to learn about tensors in modern times enjoy the benefit of a hundred
years of pedagogical progress. By building upon concepts such as vector spaces,
linear transformations, change of basis, etc., that we take for granted today but were
not readily accessible a century ago, the task of explaining tensors is significantly
simplified.
In retrospect, the main issue with definition ➀ is that it is a ‘physicist’s definition’,
that is, it attempts to define a quantity by describing the change-of-coordinates rules
that the quantity must satisfy without specifying the quantity itself. This approach
may be entirely natural in physics where one is interested in questions such as
‘Is stress a tensor?’ or ‘Is electromagnetic field strength a tensor?’, that is, the
definition is always applied in a way where some physical quantity such as stress
takes the place of the unspecified quantity, but it makes for an awkward definition in
mathematics. The modern definitions ➁ and ➂ remedy this by stating unequivocally
what this unspecified quantity is.

2.1. Transformation rules illustrated with linear algebra


The defining property of a tensor in definition ➀ is the ‘transformation rules’ alluded
to in the second and third lines of Voigt’s definition, that is, how the coordinates
transform under change of basis. In modern parlance, these transformation rules
express the notion of equivariance under a linear action of a matrix group, an
idea that has proved to be important even in trendy applications such as deep
convolutional and attention networks. There is no need to go to higher order
tensors to appreciate them; we will see that all along we have been implicitly
working with these transformation rules in linear algebra and numerical linear
algebra. Instead of stating them in their most general forms right away, we will
allow readers to discover these transformation rules for themselves via a few simple
5 These were originally published in the 1940s–1960s but all have been reprinted by Dover Public-
ations and several remain in print.
Tensors in computations 13

familiar examples; the meaning of the terminology appearing in italics may be


surmised from the examples and will be properly defined in due course.
Eigenvalues and eigenvectors. An eigenvalue 𝜆 ∈ C and a corresponding eigen-
vector 0 ≠ 𝑣 ∈ C𝑛 of a matrix 𝐴 ∈ C𝑛×𝑛 satisfy the eigenvalue/eigenvector equation
𝐴𝑣 = 𝜆𝑣. For any invertible 𝑋 ∈ C𝑛×𝑛 ,
(𝑋 𝐴𝑋 −1 )𝑋𝑣 = 𝜆𝑋𝑣. (2.1)
The transformation rules here are 𝐴 ↦→ 𝑋 𝐴𝑋 −1 , 𝑣 ↦→ 𝑋𝑣 and 𝜆 ↦→ 𝜆. An
eigenvalue is an invariant 0-tensor and an eigenvector is a contravariant 1-tensor
of a mixed 2-tensor, that is, eigenvalues of 𝐴 and 𝑋 𝐴𝑋 −1 are always the same
whereas an eigenvector 𝑣 of 𝐴 transforms into an eigenvector 𝑋𝑣 of 𝑋 𝐴𝑋 −1 . This
says that eigenvalues and eigenvectors are defined for mixed 2-tensors.
Matrix product. For matrices 𝐴 ∈ C𝑚×𝑛 and 𝐵 ∈ C𝑛× 𝑝 , we have
(𝑋 𝐴𝑌 −1 )(𝑌 𝐵𝑍 −1 ) = 𝑋(𝐴𝐵)𝑍 −1 (2.2)
for any invertible 𝑋 ∈ C𝑚×𝑚 , 𝑌 ∈ C𝑛×𝑛 , 𝑍 ∈ C 𝑝× 𝑝 . The transformation rule is
again of the form 𝐴 ↦→ 𝑋 𝐴𝑌 −1 . This says that the standard matrix–matrix product
is defined on mixed 2-tensors.
Positive definiteness. A matrix 𝐴 ∈ R𝑛×𝑛 is positive definite if and only if
𝑋 𝐴𝑋 T or 𝑋 −T 𝐴𝑋 −1
is positive definite for any invertible 𝑋 ∈ R𝑛×𝑛 . This says that positive definiteness
is a property of covariant 2-tensors that have transformation rule 𝑋 ↦→ 𝑋 𝐴𝑋 T or of
contravariant 2-tensors that have transformation rule 𝑋 ↦→ 𝑋 −T 𝐴𝑋 −1 .
Singular values and singular vectors. A singular value 𝜎 ∈ R and its corres-
ponding left singular vector 0 ≠ 𝑢 ∈ R𝑚 and right singular vector 0 ≠ 𝑣 ∈ R𝑛 of a
matrix 𝐴 ∈ R𝑚×𝑛 satisfy the singular value/singular vector equation

𝐴𝑣 = 𝜎𝑢,
𝐴T 𝑢 = 𝜎𝑣.
For any orthogonal matrices 𝑋 ∈ R𝑚×𝑚 , 𝑌 ∈ R𝑛×𝑛 , we have

(𝑋 𝐴𝑌 T )𝑌 𝑣 = 𝜎 𝑋𝑢,
(𝑋 𝐴𝑌 T )T 𝑋𝑢 = 𝜎𝑌 𝑣.
Singular values are invariant and singular vectors are contravariant as they transform
as 𝜎 ↦→ 𝜎, 𝑢 ↦→ 𝑋𝑢, 𝑣 ↦→ 𝑌 𝑣 for a matrix that transforms as 𝐴 ↦→ 𝑋 𝐴𝑌 T . This
tells us that singular values are Cartesian 0-tensors, left and right singular vectors
are contravariant Cartesian 1-tensors, defined on contravariant Cartesian 2-tensors.
‘Cartesian’ means that these transformations use orthogonal matrices instead of
invertible ones.
14 L.-H. Lim

Linear equations. Let 𝐴 ∈ R𝑚×𝑛 and 𝑏 ∈ R𝑚 . Clearly 𝐴𝑣 = 𝑏 has a solution if


and only if
(𝑋 𝐴𝑌 −1 )𝑌 𝑣 = 𝑋 𝑏
has a solution for any invertible 𝑋 ∈ R𝑚×𝑚 and 𝑌 ∈ R𝑛×𝑛 . If 𝑚 = 𝑛, then we may
instead want to consider
(𝑋 𝐴𝑋 −1 )𝑋𝑣 = 𝑋 𝑏,
and if furthermore 𝐴 is symmetric, then
(𝑋 𝐴𝑋 T )𝑋 −T 𝑣 = 𝑋 𝑏,
where 𝑋 ∈ R𝑛×𝑛 is either invertible or orthogonal (for the latter 𝑋 −T = 𝑋). We will
examine this ambiguity in transformation rules later. The solution in the last case
is a covariant 1-tensor, that is, it transforms as 𝑣 ↦→ 𝑋 −T 𝑣 with an invertible 𝑋.
Ordinary and total least-squares. Let 𝐴 ∈ R𝑚×𝑛 and 𝑏 ∈ R𝑚 . The ordinary
least-squares problem is given by
min k 𝐴𝑣 − 𝑏k 2 = min𝑛 k(𝑋 𝐴𝑌 −1 )𝑌 𝑣 − 𝑋 𝑏k 2 (2.3)
𝑣 ∈R𝑛 𝑣 ∈R

for any orthogonal 𝑋 ∈ R𝑚×𝑚 and invertible 𝑌 ∈ R𝑛×𝑛 . Unsurprisingly, the normal
equation 𝐴T 𝐴𝑣 = 𝐴T 𝑏 has the same property:
(𝑋 𝐴𝑌 −1 )T (𝑋 𝐴𝑌 −1 )𝑌 𝑣 = (𝑋 𝐴𝑌 −1 )T 𝑋 𝑏.
The total least-squares problem is defined by
min{k𝐸 k 2 + k𝑟 k 2 : (𝐴 + 𝐸)𝑣 = 𝑏 + 𝑟}
= min{k 𝑋 𝐸𝑌 T k 2 + k 𝑋𝑟 k 2 : (𝑋 𝐴𝑌 T + 𝑋 𝐸𝑌 T )𝑌 𝑣 = 𝑋 𝑏 + 𝑋𝑟}
for any orthogonal 𝑋 ∈ R𝑚×𝑚 and orthogonal 𝑌 ∈ R𝑛×𝑛 . Here the minimization is
over 𝐸 ∈ R𝑚×𝑛 , 𝑟 ∈ R𝑚 , and the constraint is interpreted as the linear system being
consistent. Both ordinary and total least-squares are defined on tensors – minimum
values transform as invariant 0-tensors, 𝑣, 𝑏, 𝑟 as contravariant 1-tensors and 𝐴, 𝐸
as mixed 2-tensors – but the 2-tensors involved are different as 𝑌 is not required to
be orthogonal in (2.3).
Rank, determinant, norm. Let 𝐴 ∈ R𝑚×𝑛 . Then
rank(𝑋 𝐴𝑌 −1 ) = rank(𝐴), det(𝑋 𝐴𝑌 −1 ) = det(𝐴), k 𝑋 𝐴𝑌 −1 k = k 𝐴k,
where 𝑋 and 𝑌 are, respectively, invertible, special linear or orthogonal matrices.
Here k · k may denote either the spectral, nuclear or Frobenius norm and the
determinant is regarded as identically zero whenever 𝑚 ≠ 𝑛. Rank, determinant
and norm are defined on mixed 2-tensors, special linear mixed 2-tensors and
Cartesian mixed 2-tensors respectively.
The point that we hope to make with these familiar examples is the following.
The most fundamental and important concepts, equations, properties and problems
Tensors in computations 15

in linear algebra and numerical linear algebra – notions that extend far beyond linear
algebra into other areas of mathematics, science and engineering – all conform to
tensor transformation rules. These transformation rules are not merely accessories
to the definition of a tensor: they are the very crux of it and are what make tensors
useful.
On the one hand, it is intuitively clear what the transformation rules are in
the examples on pages 13–14: They are transformations that preserve either the
form of an equation (e.g. if we write 𝐴 ′ = 𝑋 𝐴𝑋 −1 , 𝑣 ′ = 𝑋𝑣, 𝜆 ′ = 𝜆, then (2.1)
becomes 𝐴 ′𝑣 ′ = 𝜆 ′ 𝑣 ′), or the value of a quantity such as rank/determinant/norm,
or a property such as positive definiteness. In these examples, the ‘multi-indexed
object’ in definition ➀ can be a matrix like 𝐴 or 𝐸, a vector like 𝑢, 𝑣, 𝑏, 𝑟 or a scalar
like 𝜆; the coordinates of the matrix or vector, i.e. 𝑎𝑖 𝑗 and 𝑣 𝑖 , are sometimes called
the ‘components of the tensor’. The matrices 𝑋, 𝑌 play a different role in these
transformation rules and should be distinguished from the multi-indexed objects;
for reasons that will be explained in Section 3.1, we call them change-of-basis
matrices.
On the other hand, these examples also show why the transformation rules can
be confusing: they are ambiguous in multiple ways.
(a) The transformation rules can take several different forms. For example, 1-
tensors may transform as
𝑣 ′ = 𝑋𝑣 or 𝑣 ′ = 𝑋 −T 𝑣,
2-tensors may transform as
𝐴 ′ = 𝑋 𝐴𝑌 −1 , 𝐴 ′ = 𝑋 𝐴𝑌 T , 𝐴 ′ = 𝑋 𝐴𝑋 −1 , 𝐴 ′ = 𝑋 𝐴𝑋 T ,
or yet other possibilities we have not discussed.
(b) The change-of-basis matrices in these transformation rules may also take
several different forms, most commonly invertible or orthogonal. In the
examples, the change-of-basis matrices may be a single matrix6
𝑋 ∈ GL(𝑛), SL(𝑛), O(𝑛),
and a pair of them
(𝑋, 𝑌 ) ∈ GL(𝑚) × GL(𝑛), SL(𝑚) × SL(𝑛), O(𝑚) × O(𝑛),
or, as we saw in the case of ordinary least-squares (2.3), (𝑋, 𝑌 ) ∈ O(𝑚) ×
GL(𝑛). There are yet other possibilities for change-of-basis matrices we have
not discussed, such as Lorentz, symplectic, unitary, etc.
(c) Yet another ambiguity on top of (a) is that the roles of 𝑣 and 𝑣 ′ or 𝐴 and 𝐴 ′
are sometimes swapped and the transformation rules written as
𝑣 ′ = 𝑋 −1 𝑣, 𝑣′ = 𝑋 T𝑣
6 For those unfamiliar with this matrix group notation, it will be defined in (2.14).
16 L.-H. Lim

or
𝐴 ′ = 𝑋 −1 𝐴𝑌 , 𝐴 ′ = 𝑋 −1 𝐴𝑌 −T , 𝐴 ′ = 𝑋 −1 𝐴𝑋, 𝐴 ′ = 𝑋 −1 𝐴𝑋 −T
respectively. Note that the transformation rules here and those in (a) are all
but identical: the only difference is in whether we label the multi-indexed
object on the left or that on the right with a prime.

We may partly resolve the ambiguity in (a) by introducing covariance and con-
travariance type: tensors with transformation rules of the form 𝐴 ′ = 𝑋 𝐴𝑋 T are
covariant 2-tensors or (0, 2)-tensors; those of the form 𝐴 ′ = 𝑋 𝐴𝑋 −1 are mixed
2-tensors or (1, 1)-tensors; those of the form 𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 are contravariant
2-tensors or (2, 0)-tensors. Invariance, covariance and contravariance are all spe-
cial cases of equivariance that we will discuss later. Nevertheless, we are still
unable to distinguish between 𝐴 ′ = 𝑋 𝐴𝑋 T and 𝐴 ′ = 𝑋 𝐴𝑌 T : both are legitimate
transformation rules for covariant 2-tensors.
These ambiguities (a), (b), (c) are the result of a defect in definition ➀. One ought
to be asking: What is the tensor in definition ➀? The answer is that it is actually
missing from the definition. The multi-indexed object 𝐴 represents the tensor
whereas the transformation rules on 𝐴 defines the tensor but the tensor itself has
been left unspecified. This is a key reason why definition ➀ has been so confusing
to early learners. Fortunately, it is easily remedied by definition ➁ or ➂. We should,
however, bear in mind that definition ➀ predated modern notions of vector spaces
and linear maps, which are necessary for definitions ➁ and ➂.
When we introduce definitions ➁ or ➂, we will see that the ambiguity in (a)
is just due to different transformation rules for different tensors. Getting ahead
of ourselves, for vector spaces V and W, the transformation rule 𝐴 ′ = 𝑋 𝐴𝑋 −1
applies to tensors in V ⊗ V∗ whereas 𝐴 ′ = 𝑋 𝐴𝑌 −1 applies to those in V ⊗ W∗ ;
𝐴 ′ = 𝑋 𝐴𝑋 T and 𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 apply to tensors in V ⊗ V and V∗ ⊗ V∗ respectively.
The matrices 𝐴 and 𝐴 ′ are representations of a tensor with respect to two different
bases, and the ambiguity in (c) is just a result of which basis is regarded as the ‘old’
basis and which as the ‘new’ one.
The ambiguity in (b) is also easily resolved with definitions ➁ or ➂. The matrices
𝑋 and 𝑌 are change-of-basis matrices on V and W and are always invertible, but they
also have to preserve additional structures on V and W such as inner products or
norms. For example, singular values and singular vectors are only defined for inner
product spaces and so we require the matrices 𝑋 and 𝑌 to preserve inner products;
for the Euclidean inner product, this simply means that 𝑋 and 𝑌 are orthogonal
matrices. Tensors with transformation rules involving orthogonal matrices are
sometimes called ‘Cartesian tensors’.
These examples on pages 13–14 illustrate why the transformation rules in defin-
ition ➀ are as crucial in mathematics as they are in physics. Unlike in physics, we
do not use the transformation rules to check whether a physical quantity such as
stress or strain is a tensor; instead we use them to ascertain whether an equation
Tensors in computations 17

(e.g. an eigenvalue/eigenvector equation), a property (e.g. positive definiteness), an


operation (e.g. a matrix product) or a mathematical quantity (e.g. rank) is defined
on a tensor. In other words, when working with tensors, it is not a matter of
simply writing down an expression involving multi-indexed quantities, because if
the expression does not satisfy the transformation rules, it is undefined on a tensor.
For example, multiplying two matrices via the Hadamard (also known as Schur)
product
     
𝑎11 𝑎12 𝑏11 𝑏12 𝑎11 𝑏11 𝑎12 𝑏12
◦ = (2.4)
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑏21 𝑎22 𝑏22
seems a lot more obvious than the standard matrix product
    
𝑎11 𝑎12 𝑏11 𝑏12 𝑎11 𝑏11 + 𝑎12 𝑏21 𝑎11 𝑏12 + 𝑎12 𝑏22
= . (2.5)
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑏11 + 𝑎22 𝑏21 𝑎21 𝑏12 + 𝑎22 𝑏22
However, (2.5) defines a product on tensors whereas (2.4) does not, the reason being
that the standard matrix product satisfies the transformation rules in (2.2) whereas
the Hadamard product does not. While there may be occasional scenarios where
the Hadamard product could be useful, the standard matrix product is far more
prevalent in all areas of mathematics, science and engineering. This is not limited
to matrix products; as we can see from the list of examples, the most useful and
important notions we encounter in linear algebra are invariably tensorial notions
that satisfy various transformation rules for 0-, 1- and 2-tensors. We will have more
to say about these issues in the next few sections.

2.2. Transformation rules for 𝑑-tensors


We now state the transformation rules in their most general form, laying out the
full details of definition ➀. Let 𝐴 = [𝑎 𝑗1 ··· 𝑗𝑑 ] ∈ R𝑛1 ×···×𝑛𝑑 be a 𝑑-dimensional
matrix or hypermatrix. Readers who require a rigorous definition may take it as
a real-valued function 𝐴 : {1, . . . , 𝑛1 } × · · · × {1, . . . , 𝑛𝑑 } → R, a perspective we
will discuss in Example 4.5, but those who are fine with a ‘𝑑-dimensional matrix’
may picture it as such. For matrices, i.e. the usual two-dimensional matrices,
𝑋 = [𝑥𝑖1 𝑗1 ] ∈ R𝑚1 ×𝑛1 , 𝑌 = [𝑦 𝑖2 𝑗2 ] ∈ R𝑚2 ×𝑛2 , . . . , 𝑍 = [𝑧𝑖𝑑 𝑗𝑑 ] ∈ R𝑚𝑑 ×𝑛𝑑 , we define
(𝑋, 𝑌 , . . . , 𝑍) · 𝐴 = 𝐵, (2.6)
where 𝐵 = [𝑏𝑖1 ···𝑖𝑑 ] ∈ R𝑚1 ×···×𝑚𝑑 is given by
Õ
𝑛1 Õ
𝑛2 Õ
𝑛𝑑
𝑏𝑖1 ···𝑖𝑑 = ··· 𝑥𝑖1 𝑗1 𝑦 𝑖2 𝑗2 · · · 𝑧𝑖𝑑 𝑗𝑑 𝑎 𝑗1 ··· 𝑗𝑑
𝑗1 =1 𝑗2 =1 𝑗𝑑 =1

for 𝑖 1 = 1, . . . , 𝑛1 , 𝑖 2 = 1, . . . , 𝑛2 , . . . , 𝑖 𝑑 = 1, . . . , 𝑛𝑑 . We call the operation


(2.6) multilinear matrix multiplication; as we will discuss after stating the tensor
transformation rules, the notation (2.6) is chosen to be consistent with standard
notation for group action. For 𝑑 = 1, it reduces to 𝑋𝑎 = 𝑏 for 𝑎 ∈ R𝑛 , 𝑏 ∈ R𝑚 , and
18 L.-H. Lim

for 𝑑 = 2, it reduces to
(𝑋, 𝑌 ) · 𝐴 = 𝑋 𝐴𝑌 T .
For now, we will let the hypermatrix 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 be the multi-indexed object in
definition ➀ to keep things simple; we will later see that this multi-indexed object
does not need to be a hypermatrix.
Let 𝑋1 ∈ GL(𝑛1 ), 𝑋2 ∈ GL(𝑛2 ), . . . , 𝑋𝑑 ∈ GL(𝑛𝑑 ). The covariant 𝑑-tensor
transformation rule is
𝐴 ′ = (𝑋1T , 𝑋2T , . . . , 𝑋𝑑T ) · 𝐴. (2.7)
The contravariant 𝑑-tensor transformation rule is
𝐴 ′ = (𝑋1−1 , 𝑋2−1 , . . . , 𝑋𝑑−1 ) · 𝐴. (2.8)
The mixed 𝑑-tensor transformation rule is7
𝐴 ′ = (𝑋1−1 , . . . , 𝑋 −1
𝑝 , 𝑋 𝑝+1 , . . . , 𝑋 𝑑 ) · 𝐴.
T T
(2.9)
We say that the transformation rule in (2.9) is of type or valence (𝑝, 𝑑 − 𝑝) or, more
verbosely, of contravariant order 𝑝 and covariant order 𝑑 − 𝑝. As such, covariance
is synonymous with type (0, 𝑑) and contravariance with type (𝑑, 0).
For 𝑑 = 1, the hypermatrix is just 𝑎 ∈ R𝑛 . If it transforms as 𝑎 ′ = 𝑋 −1 𝑎,
then it is the coordinate representation of a contravariant 1-tensor or contravariant
vector; if it transforms as 𝑎 ′ = 𝑋 T 𝑎, then it is the coordinate representation of
a covariant 1-tensor or covariant vector. For 𝑑 = 2, the hypermatrix is just
a matrix 𝐴 ∈ R𝑚×𝑛 ; writing 𝑋1 = 𝑋, 𝑋2 = 𝑌 , the transformations in (2.7),
(2.8), (2.9) become 𝐴 ′ = 𝑋 T 𝐴𝑌 , 𝐴 ′ = 𝑋 −1 𝐴𝑌 −T , 𝐴 ′ = 𝑋 −1 𝐴𝑌 , which are the
transformation rules for a covariant, contravariant, mixed 2-tensor respectively.
These look different from the transformation rules in the examples on pages 13–14
as a result of the ambiguity (c) on page 15, which we elaborate below.
Clearly, the equalities in the middle and right columns below only differ in which
side we label with a prime but are otherwise identical:
contravariant 1-tensor 𝑎 ′ = 𝑋 −1 𝑎, 𝑎 ′ = 𝑋𝑎,
covariant 1-tensor 𝑎 ′ = 𝑋 T 𝑎, 𝑎 ′ = 𝑋 −T 𝑎,
contravariant 2-tensor 𝐴 ′ = 𝑋 −1 𝐴𝑌 −T , 𝐴 ′ = 𝑋 𝐴𝑌 T ,
covariant 2-tensor 𝐴 ′ = 𝑋 T 𝐴𝑌 , 𝐴 ′ = 𝑋 −T 𝐴𝑌 −1 ,
mixed 2-tensor 𝐴 ′ = 𝑋 −1 𝐴𝑌 , 𝐴 ′ = 𝑋 𝐴𝑌 −1 .
We will see in Section 3.1 after introducing definition ➁ that these transformation
rules come from the change-of-basis theorems for vectors, linear functionals, dyads,

7 We state it in this form for simplicity. For example, we do not distinguish between the 3-tensor
transformation rules 𝐴 ′ = (𝑋 −1 , 𝑌 T , 𝑍 T ) · 𝐴, 𝐴 ′ = (𝑋 T , 𝑌 −1 , 𝑍 T ) · 𝐴 and 𝐴 ′ = (𝑋 T , 𝑌 T , 𝑍 −1 ) · 𝐴.
All three are of type (1, 2).
Tensors in computations 19

bilinear functionals and linear operators, respectively, with 𝑋 and 𝑌 the change-of-
basis matrices. The versions in the middle and right column differ in terms of how
we write our change-of-basis theorem. For example, take the standard change-of-
basis theorem in a vector space. Do we write 𝑎new = 𝑋 −1 𝑎old or 𝑎old = 𝑋𝑎new ?
Observe, however, that there is no repetition in either column: the transformation
rule, whether in the middle or right column, uniquely identifies the type of tensor.
It is not possible to confuse, say, a contravariant 2-tensor with a covariant 2-tensor
just because we use the transformation rule in the middle column to describe one
and the transformation rule in the right column to describe the other.
The version in the middle column is consistent with the names ‘covariance’
and ‘contravariance’, which are based on whether the hypermatrix ‘co-varies’, i.e.
transforms with the same 𝑋, or ‘contra-varies’, i.e. transforms with its inverse 𝑋 −1 .
This is why we have stated our transformation rules in (2.7), (2.8), (2.9) to be
consistent with those in the middle column. But there are also occasions, as in the
examples on pages 13–14, when it is more natural to express the transformation
rules as those in the right column.
When 𝑛1 = 𝑛2 = · · · = 𝑛𝑑 = 𝑛, the transformation rules in (2.7), (2.8), (2.9) may
take on a different form with a single change-of-basis matrix 𝑋 ∈ GL(𝑛) as opposed
to 𝑑 of them. In this case the hypermatrix 𝐴 ∈ R𝑛×···×𝑛 is hypercubical, i.e. the
higher-order equivalent of a square matrix, and the covariant tensor transformation
rule is
𝐴 ′ = (𝑋 T , 𝑋 T , . . . , 𝑋 T ) · 𝐴, (2.10)
the contravariant tensor transformation rule is
𝐴 ′ = (𝑋 −1 , 𝑋 −1 , . . . , 𝑋 −1 ) · 𝐴, (2.11)
and the mixed tensor transformation rule is
𝐴 ′ = (𝑋 −1 , . . . , 𝑋 −1 , 𝑋 T , . . . , 𝑋 T ) · 𝐴. (2.12)
Again, we will see in Section 3.1 after introducing definition ➁ that the difference
between, say, (2.7) and (2.10) is that the former expresses the change-of-basis
theorem for a multilinear map 𝑓 : V1 × · · · × V𝑑 → R whereas the latter is for a
multilinear map 𝑓 : V × · · · × V → R.
For 𝑑 = 2, we have 𝐴 ∈ R𝑛×𝑛 , and each transformation rule again takes two
different forms that are identical in substance:
covariant 2-tensor 𝐴 ′ = 𝑋 T 𝐴𝑋, 𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 ,
contravariant 2-tensor 𝐴 ′ = 𝑋 −1 𝐴𝑋 −T , 𝐴 ′ = 𝑋 𝐴𝑋 T ,
mixed 2-tensor 𝐴 ′ = 𝑋 −1 𝐴𝑋, 𝐴 ′ = 𝑋 𝐴𝑋 −1 .
Note that either form uniquely identifies the tensor type.
Example 2.1 (covariance versus contravariance). The three transformation rules
for 2-tensors are 𝐴 ′ = 𝑋 𝐴𝑋 T , 𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 , 𝐴 ′ = 𝑋 𝐴𝑋 −1 . We know from linear
20 L.-H. Lim

algebra that there is a vast difference between the first two (congruence) and the
last one (similarity). For example, eigenvalues and eigenvectors are defined for
mixed 2-tensors by virtue of (2.1) but undefined for covariant or contravariant
2-tensors since the eigenvalue/eigenvector equation 𝐴𝑣 = 𝜆𝑣 is incompatible with
the first two transformation rules. While both the contravariant and covariant
transformation rules describe congruence of matrices, the difference between them
is best seen to be the difference between the quadratic form and the second-order
partial differential operator:
Õ𝑛 Õ 𝑛 Õ𝑛 Õ 𝑛
𝜕2
𝑎𝑖 𝑗 𝑣 𝑖 𝑣 𝑗 and 𝑎𝑖 𝑗 .
𝑖=1 𝑗=1 𝑖=1 𝑗=1
𝜕𝑣 𝑖 𝜕𝑣 𝑗

𝐴 ′ = 𝑋 −T 𝐴𝑋 −1 and 𝐴 ′ = 𝑋 𝐴𝑋 T , respectively, describe how their coefficient


matrices transform under a change of coordinates 𝑣 ′ = 𝑋𝑣.
As we have mentioned in Section 2.1, definition ➀ leaves the tensor unspecified.
In physics, this is to serve as a placeholder for a physical quantity under discussion
such as deformation or rotation or strain. In mathematics, if we insist on not
bringing in definitions ➁ or ➂, then the way to use definition ➀ is best illustrated
with an example.
Example 2.2 (transformation rule determines tensor type). We say the eigen-
value/eigenvector equation 𝐴𝑣 = 𝜆𝑣 conforms to the transformation rules since
(𝑋 𝐴𝑋 −1 )𝑋𝑣 = 𝜆𝑋𝑣, and thus 𝐴 transforms like a mixed 2-tensor 𝐴 ′ = 𝑋 𝐴𝑋 −1 ,
the eigenvector 𝑣 transforms like a covariant 1-tensor 𝑣 ′ = 𝑋𝑣 and the eigenvalue
𝜆 transforms like an invariant 0-tensor. It is only with context (here we have
eigenvalues/eigenvectors)
h1 2 3i that it makes sense to speak of 𝐴, 𝑣, 𝜆 as tensors. Is
𝐴 = 2 3 4 a tensor? It is a tensor if it represents, say, stress (in which case 𝐴
345
transforms as a contravariant 2-tensor) or, say, we are interested in its eigenvalues
and eigenvectors (in which case 𝐴 transforms as a mixed 2-tensor). Stripped of
such contexts, the question as to whether a matrix or hypermatrix is a tensor is not
meaningful.
We have stressed that the transformation rules define the tensor but one might
think that this only applies to the type of the tensor, i.e. whether it is covariant or
contravariant or mixed. This is not the case: it defines everything about the tensor.
h1 2 3i
Example 2.3 (transformation rule determines tensor order). Does 𝐴 = 2 3 4
345
represent a 1-tensor or a 2-tensor? One might think that since a matrix is a
doubly indexed object, it must represent a 2-tensor. Again this is a fallacy. If the
transformation rules we apply to a matrix 𝐴 ∈ R𝑚×𝑛 are of the forms
𝐴 ′ = 𝑋 𝐴 = [𝑋𝑎1 , . . . , 𝑋𝑎 𝑛 ], 𝐴 ′ = 𝑋 −T 𝐴 = [𝑋 −T 𝑎1 , . . . , 𝑋 −T 𝑎 𝑛 ], (2.13)
where 𝑎1 , . . . , 𝑎 𝑛 ∈ R𝑚 are the columns of 𝐴 and 𝑋 ∈ GL(𝑚), then 𝐴 represents a
contravariant or covariant 1-tensor respectively, that is, the order of a tensor is also
Tensors in computations 21

determined by the transformation rules. In numerical linear algebra, algorithms


such as Gaussian elimination and Householder or Givens QR are all based on trans-
formations of a matrix as a 1-tensor as in (2.13); we will have more to say about this
in Section 2.4. In equivariant neural networks, it does not matter that filters are rep-
resented as five-dimensional arrays (Cohen and Welling 2016, Section 7.1) or that
inputs are a matrix of vectors of ‘irreducible fragments’ (Kondor, Lin and Trivedi
2018, Definition 1): the transformation rules involved are ultimately just those of
contravariant 1-tensors and mixed 2-tensors, as we will see in Example 2.16.
The change-of-basis matrices 𝑋1 , . . . , 𝑋𝑑 in (2.7), (2.8), (2.9) and 𝑋 in (2.10),
(2.11), (2.12) do not need to be just invertible matrices, i.e. elements of GL(𝑛):
they could instead come from any classical groups, most commonly the orthogonal
group O(𝑛) or unitary group U(𝑛), possibly with unit-determinant constraints, i.e.
SL(𝑛), SO(𝑛), SU(𝑛). The reason is that as the name implies, these are change-of-
basis matrices for the vector spaces in definitions ➁ or ➂, and if those vector spaces
are equipped with additional structures, we expect the change-of-basis matrices to
preserve those structures.
Example 2.4 (Cartesian and Lorentzian tensors). For the tensor transformation
rule over R4 equipped with the Euclidean inner product
h𝑥, 𝑦i = 𝑥0 𝑦 0 + 𝑥1 𝑦 1 + 𝑥2 𝑦 2 + 𝑥3 𝑦 3 ,
we want the change-of-basis matrices to be from the orthogonal group O(4) or
special orthogonal group SO(4), but if R4 is equipped instead with the Lorentzian
scalar product in relativity,
h𝑥, 𝑦i = 𝑥0 𝑦 0 − 𝑥1 𝑦 1 − 𝑥2 𝑦 2 − 𝑥3 𝑦 3 ,
then the Lorentz group O(1, 3) or proper Lorentz group SO(1, 3) or restricted
Lorentz group SO+ (1, 3) is more appropriate. Names like Cartesian tensors or
Lorentzian tensors are sometimes used to indicate whether the transformation rules
involve orthogonal or Lorentz groups but sometimes physicists would assume that
these are self-evident from the context and are left unspecified, adding to the
confusion for the uninitiated. The well-known electromagnetic field tensor or
Faraday tensor, usually represented in matrix form by
 0 −𝐸 𝑥 /𝑐 −𝐸 𝑦 /𝑐 −𝐸 𝑧 /𝑐

 𝐸 /𝑐 0 −𝐵 𝑧 𝐵 𝑦 
 𝑥
𝐹= ,
 𝐸 𝑦 /𝑐 𝐵𝑧 0 −𝐵 𝑥 
 
 𝐸 𝑧 /𝑐 −𝐵 𝑦 𝐵𝑥 0 

where we have written (𝑥0 , 𝑥1 , 𝑥2 , 𝑥3 ) = (𝑐𝑡, 𝑥, 𝑦, 𝑧), is a contravariant Lorentz
2-tensor as it satisfies a transformation rule of the form 𝐹 ′ = 𝑋 𝐹 𝑋 T with a change-
of-basis matrix 𝑋 ∈ O(1, 3).
For easy reference, here is a list of the most common matrix Lie groups that the
22 L.-H. Lim

change-of-basis matrices 𝑋1 , . . . , 𝑋𝑑 or 𝑋 may belong to:


GL(𝑛) = {𝑋 ∈ R𝑛×𝑛 : det(𝑋) ≠ 0},
SL(𝑛) = {𝑋 ∈ R𝑛×𝑛 : det(𝑋) = 1},
O(𝑛) = {𝑋 ∈ R𝑛×𝑛 : 𝑋 T 𝑋 = 𝐼},
SO(𝑛) = {𝑋 ∈ R𝑛×𝑛 : 𝑋 T 𝑋 = 𝐼, det(𝑋) = 1},
U(𝑛) = {𝑋 ∈ C𝑛×𝑛 : 𝑋 ∗ 𝑋 = 𝐼},
SU(𝑛) = {𝑋 ∈ C𝑛×𝑛 : 𝑋 ∗ 𝑋 = 𝐼, det(𝑋) = 1},
O(𝑝, 𝑞) = {𝑋 ∈ R𝑛×𝑛 : 𝑋 T 𝐼 𝑝,𝑞 𝑋 = 𝐼 𝑝,𝑞 },
SO(𝑝, 𝑞) = {𝑋 ∈ R𝑛×𝑛 : 𝑋 T 𝐼 𝑝,𝑞 𝑋 = 𝐼 𝑝,𝑞 , det(𝑋) = 1},
Sp(2𝑛, R) = {𝑋 ∈ R2𝑛×2𝑛 : 𝑋 T 𝐽 𝑋 = 𝐽},
Sp(2𝑛) = {𝑋 ∈ C2𝑛×2𝑛 : 𝑋 T 𝐽 𝑋 = 𝐽, 𝑋 ∗ 𝑋 = 𝐼}. (2.14)
In the above, R may be replaced by C to obtain corresponding groups of complex
matrices, 𝑝 and 𝑞 are positive integers with 𝑝 + 𝑞 = 𝑛, and 𝐼 = 𝐼𝑛 denotes the 𝑛 × 𝑛
identity matrix. The matrices 𝐼 𝑝,𝑞 ∈ C𝑛×𝑛 and 𝐽 ∈ C2𝑛×2𝑛 , respectively, are
   
𝐼𝑝 0 0 𝐼
𝐼 𝑝,𝑞 ≔ , 𝐽≔ .
0 −𝐼𝑞 𝐼 0
There are some further possibilities that deserve a mention here, if only for cultural
reasons. While not as common as the classical Lie groups above, depending on
the application at hand, the groups of invertible lower/upper or unit lower/upper
triangular matrices, Heisenberg groups, etc., may all serve as the group of change-
of-basis matrices. Another three particularly important examples are the general
affine, Euclidean and special Euclidean groups
  
𝑋 𝑦 (𝑛+1)×(𝑛+1) 𝑛
GA(𝑛) = ∈R : 𝑋 ∈ GL(𝑛), 𝑦 ∈ R ,
0 1
  
𝑋 𝑦
E(𝑛) = ∈ R(𝑛+1)×(𝑛+1) : 𝑋 ∈ O(𝑛), 𝑦 ∈ R𝑛 , (2.15)
0 1
  
𝑋 𝑦 (𝑛+1)×(𝑛+1) 𝑛
SE(𝑛) = ∈R : 𝑋 ∈ SO(𝑛), 𝑦 ∈ R ,
0 1
which encode a linear transformation by 𝑋 and translation by 𝑦 in R𝑛 . Note that
even though the matrices involved are (𝑛 + 1) × (𝑛 + 1), they act on vectors 𝑣 ∈ R𝑛
embedded as (𝑣, 0) ∈ R𝑛+1 . As we will see in Example 2.16, the last two groups
play a special role in equivariant neural networks for image classification (𝑛 = 2)
and protein structure predictions (𝑛 = 3).
The above discussions extend to an infinite-dimensional setting in two different
ways. In general relativity, the tensors are replaced by tensor fields and the change-
of-basis matrices are replaced by change-of-coordinates matrices whose entries are
functions of coordinates; these are called diffeomorphisms and they form infinite-
Tensors in computations 23

dimensional Lie groups. In quantum mechanics, the tensors over finite-dimensional


vector spaces are replaced by tensors over infinite-dimensional Hilbert spaces and
the change-of-basis matrices are replaced by unitary operators, which may be
regarded as infinite-dimensional matrices (see Example 4.5). In this case the tensor
transformation rules are particularly important as they describe the evolution of the
quantum system. We will have more to say about these in Examples 3.13 and 4.49.
We are now ready to state a slightly modernized version of definition ➀.
Definition 2.5 (tensors via transformation rules). A hypermatrix 𝐴 ∈ R𝑛1 ×···×𝑛𝑑
represents a 𝑑-tensor if it satisfies one of the transformation rules in (2.7)–(2.12)
for some matrix groups in (2.14).
Notice that this only defines what it means for a hypermatrix to represent a
tensor: it does not fix the main drawback of definition ➀, i.e. the lack of an object
to serve as the tensor. Definition ➀ cannot be easily turned into a standalone
definition and is best understood in conjunction with either definition ➁ or ➂; these
transformation rules then become change-of-basis theorems for a multilinear map
or an element in a tensor product of vector spaces. Nevertheless, definition ➀ has
its value: the transformation rules encode the important notion of equivariance that
we will discuss next over Sections 2.3 and 2.4.
There are some variants of the tensor transformation rules that come up in
various areas of physics. We state them below for completeness and as examples
of hypermatrices that do not represent tensors.
Example 2.6 (pseudotensors and tensor densities). For a change-of-basis mat-
rix 𝑋 ∈ GL(𝑛) and a hypermatrix 𝐴 ∈ R𝑛×···×𝑛 , the pseudotensor transformation
rule (Borisenko and Tarapov 1979, Section 3.7.2), the tensor density transforma-
tion rule (Plebański and Krasiński 2006, Section 3.8) and the pseudotensor density
transformation rule are, respectively,
𝐴 ′ = sgn(det 𝑋) (𝑋 −1 , . . . , 𝑋 −1 , 𝑋 T , . . . , 𝑋 T ) · 𝐴,
𝐴 ′ = (det 𝑋) 𝑝 (𝑋 −1 , . . . , 𝑋 −1 , 𝑋 T , . . . , 𝑋 T ) · 𝐴, (2.16)
′ 𝑝 −1 −1
𝐴 = sgn(det 𝑋)(det 𝑋) (𝑋 , . . . , 𝑋 , 𝑋 , . . . , 𝑋 ) · 𝐴.
T T

Evidently they differ from the tensor transformation rule (2.12) by a scalar factor.
The power 𝑝 ∈ Z is called the weight with 𝑝 = 0 giving us (2.12). An alternative
name for a pseudotensor is an ‘axial tensor’ and in this case a true tensor is called
a ‘polar tensor’ (Hartmann 1984, Section 4). Note that the transformation rules in
(2.16) are variations of (2.12). While it is conceivable to define similar variations
of (2.9) with det(𝑋1 · · · 𝑋𝑑 ) = det 𝑋1 · · · det 𝑋𝑑 in place of det 𝑋, we are unable to
find a reference for this and would therefore refrain from stating it formally.
A word about notation. Up to this point we have introduced only one notion that
is not from linear algebra, namely (2.6) for multilinear matrix multiplication; this
notion dates back to at least Hitchcock (1927) and the notation is standard too.
24 L.-H. Lim

We denote it so that (2.7), (2.8), (2.9) reduce to the standard notation 𝑔 · 𝑥 for a
group element 𝑔 ∈ 𝐺 acting on an element 𝑥 of a 𝐺-set. Here our group 𝐺 is
a product of other groups 𝐺 = 𝐺 1 × · · · × 𝐺 𝑑 , and so an element takes the form
𝑔 = (𝑔1 , . . . , 𝑔𝑑 ). We made a conscious decision to cast everything in this article in
terms of left action so as not to introduce another potential source of confusion. We
could have introduced a right multilinear matrix multiplication of 𝐴 ∈ R𝑛1 ×···×𝑛𝑑
by matrices 𝑋 ∈ R𝑛1 ×𝑚1 , 𝑌 ∈ R𝑛2 ×𝑚2 , . . . , 𝑍 ∈ R𝑛𝑑 ×𝑚𝑑 ; note that the dimensions
of these matrices are now the transposes of those in (2.6),
𝐴 · (𝑋, 𝑌 , . . . , 𝑍) = 𝐵,
with 𝐵 ∈ R𝑚1 ×···×𝑚𝑑 given by
Õ
𝑛1 Õ
𝑛2 Õ
𝑛𝑑
𝑏𝑖1 ···𝑖𝑑 = ··· 𝑎 𝑗1 ··· 𝑗𝑑 𝑥𝑖1 𝑗1 𝑦 𝑖2 𝑗2 · · · 𝑧𝑖𝑑 𝑗𝑑 .
𝑗1 =1 𝑗2 =1 𝑗𝑑 =1

This would have allowed us to denote (2.7) and (2.9) without transposes, with the
latter in a two-sided product. Nevertheless, we do not think it is worth the trouble.

2.3. Importance of the transformation rules


When we say that a group is a set with an operation satisfying certain axioms, these
axioms play a crucial role and cannot be disregarded. It is the same with tensors.
The importance of the transformation rules in definition ➀ should be abundantly
clear at this point. In general, a formula in terms of the entries of the hypermatrix
is undefined for the tensor since that formula is unlikely to conform to the stringent
transformation rules; this mistake is unfortunately commonplace. Here we will
discuss some erroneous ideas that result from treating a tensor as no more than the
hypermatrix that represents it, oblivious of the transformation rules it must satisfy.
Example 2.7 (‘multiplying higher-order tensors’). There have been many re-
cent attempts at finding a formula for ‘multiplying higher-order tensors’ that pur-
ports to extend the standard matrix–matrix product (2.5). These proposed formu-
las are justified with a demonstration that they are associative, distributive, have
a multiplicative identity, etc., just like matrix multiplication. This is misguided.
We need go no further than 2-tensors to see the problem: both (2.4) and (2.5)
are associative, distributive and have multiplicative identities, but only the formula
in (2.5) defines a product of 2-tensors because it conforms to the transformation
rules (2.2). In fact no other formula aside from the standard matrix–matrix product
defines a product for 2-tensors, a consequence of the first fundamental theorem of
invariant theory; see De Concini and Procesi (2017, Chapter 9) or Goodman and
Wallach (2009, Section 5.3). This theorem also implies that there is no formula
that will yield a product for any odd-ordered tensors. In particular, one may stop
looking for a formula to multiply two 𝑛 × 𝑛 × 𝑛 matrices to get a third 𝑛 × 𝑛 × 𝑛
hypermatrix that defines a product for 3-tensors, because such a formula does not
exist. Incidentally, the transformation rules also apply to adding tensors: we may
Tensors in computations 25

not add two matrices representing two tensors of different types because they sat-
isfy different transformation rules; for example, if 𝐴 ∈ R𝑛×𝑛 represents a mixed
2-tensor and 𝐵 ∈ R𝑛×𝑛 a covariant 2-tensor, then 𝐴 + 𝐵 is rarely if ever meaningful.
Example 2.8 (‘identity tensors’). The identity matrix 𝐼 ∈ R3×3 is of course
3
Õ
𝐼= 𝑒𝑖 ⊗ 𝑒𝑖 ∈ R3×3 , (2.17)
𝑖=1

where 𝑒1 , 𝑒2 , 𝑒3 ∈ R3 are the standard basis vectors. What should be the extension
of this notion to 𝑑-tensors? Take 𝑑 = 3 for illustration. It would appear that
3
Õ
𝐴= 𝑒𝑖 ⊗ 𝑒𝑖 ⊗ 𝑒𝑖 ∈ R3×3×3 (2.18)
𝑖=1

is the obvious generalization but, as with the Hadamard product, obviousness does
not necessarily imply correctness when it comes to matters tensorial. One needs
to check the transformation rules. Note that (2.17) is independent of the choice
of orthonormal basis: we obtain the same matrix with any orthonormal basis
𝑞 1 , 𝑞 2 , 𝑞 3 ∈ R3 , a consequence of
(𝑄, 𝑄) · 𝐼 = 𝑄𝐼𝑄 T = 𝐼
for any 𝑄 ∈ O(3), that is, the identity matrix is well-defined as a Cartesian tensor.
On the other hand the hypermatrix 𝐴 in (2.18) does not have this property. One
may show that up to scalar multiples, 𝑀 = 𝐼 is the unique matrix satisfying
(𝑄, 𝑄) · 𝑀 = 𝑀 (2.19)
for any 𝑄 ∈ O(3), a property known as isotropic. An isotropic 3-tensor would then
be one that satisfies
(𝑄, 𝑄, 𝑄) · 𝑇 = 𝑇 . (2.20)
Up to scalar multiples, (2.20) has a unique solution given by the hypermatrix
3 Õ
Õ 3 Õ
3
𝐽= 𝜀 𝑖 𝑗𝑘 𝑒𝑖 ⊗ 𝑒 𝑗 ⊗ 𝑒 𝑘 ∈ R3×3×3 ,
𝑖=1 𝑗=1 𝑘=1

where 𝜀 𝑖 𝑗𝑘 is the Levi-Civita symbol



 +1 if (𝑖, 𝑗, 𝑘) = (1, 2, 3), (2, 3, 1), (3, 1, 2),



𝜀 𝑖 𝑗𝑘 = −1 if (𝑖, 𝑗, 𝑘) = (1, 3, 2), (2, 1, 3), (3, 2, 1),


0 if 𝑖 = 𝑗, 𝑗 = 𝑘, 𝑘 = 𝑖,

which is evidently quite different from (2.18). More generally, 𝑛-dimensional
isotropic 𝑑-tensors, i.e. tensors with the same hypermatrix representation regardless
of choice of orthonormal bases in R𝑛 , have been explicitly determined for 𝑛 = 3
26 L.-H. Lim

and 𝑑 ≤ 8 in Kearsley and Fong (1975), and studied for arbitrary values of 𝑛 and 𝑑
in Weyl (1997). From a tensorial perspective, isotropic tensors, not a ‘hypermatrix
with ones on the diagonal’ like (2.18), extend the notion of an identity matrix.
Example 2.9 (hyperdeterminants). The simplest order-3 analogue for the deter-
minant, called the Cayley hyperdeterminant for 𝐴 = [𝑎𝑖 𝑗𝑘 ] ∈ R2×2×2 , is given by
Det(𝐴) = 𝑎2000 𝑎2111 + 𝑎2001 𝑎2110 + 𝑎2010 𝑎2101 + 𝑎2011 𝑎2100
− 2(𝑎000 𝑎001 𝑎110 𝑎111 + 𝑎000 𝑎010 𝑎101 𝑎111 + 𝑎000 𝑎011 𝑎100 𝑎111
+ 𝑎001 𝑎010 𝑎101 𝑎110 + 𝑎001 𝑎011 𝑎110 𝑎100 + 𝑎010 𝑎011 𝑎101 𝑎100 )
+ 4(𝑎000 𝑎011 𝑎101 𝑎110 + 𝑎001 𝑎010 𝑎100 𝑎111 ),
which looks nothing like the usual expression for a matrix determinant. Just as the
matrix determinant is preserved under a transformation 𝐴 ′ = (𝑋, 𝑌 ) · 𝐴 = 𝑋 𝐴𝑌 T for
𝑋, 𝑌 ∈ SL(𝑛), this is preserved under a transformation of the form 𝐴 ′ = (𝑋, 𝑌 , 𝑍)· 𝐴
for any 𝑋, 𝑌 , 𝑍 ∈ SL(2). This notion of hyperdeterminant has been extended to
any 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 with
Õ
𝑛𝑖 − 1 ≤ (𝑛 𝑗 − 1), 𝑖 = 1, . . . , 𝑑,
𝑗≠𝑖

in Gel′fand, Kapranov and Zelevinsky (1992).


As these three examples reveal, extending any common linear algebraic notion
to higher order will require awareness of the transformation rules. Determining the
higher-order tensorial analogues of notions such as linear equations or eigenvalues
or determinant is not a matter of simply adding more indices to
Õ
𝑛 Õ
𝑛 Õ Ö
𝑛
𝑎𝑖 𝑗 𝑥 𝑗 = 𝑏𝑖 , 𝑎𝑖 𝑗 𝑥 𝑗 = 𝜆𝑥𝑖 , sgn(𝜎) 𝑎𝑖 𝜎(𝑖) .
𝑗=1 𝑗=1 𝜎 ∈𝔖𝑛 𝑖=1

Of course, one might argue that anyone is free to concoct any formula for
‘tensor multiplication’ or call any hypermatrix an ‘identity tensor’ even if they are
undefined on tensors: there is no right or wrong. But there is. They would be
wrong in the same way adding fractions as 𝑎/𝑏 + 𝑐/𝑑 = (𝑎 + 𝑏)/(𝑐 + 𝑑) is wrong,
or at least far less useful than the standard way of adding fractions. Fractions are
real, and adding half a cake to a third of a cake does not give us two-fifths of a
cake. Likewise tensors are real: we discovered the transformation rules for tensors
just as we discovered the arithmetic rules for adding fractions, that is, we did not
invent them arbitrarily.
While we have limited our examples to mathematical ones, we will end with a
note about the physics perspective. In physics, these transformation rules are just as
if not more important. One does not even have to go to higher-order tensors to see
this: a careful treatment of vectors in physics already requires such an approach.
Example 2.10 (tensor transformation rules in physics). As we saw at the be-
Tensors in computations 27

ginning of Section 2, the tensor transformation rules originated from physics as


change-of-coordinates rules for certain physical quantities. Not all physical quantit-
ies satisfy tensor transformation rules; those that do are called ‘tensorial quantities’
or simply ‘tensors’ (Voigt 1898, p. 20). We will take the simplest physical quant-
ity, displacement, i.e. the location 𝑞 of a point particle in space, to illustrate.
Once we select our 𝑥-, 𝑦- and 𝑧-axes, space becomes R3 and 𝑞 gets coordinates
𝑎 = (𝑎1 , 𝑎2 , 𝑎3 ) ∈ R3 with respect to these axes. If we change our axes to a new set
of 𝑥 ′-, 𝑦 ′- and 𝑧 ′ -axes – note that nothing physical has changed, the point 𝑞 is still
where it was, the only difference is in the directions we decided to pick as axes –
the coordinates must therefore change accordingly to some 𝑎 ′ = (𝑎1′ , 𝑎2′ , 𝑎3′ ) ∈ R3
to compensate for this change in the choice of axes (see Figure 2.1). If the matrix
𝑋 ∈ GL(𝑛) transforms the 𝑥-, 𝑦- and 𝑧-axes to the 𝑥 ′-, 𝑦 ′- and 𝑧 ′-axes, then the
coordinates must transform in the opposite way 𝑎 ′ = 𝑋 −1 𝑎 to counteract this change
so that on the whole everything stays the same. This is the transformation rule for
a contravariant 1-tensor on page 18, and we are of course just paraphrasing the
change-of-basis theorem for a vector in a vector space. This same transformation
rule applies to many other physical quantities: any time derivatives of displacement
such as velocity 𝑞, ¤ acceleration 𝑞,
¥ momentum 𝑚 𝑞, ¤ force 𝑚 𝑞.
¥ So the most familiar
physical quantities we encountered in introductory mechanics are contravariant
1-tensors.

𝑧 𝑧′

𝑦′

𝑥′
𝑥
Figure 2.1. Linear transformation of coordinate axes.

Indeed, contravariant 1-tensors are how many physicists would regard vectors
(Weinreich 1998): a vector is an object represented by 𝑎 ∈ R𝑛 that satisfies
the transformation rule 𝑎 ′ = 𝑋 −1 𝑎, possibly with the additional requirement that
the change-of-basis matrix 𝑋 be in O(𝑛) (Feynman, Leighton and Sands 1963,
Chapter 11) or 𝑋 ∈ O(𝑝, 𝑞) (Rindler 2006, Chapter 4).
28 L.-H. Lim

The transformation rules perspective has proved to be very useful in physics. For
example, special relativity is essentially the observation that the laws of physics
are invariant under Lorentz transformations 𝑋 ∈ O(1, 3) (Einstein 2002). In fact, a
study of the contravariant 1-tensor transformation rule under the O(1, 3)-analogue
of Givens rotations,
 cosh 𝜃 − sinh 𝜃 0 0  cosh 𝜃 0 − sinh 𝜃 0  cosh 𝜃 0 0 − sinh 𝜃 
  
− sinh 𝜃 cosh 𝜃 0 0  0 1 0 0  0 1 0 0 
 ,  ,  ,
 0 0 1 0 − sinh 𝜃 0 cosh 𝜃 0  0 0 1 0 
  
 0 0 0 1  0 0 0 1 − sinh 𝜃 0 0 cosh 𝜃 
  
is enough to derive most standard results of special relativity; see Friedberg, Insel
and Spence (2003, Section 6.9) and Woodhouse (2003, Chapters 4 and 5).
Covariant 1-tensors are also commonplace in physics. If the coordinates of 𝑞
transform as 𝑎 ′ = 𝑋 −1 𝑎, then the coordinates of its derivative 𝜕/𝜕𝑞 or ∇𝑞 would
transform as 𝑎 ′ = 𝑋 T 𝑎. So physical quantities that satisfy the covariant 1-tensor
transformation rule tend to have a contravariant 1-tensor ‘in the denominator’ such
as conjugate momentum 𝑝 = 𝜕𝐿/𝜕 𝑞¤ or electric field 𝐸 = −∇𝜙 − 𝜕 𝐴/𝜕𝑡. In the
former, 𝐿 is the Lagrangian, and we have the velocity 𝑞¤ ‘in the denominator’; in the
latter, 𝜙 and 𝐴, respectively, are the scalar and vector potentials, and the gradient
∇ is a derivative with respect to spatial variables, and thus have displacement 𝑞 ‘in
the denominator’.
More generally, what we said above applies to more complex physical quantities:
when the axes – also called the reference frame in physics – are changed, coordinates
of physical quantities must change in a way that preserves the laws of physics;
for tensorial quantities, this would be (2.7)–(2.12), the ‘definite way’ in Voigt’s
definition on page 11. This is the well-known maxim that the laws of physics
should not depend on coordinates, a version of which the reader will find on
page 41. The reader will also find higher-order examples in Examples 2.4, 3.12
and 4.8.

2.4. Tensors in computations I: equivariance


These transformation rules are no less important in computations. The most
fundamental algorithms in numerical linear algebra are invariably based implicitly
on one of these transformation rules combined with special properties of the change-
of-basis matrices. Further adaptations for numerical stability and efficiency can
sometimes obscure what is going on in these algorithms, but the basic principle is
always to involve one of these transformation rules, or, in modern lingo, to take
advantage of equivariance.
Example 2.11 (equivalence, similarity, congruence). The transformation rules
for 2-tensors have better-known names in linear algebra and numerical linear al-
gebra.
• Equivalence of matrices: 𝐴 ′ = 𝑋 𝐴𝑌 −1 .
Tensors in computations 29

• Similarity of matrices: 𝐴 ′ = 𝑋 𝐴𝑋 −1 .
• Congruence of matrices: 𝐴 ′ = 𝑋 𝐴𝑋 T .
The most common choices for 𝑋 and 𝑌 are either orthogonal matrices or invertible
ones (similarity and congruence are identical if 𝑋 is orthogonal). These transform-
ation rules have canonical forms, and this is an important feature in linear algebra
and numerical linear algebra alike, not withstanding the celebrated fact (Golub
and Wilkinson 1976) that some of these cannot be computed in finite precision:
Smith normal form for equivalence, singular value decomposition for orthogonal
equivalence, Jordan and Weyr canonical forms for similarity (O’Meara, Clark and
Vinsonhaler 2011), Turnbull–Aitken and Hodge–Pedoe canonical forms for con-
gruence (De Terán 2016). One point to emphasize is that the transformation rules
determine the canonical form for the tensor: it makes no sense to speak of Jordan
form for a contravariant 2-tensor or a Turnbull–Aitken form for a mixed 2-tensor,
even if both tensors are represented by exactly the same matrix 𝐴 ∈ R𝑛×𝑛 .
At this juncture it is appropriate to highlight a key difference between 2-tensors
and higher-order ones: while 2-tensors have canonical forms, higher-order tensors
in general do not (Landsberg 2012, Chapter 10). This is one of several reasons why
we should not expect an extension of linear algebra or numerical linear algebra to
𝑑-dimensional hypermatrices in a manner that resembles the 𝑑 = 2 versions.
Before we continue, we will highlight two simple properties.
(i) The change-of-basis matrices may be multiplied and inverted: if 𝑋 and 𝑌 are
orthogonal or invertible, then so is 𝑋𝑌 and so is 𝑋 −1 , that is, the set of all
change-of-basis matrices O(𝑛) or GL(𝑛) forms a group.
(ii) The transformation rules may be composed: if we have 𝑎 ′ = 𝑋 −T 𝑎 and
𝑎 ′′ = 𝑌 −T 𝑎 ′ , then 𝑎 ′′ = (𝑌 𝑋)−T 𝑎; if we have 𝐴 ′ = 𝑋 𝐴𝑋 −1 and 𝐴 ′′ = 𝑌 𝐴 ′𝑌 −1 ,
then 𝐴 ′′ = (𝑌 𝑋)𝐴(𝑌 𝑋)−1 , that is, the transformation rule defines a group
action.
These innocuous observations say that to get a matrix 𝐴 into a desired form 𝐵, we
may just work on a ‘small part’ of the matrix 𝐴, e.g. a 2 × 2-submatrix or a fragment
of a column, by applying a transformation that affects that ‘small part’. We then
repeat it on other parts to obtain a sequence of transformations:
𝐴 → 𝑋1 𝐴 → 𝑋2 (𝑋1 𝐴) → · · · → 𝐵,
𝐴 → 𝑋1−T 𝐴 → 𝑋2−T (𝑋1−T 𝐴) → · · · → 𝐵,
𝐴 → 𝑋1 𝐴𝑋1T → 𝑋2 (𝑋1 𝐴𝑋1T )𝑋2T → · · · → 𝐵,
𝐴 → 𝑋1 𝐴𝑋1−1 → 𝑋2 (𝑋1 𝐴𝑋1−1 )𝑋2−1 → · · · → 𝐵,
𝐴 → 𝑋1 𝐴𝑌1−1 → 𝑋2 (𝑋1 𝐴𝑌1−1 )𝑌2−1 → · · · → 𝐵, (2.21)
and piece all change-of-basis matrices together to get the required 𝑋 as either
𝑋𝑚 𝑋𝑚−1 . . . 𝑋1 or its limit as 𝑚 → ∞ (likewise for 𝑌 ). Algorithms for computing
30 L.-H. Lim

standard matrix decompositions such as LU, QR, EVD, SVD, Cholesky, Schur, etc.,
all involve applying a sequence of such transformation rules (Golub and Van Loan
2013). In numerical linear algebra, if 𝑚 is finite, i.e. 𝑋 may be obtained in finitely
many steps (in exact arithmetic), then the algorithm is called direct, whereas if it
requires 𝑚 → ∞, i.e. 𝑋 may only be approximated with a limiting process, then
it is called iterative. Furthermore, in numerical linear algebra, one tends to see
such transformations as giving a matrix decomposition, which may then be used to
solve other problems involving 𝐴. This is sometimes called ‘the decompositional
approach to matrix computation’ (Stewart 2000).
Designers of algorithms for matrix computations, even if they were not explicitly
aware of these transformation rules and properties, were certainly observing them
implicitly. For instance, it is rare to find algorithms that mix different transformation
rules for different types of tensors, since what is incompatible for tensors tends to
lead to meaningless results. Also, since eigenvalues are defined for mixed 2-tensors
but not for contravariant or covariant 2-tensors, the transformation 𝐴 ′ = 𝑋 𝐴𝑋 −1
is pervasive in algorithms for eigenvalue decomposition but we rarely if ever find
𝐴 ′ = 𝑋 𝐴𝑋 T (unless of course 𝑋 is orthogonal).
In numerical linear algebra, the use of the transformation rules in (2.21) goes
hand in hand with a salient property of the group of change-of-basis matrices.
Example 2.12 (Givens rotations, Householder reflectors, Gauss transforms).
Recall that these are defined by
1 ··· 0 ··· 0 ··· 0
. .. .. .. 
. ..
. . . . .
 
0 ··· cos 𝜃 · · · − sin 𝜃 · · · 0
 ..  ∈ SO(𝑛),
𝐺 =  ... ..
.
..
.
..
. .
 
0 ··· sin 𝜃 · · · cos 𝜃 ··· 0
. .. .. .. .. 
 .. . . . .

0
··· 0 ··· 0 ··· 1

2𝑣𝑣 T

𝐻 = 𝐼 − T ∈ O(𝑛), 𝑀 = 𝐼 − 𝑣𝑒T𝑖 ∈ GL(𝑛),


𝑣 𝑣
with the property that the Givens rotation 𝑎 ′ = 𝐺𝑎 is a rotation of 𝑎 in the (𝑖, 𝑗)-
plane by an angle 𝜃, the Householder reflection 𝑎 ′ = 𝐻𝑎 is the reflection of 𝑎 in the
hyperplane with normal 𝑣/k𝑣k and, for a judiciously chosen 𝑣, the Gauss transform
𝑎 ′ = 𝑀𝑎 is in span{𝑒𝑖+1 , . . . , 𝑒 𝑛 }, that is, it has (𝑖 + 1)th to 𝑛th coordinates zero.8
Standard algorithms in numerical linear algebra such as Gaussian elimination for
LU decomposition and Cholesky decomposition, Givens and Householder QR for
QR decomposition, Francis’s QR or Rutishauser’s LR algorithms for EVD, Golub–

8 If 𝑚 = 𝛼𝑒 𝑗 , then 𝑀 𝐴 adds an 𝛼 multiple of the 𝑗th row to the 𝑖th row of 𝐴; so the Gauss transform
includes the elementary matrices that perform this operation.
Tensors in computations 31

Kahan bidiagonalization for SVD, etc., all rely in part on applying a sequence of
transformation rules as in (2.21) with one of these matrices playing the role of the
change-of-basis matrices. The reason this is possible is that:
• any 𝑋 ∈ SO(𝑛) is a product of Givens rotations,
• any 𝑋 ∈ O(𝑛) is a product of Householder reflectors,
• any 𝑋 ∈ GL(𝑛) is a product of elementary matrices,
• any unit lower triangular 𝑋 ∈ GL(𝑛) is a product of Gauss transforms.
In group-theoretic lingo, these matrices are generators of the respective matrix Lie
groups; in the last case, the set of all unit lower triangular matrices, i.e. ones on the
diagonal, is also a subgroup of GL(𝑛).
Whether one seeks to solve a system of linear equations or find a least-squares
solution or compute eigenvalue or singular value decompositions, the basic under-
lying principle in numerical linear algebra is to transform the problem in such a
way that the solution of the transformed problem is related to the original solution
in a definite way; note that this is practically a paraphrase of Voigt’s definition of a
tensor on page 11. Any attempt to give a comprehensive list of examples will simply
result in our reproducing a large fraction of Golub and Van Loan (2013), so we will
just give a familiar example viewed through the lens of tensor transformation rules.
Example 2.13 (full-rank least-squares). As we saw in Section 2.1, the least-
squares problem (2.3) satisfies the transformation rule of a mixed 2-tensor 𝐴 ′ =
𝑋 𝐴𝑌 −1 with change-of-basis matrices (𝑋, 𝑌 ) ∈ O(𝑚) × GL(𝑛). Suppose rank(𝐴) =
𝑛. Then, applying a sequence of covariant 1-tensor transformation rules
 
𝑅
𝐴 → 𝑄 1 𝐴 → 𝑄 2 (𝑄 1 𝐴) → · · · → 𝑄 𝐴 =
T T T T
0
given by the Householder QR algorithm, we get
 
𝑅
𝐴=𝑄 .
0
As the minimum value is an invariant Cartesian 0-tensor,
  2
2 2 𝑅
min k 𝐴𝑣 − 𝑏k = min k𝑄 (𝐴𝑣 − 𝑏)k = min
T
𝑣 − 𝑄T 𝑏
0
    2
𝑅 𝑐
= min 𝑣− = min k𝑅𝑣 − 𝑐k 2 + k𝑑 k 2 = k𝑑 k 2 ,
0 𝑑
where we have written  
T 𝑐
𝑄 𝑏= .
𝑑
In this case the solution of the transformed problem 𝑅𝑣 = 𝑐 is in fact equal to that of
32 L.-H. Lim

the least-squares problem, and may be obtained through back-substitution, that is,
another sequence of contravariant 1-tensor transformation rules
𝑐 → 𝑌1−1 𝑐 → 𝑌2−1 (𝑌1−1 𝑐) → · · · → 𝑅 −1 𝑐 = 𝑣,
where the 𝑌𝑖 are Gauss transforms. As noted above, the solution method reflects
Voigt’s definition: we transform the problem mink 𝐴𝑣 − 𝑏k 2 into a form where the
solution of the transformed problem 𝑅𝑣 = 𝑐 is related to the original solution in a
definite way. Here we obtained the change-of-basis matrices 𝑋 = 𝑄 ∈ O(𝑚) and
𝑌 = 𝑅 −1 ∈ GL(𝑛) via Householder QR and back-substitution respectively.
As we just saw, there are distinguished choices for the change-of-basis matrices
that aid the solution of a numerical linear algebra problem. We will mention one
of the most useful choices below.
Example 2.14 (Krylov subspaces). Suppose we have 𝐴 ∈ R𝑛×𝑛 , with all eigen-
values distinct and non-zero for simplicity. Take an arbitrary 𝑏 ∈ R𝑛 . Then the
matrix 𝐾 whose columns are Krylov basis vectors of the form
𝑏, 𝐴𝑏, 𝐴2 𝑏, . . . , 𝐴𝑛−1 𝑏
is invertible, i.e. 𝐾 ∈ GL(𝑛), and using it as the change-of-basis matrix gives us
0 0 · · · 0 −𝑐0 
 
1 0 · · · 0 −𝑐1 
 
 
𝐴 = 𝐾 0 1 · · · 0 −𝑐2  𝐾 −1 . (2.22)
 .. .. . . .. .. 
. . . . 
. 

0 0 · · · 1 −𝑐 𝑛−1 
 
This is a special case of the rational canonical form, a canonical form under
similarity. The seemingly trivial observation (2.22), when combined with other
techniques, becomes a powerful iterative method for a wide variety of computa-
tional tasks such as solving linear systems, least-squares, eigenvalue problems or
evaluating various matrix functions (van der Vorst 2000). Readers unfamiliar with
numerical linear algebra may find it odd that we do not use another obvious canon-
ical form, one that makes the aforementioned problems trivial to solve, namely, the
eigenvalue decomposition
𝜆 1 0 0 · · · 0 
 
 0 𝜆2 0 · · · 0 
 
 
𝐴 = 𝑋  0 0 𝜆 3 · · · 0  𝑋 −1 , (2.23)
 .. .. .. . . .. 
. . . . .

 0 0 0 · · · 𝜆𝑛
 
where the change-of-basis matrix 𝑋 ∈ GL(𝑛) has columns given by the eigenvectors
of 𝐴. The issue is that this is more difficult to compute than (2.22). In fact, to
compute it, one way is to implicitly exploit (2.22) and the relation between the two
Tensors in computations 33

canonical forms:
𝜆 1 0 0 ··· 0  0 0 ··· 0 
−𝑐0
  
 0 𝜆2 0 ··· 0  1 0 ··· 0 
−𝑐1
  
 0 0 𝜆3 ··· 0  = 𝑉 0 1 ··· 0  −1
−𝑐2
  𝑉 ,
 .. .. .. .. ..   .. .. . . .. ..
. . . . .  . . . .  .
  
0 0 0 ··· 𝜆 𝑛  0 0 ··· 1 −𝑐 𝑛−1 
 
where
1 𝜆 1 𝜆2 . . . 𝜆 𝑛−1 
 1 1 
1 𝜆 2 𝜆2 . . . 𝜆 𝑛−1 
 2 2 
 2 𝑛−1 
𝑉 = 1 𝜆 3 𝜆 3 . . . 𝜆 3  . (2.24)
 .. .. .. . . .. 
. . . . . 

1 𝜆 𝑛 𝜆2𝑛 . . . 𝜆 𝑛−1 
 𝑛 

The pertinent point here is that the approach of solving a problem by finding an
appropriate computational basis is also an instance of the 2-tensor transformation
rule. Aside from the Krylov basis, there are simpler examples such as diagonalizing
a circulant matrix
 𝑐0 𝑐 𝑛−1 . . . 𝑐2 𝑐1 
 
 𝑐1 𝑐0 𝑐 𝑛−1 𝑐2 
 . .. .. 
 
𝐶 =  .. 𝑐1 𝑐0 . . 
 .. .. 
𝑐 . . 𝑐 𝑛−1 
 𝑛−2
𝑐 𝑐0 
 𝑛−1 𝑐 𝑛−2 . . . 𝑐1
by expressing it in the Fourier basis, i.e. with 𝜆 𝑘𝑗 = e2( 𝑗−1)𝑘 𝜋i/𝑛 in (2.24), and
there are far more complex examples such as the wavelet approach to the fast
multipole method of Beylkin, Coifman and Rokhlin (1991). This computes, in 𝑂(𝑛)
operations and to arbitrary accuracy, a matrix–vector product 𝑣 ↦→ 𝐴𝑣 with a dense
𝐴 ∈ R𝑛×𝑛 that is a finite-dimensional approximation of certain special integral
transforms. The algorithm iteratively computes a special wavelet basis 𝑋 ∈ GL(𝑛)
so that ultimately 𝑋 −1 𝐴𝑋 = 𝐵 gives a banded matrix 𝐵 where 𝑣 ↦→ 𝐵𝑣 can be
computed in time 𝑂(𝑛 log(1/𝜀)) to 𝜀-accuracy, and where 𝑣 ↦→ 𝑋𝑣, 𝑣 ↦→ 𝑋 −1 𝑣 are
both computable in time 𝑂(𝑛). One may build upon this algorithm to obtain an
𝑂(𝑛 log2 𝑛 log 𝜅(𝐴)) algorithm for the pseudoinverse (Beylkin 1993, Section 6) and
potentially also to compute other matrix functions such as square root, exponential,
sine and cosine, etc. (Beylkin, Coifman and Rokhlin 1992, Section X). We will
discuss some aspects of this basis in Example 4.47, which is constructed as the
tensor product of multiresolution analyses.
Our intention is to highlight the role of the tensor transformation rules in numer-
ical linear algebra but we do not wish to overstate it. These rules are an important
component of various algorithms but almost never the only one. Furthermore,
34 L.-H. Lim

far more work is required to guarantee correctness in finite-precision arithmetic


(Higham 2002), although the tensor transformation rules can sometimes help with
that too, as we will see next.

Example 2.15 (affine invariance of Newton’s method). The discussion below


is essentially the standard treatment in Boyd and Vandenberghe (2004), supple-
mented by commentary pointing out the tensor transformation rules. Consider the
equality-constrained convex optimization problem
minimize 𝑓 (𝑣)
(2.25)
subject to 𝐴𝑣 = 𝑏

for a strongly convex 𝑓 ∈ 𝐶 2 (Ω) with 𝛽𝐼  ∇2 𝑓 (𝑣)  𝛾𝐼. The Newton step
Δ𝑣 ∈ R𝑛 is defined as the solution to
 2    
∇ 𝑓 (𝑣) 𝐴T Δ𝑣 −∇ 𝑓 (𝑣)
=
𝐴 0 Δ𝜆 0
and the Newton decrement 𝜆(𝑣) ∈ R is defined as
𝜆(𝑣)2 ≔ ∇(𝑣)T ∇ 𝑓 (𝑣)−1 ∇ 𝑓 (𝑣).
Let 𝑋 ∈ GL(𝑛) and suppose we perform a linear change of coordinates 𝑋𝑣 ′ = 𝑣.
Then the Newton step in these new coordinates is given by
 T 2    T 
𝑋 ∇ 𝑓 (𝑋𝑣)𝑋 𝑋 T 𝐴T Δ𝑣 ′ −𝑋 ∇ 𝑓 (𝑋𝑣)
= .
𝐴𝑋 0 Δ𝜆 ′ 0
We may check that 𝑋Δ𝑣 ′ = Δ𝑣 and thus the iterates are related by 𝑋𝑣 ′𝑘 = 𝑣 𝑘 for
all 𝑘 ∈ N as long as we initialize with 𝑋𝑣 0′ = 𝑣 0 (Boyd and Vandenberghe 2004,
Section 10.2.1). Note that steepest descent satisfies no such property no matter
which 1-tensor transformation rule we use: 𝑣 ′ = 𝑋𝑣, 𝑣 ′ = 𝑋 −1 𝑣, 𝑣 ′ = 𝑋 T 𝑣, or 𝑣 ′ =
𝑋 −T 𝑣. We also have that 𝜆(𝑋𝑣)2 = 𝜆(𝑣)2 , which is used in the stopping condition
of Newton’s method; thus the iterations stop at the same point 𝑋𝑣 ′𝑘 = 𝑣 𝑘 when
𝜆(𝑣 ′𝑘 )2 = 𝜆(𝑣 𝑘 )2 ≤ 2𝜀 for a given 𝜀 > 0. In summary, if we write 𝑔(𝑣 ′) = 𝑓 (𝑋𝑣),
then we have the following relations:
coordinates contravariant 1-tensor 𝑣 ′ = 𝑋 −1 𝑣,
gradient covariant 1-tensor ∇𝑔(𝑣 ′ ) = 𝑋 T ∇ 𝑓 (𝑋𝑣),
Hessian covariant 2-tensor ∇2 𝑔(𝑣 ′ ) = 𝑋 T ∇2 𝑓 (𝑋𝑣)𝑋,
Newton step contravariant 1-tensor Δ𝑣 ′ = 𝑋 −1 Δ𝑣,
Newton iterate contravariant 1-tensor 𝑣 ′𝑘 = 𝑋 −1 𝑣 𝑘 ,
Newton decrement invariant 0-tensor 𝜆(𝑣 ′𝑘 ) = 𝜆(𝑣 𝑘 ).
Strictly speaking, the gradient and Hessian are tensor fields, and we will explain
the difference in Example 3.12. We may extend the above discussion with the
Tensors in computations 35

general affine group GA(𝑛) in (2.15) in place of GL(𝑛) to incorporate translation


by a vector.
To see the implications on computations, it suffices to take, for simplicity, an
unconstrained problem with 𝐴 = 0 and 𝑏 = 0 in (2.25). Observe that the condition
number of 𝑋 T ∇2 𝑓 (𝑋𝑣)𝑋 can be scaled to any desired value in [1, ∞) with an appro-
priate 𝑋 ∈ GL(𝑛). In other words, any Newton step is independent of the condition
number of ∇2 𝑓 (𝑣) in exact arithmetic. In finite precision arithmetic, this manifests
itself as insensitivity to condition number. Newton’s method gives solutions of high
accuracy for condition number as high as 𝜅 ≈ 1010 when steepest descent already
stops working at 𝜅 ≈ 20 (Boyd and Vandenberghe 2004, Section 9.5.4). Strictly
speaking, 𝜅 refers to the condition number of sublevel sets {𝑣 ∈ Ω : 𝑓 (𝑣) ≤ 𝛼}).
For any convex set 𝐶 ⊆ R𝑛 , this is given by
sup k 𝑣 k=1 𝑤(𝐶, 𝑣)2
𝜅(𝐶) ≔ , 𝑤(𝐶, 𝑣) ≔ sup 𝑢T 𝑣 − inf 𝑢T 𝑣.
inf k 𝑣 k=1 𝑤(𝐶, 𝑣)2 𝑢 ∈𝐶 𝑢 ∈𝐶

However, if 𝛽𝐼  ∇2 𝑓 (𝑣)  𝛾𝐼, then 𝜅({𝑣 ∈ Ω : 𝑓 (𝑣) ≤ 𝛼}) ≤ 𝛾/𝛽 (Boyd and
Vandenberghe 2004, Section 9.1.2), and so it is ultimately controlled by 𝜅(∇2 𝑓 (𝑣)).
Our last example shows that the tensor transformation rules are as important in
the information sciences as they are in the physical sciences.
Example 2.16 (equivariant neural networks). A feed-forward neural network
is usually regarded as a function 𝑓 : R𝑛 → R𝑛 obtained by alternately composing
affine maps 𝛼𝑖 : R𝑛 → R𝑛 , 𝑖 = 1, . . . , 𝑘, with a non-linear function 𝜎 : R𝑛 → R𝑛 :
𝑓 =𝛼𝑘 ◦𝜎◦𝛼𝑘−1 ◦···◦𝜎◦𝛼2 ◦𝜎◦𝛼1

𝛼1 𝜎 𝛼2 𝜎 𝜎 𝛼𝑘
R𝑛 R𝑛 R𝑛 R𝑛 ··· R𝑛 R𝑛 .
The depth, also known as the number of layers, is 𝑘 and the width, also known
as the number of neurons, is 𝑛. We assume that our neural network has constant
width throughout all layers. The non-linear function 𝜎 is called activation and we
may assume that it is given by the ReLU function 𝜎(𝑡) ≔ max(𝑡, 0) for 𝑡 ∈ R and,
by convention,
𝜎(𝑣) = (𝜎(𝑣 1 ), . . . , 𝜎(𝑣 𝑛 )), 𝑣 = (𝑣 1 , . . . , 𝑣 𝑛 ) ∈ R𝑛 . (2.26)
The affine function is defined by 𝛼𝑖 (𝑣) = 𝐴𝑖 𝑣 + 𝑏𝑖 for some 𝐴𝑖 ∈ R𝑛×𝑛 called the
weight matrix and some 𝑏𝑖 ∈ R𝑛 called the bias vector. We assume that 𝑏 𝑘 = 0 in
the last layer.
Although convenient, it is somewhat misguided to be lumping the bias and weight
together in an affine function. The biases 𝑏𝑖 are intended to serve as thresholds
for the activation function 𝜎 (Bishop 2006, Section 4.1.7) and should be part of
it, detached from the weights 𝐴𝑖 that transform the input. If one would like to
36 L.-H. Lim

incorporate translations, one could do so with weights from a matrix group such
as SE(𝑛) in (2.15). Hence a better but mathematically equivalent description of 𝑓
would be as

𝑓 =𝐴𝑘 𝜎𝑏𝑘−1 𝐴𝑘−1 ···𝜎𝑏2 𝐴2 𝜎𝑏1 𝐴1

𝐴1 𝜎𝑏1 𝐴2 𝜎𝑏2 𝜎𝑏𝑘−1 𝐴𝑘


R𝑛 R𝑛 R𝑛 R𝑛 ··· R𝑛 R𝑛

where we identify 𝐴𝑖 ∈ R𝑛×𝑛 with the linear operator R𝑛 → R𝑛 , 𝑣 ↦→ 𝐴𝑖 𝑣, and


for any 𝑏 ∈ R𝑛 we define 𝜎𝑏 : R𝑛 → R𝑛 by 𝜎𝑏 (𝑣) = 𝜎(𝑣 + 𝑏) ∈ R𝑛 . We have
dropped the composition symbol ◦ to avoid clutter and potential confusion with the
Hadamard product (2.4), and will continue to do so below. For a fixed 𝜃 ∈ R,
(
𝑡−𝜃 𝑡 ≥ 𝜃,
𝜎𝜃 (𝑡) =
0 𝑡 < 𝜃,

plays the role of a threshold for activation as was intended by Rosenblatt (1958,
p. 392) and McCulloch and Pitts (1943, p. 120).
A major computational issue with neural networks is the large number of un-
known parameters, namely the 𝑘𝑛2 + 𝑘(𝑛 − 1) entries of the weights and biases, that
have to be fitted with data, especially for deep neural networks where 𝑘 is large.
Thus successful applications of neural networks require that we identify, based on
the problem at hand, an appropriate low-dimensional subset of R𝑛×𝑛 from which we
will find our weights 𝐴1 , . . . , 𝐴 𝑘 . For instance, the very successful convolutional
neural networks for image recognition (Krizhevsky, Sutskever and Hinton 2012)
relies on restricting 𝐴1 , . . . , 𝐴 𝑘 to some block-Toeplitz–Toeplitz-block or BTTB
matrices (Ye and Lim 2018a, Section 13) determined by a very small number of
parameters. It turns out that convolutional neural networks are a quintessential ex-
ample of equivariant neural networks (Cohen and Welling 2016), and in fact every
equivariant neural network may be regarded as a generalized convolutional neural
network in an appropriate sense (Kondor and Trivedi 2018). We will describe a
simplified version that captures its essence and illustrates the tensor transformation
rules.
Let 𝐺 ⊆ R𝑛×𝑛 be a matrix group. A function 𝑓 : R𝑛 → R𝑛 is said to be
equivariant if it satisfies the condition that

𝑓 (𝑋𝑣) = 𝑋 𝑓 (𝑣) for all 𝑣 ∈ R𝑛 , 𝑋 ∈ 𝐺. (2.27)

An equivariant neural network is simply a feed-forward neural network 𝑓 : R𝑛 →


R𝑛 that satisfies (2.27). The key to constructing an equivariant neural network is
Tensors in computations 37

the trivial observation that


𝑓 (𝑋𝑣) = 𝐴 𝑘 𝜎𝑏𝑘−1 𝐴 𝑘−1 · · · 𝜎𝑏2 𝐴2 𝜎𝑏1 𝐴1 𝑋𝑣
= 𝑋(𝑋 −1 𝐴 𝑘 𝑋)(𝑋 −1 𝜎𝑏𝑘−1 𝑋)(𝑋 −1 𝐴 𝑘−1 𝑋) · · ·
· · · (𝑋 −1 𝜎𝑏2 𝑋)(𝑋 −1 𝐴2 𝑋)(𝑋 −1 𝜎𝑏1 𝑋)(𝑋 −1 𝐴1 𝑋)𝑣
= 𝑋 𝐴 𝑘′ 𝜎𝑏′ 𝑘−1 𝐴 𝑘−1

· · · 𝜎𝑏′ 2 𝐴2′ 𝜎𝑏′ 1 𝐴1′ 𝑣
and the last expression equals 𝑋 𝑓 (𝑣) as long as we have
𝐴𝑖′ = 𝑋 −1 𝐴𝑖 𝑋 = 𝐴𝑖 , 𝜎𝑏′ 𝑖 = 𝑋 −1 𝜎𝑏𝑖 𝑋 = 𝜎𝑏𝑖 (2.28)
for all 𝑖 = 1, . . . , 𝑘, and for all 𝑋 ∈ 𝐺. The condition on the right is satisfied by
any activation as long as it is applied coordinate-wise (Cohen and Welling 2016,
Section 6.2) as we did in (2.26). We will illustrate this below. The condition on the
left, which says that 𝐴𝑖 is invariant under the mixed 2-tensor transformation rule
on page 19, is what is important. The condition limits the possible weights for 𝑓
to a (generally) much smaller subspace of matrices that commute with all elements
of 𝐺. However, finding this subspace (in fact a subalgebra) of intertwiners,
{𝐴 ∈ R𝑛×𝑛 : 𝐴𝑋 = 𝑋 𝐴 for all 𝑋 ∈ 𝐺}, (2.29)
is usually where the challenge lies, although it is also a classical problem with
numerous powerful tools from group representation theory (Bröcker and tom Dieck
1995, Fulton and Harris 1991, Mackey 1978).
There are many natural candidates for the group 𝐺 from the list in (2.14) and
their subgroups, depending on the problem at hand. But we caution the reader that
𝐺 will generally be a very low-dimensional subset of R𝑛×𝑛 . It will be pointless to
pick, say, 𝐺 = SO(𝑛) as the set in (2.29) will then be just {𝜆𝐼 ∈ R𝑛×𝑛 : 𝜆 ∈ R},
clearly too small to serve as meaningful weights for any neural network. Indeed,
𝐺 is usually chosen to be an image of a much lower-dimensional group 𝐻,
𝐺 = 𝜌(𝐻) = {𝜌(ℎ) ∈ GL(𝑛) : ℎ ∈ 𝐻}
for some group homomorphism 𝜌 : 𝐻 → GL(𝑛); here 𝜌 is called a representation
of the group 𝐻. In image recognition applications, possibilities for 𝐻 are often
discrete subgroups of SE(2) or E(2) such as the group of translations
  
1 0 𝑚1 3×3
𝐻 = 0 1 𝑚2 ∈ R : 𝑚 1 , 𝑚 2 ∈ Z
00 1

for convolutional neural networks; the 𝑝4 group that augments translations with
right-angle rotations,
  
cos(𝑘 𝜋/2) − sin(𝑘 𝜋/2) 𝑚1
3×3
𝐻 = sin(𝑘 𝜋/2) cos(𝑘 𝜋/2) 𝑚2 ∈ R : 𝑘 = 0, 1, 2, 3, 𝑚 1 , 𝑚 2 ∈ Z
0 0 1

in Cohen and Welling (2016, Section 4.2); the 𝑝4𝑚 group that further augments
38 L.-H. Lim

𝑝4 with reflections,
  
(−1) 𝑗 cos(𝑘 𝜋/2) (−1) 𝑗+1 sin(𝑘 𝜋/2) 𝑚1 3×3 𝑘 = 0, 1, 2, 3,
𝐻= sin(𝑘 𝜋/2) cos(𝑘 𝜋/2) 𝑚2 ∈ R :
0 0 1 𝑗 = 0, 1, 𝑚 1 , 𝑚 2 ∈ Z
in Cohen and Welling (2016, Section 4.3). Other possibilities for 𝐻 include the
rotation group SO(3) for 3D shape recognition (Kondor et al. 2018), the rigid
motion group SE(3) for chemical property (Fuchs et al. 2020) and protein structure
(see page 6) predictions and the Lorentz group SO(1, 3) for identifying top quarks
in high-energy physics experiments (Bogatskiy et al. 2020), etc.
When people speak of SO(3)- or SE(3)- or Lorentz-equivariant neural networks,
they are referring to the group 𝐻 and not 𝐺 = 𝜌(𝐻). A key step of these works
is the construction of an appropriate representation 𝜌 for the problem at hand, or,
equivalently, constructing a linear action of 𝐻 on R𝑛 . In these applications R𝑛
should be regarded as the set of real-valued functions on a set 𝑆 of cardinality
𝑛, a perspective that we will introduce in Example 4.5. For concreteness, take
an image recognition problem on a 60 000 collection of 28 × 28-pixel images of
handwritten digits in greyscale levels 0, 1, . . . , 255 (Deng 2012). Then 𝑆 ⊆ Z2 is
the set of 𝑛 = 282 = 784 pixel indices and an image is encoded as 𝑣 ∈ R784 whose
coordinates take values from 0 (pitch black) to 255 (pure white). Note that this is
why the first three 𝐻 above are discrete: the elements ℎ ∈ 𝐻 act on the pixel indices
𝑆 ⊆ Z2 instead of R2 . These 60 000 images are then used to fit the neural network
𝑓 , i.e. to find the parameters 𝐴1 , . . . , 𝐴 𝑘 ∈ R784×784 and 𝑏1 , . . . , 𝑏 𝑘−1 ∈ R784 .
We end this example with a note on why a non-linear (and not even multilinear)
function like ReLU can satisfy the covariant 2-tensor transformation rule, which
sounds incredible but is actually obvious once explained. Take a greyscale image
𝑣 ∈ R𝑛 drawn with a black outline (greyscale value 0) but filled with varying shades
of grey (greyscale values 1, . . . , 255) and consider the activation
(
255 𝑡 > 0,
𝜎(𝑡) = 𝑡 ∈ R,
0 𝑡 ≤ 0,
so that applying 𝜎 to the image 𝑣 produces an image 𝜎(𝑣) ∈ R𝑛 with all shadings
removed, leaving just the black outline. Now take a 45◦ rotation matrix 𝑅 ∈ SO(2)
and let 𝑋 = 𝜌(𝑅) ∈ GL(𝑛) be the corresponding matrix that rotates any image 𝑣 by
45◦ to 𝑋𝑣:
𝜎

𝑋 𝜎 𝑋 −1

The bottom line is that 𝑅 acts on the indices of 𝑣 whereas 𝜎 acts on the values of
𝑣 and the two actions are always independent, which is why 𝜎 ′ = 𝑋 −1 𝜎 𝑋 = 𝜎.
Tensors in computations 39

Our definition of equivariance in (2.27) is actually a simplification but suffices


for our purposes. We state a formal definition here for completeness. A function
𝑓 : V → W is said to be 𝐻-equivariant if there are two homomorphisms 𝜌1 : 𝐻 →
GL(V) and 𝜌2 : 𝐻 → GL(W) so that
𝑓 (𝜌1 (ℎ)𝑣) = 𝜌2 (ℎ) 𝑓 (𝑣) for all 𝑣 ∈ V, ℎ ∈ 𝐻.
In (2.27), we chose V = W = R𝑛 and 𝜌1 = 𝜌2 . The condition 𝑓 (𝑋𝑣) = 𝑋 𝑓 (𝑣)
could have been replaced by other tensor or pseudotensor transformation rules,
say, 𝑓 (𝑋𝑣) = det(𝑋)𝑋 −T 𝑓 (𝑣), as long as they are given by homomorphisms. In
fact, with slight modifications we may even allow for antihomomorphisms, i.e.
𝜌(ℎ1 ℎ2 ) = 𝜌(ℎ2 )𝜌(ℎ1 ).
There are many other areas in computations where the tensor transformation
rules make an appearance. For example, both the primal and dual forms of a cone
programming problem over a symmetric cone K ⊆ V – which include as special
cases linear programming (LP), convex quadratic programming (QP), second-order
cone programming (SOCP) and semidefinite programming (SDP) – conform to the
transformation rules for Cartesian 0-, 1- and 2-tensors. However, the change-of-
basis matrices would have to be replaced by a linear map from the orthogonal
group of the cone (Hauser and Lim 2002):
O(K) ≔ {𝜑 : V → V : 𝜑 linear, invertible and 𝜑∗ = 𝜑−1 }.
Here the vector space V involved may not be R𝑛 – for SDP, K = S++ 𝑛 and V = S𝑛 , the

space of 𝑛×𝑛 symmetric matrices – making the linear algebra notation we have been
using to describe definition ➀ awkward and unnatural. Instead we should make
provision to work with tensors over arbitrary vector spaces, for example the space
of Toeplitz or Hankel or Toeplitz-plus-Hankel matrices, the space of polynomials
or differential forms or differential operators, or, in the case of equivariant neural
networks, the space of 𝐿 2 -functions on homogeneous spaces (Cohen and Welling
2016, Kondor and Trivedi 2018). This serves as another motivation for definitions ➁
and ➂.

2.5. Fallacy of ‘tensor = hypermatrix’


We conclude our discussion of definition ➀ with a few words about the fallacy of
identifying a tensor with the hypermatrix that represents it. While we have alluded
to this from time to time over the last few sections, there are three points that we
have largely deferred until now.
Firstly, the ‘multi-indexed object’ in definition ➀ is not necessarily a hypermatrix;
they could be polynomials or exterior forms or non-commutative polynomials. For
example, it is certainly more fruitful to represent symmetric tensors as polynomials
and alternating tensors as exterior forms. For one, we may take derivatives and
integrals or evaluate them at a point – operations that become awkward if we simply
regard them as hypermatrices with symmetric or skew-symmetric entries.
40 L.-H. Lim

Secondly, if one subscribes to this fallacy, then one would tend to miss important
tensors hiding in plain sight. The object of essence in each of the following
examples is a 3-tensor, but one sees no triply indexed quantities anywhere.
(i) Multiplication of complex numbers:
(𝑎 + 𝑖𝑏)(𝑐 + 𝑖𝑑) = (𝑎𝑐 − 𝑏𝑑) + 𝑖(𝑏𝑐 + 𝑎𝑑).
(ii) Matrix–matrix products:
    
𝑎11 𝑎12 𝑏11 𝑏12 𝑎11 𝑏11 + 𝑎12 𝑏21 𝑎11 𝑏12 + 𝑎12 𝑏22
= .
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑏11 + 𝑎22 𝑏21 𝑎21 𝑏12 + 𝑎22 𝑏22
(iii) Grothendieck’s inequality:
𝑚 Õ
Õ 𝑛 𝑚 Õ
Õ 𝑛
max 𝑎𝑖 𝑗 h𝑥𝑖 , 𝑦 𝑗 i ≤ 𝐾G max 𝑎𝑖 𝑗 𝜀𝑖 𝛿 𝑗 .
k 𝑥𝑖 k= k 𝑦 𝑗 k=1 | 𝜀𝑖 |= | 𝛿 𝑗 |=1
𝑖=1 𝑗=1 𝑖=1 𝑗=1

(iv) Separable representations:


sin(𝑥 + 𝑦 + 𝑧) = sin(𝑥) cos(𝑦) cos(𝑧) + cos(𝑥) cos(𝑦) sin(𝑧)
+ cos(𝑥) sin(𝑦) cos(𝑧) − sin(𝑥) sin(𝑦) sin(𝑧).
Thirdly, the value of representing a 𝑑-tensor as a 𝑑-dimensional hypermatrix is
often overrated.
(a) It may not be meaningful. For tensors in, say, 𝐿 2 (R) ⊗ 𝐿 2 (R) ⊗ 𝐿 2 (R), where
the ‘indices’ are continuous, the hypermatrix picture is not meaningful.
(b) It may not be computable. Given a bilinear operator, writing down its entries
as a three-dimensional hypermatrix with respect to bases is in general a #P-
hard problem, as we will see in Example 3.5.
(c) It may not be useful. In applications such as tensor networks, 𝑑 is likely large.
While there may be some value in picturing a three-dimensional hypermatrix,
it is no longer the case when 𝑑 ≫ 3.
(d) It may not be possible. There are applications such as Examples 3.8 and
3.14 that involve tensors defined over modules (vector spaces whose field of
scalars is replaced by a ring). Unlike vector spaces, modules may not have
bases and such tensors cannot be represented as hypermatrices.
By far the biggest issue with thinking that a tensor is just a hypermatrix is that such
an overly simplistic view disregards the transformation rules, which is the key to
definition i and whose importance we have already discussed at length. As we
h 1 2 3➀
saw, 2 3 4 may represent a 1-tensor or a 2-tensor, it may represent a covariant or
345
contravariant or mixed tensor, or it may not represent a tensor at all. What matters in
definition ➀ is the transformation rules, not the multi-indexed object. If one thinks
that a fraction is just a pair of numbers, adding them as 𝑎/𝑏 + 𝑐/𝑑 = (𝑎 + 𝑐)/(𝑏 + 𝑑)
is going to seem perfectly fine. It is the same with tensors.
Tensors in computations 41

3. Tensors via multilinearity


The main deficiency of definition ➀, namely that it leaves the tensor unspecified, is
immediately remedied by definition ➁, which states unequivocally that the tensor
is a multilinear map. The multi-indexed object in definition ➀ is then a coordinate
representation of the multilinear map with respect to a choice of bases and the
transformation rules become a change-of-basis theorem. By the 1980s, definition ➁
had by and large supplanted definition ➀ as the standard definition of a tensor in
textbooks on mathematical methods for physicists (Abraham et al. 1988, Choquet-
Bruhat, DeWitt-Morette and Dillard-Bleick 1982, Hassani 1999, Martin 1991),
geometry (Boothby 1986, Helgason 1978), general relativity (Misner et al. 1973,
Wald 1984) and gauge theory (Bleecker 1981), although it dates back to at least
Temple (1960).
Definition ➁ appeals to physicists for two reasons. The first is the well-known
maxim that the laws of physics do not depend on coordinates, and thus the tensors
used to express these laws should not depend on coordinates either. This has been
articulated in various forms in many places, but we quote a modern version that
appears as the very first sentence in Kip Thorne’s lecture notes (now a book).

Geometric principle. The laws of physics must all be expressible as geometric


(coordinate-independent and reference-frame-independent) relationships between
geometric objects (scalars, vectors, tensors, . . . ) that represent physical entities
(Thorne and Blandford 2017).

Most well-known equations in physics express a relation between tensors. For


example, Newton’s second law 𝐹 = 𝑚𝑎 is a relation between two contravariant
1-tensors – force 𝐹 and acceleration 𝑎 – and Einstein’s field equation 𝐺 + Λ𝑔 = 𝜅𝑇
is a relation between three covariant 2-tensors – the Einstein tensor 𝐺, the metric
tensor 𝑔 and the energy–momentum tensor 𝑇 , with constants Λ and 𝜅. Since
such laws of physics should not depend on coordinates, a tensor ought to be a
coordinate-free object, and definition ➁ meets this requirement: whether or not a
map 𝑓 : V1 × · · · × V𝑑 → W is multilinear does not depend on a choice of bases
on V1 , . . . , V𝑑 , W.
For the same reason, we expect laws of physics expressed in terms of the Hada-
mard product to be rare if not non-existent, precisely because the Hadamard product
is coordinate-dependent, as we saw at the end of Section 2.3. Take two linear maps
𝑓 , 𝑔 : V → V: it will not be possible to define the Hadamard product of 𝑓 and 𝑔
without choosing some basis on V and the value of the product will depend on that
chosen basis. So one advantage of using a coordinate-free definition of tensors
such as definitions ➁ or ➂ is that it automatically avoids pitfalls such as performing
coordinate-dependent operations that are undefined on tensors.
The second reason for favouring definition ➁ in physics is that the notion of
multilinearity is central to the subject. We highlight another two maxims.
42 L.-H. Lim

Linearity principle. Almost any natural process is linear in small amounts almost
everywhere (Kostrikin and Manin 1997, p. vii).
Multilinearity principle. If we keep all but one factor constant, the varying factor
obeys the linearity principle.
It is probably fair to say that the world around us is understandable largely
because of a combination of these two principles. The universal importance of
linearity needs no elaboration; in this section we will discuss the importance of
multilinearity, particularly in computations.
In mathematics, definition ➁ appeals because it is the simplest among the three
definitions: a multilinear map is trivial to define for anyone who knows about linear
maps. More importantly, definition ➁ allows us to define tensors over arbitrary
vector spaces.9 It is a misconception that in applications there is no need to discuss
more general vector spaces because one can always get by with just two of them,
R𝑛 and C𝑛 . The fact of the matter is that other vector spaces often carry structures
that are destroyed when one artificially identifies them with R𝑛 or C𝑛 . Just drawing
on the examples we have mentioned in Section 2, semidefinite programming and
equivariant neural networks already indicate why it is not a good idea to identify
the vector space of symmetric 𝑛 × 𝑛 matrices with R𝑛(𝑛+1)/2 or an 𝑚-dimensional
subspace of real-valued functions 𝑓 : Z2 → R with R𝑚 . We will elaborate on these
later in the context of definition ➁. Nevertheless, we will also see that definition ➁
has one deficiency that will serve as an impetus for definition ➂.

3.1. Multilinear maps


Again we will begin from tensors of orders zero, one and two, drawing on familiar
concepts from linear algebra – vector spaces, linear maps, dual vector spaces –
before moving into less familiar territory with order-three tensors followed by the
most general order-𝑑 version. In the following we let U, V, W be vector spaces
over a field of scalars, assumed to be R for simplicity but it may also be replaced
by C or indeed any ring such as Z or Z/𝑛Z, and we will see that such cases are also
important in computations and applications.
Definition ➁ assumes that we are given at least one vector space V. A tensor of
order zero is simply defined to be a scalar, i.e. an element of the field R. A tensor
of order one may either be a vector, i.e. an element of the vector space 𝑣 ∈ V, or a
covector, i.e. an element of a dual vector space 𝜑 ∈ V∗ . A covector, also called a
dual vector or linear functional, is a map 𝜑 : V → R that satisfies
𝜑(𝜆𝑣 + 𝜆 ′ 𝑣 ′) = 𝜆𝜑(𝑣) + 𝜆 ′ 𝜑(𝑣 ′) (3.1)
for all 𝑣, 𝑣 ′ ∈ V, 𝜆, 𝜆 ′ ∈ R. To distinguish between these two types of tensors,
9 Note that we do not mean abstract vector spaces. There are many concrete vector spaces aside
from R𝑛 and C𝑛 that are useful in applications and computations. Definitions ➁ and ➂ allow us
to define tensors over these.
Tensors in computations 43

a vector is called a tensor of contravariant order one and a covector a tensor of


covariant order one. To see how these relate to definition ➀, we pick a basis
ℬ = {𝑣 1 , . . . , 𝑣 𝑚 } of V. This gives us a dual basis ℬ∗ = {𝑣 ∗1 , . . . , 𝑣 ∗𝑚 } on V∗ ,
that is, 𝑣 ∗𝑖 : V → R is a linear functional satisfying
(
∗ 1 𝑖 = 𝑗,
𝑣 𝑖 (𝑣 𝑗 ) = 𝑗 = 1, . . . , 𝑚.
0 𝑖 ≠ 𝑗,
Any vector 𝑣 in V may be uniquely represented by a vector in R𝑚 ,
 𝑎1 
 
 
V ∋ 𝑣 = 𝑎1 𝑣 1 + · · · + 𝑎 𝑚 𝑣 𝑚 ←→ [𝑣] ℬ ≔  ...  ∈ R𝑚 , (3.2)
 
𝑎 𝑚 
 
and any covector 𝜑 in V∗ may also be uniquely represented by a vector in R𝑚 ,
 𝑏1 
 
 

V ∋𝜑= 𝑏1 𝑣 ∗1 + ··· + 𝑏 𝑚 𝑣 ∗𝑚 ←→ [𝜑] ℬ∗ ≔  ...  ∈ R𝑚 .
 
𝑏 𝑚 
 
The vector [𝑣] ℬ ∈ R𝑚 is the coordinate representation of 𝑣 with respect to the
basis ℬ; likewise [𝜑] ℬ∗ ∈ R𝑚 is the coordinate representation of 𝜑 with respect to
ℬ∗ . Here [𝑣] ℬ and [𝜑] ℬ∗ are the multi-indexed (in this case single-indexed) object
in definition ➀ but, as we have emphasized repeatedly, the multi-indexed object is
not the crux of the definition. For example, take 𝑚 = 3 and (3, −1, 2) ∈ R3 : which
vector does it represent? The answer is that it can represent any vector except the
zero vector in V or any covector except the zero covector in V∗ , that is, given any
non-zero 𝑣 ∈ V, there exists a basis {𝑣 1 , 𝑣 2 , 𝑣 3 } such that 𝑣 = 3𝑣 1 − 𝑣 2 + 2𝑣 3 ; ditto
for non-zero covectors. Knowing that the multi-indexed object is (3, −1, 2) tells
us absolutely nothing except that it is non-zero. One needs to know (i) the basis
and (ii) the transformation rules, which are given by the change-of-basis theorem,
stated below for easy reference.
′ } is another basis of V and 𝑋 ∈ R𝑚×𝑚 is such that
Recall that if 𝒞 = {𝑣 1′ , . . . , 𝑣 𝑚
Õ
𝑚
𝑣 ′𝑗 = 𝑥𝑖 𝑗 𝑣 𝑖 , 𝑗 = 1, . . . , 𝑚, (3.3)
𝑖=1

then we must have 𝑋 ∈ GL(𝑚) and


[𝑣] 𝒞 = 𝑋 −1 [𝑣] ℬ , [𝜑] 𝒞∗ = 𝑋 T [𝜑] ℬ∗ (3.4)
for any 𝑣 ∈ V and any 𝜑 ∈ V∗ . More precisely, if
𝑎1 𝑣 1 + · · · + 𝑎 𝑚 𝑣 𝑚 = 𝑎1′ 𝑣 1′ + · · · + 𝑎 𝑚
′ ′
𝑣𝑚,
∗ ∗ ′ ′∗ ′ ′∗
𝑏1 𝑣 1 + · · · + 𝑏 𝑚 𝑣 𝑚 = 𝑏1 𝑣 1 + · · · + 𝑏 𝑚 𝑣 𝑚 ,
44 L.-H. Lim

then the vectors of coefficients 𝑎, 𝑎 ′ , 𝑏, 𝑏 ′ ∈ R𝑚 must satisfy


𝑎 ′ = 𝑋 −1 𝑎, 𝑏 ′ = 𝑋 T 𝑏.
We recover the transformation rules for contravariant and covariant 1-tensors,
respectively, in the middle column of the table on page 18, which we know are
simply the change-of-basis theorems for vectors and covectors in linear algebra
(Berberian 2014, Friedberg et al. 2003). It is now clear why we have been calling
the matrix 𝑋 a change-of-basis matrix when discussing the transformation rules in
Section 2.
The most familiar 2-tensor is a linear operator,10 i.e. a map Φ : U → V that sat-
isfies (3.1). Let dim U = 𝑛 and dim V = 𝑚. Then, for any bases 𝒜 = {𝑢1 , . . . , 𝑢 𝑛 }
on U and ℬ = {𝑣 1 , . . . , 𝑣 𝑚 } on V, the linear operator Φ has a matrix representation
with respect to these bases:
[Φ] 𝒜,ℬ = 𝐴 ∈ R𝑚×𝑛 , (3.5)
where the entries 𝑎𝑖 𝑗 are the coefficients in
Õ
𝑚
Φ(𝑢 𝑗 ) = 𝑎𝑖 𝑗 𝑣 𝑖 , 𝑗 = 1, . . . , 𝑛.
𝑖=1

Let 𝒜′ be another basis on U and ℬ′ another basis on V. Let


[Φ] 𝒜′ ,ℬ′ = 𝐴 ′ ∈ R𝑚×𝑛 .
The change-of-basis theorem for linear operators tells us that if 𝑋 ∈ GL(𝑚) is a
change-of-basis matrix on V, i.e. defined according to (3.3), and likewise 𝑌 ∈ GL(𝑛)
is a change-of-basis matrix on U, then
𝐴 ′ = 𝑋 −1 𝐴𝑌 .
For the special case U = V with the same bases 𝒜 = ℬ and 𝒜 ′ = ℬ′ , we obtain
𝐴 ′ = 𝑋 −1 𝐴𝑋.
Thus we have recovered the transformation rules for mixed 2-tensors in the middle
column of the tables on pages 18 and 19.
The next most common 2-tensor is a bilinear functional, that is, a map 𝛽 : U ×
V → R that satisfies
𝛽(𝜆𝑢 + 𝜆 ′𝑢 ′, 𝑣) = 𝜆𝛽(𝑢, 𝑣) + 𝜆 ′ 𝛽(𝑢 ′ , 𝑣),
(3.6)
𝛽(𝑢, 𝜆𝑣 + 𝜆 ′ 𝑣 ′) = 𝜆𝛽(𝑢, 𝑣) + 𝜆 ′ 𝛽(𝑢, 𝑣 ′)
for all 𝑢, 𝑢 ′ ∈ U, 𝑣, 𝑣 ′ ∈ V, 𝜆, 𝜆 ′ ∈ R. In this case the matrix representation of 𝛽

10 Following convention, linear and multilinear maps are called functionals if they are scalar-valued,
i.e. the codomain is R, and operator if they are vector-valued, i.e. the codomain is a vector space
of arbitrary dimension.
Tensors in computations 45

with respect to bases 𝒜 and ℬ is particularly easy to describe:


[𝛽] 𝒜,ℬ = 𝐴 ∈ R𝑚×𝑛
is simply given by
𝑎𝑖 𝑗 = 𝛽(𝑢𝑖 , 𝑣 𝑗 ), 𝑖 = 1, . . . , 𝑚, 𝑗 = 1, . . . , 𝑛.
We have the corresponding change-of-basis theorem for bilinear forms: if
[𝛽] 𝒜′ ,ℬ′ = 𝐴 ′ ∈ R𝑚×𝑛 ,
then
𝐴 ′ = 𝑋 T 𝐴𝑌 ,
and when U = V, 𝒜 = ℬ, 𝒜 ′ = ℬ′, it reduces to
𝐴 ′ = 𝑋 T 𝐴𝑋.
We have recovered the transformation rules for covariant 2-tensors in the middle
column of the tables on pages 18 and 19.
Everything up to this point is more or less standard in linear algebra (Berberian
2014, Friedberg et al. 2003), summarized here for easy reference and to fix nota-
tion. The goal is to show that, at least in these familiar cases, the transformation
rules in definition ➀ are the change-of-basis theorems of the multilinear maps in
definition ➁.
What about contravariant 2-tensors? A straightforward way is simply to say that
they are bilinear functionals on dual spaces 𝛽 : U∗ × V∗ → R and we will recover
the transformation rules
𝐴 ′ = 𝑋 −1 𝐴𝑌 −T and 𝐴 ′ = 𝑋 −1 𝐴𝑋 −T
via similar change-of-basis arguments. We would like to revive an older alternative,
useful in engineering (Chou and Pagano 1992, Tai 1997) and physics (Morse and
Feshbach 1953, Part I, Section 1.6), which has become somewhat obscure. Given
vector spaces U and V, we define the dyadic product by
𝑛 Õ
Õ 𝑚
(𝑎1 𝑢1 + · · · + 𝑎 𝑛 𝑢 𝑛 )(𝑏1 𝑣 1 + · · · + 𝑏 𝑚 𝑣 𝑚 ) ≔ 𝑎𝑖 𝑏 𝑗 𝑢𝑖 𝑣 𝑗 .
𝑖=1 𝑗=1

Here the juxtaposed 𝑢𝑖 𝑣 𝑗 is not given any further interpretation, and is simply taken
to be an element (called a dyad) in a new vector space of dimension 𝑚𝑛 called the
dyadic product of U and V. While this appears to be a basis-dependent notion,
it is actually not, and dyadics satisfy a contravariant 2-tensor transformation rule.
This is a precursor to definition ➂; we will see that once we insert a ⊗ between
the vectors to get 𝑢𝑖 ⊗ 𝑣 𝑗 , the dyadic product of U and V is just the tensor product
U ⊗ V that we will discuss in Section 4.
Before moving to higher-order tensors, we highlight another impetus for defin-
ition ➂. Note that there are many other maps that are also 2-tensors according to
46 L.-H. Lim

definition ➁. Aside from Φ : U → V, we have other linear operators


Φ1 : U∗ → V, Φ2 : U → V∗ , Φ3 : U∗ → V∗ , (3.7)
and, aside from 𝛽 : U × V → R, other bilinear functionals
𝛽1 : U∗ × V → R, 𝛽2 : U × V∗ → R, 𝛽3 : U∗ × V∗ → R, (3.8)
among other more esoteric maps we would not go into here. We begin to see an
inkling of why definition ➁, while certainly the simplest definition of a tensor, may
not be the best one. After all, according to definition ➀, there ought to be only three
types of 2-tensors: covariant, contravariant and mixed. If we choose bases on U
and V, we will find that a matrix representing the bilinear functional 𝛽 : U × V → R
and a matrix representing the linear operator Φ2 : U → V∗ satisfy exactly the same
transformation rules, namely that of a covariant 2-tensor 𝐴 ′ = 𝑋 T 𝐴𝑌 . One benefit
of definition ➀ is that it classifies tensors according to their transformation rules,
but this is lost in definition ➁: there are many more types of multilinear maps than
there are types of tensors. For example, the linear and bilinear maps above that are
constructed out of two vector spaces U and V may be neatly categorized into three
different types of 2-tensors:
covariant 2-tensor Φ2 : U → V∗ , 𝛽: U × V → R,
contravariant 2-tensor Φ1 : U∗ → V, 𝛽3 : U∗ × V∗ → R,
mixed 2-tensor Φ: U → V, 𝛽2 : U × V∗ → R,
Φ3 : U∗ → V∗ , 𝛽1 : U∗ × V → R.
Any two maps in the same category satisfy the same change-of-basis theorem,
that is, the matrices that represent them satisfy the same tensor transformation rule.
The number of possibilities grows as the order of the tensor increases. A 3-tensor
is most commonly a bilinear operator, i.e. a map B : U × V → W that satisfies the
rules in (3.6) but is vector-valued. The next most common 3-tensor is a trilinear
functional, i.e. a map 𝜏 : U × V × W → R that satisfies
𝜏(𝜆𝑢 + 𝜆 ′𝑢 ′, 𝑣, 𝑤) = 𝜆𝜏(𝑢, 𝑣, 𝑤) + 𝜆 ′ 𝜏(𝑢 ′, 𝑣, 𝑤),
𝜏(𝑢, 𝜆𝑣 + 𝜆 ′ 𝑣 ′, 𝑤) = 𝜆𝜏(𝑢, 𝑣, 𝑤) + 𝜆 ′ 𝜏(𝑢, 𝑣 ′ , 𝑤), (3.9)
𝜏(𝑢, 𝑣, 𝜆𝑤 + 𝜆 ′ 𝑤 ′) = 𝜆𝜏(𝑢, 𝑣, 𝑤) + 𝜆 ′ 𝜏(𝑢, 𝑣, 𝑤 ′)
for all 𝑢, 𝑢 ′ ∈ U, 𝑣, 𝑣 ′ ∈ V, 𝑤, 𝑤 ′ ∈ W, 𝜆, 𝜆 ′ ∈ R. However, there are also many
other possibilities, including bilinear operators
B : U∗ × V → W, B : U × V∗ → W, . . . , B : U∗ × V∗ → W∗ (3.10)
and trilinear functionals
𝜏 : U∗ × V × W → R, 𝜏 : U × V∗ × W → R, . . . , 𝜏 : U∗ × V∗ × W∗ → R, (3.11)
among yet more complicated maps. For example, if L(U; V) denotes the vector
space of all linear maps from U to V, then operator-valued linear operators or linear
Tensors in computations 47

operators on operator spaces such as


Φ1 : U → L(V; W), Φ2 : L(U; V) → W (3.12)
or bilinear functionals such as
𝛽1 : U × L(V; W) → R, 𝛽2 : L(U; V) × W → R (3.13)
are also 3-tensors. We may substitute any of U, V, W with U∗ , V∗ , W∗ in (3.12)
and (3.13) to obtain yet other 3-tensors.
Organizing this myriad of multilinear maps, which increases exponentially with
order, is the main reason why we would ultimately want to adopt definition ➂, so
that we may classify the many different types of multilinear maps into a smaller
number of tensors of different types. For the treatments that rely on definition ➁ – a
formal version of which will appear later as Definition 3.1 – such as Abraham et al.
(1988), Bleecker (1981), Boothby (1986), Choquet-Bruhat et al. (1982), Hassani
(1999), Helgason (1978), Martin (1991), Misner et al. (1973) and Wald (1984),
this myriad of possibilities for a 𝑑-tensor is avoided by requiring that the codomain
of the multilinear maps be R or C, that is, only multilinear functionals are allowed;
we will see why this approach makes for an awkward definition of tensors.
The change-of-basis theorems for order-3 tensors are no longer standard in linear
algebra, but it is easy to extrapolate from the standard change-of-basis theorems
for vectors, linear functionals, linear operators and bilinear functionals we have
discussed above. Let U, V, W be vector spaces over R of finite dimensions 𝑚, 𝑛, 𝑝,
and let
𝒜 = {𝑢1 , . . . , 𝑢 𝑚 }, ℬ = {𝑣 1 , . . . , 𝑣 𝑛 }, 𝒞 = {𝑤 1 , . . . , 𝑤 𝑝 } (3.14)
be arbitrary bases. We limit ourselves to bilinear operators B : U × V → W and
trilinear functionals 𝜏 : U × V × W → R, as other cases are easy to deduce from
these. Take any vectors 𝑢 ∈ U, 𝑣 ∈ V, 𝑤 ∈ W and express them in terms of linear
combinations of basis vectors
Õ 𝑚 Õ
𝑛 Õ𝑝
𝑢= 𝑎𝑖 𝑢𝑖 , 𝑣 = 𝑏 𝑗𝑣 𝑗, 𝑤 = 𝑐𝑘 𝑤 𝑘 .
𝑖=1 𝑗=1 𝑘=1

Since 𝜏 is trilinear, we get


Õ 𝑝
𝑛 Õ
𝑚 Õ
𝜏(𝑢, 𝑣, 𝑤) = 𝑎𝑖 𝑏 𝑗 𝑐 𝑘 𝜏(𝑢𝑖 , 𝑣 𝑗 , 𝑤 𝑘 ),
𝑖=1 𝑗=1 𝑘=1

that is, 𝜏 is completely determined by its values on 𝒜 × ℬ × 𝒞 in (3.14). So if we


let
𝑎𝑖 𝑗𝑘 ≔ 𝜏(𝑢𝑖 , 𝑣 𝑗 , 𝑤 𝑘 ), 𝑖 = 1, . . . , 𝑚, 𝑗 = 1, . . . , 𝑛, 𝑘 = 1, . . . , 𝑝,
then the hypermatrix representation of 𝜏 with respect to bases 𝒜, ℬ and 𝒞 is
[𝜏] 𝒜,ℬ,𝒞 = 𝐴 ∈ R𝑚×𝑛× 𝑝 . (3.15)
48 L.-H. Lim

Note that this practically mirrors the discussion for the bilinear functional case on
page 45 and can be easily extended to any multilinear functional 𝜑 : V1 ×· · ·×V𝑑 →
R to get a hypermatrix representation 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 with respect to any bases ℬ𝑖
on V𝑖 , 𝑖 = 1, . . . , 𝑑.
The argument for the bilinear operator B requires an additional step. By bilin-
earity, we get
Õ 𝑛 Õ
𝑚 Õ 𝑝
B(𝑢, 𝑣) = 𝑎𝑖 𝑏 𝑗 𝑐 𝑘 B(𝑢𝑖 , 𝑣 𝑗 ),
𝑖=1 𝑗=1 𝑘=1

and since B(𝑢𝑖 , 𝑣 𝑗 ) ∈ W, we may express it as a linear combination


𝑝
Õ
B(𝑢𝑖 , 𝑣 𝑗 ) = 𝑎𝑖 𝑗𝑘 𝑤 𝑘 , 𝑖 = 1, . . . , 𝑚, 𝑗 = 1, . . . , 𝑛, 𝑘 = 1, . . . , 𝑝. (3.16)
𝑖=1

By the fact that 𝒞 is a basis, the above equation uniquely defines the values 𝑎𝑖 𝑗𝑘
and the hypermatrix representation of B with respect to bases 𝒜, ℬ and 𝒞 is
[B] 𝒜,ℬ,𝒞 = 𝐴 ∈ R𝑚×𝑛× 𝑝 . (3.17)
The hypermatrices in (3.15) and (3.17), even if they are identical, represent
different types of 3-tensors. Let 𝐴 ′ = [𝑎𝑖′ 𝑗𝑘 ] ∈ R𝑚×𝑛× 𝑝 be a hypermatrix repres-
entation of 𝜏 with respect to bases 𝒜 ′, ℬ′, 𝒞 ′ on U, V, W, i.e. 𝑎𝑖′ 𝑗𝑘 = 𝜏(𝑢𝑖′ , 𝑣 ′𝑗 , 𝑤 ′𝑘 ).
Then it is straightforward to deduce that
𝐴 ′ = (𝑋 T , 𝑌 T , 𝑍 T ) · 𝐴
with change-of-basis matrices 𝑋 ∈ GL(𝑚), 𝑌 ∈ GL(𝑛), 𝑍 ∈ GL(𝑝) similarly
defined as in (3.3). Hence, by (2.7), trilinear functionals are covariant 3-tensors.
On the other hand, had 𝐴 ′ ∈ R𝑚×𝑛× 𝑝 been a hypermatrix representation of B, then
𝐴 ′ = (𝑋 T , 𝑌 T , 𝑍 −1 ) · 𝐴.
Hence, by (2.9), bilinear operators are mixed 3-tensors of covariant order 2 and
contravariant order 1. The extension to other types of 3-tensors in (3.10)–(3.13)
and to arbitrary 𝑑-tensors may be carried out similarly.
For completeness, we state a formal definition of multilinear maps, if only to
serve as a glossary of terms and notation and as a pretext for interesting examples.
Definition 3.1 (tensors via multilinearity). Let V1 , . . . , V𝑑 and W be real vector
spaces. A multilinear map, or more precisely a 𝑑-linear map, is a map Φ : V1 ×
· · · × V𝑑 → W that satisfies
Φ(𝑣, . . . , 𝜆𝑣 𝑘 + 𝜆 ′ 𝑣 ′𝑘 , . . . , 𝑣 𝑑 ) = 𝜆Φ(𝑣 1 , . . . , 𝑣 𝑘 , . . . , 𝑣 𝑑 ) + 𝜆 ′Φ(𝑣 1 , . . . , 𝑣 ′𝑘 , . . . , 𝑣 𝑑 )
(3.18)
for all 𝑣 1 ∈ V1 , . . . , 𝑣 𝑘 , 𝑣 ′𝑘 ∈ V 𝑘 , . . . , 𝑣 𝑑 ∈ V𝑑 , 𝜆, 𝜆 ′ ∈ R, and all 𝑘 = 1, . . . , 𝑑.
The set of all such maps will be denoted M𝑑 (V1 , . . . , V𝑑 ; W).
Tensors in computations 49

Note that M𝑑 (V1 , . . . , V𝑑 ; W) is itself a vector space: a linear combination of


two 𝑘-linear maps is again a 𝑘-linear map; in particular M1 (V; W) = L(V; W). If
the vector spaces V1 , . . . , V𝑑 and W are endowed with norms, then
kΦ(𝑣 1 , . . . , 𝑣 𝑑 )k
kΦk 𝜎 ≔ sup = sup kΦ(𝑣 1 , . . . , 𝑣 𝑑 )k (3.19)
𝑣1 ,...,𝑣𝑑 ≠0 k𝑣 1 k · · · k𝑣 𝑑 k k 𝑣1 k=···= k 𝑣𝑑 k=1

defines a norm on M𝑑 (V1 , . . . , V𝑑 ; W) that we will call a spectral norm; we have


slightly abused notation by using the same k · k to denote norms on different spaces.
In Section 3.3, we will see various examples of why (higher-order) multilinear
maps are useful in computations. In part to provide necessary background for those
examples, we will review their role as higher derivatives in multivariate calculus.
Example 3.2 (higher-order derivatives). Let V and W be norm spaces and let
Ω ⊆ V be an open subset. For any function 𝐹 : Ω → W, recall that the (total)
derivative at 𝑣 ∈ Ω is a linear operator 𝐷𝐹(𝑣) : V → W that satisfies
k𝐹(𝑣 + ℎ) − 𝐹(𝑣) − [𝐷𝐹(𝑣)](ℎ)k
lim = 0, (3.20)
ℎ→0 kℎk
or, if there is no such linear operator, then 𝐹 is not differentiable at 𝑣. The definition
may be recursively applied to obtain derivatives of arbitrary order, assuming that
they exist on Ω: since 𝐷𝐹(𝑣) ∈ L(V; W), we may apply the same definition to
𝐷𝐹 : Ω → L(V; W) and get 𝐷 2 𝐹(𝑣) : V → L(V; W) as 𝐷(𝐷𝐹), that is,
k𝐷𝐹(𝑣 + ℎ) − 𝐷𝐹(𝑣) − [𝐷 2 𝐹(𝑣)](ℎ)k
lim = 0. (3.21)
ℎ→0 kℎk
Doing this recursively, we obtain
𝐷𝐹(𝑣) ∈ L(V; W), 𝐷 2 𝐹(𝑣) ∈ L(V; L(V; W)), 𝐷 3 𝐹(𝑣) ∈ L(V; L(V; L(V; W))),
and so on. Avoiding such nested spaces of linear maps is a good reason to introduce
multilinear maps. Note that we have
L(V; M𝑑−1 (V, . . . , V; W)) = M𝑑 (V, . . . , V; W) (3.22)
since, for a linear map Φ on V taking values in M𝑑−1 (V, . . . , V; W),
[Φ(ℎ)](ℎ1 , . . . , ℎ 𝑑−1 )
must be linear in ℎ for any fixed ℎ1 , . . . , ℎ 𝑑−1 and (𝑑 − 1)-linear in ℎ1 , . . . , ℎ 𝑑−1 for
any fixed ℎ, that is, it is 𝑑-linear in the arguments ℎ, ℎ1 , . . . , ℎ 𝑑−1 . Thus we obtain
the 𝑑th-order derivative of 𝐹 : Ω → W at a point 𝑣 ∈ Ω as a 𝑑-linear map,
𝐷 𝑑 𝐹(𝑣) : V × · · · × V → W. (3.23)
With some additional arguments, we may show that as long as 𝐷 𝑑 𝐹(𝑣) exists in an
open neighbourhood of 𝑣, then it must be a symmetric multilinear map in the sense
that
[𝐷 𝑑 𝐹(𝑣)](ℎ1 , ℎ2 , . . . , ℎ 𝑑 ) = [𝐷 𝑑 𝐹(𝑣)](ℎ 𝜎(1) , ℎ 𝜎(2) , . . . , ℎ 𝜎(𝑑) ) (3.24)
50 L.-H. Lim

for any permutation 𝜎 ∈ 𝔖𝑑 , the permutation group on 𝑑 objects; and in addition


we have Taylor’s theorem,
1
𝐹(𝑣 + ℎ) = 𝐹(𝑣) + [𝐷𝐹(𝑣)](ℎ) + [𝐷 2 𝐹(𝑣)](ℎ, ℎ) + · · ·
2
1 𝑑
· · · + [𝐷 𝐹(𝑣)](ℎ, . . . , ℎ) + 𝑅(ℎ),
𝑑!
where the remainder term k𝑅(ℎ)k/kℎk 𝑑 → 0 as ℎ → 0 (Lang 2002, Chapter XIII,
Section 6). Strictly speaking, the ‘=’ in (3.22) should be ‘’, but elements in both
spaces satisfy the same tensor transformation rules, so as tensors, the ‘=’ is perfectly
justified; making this statement without reference to the tensor transformation rules
is a reason why we would ultimately want to bring in definition ➂.

Following the convention in Abraham et al. (1988), Bleecker (1981), Boothby


(1986), Choquet-Bruhat et al. (1982), Hassani (1999), Helgason (1978), Martin
(1991), Misner et al. (1973) and Wald (1984), a 𝑑-tensor is then defined as a
𝑑-linear functional, i.e. by setting W = R. The following is a formal version of
definition ➁.

Definition 3.3 (tensors as multilinear functionals). Let 𝑝 ≤ 𝑑 be non-negative


integers and let V1 , . . . , V𝑑 be vector spaces over R. A tensor of contravariant
order 𝑝 and covariant order 𝑑 − 𝑝 is a multilinear functional
𝜑 : V∗1 × · · · × V∗𝑝 × V 𝑝+1 × · · · × V𝑑 → R. (3.25)

We say that 𝜑 is a tensor of type (𝑝, 𝑑 − 𝑝) and of order 𝑑.

This is a passable but awkward definition. For 𝑑 = 2 and 3, the bilinear and
trilinear functionals in (3.8) and (3.11) are tensors, but the linear and bilinear
operators in (3.7) and (3.10) strictly speaking are not; sometimes this means one
needs to convert operators to functionals, for example by identifying a linear
operator Φ : V → W as the bilinear functional defined by V × W∗ → R, (𝑣, 𝜑) ↦→
𝜑(Φ(𝑣)), before one may apply a result. Definition 3.3 is also peculiar considering
that by far the most common 1-, 2- and 3-tensors are vectors, linear operators
and bilinear operators respectively, but the definition excludes them at the outset.
So instead of simply speaking of 𝑣 ∈ V, one would need to regard it as a linear
functional on the space of linear functionals, i.e. 𝑣 ∗∗ : V∗ → R, in order to regard
it as a 1-tensor in the sense of Definition 3.3. In fact, it is often more useful to
do the reverse, that is, given a 𝑑-linear functional, we prefer to convert it into a
(𝑑 − 1)-linear operator, and we will give an example.

Example 3.4 (higher-order gradients). We consider the higher-order derivat-


ives of a real-valued function 𝑓 : Ω → R, where Ω ⊆ V is an open set and where
V is equipped with an inner product h · , · i. This is a special case of Example 3.2
Tensors in computations 51

with W = R and V an inner product space. Since


𝐷 𝑑 𝑓 (𝑣) : V × · · · × V → R
𝑑 copies

is a multilinear functional, for any fixed first 𝑑 − 1 arguments,


[𝐷 𝑑 𝑓 (𝑣)](ℎ1 , . . . , ℎ 𝑑−1 , ·) : V → R
is a linear functional and, by the Riesz representation theorem, there must be a
vector, denoted [∇𝑑 𝑓 (𝑣)](ℎ1 , . . . , ℎ 𝑑−1 ) ∈ V as it depends on ℎ1 , . . . , ℎ 𝑑−1 , such
that
[𝐷 𝑑 𝑓 (𝑣)](ℎ1 , . . . , ℎ 𝑑−1 , ℎ 𝑑 ) = h[∇𝑑 𝑓 (𝑣)](ℎ1 , . . . , ℎ 𝑑−1 ), ℎ 𝑑 i. (3.26)
Since 𝐷 𝑑 𝑓 (𝑣) is a 𝑑-linear functional,
∇𝑑 𝑓 (𝑣) : V × · · · × V → V
𝑑−1 copies

is a (𝑑 − 1)-linear operator. The symmetry in (3.24) shows that our argument


does not depend on our having fixed the first 𝑑 − 1 arguments: we could have
fixed any 𝑑 − 1 arguments and still obtained the same ∇𝑑 𝑓 (𝑣), which must itself
be symmetric in its 𝑑 − 1 arguments. We will call ∇𝑑 𝑓 (𝑣) the 𝑑th gradient of
𝑓 ; note that it depends on the choice of inner product. For 𝑑 = 1, the argument
above is essentially how one would define a gradient ∇ 𝑓 : Ω → V in Riemannian
geometry. For 𝑑 = 2, we see that ∇2 𝑓 = 𝐷(∇ 𝑓 ), which is how a Hessian is defined
in optimization (Boyd and Vandenberghe 2004, Renegar 2001). In fact, we may
show more generally that, for 𝑑 ≥ 2,
∇𝑑 𝑓 = 𝐷 𝑑−1 (∇ 𝑓 ), (3.27)
which may be taken as an alternative definition of higher gradients. The utility
of a multilinear map approach is familiar to anyone who has attempted to find the
gradient and Hessian of functions such as
𝑓 (𝑋) = tr(𝑋 −1 ) or 𝑔(𝑥, 𝑌 ) = 𝑥 T𝑌 −1 𝑥,
where 𝑥 ∈ R𝑛 and 𝑋, 𝑌 ∈ S++ 𝑛 (Boyd and Vandenberghe 2004, Examples 3.4 and

3.46). The definitions in terms of partial derivatives,


 𝜕𝑓   𝜕2 𝑓 𝜕2 𝑓 
   · · · 
 𝜕𝑥1   𝜕𝑥 2 𝜕𝑥1 𝜕𝑥 𝑛 
 .   1
 ..  ,
∇ 𝑓 =  ..  , ∇2 𝑓 =  ... ..
. . 
 𝜕𝑓  
   𝜕2 𝑓 𝜕 𝑓 
  
𝜕𝑥
 𝑛  𝜕𝑥 𝜕𝑥 · · · 
 𝑛 1 𝜕𝑥 𝑛2 
  𝑛
𝜕𝑑 𝑓
∇𝑑 𝑓 = , (3.28)
𝜕𝑥𝑖 𝜕𝑥 𝑗 · · · 𝜕𝑥 𝑘 𝑖, 𝑗,... ,𝑘=1
52 L.-H. Lim

provide little insight when functions are defined on vector spaces other than R𝑛 .
On the other hand, using 𝑓 (𝑋) = tr(𝑋 −1 ) for illustration, with (3.20) and (3.21),
we get
𝐷 𝑓 (𝑋) : S𝑛 → R, 𝐻 ↦→ − tr(𝑋 −1 𝐻 𝑋 −1 ),
𝐷 2 𝑓 (𝑋) : S𝑛 × S𝑛 → R, (𝐻1 , 𝐻2 ) ↦→ tr(𝑋 −1 𝐻1 𝑋 −1 𝐻2 𝑋 −1 + 𝑋 −1 𝐻2 𝑋 −1 𝐻1 𝑋 −1 ),
and more generally
𝐷 𝑑 𝑓 (𝑋) : S𝑛 × · · · × S𝑛 → R,
 Õ 
𝑑 −1 −1 −1 −1 −1
(𝐻1 , . . . , 𝐻 𝑑 ) ↦→ (−1) tr 𝑋 𝐻 𝜎(1) 𝑋 𝐻 𝜎(2) 𝑋 · · · 𝑋 𝐻 𝜎(𝑑) 𝑋 .
𝜎 ∈𝔖𝑑

By the cyclic invariance of trace,


[𝐷 𝑓 (𝑋)](𝐻) = tr[(−𝑋 −2 )𝐻],
[𝐷 2 𝑓 (𝑋)](𝐻1 , 𝐻2 ) = tr[(𝑋 −1 𝐻1 𝑋 −2 + 𝑋 −2 𝐻1 𝑋 −1 )𝐻2 ],
[𝐷 3 𝑓 (𝑋)](𝐻1 , 𝐻2 , 𝐻3 ) = tr[(−𝑋 −1 𝐻1 𝑋 −1 𝐻2 𝑋 −2 − 𝑋 −1 𝐻2 𝑋 −2 𝐻1 𝑋 −1
− 𝑋 −2 𝐻1 𝑋 −1 𝐻2 𝑋 −1 − 𝑋 −1 𝐻2 𝑋 −1 𝐻1 𝑋 −2
− 𝑋 −1 𝐻1 𝑋 −2 𝐻2 𝑋 −1 − 𝑋 −2 𝐻2 𝑋 −1 𝐻1 𝑋 −1 )𝐻3 ],
and so by (3.26),
∇ 𝑓 (𝑋) = −𝑋 −2 ,
[∇2 𝑓 (𝑋)](𝐻) = 𝑋 −1 𝐻 𝑋 −2 + 𝑋 −2 𝐻 𝑋 −1 ,
[∇3 𝑓 (𝑋)](𝐻1 , 𝐻2 ) = −𝑋 −1 𝐻1 𝑋 −1 𝐻2 𝑋 −2 − 𝑋 −1 𝐻2 𝑋 −2 𝐻1 𝑋 −1
− 𝑋 −2 𝐻1 𝑋 −1 𝐻2 𝑋 −1 − 𝑋 −1 𝐻2 𝑋 −1 𝐻1 𝑋 −2
− 𝑋 −1 𝐻1 𝑋 −2 𝐻2 𝑋 −1 − 𝑋 −2 𝐻2 𝑋 −1 𝐻1 𝑋 −1 ,
all without having to calculate a single partial derivative. Example 3.6 will provide
a similar illustration of how the use of multilinear maps makes such calculations
routine.
Returning to Definition 3.3, suppose we choose any bases ℬ1 = {𝑢1 , . . . , 𝑢 𝑛1 },
ℬ2 = {𝑢2 , . . . , 𝑢 𝑛2 } . . . , ℬ𝑑 = {𝑤 1 , . . . , 𝑤 𝑛𝑑 } for the vector spaces V1 , V2 , . . . , V𝑑
and let 𝐴 ∈ R𝑛1 ×𝑛2 ×···×𝑛𝑑 be given by
𝑎𝑖 𝑗 ···𝑘 = 𝜑(𝑢𝑖 , 𝑣 𝑗 , . . . , 𝑤 𝑘 ).
Then a consideration of change of basis similar to the various cases we discussed
for 𝑑 = 1, 2, 3 recovers the covariant 𝑑-tensor transformation rules (2.7) for
𝜑 : V1 × · · · × V𝑑 → R,
the contravariant 𝑑-tensor transformation rules (2.8) for
𝜑 : V∗1 × · · · × V∗𝑑 → R,
Tensors in computations 53

and the mixed 𝑑-tensor transformation rules (2.9) for the general case in (3.25).
Furthermore,
[𝜑] ℬ1 ,...,ℬ𝑑 = 𝐴 ∈ R𝑛1 ×···×𝑛𝑑
is exactly the hypermatrix representation of 𝜑 in Definition 2.5. The important
special case where V1 = · · · = V𝑑 = V and where we pick only a single basis ℬ
gives us the transformation rules in (2.10), (2.11) and (2.12).
At this juncture it is appropriate to highlight two previously mentioned points
(page 40) regarding the feasibility and usefulness of representing a multilinear map
as a hypermatrix.
Example 3.5 (writing down a hypermatrix is #P-hard). As we saw in (3.17),
given bases 𝒜, ℬ and 𝒞, a bilinear operator B may be represented as a hypermatrix
𝐴. Writing down the entries 𝑎 𝑖 𝑗 𝑘 as in (3.16) appears to be a straightforward process,
but this is an illusion: the task is #P-hard in general. Let 0 ≤ 𝑑1 ≤ 𝑑2 ≤ · · · ≤ 𝑑𝑛
be integers. Define the generalized Vandermonde matrix
 𝑥 𝑑1 𝑥2𝑑1 ... 𝑥 𝑛𝑑1 
 1𝑑 
𝑥 2 𝑥2𝑑2 ... 𝑥 𝑛𝑑2 
 1 
 
𝑉(𝑑1 ,...,𝑑𝑛 ) (𝑥) ≔  ... ..
.
..
.
..
. ,
 𝑑 
𝑥 𝑛−1 𝑥2𝑑𝑛−1 ... 𝑑𝑛−1 
𝑥𝑛 
 1
 𝑥 𝑑𝑛 𝑥2𝑑𝑛 ... 𝑥 𝑛𝑑𝑛 
 1
observing in particular that
 1 1 ... 1 

 𝑥1 𝑥2 . . . 𝑥 𝑛 

 .. .. .. .. 
𝑉(0,1,... ,𝑛−1) (𝑥) =  . . . . 
 𝑛−2 𝑛−2
𝑥 𝑛−2 
 1𝑛−1 𝑥2𝑛−1 . . . 𝑥 𝑛𝑛−1 
𝑥 𝑥2 . . . 𝑥 𝑛 
 1
is the usual Vandermonde matrix. Suppose 𝑑𝑖 ≥ 𝑖 for each 𝑖 = 1, . . Î . , 𝑛; then it is
not hard to show using the well-known formula det 𝑉(0,1,...,𝑛−1) (𝑥) = 𝑖< 𝑗 (𝑥𝑖 − 𝑥 𝑗 )
that det 𝑉(𝑑1 ,𝑑2 ,...,𝑑𝑛 ) (𝑥) is divisible by det 𝑉(0,1,...,𝑛−1) (𝑥). So, for any integers
0 ≤ 𝑝1 ≤ 𝑝2 ≤ · · · ≤ 𝑝𝑛,
det 𝑉( 𝑝1 , 𝑝2 +1,..., 𝑝𝑛 +𝑛−1) (𝑥)
𝑠( 𝑝1 , 𝑝2 ,..., 𝑝𝑛 ) (𝑥) ≔
det 𝑉(0,1,... ,𝑛−1) (𝑥)
is a multivariate polynomial in the variables 𝑥1 , . . . , 𝑥 𝑛 . These are symmetric
polynomials, i.e. homogeneous polynomials 𝑠 with
𝑠(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ) = 𝑠(𝑥 𝜎(1) , 𝑥 𝜎(2) , . . . , 𝑥 𝜎(𝑛) )
for any 𝜎 ∈ 𝔖𝑛 , the permutation group on 𝑛 objects. Let U, V, W be the vector
spaces of symmetric polynomials of degrees 𝑑, 𝑑 ′ and 𝑑 + 𝑑 ′ respectively, and
let B : U × V → W be the bilinear operator given by polynomial multiplication,
54 L.-H. Lim

i.e. B(𝑠(𝑥), 𝑡(𝑥)) = 𝑠(𝑥)𝑡(𝑥) for any symmetric polynomials 𝑠(𝑥) of degree 𝑑 and
𝑡(𝑥) of degree 𝑑 ′. A well-known basis of the vector space of degree-𝑑 symmetric
polynomials is the Schur basis given by
{𝑠( 𝑝1 , 𝑝2 ,..., 𝑝𝑛 ) (𝑥) ∈ U : 𝑝 1 ≤ 𝑝 2 ≤ · · · ≤ 𝑝 𝑛 is an integer partition of 𝑑}.
Let 𝒜, ℬ, 𝒞 be the respective Schur bases for U, V, W. In this case the coefficients
in (3.16) are called Littlewood–Richardson coefficients, and determining their val-
ues is a #P-complete problem11 (Narayanan 2006). In other words, determining the
hypermatrix representation 𝐴 of the bilinear operator B is #P-hard. Littlewood–
Richardson coefficients are not as esoteric as one might think but have significance
in linear algebra and numerical linear algebra alike. Among other things they play
a central role in the resolution of Horn’s conjecture about the eigenvalues of a sum
of Hermitian matrices (Klyachko 1998, Knutson and Tao 1999).
As we mentioned earlier, a side benefit of definition ➁ is that it allows us to
work with arbitrary vector spaces. In the previous example, U, V, W are vector
spaces of degree-𝑑 symmetric polynomials for various values of 𝑑; in the next one
we will have U = V = W = S𝑛 , the vector space of 𝑛 × 𝑛 symmetric matrices.
Aside from showing that hypermatrix representations may be neither feasible nor
useful, Examples 3.5 and 3.6 also show that there are usually good reasons to work
intrinsically with whatever vector spaces one is given.
Example 3.6 (higher gradients of log determinant). The key to all interior point
methods is a barrier function that traps iterates within the feasible region of a convex
program. In semidefinite programming, the optimal barrier function with respect
to iteration complexity is the log barrier function for the cone of positive definite
matrices S++𝑛,
𝑛
𝑓 : S++ → R, 𝑓 (𝑋) = − log det 𝑋.
Using the characterization in Example 3.4 with the inner product S𝑛 × S𝑛 → R,
(𝐻1 , 𝐻2 ) ↦→ tr(𝐻1T 𝐻2 ), we may show that its gradient is given by
𝑛
∇ 𝑓 : S++ → S𝑛 , ∇ 𝑓 (𝑋) = −𝑋 −1 , (3.29)
𝑛 is the linear map
and its Hessian at any 𝑋 ∈ S++
∇2 𝑓 (𝑋) : S𝑛 → S𝑛 , 𝐻 ↦→ 𝑋 −1 𝐻 𝑋 −1 , (3.30)
expressions we may also find in Boyd and Vandenberghe (2004) and Renegar
(2001). While we may choose bases on S𝑛 and artificially write the gradient and
Hessian of 𝑓 in the forms (3.28), interested readers may check that they are a horrid
mess that obliterates all insights and advantages proffered by (3.29) and (3.30).
11 This means it is as intractable as evaluating the permanent of a matrix whose entries are zeros
and ones (Valiant 1979). #P-complete problems include all NP-complete problems, for example,
deciding whether a graph is 3-colourable is NP-complete, but computing its chromatic number is
#P-complete.
Tensors in computations 55

Among other things, (3.29) and (3.30) allow one to exploit specialized algorithms
for matrix product and inversion.
The third-order gradient ∇3 𝑓 also plays an important role as we need it to
ascertain self-concordance in Example 3.16. By our discussion in Example 3.4,
for any 𝑋 ∈ S++ , this is a bilinear operator
∇3 𝑓 (𝑋) : S𝑛 × S𝑛 → S𝑛 ,
and by (3.27), we may differentiate (3.30) to get
[∇3 𝑓 (𝑋)](𝐻1 , 𝐻2 ) = −𝑋 −1 𝐻1 𝑋 −1 𝐻2 𝑋 −1 − 𝑋 −1 𝐻2 𝑋 −1 𝐻1 𝑋 −1 . (3.31)
Repeatedly applying (3.27) gives the (𝑑 − 1)-linear operator
∇𝑑 𝑓 (𝑋) : S𝑛 × · · · × S𝑛 → S𝑛 ,
Õ
(𝐻1 , . . . , 𝐻 𝑑−1 ) ↦→ (−1)𝑑 𝑋 −1 𝐻 𝜎(1) 𝑋 −1 𝐻 𝜎(2) 𝑋 −1 · · · 𝑋 −1 𝐻 𝜎(𝑑−1) 𝑋 −1 ,
𝜎 ∈𝔖𝑑−1

as the 𝑑th gradient of 𝑓 . As interested readers may again check for themselves,
expressing this as a 𝑑-dimensional hypermatrix is even less illuminating than ex-
pressing (3.29) and (3.30) as one- and two-dimensional hypermatrices. Multilinear
maps are essential for discussing higher derivatives and gradients of multivariate
functions.
In Examples 3.4 and 3.6, the matrices in S𝑛 are 1-tensors even though they are
doubly indexed objects. This should come as no surprise after Example 2.3: the
order of a tensor is not determined by the number of indices. We will see this in
Example 3.7 again, where we will encounter a hypermatrix with 3𝑛 + 3 indices
which nonetheless represents a 3-tensor.
We conclude this section with some infinite-dimensional examples. Much like
linear operators, there is not much that one could say about multilinear operators
over infinite-dimensional vector spaces that is purely algebraic. The topic becomes
much more interesting when one brings in analytic notions by equipping the vector
spaces with norms or inner products.
As is well known, it is easy to ascertain the continuity of linear operators between
Banach spaces: Φ : V → W is continuous if and only if it is bounded in the sense
of kΦ(𝑣)k ≤ 𝑐k𝑣k for some constant 𝑐 > 0 and for all 𝑣 ∈ V, i.e. if and only if
Φ ∈ B(V, W). We have slightly abused notation by not distinguishing the norms
on different spaces and will continue to do so below. Almost exactly the same
proof extends to multilinear operators on Banach spaces: Φ : V1 × · · · × V𝑑 → W
is continuous if and only if it is bounded in the sense of
kΦ(𝑣 1 , . . . , 𝑣 𝑑 )k ≤ 𝑐k𝑣 1 k · · · k𝑣 𝑑 k
for some constant 𝑐 > 0 and for all 𝑣 1 ∈ V1 , . . . , 𝑣 𝑑 ∈ V𝑑 (Lang 1993, Chapter IV,
Section 1), i.e. if and only if its spectral norm as defined in (3.19) is finite.
So if V1 , . . . , V𝑑 are finite-dimensional, then Φ is automatically continuous; in
56 L.-H. Lim

this case it does not matter whether W is finite-dimensional. We will write


B(V1 , . . . , V𝑑 ; W) for the set of all bounded/continuous multilinear operators.
Then B(V1 , . . . , V𝑑 ; W) is itself a Banach space when equipped with the spectral
norm.

Example 3.7 (infinite-dimensional bilinear operators). Possibly the simplest


continuous bilinear operator is the infinite-dimensional generalization of matrix–
vector product R𝑚×𝑛 × R𝑛 → R𝑚 , (𝐴, 𝑣) ↦→ 𝐴𝑣, but where we instead have

M : B(V, W) × V → W, (Φ, 𝑣) ↦→ Φ(𝑣).

If V and W are Banach spaces equipped with norms k · k and k · k ′ respectively,


then B(V, W) is a Banach space equipped with the operator norm

kΦ(𝑣)k ′
kΦk ′′ = sup .
𝑣≠0 k𝑣k

It is straightforward to see that the spectral norm of M is then

kM(Φ, 𝑣)k ′
kMk 𝜎 = sup =1
Φ≠0, 𝑣≠0 kΦk ′′ k𝑣k

and thus it is continuous. Next we will look at an actual Banach space of functions.
Another quintessential bilinear operator is the convolution of two functions 𝑓 , 𝑔 ∈
𝐿 1 (R𝑛 ), defined by

𝑓 ∗ 𝑔(𝑥) ≔ 𝑓 (𝑥 − 𝑦)𝑔(𝑦) d𝑦.
R𝑛

The result is an 𝐿 1 -function because k 𝑓 ∗ 𝑔k 1 ≤ k 𝑓 k 1 k𝑔k 1 . Thus we have a


well-defined bilinear operator

B∗ : 𝐿 1 (R𝑛 ) × 𝐿 1 (R𝑛 ) → 𝐿 1 (R𝑛 ), ( 𝑓 , 𝑔) ↦→ 𝑓 ∗ 𝑔.

Since kB∗ ( 𝑓 , 𝑔)k 1 = k 𝑓 ∗ 𝑔k 1 ≤ k 𝑓 k 1 k𝑔k 1 , we have

k 𝑓 ∗ 𝑔k 1
kB∗ k 𝜎 = sup ≤ 1,
𝑓 , 𝑔≠0 k 𝑓 k 1 k𝑔k 1

and equality is attained by, say, choosing 𝑓 , 𝑔 ∈ 𝐿 1 (R𝑛 ) to be non-negative and


applying Fubini. So kB∗ k 𝜎 = 1 and in particular B∗ is continuous.
For a continuous linear operator Φ ∈ B(H; H) and a continuous bilinear operator
B ∈ B(H, H; H) on a separable Hilbert space HÍ with a countable orthonormal basis
ℬ = {𝑒𝑖 : 𝑖 ∈ N}, Parseval’s identity 𝑓 = ∞ 𝑖=1 h 𝑓 , 𝑒 𝑖 i𝑒 𝑖 gives us the following
Tensors in computations 57

orthogonal expansions:
Õ
∞ Õ

Φ( 𝑓 ) = hΦ(𝑒𝑖 ), 𝑒 𝑗 ih 𝑓 , 𝑒 𝑗 i𝑒 𝑗 ,
𝑖=1 𝑗=1
Õ∞ Õ ∞ Õ∞
B( 𝑓 , 𝑔) = hB(𝑒𝑖 , 𝑒 𝑗 ), 𝑒 𝑘 ih 𝑓 , 𝑒𝑖 ih𝑔, 𝑒 𝑗 i𝑒 𝑘
𝑖=1 𝑗=1 𝑘=1

for any 𝑓 , 𝑔 ∈ H. Convergence of these infinite series is, as usual, in the norm
induced by the inner product, and it follows from convergence that the hypermatrices
representing Φ and B with respect to ℬ are 𝑙 2 -summable, that is,
2 2
(hΦ(𝑒𝑖 ), 𝑒 𝑗 i)∞
𝑖, 𝑗=1 ∈ 𝑙 (N × N), (hB(𝑒𝑖 , 𝑒 𝑗 ), 𝑒 𝑘 i)∞
𝑖, 𝑗,𝑘=1 ∈ 𝑙 (N × N × N)

(we discuss infinite-dimensional hypermatrices formally in Example 4.5). In fact


the converse is also true. If we define a linear operator and a bilinear operator by
∞ Õ
Õ ∞
Φ( 𝑓 ) ≔ 𝑎𝑖 𝑗 h 𝑓 , 𝑒 𝑗 i𝑒 𝑗 ,
𝑖=1 𝑗=1
Õ∞ Õ ∞ Õ ∞
B( 𝑓 , 𝑔) ≔ 𝑏𝑖 𝑗𝑘 h 𝑓 , 𝑒𝑖 ih𝑔, 𝑒 𝑗 i𝑒 𝑘
𝑖=1 𝑗=1 𝑘=1

for any 𝑓 , 𝑔 ∈ H and where


2 2
(𝑎𝑖 𝑗 )∞
𝑖, 𝑗=1 ∈ 𝑙 (N × N), (𝑏𝑖 𝑗𝑘 )∞
𝑖, 𝑗,𝑘=1 ∈ 𝑙 (N × N × N),

then Φ ∈ B(H; H) and B ∈ B(H, H; H). So the continuity of these operators


is determined by the 𝑙 2 -summability of the corresponding hypermatrices. What
is perhaps surprising is that this also holds to various extents for Banach spaces
with ‘almost orthogonal bases’ and other growth conditions for the coefficient
hypermatrices, as we will see below.
Going beyond elementary examples like M or B∗ requires somewhat more back-
ground. The space of Schwartz functions is
 𝑞1 +···+𝑞𝑛 𝑓 
𝑝 𝑝 𝜕
𝑆(R𝑛 ) ≔ 𝑓 ∈ 𝐶 ∞(R𝑛 ) : sup 𝑥1 1 · · · 𝑥 𝑛 𝑛 𝑞1 < ∞, 𝑝 𝑖 , 𝑞 𝑗 ∈ N .
𝑥 ∈R𝑛 𝜕𝑥1 · · · 𝜕𝑥 𝑛𝑞𝑛
The defining condition essentially states that all its derivatives decay rapidly to zero
faster than any negative powers. Examples include compactly supported smooth
functions and functions of the form 𝑝(𝑥) exp(−𝑥 T 𝐴𝑥), where 𝑝 is a polynomial and
𝐴 ∈ S++𝑛 . While 𝑆(R𝑛 ) is not a Banach space, it is a dense subset of many common

Banach spaces such as 𝐿 𝑝 (R𝑛 ), 𝑝 ∈ [1, ∞). Its continuous dual space,
𝑆 ′ (R𝑛 ) ≔ B(𝑆(R𝑛 ); R),
is the space of tempered distributions. It follows from Schwartz’s kernel theorem
(see Example 4.2) that any continuous bilinear operator B : 𝑆(R𝑛 )×𝑆(R𝑛 ) → 𝑆 ′ (R𝑛 )
58 L.-H. Lim

must take the form


∫ ∫
B( 𝑓 , 𝑔)(𝑥) = 𝐾(𝑥, 𝑦, 𝑧) 𝑓 (𝑦)𝑔(𝑧) d𝑦 d𝑧
R𝑛 R𝑛
for some tempered distribution 𝐾 ∈ 𝑆 ′(R𝑛 × R𝑛 × R𝑛 ) (Grafakos and Torres 2002a).
It is a celebrated result of Frazier and Jawerth (1990) that if 𝜓 ∈ 𝑆(R𝑛 ) has Fourier
transform 𝜓 b(𝜉) vanishing outside the annulus 𝜋/4 ≤ |𝜉 | ≤ 𝜋 and bounded away
from zero on a smaller annulus 𝜋/4+𝜀 ≤ |𝜉 | ≤ 𝜋 −𝜀, then it is an almost orthogonal
wavelet for 𝐿 𝑝 (R𝑛 ), 𝑝 ∈ (1, ∞). This means that if we set
ℬ𝜓 ≔ {𝜓 𝑘,𝜈 : (𝑘, 𝜈) ∈ Z × Z𝑛 }, 𝜓 𝑘,𝜈 (𝑥) ≔ 2𝑘𝑛/2 𝜓(2𝑘 𝑥 − 𝜈),
then any 𝑓 ∈ 𝐿 𝑝 (R𝑛 ) has an expansion
Õ
𝑓 = h 𝑓 , 𝜓 𝑘,𝜈 i𝜓 𝑘,𝜈
(𝑘,𝜈)∈Z𝑛+1

that converges in the 𝐿 𝑝 -norm, much like Parseval’s identity for a Hilbert space,
even though we do not have a Hilbert space and ℬ𝜓 is not an orthonormal basis (in
the language of Example 4.47, ℬ𝜓 is a tight wavelet frame with frame constant 1).
Furthermore, if the matrix12
𝐴 : Z𝑛+1 × Z𝑛+1 → R
satisfies an ‘almost diagonal’ growth condition that essentially says that 𝑎(𝑖,𝜆),( 𝑗,𝜇)
is small whenever (𝑖, 𝜆) and ( 𝑗, 𝜇) are far apart, then defining
Õ Õ
Φ 𝐴( 𝑓 ) ≔ 𝑎(𝑖,𝜆),( 𝑗,𝜇) h 𝑓 , 𝜓𝑖,𝜆 i𝜓 𝑗,𝜇
(𝑖,𝜆)∈Z𝑛+1 ( 𝑗,𝜈)∈Z𝑛+1

gives us a continuous linear operator Φ 𝐴 : 𝐿 𝑝 (R𝑛 ) → 𝐿 𝑝 (R𝑛 ), 𝑝 ∈ (1, ∞). This


extends to bilinear operators. If the 3-hypermatrix
𝐵 : Z𝑛+1 × Z𝑛+1 × Z𝑛+1 → R
satisfies an analogous ‘almost diagonal’ growth condition with 𝑏(𝑖,𝜆),( 𝑗,𝜇),(𝑘,𝜈) small
whenever (𝑖, 𝜆), ( 𝑗, 𝜇), (𝑘, 𝜈) are far apart from each other, then defining
Õ Õ Õ
Φ𝐵 ( 𝑓 , 𝑔) ≔ 𝑏(𝑖,𝜆),( 𝑗,𝜇),(𝑘,𝜈) h 𝑓 , 𝜓𝑖,𝜆 ih𝑔, 𝜓 𝑗,𝜇 i𝜓 𝑘,𝜈
(𝑖,𝜆)∈Z𝑛+1 ( 𝑗,𝜇)∈Z𝑛+1 (𝑘,𝜈)∈Z𝑛+1

gives us a continuous linear operator Φ𝐵 : 𝐿 𝑝 (R𝑛 ) × 𝐿 𝑞 (R𝑛 ) → 𝐿 𝑟 (R𝑛 ) whenever


1 1 1
+ = , 1 < 𝑝, 𝑞, 𝑟 < ∞.
𝑝 𝑞 𝑟
The linear result is due to Frazier and Jawerth (1990) but our description is based
on Grafakos and Torres (2002a, Theorem A), which also contains the bilinear result
12 Note that 𝐴 is a matrix in the sense of Example 4.5. Here the row and column indices take values
in Z𝑛+1 . Likewise for the hypermatrix 𝐵 later.
Tensors in computations 59

(Grafakos and Torres 2002a, Theorem 1). We refer the reader to these references
for the exact statement of the ‘almost diagonal’ growth conditions.
Aside from convolution, the best-known continuous bilinear operator is probably
the bilinear Hilbert transform

d𝑦
H( 𝑓 , 𝑔)(𝑥) ≔ lim 𝑓 (𝑥 + 𝑦)𝑔(𝑥 − 𝑦) , (3.32)
𝜀→0 | 𝑦 |> 𝜀 𝑦
being a bilinear extension of the Hilbert transform

d𝑦
H( 𝑓 )(𝑥) ≔ lim 𝑓 (𝑥 − 𝑦) .
𝜀→0 | 𝑦 |> 𝜀 𝑦
The latter, according to Krantz (2009, p. 15) ‘is, without question, the most im-
portant operator in analysis.’ While it is a standard result that the Hilbert transform
is continuous as a linear operator H : 𝐿 𝑝 (R) → 𝐿 𝑝 (R), 𝑝 ∈ (1, ∞), and we even
know the exact value of its operator/spectral norm (Grafakos 2014, Remark 5.1.8),

 
 𝜋

 𝜋 tan 2𝑝

 1 < 𝑝 ≤ 2,
kHk 𝜎 =

 
 𝜋
 𝜋 cot
 2 ≤ 𝑝 < ∞,
 2𝑝
the continuity of its bilinear counterpart had been a long-standing open problem.
It was resolved by Lacey and Thiele (1997, 1999), who showed that as a bilinear
operator, H : 𝐿 𝑝 (R) × 𝐿 𝑞 (R) → 𝐿 𝑟 (R) is continuous whenever
1 1 1 2
+ = , 1 < 𝑝, 𝑞 ≤ ∞, < 𝑟 < ∞,
𝑝 𝑞 𝑟 3
that is, there exists 𝑐 > 0 such that
kH( 𝑓 , 𝑔)k 𝑟 ≤ 𝑐k 𝑓 k 𝑝 k𝑔k 𝑞
for all 𝑓 ∈ 𝐿 𝑝 (R) and 𝑔 ∈ 𝐿 𝑞 (R). The special case 𝑝 = 𝑞 = 2, 𝑟 = 1, open for
more than thirty years, was known as the Calderón conjecture.
The study of infinite-dimensional multilinear operators along the above lines has
become a vast undertaking, sometimes called multilinear harmonic analysis (Mus-
calu and Schlag 2013), with a multilinear Calderón–Zygmund theory (Grafakos and
Torres 2002b) and profound connections to wavelets (Meyer and Coifman 1997)
among its many cornerstones.

3.2. Decompositions of multilinear maps


For simplicity, we will assume a finite-dimensional setting in this section. We
will see that every multilinear map can be constructed out of linear functionals
and vectors. This is a simple observation but often couched in abstract terms
under headings like the ‘universal factorization property’ or ‘universal mapping
60 L.-H. Lim

property’. Its simplicity notwithstanding, this observation is a fruitful one that is


the basis of the notion of tensor rank.
In the context of definition ➁, 0- and 1-tensors are building blocks, and there is
little more that we may say about them. The definition really begins at 𝑑 = 2. Let
Φ : U → V be a linear operator and let ℬ = {𝑣 1 , . . . , 𝑣 𝑛 } be a basis of V. Then,
for any 𝑢 ∈ U, we have
Õ𝑛
Φ(𝑢) = 𝑎𝑗𝑣𝑗 (3.33)
𝑗=1

for some 𝑎1 , . . . , 𝑎 𝑛 ∈ R. Clearly, if we change 𝑢 the coefficients 𝑎1 , . . . , 𝑎 𝑛 must


in general change too, that is, they are functions of 𝑢 and we ought to have written
Õ
𝑛
Φ(𝑢) = 𝑎 𝑗 (𝑢)𝑣 𝑗 (3.34)
𝑗=1

to indicate this. What kind of function is 𝑎𝑖 : U → R? It is easy to see if we take


the linear functional 𝑣 ∗𝑖 : V → R∗ from the dual basis ℬ∗ = {𝑣 ∗1 , . . . , 𝑣 ∗𝑛 } and hit
(3.34) on the left with it; then
Õ
𝑛
𝑣 ∗𝑖 (Φ(𝑢)) = 𝑎𝑖 (𝑢)𝑣 ∗𝑖 (𝑣 𝑗 ) = 𝑎𝑖 (𝑢)
𝑗=1

by linearity of 𝑣 ∗𝑖 and the fact that 𝑣 ∗𝑖 (𝑣 𝑗 ) = 𝛿𝑖 𝑗 . Since this holds for all 𝑢 ∈ U, we
have
𝑣 ∗𝑖 ◦ Φ = 𝑎𝑖 .
So 𝑎𝑖 : U → R is a linear functional as 𝑣 ∗𝑖 and Φ are both linear. Switching back
to our usual notation of denoting linear functionals as 𝜑𝑖 instead of 𝑎𝑖 , we see that
every linear operator Φ : U → V takes the form
Õ
𝑛
Φ(𝑢) = 𝜑 𝑗 (𝑢)𝑣 𝑗
𝑗=1

for some linear functionals 𝜑1 , . . . , 𝜑 𝑛 ∈ V∗ , that is, every linear operator is


constructed out of linear functionals and vectors. Now observe that in order to get
(3.34) we really did not need 𝑣 1 , . . . , 𝑣 𝑛 to be a basis of V: we just need 𝑣 1 , . . . , 𝑣𝑟
to be a basis of im(Φ) with 𝑟 = dim im(Φ) = rank(Φ). Since 𝑟 cannot be any
smaller or else (3.34) would not hold, we must have
 Õ𝑟 
rank(Φ) = min 𝑟 : Φ(𝑢) = 𝜑𝑖 (𝑢)𝑣 𝑖 .
𝑖=1

The argument is similar for a bilinear functional 𝛽 : U × V → R but also different


enough to warrant going over. For any 𝑢 ∈ U, 𝛽(𝑢, ·) : V → R is a linear functional,
Tensors in computations 61

so we must have
Õ
𝑛
𝛽(𝑢, ·) = 𝑎𝑖 𝑣 ∗𝑖
𝑖=1

as ℬ∗ = {𝑣 ∗1 , . . . , 𝑣 ∗𝑛 } is a basis for V∗ . Since 𝑎𝑖 depends on 𝑢, we write


Õ
𝑛
𝛽(𝑢, ·) = 𝑎𝑖 (𝑢)𝑣 ∗𝑖 .
𝑖=1

Evaluating at 𝑣 𝑗 gives
Õ
𝑛
𝛽(𝑢, 𝑣 𝑗 ) = 𝑎𝑖 (𝑢)𝑣 ∗𝑖 (𝑣 𝑗 ) = 𝑎 𝑗 (𝑢)
𝑖=1

for any 𝑢 ∈ U. So 𝑎 𝑗 : U → R is a linear functional. Since 𝑣 ∗𝑖 : V → R is also a


linear functional, switching to our usual notation for linear functionals, we conclude
that a bilinear functional 𝛽 : U × V → R must take the form
Õ
𝑛
𝛽(𝑢, 𝑣) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣) (3.35)
𝑖=1

for some 𝜑1 , . . . , 𝜑 𝑛 ∈ U∗ and 𝜓1 , . . . , 𝜓 𝑛 ∈ V∗ , that is, every bilinear functional


is constructed out of linear functionals. In addition,
 Õ 𝑟 
rank(𝛽) = min 𝑟 : 𝛽(𝑢, 𝑣) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)
𝑖=1

defines a notation of rank for bilinear functionals.


For a bilinear operator B : U × V → W and any 𝑢 ∈ U and 𝑣 ∈ V,
𝑝
Õ
B(𝑢, 𝑣) = 𝑎𝑗𝑤 𝑗
𝑗=1

with 𝒞 = {𝑤 1 , . . . , 𝑤 𝑝 } a basis of W, but the difference now is that 𝑎 𝑗 depends on


both 𝑢 and 𝑣 and so is a function 𝑎 𝑗 : U × V → R, that is,
𝑝
Õ
B(𝑢, 𝑣) = 𝑎 𝑗 (𝑢, 𝑣)𝑤 𝑗 .
𝑗=1

Hitting this equation on the left by 𝑤 ∗𝑖 gives


𝑤 ∗𝑖 (B(𝑢, 𝑣)) = 𝑎𝑖 (𝑢, 𝑣),
B 𝑤𝑖∗
and since composing a bilinear operator and a linear functional U × V − → W −−→ R
gives a bilinear functional, we conclude that 𝑎𝑖 is a bilinear functional. Applying
the result in the previous paragraph, 𝑎𝑖 must take the form in (3.35), and changing
62 L.-H. Lim

to our notation for linear functionals we get


𝑛 Õ
Õ 𝑝 Õ
𝑟
B(𝑢, 𝑣) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑗 = 𝜑 𝑘 (𝑢)𝜓 𝑘 (𝑣)𝑤 𝑘 , (3.36)
𝑖=1 𝑗=1 𝑘=1

where the last step is simply relabelling the indices, noting that both are sums of
terms of the form 𝜑(𝑢)𝜓(𝑣)𝑤, with 𝑟 = 𝑛𝑝. The smallest 𝑟, that is,
 Õ𝑟 
rank(B) = min 𝑟 : B(𝑢, 𝑣) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑖 , (3.37)
𝑖=1
is called the tensor rank of B, a notion that may be traced to Hitchcock (1927,
equations 2 and 2𝑎 ) and will play a critical role in the next section.
The same line of argument may be repeated on a trilinear functional 𝜏 : U × V ×
W → R to show that they are just sums of products of linear functionals
Õ𝑟
𝜏(𝑢, 𝑣, 𝑤) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝜃 𝑖 (𝑤) (3.38)
𝑖=1
and with it a corresponding notion of tensor rank. More generally, any 𝑑-linear
map Φ : V1 × V2 × · · · × V𝑑 → W is built up of linear functionals and vectors,
Õ
𝑟
Φ(𝑣 1 , . . . , 𝑣 𝑑 ) = 𝜑𝑖 (𝑣 1 )𝜓𝑖 (𝑣 2 ) · · · 𝜃 𝑖 (𝑣 𝑑 )𝑤 𝑖 ,
𝑖=1
where the last 𝑤 𝑖 may be dropped if it is a 𝑑-linear functional with W = R.
Consequently, we see that the ‘multilinearness’ in any multilinear map comes from
that of
R𝑑 ∋ (𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 ) ↦→ 𝑥1 𝑥2 · · · 𝑥 𝑑 ∈ R.
Take (3.38) as illustration: where does the ‘trilinearity’ of 𝜏 come from? Say we
look at the middle argument; then
Õ
𝑟
𝜏(𝑢, 𝜆𝑣 + 𝜆 ′ 𝑣 ′, 𝑤) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝜆𝑣 + 𝜆 ′ 𝑣 ′)𝜃 𝑖 (𝑤)
𝑖=1
Õ
𝑟
= 𝜑𝑖 (𝑢)[𝜆𝜓𝑖 (𝑣) + 𝜆 ′𝜓(𝑣 ′)]𝜃 𝑖 (𝑤)
𝑖=1

𝑟  Õ
𝑟 
′ ′
=𝜆 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝜃 𝑖 (𝑤) + 𝜆 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣 )𝜃 𝑖 (𝑤)
𝑖=1 𝑖=1
= 𝜆𝜏(𝑢, 𝑣, 𝑤) + 𝜆 ′ 𝜏(𝑢, 𝑣 ′, 𝑤).
The reason why it is linear in the middle argument is simply a result of
𝑥(𝜆𝑦 + 𝜆 ′ 𝑦 ′)𝑧 = 𝜆𝑥𝑦𝑧 + 𝜆 ′𝑥𝑦 ′ 𝑧,
which is in turn a result of the trilinearity of (𝑥, 𝑦, 𝑧) ↦→ 𝑥𝑦𝑧. All ‘multilinearness’
Tensors in computations 63

in all multilinear maps arises in this manner. In addition, this ‘multilinearness’


may be ‘factored out’ of a multilinear map, and once we do that, whatever is left
behind is a linear map. This is essentially the simple idea behind the universal
factorization property of tensor products that we will formally state and discuss in
Section 4.3.

3.3. Tensors in computations II: multilinearity


A basic utility of the multilinear map perspective of tensors is that it allows one to
recognize 3-tensors in situations like the examples (i), (ii), (iii) on page 40, and in
turn allows one to apply tensorial notions such as tensor ranks and tensor norms
to analyse them. The most common 3-tensors are bilinear operators and this is the
case in computations too, although we will see that trilinear functionals do also
appear from time to time. When the bilinear operator is matrix multiplication,
study of its tensor rank leads us to the fabled exponent of matrix multiplication
𝜔. While most are aware that 𝜔 is important in the complexity of matrix–matrix
products and matrix inversion, it actually goes far beyond. The complexity of com-
puting 𝐿𝑈 decompositions, symmetric eigenvalue decompositions, determinants
and characteristic polynomials, bases for null spaces, matrix sparsification – note
that none of these are bilinear operators – all have asymptotic complexity either
exactly 𝜔 or bounded by it (Bürgisser et al. 1997, Chapter 16). Determining the
exact value of 𝜔 truly deserves the status of a Holy Grail problem in numerical
linear algebra. Especially relevant to this article is the fact that 𝜔 is a tensorial
notion.
Given three vector spaces U, V, W, how could one construct a bilinear operator
B : U × V → W?
Taking a leaf from Section 3.2, a simple and natural way is to take a linear functional
𝜑 : U → R, a linear functional 𝜓 : V → R, a vector 𝑤 ∈ W, and then define
B(𝑢, 𝑣) = 𝜑(𝑢)𝜓(𝑣)𝑤 (3.39)
for any 𝑢 ∈ U and 𝑣 ∈ V; it is easy to check that this is bilinear. We will call a
non-zero bilinear operator of such a form rank-one.
An important point to note is that in evaluating B(𝑢, 𝑣) in (3.39), only multi-
plication of variables matters, and for this B(𝑢, 𝑣) requires exactly one multi-
plication. Take for example U = V = W = R3 and 𝜑(𝑢) = 𝑢1 + 2𝑢2 + 3𝑢3 ,
𝜓(𝑣) = 2𝑣 1 + 3𝑣 2 + 4𝑣 3 , 𝑤 = (3, 4, 5); then
3(𝑢1 + 2𝑢2 + 3𝑢3 )(2𝑣 1 + 3𝑣 2 + 4𝑣 3 )
 
B(𝑢, 𝑣) = 4(𝑢1 + 2𝑢2 + 3𝑢3 )(2𝑣 1 + 3𝑣 2 + 4𝑣 3 ) .
5(𝑢1 + 2𝑢2 + 3𝑢3 )(2𝑣 1 + 3𝑣 2 + 4𝑣 3 )
 
This appears to require far more than one multiplication, but multiplications such
as 2𝑢2 or 4𝑣 3 are all scalar multiplications, that is, one of the factors is a constant,
and these are discounted.
64 L.-H. Lim

This is the notion of bilinear complexity introduced in Strassen (1987) and is a


very fruitful idea. Instead of trying to design algorithms that minimize all arithmetic
operations at once, we focus on arguably the most expensive one, the multiplication
of variables, as this is the part that cannot be hardwired or hardcoded. On the other
hand, once we have fixed 𝜑, 𝜓, 𝑤, the evaluation of these linear functionals can be
implemented as specialized algorithms in hardware for maximum efficiency. One
example is the discrete Fourier transform,
 𝑥′  1 1 1 1 ··· 1   𝑥0 
 0    
 𝑥′  1 𝜔 𝜔 2 𝜔 3 · · · 𝜔 𝑛−1   𝑥1 
 1    
 𝑥′  1 1 𝜔 2 𝜔 4 𝜔 6 · · · 𝜔 2(𝑛−1)   𝑥 
2
 2′    
 𝑥  = √ 1 𝜔 3
 3  𝑛  𝜔6 𝜔9 ··· 𝜔3(𝑛−1)   𝑥3  . (3.40)
 ..   .. .. .. .. .. ..   .. 
 .  . . . . . .  . 
 ′    
𝑥   𝑛−1 2(𝑛−1) 3(𝑛−1) (𝑛−1)(𝑛−1)  𝑥 𝑛−1 
 𝑛−1  1 𝜔 𝜔 𝜔 ··· 𝜔  
Fast Fourier transform is a specialized algorithm that gives us the value of 𝑥 𝑘′ by
Í
evaluating the linear functional 𝜑(𝑥) = 𝑛−1 𝑗𝑘
𝑗=0 𝜔 𝑥 𝑗 but not in the obvious way, and
it is often built into software libraries or hardware. It is instructive to ask about the
bilinear complexity of (3.40): the answer is in fact zero, as it involves only scalar
multiplications with constants 𝜔 𝑗𝑘 .
By focusing only on variable multiplications, we may neatly characterize bilinear
complexity in terms of its tensor rank in (3.37); in fact, bilinear complexity is
tensor rank. Furthermore, upon finding the algorithms that are optimal in terms of
variable multiplications – these are almost never unique – we may then seek among
them the ones that are optimal in terms of scalar multiplications (i.e. sparsest),
additions, numerical stability, communication cost, energy cost, etc. In fact, one
may often bound the number of additions and scalar multiplications in terms of
the number of variable multiplications. For example, if an algorithm takes 𝑛 𝑝
variable multiplications, we may often show that it takes at most 𝑐𝑛 𝑝 additions and
scalar multiplications for some constant 𝑐 > 0 as each variable multiplication in the
algorithm is accompanied by at most 𝑐 other additions and scalar multiplications;
as a result the algorithm is still 𝑂(𝑛 𝑝 ) even when we count all arithmetic operations.
This will be the case for all problems considered in Examples 3.9 and 3.10, that is,
the operation counts therein include all arithmetic operations. Henceforth, by
‘multiplication’ in the remainder of this section we will always mean ‘variable
multiplication’ unless specified otherwise.
In general, a bilinear operator would not be rank-one as in (3.39). Nevertheless,
as we saw in (3.36), it will always be decomposable into a sum of rank-one terms
Õ
𝑟
B(𝑢, 𝑣) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑖 (3.41)
𝑖=1

for some 𝑟 ∈ N with the smallest possible 𝑟 given by the tensor rank of B. Note that
any decomposition of the form in (3.41) gives us an explicit algorithm for computing
Tensors in computations 65

B with 𝑟 multiplications, and thus rank(B) gives us the bilinear complexity or least
number of multiplications required to compute B. This relation between tensor
rank and evaluation of bilinear operators first appeared in Strassen (1973).
As numerical computations go, there is no need to compute a quantity exactly.
What if we just require the right-hand side (remember this gives an algorithm) of
(3.41) to be an 𝜀-approximation of the left-hand side? This leads to the notion of
border rank:
 Õ𝑟 
𝜀 𝜀 𝜀
rank(B) = min 𝑟 : B(𝑢, 𝑣) = lim+ 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑖 . (3.42)
𝜀→0
𝑖=1

This was first proposed by Bini, Capovani, Romani and Lotti (1979) and Bini, Lotti
and Romani (1980) and may be regarded as providing an algorithm (remembering
that every such decomposition gives an algorithm)
Õ
𝑟
B 𝜀 (𝑢, 𝑣) = 𝜑𝑖𝜀 (𝑢)𝜓𝑖𝜀 (𝑣)𝑤 𝑖𝜀
𝑖=1

that approximates B(𝑢, 𝑣) up to any arbitrary 𝜀-accuracy. While (3.42) relies on a


limit, it may also be defined purely algebraically over any ring as in Knuth (1998,
p. 522) and Bürgisser et al. (1997, Definition 15.19), which are in turn based on
Bini (1980). Clearly rank(B) ≤ rank(B), but one may wonder if there are indeed
instances where the inequality is strict. There are explicit examples in Bini et al.
(1979, 1980), Bürgisser et al. (1997) and Knuth (1998), but they do not reveal the
familiar underlying phenomenon, namely that of a difference quotient converging
to a derivative (De Silva and Lim 2008):
(𝜑1 (𝑢) + 𝜀𝜑2 (𝑢))(𝜓1 (𝑣) + 𝜀𝜓2 (𝑣))(𝑤 1 + 𝜀𝑤 2 ) − 𝜑1 (𝑢)𝜓1 (𝑣)𝑤
lim+
𝜀→0 𝜀
= 𝜑2 (𝑢)𝜓2 (𝑣)𝑤 1 + 𝜑1 (𝑢)𝜓2 (𝑣)𝑤 1 + 𝜑1 (𝑢)𝜓1 (𝑣)𝑤 2 . (3.43)

The left-hand side clearly has rank no more than two; one may show that as long
as 𝜑1 , 𝜑2 are not collinear, and likewise for 𝜓1 , 𝜓2 and 𝑤 1 , 𝑤 2 , the right-hand side
of (3.43) must have rank three, that is, it defines a bilinear operator with rank three
and border rank two.
Tensor rank and border rank are purely algebraic notions defined over any vector
spaces, or even modules, which are generalizations of vector spaces that we will
soon introduce. However, if U, V, W are norm spaces, we may introduce various
notions of norms on bilinear operators B : U × V → W. We will slightly abuse
notation by denoting the norms on all three spaces by k · k. Recall that for a linear
functional 𝜓 : V → R, its dual norm is just
|𝜓(𝑣)|
k𝜓 k ∗ ≔ sup . (3.44)
𝑣≠0 k𝑣k
66 L.-H. Lim

A continuous analogue of tensor rank in (3.37) may then be defined by



𝑟 Õ
𝑟 
kBk 𝜈 ≔ inf k𝜑𝑖 k ∗ k𝜓𝑖 k ∗ k𝑤 k : B(𝑢, 𝑣) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑖 (3.45)
𝑖=1 𝑖=1

and we call this a tensor nuclear norm. This defines a norm dual to the spectral
norm in (3.19), which in this case becomes
kB(𝑢, 𝑣)k
kBk 𝜎 = sup . (3.46)
𝑢,𝑣≠0 k𝑢k k𝑣k
We will argue later that the tensor nuclear norm, in an appropriate sense, quantifies
the optimal numerical stability of computing B just as tensor rank quantifies bilinear
complexity.
It will be instructive to begin from some low-dimensional examples where U, V,
W are of dimensions two and three.

Example 3.8 (Gauss’s algorithm for complex multiplication). As is fairly well


known, one may multiply a pair of complex numbers with three instead of the usual
four real multiplications:
(𝑎 + 𝑏𝑖)(𝑐 + 𝑑𝑖) = (𝑎𝑐 − 𝑏𝑑) + 𝑖(𝑏𝑐 + 𝑎𝑑)
= (𝑎𝑐 − 𝑏𝑑) + 𝑖[(𝑎 + 𝑏)(𝑐 + 𝑑) − 𝑎𝑐 − 𝑏𝑑]. (3.47)
The latter algorithm is usually attributed to Gauss. Note that while Gauss’s al-
gorithm uses three multiplications, it involves five additions/subtractions where
the usual algorithm has only two, but in bilinear complexity we only care about
multiplication of variables, which in this case are 𝑎, 𝑏, 𝑐, 𝑑. Observe that complex
multiplication BC : C × C → C, (𝑧, 𝑤) ↦→ 𝑧𝑤 is an R-bilinear map when we identify
C  R2 . So we may regard BC : R2 × R2 → R2 and rewrite (3.47) as
       
𝑎 𝑐 𝑎𝑐 − 𝑏𝑑 𝑎𝑐 − 𝑏𝑑
BC , = = .
𝑏 𝑑 𝑏𝑐 + 𝑎𝑑 (𝑎 + 𝑏)(𝑐 + 𝑑) − 𝑎𝑐 − 𝑏𝑑

The standard basis vectors in R2 ,


   
1 0
𝑒1 = , 𝑒2 = ,
0 1

correspond to 1, 𝑖 ∈ C and they have dual basis 𝑒∗1 , 𝑒∗2 : R2 → R given by


   
∗ 𝑎 ∗ 𝑎
𝑒1 = 𝑎, 𝑒2 = 𝑏.
𝑏 𝑏
Then the standard algorithm is given by the decomposition
BC (𝑧, 𝑤) = [𝑒∗1 (𝑧)𝑒∗1 (𝑤) − 𝑒∗2 (𝑧)𝑒∗2 (𝑤)]𝑒1 + [𝑒∗1 (𝑧)𝑒∗2 (𝑤) + 𝑒∗2 (𝑧)𝑒∗1 (𝑤)]𝑒2 , (3.48)
Tensors in computations 67

whereas Gauss’s algorithm is given by the decomposition

BC (𝑧, 𝑤) = [(𝑒∗1 + 𝑒∗2 )(𝑧)(𝑒∗1 + 𝑒∗2 )(𝑤)]𝑒2


+ [𝑒∗1 (𝑧)𝑒∗1 (𝑤)](𝑒1 − 𝑒2 ) − [𝑒∗2 (𝑧)𝑒∗2 (𝑤)](𝑒1 + 𝑒2 ). (3.49)

One may show that rank(BC ) = 3 = rank(BC ), that is, Gauss’s algorithm has
optimal bilinear complexity whether in the exact or approximate sense. While
using Gauss’s algorithm for actual multiplication of complex numbers is pointless
overkill, it is actually useful in practice (Higham 1992) as one may use it for the
multiplication of complex matrices:

(𝐴 + 𝑖𝐵)(𝐶 + 𝑖𝐷) = (𝐴𝐶 − 𝐵𝐷) + 𝑖[(𝐴 + 𝐵)(𝐶 + 𝐷) − 𝐴𝐶 − 𝐵𝐷] (3.50)

for any 𝐴 + 𝑖𝐵, 𝐶 + 𝑖𝐷 ∈ C𝑛×𝑛 with 𝐴, 𝐵, 𝐶, 𝐷 ∈ R𝑛×𝑛 .

For bilinear maps on two-dimensional vector spaces, Example 3.8 is essentially


the only example. We might ask for, say, parallel evaluation of the standard inner
product and the standard symplectic form on R2 , that is,

𝑔(𝑥, 𝑦) = 𝑥1 𝑦 1 + 𝑥2 𝑦 2 and 𝜔(𝑥, 𝑦) = 𝑥1 𝑦 2 − 𝑥2 𝑦 1 . (3.51)

An algorithm similar to Gauss’s algorithm would yield the result in three multi-
plications, and it is optimal going by either rank or border rank. For bilinear
maps on three-dimensional vector spaces, a natural example is the bilinear operator
B∧ : ∧2 (R𝑛 ) × R𝑛 → R𝑛 given by the 3 × 3 skew-symmetric matrix–vector product
0 𝑎 𝑏  𝑥   𝑎𝑦 + 𝑏𝑧 

−𝑎 0 𝑐   𝑦  =  −𝑎𝑥 + 𝑐𝑧  .
    
−𝑏 −𝑐 0  𝑧  −𝑏𝑥 − 𝑐𝑦 
    

In this case rank(B∧ ) = 5 = rank(B∧ ); see Ye and Lim (2018a, Proposition 12)
and Krishna and Makam (2018, Theorem 1.3). For a truly interesting example,
one would have to look at bilinear maps on four-dimensional vector spaces and we
shall do so next.

Example 3.9 (Strassen’s algorithm for matrix multiplication). Strassen (1969)


discovered an algorithm for inverting an 𝑛 × 𝑛 matrix in fewer than 5.64𝑛log 2 7
arithmetic operations (counting both additions and multiplications). This was a
huge surprise at that time as there were results proving that one may not do this with
fewer multiplications than the 𝑛3 /3 required by Gaussian elimination (Kljuev and
Kokovkin-Ščerbak 1965). The issue is that such impossibility results invariably
assume that one is limited to row and column operations. Strassen’s algorithm, on
the other hand, is based on block operations combined with a novel algorithm that
68 L.-H. Lim

multiplies two 2 × 2 matrices with seven multiplications:


  
𝑎1 𝑎2 𝑏1 𝑏2
𝑎3 𝑎4 𝑏3 𝑏4
 
𝑎1 𝑏1 + 𝑎2 𝑏3 𝛽 + 𝛾 + (𝑎1 + 𝑎2 − 𝑎3 − 𝑎4 )𝑏4
=
𝛼 + 𝛾 + 𝑎4 (𝑏2 + 𝑏3 − 𝑏1 − 𝑏4 ) 𝛼+𝛽+𝛾

with

𝛼 = (𝑎3 − 𝑎1 )(𝑏2 − 𝑏4 ), 𝛽 = (𝑎3 + 𝑎4 )(𝑏2 − 𝑏1 ),


𝛾 = 𝑎1 𝑏1 + (𝑎3 + 𝑎4 − 𝑎1 )(𝑏1 − 𝑏2 + 𝑏4 ).

As in the case of Gauss’s algorithm, the saving of one multiplication comes at the
cost of an increase in the number of additions/subtractions from eight to fifteen.
In fact Strassen’s original version had eighteen; the version presented here is the
well-known but unpublished Winograd variant discussed in Knuth (1998, p. 500)
and Higham (2002, equation 23.6). Recursively applying this algorithm to 2 × 2
block matrices produces an algorithm for multiplying 𝑛 × 𝑛 matrices with 𝑂(𝑛log2 7 )
multiplications. More generally, the bilinear operator defined by

M𝑚,𝑛, 𝑝 : R𝑚×𝑛 × R𝑛× 𝑝 → R𝑚× 𝑝 , (𝐴, 𝐵) ↦→ 𝐴𝐵,

is called either the matrix multiplication tensor or Strassen’s tensor. Every decom-
position
Õ
𝑟
M𝑚,𝑛, 𝑝 (𝐴, 𝐵) = 𝜑𝑖 (𝐴)𝜓𝑖 (𝐵)𝑊𝑖
𝑖=1

with linear functionals 𝜑𝑖 : R𝑚×𝑛 → R, 𝜓𝑖 : R𝑛× 𝑝 → R and 𝑊𝑖 ∈ R𝑚× 𝑝 gives us


an algorithm for multiplying an 𝑚 × 𝑛 matrix by an 𝑛 × 𝑝 matrix that requires just 𝑟
multiplications, with the smallest possible 𝑟 given by rank(M𝑚,𝑛, 𝑝 ). The exponent
of matrix multiplication is defined to be

𝜔 ≔ inf{𝑝 ∈ R : rank(M𝑛,𝑛,𝑛 ) = 𝑂(𝑛 𝑝 )}.

Strassen’s work showed that 𝜔 < log2 7 ≈ 2.807 354 9, and this has been improved
over the years to 𝜔 < 2.372 859 6 at the time of writing (Alman and Williams 2021).
Recall that any linear functional 𝜑 : R𝑚×𝑛 → R must take the form 𝜑(𝐴) = tr(𝑉 T 𝐴)
for some matrix 𝑉 ∈ R𝑚×𝑛 , a consequence of the Riesz representation theorem
for an inner product space. For concreteness, when 𝑚 = 𝑛 = 𝑝 = 2, Winograd’s
variant of Strassen’s algorithm is given by
7
Õ
M2,2,2 (𝐴, 𝐵) = 𝜑𝑖 (𝐴)𝜓𝑖 (𝐵)𝑊𝑖 ,
𝑖=1
Tensors in computations 69

where 𝜑𝑖 (𝐴) = tr(𝑈𝑖T 𝐴) and 𝜓𝑖 (𝐵) = tr(𝑉𝑖T 𝐵) with


     
−1 0 1 −1 0 1
𝑈1 = , 𝑉1 = , 𝑊1 = ,
1 1 0 1 1 1
     
1 0 1 0 1 1
𝑈2 = , 𝑉2 = , 𝑊2 = ,
0 0 0 0 1 1
     
0 1 0 0 1 0
𝑈3 = , 𝑉3 = , 𝑊3 = ,
0 0 1 0 0 0
     
1 0 0 −1 0 0
𝑈4 = , 𝑉4 = , 𝑊4 = ,
−1 0 0 1 1 1
     
0 0 −1 1 0 1
𝑈5 = , 𝑉5 = , 𝑊5 = ,
1 1 0 0 0 1
     
1 1 0 0 0 1
𝑈6 = , 𝑉6 = , 𝑊6 = ,
−1 −1 0 1 0 0
     
0 0 1 −1 0 0
𝑈7 = , 𝑉7 = , 𝑊7 = .
0 1 −1 1 −1 0
The name ‘exponent of matrix multiplication’ is in retrospect a misnomer: 𝜔
should rightly be called the exponent of nearly all matrix computations, as we will
see next.
Example 3.10 (beyond matrix multiplication). This example is a summary of
the highly informative book by Bürgisser et al. (1997, Chapter 16), with a few more
items drawn from Schönhage (1972/73), but stripped of the technical details to make
the message clear, namely that 𝜔 pervades numerical linear algebra. Consider the
following problems.13
(i) Inversion. Given 𝐴 ∈ GL(𝑛), find 𝐴−1 ∈ GL(𝑛).
(ii) Determinant. Given 𝐴 ∈ GL(𝑛), find det(𝐴) ∈ R.
(iii) Null basis. Given 𝐴 ∈ R𝑛×𝑛 , find a basis 𝑣 1 , . . . , 𝑣 𝑚 ∈ R𝑛 of ker(𝐴).
(iv) Linear system. Given 𝐴 ∈ GL(𝑛) and 𝑏 ∈ R𝑛 , find 𝑣 ∈ R𝑛 so that 𝐴𝑣 = 𝑏.
(v) LU decomposition. Given 𝐴 ∈ R𝑚×𝑛 of full rank, find permutation 𝑃, unit
lower triangular 𝐿 ∈ R𝑚×𝑚 , upper triangular 𝑈 ∈ R𝑚×𝑛 so that 𝑃 𝐴 = 𝐿𝑈.
(vi) QR decomposition. Given 𝐴 ∈ R𝑛×𝑛 , find orthogonal 𝑄 ∈ O(𝑛), upper
triangular 𝑈 ∈ R𝑛×𝑛 so that 𝐴 = 𝑄𝑅.
(vii) Eigenvalue decomposition. Given 𝐴 ∈ S𝑛 , find 𝑄 ∈ O(𝑛) and a diagonal
matrix Λ ∈ R𝑛×𝑛 so that 𝐴 = 𝑄Λ𝑄 T .
(viii) Hessenberg decomposition. Given 𝐴 ∈ R𝑛×𝑛 , find 𝑄 ∈ O(𝑛) and an upper
Hessenberg matrix 𝐻 ∈ R𝑛×𝑛 so that 𝐴 = 𝑄𝐻𝑄 T .
13 An upper Hessenberg matrix 𝐻 is one where ℎ𝑖 𝑗 = 0 whenever 𝑖 > 𝑗 + 1 and nnz(𝐴) ≔
#{(𝑖, 𝑗) : 𝑎 𝑖 𝑗 ≠ 0} counts the number of non-zero entries.
70 L.-H. Lim

(ix) Characteristic polynomial. Given 𝐴 ∈ R𝑛×𝑛 , find (𝑎0 , . . . , 𝑎 𝑛−1 ) ∈ R𝑛 so that


det(𝑥𝐼 − 𝐴) = 𝑥 𝑛 + 𝑎 𝑛−1 𝑥 𝑛−1 + · · · + 𝑎1 𝑥 + 𝑎0 .
(x) Sparsification. Given 𝐴 ∈ R𝑛×𝑛 and 𝑐 ∈ [1, ∞), find 𝑋, 𝑌 ∈ GL(𝑛) so that
nnz(𝑋 𝐴𝑌 −1 ) ≤ 𝑐𝑛.
For each of problems (i)–(x), if there is an algorithm that computes the 𝑛 × 𝑛
matrix product in 𝑂(𝑛 𝑝 ) multiplications, then there is an algorithm that solves that
problem in 𝑂(𝑛 𝑝 ) or possibly 𝑂(𝑛 𝑝 log 𝑛) arithmetic operations, i.e. inclusive of
additions and scalar multiplications. By the definition of 𝜔 in Example 3.9, there
is an algorithm for the 𝑛 × 𝑛 matrix product in 𝑂(𝑛 𝜔+𝜀 ) multiplications for any
𝜀 > 0. So the ‘exponents’ of problems (i)–(x), which may be properly defined
(Bürgisser et al. 1997, Definition 16.1) even though they are not bilinear operators,
are all equal to or bounded by 𝜔. These results are known to be sharp, that is, one
cannot do better than 𝜔, except for problems (iv), (vi) and (viii). In particular, it
is still an open problem whether one might solve a non-singular linear system with
fewer than 𝑂(𝑛 𝜔 ) arithmetic operations asymptotically (Strassen 1990).
To get an inkling of why these results hold, a key idea, mentioned in Example 3.9
and well known to practitioners of numerical linear algebra, is to avoid working
with rows and columns and work with blocks instead, making recursive use of
formulas for block factorization, inversion and determinant similar to
      
𝐴 𝐵 𝐼 0 𝐴 𝐵 𝐴 𝐵
= , det = det(𝐴) det(𝑆),
𝐶 𝐷 𝐶 𝐴−1 𝐼 0 𝑆 𝐶 𝐷
  −1  −1 
𝐴 𝐵 𝐴 + 𝐴−1 𝐵𝑆−1 𝐶 𝐴−1 −𝐴−1 𝐵𝑆−1
= ,
𝐶 𝐷 −𝑆−1 𝐶 𝐴−1 𝑆−1
where 𝑆 = 𝐷 − 𝐶 𝐴−1 𝐵 is the Schur complement. For illustration, Strassen’s
algorithm for matrix inversion (Strassen 1969) in 𝑂(𝑛 𝜔+𝜀 ) complexity computes
  −1   𝑋1 = 𝐴−1 , 𝑋2 = 𝐶 𝑋1 ,
𝐴 𝐵 𝑋 − 𝑋3 𝑋6 𝑋2 𝑋3 𝑋6
= 1 , 𝑋3 = 𝑋1 𝐵, 𝑋4 = 𝐶 𝑋3 ,
𝐶 𝐷 𝑋6 𝑋2 −𝑋6
𝑋5 = 𝑋4 − 𝐷, 𝑋6 = 𝑋5−1 ,
with the inversion in 𝑋1 and 𝑋6 performed recursively using the same algorithm
(Higham 2002, p. 449).
We would like to stress that the results in Examples 3.9 and 3.10, while certainly
easy to state, took the combined efforts of a multitude of superb scholars over
many decades to establish. We refer interested readers to Bürgisser et al. (1997,
Sections 15.13 and 16.12), Knuth (1998, Section 4.6.4) and Landsberg (2017,
Chapters 1–5) for further information on the work involved.
The examples above are primarily about the tensor rank of a bilinear operator
over vector spaces. We next consider three variations on this main theme:
• nuclear/spectral norms instead of tensor rank,
Tensors in computations 71

• modules instead of vector spaces,


• trilinear functionals instead of bilinear operators.
An algorithm, however fast, is generally not practical if it lacks numerical stability,
as the correctness of its output may no longer be guaranteed in finite precision.
Fortunately, Strassen’s algorithm in Example 3.9 is only slightly less stable than
the standard algorithm for matrix multiplication, an unpublished result14 of Brent
(1970) that has been reproduced and extended in Higham (2002, Theorems 23.2 and
23.3). Numerical stability is, however, a complicated issue that depends on many
factors and on hardware architecture. Designing numerically stable algorithms
is as much an art as it is a science, but the six Higham guidelines for numerical
stability (Higham 2002, Section 1.18) capture the most salient aspects. The second
guideline, to ‘minimize the size of intermediate quantities relative to the final
solution’, together with the fifth are the two most unequivocal ones. We will
discuss how such considerations naturally lead us to the definition of tensor nuclear
norms in (3.45).

Example 3.11 (nuclear norm and numerical stability). As we have discussed,


an algorithm for evaluating a bilinear operator is a decomposition of the form
(3.41),
Õ𝑟
B(𝑢, 𝑣) = 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑖 , (3.52)
𝑖=1

where 𝑟 may or may not be minimal. An intermediate quantity has a natural


interpretation as a rank-one term 𝜑𝑖 (𝑢)𝜓𝑖 (𝑣)𝑤 𝑖 among the summands and its size
also has a natural interpretation as its norm. Note that such a rank-one bilinear
operator has nuclear and spectral norms equal to each other and to k𝜑𝑖 k ∗ k𝜓𝑖 k ∗ k𝑤 𝑖 k.
Hence the sum
Õ𝑟
k𝜑𝑖 k ∗ k𝜓𝑖 k ∗ k𝑤 𝑖 k
𝑖=1

measures the total size of the intermediate quantities in the algorithm (3.52) and
its minimum value, given by the nuclear norm of B as defined in (3.45), provides a
measure of optimal numerical stability in the sense of Higham’s second guideline.
This was first discussed in Ye and Lim (2018a, Section 3.2). Using complex
multiplication BC in Example 3.8 for illustration, one may show that the nuclear
norm (Friedland and Lim 2018, Lemma 6.1) is given by
kBC k 𝜈 = 4.

14 For 𝐶 = 𝐴𝐵 ∈ R𝑛×𝑛 , we have k𝐶 − 𝐶k b ≤ 𝑐 𝑛 k 𝐴k k𝐵k𝜀 + 𝑂(𝜀2) with 𝑐 𝑛 = 𝑛2 (standard), 𝑐𝑛log2 (12)


(Strassen), 𝑐 ′𝑛log2 (18) (Winograd) using k 𝑋 k = max𝑖, 𝑗=1,...,𝑛 |𝑥𝑖 𝑗 |.
72 L.-H. Lim

The standard algorithm (3.48) attains this minimum value:


k𝑒∗1 k ∗ k𝑒∗1 k ∗ k𝑒1 k + k−𝑒∗2 k ∗ k𝑒∗2 k ∗ k𝑒1 k + k𝑒∗1 k ∗ k𝑒∗2 k ∗ k𝑒2 k
+ k𝑒∗2 k ∗ k𝑒∗1 k ∗ k𝑒2 k = 4 = kBC k 𝜈
but not Gauss’s algorithm (3.49), which has total size
k𝑒∗1 + 𝑒∗2 k ∗ k𝑒∗1 + 𝑒∗2 k ∗ k𝑒2 k + k𝑒∗1 k ∗ k𝑒∗1 k ∗ k𝑒1 − 𝑒2 k

+ k−𝑒∗2 k ∗ k𝑒∗2 k ∗ k𝑒1 + 𝑒2 k = 2(1 + 2) > kBC k 𝜈 .
So Gauss’s algorithm is faster (by bilinear complexity) but less stable (by Higham’s
second guideline) than the standard algorithm. One might ask if it is possible to
have an algorithm for complex multiplication that is optimal in both measures. It
turns out that the decomposition15
 √  √  √ 
4 3 ∗ 1 ∗ 3 ∗ 1 ∗ 1 3
BC (𝑧, 𝑤) = 𝑒 (𝑧) + 𝑒2 (𝑧) 𝑒 (𝑤) + 𝑒2 (𝑤) 𝑒1 + 𝑒2
3 2 1 2 2 1 2 2 2
 √  √  √ 
3 ∗ 1 ∗ 3 ∗ 1 ∗ 1 3
+ 𝑒 (𝑧) − 𝑒2 (𝑧) 𝑒 (𝑤) − 𝑒2 (𝑤) 𝑒1 − 𝑒2
2 1 2 2 1 2 2 2

− 𝑒∗2 (𝑧)𝑒∗2 (𝑤)𝑒1 (3.53)

gives an algorithm that attains both rank(BC ) and kBC k 𝜈 . In conventional notation,
this algorithm multiplies two complex numbers via
(𝑎 + 𝑏𝑖)(𝑐 + 𝑑𝑖)
 
1 1 1 1 1 8
    
= 𝑎+ √ 𝑏 𝑐+ √ 𝑑 + 𝑎− √ 𝑏 𝑐 − √ 𝑑 − 𝑏𝑑
2 3 3 3 3 3
√ 
𝑖 3 1 1 1 1
    
+ 𝑎+ √ 𝑏 𝑐+ √ 𝑑 − 𝑎− √ 𝑏 𝑐− √ 𝑑 .
2 3 3 3 3
We stress that the nuclear norm is solely a measure of stability in the sense of
Higham’s second guideline. Numerical stability is too complicated an issue to
be adequately captured by a single number. For instance, from the perspective of
cancellation errors, our algorithm above also suffers from the same issue pointed
out in Higham
√ (2002, Section 23.2.4) for Gauss’s algorithm. By choosing 𝑧 = 𝑤
and 𝑏 = 3/𝑎, our algorithm computes
  √   
1 1 2 1 2 8 𝑖 3 1 2 1 2
    
𝑎+ + 𝑎− − 2 + 𝑎+ − 𝑎− ≕ 𝑥 + 𝑖𝑦.
2 𝑎 𝑎 𝑎 2 𝑎 𝑎
There will be cancellation error in the computed real part b
𝑥 when |𝑎| is small
15 We have gone to some lengths to avoid the tensor product ⊗ in this section, preferring to defer it
to Section 4. The decompositions (3.48), (3.49), (3.53) would be considerably neater if expressed
in tensor product form, giving another impetus for definition ➂.
Tensors in computations 73

and likewise in the computed imaginary part b 𝑦 when |𝑎| is large. Nevertheless,
as discussed in Higham (2002, Section 23.2.4), the algorithm is still stable in the
weaker sense of having acceptably small |𝑥 −b
𝑥 |/|𝑧| and |𝑦 −b
𝑦 |/|𝑧| even if |𝑥 −b
𝑥 |/|𝑥|
or |𝑦 − b
𝑦 |/|𝑦| might be large.
As is the case for Example 3.8, the algorithm for complex multiplication above
is useful only when applied to complex matrices. When 𝐴, 𝐵, 𝐶, 𝐷 ∈ R𝑛×𝑛 , the
algorithms in Examples 3.8 and 3.11 provide substantial savings when used to
multiply 𝐴 + 𝑖𝐵, 𝐶 + 𝑖𝐷 ∈ C𝑛×𝑛 . This gives a good reason for extending multilinear
maps and tensors to modules (Lang 2002), i.e. vector spaces whose field of scalars
is replaced by other more general rings. Formally, if 𝑅 is a ring with multiplicative
identity 1 (which we will assume henceforth), an 𝑅-module V is a commutative
group under a group operation denoted + and has a scalar multiplication operation
denoted · such that, for all 𝑎, 𝑏 ∈ 𝑅 and 𝑣, 𝑤 ∈ V,
(i) 𝑎 · (𝑣 + 𝑤) = 𝑎 · 𝑣 + 𝑎 · 𝑤,
(ii) (𝑎 + 𝑏) · 𝑣 = 𝑎 · 𝑣 + 𝑏 · 𝑣,
(iii) (𝑎𝑏) · 𝑣 = 𝑎 · (𝑏 · 𝑣),
(iv) 1 · 𝑣 = 𝑣.
Clearly, when 𝑅 = R or C or any other field, a module just becomes a vector
space. What is useful about the notion is that it allows us to include rings of objects
we would not normally consider as ‘scalars’. For example, in (3.47) we regard
C as a two-dimensional vector space over R, but in (3.50) we regard C𝑛×𝑛 as a
two-dimensional16 module over R𝑛×𝑛 . So in the latter the ‘scalars’ are actually
matrices, i.e. 𝑅 = R𝑛×𝑛 . When we consider block matrix operations on square
matrices such as those on page 70, we are implicitly doing linear algebra over the
ring 𝑅 = R𝑛×𝑛 , which is not even commutative.
Many standard notions in linear and multilinear algebra carry through from
vector spaces to modules with little or no change. For example, the multilinear
maps and multilinear functionals of Definitions 3.1 and 3.3 apply verbatim to
modules with the field of scalars R replaced by any ring 𝑅. In other words, the
notion of tensors in the sense of definition ➁ applies equally well over modules.
We will discuss three examples of multilinear maps over modules: tensor fields,
fast integer multiplication algorithms and cryptographic multilinear maps.
When people speak of tensors in physics or geometry, they often really mean a
tensor field. As a result one may find statements of the tensor transformation rules
that bear little resemblance to our version. The next two examples are intended
to describe the analogues of definitions ➀ and ➁ for a tensor field and show how
they fit into the narrative of this article. Also, outside pure mathematics, defining
a tensor field is by far the biggest reason for considering multilinear maps over
modules.
16 Strictly speaking, the terminology over modules should be length instead of dimension.
74 L.-H. Lim

Example 3.12 (modules and tensor fields). A tensor field is – roughly speaking
– a tensor-valued function on a manifold. Let 𝑀 be a smooth manifold and
𝐶 ∞(𝑀) ≔ { 𝑓 : 𝑀 → R : 𝑓 is smooth}. Note that 𝐶 ∞ (𝑀) is a ring, as products
and linear combinations of smooth real-valued functions are smooth (henceforth
we drop the word ‘smooth’ as all functions and fields in this example are assumed
smooth). Thus 𝑅 = 𝐶 ∞ (𝑀) may play the role of the ring of scalars. A 0-tensor
field is just a function in 𝐶 ∞ (𝑀). A contravariant 1-tensor field is a vector field,
i.e. a function 𝑓 whose value at 𝑥 ∈ 𝑀 is a vector in the tangent space at 𝑥, denoted
T 𝑥 (𝑀). A covariant 1-tensor field is a covector field, i.e. a function 𝑓 whose value
at 𝑥 ∈ 𝑀 is a vector in the cotangent space at 𝑥, denoted T∗𝑥 (𝑀). Here T 𝑥 (𝑀) is a
vector space and T∗𝑥 (𝑀) is its dual vector space, in which case a 𝑑-tensor field of
contravariant order 𝑝 and covariant order 𝑑 − 𝑝 is simply a function 𝜑 whose value
at 𝑥 ∈ 𝑀 is such a 𝑑-tensor, that is, by Definition 3.3, a multilinear functional
𝜑(𝑥) : T∗𝑥 (𝑀) × · · · × T∗𝑥 (𝑀) × T 𝑥 (𝑀) × · · · × T 𝑥 (𝑀) → R.
𝑝 copies 𝑑− 𝑝 copies

This seems pretty straightforward; the catch is that 𝜑 is not a function in the usual
sense, which has a fixed domain and a fixed codomain, but the codomain of 𝜑
depends on the point 𝑥 where it is evaluated. So we have only defined a tensor field
at a point 𝑥 ∈ 𝑀 but we still need a way to ‘glue’ all these pointwise definitions
together. The customary way to do this is via coordinate charts and transition maps,
but an alternative is to simply define the tangent bundle
T(𝑀) ≔ {(𝑥, 𝑣) : 𝑥 ∈ 𝑀, 𝑣 ∈ T 𝑥 (𝑀)}
and cotangent bundle
T∗ (𝑀) ≔ {(𝑥, 𝜑) : 𝑥 ∈ 𝑀, 𝜑 ∈ T∗𝑥 (𝑀)}
and observe that these are 𝐶 ∞ (𝑀)-modules with scalar product given by pointwise
multiplication of real-valued functions with vector/covector fields. A 𝑑-tensor
field of contravariant order 𝑝 and covariant order 𝑑 − 𝑝 is then defined to be the
multilinear functional
𝜑 : T∗ (𝑀) × · · · × T∗ (𝑀) × T(𝑀) × · · · × T(𝑀) → 𝐶 ∞ (𝑀).
𝑝 copies 𝑑− 𝑝 copies

Note that this is a multilinear functional in the sense of modules: the ‘scalars’ are
drawn from a ring 𝑅 = 𝐶 ∞ (𝑀) and the ‘vector spaces’ are the 𝑅-modules T(𝑀)
and T∗ (𝑀). This is another upside of defining tensors via definition ➁; it may be
easily extended to include tensor fields.
What about definition ➀? The first thing to note is that not every result in linear
algebra over vector spaces carries over to modules. An example is the notion of
basis. While some modules do have a basis – for example, when we speak of
C𝑛×𝑛 as a two-dimensional R𝑛×𝑛 -module, it is with the basis {1, 𝑖} in mind – others
such as the 𝐶 ∞ (𝑀)-module of 𝑑-tensor fields may not have a basis when 𝑑 ≥ 1.
Tensors in computations 75

An explicit and well-known example is given by vector fields on the 2-sphere 𝑆2 ,


which as a 𝐶 ∞ (𝑆2 )-module does not have a basis because of the hairy ball theorem
(Passman 1991, Proposition 17.7). Consequently, given a 𝑑-tensor field on 𝑀, the
idea that one may just choose bases and represent it as 𝑑-dimensional hypermatrix
with entries in 𝐶 ∞(𝑀) is false even when 𝑑 = 1. There is, however, one important
exception, namely when 𝑀 = R𝑛 , in which case, given a 𝑑-tensor field 𝜑, it is
possible to write down a 𝑑-hypermatrix

𝐴(𝑣) = [𝜑𝑖 𝑗 ···𝑘 (𝑣)] 𝑖,𝑛 𝑗,... ,𝑘=1 ∈ 𝑅 𝑛×𝑛×···×𝑛

with entries in 𝑅 = 𝐶 ∞(R𝑛 ) that holds for all 𝑣 ∈ R𝑛 . The analogue of the
transformation rules is a bit more complicated since we are now allowed a non-
linear change of coordinates 𝑣 ′ = 𝐹(𝑣) as opposed to merely a linear change of
basis as in Section 2.2. Here the change-of-coordinates function 𝐹 : 𝑁(𝑣) → R𝑛
is any smooth function defined on a neighbourhood 𝑁(𝑣) ⊆ R𝑛 of 𝑣 that is locally
invertible, that is, the derivative 𝐷𝐹(𝑣) as defined in Example 3.2 is an invertible
linear map in a neighbourhood of 𝑣. This is sometimes called a curvilinear change
of coordinates. The analogue of tensor transformation rule (2.12) for a 𝑑-tensor
field on R𝑛 is then

𝐴(𝑣 ′) = (𝐷𝐹(𝑣)−1 , . . . , 𝐷𝐹(𝑣)−1 , 𝐷𝐹(𝑣)T , . . . , 𝐷𝐹(𝑣)T ) · 𝐴(𝑣). (3.54)

If we have 𝑣 ′ = 𝑋𝑣 for some 𝑋 ∈ GL(𝑛), then 𝐷𝐹(𝑣) = 𝑋 and we recover (2.12).


More generally, if we write 𝑣 = (𝑣 1 , . . . , 𝑣 𝑛 ) and 𝑣 ′ = (𝑣 1′ , . . . , 𝑣 𝑛′ ), then 𝐷𝐹(𝑣) and
𝐷𝐹(𝑣)−1 may be written down as Jacobian matrices,

 𝜕𝑣 1′ 𝜕𝑣 1′   𝜕𝑣 1 𝜕𝑣 1 
 ···   ···
 𝜕𝑣 1  𝜕𝑣 ′ 𝜕𝑣 𝑛′ 
 𝜕𝑣 𝑛   1
 ..  ,
𝐷𝐹(𝑣) =  ... ..
.
..  ,
.  𝐷𝐹(𝑣)−1 =  ... ..
. . 
 𝜕𝑣 ′ 
 𝑛 𝜕𝑣 𝑛′   𝜕𝑣 𝑛
 ′
𝜕𝑣 𝑛 

 ···   𝜕𝑣 ···
 𝜕𝑣 1 𝜕𝑣 𝑛   1 𝜕𝑣 𝑛′ 

where the latter is a consequence of the inverse function theorem: 𝐷𝐹(𝑣)−1 =


𝐷𝐹 −1 (𝑣 ′ ) for 𝑣 ′ = 𝐹(𝑣).
Take, for instance, Cartesian coordinates 𝑣 = (𝑥, 𝑦, 𝑧) and spherical coordinates
𝑣 = (𝑟, 𝜃, 𝜑) on R3 . Then

q
𝑟= 𝑥 2 + 𝑦 2 + 𝑧2 , 𝑥 = 𝑟 sin 𝜃 cos 𝜑,
q
𝜃 = arctan 𝑥 2 + 𝑦 2 /𝑧 ,

𝑦 = 𝑟 sin 𝜃 sin 𝜑, (3.55)
𝜑 = arctan(𝑦/𝑥), 𝑧 = 𝑟 cos 𝜃,
76 L.-H. Lim

and
 𝑥 𝑦 𝑧 
 𝑟 𝑟 𝑟 
 
 2 + 𝑦2) 
𝜕(𝑟, 𝜃, 𝜑)  𝑥𝑧 𝑦𝑧 −(𝑥 
𝐷𝐹(𝑣) = =  2p 2 p p ,
𝜕(𝑥, 𝑦, 𝑧)  𝑟 𝑥 + 𝑦 2 𝑟 2 𝑥 2 + 𝑦 2 𝑟 2 𝑥 2 + 𝑦 2 
 −𝑦 𝑥 
 0 
 2 
 𝑥 + 𝑦2 𝑥2 + 𝑦2 
sin 𝜃 cos 𝜑 𝑟 cos 𝜃 cos 𝜑 −𝑟 sin 𝜃 sin 𝜑
𝜕(𝑥, 𝑦, 𝑧)  
−1
𝐷𝐹(𝑣) = =  sin 𝜃 sin 𝜑 𝑟 cos 𝜃 sin 𝜑 𝑟 sin 𝜃 cos 𝜑  .
𝜕(𝑟, 𝜃, 𝜑)  cos 𝜃 −𝑟 sin 𝜃 0 
 
Note that either of the last two matrices may be expressed solely in terms of 𝑥, 𝑦, 𝑧
or solely in terms of 𝑟, 𝜃, 𝜑; the forms above are chosen for convenience. The
transformation rule (3.54) then allows us to transform a hypermatrix in 𝑥, 𝑦, 𝑧 to
one in 𝑟, 𝜃, 𝜑 or vice versa.
Example 3.13 (tensor fields over manifolds). Since a smooth manifold 𝑀 loc-
ally resembles R𝑛 , we may take (3.54) as the local transformation rule for a tensor
field 𝜑. It describes how, for a point 𝑥 ∈ 𝑀, a hypermatrix 𝐴(𝑣(𝑥)) representing 𝜑
in one system of local coordinates 𝑣 : 𝑁(𝑥) → R𝑛 on a neighbourhood 𝑁(𝑥) ⊆ 𝑀
is related to another hypermatrix 𝐴(𝑣 ′(𝑥)) representing 𝜑 in another system of local
coordinates 𝑣 ′ : 𝑁 ′(𝑥) → R𝑛 . As is the case for the tensor transformation rule
(2.12), one may use the tensor field transformation rule (3.54) to ascertain whether
or not a physical or mathematical object represented locally by a hypermatrix
with entries in 𝐶 ∞(𝑀) is a tensor field; this is how one would usually show that
connection forms or Christoffel symbols are not tensor fields (Simmonds 1994,
pp. 64–65).
In geometry, tensor fields play a central role, as geometric structures are nearly
always tensor fields (exceptions such as Finsler metrics tend to be less studied).
The most common ones include Riemannian metrics, which are symmetric 2-tensor
fields 𝑔 : T(𝑀) × T(𝑀) → 𝐶 ∞ (𝑀), and symplectic forms, which are alternating
2-tensor fields 𝜔 : T(𝑀) × T(𝑀) → 𝐶 ∞(𝑀), toy examples of which we have
seen in (3.51). As we saw in the previous example, the change-of-coordinates
maps for tensors are invertible linear maps, but for tensor fields they are locally
invertible linear maps; these are called diffeomorphisms, Riemannian isometries,
symplectomorphisms, etc., depending on what geometric structures they preserve.
The analogues of the finite-dimensional matrix groups in (2.14) then become
infinite-dimensional Lie groups such as Diff(𝑀), Iso(𝑀, 𝑔), Symp(𝑀, 𝜔), etc.
Although beyond the scope of our article, there are two tensor fields that are too
important to be left completely unmentioned: the Ricci curvature tensor and the
Riemann curvature tensor. Without introducing additional materials, a straight-
forward way to define them is to use a special system of local coordinates called
Riemannian normal coordinates (Chern, Chen and Lam 1999, Chapter 5). For a
Tensors in computations 77

point 𝑥 ∈ 𝑀, an 𝑛-dimensional Riemannian manifold, we choose normal coordin-


ates (𝑥1 , . . . , 𝑥 𝑛 ) : 𝑁(𝑥) → R𝑛 in a neighbourhood 𝑁(𝑥) ⊆ 𝑀 with 𝑥 at the origin
0. Then the Riemannian metric 𝑔 takes values in S++ 𝑛 and we have
p
3 𝜕 2 𝑔𝑖 𝑗 3 𝜕 2 det(𝑔)
𝑟 𝑖 𝑗𝑘𝑙 (𝑥) = (0), 𝑟¯𝑖 𝑗 (𝑥) = p (0).
2 𝜕𝑥 𝑘 𝜕𝑥𝑙 det(𝑔) 𝜕𝑥 𝑘 𝜕𝑥𝑙
The Riemann and Ricci curvature tensors at 𝑥, respectively, are the quadrilinear
and bilinear functionals
𝑅(𝑥) : T 𝑥 (𝑀) × T 𝑥 (𝑀) × T 𝑥 (𝑀) × T 𝑥 (𝑀) → R, ¯ : T 𝑥 (𝑀) × T 𝑥 (𝑀) → R,
𝑅(𝑥)
given by
Õ
𝑛 𝑛 Õ
Õ 𝑛
𝑅(𝑥)(𝑢, 𝑣, 𝑤, 𝑧) = 𝑟 𝑖 𝑗𝑘𝑙 (𝑥) 𝑢𝑖 𝑣 𝑗 𝑤 𝑘 𝑧𝑙 , ¯
𝑅(𝑥)(𝑢, 𝑣) = 𝑟¯𝑖 𝑗 (𝑥) 𝑢𝑖 𝑣 𝑗 .
𝑖, 𝑗,𝑘,𝑙=1 𝑖=1 𝑗=1

These are tensor fields – note that the coefficients depend on 𝑥 – so 𝑅(𝑥) will
generally be a different multilinear functional at different points 𝑥 ∈ 𝑀. The Ricci
curvature tensor will make a brief appearance in Example 4.33 on separation of
variables for PDEs. Riemann curvature, being a 4-tensor, is difficult to handle as
is, but when 𝑀 is embedded in Euclidean space, it appears implicitly in the form
of a 2-tensor called the Weingarten map (Bürgisser and Cucker 2013, Chapter 21)
or second fundamental form (Niyogi, Smale and Weinberger 2008), whose eigen-
values, called principal curvatures, give us condition numbers. There are other
higher-order tensor fields in geometry (Dodson and Poston 1991) such as the tor-
sion tensor, the Nijenhuis tensor (𝑑 = 3) and the Weyl curvature tensor (𝑑 = 4), all
of which are unfortunately beyond our scope.
In physics, it is probably fair to say that (i) most physical fields are tensor fields
and (ii) most tensors are tensor fields. For (i), while there are important exceptions
such as spinor fields, the most common fields such as temperature, pressure and
Higgs fields are scalar fields; electric, magnetic and flow velocity fields are vector
fields; the Cauchy stress tensor, Einstein tensor and Faraday electromagnetic tensor
are 2-tensor fields; higher-order tensor fields are rarer in physics but there are also
examples such as the Cotton tensor (García, Hehl, Heinicke and Macías 2004) and
the Lanczos tensor (Roberts 1995), both 3-tensor fields. The last five named tensors
also make the case for (ii): a ‘tensor’ in physics almost always means a tensor field
of order two or (more rarely) higher. We will describe the Cauchy stress tensor and
mention a few higher-order tensors related to it in Example 4.8.
The above examples are analytic in nature but the next two will be algebraic.
They show why it is useful to consider bilinear and more generally multilinear
maps for modules over Z/𝑚Z, the ring of integers modulo 𝑚.
Example 3.14 (modules and integer multiplication). Trivially, integer multipli-
cation B : Z × Z → Z, (𝑎, 𝑏) ↦→ 𝑎𝑏 is a bilinear map over the Z-module Z, but this
78 L.-H. Lim

is not the relevant module structure that one exploits in fast integer multiplication
algorithms. Instead they are based primarily on two key ideas. The first idea is that
integers (assumed unsigned) may be represented as polynomials,
𝑝−1
Õ 𝑝−1
Õ
𝑖
𝑎= 𝑎𝑖 𝜃 ≕ 𝑎(𝜃) and 𝑏= 𝑏 𝑗 𝜃 𝑗 ≕ 𝑏(𝜃)
𝑖=0 𝑗=0

for some number base 𝜃, and the product has coefficients given by convolutions,

𝑝−2 Õ
𝑘
𝑎𝑏 = 𝑐 𝑘 𝜃 𝑘 ≕ 𝑐(𝜃) with 𝑐𝑘 = 𝑎𝑖 𝑏 𝑘−𝑖 .
𝑘=0 𝑖=0

Let 𝑛 = 2𝑝 − 1 and pad the vectors of coefficients with enough zeros so that we may
consider (𝑎0 , . . . , 𝑎 𝑛−1 ), (𝑏0 , . . . , 𝑏 𝑛−1 ), (𝑐0 , . . . , 𝑐 𝑛−1 ) on an equal footing. The
second idea is to use the discrete Fourier transform17 (DFT) for some root of unity
𝜔 to perform the convolution,
 𝑎0′  1 1 1 ··· 1   𝑎0 
     
 𝑎 ′  1 𝜔 𝜔2 ··· 𝜔 𝑛−1   𝑎1 
 1     
 𝑎 ′  1 𝜔 2 𝜔4 ··· 𝜔 2(𝑛−1)   𝑎 
 2 =   2 ,
 .  . .. .. .. ..   . 
 ..   .. . . . .   .. 
     
𝑎 ′  1 𝜔 𝑛−1 𝜔2(𝑛−1) ··· 𝜔 (𝑛−1)(𝑛−1)  𝑎 
 𝑛−1     𝑛−1 
 𝑏0′  1 1 1 ··· 1   𝑏0 
    
 𝑏 ′  1 𝜔 𝜔 2 ··· 𝜔 𝑛−1   𝑏1 
 1    
 𝑏 ′  1 𝜔 2 𝜔 4 ··· 𝜔2(𝑛−1)   𝑏2  ,
 2 =
 .  . . . .. ..  . 
 ..   .. .. .. . .   .. 
    
𝑏 ′  1 𝜔 𝑛−1 𝜔2(𝑛−1) · · · 𝜔 (𝑛−1)(𝑛−1)  𝑏 
 𝑛−1     𝑛−1 
 𝑐0  1 1 1 ··· 1   𝑎0′ 𝑏0′ 
    
 𝑐1  1 𝜔 −1 𝜔 −2 ··· 𝜔 −(𝑛−1)   𝑎′ 𝑏′ 
  1  1 1 
 𝑐   𝜔−2 𝜔−4 𝜔−2(𝑛−1)   𝑎2′ 𝑏2′  ,
 2  = 1 ···
 .  𝑛 . .. .. .. ..  .. 
 ..   .. . . . .  . 
    
𝑐  1 𝜔−(𝑛−1) 𝜔−2(𝑛−1) −(𝑛−1)(𝑛−1)   ′ ′ 
 𝑛−1   ··· 𝜔  𝑎 𝑛−1 𝑏 𝑛−1 
taking advantage of the following well-known property: a Fourier transform F turns
convolution product ∗ into pointwise product · and the inverse Fourier transform
turns it back, that is,
𝑎 ∗ 𝑏 = F−1 (F(𝑎) · F(𝑎)).
Practical considerations inform the way we choose 𝜃 and 𝜔. We choose 𝜃 = 2𝑠
17 √
A slight departure from (3.40) is that we dropped the 1/ 𝑛 coefficient from our DFT and instead
put a 1/𝑛 with our inverse DFT to avoid surds.
Tensors in computations 79

so that 𝑎𝑖 , 𝑏𝑖 ∈ {0, 1, 2, . . . , 2𝑠 − 1} are 𝑠-bit binary numbers that can be readily


handled. We choose 𝜔 to be a 2𝑝th root of unity in a ring Z/𝑚Z with 𝑚 > 𝑐 𝑘 for all
𝑘 = 0, 1, . . . , 𝑛 − 1. The 2𝑝th root of unity ensures that the powers 1, 𝜔, . . . , 𝜔 𝑛−1
are distinct and 𝑚 > 𝑐 𝑘 prevents any ‘carrying’ when computing 𝑐 𝑘 . It turns out
that such a choice is always possible with, say, 𝑚 = 23𝑑 +1. Practical considerations
aside, the key ideas are to convert integer multiplication to a bilinear operator,
B1 : (Z/2𝑠 Z)[𝜃] × (Z/2𝑠 Z)[𝜃] → (Z/2𝑠 Z)[𝜃],
(𝑎(𝜃), 𝑏(𝜃)) ↦→ 𝑎(𝜃)𝑏(𝜃),
followed by a Fourier conversion into another bilinear operator,
B2 : (Z/𝑚Z)𝑛 × (Z/𝑚Z)𝑛 → (Z/𝑚Z)𝑛 ,
((𝑎0 , . . . , 𝑎 𝑛−1 ), (𝑏0 , . . . , 𝑏 𝑛−1 )) ↦→ (𝑎0 𝑏0 , . . . , 𝑎 𝑛−1 𝑏 𝑛−1 ).
In the former, the univariate polynomial ring (Z/2𝑠 Z)[𝜃] is a Z/2𝑠 Z-module with
𝜃 regarded as the indeterminate of the polynomials. In the latter (Z/𝑚Z)𝑛 is a
Z/𝑚Z-module, in fact it is the direct sum of 𝑛 copies of Z/𝑚Z. The algorithms of
Karatsuba and Ofman (1962), Cook and Aanderaa (1969), Toom (1963), Schön-
hage and Strassen (1971) and Fürer (2009) are all variations of one or both of these
two ideas: incorporating a divide-and-conquer strategy, computing the DFT with
fast Fourier transform, replacing the DFT over Z/𝑚Z with one over C, etc. The
recent breakthrough of Harvey and van der Hoeven (2021) that led to an 𝑂(𝑛 log 𝑛)
algorithm for 𝑛-bit integer multiplications was achieved with a multidimensional
variation: (i) replacing the univariate polynomials (Z/2𝑠 Z)[𝜃] with an 𝑅-module
of multivariate polynomials 𝑅[𝜃 1 , . . . , 𝜃 𝑑 ] with a slightly more complicated coef-
ficient ring 𝑅, and (ii) replacing the one-dimensional DFT with a 𝑑-dimensional
DFT,
Õ
𝑛1 Õ
𝑛𝑑
′ 𝜙 𝜃 𝜙 𝜃2 𝜙 𝜃
𝑎 (𝜙1 , 𝜙2 , . . . , 𝜙 𝑑 ) = ··· 𝜔1 1 1 𝜔2 2 · · · 𝜔 𝑑 𝑑 𝑑 𝑎(𝜃 1 , 𝜃 2 , . . . , 𝜃 𝑑 ),
𝜃1 =0 𝜃𝑑 =0

and we will discuss some tensorial features of such multidimensional transforms


in Example 4.46. Their resulting algorithm is not only the fastest ever but the
fastest possible. As in the case of matrix multiplication, the implications of an
𝑂(𝑛 log 𝑛) integer multiplication algorithm stretch far beyond integer arithmetic.
Among other things, we may use the algorithm √ of Harvey and van der Hoeven
(2021) to compute to 𝑛-bit precision 𝑥/𝑦 or 𝑘 𝑥 in time 𝑂(𝑛 log 𝑛) and e 𝑥 or 𝜋 in
time 𝑂(𝑛 log2 𝑛), for real inputs 𝑥, 𝑦 (Brent and Zimmermann 2011).
While the previous example exploits the computational efficiency of multilinear
maps over modules, the next one exploits their computational intractability.
Example 3.15 (cryptographic multilinear maps). Suppose Alice and Bob want
to generate a (secure) common password they may both use over the (insecure)
internet. One way they may do so is to pick a large prime number 𝑝 and pick a
80 L.-H. Lim

primitive root of unity 𝑔 ∈ (Z/𝑝Z)× , that is, 𝑔 generates the multiplicative group
of integers modulo 𝑝 in the sense that every non-zero 𝑥 ∈ Z/𝑝Z may be expressed
as 𝑥 = 𝑔 𝑎 (group theory notation) or 𝑥 ≡ 𝑔 𝑎 (mod 𝑝) (number theory notation)
for some 𝑎 ∈ Z. Alice will pick a secret 𝑎 ∈ Z and send 𝑔 𝑎 publicly to Bob;
Bob will pick a secret 𝑏 ∈ Z and send 𝑔 𝑏 publicly to Alice. Alice, knowing the
value of 𝑎, may compute 𝑔 𝑎𝑏 = (𝑔 𝑏 )𝑎 from the 𝑔 𝑏 she received from Bob, and
Bob, knowing the value of 𝑏, may compute 𝑔 𝑎𝑏 = (𝑔 𝑎 )𝑏 from the 𝑔 𝑎 he received
from Alice. They now share the secure password 𝑔 𝑎𝑏 . This is the renowned
Diffie–Hellman key exchange. The security of the version described is based on
the intractability of the discrete log problem: determining the value 𝑎 = log𝑔 (𝑔 𝑎 )
from 𝑔 𝑎 and 𝑝 is believed to be intractable. Although the problem has a well-known
polynomial-time quantum algorithm (Shor 1994) and has recently been shown to
be quasi-polynomial-time in 𝑛 (Kleinjung and Wesolowski 2021) for a finite field
F 𝑝 𝑛 when 𝑝 is fixed (note that for 𝑛 = 1, F 𝑝 = Z/𝑝Z) the technology required for
the former is still in its infancy, whereas the latter does not apply in our case where
complexity is measured in terms of 𝑝 and not 𝑛 (for us, 𝑛 = 1 always but 𝑝 → ∞).
Now observe that (Z/𝑝Z)× is a commutative group under the group operation of
multiplication modulo 𝑝, and it is a Z-module as we may check that it satisfies the
properties on page 73: for any 𝑥, 𝑦 ∈ Z/𝑝Z and 𝑎, 𝑏 ∈ Z,
(i) (𝑥𝑦)𝑎 = 𝑥 𝑎 𝑦 𝑏 ,
(ii) 𝑥 (𝑎+𝑏) = 𝑥 𝑎 𝑥 𝑏 ,
(iii) 𝑥 (𝑎𝑏) = (𝑥 𝑏 )𝑎 ,
(iv) 𝑥 1 = 𝑥.
Furthermore, the Diffie–Hellman key exchange is a Z-bilinear map
B : (Z/𝑝Z)× × (Z/𝑝Z)× → (Z/𝑝Z)× , (𝑔 𝑎 , 𝑔 𝑏 ) ↦→ 𝑔 𝑎𝑏
since, for any 𝜆, 𝜆 ′ ∈ Z and 𝑔 𝑎 , 𝑔 𝑏 ∈ (Z/𝑝Z)× ,
′ ′ ′ ′ ′ ′ ′ ′
B(𝑔𝜆𝑎+𝜆 𝑎 , 𝑔 𝑏 ) = 𝑔(𝜆𝑎+𝜆 𝑎 )𝑏 = (𝑔 𝑎𝑏 )𝜆 (𝑔 𝑎 𝑏 )𝜆 = B(𝑔 𝑎 , 𝑔 𝑏 )𝜆 B(𝑔 𝑎 , 𝑔 𝑏 )𝜆
and likewise for the second argument. That the notation is written multiplicatively
with coefficients appearing in the power is immaterial; if anything it illustrates
the power of abstract algebra in recognizing common structures across different
scenarios. While one may express everything in additive notation by taking discrete
logs whenever necessary, the notation 𝑔 𝑎 serves as a useful mnemonic: anything
appearing in the power is hard to extract, while using additive notation means having
to constantly remind ourselves that extracting 𝑎 from 𝑎𝑔 and 𝜆𝑎 is intractable for
the former and trivial for the latter.
What if 𝑑 + 1 different parties need to establish a common password (say, in
a 1000-participant Zoom session)? In principle one may successively apply the
two-party Diffie–Hellman key exchange 𝑑 + 1 times with the 𝑑 + 1 parties each
doing 𝑑 + 1 exponentiations, which is obviously expensive. One may reduce the
Tensors in computations 81

number of exponentiations to log2 (𝑑 + 1) with a more sophisticated protocol, but


it was discovered in Joux (2004) that with some mild assumptions a bilinear map
like the one above already allows us to generate a tripartite password in just one
round since
B(𝑔 𝑎 , 𝑔 𝑏 )𝑐 = B(𝑔 𝑎 , 𝑔 𝑐 )𝑏 = B(𝑔 𝑏 , 𝑔 𝑐 )𝑎 = B(𝑔, 𝑔)𝑎𝑏𝑐 .
Boneh and Silverberg (2003) generalized and formalized this idea as a crypto-
graphic multilinear map to allow a one-round (𝑑 + 1)-partite password. Let 𝑝 be
a prime and let 𝐺 1 , . . . , 𝐺 𝑑 , 𝐺 be cyclic groups of 𝑝 elements, written multiplic-
atively.18 Then, as in the 𝑑 = 2 case, these groups are Z-modules and we may
consider a 𝑑-linear map over Z-modules:
Φ : 𝐺 1 × · · · × 𝐺 𝑑 → 𝐺.
The only property that matters is the following consequence of multilinearity:
Φ(𝑔1𝑎1 , 𝑔2𝑎2 , . . . , 𝑔𝑑𝑎𝑑 ) = Φ(𝑔1 , 𝑔2 , . . . , 𝑔𝑑 )𝑎1 𝑎2 ···𝑎𝑑
for any 𝑔1 ∈ 𝐺 1 , . . . , 𝑔𝑑 ∈ 𝐺 𝑑 and 𝑎1 , . . . , 𝑎 𝑑 ∈ Z. Slightly abusing notation,
we will write 1 for the identity element in all groups. To exclude trivialities, we
assume that Φ is non-degenerate, that is, if Φ(𝑔1 , . . . , 𝑔𝑑 ) = 1, then we must have
𝑔𝑖 = 1 for some 𝑖 (the converse is always true). For a non-degenerate multilinear
map Φ to be cryptographic, it needs to have two additional properties:
(i) the discrete log problem in each of 𝐺 1 , . . . , 𝐺 𝑑 is intractable,
(ii) Φ(𝑔1 , . . . , 𝑔𝑑 ) is efficiently computable for any 𝑔1 ∈ 𝐺 1 , . . . , 𝑔𝑑 ∈ 𝐺 𝑑 .
Note that these assumptions imply that the discrete log problem in 𝐺 must also
be intractable. Given, say, 𝑔1 and 𝑔1𝑎1 , since Φ(𝑔1 , 1, . . . , 1) and Φ(𝑔1𝑎1 , 1, . . . , 1)
may be efficiently computed, if we could efficiently solve the discrete log problem
Φ(𝑔1𝑎1 , 1, . . . , 1) = Φ(𝑔1 , 1, . . . , 1)𝑎1 in 𝐺 to get 𝑎1 , we would have solved the
discrete log problem in 𝐺 1 efficiently.
Suppose we have a cryptographic 𝑑-linear map, Φ : 𝐺 × · · · × 𝐺 → 𝐺, where
we have assumed 𝐺 1 = · · · = 𝐺 𝑑 = 𝐺 for simplicity. For 𝑑 + 1 parties to generate a
common password, the 𝑖th party just needs to pick a password 𝑎𝑖 , perform a single
exponentiation to get 𝑔 𝑎𝑖 , and broadcast 𝑔 𝑎𝑖 to the other parties, who will each do
likewise so that every party now has 𝑔 𝑎1 , . . . , 𝑔 𝑎𝑑+1 . With these and the password
𝑎𝑖 , the 𝑖th party will now compute
Φ(𝑔 𝑎1 , . . . , 𝑔 𝑎𝑖−1 , 𝑔 𝑎𝑖+1 , . . . , 𝑔 𝑎𝑑+1 )𝑎𝑖 = Φ(𝑔, . . . , 𝑔)𝑎1 ···𝑎𝑑+1 ,
which will serve as the common password for the 𝑑 + 1 parties. Note that each

18 Of course they are all isomorphic to each other, but in actual cryptographic applications, it matters
how the groups are realized: one might be an elliptic curve over a finite field and another a cyclic
subgroup of the Jacobian of a hyperelliptic curve. Also, an explicit isomorphism may not be easy
to identify or compute in practice.
82 L.-H. Lim

of them would have arrived at Φ(𝑔, . . . , 𝑔)𝑎1 ···𝑎𝑑+1 in a different way with their
own password but their results are guaranteed to be equal as a consequence of
multilinearity. There are several candidates for such cryptographic multilinear
maps (Gentry, Gorbunov and Halevi 2015, Garg, Gentry and Halevi 2013) and
a variety of cryptographic applications that go beyond multipartite key exchange
(Boneh and Silverberg 2003).
We return to multilinear maps over familiar vector spaces. The next two examples
are about trilinear functionals.
Example 3.16 (trilinear functionals and self-concordance). The definition of
self-concordance is usually stated over R𝑛 . Let 𝑓 : Ω → R be a convex 𝐶 3 -
function on an open convex subset Ω ⊆ R𝑛 . Then 𝑓 is said to be self-concordant
at 𝑥 ∈ Ω if
|𝐷 3 𝑓 (𝑥)(ℎ, ℎ, ℎ)| ≤ 2[𝐷 2 𝑓 (𝑥)(ℎ, ℎ)] 3/2 (3.56)
for all ℎ ∈ R𝑛 (Nesterov and Nemirovskii 1994). As we discussed in Example 3.2,
for any fixed 𝑥 ∈ Ω, the higher derivatives 𝐷 2 𝑓 (𝑥) and 𝐷 3 𝑓 (𝑥) in this case are
bilinear and trilinear functionals on R𝑛 given by
Õ 𝑛
𝜕 2 𝑓 (𝑥)
[𝐷 2 𝑓 (𝑥)](ℎ, ℎ) = ℎ𝑖 ℎ 𝑗 ,
𝑖, 𝑗=1
𝜕𝑥𝑖 𝜕𝑥 𝑗
Õ
𝑛
𝜕 3 𝑓 (𝑥)
3
[𝐷 𝑓 (𝑥)](ℎ, ℎ, ℎ) = ℎ𝑖 ℎ 𝑗 ℎ 𝑘 .
𝑖, 𝑗,𝑘=1
𝜕𝑥𝑖 𝜕𝑥 𝑗 𝜕𝑥 𝑘

The affine invariance (Nesterov and Nemirovskii 1994, Proposition 2.1.1) of self-
concordance implies that self-concordance is a tensorial property in the sense
of definition ➀. For the convergence and complexity analysis of interior point
methods, it goes hand in hand with the affine invariance of Newton’s method in
Example 2.15. Such analysis in turn allows one to establish the celebrated result that
a convex optimization problem may be solved to arbitrary 𝜀-accuracy in polynomial
time using interior point methods if it has a self-concordant barrier function whose
first and second derivatives may be evaluated in polynomial time (Nesterov and
Nemirovskii 1994, Chapter 6). These conditions are satisfied for many common
problems including linear programming, convex quadratic programming, second-
order cone programming, semidefinite programming and geometric programming
(Boyd and Vandenberghe 2004). Contrary to popular belief, polynomial-time
solvability to 𝜀-accuracy is not guaranteed by convexity alone: copositive and
complete positive programming are convex optimization problems but both are
known to be NP-hard (Murty and Kabadi 1987, Dickinson and Gijben 2014).
By Example 3.12, we may view 𝐷 2 𝑓 and 𝐷 3 𝑓 as covariant tensor fields on the
manifold Ω (any open subset of a manifold is itself a manifold) with
𝐷 2 𝑓 (𝑥) : T 𝑥 (Ω) × T 𝑥 (Ω) → R, 𝐷 3 𝑓 (𝑥) : T 𝑥 (Ω) × T 𝑥 (Ω) × T 𝑥 (Ω) → R
Tensors in computations 83

for any fixed 𝑥 ∈ Ω. While this tensor field perspective is strictly speaking
unnecessary, it helps us formulate (3.56) concretely in situations when we are not
working over R𝑛 . For instance, among the aforementioned optimization problems,
semidefinite, complete positive and copositive programming require that we work
over the space of symmetric matrices S𝑛 with Ω given respectively by the following
open convex cones:
𝑛
S++ = {𝐴 ∈ S𝑛 : 𝑥 T 𝐴𝑥 > 0, 0 ≠ 𝑥 ∈ R𝑛 } = {𝐵𝐵T ∈ S𝑛 : 𝐵 ∈ GL(𝑛)},
𝑛
S+++ = {𝐵𝐵T ∈ S𝑛 : 𝐵 ∈ GL(𝑛) ∩ R+𝑛×𝑛 },
𝑛∗
S+++ = {𝐴 ∈ S𝑛 : 𝑥 T 𝐴𝑥 > 0, 0 ≠ 𝑥 ∈ R+𝑛 }.

In all three cases we have T𝑋 (Ω) = S𝑛 for all 𝑋 ∈ Ω, equipped with the usual
trace inner product. To demonstrate self-concordance for the log barrier function
𝑓 (𝑋) = − log det(𝑋), Examples 3.4 and 3.6 give us

𝐷 2 𝑓 (𝑋)(𝐻, 𝐻) = tr(𝐻 T [∇2 𝑓 (𝑋)](𝐻)) = tr(𝐻 𝑋 −1 𝐻 𝑋 −1 ),


𝐷 3 𝑓 (𝑋)(𝐻, 𝐻, 𝐻) = tr(𝐻 T [∇3 𝑓 (𝑋)](𝐻, 𝐻)) = −2 tr(𝐻 𝑋 −1 𝐻 𝑋 −1 𝐻 𝑋 −1 ),

and it follows from Cauchy–Schwarz that

|𝐷 3 𝑓 (𝑋)(𝐻, 𝐻, 𝐻)| ≤ 2k𝐻 𝑋 −1 k 3 = 2[𝐷 2 𝑓 (𝑋)(𝐻, 𝐻)] 3/2 ,

as required. For comparison, take the inverse barrier function 𝑔(𝑋) = tr(𝑋 −1 )
𝑛 and has easily
which, as a convex function that blows up near the boundary of S++
computable gradient and Hessian as we saw in Example 3.4, appears to be a perfect
barrier function for semidefinite programming. Nevertheless, using the derivatives
calculated in Example 3.4,

𝐷 2 𝑔(𝑋)(𝐻, 𝐻) = 2 tr(𝐻 𝑋 −1 𝐻 𝑋 −2 ),
𝐷 3 𝑔(𝑋)(𝐻, 𝐻, 𝐻) = −6 tr(𝐻 𝑋 −1 𝐻 𝑋 −1 𝐻 𝑋 −2 ),

and since 6|ℎ| 3 /𝑥 4 > 2(2ℎ2 /𝑥 3 )3/2 as 𝑥 → 0+ , (3.56) will not be satisfied when 𝑋
is near singular. Thus the inverse barrier function for S++ 𝑛 is not self-concordant.
𝑛
What about S+++, the completely positive cone, and its dual cone S+++ 𝑛∗ , the co-

positive cone? It turns out that it is possible to construct convex self-concordant


barrier functions for these (Nesterov and Nemirovskii 1994, Theorem 2.5.1), but
the problem is that the gradients and Hessians of these barrier functions are not
computable in polynomial time.

The last example, though more of a curiosity, is a personal favourite of the author
(Friedland, Lim and Zhang 2019).

Example 3.17 (spectral norm and Grothendieck’s inequality). One of the most
fascinating inequalities in matrix and operator theory is the following. For any
84 L.-H. Lim

𝐴 ∈ R𝑚×𝑛 , there exists a constant 𝐾G > 0 such that


Õ
𝑚 Õ
𝑛 Õ
𝑚 Õ
𝑛
max 𝑎𝑖 𝑗 h𝑥𝑖 , 𝑦 𝑗 i ≤ 𝐾G max 𝑎𝑖 𝑗 𝜀𝑖 𝛿 𝑗 , (3.57)
k 𝑥𝑖 k= k 𝑦 𝑗 k=1 | 𝜀𝑖 |= | 𝛿 𝑗 |=1
𝑖=1 𝑗=1 𝑖=1 𝑗=1

with the maximum on the left taken over vectors 𝑥1 , . . . , 𝑥 𝑚 , 𝑦 1 , . . . , 𝑦 𝑛 ∈ R 𝑝 of


unit 2-norm and that on the right taken over 𝜀 1 , . . . , 𝜀 𝑚 , 𝛿1 , . . . , 𝛿𝑛 ∈ {−1, +1}.
Here 𝑝 ≥ 𝑚 + 𝑛 can be arbitrarily large. What is remarkable is that the constant
𝐾G is universal, i.e. independent of 𝑚, 𝑛, 𝑝 or the matrix 𝐴. It is christened
the Grothendieck constant after the discoverer of the inequality (Grothendieck
1953), which has found widespread applications in combinatorial optimization,
complexity theory and quantum information (Pisier 2012). The exact value of the
constant is unknown but there are excellent bounds (Davie 1985, Krivine 1979):
1.67696 ≤ 𝐾G ≤ 1.78221.
Nevertheless the nature of the Grothendieck constant remains somewhat myster-
ious. One clue is that since the constant is independent of the matrix 𝐴 and
obviously also independent of the 𝑥𝑖 , 𝑦 𝑗 , 𝜀 𝑖 , 𝛿 𝑗 , as these are dummy variables
that are maximized over, whatever underlying object the Grothendieck constant
measures must have nothing to do with any of the quantities appearing in (3.57).
We will show that this object is in fact the Strassen matrix multiplication tensor
M𝑚,𝑛, 𝑝 in Example 3.9. First, observe that the right-hand side of (3.57) is just the
matrix (∞, 1)-norm,
k 𝐴𝑣k 1 ÕÕ 𝑚 𝑛
k 𝐴k ∞,1 = max = max 𝑎𝑖 𝑗 𝜀𝑖 𝛿 𝑗 ,
𝑣≠0 k𝑣k ∞ 𝜀𝑖 , 𝛿 𝑗 =±1
𝑖=1 𝑗=1

which is in fact a reason why the inequality is so useful, the (∞, 1)-norm being
ubiquitous in combinatorial optimization but NP-hard to compute, and the left-hand
side of (3.57) being readily computable via semidefinite programming. Writing 𝑥𝑖
and 𝑦 𝑗 , respectively, as columns and rows of matrices 𝑋 = [𝑥1 , . . . , 𝑥 𝑚 ] ∈ R 𝑝×𝑚
and 𝑌 = [𝑦 T1 , . . . , 𝑦 T𝑛 ] T ∈ R𝑛× 𝑝 , we next observe that the constraints on the left-hand
side of (3.57) may be expressed in terms of their (1, 2)-norm and (2, ∞)-norm:
k 𝑋𝑣k 2
k 𝑋 k 1,2 = max = max k𝑥𝑖 k 2 ,
𝑣≠0 k𝑣k 1 𝑖=1,...,𝑚

k𝑌 𝑣k ∞
k𝑌 k 2,∞ = max = max k𝑦 𝑗 k 2 ,
𝑣≠0 k𝑣k 2 𝑗=1,...,𝑛

namely, as k 𝑋 k 1,2 ≤ 1 and k𝑌 k 2,∞ ≤ 1. Lastly, we observe that for the standard
inner product on R 𝑝 ,
Õ𝑚 Õ𝑛
𝑎𝑖 𝑗 h𝑥𝑖 , 𝑦 𝑗 i = tr(𝑋 𝐴𝑌 ),
𝑖=1 𝑗=1
Tensors in computations 85

and thus (3.57) may be written as


tr(𝑋 𝐴𝑌 )
max ≤ 𝐾G .
𝐴,𝑋 ,𝑌 ≠0 k 𝐴k ∞,1 k 𝑋 k 1,2 k𝑌 k 2,∞
As we saw in (3.19), the expression on the left is precisely the spectral norm of the
trilinear functional
𝜏𝑚,𝑛, 𝑝 : R𝑚×𝑛 × R𝑛× 𝑝 × R 𝑝×𝑚 → R, (𝐴, 𝐵, 𝐶) ↦→ tr(𝐴𝐵𝐶),
if we equip the three vector spaces of matrices with the matrix (1, 2), (∞, 1)
and (2, ∞)-norms respectively. We shall denote this norm as k · k 1,2,∞ . The last
step is to observe that as 3-tensors, the trilinear functional 𝜏𝑚,𝑛, 𝑝 and the matrix
multiplication tensor
M𝑚,𝑛, 𝑝 : R𝑚×𝑛 × R𝑛× 𝑝 → R𝑚× 𝑝 , (𝐴, 𝐵) ↦→ 𝐴𝐵
are one and the same, assuming we identify (R 𝑝×𝑛 )∗ = R𝑛× 𝑝 . One way to see this
is that the same type of tensor transformation rule applies:
(𝑋 𝐴𝑌 −1 )(𝑌 𝐵𝑍 −1 ) = 𝑋(𝐴𝐵)𝑍 −1 ,
tr((𝑋 𝐴𝑌 −1 )(𝑌 𝐵𝑍 −1 )(𝑍𝐶 𝑋 −1 )) = tr(𝐴𝐵𝐶)
for any (𝑋, 𝑌 , 𝑍) ∈ GL(𝑚) × GL(𝑛) × GL(𝑝). As we have alluded, recognizing that
two different multilinear maps correspond to the same tensor without reference
to the transformation rule is one benefit of definition ➂, which we will discuss
immediately after this example. Hence we have shown that
𝐾G = sup kM𝑚,𝑛, 𝑝 k 1,2,∞ .
𝑚,𝑛, 𝑝 ∈N

In particular, the Grothendieck constant is an asymptotic value of the spectral norm


of M𝑚,𝑛, 𝑝 just as the exponent of matrix multiplication 𝜔 in Example 3.9 is an
asymptotic value of the tensor rank of M𝑚,𝑛, 𝑝 .

4. Tensors via tensor products


In the last two sections we have alluded to various inadequacies of definitions ➀
and ➁. The modern definition of tensors, i.e. definition ➂, remedies them by regard-
ing a tensor as an element of a tensor product of vector spaces (or, more generally,
modules). We may view the development of the notion of tensor via these three
definitions as a series of improvements. Definition ➀ leaves the tensor unspecified
and instead just describes its change-of-basis theorems as transformation rules.
Definition ➁ fixes this by supplying a multilinear map as a candidate for the un-
specified object but the problem becomes one of oversupply: with increasing order
𝑑, there is an exponential multitude of different 𝑑-linear maps for each transform-
ation rule. Definition ➂ fixes both issues by providing an object for definition ➀
that at the same time also neatly classifies the multilinear maps in definition ➁
86 L.-H. Lim

by the transformation rules they satisfy. In physics, the perspective may be some-
what different. Instead of viewing them as a series of pedagogical improvements,
definition ➀ was discovered alongside (most notably) general relativity, with defin-
ition ➁ as its addendum, and definition ➂ was discovered alongside (most notably)
quantum mechanics. Both remain useful in different ways for different purposes
and each is used independently of the other.
One downside of definition ➂ is that it relies on the tensor product construction
and/or the universal factorization property; both have a reputation of being abstract.
We will see that the reputation is largely undeserved; with the proper motivations
they are far easier to appreciate than, say, definition ➀. One reason for its ‘abstract’
reputation is that the construction is often cast in a way that lends itself to vast
generalizations. In fact, tensor products of Hilbert spaces (Reed and Simon 1980),
modules (Lang 2002), vector bundles (Milnor and Stasheff 1974), operator algeb-
ras (Takesaki 2002), representations (Fulton and Harris 1991), sheaves (Hartshorne
1977), cohomology rings (Hatcher 2002), etc., have become foundational materials
covered in standard textbooks, with more specialized tensor product constructions,
such as those of Banach spaces (Ryan 2002), distributions (Trèves 2006), operads
(Dunn 1988) or more generally objects in any tensor category (Etingof et al. 2015),
readily available in various monographs. This generality is a feature, not a bug.
In our (much more modest) context, it allows us to define tensor products of norm
spaces and inner product spaces, which in turn allows us to define norms and
inner products for tensors, to view multivariate functions as tensors, an enorm-
ously useful perspective in computation, and to identify ‘separation of variables’
as the common underlying thread in a disparate array of well-known algorithms.
In physics, the importance of tensor products cannot be over-emphasized: it is
one of the fundamental postulates of quantum mechanics (Nielsen and Chuang
2000, Section 2.2.8), the source behind many curious quantum phenomena that lie
at the heart of the subject, and is indispensable in technologies such as quantum
computing and quantum cryptography.

Considering its central role, we will discuss three common approaches to con-
structing tensor products, increasing in abstraction:
(i) via tensor products of function spaces,
(ii) via tensor products of more general vector spaces,
(iii) via the universal factorization property.
As in the case of the three definitions of tensor, each of these constructions is
useful in its own way and each may be taken to be a variant of definition ➂. We
will motivate each construction with concrete examples but we defer all examples
of computational relevance to Section 4.4. The approach of defining a tensor as
an element of a tensor product of vector spaces likely first appeared in the first
French edition of Bourbaki (1998) and is now standard in graduate algebra texts
(Dummit and Foote 2004, Lang 2002, Vinberg 2003). It has also caught on in
Tensors in computations 87

physics (Geroch 1985) and in statistics (McCullagh 1987). For further historical
information we refer the reader to Conrad (2018) and the last chapter of Kostrikin
and Manin (1997).

4.1. Tensor products of function spaces


This is the most intuitive of our three tensor product constructions as it reflects
the way we build functions of more variables out of functions of fewer variables.
In many situations, especially analytic ones, this tensor-as-a-multivariate-function
view suffices and one need go no further.
For any set 𝑋, the set of all real-valued functions on 𝑋,
R𝑋 ≔ { 𝑓 : 𝑋 → R},
is always a vector space under pointwise addition and scalar multiplication of
functions, that is, by defining
(𝑎1 𝑓1 + 𝑎2 𝑓2 )(𝑥) ≔ 𝑎1 𝑓1 (𝑥) + 𝑎2 𝑓2 (𝑥) for each 𝑥 ∈ 𝑋.
Note that the left-hand side is a real-valued function whose value at 𝑥 is defined
by the right-hand side, which involves only additions and multiplications of real
numbers 𝑎1 , 𝑎2 , 𝑓1 (𝑥), 𝑓2 (𝑥). For any finite set 𝑋, dim R𝑋 = #𝑋, the cardinality
of 𝑋.
It turns out that the converse is also true: any vector space V may be regarded
as a function space R𝑋 for some 𝑋. It is easiest to see this for finite-dimensional
vector spaces, so we first assume that dim V = 𝑛 ∈ N. In this case V has a basis
ℬ = {𝑣 1 , . . . , 𝑣 𝑛 } and every element 𝑣 ∈ V may be written as
𝑣 = 𝑎1 𝑣 1 + · · · + 𝑎 𝑛 𝑣 𝑛 for some unique 𝑎1 , . . . , 𝑎 𝑛 ∈ R. (4.1)
So we simply take 𝑋 = ℬ and regard 𝑣 ∈ V as the function 𝑓 𝑣 ∈ R𝑋 defined by
𝑓 𝑣 : 𝑋 → R, 𝑓 𝑣 (𝑣 𝑖 ) = 𝑎𝑖 , 𝑖 = 1, . . . , 𝑛.
The map V → Rℬ , 𝑣 ↦→ 𝑓 𝑣 is linear and invertible and so the two vector spaces
are isomorphic. The function 𝑓 𝑣 depends on our choice of basis, and any two such
functions 𝑓 𝑣 and 𝑓 𝑣′ corresponding to different bases ℬ and ℬ′ will be related via
the contravariant 1-tensor transformation rule if we represent 𝑓 𝑣 as a vector in R𝑛
with coordinates given by its values
 𝑓 (𝑣 1 )   𝑎1 
   
   
[ 𝑓 𝑣 ] ℬ ≔  ...  =  ...  ∈ R𝑛
   
 𝑓 (𝑣 𝑛 ) 𝑎 𝑛 
   

and do likewise for 𝑓 𝑣 . This is of course just restating (3.2) and (3.3) in slightly
different language. As long as we assume that every vector space V has a basis
ℬ, or, equivalently, the axiom of choice (Blass 1984), the above construction for
V  Rℬ in principle applies verbatim to infinite-dimensional vector spaces, as
88 L.-H. Lim

having a unique finite linear combination of the form (4.1) is the very definition of
a basis. When used in this sense, such a basis for a vector space is called a Hamel
basis.
Since any vector space is isomorphic to a function space, by defining tensor
products on function spaces we define tensor products on all vector spaces. This
is the most straightforward approach to defining tensor products. In this regard, a
𝑑-tensor is just a real-valued function 𝑓 (𝑥1 , . . . , 𝑥 𝑑 ) of 𝑑 variables. For two sets 𝑋
and 𝑌 , we define the tensor product R𝑋 ⊗ R𝑌 of R𝑋 and R𝑌 to be the subspace of
R𝑋 ×𝑌 comprising all functions that can be written as a finite sum of product of two
univariate functions, one of 𝑥 and another of 𝑦:
Õ
𝑟
𝑓 (𝑥, 𝑦) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦). (4.2)
𝑖=1

For univariate functions 𝜑 : 𝑋 → R and 𝜓 : 𝑌 → R, we will write 𝜑 ⊗𝜓 : 𝑋 ×𝑌 → R


for the bivariate function defined by
(𝜑 ⊗ 𝜓)(𝑥, 𝑦) = 𝜑(𝑥)𝜓(𝑦) for all 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌 .
Such a function is called a separable function, and when ⊗ is used in this sense,
we will refer to it a separable product; see Figure 4.1 for a simple example. In
notation,
 Õ
𝑟 
𝑋 𝑌 𝑋 ×𝑌 𝑋 𝑌
R ⊗R ≔ 𝑓 ∈R : 𝑓 = 𝜑𝑖 ⊗ 𝜓𝑖 , 𝜑𝑖 ∈ R , 𝜓𝑖 ∈ R , 𝑟 ∈ N .
𝑖=1

Note that R𝑋 ×𝑌 = { 𝑓 : 𝑋 × 𝑌 → R} is the set of all bivariate R-valued functions


on 𝑋 × 𝑌 , separable or not. In reality, when V is infinite-dimensional, its Hamel
basis ℬ is often uncountably infinite and impossible to write down explicitly for
many common infinite-dimensional vector spaces (e.g. any complete topological
vector space), and thus this approach of constructing tensor products by identifying
a vector space V with Rℬ is not practically useful for most infinite-dimensional
vector spaces. Furthermore, if we take this route we will inevitably have to discuss
a ‘change of Hamel bases’, a tricky undertaking that has never been done, as far
as we know. As such, tensor products of infinite-dimensional vector spaces are
almost always discussed in the context of topological tensor products, which we
will describe in Example 4.1.
If 𝑋 and 𝑌 are finite sets, then in fact we have
R𝑋 ⊗ R𝑌 = R𝑋 ×𝑌 (4.3)
as, clearly,
dim(R𝑋 ⊗ R𝑌 ) = #𝑋 · #𝑌 = #(𝑋 × 𝑌 ) = dim R𝑋 ×𝑌 ,
and so every function 𝑓 : 𝑋 × 𝑌 → R can be expressed in the form (4.2) for
some finite 𝑟. In other words, elements of R𝑋 ⊗ R𝑌 give all functions of two
Tensors in computations 89

𝜑(𝑥)

𝜓(𝑦)
2
𝑓 (𝑥, 𝑦)

0
−1 4

0 3 𝑓 =𝜑⊗𝜓

1 2

𝑥 2 1 𝑦 1

3 0
0
4 −1

Figure 4.1. Separable function 𝑓 (𝑥, 𝑦) = 𝜑(𝑥)𝜓(𝑦) with 𝜑(𝑥) = 1 + sin(180𝑥) and
𝜓(𝑦) = exp −(𝑦 − 3/2)2 .


variables (𝑥, 𝑦) ∈ 𝑋 × 𝑌 . If 𝑋 and 𝑌 are infinite sets, then (4.3) is demonstrably


false. Nevertheless, function spaces over infinite sets almost always have additional
properties imposed (e.g. continuity, differentiability, integrability, etc.) and/or carry
additional structures (e.g. metrics, norms, inner products, completeness in the
topology induced by these, etc.). If we constrain our functions in some manner
(instead of allowing them to be completely arbitrary) and limit our attention to
some smaller subspace F(𝑋) ⊆ R𝑋 , then it turns out that the analogue of (4.3)
often holds, that is,
F(𝑋) ⊗ F(𝑌 ) = F(𝑋 × 𝑌 ) (4.4)
for the right function class F and the right interpretation of ⊗, as we will see in the
next two examples. For infinite-dimensional vector spaces, there are many ways
to define (topological) tensor products, and the ‘right tensor product’ is usually
regarded as the one that gives us (4.4).
Example 4.1 (tensor products in infinite dimensions). We write R[𝑥1 , . . . , 𝑥 𝑚 ]
for the set of all multivariate polynomials in 𝑚 variables, with no restriction on
90 L.-H. Lim

degrees. This is one of the few infinite-dimensional vector spaces that has a
countable Hamel basis, namely the set of all monic monomials
ℬ = {𝑥1𝑑1 𝑥2𝑑2 · · · 𝑥 𝑚
𝑑𝑚
: 𝑑1 , 𝑑2 , . . . , 𝑑𝑚 ∈ N ∪ {0}}.
For multivariate polynomials, it is clearly true that19
R[𝑥1 , . . . , 𝑥 𝑚 ] ⊗ R[𝑦 1 , . . . , 𝑦 𝑛 ] = R[𝑥1 , . . . , 𝑥 𝑚 , 𝑦 1 , . . . , 𝑦 𝑛 ], (4.5)
as polynomials are sums of finitely many monomials and monomials are always
separable, for example
7𝑥12 𝑥2 𝑦 32 𝑦 3 − 6𝑥24 𝑦 1 𝑦 52 𝑦 3 = (7𝑥12 𝑥2 ) · (𝑦 32 𝑦 3 ) + (−6𝑥24 ) · (𝑦 1 𝑦 52 𝑦 3 ).
But (4.4) is also clearly false in general for other infinite-dimensional vector
spaces. Take continuous real-valued functions for illustration. While sin(𝑥 + 𝑦) =
sin 𝑥 cos 𝑦 + cos 𝑥 sin 𝑦 and log(𝑥𝑦) = log 𝑥 + log 𝑦, we can never have
Õ𝑟 Õ𝑟
sin(𝑥𝑦) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦) or log(𝑥 − 𝑦) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦)
𝑖=1 𝑖=1
for any continuous functions 𝜑𝑖 , 𝜓𝑖 and finite 𝑟 ∈ N. But if we allow 𝑟 → ∞, then
Õ∞
(−1)𝑛 2𝑛+1 2𝑛+1
sin(𝑥𝑦) = 𝑥 𝑦 ,
𝑛=0
(2𝑛 + 1)!
and by Taylor’s theorem, sin(𝑥𝑦) can be approximated to arbitrary accuracy by
sums of separable functions. Likewise,
𝑥 𝑥2 𝑥3 𝑥𝑛
log(𝑥 − 𝑦) = log(−𝑦) − − 2 − 3 − · · · − 𝑛 + 𝑂(𝑥 𝑛+1 )
𝑦 2𝑦 3𝑦 𝑛𝑦
as 𝑥 → 0, or, more relevant for our later purposes,
𝑦2 𝑦3 𝑦𝑛 1
 
𝑦
log(𝑥 − 𝑦) = log(𝑥) − − 2 − 3 − · · · − 𝑛 + 𝑂 𝑛+1 ,
𝑥 2𝑥 3𝑥 𝑛𝑥 𝑥
as 𝑥 → ∞. In other words, we may approximate log(𝑥 − 𝑦) by sums of separable
functions when 𝑥 is very small or very large; the latter will be important when we
discuss the fast multipole method in Example 4.45.
Given that (4.5) holds for polynomials, the Stone–Weierstrass theorem gives an
indication that we might be able to extend (4.4) to other infinite-dimensional spaces,
say, the space of continuous functions 𝐶(𝑋) on some nice domain 𝑋. However,
we will need to relax the definition of tensor product to allow for limits, i.e. taking
completion with respect to some appropriate choice of norm. For instance, we may
establish, for 1 ≤ 𝑝 < ∞,
𝐶(𝑋) b
⊗ 𝐶(𝑌 ) = 𝐶(𝑋 × 𝑌 ), 𝐿 𝑝 (𝑋) b
⊗ 𝐿 𝑝 (𝑌 ) = 𝐿 𝑝 (𝑋 × 𝑌 ),
19 One may extend (4.5) to monomials involving arbitrary real powers, i.e. Laurent polynomials,
posynomials, signomials, etc., as they are all finite sums of monomials.
Tensors in computations 91

where 𝑋 and 𝑌 are locally compact Hausdorff spaces in the left equality and 𝜎-finite
measure spaces in the right equality (Light and Cheney 1985, Corollaries 1.14 and
1.52). The topological tensor product 𝐶(𝑋) b⊗ 𝐶(𝑌 ) above refers to the completion of
 Õ𝑟 
𝐶(𝑋) ⊗ 𝐶(𝑌 ) = 𝑓 ∈ 𝐶(𝑋 × 𝑌 ) : 𝑓 = 𝜑𝑖 ⊗ 𝜓𝑖 , 𝜑𝑖 ∈ 𝐶(𝑋), 𝜓𝑖 ∈ 𝐶(𝑌 ), 𝑟 ∈ N
𝑖=1

with respect to some cross-norm, i.e. a norm k · k that is multiplicative on rank-one


elements,
k𝜑 ⊗ 𝜓 ⊗ · · · ⊗ 𝜃 k = k𝜑k k𝜓 k · · · k𝜃 k,
like those in (1.6), (1.7), (3.45), (3.46); we will discuss cross-norms in greater
detail in Examples 4.17 and 4.18 for Banach and Hilbert spaces. As we have
seen, without taking completion, 𝐶(𝑋) ⊗ 𝐶(𝑌 ) will be a strictly smaller subset of
𝐶(𝑋 × 𝑌 ). The same goes for 𝐿 𝑝 (𝑋) b ⊗ 𝐿 𝑝 (𝑌 ), but the completion is with respect
to some other cross-norm dependent on the value of 𝑝.
There many important infinite-dimensional vector spaces that are not Banach
(and thus not Hilbert) spaces. The topology of these spaces is usually defined by
families of seminorms. Fortunately, the construction outlined above also works with
such spaces to give topological tensor products that satisfy (4.4). If we limit our
consideration of 𝑋 and 𝑌 to subsets of Euclidean spaces, then for spaces of rapidly
decaying Schwartz functions (see Example 3.7), smooth functions, compactly
supported smooth functions and holomorphic functions, we have (Trèves 2006,
Theorem 51.6)
𝑆(𝑋) b
⊗ 𝑆(𝑌 ) = 𝑆(𝑋 × 𝑌 ), 𝐶 ∞(𝑋) b
⊗ 𝐶 ∞ (𝑌 ) = 𝐶 ∞ (𝑋 × 𝑌 ),

⊗ 𝐶𝑐∞(𝑌 ) = 𝐶𝑐∞(𝑋 × 𝑌 ),
𝐶𝑐 (𝑋) b 𝐻(𝑋) b
⊗ 𝐻(𝑌 ) = 𝐻(𝑋 × 𝑌 ),
where 𝑋 and 𝑌 , respectively, are R𝑚 and R𝑛 (Schwartz), open subsets of R𝑚 and R𝑛
(smooth), compact subsets of R𝑚 and R𝑛 (compactly supported) and open subsets
of C𝑚 and C𝑛 (holomorphic).
Similar results have been established for more complicated function spaces such
as Sobolev and Besov spaces (Sickel and Ullrich 2009) or extended to more general
domains such as having 𝑋 and 𝑌 be smooth manifolds or Stein manifolds (Akbarov
2003). The bottom line is that while it is not always true that all (𝑋, 𝑌 )-variate
functions are separable products of 𝑋-variate functions and 𝑌 -variate functions
– 𝐿 ∞ -functions being a notable exception – (4.4) holds true for many common
function spaces and many common types of domains when ⊗ is interpreted as an
appropriate topological tensor product b⊗.
To keep our promise of simplicity, our discussion in Example 4.1 is necessarily
terse as the technical details are quite involved. Nevertheless, the readers will
get a fairly good idea of the actual construction for the special cases of Banach
and Hilbert spaces in Examples 4.17 and 4.18. We will briefly mention another
example.
92 L.-H. Lim

Example 4.2 (tensor products of distributions). Strictly speaking, the construc-


tion of tensor products via sums of separable functions gives us contravariant
tensors, as we will see more clearly in Example 4.5. Nevertheless, we may get cov-
ariant tensors by taking the continuous dual of the function spaces in Example 4.1.
For a given V equipped with a norm, this is the set of all linear functionals 𝜑 ∈ V∗
whose dual norm in (3.44) is finite. For a function space F(𝑋), the continuous
dual space will be denoted F ′(𝑋) and it is a subspace of the algebraic dual space
F(𝑋)∗ . Elements of F ′ (𝑋) are called distributions or generalized functions with
the best-known example being the Dirac delta ‘function’, intuitively defined by
∫ ∞
𝑓 (𝑥)𝛿(𝑥) d𝑥 = 𝑓 (0),
−∞
and rigorously as the continuous linear functional 𝛿 : 𝐶𝑐∞ (R) → R, 𝜑 → 𝜑(0), i.e.
given by evaluation at zero. The continuous duals of 𝐶𝑐∞(𝑋), 𝑆(𝑋), 𝐶 ∞ (𝑋) and
𝐻(𝑋) are denoted by 𝐷 ′(𝑋) (distributions), 𝑆 ′(𝑋) (tempered distributions), 𝐸 ′(𝑋)
(compactly supported distributions) and 𝐻 ′(𝑋) (analytic functionals) respectively,
and one may obtain the following dual analogues (Trèves 2006, Theorem 51.7):
𝑆 ′(𝑋) b
⊗ 𝑆 ′(𝑌 ) = 𝑆 ′(𝑋 × 𝑌 ), 𝐸 ′(𝑋) b
⊗ 𝐸 ′(𝑌 ) = 𝐸 ′(𝑋 × 𝑌 ),
(4.6)
𝐷 ′(𝑋) b
⊗ 𝐷 ′(𝑌 ) = 𝐷 ′(𝑋 × 𝑌 ), 𝐻 ′(𝑋) b
⊗ 𝐻 ′(𝑌 ) = 𝐻 ′(𝑋 × 𝑌 ),
with 𝑋 and 𝑌 as in Example 4.1. These apparently innocuous statements conceal
some interesting facts. While every linear operator Φ : V → W and bilinear
functional 𝛽 : V × W → R on finite-dimensional vector spaces may be represented
by a matrix 𝐴 ∈ R𝑚×𝑛 , as we discussed in Section 3.1, the continuous analogue
on infinite-dimensional spaces is clearly false. For instance, not every continuous
linear operator Φ : 𝐿 2 (R) → 𝐿 2 (R) may be expressed as
∫ ∞
[Φ( 𝑓 )](𝑥) = 𝐾(𝑥, 𝑦) 𝑓 (𝑦) d𝑦 (4.7)
−∞
with an integral kernel 𝐾 in place of the matrix 𝐴, as the identity operator is
already a counterexample. This is the impetus behind the idea of a reproducing
kernel Hilbert space that we will discuss in Example 4.20. What (4.6) implies is that
this is true for the function spaces in Example 4.1. Any continuous linear operator
Φ : 𝑆(𝑋) → 𝑆 ′ (𝑌 ) has a unique kernel 𝐾 ∈ 𝑆 ′(𝑋 × 𝑌 ) and likewise with 𝐶𝑐∞, 𝐶 ∞,
𝐻 in place of 𝑆 (Trèves 2006, Chapters 50 and 51). For bilinear functionals, (4.6)
implies that for these spaces, say, 𝛽 : 𝐶𝑐∞ (𝑋) × 𝐶𝑐∞(𝑌 ) → R, we may decompose it
into
Õ∞
𝛽= 𝑎 𝑖 𝜑𝑖 ⊗ 𝜓𝑖 (4.8)
𝑖=1

with (𝑎𝑖 )∞ 1 ′ ′
𝑖=1 ∈ 𝑙 (N), and 𝜑 𝑖 ∈ 𝐷 (𝑋), 𝜓𝑖 ∈ 𝐷 (𝑌 ) satisfying lim𝑖→∞ 𝜑 𝑖 = 0 =
lim𝑖→∞ 𝜓𝑖 , among other properties (Schaefer and Wolff 1999, Theorem 9.5). These
results are all consequences of Schwartz’s kernel theorem, which, alongside Mer-
Tensors in computations 93

cer’s kernel theorem in Example 4.20, represent different approaches to resolving


the falsity of (4.7). While Mercer restricted to smaller spaces (reproducing kernel
Hilbert spaces), Schwartz expanded to larger ones (spaces of distributions), but
both are useful. Among other things, the former allows for the 𝜑𝑖 and 𝜓𝑖 in (4.8)
to be functions instead of distributions, from which we obtain feature maps for
support vector machines (Scholköpf and Smola 2002, Steinwart and Christmann
2008); on the other hand the latter would give us distributional solutions for PDEs
(Trèves 2006, Theorem 52.6).
The discussion up to this point extends to any 𝑑 sets 𝑋1 , 𝑋2 , . . . , 𝑋𝑑 , either by
repeated application of the 𝑑 = 2 case or directly:
R𝑋1 ⊗ R𝑋2 ⊗ · · · ⊗ R𝑋𝑑
 Õ
𝑟
≔ 𝑓 ∈ R𝑋1 ×𝑋2 ×···×𝑋𝑑 : 𝑓 = 𝜑𝑖 ⊗ 𝜓𝑖 ⊗ · · · ⊗ 𝜃 𝑖 ,
𝑖=1

𝑋1 𝑋2 𝑋𝑑
𝜑𝑖 ∈ R , 𝜓𝑖 ∈ R , . . . , 𝜃 𝑖 ∈ R , 𝑖 = 1, . . . , 𝑟 ∈ N . (4.9)

Each summand is a separable function defined by


(𝜑 ⊗ 𝜓 ⊗ · · · ⊗ 𝜃)(𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 ) ≔ 𝜑(𝑥1 )𝜓(𝑥2 ) · · · 𝜃(𝑥 𝑑 ) (4.10)
for 𝑥1 ∈ 𝑋1 , 𝑥2 ∈ 𝑋2 , . . . , 𝑥 𝑑 ∈ 𝑋𝑑 , and
R𝑋1 ×···×𝑋𝑑 = { 𝑓 : 𝑋1 × · · · × 𝑋𝑑 → R}
is the set of 𝑑-variate real-valued functions on 𝑋1 × · · · × 𝑋𝑑 . We will next look at
some concrete examples of separable functions.
Example 4.3 (multivariate normal and quantum harmonic oscillator). The
quintessential example of an 𝑛-variate separable function is a Gaussian:
𝑓 (𝑥) = exp(𝑥 ∗ 𝐴𝑥 + 𝑏∗ 𝑥 + 𝑐)
for 𝐴 ∈ C𝑛×𝑛 , 𝑏 ∈ C𝑛 , 𝑐 ∈ C, which is separable under a change of variables
𝑥 ↦→ 𝑉 ∗ 𝑥 where the columns of 𝑉 ∈ U(𝑛) are an orthonormal eigenbasis of
(𝐴 + 𝐴∗ )/2. Usually one requires that (𝐴 + 𝐴∗ )/2 is negative definite and 𝑏 is
purely imaginary to ensure that it is an 𝐿 2 -function. This is a singularly important
separable function.
In statistics, it arises as the probability density, moment generating and charac-
teristic functions of a normal random variable 𝑋 ∼ 𝑁(𝜇, Σ), with the last of the
three given by
1 T
 
𝜑 𝑋 (𝑥) = exp i𝜇 𝑥 − 𝑥 Σ𝑥 .
T

2
This gives the unique distribution all of whose cumulant 𝑑-tensors, i.e. the 𝑑th
derivative as in Example 3.2 of log 𝜑 𝑋 , vanish when 𝑑 ≥ 3, a fact from which
94 L.-H. Lim

the central limit theorem follows almost as an afterthought (McCullagh 1987,


Section 2.7). It is an undisputed fact that this is the single most important probability
distribution in statistics.
In physics, it arises as Gaussian wave functions for the quantum harmonic
oscillator. For a spinless particle in R3 of mass 𝜇 subjected to the potential
1
𝑉(𝑥, 𝑦, 𝑧) =𝜇(𝜔2𝑥 𝑥 2 + 𝜔2𝑦 𝑦 2 + 𝜔2𝑧 𝑧2 ),
2
if we assume that 𝑉 is isotropic, i.e. 𝜔 𝑥 = 𝜔 𝑦 = 𝜔 𝑧 = 𝜔, then the states are given by
 2 3/4  2 
𝛽 𝐻𝑚 (𝛽𝑥)𝐻𝑛 (𝛽𝑦)𝐻 𝑝 (𝛽𝑧) 𝛽 2 2 2
𝜓 𝑚,𝑛, 𝑝 (𝑥, 𝑦, 𝑧) = p exp (𝑥 + 𝑦 + 𝑧 ) ,
𝜋 2𝑚+𝑛+𝑝 𝑚! 𝑛! 𝑝! 2

where 𝛽2 ≔ 𝜇𝜔/ℏ and 𝐻𝑛 denotes a Hermite polynomial of degree 𝑛 (Cohen-


Tannoudji, Diu and Laloë 2020a, Chapter VII, Complement B). The ground state,
i.e. 𝑚 = 𝑛 = 𝑝 = 0, is a Gaussian but note that even excited states, i.e. 𝑚 + 𝑛 + 𝑝 > 0,
are separable functions. If one subscribes to the dictum that ‘physics is that
subset of human experience which can be reduced to coupled harmonic oscillators’
(attributed to Michael Peskin), then this has a status not unlike that of the normal
distribution in statistics.
As in the case 𝑑 = 2, when 𝑋1 , . . . , 𝑋𝑑 are finite sets, we have
R𝑋1 ⊗ · · · ⊗ R𝑋𝑑 = R𝑋1 ×···×𝑋𝑑 , (4.11)
that is, 𝑑-tensors in R𝑋1 ⊗ · · · ⊗ R𝑋𝑑 are just functions of 𝑑 variables (𝑥1 , . . . , 𝑥 𝑑 ) ∈
𝑋1 × · · · × 𝑋𝑑 . In particular, the dimension of the tensor space is
dim(R𝑋1 ⊗ · · · ⊗ R𝑋𝑑 ) = 𝑛1 · · · 𝑛𝑑 ,
where 𝑛𝑖 ≔ #𝑋𝑖 , 𝑖 = 1, . . . , 𝑑. As in the 𝑑 = 2 case, constructing a tensor
product of infinite-dimensional vector spaces V1 , . . . , V𝑑 by picking Hamel bases
ℬ1 , . . . , ℬ𝑑 and then defining
V1 ⊗ · · · ⊗ V𝑑 ≔ Rℬ1 ⊗ · · · ⊗ Rℬ𝑑 (4.12)
is rarely practical, with multivariate polynomials just about the only exception:
Clearly,
R[𝑥1 , . . . , 𝑥 𝑚 ] ⊗ R[𝑦 1 , . . . , 𝑦 𝑛 ] ⊗ · · · ⊗ R[𝑧1 , . . . , 𝑧 𝑝 ]
= R[𝑥1 , . . . , 𝑥 𝑚 , 𝑦 1 , . . . , 𝑦 𝑛 , . . . , 𝑧1 , . . . , 𝑧 𝑝 ],
just as in (4.5). However, if b⊗ is interpreted to be an appropriate topological
tensor product, then repeated applications of the results in Example 4.1 lead us to
equalities like
𝐿 2 (𝑋1 ) b
⊗ 𝐿 2 (𝑋2 ) b ⊗ 𝐿 2 (𝑋𝑑 ) = 𝐿 2 (𝑋1 × 𝑋2 × · · · × 𝑋𝑑 ).
⊗ ··· b (4.13)
We will furnish more details in Examples 4.17 and 4.18.
Tensors in computations 95

We now attempt to summarize the discussions up to this point with a definition


that is intended to capture all examples in this section. Note that it is difficult to be
more precise without restricting the scope.
Definition 4.4 (tensors as multivariate functions). Let F𝑖 (𝑋𝑖 ) ⊆ R𝑋𝑖 be a sub-
space of real-valued functions on 𝑋𝑖 , 𝑖 = 1, . . . , 𝑑. The topological tensor product
of F1 (𝑋1 ), . . . , F 𝑑 (𝑋𝑑 ) is a subspace
F 1 (𝑋1 ) b ⊗ F 𝑑 (𝑋𝑑 ) ⊆ R𝑋1 ×···×𝑋𝑑
⊗ ···b
that comprises finite sums of real-valued separable functions 𝜑1 ⊗ · · · ⊗ 𝜑 𝑑 with
𝜑𝑖 ∈ F 𝑖 (𝑋𝑖 ), 𝑖 = 1, . . . , 𝑑, and their limits with respect to some topology. A
𝑑-tensor is an element of this space.
Some explanations are in order. Definition 4.4 includes the special cases of
functions on finite sets and multivariate polynomials by simply taking the topology
to be the discrete topology. It is not limited to norm topology, as numerous tensor
product constructions require seminorms (Trèves 2006) or quasinorms (Sickel and
Ullrich 2009). The definition allows for tensor products of different types of
function spaces so as to accommodate different regularities in different arguments.
For example, a classical solution for the fluid velocity 𝑣 in the Navier–Stokes
equation (Fefferman 2006)
3 3
𝜕𝑣 𝑖 Õ 𝜕𝑣 𝑖 1 𝜕𝑝 Õ 𝜕2 𝑣
𝑖
+ 𝑣𝑗 = − +𝜈 2
+ 𝑓𝑖 , 𝑖 = 1, 2, 3, (4.14)
𝜕𝑡 𝑗=1
𝜕𝑥 𝑗 𝜌 𝜕𝑥𝑖 𝑗=1
𝜕𝑥 𝑗

may be regarded as a tensor 𝑣 ∈ 𝐶 2 (R3 ) b


⊗ 𝐶 1 [0, ∞) ⊗ R3 , that is, twice continuously
differentiable in its spatial arguments, once in its temporal argument, with a discrete
third argument capturing the fact that it is a three-dimensional vector field. We will
have more to say about the last point in Example 4.30.
Definition 4.4 allows us to view numerous types of multivariate functions as
tensors. The variables involved may be continuous or discrete or a mix of both.
The functions may represent anything from solutions to PDEs to target functions
in machine learning. In short, Definition 4.4 is extremely versatile and is by far the
most common manifestation of tensors.
We will reflect on how Definition 4.4 relates to definitions ➀ and ➁ by way of
two examples.
Example 4.5 (hypermatrices). A hypermatrix has served as our ‘multi-indexed
object’ in definition ➀ but may now be properly defined as an element of R𝑋1 ×···×𝑋𝑑
when 𝑋1 , . . . , 𝑋𝑑 are finite or countable discrete sets. For any 𝑛 ∈ N, we will write
[𝑛] ≔ {1, 2, . . . , 𝑛}.
We begin from 𝑑 = 1. While R𝑛 is usually taken as a Cartesian product R𝑛 =
R × · · · × R, it is often fruitful to regard it as the set of functions
R [𝑛] = { 𝑓 : [𝑛] → R}. (4.15)
96 L.-H. Lim

Any 𝑛-tuple (𝑎1 , . . . , 𝑎 𝑛 ) ∈ R𝑛 defines a function 𝑓 : [𝑛] → R with 𝑓 (𝑖) = 𝑎𝑖 ,


𝑖 = 1, . . . , 𝑛, and any function 𝑓 ∈ R [𝑛] defines an 𝑛-tuple ( 𝑓 (1), . . . , 𝑓 (𝑛)) ∈ R𝑛 .
So we may as well assume that R𝑛 ≡ R [𝑛] , given that any difference is superficial.
There are two advantages. The first is that it allows us to speak of sequence spaces
in the same breath, as the space of all sequences is then
RN = { 𝑓 : N → R} = {(𝑎𝑖 )∞
𝑖=1 : 𝑎 𝑖 ∈ R},
and this in turn allows us to treat various Banach spaces of sequences 𝑙 𝑝 (N), 𝑐(N),
𝑐0 (N) as special cases of their function spaces counterpart 𝐿 𝑝 (Ω), 𝐶(Ω), 𝐶0 (Ω).
This perspective is of course common knowledge in real and functional analysis.
Given that there is no such thing as a ‘two-way Cartesian product’, a second
advantage is that it allows us to define matrix spaces via
R𝑚×𝑛 ≔ R [𝑚]×[𝑛] = { 𝑓 : [𝑚] × [𝑛] → R}, (4.16)
where we regard (𝑎𝑖 𝑗 )𝑖,𝑚,𝑛
as shorthand for the function with 𝑓 (𝑖, 𝑗) = 𝑎𝑖 𝑗 , 𝑖 ∈ [𝑚],
𝑗=1
𝑗 ∈ [𝑛]. Why not just say that it is a two-dimensional array? The answer is that
an ‘array’ is an undefined term in mathematics; at best we may view it as a type
of data structure, and as soon as we try to define it rigorously, we will end up with
essentially (4.16). Again this perspective is known; careful treatments of matrices
in linear algebra such as Berberian (2014, Definition 4.1.3) or Brualdi (1992)
would define a matrix in the same vein. An infinite-dimensional variation gives
us double sequences or equivalently infinite-dimensional matrices in 𝑙 2 (N × N).
These are essential in Heisenberg’s matrix mechanics, the commutation relation
𝑃𝑄 − 𝑄𝑃 = 𝑖ℏ𝐼 being false for finite-dimensional matrices (trace zero on the left,
non-zero on the right), requiring infinite-dimensional ones instead:

0 1 √0 0 · · · 
√
r  1 0 2 √0 · · · 
ℏ  √
𝑄= 0 2 √0 3 · · ·  ,
2𝜈𝜔  0 0 3 0 · · · 

 . .. .. .. . . 
 .. . . . .


 0 −𝑖 1 0√ 0 · · · 
√
r 𝑖 1 0 −𝑖 2 0√ · · · 
ℏ𝜈𝜔  √
𝑃=  0 𝑖 2 √0 −𝑖 3 · · ·  .
2  0 0 𝑖 3 0 · · · 

 . .. .. .. . . 
 .. . . . .

𝑃 and 𝑄 are in fact the position and momentum operators for the harmonic oscillator
in Example 4.3 restricted to one dimension with all 𝑦- and 𝑧-terms dropped (Jordan
2005, Chapter 18).
It is straightforward to generalize. Let 𝑛1 , . . . , 𝑛𝑑 ∈ N. Then
R𝑛1 ×···×𝑛𝑑 ≔ R [𝑛1 ]×···×[𝑛𝑑 ] = { 𝑓 : [𝑛1 ] × · · · × [𝑛𝑑 ] → R}, (4.17)
Tensors in computations 97

that is, a 𝑑-dimensional hypermatrix or more precisely an 𝑛1 × · · · × 𝑛𝑑 hyper-


matrix is a real-valued function on [𝑛1 ] × · · · × [𝑛𝑑 ]. We introduce the shorthand
(𝑎𝑖1 ···𝑖𝑑 )𝑖𝑛11,...,𝑖
,...,𝑛𝑑
𝑑 =1
for a hypermatrix with 𝑓 (𝑖1 , . . . , 𝑖 𝑑 ) = 𝑎𝑖1 ···𝑖𝑑 . We say that a
hypermatrix is hypercubical when 𝑛1 = · · · = 𝑛𝑑 . These terms originated in the
Russian mathematical literature, notably in Gel′fand, Kapranov and Zelevinsky
(1992, 1994), although the notion itself has appeared implicitly as coefficients
of multilinear forms in Cayley (1845), alongside the term ‘hyperdeterminant’.
The names ‘multidimensional matrix’ (Sokolov 1972) or ‘spatial matrix’ (Sokolov
1960) have also been used for hypermatrices. We would have preferred simply
‘matrix’ or ‘𝑑-matrix’ instead of ‘hypermatrix’ or ‘𝑑-hypermatrix’, but the con-
ventional view of the term ‘matrix’ as exclusively two-dimensional has become too
deeply entrenched to permit such a change in nomenclature.
We will allow for a further generalization of the above, namely, 𝑋1 , . . . , 𝑋𝑑 may
be any discrete sets.
(a) Instead of restricting 𝑖1 , . . . , 𝑖 𝑑 to ordinal variables with values in [𝑛 𝑘 ] =
{1, . . . , 𝑛 𝑘 }, we allow for them to be nominal, e.g. {↑, ↓}, {A, C, G, T }, or
irreducible representations of some group (see Example 2.16).
(b) The discrete set may have additional structures, e.g. they might be graphs or
finite groups or combinatorial designs.
(c) We permit sets of countably infinite cardinality, e.g. N or Z (see Example 3.7).
Readers may have noted that we rarely invoke a 𝑑-hypermatrix with 𝑑 ≥ 3 in
actual examples. The reason is that whatever we can do with a 𝑑-hypermatrix
𝐴 ∈ R𝑛1 ×···×𝑛𝑑 we can do without or do better with 𝑓 : [𝑛1 ] × · · · × [𝑛𝑑 ] → R.
If anything, hypermatrices give a misplaced sense of familiarity by furnishing a
false analogy with standard two-dimensional matrices; they are nothing alike, as
we explain next.
In conjunction with specific bases, that is, when a 𝑑-hypermatrix is a coordinate
representation of a 𝑑-tensor, the values 𝑎𝑖1 ···𝑖𝑑 convey information about those basis
vectors such as near orthogonality (Example 3.7) or incidence (below). But on their
own, the utility of 𝑑-hypermatrices is far from that of the usual two-dimensional
matrices, one of the most powerful and universal mathematical tools. While there is
a dedicated calculus20 for working with two-dimensional matrices – a rich collection
of matrix operations, functions, invariants, decompositions, canonical forms, etc.,
that are consistent with the tensor transformation rules and most of them readily
computable – there is no equivalent for 𝑑-hypermatrices. There are no canonical
forms when 𝑑 ≥ 3 (Landsberg 2012, Chapter 10), no analogue of matrix–matrix
product when 𝑑 is odd (Example 2.7) and thus no analogue of matrix inverse, and
almost every calculation is computationally intractable even for small examples (see
Section 5.2); here ‘no’ is in the sense of mathematically proved to be non-existent.
The most useful aspect of hypermatrices is probably in supplying some analogues
20 In the sense of a system of calculation and reasoning.
98 L.-H. Lim

of basic block operations, notably in Strassen’s laser method (Bürgisser et al. 1997,
Chapter 15), or of row and column operations, notably in various variations of slice
rank (Lovett 2019), although there is nothing as sophisticated as the kind of block
matrix operations on page 70.
Usually when we recast a problem in the form of a matrix problem, we are on our
way to a solution: matrices are the tool that gets us there. The same is not true with
hypermatrices. For instance, while wemay easily capture the adjacency structure
of a 𝑑-hypergraph 𝐺 = (𝑉, 𝐸), 𝐸 ⊆ 𝑉𝑑 , with a hypermatrix 𝑓 : 𝑉 × · · · × 𝑉 → R,
(
1 {𝑖 1 , . . . , 𝑖 𝑑 } ∈ 𝐸,
𝑓 (𝑖1 , . . . , 𝑖 𝑑 ) ≔
0 {𝑖 1 , . . . , 𝑖 𝑑 } ∉ 𝐸,

the analogy with spectral graph theory stops here; the hypermatrix view (Friedman
1991, Friedman and Wigderson 1995) does not get us anywhere close to what
is possible in the 𝑑 = 2 case. Almost every mathematically sensible problem
involving hypermatrices is a challenge, even for 𝑑 = 3 and very small dimensions
such as 4 × 4 × 4. For example, we know that the 𝑚 × 𝑛 matrices of rank not more
than 𝑟 are precisely the ones with vanishing (𝑟 + 1) × (𝑟 + 1) minors. Resolving the
equivalent problem for 4 × 4 × 4 hypermatrices of rank not more than 4 requires
that we first reduce it to a series of questions about matrices and then throw every
tool in the matrix arsenal at them (Friedland 2013, Friedland and Gross 2012).

The following resolves a potential point of confusion that the observant reader
might have already noticed. In Definition 3.3, a 𝑑-tensor is a multilinear functional,
that is,
𝜑 : V1 × · · · × V 𝑑 → R (4.18)
satisfying (3.18), but in Definition 4.4 a 𝑑-tensor is merely a multivariate function
𝑓 : 𝑋1 × · · · × 𝑋 𝑑 → R (4.19)
that is not required to be multilinear, a requirement that in fact makes no sense as the
sets 𝑋1 , . . . , 𝑋𝑑 may not be vector spaces. One might think that (4.18) is a special
case of (4.19) as a multilinear functional is a special case of a multivariate function,
but this would be incorrect: they are different types of tensors. By Definition 3.3, 𝜑
is a covariant tensor whereas, as we pointed out in Example 4.2, 𝑓 is a contravariant
tensor. It suffices to explain this issue over finite dimensions and so we will assume
that V1 , . . . , V𝑑 are finite-dimensional vector spaces and 𝑋1 , . . . , 𝑋𝑑 are finite sets.
We will use slightly different notation below to avoid having to introduce more
indices.

Example 4.6 (multivariate versus multilinear). The relation between (4.18) and
(4.19) is subtler than meets the eye. Up to a choice of bases, we can construct a
unique 𝜑 from any 𝑓 and vice versa. One direction is easy. Given vector spaces
Tensors in computations 99

U, V, . . . , W, choose any bases


𝒜 = {𝑢1 , . . . , 𝑢 𝑚 }, ℬ = {𝑣 1 , . . . , 𝑣 𝑛 }, . . . , 𝒞 = {𝑤 1 , . . . , 𝑤 𝑝 }.
Then any multilinear functional 𝜑 : U × V × · · · × W → R gives us a real-valued
function
𝑓 𝜑 : 𝒜 × ℬ × · · · × 𝒞 → R, (𝑢𝑖 , 𝑣 𝑗 , . . . , 𝑤 𝑘 ) ↦→ 𝜑(𝑢𝑖 , 𝑣 𝑗 , . . . , 𝑤 𝑘 ).
The other direction is a bit more involved. For any finite set 𝑋 and a point 𝑥 ∈ 𝑋,
we define the delta function 𝛿 𝑥 ∈ R𝑋 given by
(
′ 1 𝑥 = 𝑥 ′,
𝛿 𝑥 : 𝑋 → R, 𝛿 𝑥 (𝑥 ) = (4.20)
0 𝑥 ≠ 𝑥 ′,

for all 𝑥 ′ ∈ 𝑋, and the point evaluation linear functional 𝜀 𝑥 ∈ (R𝑋 )∗ given by
𝜀 𝑥 : R𝑋 → R, 𝜀 𝑥 ( 𝑓 ) = 𝑓 (𝑥)
for all 𝑓 ∈ R𝑋 . These are bases of their respective spaces and for any set we have
the following:
𝑋 = {𝑥1 , . . . , 𝑥 𝑚 }, R𝑋 = span{𝛿 𝑥1 , . . . , 𝛿 𝑥𝑚 }, (R𝑋 )∗ = span{𝜀 𝑥1 , . . . , 𝜀 𝑥𝑚 }.
Given any 𝑑 sets,
𝑋 = {𝑥1 , . . . , 𝑥 𝑚 }, 𝑌 = {𝑦 1 , . . . , 𝑦 𝑛 }, . . . , 𝑍 = {𝑧1 , . . . , 𝑧 𝑝 },
a real-valued function 𝑓 : 𝑋 × 𝑌 × · · · × 𝑍 → R gives us a multilinear functional
𝜑 𝑓 : R𝑋 × R𝑌 × · · · × R 𝑍 → R
defined by
𝑚 Õ
Õ 𝑛 𝑝
Õ
𝜑𝑓 = ··· 𝑓 (𝑥𝑖 , 𝑦 𝑗 , . . . , 𝑧 𝑘 )𝜀 𝑥𝑖 ⊗ 𝜀 𝑦 𝑗 ⊗ · · · ⊗ 𝜀 𝑧𝑘 .
𝑖=1 𝑗=1 𝑘=1

Evaluating 𝜑 𝑓 on basis vectors, we get


𝜑 𝑓 (𝛿 𝑥𝑖 , 𝛿 𝑦 𝑗 , . . . , 𝛿 𝑧𝑘 ) = 𝑓 (𝑥𝑖 , 𝑦 𝑗 , . . . , 𝑧 𝑘 ),
taken together with the earlier definition of 𝑓 𝜑 ,
𝑓 𝜑 (𝑢𝑖 , 𝑣 𝑗 , . . . , 𝑤 𝑘 ) = 𝜑(𝑢𝑖 , 𝑣 𝑗 , . . . , 𝑤 𝑘 ),
and we obtain
𝜑 𝑓𝜑 = 𝜑, 𝑓𝜑 𝑓 = 𝑓 .
Roughly speaking, a real-valued multivariate function is a tensor in the sense of
Definition 3.3 because one may view it as a linear combination of point evaluation
linear functionals. Such point evaluation functionals are the basis behind numerical
quadrature (Example 4.48).
100 L.-H. Lim

The next example is intended to serve as further elaboration for Examples 4.3,
4.5 and 4.6, and as partial impetus for Sections 4.2 and 4.3.
Example 4.7 (quantum spin and covalent bonds). As we saw in Example 4.3,
the quantum state of a spinless particle like the Gaussian wave function 𝜓 𝑚,𝑛, 𝑝 is
an element of the Hilbert space21 𝐿 2 (R3 ). But even for a single particle, tensor
products come in handy when we need to discuss particles with spin, which are
crucial in chemistry. The simplest but also the most important case is when
a quantum particle has spin − 21 , 12 , called a spin-half particle for short, and its
quantum state is
Ψ ∈ 𝐿 2 (R3 ) ⊗ C2 . (4.21)
Going by Definition 4.4, this is a function Ψ : R3 × {− 21 , 12 } → C, where Ψ(𝑥, 𝜎)
is an 𝐿 2 (R3 )-integrable function in the argument 𝑥 for each 𝜎 = − 21 or 𝜎 = 21 . It is
also not hard to see that when we have a tensor product of an infinite-dimensional
space with a finite-dimensional one, every element must be a finite sum of separable
terms, so Ψ in (4.21) may be expressed as
Õ
𝑟 Õ
𝑟
Ψ= 𝜓𝑖 ⊗ 𝜒𝑖 or Ψ(𝑥, 𝜎) = 𝜓𝑖 (𝑥)𝜒𝑖 (𝜎) (4.22)
𝑖=1 𝑖=1

with 𝜓𝑖 ∈ and 𝜒𝑖 : {− 12 , 21 } →
𝐿 2 (R3 ) C. In physics parlance, Ψ is the total wave
function of the particle and 𝜓𝑖 and 𝜒𝑖 are its spatial and spin wave functions; the
variables that Ψ depends on are called degrees of freedom, with those describing
position and momentum called external degrees of freedom and others like spin
called internal degrees of freedom (Cohen-Tannoudji et al. 2020a, Chapter II,
Section F).
This also illustrates why, as we discussed in Example 4.5, it is often desirable to
view C𝑛 as the set of complex-valued functions on some finite sets of 𝑛 elements.
Here the finite set is {− 12 , 12 } and C2 in (4.21) is just shorthand for the spin state
space
   
{−1/2,1/2} 1 1
C = 𝜒: − , →C .
2 2
The domain R3 × {− 21 , 12 } is called the position–spin space (Pauli 1980) and plays
a role in the Pauli exclusion principle. While two classical particles cannot simul-
taneously occupy the exact same location in R3 , two quantum particles can as long
as they have different spins, as that means they are occupying different locations
in R3 × {− 12 , 12 }. A consequence is that two electrons with opposite spins can
occupy the same molecular orbital, and when they do, we have a covalent bond in
21 To avoid differentiability and integrability issues, we need 𝜓𝑚,𝑛, 𝑝 to be in the Schwartz space
𝑆(R3 ) ⊆ 𝐿 2 (R3 ) if we take a rigged Hilbert space approach (de la Madrid 2005) or the Sobolev
space 𝐻 2 (R3 ) ⊆ 𝐿 2 (R3 ) if we take an unbounded self-adjoint operators approach (Teschl 2014,
Chapter 7), but we disregard these to keep things simple.
Tensors in computations 101

the molecule. The Pauli exclusion principle implies the converse: if two electrons
occupy the same molecular orbital, then they must have opposite spins. We will
see that this is a consequence of the antisymmetry of the total wave function.
In fact, ‘two electrons with opposite spins occupying the same molecular orbital’
is the quantum mechanical definition of a covalent bond in chemistry. When this
happens, quantum mechanics mandates that we may not speak of each electron
individually but need to consider both as a single entity described by a single wave
function
Ψ ∈ (𝐿 2 (R3 ) ⊗ C2 ) b
⊗ (𝐿 2 (R3 ) ⊗ C2 ), (4.23)
with each copy of 𝐿 2 (R3 ) ⊗ C2 associated with one of the electrons. The issue with
writing the tensor product in the form (4.23) is that Ψ will not be a finite sum of
separable terms since it is not a tensor product of two infinite-dimensional spaces.
As we will see in Section 4.3, we may arrange the factors in a tensor product in
arbitrary order and obtain isomorphic spaces
(𝐿 2 (R3 ) ⊗ C2 ) b
⊗ (𝐿 2 (R3 ) ⊗ C2 )  𝐿 2 (R3 ) b
⊗ 𝐿 2 (R3 ) ⊗ C2 ⊗ C2 , (4.24)
but this is also obvious from Definition 4.4 by observing that for finite sums
Õ
𝑟 Õ
𝑟
𝜓𝑖 (𝑥)𝜒𝑖 (𝜎)𝜑𝑖 (𝑦)𝜉𝑖 (𝜏) = 𝜓𝑖 (𝑥)𝜑𝑖 (𝑦)𝜒𝑖 (𝜎)𝜉𝑖 (𝜏),
𝑖=1 𝑖=1

and then taking completion. Since 𝐿 2 (R3 ) b


⊗ 𝐿 2 (R3 ) = 𝐿 2 (R3 × R3 ) by (4.13) and
2 2 2×2
C ⊗ C = C by Example 4.5, again the latter is shorthand for
  
1 1 1 1 1 1 1 1
      
C {−1/2,1/2}×{−1/2,1/2} = 𝜒 : − ,− , − , , ,− , , →C ,
2 2 2 2 2 2 2 2
we deduce that the state space in (4.24) is 𝐿 2 (R6 ) ⊗ C2×2 . The total wave function
for two electrons may now be expressed as a finite sum of separable terms
Õ
𝑟
Ψ(𝑥, 𝑦, 𝜎, 𝜏) = 𝜑𝑖 (𝑥, 𝑦)𝜒𝑖 (𝜎, 𝜏),
𝑖=1

nicely separated into external degrees of freedom (𝑥, 𝑦) ∈ R3 × R3 and internal


ones (𝜎, 𝜏) as before. Henceforth we will write Ψ(𝑥, 𝜎; 𝑦, 𝜏) if we view Ψ as an
element of the space on the left of (4.24) and Ψ(𝑥, 𝑦, 𝜎, 𝜏) if we view it as an
element of the space on the right. The former emphasizes that Ψ is a wave function
of two electrons, one with degrees of freedom (𝑥, 𝜎) and the other (𝑦, 𝜏); the latter
emphasizes the separation of the wave function into external and internal degrees
of freedom.
Electrons are fermions, which are quantum particles with antisymmetric wave
functions, that is,
Ψ(𝑥, 𝜎; 𝑦, 𝜏) = −Ψ(𝑦, 𝜏; 𝑥, 𝜎). (4.25)
102 L.-H. Lim

For simplicity, suppose we have a separable wave function


Ψ(𝑥, 𝜎; 𝑦, 𝜏) = Ψ(𝑥, 𝑦, 𝜎, 𝜏) = 𝜑(𝑥, 𝑦)𝜒(𝜎, 𝜏).
Aside from their spins, identical electrons in the same orbital are indistinguishable
and we must have 𝜑(𝑥, 𝑦) = 𝜑(𝑦, 𝑥), so the antisymmetry (4.25) must be a con-
sequence of 𝜒(𝜎, 𝜏) = −𝜒(𝜏, 𝜎). As in Example 4.6, a basis for C2 is given by the
delta functions 𝛿−1/2 and 𝛿1/2 and thus a basis for C2×2 is given by
𝛿−1/2 ⊗ 𝛿−1/2 , 𝛿−1/2 ⊗ 𝛿1/2 , 𝛿1/2 ⊗ 𝛿−1/2 , 𝛿1/2 ⊗ 𝛿1/2 .
We may thus express
𝜒 = 𝑎𝛿−1/2 ⊗ 𝛿−1/2 + 𝑏𝛿−1/2 ⊗ 𝛿1/2 + 𝑐𝛿1/2 ⊗ 𝛿−1/2 + 𝑑𝛿1/2 ⊗ 𝛿1/2 ,
and if 𝜒(𝜎, 𝜏) = √
−𝜒(𝜏, 𝜎) for all 𝜎, 𝜏, then it implies that 𝑎 = 𝑑 = 0 and 𝑏 = −𝑐.
Choosing 𝑏 = 1/ 2 to get unit norms, we have
1
𝜒 = √ (𝛿−1/2 ⊗ 𝛿1/2 − 𝛿1/2 ⊗ 𝛿−1/2 ).
2
In other words, the two electrons in the same molecular orbital must have opposite
spins – either (− 12 , 12 ) or ( 12 , − 21 ).
The discussion above may be extended to more particles (𝑑 > 2) and more spin
values (𝑠 > 21 ). When we have 𝑑 quantum particles with spins − 12 , 12 , the total wave
function satisfies
Ψ ∈ (𝐿 2 (R3 ) ⊗ C2 ) b ⊗ (𝐿 2 (R3 ) ⊗ C2 )  𝐿 2 (R3𝑑 ) ⊗ C2×···×2
⊗ ··· b (4.26)
via the same argument. Again we may group the degrees of freedom either by
particles or by external/internal, that is,
Ψ(𝑥1 , 𝜎1 ; 𝑥2 , 𝜎2 ; . . . ; 𝑥 𝑑 , 𝜎𝑑 ) = Ψ(𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 , 𝜎1 , 𝜎2 , . . . , 𝜎𝑑 ),
depending on whether we want to view Ψ as an element of the left or right space
in (4.26), with antisymmetry usually expressed in the former,
Ψ(𝑥1 , 𝜎1 ; . . . ; 𝑥 𝑑 , 𝜎𝑑 ) = (−1)sgn( 𝜋) Ψ(𝑥 𝜋(1) , 𝜎𝜋(1) ; . . . ; 𝑥 𝜋(𝑑) , 𝜎𝜋(𝑑) ) (4.27)
for all 𝜋 ∈ 𝔖𝑑 , and separability usually expressed in the latter,
Õ
𝑟
Ψ(𝑥1 , . . . , 𝑥 𝑑 , 𝜎1 , . . . , 𝜎𝑑 ) = 𝜓𝑖 (𝑥1 , . . . , 𝑥 𝑑 )𝜒𝑖 (𝜎1 , . . . , 𝜎𝑑 ).
𝑖=1

As before, C2×···×2 is to be regarded as


  𝑛 
{−1/2,1/2}×···×{−1/2,1/2} 1 1
C = 𝜒: − , →C .
2 2
We may easily extend our discussions to a spin-𝑠 quantum particle for any
 
1 3 5
𝑠 ∈ 0, , 1, , 2, , . . . ,
2 2 2
Tensors in computations 103

in which case its state space would be


𝐿 2 (R3 ) ⊗ C2𝑠+1 ,
where C2𝑠+1 is shorthand for the space of functions 𝜒 : {−𝑠, −𝑠 + 1, . . . , 𝑠 − 1, 𝑠} →
C. The particle is called a fermion or a boson depending on whether 𝑠 is a
half-integer or an integer. The key difference is that the total wave function of 𝑑
identical fermions is antisymmetric as in (4.27) whereas that of 𝑑 identical bosons
is symmetric, that is,
Ψ(𝑥1 , 𝜎1 ; . . . ; 𝑥 𝑑 , 𝜎𝑑 ) = Ψ(𝑥 𝜋(1) , 𝜎𝜋(1) ; . . . ; 𝑥 𝜋(𝑑) , 𝜎𝜋(𝑑) ) (4.28)
for all 𝜋 ∈ 𝔖𝑑 . Collectively, (4.27) and (4.28) are the modern form of the Pauli
exclusion principle. Even more generally, we may consider 𝑑 quantum particles
with different spins 𝑠1 , . . . , 𝑠 𝑑 and state space
(𝐿 2 (R3 ) ⊗ C2𝑠1 +1 ) b ⊗ (𝐿 2 (R3 ) ⊗ C2𝑠𝑑 +1 )  𝐿 2 (R3𝑑 ) ⊗ C(2𝑠1 +1)×···×(2𝑠𝑑 +1) .
⊗ ···b
We recommend the very lucid exposition in Faddeev and Yakubovskiı̆ (2009,
Sections 46–48) and Takhtajan (2008, Chapter 4) for further information.
For most purposes in applied and computational mathematics, 𝑠 = 1/2, that is,
the spin-half case is literally the only one that matters, as particles that constitute or-
dinary matter (electrons, protons, neutrons, quarks, etc.) are all spin-half fermions.
Spin-zero (e.g. Higgs) and spin-one bosons (e.g. W and Z, photons, gluons) are
already the realm of particle physics, and higher-spin particles, whether fermions
or bosons, tend to be hypothetical (e.g. gravitons, gravitinos).
In Example 4.7 we have implicitly relied on the following three background
assumptions for a physical system in quantum mechanics.
(a) The state space of a system is a Hilbert space.
(b) The state space of a composite system is the tensor product of the state spaces
of the component systems.
(c) The state space of a composite system of identical particles lies either in the
symmetric or alternating tensor product of the state spaces of the component
systems.
These assumptions are standard; see Cohen-Tannoudji et al. (2020a, Chapter III,
Section B1), Nielsen and Chuang (2000, Section 2.2.8) and Cohen-Tannoudji, Diu
and Laloë (2020b, Chapter XIV, Section C).
Example 4.7 shows the simplicity of Definition 4.4. If our quantum system has yet
other degrees of freedom, say, flavour or colour, we just include the corresponding
wave functions as factors in the separable product. For instance, the total wave
function of a proton or neutron is a finite sum of separable terms of the form
(Griffiths 2008, Section 5.6.1)
𝜓spatial ⊗ 𝜓spin ⊗ 𝜓flavour ⊗ 𝜓colour .
104 L.-H. Lim

Nevertheless, we would like more flexibility than Definition 4.4 provides. For
instance, instead of writing the state of a spin-half particle as a sum of real-valued
functions in (4.22), we might prefer to view it as a C2 -valued 𝐿 2 -vector field on R3
indexed by spin:
 
𝜓−1/2 (𝑥)
Ψ(𝑥) = (4.29)
𝜓1/2 (𝑥)
with ∫ ∫
kΨk 2 = |𝜓−1/2 (𝑥)| 2 d𝑥 + |𝜓1/2 (𝑥)| 2 d𝑥 < ∞.
R3 R3

In other words, we want to allow 𝐿 2 (R3 ; C2 ) to be a possible interpretation for


𝐿 2 (R3 ) ⊗ C2 . The next two sections will show how to accomplish this.

4.2. Tensor products of abstract vector spaces


Most readers’ first encounter with scalars and vectors will likely take the following
forms:
• a scalar is an object that only has magnitude,
• a vector is an object with both magnitude and direction.
In this section we will construct tensors of arbitrary order starting from these. For
a vector 𝑣 in this sense, we will denote its magnitude by k𝑣k and its direction by
𝑣 . Note that here k · k does not need to be a norm, just some notion of magnitude.
b
Likewise we will write |𝑎| ∈ [0, ∞) for the magnitude of a scalar 𝑎 and sgn(𝑎) for
its sign if it has one.
If we go along this line of reasoning, what should the next object be? One
possible answer is that it ought to be an object with a magnitude and two directions;
this is called a dyad and we have in fact encountered it in Section 3.1. Henceforth
it is straightforward to generalize to objects with a magnitude and an arbitrary
number of directions:
object property
scalar 𝑎 has magnitude |𝑎|
vector 𝑣 has magnitude k𝑣k and a direction b
𝑣
dyad 𝑣 ⊗ 𝑤 has magnitude k𝑣 ⊗ 𝑤 k and two directions b
𝑣, 𝑤
b
triad 𝑢 ⊗ 𝑣 ⊗ 𝑤 has magnitude k𝑢 ⊗ 𝑣 ⊗ 𝑤 k and three directions b
𝑢, b
𝑣, 𝑤
b
.. ..
. .
𝑑-ad 𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 has magnitude k𝑢 ⊗ 𝑣 ⊗ · · ·⊗ 𝑤 k and 𝑑 directions b
𝑢, b b.
𝑣, . . . , 𝑤
Such objects are called polyads and have been around for over a century; the first
definition of tensor rank was in fact in terms of polyads (Hitchcock 1927). We
have slightly modified the notation by adding the tensor product symbol ⊗ between
Tensors in computations 105

vectors to bring it in line with modern treatments in algebra, which we will describe
later. Note that the ⊗ in 𝑣 ⊗ 𝑤 is solely used as a delimiter; we could have written
it as 𝑣𝑤 as in Hitchcock (1927) and Morse and Feshbach (1953) or |𝑣i|𝑤i or |𝑣, 𝑤i
in Dirac notation (Cohen-Tannoudji et al. 2020a, Chapter II, Section F2c). Indeed,
𝑣 and 𝑤 may not have coordinates and there is no ‘formula for 𝑣 ⊗ 𝑤’. Also, the
order matters as 𝑣 and 𝑤 may denote vectors of different nature, so in general
𝑣 ⊗ 𝑤 ≠ 𝑤 ⊗ 𝑣. (4.30)
The reason for this would be clarified with Example 4.8. Used in this sense, ⊗ is
called an abstract tensor product or dyadic product in the older literature (Chou
and Pagano 1992, Morse and Feshbach 1953, Tai 1997).
With hindsight, we see that the extension of a scalar in the sense of an object
with magnitude, and a vector in the sense of one with magnitude and direction, is
not a tensor but a rank-one tensor. To get tensors that are not rank-one we will have
to look at the arithmetic of dyads and polyads. As we have discussed, each dyad is
essentially a placeholder for three pieces of information:
(magnitude, first direction, second direction)
or, in our notation (k𝑣 ⊗ 𝑤 k, b b). We are trying to devise a consistent system of
𝑣, 𝑤
arithmetic for such objects that (a) preserves their information content, (b) merges
information whenever possible, and (c) is consistent with scalar and vector arith-
metic.
Scalar products are easy. The role of scalars is that they scale vectors, as reflected
in its name. We expect the same for dyads. If any of the vectors in a dyad is scaled
by 𝑎, we expect its magnitude to be scaled by 𝑎 but the two directions to remain
unchanged. So we require
(𝑎𝑣) ⊗ 𝑤 = 𝑣 ⊗ (𝑎𝑤) ≕ 𝑎𝑣 ⊗ 𝑤, (4.31)
where the last term is defined to be the common value of the first two. This
seemingly innocuous property is the main reason tensor products, not direct sums,
are used to combine quantum state spaces. In quantum mechanics, a quantum state
is described not so much by a vector 𝑣 but the entire one-dimensional subspace
spanned by 𝑣; thus (4.31) ensures that when combining two quantum states in the
form of two one-dimensional subspaces, it does not matter which (non-zero) vector
in the subspace we pick to represent the state. On the other hand, for direct sums,
(𝑎𝑣) ⊕ 𝑤 ≠ 𝑣 ⊕ (𝑎𝑤) ≠ 𝑎(𝑣 ⊕ 𝑤) (4.32)
in general; for example, with 𝑣 = (1, 0) ∈ R2 , 𝑤 = 1 ∈ R1 and 𝑎 = 2 ∈ R, we get
(2, 0, 1) ≠ (1, 0, 2) ≠ (2, 0, 2).
To ensure consistency with the usual scalar and vector arithmetic, one also needs
the assumption that scalar multiplication is always distributive and associative in
the following sense:
(𝑎 + 𝑏)𝑣 ⊗ 𝑤 = 𝑎𝑣 ⊗ 𝑤 + 𝑏𝑣 ⊗ 𝑤, (𝑎𝑏)𝑣 ⊗ 𝑤 = 𝑎(𝑏𝑣 ⊗ 𝑤).
106 L.-H. Lim

Addition of dyads is trickier. If 𝑎1 , 𝑎2 are scalars, then 𝑎1 + 𝑎2 is a scalar; if


𝑣 1 , 𝑣 2 are vectors, then 𝑣 1 + 𝑣 2 is a vector according to either the parallelogram law
of addition (for geometrical or physical vectors) or the axioms of vector spaces (for
abstract vectors). But for dyads 𝑣 1 ⊗ 𝑤 1 and 𝑣 2 ⊗ 𝑤 2 , their sum 𝑣 1 ⊗ 𝑤 1 + 𝑣 2 ⊗ 𝑤 2
is no longer a dyad; in general there is no way to simplify the sum further and it
should be regarded as an object with two magnitudes and four directions:22
(magnitude 1, first direction 1, second direction 1)
& (magnitude 2, first direction 2, second direction 2).
Just as the ⊗ is used as a delimiter where order matters, the + in the sum of dyads
is used as a different type of delimiter where order does not matter:
𝑣1 ⊗ 𝑤1 + 𝑣2 ⊗ 𝑤2 = 𝑣2 ⊗ 𝑤2 + 𝑣1 ⊗ 𝑤1,
that is, + is commutative. We also want + to be associative so that we may
unambiguously define, for any finite 𝑟 ∈ N, a sum of 𝑟 dyads
𝑣 1 ⊗ 𝑤 1 + 𝑣 2 ⊗ 𝑤 2 + · · · + 𝑣𝑟 ⊗ 𝑤𝑟
that we will call a dyadic. To ensure consistency with the usual scalar and vector
arithmetic, we assume throughout this article that addition + is always commutative
and associative. This assumption is also perfectly reasonable as a dyadic is just a
placeholder for the information contained in its dyad summands:
(magnitude 1, first direction 1, second direction 1)
& (magnitude 2, first direction 2, second direction 2) & · · ·
& (magnitude 𝑟, first direction 𝑟, second direction 𝑟).
There is one scenario where a sum can be simplified and the directional information
in two dyads merged. Whenever 𝑣 1 = 𝑣 2 or 𝑤 1 = 𝑤 2 , we want
𝑣 ⊗ 𝑤 1 + 𝑣 ⊗ 𝑤 2 = 𝑣 ⊗ (𝑤 1 + 𝑤 2 ),
(4.33)
𝑣 1 ⊗ 𝑤 + 𝑣 2 ⊗ 𝑤 = (𝑣 1 + 𝑣 2 ) ⊗ 𝑤,
and these may be applied to combine any pair of summands in a dyadic that share
a common factor. The following example provides further motivation for why we
would like these arithmetical properties to hold.
Example 4.8 (stress tensor). In physics, a dyadic is essentially a vector whose
components are themselves vectors. The standard example is the stress tensor,
sometimes called the Cauchy stress tensor to distinguish it from other related
models. The stress 𝜎 at a point in space23 has three components in the directions

22 We could of course add corresponding vectors 𝑣 1 + 𝑣 2 and 𝑤 1 + 𝑤 2 , but that leads to direct sums
of vector spaces, which is inappropriate because of (4.32).
23 We assume ‘space’ here means R3 with coordinate axes labelled 𝑥, 𝑦, 𝑧. The older literature would
use i, j, k instead of 𝑒 𝑥 , 𝑒 𝑦 , 𝑒 𝑧 .
Tensors in computations 107

given by unit vectors 𝑒 𝑥 , 𝑒 𝑦 , 𝑒 𝑧 :


𝜎 = 𝜎𝑥 𝑒 𝑥 + 𝜎𝑦 𝑒 𝑦 + 𝜎𝑧 𝑒 𝑧 . (4.34)
Now the deviation from a usual linear combination of vectors is that each of the
coefficients 𝜎𝑥 , 𝜎𝑦 , 𝜎𝑧 is not a scalar but a vector. This is the nature of stress: in
every direction, the stress in that direction has a normal component in that direction
and two shear components in the plane perpendicular to it. For example,
𝜎𝑥 = 𝜎𝑥 𝑥 𝑒 𝑥 + 𝜎𝑦 𝑥 𝑒 𝑦 + 𝜎𝑧 𝑥 𝑒 𝑧 , (4.35)
the component 𝜎𝑥 𝑥 in the direction of 𝑒 𝑥 is called normal stress whereas the two
components 𝜎𝑦 𝑥 and 𝜎𝑧 𝑥 in the plane perpendicular to 𝑒 𝑥 , i.e. span{𝑒 𝑦 , 𝑒 𝑧 }, is
called shear stress. Here 𝜎𝑥 𝑥 , 𝜎𝑦 𝑥 , 𝜎𝑧 𝑥 are scalars and (4.35) is an honest linear
combination. It is easiest to represent 𝜎 pictorially, as in Borg (1990, Figure 4.10)
or (Chou and Pagano 1992, Figure 1.5), adapted here as Figure 4.2, where the
point is blown up into a cube to show the fine details. The coefficients 𝜎𝑥 , 𝜎𝑦 ,

𝜎𝑧𝑧

𝜎𝑥𝑧
𝜎𝑦𝑧
𝜎𝑧 𝑥
𝑒𝑧 𝜎𝑧𝑦

𝜎𝑥 𝑥
𝑒𝑥
𝑒𝑦 𝜎𝑦 𝑥
𝜎𝑥 𝑦
𝜎𝑦 𝑦

Figure 4.2. Depiction of the stress tensor. The cube represents a single point in
R3 and should be regarded as infinitesimally small. In particular, the origins of the
three coordinate frames on the surface of the cube are the same point.

𝜎𝑧 are called stress vectors in the directions 𝑒 𝑥 , 𝑒 𝑦 , 𝑒 𝑧 respectively. As Figure 4.2


indicates, each stress vector has a normal component pointing in the same direction
as itself and two shear components in the plane perpendicular to it:
𝜎𝑦 = 𝜎𝑥 𝑦 𝑒 𝑥 + 𝜎𝑦 𝑦 𝑒 𝑦 + 𝜎𝑧 𝑦 𝑒 𝑧 ,
𝜎𝑧 = 𝜎𝑥𝑧 𝑒 𝑥 + 𝜎𝑦𝑧 𝑒 𝑦 + 𝜎𝑧 𝑧 𝑒 𝑧 .
Since the ‘coefficients’ 𝜎𝑥 , 𝜎𝑦 , 𝜎𝑧 in the ‘linear combination’ in (4.34) are vectors,
108 L.-H. Lim

to conform to modern notation we just insert ⊗:


𝜎 = 𝜎𝑥 ⊗ 𝑒 𝑥 + 𝜎𝑦 ⊗ 𝑒 𝑦 + 𝜎𝑧 ⊗ 𝑒 𝑧 .
If we then plug the expressions for 𝜎𝑥 , 𝜎𝑦 , 𝜎𝑧 into that of 𝜎 and use (4.33), we get
𝜎 = 𝜎𝑥 𝑥 𝑒 𝑥 ⊗ 𝑒 𝑥 + 𝜎𝑦 𝑥 𝑒 𝑦 ⊗ 𝑒 𝑥 + 𝜎𝑧 𝑥 𝑒 𝑧 ⊗ 𝑒 𝑥
+ 𝜎𝑥 𝑦 𝑒 𝑥 ⊗ 𝑒 𝑦 + 𝜎𝑦 𝑦 𝑒 𝑦 ⊗ 𝑒 𝑦 + 𝜎𝑧 𝑦 𝑒 𝑧 ⊗ 𝑒 𝑦
+ 𝜎𝑥𝑧 𝑒 𝑥 ⊗ 𝑒 𝑧 + 𝜎𝑦𝑧 𝑒 𝑦 ⊗ 𝑒 𝑧 + 𝜎𝑧 𝑧 𝑒 𝑧 ⊗ 𝑒 𝑧 . (4.36)
Without the ⊗, this is the expression of stress as a dyadic in Morse and Feshbach
(1953, pp. 70–71). This also explains why we want (4.30), that is, ⊗ should be
non-commutative. For instance, 𝑒 𝑧 ⊗ 𝑒 𝑦 , the 𝑧-shear component for the 𝑦-normal
direction, and 𝑒 𝑧 ⊗ 𝑒 𝑦 , the 𝑦-shear component for the 𝑧-normal direction mean
completely different things.
If we set a basis ℬ = {𝑒 𝑥 , 𝑒 𝑦 , 𝑒 𝑧 }, we may represent 𝜎 as a 3 × 3 matrix, that is,
𝜎𝑥 𝑥 𝜎𝑥 𝑦 𝜎𝑥𝑧 
 
Σ ≔ [𝜎] ℬ = 𝜎𝑦 𝑥 𝜎𝑦 𝑦 𝜎𝑦𝑧 
 𝜎𝑧 𝑥 𝜎𝑧 𝑦 𝜎𝑧 𝑧 
 
with normal stresses on the diagonal and shear stresses on the off-diagonal. If we
have a different basis ℬ′ = {𝑒 ′𝑥 , 𝑒 ′𝑦 , 𝑒 ′𝑧 } with a different representation
𝜎𝑥′ 𝑥 ′ 
𝜎𝑥′ 𝑦 𝜎𝑥𝑧
 

Σ ≔ [𝜎] ℬ′ = 𝜎𝑦′ 𝑥 𝜎𝑦 𝑦 𝜎𝑦𝑧  ,
′ ′
 𝜎𝑧′ 𝑥 𝜎𝑧′ 𝑦 𝜎𝑧′ 𝑧 

then by plugging the change-of-basis relations

 𝑒′ = 𝑐𝑥𝑥 𝑒𝑥 + 𝑐𝑦𝑥 𝑒𝑦 + 𝑐𝑧 𝑥 𝑒𝑧 , 𝑐 𝑥 𝑥

 𝑥′
  𝑐𝑥𝑦 𝑐 𝑥𝑧 
𝑒𝑦 = 𝑐𝑥𝑦 𝑒𝑥 + 𝑐𝑦𝑦 𝑒𝑦 + 𝑐𝑧 𝑦𝑒𝑧 , 𝐶 ≔ 𝑐 𝑦 𝑥 𝑐𝑦𝑦 𝑐 𝑦𝑧 

 𝑐 𝑧 𝑥
 𝑒 ′ = 𝑐 𝑥𝑧 𝑒 𝑥 + 𝑐 𝑦𝑧 𝑒 𝑦 + 𝑐 𝑧 𝑧 𝑒 𝑧 ,  𝑐𝑧 𝑦 𝑐 𝑧 𝑧 
 𝑧
into
𝜎 = 𝜎𝑥′ 𝑥 𝑒 ′𝑥 ⊗ 𝑒 ′𝑥 + 𝜎𝑦′ 𝑥 𝑒 ′𝑦 ⊗ 𝑒 ′𝑥 + 𝜎𝑧′ 𝑥 𝑒 ′𝑧 ⊗ 𝑒 ′𝑥
+ 𝜎𝑥′ 𝑦 𝑒 ′𝑥 ⊗ 𝑒 ′𝑦 + 𝜎𝑦′ 𝑦 𝑒 𝑦 ⊗ 𝑒 ′𝑦 + 𝜎𝑧′ 𝑦 𝑒 ′𝑧 ⊗ 𝑒 ′𝑦
′ ′
+ 𝜎𝑥𝑧 𝑒 𝑥 ⊗ 𝑒 ′𝑧 + 𝜎𝑦𝑧
′ ′
𝑒 𝑦 ⊗ 𝑒 ′𝑧 + 𝜎𝑧′ 𝑧 𝑒 ′𝑧 ⊗ 𝑒 ′𝑧 ,
applying (4.33) and comparing with (4.36), we get
𝜎𝑥 𝑥 𝜎𝑥 𝑦 𝜎𝑥𝑧  𝑐 𝑥 𝑥 𝑐 𝑥 𝑦 𝑐 𝑥𝑧  𝜎𝑥′ 𝑥 𝜎𝑥′ 𝑦 𝜎𝑥𝑧 ′   
    ′  𝑐 𝑥 𝑥 𝑐 𝑦 𝑥 𝑐 𝑧 𝑥 
𝜎𝑦 𝑥 𝜎𝑦 𝑦 𝜎𝑦𝑧  = 𝑐 𝑦 𝑥 𝑐 𝑦 𝑦 𝑐 𝑦𝑧  𝜎𝑦 𝑥 𝜎𝑦′ 𝑦 𝜎𝑦𝑧 ′   
     𝑐 𝑥 𝑦 𝑐 𝑦 𝑦 𝑐 𝑧 𝑦  .
 𝜎𝑧 𝑥 𝜎𝑧 𝑦 𝜎𝑧 𝑧   𝑐 𝑧 𝑥 𝑐 𝑧 𝑦 𝑐 𝑧 𝑧   𝜎𝑧′ 𝑥 𝜎𝑧′ 𝑦 𝜎𝑧′ 𝑧   𝑐 𝑥𝑧 𝑐 𝑦𝑧 𝑐 𝑧 𝑧 
     
By the table on page 19, the fact that any two coordinate representations of Cauchy
stress 𝜎 satisfy the transformation rule Σ ′ = 𝐶 −1 Σ𝐶 −T says that it is a contravariant
Tensors in computations 109

2-tensor. This provides yet another example why we do not want to identify a 2-
tensor with the matrix that represents it: stripped of the basis vectors, the physical
connotations of (4.34) and (4.35) that we discussed earlier are irretrievably lost.
What we have described above is the behaviour of stress at a single point in space,
assumed to be R3 , and in Cartesian coordinates. As we mentioned in Example 3.13,
stress is a tensor field and 𝑒 𝑥 , 𝑒 𝑦 , 𝑒 𝑧 really form a basis of the tangent space T 𝑣 (R3 )
at the point 𝑣 = (𝑥, 𝑦, 𝑧) ∈ R3 , and we could have used any coordinates. Note that
our discussion above merely used stress in a purely nominal way: our focus is on
the dyadic that represents stress. The same discussion applies mutatis mutandis
to any contravariant 2-tensors describing inertia, polarization, strain, tidal force,
viscosity, etc. (Borg 1990). We refer the readers to Borg (1990, Chapter 4), Chou
and Pagano (1992, Chapter 1) and Irgens (2019, Chapter 2) for the actual physical
details regarding stress, including how to derive (4.34) from first principles.
Stress is a particularly vital notion, as many tensors describing physical prop-
erties in various constitutive equations (Hartmann 1984, Table 1) are defined as
derivatives with respect to it. In fact, this is one way in which tensors of higher order
arise in physics – as second-order derivatives of real-valued functions with respect
to first- and second-order tensors. For instance, in Cartesian coordinates, the piezo-
electric tensor, the piezo-magnetic tensor and the elastic tensor are represented as
hypermatrices 𝐷, 𝑄 ∈ R3×3×3 and 𝑆 ∈ R3×3×3×3 , where
𝜕2𝐺 𝜕2𝐺 𝜕2𝐺
𝑑𝑖 𝑗𝑘 = − , 𝑞 𝑖 𝑗𝑘 = − , 𝑠𝑖 𝑗𝑘𝑙 = − ,
𝜕𝜎𝑖 𝑗 𝜕𝑒 𝑘 𝜕𝜎𝑖 𝑗 𝜕ℎ 𝑘 𝜕𝜎𝑖 𝑗 𝜕𝜎𝑘𝑙
and 𝑖, 𝑗, 𝑘, 𝑙 = 1, 2, 3. Here 𝐺 = 𝐺(Σ, 𝐸, 𝐻, 𝑇 ) is the Gibbs potential, a real-
valued function of the second-order stress tensor Σ, the first-order tensors 𝐸 and 𝐻
representing electric and magnetic field respectively, and the zeroth-order tensor 𝑇
representing temperature (Hartmann 1984, Section 3).
It is straightforward to extend the above construction of a dyadic to arbitrary
dimensions and arbitrary vectors in abstract vector spaces U and V. A dyadic is a
‘linear combination’ of vectors in V,
𝜎 = 𝜎1 𝑣 1 + · · · + 𝜎𝑛 𝑣 𝑛 ,
whose coefficients 𝜎 𝑗 are vectors in U:
𝜎 𝑗 = 𝜎1 𝑗 𝑢1 + · · · + 𝜎𝑚 𝑗 𝑢 𝑗 , 𝑗 = 1, . . . , 𝑛.
Thus by (4.31) and (4.33) we have
Õ
𝑚 Õ
𝑛 Õ
𝑚 Õ
𝑛
𝜎= 𝜎𝑖 𝑗 𝑢𝑖 𝑣 𝑗 = 𝜎𝑖 𝑗 𝑢𝑖 ⊗ 𝑣 𝑗
𝑖=1 𝑗=1 𝑖=1 𝑗=1

in old (pre-⊗) and modern notation respectively. Note that the coefficients 𝜎𝑖 𝑗
are now scalars and a dyadic is an honest linear combination of dyads with scalar
coefficients. We denote the set of all such dyadics as U ⊗ V.
110 L.-H. Lim

We may recursively extend the construction to arbitrary order. With a third


vector space W, a triadic is a ‘linear combination’ of vectors in W,
𝜏 = 𝜏1 𝑤 1 + · · · + 𝜏𝑝 𝑤 𝑝 ,
whose coefficients 𝜏𝑘 are dyadics in U ⊗ V:
Õ
𝑚 Õ 𝑛 Õ
𝑚 Õ
𝑛
𝜏𝑘 = 𝜏𝑖 𝑗𝑘 𝑢𝑖 𝑣 𝑗 = 𝜏𝑖 𝑗𝑘 𝑢𝑖 ⊗ 𝑣 𝑗 , 𝑘 = 1, . . . , 𝑝.
𝑖=1 𝑗=1 𝑖=1 𝑗=1

Thus we have
𝑛 Õ
𝑚 Õ
Õ 𝑝 Õ 𝑝
𝑛 Õ
𝑚 Õ
𝜏= 𝜏𝑖 𝑗𝑘 𝑢𝑖 𝑣 𝑗 𝑤 𝑘 = 𝜏𝑖 𝑗𝑘 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ 𝑤 𝑘
𝑖=1 𝑗=1 𝑗=1 𝑖=1 𝑗=1 𝑗=1

in old and modern notation respectively. Henceforth we will ditch the old notation.
Again a triadic is an honest linear combination of triads and the set of all triadics
will be denoted U ⊗ V ⊗ W. Strictly speaking, we have constructed (U ⊗ V) ⊗ W and
an analogous construction considering ‘linear combinations’ of dyadics in V ⊗ W
with coefficients in U would give us U ⊗ (V ⊗ W), but it does not matter for us as
we have implicitly imposed that
(𝑢 ⊗ 𝑣) ⊗ 𝑤 = 𝑢 ⊗ (𝑣 ⊗ 𝑤) ≕ 𝑢 ⊗ 𝑣 ⊗ 𝑤 (4.37)
for all 𝑢 ∈ U, 𝑣 ∈ V, 𝑤 ∈ W.
The observant reader may have noticed that aside from (4.37) we have also
implicitly imposed the third-order analogues of (4.31) and (4.33) in the construction
above. For completeness we will state them formally but in a combined form:
(𝜆𝑢 + 𝜆 ′𝑢 ′) ⊗ 𝑣 ⊗ 𝑤 = 𝜆𝑢 ⊗ 𝑣 ⊗ 𝑤 + 𝜆 ′𝑢 ′ ⊗ 𝑣 ⊗ 𝑤,
𝑢 ⊗ (𝜆𝑣 + 𝜆 ′ 𝑣 ′) ⊗ 𝑤 = 𝜆𝑢 ⊗ 𝑣 ⊗ 𝑤 + 𝜆 ′𝑢 ⊗ 𝑣 ′ ⊗ 𝑤, (4.38)
𝑢 ⊗ 𝑣 ⊗ (𝜆𝑤 + 𝜆 ′ 𝑤 ′) = 𝜆𝑢 ⊗ 𝑣 ⊗ 𝑤 + 𝜆 ′𝑢 ⊗ 𝑣 ⊗ 𝑤 ′
for all vectors 𝑢, 𝑢 ′ ∈ U, 𝑣, 𝑣 ′ ∈ V, 𝑤, 𝑤 ′ ∈ W and scalars 𝜆, 𝜆 ′.
The construction described above makes the set U ⊗ V ⊗ W into an algebraic
object with scalar product, tensor addition + and tensor product ⊗ interacting in
a consistent manner according to the algebraic rules (4.37) and (4.38). We will
call it a tensor product of U, V, W, or an abstract tensor product if we need to
emphasize that it refers to this particular construction. Observe that this is an
extremely general construction.
(i) It does not depend what we call ‘scalars’ so long as we may add and multiply
them, that is, this construction works for any arbitrary modules U, V, W over
a ring 𝑅.
(ii) It does not require U, V, W to be finite-dimensional (vector spaces) or finitely
generated (modules); the whole construction is about specifying how linear
combinations behave under ⊗, and there is no need for elements to be linear
combinations of some basis or generating set.
Tensors in computations 111

(iii) It does not call for separate treatments of covariant and mixed tensors; the
definition is agnostic to having some or all of U, V, W replaced by their dual
spaces U∗ , V∗ , W∗ .
It almost goes without saying that the construction can be readily extended to
arbitrary order 𝑑. We may consider ‘linear combinations’ of dyadics with dyadic
coefficients to get 𝑑 = 4, ‘linear combinations’ of triadics with dyadic coefficients
to get 𝑑 = 5, etc. But looking at the end results of the 𝑑 = 2 and 𝑑 = 3 constructions,
if we have 𝑑 vector spaces U, V, . . . , W, a more direct way is to simply consider
the set of all 𝑑-adics, i.e. finite sums of 𝑑-ads,
Õ 𝑟 
U ⊗ V ⊗ ··· ⊗ W ≔ 𝑢𝑖 ⊗ 𝑣 𝑖 ⊗ · · · ⊗ 𝑤 𝑖 : 𝑢𝑖 ∈ U, 𝑣 𝑖 ∈ V, . . . , 𝑤 𝑖 ∈ W, 𝑟 ∈ N ,
𝑖=1
(4.39)
and decree that ⊗ is associative, + is associative and commutative, and ⊗ is
distributive over + in the sense of
(𝜆𝑢 + 𝜆 ′𝑢 ′) ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 = 𝜆𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 + 𝜆 ′𝑢 ′ ⊗ 𝑣 ⊗ · · · ⊗ 𝑤,
𝑢 ⊗ (𝜆𝑣 + 𝜆 ′ 𝑣 ′) ⊗ · · · ⊗ 𝑤 = 𝜆𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 + 𝜆 ′𝑢 ⊗ 𝑣 ′ ⊗ · · · ⊗ 𝑤,
.. .. (4.40)
. .
𝑢 ⊗ 𝑣 ⊗ · · · ⊗ (𝜆𝑤 + 𝜆 𝑤 ) = 𝜆𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 + 𝜆 ′𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 ′
′ ′

for all vectors 𝑢, 𝑢 ′ ∈ U, 𝑣, 𝑣 ′ ∈ V, . . . , 𝑤, 𝑤 ′ ∈ W, and scalars 𝜆, 𝜆 ′.


We have used the old polyadic terminology to connect our discussions to the
older engineering and physics literature, and to show that the modern abstract con-
struction24 is firmly rooted in concrete physical considerations. We will henceforth
ditch the obsolete names 𝑑-ads and 𝑑-adics and use their modern names, rank-one
𝑑-tensors and 𝑑-tensors. Note that these arithmetic rules also allow us to define,
for any 𝑇 ∈ U ⊗ V ⊗ · · · ⊗ W and any 𝑇 ′ ∈ U ′ ⊗ V ′ ⊗ · · · ⊗ W ′, an element
𝑇 ⊗ 𝑇 ′ ∈ U ⊗ V ⊗ · · · ⊗ W ⊗ U′ ⊗ V′ ⊗ · · · ⊗ W′. (4.41)
We emphasize that U ⊗ V ⊗ · · · ⊗ W does not mean the set of rank-one tensors
but the set of finite sums of them as in (4.39). This mistake is common enough to
warrant a displayed equation to highlight the pitfall:
U ⊗ V ⊗ · · · ⊗ W ≠ {𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 : 𝑢 ∈ U, 𝑣 ∈ V, . . . , 𝑤 ∈ W}, (4.42)
unless all but one of U, V, . . . , W are one-dimensional.
For easy reference, we state a special case of the above abstract tensor product
24 This construction is equivalent to the one in Lang (2002, Chapter XVI, Section 1), likely adapted
from Bourbaki (1998, Chapter II, Section 3), and widely used in pure mathematics. Take the
free module generated by all rank-one elements 𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 and quotient it by the submodule
generated by elements of the forms (𝜆𝑢 + 𝜆 ′𝑢 ′) ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 − 𝜆𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 − 𝜆 ′𝑢 ′ ⊗ 𝑣 ⊗ · · · ⊗ 𝑤,
etc., to ensure that (4.40) holds.
112 L.-H. Lim

construction, but where we have both vector spaces and dual spaces, as another
definition for tensors.
Definition 4.9 (tensors as elements of tensor spaces). Let 𝑝 ≤ 𝑑 be non-negative
integers and let V1 , . . . , V𝑑 be vector spaces. A tensor of contravariant order 𝑝
and covariant order 𝑑 − 𝑝 is an element
𝑇 ∈ V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 .
The set above is as defined in (4.39) and is called a tensor product of vector spaces
or a tensor space for short. The tensor 𝑇 is said to be of type (𝑝, 𝑑 − 𝑝) and of order
𝑑. A tensor of type (𝑑, 0), i.e. 𝑇 ∈ V1 ⊗ · · · ⊗ V𝑑 , is called a contravariant 𝑑-tensor,
and a tensor of type (0, 𝑑), i.e. 𝑇 ∈ V∗1 ⊗ · · · ⊗ V∗𝑑 is called a covariant 𝑑-tensor.
Vector spaces have bases and a tensor space V1 ⊗ · · · ⊗ V𝑑 has a tensor product
basis given by
ℬ1 ⊗ ℬ2 ⊗ · · · ⊗ ℬ𝑑 ≔ {𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 : 𝑢 ∈ ℬ1 , 𝑣 ∈ ℬ2 , . . . , 𝑤 ∈ ℬ𝑑 },
(4.43)
where ℬ𝑖 is any basis of V𝑖 , 𝑖 = 1, . . . , 𝑑. This is the only way to get a basis of
rank-one tensors on a tensor space. If ℬ1 ⊗ · · · ⊗ ℬ𝑑 is a basis for V1 ⊗ · · · ⊗ V𝑑 ,
then ℬ1 , . . . , ℬ𝑑 must be bases of V1 , . . . , V𝑑 , respectively. More generally, for
mixed tensor spaces, ℬ1 ⊗ · · · ⊗ ℬ𝑝 ⊗ ℬ∗𝑝+1 ⊗ · · · ⊗ ℬ𝑑∗ is a basis for V1 ⊗ · · · ⊗
V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 , where ℬ∗ denotes the dual basis of ℬ. There is no need to
assume finite dimension; these bases may well be uncountable. If the vector spaces
are finite-dimensional, then (4.43) implies that
dim(V1 ⊗ · · · ⊗ V𝑑 ) = dim(V1 ) · · · dim(V𝑑 ). (4.44)
Aside from (4.31), relation (4.44) is another reason the tensor product is the
proper operation for combining quantum states of different systems: if a quantum
system is made up of superpositions (i.e. linear combinations) of 𝑚 distinct (i.e.
linearly independent) states and another is made up of superpositions of 𝑛 distinct
states, then it is physically reasonable for the combined system to be made up
of superpositions of 𝑚𝑛 distinct states (Nielsen and Chuang 2000, p. 94). As a
result, we expect the combination of an 𝑚-dimensional quantum state space and an
𝑛-dimensional quantum state space to be an 𝑚𝑛-dimensional quantum state space,
and by (4.44), the tensor product fits the bill perfectly. This of course is hand-
waving and assumes finite-dimensionality. For a fully rigorous justification, we
refer readers to Aerts and Daubechies (1978, 1979a,b).
Note that there is no contradiction between (4.42) and (4.43): in the former the
objects are vector spaces; in the latter they are bases of vector spaces. The tensor
product symbol ⊗ is used in at least a dozen different ways, but fortunately there
is little cause for confusion as its meaning is almost always unambiguous from the
context.
One downside of Definition 4.4 in Section 4.1 is that it requires us to convert
Tensors in computations 113

everything into real-valued functions before we can discuss tensor products. What
we gain from the abstraction in Definition 4.9 is generality: the construction above
allows us to form tensor products of any objects as is, so long as they belong to
some vector space (or module).

Example 4.10 (concrete tensor products). As we mentioned on page 105, when


U, V, W are abstract vector spaces, there is no further ‘meaning’ one may attach to
⊗, just that it satisfies arithmetical properties such as (4.38). But with abstraction
comes generality: it also means that any associative product operation that satisfies
(4.38) may play the role of ⊗. As soon as we pick specific U, V, W – Euclidean
spaces, function spaces, distribution spaces, measure spaces, etc. – we may start
taking tensor products of these spaces with concrete formulas for ⊗. We will go
over the most common examples.
(i) Outer products of vectors 𝑎 = (𝑎1 , . . . , 𝑎 𝑚 ) ∈ R𝑚 , 𝑏 = (𝑏1 , . . . , 𝑏 𝑛 ) ∈ R𝑛 :
 𝑎1 𝑏1 · · · 𝑎1 𝑏 𝑛 

 ..  ∈ R𝑚×𝑛 .
𝑎 ⊗ 𝑏 ≔ 𝑎𝑏 =  ...
T ..
. . 

𝑎 𝑚 𝑏1 · · · 𝑎 𝑚 𝑏 𝑛 

(ii) Kronecker products of matrices 𝐴 ∈ R𝑚×𝑛 , 𝐵 ∈ R 𝑝×𝑞 :
 𝑎11 𝐵 · · · 𝑎1𝑛 𝐵 

 ..  ∈ R𝑚 𝑝×𝑛𝑞 .
𝐴 ⊗ 𝐵 ≔  ... ..
. . 

𝑎 𝑚1 𝐵 · · · 𝑎 𝑚𝑛 𝐵

(iii) Separable products of functions 𝑓 : 𝑋 → R, 𝑔 : 𝑌 → R:
𝑓 ⊗ 𝑔 : 𝑋 × 𝑌 → R, (𝑥, 𝑦) ↦→ 𝑓 (𝑥)𝑔(𝑦).
(iv) Separable products of kernels 𝐾 : 𝑋 × 𝑋 ′ → R, 𝐻 : 𝑌 × 𝑌 ′ → R:
𝐾 ⊗ 𝐻 : (𝑋 × 𝑋 ′) × (𝑌 × 𝑌 ′) → R,
((𝑥, 𝑥 ′), (𝑦, 𝑦 ′)) ↦→ 𝐾(𝑥, 𝑥 ′)𝐻(𝑦, 𝑦 ′).
Hence these apparently different products of different objects may all be regarded
as tensor products. In particular, associativity (4.37) is satisfied and we may extend
these products to an arbitrary number of factors. Readers who recall Example 4.5
would see that (i) is a special case of (iii) with 𝑋 = [𝑚], 𝑌 = [𝑛], and (ii) is a
special case of (iv) with 𝑋 = [𝑚], 𝑋 ′ = [𝑛], 𝑌 = [ 𝑝], 𝑌 ′ = [𝑞]. Nevertheless, (i)
and (ii) may also be independently derived from Definition 4.9. More generally
(iii) recovers Definition 4.4.
The outer product in (i) comes from Definition 4.9. Take two vectors represented
in, say, Cartesian and spherical coordinates:
𝑢 = 𝑎1 𝑒 𝑥 + 𝑎2 𝑒 𝑦 + 𝑎3 𝑒 𝑧 , 𝑣 = 𝑏 1 𝑒𝑟 + 𝑏 2 𝑒 𝜃 + 𝑏 3 𝑒 𝜙 (4.45)
114 L.-H. Lim

respectively. Consider the contravariant 2-tensor 𝑢 ⊗ 𝑣. It follows from (4.38) that


𝑢 ⊗ 𝑣 = 𝑎 1 𝑏 1 𝑒 𝑥 ⊗ 𝑒𝑟 + 𝑎 1 𝑏 2 𝑒 𝑥 ⊗ 𝑒 𝜃 + 𝑎 1 𝑏 3 𝑒 𝑥 ⊗ 𝑒 𝜙
+ 𝑎 2 𝑏 1 𝑒 𝑦 ⊗ 𝑒𝑟 + 𝑎 2 𝑏 2 𝑒 𝑦 ⊗ 𝑒 𝜃 + 𝑎 2 𝑏 3 𝑒 𝑦 ⊗ 𝑒 𝜙
+ 𝑎 3 𝑏 1 𝑒 𝑧 ⊗ 𝑒𝑟 + 𝑎 3 𝑏 2 𝑒 𝑧 ⊗ 𝑒 𝜃 + 𝑎 3 𝑏 3 𝑒 𝑧 ⊗ 𝑒 𝜙 .
The coefficients are captured by the outer product of their coefficients, that is,
𝑎1 𝑏1 𝑎1 𝑏2 𝑎1 𝑏3 
 
𝑎 ⊗ 𝑏 = 𝑎2 𝑏1 𝑎2 𝑏2 𝑎2 𝑏3  = [𝑢 ⊗ 𝑣] ℬ1 ⊗ℬ2 ,
𝑎3 𝑏1 𝑎3 𝑏2 𝑎3 𝑏3 
 
which is a (hyper)matrix representation of 𝑢 ⊗ 𝑣 in the tensor product basis
ℬ1 ⊗ ℬ2 ≔ {𝑒 𝑥 ⊗ 𝑒𝑟 , 𝑒 𝑥 ⊗ 𝑒 𝜃 , 𝑒 𝑥 ⊗ 𝑒 𝜙 , 𝑒 𝑦 ⊗ 𝑒𝑟 ,
𝑒 𝑦 ⊗ 𝑒 𝜃 , 𝑒 𝑦 ⊗ 𝑒 𝜙 , 𝑒 𝑧 ⊗ 𝑒𝑟 , 𝑒 𝑧 ⊗ 𝑒 𝜃 , 𝑒 𝑧 ⊗ 𝑒 𝜙 } (4.46)
of the two bases ℬ1 = {𝑒 𝑥 , 𝑒 𝑦 , 𝑒 𝑧 } and ℬ2 = {𝑒𝑟 , 𝑒 𝜃 , 𝑒 𝜙 }. If we have another pair
of vectors
𝑢 ′ = 𝑎1′ 𝑒 𝑥 + 𝑎2′ 𝑒 𝑦 + 𝑎3′ 𝑒 𝑧 , 𝑣 ′ = 𝑏1′ 𝑒𝑟 + 𝑏2′ 𝑒 𝜃 + 𝑏3′ 𝑒 𝜙 ,
then 𝑢 ⊗ 𝑣 + 𝑢 ′ ⊗ 𝑣 ′ , or more generally 𝜆𝑢 ⊗ 𝑣 + 𝜆 ′𝑢 ′ ⊗ 𝑣 ′ for any scalars 𝜆, 𝜆 ′, obeys
[𝜆𝑢 ⊗ 𝑣 + 𝜆 ′𝑢 ′ ⊗ 𝑣 ′] ℬ1 ⊗ℬ2 = 𝜆𝑎 ⊗ 𝑏 + 𝜆 ′ 𝑎 ′ ⊗ 𝑏 ′ .
In other words, the arithmetic of 2-tensors in V ⊗ W mirrors the arithmetic of rank-
one matrices in R𝑚×𝑛 , 𝑚 = dim W and 𝑛 = dim V. It is easy to see that this statement
extends to higher order, that is, the arithmetic of 𝑑-tensors in U ⊗ V ⊗ · · · ⊗ W
mirrors the arithmetic of rank-one 𝑑-hypermatrices in R𝑚×𝑛×···× 𝑝 :
Õ𝑟  Õ𝑟
𝜆𝑖 𝑢𝑖 ⊗ 𝑣 𝑖 ⊗ · · · ⊗ 𝑤 𝑖 = 𝜆 𝑖 [𝑢𝑖 ] ℬ1 ⊗ [𝑣 𝑖 ] ℬ2 ⊗ · · · ⊗ [𝑤 𝑖 ] ℬ𝑑 .
𝑖=1 ℬ1 ⊗ℬ2 ⊗ ···⊗ℬ𝑑 𝑖=1
Note that the tensor products on the left of the equation are abstract tensor products
and those on the right are outer products.
A few words of caution are in order whenever hypermatrices are involved. Firstly,
hypermatrices are dependent on the bases: if we change 𝑢 in (4.45) to spherical
coordinates with the same numerical values,
𝑢 = 𝑎 1 𝑒𝑟 + 𝑎 2 𝑒 𝜃 + 𝑎 3 𝑒 𝜙 ,
then 𝜆𝑎 ⊗ 𝑏 + 𝜆 ′ 𝑎 ′ ⊗ 𝑏 ′ , perfectly well-defined as a (hyper)matrix, is a meaningless
thing to compute as we are adding coordinate representations in different bases.
Secondly, hypermatrices are oblivious to covariance and contravariance: if we
change 𝑢 in (4.45) to a dual vector with exactly the same coordinates,
𝑢 = 𝑎1 𝑒∗𝑥 + 𝑎2 𝑒∗𝑦 + 𝑎3 𝑒∗𝑧 ,
then again 𝜆𝑎 ⊗ 𝑏 + 𝜆 ′ 𝑎 ′ ⊗ 𝑏 ′ is meaningless as we are adding coordinate repres-
entations of incompatible objects in different vector spaces. Both are of course
Tensors in computations 115

obvious points but they are often obfuscated when higher-order tensors are brought
into the picture.
The Kronecker product above may also be viewed as another manifestation of
Definition 4.9, but a more fruitful way is to deduce it as a tensor product of
linear operators that naturally follows from Definition 4.9. This is important and
sufficiently interesting to warrant separate treatment.
Example 4.11 (Kronecker product). Given linear operators
Φ1 : V1 → W1 and Φ2 : V2 → W2 ,
forming the tensor products V1 ⊗ V2 and W1 ⊗ W2 automatically25 gives us a linear
operator
Φ1 ⊗ Φ2 : V1 ⊗ V2 → W1 ⊗ W2
defined on rank-one elements by
Φ1 ⊗ Φ2 (𝑣 1 ⊗ 𝑣 2 ) ≔ Φ1 (𝑣 1 ) ⊗ Φ2 (𝑣 2 )
and extended linearly to all elements of V1 ⊗ V2 , which are all finite linear com-
binations of rank-one elements. The way it is defined, Φ1 ⊗ Φ2 is clearly unique
and we will call it the Kronecker product of linear operators Φ1 and Φ2 . It extends
easily to an arbitrary number of linear operators via
Φ1 ⊗ · · · ⊗ Φ𝑑 (𝑣 1 ⊗ · · · ⊗ 𝑣 𝑑 ) ≔ Φ1 (𝑣 1 ) ⊗ · · · ⊗ Φ𝑑 (𝑣 𝑑 ), (4.47)
and the result obeys (4.40):
(𝜆Φ1 + 𝜆 ′Φ1′ ) ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 = 𝜆Φ1 ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 + 𝜆 ′Φ1′ ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 ,
Φ1 ⊗ (𝜆Φ2 + 𝜆 ′Φ2′ ) ⊗ · · · ⊗ Φ𝑑 = 𝜆Φ1 ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 + 𝜆 ′Φ1 ⊗ Φ2′ ⊗ · · · ⊗ Φ𝑑 ,
.. ..
. .
Φ1 ⊗ Φ2 ⊗ · · · ⊗ (𝜆Φ𝑑 + 𝜆 Φ𝑑 ) = 𝜆Φ1 ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑 + 𝜆 ′Φ1 ⊗ Φ2 ⊗ · · · ⊗ Φ𝑑′ .
′ ′

In other words, the Kronecker product defines a tensor product in the sense of
Definition 4.9 on the space of linear operators,
L(V1 ; W1 ) ⊗ · · · ⊗ L(V𝑑 ; W𝑑 ),
that for finite-dimensional vector spaces equals
L(V1 ⊗ · · · ⊗ V𝑑 ; W1 ⊗ · · · ⊗ W𝑑 ).
In addition, as linear operators they may be composed and Moore–Penrose-inverted,

25 This is called functoriality in category theory. The fact that the tensor product of vector spaces
in Definition 4.9 gives a tensor product of linear operators on these spaces says that the tensor
product in Definition 4.9 is functorial.
116 L.-H. Lim

and have adjoints and ranks, images and null spaces, all of which work in tandem
with the Kronecker product:
(Φ1 ⊗ · · · ⊗ Φ𝑑 )(Ψ1 ⊗ · · · ⊗ Ψ𝑑 ) = Φ1 Ψ1 ⊗ · · · ⊗ Φ𝑑 Ψ𝑑 , (4.48)
(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = †
Φ†1 ⊗ ··· ⊗ Φ†𝑑 , (4.49)

(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = Φ∗1
⊗ ··· ⊗ Φ∗𝑑 , (4.50)
rank(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = rank(Φ1 ) · · · rank(Φ𝑑 ), (4.51)
im(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = im(Φ1 ) ⊗ · · · ⊗ im(Φ𝑑 ). (4.52)
Observe that the ⊗ on the left of (4.52) is the Kronecker product of operators
whereas that on the right is the tensor product of vector spaces. For null spaces,
we have
ker(Φ1 ⊗ Φ2 ) = ker(Φ1 ) ⊗ V2 + V1 ⊗ ker(Φ2 ),
when 𝑑 = 2 and more generally
ker(Φ1 ⊗ · · · ⊗ Φ𝑑 ) = ker(Φ1 ) ⊗ V2 ⊗ · · · ⊗ V𝑑 + V1 ⊗ ker(Φ2 ) ⊗ · · · ⊗ V𝑑
+ · · · + V1 ⊗ V2 ⊗ · · · ⊗ ker(Φ𝑑 ).
Therefore injectivity and surjectivity are preserved by taking Kronecker products.
If V𝑖 = W𝑖 , 𝑖 = 1, . . . , 𝑑, then the eigenpairs of Φ1 ⊗ · · · ⊗ Φ𝑑 are exactly those
given by
(𝜆 1 𝜆 2 · · · 𝜆 𝑑 , 𝑣 1 ⊗ 𝑣 2 ⊗ · · · ⊗ 𝑣 𝑑 ), (4.53)
where (𝜆 𝑖 , 𝑣 𝑖 ) is an eigenpair of Φ𝑖 , 𝑖 = 1, . . . , 𝑑. All of these are straightforward
consequences of (4.47) (Berberian 2014, Section 13.2).
The Kronecker product of matrices simply expresses how the matrix representing
Φ1 ⊗ · · · ⊗ Φ𝑑 is related to those representing Φ1 , . . . , Φ𝑑 . If we pick bases ℬ𝑖 for
V𝑖 and 𝒞𝑖 for W𝑖 , then each linear operator Φ𝑖 has a matrix representation
[Φ𝑖 ] ℬ𝑖 ,𝒞𝑖 = 𝑋𝑖 ∈ R𝑚𝑖 ×𝑛𝑖
as in (3.5), 𝑖 = 1, . . . , 𝑑, and
[Φ1 ⊗ · · · ⊗ Φ𝑑 ] ℬ1 ⊗ ···⊗ℬ𝑑 , 𝒞1 ⊗ ···⊗𝒞𝑑 = 𝑋1 ⊗ · · · ⊗ 𝑋𝑑 ∈ R𝑚1 𝑚2 ···𝑚𝑑 ×𝑛1 𝑛2 ···𝑛𝑑 .
Note that the tensor products on the left of the equation are Kronecker products as
defined in (4.47) and those on the right are matrix Kronecker products as defined
in Example 4.10(ii), applied 𝑑 times in any order (order does not matter as ⊗ is
associative). The tensor product of 𝑑 bases is as defined in (4.43).
Clearly the matrix Kronecker product inherits the properties (4.48)–(4.53), and
when 𝑚 𝑖 = 𝑛𝑖 , the eigenvalue property in particular gives
tr(𝑋1 ⊗ · · · ⊗ 𝑋𝑑 ) = tr(𝑋1 ) · · · tr(𝑋𝑑 ), (4.54)
det(𝑋1 ⊗ · · · ⊗ 𝑋𝑑 ) = det(𝑋1 ) 𝑝1 · · · det(𝑋𝑑 ) 𝑝𝑑 ,
where 𝑝 𝑖 = (𝑛1 𝑛2 · · · 𝑛𝑑 )/𝑛𝑖 , 𝑖 = 1, . . . , 𝑑. Incidentally these last two properties
Tensors in computations 117

also apply to Kronecker products of operators with coordinate independent defini-


tions of trace (see Example 4.28) and determinant (Berberian 2014, Section 7.2).
Next we will see that the multilinear matrix multiplication in (2.6) and the matrix
Kronecker product in Example 4.10(ii) are intimately related.
Example 4.12 (multilinear matrix multiplication revisited). There is one caveat
to obtaining the neat formula for the matrix Kronecker product in Example 4.10(ii)
from the operator Kronecker product in Example 4.11, namely, we must choose the
lexicographic order for the tensor product bases ℬ1 ⊗ · · · ⊗ ℬ𝑑 and 𝒞1 ⊗ · · · ⊗ 𝒞𝑑 ,
recalling that ‘basis’ in this article always means ordered basis. This is the most
common and intuitive way to totally order a Cartesian product of totally ordered
sets. For our purposes, we just need to define it for [𝑛1 ] × [𝑛2 ] × · · · × [𝑛𝑑 ] since
our tensor product bases and hypermatrices are all indexed by this. Let
(𝑖1 , . . . , 𝑖 𝑑 ), ( 𝑗 1 , . . . , 𝑗 𝑑 ) ∈ [𝑛1 ] × · · · × [𝑛𝑑 ].
The lexicographic order ≺ is defined as
(
𝑖 1 = 𝑗 1 , . . . , 𝑖 𝑘−1 = 𝑗 𝑘−1 ,
(𝑖1 , 𝑖 2 , . . . , 𝑖 𝑑 ) ≺ ( 𝑗 1 , 𝑗 2 , . . . , 𝑗 𝑑 ) if and only if
𝑖 𝑘 < 𝑗 𝑘 for some 𝑘 ∈ [𝑑].
Take 𝑑 = 2, for instance. If ℬ1 = {𝑢1 , 𝑢2 }, ℬ2 = {𝑣 1 , 𝑣 2 , 𝑣 3 }, we have to order
ℬ1 ⊗ ℬ2 = {𝑢1 ⊗ 𝑣 1 , 𝑢1 ⊗ 𝑣 2 , 𝑢1 ⊗ 𝑣 3 , 𝑢2 ⊗ 𝑣 1 , 𝑢2 ⊗ 𝑣 2 , 𝑢2 ⊗ 𝑣 3 }
and not, say, {𝑢1 ⊗ 𝑣 1 , 𝑢2 ⊗ 𝑣 1 , 𝑢1 ⊗ 𝑣 2 , 𝑢2 ⊗ 𝑣 2 , 𝑢1 ⊗ 𝑣 3 , 𝑢2 ⊗ 𝑣 3 } ≕ 𝒜. While this is
a trifling point, the latter ordering is unfortunately what is used in the vec operation
common in mathematical software where one concatenates the 𝑛 columns of an
𝑚 × 𝑛 matrix into a vector of dimension 𝑚𝑛. The coordinate representation of the
2-tensor
𝑇 = 𝑎11 𝑢1 ⊗ 𝑣 1 + 𝑎12 𝑢1 ⊗ 𝑣 2 + 𝑎13 𝑢1 ⊗ 𝑣 3
+ 𝑎21 𝑢2 ⊗ 𝑣 1 + 𝑎22 𝑢2 ⊗ 𝑣 2 + 𝑎23 𝑢2 ⊗ 𝑣 3
in ℬ1 ⊗ ℬ2 (lexicographic) and 𝒜 (non-lexicographic) order are
𝑎11  𝑎11 
   
𝑎12  𝑎21 
    
𝑎13  𝑎12 

6 𝑎11 𝑎12 𝑎13
  
[𝑇 ] ℬ1 ⊗ℬ2 =   ∈ R , [𝑇 ] 𝒜 =   = vec  ∈ R6
𝑎
  21 𝑎
  22 𝑎 21 𝑎 22 𝑎 23
𝑎22  𝑎13 
   
𝑎23  𝑎23 
   
respectively, where vec denotes the usual ‘column-concatenated’ operation stacking
the columns of an 𝑚 × 𝑛 matrix to obtain a vector of dimension 𝑚𝑛. To see why
we favour lexicographic order, define the vector space isomorphism
vecℓ : R𝑛1 ×𝑛2 ×···×𝑛𝑑 → R𝑛1 𝑛2 ···𝑛𝑑
118 L.-H. Lim

that takes a 𝑑-hypermatrix 𝐴 = [𝑎𝑖1 𝑖2 ···𝑖𝑑 ] to a vector


vecℓ (𝐴) = (𝑎11···1 , 𝑎11···2 , . . . , 𝑎 𝑛1 𝑛2 ···𝑛𝑑 ) ∈ R𝑛1 𝑛2 ···𝑛𝑑 ,
whose coordinates are those in 𝐴 ordered lexicographically. For any matrices
𝑋1 ∈ R𝑚1 ×𝑛1 , . . . , 𝑋𝑑 ∈ R𝑚𝑑 ×𝑛𝑑 , it is routine to check that
vecℓ ((𝑋1 , . . . , 𝑋𝑑 ) · 𝐴) = (𝑋1 ⊗ · · · ⊗ 𝑋𝑑 ) vecℓ (𝐴). (4.55)
In other words, the lexicographic vecℓ operator takes multilinear matrix multi-
plication to the matrix Kronecker product. They are the same operation represen-
ted on different but isomorphic vector spaces R𝑛1 ×···×𝑛𝑑 and R𝑛1 ···𝑛𝑑 , respectively.
Incidentally, vecℓ is an example of a ‘forgetful map’: it forgets the hypermatrix
structure. This loss of information is manifested in that, given (𝑋1 , . . . , 𝑋𝑑 ), we
can compute 𝑋1 ⊗ · · · ⊗ 𝑋𝑑 but we cannot uniquely recover (𝑋1 , . . . , 𝑋𝑑 ) from their
matrix Kronecker product.
For 𝑑 = 2, vecℓ is just the ‘row-concatenated vec operation’, i.e. vecℓ (𝐴) =
vec(𝐴T ) for any 𝐴 ∈ R𝑚×𝑛 , and we have
vecℓ (𝑋 𝐴𝑌 T ) = (𝑋 ⊗ 𝑌 ) vecℓ (𝐴), vec(𝑋 𝐴𝑌 T ) = (𝑌 T ⊗ 𝑋) vec(𝐴).
The neater form for vecℓ , especially when extended to higher order, is why we
prefer lexicographic ordering.
As we discussed in Example 2.7, there is no formula that extends the usual
matrix–matrix multiplication to 3-hypermatrices or indeed any 𝑑-hypermatrices
for odd 𝑑 ∈ N. Such a formula, however, exists for any even 𝑑, thanks to the
Kronecker product.
Example 4.13 (‘multiplying higher-order tensors’ revisited). Consider the fol-
lowing 2𝑑-hypermatrices:
𝐴 ∈ R𝑚1 ×𝑛1 ×𝑚2 ×𝑛2 ×···×𝑚𝑑 ×𝑛𝑑 , 𝐵 ∈ R𝑛1 × 𝑝1 ×𝑛2 × 𝑝2 ×···×𝑛𝑑 × 𝑝𝑑 .
Their product, denoted 𝐴𝐵 to be consistent with the usual matrix–matrix product,
is the 2𝑑-hypermatrix 𝐶 ∈ R𝑚1 × 𝑝1 ×𝑚2 × 𝑝2 ×···×𝑚𝑑 × 𝑝𝑑 , where
Õ
𝑛1 Õ
𝑛𝑑
𝑐𝑖1 𝑘1 ···𝑖𝑑 𝑘𝑑 ≔ ··· 𝑎𝑖1 𝑗1 ···𝑖𝑑 𝑗𝑑 𝑏 𝑗1 𝑘1 ··· 𝑗𝑑 𝑘𝑑 .
𝑗1 =1 𝑗𝑑 =1

This clearly extends the matrix–matrix product, which is just the case when 𝑑 = 1.
We have in fact already encountered Example 4.11 in a different form: this is
(4.48), the composition of two Kronecker products of 𝑑 linear operators. To view
it in hypermatrix form, we just use the fact that L(V; W) = V∗ ⊗ W. Indeed, we
may also view it as a tensor contraction of
𝑇 ∈ (U∗1 ⊗ V1 ) ⊗ · · · ⊗ (U∗𝑑 ⊗ V𝑑 ), 𝑇 ′ ∈ (V∗1 ⊗ W1 ) ⊗ · · · ⊗ (V∗𝑑 ⊗ W𝑑 ),
from which we obtain
h𝑇 , 𝑇 ′i ∈ (U∗1 ⊗ W1 ) ⊗ · · · ⊗ (U∗𝑑 ⊗ W𝑑 ).
Tensors in computations 119

This is entirely within expectation as the first fundamental theorem of invariant the-
ory, the result that prevents the existence of a 𝑑-hypermatrix–hypermatrix product
for odd 𝑑, also tells us that essentially the only way to multiply tensors without
increasing order is via contractions. This product is well-defined for 2𝑑-tensors
and does not depend on bases: in terms of Kronecker products of matrices,
((𝑋1 𝐴1𝑌1−1 ) ⊗ · · · ⊗ (𝑋𝑑 𝐴𝑑𝑌𝑑−1 ))((𝑌1 𝐵1 𝑍1−1 ) ⊗ · · · ⊗ (𝑌𝑑 𝐵 𝑑 𝑍 𝑑−1 ))
= (𝑋1 𝐴1 𝐵1 𝑍1−1 ) ⊗ · · · ⊗ (𝑋𝑑 𝐴𝑑 𝐵 𝑑 𝑍 𝑑−1 ),
and in terms of hypermatrix–hypermatrix products,
((𝑋1 , 𝑌1−1 , . . . , 𝑋𝑑 , 𝑌𝑑−1 ) · 𝐴)((𝑌1 , 𝑍1−1 , . . . , 𝑌𝑑 , 𝑍 𝑑−1 ) · 𝐵)
= (𝑋1 , 𝑍1−1 , . . . , 𝑋𝑑 , 𝑍 𝑑−1 ) · (𝐴𝐵),
that is, they satisfy the higher-order analogue of (2.2). Frankly, we do not see any
advantage in formulating such a product as 2𝑑-hypermatrices, but an abundance of
disadvantages.
Just as it is possible to deduce the tensor transformation rules in definition ➀ from
definition ➁, we can do likewise with definition ➂ in the form of Definition 4.9.
Example 4.14 (tensor transformation rules revisited). As stated on page 110,
the tensor product construction in this section does not require that U, V, . . . , W
have bases, that is, they could be modules, which do not have bases in general (those
with bases are called free modules). Nevertheless, in the event where they do have
bases, their change-of-basis theorems would lead us directly to the transformation
rules discussed in Section 2.
We first remind the reader of a simple notion, discussed in standard linear
algebra textbooks such as Berberian (2014, Section 3.9) and Friedberg et al. (2003,
Section 2.6) but often overlooked. Any linear operator Φ : V → W induces a
transpose linear operator on the dual spaces defined as
ΦT : W∗ → V∗ , Φ(𝜑) ≔ 𝜑 ◦ Φ
for any linear functional 𝜑 ∈ W∗ . Note that the composition 𝜑 ◦ Φ : V → R is
indeed a linear functional in V∗ . The reason for its name is that
[Φ] ℬ,𝒞 = 𝐴 ∈ R𝑚×𝑛 if and only if [ΦT ] 𝒞∗ ,ℬ∗ = 𝐴T ∈ R𝑛×𝑚 .
One may show that Φ is injective if and only if ΦT is subjective, and Φ is surjective if
and only if ΦT is injective.26 So if Φ : V → W is invertible, then so is ΦT : W∗ → V∗
and its inverse is a linear operator,
Φ−T : V∗ → W∗ .
Another name for ‘invertible linear operator’ is vector space isomorphism, es-
pecially when used in the following context. Any basis ℬ = {𝑣 1 , . . . , 𝑣 𝑛 }
26 Not a typo. Injectivity and surjectivity are in fact dual notions in this sense.
120 L.-H. Lim

of V gives us two vector space isomorphisms, namely (i) Φℬ : V → R𝑛 that


takes 𝑣 = 𝑎1 𝑣 1 + · · · + 𝑎 𝑛 𝑣 𝑛 ∈ V to its coordinate representation in ℬ, and
(ii) Φℬ∗ : V∗ → R𝑛 that takes 𝜑 = 𝑏1 𝑣 ∗1 + · · · + 𝑏 𝑛 𝑣 ∗𝑛 ∈ V∗ to its coordinate
representation in ℬ∗ :
 𝑎1   𝑏1 
   
   
Φℬ (𝑣) =  ...  ∈ R𝑛 , Φℬ∗ (𝜑) =  ...  ∈ R𝑛 , (4.56)
   
𝑎 𝑛  𝑏 𝑛 
   
and as expected
Φℬ∗ = Φ−ℬT .
This is of course just paraphrasing what we have already discussed in (3.2) and
thereabouts. Note that we do not distinguish between R𝑛 and its dual space (R𝑛 )∗ .
Given vector spaces V1 , . . . , V𝑑 and W1 , . . . , W𝑑 and invertible linear oper-
ators Φ𝑖 : V𝑖 → W𝑖 , 𝑖 = 1, . . . , 𝑑, take the Kronecker product of Φ1 , . . . , Φ 𝑝 ,
Φ−𝑝+1
T
, . . . , Φ−𝑑T as defined in (4.47) of the previous example. Then the linear
operator
Φ1 ⊗ · · · ⊗ Φ 𝑝 ⊗ Φ−𝑝+1
T
⊗ · · · ⊗ Φ−𝑑T : V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑
→ W1 ⊗ · · · ⊗ W 𝑝 ⊗ W∗𝑝+1 ⊗ · · · ⊗ W∗𝑑
must also be invertible by (4.49).
For each 𝑖 = 1, . . . , 𝑑, let ℬ𝑖 and ℬ𝑖′ be two different bases of V𝑖 ; let W𝑖 = R𝑛𝑖
with 𝑛𝑖 = dim V𝑖 and
Φ𝑖 ≔ Φℬ𝑖 , Ψ𝑖 ≔ Φℬ𝑖′
as in (4.56). Since we assume R𝑛 = (R𝑛 )∗ , we have
W1 ⊗ · · · ⊗ W 𝑝 ⊗ W∗𝑝+1 ⊗ · · · ⊗ W∗𝑑  R𝑛1 ⊗ · · · ⊗ R𝑛𝑑  R𝑛1 ×···×𝑛𝑑 , (4.57)
where the last  is a consequence of using Example 4.10(i). With these choices,
the vector space isomorphisms
Φ1 ⊗ · · · ⊗ Φ 𝑝 ⊗ Φ−𝑝+1
T
⊗ · · · ⊗ Φ−𝑑T :
V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 → R𝑛1 ×···×𝑛𝑑 ,
Ψ1 ⊗ · · · ⊗ Ψ 𝑝 ⊗ Ψ−𝑝+1
T
⊗ · · · ⊗ Ψ−𝑑T :
V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 → R𝑛1 ×···×𝑛𝑑
give us hypermatrix representations of 𝑑-tensors with respect to ℬ1 , . . . , ℬ𝑑 and
ℬ1′ , . . . , ℬ𝑑′ respectively. Let 𝑇 ∈ V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 and

Φ1 ⊗ · · · ⊗ Φ 𝑝 ⊗ Φ−𝑝+1
T
⊗ · · · ⊗ Φ−𝑑T (𝑇 ) = 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 ,
Ψ1 ⊗ · · · ⊗ Ψ 𝑝 ⊗ Ψ−𝑝+1
T
⊗ · · · ⊗ Ψ−𝑑T (𝑇 ) = 𝐴 ′ ∈ R𝑛1 ×···×𝑛𝑑 .
Tensors in computations 121

Then the relation between the hypermatrices 𝐴 and 𝐴 ′ is given by

𝐴 = Φ1 Ψ−1 −1 −1 −T −1 −T ′
1 ⊗ · · · ⊗ Φ 𝑝 Ψ 𝑝 ⊗ (Φ 𝑝+1 Ψ 𝑝+1 ) ⊗ · · · ⊗ (Φ𝑑 Ψ 𝑑 ) (𝐴 ),

where we have used (4.48) and (4.49). Note that each Φ𝑖 Ψ−1𝑖 : R
𝑛𝑖 → R𝑛𝑖 is an
−1
invertible linear operator and so it must be given by Φ𝑖 Ψ𝑖 (𝑣) = 𝑋𝑖 𝑣 for some
𝑋𝑖 ∈ GL(𝑛𝑖 ). Using (4.55), the above relation between 𝐴 and 𝐴 ′ in terms of
multilinear matrix multiplication is just

𝐴 = (𝑋1 , . . . , 𝑋 𝑝 , 𝑋 −𝑝+1
T
, . . . , 𝑋𝑑−T ) · 𝐴 ′,

which gives us the tensor transformation rule (2.9). To get the isomorphism in
(4.57), we had to identify R𝑛 with (R𝑛 )∗ , and this is where we lost all information
pertaining to covariance and contravariance; a hypermatrix cannot perfectly rep-
resent a tensor, which is one reason why we need to look at the transformation rules
to ascertain the tensor.

The example above essentially discusses change of basis without any mention
of bases. In pure mathematics, sweeping such details under the rug is generally
regarded as a good thing; in applied and computational mathematics, such details
are usually unavoidable and we will work them out below.

Example 4.15 (representing tensors as hypermatrices). Let ℬ1 = {𝑢1 , . . . , 𝑢 𝑚 },


ℬ2 = {𝑣 1 , . . . , 𝑣 𝑛 }, . . . , ℬ𝑑 = {𝑤 1 , . . . , 𝑤 𝑝 } be any bases of finite-dimensional
vector spaces U, V, . . . , W. Then any 𝑑-tensor 𝑇 ∈ U ⊗ V ⊗ · · · ⊗ W may be
expressed as
𝑚 Õ
Õ 𝑛 𝑝
Õ
𝑇= ··· 𝑎𝑖 𝑗 ···𝑘 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ · · · ⊗ 𝑤 𝑘 , (4.58)
𝑖=1 𝑗=1 𝑘=1

and the 𝑑-hypermatrix of coefficients

[𝑇 ] ℬ1 ,ℬ2 ,...,ℬ𝑑 ≔ [𝑎𝑖 𝑗 ···𝑘 ] 𝑖,𝑚,𝑛,... ,𝑝


𝑗,... ,𝑘=1 = 𝐴 ∈ R
𝑚×𝑛×···× 𝑝

is said to represent or to be a coordinate representation of 𝑇 with respect to


ℬ1 , ℬ2 , . . . , ℬ𝑑 . That (4.58) always holds is easily seen as follows. First it holds
for a rank-one tensor 𝑇 = 𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 as we may express 𝑢, 𝑣, . . . , 𝑤 as a linear
combination of basis vectors and apply (4.40) to get

(𝑎1 𝑢1 + · · · + 𝑎 𝑚 𝑢 𝑚 ) ⊗ (𝑏1 𝑣 1 + · · · + 𝑏 𝑛 𝑣 𝑛 ) ⊗ · · · ⊗ (𝑐1 𝑤 1 + · · · + 𝑐 𝑝 𝑤 𝑝 )


Õ𝑚 Õ 𝑛 Õ 𝑝
= ··· 𝑎𝑖 𝑏 𝑗 · · · 𝑐 𝑘 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ · · · ⊗ 𝑤 𝑘 , (4.59)
𝑖=1 𝑗=1 𝑘=1

which has the form in (4.58). Since an arbitrary 𝑇 is simply a finite sum of rank-one
122 L.-H. Lim

tensors as in (4.39), the distributivity in (4.40) again ensures that


𝑚 Õ
Õ 𝑛 𝑝
Õ
𝜆 ··· 𝑎𝑖 𝑗 ···𝑘 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ · · · ⊗ 𝑤 𝑘
𝑖=1 𝑗=1 𝑘=1
ÕÕ
𝑚 𝑛 𝑝
Õ

+𝜆 ··· 𝑏𝑖 𝑗 ···𝑘 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ · · · ⊗ 𝑤 𝑘
𝑖=1 𝑗=1 𝑘=1
Õ
𝑚 Õ
𝑛 Õ 𝑝
= ··· (𝜆𝑎𝑖 𝑗 ···𝑘 + 𝜆 ′ 𝑏𝑖 𝑗 ···𝑘 ) 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ · · · ⊗ 𝑤 𝑘 . (4.60)
𝑖=1 𝑗=1 𝑘=1

By (4.59) and (4.60), any 𝑇 must be expressible in the form (4.58). For a different
choice of bases ℬ1′ , ℬ2′ , . . . , ℬ𝑑′ we obtain a different hypermatrix representation
𝐴 ′ ∈ R𝑚×𝑛×···× 𝑝 for 𝑇 that is related to 𝐴 by a multilinear matrix multiplication as
in Example 4.14.
Note that (4.59) and (4.60), respectively, give the formulas for outer product and
linear combination:

(𝑎1 , . . . , 𝑎 𝑚 ) ⊗ (𝑏1 , . . . , 𝑏 𝑛 ) ⊗ · · · ⊗ (𝑐1 , . . . , 𝑐 𝑝 ) = [𝑎𝑖 𝑏 𝑗 · · · 𝑐 𝑘 ],


𝜆[𝑎𝑖 𝑗 ···𝑘 ] + 𝜆 ′ [𝑏𝑖 𝑗 ···𝑘 ] = [𝜆𝑎𝑖 𝑗 ···𝑘 + 𝜆 ′ 𝑏𝑖 𝑗 ···𝑘 ].

These also follow from our definition of hypermatrices as real-valued functions in


Example 4.5.
If 𝑇 ′ ∈ U ′ ⊗ V ′ ⊗ · · · ⊗ W ′ is a 𝑑 ′ -tensor, expressed as in (4.58) as

𝑚′ Õ
Õ 𝑛′ 𝑝
Õ

𝑇 = ··· 𝑏𝑖′ 𝑗′ ···𝑘 ′ 𝑢𝑖′′ ⊗ 𝑣 ′𝑗′ ⊗ · · · ⊗ 𝑤 ′𝑘 ′ ,
𝑖′ =1 𝑗 ′ =1 𝑘 ′ =1

with bases of U ′, V ′, . . . , W ′

ℬ1′ = {𝑢1′ , . . . , 𝑢 𝑚

′ }, ℬ2′ = {𝑣 1′ , . . . , 𝑣 𝑛′ ′ }, . . . , ℬ𝑑′ ′ = {𝑤 1′ , . . . , 𝑤 ′𝑝′ },

then the outer product of 𝑇 and 𝑇 ′ in (4.41) may be expressed as


𝑚,𝑛..., 𝑝,𝑚′ ′ ′
Õ,𝑛 ..., 𝑝

𝑇 ⊗𝑇 = 𝑎𝑖 𝑗 ···𝑘 𝑏𝑖′ 𝑗′ ···𝑘 ′ 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ · · · ⊗ 𝑤 𝑘 ⊗ 𝑢𝑖′′ ⊗ 𝑣 ′𝑗′ ⊗ · · · ⊗ 𝑤 ′𝑘 ′ ,
𝑖, 𝑗,... ,𝑘,𝑖′ , 𝑗 ′ ,...,𝑘 ′ =1

which is a (𝑑 + 𝑑 ′)-tensor. In terms of their hypermatrix representations,


′ ′ ′ ′ ′ ′
𝐴 ⊗ 𝐵 = [𝑎𝑖 𝑗 ···𝑘 𝑏𝑖′ 𝑗′ ···𝑘 ′ ] 𝑖,𝑚,𝑛..., 𝑝,𝑚 ,𝑛 ..., 𝑝
𝑗,... ,𝑘,𝑖′ , 𝑗 ′ ,...,𝑘 ′ =1 ∈ R
𝑚×𝑛×···× 𝑝×𝑚 ×𝑛 ×···× 𝑝
.

It is straightforward to show that with respect to the Kronecker product or multi-


Tensors in computations 123

linear matrix multiplication,


[Φ1 ⊗ · · · ⊗ Φ𝑑 (𝑇 )] ⊗ [Ψ1 ⊗ · · · ⊗ Ψ𝑑′ (𝑇 ′)]
= Φ1 ⊗ · · · ⊗ Φ𝑑 ⊗ Ψ1 ⊗ · · · ⊗ Ψ𝑑′ (𝑇 ⊗ 𝑇 ′),
[(𝑋1 , . . . , 𝑋𝑑 ) · 𝐴] ⊗ [(𝑌1 , . . . , 𝑌𝑑′ ) · 𝐵]
= (𝑋1 , . . . , 𝑋𝑑 , 𝑌1 , . . . , 𝑌𝑑′ ) · 𝐴 ⊗ 𝐵,
with notation as in Examples 4.11, 4.12 and 4.14.
The above discussion extends verbatim to covariant and mixed tensors by repla-
cing some or all of these vector spaces with their duals and bases with dual bases.
It also extends to separable Hilbert spaces with orthonormal bases ℬ1 , . . . , ℬ𝑑
and 𝐴 ∈ 𝑙 2 (N𝑑 ) an infinite-dimensional hypermatrix as discussed in Example 4.5.
Nevertheless we do not recommend working exclusively with such hypermatrices.
The reason is simple: there is a better representation of the tensor 𝑇 sitting in plain
sight, namely (4.58), which contains not just all information in the hypermatrix 𝐴
but also the bases ℬ1 , . . . , ℬ𝑑 . Indeed, the representation of a tensor in (4.58),
coupled with the arithmetic rules for manipulating ⊗ like (4.40), is more or less the
standard method for working with tensors when bases are involved. Take 𝑑 = 2 for
illustration. We can tell at a glance from
Õ
𝑚 Õ
𝑛 Õ
𝑚 Õ
𝑛
𝛽= 𝑎𝑖 𝑗 𝑢∗𝑖 ⊗ 𝑣 ∗𝑗 , Φ= 𝑎𝑖 𝑗 𝑢𝑖 ⊗ 𝑣 ∗𝑗
𝑖=1 𝑗=1 𝑖=1 𝑗=1

that 𝛽 is a bilinear functional and Φ a linear operator (the dyad case is discussed
in Example 4.8), information that is lost if one just looks at the matrix 𝐴 = [𝑎𝑖 𝑗 ].
Furthermore, the 𝑢𝑖 and 𝑣 𝑗 are sometimes more important than the 𝑎𝑖 𝑗 . We might
have

𝛽 = 𝜀 −1/√3 ⊗ 𝜀 −1/√3 + 𝜀 −1/√3 ⊗ 𝜀 1/√3 + 𝜀 1/√3 ⊗ 𝜀 −1/√3 + 𝜀 1/√3 ⊗ 𝜀 1/√3 , (4.61)

where the 𝑎𝑖 𝑗 are all ones, and all the important information is in the basis vectors,
which are the point evaluation functionals we encountered in Example 4.6, that is,
𝜀 𝑥 ( 𝑓 ) = 𝑓 (𝑥) for any 𝑓 : [−1, 1] → R. The decomposition (4.61) is in fact a
four-point Gauss quadrature formula on the domain [−1, 1] × [−1, 1]. We will
revisit this in Example 4.48.

We next discuss tensor rank in light of Definition 4.9.

Example 4.16 (multilinear rank and tensor rank). For any 𝑑 vector spaces U, V,
. . . , W and any subspaces U ′ ⊆ U, V ′ ⊆ V, . . . , W ′ ⊆ W, it follows from (4.39)
that
U ′ ⊗ V ′ ⊗ · · · ⊗ W ′ ⊆ U ⊗ V ⊗ · · · ⊗ W.

Given a non-zero 𝑑-tensor 𝑇 ∈ U ⊗ V ⊗ · · · ⊗ W, one might ask for a smallest tensor


124 L.-H. Lim

subspace that contains 𝑇 , that is,


𝜇rank(𝑇 ) ≔ min{(dim U ′, dim V ′, . . . , dim W ′) ∈ N𝑑 : (4.62)
′ ′ ′ ′ ′ ′
𝑇 ∈ U ⊗ V ⊗ · · · ⊗ W , U ⊆ U, V ⊆ V, . . . , W ⊆ W}.
This gives the multilinear rank of 𝑇 , first studied in Hitchcock (1927, The-
orem II). Let U ′, V ′, . . . , W ′ be subspaces that attain the minimum in (4.62) and
let {𝑢1 , . . . , 𝑢 𝑝 }, {𝑣 1 , . . . , 𝑣 𝑞 }, . . . , {𝑤 1 , . . . , 𝑤 𝑟 } be any respective bases. Then,
by (4.58), we get a multilinear rank decomposition (Hitchcock 1927, Corollary I):
𝑝 Õ
Õ 𝑞 Õ
𝑟
𝑇= ··· 𝑐𝑖 𝑗 ···𝑘 𝑢𝑖 ⊗ 𝑣 𝑗 ⊗ · · · ⊗ 𝑤 𝑘 . (4.63)
𝑖=1 𝑗=1 𝑘=1

For a given 𝑇 , the vectors on the right of (4.63) are not unique but the subspaces
U ′, V ′, . . . , W ′ that they span are unique, and thus 𝜇rank(𝑇 ) is well-defined.27 One
may view these subspaces as generalizations of the row and column spaces of a
matrix and multilinear rank as a generalization of the row and column ranks. Note
that the coefficient hypermatrix 𝐶 = [𝑐𝑖 𝑗 ···𝑘 ] ∈ R 𝑝×𝑞×···×𝑟 is not unique either,
but two such coefficient hypermatrices 𝐶 and 𝐶 ′ must be related by the tensor
transformation rule (2.9), as we saw in Example 4.15.
The decomposition in (4.63) is a sum of rank-one tensors, and if we simply use
this as impetus and define
 Õ𝑟
rank(𝑇 ) ≔ min 𝑟 ∈ N : 𝑇 = 𝑢𝑖 ⊗ 𝑣 𝑖 ⊗ · · · ⊗ 𝑤 𝑖 , 𝑢1 , . . . , 𝑢𝑟 ∈ U,
𝑖=1

𝑣 1 , . . . , 𝑣𝑟 ∈ V, . . . , 𝑤 1 , . . . , 𝑤 𝑟 ∈ W ,

we obtain tensor rank (Hitchcock 1927, equations 2 and 2𝑎 ), which we have


encountered in a more restricted form in (3.37). For completeness, we define
𝜇rank(𝑇 ) ≔ (0, 0, . . . , 0) ⇔ rank(𝑇 ) ≔ 0 ⇔ 𝑇 = 0.
Note that
𝜇rank(𝑇 ) = (1, 1, . . . , 1) ⇔ rank(𝑇 ) = 1 ⇔ 𝑇 = 𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤,
that is, a rank-one tensor has rank one irrespective of the rank we use. Also, we
did not exclude infinite-dimensional vector spaces from our discussion. Note that
by construction U ⊗ V ⊗ · · · ⊗ W comprises finite linear combinations of rank-one
tensors, and so by its definition, U ⊗ V ⊗ · · · ⊗ W is the set of all tensors of finite
rank. Furthermore, whether or not a tensor has finite rank is independent of the
rank we use.

27 While N𝑑 is only partially ordered, this shows that the minimum in (4.62) is unique.
Tensors in computations 125

Vector spaces become enormously more interesting when equipped with in-
ner products and norms; it is the same with tensor spaces. The tensor product
construction leading to Definition 4.9 allows us to incorporate them easily.

Example 4.17 (tensor products of inner product and norm spaces). We let
U, V, . . . , W be 𝑑 vector spaces equipped with either norms
k · k 1 : U → [0, ∞), k · k 2 : V → [0, ∞), . . . , k · k 𝑑 : W → [0, ∞)
or inner products
h · , · i1 : U × U → R, h · , · i2 : V × V → R, . . . , h · , · i𝑑 : W × W → R.
We would like to define a norm or an inner product on the tensor space U⊗V⊗· · ·⊗W
as defined in (4.39). Note that we do not assume that these vector spaces are finite-
dimensional.
The inner product is easy. First define it on rank-one tensors,
h𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤, 𝑢 ′ ⊗ 𝑣 ′ ⊗ · · · ⊗ 𝑤 ′i ≔ h𝑢, 𝑢 ′i1 h𝑣, 𝑣 ′i2 · · · h𝑤, 𝑤 ′i𝑑 (4.64)
for all 𝑢, 𝑢 ′ ∈ U, 𝑣, 𝑣 ′ ∈ V, . . . , 𝑤, 𝑤 ′ ∈ W. Then extend it bilinearly, that is, decree
that for rank-one tensors 𝑆, 𝑇 , 𝑆 ′, 𝑇 ′ and scalars 𝜆, 𝜆 ′,
h𝜆𝑆 + 𝜆 ′ 𝑆 ′, 𝑇 i ≔ 𝜆h𝑆, 𝑇 i + 𝜆 ′ h𝑆 ′, 𝑇 i,
h𝑆, 𝜆𝑇 + 𝜆 ′𝑇 ′i ≔ 𝜆h𝑆, 𝑇 i + 𝜆 ′ h𝑆, 𝑇 ′i.
As U ⊗ V ⊗ · · · ⊗ W comprises finite linear combinations of rank-one tensors
by definition, this defines h · , · i on the whole tensor space. It is then routine to
check that the axioms of inner product are satisfied by h · , · i. This construction
applies verbatim to other bilinear functionals such as the Lorentzian scalar product
in Example 2.4.
Norms are slightly trickier because there are multiple ways to define them on
tensor spaces. We will restrict ourselves to the three most common constructions.
If we have an inner product as above, then the norm induced by the inner product,
p
k𝑇 k F ≔ h𝑆, 𝑇 i,
is called the Hilbert–Schmidt norm. Note that by virtue of (4.64), the Hilbert–
Schmidt norm would satisfy
k𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤 k = k𝑢k 1 k𝑣k 2 · · · k𝑤 k 𝑑 , (4.65)
a property that we will require of all tensor norms. As we mentioned in Example 4.1,
such norms are called cross-norms. We have to extend (4.65) so that k · k is defined
on not just rank-one tensors but on all finite sums of these. Naively we might just
define it to be the sum of the norms of its rank-one summands, but that does not
work as we get different values with different sums of rank-one tensors. Fortunately,
126 L.-H. Lim

taking the infimum over all possible finite sums



𝑟 Õ
𝑟 
k𝑇 k 𝜈 ≔ inf k𝑢𝑖 k 1 k𝑣 𝑖 k 2 · · · k𝑤 𝑖 k 𝑑 : 𝑇 = 𝑢𝑖 ⊗ 𝑣 𝑖 ⊗ · · · ⊗ 𝑤 𝑖 (4.66)
𝑖=1 𝑖=1

fixes the issue and gives us a norm. This is the nuclear norm, special cases of
which we have encountered in (3.45) and in Examples 3.11 and 4.1.
There is an alternative. We may also take the supremum over all rank-one tensors
in the dual tensor space:
 
|𝜑 ⊗ 𝜓 ⊗ · · · ⊗ 𝜃(𝑇 )| ∗ ∗ ∗
k𝑇 k 𝜎 ≔ sup : 𝜑 ∈ U , 𝜓 ∈ V , . . . , 𝜃 ∈ W , (4.67)
k𝜑k 1∗ k𝜓 k 2∗ · · · k𝜃 k 𝑑∗
where k · k 𝑖∗ is the dual norm of k · k 𝑖 as defined by (3.44), 𝑖 = 1, . . . , 𝑑. This is the
spectral norm, special cases of which we have encountered in (3.19) and (3.46) and
in Example 3.17. There are many more cross-norms (Diestel et al. 2008, Chapter 4)
but these three are the best known. The nuclear and spectral norms are also special
in that any cross-norm k · k must satisfy
k𝑇 k 𝜎 ≤ k𝑇 k ≤ k𝑇 k 𝜈 for all 𝑇 ∈ U ⊗ V ⊗ · · · ⊗ W; (4.68)
and conversely any norm k · k that satisfies (4.68) must be a cross-norm (Ryan
2002, Proposition 6.1). In this sense, the nuclear and spectral norms, respectively,
are the largest and smallest cross-norms. If V is a Hilbert space, then by the Riesz
representation theorem, a linear functional 𝜑 : V → R takes the form 𝜑 = h𝑣, ·i for
some 𝑣 ∈ V and the Hilbert space norm on V and its dual norm may be identified.
Thus, if U, V, . . . , W are Hilbert spaces, then the spectral norm in (4.67) takes the
form
 
|h𝑇 , 𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤i|
k𝑇 k 𝜎 = sup : 𝑢 ∈ U, 𝑣 ∈ V, . . . , 𝑤 ∈ W . (4.69)
k𝑢k 1 k𝑣k 2 · · · k𝑤 k 𝑑
The definition of inner products on tensor spaces fits perfectly with other tensorial
notions such as the tensor product basis and the Kronecker product discussed in
Example 4.11. If ℬ1 , ℬ2 , . . . , ℬ𝑑 are orthonormal bases for U, V, . . . , W, then the
tensor product basis ℬ1 ⊗ ℬ2 ⊗ · · · ⊗ ℬ𝑑 as defined in (4.43) is an orthonormal basis
for U⊗V⊗· · ·⊗W. Here we have no need to assume finite dimension or separability:
these orthonormal bases may be uncountable. If Φ1 : U → U, . . . , Φ𝑑 : W → W
are orthogonal linear operators in the sense that
hΦ1 (𝑢), Φ1 (𝑢 ′)i1 = h𝑢, 𝑢 ′i1 , . . . , hΦ𝑑 (𝑤), Φ𝑑 (𝑤 ′)i𝑑 = h𝑤, 𝑤 ′i𝑑
for all 𝑢, 𝑢 ′ ∈ U, . . . , 𝑤, 𝑤 ′ ∈ W, then the Kronecker product Φ1 ⊗ · · · ⊗ Φ𝑑 : U ⊗
· · · ⊗ W → U ⊗ · · · ⊗ W is an orthogonal linear operator, that is,
hΦ1 ⊗ · · · ⊗ Φ𝑑 (𝑆), Φ1 ⊗ · · · ⊗ Φ𝑑 (𝑇 )i = h𝑆, 𝑇 i
for all 𝑆, 𝑇 ∈ U ⊗ · · · ⊗ W. For Kronecker products of operators defined on norm
spaces, we will defer the discussion to the next example.
Tensors in computations 127

Let U, V, . . . , W be finite-dimensional inner product spaces. Let ℬ1 , ℬ2 , . . . , ℬ𝑑


be orthonormal bases. Then, as we saw in Example 4.15, a 𝑑-tensor in 𝑇 ∈
U ⊗ V ⊗ · · · ⊗ W may be represented by a 𝑑-hypermatrix 𝐴 ∈ R𝑚×𝑛×···× 𝑝 , in which
case the expressions for norms and inner product may be given in terms of 𝐴:
Õ 𝑚 Õ 𝑛 Õ𝑝 1/2 Õ
𝑚 Õ
𝑛 𝑝
Õ
2
k 𝐴k F = ··· |𝑎𝑖 𝑗 ···𝑘 | , h𝐴, 𝐵i = ··· 𝑎𝑖 𝑗 ···𝑘 𝑏𝑖 𝑗 ···𝑘 ,
𝑖=1 𝑗=1 𝑘=1 𝑖=1 𝑗=1 𝑘=1
Õ𝑟 Õ
𝑟 
k 𝐴k 𝜈 = inf |𝜆 𝑖 | : 𝐴 = 𝜆 𝑖 𝑢𝑖 ⊗ 𝑣 𝑖 ⊗ · · · ⊗ 𝑤 𝑖 , k𝑢k = k𝑣k = · · · = k𝑤 k = 1 ,
𝑖=1 𝑖=1
k 𝐴k 𝜎 = sup{|h𝐴, 𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤i| : k𝑢k = k𝑣k = · · · = k𝑤 k = 1}.
Here the nuclear and spectral norms are in fact dual norms (not true in general)
k 𝐴k 𝜈∗ = sup |h𝐴, 𝐵i| = k 𝐴k 𝜎 (4.70)
k𝐵 k 𝜈 ≤1

and thus satisfy


|h𝐴, 𝐵i| ≤ k 𝐴k 𝜎 k𝐵k 𝜈 .
For 𝑑 = 2, these inner product and norms become the usual trace inner product
h𝐴, 𝐵i = tr(𝐴T 𝐵) and spectral, Frobenius, nuclear norms of matrices:
Õ
𝑟 Õ𝑟 1/2
2
k 𝐴k 𝜈 = 𝜎𝑖 (𝐴), k 𝐴k F = 𝜎𝑖 (𝐴) , k 𝐴k 𝜎 = max 𝜎𝑖 (𝐴),
𝑖=1,...,𝑟
𝑖=1 𝑖=1

where 𝜎𝑖 (𝐴) denotes the 𝑖th singular value of 𝐴 ∈ R𝑚×𝑛 and 𝑟 = rank(𝐴).
We emphasize that the discussion in the above paragraph requires the bases in
Example 4.15 to be orthonormal. Nevertheless, the values of the inner product
and norms do not depend on which orthonormal bases we choose. In the termin-
ology of Section 2 they are invariant under multilinear matrix multiplication by
(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) ∈ O(𝑚) × O(𝑛) × · · · × O(𝑝), or equivalently, they are defined on
Cartesian tensors. More generally, if h𝑋𝑖 𝑣, 𝑋𝑖 𝑣i𝑖 = h𝑣, 𝑣i𝑖 , 𝑖 = 1, . . . , 𝑑, then
h(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐴, (𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐵i = h𝐴, 𝐵i,
and thus k(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐴k F = k 𝐴k F; if k 𝑋𝑖 𝑣k 𝑖 = k𝑣k 𝑖 , 𝑖 = 1, . . . , 𝑑, then
k(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐴k 𝜈 = k 𝐴k 𝜈 , k(𝑋1 , 𝑋2 , . . . , 𝑋𝑑 ) · 𝐴k 𝜎 = k 𝐴k 𝜎 .
This explains why, when discussing definition ➀ in conjunction with inner products
or norms, we expect the change-of-basis matrices in the transformation rules to
preserve these inner products or norms.
Inner product and norm spaces become enormously more interesting when they
are completed into Hilbert and Banach spaces. The study of cross-norms was in
fact started by Grothendieck (1955) and Schatten (1950) in order to define tensor
products of Banach spaces, and this has grown into a vast subject (Defant and Floret
128 L.-H. Lim

1993, Diestel et al. 2008, Light and Cheney 1985, Ryan 2002, Trèves 2006) that
we are unable to survey at any reasonable level of detail. The following example
is intended to convey an idea of how cross-norms allow one to complete the tensor
spaces in a suitable manner.
Example 4.18 (tensor product of Hilbert and Banach spaces). We revisit our
discussion in Example 4.1, properly defining topological tensor products b
⊗ using
continuous and integrable functions as illustrations.
(i) If 𝑋 and 𝑌 are compact Hausdorff topological spaces, then by the Stone–
Weierstrass theorem, 𝐶(𝑋) ⊗ 𝐶(𝑌 ) is a dense subset of 𝐶(𝑋 × 𝑌 ) with respect
to the uniform norm
k 𝑓 k∞ = sup | 𝑓 (𝑥, 𝑦)|. (4.71)
(𝑥,𝑦)∈𝑋 ×𝑌

(ii) If 𝑋 and 𝑌 are 𝜎-finite measure spaces, then by the Fubini–Tonelli theorem,
𝐿 1 (𝑋) ⊗ 𝐿 1 (𝑌 ) is a dense subset of 𝐿 1 (𝑋 × 𝑌 ) with respect to the 𝐿 1 -norm

k 𝑓 k1 = | 𝑓 (𝑥, 𝑦)| d𝑥 d𝑦. (4.72)
𝑋 ×𝑌
Here a tensor product of vector spaces is as defined in (4.39), and as we just saw in
(4.17), there are several ways to equip it with a norm and with respect to any norm
we may complete it (i.e. by adding the limits of all Cauchy sequences) to obtain
Banach and Hilbert spaces out of norm and inner product spaces. The completed
space depends on the choice of norms; with a judicious choice, we get
𝐶(𝑋) b
⊗ 𝜎 𝐶(𝑌 ) = 𝐶(𝑋 × 𝑌 ), 𝐿 1 (𝑋) b
⊗ 𝜈 𝐿 1 (𝑌 ) = 𝐿 1 (𝑋 × 𝑌 ), (4.73)
as was first discovered in Grothendieck (1955). Here b ⊗ 𝜎 and b
⊗ 𝜈 denote completion
in the spectral and nuclear norm respectively and are called the injective tensor
product and projective tensor product respectively. To be clear, the first equality
in (4.73) says that if we equip 𝐶(𝑋) ⊗ 𝐶(𝑌 ) with the spectral norm (4.67) and
complete it to obtain 𝐶(𝑋) b ⊗ 𝜎 𝐶(𝑌 ), then the resulting space is 𝐶(𝑋 × 𝑌 ) equipped
with the uniform norm (4.71), and likewise for the second equality. In particular,
(4.73) also tells us that the uniform and spectral norms are equal on 𝐶(𝑋 × 𝑌 ), and
likewise for the 𝐿 1 -norm in (4.72) and nuclear norm in (4.66). For 𝑓 ∈ 𝐿 1 (𝑋 × 𝑌 ),

| 𝑓 (𝑥, 𝑦)| d𝑥 d𝑦
𝑋 ×𝑌
Õ 𝑟 ∫ ∫ 
= inf |𝜑𝑖 (𝑥)| d𝑥 |𝜓𝑖 (𝑦)| d𝑦 : 𝜑𝑖 ∈ 𝐿 1 (𝑋), 𝜓𝑖 ∈ 𝐿 1 (𝑌 ), 𝑟 ∈ N ,
𝑖=1 𝑋 𝑌

a fact that can be independently verified by considering simple functions28 and


taking limits.

28 Finite linear combinations of characteristic functions of measurable sets.


Tensors in computations 129

These are examples of topological tensor products that involve completing the
(algebraic) tensor product in (4.39) with respect to a choice of cross-norm to
obtain a complete topological vector space. These also suggest why it is desirable
to have a variety of different cross-norms, and with each a different topological
tensor product, as the ‘right’ cross-norm to choose for a class of functions F(𝑋) is
usually the one that gives us F(𝑋) b ⊗ F(𝑌 ) = F(𝑋 × 𝑌 ) as we discussed after (4.4).
For example, to get the corresponding result for 𝐿 2 -functions, we have to use the
Hilbert–Schmidt norm
𝐿 2 (𝑋) b
⊗F 𝐿 2 (𝑌 ) = 𝐿 2 (𝑋 × 𝑌 ). (4.74)
Essentially the proof relies on the fact that the completion of 𝐿 2 (𝑋) ⊗ 𝐿 2 (𝑌 ) with
respect to the Hilbert–Schmidt norm is the closure of 𝐿 2 (𝑋) ⊗ 𝐿 2 (𝑌 ) as a subspace
of 𝐿 2 (𝑋 × 𝑌 ), and as the orthogonal complement of 𝐿 2 (𝑋) ⊗ 𝐿 2 (𝑌 ) is zero, its
closure is the whole space (Light and Cheney 1985, Theorem 1.39). Nevertheless,
such results are not always possible: there is no cross-norm that will complete
𝐿 ∞ (𝑋) ⊗ 𝐿 ∞ (𝑌 ) into 𝐿 ∞ (𝑋 × 𝑌 ) for all 𝜎-finite 𝑋 and 𝑌 (Light and Cheney 1985,
Theorem 1.53). Also, we should add that a ‘right’ cross-norm that guarantees
(4.73) may be less interesting than a ‘wrong’ cross-norm that gives us a new tensor
space. For instance, had we used b ⊗ 𝜈 to form a tensor product of 𝐶(𝑋) and 𝐶(𝑌 ),
we would have obtained a smaller subset:
𝐶(𝑋) b
⊗ 𝜈 𝐶(𝑌 ) ( 𝐶(𝑋 × 𝑌 ).
This smaller tensor space of continuous functions on 𝑋 × 𝑌 , more generally the
tensor space 𝐶(𝑋1 ) b⊗𝜈 · · · b
⊗ 𝜈 𝐶(𝑋𝑛 ), is called the Varopoulos algebra and it turns
out to be very interesting and useful in harmonic analysis (Varopoulos 1965, 1967).
A point worth highlighting is the difference between 𝐿 2 (𝑋) ⊗ 𝐿 2 (𝑌 ) and 𝐿 2 (𝑋) b
⊗F
2
𝐿 (𝑌 ). While the former contains only finite sums of separable functions
Õ
𝑟
𝑓 𝑖 ⊗ 𝑔𝑖
𝑖=1

with 𝑓𝑖 ∈ 𝐿 2 (𝑋), 𝑔𝑖 ∈ 𝐿 2 (𝑌 ), the latter includes all convergent infinite series of


separable functions
Õ ∞ Õ𝑟
𝑓𝑖 ⊗ 𝑔𝑖 ≔ lim 𝑓 𝑖 ⊗ 𝑔𝑖 ,
𝑟 →∞
𝑖=1 𝑖=1

that is, the right-hand side converges to some limit in 𝐿 2 (𝑋 × 𝑌 ) in the Hilbert–
Schmidt norm k · k F . The equality in (4.74) says that every function in 𝐿 2 (𝑋 × 𝑌 )
is given by such a limit, but if we had taken completion with respect to some other
norms such as k · k 𝜈 or k · k 𝜎 , their topological tensor products 𝐿 2 (𝑋) b
⊗ 𝜈 𝐿 2 (𝑌 ) or
2
𝐿 (𝑋) b 2 2
⊗ 𝜎 𝐿 (𝑌 ) would in general be smaller or larger than 𝐿 (𝑋 × 𝑌 ) respectively.
Whatever the choice of b ⊗ , the (algebraic) tensor product U ⊗ V ⊗ · · · ⊗ W should
130 L.-H. Lim

be regarded as the subset of all finite-rank tensors in the topological tensor product
Ub⊗Vb ⊗ ···b⊗ W.
The tensor product of Hilbert spaces invariably refers to the topological tensor
product with respect to the Hilbert–Schmidt norm because the result is always a
⊗ in (4.13) should be interpreted as b
Hilbert space. In particular, the b ⊗F . However,
as we pointed out above, other topological tensor products with respect to other
cross-norms may also be very interesting.
Example 4.19 (trace-class, Hilbert–Schmidt, compact operators). For a sep-
arable Hilbert space H with inner product h · , · i and induced norm k · k, completing
H ⊗ H∗ with respect to the nuclear, Hilbert–Schmidt and spectral norms, we obtain
different types of bounded linear operators on H as follows:
 ÕÕ 

trace-class Hb ⊗ 𝜈 H = Φ ∈ B(H) : |hΦ(𝑒𝑖 ), 𝑓 𝑗 i| < ∞ ,
𝑖 ∈𝐼 𝑗 ∈𝐼
 Õ 
∗ 2
Hilbert–Schmidt Hb
⊗F H = Φ ∈ B(H) : kΦ(𝑒𝑖 )k < ∞ ,
𝑖 ∈𝐼
 

𝑋 ⊆ H bounded
compact Hb
⊗ 𝜎 H = Φ ∈ B(H) : .
⇒ Φ(𝑋) ⊆ H compact
The series convergence is understood to mean ‘for some orthonormal bases {𝑒𝑖 : 𝑖 ∈
𝐼} and { 𝑓𝑖 : 𝑖 Í
∈ 𝐼} of H’, although for trace-class operators the condition could be
simplified to 𝑖 ∈𝐼 hΦ(𝑒𝑖 ), 𝑒 𝑗 i < ∞ provided that ‘for some’ is replaced by ‘for all’.
See Schaefer and Wolff (1999, p. 278) for the trace-class result and Trèves (2006,
Theorems 48.3) for the compact result.
If H is separable, then such operators are characterized by their having a Schmidt
decomposition:
Õ
∞ Õ

Φ= 𝜎𝑖 𝑢𝑖 ⊗ 𝑣 ∗𝑖 or Φ(𝑥) = 𝜎𝑖 h𝑣 𝑖 , 𝑥i𝑢𝑖 for all 𝑥 ∈ H, (4.75)
𝑖=1 𝑖=1

where {𝑢𝑖 : 𝑖 ∈ N} and {𝑣 𝑖 : 𝑖 ∈ N} are orthonormal sets, 𝜎𝑖 ≥ 0, and


Õ
∞ Õ

𝜎𝑖 < ∞, 𝜎𝑖2 < ∞, lim 𝜎𝑖 = 0
𝑖→∞
𝑖=1 𝑖=1

depending on whether they are trace-class, Hilbert–Schmidt or compact, respect-


ively (Reed and Simon 1980, Chapter VI, Sections 5–6). Note that H and H∗ are
naturally isomorphic by the Riesz representation theorem and we have identified
𝑣 ∗ with h𝑣, · i. As expected, we also have
Õ∞ Õ∞ 1/2
2
kΦk 𝜈 = 𝜎𝑖 , kΦk F = 𝜎𝑖 , kΦk 𝜎 = sup 𝜎𝑖 .
𝑖=1 𝑖=1 𝑖 ∈N
Tensors in computations 131

In other words, this is the infinite-dimensional version of the relation between the
various norms and matrix singular values in Example 4.17, and the Schmidt decom-
position is an infinite-dimensional generalization of singular value decomposition.
Unlike the finite-dimensional case, where one may freely speak of the nuclear,
Frobenius and spectral norms of any matrix, for infinite-dimensional H we have
H ⊗ H∗ ( H b
⊗ 𝜈 H∗ ( H b
⊗ F H∗ ( H b
⊗ 𝜎 H∗ , kΦk 𝜎 ≤ kΦk F ≤ kΦk 𝜈 ,
and the inclusions are strict, for example, a compact operator Φ may have kΦk 𝜈 =
∞. By our discussion at the end of Example 4.18, H ⊗ H∗ is the subset of finite-rank
operators in any of these larger spaces. The inequality relating the three norms is
⊗ 𝜈 H∗ , nuclear and spectral
a special case of (4.68) as k · k F is a cross-norm. On H b
norms are dual norms as in the finite-dimensional case (4.70):
kΦk 𝜈∗ = sup |hΦ, ΨiF | = kΦk 𝜎 ,
kΨ k 𝜈 ≤1

where the inner product will be defined below in (4.78).


The reason for calling Φ ∈ H b ⊗ 𝜈 H∗ trace-class is that its trace is well-defined
and always finite:
Õ∞
tr(Φ) ≔ hΦ(𝑒𝑖 ), 𝑒𝑖 i
𝑖=1

for any orthonormal basis {𝑒𝑖 : 𝑖 ∈ N} of H. As trace is independent of the choice


of orthonormal
√ basis, it is truly a tensorial notion. Moreover tr(|Φ|) = kΦk 𝜈 , where
|Φ| ≔ Φ∗ Φ. To see why the expression above is considered a trace, we may,
as discussed in Example 4.15, represent Φ as an infinite-dimensional matrix with
respect to the chosen orthonormal basis,
∞ Õ
Õ ∞
Φ= 𝑎𝑖 𝑗 𝑒𝑖 ⊗ 𝑒∗𝑗 , (4.76)
𝑖=1 𝑗=1

where 𝐴 = (𝑎𝑖 𝑗 )∞ 2 2
𝑖, 𝑗=1 ∈ 𝑙 (N ), and now observe that

Õ∞ ∞ Õ
Õ ∞  Õ∞
tr(Φ) = hΦ(𝑒 𝑘 ), 𝑒 𝑘 i = 𝑎𝑖𝑘 𝑒𝑖 , 𝑒 𝑘 = 𝑎𝑘 𝑘 . (4.77)
𝑘=1 𝑘=1 𝑖=1 𝑘=1

One may show that an operator is trace-class if and only if it is a product of two
Hilbert–Schmidt operators. A consequence is that
hΦ, ΨiF ≔ tr(Φ∗ Ψ) (4.78)
⊗F H∗ that gives k · k F as its
is always finite and defines an inner product on H b
induced norm:
hΦ, ΦiF = tr(Φ∗ Φ) = kΦk 2F .
⊗ 𝜈 H∗ , k · k 𝜈 ), (H b
While (H b ⊗F H∗ , k · k F ), (H b
⊗ 𝜎 H∗ , k · k 𝜎 ) are all Banach spaces,
132 L.-H. Lim

only H b ⊗F H∗ is a Hilbert space with the inner product in (4.78). So a topological


tensor product of Hilbert spaces may or may not be a Hilbert space: it depends
on the choice of cross-norms. Also, the choice of norm is critical in determining
whether we have a Banach space: for instance, H b ⊗ 𝜈 H∗ is complete with respect
to k · k 𝜈 but not k · k 𝜎 .
Taking topological tensor products of infinite-dimensional spaces has some unex-
pected properties. When H is infinite-dimensional, the identity operator 𝐼H ∈ B(H)
is not in H b⊗ H∗ regardless of whether we use b ⊗𝜈 , b
⊗F or b
⊗ 𝜎 , as 𝐼H is non-compact
(and thus neither Hilbert–Schmidt nor trace-class). Hence H b ⊗ H∗ ≠ B(H), no
matter which notion of topological tensor product b ⊗ we use. Incidentally, the
strong and weak duals of B(H), i.e. the set of linear functionals on B(H) con-
tinuous under the strong and weak operator topologies, are equal to each other
and to H ⊗ H∗ , i.e. the set of finite-rank operators (Dunford and Schwartz 1988,
Chapter IV, Section 1). Taking the (continuous) dual of the compact operators
gives us the trace-class operators
⊗ 𝜎 H∗ )∗  H b
(H b ⊗ 𝜈 H∗ , (4.79)
so taking dual can change the topological tensor product.
We have framed our discussion in terms of H b ⊗ H∗ as this is the most common
scenario, but it may be extended to other situations.
⊗ H∗2 as operators in B(H1 ; H2 ).
Different Hilbert spaces. It applies verbatim to H1 b
Covariant and contravariant. We covered the mixed case H1 b ⊗ H∗2 , but the
⊗ H∗2 and contravariant case H1 b
covariant case H∗1 b ⊗ H2 follow from replacing H1
with H1 or H2 with H∗2 .

Banach spaces. Note that (4.66) and (4.67) are defined without any reference
⊗ 𝜎 H∗2 and H1 b
to inner products, so for H1 b ⊗ 𝜈 H∗2 we do not need H1 and H2 to
be Hilbert spaces; compact and trace-class operators may be defined for any pair
of Banach spaces, although the latter are usually called nuclear operators in this
context.
Higher order. One may define order-𝑑 ≥ 3 analogues of bounded, compact,
Hilbert–Schmidt, trace-class operators in a straightforward manner, but corres-
ponding results are more difficult; one reason is that Schmidt decomposition (4.75)
no longer holds (Bényi and Torres 2013, Cobos, Kühn and Peetre 1992, 1999).
We remind the reader that many of the topological vector spaces considered
in Examples 4.1 and 4.2, such as 𝑆(𝑋), 𝐶 ∞(𝑋), 𝐶𝑐∞(𝑋), 𝐻(𝑋), 𝑆 ′(𝑋), 𝐸 ′(𝑋),
𝐷 ′(𝑋) and 𝐻 ′(𝑋), are not Banach spaces (and thus not Hilbert spaces) but so-
called nuclear spaces. Nevertheless, similar ideas apply to yield topological tensor
products. In fact nuclear spaces have the nice property that topological tensor
products with respect to the nuclear and spectral norms, i.e. b
⊗ 𝜈 and b
⊗ 𝜎 , are always
equal and thus independent of the choice of cross-norms.
Tensors in computations 133

We mention two applications of Example 4.19 in machine learning (Mercer


kernels) and quantum mechanics (density matrices).
Example 4.20 (Mercer kernels). Hilbert–Schmidt operators, in the form of Mer-
cer kernels (Mercer 1909), are of special significance to machine learning. Here
H = 𝐿 2 (𝑋) for a set 𝑋 of machine learning significance (where a training set
is sampled from) that has some appropriate measure (usually Borel) and topo-
logy (usually locally compact). For any 𝐿 2 -integral kernel 𝐾 : 𝑋 × 𝑋 → R,
i.e. 𝐾 ∈ 𝐿 2 (𝑋 × 𝑋) = 𝐿 2 (𝑋) b
⊗F 𝐿 2 (𝑋), the integral operator Φ : 𝐿 2 (𝑋) → 𝐿 2 (𝑋),

Φ( 𝑓 )(𝑥) = 𝐾(𝑥, 𝑦) 𝑓 (𝑦) d𝑦,
𝑋

is Hilbert–Schmidt; in fact every Hilbert–Schmidt operator on 𝐿 2 (𝑋) arises in


this manner and we always have kΦk F = k𝐾 k 𝐿2 (Conway 1990, p. 267). By our
discussion above, Φ has a Schmidt decomposition as in (4.75):
Õ

Φ= 𝜎𝑖 𝜑𝑖 ⊗ 𝜓𝑖∗ , (4.80)
𝑖=1

where 𝜓 ∗ ( 𝑓 ) = h𝜓, 𝑓 i = 𝑋
𝜓(𝑥) 𝑓 (𝑥) d𝑥 by the Riesz representation. Hence
Õ

𝐾= 𝜎𝑖 𝜑𝑖 ⊗ 𝜓𝑖 , (4.81)
𝑖=1
and we see that the difference between (4.80) and (4.81) is just one of covariance
and contravariance: every 𝐾 ∈ 𝐿 2 (𝑋) b ⊗F 𝐿 2 (𝑋) gives us a Φ ∈ 𝐿 2 (𝑋) b
⊗F 𝐿 2 (𝑋)∗
and vice versa. Mercer’s kernel theorem is essentially the statement that if 𝐾 is
continuous and symmetric positive semidefinite, i.e. [𝐾(𝑥𝑖 , 𝑥 𝑗 )] 𝑖,𝑛 𝑗=1 ∈ S+𝑛 for any
𝑥1 , . . . , 𝑥 𝑛 ∈ 𝑋 and any 𝑛 ∈ N, then 𝜑𝑖 and 𝜓𝑖 may be chosen to be equal in (4.80)
and (4.81). In this case we obtain a feature map:

𝐹 : 𝑋 → 𝑙 2 (N), 𝑥 ↦→ ( 𝜎𝑖 𝜑𝑖 (𝑥))∞𝑖=1 ,
and in this context 𝑙 2 (N) is called the feature space. It follows from (4.81) that
𝐾(𝑥, 𝑦) = h𝐹(𝑥), 𝐹(𝑦)i𝑙 2(N) . Assuming that 𝜎𝑖 > 0 for all 𝑖 ∈ N, the subspace of
𝐿 2 (𝑋) defined by
Õ ∞ 
2 2 √ ∞ 2
𝐿 𝐾 (𝑋) ≔ 𝑎𝑖 𝜑𝑖 ∈ 𝐿 (𝑋) : (𝑎𝑖 / 𝜎𝑖 )𝑖=1 ∈ 𝑙 (N) ,
𝑖=1
equipped with the inner product
Õ∞ Õ
∞  Õ

𝑎𝑖 𝑏𝑖
𝑎𝑖 𝜑𝑖 , 𝑏𝑖 𝜑𝑖 ≔ = 𝑎T Σ−1 𝑏,
𝑖=1 𝑖=1 𝐾 𝑖=1
𝜎𝑖
is called the reproducing kernel Hilbert space associated with the kernel 𝐾 (Cucker
and Smale 2002). The last expression is given in terms of infinite-dimensional
134 L.-H. Lim

matrices, as described in Example 4.5, i.e. 𝑎, 𝑏 ∈ 𝑙 2 (N) and Σ ∈ 𝑙 2 (N × N), to show


its connection with the finite-dimensional case.
These notions are used to great effect in machine learning (Hofmann, Schölkopf
and Smola 2008, Scholköpf and Smola 2002, Steinwart and Christmann 2008).
For instance, as depicted in Figure 4.3, the feature map allows one to unfurl a
complicated set of points 𝑥1 , . . . , 𝑥 𝑛 in a lower-dimensional space 𝑋 into a simpler
set of points 𝐹(𝑥1 ), . . . , 𝐹(𝑥 𝑛 ), say, one that may be partitioned with a hyperplane,
in an infinite-dimensional feature space 𝑙 2 (N).

𝑋 𝑙 2 (N)

Figure 4.3. Depiction of the feature map 𝐹 : 𝑋 → 𝑙 2 (N) in the context of support-
vector machines.

The previous example is about symmetric positive semidefinite Hilbert–Schmidt


operators; the next one will be about Hermitian positive semidefinite trace-class
operators.
Example 4.21 (density operators). These are sometimes called density matrices,
and are important alike in quantum physics (Blum 1996), quantum chemistry (Dav-
idson 1976) and quantum computing (Nielsen and Chuang 2000). Let H be a
Hilbert space, assumed separable for notational convenience, and let h · , · i be its
inner product. A trace-class operator 𝜌 ∈ H b ⊗ 𝜈 H∗ is called a density operator if

it is Hermitian 𝜌 = 𝜌, positive semidefinite |𝜌| = 𝜌, and of unit trace tr(𝜌) = 1
(or, equivalently, unit nuclear norm as tr(𝜌) = k 𝜌k 𝜈 in this case). By the same
considerations as in Example 4.20, its Schmidt decomposition may be chosen to
have 𝑢𝑖 = 𝑣 𝑖 , that is,
Õ
∞ Õ


𝜌= 𝜎𝑖 𝑣 𝑖 ⊗ 𝑣 𝑖 , tr(𝜌) = 𝜎𝑖 = 1. (4.82)
𝑖=1 𝑖=1

A key reason density operators are important is that the postulates of quantum
mechanics may be reformulated in terms of them (Nielsen and Chuang 2000,
Tensors in computations 135

Section 2.4.2). Instead of representing a quantum state as a one-dimensional


subspace of H spanned by a non-zero vector 𝑣 ∈ H, we let it be represented by the
projection operator onto that subspace, that is,
𝑣 ⊗ 𝑣 ∗ : H → H, 𝑤 ↦→ 𝑣 ∗ (𝑤)𝑣 = h𝑣, 𝑤i𝑣,
where we have assumed that 𝑣 has unit norm (if not, normalize). This is evidently
a density operator and such rank-one density operators are called pure states. An
advantage is that one may now speak of mixed states: they are infinite convex linear
combination of pure states as in (4.82), i.e. the density operators. Note that this is not
possible with states represented as vectors in H: a linear combination of two vectors
would just be another vector in H. There is a well-known condition for telling pure
states apart from mixed states without involving a Schmidt decomposition, namely,
𝜌 is pure if and only if tr(𝜌 2 ) = 1 and 𝜌 is mixed if and only if tr(𝜌 2 ) < 1.
Consequently, the value of tr(𝜌 2 ) is called the purity.
In this density operator formulation, a quantum system is completely described
by a single density operator, and for 𝑑 quantum systems described by density
operators 𝜌1 , . . . , 𝜌 𝑑 , the composite system is described by their Kronecker product
𝜌1 ⊗ · · · ⊗ 𝜌 𝑑 as defined in Example 4.11. The reader should compare these with (a)
and (b) on page 103. A Kronecker product of density operators remains a density
operator because of (4.50), (4.53) and the fact that (4.54) holds for trace-class
operators:
tr(𝜌1 ⊗ · · · ⊗ 𝜌 𝑑 ) = tr(𝜌1 ) · · · tr(𝜌 𝑑 ).
We end with a word about the role of compact operators H b ⊗ 𝜎 H∗ in this
discussion. For any density operator 𝜌, the linear functional 𝜔 𝜌 : B(H) → C,
Φ ↦→ tr(𝜌Φ) satisfies 𝜔 𝜌 (𝐼H ) = 1 and 𝜔 𝜌 (|Φ|) ≥ 0 for all Φ ∈ B(H). A linear
functional 𝜔 with the last two properties is called a positive linear functional,
abbreviated 𝜔  0. It turns out that every positive linear functional on the space of
compact operators is of the form 𝜔 𝜌 for some density operator 𝜌 (Landsman 2017,
Corollaries 4.14 and 4.15):
⊗ 𝜎 H∗ )∗ : 𝜔  0}  {𝜌 ∈ H b
{𝜔 ∈ (H b ⊗ 𝜈 H∗ : 𝜌 density operator},
a very useful addendum to (4.79). With this observation, we may now use positive
linear functionals to represent quantum states. The advantage is that it permits
ready generalization: H b ⊗ 𝜎 H∗ has the structure of a 𝐶 ∗-algebra and one may
replace it with other 𝐶 ∗ -algebra A and represent quantum states as positive linear
functionals 𝜔 : A → C.

The evolution of the definition of quantum states from vectors in Hilbert spaces to
density operators to positive linear functionals on 𝐶 ∗ -algebras is not unlike the three
increasingly sophisticated definitions of tensors or the three increasingly abstract
definitions of tensor products. As in the case of tensors and tensor products, each
definition of quantum states is useful in its own way, and all three remain in use.
136 L.-H. Lim

Our discussion of tensor product will not be complete without mentioning the
tensor algebra. We will make this the last example of this section.
Example 4.22 (tensor algebra). Let V be a vector space and 𝑣 ∈ V. We intro-
duce the shorthand
𝑑 copies 𝑑 copies
⊗𝑑 ⊗𝑑
V ≔ V ⊗ · · · ⊗ V, 𝑣 ≔ 𝑣 ⊗ ··· ⊗ 𝑣
for any 𝑑 ∈ N. We also define V ⊗0 ≔ R and 𝑣 ⊗0 ≔ 1. There is a risk of construing
erroneous relations from notation like this. We caution that V ⊗𝑑 is not the set of
tensors of the form 𝑣 ⊗𝑑 , as we pointed out in (4.42), but neither is it the set of
linear combinations of such tensors; V ⊗𝑑 contains all finite linear combinations of
any 𝑣 1 ⊗ · · · ⊗ 𝑣 𝑑 , including but not limited to those of the form 𝑣 ⊗𝑑 . As V ⊗𝑑 ,
𝑑 = 0, 1, 2, . . . , are all vector spaces, we may form their direct sum to obtain an
infinite-dimensional vector space:
Ê

T(V) = V ⊗𝑘 = R ⊕ V ⊕ V ⊗2 ⊕ V ⊗3 ⊕ · · · , (4.83)
𝑘=0

called the tensor algebra of V. Those unfamiliar with direct sums may simply
regard T(V) as the set of all finite sums of the tensors of any order:
Õ 𝑑 
⊗𝑘
T(V) = 𝑇𝑘 : 𝑇𝑘 ∈ V , 𝑑 ∈ N . (4.84)
𝑘=0

What does it mean to add a tensor 𝑇 ∈ V ⊗2 to another tensor 𝑇 ′ ∈ V ⊗5 , i.e. of


different orders? This is exactly like the addition of polyads we discussed at the
beginning of this section. Namely there is no formula for actually adding 𝑇 to 𝑇 ′,
a sum like 𝑇 + 𝑇 ′ cannot be further simplified, it is just intended as a place holder
for the two tensors 𝑇 and 𝑇 ′, we could have written instead (0, 0, 𝑇 , 0, 0, 𝑇 ′, 0, . . .),
and indeed another possible way to view T(V) is that it is the set of all sequences
with finitely many non-zero terms:
{(𝑇𝑘 )∞ ⊗𝑘
𝑘=0 : 𝑇𝑘 ∈ V , 𝑇𝑘 = 0 for all but finitely many 𝑘}.

However, we prefer the form in (4.84). The reason T(V) is called an algebra is that
it inherits a product operation given by the product of tensors in (4.41):
′ ′
V ⊗𝑑 × V ⊗𝑑 ∋ (𝑇 , 𝑇 ′) ↦→ 𝑇 ⊗ 𝑇 ′ ∈ V ⊗(𝑑+𝑑 ) ,
which can be extended to T(V) by
Õ 𝑑  Õ 𝑑′  Õ 𝑑′
𝑑 Õ

𝑇𝑗 ⊗ 𝑇𝑘 ≔ 𝑇 𝑗 ⊗ 𝑇𝑘′ .
𝑗=0 𝑘=0 𝑗=0 𝑘=0

Nothing we have discussed up to this point involves a basis of V, so it all applies to


modules as well. If we now take advantage of the fact that V is a vector space and
Tensors in computations 137

further assume that it is finite-dimensional, then V has a basis ℬ = {𝑒1 , . . . , 𝑒 𝑛 }


and we may represent any element in T(V) as
Õ
𝑑 Õ
𝑛
𝑎𝑖1 𝑖2 ···𝑖𝑘 𝑒𝑖1 ⊗ 𝑒𝑖2 ⊗ · · · ⊗ 𝑒𝑖𝑘
𝑘=0 𝑖1 ,𝑖2 ,...,𝑖𝑘 =1

with 𝑑 ∈ N. Another way to represent this is as a non-commutative polynomial


Õ
𝑑 Õ
𝑛
𝑓 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ) = 𝑎𝑖1 𝑖2 ···𝑖𝑘 𝑋𝑖1 𝑋𝑖2 · · · 𝑋𝑖𝑘 ,
𝑘=0 𝑖1 ,𝑖2 ,...,𝑖𝑘 =1

in 𝑛 non-commutative variables 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 . Here non-commutative means


𝑋𝑖 𝑋 𝑗 ≠ 𝑋 𝑗 𝑋𝑖 if 𝑖 ≠ 𝑗, i.e. just like (4.30), and we require other arithmetic rules
for manipulating 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 to mirror those in (4.40). The set of all such
polynomials is usually denoted Rh𝑋1 , . . . , 𝑋𝑛 i. Hence, as algebras,
T(V)  Rh𝑋1 , . . . , 𝑋𝑛 i, (4.85)
where the isomorphism is given by the map that takes 𝑒𝑖1 ⊗ 𝑒𝑖2 ⊗ · · · ⊗ 𝑒𝑖𝑘 ↦→
𝑋𝑖1 𝑋𝑖2 · · · 𝑋𝑖𝑘 . Such a representation also applies to any 𝑑-tensor in V ⊗𝑑 as a
special case, giving us a homogeneous non-commutative polynomial
Õ
𝑛
𝑎𝑖1 𝑖2 ···𝑖𝑑 𝑋𝑖1 𝑋𝑖2 · · · 𝑋𝑖𝑑 ,
𝑖1 ,𝑖2 ,...,𝑖𝑑 =1

where all terms have degree 𝑑. This is a superior representation compared to merely
representing the tensor as a hypermatrix (𝑎𝑖1 𝑖2 ···𝑖𝑑 )𝑖𝑛1 ,𝑖2 ,...,𝑖𝑑 =1 ∈ R𝑛×𝑛×···×𝑛 because,
like usual polynomials, we can take derivatives and integrals of non-commutative
polynomials and evaluate them on matrices, that is, we can take 𝐴1 , 𝐴2 , . . . , 𝐴𝑛 ∈
R𝑚×𝑚 and plug them into 𝑓 (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ) to get 𝑓 (𝐴1 , 𝐴2 , . . . , 𝐴𝑛 ) ∈ R𝑚×𝑚 .
This last observation alone leads to a rich subject, often called ‘non-commutative
sums-of-squares’, that has many engineering applications (Helton and Putinar
2007).
What if we need to speak of an infinite sum of tensors of different orders? This
is not just of theoretical interest; as we will see in Example 4.45, a multipole
expansion is such an infinite sum. A straightforward way to do this would be to
replace the direct sums in (4.83) with direct products; the difference, if the reader
recalls, is that while the elements of a direct sum are zero for all but a finite number
of summands, those of a direct product may contain an infinite number of non-
zero summands. If there are just a finite number of vector spaces V1 , . . . , V𝑑 , the
direct sum V1 ⊕ · · · ⊕ V𝑑 and the direct product V1 × · · · × V𝑑 are identical, but
if we have infinitely many vector spaces, a direct product is much larger than a
É For instance, a direct sum of countably many two-dimensional
direct sum. Î vector
spaces 𝑘 ∈N V 𝑘 has countable dimension whereas a direct product 𝑘 ∈N 𝑘 has
V
uncountable dimension.
138 L.-H. Lim

Nevertheless, instead of introducing new notation for the direct product of V ⊗𝑘 ,


𝑘 = 0, 1, 2, . . . , we may simply consider the dual vector space of T(V),
Ö∞ Õ ∞ 
∗ ∗⊗𝑘 ∗⊗𝑘
T(V) = V = 𝜑𝑘 : 𝜑𝑘 ∈ V , (4.86)
𝑘=0 𝑘=0
noting that taking duals turns a direct sum into a direct product. We call this the
dual tensor algebra. While it is true that (V ⊗𝑘 )∗ = V∗⊗𝑘 – this follows from the
‘calculus of tensor products’ that we will present in Example 4.31 – we emphasize
that the dual tensor algebra of any non-zero vector space V is not the same as the
tensor algebra of its dual space V∗ , that is,
T(V)∗ ≠ T(V∗ ).
Indeed, the dual vector space of any infinite-dimensional vector space must have
dimension strictly larger than that of the vector space (Jacobson 1975, Chapter IX,
Section 5). In this case, if V is finite-dimensional, then by what we discussed
earlier, T(V∗ ) has countable dimension whereas T(V)∗ has uncountable dimension.
Although what we have defined in (4.86) is the direct product of V∗⊗𝑘 , 𝑘 =
0, 1, 2, . . . , the direct product of V ⊗𝑘 , 𝑘 = 0, 1, 2, . . . , is simply given by the dual
tensor algebra of the dual space:
Ö
∞ Õ∞ 
∗ ∗ ⊗𝑘 ⊗𝑘
T(V ) = V = 𝑇𝑘 : 𝑇𝑘 ∈ V .
𝑘=0 𝑘=0
Just as T(V) may be regarded as the ring of non-commutative polynomials, the
same arguments that led to (4.85) also give
T(V∗ )∗  Rhh𝑋1 , . . . , 𝑋𝑛 ii,
that is, T(V∗ )∗ may be regarded as the ring of non-commutative power series.
One issue with these constructions is that taking direct sums results in a space
T(V) that is sometimes too small whereas taking direct products leads to a space
T(V∗ )∗ that is often too big. However, these constructions are purely algebraic;
if V is equipped with a norm or inner product, we can often use that to obtain a
tensor algebra of the desired size. Suppose V has an inner product h · , · i; then as
in Example 4.17, defining
h𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤, 𝑢 ′ ⊗ 𝑣 ′ ⊗ · · · ⊗ 𝑤 ′i ≔ h𝑢, 𝑢 ′ih𝑣, 𝑣 ′i · · · h𝑤, 𝑤 ′i
for any 𝑢 ⊗ 𝑣 ⊗ · · · ⊗ 𝑤, 𝑢 ′ ⊗ 𝑣 ′ ⊗ · · · ⊗ 𝑤 ′ ∈ V ⊗𝑑 , any 𝑑 ∈ N, and extending
bilinearly to all of T(V) makes it into an inner product space. We will denote the
inner products on V ⊗𝑑 and T(V) by h · , · i and the corresponding induced norms
by k · k; there is no ambiguity since they are all consistently defined. Taking
completion gives us a Hilbert space
Õ ∞ Õ∞ 
b ⊗𝑘 2
T(V) ≔ 𝑇𝑘 : 𝑇𝑘 ∈ V , k𝑇𝑘 k < ∞
𝑘=0 𝑘=0
Tensors in computations 139

with inner product



∞ Õ
∞  Õ

𝑇𝑘 , 𝑇𝑘′ ≔ h𝑇𝑘 , 𝑇𝑘′ i.
𝑘=0 𝑘=0 𝑘=0

Alternatively, one may also describe b T(V) as the Hilbert space direct sum of the
Hilbert spaces V ⊗𝑘 , 𝑘 = 0, 1, 2, . . . . If ℬ is an orthonormal basis of V, then the
tensor
Ð∞ product basis (4.43), denoted ℬ ⊗𝑘 , is an orthonormal basis on V ⊗𝑘 , and
⊗𝑘 is a countable orthonormal basis on b T(V), i.e. it is a separable Hilbert
𝑘=0 ℬ
space. This can be used in a ‘tensor trick’ for linearizing a non-linear function with
convergent Taylor series. For example, for any 𝑥, 𝑦 ∈ R𝑛 ,
Õ∞
1 Õ∞
1 ⊗𝑘 ⊗𝑘
exp(h𝑥, 𝑦i) = h𝑥, 𝑦i 𝑘 = h𝑥 , 𝑦 i
𝑘=0
𝑘! 𝑘=0
𝑘!
Õ 
1 ⊗𝑘 Õ 1 ⊗𝑘
∞ ∞
= √ 𝑥 , √ 𝑦 = h𝑆(𝑥), 𝑆(𝑦)i,
𝑘=0 𝑘! 𝑘=0 𝑘!

where the map


Õ∞
1
𝑆 : R𝑛 → b
T(R𝑛 ), 𝑆(𝑥) = √ 𝑥 ⊗𝑘
𝑘=0 𝑘!

is well-defined since
Õ
∞ 2 Õ∞
1 1
k𝑆(𝑥)k 2 = √ 𝑥 ⊗𝑘 = k𝑥 k 2𝑘 = exp(k𝑥 k 2 ) < ∞.
𝑘=0 𝑘! 𝑘=0
𝑘!
This was used, with sin in place of exp, in an elegant proof (Krivine 1979) of
Grothendieck’s inequality that also yields
𝜋
𝐾G ≤ √ ≈ 1.78221,
2 log(1 + 2)
the upper bound we saw in Example 3.17. This is in fact the best explicit upper
bound for the Grothendieck constant over R, although it is now known that it is not
sharp.

4.3. Tensors via the universal factorization property


We now come to the most abstract and general way to define tensor products
and thus tensors: the universal factorization property. It is both a potent tool
and an illuminating concept. Among other things, it unifies almost every notion
of tensors we have discussed thus far. For instance, an immediate consequence
is that multilinear maps (Definitions 3.1 and 3.3) and sums of separable functions
(Definitions 4.4 and 4.9) are one and the same notion, unifying definitions ➁ and ➂.
The straightforward way to regard the universal factorization property is that it is
a property of multilinear maps. It says that the multilinearity in any multilinear map
140 L.-H. Lim

Φ is the result of a special multilinear map 𝜎⊗ that is an omnipresent component


of all multilinear maps: Φ = 𝐹 ◦ 𝜎⊗ – once 𝜎⊗ is factored out of Φ, what remains,
i.e. the component 𝐹, is just a linear map. Any multilinearity contained in Φ is all
due to 𝜎⊗ .
However, because of its universal nature, we may turn the property on its head
and use it to define the tensor product: a tensor product is whatever satisfies the
universal factorization property. More precisely, whatever vector space one can
put in the box below that makes the diagram
𝜎 /
V1 × · · · ×❑V𝑑 ?
❑❑
❑❑
❑❑ 𝐹
Φ ❑❑❑
❑% 
W
commute for all vector spaces W with Φ multilinear and 𝐹 linear is defined to be
a tensor product of V1 , . . . , V𝑑 . One may show that this defines the tensor product
(and the map 𝜎) uniquely up to vector space isomorphism.
We now supply the details. The special multilinear map in question is 𝜎⊗ : V1 ×
· · · × V 𝑑 → V1 ⊗ · · · ⊗ V 𝑑 ,
𝜎⊗ (𝑣 1 , . . . , 𝑣 𝑑 ) = 𝑣 1 ⊗ · · · ⊗ 𝑣 𝑑 , (4.87)
called the Segre map or Segre embedding.29 Here the tensor space V1 ⊗ · · · ⊗ V𝑑 is
as defined in Section 4.2 and (4.87) defines a multilinear map by virtue of (4.40).
Universal factorization property. Every multilinear map is the composition of a
linear operator with the Segre map.
This is also called the universal mapping property, or the universal property of
the tensor product. Formally, it says that if Φ is a multilinear map from V1 ×· · ·×V𝑑
into a vector space W, then Φ determines a unique linear map 𝐹Φ from V1 ⊗ · · · ⊗V𝑑
into W, that makes the following diagram commute:
𝜎⊗
V1 × · · · × V
❘𝑑
/ V1 ⊗ · · · ⊗ V 𝑑 (4.88)
❘❘❘
linear 𝐹Φ
❘❘❘
❘❘
multilinear Φ ❘❘❘❘❘
❘( 
W
or, equivalently, Φ can be factored as
Φ = 𝐹Φ ◦ 𝜎⊗ , (4.89)
where 𝐹Φ is a ‘linear component’ dependent on Φ and 𝜎⊗ is a ‘multilinear com-
ponent’ independent of Φ. The component 𝜎⊗ is universal, i.e. common to all

29 There is no standard notation. It has been variously denoted as 𝜎 (Harris 1995), as 𝜑 (Bourbaki
1998, Lang 2002), as ⊗ (Conrad 2018) and as Seg (Landsberg 2012).
Tensors in computations 141

multilinear maps. This can be stated in less precise but more intuitive terms as
follows.
(a) All the ‘multilinearness’ in a multilinear map Φ may be factored out of Φ,
leaving behind only the ‘linearness’ that is encapsulated in 𝐹Φ .
(b) The ‘multilinearness’ extracted from any multilinear map is identical, that is,
all multilinear maps are multilinear because they contain a copy of the Segre
map 𝜎⊗ .
Actually, 𝜎⊗ does depend on 𝑑 and so we should have said it is universal for 𝑑-linear
maps. An immediate consequence of the universal factorization property is that
M𝑑 (V1 , . . . , V𝑑 ; W) = L(V1 ⊗ · · · ⊗ V𝑑 ; W).
In principle one could use the universal factorization property to avoid multilinear
maps entirely and discuss only linear maps, but of course there is no reason to do
that: as we saw in Section 3, multilinearity is a very useful notion in its own right.
The universal factorization property should be viewed as a way to move back and
forth between the multilinear realm and the linear one. While we have assumed
contravariant tensors for simplicity, the discussions above apply verbatim to mixed
tensor spaces V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 and with modules in place of vector
spaces.
Pure mathematicians accustomed to commutative diagrams would swear by
(4.88) but it expresses the same thing as (4.89), which is more palatable to applied
and computational mathematicians as it reminds us of matrix factorizations like
Example 2.13, where we factor a matrix 𝐴 ∈ R𝑚×𝑛 into 𝐴 = 𝑄𝑅 with an orthogonal
component 𝑄 ∈ O(𝑚) that captures the ‘orthogonalness’ in 𝐴, leaving behind a
triangular component 𝑅. There is one big difference, though: both 𝑄 and 𝑅 depend
on 𝐴, but in (4.89), whatever our choice of Φ, the multilinear component will always
be 𝜎⊗ . The next three examples will bear witness to this intriguing property.
Example 4.23 (trilinear functionals). The universal factorization property ex-
plains why (3.9) and (4.38) look alike. A trilinear functional 𝜏 : U × V × W → R
can be factored as
𝜏 = 𝐹𝜏 ◦ 𝜎 ⊗ (4.90)
and the multilinearity of 𝜎⊗ in (4.38) accounts for that of 𝜏 in (3.9). For example,
take the first equation in (3.9). We may view it as arising from
𝜏(𝜆𝑢 + 𝜆 ′𝑢 ′, 𝑣, 𝑤) = 𝐹𝜏 (𝜎⊗ (𝜆𝑢 + 𝜆 ′𝑢 ′ , 𝑣, 𝑤)) by (4.90),
= 𝐹𝜏 ((𝜆𝑢 + 𝜆 ′𝑢 ′) ⊗ 𝑣 ⊗ 𝑤) by (4.87),
= 𝐹𝜏 (𝜆𝑢 ⊗ 𝑣 ⊗ 𝑤 + 𝜆 ′𝑢 ′ ⊗ 𝑣 ⊗ 𝑤) by (4.38),
= 𝜆𝐹𝜏 (𝑢 ⊗ 𝑣 ⊗ 𝑤) + 𝜆 ′ 𝐹𝜏 (𝑢 ′ ⊗ 𝑣 ⊗ 𝑤) 𝐹𝜏 is linear,
= 𝜆𝐹𝜏 (𝜎⊗ (𝑢, 𝑣, 𝑤)) + 𝜆 ′ 𝐹𝜏 (𝜎⊗ (𝑢 ′, 𝑣, 𝑤)) by (4.87),
= 𝜆𝜏(𝑢, 𝑣, 𝑤) + 𝜆 ′ 𝜏(𝑢 ′ , 𝑣, 𝑤) by (4.90),
142 L.-H. Lim

and similarly for the second and third equations in (3.9). More generally, if we
reread Section 3.2 with the hindsight of this section, we will realize that it is simply
a discussion of the universal factorization property without invoking the ⊗ symbol.
We next see how the universal factorization property ties together the three
matrix products that have made an appearance in this article.
Example 4.24 (matrix, Hadamard and Kronecker products). Let us consider
the standard matrix product in (2.5) and Hadamard product in (2.4) on 2 × 2
matrices. Let 𝜇, 𝜂 : R2×2 × R2×2 → R2×2 be, respectively,
     
𝑎11 𝑎12 𝑏11 𝑏12 𝑎 𝑏 + 𝑎12 𝑏21 𝑎11 𝑏12 + 𝑎12 𝑏22
𝜇 , = 11 11 ,
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑏11 + 𝑎22 𝑏21 𝑎21 𝑏12 + 𝑎22 𝑏22
     
𝑎11 𝑎12 𝑏11 𝑏12 𝑎11 𝑏11 𝑎12 𝑏12
𝜂 , = .
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑏21 𝑎22 𝑏22
They look nothing alike and, as discussed in Section 2, are of entirely different
natures. But the universal factorization property tells us that, by virtue of the fact
that both are bilinear operators from R2×2 ×R2×2 to R2×2 , the ‘bilinearness’ in 𝜇 and 𝜂
is the same, encapsulated in the Segre map 𝜎⊗ : R2×2 ×R2×2 → R2×2 ⊗R2×2  R4×4 ,
       
𝑎11 𝑎12 𝑏11 𝑏12 𝑎11 𝑎12 𝑏11 𝑏12
𝜎⊗ , = ⊗
𝑎21 𝑎22 𝑏21 𝑏22 𝑎21 𝑎22 𝑏21 𝑏22
𝑎11 𝑏11 𝑎11 𝑏12 𝑎12 𝑏11 𝑎12 𝑏12 
 
𝑎 𝑏 𝑎11 𝑏22 𝑎12 𝑏21 𝑎12 𝑏22 
=  11 21 ,
𝑎21 𝑏11 𝑎21 𝑏12 𝑎22 𝑏11 𝑎22 𝑏12 
𝑎21 𝑏21 𝑎21 𝑏22 𝑎22 𝑏21 𝑎22 𝑏22 
 
i.e. the Kronecker product in Example 4.10(ii). The difference between 𝜇 and 𝜂 is
due entirely to their linear components 𝐹𝜇 , 𝐹𝜂 : R4×4 → R2×2 ,
𝑐11 𝑐12 𝑐13 𝑐14 
 
   
 𝑐21 𝑐22 𝑐23 𝑐24   𝑐 11 + 𝑐 23 𝑐 12 + 𝑐 24
 
𝐹𝜇  𝑐31 𝑐32 𝑐33 𝑐34   = 𝑐31 + 𝑐43 𝑐32 + 𝑐44 ,
 
𝑐41 𝑐42 𝑐43 𝑐44 
 
𝑐11 𝑐12 𝑐13 𝑐14 

   
 𝑐21 𝑐22 𝑐23 𝑐24  
𝐹𝜂     = 𝑐11 𝑐14 ,
 𝑐31 𝑐32 𝑐33 𝑐34   𝑐41 𝑐44
 
𝑐41 𝑐42 𝑐43 𝑐44 
 
which are indeed very different. As a sanity check, the reader may like to verify
from these formulas that
𝜇 = 𝐹𝜇 ◦ 𝜎 ⊗ , 𝜂 = 𝐹𝜂 ◦ 𝜎⊗ .
More generally, the same argument shows that any products we define on 2 × 2
matrices must share something in common, namely the Kronecker product that
Tensors in computations 143

takes a pair of 2 × 2 matrices to a 4 × 4 matrix. The Kronecker product accounts


for the ‘bilinearness’ of all conceivable matrix products.
We will include an infinite-dimensional example for variety.
Example 4.25 (convolution I). We recall the convolution bilinear operator
∗ : 𝐿 1 (R𝑛 ) × 𝐿 1 (R𝑛 ) → 𝐿 1 (R𝑛 ), ( 𝑓 , 𝑔) ↦→ 𝑓 ∗ 𝑔
from Example 3.7. As we saw in Example 4.18, 𝐿 1 (R𝑛 ) ⊗ 𝐿 1 (R𝑛 ) = 𝐿 1 (R𝑛 × R𝑛 )
when we interpret ⊗ as b ⊗ 𝜈 , the topological tensor product with respect to the
nuclear norm, and so the Segre map is
𝜎⊗ : 𝐿 1 (R𝑛 ) × 𝐿 1 (R𝑛 ) → 𝐿 1 (R𝑛 × R𝑛 ), ( 𝑓 , 𝑔) ↦→ 𝑓 ⊗ 𝑔,
where, as a reminder, 𝑓 ⊗ 𝑔(𝑥, 𝑦) = 𝑓 (𝑥)𝑔(𝑦), where ⊗ is interpreted as the separable
product. Commutative diagrams like (4.88) are useful when we have three or more
maps between multiple spaces, as they allow us to assemble all maps into a single
picture, which helps to keep track of what gets mapped where. Here we have
𝜎⊗
𝐿 1 (R𝑛 ) × 𝐿 1 (R
◗ )
𝑛 / 𝐿 1 (R𝑛 × R𝑛 )
◗◗◗
◗◗◗
◗ 𝐹∗
∗ ◗◗◗◗◗
( 
𝐿 1 (R𝑛 )
and we would like to find the unique linear map 𝐹∗ that makes the diagram commute,
i.e. 𝑓 ∗ 𝑔 = 𝐹∗ ( 𝑓 ⊗ 𝑔). Writing out this expression, we see that 𝐹∗ = 𝐺 ◦ 𝐻 is a
composition of the two linear maps defined by

1 𝑛 𝑛 1 𝑛
𝐺 : 𝐿 (R × R ) → 𝐿 (R ), 𝐺(𝐾)(𝑥) ≔ 𝐾(𝑥, 𝑦) d𝑦,
R𝑛
1 𝑛 𝑛 1 𝑛 𝑛
𝐻 : 𝐿 (R × R ) → 𝐿 (R × R ), 𝐻(𝐾)(𝑥, 𝑦) ≔ 𝐾(𝑥 − 𝑦, 𝑦)
for all 𝐾 ∈ 𝐿 1 (R𝑛 × R𝑛 ).
Strictly speaking, the space V1 ⊗· · ·⊗V𝑑 has already been fixed by Definition 4.9.
In order to allow some flexibility in (4.88), we will have to allow for V1 ⊗ · · · ⊗ V𝑑
to be replaced by any vector space T and 𝜎⊗ by any multilinear map
𝜎T : V1 × · · · × V𝑑 → T
as long as the universal factorization property holds for this choice of T and 𝜎T ,
that is,
𝜎T
V1 × · · · ×▼V𝑑 /T (4.91)
▼▼▼
linear 𝐹Φ
▼▼▼
multilinear Φ ▼▼▼▼ 
&
W
Definition 4.26 (tensors via universal factorization). Let T be a vector space and
144 L.-H. Lim

let 𝜎T : V1 × · · · × V𝑑 → T be a multilinear map with the universal factorization


property, that is, for any vector space W and any multilinear map Φ : V1 ×· · ·×V𝑑 →
W, there is a unique linear map 𝐹Φ : T → W, called the linearization of Φ, such
that the factorization Φ = 𝐹Φ ◦ 𝜎T holds. Then T is unique up to vector space
isomorphisms and is called a tensor space or more precisely a tensor product of
the vector spaces V1 , . . . , V𝑑 . An element of T is called a 𝑑-tensor.
As usual, one may easily incorporate covariance and contravariance by replacing
some of the vector spaces with their dual spaces.
In Example 4.23, T = U ⊗ V ⊗ W, as defined in Definition 4.9, 𝜎T = 𝜎⊗ , as
defined in (4.87), and we recover (4.88). But for the subsequent two examples we
have in fact already used Definition 4.26 implicitly. In Example 4.24, T = R4×4
and 𝜎T : R2×2 × R2×2 → T takes a pair of 2 × 2 matrices to their Kronecker product.
In Example 4.25, T = 𝐿 1 (R𝑛 × R𝑛 ) and 𝜎T : 𝐿 1 (R𝑛 ) × 𝐿 1 (R𝑛 ) → T takes a pair of
𝐿 1 -functions to their separable product.
Definition 4.26 is intended to capture as many notions of tensor products as
possible. The Segre map 𝜎T may be an abstract tensor product of vectors, an outer
product of vectors in R𝑛 , a Kronecker product of operators or matrices, a separable
product of functions, etc.; the ⊗ in (4.91) could be abstract tensor products of
vector spaces, Kronecker products of operator spaces, topological tensor products
of topological vector spaces, etc. In the last case we require the maps Φ, 𝜎T , 𝐹Φ
to be also continuous; without this requirement the topological tensor product just
becomes the algebraic tensor product. Recall that continuity takes a simple form
for a multilinear map, being equivalent to the statement that its spectral norm as
defined in (3.19) is finite, that is, continuous multilinear maps are precisely bounded
multilinear maps. Let us verify that the maps in Example 4.25 are continuous.
Example 4.27 (convolution II). We follow the same notation in Example 4.25
except that we will write B∗ : 𝐿 1 (R𝑛 ) × 𝐿 1 (R𝑛 ) → 𝐿 1 (R𝑛 ), ( 𝑓 , 𝑔) ↦→ 𝑓 ∗ 𝑔 for
convolution as a bilinear map. We show that kB∗ k 𝜎 = k𝜎⊗ k 𝜎 = k𝐹∗ k 𝜎 = 1 and
thus these maps are continuous. First recall from Example 3.7 that kB∗ k 𝜎 = 1.
Next, since k 𝑓 ⊗ 𝑔k 1 = k 𝑓 k 1 k𝑔k 1 , we have
k 𝑓 ⊗ 𝑔k 1
k𝜎⊗ k 𝜎 = sup = 1.
𝑓 ,𝑔≠0 k 𝑓 k 1 k𝑔k 1

Lastly, for 𝐾 ∈ 𝐿 1 (R𝑛 × R𝑛 ),


∫ ∫
k𝐹∗(𝐾)k 1 = |𝐾(𝑥 − 𝑦, 𝑦)| d𝑦 d𝑥 = k𝐾 k 1
R𝑛 R𝑛
by Fubini and a change of variables. Thus we get
k𝐹∗(𝐾)k 1
k𝐹∗ k 𝜎 = sup = 1.
𝐾 ≠0 k𝐾 k 1
Note that even though we use the spectral norm to ascertain continuity of the
Tensors in computations 145

linear and multilinear maps between spaces, the tensor product space above and
in Example 4.25 is formed with b ⊗ 𝜈 , the topological tensor product with respect to
the nuclear norm. Indeed, b ⊗ 𝜈 is an amazing tensor product in that for any norm
spaces U, V, W, if we have the commutative diagram
𝜎⊗
U × V❏ /Ub ⊗𝜈 V
❏❏
❏❏
❏❏ 𝐹Φ
Φ ❏❏❏ 

%
W

then the bilinear map Φ : U × V → W is continuous if and only if its linearization


𝐹Φ : U b
⊗ 𝜈 V → W is continuous (Defant and Floret 1993, Chapter 3).

We will now examine a non-example to see that Definition 4.26 is not omnipotent.
The tensor product of Hilbert spaces is an important exception that does not satisfy
the universal factorization property. The following example is adapted from Garrett
(2010).

Example 4.28 (no universal factorization for Hilbert spaces). We begin with
a simple observation. For a finite-dimensional vector space V, the map defined
by 𝛽 : V × V∗ → R, (𝑣, 𝜑) ↦→ 𝜑(𝑣) is clearly bilinear. So we may apply universal
factorization to get
𝜎⊗
V × V▲∗ / V ⊗ V∗
▲▲
▲▲
▲▲ tr
𝛽 ▲▲▲▲ 
%
R

The linear map that we obtained, tr : V ⊗ V∗ → R, is in fact a coordinate-free way


to define trace. To see this, note that if we choose V = R𝑛 , then for any 𝑎 ∈ R𝑛 and
𝑏T ∈ R1×𝑛 = R𝑛∗ ,

𝛽(𝑎, 𝑏T ) = 𝑏T 𝑎 = tr(𝑏T 𝑎) = tr(𝑎𝑏T ) = tr(𝑎 ⊗ 𝑏).

This observation is key to our subsequent discussion. We will see that when V is
replaced by an infinite-dimensional Hilbert space and we require all maps to be
continuous, then there is no continuous linear map that can take the place of trace.
Let H be a separable infinite-dimensional Hilbert space. As discussed in Ex-
amples 4.18 and 4.19, the tensor product of H and H∗ is interpreted to be the
topological tensor product H b⊗F H∗ with respect to the Hilbert–Schmidt norm k · k F .

So we must have 𝜎⊗ : H×H → H b ⊗ F H∗ , (𝑣, 𝜑) ↦→ 𝑣 ⊗ 𝜑. Take W = C and consider
the continuous bilinear functional 𝛽 : H × H∗ → C, (𝑣, 𝜑) ↦→ 𝜑(𝑣). We claim that
there is no continuous linear map 𝐹𝛽 : H b ⊗ F H∗ → C such that 𝛽 = 𝐹𝛽 ◦ 𝜎⊗ , that is,
146 L.-H. Lim

we cannot make the diagram


𝜎⊗
H × H❑∗ /Hb⊗ F H∗
❑❑❑
𝐹𝛽 ?
❑❑❑
𝛽 ❑❑❑ 

❑%
C
commute. To this end, suppose 𝐹𝛽 exists. Take any orthonormal basis {𝑒𝑖 : 𝑖 ∈ N}.
Then
𝛽(𝑒𝑖 , 𝑒∗𝑗 ) = 𝛿𝑖 𝑗 , 𝜎⊗ (𝑒𝑖 , 𝑒∗𝑗 ) = 𝑒𝑖 ⊗ 𝑒∗𝑗 ,
and since 𝛽 = 𝐹𝛽 ◦ 𝜎⊗ , we must have
𝐹𝛽 (𝑒𝑖 ⊗ 𝑒∗𝑗 ) = 𝛿𝑖 𝑗 .

Now take any Φ ∈ H b⊗F H∗ and express it as in (4.76). Then since 𝐹𝛽 is continuous
and linear, we have
Õ∞
𝐹𝛽 (Φ) = 𝑎 𝑘 𝑘 = tr(Φ),
𝑘=1

that is, 𝐹𝛽 must be the trace as defined in (4.77). This is a contradiction as trace is
unbounded on H b ⊗F H∗ ; just take
Õ

Φ= 𝑘 −1 𝑒 𝑘 ⊗ 𝑒∗𝑘 ,
𝑘=1
Í −2 < ∞ but tr(Φ) = Í∞ 𝑘 −1 = ∞.
which is Hilbert–Schmidt as kΦk 2F = ∞ 𝑘=1 𝑘 𝑘=1
Hence no such map exists.
A plausible follow-up question is that instead of b ⊗F , why not take topological
b
tensor products with respect to the nuclear norm ⊗ 𝜈 ? As we mentioned at the
end of Example 4.27, this automatically guarantees continuity of all maps. The
problem, as we pointed out in Example 4.19, is that the resulting tensor space, i.e.
⊗ 𝜈 H∗ , is not a Hilbert space. In fact, one may show
the trace-class operators H b
that no such Hilbert space exists. Suppose there is a Hilbert space T and a Segre
map 𝜎T : H × H∗ → T so that the universal factorization property in (4.91) holds.
Then we may choose W = H b ⊗F H∗ and Φ = 𝜎⊗ to get
𝜎T
H × H❑∗❑ /T
❑❑❑
❑❑ 𝐹⊗
𝜎⊗ ❑❑❑% 
⊗ F H∗
Hb
from which one may deduce (see Garrett 2010 for details) that if 𝐹⊗ is a continuous
linear map, then
h · , · iT = 𝑐h · , · iF
Tensors in computations 147

for some 𝑐 > 0, that is, 𝐹⊗ is up to a constant multiple an isometry (Hilbert space
isomorphism). So T and H b ⊗ F H∗ are essentially the same Hilbert space up to
scaling and thus the same trace argument above shows that T does not satisfy the
universal factorization property.
While the goal of this example is to demonstrate that Hilbert spaces generally do
not satisfy the universal factorization property, there is a point worth highlighting
in this construction. As we saw at the beginning, if we do not care about continuity,
then everything goes through without a glitch: the evaluation bilinear functional
𝛽 : V × V∗ → R, (𝑣, 𝜑) ↦→ 𝜑(𝑣) satisfies the universal factorization property
𝜎⊗
V × V▲∗ / V ⊗ V∗
▲▲
▲▲
▲▲ tr
𝛽 ▲▲▲▲ 
%
R
with the linear functional tr : V ⊗ V∗ → R given by
Õ𝑟  Õ𝑟
tr 𝑣 𝑖 ⊗ 𝜑𝑖 = 𝜑𝑖 (𝑣 𝑖 )
𝑖=1 𝑖=1

for any 𝑟 ∈ N. In the language of Definition 4.26, trace is the linearization of


the evaluation bilinear functional, a definition adopted in both algebra (Lang 2002,
Proposition 5.7) and analysis (Ryan 2002, Section 1.3). Note that here V can be
any vector space, not required to be finite-dimensional or complete or normed, and
the maps involved are not required to be continuous. The trace as defined above is
always a finite sum and thus finite-valued.

The next example clarifies an occasional ambiguity in the definition of higher-


order derivatives of vector-valued functions.

Example 4.29 (linearizing higher-order derivatives). Recall from Example 3.2


that the 𝑑th derivative of 𝑓 : Ω → W at a point 𝑣 ∈ Ω ⊆ V is a multilinear map
denoted 𝐷 𝑑 𝑓 (𝑣) as in (3.23). Applying the universal factorization property
𝑑 copies 𝑑 copies
𝜎⊗
V × · · · ×◆◆V / V ⊗ ··· ⊗ V
◆◆◆
◆◆◆
◆◆ 𝜕 𝑑 𝑓 (𝑣)
𝑑
𝐷 𝑓 (𝑣) ◆◆◆◆
◆& 
W
we obtain a linear map on a tensor space,
𝜕 𝑑 𝑓 (𝑣) : V ⊗𝑑 → W,
that may be regarded as an alternative candidate for the 𝑑th derivative of 𝑓 . In the
language of Definition 4.26, 𝜕 𝑑 𝑓 (𝑣) is the linearization of 𝐷 𝑑 𝑓 (𝑣). While we have
148 L.-H. Lim

denoted them differently for easy distinction, it is not uncommon to see 𝐷 𝑑 𝑓 (𝑣)
and 𝜕 𝑑 𝑓 (𝑣) used interchangeably, sometimes in the same sentence.
𝑛 → R, 𝑓 (𝑋) = − log det(𝑋) discussed in Examples 3.6
Take the log barrier 𝑓 : S++
and 3.16. We have

𝐷 2 𝑓 (𝑋) : S𝑛 × S𝑛 → R, (𝐻1 , 𝐻2 ) ↦→ tr(𝑋 −1 𝐻1 𝑋 −1 𝐻2 ),


𝜎⊗ : S𝑛 × S𝑛 → S𝑛 ⊗ S𝑛 , (𝐻1 , 𝐻2 ) ↦→ 𝐻1 ⊗ 𝐻2 ,

where the latter denotes the Kronecker product as discussed in Example 4.11. The
universal factorization property
𝜎⊗
S𝑛 × S▲𝑛 / S𝑛 ⊗ S𝑛
▲▲▲
𝜕 𝑑 𝑓 (𝑋 )
▲▲▲
𝑑
▲▲
𝐷 𝑓 (𝑋 ) ▲▲▲ 
&
R
gives us

𝜕 2 𝑓 (𝑋) : S𝑛 ⊗ S𝑛 → R, (𝐻1 , 𝐻2 ) = vec(𝑋 −1 )T (𝐻1 ⊗ 𝐻2 ) vec(𝑋 −1 ).

Note that here it does not matter whether we use vec or vecℓ , since the matrices
involved are all symmetric.
Taylor’s theorem for a vector-valued function 𝑓 : Ω ⊆ V → W in Example 3.2
may alternatively be expressed in the form
1
𝑓 (𝑣) = 𝑓 (𝑣 0 ) + [𝜕 𝑓 (𝑣 0 )](𝑣 − 𝑣 0 ) + [𝜕 2 𝑓 (𝑣 0 )](𝑣 − 𝑣 0 ) ⊗2 + · · ·
2
1 𝑑
· · · + [𝜕 𝑓 (𝑣 0 )](𝑣 − 𝑣 0 ) ⊗𝑑 + 𝑅(𝑣 − 𝑣 0 ),
𝑑!
where 𝑣 0 ∈ Ω and k𝑅(𝑣 − 𝑣 0 )k/k𝑣 − 𝑣 0 k 𝑑 → 0 as 𝑣 → 𝑣 0 . Here the ‘Taylor
coefficients’ are linear maps 𝜕 𝑑 𝑓 (𝑣) : V ⊗𝑑 → W. For the important special case
W = R, i.e. a real-valued function 𝑓 : V → R, we have that 𝜕 𝑑 𝑓 (𝑣) : V ⊗𝑑 → R is
a linear functional. So the Taylor coefficients are covariant 𝑑-tensors

𝜕 𝑑 𝑓 (𝑣) ∈ V∗⊗𝑑 .

By our discussions in Example 4.22 and assuming that 𝑓 ∈ 𝐶 ∞ (Ω), we obtain an


element of the dual tensor algebra
Õ∞
1 𝑘
𝜕 𝑓 (𝑣) ≔ 𝜕 𝑓 (𝑣) ∈ T(V)∗
𝑘=0
𝑘!

that we will call the tensor Taylor series.


A slight variation is when V is equipped with an inner product h · , · i. We may
Tensors in computations 149

use the Riesz representation theorem to write, for 𝑓 ∈ 𝐶 𝑑+1 (Ω),


1
𝑓 (𝑣) = 𝑓 (𝑣 0 ) + h𝜕 𝑓 (𝑣 0 ), 𝑣 − 𝑣 0 i + h𝜕 2 𝑓 (𝑣 0 ), (𝑣 − 𝑣 0 ) ⊗2 i + · · ·
2
1 𝑑 ⊗𝑑
· · · + h𝜕 𝑓 (𝑣 0 ), (𝑣 − 𝑣 0 ) i + 𝑅(𝑣 − 𝑣 0 ), (4.92)
𝑑!
where the inner products on V ⊗𝑑 are defined as in (4.64). Here the Taylor coeffi-
cients are contravariant 𝑑-tensors 𝜕 𝑑 𝑓 (𝑣) ∈ V ⊗𝑑 . If 𝑓 is furthermore analytic, we
may rewrite this as an inner product between maps into the tensor algebra:
Õ∞ Õ∞ 
1 𝑘 ⊗𝑘
𝑓 (𝑣) = 𝜕 𝑓 (𝑣 0 ), (𝑣 − 𝑣 0 ) = h𝜕 𝑓 (𝑣 0 ), 𝑆(𝑣 − 𝑣 0 )i, (4.93)
𝑘=0
𝑘! 𝑘=0

where the maps 𝜕 𝑓 : Ω → b


T(V) and 𝑆 : Ω → b
T(V) are given by
Õ∞
1 𝑘 Õ

𝜕 𝑓 (𝑣) = 𝜕 𝑓 (𝑣), 𝑆(𝑣) = 𝑣 ⊗𝑘 .
𝑘=0
𝑘! 𝑘=0

The tensor geometric series 𝑆(𝑣) is clearly a well-defined element of bT(V) whenever
k𝑣k < 1 and the tensor Taylor series 𝜕 𝑓 (𝑣) is well-defined as long as k𝜕 𝑘 𝑓 (𝑣)k ≤ 𝐵
for some uniform bound 𝐵 > 0. We will see this in action when we discuss multipole
expansion in Example 4.45, which is essentially (4.93) applied to 𝑓 (𝑣) = 1/k𝑣k.

Another utility of the universal factorization property is in defining various


vector-valued objects, i.e. taking values in a vector space V. The generality here
is deliberate, intended to capture a myriad of possibilities for V, which could
refer to anything from C as a two-dimensional vector space over R to an infinite-
dimensional function space, a space of matrices or hypermatrices, a Hilbert or
Banach or Fréchet space, a space of linear or multilinear operators, a tensor space
itself, etc.

Example 4.30 (vector-valued objects). For any real vector space V and any set
𝑋, we denote the set of all functions taking values in V by

V𝑋 ≔ { 𝑓 : 𝑋 → V}.

This is a real vector space in the obvious way: 𝜆 𝑓 + 𝜆 ′ 𝑓 ′ is defined to be the


function whose value at 𝑥 ∈ 𝑋 is given by the vector 𝜆 𝑓 (𝑥) + 𝜆 ′ 𝑓 ′(𝑥) ∈ V. It is also
consistent with our notation R𝑋 for the vector space of all real-valued functions on
𝑋 in Section 4.1. Consider the bilinear map

Φ : R𝑋 × V → V𝑋 , ( 𝑓 , 𝑣) ↦→ 𝑓 · 𝑣,

where, for any fixed 𝑓 ∈ R𝑋 and 𝑣 ∈ V, the function 𝑓 · 𝑣 is defined to be


150 L.-H. Lim

𝑓 · 𝑣 : 𝑋 → V, 𝑥 ↦→ 𝑓 (𝑥)𝑣. Applying the universal factorization property, we have


𝜎⊗
R𝑋 × V
❑❑❑
/ R𝑋 ⊗ V (4.94)
❑❑❑
❑❑ 𝐹Φ
Φ ❑❑❑ 
%
V𝑋
with the linearization of Φ given by
Õ𝑟  Õ𝑟
𝐹Φ 𝑓𝑖 ⊗ 𝑣 𝑖 = 𝑓𝑖 · 𝑣 𝑖 .
𝑖=1 𝑖=1
It is straightforward (Ryan 2002, Section 1.5) to show that 𝐹Φ gives an isomorphism
R𝑋 ⊗ V  V𝑋 . The standard practice is to identify the two spaces, that is, replace
 with = and take 𝑓 ⊗ 𝑣 to mean 𝑓 · 𝑣 as defined above.
In reality we are often more interested in various subspaces F(𝑋) ⊆ R𝑋 whose
elements satisfy properties such as continuity, differentiability, integrability, sum-
mability, etc., and we would like to have a corresponding space of vector-valued
functions F(𝑋; V) ⊆ V𝑋 that also have these properties. If V is finite-dimensional,
applying the universal factorization property (4.94) with the subspaces F(𝑋) and
F(𝑋; V) in place of R𝑋 and V𝑋 gives us
F(𝑋; V) = F(𝑋) ⊗ V,
and this already leads to some useful constructions.
Extension of scalars. Any real vector space U has a complexification UC with
elements given by
𝑢1 ⊗ 1 + 𝑢2 ⊗ 𝑖, 𝑢1 , 𝑢2 ∈ U,
and scalar multiplication by 𝑎 + 𝑏𝑖 ∈ C given by
(𝑎 + 𝑖𝑏)(𝑢1 ⊗ 1 + 𝑢2 ⊗ 𝑖) = (𝑎𝑢1 − 𝑏𝑢2 ) ⊗ 1 + (𝑏𝑢1 + 𝑎𝑢2 ) ⊗ 𝑖.
This construction is just
W ⊗ C = WC .

Polynomial matrices. In mechanical engineering, one of the simplest and most


fundamental system is the mass–spring–damper model described by
𝑞(𝜆) = 𝜆2 𝑀 + 𝜆𝐶 + 𝐾,
where 𝑀, 𝐶, 𝐾 ∈ R𝑛×𝑛 represent mass, damping and stiffness respectively. This
a polynomial matrix, which may be viewed as either a matrix of polynomials or a
polynomial with matrix coefficients. This construction is just
R[𝜆] ⊗ R𝑛×𝑛 = (R[𝜆])𝑛×𝑛 = (R𝑛×𝑛 )[𝜆].

𝐶 𝑘 - and 𝐿 𝑝 -vector fields. In (4.14) we viewed the solution to a system of partial


Tensors in computations 151

differential equations as an element of 𝐶 2 (R3 ) b


⊗ 𝐶 1 [0, ∞) ⊗ R3 , and in (4.29) we
viewed an 𝐿 -vector field as an element of 𝐿 (R3 ) ⊗ C2 . These are consequences of
2 2

𝐶 𝑘 (Ω) ⊗ R𝑛 = 𝐶 𝑘 (Ω; R𝑛 ) and 𝐿 𝑝 (Ω) ⊗ C𝑛 = 𝐿 𝑝 (Ω; C𝑛 )


where the latter spaces refer to the sets of vector-valued functions
𝑓 = ( 𝑓1 , . . . , 𝑓𝑛 ) : Ω → R𝑛 or C𝑛
whose components 𝑓𝑖 are 𝑘-times continuously differentiable or 𝐿 𝑝 -integrable on
Ω respectively.
On the other hand, if V is an infinite-dimensional topological vector space, then
the results depend on the choice of topological tensor product and the algebraic
construction above will need to be adapted to address issues of convergence. We
present a few examples involving Banach-space-valued functions, mostly drawn
from Ryan (2002, Chapters 2 and 3). In the following, B denotes a Banach space
with norm k · k.
Absolutely summable sequences. This is a special case of the 𝐿 1 -functions below
but worth a separate statement:
 Õ
∞ 
1 b 1 ∞
𝑙 (N) ⊗ 𝜈 B = 𝑙 (N; B) = (𝑥𝑖 )𝑖=1 : 𝑥𝑖 ∈ B, k𝑥𝑖 k < ∞ .
𝑖=1

If we use b
⊗ 𝜎 in place of b
⊗ 𝜈 , we obtain a larger subspace, as we will see next.
Unconditionally summable sequences. Absolute summability implies uncondi-
tional summability but the converse is generally false when B is infinite-dimen-
sional. We have
 Õ
∞ 
𝑙 1 (N) b
⊗ 𝜎 B = (𝑥𝑖 )∞
𝑖=1 : 𝑥 𝑖 ∈ B, 𝜀 𝑥
𝑖 𝑖 < ∞ for any 𝜀 𝑖 = ±1 .
𝑖=1
Í∞
The last condition is also equivalent to 𝑖=1 𝑥 𝜎(𝑖) < ∞ for any bijection 𝜎 : N → N.
Integrable functions. Let Ω be 𝜎-finite. Then
𝐿 1 (Ω) b
⊗ 𝜈 B = 𝐿 1 (Ω; B) = { 𝑓 : Ω → B : k 𝑓 k 1 < ∞}.
This is ∫called the Lebesgue–Bochner space. Here the 𝐿 1 -norm is defined as
k 𝑓 k 1 = Ω k 𝑓 (𝑥)k d𝑥. In fact the proof of the second half of (4.73) is via

𝐿 1 (𝑋) b
⊗ 𝜈 𝐿 1 (𝑌 ) = 𝐿 1 (𝑋; 𝐿 1 (𝑌 )) = 𝐿 1 (𝑋 × 𝑌 ).

Continuous functions. Let Ω be compact Hausdorff. Then


𝐶(Ω) b
⊗ 𝜎 B = 𝐶(Ω; B) = { 𝑓 : Ω → B : 𝑓 continuous}.
Here continuity is with respect to k 𝑓 k ∞ = sup 𝑥 ∈Ω k 𝑓 (𝑥)k. In fact the proof of the
152 L.-H. Lim

first half of (4.73) is via


𝐶(𝑋) b
⊗ 𝜎 𝐶(𝑌 ) = 𝐶(𝑋; 𝐶(𝑌 )) = 𝐶(𝑋 × 𝑌 ).

Infinite-dimensional diagonal matrices. For finite-dimensional square matrices,


the subspace of diagonal matrices diag(R𝑛×𝑛 ) ≔ span{𝑒𝑖 ⊗ 𝑒𝑖 : 𝑖 = 1, . . . , 𝑛}  R𝑛
is independent of any other structure on R𝑛×𝑛 . For infinite-dimensional matrices,
the closed linear span span{𝑒𝑖 ⊗ 𝑒𝑖 : 𝑖 ∈ N} depends on both the choice of norms
and the choice of tensor products:

 𝑙 1 (N) 1 ≤ 𝑝 ≤ 2,



𝑝 𝑝
diag(𝑙 (N) b
⊗ 𝜈 𝑙 (N))  𝑙 𝑝/2 (N) 2 < 𝑝 < ∞,


 𝑐0 (N) 𝑝 = ∞.

We refer readers to Holub (1970) for other cases with diag(𝑙 𝑝 (N) b ⊗ 𝜈 𝑙 𝑞 (N)), or
b
⊗ 𝜎 in place of b ⊗ 𝜈 , and to Arias and Farmer (1996) for the higher-order case
𝑙 𝑝1 (N) ⊗ · · · ⊗ 𝑙 𝑝𝑑 (N) with 𝑑 > 2.
Partial traces. For separable Hilbert spaces H1 and H2 , the partial traces are the
continuous linear maps
tr1 : (H1 b
⊗ F H2 ) b ⊗F H2 )∗ → H2 b
⊗ 𝜈 (H1 b ⊗F H∗2 ,
tr2 : (H1 b
⊗ F H2 ) b ⊗F H2 )∗ → H1 b
⊗ 𝜈 (H1 b ⊗F H∗1 .
Note that the domain of these maps are trace-class operators on the Hilbert space
H1 b
⊗F H2 by Example 4.19. Choose any orthonormal basis on H1 to obtain a Hilbert
space isomorphism H1 b⊗ F H∗1  𝑙 2 (N × N). Then
(H1 b
⊗ F H2 ) b ⊗F H2 )∗  (H1 b
⊗ 𝜈 (H1 b ⊗F H∗1 ) b ⊗F H2 )∗
⊗ 𝜈 (H2 b
 𝑙 2 (N × N) b ⊗F H∗2 )
⊗ 𝜈 (H2 b
and the last space is the set of infinite matrices of the form
Φ11 Φ12 · · · Φ1 𝑗 · · ·
 
Φ21 Φ22 · · · Φ2 𝑗 · · ·
 . .. .. .. 
 . 
 . . . . , (4.95)
 
 Φ𝑖1 Φ𝑖2 · · · Φ𝑖 𝑗 · · ·
 . .. .. . . 
 .. . . .

where each Φ𝑖 𝑗 : H2 → H∗2 is a Hilbert–Schmidt operator for any 𝑖, 𝑗 ∈ N. Given
a trace-class operator Φ : H1 b
⊗ F H2 → H1 b ⊗F H2 , we may express it in the form
(4.95) and then define the first partial trace as
Õ∞
tr1 (Φ) ≔ Φ𝑖𝑖 .
𝑖=1
The other partial trace tr2 may be similarly defined by reversing the roles of H1
Tensors in computations 153

and H2 . Partial traces are an indispensable tool for working with the density
operators discussed in Example 4.21; we refer readers to Cohen-Tannoudji et al.
(2020a, Chapter III, Complement E, Section 5b) and Nielsen and Chuang (2000,
Sections 2.4.3) for more information.
Our discussion about partial traces contains quite a bit of hand-waving; we will
justify some of it with the next example, where we discuss some properties of
tensor products that we have used liberally above.
Example 4.31 (calculus of tensor products). The universal factorization prop-
erty is particularly useful for establishing other properties (Greub 1978, Chapter I)
of the tensor product operation ⊗ such as how it interacts with itself,
U ⊗ V  V ⊗ U, U ⊗ (V ⊗ W)  (U ⊗ V) ⊗ W  U ⊗ V ⊗ W,
with direct sum ⊕,
U ⊗ (V ⊕ W)  (U ⊗ V) ⊕ (U ⊗ W), (U ⊕ V) ⊗ W  (U ⊗ W) ⊕ (V ⊗ W),
and with intersection ∩,
(V ⊗ W) ∩ (V ′ ⊗ W ′) = (V ∩ V ′) ⊗ (W ∩ W ′),
as well as how it interacts with linear and multilinear maps,
L(U ⊗ V; W)  L(U; L(V; W))  M2 (U, V; W),
with duality V∗ = L(V; R) a special case,
(V ⊗ W)∗  V∗ ⊗ W∗ , V∗ ⊗ W  L(V; W),
and with the Kronecker product,
L(V; W) ⊗ L(V ′; W ′ )  L(V ⊗ V ′; W ⊗ W ′).
Collectively, these properties form a system of calculus for manipulating tensor
products of vector spaces. When pure mathematicians speak of multilinear algebra,
this is often what they have in mind, that is, the subject is less about manipulating
individual tensors and more about manipulating whole spaces of tensors. Many,
but not all, of these properties may be extended to other vector-space-like objects
such as modules and vector bundles, or to vector spaces with additional structures
such as metrics, products and topologies. One needs to exercise some caution in
making such extensions. For instance, if V and W are infinite-dimensional Hilbert
spaces, then
V∗ ⊗ W  B(V; W),
no matter what notion of tensor product or topological tensor product one uses
for ⊗, that is, in this context B(V; W) is not the infinite-dimensional analogue of
L(V; W). As we saw in Examples 4.19 and 4.28, the proper interpretation of ⊗ for
Hilbert spaces is more subtle.
154 L.-H. Lim

To establish these properties, the ‘uniqueness up to vector space isomorphisms’


in Definition 4.26 is the tool of choice. This says that if we have two tensor spaces T
and T ′, both satisfying the universal factorization property for a multilinear map Φ,
𝜎T 𝜎T′
V1 × · · · ×▼V𝑑 /T V1 × · · · ×▲V𝑑 / T′
▼▼▼ ▲▲▲
▼▼▼ 𝐹Φ
▲▲▲ 𝐹′
Φ ▼▼▼▼  Φ ▲▲▲▲  Φ
& &
W W
then we must have T  T ′ as vector spaces. For each of the above properties,
say, U ⊗ V  V ⊗ U, we plug in the vector spaces on both sides of the purported
isomorphism  as our T and W and make a judicious choice for Φ, say, U × V →
V ⊗ U, (𝑢, 𝑣) ↦→ 𝑣 ⊗ 𝑢, to obtain 𝐹Φ as the required isomorphism.
While we have said early on in (4.30) that 𝑢 ⊗ 𝑣 ≠ 𝑣 ⊗ 𝑢, note that here we did not
say U ⊗ V = V ⊗ U but U ⊗ V  V ⊗ U. We sometimes identify isomorphic spaces,
i.e. replace  with =, but this depends on the context. In Navier–Stokes (4.14), we
do not want to say 𝐶 2 (R3 ) b
⊗ 𝐶 1 [0, ∞) = 𝐶 1 [0, ∞) b
⊗ 𝐶 2 (R3 ), since the order of the
variables in a solution 𝑣(𝑥, 𝑦, 𝑧, 𝑡) carries a specific meaning. On the other hand, it
does not matter whether we treat the quantum states of a composite system of two
particles as H1 b
⊗ H2 or H2 b⊗ H1 : either the particles are distinguishable with H1 and
H2 distinct and the difference becomes merely a matter of which particle we label
as first and which as second (completely arbitrary); or, if they are indistinguishable,
then H1 = H2 and the question does not arise.
The reader may remember from our discussions in Section 3.1 about a major
shortcoming of definition ➁, that for each type of tensor there are many different
types of multilinear maps. Definition 4.26 neatly classifies them in that we may
use the calculus of ⊗ described above to reduce any type of multilinear map into
a tensor of type (𝑝, 𝑑 − 𝑝) for some 𝑝 and 𝑑. A noteworthy special case is the
multilinear functionals in Definition 3.3 where we have

M(V∗1 , . . . , V∗𝑝 , V 𝑝+1 , . . . , V𝑑 ; R)  V1 ⊗ · · · ⊗ V 𝑝 ⊗ V∗𝑝+1 ⊗ · · · ⊗ V∗𝑑 .

For instance, recall the matrix–matrix product in Example 3.9 and triple product
trace in Example 3.17, respectively,

M𝑚,𝑛, 𝑝 (𝐴, 𝐵) = 𝐴𝐵, 𝜏𝑚,𝑛, 𝑝 (𝐴, 𝐵, 𝐶) = tr(𝐴𝐵𝐶),

for 𝐴 ∈ R𝑚×𝑛 , 𝐵 ∈ R𝑛× 𝑝 , 𝐶 ∈ R 𝑝×𝑚 . Since

M𝑚,𝑛, 𝑝 ∈ M(R𝑚×𝑛 , R𝑛× 𝑝 ; R𝑚× 𝑝 )  (R𝑚×𝑛 )∗ ⊗ (R𝑛× 𝑝 )∗ ⊗ R𝑚× 𝑝 ,


𝜏𝑚,𝑛, 𝑝 ∈ M(R𝑚×𝑛 , R𝑛× 𝑝 , R 𝑝×𝑚 ; R)  (R𝑚×𝑛 )∗ ⊗ (R𝑛× 𝑝 )∗ ⊗ (R 𝑝×𝑚 )∗ ,

and (R 𝑝×𝑚 )∗  R𝑚× 𝑝 , the bilinear operator M𝑚,𝑛, 𝑝 and the trilinear functional
𝜏𝑚,𝑛, 𝑝 may be regarded as elements of the same tensor space. Checking their
values on the standard bases shows that they are in fact the same tensor. More
Tensors in computations 155

generally, the (𝑑 − 1)-linear map and 𝑑-linear functional given by


M𝑛1 ,𝑛2 ,...,𝑛𝑑 (𝐴1 , 𝐴2 , . . . , 𝐴𝑑−1 ) = 𝐴1 𝐴2 · · · 𝐴𝑑−1 ,
𝜏𝑛1 ,𝑛2 ,...,𝑛𝑑 (𝐴1 , 𝐴2 , . . . , 𝐴𝑑 ) = tr(𝐴1 𝐴2 · · · 𝐴𝑑 ),
for matrices
𝐴1 ∈ R𝑛1 ×𝑛2 , 𝐴2 ∈ R𝑛2 ×𝑛3 , . . . , 𝐴𝑑−1 ∈ R𝑛𝑑−1 ×𝑛𝑑 , 𝐴𝑑 ∈ R𝑛𝑑 ×𝑛1 ,
are one and the same tensor. The tensor 𝜏𝑛1 ,𝑛2 ,...,𝑛𝑑 will appear again in Ex-
ample 4.44 in the form of the matrix product states tensor network.
Definition 4.26 was originally due to Bourbaki (1998) and, as is typical of their
style, provides the most general and abstract definition of a tensor, stated in language
equally abstruse. As a result it is sometimes viewed with trepidation by students
attempting to learn tensor products from standard algebra texts, although it has also
become the prevailing definition of a tensor in modern mathematics (Dummit and
Foote 2004, Kostrikin and Manin 1997, Lang 2002, Vinberg 2003). As we saw in
this section, however, it is a simple and practical idea.

4.4. Tensors in computations III: separability


The notion of separability is central to the utility of definition ➂ in its various
forms. We begin with its best-known manifestation: the separation-of-variables
technique.
We will first present an abstract version to show the tensor perspective and then
apply it to PDEs, finite difference schemes and integro-differential equations.
Example 4.32 (separation of variables: abstract). Let Φ : V1 ⊗ · · · ⊗ V𝑑 →
V1 ⊗ · · · ⊗ V𝑑 be a linear operator of the form
Φ = Φ1 ⊗ 𝐼2 ⊗ · · · ⊗ 𝐼 𝑑 + 𝐼1 ⊗ Φ2 ⊗ · · · ⊗ 𝐼 𝑑 + · · · + 𝐼1 ⊗ 𝐼2 ⊗ · · · ⊗ Φ𝑑 , (4.96)
where 𝐼𝑑 is the identity operator on V𝑑 and ⊗ is the Kronecker product in Ex-
ample 4.11. The separation-of-variables technique essentially transforms a homo-
geneous linear system into a collection of eigenproblems:

 Φ1 (𝑣 1 ) = 𝜆 1 𝑣 1 ,




 Φ2 (𝑣 2 ) = 𝜆 2 𝑣 2 ,

Φ(𝑣 1 ⊗ 𝑣 2 ⊗ · · · ⊗ 𝑣 𝑑 ) = 0 −→ .. (4.97)

 .



 Φ (𝑣 ) = −(𝜆 + · · · + 𝜆 )𝑣 ,
 𝑑 𝑑 1 𝑑−1 𝑑

and Φ being linear, any sum, linear combination or, under the right conditions,
even integral of 𝑣 1 ⊗ 𝑣 2 ⊗ · · · ⊗ 𝑣 𝑑 is also a solution. The constants 𝜆 1 , . . . , 𝜆 𝑑−1
are called separation constants. The technique relies only on one easy fact about
tensor products: for any non-zero 𝑣 ∈ V and 𝑤 ∈ W,
𝑣 ⊗ 𝑤 = 𝑣′ ⊗ 𝑤′ ⇒ 𝑣 = 𝜆𝑣 ′, 𝑤 = 𝜆−1 𝑤 ′ (4.98)
156 L.-H. Lim

for some non-zero 𝜆 ∈ R. More generally, for any 𝑑 non-zero vectors 𝑣 1 ∈ V1 , 𝑣 2 ∈


V2 , . . . , 𝑣 𝑑 ∈ V𝑑 , we have
𝑣 1 ⊗ 𝑣 2 ⊗ · · · ⊗ 𝑣 𝑑 = 𝑣 1′ ⊗ 𝑣 2′ ⊗ · · · ⊗ 𝑣 𝑑′
⇒ 𝑣 1 = 𝜆 1 𝑣 1′ , 𝑣 2 = 𝜆 2 𝑣 2′ , . . . , 𝑣 𝑑 = 𝜆 𝑑 𝑑 𝑑′ , 𝜆 1 𝜆 2 · · · 𝜆 𝑑 = 1,
although this more general version is not as useful for separation of variables, which
just requires repeated applications of (4.98).
Take 𝑑 = 3 for illustration. Dropping subscripts to avoid clutter, we have
(Φ ⊗ 𝐼 ⊗ 𝐼 + 𝐼 ⊗ Ψ ⊗ 𝐼 + 𝐼 ⊗ 𝐼 ⊗ Θ)(𝑢 ⊗ 𝑣 ⊗ 𝑤) = 0 (4.99)
or equivalently Φ(𝑢)⊗ 𝑣 ⊗ 𝑤 +𝑢 ⊗Ψ(𝑣)⊗ 𝑤 +𝑢 ⊗ 𝑣 ⊗Θ(𝑤) = 0. Since Φ(𝑢)⊗(𝑣 ⊗ 𝑤) =
𝑢 ⊗ [−Ψ(𝑣) ⊗ 𝑤 − 𝑣 ⊗ Θ(𝑤)], applying (4.98) we get
Φ(𝑢) = 𝜆𝑢, 𝑣 ⊗ 𝑤 = −𝜆−1 [Ψ(𝑣) ⊗ 𝑤 + 𝑣 ⊗ Θ(𝑤)].
Rearranging the second equation, Ψ(𝑣) ⊗ 𝑤 = 𝑣 ⊗ [−Θ(𝑤) − 𝜆𝑤], and applying
(4.98) again, we get
Ψ(𝑣) = 𝜇𝑣, Θ(𝑤) = −(𝜇 + 𝜆)𝑤.
Thus we have transformed (4.99) into three eigenproblems:

 Φ(𝑢) = 𝜆𝑢,



Ψ(𝑣) = 𝜇𝑣,


 Θ(𝑤) = (−𝜇 − 𝜆)𝑤.

The technique applies widely. The operators Φ𝑖 : V𝑖 → V𝑖 involved can be any
operator in any coordinates. While they are most commonly differential operators
such as
1 𝜕2 𝑓 1 𝜕2 𝑓
Φ𝜃 ( 𝑓 ) = , Φ 𝜎 ( 𝑓 ) =
𝑟 2 sin2 𝜙 𝜕𝜃 2 𝜎 2 + 𝜏 2 𝜕𝜎 2
in Example 4.33, they may well be finite difference or integro-differential operators,
∫ 𝑥
𝜕2 𝑓
Φ𝑘 (𝑢 𝑘,𝑛 ) = 𝑢 𝑘−1,𝑛 + (1 − 2𝑟)𝑢 𝑘,𝑛 + 𝑟𝑢 𝑘+1,𝑛 , Φ 𝑥 ( 𝑓 ) = 𝑎 2 + 𝑏 𝑓,
𝜕𝑥 0
as in Examples 4.34 and 4.35 respectively. The important thing is that they must
take the form30 in (4.96): there should not be any ‘cross-terms’ like Φ1 ⊗ 𝐼2 ⊗ Φ3 ,
e.g. 𝜕 2 𝑓 /𝜕𝑥1 𝜕𝑥3 , because if so then the first variable and third variable will not be
separable, as we will see.

30 One sometimes hears of ‘Kronecker sums’ in the 𝑑 = 2 case: Φ1 ⊕ Φ2 ≔ Φ1 ⊗ 𝐼2 + 𝐼1 ⊗ Φ2 .


This has none of the usual characteristics of a sum: it is not associative, not commutative, and
Φ1 ⊕ Φ2 ⊕ Φ3 ≠ Φ1 ⊗ 𝐼2 ⊗ 𝐼3 + 𝐼1 ⊗ Φ2 ⊗ 𝐼3 + 𝐼1 ⊗ 𝐼2 ⊗ Φ3 regardless of the order ⊕ is performed.
As such we avoid using ⊕ in this sense.
Tensors in computations 157

The abstract generality of Example 4.32 makes the separation-of-variables tech-


nique appear trivial, although as we will see in Examples 4.34 and 4.35, it also
makes the technique widely applicable. Once we restrict to specific classes of prob-
lems like PDEs, there are much deeper results about the technique. Nevertheless,
elementary treatments tend to focus on auxiliary issues such as Sturm–Liouville
theory and leave out what we think are the most important questions: Why and when
does the technique work? Fortunately these questions have been thoroughly invest-
igated and addressed in Koornwinder (1980), Kalnins, Kress and Miller (2018),
Miller (1977) and Schöbel (2015). The explanation is intimately related to tensors
but in the sense of tensor fields in Example 3.13: the metric tensor, Ricci tensor,
Nijenhuis tensor, Weyl tensor all play a role. Unfortunately, the gap between these
results and our brief treatment is too wide for us to convey anything beyond a
glimpse of what is involved.
Example 4.33 (separation of variables: PDEs). The technique of separation of
variables for solving PDEs evidently depends on the choice of coordinates. It works
perfectly for the one-dimensional wave equation in the form
𝜕2 𝑓 𝜕2 𝑓
− = 0. (4.100)
𝜕𝑥 2 𝜕𝑡 2
The standard recipe is to use the ansatz 𝑓 (𝑥, 𝑡) = 𝜑(𝑥)𝜓(𝑡), obtain 𝜑 ′′(𝑥)/𝜑(𝑥) =
𝜓 ′′(𝑡)/𝜓(𝑡), and argue that since the left-hand side is a function of 𝑥 and the right-
hand side of 𝑡, they must both be equal to some separation constant −𝜔2 , and we
obtain two ODEs 𝜑 ′′ + 𝜔2 𝜑 = 0, 𝜓 ′′ + 𝜔2 𝜓 = 0. We may deduce the same thing in
one step from (4.97):
( 2
2 2
𝜕𝑥 𝜑 = −𝜔2 𝜑,
[𝜕𝑥 ⊗ 𝐼 + 𝐼 ⊗ (−𝜕𝑡 )](𝜑 ⊗ 𝜓) = 0 −→
−𝜕𝑡2 𝜓 = 𝜔2 𝜓.
The ODEs have solutions 𝜑(𝑥) = 𝑎1 e 𝜔 𝑥 + 𝑎2 e−𝜔 𝑥 , 𝜓(𝑥) = 𝑎3 e 𝜔𝑡 + 𝑎4 e−𝜔𝑡 if
𝜔 ≠ 0, or 𝜑(𝑥) = 𝑎1 + 𝑎2 𝑥, 𝜓(𝑡) = 𝑎3 + 𝑎4 𝑡 if 𝜔 = 0, and any finite linear
combinations of 𝜑 ⊗ 𝜓 give us solutions for (4.100). Nevertheless, a change of
coordinates 𝜉 = 𝑥 − 𝑡, 𝜂 = 𝑥 + 𝑡 transforms (4.100) into
𝜕2 𝑓
= 0. (4.101)
𝜕𝜉𝜕𝜂
Note that the operator here is 𝜕𝑥 ⊗ 𝜕𝑦 and does not have the required form in (4.96).
Indeed, the solution of (4.101) is easily seen to take the form 𝑓 (𝜉, 𝜂) = 𝜑(𝜉) + 𝜓(𝜂),
so the usual multiplicatively separable ansatz 𝑓 (𝜉, 𝜂) = 𝜑(𝜉)𝜓(𝜂) will not work.
More generally, the same argument works in any dimension 𝑛 to separate the
spatial coordinates 𝑥 = (𝑥1 , . . . , 𝑥 𝑛 ) (which do not need to be Cartesian) from the
temporal ones. Applying (4.97) to the 𝑛-dimensional wave equation,
𝜕2 𝑓
Δ𝑓 − = 0, (4.102)
𝜕𝑡 2
158 L.-H. Lim

we get
(
Δ𝜑 = −𝜔2 𝜑,
[Δ ⊗ 𝐼 + 𝐼 ⊗ (−𝜕𝑡2 )](𝜑 ⊗ 𝜓) = 0 −→
−𝜕𝑡2 𝜓 = 𝜔2 𝜓,
with separation constant −𝜔2 . In the remainder of this example, we will focus on
the first equation, called the 𝑛-dimensional Helmholtz equation,
Δ 𝑓 + 𝜔2 𝑓 = 0, (4.103)
which may also be obtained by taking the Fourier transform of (4.102) in time. For
𝑛 = 2, (4.103) in Cartesian and polar coordinates are given by
𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓 1 𝜕 𝑓 1 𝜕2 𝑓
+ + 𝜔2 𝑓 = 0, + + + 𝜔2 𝑓 = 0 (4.104)
𝜕𝑥 2 𝜕𝑦 2 𝜕𝑟 2 𝑟 𝜕𝑟 𝑟 2 𝜕𝜃 2
respectively. Separation of variables works in both cases but gives entirely different
solutions. Applying (4.97) to 𝜕𝑥2 ⊗ 𝐼 + 𝐼 ⊗ (𝜕𝑦2 + 𝜔2 𝐼) gives us
d2 𝜑 d2 𝜓
+ 𝑘 2 𝜑 = 0, + (𝜔2 − 𝑘 2 )𝜓 = 0,
d𝑥 2 d𝑦 2
with separation constant 𝑘 2 and therefore the solution
2 −𝑘 2 )1/2 𝑦 ] 2 −𝑘 2 )1/2 𝑦 ]
𝑓 𝑘 (𝑥, 𝑦) ≔ 𝑎1 ei[𝑘 𝑥+(𝜔 + 𝑎2 ei[−𝑘 𝑥+(𝜔
2 −𝑘 2 )1/2 𝑦 ] 2 −𝑘 2 )1/2 𝑦 ]
+ 𝑎3 ei[𝑘 𝑥−(𝜔 + 𝑎4 ei[−𝑘 𝑥−(𝜔 .
Applying (4.97) to [(𝑟 2 𝜕𝑟2 + 𝜔2 𝑟 2 𝐼) ⊗ 𝐼 + 𝐼 ⊗ 𝜕𝜃2 ](𝜑 ⊗ 𝜓) = 0 gives us
d2 𝜑 d𝜑 d2 𝜓
𝑟2 +𝑟 + (𝜔2 𝑟 2 − 𝑘 2 )𝜑 = 0, + 𝑘 2 𝜓 = 0,
d𝑟 2 d𝑟 d𝜃 2
with separation constant 𝑘 2 and therefore the solution
𝑓 𝑘 (𝑟, 𝜃) ≔ 𝑎1 ei𝑘 𝜃 𝐽𝑘 (𝜔𝑟) + 𝑎2 e−i𝑘 𝜃 𝐽𝑘 (𝜔𝑟) + 𝑎3 ei𝑘 𝜃 𝐽−𝑘 (𝜔𝑟) + 𝑎4 e−i𝑘 𝜃 𝐽−𝑘 (𝜔𝑟)
where 𝐽𝑘 is a Bessel function. Any solution of the two-dimensional Helmholtz
equation in Cartesian coordinates is a sum or integral of 𝑓 𝑘 (𝑥, 𝑦) over 𝑘 and
any solution in polar coordinates is one of 𝑓 𝑘 (𝑟, 𝜃) over 𝑘. Analytic solutions
in different coordinate systems provide different insights. There are exactly two
more such coordinate systems where separation of variables works; we call these
separable coordinates. For 𝑛 = 2, (4.103) has exactly four systems of separable
coordinates: Cartesian, polar, parabolic and elliptic. For 𝑛 = 3, there are exactly
eleven (Eisenhart 1934).
The fundamental result that allows us to deduce these numbers is the Stäckel
condition: the 𝑛-dimensional Helmholtz equation in coordinates 𝑥1 , . . . , 𝑥 𝑛 can be
solved using the separation-of-variables technique if and only if (a) the Euclidean
metric tensor 𝑔 is a diagonal matrices in this coordinate system, and (b) if 𝑔 =
Tensors in computations 159

diag(𝑔11 , . . . , 𝑔𝑛𝑛 ), then there exists an invertible matrix of the form


 𝑠11 (𝑥1 ) 𝑠12 (𝑥1 ) · · · 𝑠1𝑛 (𝑥1 ) 

 𝑠21 (𝑥2 ) 𝑠22 (𝑥2 ) · · · 𝑠2𝑛 (𝑥2 ) 

𝑆 =  .. .. .. 
 . . . 

𝑠𝑛1 (𝑥 𝑛 ) 𝑠𝑛2 (𝑥 𝑛 ) · · · 𝑠𝑛𝑛 (𝑥 𝑛 )

with
𝑔−1 −1
𝑗 𝑗 = (𝑆 )1 𝑗 , 𝑗 = 1, . . . , 𝑛. (4.105)
Here 𝑆 is called a Stäckel matrix for 𝑔 in coordinates 𝑥1 , . . . , 𝑥 𝑛 ; note that the 𝑖th
row of 𝑆 depends only on the 𝑖th coordinate. The Euclidean metric tensor 𝑔 is a
covariant 2-tensor field that parametrizes the Euclidean inner product at different
points in a coordinate system. As we saw in Example 3.12, any 2-tensor field over
R𝑛 may be represented as an 𝑛 × 𝑛 matrix of functions. Take 𝑛 = 3; the Euclidean
metric tensor on R3 in Cartesian, cylindrical, spherical and parabolic31 coordinates
are
1 0 0 1 0 0
   

𝑔(𝑥, 𝑦, 𝑧) = 0 1 0 , 𝑔(𝑟, 𝜃, 𝑧) = 0 𝑟 2 0 ,
0 0 1 0 0 1
   
1 0 0    2 2 0 0 
 2
𝜎 + 𝜏

𝑔(𝑟, 𝜃, 𝜙) = 0 𝑟 
0  , 𝑔(𝜎, 𝜏, 𝜙) =  0  2
𝜎 +𝜏 2 0  .
0 0 𝑟 sin 𝜃 
2 2  0 0 𝜎 𝜏 2 
2
  
The Stäckel condition is satisfied for these four coordinate systems using the fol-
lowing matrices and their inverses:
0 −1 −1 1 1 1
 
Cartesian 𝑆= 0 1 0  , 𝑆 −1
= 0 1 0 ,
 
1 0 1  −1 −1 0
  
0 1
 − 𝑟12 −1 1
 𝑟2
1 

   
cylindrical 𝑆= 0 1 0 , 𝑆−1 = 0 1 0 ,
   
1 0 1  −1 − 12 0
  𝑟 
1 − 𝑟12 0  1 12 1 
  𝑟 (𝑟 2 sin2 𝜃) 
   
spherical 𝑆= 0 1 − sin12 𝜃  , 𝑆−1 = 0 1 1
2 ,
   sin 𝜃 
0 0 1  0 0 1 
 
𝜎 2 −1 − 𝜎12   21 2 1 1 
  (𝜎 +𝜏 ) (𝜎2 +𝜏 2 ) (𝜎 2 𝜏 2 ) 
 2   2 2 
parabolic 𝑆= 𝜏 1 − 𝜏12  , 𝑆−1 = − (𝜎2𝜏+𝜏 2 ) (𝜎2𝜎+𝜏 2 ) 1
𝜏 2 −1/𝜎 2 

  
0 0 1   0 0 1 
  

31 Given by 𝑥 = 𝜎𝜏 cos 𝜑, 𝑦 = 𝜎𝜏 sin 𝜑, 𝑧 = (𝜎 2 − 𝜏 2 )/2.


160 L.-H. Lim

The matrices on the left are Stäckel matrices for 𝑔 in the respective coordinate
system, that is, the entries in the first row of 𝑆−1 are exactly the reciprocal of the
entries on the diagonal of 𝑔.
What we have ascertained is that the three-dimensional Helmholtz equation can
be solved by separation of variables in these four coordinate systems, without
writing down a single differential equation. In fact, with more effort, one can show
that there are exactly eleven such separable coordinate systems:

(i) Cartesian,
(ii) cylindrical,
(iii) spherical,
(iv) parabolic,
(v) paraboloidal,
(vi) ellipsoidal,
(vii) conical,
(viii) prolate spheroidal,
(ix) oblate spheroidal,
(x) elliptic cylindrical,
(xi) parabolic cylindrical.

These eleven coordinate systems have been thoroughly studied in engineering


(Moon and Spencer 1988, 1961), where they are used for different tasks: (viii) for
modelling radiation from a slender spheroidal antenna, (x) for heat flow in a bar
of elliptic cross section, etc. More generally, the result can be extended to a
Riemannian manifold 𝑀 of arbitrary dimension: a system of local coordinates is
separable if and only if both the Riemannian metric tensor 𝑔 and the Ricci curvature
tensor 𝑅¯ are diagonal matrices in those coordinates,32 and 𝑔 satisfies the Stäckel
condition (Eisenhart 1934). In particular, we now see why separation of variables
does not apply to PDEs with mixed partials like (4.101): when the metric tensor is
diagonal, 𝑔 = diag(𝑔11 , . . . , 𝑔𝑛𝑛 ), the Laplacian takes the form
p
Õ𝑛
1 𝜕 det(𝑔) 𝜕 𝑓
Δ= p (4.106)
𝑖=1 det(𝑔) 𝜕𝑥𝑖 𝑔𝑖𝑖 𝜕𝑥𝑖
and thus does not contain any mixed derivatives. We may use (4.106) to write down

32 We did not need to worry about the Ricci tensor because R𝑛 is a so-called Einstein manifold
¯
where 𝑔 and 𝑅¯ differ by a scalar multiple. See Example 3.13 for a cursory discussion of 𝑔 and 𝑅.
Tensors in computations 161

the Helmholtz equation on R3 in cylindrical, spherical and parabolic coordinates:


𝜕2 𝑓 1 𝜕 𝑓 1 𝜕2 𝑓 𝜕2 𝑓
+ + + + 𝜔2 𝑓 = 0,
𝜕𝑟 2 𝑟 𝜕𝑟 𝑟 2 𝜕𝜃 2 𝜕𝑧2
𝜕2 𝑓 2 𝜕 𝑓 1 𝜕2 𝑓 cos 𝜙 𝜕 𝑓 1 𝜕2 𝑓
+ + + + + 𝜔2 𝑓 = 0,
𝜕𝑟 2 𝑟 𝜕𝑟 𝑟 2 sin2 𝜙 𝜕𝜃 2 𝑟 2 sin2 𝜙 𝜕𝜙 𝑟 2 𝜕𝜙2
 2 
1 𝜕 𝑓 1 𝜕 𝑓 𝜕2 𝑓 1 𝜕 𝑓 1 𝜕2 𝑓
+ + + + + 𝜔2 𝑓 = 0.
𝜎 2 + 𝜏 2 𝜕𝜎 2 𝜎 𝜕𝜎 𝜕𝜏 2 𝜏 𝜕𝜏 𝜎 2 𝜏 2 𝜕𝜙2
It is not so obvious that these are amenable to separation of variables, speaking
to the power of the Stäckel condition. While we have restricted to the Helmholtz
equation for concreteness, the result has been generalized to higher-order semilinear
PDEs (Koornwinder 1980, Theorem 3.8).
Separation of variables is manifestly about tensors in the form of definition ➂
and particularly Definition 4.4, as it involves, in one way or another, a rank-one
tensor 𝜑 ⊗ 𝜓 ⊗ · · · ⊗ 𝜃 or a sum of these. It may have come as a surprise that tensors
in the sense of definition ➀ also play a major role in Example 4.33, but this should
be expected as the tensor transformation rules originally came from a study of the
relations between different coordinate systems.
By our discussion in Example 4.5, whether a function is of continuous variables
or discrete ones, whether the variables are written as arguments or indices, should
not make a substantive difference. Indeed, as we saw in Example 4.32, separation of
variables ought to apply verbatim in cases where the variables are finite or discrete.
Our next example, adapted from Thomas (1995, Example 3.2.2), is one where the
rank-one tensors are of the form 𝑢 = 𝑎 ⊗ 𝑏 with 𝑎 = (𝑎0 , . . . , 𝑎 𝑚 ) ∈ R𝑚+1 and
𝑏 = (𝑏 𝑛 )∞
𝑛=0 ∈ 𝑐 0 (N ∪ {0}).

Example 4.34 (separation of variables: finite difference). Consider the recur-


rence relation

 𝑢 𝑘,𝑛+1 = 𝑟𝑢 𝑘−1,𝑛 + (1 − 2𝑟)𝑢 𝑘,𝑛 + 𝑟𝑢 𝑘+1,𝑛 𝑘 = 1, . . . , 𝑚 − 1,



 𝑢0,𝑛+1 = 0 = 𝑢 𝑚,𝑛+1

(4.107)

 

 𝑘

 𝑢 𝑘,0 = 𝑓 𝑚 𝑘 = 0, 1, . . . , 𝑚,

with 𝑛 = 0, 1, 2, . . . , and 𝑟 > 0 is some fixed constant. This comes from a
forward-time centred space discretization scheme applied to an initial–boundary
value problem for a one-dimensional heat equation on [0, 1], although readers lose
nothing by simply treating this example as one of solving recurrence relations. To
be consistent with our notation, we have written 𝑢 𝑘,𝑛 = 𝑢(𝑥 𝑘 , 𝑡 𝑛 ) instead of the
more common 𝑢 𝑛𝑘 with the time index in superscript. Applying (4.97) to (4.107)
with
Φ𝑘 (𝑢 𝑘,𝑛 ) ≔ 𝑟𝑢 𝑘−1,𝑛 + (1 − 2𝑟)𝑢 𝑘,𝑛 + 𝑟𝑢 𝑘+1,𝑛 , Ψ𝑛 (𝑢 𝑘,𝑛 ) ≔ 𝑢 𝑘,𝑛+1 ,
162 L.-H. Lim

we get

Φ𝑘 (𝑎 𝑘 ) = 𝜆𝑎 𝑘 ,
[Φ𝑘 ⊗ 𝐼 + 𝐼 ⊗ (−Ψ𝑛 )](𝑎 ⊗ 𝑏) = 0 −→
−Ψ𝑛 (𝑏 𝑛 ) = −𝜆𝑏 𝑛 ,
with separation constant 𝜆. We write these out in full:
𝑟𝑎 𝑘−1 + (1 − 2𝑟)𝑎 𝑘 + 𝑟𝑎 𝑘+1 = 𝜆𝑎 𝑘 , 𝑘 = 1, . . . , 𝑚 − 1,
𝑏 𝑛+1 = 𝜆𝑏 𝑛 , 𝑛 = 0, 1, 2, . . . .
The second equation is trivial to solve: 𝑏 𝑛 = 𝜆 𝑛 𝑏0 . Noting that the boundary
conditions 𝑢0,𝑛+1 = 0 = 𝑢 𝑚,𝑛+1 give 𝑎0 = 0 = 𝑎 𝑚 , we see that the first equation is a
tridiagonal eigenproblem:
1 − 2𝑟 𝑟   𝑎1   𝑎1 
    
 𝑟 1 − 2𝑟 𝑟   𝑎2   𝑎2 
    
 𝑟 1 − 2𝑟 𝑟   𝑎3   
   = 𝜆  𝑎3  .
 . .. . ..   .   . 
   ..   .. 
    
 𝑟 1 − 2𝑟  𝑎 𝑚−1  𝑎 𝑚−1 
    
The eigenvalues and eigenvectors of a tridiagonal Toeplitz matrices have well-
known closed-form expressions:
   
2 𝑗𝜋 𝑗 𝑘𝜋
𝜆 𝑗 = 1 − 4𝑟 sin , 𝑎 𝑗𝑘 = sin , 𝑗, 𝑘 = 1, . . . , 𝑚 − 1,
2𝑚 𝑚
where 𝑎 𝑗𝑘 is the 𝑘th coordinate of the 𝑗th eigenvector. Hence we get
Õ
𝑚−1    𝑛  
2 𝑗𝜋 𝑗 𝑘𝜋
𝑢 𝑘,𝑛 = 𝑐 𝑗 𝑏0 1 − 4𝑟 sin sin .
𝑗=1
2𝑚 𝑚

The initial condition 𝑢 𝑘,0 = 𝑓 (𝑘/𝑚), 𝑘 = 0, 1, . . . , 𝑚, may be used to determine


the coefficients 𝑐1 , . . . , 𝑐 𝑚−1 and 𝑏0 .
The von Neumann stability analysis is a special case of such a discrete separation
of variables with 𝑎 𝑘 = ei 𝑗𝑘 𝜋/𝑚 and 𝑏 𝑛 = 𝜆 𝑛 , that is,
𝑢 𝑘,𝑛 = 𝜆 𝑛 ei 𝑗𝑘 𝜋/𝑚 .
Substituting this into the recursion in (4.107) and simplifying gives us
 
2 𝑗𝜋
𝜆 = 1 − 4𝑟 sin ,
2𝑚
and since we require |𝜆| ≤ 1 for stability, we get that 𝑟 ≤ 1/2, an analysis familiar
to readers who have studied finite difference methods.
The next example is adapted from Kostoglou (2005), just to show the range of
applicability of Example 4.32.
Example 4.35 (separation of variables: integro-differential equations). Let us
Tensors in computations 163

consider the following integro-differential equation arising from the study of het-
erogeneous heat transfer:
∫ 𝑥
𝜕𝑓 𝜕2 𝑓
=𝑎 2 +𝑏 𝑓 (𝑦, 𝑡) d𝑦 − 𝑓 , (4.108)
𝜕𝑡 𝜕𝑥 0
with 𝑎, 𝑏 ≥ 0 and 𝑓 : [0, 1] × R → R. Note that at this juncture, if we simply
differentiate both sides to eliminate the integral, we will introduce mixed derivatives
and thus prevent ourselves from using separation of variables. Nevertheless, our
interpretation of separation of variables in Example 4.32 allows for integrals. If we
let
∫ 𝑥
𝜕2 𝑓 𝜕𝑓
Φ𝑥 ( 𝑓 ) ≔ 2
− 𝑓 +𝑏 𝑓 (𝑦, 𝑡) d𝑦, Ψ𝑡 ( 𝑓 ) ≔ − ,
𝜕𝑥 0 𝜕𝑡
then (4.97) gives us

Φ 𝑥 (𝜑) = 𝜆𝜑,
[Φ 𝑥 ⊗ 𝐼 + 𝐼 ⊗ Ψ𝑡 ](𝜑 ⊗ 𝜓) = 0 −→
Ψ𝑡 (𝜓) = −𝜆𝜓,
with separation constant 𝜆. Writing these out in full, we have
∫ 𝑥
d2 𝜑 d𝜓
𝑎 2 + (𝜆 − 1)𝜑 + 𝑏 𝜑(𝑦) d𝑦 = 0, + 𝜆𝜓 = 0.
d𝑥 0 d𝑡
The second equation is easy: 𝜓(𝑡) = 𝑐 e−𝜆𝑡 for an arbitrary constant 𝑐 that could
be determined with an initial condition. Kostoglou (2005) solved the first equation
in a convoluted manner involving Laplace transforms and partial fractions, but this
is unnecessary; at this point it is harmless to simply differentiate and eliminate
the integral. With this, we obtain a third-order homogeneous ODE with constant
coefficients, 𝑎𝜑 ′′′ + (𝜆 − 1)𝜑 ′ + 𝑏𝜑 = 0, whose solution is standard. Physical
considerations show that its characteristic polynomial
𝜆−1
𝑟3 + 𝑏
𝑎𝑟 + 𝑎 =0
must have one real and two complex roots 𝑟 1 , 𝑟 2 ± 𝑖𝑟 3 , and thus the solution is given
by 𝑐1 e𝑟1 𝑥 + e𝑟2 𝑥 (𝑐2 cos 𝑟 3 𝑥 + 𝑐3 sin 𝑟 3 𝑥) with arbitrary constants 𝑐1 , 𝑐2 , 𝑐3 that could
be determined with appropriate boundary conditions.
The last four examples are about exploiting separability in the structure of the
solutions; the next few are about exploiting separability in the structure of the
problems.
Example 4.36 (separable ODEs). The easiest ODEs to solve are probably the
separable ones,
d𝑦
= 𝑓 (𝑥)𝑔(𝑦), (4.109)
d𝑥
with special cases d𝑦/d𝑥 = 𝑓 (𝑥) and d𝑦/d𝑥 = 𝑔(𝑦) when one of the functions is
164 L.-H. Lim

constant. Solutions are, at least in principle, given by direct integration,


∫ ∫
d𝑦
= 𝑓 (𝑥) d𝑥 + 𝑐,
𝑔(𝑦)
even though closed-form expressions still depend on having closed-form integrals.
The effort is not so much in solving (4.109) but in seeking a transformation into
(4.109), that is, given an inseparable ODE
d𝑦
= 𝐾(𝑥, 𝑦),
d𝑥
we would like to find a differentiable Φ : R2 → R2 , (𝑥, 𝑦) ↦→ (𝑢, 𝑣), so that
d𝑣
= 𝑓 (𝑢)𝑔(𝑣). (4.110)
d𝑢
These transformations may be regarded as a special case of the transformation
rules for tensor fields (two-dimensional vector fields in this case) discussed in
Example 3.12. For instance,
h i h i
𝑥 𝑟 cos 𝜃
d𝑦 𝑦 − 𝑥 𝑦 = 𝑟 sin 𝜃 d𝑟
= −−−−−−−−−−−→ = −𝑟 (4.111)
d𝑥 𝑦+𝑥 d𝜃
is the two-dimensional version of the tensor transformation rule in (3.55). There is
a wide variety of scenarios where this is possible:
d𝑦 d𝑣
= 𝑓 (𝑎𝑥 + 𝑏𝑦 + 𝑐), 𝑣 = 𝑎𝑥 + 𝑏𝑦 + 𝑐, = 𝑎 + 𝑏 𝑓 (𝑣),
d𝑥 d𝑥
d𝑦 d𝑣 𝑓 (𝑣) − 𝑣
 
𝑦 𝑦
= 𝑓 , 𝑣= , = ,
d𝑥 𝑥 𝑥 d𝑥 𝑥
∫ ∫
d𝑦 𝑓 (𝑥) d𝑥 d𝑣 𝑓 (𝑥) d𝑥
= 𝑓 (𝑥)𝑦 + 𝑔(𝑥), 𝑣 = 𝑦 e− , = 𝑔(𝑥) e− ,
d𝑥 d𝑥
d𝑛 𝑦
 𝑛−1 
d 𝑦 d𝑛−1 𝑦 d𝑣
= 𝑓 𝑔(𝑥), 𝑣= , = 𝑓 (𝑣)𝑔(𝑥),
d𝑥 𝑛 d𝑥 𝑛−1 d𝑥 𝑛−1 d𝑥
′ ′
  𝑥 − 𝑏 𝑐 + 𝑏𝑐 
d𝑦 d𝑣
   
𝑎𝑥 + 𝑏𝑦 + 𝑐 𝑢  𝑎 ′ 𝑏 − 𝑎𝑏 ′  , 𝑎 + 𝑏𝑣/𝑢
= 𝑓 , =  = 𝑓 ,
d𝑥 𝑎 ′𝑥 + 𝑏 ′ 𝑦 + 𝑐 ′ 𝑣  𝑎 ′ 𝑐 + 𝑎𝑐 ′  d𝑢 𝑎 ′ + 𝑏 ′ 𝑣/𝑢
𝑦 − ′ 
 𝑎 𝑏 − 𝑎𝑏 ′ 
noting that the last case reduces to the second, and if 𝑢 = 𝑥, we do not introduce
a new variable (Walter 1998, Chapter 1). Nevertheless, unlike the case of PDEs
discussed in Example 4.33, there does not appear to be a systematic study of when
transformation to (4.110) is possible. Something along the lines of the Stäckel
condition, but for ODEs, would be interesting and useful.
There is a close analogue of separable ODEs for integral equations with kernels
that satisfy a finite version of (4.81). As we will see below, they have neat and
Tensors in computations 165

simple solutions, but unfortunately such kernels are not too common in practice; in
fact they are regarded as a degenerate case in the study of integral equations. The
following discussion is adapted from Kanwal (1997, Chapter 2).
Example 4.37 (separable integral equations). Let us consider Fredholm integ-
ral equations of the first and second kind:
∫ 𝑏 ∫ 𝑏
𝑔(𝑥) = 𝐾(𝑥, 𝑦) 𝑓 (𝑦) d𝑦, 𝑓 (𝑥) = 𝑔(𝑥) + 𝜆 𝐾(𝑥, 𝑦) 𝑓 (𝑦) d𝑦, (4.112)
𝑎 𝑎
with given constants 𝑎 < 𝑏 and functions 𝑔 ∈ 𝐶 [𝑎, 𝑏] and 𝐾 ∈ 𝐶([𝑎, 𝑏] × [𝑎, 𝑏]).
The goal is to solve for 𝑓 ∈ 𝐶 [𝑎, 𝑏] and, in the second case, also 𝜆 ∈ R. The kernel
𝐾 is said to be degenerate or separable if
Õ
𝑛
𝐾(𝑥, 𝑦) = 𝜑𝑖 (𝑥)𝜓𝑖 (𝑦) (4.113)
𝑖=1

for some 𝜑1 , . . . , 𝜑 𝑛 , 𝜓1 , . . . , 𝜓 𝑛 ∈ 𝐶 [𝑎, 𝑏], assumed known and linearly independ-


ent. We set
∫ 𝑏 ∫ 𝑏 ∫ 𝑏
𝑣𝑖 ≔ 𝜓𝑖 (𝑦) 𝑓 (𝑦) d𝑦, 𝑏𝑖 ≔ 𝜓𝑖 (𝑥)𝑔(𝑥) d𝑥, 𝑎𝑖 𝑗 ≔ 𝜑 𝑗 (𝑥)𝜓𝑖 (𝑥) d𝑥
𝑎 𝑎 𝑎
for 𝑖, 𝑗 = 1, . . . , 𝑛. Plugging (4.113) into (4.112) and integrating with respect to 𝑦
gives us
Õ𝑛 Õ
𝑛
𝑔(𝑥) = 𝑣 𝑗 𝜑 𝑗 (𝑥), 𝑓 (𝑥) = 𝑔(𝑥) + 𝜆 𝑣 𝑗 𝜑 𝑗 (𝑥). (4.114)
𝑗=1 𝑖=1

Multiplying by 𝜓𝑖 (𝑥) and integrating with respect to 𝑥 gives us


Õ
𝑛 Õ
𝑛
𝑏𝑖 = 𝑎𝑖 𝑗 𝑣 𝑗 , 𝑣 𝑖 = 𝑏𝑖 + 𝜆 𝑎𝑖 𝑗 𝑣 𝑗
𝑗=1 𝑗=1

for 𝑖 = 1, . . . , 𝑛, or, in matrix forms,


𝐴𝑣 = 𝑏, (𝐼 − 𝜆𝐴)𝑣 = 𝑏
respectively. For the latter, any 𝜆−1 not an eigenvalue of 𝐴 gives a unique solution
for 𝑣 and thus for 𝑓 by (4.114); for the former, considerations similar to those
for obtaining minimum-norm solutions in numerical linear algebra allow us to
obtain 𝑓 .
It is straightforward to extend the above to Fredholm integro-differential equa-
tions (Wazwaz 2011, Chapter 16) such as
∫ 𝑏
d𝑘 𝑓
= 𝑔(𝑥) + 𝜆 𝐾(𝑥, 𝑦) 𝑓 (𝑦) d𝑦.
d𝑥 𝑘 𝑎
Note that if 𝑘 = 0, this reduces to the second equation in (4.112), but it is of a
166 L.-H. Lim

different nature from the integro-differential considered in Example 4.35, which is


of Volterra type and involves partial derivatives.
In principle, the discussion extends to the 𝐿 2 -kernels in Examples 4.19 and 4.20
with 𝑛 = ∞, although in this case we would just be reformulating Hilbert–Schmidt
operators in terms of infinite-dimensional matrices in 𝑙 2 (N × N); the resulting
equations are no easier to analyse or solve (Kanwal 1997, Chapter 7).
While separable kernels are uncommon in integral equations, they are quite
common in multidimensional integral transforms.
Example 4.38 (separable integral transforms). Whenever one iteratively ap-
plies a univariate integral transform to each variable of a multivariate function
separately, one obtains an iterated integral
∫ 𝑏1 ∫ 𝑏𝑑
··· 𝑓 (𝑥1 , . . . , 𝑥 𝑑 )𝐾1 (𝑥1 , 𝑦 1 ) · · · 𝐾 𝑑 (𝑥 𝑑 , 𝑦 𝑑 ) d𝑥1 · · · d𝑥 𝑑
𝑎1 𝑎𝑑
with respect to the separable kernel 𝐾1 ⊗ · · · ⊗ 𝐾 𝑑 in Example 4.10(iv). We will
just mention two of the most common. The multidimensional Fourier transform is
given by
∫ ∞ ∫ ∞
F( 𝑓 )(𝜉1 , . . . , 𝜉 𝑑 ) = ··· 𝑓 (𝑡 1 , . . . , 𝑡 𝑑 ) e−i 𝜉1 𝑡1 −···−i 𝜉𝑑 𝑡𝑑 d𝑡1 · · · d𝑡 𝑑
−∞ −∞
and the multidimensional Laplace transform is given by
∫ ∞ ∫ ∞
L( 𝑓 )(𝑠1 , . . . , 𝑠 𝑑 ) = ··· 𝑓 (𝑡 1 , . . . , 𝑡 𝑑 ) e−𝑠1 𝑡1 −···−𝑠𝑑 𝑡𝑑 d𝑡 1 · · · d𝑡 𝑑 .
0 0
We point readers to Osgood (2019, Chapter 9) and Bracewell (1986, Chapter 13)
for properties and applications of the former, and Debnath and Dahiya (1989) and
Jaeger (1940) for the latter.
We briefly discuss another closely related notion of separability before our next
examples, which will require it.
Example 4.39 (additive separability). Our interpretation of separability thus far
is multiplicative separability of the form
𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 ) = 𝜑1 (𝑥1 )𝜑2 (𝑥2 ) · · · 𝜑 𝑑 (𝑥 𝑑 )
and not additive separability of the form
𝑔(𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 ) = 𝜑1 (𝑥1 ) + 𝜑2 (𝑥2 ) + · · · + 𝜑 𝑑 (𝑥 𝑑 ). (4.115)
However, as we saw in the solution of the one-dimensional wave equation in Ex-
ample 4.33, what is multiplicatively separable in one coordinate system is additively
separable in another. The two notions are intimately related although the exact re-
lation depends on the context. In PDEs, the relation is simple (Kalnins et al. 2018,
Section 3.1.1), at least in principle. If
𝐹(𝐷 𝑑 𝑓 (𝑥), 𝐷 𝑑−1 𝑓 (𝑥), . . . , 𝐷 𝑓 (𝑥), 𝑓 (𝑥), 𝑥) = 0
Tensors in computations 167

has a multiplicatively separable solution, then the transformation 𝑔(𝑥) ≔ e 𝑓 (𝑥)


gives us a PDE
𝐺(𝐷 𝑑 𝑔(𝑥), 𝐷 𝑑−1 𝑔(𝑥), . . . , 𝐷𝑔(𝑥), 𝑔(𝑥), 𝑥) = 0
with an additively separable solution of the form (4.115). The Stäckel condition in
Example 4.33 was in fact originally proved for an additively separable solution of
the form (4.115) for the Hamilton–Jacobi equation.
We describe a readily checkable condition due to Scheffé (1970) for deciding
additive separability, which is important in statistics, particularly in generalized ad-
ditive models (Hastie and Tibshirani 1990) and in the analysis of variance (Scheffé
1999, Section 4.1). Given any 𝐶 2 -function 𝑓 : Ω ⊆ R𝑑 → R, there exists a trans-
formation 𝜑0 : R → R so that 𝑔 = 𝜑0 ◦ 𝑓 has a decomposition of the form (4.115)
if and only if its Hessian takes the form
∇2 𝑓 (𝑥) = ℎ( 𝑓 (𝑥))∇ 𝑓 (𝑥)∇ 𝑓 (𝑥)T
for some function ℎ : R → R. In this case
∫ ∫ ∫ ∫
− ℎ(𝑡) d𝑡 ℎ(𝑡) d𝑡 𝜕𝑓
𝜑0 (𝑡) = 𝑐 e d𝑡 + 𝑐0 , 𝜑𝑖 (𝑡) = 𝑐 e− (𝑡) d𝑡 + 𝑐𝑖 ,
𝜕𝑥𝑖
for 𝑖 = 1, . . . , 𝑑, with constants 𝑐 > 0 and 𝑐0 , 𝑐1 , . . . , 𝑐 𝑑 ∈ R arising from the
indefinite integrals. For instance, the condition may be used to verify that
p p
𝑓 (𝑥, 𝑦, 𝑧) = 𝑥 (1 − 𝑦 2 )(1 − 𝑧2 ) + 𝑦 (1 − 𝑥 2 )(1 − 𝑧2 )
p
+ 𝑧 (1 − 𝑥 2 )(1 − 𝑦 2 ) − 𝑥𝑦𝑧,
𝑎 + 𝑏 + 𝑐 + 𝑑 + 𝑎𝑏𝑐 + 𝑎𝑏𝑑 + 𝑎𝑐𝑑 + 𝑏𝑐𝑑
𝑓 (𝑎, 𝑏, 𝑐, 𝑑) = ,
1 + 𝑎𝑏 + 𝑎𝑐 + 𝑎𝑑 + 𝑏𝑐 + 𝑏𝑑 + 𝑐𝑑 + 𝑎𝑏𝑐𝑑
are both additively separable with 𝜑𝑖 (𝑡) = sin−1 (𝑡), 𝑖 = 0, 1, 2, 3, in the first case
and 𝜑𝑖 (𝑡) = log(1 + 𝑡) − log(1 − 𝑡), 𝑖 = 0, 1, 2, 3, 4, in the second.
The next two examples show that it is sometimes more natural to formulate
separability in the additive form.
Example 4.40 (separable convex optimization). Let 𝐴 ∈ R𝑚×𝑛 , 𝑏 ∈ R𝑚 and
𝑐 ∈ R𝑛 . We consider the linear programming problem
minimize 𝑐T 𝑥
LP
subject to 𝐴𝑥 ≤ 𝑏, 𝑥 ≥ 0
and integer linear programming problem
minimize 𝑐T 𝑥
ILP
subject to 𝐴𝑥 ≤ 𝑏, 𝑥 ≥ 0, 𝑥 ∈ Z𝑛
where, as is customary in optimization, inequalities between vectors are interpreted
in a coordinate-wise sense. As we will be discussing complexity and will assume
168 L.-H. Lim

finite bit length inputs with 𝐴, 𝑏, 𝑐 having rationals entries, by clearing denominat-
ors, there is no loss of generality in assuming that they have integer entries, that is,
𝐴 ∈ Z𝑚×𝑛 , 𝑏 ∈ Z𝑚 , 𝑐 ∈ Z𝑛 .
LP is famously solvable in polynomial time to arbitrary 𝜀-relative accuracy
(Khachiyan 1979), with its time complexity improved over the years, notably in
Karmarkar (1984) and Vaidya (1990), to its current bound in Cohen, Lee and Song
(2019) that is essentially in terms of 𝜔, the exponent of matrix multiplication we saw
in Example 3.9. The natural question is: Polynomial in what? The aforementioned
complexity bounds invariably involve the bit length of the input,
𝑚 Õ
Õ 𝑛 Õ
𝑚
𝐿≔ log2 (|𝑎𝑖 𝑗 | + 1) + log2 (|𝑏𝑖 | + 1) + log2 (𝑚𝑛) + 1,
𝑖=1 𝑗=1 𝑖=1

but ideally we want an algorithm that runs in time polynomial in the size of the
structure, rather than the size of the actual numbers involved. This is called a
strongly polynomial-time algorithm. Whether it exists for LP is still famously
unresolved (Smale 1998, Problem 9), although a result of Tardos (1986) shows that
it is plausible: there is a polynomial-time algorithm for LP whose time complexity
is independent of the vectors 𝑏 and 𝑐 and depends only on 𝑚, 𝑛 and the largest
subdeterminant33 of 𝐴,
Δ ≔ max{|det 𝐴 𝜎×𝜏 | : 𝜎 ⊆ [𝑚], 𝜏 ⊆ [𝑛]}.
Note that 𝐿 depends on 𝐴, 𝑏, 𝑐 and Δ only on 𝐴.
We will now assume that we have a box constraint 0 ≤ 𝑥 ≤ 1 in our LP and
ILP. In this case ILP becomes a zero-one ILP with 𝑥 ∈ {0, 1}𝑛 ; such an ILP is
also known to have time complexity that does not depend on 𝑏 and 𝑐 although
it generally still depends on the entries of 𝐴 and not just Δ (Frank and Tardos
1987). We will let the time complexity of the box-constrained LP be LP(𝑚, 𝑛, Δ)
and let that of the zero-one ILP be ILP(𝑚, 𝑛, 𝐴). The former has time complexity
polynomial in 𝑚, 𝑛, Δ by Tardos (1986); the latter is polynomial-time if 𝐴 is totally
unimodular, i.e. when Δ = 1 (Schrijver 1986, Chapter 19).
Observe that the linear objective
𝑐T 𝑥 = 𝑐 1 𝑥 1 + · · · + 𝑐 𝑛 𝑥 𝑛
is additively separable. It turns out that this separability, not so much that it is
linear, is the key to polynomial-time solvability. Consider an objective function
𝑓 : R𝑛 → R of the form
𝑓 (𝑥) = 𝑓1 (𝑥1 ) + · · · + 𝑓𝑛 (𝑥 𝑛 ),
where 𝑓1 , . . . , 𝑓𝑛 : R → R are all convex. Note that if we set 𝑓𝑖 (𝑥𝑖 ) = 𝑐𝑖 𝑥𝑖 , which is
33 Recall from Example 4.5 that a matrix is a function 𝐴 : [𝑚] × [𝑛] → R and here 𝐴 𝜎, 𝜏 is the
function restricted to the subset 𝜎 × 𝜏 ⊆ [𝑚] × [𝑛]. Recall from page 14 that a non-square
determinant is identically zero.
Tensors in computations 169

linear and therefore convex, we recover the linear objective. A surprising result of
Hochbaum and Shanthikumar (1990) shows that the separable convex programming
problem
minimize 𝑓1 (𝑥1 ) + · · · + 𝑓𝑛 (𝑥 𝑛 )
SP
subject to 𝐴𝑥 ≤ 𝑏, 0 ≤ 𝑥 ≤ 1
and its zero-one variant
minimize 𝑓1 (𝑥1 ) + · · · + 𝑓𝑛 (𝑥 𝑛 )
ISP
subject to 𝐴𝑥 ≤ 𝑏, 𝑥 ∈ {0, 1}𝑛
are solvable to 𝜀-accuracy in time complexities
 
𝛽
SP(𝑚, 𝑛, Δ) = log2 LP(𝑚, 8𝑛2 Δ, Δ),
2𝜀
 
𝛽
ISP(𝑚, 𝑛, 𝐴) = log2 LP(𝑚, 8𝑛2 Δ, Δ) + ILP(𝑚, 4𝑛2 Δ, 𝐴 ⊗ 1T4𝑛Δ )
2𝑛Δ
respectively. Here 1𝑛 ∈ R𝑛 denotes the vector of all ones, ⊗ the Kronecker product,
and 𝛽 > 0 is some constant. A consequence is that SP is always polynomial-time
solvable and ISP is polynomial-time solvable for totally unimodular 𝐴. The latter is
a particularly delicate result, and even a slight deviation yields an NP-hard problem
(Baldick 1995). While the objective 𝑓 = 𝑓1 + · · · + 𝑓𝑛 is convex, we remind readers
of Example 3.16: it is a mistake to think that all convex optimization problems
have polynomial-time algorithms.
Note that we could have stated the results above in terms of multiplicatively
separable functions: if 𝑔1 , . . . , 𝑔𝑛 : R → R++ are log convex, and 𝑔 : R𝑛 → R++ is
defined by 𝑔(𝑥1 , . . . , 𝑥 𝑛 ) = 𝑔1 (𝑥1 ) · · · 𝑔𝑛 (𝑥 𝑛 ), then SP and ISP are equivalent to
minimize log 𝑔(𝑥1 , . . . , 𝑥 𝑛 ) minimize log 𝑔(𝑥1 , . . . , 𝑥 𝑛 )
subject to 𝐴𝑥 ≤ 𝑏, 0 ≤ 𝑥 ≤ 1, subject to 𝐴𝑥 ≤ 𝑏, 𝑥 ∈ {0, 1}𝑛 ,
although it would be unnatural to discuss LP and ILP in these forms.
We will discuss another situation where additive separability arises.
Example 4.41 (separable Hamiltonians). The Hamilton equations
d𝑝 𝑖 𝜕𝐻 d𝑥𝑖 𝜕𝐻
=− (𝑥, 𝑝), =− (𝑥, 𝑝), 𝑖 = 1, . . . , 𝑛, (4.116)
d𝑡 𝜕𝑥𝑖 d𝑡 𝜕 𝑝𝑖
are said to be separable if the Hamiltonian 𝐻 : R𝑛 ×R𝑛 → R is additively separable:
𝐻(𝑥, 𝑝) = 𝑉(𝑥) + 𝑇 (𝑝), (4.117)
that is, the kinetic energy 𝑇 : R𝑛 → R depends only on momentum 𝑝 = (𝑝 1 , . . . , 𝑝 𝑛 )
and the potential energy 𝑉 : R𝑛 → R depends only on position 𝑥 = (𝑥1 , . . . , 𝑥 𝑛 ),
a common scenario. In this case the system (4.116) famously admits a finite
difference scheme that is both explicit, that is, iteration depends only on quantities
170 L.-H. Lim

already computed in a previous step, and symplectic, that is, it conforms to the
tensor transformation rules with change-of-coordinates matrices from Sp(2𝑛, R)
on page 22. Taking 𝑛 = 1 for illustration, the equations in (4.116) are
d𝑝 d𝑥
= −𝑉 ′(𝑥), = −𝑇 ′(𝑝),
d𝑡 d𝑡
and with backward Euler on the first, forward Euler on the second, we obtain the
finite difference scheme
𝑝 (𝑘+1) = 𝑝 (𝑘) + 𝑉 ′(𝑥 (𝑘+1) )Δ𝑡, 𝑥 (𝑘+1) = 𝑥 (𝑘) − 𝑇 ′(𝑥 (𝑘) )Δ𝑡,
which is easily seen to be explicit as 𝑥 (𝑘+1) may be computed before 𝑝 (𝑘+1) ; to show
that it is a symplectic integrator requires a bit more work and the reader may consult
Stuart and Humphries (1996, pp. 580–582). Note that we could have written (4.117)
in an entirely equivalent multiplicatively separable form with e𝐻 (𝑥, 𝑝) = e𝑉 (𝑥) e𝑇 ( 𝑝) ,
although our subsequent discussion would be somewhat awkward.
We now present an example with separability structures in both problem and
solution and where both additive and multiplicative separability play a role.
Example 4.42 (one-electron and Hartree–Fock approximations). The time-
dependent Schrödinger equation for a system of 𝑑 particles in R3 is
 2 
𝜕 ℏ
𝑖ℏ 𝑓 (𝑥, 𝑡) = − Δ + 𝑉(𝑥) 𝑓 (𝑥, 𝑡).
𝜕𝑡 2𝑚
Here 𝑥 = (𝑥1 , . . . , 𝑥 𝑑 ) ∈ R3𝑛 represents the positions of the 𝑑 particles, 𝑉 a real-
valued function representing potential, and
Δ = Δ1 + Δ2 + · · · + Δ𝑑
with each Δ𝑖 : 𝐿 2 (R3 ) → 𝐿 2 (R3 ) a copy of the Laplacian on R3 corresponding to
the 𝑖th particle. Note that these do not need to be in Cartesian coordinates. For
instance, we might have
1 𝜕 1 1 𝜕2
   
𝜕 𝜕 𝜕
Δ𝑖 = 2 𝑟 𝑖2 + 2 sin 𝜃 𝑖 + 2 2
𝑟 𝑖 𝜕𝑟 𝑖 𝜕𝑟 𝑖 𝑟 𝑖 sin 𝜃 𝑖 𝜕𝜃 𝑖 𝜕𝜃 𝑖 𝑟 𝑖 sin 𝜃 𝑖 𝜕𝜙2𝑖
with 𝑥𝑖 = (𝑟 𝑖 , 𝜃 𝑖 , 𝜙𝑖 ), and 𝑖 = 1, . . . , 𝑑.
We will drop the constants, which are unimportant to our discussions, and just
keep their signs:
(−Δ + 𝑉) 𝑓 − 𝑖𝜕𝑡 𝑓 = 0. (4.118)
Separation of variables (4.97) applies to give us

(−Δ + 𝑉)𝜑 = 𝐸 𝜑,
(−Δ + 𝑉) ⊗ 𝐼 + 𝐼 ⊗ (−𝑖𝜕𝑡 ) −→
−𝑖𝜕𝑡 𝜓 = −𝐸𝜓,
where we have written our separation constant as −𝐸. The second equation is
Tensors in computations 171

trivial to solve, 𝜓(𝑡) = e−i𝐸𝑡 , and as (4.118) is linear, the solution 𝑓 is given by a
linear combination of 𝜓 ⊗ 𝜑 over all possible values of 𝐸 and our main task is to
determine 𝜑 and 𝐸 from the first equation, called the time-independent Schrödinger
equation for 𝑑 particles. Everything up to this stage is similar to our discussion for
the wave equation in Example 4.33. The difference is that we now have an extra 𝑉
term: (𝐸, 𝜑) are an eigenpair of −Δ + 𝑉.
The motivation behind the one-electron, Hartree–Fock and other approximations
is as follows. If the potential 𝑉 is additively separable,

𝑉(𝑥) = 𝑉1 (𝑥1 ) + 𝑉2 (𝑥2 ) + · · · + 𝑉𝑑 (𝑥 𝑑 ), (4.119)

then the eigenfunction 𝜑 is multiplicatively separable,

𝜑(𝑥) = 𝜑1 (𝑥1 )𝜑2 (𝑥2 ) · · · 𝜑 𝑑 (𝑥 𝑑 ),

and the eigenvalue 𝐸 is additively separable,

𝐸 = 𝐸1 + 𝐸2 + · · · + 𝐸 𝑑 .

The reason is that with (4.119) we have


Õ
𝑑
(−Δ𝑖 + 𝑉𝑖 )𝜑 − 𝐸 𝜑 = 0,
𝑖=1

and we may apply (4.97) to get

[(−Δ1 + 𝑉1 ) ⊗ 𝐼 ⊗ · · · ⊗ 𝐼 + 𝐼 ⊗ (−Δ2 + 𝑉2 ) ⊗ · · · ⊗ 𝐼
+ 𝐼 ⊗ · · · ⊗ 𝐼 ⊗ (−Δ𝑑 + 𝑉𝑑 − 𝐸)](𝜑1 ⊗ 𝜑2 ⊗ · · · ⊗ 𝜑 𝑑 ) = 0

 (−Δ1 + 𝑉1 )𝜑1 = 𝐸 1 𝜑1 ,




 (−Δ2 + 𝑉2 )𝜑2 = 𝐸 2 𝜑2 ,

−→ .. (4.120)

 .



 (−Δ + 𝑉 )𝜑 = (𝐸 − 𝐸 − · · · − 𝐸 )𝜑 .
 𝑑 𝑑 𝑑 1 𝑑−1 𝑑

Note that we may put the −𝐸 𝜑 term with any −Δ𝑖 + 𝑉𝑖 ; we just chose 𝑖 = 𝑑 for
convenience. If we write 𝐸 𝑑 ≔ 𝐸 − 𝐸 1 − · · · − 𝐸 𝑑−1 and 𝜑 = 𝜑1 ⊗ · · · ⊗ 𝜑 𝑑 , we
obtain the required expressions. So (4.120) transforms a 𝑑-particle Schrödinger
equation into 𝑑 one-particle Schrödinger equations.
If one could solve the 𝑑-particle Schrödinger equation, then one could in principle
determine all chemical and physical properties of atoms and molecules, so the above
discussion seems way too simple, and indeed it is: the additivity in (4.119) can only
happen if the particles do not interact. In this context an operator of the form in
(4.96) is called a non-interacting Hamiltonian, conceptually useful but unrealistic.
It is, however, the starting point from which various approximation schemes are
developed. To see the necessity of approximation, consider what appears to be a
172 L.-H. Lim

slight deviation from (4.119). Let 𝑉 include only pairwise interactions:


Õ
𝑑 Õ
𝑉(𝑥) = 𝑉𝑖 (𝑥𝑖 ) + 𝑉𝑖 𝑗 (𝑥𝑖 , 𝑥 𝑗 ),
𝑖=1 𝑖< 𝑗

that is, no higher-order terms of the form 𝑉𝑖 𝑗𝑘 (𝑥𝑖 , 𝑥 𝑗 , 𝑥 𝑘 ), and we may even fix
𝑉𝑖 𝑗 (𝑥𝑖 , 𝑥 𝑗 ) = 1/k𝑥𝑖 − 𝑥 𝑗 k. But the equation (−Δ + 𝑉)𝜑 = 𝐸 𝜑 then becomes
computationally intractable in multiple ways (Whitfield, Love and Aspuru-Guzik
2013).
The one-electron approximation and Hartree–Fock approximation are based on
the belief that

𝜑(𝑥) ≈ 𝜑1 (𝑥1 ) · · · 𝜑 𝑑 (𝑥 𝑑 ),
𝑉(𝑥) ≈ 𝑉1 (𝑥1 ) + · · · + 𝑉𝑑 (𝑥 𝑑 ) ⇒
𝐸 ≈ 𝐸1 + · · · + 𝐸 𝑑 ,
with ‘≈’ interpreted differently and with different tools: the former uses perturb-
ation theory and the latter calculus of variations. While these approximation
methods are only tangentially related to the topic of this section, they do tell us
how additive and multiplicative separability can be approximated and thus we will
briefly describe the key ideas. Our discussion below is based on Fischer (1977),
Faddeev and Yakubovskiı̆ (2009, Chapters 33, 34, 50, 51) and Hannabuss (1997,
Chapter 12).
In most scenarios, 𝑉(𝑥) will have an additively separable component 𝑉1 (𝑥1 ) +
· · · + 𝑉𝑑 (𝑥 𝑑 ) and an additively inseparable component comprising the higher-order
interactions 𝑉𝑖 𝑗 (𝑥𝑖 , 𝑥 𝑗 ), 𝑉𝑖 𝑗𝑘 (𝑥𝑖 , 𝑥 𝑗 , 𝑥 𝑘 ), . . . . We will write
Õ
𝑑
𝐻0 ≔ (−Δ𝑖 + 𝑉𝑖 ), 𝐻1 ≔ −Δ + 𝑉 = 𝐻0 + 𝑊,
𝑖=1

with 𝑊 accounting for the inseparable terms. The one-electron approximation


method uses perturbation theory to recursively add correction terms to the eigen-
pairs of 𝐻0 to obtain those of 𝐻1 . Ignoring rigour, the method is easy to describe.
Set 𝐻 𝜀 ≔ (1 − 𝜀)𝐻0 + 𝜀𝑊 and plug the power series
𝜆 𝜀 = 𝜆(0) + 𝜀𝜆(1) + 𝜀 2 𝜆(2) + · · · , 𝜓 𝜀 = 𝜓 (1) + 𝜀𝜓 (1) + 𝜀 2 𝜓 (2) + · · ·
into 𝐻 𝜀 𝜓 𝜀 = 𝜆 𝜀 𝜓 𝜀 to see that
𝐻0 𝜓 (0) = 𝜆(0) 𝜓 (0) , (4.121)
(1) (0) (0) (1) (1) (0)
𝐻0 𝜓 + 𝑊𝜓 =𝜆 𝜓 +𝜆 𝜓 ,
..
.
𝐻0 𝜓 (𝑘) + 𝑊𝜓 (𝑘−1) = 𝜆(0) 𝜓 (𝑘) + 𝜆(1) 𝜓 (𝑘−1) + · · · + 𝜆(𝑘) 𝜓 (0) . (4.122)
By (4.121), 𝜆(0) and 𝜓 (0) form an eigenpair of 𝐻0 . The subsequent terms 𝜆(𝑘) and
𝜓 (𝑘) are interpreted as the 𝑘th-order corrections to the eigenpair of 𝐻0 , and as 𝜆(𝑘)
Tensors in computations 173

and 𝜓 (𝑘) may be recursively calculated from (4.122), the hope is that
𝜆(0) + 𝜆(1) + 𝜆(2) + · · · , 𝜓 (0) + 𝜓 (1) + 𝜓 (2) + · · ·
would converge to the desired eigenpair of 𝐻1 . For instance, if 𝐻0 has an orthonor-
mal basis of eigenfunctions {𝜓𝑖 : 𝑖 ∈ N} with eigenvalues {𝜆 𝑖 : 𝑖 ∈ N} and 𝜆 𝑗 is a
simple eigenvalue, then (Hannabuss 1997, Theorem 12.4.3)
Õ h𝑊𝜓 𝑗 , 𝜓𝑖 i Õ |h𝑊𝜓 𝑗 , 𝜓𝑖 i| 2
𝜆(1)
𝑗 = h𝑊𝜓 𝑗 , 𝜓 𝑗 i, 𝜓 (1)
𝑗 = 𝜓 𝑖 , 𝜆 (2)
𝑗 = .
𝑖≠ 𝑗
𝜆 𝑗 − 𝜆𝑖 𝑖≠ 𝑗
𝜆 𝑗 − 𝜆𝑖

Whereas one-electron approximation considers perturbation of additive separ-


ability in the problem (the Hamiltonian), Hartree–Fock approximation considers
variation of multiplicative separability in the solution (the wave function). Ob-
serve that the non-linear functional given by the Rayleigh quotient E(𝜑) = h(−Δ +
𝑉)𝜑, 𝜑i/k𝜑k 2 is stationary, i.e. 𝛿E = 0, if and only if (−Δ + 𝑉)𝜑 = 𝐸 𝜑, that is, its
Euler–Lagrange equation gives the Schrödinger equation and its Lagrange multi-
plier gives the eigenvalue 𝐸. The Hartree–Fock approximation seeks stationarity
under the additional condition of multiplicative separability, i.e. 𝜑 = 𝜑1 ⊗ · · · ⊗ 𝜑 𝑑 .
If the non-linear functional
E(𝜑1 , . . . , 𝜑 𝑑 ) = h(−Δ + 𝑉)𝜑1 ⊗ · · · ⊗ 𝜑 𝑑 , 𝜑1 ⊗ · · · ⊗ 𝜑 𝑑 i
with constraints k𝜑1 k 2 = · · · = k𝜑 𝑑 k 2 = 1 is stationary with respect to variations
in 𝜑1 , . . . , 𝜑 𝑑 , i.e. 𝛿L = 0, where
L(𝜑1 , . . . , 𝜑 𝑑 , 𝜆 1 , . . . , 𝜆 𝑑 ) = E(𝜑1 , . . . , 𝜑 𝑑 ) − 𝜆 1 k𝜑1 k 2 − · · · − 𝜆 𝑑 k𝜑 𝑑 k 2
is the Lagrangian with Lagrange multipliers 𝜆 1 , . . . , 𝜆 𝑑 , then the Euler–Lagrange
equations give us
 Õ∫ 
−Δ𝑖 + |𝜑 𝑗 (𝑦)| 2 𝑉(𝑥, 𝑦) d𝑦 𝜑𝑖 = 𝜆 𝑖 𝜑𝑖 , 𝑖 = 1, . . . , 𝑑. (4.123)
𝑗≠𝑖 R3

These equations make physical sense. For a fixed 𝑖, (4.123) is the Schrödinger
equation for particle 𝑖 in a potential field due to the charge of particle 𝑗; this charge
is spread over space with density |𝜑 𝑗 | 2 , and we have to sum over the potential
fields created by all particles 𝑗 = 1, . . . , 𝑖 − 1, 𝑖 + 1, . . . , 𝑑. While the system of
coupled integro-differential equations (4.123) generally would not have an analytic
solution like the one in Example 4.35, it readily lends itself to numerical solution
via a combination of quadrature and finite difference (Fischer 1977, Chapter 6,
Section 7).
We have ignored the spin variables in the last and the next examples, as spin has
already been addressed in Examples 4.7 and would only be an unnecessary distrac-
tion here. So, strictly speaking, these examples are about Hartree approximation
(Hartree 1928), i.e. no spin, and not Hartree–Fock approximation (Fock 1930), i.e.
with spin.
174 L.-H. Lim

Example 4.43 (multiconfiguration Hartree–Fock). We will continue our dis-


cussion in the last example but with some simplifications. We let 𝑑 = 2 and
𝑉(𝑥, 𝑦) = 𝑉(𝑦, 𝑥). We emulate the Hartree–Fock approximation for the time-
independent Schrödinger equation (−Δ + 𝑉) 𝑓 = 𝐸 𝑓 but now with an ansatz of the
form
𝑓 = 𝑎1 𝜑1 ⊗ 𝜑1 + 𝑎2 𝜑2 ⊗ 𝜑2, (4.124)
where 𝜑1 , 𝜑2 ∈ 𝐿 2 (R3 ) are orthonormal and 𝑎 = (𝑎1 , 𝑎2 ) ∈ R2 is a unit vector,
that is, we consider the non-linear functional
E(𝜑1 , 𝜑2 , 𝑎1 , 𝑎2 ) = 𝑎21 h(−Δ + 𝑉)𝜑1 ⊗ 𝜑1 , 𝜑1 ⊗ 𝜑1 i
+ 2𝑎1 𝑎2 h(−Δ + 𝑉)𝜑1 ⊗ 𝜑1 , 𝜑2 ⊗ 𝜑2 i
+ 𝑎22 h(−Δ + 𝑉)𝜑2 ⊗ 𝜑2 , 𝜑2 ⊗ 𝜑2 i
with constraints k𝜑1 k 2 = k𝜑2 k 2 = 1, h𝜑1 , 𝜑2 i = 0, k𝑎k 2 = 1. Introducing Lagrange
multipliers 𝜆 11 , 𝜆 12 , 𝜆 22 and 𝜆 for the constraints, we obtain the Lagrangian
L(𝜑1 , 𝜑2 , 𝑎1 , 𝑎2 , 𝜆 11 , 𝜆 12 , 𝜆 22 , 𝜆)
= E(𝜑1 , 𝜑2 , 𝑎1 , 𝑎2 ) + 𝜆 11 k𝜑1 k 2 + 𝜆 12 h𝜑1 , 𝜑2 i + 𝜆 22 k𝜑2 k 2 − 𝜆k𝑎k 2 .
For 𝑖, 𝑗 ∈ {1, 2} we write

𝑏𝑖 𝑗 = h(−Δ + 𝑉)𝜑𝑖 ⊗ 𝜑𝑖 , 𝜑 𝑗 ⊗ 𝜑 𝑗 i, 𝑐𝑖 𝑗 (𝑥) = 𝜑𝑖 (𝑥)𝜑 𝑗 (𝑦)𝑉(𝑥, 𝑦) d𝑦.
R3
The stationarity conditions ∇𝑎 L = 0 and 𝛿L = 0 give us
  
𝑏11 − 𝜆 𝑏12 𝑎1
=0
𝑏12 𝑏22 − 𝜆 𝑎2
and     
−Δ1 𝜑1 𝑐11 (𝑥) − 𝜆 11 (𝑎2 /𝑎1 )(𝑐12 (𝑥) − 𝜆 12 ) 𝜑1
=
−Δ2 𝜑2 (𝑎1 /𝑎2 )(𝑐12 (𝑥) − 𝜆 12 ) 𝑐22 (𝑥) − 𝜆 22 𝜑2
respectively. What we have described here is a simplified version of the multi-
configuration Hartree–Fock approximation (Fischer 1977, Chapters 3 and 4), and
it may also be viewed as the starting point for a variety of other heuristics, one of
which will be our next example.
We will describe an extension of the ansatz in (4.124) to more summands and to
higher order.
Example 4.44 (tensor networks). Instead of restricting ourselves to 𝐿 2 (R3 ) as
in the last two examples, we will now assume arbitrary separable Hilbert spaces
H1 , . . . , H𝑑 to allow for spin and other properties. We will seek a solution 𝑓 ∈
H1 ⊗ · · · ⊗ H𝑑 to the Schrödinger equation for 𝑑 particles. Readers are reminded
of the discussion at the end of Example 4.16. Note that by definition of ⊗, 𝑓 has
finite rank regardless of whether H1 , . . . , H𝑑 are finite- or infinite-dimensional. Let
Tensors in computations 175

𝜇rank( 𝑓 ) = (𝑟 1 , . . . , 𝑟 𝑑 ). Then there is a decomposition


Õ
𝑟1 Õ
𝑟2 Õ
𝑟𝑑
𝑓 = ··· 𝑐𝑖 𝑗 ···𝑘 𝜑𝑖 ⊗ 𝜓 𝑗 ⊗ · · · ⊗ 𝜃 𝑘 (4.125)
𝑖=1 𝑗=1 𝑘=1

with orthogonal factors (De Lathauwer, De Moor and Vandewalle 2000, equa-
tion 13)
(
0 𝑖 ≠ 𝑗,
h𝜑𝑖 , 𝜑 𝑗 i = h𝜓𝑖 , 𝜓 𝑗 i = · · · = h𝜃 𝑖 , 𝜃 𝑗 i =
1 𝑖 = 𝑗.
Note that when our spaces have inner products, any multilinear rank decomposition
(4.63) may have its factors orthogonalized (De Lathauwer et al. 2000, Theorem 2),
that is, the orthogonality constraints do not limit the range of possibilities in (4.125).
By definition of multilinear rank, there exist subspaces H1′ , . . . , H𝑑′ of dimensions
𝑟 1 , . . . , 𝑟 𝑑 with 𝑓 ∈ H1′ ⊗ · · · ⊗ H𝑑′ that attain the minimum in (4.62). As such,
we may replace H𝑖 with H𝑖′ at the outset, and to simplify our discussion we may as
well assume that H1 , . . . , H𝑑 are of dimensions 𝑟 1 , . . . , 𝑟 𝑑 .
The issue with (4.125) is that there is an exponential number of rank-one terms
as 𝑑 increases. Suppose 𝑟 1 = · · · = 𝑟 𝑑 = 𝑟; then there are 𝑟 𝑑 summands in (4.125).
This is not unexpected because (4.125) is the most general form a finite-rank tensor
can take. Here the ansatz in (4.125) describes the whole space H1 ⊗ · · · ⊗ H𝑑 and
does not quite serve its purpose; an ansatz is supposed to be an educated guess,
typically based on physical insights, that captures a small region of the space where
the solution likely lies. The goal of tensor networks is to provide such an ansatz by
limiting the coefficients [𝑐𝑖 𝑗 ···𝑘 ] ∈ R𝑟1 ×···×𝑟𝑑 to a much smaller set. The first and
best-known example is the matrix product states tensor network (Anderson 1959,
White 1992, White and Huse 1993), which imposes on the coefficients the structure
𝑐𝑖 𝑗 ···𝑘 = tr(𝐴𝑖 𝐵 𝑗 · · · 𝐶 𝑘 ), 𝐴𝑖 ∈ R𝑛1 ×𝑛2 , 𝐵 𝑗 ∈ R𝑛2 ×𝑛3 , . . . , 𝐶 𝑘 ∈ R𝑛𝑑 ×𝑛1
for 𝑖 = 1, . . . , 𝑟 1 , 𝑗 = 1, . . . , 𝑟 2 , . . . , 𝑘 = 1, . . . , 𝑟 𝑑 . An ansatz of the form
Õ
𝑟1 Õ
𝑟2 Õ
𝑟𝑑
𝑓 = ··· tr(𝐴𝑖 𝐵 𝑗 · · · 𝐶 𝑘 ) 𝜑𝑖 ⊗ 𝜓 𝑗 ⊗ · · · ⊗ 𝜃 𝑘 (4.126)
𝑖=1 𝑗=1 𝑘=1

is called a matrix product state or MPS (Affleck, Kennedy, Lieb and Tasaki 1987).
Note that the coefficients are now parametrized by 𝑟 1 + 𝑟 2 + · · · + 𝑟 𝑑 matrices of
various sizes. For easy comparison, if 𝑟 1 = · · · = 𝑟 𝑑 = 𝑟 and 𝑛1 = · · · = 𝑛𝑑 = 𝑛,
then the coefficients in (4.125) have 𝑟 𝑑 degrees of freedom whereas those in (4.126)
only have 𝑟𝑑𝑛2 . When 𝑛1 = 1, the first and last matrices in (4.126) are a row and a
column vector respectively; as the trace of a 1 × 1 matrix is itself, we may drop the
‘tr’ in (4.126). This special case with 𝑛1 = 1 is sometimes called MPS with open
boundary conditions (Anderson 1959) and the more general case is called MPS
with periodic conditions.
The above discussion of matrix product states conceals an important structure.
176 L.-H. Lim

( 𝑗)
Take 𝑑 = 3 and denote the entries of the matrices as 𝐴𝑖 = [𝑎(𝑖)
𝛼𝛽 ], 𝐵 𝑗 = [𝑏 𝛽𝛾 ] and
𝐶 𝑘 = [𝑐(𝑘)
𝛾 𝛼 ]. Then
𝑟1Õ
,𝑟2 ,𝑟3
𝑓 = tr(𝐴𝑖 𝐵 𝑗 𝐶 𝑘 ) 𝜑𝑖 ⊗ 𝜓 𝑗 ⊗ 𝜃 𝑘
𝑖, 𝑗,𝑘=1
𝑟1Õ,𝑟2 ,𝑟3  𝑛1Õ
,𝑛2 ,𝑛3 
(𝑖) ( 𝑗) (𝑘)
= 𝑎 𝛼𝛽 𝑏 𝛽𝛾 𝑐 𝛾 𝛼 𝜑𝑖 ⊗ 𝜓 𝑗 ⊗ 𝜃 𝑘
𝑖, 𝑗,𝑘=1 𝛼,𝛽,𝛾=1
𝑛1Õ ,𝑛2 ,𝑛3 Õ 𝑟1  Õ 𝑟2  Õ 𝑟3 
( 𝑗)
= 𝑎(𝑖)
𝛼𝛽 𝑖 𝜑 ⊗ 𝑏 𝛽𝛾 𝑗 𝜓 ⊗ 𝑐(𝑘)
𝛾𝛼 𝜃𝑘
𝛼,𝛽,𝛾=1 𝑖=1 𝑗=1 𝑘=1
𝑛1Õ ,𝑛2 ,𝑛3
= 𝜑 𝛼𝛽 ⊗ 𝜓 𝛽𝛾 ⊗ 𝜃 𝛾 𝛼 ,
𝛼,𝛽,𝛾=1

where
Õ
𝑟1 Õ
𝑟2 Õ
𝑟3
( 𝑗)
𝜑 𝛼𝛽 ≔ 𝑎(𝑖)
𝛼𝛽 𝜑 𝑖 , 𝜓 𝛽𝛾 ≔ 𝑏 𝛽𝛾 𝜓 𝑗 and 𝜃𝛾 𝛼 ≔ 𝑐(𝑘)
𝛾𝛼 𝜃𝑘 .
𝑖=1 𝑗=1 𝑘=1

In other words, the indices have the incidence structure of an undirected graph,
in this case a triangle. This was first observed in Landsberg, Qi and Ye (2012)
and later generalized in Ye and Lim (2018b), the bottom line being that any tensor
network state is a sum of separable functions indexed by a graph. In the following,
we show some of the most common tensor network states, written in this simplified
form, together with the graphs they correspond to in Figure 4.4.
Periodic matrix product states:
𝑛1Õ
,𝑛2 ,𝑛3
𝑓 (𝑥, 𝑦, 𝑧) = 𝜑𝑖 𝑗 (𝑥)𝜓 𝑗𝑘 (𝑦)𝜃 𝑘𝑖 (𝑧).
𝑖, 𝑗,𝑘=1

Tree tensor network states:


𝑛1Õ
,𝑛2 ,𝑛3
𝑓 (𝑥, 𝑦, 𝑧, 𝑤) = 𝜑𝑖 𝑗𝑘 (𝑥)𝜓𝑖 (𝑦)𝜃 𝑗 (𝑧)𝜋 𝑘 (𝑤).
𝑖, 𝑗,𝑘=1

Open matrix product states:


𝑛1 ,𝑛Õ
2 ,𝑛3 ,𝑛4

𝑓 (𝑥, 𝑦, 𝑧, 𝑢, 𝑣) = 𝜑𝑖 (𝑥)𝜓𝑖 𝑗 (𝑦)𝜃 𝑗𝑘 (𝑧)𝜋 𝑘𝑙 (𝑢)𝜌𝑙 (𝑣).


𝑖, 𝑗,𝑘,𝑙=1

Projected entangled pair states:


𝑛1 ,𝑛2 ,𝑛3Õ
,𝑛4 ,𝑛5 ,𝑛6 ,𝑛7
𝑓 (𝑥, 𝑦, 𝑧, 𝑢, 𝑣, 𝑤) = 𝜑𝑖 𝑗 (𝑥)𝜓 𝑗𝑘𝑙 (𝑦)𝜃 𝑙𝑚 (𝑧)𝜋 𝑚𝑛 (𝑢)𝜌 𝑛𝑘𝑜 (𝑣)𝜎𝑜𝑖 (𝑤).
𝑖, 𝑗,𝑘,𝑙,𝑚,𝑛,𝑜=1
Tensors in computations 177

The second and the last, often abbreviated to TTNS and PEPS, were proposed by
Shi, Duan and Vidal (2006) and Verstraete and Cirac (2004) respectively. More
generally, any periodic MPS corresponds to a cycle graph and any open MPS
corresponds to a line graph. We have deliberately written them without the ⊗
symbol to emphasize that all these ansätze are just sums of separable functions,
differing only in terms of how their factors are indexed.

𝜋
mps (open) ttns
𝑘
𝑖 𝑗 𝑘 𝑙 𝑖 𝑗
𝜑 𝜓 𝜃 𝜋 𝜌 𝜓 𝜑 𝜃
peps
𝜎 𝜌 𝜋
𝜃 𝑘 𝜓 𝑜 𝑛
𝑗 𝑖 𝑘 𝑚
mps (periodic) 𝑖
𝑗 𝑙
𝜑 𝜑 𝜓 𝜃
Figure 4.4. Graphs associated with common tensor networks.

One difference between the Hartree–Fock approximations in Examples 4.42 and


4.43 and the tensor network methods here is that the factors 𝜑𝑖 , 𝜓 𝑗 , . . . , 𝜃 𝑘 in
(4.126) are often fixed in advance as some standard bases of H1 , . . . , H𝑑 , called
a local basis in this context. The computational effort is solely in determining
the coefficients 𝑐𝑖 𝑗 ···𝑘 ; in the case of MPS this can be done via several singular
value decompositions (Orús 2014). Indeed, because of the way it is computed, the
coefficients of the MPS ansatz are sometimes represented as
tr(𝑄 1 Σ1 𝑄 2 Σ2 . . . Σ𝑑 𝑄 𝑑+1 ), 𝑄 𝑖 ∈ O(𝑛𝑖 ), Σ𝑖 ∈ R𝑛𝑖 ×𝑛𝑖+1
for 𝑖 = 1, . . . 𝑑, with Σ𝑖 a diagonal matrix. That any tr(𝐴1 𝐴2 · · · 𝐴𝑑 ) with 𝐴𝑖 ∈
R𝑛𝑖 ×𝑛𝑖+1 may be written in this form follows from the singular value decompositions
𝐴𝑖 = 𝑈𝑖 Σ𝑖 𝑉𝑖T and setting 𝑄 𝑖+1 = 𝑉𝑖T𝑈𝑖+1 for 𝑖 = 1, . . . , 𝑑 − 1, and 𝑄 1 = 𝑈1 ,
𝑄 𝑑+1 = 𝑉𝑑T .
We end with a brief word about our initial assumption that 𝑓 ∈ H1 ⊗ · · · ⊗ H𝑑 .
For a solution to a PDE, we expect that 𝑓 ∈ H1 b ⊗F · · · b⊗F H𝑑 , where b ⊗ F is the
Hilbert–Schmidt tensor product discussed in Examples 4.18 and 4.19. Since any
tensor product basis (4.43) formed from orthonormal bases of H1 , . . . , H𝑑 is an
orthonormal basis of H1 b ⊗ ··· b
⊗ H𝑑 with respect to the inner product in (4.64), it
is always possible to express 𝑓 ∈ H1 b ⊗ ··· b
⊗ H𝑑 in the form (4.125) as long as
we allow 𝑟 1 , . . . , 𝑟 𝑑 to be infinite. The ones with finite 𝑟 1 , . . . , 𝑟 𝑑 are precisely the
finite-rank tensors in H1 ⊗ · · · ⊗ H𝑑 , but any other 𝑓 ∈ H1 b ⊗F · · · b
⊗F H𝑑 can be
178 L.-H. Lim

approximated to arbitrary accuracy by elements of H1 ⊗ · · · ⊗ H𝑑 , that is, 𝑓 is a


limit of finite-rank tensors with respect to the Hilbert–Schmidt norm.
We will now embark on a different line of applications of definition ➂ tied to
our discussions of integral kernels in Examples 3.7, 4.2 and 4.20. As we saw
in Example 3.7, some of the most important kernels take the form 𝐾(𝑣, 𝑣 ′) =
𝑓 (k𝑣 − 𝑣 ′ k) for some real-valued function 𝑓 on a norm space V. In this case we
usually write 𝜅(𝑣 − 𝑣 ′) = 𝑓 (k𝑣 − 𝑣 ′ k) and call 𝜅 a radial basis kernel (Buhmann
2003). These are usually not separable like those in Examples 4.37 or 4.38. The fast
multipole method (Greengard and Rokhlin 1987) may be viewed as an algorithm for
approximating the convolution of certain radial basis kernels with certain functions.
We will discuss how tensors arise in this algorithm.
Example 4.45 (multipole tensors). Let V be an inner product space. As usual we
will denote the inner product and its induced norm by h · , · i and k · k respectively.
As we saw in Examples 4.17 and 4.22, this induces an inner product and norm on
V ⊗𝑑 that we may denote with the same notation. For any 𝑠 ≥ 0, the function
(
log k𝑣k 𝑠 = 0,
𝜅 𝑠 : V \ {0} → R, 𝜅 𝑠 (𝑣) =
1/k𝑣k 𝑠 𝑠 > 0,
is called a Newtonian kernel. Henceforth we will drop the subscript 𝑠 from 𝜅 𝑠 since
this is normally fixed throughout. If V is equipped with some measure, then the
convolution ∫
𝜅 ∗ 𝑓 (𝑣) = 𝜅(𝑣 − 𝑤) 𝑓 (𝑤) d𝑤
𝑣 ∈V
is called the Newtonian potential34 of 𝑓 . We will assume 𝑠 > 0 in the following; the
𝑠 = 0 case is similar. Our interpretation of higher-order derivatives as multilinear
maps in Example 3.2 makes their calculation an effortless undertaking:
𝑠
[𝐷𝜅(𝑣)](ℎ) = − h𝑣, ℎi,
k𝑣k 𝑠+2
𝑠 𝑠(𝑠 + 2)
[𝐷 2 𝜅(𝑣)](ℎ1 , ℎ2 ) = − 𝑠+2
hℎ2 , ℎ1 i + h𝑣, ℎ1 ih𝑣, ℎ2 i,
k𝑣k k𝑣k 𝑠+4
[𝐷 3 𝜅(𝑣)](ℎ1 , ℎ2 , ℎ3 )
𝑠(𝑠 + 2)
= [h𝑣, ℎ1 ihℎ2 , ℎ3 i + h𝑣, ℎ2 ihℎ1 , ℎ3 i + h𝑣, ℎ3 ihℎ1 , ℎ2 i]
k𝑣k 𝑠+4
𝑠(𝑠 + 2)(𝑠 + 4)
− h𝑣, ℎ1 ih𝑣, ℎ2 ih𝑣, ℎ3 i,
k𝑣k 𝑠+6

34 Strictly speaking, the names are used for the case when 𝑠 = dim V − 2, and for more general 𝑠 they
are called the Riesz kernel and Riesz potentials. Because of the singularity in 𝜅, the integral is
interpreted in a principal value sense as in (3.32) but we ignore such details (Stein 1993, Chapter 1,
Section 8.18). We have also dropped some multiplicative constants.
Tensors in computations 179

[𝐷 4 𝜅(𝑣)](ℎ1 , ℎ2 , ℎ3 , ℎ4 )
𝑠(𝑠 + 2)
= [hℎ1 , ℎ2 ihℎ3 , ℎ4 i + hℎ1 , ℎ3 ihℎ2 , ℎ4 i + hℎ1 , ℎ4 ihℎ2 , ℎ3 i]
k𝑣k 𝑠+4
𝑠(𝑠 + 2)(𝑠 + 4)
− [h𝑣, ℎ1 ih𝑣, ℎ2 ihℎ3 , ℎ4 i + h𝑣, ℎ1 ih𝑣, ℎ3 ihℎ2 , ℎ4 i
k𝑣k 𝑠+6
+ h𝑣, ℎ1 ih𝑣, ℎ4 ihℎ2 , ℎ3 i + h𝑣, ℎ2 ih𝑣, ℎ3 ihℎ1 , ℎ4 i
+ h𝑣, ℎ2 ih𝑣, ℎ4 ihℎ1 , ℎ3 i + h𝑣, ℎ3 ih𝑣, ℎ4 ihℎ1 , ℎ2 i]
𝑠(𝑠 + 2)(𝑠 + 4)(𝑠 + 6)
+ h𝑣, ℎ1 ih𝑣, ℎ2 ih𝑣, ℎ3 ih𝑣, ℎ4 i.
k𝑣k 𝑠+8
To obtain these, all we need is the expression for [𝐷𝜅(𝑣)](ℎ), which follows from
binomial expanding to linear terms in ℎ, that is,
  −𝑠/2
1 2h𝑣, ℎi kℎk 2 𝑠
𝜅(𝑣 + ℎ) = 𝑠
1+ 2
+ 2
= 𝜅(𝑣) − h𝑣, ℎi + 𝑂(kℎk 2 ),
k𝑣k k𝑣k k𝑣k k𝑣k 𝑠+2
along with the observation that [𝐷 h𝑣, ℎ ′i](ℎ) = hℎ, ℎ ′ i and the product rule 𝐷 𝑓 ·
𝑔(𝑣) = 𝑓 (𝑣) · 𝐷𝑔(𝑣) + 𝑔(𝑣) · 𝐷 𝑓 (𝑣). Applying these repeatedly gives us 𝐷 𝑘 𝜅(𝑣) in
any coordinate system without having to calculate a single partial derivative.
As we saw in Example 4.29, these derivatives may be linearized via the universal
factorization property. With an inner product, this is particularly simple. We will
take 𝑠 = 1, as this gives us the Coulomb potential, the most common case. We may
rewrite the above expressions as
 
𝑣
[𝐷𝜅(𝑣)](ℎ) = − ,ℎ ,
k𝑣k 3
 
𝐼 3𝑣 ⊗ 𝑣
[𝐷 2 𝜅(𝑣)](ℎ1 , ℎ2 ) = − + , ℎ 1 ⊗ ℎ 2 ,
k𝑣k 3 k𝑣k 5
 
3 𝐼 15 𝑣 ⊗ 𝑣 ⊗ 𝑣
 
3
[𝐷 𝜅(𝑣)](ℎ1 , ℎ2 , ℎ3 ) = 𝑣 ⊗ 𝐼+ ⊗ +𝐼 ⊗ 𝑣 − , ℎ1 ⊗ ℎ2 ⊗ ℎ3 .
k𝑣k 5 𝑣 k𝑣k 7
The symbol where 𝐼 and 𝑣 appear above and below ⊗ is intended Í𝑛 to mean the
following. Take any orthonormal basis 𝑒1 , . . . , 𝑒 𝑛 of V; then 𝐼 = 𝑖=1 𝑒𝑖 ⊗ 𝑒𝑖 , and
we have
Õ𝑛
𝐼 Õ 𝑛 Õ𝑛
𝑣⊗𝐼= 𝑣 ⊗ 𝑒𝑖 ⊗ 𝑒𝑖 , ⊗ = 𝑒𝑖 ⊗ 𝑣 ⊗ 𝑒𝑖 , 𝐼 ⊗ 𝑣 = 𝑒𝑖 ⊗ 𝑒𝑖 ⊗ 𝑣.
𝑣
𝑖=1 𝑖=1 𝑖=1
The tensors appearing in the first argument of the inner products are precisely the
linearized derivatives and they carry important physical meanings:
1
monopole 𝜅(𝑣) = ,
k𝑣k
−1
dipole 𝜕𝜅(𝑣) = 𝑣ˆ ,
k𝑣k 2
180 L.-H. Lim

1
quadrupole 𝜕 2 𝜅(𝑣) = (3 𝑣ˆ ⊗ 𝑣ˆ − 𝐼),
k𝑣k 3

𝐼
 
3 −1
octupole 𝜕 𝜅(𝑣) = 15 𝑣ˆ ⊗ 𝑣ˆ ⊗ 𝑣ˆ − 3 𝑣ˆ ⊗ 𝐼+ ⊗ +𝐼 ⊗ 𝑣ˆ ,
k𝑣k 4 𝑣ˆ
where 𝑣ˆ ≔ 𝑣/k𝑣k. The norm-scaled 𝑘th linearized derivative
𝑀 𝑘 𝜅(𝑣) ≔ k𝑣k 𝑘+1 𝜕 𝑘 𝜅(𝑣) ∈ V ⊗𝑘
is called a multipole tensor or 𝑘-pole tensor. The Taylor expansion (4.92) may then
be written in terms of the multipole tensors,
Õ𝑑  
1 𝑘 (𝑣 − 𝑣 0 ) ⊗𝑘
𝜅(𝑣) = 𝜅(𝑣 0 ) 𝑀 𝜅(𝑣 0 ), + 𝑅(𝑣 − 𝑣 0 ),
𝑘=0
𝑘! k𝑣 0 k 𝑘
and this is called a multipole expansion of 𝜅(𝑣) = 1/k𝑣k. We may rewrite this in
terms of a tensor Taylor series and a tensor geometric series as in (4.93),
Õ∞
1 𝑘

𝑣 − 𝑣0
 Õ ∞
(𝑣 − 𝑣 0 ) ⊗𝑘 b
𝜕𝜅(𝑣 0 ) = b
𝑀 𝜅(𝑣 0 ) ∈ T(V), 𝑆 = ∈ T(V).
𝑘=0
𝑘! k𝑣 0 k 𝑘=0
k𝑣 0 k 𝑘
Note that

1 1 1 𝐼
 
𝜕𝜅(𝑣) = 1 − 𝑣ˆ + (3 𝑣ˆ ⊗ 𝑣ˆ − 𝐼) − 15 𝑣ˆ ⊗ 𝑣ˆ ⊗ 𝑣ˆ − 3 𝑣ˆ ⊗ 𝐼+ ⊗ +𝐼 ⊗ 𝑣ˆ + · · ·
2! 3! 4! 𝑣ˆ
is an expansion in the tensor algebra b
T(V).
Let 𝑓 : V → R be a compactly supported function that usually represents a
charge distribution confined to a region within V. The tensor-valued integral

𝑘
(𝑀 𝜅) ∗ 𝑓 (𝑣) = 𝑀 𝑘 𝜅(𝑣 − 𝑤) 𝑓 (𝑤) d𝑤
𝑣 ∈V
is called a multipole moment or 𝑘-pole moment, and the Taylor expansion
Õ𝑑  
1 𝑘 (𝑣 − 𝑣 0 ) ⊗𝑘
𝜅 ∗ 𝑓 (𝑣) = 𝜅(𝑣 0 ) (𝑀 𝜅) ∗ 𝑓 (𝑣 0 ), + 𝑅(𝑣 − 𝑣 0 ),
𝑘=0
𝑘! k𝑣 0 k 𝑘
is called a multipole expansion of 𝑓 . One of the most common scenarios is when
we have a finite collection of 𝑛 point charges at positions 𝑣 1 , . . . , 𝑣 𝑛 ∈ V with
charges 𝑞 1 , . . . , 𝑞 𝑛 ∈ R. Then 𝑓 is given by
𝑓 (𝑣) = 𝑞 1 𝛿(𝑣 − 𝑣 1 ) + · · · + 𝑞 𝑛 𝛿(𝑣 − 𝑣 𝑛 ). (4.127)
Our goal is to estimate the potential at a point 𝑣 some distance away from 𝑣 1 , . . . , 𝑣 𝑛 .
Now consider a distinct point 𝑣 0 that is much nearer to 𝑣 1 , . . . , 𝑣 𝑛 than to 𝑣 in the
sense that k𝑣 𝑖 −𝑣 0 k ≤ 𝑐k𝑣−𝑣 0 k, 𝑖 = 1, . . . , 𝑛, for some 𝑐 < 1. Suppose there is just a
single point charge with 𝑓 (𝑣) = 𝑞 𝑖 𝛿(𝑣 −𝑣 𝑖 ); then (𝑀 𝑘 𝜅)∗ 𝑓 (𝑣 −𝑣 0 ) = 𝑞 𝑖 𝑀 𝑘 𝜅(𝑣 𝑖 −𝑣 0 )
and the multipole expansion of 𝑓 at the point 𝑣 − 𝑣 𝑖 = (𝑣 − 𝑣 0 ) + (𝑣 0 − 𝑣 𝑖 ) about the
Tensors in computations 181

point 𝑣 − 𝑣 0 is
Õ𝑑  
𝑞𝑖 (𝑣 0 − 𝑣 𝑖 ) ⊗𝑘
𝜅 ∗ 𝑓 (𝑣 − 𝑣 𝑖 ) = 𝜅(𝑣 − 𝑣 0 ) 𝑀 𝑘 𝜅(𝑣 𝑖 − 𝑣 0 ), 𝑘
+ 𝑅(𝑣 0 − 𝑣 𝑖 )
𝑘=0
𝑘! k𝑣 − 𝑣 0 k
Õ
𝑑
= 𝜑 𝑘 (𝑣 𝑖 − 𝑣 0 )𝜓 𝑘 (𝑣 − 𝑣 0 ) + 𝑂(𝑐 𝑑+1 ),
𝑘=0

where we have written


(−1)𝑘 𝑞 𝑖 𝜅(𝑣)
𝜑 𝑘 (𝑣) ≔ h𝑀 𝑘 𝜅(𝑣), 𝑣 ⊗𝑘 i, 𝜓 𝑘 (𝑣) ≔ ,
𝑘! k𝑣k 𝑘
and since 𝑅(𝑣 0 − 𝑣 𝑖 )/k𝑣 0 − 𝑣 𝑖 k 𝑑 → 0, we assume 𝑅(𝑣 0 − 𝑣 𝑖 ) = 𝑂(𝑐 𝑑+1 ) for
simplicity. For the general case in (4.127) with 𝑛 point charges, the potential at 𝑣
is then approximated by summing over 𝑖:
Õ 𝑑 Õ 𝑛 
𝜅 ∗ 𝑓 (𝑣) ≈ 𝜑 𝑘 (𝑣 𝑖 − 𝑣 0 ) 𝜓 𝑘 (𝑣 − 𝑣 0 ). (4.128)
𝑘=0 𝑖=1

This sum can be computed in 𝑂(𝑛𝑑) complexity or 𝑂(𝑛 log(1/𝜀)) for an 𝜀-accurate
algorithm. While the high level idea in (4.128) is still one of approximating a
function by a sum of 𝑑 separable functions, the fast multipole method involves a
host of other clever ideas, not least among which are the techniques for performing
such sums in more general situations (Demaine et al. 2005), for subdividing the
region containing 𝑣 1 , . . . , 𝑣 𝑛 into cubic cells (when V = R3 ) and thereby organizing
these computations into a tree-like multilevel algorithm (Barnes and Hut 1986).
Clearly the approximation is good only when 𝑣 is far from 𝑣 1 , . . . , 𝑣 𝑛 relative to 𝑣 0
but the algorithm allows one to circumvent this requirement. We refer to Greengard
and Rokhlin (1987) for further information.
Multipole moments, multipole tensors and multipole expansions are usually
discussed in terms of coordinates (Jackson 1999, Chapter 4). The coordinate-free
approach, the multipole expansion as an element of the tensor algebra, etc., are
results of our working with definitions ➁ and ➂.
Although non-separable kernels make for more interesting examples, separable
kernels arise as some of the most common multidimensional integral transforms,
as we saw in Example 4.38. Also, they warrant a mention if only to illustrate why
separability in a kernel is computationally desirable.
Example 4.46 (discrete multidimensional transforms). Three of the best-known
discrete transforms are the discrete Fourier transform we encountered in Ex-
ample 3.14, the discrete Z-transform and the discrete cosine transform:
Õ
∞ Õ

𝐹(𝑥1 , . . . , 𝑥 𝑑 ) = ··· 𝑓 (𝑘 1 , . . . , 𝑘 𝑑 ) e−i𝑘1 𝑥1 −···−i𝑘𝑑 𝑥𝑑 ,
𝑘1 =−∞ 𝑘𝑑 =−∞
182 L.-H. Lim
Õ
∞ Õ

−𝑘𝑑
𝐹(𝑧1 , . . . , 𝑧 𝑑 ) = ··· 𝑓 (𝑘 1 , . . . , 𝑘 𝑑 ) 𝑧−𝑘1
1 · · · 𝑧𝑑 ,
𝑘1 =−∞ 𝑘𝑑 =−∞
𝐹( 𝑗 1 , . . . , 𝑗 𝑑 ) =
1 −1
𝑛Õ 𝑑 −1
𝑛Õ
𝜋(2 𝑗 1 + 1) 𝑗 1 𝜋(2 𝑗 𝑑 + 1) 𝑗 𝑑
   
··· 𝑓 (𝑘 1 , . . . , 𝑘 𝑑 ) cos · · · cos ,
𝑘1 =0 𝑘𝑑 =0
2𝑛1 2𝑛𝑑

where the 𝑥𝑖 are real, the 𝑧𝑖 are complex, and 𝑗 𝑖 and 𝑘 𝑖 are integer variables. We
refer the reader to Dudgeon and Mersereau (1983, Sections 2.2 and 4.2) for the first
two and Rao and Yip (1990, Chapter 5) for the last. While we have stated them for
general 𝑑, in practice 𝑑 = 2, 3 are the most useful. The separability of these kernels
is exploited in the row–column decomposition for their evaluation (Dudgeon and
Mersereau 1983, Section 2.3.2). Assuming, for simplicity, that we have 𝑑 = 2, a
kernel 𝐾( 𝑗, 𝑘) = 𝜑( 𝑗)𝜓(𝑘) and all integer variables, then
Õ∞ Õ
∞ Õ∞  Õ
∞ 
𝐹(𝑥, 𝑦) = 𝜑( 𝑗)𝜓(𝑘) 𝑓 (𝑥− 𝑗, 𝑦−𝑘) = 𝜑( 𝑗) 𝜓(𝑘) 𝑓 (𝑥− 𝑗, 𝑦−𝑘) .
𝑗=−∞ 𝑘=−∞ 𝑗=−∞ 𝑘=−∞

We store the sum in the bracket, which we then re-use when evaluating 𝐹 at other
points (𝑥 ′, 𝑦) where only the first argument 𝑥 ′ = 𝑥 + 𝛿 is changed:
Õ
∞  Õ
∞ 
𝐹(𝑥 + 𝛿, 𝑦) = 𝜑( 𝑗) 𝜓(𝑘) 𝑓 (𝑥 − 𝑗 + 𝛿, 𝑦 − 𝑘)
𝑗=−∞ 𝑘=−∞
Õ∞  Õ
∞ 
= 𝜑( 𝑗 + 𝛿) 𝜓(𝑘) 𝑓 (𝑥 − 𝑗, 𝑦 − 𝑘) .
𝑗=−∞ 𝑘=−∞

We have assumed that the indices run over all integers to avoid having to deal with
boundary complications. In reality, when we have a finite sum as in the discrete
cosine transform, evaluating 𝐹 in the direct manner would have taken 𝑛21 𝑛22 · · · 𝑛2𝑑
additions and multiplications, whereas the row–column decomposition would just
require
𝑛1 𝑛2 · · · 𝑛𝑑 (𝑛1 + 𝑛2 + · · · + 𝑛𝑑 )
additions and multiplications. In cases where there are fast algorithms available
for the one-dimensional transform, say, if we employ one-dimensional FFT in an
evaluation of the 𝑑-dimensional DFT via row–column decomposition, the number
of additions and multiplications could be further reduced to
1
𝑛1 𝑛2 · · · 𝑛𝑑 log2 (𝑛1 + 𝑛2 + · · · + 𝑛𝑑 ) and 𝑛1 𝑛2 · · · 𝑛𝑑 log2 (𝑛1 + 𝑛2 + · · · + 𝑛𝑑 )
2
respectively. Supposing 𝑑 = 2 and 𝑛1 = 𝑛2 = 210 , the approximate number of multi-
plications required to evaluate a two-dimensional DFT using the direct method, the
row–column decomposition method and the row–column decomposition with FFT
Tensors in computations 183

method are 1012 , 2 × 109 , and 107 respectively (Dudgeon and Mersereau 1983,
Section 2.3.2).

The fact that we may construct a tensor product basis ℬ1 ⊗ · · · ⊗ ℬ𝑑 out of


given bases ℬ1 , . . . , ℬ𝑑 allows us to build multivariate bases out of univariate
ones. While this is not necessarily the best way to construct multivariate bases,
it is the simplest and often serves as a basic standard that more sophisticated
multivariate bases are compared against. This discussion is most interesting with
Hilbert spaces, and this will be the backdrop for our next example, where we will
see the construction for various notions of bases and beyond.

Example 4.47 (tensor product wavelets and splines). We let H be any separ-
able Banach space with norm k · k. Then ℬ = {𝜑𝑖 ∈ H : 𝑖 ∈ N} is said to be a
Schauder basis of H if, for every 𝑓 ∈ H, there is a unique sequence (𝑎𝑖 )∞
𝑖=1 such
that
Õ

𝑓 = 𝑎𝑖 𝜑𝑖 ,
𝑖=1

where, as usual, this means the series on the right converges to 𝑓 in k · k. A Banach
space may not have a Schauder basis but a Hilbert space always does. If H is a
Hilbert space with inner product h · , · i and k · k its induced norm, then its Schauder
basis has specific names when it satisfies additional conditions; two of the best-
known ones are the orthonormal basis and the Riesz basis. Obviously the elements
of a Schauder basis must be linearly independent, but this can be unnecessarily
restrictive since overcompleteness can be a desirable feature (Mallat 2009), leading
us to the notion of frames. For easy reference, we define them as follows (Heil
2011):
Õ

orthonormal basis k 𝑓 k2 = |h 𝑓 , 𝜑𝑖 i| 2 , 𝑓 ∈ H,
𝑖=1
Õ
∞ Õ∞ 2 Õ

Riesz basis 𝛼 |𝑎𝑖 | 2 ≤ 𝑎𝑖 𝜑𝑖 ≤𝛽 |𝑎𝑖 | 2 , (𝑎𝑖 )∞ 2
𝑖=1 ∈ 𝑙 (N),
𝑖=1 𝑖=1 𝑖=1
Õ

frame 𝛼k 𝑓 k 2 ≤ |h 𝑓 , 𝜑𝑖 i| 2 ≤ 𝛽k 𝑓 k 2 , 𝑓 ∈ H,
𝑖=1

where the constants 0 < 𝛼 < 𝛽 are called frame constants and if 𝛼 = 𝛽, then the
frame is tight. Clearly every orthonormal basis is a Riesz basis and every Riesz
basis is a frame.
Let H1 , . . . , H𝑑 be separable Hilbert spaces and let ℬ1 , . . . , ℬ𝑑 be countable
dense spanning sets. Let H1 b ⊗···b⊗ H𝑑 be their Hilbert–Schmidt tensor product as
discussed in Examples 4.18 and 4.19 and let ℬ1 ⊗ · · · ⊗ ℬ𝑑 be as defined in (4.43).
184 L.-H. Lim

Then

 orthonormal bases, 
 an orthonormal basis,


 


ℬ1 , . . . , ℬ𝑑 are Riesz bases, ⇔ ℬ1 ⊗· · ·⊗ℬ𝑑 is a Riesz basis,

 

 frames,  a frame.
 
The forward implication is straightforward, and if the frame constants ofÎ ℬ𝑖 are
𝑑
𝛼𝑖 and 𝛽𝑖 , 𝑖 = 1, . . . , 𝑑, then the frame constants of ℬ1 ⊗ · · · ⊗ ℬ𝑑 are 𝑖=1 𝛼𝑖
Î𝑑
and 𝑖=1 𝛽𝑖 (Feichtinger and Gröchenig 1994, Lemma 8.18). The converse is,
however, more surprising (Bourouihiya 2008).
If 𝜓 ∈ 𝐿 2 (R) is a wavelet, that is, ℬ𝜓 ≔ {𝜓 𝑚,𝑛 : (𝑚, 𝑛) ∈ Z × Z} with 𝜓 𝑚,𝑛 (𝑥) ≔
2 𝜓(2𝑚 𝑥 − 𝑛) is an orthonormal basis, Riesz basis or frame of 𝐿 2 (R), then
𝑚/2

ℬ𝜓 ⊗ ℬ𝜓 = {𝜓 𝑚,𝑛 ⊗ 𝜓 𝑝,𝑞 : (𝑚, 𝑛, 𝑝, 𝑞) ∈ Z × Z × Z × Z} (4.129)


is an orthonormal basis, Riesz basis or frame of 𝐿 2 (R2 ) by our general discussion
above. More generally, ℬ𝜓⊗𝑑 gives a tensor product wavelet basis/frame for 𝐿 2 (R𝑑 )
(Mallat 2009, Section 7.7.4). While this is certainly the most straightforward way
to obtain a wavelet basis for multivariate functions out of univariate wavelets, there
are often better options.
Some orthonormal wavelets have a multiresolution analysis, that is, a sequence
of nested subspaces whose intersection is {0} and union is dense in 𝐿 2 (R),
{0} ⊆ · · · ⊆ V1 ⊆ V0 ⊆ V−1 ⊆ V−2 ⊆ · · · ⊆ 𝐿 2 (R),
defined by a scaling function 𝜑 ∈ 𝐿 2 (R) so that {𝜑 𝑚,𝑛 : 𝑛 ∈ Z} is an orthonormal
basis of V𝑚 for any 𝑚 ∈ Z. If we write the orthogonal complement of V𝑚 in
V𝑚−1 as W𝑚 , then {𝜓 𝑚,𝑛 : 𝑛 ∈ Z} is an orthonormal basis of W𝑚 (Meyer 1992,
Chapter 2). Multiresolution analysis interacts with tensor products via the rules in
Example 4.31:
V𝑚−1 ⊗ V𝑚−1 = (V𝑚 ⊗ V𝑚 ) ⊕ (V𝑚 ⊗ W𝑚 ) ⊕ (W𝑚 ⊗ V𝑚 ) ⊕ (W𝑚 ⊗ W𝑚 ).
Here ⊕ denotes orthogonal direct sum. In image processing lingo, the subspaces
V𝑚 ⊗ V𝑚 = span{𝜑 𝑚,𝑛 ⊗ 𝜑 𝑝,𝑞 : (𝑛, 𝑞) ∈ Z × Z},
V𝑚 ⊗ W𝑚 = span{𝜑 𝑚,𝑛 ⊗ 𝜓 𝑝,𝑞 : (𝑛, 𝑞) ∈ Z × Z},
(4.130)
W𝑚 ⊗ V𝑚 = span{𝜓 𝑚,𝑛 ⊗ 𝜑 𝑝,𝑞 : (𝑛, 𝑞) ∈ Z × Z},
W𝑚 ⊗ W𝑚 = span{𝜓 𝑚,𝑛 ⊗ 𝜓 𝑝,𝑞 : (𝑛, 𝑞) ∈ Z × Z}
are called the LL-, LH-, HL- and HH-subbands respectively, with L for low-pass and
H for high-pass. For an image 𝑓 ∈ V𝑚−1 ⊗ V𝑚−1 , its component in the LL-subband
V𝑚 ⊗V𝑚 represents a low-resolution approximation to 𝑓 ; the high-resolution details
are contained in its components in the LH-, HL- and HH-subbands, whose direct
sum is sometimes called the detailed space.
The orthonormal basis obtained via (4.130) can be a better option than simply
taking the tensor product (4.129). For instance, in evaluating an 𝑛×𝑛 dense matrix–
Tensors in computations 185

vector product representing an integral transform like the ones in Example 4.45,
the Beylkin et al. (1991) wavelet-variant of the fast multipole algorithm mentioned
in Example 2.14 runs in time 𝑂(𝑛 log 𝑛) using the basis in (4.129) and in time 𝑂(𝑛)
using that in (4.130), both impressive compared to the usual 𝑂(𝑛2 ), but the latter is
clearly superior when 𝑛 is very large.
The main advantage of (4.129) is its generality; with other types of bases or
frames we also have similar constructions. For instance, the simplest types of
multivariate B-splines (Höllig and Hörner 2013) are constructed out of tensor
products of univariate B-splines; expressed in a multilinear rank decomposition
(4.63), we have
𝑝 Õ
Õ 𝑞 Õ
𝑟
𝐵𝑙,𝑚,𝑛 (𝑥, 𝑦, 𝑧) = 𝑎𝑖 𝑗𝑘 𝐵𝑖,𝑙 (𝑥)𝐵 𝑗,𝑚 (𝑦)𝐵 𝑘,𝑛 (𝑧), (4.131)
𝑖=1 𝑗=1 𝑘=1

where 𝐵𝑖,𝑙 , 𝐵 𝑗,𝑚 , 𝐵 𝑘,𝑛 are univariate B-splines of degrees 𝑙, 𝑚, 𝑛. Nevertheless, the
main drawback of straightforward tensor product constructions is that they attach
undue importance to the directions of the coordinate axes (Cohen and Daubechies
1993). There are often better alternatives such as box splines or beamlets, curvelets,
ridgelets, shearlets, wedgelets, etc., that exploit the geometry of R2 or R3 .
We next discuss a covariant counterpart to the contravariant example above: a
univariate quadrature on [−1, 1] is a covariant 1-tensor, and a multivariate quad-
rature on [−1, 1] 𝑑 is a covariant 𝑑-tensor.
Example 4.48 (quadrature). Let 𝑓 : [−1, 1] → R be a univariate polynomial
function of degree not more than 𝑛. Without knowing anything else about 𝑓 , we
know that there exist 𝑛 distinct points 𝑥0 , 𝑥1 , . . . , 𝑥 𝑛 ∈ [−1, 1], called nodes, and
coefficients 𝑤 0 , 𝑤 1 , . . . , 𝑤 𝑛 ∈ R, called weights, so that
∫ 1
𝑓 (𝑥) d𝑥 = 𝑤 0 𝑓 (𝑥0 ) + 𝑤 1 𝑓 (𝑥1 ) + · · · + 𝑤 𝑛 𝑓 (𝑥 𝑛 ).
−1
In fact, since 𝑓 is arbitrary, the formula holds with the same nodes and weights
for all polynomials of degree 𝑛 or less. This is called a quadrature formula and
its existence is simply a consequence of the following observations. A definite
integral is a linear functional,
∫ 1
𝐼 : 𝐶([−1, 1]) → R, 𝐼( 𝑓 ) = 𝑓 (𝑥) d𝑥,
−1
as is point evaluation, introduced in Example 4.6,
𝜀 𝑥 : 𝐶([−1, 1]) → R, 𝜀 𝑥 ( 𝑓 ) = 𝑓 (𝑥).
Since V = { 𝑓 ∈ 𝐶([−1, 1]) : 𝑓 (𝑥) = 𝑐0 + 𝑐1 𝑥 + · · · + 𝑐 𝑛 𝑥 𝑛 } is a vector space of
dimension 𝑛 + 1, its dual space V∗ also has dimension 𝑛 + 1, and the 𝑛 + 1 linear
functionals 𝜀 𝑥0 , 𝜀 𝑥1 , . . . , 𝜀 𝑥𝑛 , obviously linearly independent, form a basis of V∗ .
186 L.-H. Lim

Thus it must be possible to write 𝐼 as a unique linear combination

𝐼 = 𝑤 0 𝜀 𝑥0 + 𝑤 1 𝜀 𝑥1 + · · · + 𝑤 𝑛 𝜀 𝑥 𝑛 . (4.132)

This observation is presented in Lax (2007, Chapter 2, Theorem 7) as an instance in


linear algebra where ‘even trivial-looking material has interesting consequences’.
More generally, a quadrature formula is simply a linear functional in V∗ that
is a linear combination of point evaluation functionals, as on the right-hand side
of (4.132). If V ⊆ 𝐶(𝑋) is any (𝑛 + 1)-dimensional subspace and 𝑋 is compact
Hausdorff with a Borel measure, then the ∫ same argument in the previous paragraph
shows that there is a quadrature for 𝐼 = 𝑋 that gives the exact value of the integral
for all 𝑓 ∈ V in terms of the values of 𝑓 at 𝑛 + 1 nodes 𝑥0 , . . . , 𝑥 𝑛 ∈ 𝑋. However,
this abstract existential argument does not make it any easier to find 𝑥0 , . . . , 𝑥 𝑛 ∈ 𝑋
for a specific type of function space V and a specific domain 𝑋, nor does it tell
us about the error if we apply the formula to some 𝑓 ∉ V. In fact, for specific V
and 𝑋 we can often do better.p For instance, the three-node Gauss quadrature on
𝑋 = [−1, 1] has 𝑥0 = 0, 𝑥1 = 3/5 = −𝑥2 and 𝑤 0 = 8/9, 𝑤 1 = 5/9 = 𝑤 2 , with
∫ 1 r   r 
8 5 3 5 3
𝑓 (𝑥) d𝑥 = 𝑓 (0) + 𝑓 + 𝑓 −
−1 9 9 5 9 5
giving exact answers for all quintic polynomials 𝑓 . So dim V = 6 but we only
need three point evaluations to determine 𝐼. There is no contradiction here: we are
not saying that every linear functional in V∗ can be determined with these three
point evaluations, just this one linear functional 𝐼. This extra efficacy comes from
exploiting the structure of V as a space of polynomials; while the moment function-
∫1
als 𝑓 ↦→ −1 𝑥 𝑛 𝑓 (𝑥) d𝑥 are linearly independent, they satisfy non-linear relations
(Schmüdgen 2017) that can be exploited to obtain various 𝑛-node quadratures with
exact answers for all polynomials up to degree 2𝑛 − 1 (Golub and Welsch 1969).
Generally speaking, the more nodes there are in a quadrature, the more accurate it
is but also the more expensive it is. As such we will take the number of nodes as a
crude measure of accuracy and computational cost.
As univariate quadratures are well studied in theory (Brass and Petras 2011) and
practice (Davis and Rabinowitz 2007), it is natural to build multivariate quadrat-
ures from univariate ones. We assume without too much loss of generality that
𝑋 = [−1, 1] 𝑑 , noting that any [𝑎1 , 𝑏1 ] × · · · × [𝑎 𝑑 , 𝑏 𝑑 ] can be transformed into
[−1, 1] 𝑑 with an affine change of variables. Since a univariate quadrature is a linear
functional 𝜑 : 𝐶([−1, 1]) → R, it seems natural to define a multivariate quadrature
as a 𝑑-linear functional

Φ : 𝐶([−1, 1]) × · · · × 𝐶([−1, 1]) → R, (4.133)

but this does not appear to work as we want multivariate quadratures to be defined
on 𝐶([−1, 1] 𝑑 ). Fortunately the universal factorization property (4.88) gives us a
Tensors in computations 187

unique linear map


𝐹Φ : 𝐶([−1, 1]) ⊗ · · · ⊗ 𝐶([−1, 1]) → R
with Φ = 𝐹Φ ◦ 𝜎⊗ , and as 𝐶([−1, 1] 𝑑 ) = 𝐶([−1, 1]) ⊗𝑑 by (4.73), we get a linear
functional
𝐹Φ : 𝐶([−1, 1] 𝑑 ) → R. (4.134)
A multivariate quadrature is a linear functional 𝐹 : 𝐶([−1, 1] 𝑑 ) → R that is a linear
combination of point evaluation functionals on [−1, 1] 𝑑 . Note that given 𝐹 we may
also work backwards to obtain a multilinear functional Φ𝐹 = 𝐹 ◦ 𝜎⊗ as in (4.133).
So (4.133) and (4.134) are equivalent.
The simplest multivariate quadratures are simply separable products of univariate
quadratures. Let
Õ
𝑛1 Õ
𝑛2 Õ
𝑛𝑑
𝜑1 = 𝑎 𝑖 𝜀 𝑥𝑖 , 𝜑 2 = 𝑏 𝑗 𝜀 𝑦 𝑗 , . . . , 𝜑𝑑 = 𝑐 𝑘 𝜀 𝑧𝑘
𝑖=1 𝑗=1 𝑘=1

be 𝑑 univariate quadratures with nodes 𝑁𝑖 ⊆ [−1, 1] of size 𝑛𝑖 , 𝑖 = 1, . . . , 𝑑. Then


the covariant 𝑑-tensor
Õ
𝑛1 Õ
𝑛2 Õ
𝑛𝑑
𝜑1 ⊗ 𝜑2 ⊗ · · · ⊗ 𝜑 𝑑 = ··· 𝑎 𝑖 𝑏 𝑗 · · · 𝑐 𝑘 𝜀 𝑥𝑖 ⊗ 𝜀 𝑦 𝑗 ⊗ · · · ⊗ 𝜀 𝑧 𝑘
𝑖=1 𝑗=1 𝑘=1

is a multivariate quadrature on [−1, 1] 𝑑 usually called the tensor product quadrat-


ure. The downside of such a quadrature is immediate: for an 𝑓 ∈ 𝐶([−1, 1] 𝑑 ),
evaluating 𝜑1 ⊗ · · · ⊗ 𝜑 𝑑 ( 𝑓 ) requires 𝑓 to be evaluated on 𝑛1 𝑛2 · · · 𝑛𝑑 nodes, but
for 𝑑 = 2, 3 this is still viable. For instance, the nine-node tensor product Gauss
quadrature on [−1, 1] × [−1, 1] is the covariant 2-tensor
 ⊗2
5 √ 8 5 √

𝜀 + 𝜀 0 + 𝜀 3/5 ,
9 − 3/5 9 9
with nodes
r                  
3 −1 0 1 −1 0 1 −1 0 1
, , , , , , , , ,
5 −1 −1 −1 0 0 0 1 1 1
and corresponding weights
 
25 40 25 40 64 40 25 40 25
, , , , , , , , .
81 81 81 81 81 81 81 81 81
One observation is that the weights in the tensor product quadrature are always
positive whenever those in its univariate factors 𝜑1 , . . . , 𝜑 𝑑 are. So tensor product
quadrature is essentially just a form of weighted average. Taking a leaf from
Strassen’s algorithm in Example 3.9, allowing for negative coefficients in a tensor
decomposition can be a starting point for superior algorithms; we will examine this
in the context of quadrature.
188 L.-H. Lim

The Smolyak quadrature (Smolyak 1963) is a more sophisticated multivariate


quadrature that is also based on tensor products. Our description is adapted from
Kaarnioja (2013) and Keshavarzzadeh, Kirby and Narayan (2018). As before, we
are given univariate quadratures 𝜑𝑖 with nodes 𝑁𝑖 ⊆ [−1, 1] of cardinality 𝑛𝑖 ,
𝑖 = 1, . . . , 𝑑. An important difference is that we now order 𝜑1 , 𝜑2 , . . . , 𝜑 𝑑 so that
𝑛1 ≤ 𝑛2 ≤ · · · ≤ 𝑛𝑑 , that is, these univariate quadratures get increasingly accurate
but also increasingly expensive as 𝑖 increases. A common scenario would have
𝑛𝑖 = 2𝑖−1 . The goal is to define a tensor product quadrature that uses the cheaper
quadratures on the left end of the list 𝜑1 , 𝜑2 , . . . liberally and the expensive ones
on the right end . . . , 𝜑 𝑑−1 , 𝜑 𝑑 sparingly. We set 𝜑0 ≔ 0,
Δ𝑖 ≔ 𝜑𝑖 − 𝜑𝑖−1 , 𝑖 = 1, . . . , 𝑑,
and observe that
𝜑𝑖 = Δ1 + Δ2 + · · · + Δ𝑖 , 𝑖 = 1, . . . , 𝑑.
The separable multivariate quadrature we defined earlier is
𝜑1 ⊗ 𝜑2 ⊗ · · · ⊗ 𝜑 𝑑 = Δ1 ⊗ (Δ1 + Δ2 ) ⊗ · · · ⊗ (Δ1 + Δ2 + · · · + Δ𝑑 );
it could be made less expensive and accurate by replacing the 𝜑𝑖 with 𝜑1 ,
𝜑1 ⊗ 𝜑1 ⊗ · · · ⊗ 𝜑1 = Δ1 ⊗ Δ1 ⊗ · · · ⊗ Δ1 ,
or more expensive and accurate by replacing the 𝜑𝑖 with 𝜑 𝑑 ,
𝜑 𝑑 ⊗ 𝜑 𝑑 ⊗ · · · ⊗ 𝜑 𝑑 = (Δ1 + Δ2 + · · · + Δ𝑑 ) ⊗ · · · ⊗ (Δ1 + Δ2 + · · · + Δ𝑑 ).
The Smolyak quadrature allows one to adjust the level of accuracy and compu-
tational cost between these two extremes. For any 𝑟 ∈ N, the level-𝑟 Smolyak
quadrature is the covariant 𝑑-tensor
Õ
𝑟 −1 Õ
𝑆𝑟 ,𝑑 = Δ𝑖1 ⊗ · · · ⊗ Δ𝑖𝑑
𝑘=0 𝑖1 +···+𝑖𝑑 =𝑑+𝑘
Õ
𝑟 −1 
𝑑−1
 Õ
𝑟 −1−𝑘
= (−1) 𝜑𝑖1 ⊗ · · · ⊗ 𝜑𝑖𝑑 .
𝑘=𝑟 −𝑑
𝑟 −1−𝑘 𝑖1 +···+𝑖𝑑 =𝑑+𝑘

Note that when 𝑟 = 1 and (𝑑 − 1)𝑑, we get


𝑆1,𝑑 = Δ1⊗𝑑 = 𝜑1⊗𝑑 , 𝑆(𝑑−1)𝑑,𝑑 = (Δ1 + Δ2 + · · · + Δ𝑑 ) ⊗𝑑 = 𝜑 𝑑⊗𝑑
respectively. Take the smallest non-trivial case with 𝑑 = 𝑟 = 2; we have Δ1 = 𝜑1 ,
Δ2 = 𝜑2 − 𝜑1 and
𝑆2,2 = Δ1 ⊗ Δ1 + Δ1 ⊗ Δ2 + Δ2 ⊗ Δ1
= 𝜑1 ⊗ 𝜑2 + 𝜑2 ⊗ 𝜑1 − 𝜑1 ⊗ 𝜑1 .
Observe that the weights can now be negative. For the one-point and two-point
Tensors in computations 189

Gauss quadratures 𝜑1 = 2𝜀 0 and 𝜑2 = 𝜀 −1/√3 + 𝜀 1/√3 , we have


1 1
       
−1 −1
𝑆2,2 ( 𝑓 ) = 2 𝑓 0, √ + 2 𝑓 0, √ + 2 𝑓 √ , 0 + 2 𝑓 √ , 0 − 4 𝑓 (0, 0)
3 3 3 3
(4.135)
2
for any 𝑓 ∈ 𝐶([−1, 1] ). We will mention just one feature of Smolyak quadrature:
if the univariate quadrature 𝜑𝑖 is exact for univariate polynomials of degree 2𝑖 − 1,
𝑖 = 1, . . . , 𝑑, then 𝑆𝑟 ,𝑑 is exact for 𝑑-variate polynomials of degree 2𝑟 −1 (Heiss and
Winschel 2008). Thus (4.135) gives the exact answers when integrating bivariate
cubic polynomials 𝑓 .
For any 𝑓 ∈ 𝐶([−1, 1] 𝑑 ), 𝑆𝑟 ,𝑑 ( 𝑓 ) has to be evaluated at the nodes in
Ø
𝑁𝑖1 × · · · × 𝑁𝑖𝑑 ,
𝑟 ≤𝑖1 +···+𝑖𝑑 ≤𝑟 +𝑑−1

which is called a sparse grid (Gerstner and Griebel 1998). Smolyak quadrat-
ure has led to further developments under the heading of sparse grids; we refer
interested readers to Garcke (2013) for more information. Nevertheless, many
multivariate quadratures tend to be ‘grid-free’ and involve elements of random-
ness or pseudorandomness, essentially (4.132) with the nodes 𝑥0 , . . . , 𝑥 𝑛 chosen
(pseudo)randomly in [−1, 1] 𝑑 to have low discrepancy, as defined in Niederreiter
(1992). As a result, modern study of multivariate quadratures tends to be quite
different from the classical approach described above.
We revisit a point that we made in Example 4.15: it is generally not a good idea
to identify tensors in V1 ⊗ · · · ⊗ V𝑑 with hypermatrices in R𝑛1 ×···×𝑛𝑑 , as the bases
often carry crucial information, that is, in a decomposition such as
𝑝 Õ
Õ 𝑞 Õ
𝑟
𝑇= ··· 𝑎𝑖 𝑗 ···𝑘 𝑒𝑖 ⊗ 𝑓 𝑗 ⊗ · · · ⊗ 𝑔 𝑘 ,
𝑖=1 𝑗=1 𝑘=1

the basis elements 𝑒𝑖 , 𝑓 𝑗 , . . . , 𝑔 𝑘 can be far more important than the coefficients
𝑎𝑖 𝑗 ···𝑘 . Quadrature provides a fitting example. The basis elements are the point
evaluation functionals at the nodes and these are key to the quadrature: once
they are chosen, the coefficients, i.e. the weights, can be determined almost as an
afterthought. Furthermore, while these basis elements are by definition linearly
independent, in the context of quadrature there are non-linear relations between
them that will be lost if one just looks at the coefficients.
All our examples in this article have focused on computations in the classical
sense; we will end with a quantum computing example, adapted from Nakahara
and Ohmi (2008, Chapter 7), Nielsen and Chuang (2000, Chapter 6) and Wallach
(2008, Section 2.3). While tensors appear at the beginning, ultimately Grover’s
quantum search algorithm (Grover 1996) reduces to the good old power method.
Example 4.49 (Grover’s quantum search). Let C2 be equipped with its usual
Hermitian inner product h𝑥, 𝑦i = 𝑥 ∗ 𝑦 and let 𝑒0 , 𝑒1 be any pair of orthonormal
190 L.-H. Lim

vectors in C2 . Let 𝑑 ∈ N and 𝑛 = 2𝑑 . Suppose we have a function


(
−1 if 𝑖 = 𝑗,
𝑓 : {0, 1, . . . , 𝑛 − 1} → {−1, +1}, 𝑓 (𝑖) = (4.136)
+1 if 𝑖 ≠ 𝑗,

that is, 𝑓 takes the value −1 for exactly one 𝑗 ∈ {0, 1, . . . , 𝑛 − 1} but we do not
know which 𝑗. Also, 𝑓 is only accessible as a black box; we may evaluate 𝑓 (𝑖)
for any given 𝑖 and so to find 𝑗 in this manner would
√ require 𝑂(𝑛) evaluations in
the worst case. Grover’s algorithm finds 𝑗 in 𝑂( 𝑛) complexity with a quantum
computer.
We begin by observing that the 𝑑-tensor 𝑢 below may be expanded as
  ⊗𝑑
1 Õ
𝑛−1
1
𝑢 ≔ √ (𝑒0 + 𝑒1 ) =√ 𝑢𝑖 ∈ (C2 ) ⊗𝑑 , (4.137)
2 𝑛 𝑖=0
with
𝑢𝑖 ≔ 𝑒𝑖1 ⊗ 𝑒𝑖2 ⊗ · · · ⊗ 𝑒𝑖𝑑 ,
where 𝑖 1 , . . . , 𝑖 𝑑 ∈ {0, 1} are given by expressing the integer 𝑖 in binary:
[𝑖] 2 = 𝑖 𝑑 𝑖 𝑑−1 · · · 𝑖 2 𝑖 1 .
Furthermore, with respect to the inner product on (C2 ) ⊗𝑑 given by (4.64) and its
induced norm, 𝑢 is of unit norm and {𝑢0 , . . . , 𝑢 𝑛−1 } is an orthonormal basis. Recall
the notion of a Householder reflector in Example 2.12. We define two Householder
reflectors 𝐻 𝑓 and 𝐻𝑢 : (C2 ) ⊗𝑑 → (C2 ) ⊗𝑑 ,
𝐻 𝑓 (𝑣) = 𝑣 − 2h𝑢 𝑗 , 𝑣i𝑢 𝑗 , 𝐻𝑢 (𝑣) = 𝑣 − 2h𝑢, 𝑣i𝑢,
reflecting 𝑣 about the hyperplane orthogonal to 𝑢 𝑗 and 𝑢 respectively. Like the
function 𝑓 , the Householder reflectors 𝐻 𝑓 and 𝐻𝑢 are only accessible as black
boxes. Given any 𝑣 ∈ (C2 ) ⊗𝑑 , a quantum computer allows us to evaluate 𝐻 𝑓 (𝑣)
and 𝐻𝑢 (𝑣) in logarithmic complexity 𝑂(log 𝑛) = 𝑂(𝑑) (Nielsen and Chuang 2000,
p. 251).
Grover’s algorithm is essentially the power method with 𝐻𝑢 𝐻 𝑓 applied to 𝑢 as
the initial vector. Note that 𝐻𝑢 𝐻 𝑓 is unitary and thus norm-preserving and we may
skip the normalization step. If we write
Õ
(𝐻𝑢 𝐻 𝑓 )𝑘 𝑢 = 𝑎 𝑘 𝑢 𝑗 + 𝑏 𝑘 𝑢𝑖 ,
𝑖≠ 𝑗

then we may show (Nakahara and Ohmi 2008, Propositions 7.2–7.4) that

 2−𝑛 2(1 − 𝑛)

 𝑎 𝑘 = 𝑛 𝑎 𝑘−1 +
 𝑏 𝑘−1 ,
1 𝑛
𝑎0 = 𝑏0 = √ ,
𝑛 
 2 2−𝑛
 𝑏 𝑘 = 𝑎 𝑘−1 + 𝑏 𝑘−1 ,
 𝑛 𝑛
Tensors in computations 191

for 𝑘 ∈ N, with the solution given by


−1 √
𝑎 𝑘 = − sin(2𝑘 + 1)𝜃, 𝑏𝑘 = √ cos(2𝑘 + 1)𝜃, 𝜃 ≔ sin−1 (1/ 𝑛).
𝑛−1
The probability of a quantum measurement giving the correct answer 𝑗 is given by
|𝑎 𝑘 | 2 = sin2 (2𝑘 + 1)𝜃,

and for 𝑛 ≫ 1 this is greater than 1 − 1/𝑛 when 𝑘 ≈ 𝜋 𝑛/4. This algorithm can
be extended to search an 𝑓 in (4.136) with 𝑚 copies of −1s (Nakahara and Ohmi
2008, Section 7.2).
Observe that after setting up the stage with tensors, Grover’s algorithm deals
only with linear operators on tensor spaces. It is the same with Shor’s quantum
algorithm for integer factorization and discrete log (Shor 1994), which also relies
on the tensor 𝑢 in (4.137) but uses DFT instead of Householder reflectors, although
the details (Nakahara and Ohmi 2008, Chapter 8) are too involved to go into here.
In either case there is no computation, classical or quantum, involving higher-order
tensors.

5. Odds and ends


We conclude our article by recording some parting thoughts. Section 5.1 is about
the major topics that had to be omitted. Sections 5.2 and 5.3 may be viewed
respectively as explanations as to why our article is not written like a higher-order
analogue of Golub and Van Loan (2013) or Horn and Johnson (2013).

5.1. Omissions
There are three major topics in our original plans for this article that we did not
manage to cover: (i) symmetric and alternating tensors, (ii) algorithms based on
tensor contractions, and (iii) provably correct algorithms for higher-order tensor
problems.
In (i) we would have included a discussion of the following: polynomials and dif-
ferential forms as coordinate representations of symmetric and alternating tensors;
polynomial kernels as Veronese embedding of symmetric tensors, Slater determ-
inants as Plücker embedding of alternating tensors; moments and sums-of-squares
theory for symmetric tensors; matrix–matrix multiplication as a special case of
the Cauchy–Binet formula for alternating tensors. We also planned to discuss the
symmetric and alternating analogues of the three definitions of a tensor, the three
constructions of tensor products and the tensor algebra. The last in particular gives
us the symmetric and alternating Fock spaces with their connections to bosons
and fermions. We would also have discussed various notions of symmetric and
alternating tensor ranks.
In (ii) we had intended to demonstrate how the Cooley–Tukey FFT, the multi-
dimensional DFT, the Walsh transform, wavelet packet transform, Yates’ method
192 L.-H. Lim

in factorial designs, even FFT on finite non-Abelian groups, etc., are all tensor con-
tractions. We would also show how the Strassen tensor in Example 3.9, the tensors
corresponding to the multiplication operations in Lie, Jordan and Clifford algebras,
and any tensor network in Example 4.44 may each be realized as the self-contraction
of a rank-one tensor. There would also be an explanation of matchgate tensors and
holographic algorithms tailored for a numerical linear algebra readership.
In (iii) we had planned to provide a reasonably complete overview of the handful
of provably correct algorithms for various NP-hard problems involving higher-
order tensors from three different communities: polynomial optimization, symbolic
computing/computer algebra and theoretical computer science. We will have more
to say about these below.

5.2. Numerical multilinear algebra?


This article is mostly about how tensors arise in computations and not so much
about how one might compute tensorial quantities. Could one perhaps develop
algorithms analogous to those in numerical linear algebra (Demmel 1997, Golub
and Van Loan 2013, Higham 2002, Trefethen and Bau 1997) but for higher-order
tensors, or more accurately, for hypermatrices since most models of computations
require that we work in terms of coordinates? To be more precise, given a hyper-
matrix 𝐴 ∈ R𝑛1 ×···×𝑛𝑑 with 𝑑 > 2, can we compute its ranks, norms, determinants,
eigenvalues and eigenvectors, various decompositions and approximations, solu-
tions to multilinear systems of equations, multilinear programming and multilinear
regression with coefficients given by 𝐴, etc.? The answer we think is no. One
reason is that most of these problems are NP-hard (Hillar and Lim 2013). Another
related reason – but more of an observation – is that it seems difficult to construct
provably correct algorithms when 𝑑 > 2.
If we sort a list of numbers in Excel or compute the QR factorization of a
matrix in MATLAB, we count on the program to give us the correct answer: a
sorted list or a QR factorization up to some rounding error. Such is the nature of
algorithms: producing the correct answer or an approximation with a controllable
error to the problem under consideration is an integral part of its definition (Skiena
2020, Chapter 1). An algorithm needs to have a proof of correctness or, better, a
certificate of correctness that the user may employ to check the output. In numerical
computing, an example of the former would be the Eckart–Young theorem, which
guarantees that SVD produces a best rank-𝑟 approximation, or the Lax equivalence
theorem, which guarantees convergence of consistent finite difference schemes; an
example of the latter would be relative backward error for a matrix decomposition
or duality gap for a convex optimization problem having strong duality.
Note that we are not asking for efficient algorithms: exponential running time
or numerical instability are acceptable. The basic requirement is just that for some
non-trivial set of inputs, the output in exact arithmetic can be guaranteed to be the
answer we seek up to some quantifiable error. While such algorithms for tensor
Tensors in computations 193

problems do exist in the literature – notably the line of work in Brachat, Comon,
Mourrain and Tsigaridas (2010) and Nie (2017) that extended Sylvester (1886)
and Reznick (1992) to give a provably correct algorithm for symmetric tensor
decomposition – they tend to be the exception rather than the rule. It is more
common to find ‘algorithms’ for tensor problems that are demonstrably wrong.
The scarcity of provably correct algorithms goes hand in hand with the NP-
hardness of tensor problems. For illustration, we consider the best rank-one ap-
proximation problem for a hypermatrix 𝐴 ∈ R𝑛×𝑛×𝑛 :
min k 𝐴 − 𝑥 ⊗ 𝑦 ⊗ 𝑧k F . (5.1)
𝑥,𝑦,𝑧 ∈R𝑛

This is particularly pertinent as every tensor approximation problem contains (5.1)


as a special case; all notions of rank – tensor rank, multilinear rank, tensor net-
work rank, slice rank, etc. – agree on the non-zero lowest-ranked tensor and (5.1)
represents the lowest order beyond 𝑑 = 2. If we are given a purported solution
(𝑥, 𝑦, 𝑧), there is no straightforward way to ascertain that it is indeed a solution
unless 𝐴 happens to be rank-one and equal to 𝑥 ⊗ 𝑦 ⊗ 𝑧. One way to put this
is that (5.1) is NP-hard but not NP-complete: given a candidate solution to an
NP-complete problem, one can check whether it is indeed a solution in polynomial
time; an NP-hard problem does not have this feature. In fact, any graph 𝐺 can be
encoded as a hypermatrix 𝐴𝐺 ∈ R𝑛×𝑛×𝑛 in such a way that finding a best rank-one
approximation to 𝐴𝐺 gives us the chromatic number of the graph (Hillar and Lim
2013), which is NP-hard even to approximate (Khanna, Linial and Safra 2000).
In case it is not clear, applying general-purpose methods such as gradient descent
or alternating least-squares to an NP-hard optimization problem does not count as an
algorithm. With few exceptions, an algorithm for a problem, say, the Golub–Kahan
bidiagonalization for SVD, must exploit the special structure of the problem. Even
though one would not expect gradient descent to yield SVD, it is not uncommon to
find ‘algorithms’ for tensor problems that do no more than apply off-the-shelf non-
linear optimization methods (e.g. steepest descent, Newton’s method) or coordinate
cycling methods (e.g. non-linear Jacobi or Gauss–Seidel).
Furthermore, it is a misconception to think that it is only NP-hard to find global
optimizers, and that if one just needs a local optimizer then NP-hardness is no
obstacle. In fact, finding local optimizers (Vavasis 1991, Theorem 5.1) or even just
stationary points (Khachiyan 1996) may well be NP-hard problems too. It is also
a fallacy to assume that since NP-hardness is an asymptotic notion, one should be
fine with ‘real-world’ instances of moderate dimensions. As David S. Johnson,
co-author of the standard compendium (Garey and Johnson 1979), puts it (Johnson
2012):
What does it mean for practitioners? I believe that the years have taught us to take the
warnings of NP-completeness seriously. If an optimization problem is NP-hard, it is rare
that we find algorithms that, even when restricted to ‘real-world’ instances, always seem to
find optimal solutions, and do so in empirical polynomial time.
194 L.-H. Lim

A pertinent example for us is the Strassen tensor M3,3,3 in Example 3.9 with
𝑚 = 𝑛 = 𝑝 = 3. With respect to the standard basis on R3×3 , it is a simple 9 × 9 × 9
hypermatrix with mostly zeros and a few ones, but its tensor rank is still unknown
today.

5.3. Hypermatrix analysis?


Readers conversant with matrix analysis (Bellman 1997, Bhatia 1997, Horn and
Johnson 2013, 1994) or applied linear algebra (Boyd and Vandenberghe 2018,
Dym 2013, Lax 2007, Strang 1980) would certainly have noticed that abstract
linear algebra – vector spaces, dual spaces, bases, dual bases, change of basis, etc.
– can be largely or, with some effort, even completely avoided. Essentially, one
just needs to know how to work with matrices. In fact, pared down to an absolute
minimum, there is just one prerequisite: how to add and multiply matrices.
There is a highbrow partial explanation for this. The category of matrices Mat and
the category of vector spaces Vec are equivalent: abstract linear algebra and con-
crete linear algebra are by and large the same thing (Riehl 2016, Corollary 1.5.11).
The analogous statement is false for hypermatrices and tensor spaces except when
interpreted in ways that reduce it to variations of the Mat–Vec equivalence as in
Example 4.13. A main reason is that it is impossible to multiply hypermatrices
in the manner one multiplies matrices, as we have discussed in Example 2.7. The
reader may notice that if we remove everything that involves matrix multiplication
from Bellman (1997), Bhatia (1997), Horn and Johnson (2013, 1994) or Boyd and
Vandenberghe (2018), Dym (2013), Lax (2007), Strang (1980), there would hardly
be any content left.
Another big difference in going to order 𝑑 ≥ 3 is the lack of canonical forms
that we mentioned in Section 2. The 2-tensor transformation rules, i.e. equi-
valence, similarity and congruence, allow us to transform matrices to various
canonical forms: rational, Jordan, Weyr, Smith, Turnbull–Aitken, Hodge–Pedoe,
etc., or Schur and singular value decompositions with orthogonal/unitary simil-
arity/congruence and equivalence. These play an outsize role in matrix analysis
and applied linear algebra. For 𝑑 ≥ 3, while there is the Kronecker–Weierstrass
form for 𝑚 × 𝑛 × 2 hypermatrices, this is the only exception; there is no canonical
form for 𝑚 × 𝑛 × 𝑝 hypermatrices, even if we fix 𝑝 = 3, and therefore none for
𝑛1 × · · · × 𝑛𝑑 hypermatrices since these include 𝑚 × 𝑛 × 3 as a special case by setting
𝑛4 = · · · = 𝑛𝑑 = 1. Note that we did not say ‘no known’: the ‘no’ above is in the
sense of ‘mathematically proven to be non-existent’.
Tensor rank appears to give us a handle – after all, writing a matrix 𝐴 as a
sum of rank(𝐴) matrices, each of rank one, is essentially the Smith normal form
– but this is an illusion. The M3,3,3 example mentioned at the end of Section 5.2
is one among the hundreds of things we do not know about tensor rank. In fact,
4 × 4 × 4 hypermatrices appear to be more or less the boundary of our current
knowledge. Two of the biggest breakthroughs in recent times – finding the border
Tensors in computations 195

rank of M2,2,2 ∈ C4×4×4 (Landsberg 2006) and finding the equations that define
border rank-4 tensors {𝐴 ∈ C4×4×4 : rank(𝐴) ≤ 4} (Friedland 2013) – were both
about 4 × 4 × 4 hypermatrices and both required Herculean efforts, notwithstanding
the fact that border rank is a simpler notion than rank.

Acknowledgement
Most of what I learned about tensors in the last ten years I learned from my two
former postdocs Yang Qi and Ke Ye; I gratefully acknowledge the education they
have given me. Aside from them, I would like to thank Keith Conrad, Zhen
Dai, Shmuel Friedland, Edinah Gnang, Shenglong Hu, Risi Kondor, and J. M.
Landsberg, Visu Makam, Peter McCullagh, Emily Riehl and Thomas Schultz for
answering my questions and/or helpful discussions. I would like to express heartfelt
gratitude to Arieh Iserles for his kind encouragement and Glennis Starling for her
excellent copy-editing. Figure 4.1 is adapted from Walmes M. Zeviani’s gallery
of TikZ art and Figure 4.3 is reproduced from Yifan Peng’s blog. This work is
partially supported by DARPA HR00112190040, NSF DMS-11854831, and the
Eckhardt Faculty Fund.
This article was written while taking refuge from Covid-19 at my parents’ home
in Singapore. After two decades in the US, it is good to be reminded of how
fortunate I am to have such wonderful parents. I dedicate this article to them: my
father Lim Pok Beng and my mother Ong Aik Kuan.

References
R. Abraham, J. E. Marsden and T. Ratiu (1988), Manifolds, Tensor Analysis, and Applica-
tions, Vol. 75 of Applied Mathematical Sciences, second edition, Springer.
D. Aerts and I. Daubechies (1978), Physical justification for using the tensor product to
describe two quantum systems as one joint system, Helv. Phys. Acta 51, 661–675.
D. Aerts and I. Daubechies (1979a), A characterization of subsystems in physics, Lett.
Math. Phys. 3, 11–17.
D. Aerts and I. Daubechies (1979b), A mathematical condition for a sublattice of a pro-
positional system to represent a physical subsystem, with a physical interpretation, Lett.
Math. Phys. 3, 19–27.
I. Affleck, T. Kennedy, E. H. Lieb and H. Tasaki (1987), Rigorous results on valence-bond
ground states in antiferromagnets, Phys. Rev. Lett. 59, 799–802.
S. S. Akbarov (2003), Pontryagin duality in the theory of topological vector spaces and in
topological algebra, J. Math. Sci. 113, 179–349.
J. Alman and V. V. Williams (2021), A refined laser method and faster matrix multiplication,
in Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA)
(D. Marx, ed.), Society for Industrial and Applied Mathematics (SIAM), pp. 522–539.
P. W. Anderson (1959), New approach to the theory of superexchange interactions, Phys.
Rev. (2) 115, 2–13.
A. Arias and J. D. Farmer (1996), On the structure of tensor products of 𝑙 𝑝 -spaces, Pacific
J. Math. 175, 13–37.
196 L.-H. Lim

R. Baldick (1995), A unified approach to polynomially solvable cases of integer ‘non-


separable’ quadratic optimization, Discrete Appl. Math. 61, 195–212.
J. Barnes and P. Hut (1986), A hierarchical 𝑂(𝑁log𝑁) force-calculation algorithm, Nature
324, 446–449.
R. Bellman (1997), Introduction to Matrix Analysis, Vol. 19 of Classics in Applied Math-
ematics, Society for Industrial and Applied Mathematics (SIAM).
A. Bényi and R. H. Torres (2013), Compact bilinear operators and commutators, Proc.
Amer. Math. Soc. 141, 3609–3621.
S. K. Berberian (2014), Linear Algebra, Dover Publications.
G. Beylkin (1993), Wavelets and fast numerical algorithms, in Different Perspectives on
Wavelets (San Antonio, TX, 1993), Vol. 47 of Proceedings of Symposia in Applied
Mathematics, American Mathematical Society, pp. 89–117.
G. Beylkin, R. Coifman and V. Rokhlin (1991), Fast wavelet transforms and numerical
algorithms I, Comm. Pure Appl. Math. 44, 141–183.
G. Beylkin, R. Coifman and V. Rokhlin (1992), Wavelets in numerical analysis, in Wavelets
and their Applications (M. B. Ruskai et al., eds), Jones & Bartlett, pp. 181–210.
R. Bhatia (1997), Matrix Analysis, Vol. 169 of Graduate Texts in Mathematics, Springer.
D. Bini (1980), Relations between exact and approximate bilinear algorithms: Applications,
Calcolo 17, 87–97.
D. Bini, M. Capovani, F. Romani and G. Lotti (1979), 𝑂(𝑛2.7799 ) complexity for 𝑛 × 𝑛
approximate matrix multiplication, Inform. Process. Lett. 8, 234–235.
D. Bini, G. Lotti and F. Romani (1980), Approximate solutions for the bilinear form
computational problem, SIAM J. Comput. 9, 692–697.
C. M. Bishop (2006), Pattern Recognition and Machine Learning, Information Science
and Statistics, Springer.
A. Blass (1984), Existence of bases implies the axiom of choice, in Axiomatic Set Theory
(Boulder, CO, 1983), Vol. 31 of Contemporary Mathematics, American Mathematical
Society, pp. 31–33.
D. Bleecker (1981), Gauge Theory and Variational Principles, Vol. 1 of Global Analysis
Pure and Applied Series A, Addison-Wesley.
K. Blum (1996), Density Matrix Theory and Applications, Physics of Atoms and Molecules,
second edition, Plenum Press.
J. Board and L. Schulten (2000), The fast multipole algorithm, Comput. Sci. Eng. 2, 76–79.
A. Bogatskiy, B. Anderson, J. Offermann, M. Roussi, D. Miller and R. Kondor (2020),
Lorentz group equivariant neural network for particle physics, in Proceedings of the
37th International Conference on Machine Learning (ICML 2020) (H. Daumé III and
A. Singh, eds), Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 992–
1002.
D. Boneh and A. Silverberg (2003), Applications of multilinear forms to cryptography,
in Topics in Algebraic and Noncommutative Geometry (Luminy/Annapolis, MD, 2001),
Vol. 324 of Contemporary Mathematics, American Mathematical Society, pp. 71–90.
W. M. Boothby (1986), An Introduction to Differentiable Manifolds and Riemannian
Geometry, Vol. 120 of Pure and Applied Mathematics, second edition, Academic Press.
S. F. Borg (1990), Matrix–Tensor Methods in Continuum Mechanics, second edition, World
Scientific.
A. I. Borisenko and I. E. Tarapov (1979), Vector and Tensor Analysis with Applications,
Dover Publications.
Tensors in computations 197

N. Bourbaki (1998), Algebra I, Chapters 1–3, Elements of Mathematics (Berlin), Springer.


A. Bourouihiya (2008), The tensor product of frames, Sampl. Theory Signal Image Process.
7, 65–76.
S. Boyd and L. Vandenberghe (2004), Convex Optimization, Cambridge University Press.
S. Boyd and L. Vandenberghe (2018), Introduction to Applied Linear Algebra, Cambridge
University Press.
R. N. Bracewell (1986), The Fourier Transform and its Applications, McGraw-Hill Series
in Electrical Engineering, third edition, McGraw-Hill.
J. Brachat, P. Comon, B. Mourrain and E. Tsigaridas (2010), Symmetric tensor decompos-
ition, Linear Algebra Appl. 433, 1851–1872.
L. Brand (1947), Vector and Tensor Analysis, Wiley.
H. Brass and K. Petras (2011), Quadrature Theory: The Theory of Numerical Integration
on a Compact Interval, Vol. 178 of Mathematical Surveys and Monographs, American
Mathematical Society.
R. P. Brent (1970), Algorithms for matrix multiplication. Report Stan-CS-70-157, Stanford
University.
R. P. Brent and P. Zimmermann (2011), Modern Computer Arithmetic, Vol. 18 of Cam-
bridge Monographs on Applied and Computational Mathematics, Cambridge University
Press.
T. Bröcker and T. tom Dieck (1995), Representations of Compact Lie Groups, Vol. 98 of
Graduate Texts in Mathematics, Springer.
R. A. Brualdi (1992), The symbiotic relationship of combinatorics and matrix theory,
Linear Algebra Appl. 162/164, 65–105.
M. D. Buhmann (2003), Radial Basis Functions: Theory and Implementations, Vol. 12
of Cambridge Monographs on Applied and Computational Mathematics, Cambridge
University Press.
P. Bürgisser and F. Cucker (2013), Condition: The Geometry of Numerical Algorithms,
Vol. 349 of Grundlehren der Mathematischen Wissenschaften, Springer.
P. Bürgisser, M. Clausen and M. A. Shokrollahi (1997), Algebraic Complexity Theory, Vol.
315 of Grundlehren der Mathematischen Wissenschaften, Springer.
E. Callaway (2020), ‘It will change everything’: Deepmind’s AI makes gigantic leap in
solving protein structures, Nature 588 (7837), 203–204.
A. Cayley (1845), On the theory of linear transformations, Cambridge Math. J. 4, 193–209.
S. S. Chern, W. H. Chen and K. S. Lam (1999), Lectures on Differential Geometry, Vol. 1
of Series on University Mathematics, World Scientific.
Y. Choquet-Bruhat, C. DeWitt-Morette and M. Dillard-Bleick (1982), Analysis, Manifolds
and Physics, second edition, North-Holland.
P. C. Chou and N. J. Pagano (1992), Elasticity: Tensor, Dyadic and Engineering Ap-
proaches, Dover Publications.
F. Cobos, T. Kühn and J. Peetre (1992), Schatten–von Neumann classes of multilinear
forms, Duke Math. J. 65, 121–156.
F. Cobos, T. Kühn and J. Peetre (1999), On 𝔊 𝑝 -classes of trilinear forms, J. London Math.
Soc. (2) 59, 1003–1022.
A. Cohen and I. Daubechies (1993), Nonseparable bidimensional wavelet bases, Rev. Mat.
Iberoamericana 9, 51–137.
M. B. Cohen, Y. T. Lee and Z. Song (2019), Solving linear programs in the current matrix
multiplication time, in Proceedings of the 51st Annual ACM SIGACT Symposium on
Theory of Computing (STOC 2019), ACM, pp. 938–942.
198 L.-H. Lim

T. Cohen and M. Welling (2016),Group equivariant convolutional networks,in Proceedings


of the 33rd International Conference on Machine Learning (ICML 2016) (M. F. Balcan
and K. Q. Weinberger, eds), Vol. 48 of Proceedings of Machine Learning Research,
PMLR, pp. 2990–2999.
C. Cohen-Tannoudji, B. Diu and F. Laloë (2020a), Quantum Mechanics 1: Basic Concepts,
Tools, and Applications, second edition, Wiley-VCH.
C. Cohen-Tannoudji, B. Diu and F. Laloë (2020b), Quantum Mechanics 2: Angular
Momentum, Spin, and Approximation Methods, second edition, Wiley-VCH.
K. Conrad (2018), Tensor products. Available at https://kconrad.math.uconn.edu/blurbs/
linmultialg/tensorprod.pdf.
J. B. Conway (1990), A Course in Functional Analysis, Vol. 96 of Graduate Texts in
Mathematics, second edition, Springer.
S. A. Cook and S. O. Aanderaa (1969), On the minimum computation time of functions,
Trans. Amer. Math. Soc. 142, 291–314.
F. Cucker and S. Smale (2002), On the mathematical foundations of learning, Bull. Amer.
Math. Soc. (N.S.) 39, 1–49.
E. R. Davidson (1976), Reduced Density Matrices in Quantum Chemistry, Vol. 6 of
Theoretical Chemistry, Academic Press.
A. M. Davie (1985), Matrix norms related to Grothendieck’s inequality, in Banach Spaces
(Columbia, MO, 1984), Vol. 1166 of Lecture Notes in Mathematics, Springer, pp. 22–26.
P. J. Davis and P. Rabinowitz (2007),Methods of Numerical Integration, Dover Publications.
C. De Concini and C. Procesi (2017), The Invariant Theory of Matrices, Vol. 69 of
University Lecture Series, American Mathematical Society.
R. de la Madrid (2005), The role of the rigged Hilbert space in quantum mechanics,
European J. Phys. 26, 287–312.
L. De Lathauwer, B. De Moor and J. Vandewalle (2000), A multilinear singular value
decomposition, SIAM J. Matrix Anal. Appl. 21, 1253–1278.
V. De Silva and L.-H. Lim (2008), Tensor rank and the ill-posedness of the best low-rank
approximation problem, SIAM J. Matrix Anal. Appl. 30, 1084–1127.
F. De Terán (2016), Canonical forms for congruence of matrices and 𝑇-palindromic matrix
pencils: A tribute to H. W. Turnbull and A. C. Aitken, SeMA J. 73, 7–16.
J. Debnath and R. S. Dahiya (1989), Theorems on multidimensional Laplace transform for
solution of boundary value problems, Comput. Math. Appl. 18, 1033–1056.
A. Defant and K. Floret (1993), Tensor Norms and Operator Ideals, Vol. 176 of North-
Holland Mathematics Studies, North-Holland.
E. D. Demaine, M. L. Demaine, A. Edelman, C. E. Leiserson and P.-O. Persson (2005),
Building blocks and excluded sums, SIAM News 38, 1, 4, 6.
J. W. Demmel (1997), Applied Numerical Linear Algebra, Society for Industrial and
Applied Mathematics (SIAM).
L. Deng (2012), The MNIST database of handwritten digit images for machine learning
research, IEEE Signal Process. Mag. 29, 141–142.
P. J. C. Dickinson and L. Gijben (2014), On the computational complexity of membership
problems for the completely positive cone and its dual, Comput. Optim. Appl. 57, 403–
415.
J. Diestel, J. H. Fourie and J. Swart (2008), The Metric Theory of Tensor Products:
Grothendieck’s Résumé Revisited, American Mathematical Society.
Tensors in computations 199

C. T. J. Dodson and T. Poston (1991), Tensor Geometry: The Geometric Viewpoint and its
Uses, Vol. 130 of Graduate Texts in Mathematics, second edition, Springer.
J. Dongarra and F. Sullivan (2000), Guest editors’ introduction to the top 10 algorithms,
Comput. Sci. Eng. 2, 22–23.
D. E. Dudgeon and R. M. Mersereau (1983), Multidimensional Digital Signal Processing,
Prentice Hall.
D. S. Dummit and R. M. Foote (2004), Abstract Algebra, third edition, Wiley.
N. Dunford and J. T. Schwartz (1988), Linear Operators, Part I, Wiley Classics Library,
Wiley.
G. Dunn (1988), Tensor product of operads and iterated loop spaces, J. Pure Appl. Algebra
50, 237–258.
H. Dym (2013), Linear Algebra in Action, Vol. 78 of Graduate Studies in Mathematics,
second edition, American Mathematical Society.
J. Earman and C. Glymour (1978), Lost in the tensors: Einstein’s struggles with covariance
principles 1912–1916, Stud. Hist. Philos. Sci. A 9, 251–278.
A. Einstein (2002), Fundamental ideas and methods of the theory of relativity, presented
in its development, in The Collected Papers of Albert Einstein (M. Janssen et al., eds),
Vol. 7: The Berlin Years, 1918–1921, Princeton University Press, pp. 113–150.
L. P. Eisenhart (1934), Separable systems of Stackel, Ann. of Math. (2) 35, 284–305.
P. Etingof, S. Gelaki, D. Nikshych and V. Ostrik (2015), Tensor Categories, Vol. 205 of
Mathematical Surveys and Monographs, American Mathematical Society.
L. D. Faddeev and O. A. Yakubovskiı̆ (2009), Lectures on Quantum Mechanics for Math-
ematics Students, Vol. 47 of Student Mathematical Library, American Mathematical
Society.
C. L. Fefferman (2006), Existence and smoothness of the Navier–Stokes equation, in The
Millennium Prize Problems (J. Carlson, A. Jaffe and A. Wiles, eds), Clay Mathematics
Institute, pp. 57–67.
H. G. Feichtinger and K. Gröchenig (1994), Theory and practice of irregular sampling, in
Wavelets: Mathematics and Applications, CRC, pp. 305–363.
R. P. Feynman, R. B. Leighton and M. Sands (1963), The Feynman Lectures on Physics,
Vol. 1: Mainly Mechanics, Radiation, and Heat, Addison-Wesley.
C. F. Fischer (1977), The Hartree–Fock Method for Atoms, Wiley.
V. Fock (1930), Näherungsmethode zur Lösung des quantenmechanischen Mehrkörper-
problems, Z. Physik 61, 126–148.
L. Fortnow (2013), The Golden Ticket: P, NP, and the Search for the Impossible, Princeton
University Press.
A. Frank and E. Tardos (1987), An application of simultaneous Diophantine approximation
in combinatorial optimization, Combinatorica 7, 49–65.
M. Frazier and B. Jawerth (1990), A discrete transform and decompositions of distribution
spaces, J. Funct. Anal. 93, 34–170.
S. H. Friedberg, A. J. Insel and L. E. Spence (2003), Linear Algebra, fourth edition,
Prentice Hall.
S. Friedland (2013), On tensors of border rank 𝑙 in C𝑚×𝑛×𝑙 , Linear Algebra Appl. 438,
713–737.
S. Friedland and E. Gross (2012), A proof of the set-theoretic version of the salmon
conjecture, J. Algebra 356, 374–379.
200 L.-H. Lim

S. Friedland and L.-H. Lim (2018), Nuclear norm of higher-order tensors, Math. Comp.
87, 1255–1281.
S. Friedland, L.-H. Lim and J. Zhang (2019), Grothendieck constant is norm of Strassen
matrix multiplication tensor, Numer. Math. 143, 905–922.
J. Friedman (1991), The spectra of infinite hypertrees, SIAM J. Comput. 20, 951–961.
J. Friedman and A. Wigderson (1995), On the second eigenvalue of hypergraphs, Combin-
atorica 15, 43–65.
F. Fuchs, D. E. Worrall, V. Fischer and M. Welling (2020), SE(3)-transformers: 3D roto-
translation equivariant attention networks, in Advances in Neural Information Processing
Systems 33 (NeurIPS 2020) (H. Larochelle et al., eds), Curran Associates, pp. 1970–
1981.
W. Fulton and J. Harris (1991), Representation Theory: A First Course, Vol. 129 of
Graduate Texts in Mathematics, Springer.
M. Fürer (2009), Faster integer multiplication, SIAM J. Comput. 39, 979–1005.
A. A. García, F. W. Hehl, C. Heinicke and A. Macías (2004), The Cotton tensor in
Riemannian spacetimes, Classical Quantum Gravity 21, 1099–1118.
J. Garcke (2013), Sparse grids in a nutshell, in Sparse Grids and Applications, Vol. 88 of
Lecture Notes in Computational Science and Engineering, Springer, pp. 57–80.
M. R. Garey and D. S. Johnson (1979), Computers and Intractability: A Guide to the
Theory of NP-Completeness, W. H. Freeman.
S. Garg, C. Gentry and S. Halevi (2013), Candidate multilinear maps from ideal lattices, in
Advances in Cryptology (EUROCRYPT 2013), Vol. 7881 of Lecture Notes in Computer
Science, Springer, pp. 1–17.
P. Garrett (2010), Non-existence of tensor products of Hilbert spaces. Available at http://
www-users.math.umn.edu/~garrett/m/v/nonexistence_tensors.pdf.
I. M. Gel′fand, M. M. Kapranov and A. V. Zelevinsky (1992), Hyperdeterminants, Adv.
Math. 96, 226–263.
I. M. Gel′fand, M. M. Kapranov and A. V. Zelevinsky (1994), Discriminants, Resultants,
and Multidimensional Determinants, Mathematics: Theory & Applications, Birkhäuser.
C. Gentry, S. Gorbunov and S. Halevi (2015), Graph-induced multilinear maps from
lattices, in Theory of Cryptography (TCC 2015), part II, Vol. 9015 of Lecture Notes in
Computer Science, Springer, pp. 498–527.
R. Geroch (1985), Mathematical Physics, Chicago Lectures in Physics, University of
Chicago Press.
T. Gerstner and M. Griebel (1998), Numerical integration using sparse grids, Numer.
Algorithms 18, 209–232.
G. H. Golub and C. F. Van Loan (2013), Matrix Computations, Johns Hopkins Studies in
the Mathematical Sciences, fourth edition, Johns Hopkins University Press.
G. H. Golub and J. H. Welsch (1969), Calculation of Gauss quadrature rules, Math. Comp.
23, 221–230, A1–A10.
G. H. Golub and J. H. Wilkinson (1976), Ill-conditioned eigensystems and the computation
of the Jordan canonical form, SIAM Rev. 18, 578–619.
R. Goodman and N. R. Wallach (2009), Symmetry, Representations, and Invariants, Vol.
255 of Graduate Texts in Mathematics, Springer.
L. Grafakos (2014), Classical Fourier Analysis, Vol. 249 of Graduate Texts in Mathematics,
third edition, Springer.
Tensors in computations 201

L. Grafakos and R. H. Torres (2002a), Discrete decompositions for bilinear operators and
almost diagonal conditions, Trans. Amer. Math. Soc. 354, 1153–1176.
L. Grafakos and R. H. Torres (2002b), Multilinear Calderón–Zygmund theory, Adv. Math.
165, 124–164.
L. Greengard and V. Rokhlin (1987), A fast algorithm for particle simulations, J. Comput.
Phys. 73, 325–348.
W. Greub (1978), Multilinear Algebra, Universitext, second edition, Springer.
D. Griffiths (2008), Introduction to Elementary Particles, second edition, Wiley-VCH.
A. Grothendieck (1953), Résumé de la théorie métrique des produits tensoriels topo-
logiques, Bol. Soc. Mat. São Paulo 8, 1–79.
A. Grothendieck (1955), Produits Tensoriels Topologiques et Espaces Nucléaires, Vol. 16
of Memoirs of the American Mathematical Society, American Mathematical Society.
L. K. Grover (1996), A fast quantum mechanical algorithm for database search, in Proceed-
ings of the 28th Annual ACM Symposium on the Theory of Computing (STOC 1996),
ACM, pp. 212–219.
K. Hannabuss (1997), An Introduction to Quantum Theory, Vol. 1 of Oxford Graduate
Texts in Mathematics, The Clarendon Press, Oxford University Press.
J. Harris (1995), Algebraic Geometry: A First Course, Vol. 133 of Graduate Texts in
Mathematics, Springer.
E. Hartmann (1984), An Introduction to Crystal Physics, Vol. 18 of Commission on Crystal-
lographic Teaching: Second series pamphlets, International Union of Crystallography,
University College Cardiff Press.
D. R. Hartree (1928), The wave mechanics of an atom with a non-Coulomb central field,
I: Theory and methods, Proc. Cambridge Philos. Soc. 24, 89–132.
R. Hartshorne (1977), Algebraic Geometry, Vol. 52 of Graduate Texts in Mathematics,
Springer.
D. Harvey and J. van der Hoeven (2021), Integer multiplication in time 𝑂(𝑛 log 𝑛), Ann. of
Math. (2) 193, 563–617.
S. Hassani (1999), Mathematical Physics: A Modern Introduction to its Foundations,
Springer.
T. J. Hastie and R. J. Tibshirani (1990), Generalized Additive Models, Vol. 43 of Mono-
graphs on Statistics and Applied Probability, Chapman & Hall.
A. Hatcher (2002), Algebraic Topology, Cambridge University Press.
R. A. Hauser and Y. Lim (2002), Self-scaled barriers for irreducible symmetric cones,
SIAM J. Optim. 12, 715–723.
G. E. Hay (1954), Vector and Tensor Analysis, Dover Publications.
C. Heil (2011), A Basis Theory Primer, Applied and Numerical Harmonic Analysis,
expanded edition, Birkhäuser / Springer.
F. Heiss and V. Winschel (2008), Likelihood approximation by numerical integration on
sparse grids, J. Econometrics 144, 62–80.
S. Helgason (1978), Differential Geometry, Lie Groups, and Symmetric Spaces, Vol. 80 of
Pure and Applied Mathematics, Academic Press.
J. W. Helton and M. Putinar (2007), Positive polynomials in scalar and matrix variables,
the spectral theorem, and optimization, in Operator Theory, Structured Matrices, and
Dilations, Vol. 7 of Theta Ser. Adv. Math., Theta, Bucharest, pp. 229–306.
N. J. Higham (1992), Stability of a method for multiplying complex matrices with three
real matrix multiplications, SIAM J. Matrix Anal. Appl. 13, 681–687.
202 L.-H. Lim

N. J. Higham (2002), Accuracy and Stability of Numerical Algorithms, second edition,


Society for Industrial and Applied Mathematics (SIAM).
C. J. Hillar and L.-H. Lim (2013), Most tensor problems are NP-hard, J. Assoc. Comput.
Mach. 60, 45.
F. L. Hitchcock (1927), The expression of a tensor or a polyadic as a sum of products,
J. Math. Phys. Mass. Inst. Tech. 6, 164–189.
D. S. Hochbaum and J. G. Shanthikumar (1990), Convex separable optimization is not
much harder than linear optimization, J. Assoc. Comput. Mach. 37, 843–862.
T. Hofmann, B. Schölkopf and A. J. Smola (2008), Kernel methods in machine learning,
Ann. Statist. 36, 1171–1220.
K. Höllig and J. Hörner (2013), Approximation and Modeling with B-Splines, Society for
Industrial and Applied Mathematics (SIAM).
J. R. Holub (1970), Tensor product bases and tensor diagonals, Trans. Amer. Math. Soc.
151, 563–579.
R. A. Horn and C. R. Johnson (1994), Topics in Matrix Analysis, Cambridge University
Press.
R. A. Horn and C. R. Johnson (2013), Matrix Analysis, second edition, Cambridge Uni-
versity Press.
F. Irgens (2019), Tensor Analysis, Springer.
J. D. Jackson (1999), Classical Electrodynamics, third edition, Wiley.
N. Jacobson (1975), Lectures in Abstract Algebra, Vol II: Linear Algebra, Vol. 31 of
Graduate Texts in Mathematics, Springer.
J. C. Jaeger (1940), The solution of boundary value problems by a double Laplace trans-
formation, Bull. Amer. Math. Soc. 46, 687–693.
D. S. Johnson (2012), A brief history of NP-completeness, 1954–2012, Doc. Math. pp. 359–
376.
T. F. Jordan (2005), Quantum Mechanics in Simple Matrix Form, Dover Publications.
A. Joux (2004), A one round protocol for tripartite Diffie–Hellman, J. Cryptology 17,
263–276.
V. Kaarnioja (2013), Smolyak quadrature. Master’s thesis, University of Helsinki, Finland.
E. G. Kalnins, J. M. Kress and W. Miller, Jr. (2018), Separation of Variables and Superin-
tegrability: The Symmetry of Solvable Systems, IOP Expanding Physics, IOP Publishing.
R. P. Kanwal (1997), Linear Integral Equations, second edition, Birkhäuser.
A. A. Karatsuba and Y. Ofman (1962),Multiplication of many-digital numbers by automatic
computers, Dokl. Akad. Nauk SSSR 145, 293–294.
N. Karmarkar (1984), A new polynomial-time algorithm for linear programming, Combin-
atorica 4, 373–395.
E. A. Kearsley and J. T. Fong (1975), Linearly independent sets of isotropic Cartesian
tensors of ranks up to eight, J. Res. Nat. Bur. Standards B 79, 49–58.
V. Keshavarzzadeh, R. M. Kirby and A. Narayan (2018), Numerical integration in multiple
dimensions with designed quadrature, SIAM J. Sci. Comput. 40, A2033–A2061.
L. Khachiyan (1979), A polynomial algorithm in linear programming, Soviet Math. Dokl.
20, 191–194.
L. Khachiyan (1996), Diagonal matrix scaling is NP-hard, Linear Algebra Appl. 234,
173–179.
S. Khanna, N. Linial and S. Safra (2000), On the hardness of approximating the chromatic
number, Combinatorica 20, 393–415.
Tensors in computations 203

T. Kleinjung and B. Wesolowski (2021), Discrete logarithms in quasi-polynomial time in


finite fields of fixed characteristic, J. Amer. Math. Soc. doi:10.1090/jams/985.
V. V. Kljuev and N. I. Kokovkin-Ščerbak (1965), On the minimization of the number of
arithmetic operations for solving linear algebraic systems of equations, Ž. Vyčisl. Mat i
Mat. Fiz. 5, 21–33.
A. A. Klyachko (1998), Stable bundles, representation theory and Hermitian operators,
Selecta Math. (N.S.) 4, 419–445.
D. E. Knuth (1998), The Art of Computer Programming, Vol. 2: Seminumerical Algorithms,
third edition, Addison-Wesley.
A. Knutson and T. Tao (1999), The honeycomb model of 𝐺 𝐿 𝑛 (𝐶) tensor products I: Proof
of the saturation conjecture, J. Amer. Math. Soc. 12, 1055–1090.
R. Kondor and S. Trivedi (2018), On the generalization of equivariance and convolution in
neural networks to the action of compact groups, in Proceedings of the 35th International
Conference on Machine Learning (ICML 2018) (J. Dy and A. Krause, eds), Vol. 80 of
Proceedings of Machine Learning Research, PMLR, pp. 2747–2755.
R. Kondor, Z. Lin and S. Trivedi (2018), Clebsch–Gordan nets: A fully Fourier space
spherical convolutional neural network, in Advances in Neural Information Processing
Systems 31 (NeurIPS 2018) (S. Bengio et al., eds), Curran Associates, pp. 10138–10147.
T. H. Koornwinder (1980), A precise definition of separation of variables, in Geometrical
Approaches to Differential Equations (Proc. Fourth Scheveningen Conference, 1979),
Vol. 810 of Lecture Notes in Mathematics, Springer, pp. 240–263.
M. Kostoglou (2005), On the analytical separation of variables solution for a class of partial
integro-differential equations, Appl. Math. Lett. 18, 707–712.
A. I. Kostrikin and Y. I. Manin (1997), Linear Algebra and Geometry, Vol. 1 of Algebra,
Logic and Applications, Gordon & Breach Science Publishers.
S. G. Krantz (2009), Explorations in Harmonic Analysis: With Applications to Complex
Function Theory and the Heisenberg Group,Applied and Numerical Harmonic Analysis,
Birkhäuser.
S. Krishna and V. Makam (2018), On the tensor rank of 3 × 3 permanent and determinant.
Available at arXiv:1801.00496.
J.-L. Krivine (1979), Constantes de Grothendieck et fonctions de type positif sur les sphères,
Adv. Math. 31, 16–30.
A. Krizhevsky, I. Sutskever and G. E. Hinton (2012), Imagenet classification with deep
convolutional neural networks, in Advances in Neural Information Processing Systems 25
(NIPS 2012) (F. Pereira et al., eds), Curran Associates, pp. 1097–1105.
M. Lacey and C. Thiele (1997), 𝐿 𝑝 estimates on the bilinear Hilbert transform for 2 < 𝑝 <
∞, Ann. of Math. (2) 146, 693–724.
M. Lacey and C. Thiele (1999), On Calderón’s conjecture, Ann. of Math. (2) 149, 475–496.
J. M. Landsberg (2006), The border rank of the multiplication of 2 × 2 matrices is seven,
J. Amer. Math. Soc. 19, 447–459.
J. M. Landsberg (2012), Tensors: Geometry and Applications, Vol. 128 of Graduate Studies
in Mathematics, American Mathematical Society.
J. M. Landsberg (2017), Geometry and Complexity Theory, Vol. 169 of Cambridge Studies
in Advanced Mathematics, Cambridge University Press.
J. M. Landsberg (2019), Tensors: Asymptotic Geometry and Developments 2016–2018,
Vol. 132 of CBMS Regional Conference Series in Mathematics, American Mathematical
Society.
204 L.-H. Lim

J. M. Landsberg, Y. Qi and K. Ye (2012), On the geometry of tensor network states,


Quantum Inform. Comput. 12, 346–354.
K. Landsman (2017), Foundations of Quantum Theory: From Classical Concepts to
Operator Algebras, Vol. 188 of Fundamental Theories of Physics, Springer.
S. Lang (1993), Real and Functional Analysis, Vol. 142 of Graduate Texts in Mathematics,
third edition, Springer.
S. Lang (2002), Algebra, Vol. 211 of Graduate Texts in Mathematics, third edition, Springer.
P. D. Lax (2007), Linear Algebra and its Applications, Pure and Applied Mathematics
(Hoboken), second edition, Wiley-Interscience.
J. M. Lee (2013), Introduction to Smooth Manifolds, Vol. 218 of Graduate Texts in Math-
ematics, second edition, Springer.
W. A. Light and E. W. Cheney (1985), Approximation Theory in Tensor Product Spaces,
Vol. 1169 of Lecture Notes in Mathematics, Springer.
D. Lovelock and H. Rund (1975), Tensor, Differential Forms, and Variational Principles,
Wiley-Interscience.
S. Lovett (2019), The analytic rank of tensors and its applications, Discrete Anal. 2019, 7.
G. W. Mackey (1978), Unitary Group Representations in Physics, Probability, and Number
Theory, Vol. 55 of Mathematics Lecture Note Series, Benjamin / Cummings.
S. Mallat (2009), A Wavelet Tour of Signal Processing, third edition, Elsevier / Academic
Press.
D. Martin (1991), Manifold Theory: An Introduction for Mathematical Physicists, Ellis
Horwood Series in Mathematics and its Applications, Ellis Horwood.
A. J. McConnell (1957), Application of Tensor Analysis, Dover Publications.
P. McCullagh (1987), Tensor Methods in Statistics, Monographs on Statistics and Applied
Probability, Chapman & Hall.
W. S. McCulloch and W. Pitts (1943), A logical calculus of the ideas immanent in nervous
activity, Bull. Math. Biophys. 5, 115–133.
J. Mercer (1909), Functions of positive and negative type and their connection with the
theory of integral equations, Philos. Trans. Roy. Soc. London Ser. A 209, 415–446.
Y. Meyer (1992), Wavelets and Operators, Vol. 37 of Cambridge Studies in Advanced
Mathematics, Cambridge University Press.
Y. Meyer and R. Coifman (1997), Wavelets: Calderón–Zygmund and Multilinear Oper-
ators, Vol. 48 of Cambridge Studies in Advanced Mathematics, Cambridge University
Press.
A. D. Michal (1947), Matrix and Tensor Calculus with Applications to Mechanics, Elasti-
city, and Aeronautics, Wiley / Chapman & Hall.
W. Miller, Jr (1977), Symmetry and Separation of Variables, Vol. 4 of Encyclopedia of
Mathematics and its Applications, Addison-Wesley.
J. W. Milnor and J. D. Stasheff (1974), Characteristic Classes, Vol. 76 of Annals of
Mathematics Studies, Princeton University Press / University of Tokyo Press.
C. W. Misner, K. S. Thorne and J. A. Wheeler (1973), Gravitation, W. H. Freeman.
P. Moon and D. E. Spencer (1961), Field Theory for Engineers, The Van Nostrand Series
in Electronics and Communications, Van Nostrand.
P. Moon and D. E. Spencer (1988), Field Theory Handbook: Including Coordinate Systems,
Differential Equations and Their Solutions, second edition, Springer.
P. M. Morse and H. Feshbach (1953), Methods of Theoretical Physics, McGraw-Hill.
Tensors in computations 205

K. G. Murty and S. N. Kabadi (1987), Some NP-complete problems in quadratic and


nonlinear programming, Math. Program. 39, 117–129.
C. Muscalu and W. Schlag (2013), Classical and Multilinear Harmonic Analysis II, Vol.
138 of Cambridge Studies in Advanced Mathematics, Cambridge University Press.
M. Nakahara and T. Ohmi (2008), Quantum Computing: From Linear Algebra to Physical
Realizations, CRC Press.
H. Narayanan (2006), On the complexity of computing Kostka numbers and Littlewood–
Richardson coefficients, J. Algebraic Combin. 24, 347–354.
Y. Nesterov and A. Nemirovskii (1994), Interior-Point Polynomial Algorithms in Convex
Programming, Vol. 13 of SIAM Studies in Applied Mathematics, Society for Industrial
and Applied Mathematics (SIAM).
J. Nie (2017), Generating polynomials and symmetric tensor decompositions, Found.
Comput. Math. 17, 423–465.
H. Niederreiter (1992), Random Number Generation and Quasi-Monte Carlo Methods,
Vol. 63 of CBMS-NSF Regional Conference Series in Applied Mathematics, Society
for Industrial and Applied Mathematics (SIAM).
M. A. Nielsen and I. L. Chuang (2000), Quantum Computation and Quantum Information,
Cambridge University Press.
P. Niyogi, S. Smale and S. Weinberger (2008), Finding the homology of submanifolds with
high confidence from random samples, Discrete Comput. Geom. 39, 419–441.
K. C. O’Meara, J. Clark and C. I. Vinsonhaler (2011), Advanced Topics in Linear Algebra:
Weaving Matrix Problems Through the Weyr Form, Oxford University Press.
R. Orús (2014), A practical introduction to tensor networks: Matrix product states and
projected entangled pair states, Ann. Phys. 349, 117–158.
B. G. Osgood (2019), Lectures on the Fourier Transform and its Applications, Vol. 33 of
Pure and Applied Undergraduate Texts, American Mathematical Society.
B. N. Parlett (2000), The QR algorithm, Comput. Sci. Eng. 2, 38–42.
D. S. Passman (1991), A Course in Ring Theory, The Wadsworth & Brooks/Cole Mathem-
atics Series, Wadsworth & Brooks/Cole Advanced Books & Software.
W. Pauli (1980), General Principles of Quantum Mechanics, Springer.
G. Pisier (2012), Grothendieck’s theorem, past and present, Bull. Amer. Math. Soc. (N.S.)
49, 237–323.
J. Plebański and A. Krasiński (2006), An Introduction to General Relativity and Cosmology,
Cambridge University Press.
K. R. Rao and P. Yip (1990), Discrete Cosine Transform: Algorithms, Advantages, Applic-
ations, Academic Press.
M. Reed and B. Simon (1980), Functional Analysis, Vol. I of Methods of Modern Math-
ematical Physics, second edition, Academic Press.
J. Renegar (2001), A Mathematical View of Interior-Point Methods in Convex Optimization,
MPS/SIAM Series on Optimization, Society for Industrial and Applied Mathematics
(SIAM) / Mathematical Programming Society (MPS).
B. Reznick (1992), Sums of Even Powers of Real Linear Forms, Vol. 96 of Memoirs of the
American Mathematical Society, American Mathematical Society.
M. M. G. Ricci and T. Levi-Civita (1900), Méthodes de calcul différentiel absolu et leurs
applications, Math. Ann. 5, 125–201. Translation by R. Hermann: Ricci and Levi-
Civita’s Tensor Analysis Paper (1975), Vol. II of Lie Groups: History, Frontiers and
Applications, Math Sci Press.
206 L.-H. Lim

E. Riehl (2016), Category Theory in Context, Dover Publications.


W. Rindler (2006), Relativity: Special, General, and Cosmological, second edition, Oxford
University Press.
M. D. Roberts (1995), The physical interpretation of the Lanczos tensor, Nuovo Cimento
B (11) 110, 1165–1176.
D. N. Rockmore (2000), The FFT: An algorithm the whole family can use, Comput. Sci.
Eng. 2, 30–34.
F. Rosenblatt (1958), The perceptron: A probabilistic model for information storage and
organization in the brain, Psychol. Rev. 65, 386–408.
R. A. Ryan (2002), Introduction to Tensor Products of Banach Spaces, Springer Mono-
graphs in Mathematics, Springer.
H. H. Schaefer and M. P. Wolff (1999), Topological Vector Spaces, Vol. 3 of Graduate
Texts in Mathematics, second edition, Springer.
R. Schatten (1950), A Theory of Cross-Spaces, Vol. 26 of Annals of Mathematics Studies,
Princeton University Press.
H. Scheffé (1970), A note on separation of variables, Technometrics 12, 388–393.
H. Scheffé (1999), The Analysis of Variance, Wiley Classics Library, Wiley.
K. Schmüdgen (2017), The Moment Problem, Vol. 277 of Graduate Texts in Mathematics,
Springer.
K. Schöbel (2015), An Algebraic Geometric Approach to Separation of Variables, Springer
Spektrum.
B. Scholköpf and A. J. Smola (2002), Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond, MIT Press.
A. Schönhage (1972/73), Unitäre Transformationen grosser Matrizen, Numer. Math. 20,
409–417.
A. Schönhage and V. Strassen (1971), Schnelle Multiplikation grosser Zahlen, Computing
(Arch. Elektron. Rechnen) 7, 281–292.
J. A. Schouten (1951), Tensor Analysis for Physicists, Clarendon Press.
A. Schrijver (1986), Theory of Linear and Integer Programming, Wiley-Interscience Series
in Discrete Mathematics, Wiley-Interscience.
R. F. Service (2020), ‘The game has changed’: AI triumphs at protein folding, Science
370 (6521), 1144–1145.
Y. Y. Shi, L. M. Duan and G. Vidal (2006), Classical simulation of quantum many-body
systems with a tree tensor network, Phys. Rev. A 74, 022320.
P. W. Shor (1994), Algorithms for quantum computation: Discrete logarithms and factoring,
in 35th Annual Symposium on Foundations of Computer Science (1994), IEEE Computer
Society Press, pp. 124–134.
W. Sickel and T. Ullrich (2009), Tensor products of Sobolev–Besov spaces and applications
to approximation from the hyperbolic cross, J. Approx. Theory 161, 748–786.
J. G. Simmonds (1994), A Brief on Tensor Analysis, Undergraduate Texts in Mathematics,
second edition, Springer.
S. S. Skiena (2020), The Algorithm Design Manual, Texts in Computer Science, third
edition, Springer.
S. Smale (1998), Mathematical problems for the next century, Math. Intelligencer 20, 7–15.
S. A. Smolyak (1963), Quadrature and interpolation formulas for tensor products of certain
classes of functions, Dokl. Akad. Nauk SSSR 148, 1042–1045.
Tensors in computations 207

N. P. Sokolov (1960), Spatial Matrices and their Applications, Gosudarstv. Izdat. Fiz.-Mat.
Lit., Moscow. In Russian.
N. P. Sokolov (1972), Introduction to the Theory of Multidimensional Matrices, Izdat.
‘Naukova Dumka’, Kiev. In Russian.
B. Spain (1960), Tensor Calculus: A Concise Course, third edition, Oliver and Boyd /
Interscience.
E. M. Stein (1993), Harmonic Analysis: Real-Variable Methods, Orthogonality, and Oscil-
latory Integrals, Vol. 43 of Princeton Mathematical Series, Princeton University Press.
I. Steinwart and A. Christmann (2008), Support Vector Machines, Information Science and
Statistics, Springer.
G. W. Stewart (2000), The decompositional approach to matrix computation, Comput. Sci.
Eng. 2, 50–59.
G. Strang (1980), Linear Algebra and its Applications, second edition, Academic Press.
V. Strassen (1969), Gaussian elimination is not optimal, Numer. Math. 13, 354–356.
V. Strassen (1973), Vermeidung von Divisionen, J. Reine Angew. Math. 264, 184–202.
V. Strassen (1987), Relative bilinear complexity and matrix multiplication, J. Reine Angew.
Math. 375/376, 406–443.
V. Strassen (1990), Algebraic complexity theory, in Handbook of Theoretical Computer
Science, Vol. A: Algorithms and Complexity, Elsevier, pp. 633–672.
A. M. Stuart and A. R. Humphries (1996), Dynamical Systems and Numerical Analysis,
Vol. 2 of Cambridge Monographs on Applied and Computational Mathematics, Cam-
bridge University Press.
J. J. Sylvester (1886), Sur une extension d’un théorème de Clebsch relatif aux courbes du
quatrième degré, C. R. Math. Acad. Sci. Paris 102, 1532–1534.
J. L. Synge and A. Schild (1978), Tensor Calculus, Dover Publications.
C.-T. Tai (1997), Generalized Vector and Dyadic Analysis, second edition, IEEE Press.
M. Takesaki (2002), Theory of Operator Algebras I, Vol. 124 of Encyclopaedia of Math-
ematical Sciences, Springer.
L. A. Takhtajan (2008), Quantum Mechanics for Mathematicians, Vol. 95 of Graduate
Studies in Mathematics, American Mathematical Society.
E. Tardos (1986), A strongly polynomial algorithm to solve combinatorial linear programs,
Oper. Res. 34, 250–256.
G. Temple (1960), Cartesian Tensors: An Introduction, Methuen’s Monographs on Physical
Subjects, Methuen / Wiley.
G. Teschl (2014), Mathematical Methods in Quantum Mechanics: With Applications to
Schrödinger Operators, Vol. 157 of Graduate Studies in Mathematics, second edition,
American Mathematical Society.
J. W. Thomas (1995), Numerical Partial Differential Equations: Finite Difference Methods,
Vol. 22 of Texts in Applied Mathematics, Springer.
K. S. Thorne and R. D. Blandford (2017), Modern Classical Physics: Optics, Fluids,
Plasmas, Elasticity, Relativity, and Statistical Physics, Princeton University Press.
A. L. Toom (1963), The complexity of a scheme of functional elements simulating the
multiplication of integers, Dokl. Akad. Nauk SSSR 150, 496–498.
L. N. Trefethen and D. Bau, III (1997), Numerical Linear Algebra, Society for Industrial
and Applied Mathematics (SIAM).
F. Trèves (2006), Topological Vector Spaces, Distributions and Kernels, Dover Publica-
tions.
208 L.-H. Lim

P. M. Vaidya (1990), An algorithm for linear programming which requires 𝑂(((𝑚 + 𝑛)𝑛2 +
(𝑚 + 𝑛)1.5 𝑛)𝐿) arithmetic operations, Math. Program. 47, 175–201.
L. G. Valiant (1979), The complexity of computing the permanent, Theoret. Comput. Sci.
8, 189–201.
H. A. van der Vorst (2000), Krylov subspace iteration, Comput. Sci. Eng. 2, 32–37.
N. T. Varopoulos (1965), Sur les ensembles parfaits et les séries trigonométriques, C. R.
Acad. Sci. Paris 260, 4668–4670, 5165–5168, 5997–6000.
N. T. Varopoulos (1967), Tensor algebras and harmonic analysis, Acta Math. 119, 51–112.
S. A. Vavasis (1991), Nonlinear Optimization: Complexity Issues, Vol. 8 of International
Series of Monographs on Computer Science, The Clarendon Press, Oxford University
Press.
F. Verstraete and J. I. Cirac (2004), Renormalization algorithms for quantum-many body
systems in two and higher dimensions. Available at arXiv:cond-mat/0407066.
E. B. Vinberg (2003), A Course in Algebra, Vol. 56 of Graduate Studies in Mathematics,
American Mathematical Society.
W. Voigt (1898), Die fundamentalen physikalischen Eigenschaften der Krystalle in ele-
mentarer Darstellung, Von Veit.
R. M. Wald (1984), General Relativity, University of Chicago Press.
N. R. Wallach (2008), Quantum computing and entanglement for mathematicians, in Rep-
resentation Theory and Complex Analysis, Vol. 1931 of Lecture Notes in Mathematics,
Springer, pp. 345–376.
W. Walter (1998), Ordinary Differential Equations, Vol. 182 of Graduate Texts in Math-
ematics, Springer.
A.-M. Wazwaz (2011), Linear and Nonlinear Integral Equations: Methods and Applica-
tions, Higher Education Press (Beijing) / Springer.
G. Weinreich (1998), Geometrical Vectors, Chicago Lectures in Physics, University of
Chicago Press.
H. Weyl (1997), The Classical Groups: Their Invariants and Representations, Princeton
Landmarks in Mathematics, Princeton University Press.
S. R. White (1992), Density matrix formulation for quantum renormalization groups, Phys.
Rev. Lett. 69, 2863–2866.
S. R. White and D. A. Huse (1993), Numerical renormalization-group study of low-lying
eigenstates of the antiferromagnetic 𝑆 = 1 Heisenberg chain, Phys. Rev. B 48, 3844–
3853.
J. D. Whitfield, P. J. Love and A. Aspuru-Guzik (2013), Computational complexity in
electronic structure, Phys. Chem. Chem. Phys. 15, 397–411.
N. M. J. Woodhouse (2003), Special Relativity, Springer Undergraduate Mathematics
Series, Springer.
R. C. Wrede (1963), Introduction to Vector and Tensor Analysis, Wiley.
S.-T. Yau (2020), Shiing-Shen Chern: A great geometer of 20th century.
https://cmsa.fas.harvard.edu/wp-content/uploads/2020/05/2020-04-22-essay-on-Chern-english-v1.pd
K. Ye and L.-H. Lim (2018a), Fast structured matrix computations: Tensor rank and
Cohn–Umans method, Found. Comput. Math. 18, 45–95.
K. Ye and L.-H. Lim (2018b), Tensor network ranks. Available at arXiv:1801.02662.

You might also like