Deep Learning
Deep Learning
Deep Learning
Foundations and Concepts
Christopher M. Bishop Hugh Bishop
Microsoft Research Wayve Technologies Ltd
Cambridge, UK London, UK
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for
general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and
accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect
to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
v
vi PREFACE
new book is self-contained, appropriate material has been carried over from Bishop
(2006) and refactored to focus on those foundational ideas that are needed for deep
learning. This means that there are many interesting topics in machine learning dis-
cussed in Bishop (2006) that remain of interest today but which have been omitted
from this new book. For example, Bishop (2006) discusses Bayesian methods in
some depth, whereas this book is almost entirely non-Bayesian.
The book is accompanied by a web site that provides supporting material, in-
cluding a free-to-use digital version of the book as well as solutions to the exercises
and downloadable versions of the figures in PDF and JPEG formats:
https://www.bishopbook.com
The book can be cited using the following BibTex entry:
@book{Bishop:DeepLearning24,
author = {Christopher M. Bishop and Hugh Bishop},
title = {Deep Learning: Foundations and Concepts},
year = {2024},
publisher = {Springer}
}
If you have any feedback on the book or would like to report any errors, please
send these to [email protected]
References
In the spirit of focusing on core ideas, we make no attempt to provide a com-
prehensive literature review, which in any case would be impossible given the scale
and pace of change of the field. We do, however, provide references to some of the
key research papers as well as review articles and other sources of further reading.
In many cases, these also provide important implementation details that we gloss
over in the text in order not to distract the reader from the central concepts being
discussed.
Many books have been written on the subject of machine learning in general and
on deep learning in particular. Those which are closest in level and style to this book
include Bishop (2006), Goodfellow, Bengio, and Courville (2016), Murphy (2022),
Murphy (2023), and Prince (2023).
Over the last decade, the nature of machine learning scholarship has changed
significantly, with many papers being posted online on archival sites ahead of, or
even instead of, submission to peer-reviewed conferences and journals. The most
popular of these sites is arXiv, pronounced ‘archive’, and is available at
https://arXiv.org
The site allows papers to be updated, often leading to multiple versions associated
with different calendar years, which can result in some ambiguity as to which version
should be cited and for which year. It also provides free access to a PDF of each pa-
per. We have therefore adopted a simple approach of referencing the paper according
to the year of first upload, although we recommend reading the most recent version.
viii PREFACE
Exercises
Each chapter concludes with a set of exercises designed to reinforce the key
ideas explained in the text or to develop and generalize them in significant ways.
These exercises form an important part of the text and each is graded according to
difficulty ranging from (?), which denotes a simple exercise taking a few moments
to complete, through to (? ? ?), which denotes a significantly more complex exercise.
The reader is strongly encouraged to attempt the exercises since active participation
with the material greatly increases the effectiveness of learning. Worked solutions to
all of the exercises are available as a downloadable PDF file from the book web site.
Mathematical notation
We follow the same notation as Bishop (2006). For an overview of mathematics
in the context of machine learning, see Deisenroth, Faisal, and Ong (2020).
Vectors are denoted by lower case bold roman letters such as x, whereas matrices
are denoted by uppercase bold roman letters, such as M. All vectors are assumed to
be column vectors unless otherwise stated. A superscript T denotes the transpose of a
matrix or vector, so that xT will be a row vector. The notation (w1 , . . . , wM ) denotes
a row vector with M elements, and the corresponding column vector is written as
w = (w1 , . . . , wM )T . The M × M identity matrix (also known as the unit matrix)
is denoted IM , which will be abbreviated to I if there is no ambiguity about its
dimensionality. It has elements Iij that equal 1 if i = j and 0 if i 6= j. The elements
of a unit matrix are sometimes denoted by δij . The notation 1 denotes a column
vector in which all elements have the value 1. a ⊕ b denotes the concatenation of
vectors a and b, so that if a = (a1 , . . . , aN ) and b = (b1 , . . . , bM ) then a ⊕ b =
(a1 , . . . , aN , b1 , . . . , bM ). |x| denotes the modulus (the positive part) of a scalar x,
also known as the absolute value. We use det A to denote the determinant of a matrix
A.
The notation x ∼ p(x) signifies that x is sampled from the distribution p(x).
Where there is ambiguity, we will use subscripts as in px (·) to denote which density
is referred to. The expectation of a function f (x, y) with respect to a random variable
x is denoted by Ex [f (x, y)]. In situations where there is no ambiguity as to which
variable is being averaged over, this will be simplified by omitting the suffix, for
instance E[x]. If the distribution of x is conditioned on another variable z, then
the corresponding conditional expectation will be written Ex [f (x)|z]. Similarly, the
variance of f (x) is denoted var[f (x)], and for vector variables, the covariance is
written cov[x, y]. We will also use cov[x] as a shorthand notation for cov[x, x].
The symbol ∀ means ‘for all’, so that ∀m ∈ M denotes all values of m within
the set M. We use R to denote the real numbers. On a graph, the set of neighbours of
node i is denoted N (i), which should not be confused with the Gaussian or normal
distribution N (x|µ, σ 2 ). A functional is denoted f [y] where y(x) is some function.
The concept of a functional is discussed in Appendix B. Curly braces { } denote a
PREFACE ix
set. The notation g(x) = O(f (x)) denotes that |f (x)/g(x)| is bounded as x → ∞.
For instance, if g(x) = 3x2 + 2, then g(x) = O(x2 ). The notation bxc denotes the
‘floor’ of x, i.e., the largest integer that is less than or equal to x.
If we have N independent and identically distributed (i.i.d.) values x1 , . . . , xN
of a D-dimensional vector x = (x1 , . . . , xD )T , we can combine the observations
into a data matrix X of dimension N × D in which the nth row of X corresponds
to the row vector xT n . Thus, the n, i element of X corresponds to the ith element of
the nth observation xn and is written xni . For one-dimensional variables, we denote
such a matrix by x, which is a column vector whose nth element is xn . Note that
x (which has dimensionality N ) uses a different typeface to distinguish it from x
(which has dimensionality D).
Acknowledgements
We would like to express our sincere gratitude to the many people who re-
viewed draft chapters and provided valuable feedback. In particular, we wish to
thank Samuel Albanie, Cristian Bodnar, John Bronskill, Wessel Bruinsma, Ignas
Budvytis, Chi Chen, Yaoyi Chen, Long Chen, Fergal Cotter, Sam Devlin, Alek-
sander Durumeric, Sebastian Ehlert, Katarina Elez, Andrew Foong, Hong Ge, Paul
Gladkov, Paula Gori Giorgi, John Gossman, Tengda Han, Juyeon Heo, Katja Hof-
mann, Chin-Wei Huang, Yongchaio Huang, Giulio Isacchini, Matthew Johnson,
Pragya Kale, Atharva Kelkar, Leon Klein, Pushmeet Kohli, Bonnie Kruft, Adrian
Li, Haiguang Liu, Ziheng Lu, Giulia Luise, Stratis Markou, Sergio Valcarcel Macua,
Krzysztof Maziarz, Matěj Mezera, Laurence Midgley, Usman Munir, Félix Musil,
Elise van der Pol, Tao Qin, Isaac Reid, David Rosenberger, Lloyd Russell, Maxi-
milian Schebek, Megan Stanley, Karin Strauss, Clark Templeton, Marlon Tobaben,
Aldo Sayeg Pasos-Trejo, Richard Turner, Max Welling, Furu Wei, Robert Weston,
Chris Williams, Yingce Xia, Shufang Xie, Iryna Zaporozhets, Claudio Zeni, Xieyuan
Zhang, and many other colleagues who contributed through valuable discussions.
We would also like to thank our editor Paul Drougas and many others at Springer, as
well as the copy editor Jonathan Webley, for their support during the production of
the book.
We would like to say a special thank-you to Markus Svensén, who provided
immense help with the figures and typesetting for Bishop (2006) including the LATEX
style files, which have also been used for this new book. We are also grateful to the
many scientists who allowed us to reproduce diagrams from their published work.
Acknowledgements for specific figures appear in the associated figure captions.
Chris would like to express sincere gratitude to Microsoft for creating a highly
stimulating research environment and for providing the opportunity to write this
book. The views and opinions expressed in this book, however, are those of the
authors and are therefore not necessarily the same as those of Microsoft or its affil-
iates. It has been a huge privilege and pleasure to collaborate with my son Hugh in
preparing this book, which started as a joint project during the first Covid lockdown.
x PREFACE
Hugh would like to thank Wayve Technologies Ltd for generously allowing him
to work part time so that he could collaborate in writing this book as well as for
providing an inspiring and supportive environment for him to work and learn in.
The views expressed in this book are not necessarily the same as those of Wayve or
its affiliates. He would like to express his gratitude to his fiancée Jemima for her
constant support as well as her grammatical and stylistic consultations. He would
also like to thank Chris, who has been an excellent colleague and an inspiration to
Hugh throughout his life.
Finally, we would both like to say a huge thank-you to our family members
Jenna and Mark for so many things far too numerous to list here. It seems a very
long time ago that we all gathered on the beach in Antalya to watch a total eclipse
of the sun and to take a family photo for the dedication page of Pattern Recognition
and Machine Learning!
Preface v
Contents xi
2 Probabilities 23
2.1 The Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 A medical screening example . . . . . . . . . . . . . . . . 25
2.1.2 The sum and product rules . . . . . . . . . . . . . . . . . . 26
2.1.3 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.4 Medical screening revisited . . . . . . . . . . . . . . . . . 30
2.1.5 Prior and posterior probabilities . . . . . . . . . . . . . . . 31
xi
xii CONTENTS
3 Standard Distributions 65
3.1 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . 66
3.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . 67
3.1.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . 68
3.2 The Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Geometry of the Gaussian . . . . . . . . . . . . . . . . . . 71
3.2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2.4 Conditional distribution . . . . . . . . . . . . . . . . . . . 76
3.2.5 Marginal distribution . . . . . . . . . . . . . . . . . . . . . 79
3.2.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.7 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 84
3.2.8 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 85
3.2.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 86
3.3 Periodic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.1 Von Mises distribution . . . . . . . . . . . . . . . . . . . . 89
3.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . 97
3.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 98
CONTENTS xiii
3.5.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5.2 Kernel densities . . . . . . . . . . . . . . . . . . . . . . . . 100
3.5.3 Nearest-neighbours . . . . . . . . . . . . . . . . . . . . . . 103
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8 Backpropagation 233
8.1 Evaluation of Gradients . . . . . . . . . . . . . . . . . . . . . . . 234
8.1.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 234
8.1.2 General feed-forward networks . . . . . . . . . . . . . . . 235
8.1.3 A simple example . . . . . . . . . . . . . . . . . . . . . . 238
8.1.4 Numerical differentiation . . . . . . . . . . . . . . . . . . . 239
8.1.5 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 240
8.1.6 The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . 242
8.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . 244
8.2.1 Forward-mode automatic differentiation . . . . . . . . . . . 246
8.2.2 Reverse-mode automatic differentiation . . . . . . . . . . . 249
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
9 Regularization 253
9.1 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
9.1.1 Inverse problems . . . . . . . . . . . . . . . . . . . . . . . 254
9.1.2 No free lunch theorem . . . . . . . . . . . . . . . . . . . . 255
9.1.3 Symmetry and invariance . . . . . . . . . . . . . . . . . . . 256
9.1.4 Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . 259
9.2 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9.2.1 Consistent regularizers . . . . . . . . . . . . . . . . . . . . 262
9.2.2 Generalized weight decay . . . . . . . . . . . . . . . . . . 264
9.3 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.3.1 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . 266
9.3.2 Double descent . . . . . . . . . . . . . . . . . . . . . . . . 268
9.4 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 270
9.4.1 Soft weight sharing . . . . . . . . . . . . . . . . . . . . . . 271
9.5 Residual Connections . . . . . . . . . . . . . . . . . . . . . . . . 274
9.6 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
9.6.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12 Transformers 357
12.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
12.1.1 Transformer processing . . . . . . . . . . . . . . . . . . . . 360
12.1.2 Attention coefficients . . . . . . . . . . . . . . . . . . . . . 361
12.1.3 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . 362
12.1.4 Network parameters . . . . . . . . . . . . . . . . . . . . . 363
12.1.5 Scaled self-attention . . . . . . . . . . . . . . . . . . . . . 366
12.1.6 Multi-head attention . . . . . . . . . . . . . . . . . . . . . 366
12.1.7 Transformer layers . . . . . . . . . . . . . . . . . . . . . . 368
12.1.8 Computational complexity . . . . . . . . . . . . . . . . . . 370
12.1.9 Positional encoding . . . . . . . . . . . . . . . . . . . . . . 371
12.2 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
12.2.1 Word embedding . . . . . . . . . . . . . . . . . . . . . . . 375
12.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 377
12.2.3 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . 378
12.2.4 Autoregressive models . . . . . . . . . . . . . . . . . . . . 379
12.2.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . 380
12.2.6 Backpropagation through time . . . . . . . . . . . . . . . . 381
12.3 Transformer Language Models . . . . . . . . . . . . . . . . . . . . 382
12.3.1 Decoder transformers . . . . . . . . . . . . . . . . . . . . . 383
12.3.2 Sampling strategies . . . . . . . . . . . . . . . . . . . . . . 386
12.3.3 Encoder transformers . . . . . . . . . . . . . . . . . . . . . 388
12.3.4 Sequence-to-sequence transformers . . . . . . . . . . . . . 390
12.3.5 Large language models . . . . . . . . . . . . . . . . . . . . 390
12.4 Multimodal Transformers . . . . . . . . . . . . . . . . . . . . . . 394
12.4.1 Vision transformers . . . . . . . . . . . . . . . . . . . . . . 395
12.4.2 Generative image transformers . . . . . . . . . . . . . . . . 396
12.4.3 Audio data . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.4.4 Text-to-speech . . . . . . . . . . . . . . . . . . . . . . . . 400
12.4.5 Vision and language transformers . . . . . . . . . . . . . . 402
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
14 Sampling 429
14.1 Basic Sampling Algorithms . . . . . . . . . . . . . . . . . . . . . 430
14.1.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . 430
14.1.2 Standard distributions . . . . . . . . . . . . . . . . . . . . 431
14.1.3 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . 433
14.1.4 Adaptive rejection sampling . . . . . . . . . . . . . . . . . 435
14.1.5 Importance sampling . . . . . . . . . . . . . . . . . . . . . 437
14.1.6 Sampling-importance-resampling . . . . . . . . . . . . . . 439
14.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 440
14.2.1 The Metropolis algorithm . . . . . . . . . . . . . . . . . . 441
14.2.2 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 442
14.2.3 The Metropolis–Hastings algorithm . . . . . . . . . . . . . 445
14.2.4 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . 446
14.2.5 Ancestral sampling . . . . . . . . . . . . . . . . . . . . . . 450
14.3 Langevin Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 451
14.3.1 Energy-based models . . . . . . . . . . . . . . . . . . . . . 452
14.3.2 Maximizing the likelihood . . . . . . . . . . . . . . . . . . 453
14.3.3 Langevin dynamics . . . . . . . . . . . . . . . . . . . . . . 454
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
19 Autoencoders 563
19.1 Deterministic Autoencoders . . . . . . . . . . . . . . . . . . . . . 564
19.1.1 Linear autoencoders . . . . . . . . . . . . . . . . . . . . . 564
19.1.2 Deep autoencoders . . . . . . . . . . . . . . . . . . . . . . 565
19.1.3 Sparse autoencoders . . . . . . . . . . . . . . . . . . . . . 566
19.1.4 Denoising autoencoders . . . . . . . . . . . . . . . . . . . 567
19.1.5 Masked autoencoders . . . . . . . . . . . . . . . . . . . . . 567
19.2 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 569
19.2.1 Amortized inference . . . . . . . . . . . . . . . . . . . . . 572
19.2.2 The reparameterization trick . . . . . . . . . . . . . . . . . 574
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
Bibliography 625
Index 641