0% found this document useful (0 votes)

903 views19 pages

Deep Learning

Uploaded by

Rahul Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

903 views19 pages

Deep Learning

Uploaded by

Rahul Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Deep Learning

Christopher M. Bishop • Hugh Bishop

Deep Learning
Foundations and Concepts
Christopher M. Bishop Hugh Bishop
Microsoft Research Wayve Technologies Ltd
Cambridge, UK London, UK

ISBN 978-3-031-45467-7 ISBN 978-3-031-45468-4 (eBook)

https://doi.org/10.1007/978-3-031-45468-4

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for
general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and
accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect
to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Cover illustration: maksimee / Alamy Stock Photo

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.

Preface
Deep learning uses multilayered neural networks trained with large data sets to
solve complex information processing tasks and has emerged as the most successful
paradigm in the field of machine learning. Over the last decade, deep learning has
revolutionized many domains including computer vision, speech recognition, and
natural language processing, and it is being used in a growing multitude of applica-
tions across healthcare, manufacturing, commerce, finance, scientific discovery, and
many other sectors. Recently, massive neural networks, known as large language
models and comprising of the order of a trillion learnable parameters, have been
found to exhibit the first indications of general artificial intelligence and are now
driving one of the biggest disruptions in the history of technology.

Goals of the book

This expanding impact has been accompanied by an explosion in the number
and breadth of research publications in machine learning, and the pace of innova-
tion continues to accelerate. For newcomers to the field, the challenge of getting
to grips with the key ideas, let alone catching up to the research frontier, can seem
daunting. Against this backdrop, Deep Learning: Foundations and Concepts aims
to provide newcomers to machine learning, as well as those already experienced in
the field, with a thorough understanding of both the foundational ideas that underpin
deep learning as well as the key concepts of modern deep learning architectures and
techniques. This material will equip the reader with a strong basis for future spe-
cialization. Due to the breadth and pace of change in the field, we have deliberately
avoided trying to create a comprehensive survey of the latest research. Instead, much
of the value of the book derives from a distillation of key ideas, and although the field
itself can be expected to continue its rapid advance, these foundations and concepts
are likely to stand the test of time. For example, large language models have been
evolving very rapidly at the time of writing, yet the underlying transformer archi-
tecture and attention mechanism have remained largely unchanged for the last five
years, while many core principles of machine learning have been known for decades.

v
vi PREFACE

Responsible use of technology

Deep learning is a powerful technology with broad applicability that has the po-
tential to create huge value for the world and address some of society’s most pressing
challenges. However, these same attributes mean that deep learning also has poten-
tial both for deliberate misuse and to cause unintended harms. We have chosen not
to discuss ethical or societal aspects of the use of deep learning, as these topics are of
such importance and complexity that they warrant a more thorough treatment than is
possible in a technical textbook such as this. Such considerations should, however,
be informed by a solid grounding in the underlying technology and how it works,
and so we hope that this book will make a valuable contribution towards these im-
portant discussions. The reader is, nevertheless, strongly encouraged to be mindful
about the broader implications of their work and to learn about the responsible use
of deep learning and artificial intelligence alongside their studies of the technology
itself.

Structure of the book

The book is structured into a relatively large number of smaller bite-sized chap-
ters, each of which explores a specific topic. The book has a linear structure in
the sense that each chapter depends only on material covered in earlier chapters. It
is well suited to teaching a two-semester undergraduate or postgraduate course on
machine learning but is equally relevant to those engaged in active research or in
self-study.
A clear understanding of machine learning can be achieved only through the
use of some level of mathematics. Specifically, three areas of mathematics lie at the
heart of machine learning: probability theory, linear algebra, and multivariate cal-
culus. The book provides a self-contained introduction to the required concepts in
probability theory and includes an appendix that summarizes some useful results in
linear algebra. It is assumed that the reader already has some familiarity with the
basic concepts of multivariate calculus although there are appendices that provide
introductions to the calculus of variations and to Lagrange multipliers. The focus
of the book, however, is on conveying a clear understanding of ideas, and the em-
phasis is on techniques that have real-world practical value rather than on abstract
theory. Where possible we try to present more complex concepts from multiple com-
plementary perspectives including textual description, diagrams, and mathematical
formulae. In addition, many of the key algorithms discussed in the text are summa-
rized in separate boxes. These do not address issues of computational efficiency, but
are provided as a complement to the mathematical explanations given in the text.
We therefore hope that the material in this book will be accessible to readers from a
variety of backgrounds.
Conceptually, this book is perhaps most naturally viewed as a successor to Neu-
ral Networks for Pattern Recognition (Bishop, 1995b), which provided the first com-
prehensive treatment of neural networks from a statistical perspective. It can also
be considered as a companion volume to Pattern Recognition and Machine Learn-
ing (Bishop, 2006), which covered a broader range of topics in machine learning
although it predated the deep learning revolution. However, to ensure that this
PREFACE vii

new book is self-contained, appropriate material has been carried over from Bishop
(2006) and refactored to focus on those foundational ideas that are needed for deep
learning. This means that there are many interesting topics in machine learning dis-
cussed in Bishop (2006) that remain of interest today but which have been omitted
from this new book. For example, Bishop (2006) discusses Bayesian methods in
some depth, whereas this book is almost entirely non-Bayesian.
The book is accompanied by a web site that provides supporting material, in-
cluding a free-to-use digital version of the book as well as solutions to the exercises
and downloadable versions of the figures in PDF and JPEG formats:
https://www.bishopbook.com
The book can be cited using the following BibTex entry:
@book{Bishop:DeepLearning24,
author = {Christopher M. Bishop and Hugh Bishop},
title = {Deep Learning: Foundations and Concepts},
year = {2024},
publisher = {Springer}
}
If you have any feedback on the book or would like to report any errors, please
send these to [email protected]

References
In the spirit of focusing on core ideas, we make no attempt to provide a com-
prehensive literature review, which in any case would be impossible given the scale
and pace of change of the field. We do, however, provide references to some of the
key research papers as well as review articles and other sources of further reading.
In many cases, these also provide important implementation details that we gloss
over in the text in order not to distract the reader from the central concepts being
discussed.
Many books have been written on the subject of machine learning in general and
on deep learning in particular. Those which are closest in level and style to this book
include Bishop (2006), Goodfellow, Bengio, and Courville (2016), Murphy (2022),
Murphy (2023), and Prince (2023).
Over the last decade, the nature of machine learning scholarship has changed
significantly, with many papers being posted online on archival sites ahead of, or
even instead of, submission to peer-reviewed conferences and journals. The most
popular of these sites is arXiv, pronounced ‘archive’, and is available at
https://arXiv.org
The site allows papers to be updated, often leading to multiple versions associated
with different calendar years, which can result in some ambiguity as to which version
should be cited and for which year. It also provides free access to a PDF of each pa-
per. We have therefore adopted a simple approach of referencing the paper according
to the year of first upload, although we recommend reading the most recent version.
viii PREFACE

Papers on arXiv are indexed using a notation arXiv:YYMM.XXXXX where YY and

MM denote the year and month of first upload, respectively. Subsequent versions are
denoted by appending a version number N in the form arXiv:YYMM.XXXXXvN.

Exercises
Each chapter concludes with a set of exercises designed to reinforce the key
ideas explained in the text or to develop and generalize them in significant ways.
These exercises form an important part of the text and each is graded according to
difficulty ranging from (?), which denotes a simple exercise taking a few moments
to complete, through to (? ? ?), which denotes a significantly more complex exercise.
The reader is strongly encouraged to attempt the exercises since active participation
with the material greatly increases the effectiveness of learning. Worked solutions to
all of the exercises are available as a downloadable PDF file from the book web site.

Mathematical notation
We follow the same notation as Bishop (2006). For an overview of mathematics
in the context of machine learning, see Deisenroth, Faisal, and Ong (2020).
Vectors are denoted by lower case bold roman letters such as x, whereas matrices
are denoted by uppercase bold roman letters, such as M. All vectors are assumed to
be column vectors unless otherwise stated. A superscript T denotes the transpose of a
matrix or vector, so that xT will be a row vector. The notation (w1 , . . . , wM ) denotes
a row vector with M elements, and the corresponding column vector is written as
w = (w1 , . . . , wM )T . The M × M identity matrix (also known as the unit matrix)
is denoted IM , which will be abbreviated to I if there is no ambiguity about its
dimensionality. It has elements Iij that equal 1 if i = j and 0 if i 6= j. The elements
of a unit matrix are sometimes denoted by δij . The notation 1 denotes a column
vector in which all elements have the value 1. a ⊕ b denotes the concatenation of
vectors a and b, so that if a = (a1 , . . . , aN ) and b = (b1 , . . . , bM ) then a ⊕ b =
(a1 , . . . , aN , b1 , . . . , bM ). |x| denotes the modulus (the positive part) of a scalar x,
also known as the absolute value. We use det A to denote the determinant of a matrix
A.
The notation x ∼ p(x) signifies that x is sampled from the distribution p(x).
Where there is ambiguity, we will use subscripts as in px (·) to denote which density
is referred to. The expectation of a function f (x, y) with respect to a random variable
x is denoted by Ex [f (x, y)]. In situations where there is no ambiguity as to which
variable is being averaged over, this will be simplified by omitting the suffix, for
instance E[x]. If the distribution of x is conditioned on another variable z, then
the corresponding conditional expectation will be written Ex [f (x)|z]. Similarly, the
variance of f (x) is denoted var[f (x)], and for vector variables, the covariance is
written cov[x, y]. We will also use cov[x] as a shorthand notation for cov[x, x].
The symbol ∀ means ‘for all’, so that ∀m ∈ M denotes all values of m within
the set M. We use R to denote the real numbers. On a graph, the set of neighbours of
node i is denoted N (i), which should not be confused with the Gaussian or normal
distribution N (x|µ, σ 2 ). A functional is denoted f [y] where y(x) is some function.
The concept of a functional is discussed in Appendix B. Curly braces { } denote a
PREFACE ix

set. The notation g(x) = O(f (x)) denotes that |f (x)/g(x)| is bounded as x → ∞.
For instance, if g(x) = 3x2 + 2, then g(x) = O(x2 ). The notation bxc denotes the
‘floor’ of x, i.e., the largest integer that is less than or equal to x.
If we have N independent and identically distributed (i.i.d.) values x1 , . . . , xN
of a D-dimensional vector x = (x1 , . . . , xD )T , we can combine the observations
into a data matrix X of dimension N × D in which the nth row of X corresponds
to the row vector xT n . Thus, the n, i element of X corresponds to the ith element of
the nth observation xn and is written xni . For one-dimensional variables, we denote
such a matrix by x, which is a column vector whose nth element is xn . Note that
x (which has dimensionality N ) uses a different typeface to distinguish it from x
(which has dimensionality D).

Acknowledgements
We would like to express our sincere gratitude to the many people who re-
viewed draft chapters and provided valuable feedback. In particular, we wish to
thank Samuel Albanie, Cristian Bodnar, John Bronskill, Wessel Bruinsma, Ignas
Budvytis, Chi Chen, Yaoyi Chen, Long Chen, Fergal Cotter, Sam Devlin, Alek-
sander Durumeric, Sebastian Ehlert, Katarina Elez, Andrew Foong, Hong Ge, Paul
Gladkov, Paula Gori Giorgi, John Gossman, Tengda Han, Juyeon Heo, Katja Hof-
mann, Chin-Wei Huang, Yongchaio Huang, Giulio Isacchini, Matthew Johnson,
Pragya Kale, Atharva Kelkar, Leon Klein, Pushmeet Kohli, Bonnie Kruft, Adrian
Li, Haiguang Liu, Ziheng Lu, Giulia Luise, Stratis Markou, Sergio Valcarcel Macua,
Krzysztof Maziarz, Matěj Mezera, Laurence Midgley, Usman Munir, Félix Musil,
Elise van der Pol, Tao Qin, Isaac Reid, David Rosenberger, Lloyd Russell, Maxi-
milian Schebek, Megan Stanley, Karin Strauss, Clark Templeton, Marlon Tobaben,
Aldo Sayeg Pasos-Trejo, Richard Turner, Max Welling, Furu Wei, Robert Weston,
Chris Williams, Yingce Xia, Shufang Xie, Iryna Zaporozhets, Claudio Zeni, Xieyuan
Zhang, and many other colleagues who contributed through valuable discussions.
We would also like to thank our editor Paul Drougas and many others at Springer, as
well as the copy editor Jonathan Webley, for their support during the production of
the book.
We would like to say a special thank-you to Markus Svensén, who provided
immense help with the figures and typesetting for Bishop (2006) including the LATEX
style files, which have also been used for this new book. We are also grateful to the
many scientists who allowed us to reproduce diagrams from their published work.
Acknowledgements for specific figures appear in the associated figure captions.
Chris would like to express sincere gratitude to Microsoft for creating a highly
stimulating research environment and for providing the opportunity to write this
book. The views and opinions expressed in this book, however, are those of the
authors and are therefore not necessarily the same as those of Microsoft or its affil-
iates. It has been a huge privilege and pleasure to collaborate with my son Hugh in
preparing this book, which started as a joint project during the first Covid lockdown.
x PREFACE

Hugh would like to thank Wayve Technologies Ltd for generously allowing him
to work part time so that he could collaborate in writing this book as well as for
providing an inspiring and supportive environment for him to work and learn in.
The views expressed in this book are not necessarily the same as those of Wayve or
its affiliates. He would like to express his gratitude to his fiancée Jemima for her
constant support as well as her grammatical and stylistic consultations. He would
also like to thank Chris, who has been an excellent colleague and an inspiration to
Hugh throughout his life.
Finally, we would both like to say a huge thank-you to our family members
Jenna and Mark for so many things far too numerous to list here. It seems a very
long time ago that we all gathered on the beach in Antalya to watch a total eclipse
of the sun and to take a family photo for the dedication page of Pattern Recognition
and Machine Learning!

Chris Bishop and Hugh Bishop

Cambridge, UK
October, 2023
Contents

Preface v

Contents xi

1 The Deep Learning Revolution 1

1.1 The Impact of Deep Learning . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Medical diagnosis . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Protein structure . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Image synthesis . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Large language models . . . . . . . . . . . . . . . . . . . . 5
1.2 A Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Error function . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Model complexity . . . . . . . . . . . . . . . . . . . . . . 9
1.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6 Model selection . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 A Brief History of Machine Learning . . . . . . . . . . . . . . . . 16
1.3.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.3 Deep networks . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Probabilities 23
2.1 The Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 A medical screening example . . . . . . . . . . . . . . . . 25
2.1.2 The sum and product rules . . . . . . . . . . . . . . . . . . 26
2.1.3 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.4 Medical screening revisited . . . . . . . . . . . . . . . . . 30
2.1.5 Prior and posterior probabilities . . . . . . . . . . . . . . . 31

xi
xii CONTENTS

2.1.6 Independent variables . . . . . . . . . . . . . . . . . . . . 31

2.2 Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Example distributions . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 34
2.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Mean and variance . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 37
2.3.3 Bias of maximum likelihood . . . . . . . . . . . . . . . . . 39
2.3.4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Transformation of Densities . . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . 44
2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.2 Physics perspective . . . . . . . . . . . . . . . . . . . . . . 47
2.5.3 Differential entropy . . . . . . . . . . . . . . . . . . . . . . 49
2.5.4 Maximum entropy . . . . . . . . . . . . . . . . . . . . . . 50
2.5.5 Kullback–Leibler divergence . . . . . . . . . . . . . . . . . 51
2.5.6 Conditional entropy . . . . . . . . . . . . . . . . . . . . . 53
2.5.7 Mutual information . . . . . . . . . . . . . . . . . . . . . . 54
2.6 Bayesian Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 54
2.6.1 Model parameters . . . . . . . . . . . . . . . . . . . . . . . 55
2.6.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6.3 Bayesian machine learning . . . . . . . . . . . . . . . . . . 57
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3 Standard Distributions 65
3.1 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . 66
3.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . 67
3.1.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . 68
3.2 The Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Geometry of the Gaussian . . . . . . . . . . . . . . . . . . 71
3.2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2.4 Conditional distribution . . . . . . . . . . . . . . . . . . . 76
3.2.5 Marginal distribution . . . . . . . . . . . . . . . . . . . . . 79
3.2.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.7 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 84
3.2.8 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 85
3.2.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 86
3.3 Periodic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.1 Von Mises distribution . . . . . . . . . . . . . . . . . . . . 89
3.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . 97
3.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 98
CONTENTS xiii

3.5.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5.2 Kernel densities . . . . . . . . . . . . . . . . . . . . . . . . 100
3.5.3 Nearest-neighbours . . . . . . . . . . . . . . . . . . . . . . 103
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4 Single-layer Networks: Regression 111

4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.1.1 Basis functions . . . . . . . . . . . . . . . . . . . . . . . . 112
4.1.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 114
4.1.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 115
4.1.4 Geometry of least squares . . . . . . . . . . . . . . . . . . 117
4.1.5 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 117
4.1.6 Regularized least squares . . . . . . . . . . . . . . . . . . . 118
4.1.7 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 119
4.2 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3 The Bias–Variance Trade-off . . . . . . . . . . . . . . . . . . . . . 123
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5 Single-layer Networks: Classification 131

5.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 132
5.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 134
5.1.3 1-of-K coding . . . . . . . . . . . . . . . . . . . . . . . . 135
5.1.4 Least squares for classification . . . . . . . . . . . . . . . . 136
5.2 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.2.1 Misclassification rate . . . . . . . . . . . . . . . . . . . . . 139
5.2.2 Expected loss . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.2.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 143
5.2.5 Classifier accuracy . . . . . . . . . . . . . . . . . . . . . . 147
5.2.6 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.3 Generative Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 150
5.3.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 152
5.3.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 153
5.3.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 156
5.3.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 156
5.4 Discriminative Classifiers . . . . . . . . . . . . . . . . . . . . . . 157
5.4.1 Activation functions . . . . . . . . . . . . . . . . . . . . . 158
5.4.2 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 158
5.4.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 159
5.4.4 Multi-class logistic regression . . . . . . . . . . . . . . . . 161
5.4.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 163
5.4.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 164
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
xiv CONTENTS

6 Deep Neural Networks 171

6.1 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 172
6.1.1 The curse of dimensionality . . . . . . . . . . . . . . . . . 172
6.1.2 High-dimensional spaces . . . . . . . . . . . . . . . . . . . 175
6.1.3 Data manifolds . . . . . . . . . . . . . . . . . . . . . . . . 176
6.1.4 Data-dependent basis functions . . . . . . . . . . . . . . . 178
6.2 Multilayer Networks . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.2.1 Parameter matrices . . . . . . . . . . . . . . . . . . . . . . 181
6.2.2 Universal approximation . . . . . . . . . . . . . . . . . . . 181
6.2.3 Hidden unit activation functions . . . . . . . . . . . . . . . 182
6.2.4 Weight-space symmetries . . . . . . . . . . . . . . . . . . 185
6.3 Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.3.1 Hierarchical representations . . . . . . . . . . . . . . . . . 187
6.3.2 Distributed representations . . . . . . . . . . . . . . . . . . 187
6.3.3 Representation learning . . . . . . . . . . . . . . . . . . . 188
6.3.4 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . 189
6.3.5 Contrastive learning . . . . . . . . . . . . . . . . . . . . . 191
6.3.6 General network architectures . . . . . . . . . . . . . . . . 193
6.3.7 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.4 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.4.2 Binary classification . . . . . . . . . . . . . . . . . . . . . 196
6.4.3 multiclass classification . . . . . . . . . . . . . . . . . . . 197
6.5 Mixture Density Networks . . . . . . . . . . . . . . . . . . . . . . 198
6.5.1 Robot kinematics example . . . . . . . . . . . . . . . . . . 198
6.5.2 Conditional mixture distributions . . . . . . . . . . . . . . 199
6.5.3 Gradient optimization . . . . . . . . . . . . . . . . . . . . 201
6.5.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . 202
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

7 Gradient Descent 209

7.1 Error Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
7.1.1 Local quadratic approximation . . . . . . . . . . . . . . . . 211
7.2 Gradient Descent Optimization . . . . . . . . . . . . . . . . . . . 213
7.2.1 Use of gradient information . . . . . . . . . . . . . . . . . 214
7.2.2 Batch gradient descent . . . . . . . . . . . . . . . . . . . . 214
7.2.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . 214
7.2.4 Mini-batches . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.2.5 Parameter initialization . . . . . . . . . . . . . . . . . . . . 216
7.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.3.1 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.3.2 Learning rate schedule . . . . . . . . . . . . . . . . . . . . 222
7.3.3 RMSProp and Adam . . . . . . . . . . . . . . . . . . . . . 223
7.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.4.1 Data normalization . . . . . . . . . . . . . . . . . . . . . . 226
CONTENTS xv

7.4.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . 227

7.4.3 Layer normalization . . . . . . . . . . . . . . . . . . . . . 229
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

8 Backpropagation 233
8.1 Evaluation of Gradients . . . . . . . . . . . . . . . . . . . . . . . 234
8.1.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 234
8.1.2 General feed-forward networks . . . . . . . . . . . . . . . 235
8.1.3 A simple example . . . . . . . . . . . . . . . . . . . . . . 238
8.1.4 Numerical differentiation . . . . . . . . . . . . . . . . . . . 239
8.1.5 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 240
8.1.6 The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . 242
8.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . 244
8.2.1 Forward-mode automatic differentiation . . . . . . . . . . . 246
8.2.2 Reverse-mode automatic differentiation . . . . . . . . . . . 249
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

9 Regularization 253
9.1 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
9.1.1 Inverse problems . . . . . . . . . . . . . . . . . . . . . . . 254
9.1.2 No free lunch theorem . . . . . . . . . . . . . . . . . . . . 255
9.1.3 Symmetry and invariance . . . . . . . . . . . . . . . . . . . 256
9.1.4 Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . 259
9.2 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9.2.1 Consistent regularizers . . . . . . . . . . . . . . . . . . . . 262
9.2.2 Generalized weight decay . . . . . . . . . . . . . . . . . . 264
9.3 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.3.1 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . 266
9.3.2 Double descent . . . . . . . . . . . . . . . . . . . . . . . . 268
9.4 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 270
9.4.1 Soft weight sharing . . . . . . . . . . . . . . . . . . . . . . 271
9.5 Residual Connections . . . . . . . . . . . . . . . . . . . . . . . . 274
9.6 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
9.6.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

10 Convolutional Networks 287

10.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
10.1.1 Image data . . . . . . . . . . . . . . . . . . . . . . . . . . 289
10.2 Convolutional Filters . . . . . . . . . . . . . . . . . . . . . . . . . 290
10.2.1 Feature detectors . . . . . . . . . . . . . . . . . . . . . . . 290
10.2.2 Translation equivariance . . . . . . . . . . . . . . . . . . . 291
10.2.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
10.2.4 Strided convolutions . . . . . . . . . . . . . . . . . . . . . 294
10.2.5 Multi-dimensional convolutions . . . . . . . . . . . . . . . 295
10.2.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
xvi CONTENTS

10.2.7 Multilayer convolutions . . . . . . . . . . . . . . . . . . . 298

10.2.8 Example network architectures . . . . . . . . . . . . . . . . 299
10.3 Visualizing Trained CNNs . . . . . . . . . . . . . . . . . . . . . . 302
10.3.1 Visual cortex . . . . . . . . . . . . . . . . . . . . . . . . . 302
10.3.2 Visualizing trained filters . . . . . . . . . . . . . . . . . . . 303
10.3.3 Saliency maps . . . . . . . . . . . . . . . . . . . . . . . . 305
10.3.4 Adversarial attacks . . . . . . . . . . . . . . . . . . . . . . 306
10.3.5 Synthetic images . . . . . . . . . . . . . . . . . . . . . . . 308
10.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
10.4.1 Bounding boxes . . . . . . . . . . . . . . . . . . . . . . . 309
10.4.2 Intersection-over-union . . . . . . . . . . . . . . . . . . . . 310
10.4.3 Sliding windows . . . . . . . . . . . . . . . . . . . . . . . 311
10.4.4 Detection across scales . . . . . . . . . . . . . . . . . . . . 313
10.4.5 Non-max suppression . . . . . . . . . . . . . . . . . . . . . 314
10.4.6 Fast region CNNs . . . . . . . . . . . . . . . . . . . . . . . 314
10.5 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 315
10.5.1 Convolutional segmentation . . . . . . . . . . . . . . . . . 315
10.5.2 Up-sampling . . . . . . . . . . . . . . . . . . . . . . . . . 316
10.5.3 Fully convolutional networks . . . . . . . . . . . . . . . . . 318
10.5.4 The U-net architecture . . . . . . . . . . . . . . . . . . . . 319
10.6 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

11 Structured Distributions 325

11.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
11.1.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . 326
11.1.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 327
11.1.3 Discrete variables . . . . . . . . . . . . . . . . . . . . . . . 329
11.1.4 Gaussian variables . . . . . . . . . . . . . . . . . . . . . . 332
11.1.5 Binary classifier . . . . . . . . . . . . . . . . . . . . . . . 334
11.1.6 Parameters and observations . . . . . . . . . . . . . . . . . 334
11.1.7 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 336
11.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . 337
11.2.1 Three example graphs . . . . . . . . . . . . . . . . . . . . 338
11.2.2 Explaining away . . . . . . . . . . . . . . . . . . . . . . . 341
11.2.3 D-separation . . . . . . . . . . . . . . . . . . . . . . . . . 343
11.2.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 344
11.2.5 Generative models . . . . . . . . . . . . . . . . . . . . . . 346
11.2.6 Markov blanket . . . . . . . . . . . . . . . . . . . . . . . . 347
11.2.7 Graphs as filters . . . . . . . . . . . . . . . . . . . . . . . . 348
11.3 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
11.3.1 Hidden variables . . . . . . . . . . . . . . . . . . . . . . . 352
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
CONTENTS xvii

12 Transformers 357
12.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
12.1.1 Transformer processing . . . . . . . . . . . . . . . . . . . . 360
12.1.2 Attention coefficients . . . . . . . . . . . . . . . . . . . . . 361
12.1.3 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . 362
12.1.4 Network parameters . . . . . . . . . . . . . . . . . . . . . 363
12.1.5 Scaled self-attention . . . . . . . . . . . . . . . . . . . . . 366
12.1.6 Multi-head attention . . . . . . . . . . . . . . . . . . . . . 366
12.1.7 Transformer layers . . . . . . . . . . . . . . . . . . . . . . 368
12.1.8 Computational complexity . . . . . . . . . . . . . . . . . . 370
12.1.9 Positional encoding . . . . . . . . . . . . . . . . . . . . . . 371
12.2 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
12.2.1 Word embedding . . . . . . . . . . . . . . . . . . . . . . . 375
12.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 377
12.2.3 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . 378
12.2.4 Autoregressive models . . . . . . . . . . . . . . . . . . . . 379
12.2.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . 380
12.2.6 Backpropagation through time . . . . . . . . . . . . . . . . 381
12.3 Transformer Language Models . . . . . . . . . . . . . . . . . . . . 382
12.3.1 Decoder transformers . . . . . . . . . . . . . . . . . . . . . 383
12.3.2 Sampling strategies . . . . . . . . . . . . . . . . . . . . . . 386
12.3.3 Encoder transformers . . . . . . . . . . . . . . . . . . . . . 388
12.3.4 Sequence-to-sequence transformers . . . . . . . . . . . . . 390
12.3.5 Large language models . . . . . . . . . . . . . . . . . . . . 390
12.4 Multimodal Transformers . . . . . . . . . . . . . . . . . . . . . . 394
12.4.1 Vision transformers . . . . . . . . . . . . . . . . . . . . . . 395
12.4.2 Generative image transformers . . . . . . . . . . . . . . . . 396
12.4.3 Audio data . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.4.4 Text-to-speech . . . . . . . . . . . . . . . . . . . . . . . . 400
12.4.5 Vision and language transformers . . . . . . . . . . . . . . 402
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

13 Graph Neural Networks 407

13.1 Machine Learning on Graphs . . . . . . . . . . . . . . . . . . . . 409
13.1.1 Graph properties . . . . . . . . . . . . . . . . . . . . . . . 410
13.1.2 Adjacency matrix . . . . . . . . . . . . . . . . . . . . . . . 410
13.1.3 Permutation equivariance . . . . . . . . . . . . . . . . . . . 411
13.2 Neural Message-Passing . . . . . . . . . . . . . . . . . . . . . . . 412
13.2.1 Convolutional filters . . . . . . . . . . . . . . . . . . . . . 413
13.2.2 Graph convolutional networks . . . . . . . . . . . . . . . . 414
13.2.3 Aggregation operators . . . . . . . . . . . . . . . . . . . . 416
13.2.4 Update operators . . . . . . . . . . . . . . . . . . . . . . . 418
13.2.5 Node classification . . . . . . . . . . . . . . . . . . . . . . 419
13.2.6 Edge classification . . . . . . . . . . . . . . . . . . . . . . 420
13.2.7 Graph classification . . . . . . . . . . . . . . . . . . . . . . 420
xviii CONTENTS

13.3 General Graph Networks . . . . . . . . . . . . . . . . . . . . . . . 420

13.3.1 Graph attention networks . . . . . . . . . . . . . . . . . . . 421
13.3.2 Edge embeddings . . . . . . . . . . . . . . . . . . . . . . . 421
13.3.3 Graph embeddings . . . . . . . . . . . . . . . . . . . . . . 422
13.3.4 Over-smoothing . . . . . . . . . . . . . . . . . . . . . . . 422
13.3.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 423
13.3.6 Geometric deep learning . . . . . . . . . . . . . . . . . . . 424
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

14 Sampling 429
14.1 Basic Sampling Algorithms . . . . . . . . . . . . . . . . . . . . . 430
14.1.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . 430
14.1.2 Standard distributions . . . . . . . . . . . . . . . . . . . . 431
14.1.3 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . 433
14.1.4 Adaptive rejection sampling . . . . . . . . . . . . . . . . . 435
14.1.5 Importance sampling . . . . . . . . . . . . . . . . . . . . . 437
14.1.6 Sampling-importance-resampling . . . . . . . . . . . . . . 439
14.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 440
14.2.1 The Metropolis algorithm . . . . . . . . . . . . . . . . . . 441
14.2.2 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 442
14.2.3 The Metropolis–Hastings algorithm . . . . . . . . . . . . . 445
14.2.4 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . 446
14.2.5 Ancestral sampling . . . . . . . . . . . . . . . . . . . . . . 450
14.3 Langevin Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 451
14.3.1 Energy-based models . . . . . . . . . . . . . . . . . . . . . 452
14.3.2 Maximizing the likelihood . . . . . . . . . . . . . . . . . . 453
14.3.3 Langevin dynamics . . . . . . . . . . . . . . . . . . . . . . 454
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456

15 Discrete Latent Variables 459

15.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 460
15.1.1 Image segmentation . . . . . . . . . . . . . . . . . . . . . 464
15.2 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . 466
15.2.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 468
15.2.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 470
15.3 Expectation–Maximization Algorithm . . . . . . . . . . . . . . . . 474
15.3.1 Gaussian mixtures . . . . . . . . . . . . . . . . . . . . . . 478
15.3.2 Relation to K-means . . . . . . . . . . . . . . . . . . . . . 480
15.3.3 Mixtures of Bernoulli distributions . . . . . . . . . . . . . . 481
15.4 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 485
15.4.1 EM revisited . . . . . . . . . . . . . . . . . . . . . . . . . 486
15.4.2 Independent and identically distributed data . . . . . . . . . 488
15.4.3 Parameter priors . . . . . . . . . . . . . . . . . . . . . . . 489
15.4.4 Generalized EM . . . . . . . . . . . . . . . . . . . . . . . 489
15.4.5 Sequential EM . . . . . . . . . . . . . . . . . . . . . . . . 490
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
CONTENTS xix

16 Continuous Latent Variables 495

16.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . 497
16.1.1 Maximum variance formulation . . . . . . . . . . . . . . . 497
16.1.2 Minimum-error formulation . . . . . . . . . . . . . . . . . 499
16.1.3 Data compression . . . . . . . . . . . . . . . . . . . . . . . 501
16.1.4 Data whitening . . . . . . . . . . . . . . . . . . . . . . . . 502
16.1.5 High-dimensional data . . . . . . . . . . . . . . . . . . . . 504
16.2 Probabilistic Latent Variables . . . . . . . . . . . . . . . . . . . . 506
16.2.1 Generative model . . . . . . . . . . . . . . . . . . . . . . . 506
16.2.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 507
16.2.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 509
16.2.4 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . 513
16.2.5 Independent component analysis . . . . . . . . . . . . . . . 514
16.2.6 Kalman filters . . . . . . . . . . . . . . . . . . . . . . . . . 515
16.3 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 516
16.3.1 Expectation maximization . . . . . . . . . . . . . . . . . . 518
16.3.2 EM for PCA . . . . . . . . . . . . . . . . . . . . . . . . . 519
16.3.3 EM for factor analysis . . . . . . . . . . . . . . . . . . . . 520
16.4 Nonlinear Latent Variable Models . . . . . . . . . . . . . . . . . . 522
16.4.1 Nonlinear manifolds . . . . . . . . . . . . . . . . . . . . . 522
16.4.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 524
16.4.3 Discrete data . . . . . . . . . . . . . . . . . . . . . . . . . 526
16.4.4 Four approaches to generative modelling . . . . . . . . . . 527
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527

17 Generative Adversarial Networks 533

17.1 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 534
17.1.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 535
17.1.2 GAN training in practice . . . . . . . . . . . . . . . . . . . 536
17.2 Image GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
17.2.1 CycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544

18 Normalizing Flows 547

18.1 Coupling Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
18.2 Autoregressive Flows . . . . . . . . . . . . . . . . . . . . . . . . . 552
18.3 Continuous Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 554
18.3.1 Neural differential equations . . . . . . . . . . . . . . . . . 554
18.3.2 Neural ODE backpropagation . . . . . . . . . . . . . . . . 555
18.3.3 Neural ODE flows . . . . . . . . . . . . . . . . . . . . . . 557
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
xx CONTENTS

19 Autoencoders 563
19.1 Deterministic Autoencoders . . . . . . . . . . . . . . . . . . . . . 564
19.1.1 Linear autoencoders . . . . . . . . . . . . . . . . . . . . . 564
19.1.2 Deep autoencoders . . . . . . . . . . . . . . . . . . . . . . 565
19.1.3 Sparse autoencoders . . . . . . . . . . . . . . . . . . . . . 566
19.1.4 Denoising autoencoders . . . . . . . . . . . . . . . . . . . 567
19.1.5 Masked autoencoders . . . . . . . . . . . . . . . . . . . . . 567
19.2 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 569
19.2.1 Amortized inference . . . . . . . . . . . . . . . . . . . . . 572
19.2.2 The reparameterization trick . . . . . . . . . . . . . . . . . 574
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578

20 Diffusion Models 581

20.1 Forward Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
20.1.1 Diffusion kernel . . . . . . . . . . . . . . . . . . . . . . . 583
20.1.2 Conditional distribution . . . . . . . . . . . . . . . . . . . 584
20.2 Reverse Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
20.2.1 Training the decoder . . . . . . . . . . . . . . . . . . . . . 587
20.2.2 Evidence lower bound . . . . . . . . . . . . . . . . . . . . 588
20.2.3 Rewriting the ELBO . . . . . . . . . . . . . . . . . . . . . 589
20.2.4 Predicting the noise . . . . . . . . . . . . . . . . . . . . . . 591
20.2.5 Generating new samples . . . . . . . . . . . . . . . . . . . 592
20.3 Score Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
20.3.1 Score loss function . . . . . . . . . . . . . . . . . . . . . . 595
20.3.2 Modified score loss . . . . . . . . . . . . . . . . . . . . . . 596
20.3.3 Noise variance . . . . . . . . . . . . . . . . . . . . . . . . 597
20.3.4 Stochastic differential equations . . . . . . . . . . . . . . . 598
20.4 Guided Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
20.4.1 Classifier guidance . . . . . . . . . . . . . . . . . . . . . . 600
20.4.2 Classifier-free guidance . . . . . . . . . . . . . . . . . . . 600
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603

Appendix A Linear Algebra 609

A.1 Matrix Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
A.2 Traces and Determinants . . . . . . . . . . . . . . . . . . . . . . . 610
A.3 Matrix Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 611
A.4 Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612

Appendix B Calculus of Variations 617

Appendix C Lagrange Multipliers 621

Bibliography 625

Index 641

Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
32 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Phase 1 Document - Breast Cancer Prediction
No ratings yet
Phase 1 Document - Breast Cancer Prediction
56 pages
Anatomy of Deep Learning Principles (2023)
No ratings yet
Anatomy of Deep Learning Principles (2023)
755 pages
Deep Learning Networks
No ratings yet
Deep Learning Networks
173 pages
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
100% (5)
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
84 pages
Deep Dive Into ML
No ratings yet
Deep Dive Into ML
124 pages
Module 1 Introduction To DL
No ratings yet
Module 1 Introduction To DL
17 pages
(Machine Learning - Foundations, Methodologies, and Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)
No ratings yet
(Machine Learning - Foundations, Methodologies, and Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)
298 pages
Deep Learning: Book Review
No ratings yet
Deep Learning: Book Review
4 pages
III-II CSM (Ar 20) DL Unit - 1
No ratings yet
III-II CSM (Ar 20) DL Unit - 1
24 pages
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
23 pages
Deep Learning With Python 2nd Edition Nikhil Ketkar Newest Edition 2025
No ratings yet
Deep Learning With Python 2nd Edition Nikhil Ketkar Newest Edition 2025
103 pages
Unit-I Deep Learning Techniques
No ratings yet
Unit-I Deep Learning Techniques
20 pages
A Study On Deep Learning
No ratings yet
A Study On Deep Learning
6 pages
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
No ratings yet
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
15 pages
NMI Review
No ratings yet
NMI Review
2 pages
Deep Learning
No ratings yet
Deep Learning
50 pages
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
290 pages
Artificial Intelligence and Deep Learning
0% (1)
Artificial Intelligence and Deep Learning
9 pages
Ian Goodfellow Yoshua Bengio and Aaron Courville D
No ratings yet
Ian Goodfellow Yoshua Bengio and Aaron Courville D
4 pages
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
No ratings yet
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
69 pages
Ian Goodfellow Yoshua Bengio and Aaron Courville D
No ratings yet
Ian Goodfellow Yoshua Bengio and Aaron Courville D
4 pages
Pedro Domingos On The Master Algorithm: A Conversation With Vasant Dhar
No ratings yet
Pedro Domingos On The Master Algorithm: A Conversation With Vasant Dhar
4 pages
Cbse 10th Bio Atom Bomb Free
No ratings yet
Cbse 10th Bio Atom Bomb Free
6 pages
DL Unit 2
No ratings yet
DL Unit 2
29 pages
Deep Learmnng
No ratings yet
Deep Learmnng
2 pages
Deep Learning
No ratings yet
Deep Learning
22 pages
Liver Function Test and Renal Function Test-Final1
No ratings yet
Liver Function Test and Renal Function Test-Final1
8 pages
Lecun 2015
No ratings yet
Lecun 2015
10 pages
Lecun 2015
No ratings yet
Lecun 2015
9 pages
Deep Learning Artificial Intelligence
No ratings yet
Deep Learning Artificial Intelligence
9 pages
DL All Units Materials
No ratings yet
DL All Units Materials
138 pages
Machine Learning Vs Deep Learning
No ratings yet
Machine Learning Vs Deep Learning
2 pages
Learning On The Road To Good Design - Case Studies
100% (1)
Learning On The Road To Good Design - Case Studies
29 pages
Deep Learning With R
No ratings yet
Deep Learning With R
18 pages
Machine Learning vs Deep Learning
No ratings yet
Machine Learning vs Deep Learning
2 pages
Insidedeeplearning Preview
No ratings yet
Insidedeeplearning Preview
5 pages
MCQ Dermatology (Dr. Asaf K.)
No ratings yet
MCQ Dermatology (Dr. Asaf K.)
31 pages
Vasudevan S. Deep Learning. A Comprehensive Guide 2022
No ratings yet
Vasudevan S. Deep Learning. A Comprehensive Guide 2022
307 pages
Unit - 1 Deep Learning Techniques
No ratings yet
Unit - 1 Deep Learning Techniques
18 pages
Drip Irrigation Pipes
No ratings yet
Drip Irrigation Pipes
8 pages
Mood Disorder
No ratings yet
Mood Disorder
18 pages
Unit 2 Speaking Skills Exp. 4 Role Plays
No ratings yet
Unit 2 Speaking Skills Exp. 4 Role Plays
14 pages
Computational Methods For Deep Learning: Theoretic, Practice and Applications Wei Qi Yan Updated 2025
No ratings yet
Computational Methods For Deep Learning: Theoretic, Practice and Applications Wei Qi Yan Updated 2025
136 pages
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
No ratings yet
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
56 pages
III-II CSM (Ar 20) DL 5 Units Question Answers
No ratings yet
III-II CSM (Ar 20) DL 5 Units Question Answers
108 pages
Unit I
No ratings yet
Unit I
10 pages
Final Project
No ratings yet
Final Project
9 pages
R Deep Learning Essentials - Sample Chapter
100% (3)
R Deep Learning Essentials - Sample Chapter
24 pages
Machine Learning Semester Paper
No ratings yet
Machine Learning Semester Paper
31 pages
TOC Micro. Fabric.
No ratings yet
TOC Micro. Fabric.
4 pages
Machine Learning Model Evaluation - Zero To Mastery Academy
No ratings yet
Machine Learning Model Evaluation - Zero To Mastery Academy
1 page
Agricultural Engineering
No ratings yet
Agricultural Engineering
5 pages
Deep Learning File
No ratings yet
Deep Learning File
60 pages
Forklift Battery Maintenance Guide
No ratings yet
Forklift Battery Maintenance Guide
3 pages
Deep Learning File
No ratings yet
Deep Learning File
58 pages
Fundamentals of Deep Learning Nikhil Buduma Online Version
No ratings yet
Fundamentals of Deep Learning Nikhil Buduma Online Version
126 pages
Venue by Proline Control16 Manual
No ratings yet
Venue by Proline Control16 Manual
22 pages
Google Interview Questions - Zero To Mastery Academy
No ratings yet
Google Interview Questions - Zero To Mastery Academy
1 page
Deep Learning for AI Experts
No ratings yet
Deep Learning for AI Experts
10 pages
ITR Roll No.20
No ratings yet
ITR Roll No.20
3 pages
Deep Learning Foundations Taeho Jo Online Reading
No ratings yet
Deep Learning Foundations Taeho Jo Online Reading
114 pages
Tally ERP 1 Book (1) 1-1
No ratings yet
Tally ERP 1 Book (1) 1-1
43 pages
Deep Learning With Python 1st Edition Nikhil Ketkar Complete Edition
No ratings yet
Deep Learning With Python 1st Edition Nikhil Ketkar Complete Edition
104 pages
3.1 BSMarE 1st Yr Level - REVALIDA SET B
No ratings yet
3.1 BSMarE 1st Yr Level - REVALIDA SET B
11 pages
IoT and Cloud Computing Insights
No ratings yet
IoT and Cloud Computing Insights
149 pages
Abhijit Ghatak - Deep Learning With R-Springer (2019)
No ratings yet
Abhijit Ghatak - Deep Learning With R-Springer (2019)
259 pages
q14 SVC 052 Chaudhry r0
No ratings yet
q14 SVC 052 Chaudhry r0
5 pages
Deep Learning Basics & Applications
No ratings yet
Deep Learning Basics & Applications
6 pages
Tle
No ratings yet
Tle
7 pages
Three Reasons That You Should NOT Use Deep Learning - by George Seif - Towards Data Science
No ratings yet
Three Reasons That You Should NOT Use Deep Learning - by George Seif - Towards Data Science
1 page
Controller and Io Port
No ratings yet
Controller and Io Port
3 pages
Ganga Delta Population Analysis
No ratings yet
Ganga Delta Population Analysis
14 pages
Shop Christian Louboutin Loubi Girl 100 Leather Sandals Saks Fifth Avenue
No ratings yet
Shop Christian Louboutin Loubi Girl 100 Leather Sandals Saks Fifth Avenue
1 page
Light Emitting Polymer Report
No ratings yet
Light Emitting Polymer Report
24 pages
MATH
No ratings yet
MATH
6 pages
Onychophagia (Nail Biting), Anxiety, and Malocclusion
No ratings yet
Onychophagia (Nail Biting), Anxiety, and Malocclusion
4 pages
ALAC14
No ratings yet
ALAC14
6 pages
Week 7-1
No ratings yet
Week 7-1
14 pages
Pembuatan Tawas Dari Limbah Kaleng Alumunium
No ratings yet
Pembuatan Tawas Dari Limbah Kaleng Alumunium
8 pages
Din-Dvgwtype Examination Certificate: Din-Dvgw-Baumusterprüfzertifikat
No ratings yet
Din-Dvgwtype Examination Certificate: Din-Dvgw-Baumusterprüfzertifikat
2 pages
Programs 1&2
No ratings yet
Programs 1&2
7 pages
Electrical System Building Blocks
100% (1)
Electrical System Building Blocks
71 pages
Computer Graphics
100% (1)
Computer Graphics
132 pages
Semiconductor Field Service Expert
No ratings yet
Semiconductor Field Service Expert
2 pages
6744-00-16-46-SP-09 Ra
No ratings yet
6744-00-16-46-SP-09 Ra
4 pages
RP1
No ratings yet
RP1
2 pages
DEPORTES
No ratings yet
DEPORTES
5 pages
02 Radio Engineering - Radio Propagation
No ratings yet
02 Radio Engineering - Radio Propagation
18 pages
Nature14539 PDF
No ratings yet
Nature14539 PDF
9 pages
Alternative Depression Remedies
No ratings yet
Alternative Depression Remedies
37 pages

Deep Learning

Uploaded by

Deep Learning

Uploaded by

Deep Learning

Christopher M. Bishop • Hugh Bishop

ISBN 978-3-031-45467-7 ISBN 978-3-031-45468-4 (eBook)

Cover illustration: maksimee / Alamy Stock Photo

Paper in this product is recyclable.

Goals of the book

Responsible use of technology

Structure of the book

Papers on arXiv are indexed using a notation arXiv:YYMM.XXXXX where YY and

Chris Bishop and Hugh Bishop

1 The Deep Learning Revolution 1

2.1.6 Independent variables . . . . . . . . . . . . . . . . . . . . 31

4 Single-layer Networks: Regression 111

5 Single-layer Networks: Classification 131

6 Deep Neural Networks 171

7 Gradient Descent 209

7.4.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . 227

10 Convolutional Networks 287

10.2.7 Multilayer convolutions . . . . . . . . . . . . . . . . . . . 298

11 Structured Distributions 325

13 Graph Neural Networks 407

13.3 General Graph Networks . . . . . . . . . . . . . . . . . . . . . . . 420

15 Discrete Latent Variables 459

16 Continuous Latent Variables 495

17 Generative Adversarial Networks 533

18 Normalizing Flows 547

20 Diffusion Models 581

Appendix A Linear Algebra 609

Appendix B Calculus of Variations 617

Appendix C Lagrange Multipliers 621

You might also like