Deep Learning

deep learnig

Uploaded by

Maktek Integrated Technologies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

986 views800 pages

Deep Learning

deep learnig

Uploaded by

Maktek Integrated Technologies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 800

Deep Learning Tan Guodfellow Yoshua Bengio Aaron CourvilleContents Website Acknowledgments Notation 1 Introduction 1.1 Who Should Read This Book’ 1.2 Historical ‘Itends in Deep Learning... . « Applied Math and Machine Learning Basics Linear Algebra 21° Si rs, Vectors, Matrices and Tensors 2 Multiplying Matrices and Vectors... .. . Identity and Inverse Matrice Linear Dependence and Span Nami ae ree tc sees ee ce cee lee Special Kinds of Matrices and Vectors Eigendecomposition . . Singular Value Decomposition ote 2.9 The Moore-Penrose Pseudoinverse. .... . . 2.10 The Trace Operator 2.11 The Determinant ....... 2.12 Example: Principal Components Analys Probability and Information Theory 3.1 Why Probability? ...........2...4 29 31 31 34 36 37 39 42 44 45 « 47 ieee oe 48 53.CONTENTS 3.2 Random Variables... 0.2.2... eee ee eee ee 56 3.3. Probability Distributions . 56 34 Marginal Probability . . . 58 3.5 Conditional Probability : 59 3.6 The Chain Rule of Conditional Probabilities . 59 3.7 Independence and Conditional Independence . 60 3.8 Expectation, Variance and Covariance : 60 3.9 Common Probability Distributions»... . . 62 3.10 Useful Properties of Common Functions 67 3.11 Bayes’Rule........ 70 3.12 Technical Details of Continuous Var a 3.13 Information Theory 3 3.14. Structured Probabilistie Models... 2... . 5 4 Numerical Computation 80 4.1 Overflow and Underflow 80 4.2 Poor Conditioning... 82 43 Gradient-Based Optimization... 1... - 82 44 — Constrained Optimization 93 4.5 Example: Linear Least Squares 2... 0... 96 5 Machine Learning Basics 98 5.1 Learning Algorithms 99 5.2 Capacity, Overfitting and Underfitting .. . . 110 5.3 Hyperparameters and Validation Sets . 120 54 Estimators, Bias and Variance. ... 6... 122 5.5 Maximum Likelihood Estimation .. 2... . 131 5.6 Bayesian Statistics 135 5.7 Supervised Learning Algorithms . . 140 5.8 Unsupervised Learning Algorithms .... . . 146 5.9 Stochastic Gradient Descent .........- 151 5.10 Building a Machine Learning Algorithm .............. 153 5.11 Challenges Motivating Deep Learning... . . 155 IL Deep Networks: Modern Practices 166 6 Deep Feedforward Networks 168 6.1 Example: Learning XOR . qi 6.2 Gradient-Based Learning . 77CONTENTS 7 We Rom ON tf 191 64 Architecture Design... 0... ee ee +2. 197 6.5 Back-Propagation and Other Differentiation Algorithms... . . 204 GG Ebtoriog! Notes scr ete ewer eee ewer enwimene 224 Regularization for Deep Learning 7.1 Parameter Norm Penalties . . 7.2. Norm Penalties as Constrained Optimization . 7.3 Regularization and Under-Constrained Problems 74 Dataset Augmentation . . 7.5 Noise Robustness... . . 7.6 Semi-Supervised Learning 7.7 Multi-Task Learning 7.8 Early Stopping ..... 7.9 Parameter Tying and Parameter Sharing 7.10 Sparse Representations . . . 711 Bagging and Other Ensemble Methods . . . . 7.12 Dropout 7.13 Adversarial Training 7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier 270 Optimization for Training Deep Models 274 8.1 How Learning Differs from Pure Optimization 8.2 Challenges in Neural Network Optimization . 8.3. Basic Algorithms : bees 8.4 Parameter Initialization Strategies Geet 8.5 Algorithms with Adaptive Learning Rates . . 8.6 Approximat> Second-Order Methods... . 8.7 Optimization Strategies and Meta-Algorithms Convolutional Networks 330 9.1 The Convolution Operation 331 9.2 Motivation . 335 9.3 Pooling. 339 9.4 Convolution and Pooling as an Infinitely Strong Prior . = Bd5 9.5 Variants of the Basic Convolution Function 9.6 Structured Outputs... . 9.7 Data Types pa 9.8 Efficient Convolution Algorithms»... . . 9.9 Random or Unsupervised Features»... . .CONTENTS 9.10 The Neuroscientific Basis for Convolutional Networks... .. . . 364 9.11 Convolutional Networks and the History of Deep Learning... . 371 10 Sequence Modeling: Recurrent and Recursive Nets 373 10.1 Unfolding Computational Graphs . . . 10.2 Recurrent Neural Networks 10.3. Bidirectional RNNs ..... . 10.4 Encoder-Decoder Sequence-to-Sequence Architectures... ... . . 396 10.5 Deep Recurrent Networks 398 10.6 Recursive Neural Networks 400 10.7 The Challenge of Long-Term Dependencies . . 401 10.8 Echo State Networks 4 404 10.9 Leaky Units and Other Strategies for Multiple Time Seales. . . . 406 10.10 The Long Short-Term Memory and Other Gated RNNs.... . « 408 10.11 Optimization for Long-Term Dependencies ............. 413 10.12 Explicit Memory»... ....22000-% 416 11 Practical Methodology 421 11.1 Performance Metries . . . 11.2 Default Baseline Models. 11.3 Determining Whether to Gather More Data 114 Selecting Hyperparameters... 6.0... 5. 11.5 Debugging Strategies 11.6 Example: Multi-Digit Number Recognition . . 12 Applications 443 12.1 Large-Scale Deep Hearing seer eee e-em eteaeet eetae 443 12.2. Computer Vision 452 12.3. Speech Recognition .. . . 458 124 Natural Language Processing ......... 461 125 Other Applications 478 III Deep Learning Research 486 13 Linear Factor Models +489 13.1. Probabilistie PCA and Factor Analysis. . . « 490 13.2 Independent Component Analysis (ICA) . . . 491 13.3 Slow Feature Analysis ............ . 493 134 Sparse Coding. ... 0,-2.0... 00000- 496CONTENTS 13.5 Manifold Interpretation of PCA... 2.6... ee ee ee 499 14 Autoencoders 502 14.1 Undercomplete Autoencoders =. 508 14.2. Regularized Autoencoders . . +. 504 143. Representational Power, Layer Size and Depth + + 508 14.4 Stochastic Encoders and Decoders .. 509 14.5 Denoising Autoencoders . . . .. 510 14.6 Learning Manifolds with Autoencoders . . 515 14.7 Contractive Autoencoders ....... . 14.8 Predictive Sparse Decomposition + 523 14.9 Applications of Autoencoders . . . tabs Hoes BM 15 Representation Learning 15.1 Greedy Layer-Wise Unsupervised Pretraining 15.2 Transfer Learning and Domain Adaptation . . 15.3 Semi-Supervised Disentangling of Causal Factors 15.4 Distributed Representation . 15.5 Exponential Gains from Depth te 15.6 Providing Clues to Discover Underlying Causes . . 16 Structured Probabilistic Models for Deep Tearing 558 16.1 The Challenge of Unstructured Modeling . . . - 559 16.2 Using Graphs to Describe Model Structure . . » 563 16.3 Sampling from Graphical Models... . . 580 16.4 Advantages of Structured Modeling . . . . 582 16.5 Learning about Dependencies ee . 582 16.6 Inference and Approximate Inference . . 584 tic Models 585 16.7 The Deep Learning Approach to Structured Probabi 17 Monte Carlo Methods 590 17.1 Sampling and Monte Carlo Methods .. 590 17.2 Importance Sampling -. 592 173. Markov Chain Monte Carlo Methods - 595 174 Gibbs Sampling... ....00.02005 599 175. The Challenge of Mixing between Separated Modes . - 599 18 Confronting the Partition Function 605 18.1 The Log-Likelihood Gradient... . fees. 606 18.2 Stochastic Maximum Likelihood and Contrastive Divergenca . . 607CONTENTS 19 18.3 18.4 18.5 18.6 18.7 Approximate Inference Pseudolikeliaood Score Matching and Ratio Matching .. . . spans . + G17 Denoising Score Matching .. 2.0.20... 0020 e eee ee 619 Noise-Contrastive Estimation ......... 620 Estimating the Partition Function... .. . . 623 19.1 Inference as Optimization 19.2 Expectation Maximization . . 19.3. MAP Infereace and Sparse Coding ..... . 19.4 Variational Inference and Learning ..... 19.5 Learned Approximate Inference»... 2...) ia-e Gb 20 Deep Generative Models 654 20.1 Boltzmann Machines 2.22.22... eee ee eee 654 20.2. Restricted Boltzmann Machines... ... . . 656 20.3 Deep Belief Networks . . . 660 20.4 Deep Boltzmann Machines... . » 663 20.5 Boltzmann Machines for Real-Valued Data . 676 20.6 Convolutional Boltzmann Machines»... . . 683 20.7 Boltzmann Machines for Structured or Sequential Outputs ... . 685 20.8 Other Boltzmann Machines ...... 2... 686 20.9 Back-Propagation through Random Operations 687 20.10 Directed Generative Nets... 22.0... 692 20.11 Drawing Samples from Autoencoders ead 20.12 Generative Stochastic Networks... 2... 74 20.13 Other Generation Schemes .. 2.2.2... 716 20.14 Evaluating Generative Models . 2... 2... 717 20.15 Conclusion . . 720 Bibliography Ta. Index 777Website www.deeplearningbook.org This book is accompanied by the above website. The website provides a variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors. viiAcknowledgments This book would not have been possible without the contributions of many people. We would like to thank those who commented on our proposal for the book and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho, Gaglar Giilgehre, David Krueger, Hugo Larochelle, Razvan Pascan and Thomas Rohée. We would like to thank the people who offered feedback on the content of the book itself, Some offered feedback on many chapters: Martin Abadi, Guillaume Alain, Ion Androutsopoulos, Fred Bertsch, Olexa Bilaniuk, Ufuk Can Bigici, Matko Bo&njak, John Boersma, Greg Brockman, Alexandre de Brébisson, Pierre Lue Carrier, Sarath Chandar, Pawel Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh, Stephan Dreseitl, Jim Fan, Miao Fan, Meire Fortunato, Frédéric Francis, Nando de Freitas, Geglar Giilgehre, Jurgen Van Gael, Javier Alonso Garcia, Jonathan Hunt, Gopi Jeyaram, Chingiz Kabytayev, Lukasz Kaiser, Varun Kanade, Asifullah Khan, Akiel Khan, John King, Diederik P. Kingma, Yann LeCun, Rudolf Mathey, Matfas Mattamala, Abhinav Maurya, Kevin Murphy, Oleg Miirk, Roman Novak, Augustus Q. Odena, Simon Pavlik, Karl Pichotta, Eddie Pierce, Kari Pulli, Roussel Rahman, Tapani Raiko, Anurag Ranjan, Johannes Roith, Mihacla Rosca, Halis Sak, César Salgado, Grigory Sapunov, Yoshinori Sasaki, Mike Schuster, Julian Serban, Nir Shabat, Ken Shirriff, Andre Simpelo, Scott Stanley, David Sussillo, Ilya Sutskever, Carles Gelada Saez, Graham Taylor, Valentin Tolmer, Massimiliano Tomassoli, An ‘Tran, Shubhendu Trivedi, Alexey Umnov, Vincent Vanhoucke, Marco Visentini-Scarzanella, Martin Vita, David Warde-Farley, Dustin Webb, Kelvin Xu, Wei Xue, Ke Yang, Li Yao, Zygmunt Zajac and Ozan Gaglayan. We would also like to thank those who provided us with useful feedback on individual chapters: © Notation: Zhang Yuanhang. ¢ Chapter 1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi,CONTENTS Charlie Gorichanaz, Brendan Loudermilk, Erie Morris, Cosmin Parvuleseu and Alfredo Solano. Chapter 2, Linear Algebra: Amjad Almahairi, Nikola Banié, Kevin Bennett, Philippe Castonguay, Oscar Chang, Brie Fosler-Lussier, Andrey Khalyavin, Sergey Oreshkov, Istvan Petras, Dennis Prangle, Thomas Rohée, Gitanjali Gulve Sehgal, Colby Toland, Alessandro Vitale and Bob Welland. Chapter 3, Probability and Information Theory: John Philip Anderson, Kai Arulkumaran, Vincent Dumoulin, Rui Fa, Stephan Gouws, Artem Oboturov, Antti Rasmus, Alexey Surkov and Volker Tresp. Chapter 4, Numeri Yuhuang. al Computation: Tran Lam Anan Fischer and Hu Chapter 5, Machine Learning Basics: Dzmitry Bahdanan, Justin Domingue, Nikhil Garg, Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Peter Shepard, Kee-Bong Song, Zheng Sun and Andy Wu. Chapter 6, Deep Feedforward Networks: Uriel Bercugo, Fabrizio Bottarel, Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Weok Kim, David Krueger and Aditya Kumar Praharaj. Chapter 7, Regularization for Deep Learning: Morten Kolbeek, Kshitij Lauria, Inkyu Lee, Sunil Mohan, Hai Phong Phan and Josh:a Salisbury. Chapter 8, Optimization for Training Deep Models: Marcel Ackermann, Peter Armitage, Rowel Atienza, Andrew Brock, Tegan Maharaj, James Martens, Kashif Rasul, Klaus Strobl and Nicholas Turner. Chapter 9, Convolutional Networks: Martin Arjovsky, Eugene Brevdo, Kon- stantin Divilov, Erie Jensen, Mehdi Mirza, Alex Paino, Marjorie Sayer, Ryan Stout and Wenteo Wu. Chapter 10, Sequence Modeling: Recurrent and Recursive Nets: Gékgen Eraslan, Steven Hickson, Razvan Pascanu, Lorenzo von Ritter, Rui Rodrigues, Dmitriy Serdyuk, Dongyu Shi and Kaiyu Yang. © Chapter 11, Practical Methodology: Daniel Beckstein. © Chapter 12, Applications: George Dahl, Vladimir Nekrasov and Ribana Roscher. © Chapter 13, Lincar Factor Models: Jayanth Koushik.CONTENTS © Chapter 15, Representation Learning: Kunal Ghosh. © Chapter 16, Structured Probabi and Anton Varfolom. tic Models for Deep Learning: Minh Lé © Chapter 18, Confronting the Partition Function: Sam Bowman. © Chapter 19, Approximate Inference: Yujia Bao. © Chapter 20, Deep Generative Models: Nicolas Chapados, Daniel Galvez, Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Montavon. © Bibliography: Lukas Michelbacher and Leslie N. Smith. We also want to thank those who allowed us to reproduce images, figures or data from their publications. We indicate their contributions in the figure captions throughout the text. We would like to thank Lu Wang for writing pdf2htmIEX, which we used to make the web version of the book, and for offering support to improve the quality of the resulting HTML. We would like to thank Ian’s wife Daniela Flori Goodfellow for patiently supporting Ian during the writing of the book as well as for help with proofreading. We would like to thank the Google Brain team for providing an intellectual environment where Ian could devote a tremendous amount of time to writing this book and receive feedback and guidance from colleagues. We would especially like to thank Ian’s former manager, Greg Corrado, and his current manager, Samy Bengio, for their support of this project. Finally, we would like to thank Geoffrey Hinton for encouragement when writing was difficult.Notation This section provides a concise reference describing the notation used throughout this book. If you are unfamiliar with any of the corresponding mathematical concepts, we describe most of these ideas in chapters 2-4. Numbers and Arrays a A sscalar (integer or real) a A vector A Amatrix A A tensor I, Identity matrix with n rows and n columns I Identity matrix with dimensionality implied by context ef) Standard basis vector [0, Lat position i 0,1,0,...,0] with a diag(a) A square, diagonal matrix with diagonal entries given by a a A scalar random variable a A vector-valued random variable A matrix-valued random variable xiCONTENTS Sets and Graphs A A set R The set of real numbers {0,1} ‘The set containing 0 and 1 {0,1,...,n} The sot of all integers between 0 and » [a,b] Tho real interval including a and 6 (a,6) ‘The real interval excluding a but including 6 A\B Set subtraction, i.e., the set containing the elements of A that are not in B G A graph Pag(x:) ‘The parents of x; in G Indexing a; Blement é of vector a, with indexing starting at 1 a; Allelemerts of vector @ except for element i Ai; Element i.j of matrix A ‘Ai, Row j of matrix A A. Column i of matrix A Aig Blement (i,j,k) of a 3D tensor A A,,,; 2-D slice of a 3-D tensor a; Element i of the random vector a Linear Algebra Operations AT Transpose of matrix A A* — Moore-Penrose pseudoinverse of A A®B_ Element-wise (Hadamard) product of A and B det(A) Determinant of ACONTENTS Vxy ar Or Ves (a) or H(f)(x) Calculus Derivative of y with respect to « Partial derivative of y with respect to x Gradient of y with respect to x Matrix derivatives of y with respect to X Tensor containing derivatives of y with respect to x Jacobian matrix J € R"*" of f :R" +R” ‘The Hessian matrix of f at input point & Definite integral over the entire domain of a Definite integral with respect to a over the set S Probability and Information Theory alb alb|c Pla) Pla) anP Ex~plf(x)] or Eftx) Var(f(z)) Cov( f(z), g(x); H(x) Du(P|lQ) N(w;¢,%) The random variables a and b are independent ‘They are conditionally independent given ¢ A probability distribution ove: a discrete variable A probability distribution over a continuous variable, or over a variable whose type has not been specified Random variable a has distribution P_ Expectation of f(x) with respect to P(x) Variance of f(z) under P(x) Covariance of f(r) and g(x) under P(x) Shannon entropy of the random variable x Kullback-Leibler divergence of P and Q Gaussian distribution over @ with mean p: and covariance 5CONTENTS f:A3B fog f(@;) condition Functions The function f with domain A and rang B Composition of the functions f and g A function of @ parametrized by @. (Sometimes we write f(«) and omit the argument 8 to lighten notation) Natural logarithm of ar 1 1+ exp(—a) Softplus, log(1 + exp(x)) L? norm of & Logistic sigmoid, L? porm of x Positive part of 2, .c., max(0, 2) is 1 if the condition is true, 0 otherwise Sometimes we use a function f whose argument is a scalar but apply it to a vector, matrix, or tensor: f(«), f(X), or f(X). This denotes the application of f to the array clement-wise. For example, if C = o(X), then Cie = o(Xig,x) for all valid values of i, j and k. Pasta Panta aol) y ory Datasets and Distributions The data generating distribution ‘The empirical distribution defined by the training set A set of training examples The -th cxample (input) from a dataset ‘The target associated with a) for supervised learning The m x n matrix with input example x in row Xi;Chapter 1 Introduction Inventors have long dreamed of creating machines that think. This desire dates back to at least the time of ancient Greece. The mythical figures Pygmalion, Daedalus, and Hephaestus may all be interpreted as legendary inventors, and Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, 2004; Sparkes, 1996; Tandy, 1997). When programmable computers were first conceived, people wondered whether such machines might become intelligent, over a hundred years before one was built (Lovelace, 1812). Today, artificial intelligence (AI} is a thriving field with many practical applications and active research topics. We look to intelligent. software to automate routine labor, understand speech or smages, make diagnoses in medicine and support basic scientific research. In the early days of artificial intelligence, the field rapidly tackled and solved problems that are intellectually difficult for human beings but relatively straightforward for computers—problems that can be described by a list of formal, mathematical rules. The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images. This book is about a solution to these more intuitive problems. This solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined in terms of its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the need for human operators to formally specify all of the knowledge that the computer needs. The hierarchy of concepts allows the computer to leazn complicated concepts by building them out of simpler ones. If we draw a graph showing how these 1CHAPTER 1. INTRODUCTION concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach to AT deep learning. Many of the early successes of AT took place in relatively sterile and formal environments and did not require computers to have much knowledge about the world. For example, IBM's Deep Blue chess-playing system defeated world champion Garry Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only sixty-four locations and thirty-two pieces that can move in only rigidly cireurscribed ways. Devising a successful chess strategy is a tremendous accomplishment, but the challenge is not cue to the difficulty of describing the set of chess pieces and allowable moves to the computer. Chess can be completely described by a very brief list of comple:ely formal rules, easily provided ahead of time by the programmer. Ironically, abstract and formal tasks that are among the most difficult mental undertakings for a human being are among the easiest for a computer. Computers have long been able to defeat even the best human chess player, but are only recently matching some of the abilities of average human beings to recognize objects or speech. A person’s everyday life requires an immense amount of knowledge about the world. Muck of this knowledge is subjective and intuitive, and therefore difficult to articulate in a formal way. Computers neec to capture this same knowledge in order to behave in an intelligent way. One of the key challenges in artificial intelligence is how to get this informal knowledge into a computer. Several artificial intelligence projects have sought to hard-code knowledge about the world in formal languages. A computer can reason about statements in these formal languages automatically using logical inference rules. This is known as the knowledge base approach to artificial intelligence. None of these projects has led to a major success. One of the most famous such projects is Cyc (Lenat and Guha, 1989). Gye is an inference engine and a database of statements in a language called CycL. These statements are entered by a staff of human supervisors. It is an unwieldy process. People struggle to devise formal rules with enough complexity to accurately describe the world. For example, Cyc failec to understand a story about a person named Fred shaving in the morning (Linde, 192). Its inference engine detected an inconsistency in the story: it knew that people do not have electrical parts, but because Fred was holding an electric razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore asked whether Fred was still a person while he was shaving. ‘The difficulties faced by systems relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data, ‘This capability is known as machine learning. The 2CHAPTER 1, INTRODUCTION introduction of machine learning allowed computers to tackle problems involving knowledge of the real world and make decisions that appear subjective. A simple machine learning algorithm called logistic regression can determine whether to recommend cesarean delivery (\or-Yosef ef al., 1990). A simple machine learning algorithm called naive Bayes can separate legitimate e-mail from spam e-mail The performance o these simple machine learning algcrithms depends heavily on the representation of the data they are given. For example, when logistic regression is used to recommend cesarean delivery, the AI system does not examine the patient directly. Instead, the doctor tells the system several pieces of relevant information, such as the presence or absence of a uterine scar. Each piece of information included in the representation of the patient is known as a feature. Logistic regression leams how each of these features of the patient correlates with various outcomes. However, it cannot influence the way that the features are defined in any way. If logistic regression was given an MRI scan of the patient, rather than the docto-’s formalized report, it would not be able to make useful predictions. Individual pixels in an MRI scan have negligible correlation with any complications that might occur during delivery. This dependence on representations is a general phenomenon that appears throughout computer science and even daily life. In computer science, operations such as searching a collection of data can proceed exponentially faster if the collection is structured and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but find arithmetic on Roman numerals much more time-consuming. It is not surprising that the choice of representation has an enormous effect on the performance of machine learning a‘gorithms. For a simple visual example, see figure 1.1. Many artificial intelligence tasks can be solved by designing the right set of features to extract for that task, then providing these features to a simple machine learning algorithm. For example, a useful feature for speaker identification from sound is an estimate of the size of speaker's vocal tract. It therefore gives a strong clue as to whether the speaker is a man, woman, or child. However, for many tasks, it is difficult to know what features should be extracted. For example, suppose that we would like to write a program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence of a wheel as a feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms of pixel values. A wheel has a simple geometric shape but its image may be complicated by shadows falling on the waeel, the sun glaring off the metal parts of the wheel, the fender of the car or an object in the foreground obscuring part of the wheel, and so on.CHAPTER 1. INTRODUCTION Cartesian coordinates Polar coordinates + a wah, vy Wy, ihn Te of rt y 3 vey win? , ee | eee Way OF wires . Some” | ee ae Figure 1.1: Example of different representations: suppose we want to separate two categories of data by drawing a line between them in a scatterplot. In the plot on the left, ‘we represent some data using Cartesian coordinates, and the task is impossible. In the plot on the right, we represent the data with polar coordinates and the task becomes simple to solve with a vertical line. Figure produced in collaboration wit David Warde-Farley. One solution to this problem is to use machine learniag to discover not only the mapping from representation to output but also tke representation itself. This approach is known as representation learning. Learned representations often result in much better performance than can be obtained with hand-designed representations. They also allow AT systems to rapidly adapt to new tasks, with minimal human intervention. A representation learning algorithm can discover a good set of features for a simple task in minutes, or a complex task in hours to months. Manually designing features for a complex task zequires a great deal of human time and effort; it can take decades for an entire community of researchers. ithm is the au- The quintessential example of a representation learning algo toencoder. An autozncoder is the combination of an encoder function that converts the input data into a different representation, and a decoder function that converts the new representation back into the original format. Autoencoders are trained to preserve as much information as possible when an input is run through the encoder aad then the decoder, but are also trained to make the new representation have various nice properties. Different kinds of autoencoders aim to achieve different kinds of properties. When designing features or algorithms for learning features, our goal is usually to separate the factors of variation that explain the observed data. In this context, we use the word “factors” simply to refer to separate sources of influence; the factors are usually not combined by multiplication. Such factors are often not 4CHAPTER 1, INTRODUCTION quantities that are directly observed. Instead, they may exist either as unobserved objects or unobserved forces in the physical world that affect observable quantities. ‘They may also exist as constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data. They can be thought of as concepts or abstractions that help us make sense of the rich variability in the data. When analyzing a speech recording, the factors of variation include the speaker's age, their sex, their accent and the words that they are speaking, When analyzing fan image of a car, the factors of variation include the position of the car, its color, and the angle and brightness of the sun. A major source of difficulty in many real-world artificial intelligence applications is that many of the factors of variation influence every single piece of data we are able to observe. The irdividual pixels in an image of a red car might be very close to black at night. The shape of the car’s silhouette depends on the viewing angle. Most applications require us to disentangle the factors of variation and discard the ones that we do not care about. Of course, it can be very difficult to extract such high-level, abstract features from raw data. Many of these factors of variation, such as a speaker's accent, can be identified only using sophisticated, nearly human-level understanding of the data. When it is nearly as difficult to obtain a representation as to solve the original problem, representation learning does not, at first glance, seem to help us. Deep learning solves this central problem in representation learning by intro- ducing representations that are expressed in terms of other, simpler representations. Deep learning allows the computer to build complex concepts out of simpler concepts. Figure 1.2 shows how a deep learning system can represent the concept of an image of a person by combining simpler concepts, such as corners and contours, which are in turn defined in terms of edges. ‘The quintessential example of a deep learning model is the feedforward deep network or multilayer perceptron (MLP). A multilayer perceptron is just a mathematical function mapping some set of input values to output values. The function is formed by composing many simpler functions. We can think of each application of a different mathematical function as providing a new representation of the input. ‘The idea of learning the right representation for the data provides one perspective on deep learning. Another perspective on deep learning is that depth allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer’s memory after executing another set of instructions in parallel. Networks with greater depth can execute more instructions in sequence. Sequential instructions offer great power because later 5CHAPTER 1. INTRODUCTION Output (object identity) Sed hidden layer (chject parts) 2nd hidden layer (comers and contours) Ist hidden layer (caues) Figure 1.2: Ilustration ofa deep learning model. It is difficult fora computer to understand the meaning of raw sensory input data, such as this image represented as a collection of pixel values, The function mapping from a set of pixels to an object identity is very complicated. Learning or evaluating this mapping seems insurmcuntable if tackled directly. Deep learning resolves this difficulty by breaking the desired complicated mapping into a of nested simple mappings, each described by a different layer of the model. The input is presented at the visible layer, so named because it contains the variables that we are able to observe. Then a series of hidden layers extrects increasingly abstract features from the image. These layers are called “hidden” because their values are not given in the data; instead the model must determine which concepts are useful for explaining the relationships in the observed data. The images here are visualizations of the lind of feature represented by each hidden unit. Given the pixels, the first layer can easily identify edges, by comparing the brightness of neighboring pixels. Given the first hidden layer’s description of the edges, the second hidden layer ean easily search for corners and extended contours, which are recognizable as collections of edges. Given the second hidden layer’s description of the image in terms of corners and contours, the third hidden layer can detect entire parts of specific objects, by finding specific collections of contours and comers. Finally, this description of the image in terms of the object parts it contains can, be used to recognize the objects present in the image. Images reproduced with permission, from Zeiler and Fergus (2014).CHAPTER 1. INTRODUCTION Blement Element Set Set Logis Logistic Regres Regression Figure 1.3: Illustration of computational graphs mapping an input to an output where each node performs an operation. Depth is the length of the longest path from input to output but depends on the definition of what constitutes a possible computational step. ‘The computation depicted in these graphs is the output of a logistic regression model, o(w?z), where o is the logistic sigmoid function. If we use addition, multiplication and logistic sigmoids as the slements of our computer language, then this model has depth three. If we view logistic regression as an element itself, then this model has depth one. QOH instructions can refer back to the results of earlier instructions. According to this view of deep learning, not all of the information in a layer's activations necessarily encodes factors of variation that explain the input. The representation also stores state information that 1elps to execute a program that can make sense of the input. This state information could be analogous to a counter or pointer in a traditional computer program. It has nothing to do with the content of the input specifically, but it helps the model to organize its processing. ‘There are two main ways of measuring the depth of a model. The first view is based on the number cf sequential instructions that must be executed to evaluate the architecture, We can think of this as the length of the longest path through a flow chart that describes how to compute each of the model's outputs given its inputs. Just as twe equivalent computer programs will have different lengths depending on which lenguage the program is written in, the same function may be drawn as a flowchart with different depths depending on which functions we allow to be used as ind:vidual steps in the flowchart. Figure 1.3 illustrates how this choice of language can give two different measurements fo: the same architecture. Another approach, used by deep probabilistic models, regards the depth of a model as being not the depth of the computational graph but the depth of the graph describing how concepts are related to each other. In this case, the depth 7CHAPTER 1, INTRODUCTION of the flowchart of the computations needed to compute the representation of each concept may be much deeper than the graph of tke concepts themselves. This is because the system’s understanding of the simpler concepts can be refined given information about the more complex concepts. For example, an AI system observing an image of a face with one eye in shadow may initially only see one eye. After detecting that a face is present, it can then infer that a second eye is probably present as well. In this case, the graph of concepts only includes two layers—a layer for eyes and a layer for faces—but the graph of computations includes 2n layers if we refine our estimate of each concept given the other n times. Because it is not always clear which of these two views—the depth of the computational graph, or the depth of the probabilistic modeling graph—is most relevant, and because different people choose different sets of smallest elements from which to construct their graphs, there is no single correct value for the depth of an architecture, just as there is no single correct value for the length of a computer program. Nor is there a consensus about how much depth a model requires to qualify as “deep.” However, deep learning can safely be regarded as the study of models that either involve a greater amount of composition of learned functions or learned concepts than traditional machine leerning does. ‘To summarize, deep learning, the subject of this book, is an approach to AL. Specifically, it is a type of machine learning, a technique that allows computer systems to improve with experience and data. According to the authors of this book, machine learning is the only viable approach to building AI systems that can operate in complicated, real-world environments. Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more abstract representations computed in terms of less abstract ones. Figure 1.4 illustrates the relationship between these different Al disciplines. Figure 1.5 gives a high-level schematic of how each works. 1.1 Who Should Read This Book? ‘This book can be usefil for a variety of readers, but we wrote it with two main target audiences in mind. One of these target audiences is university students (undergraduate or graduate) learning about machine learning, including those who are beginning a career in deep learning and artificial intelligence research. The other target audience is software engineers who do not have a machine learning or statistics background, but want to rapidly acquire one and begin using deep learning in their product or platform. Deep learning has already proven useful in 8CHAPTER 1. INTRODUCTION Deep learning. Example: Shallow Example: santoencoders MPs Example Example Logistic regression Representation learning. Machine learning, Figure 1.4: A Venn diagram showing how deep learning, a kind of representation learning, which is in turn a kind of machine learning, which is used for many but not all approaches to AI. Each section of the Vem diagram includes an example cf an AI technology.CHAPTER 1. INTRODUCTION A Outpt ouput 4 4 A Tatil Mapping from Mapping from layers of more Outpt : ee A 4 4 A Han ace ||| fumes | | Stel A 4 { A Faeent Cie systems learning Representation caring Figure 1.5: Flowcharts showing how the different parts of an AI system relate to each other within different Al disciplines. Shaded boxes indicate components that are able to learn from data, 10CHAPTER 1, INTRODUCTION many software disciplines including computer vision, speech and audio processing, natural language processing, robotics, bioinformatics and chemistry, video games, search engines, online advertising and finance. This book has beer organized into three parts in order to best accommodate a variety of readers. Part I introduces basic mathematical tools and machine learning concepts. Part II describes the most established deep learning algorithms that are essentially solved technologies. Part III describes more speculative ideas that are widely believed to be important for future research in deep learning. Readers should feel free to skip parts that are not relevant given their interests or background. Readers familiar with linear algebra, probability, and fundamental machine learning concepts can skip part I, for example, while readers who just want to implement a working system need not read beyond part II. To help choose which chapters to read, figure 1.6 provides a flowchart showing the high-level organization of the book. We do assume that all readers come from a computer science background. We assume familiarity with programming, a basic understanding of computational performance issues, complexity theory, introductory level calculus and some of the terminology of graph theory. 1.2 Historical Trends in Deep Learning It is easiest to understend deep learning with some historical context. Rather than providing a detailed history of deep learning, we identify a few key trends: © Deep learning has had a long and rich history, but has gone by many names reflecting differeat philosophical viewpoints, and has waxed and waned in popularity. « Deep learning has become more useful as the amount of available training data has increased. # Deep learning models have grown in size over time as computer infrastructure (both hardware end software) for deep learning has ‘mproved. # Deep learning has solved increasingly complicated applications with increasing accuracy over time. uCHAPTER 1. INTRODUCTION 1. Introduetion ¥ Fart I: Applied Math and Machine Learning Basies 3, Probability and 2 Linear Algebra F—) i scoemation Theory v v 4. Numerical 5, Machine Learning Computation — Basics v Part Il: Deep Networks: Modern Practices (6. Deep Feedforward Networks 7. Regularization || 8. Optimization 9. cNNs | | 10. RNNs 11. Practical Methodology’ 12. Applications v Pact Il: Deep Les ning Research 13. Linear Factor 15, Representation Models Le} 14. Autoencoders Lo} i: v 16, Structured 17. Monte Carlo Probabilistic Models Methods t t 18. Partition Function 20. Deep Generative Models Figure 1.6: The high-level organization of the book. An arrow from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter. 12CHAPTER 1, INTRODUCTION 1.2.1. The Many Names and Changing Fortunes of Neural Net- works We expect that many readers of this book have heard of deep learning as an exciting new technology, and are surprised to see a mention of “history” in a book about an emerging field. In fact, deep learning dates back to the 1940s. Deep learning only appears to be new, because it was relatively unpopular for several years preceding its current popularity, and because it has gone through many different names, and kas only recently become called “deep learning.” The field has been rebranded many times, reflecting the influence of different researchers and different perspectives. A comprehensive h'story of deep learning is beyond the scope of this textbook. However, some basic context is useful for understanding deep learning. Broadly speaking, there have deen three waves of development of deep learning: deep learning known as cybernetics in the 1940s-1960s, deep learning known as connectionism in the 1980s-1990s, and the current resurgence under the name deep learning beginning in 2006. This is quantitatively illustrated in figure 1.7. Some of the earliest learning algorithms we recognize today were intended to be computational models of biological learning, i.e. models of how learning happens or could happen in the brain. As a result, one of the names that deep Jearning has gone by is artificial neural networks (ANNs). The corresponding perspective on deep learning models is that they are engineered systems inspired by the biological brain (whether the human brain or the brain of another animal). While the kinds of neural networks used for machine learning have sometimes been used to understand brain function ([linton and Shallice, 1991), they are generally not designec to be realistic models of biological function. The neural perspective on deep learning is motivated by two main ideas. One idea is that the brain provides a proof by example that intelligent behavior is possible, and a conceptually straightforward path to building intelligence is to reverse engineer the computational principles behind the brain and duplicate its functionality. Another perspective is that it would be deeply interesting to understand the brain and the principles that underlie human intelligence, so machine learning models that shed light on these basic scientific questions are useful apart from their ability to solve engineering applications. ‘The modern term “deep learning” goes beyond the neuroscientific perspective on the current breed of machine learning models. It appeals to a more general principle of learning multiple levels of composition, which can be applied in machine learning frameworks that are not necessarily neurally inspired. 13CHAPTER 1. INTRODUCTION 0.000250) —" cybernetics 9.000200). (connectionism + neural networks) 0.000150 }- dL. 7 0.000100} 0.000050- Frequency of Word or Phrase 0.000000 1940 1950 1960-1970 —«:1980_—«1990 2000 Figure 1.7: The figure shows two of the three historical waves of artificial neural nets research, as measured by the frequency of the phrases “cybernetics” and “conneetionisin” or “neural networks” according to Google Books (the third wave is too recent to appear). The first wave started with eybernetics in the 1940s-1960s, with the development of theories of biological learning (McCulloch and Pitts, 1943; Hebb, 1919) and implementations of the first models such as the perceptron (Rosenblatt, 1958) allowing the training of a single neuron. The second wave started with the connectionist approaca of the 1980-1995 period, with back-propagation (Runiclhart e/ cf, 1986a) to train a neural network with one or two hidden layers. The current and third wave, deep learning, started around 2006 (‘Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007a), and is just now appearing in book form as of 2016. The other two waves similarly appeared in book form much later than the corresponding scientific activity occurred. uCHAPTER 1, INTRODUCTION ‘The earliest predecessors of modern deep learning were simple linear models motivated from a neuroscientific perspective. These models were designed to take a set of n input values 2r1,...,<, and associate them with an output y. ‘These models would learn a set of weights wy,...,wy and compute their output f(a,w) = vw) + ---+2nwn. This first wave of neural networks research was known as cybernetics, as illustrated in figure 1.7. ‘The McCulloch-Pitts Neuron (\MeCulloch snd Pitts, 1048) was an early model of brain function. This linear model could recognize twe different categories of inputs by testing whether f («, w) is positive or negative. Of course, for the model to correspond to the desired definition of the categories, the weights needed to be set correctly. These weights could be set by the human operator. In the 1950s, the perceptron (Itosenblatt, 1958, 1962) became the first model that could learn the weights defining the categories given examples of inputs from each category. ‘The adaptive linear element (ADALINE), which dates from about the same time, simply returned the value of f(z) itself to predict real number (\Vidrow uid Hofl, 1960), and could also learn to predict these numbers from data. ‘These simple learning algorithms greatly affected the modern landscape of machine learning. The training algorithm used to adapt the weights of the ADALINE ‘was a special case of an algorithm called stochastic gradient descent. Slightly modified versions of the stochastic gradient descent algorithm remain the dominant training algorithms for deep learning models today. Models based on the f(«. 1) used by the perceptron and ADALINE are called linear models. These models remain some of the most widely used machine Iearning models, though in many cases they are trained in different ways than the original models were trained. Linear models have many limitations. Most famously, they cannot learn the XOR function, where f({0,1],w) = 1 and f((1,0],w) = 1 but f((1,1],w) = 0 and f((0,0],w) = 0. Critics who observed these flaws in linear models caused a backlash against biologically inspired learning in general (\insky and Papert, 1969). This was the first major dip in the popularity of neural networks. ‘Today, neuroscience is regarded as an important source of inspiration for deep learning researchers, but it is no longer the predominant guide for the field. ‘The main reason for the diminished role of neuroscience in deep learning research today is that we simply do not have enough information about the brain to use it as a guide. To obtain a deep understanding of the actual algorithms used by the brain, we would need to be able to monitor the activity of (at the very least) thousands of interconnected neurons simultaneously. Because we are not able to do this, we are far from understanding even some of the most simple and 16CHAPTER 1, INTRODUCTION well-studied parts of the brain (Olshausen and Field, 2005). Neuroscience has given us a reason to hope that a single deep learning algorithm can solve many differert tasks. Neuroscientists have found shat ferrets can learn to see” with the auditory processing region of their brain if their brains are rewired to send visual signals to that area (Von Melchner ef al., 2000). This suggests that much of the mammalian brain might use a single algorithm to solve most of the different tasks that tke brain solves. Before this hypothesis, machine learning research was more fragmented, with different communities of researchers studying natural language processing, vision, motion planning and speech recognition. Today, these application communities are still separate, but it is common for deep learning research groups to study many or even all of these application areas simultaneously. We are able to draw some rough guidelines from neuroscience. The basic idea of having many computational units that become intelligent only via their interactions with each other is inspired by the brain. The Neocognitron (Fukushima, 1980) introduced a powerful model architecture for processing images that was inspired by the structure of the mammalian visual system and later became the basis for the modern convolutional network (LeCun ct al., 1998b), as we will see in section 9.10. Most neural networks today are based on a model neuron called the rectified linear unit. The original Cognitron (Fulcishima, 1975) introduced a more complicated version that was highly inspired by our knowledge of brain function. The simplified modern version was developed incorporating ideas from many viewpoints, with Nair and Hinton (2010) and Glort ef al. (201 1a) citing neuroscience as an influence, and Jarrett et al. (2009) citing more engincering- oriented influences. While neuroscience is an important source of inspiration, it need not be taken as a rigid guide. We know that actual neurons compute very different functions than modern rectified linear units, but greater neural realism has not yet led to an improvement in machine learning performance. Also, while neuroscience has sucessfully inspired several neural network architectures, we do not yet know enough about biological learning for neuroscience to offer much guidance for the learning algorithms we use to train these architectures. Media accounts often emphasize the similarity of dee learning to the brain. While it is true that deep learning researchers are more likely to cite the brain as an influence than researchers working in other machine learning fields such as kernel machines or Bayesian statistics, one should not view deep learning as an attempt to simulate the brain. Modern deep learning draws inspiration from many fields, especially applied math fundamentals like linear algebra, probability, information theory, and numerical optimization. While some deep learning researchers cite neuroscience as an important source of inspiration, others are not concerned with 16CHAPTER 1, INTRODUCTION neuroscience at all. It is worth noting that the effort to understand how the brain works on an algorithmic level is alive and well. This endeavor is primarily known as “computational neuroscience” and is a separate field of study from deep learning. It is common for researchers to move back and forth between both fields. The field of deep learning is primarily concerned with how to build computer systems that are able to successfully solve tasks requiring intelligence, while the field of computational neuroscience is primarily concerned with building more accurate models of how the brain actually works. In the 1980s, the second wave of neural network research emerged in great part via a movement called connectionism or parallel distributed processing (Rumelhart ef al., 1986c; McClelland et al., 1995). Connectionism arose in the context of cognitive science. Cognitive science is an interdisciplinary approach to understanding the mind, combining multiple different levels of analysis. During the early 1980s, most cognitive scientists studied models of symbolic reasoning. Despite their popularity, symbolic models were difficult to explain in terms of how the brain could actually implement them using neurons. The connectionists began to study models of cognition that could actually be grounded in neural implementations (Towetziy and Minton, 1985), reviving many ideas dating back to the work of psychologist Donald Hebb in the 1940s (Ich, 1949) ‘The central idea in connectionism is that a large number of simple computational units can achieve intelligent behavior when networked cogether. This insight applies equally to neurons in biological nervous systems and to hidden units in computational models. Several key concepts arose during the connectionism movement of the 1980s that remain central to today’s deep learning. One of these concepts is that of distributed representation (Hinton of ol.. 1986). ‘This is the idea that each input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs. For example, suppose we have a vision system that can recognize cars, trucks, and birds and these objects can each be red, green, or blue. One way of representing these inputs would be to have a separate neuron or hidden unit that activates for each of the nine possible combinations: red truck, red car, red bird, green truck, and so on. This requires nine different neurons, and each neuron must independently learn the concept of color and object identity. One way to improve on this situation is to use a distributed representation, with three neurons describing the color and three neurons describing the object identity. This requires only six neurons total instead of nine, and the neuron describing redness is able to 17CHAPTER 1, INTRODUCTION Jearn about redness from images of cars, trucks and birds, not only from images of one specific category of objects. The concept of distributed representation is central to this book, aad will be described in greater detail in chapter 15. Another major accomplishment of the connectionist movement was the suc- cessfull use of back-propagation to train deep neural netwerks with internal representations and the popularization of the back-propagatior. algorithm (/umellart ef al, 1986a; LeCun, 1987). This algorithm has waxed and waned in popularity but as of this writing is currently the dominant approach to training deep models. During the 1990s, researchers made important advances in modeling sequences with neural networks. Hochreiter (1991) and Bengio et al. (1994) identified some of the fundamental mathematical difficulties in modeling long sequences, described in section 10.7. Hochreiter and Schmidhuber (1997) introduced the long short-term memory or LSTM network to resolve some of these difficulties. Today, the LSTM is widely used for many sequence modeling tasks, including many natural language processing tasks at Google. The second wave of neural networks research lasted urtil the mid-1990s. Ven- tures based on neural networks and other AI technologies began to make unrealisti- cally ambitious claims while seeking investments. When Al research did not fulfill these unreasonable expectations, investors were disappointed. Simultaneously, other fields of machine learning made advances. Kernel machines (Loser ct al., 1992; Cortes and Vapnik, 1995; Schélkopf et al., 1999) and graphical models (Jor dan, 1998) both achieved good results on many important tasks. These two factors led to a decline in the popularity of neural networks that lasted until 2007. During this time, neural networks continued to obtain impressive performance on some tasks (LeCun et al., 1998b; Bengio et al., 2001). The Canadian Institute for Advanced Researc (CIFAR) helped to keep neural networks research alive via its Neural Computation and Adaptive Perception (NCAP) research initiative. ‘This program united machine learning research groups led by Geoffrey Hinton at University of Toronto, Yoshua Bengio at University of Montreal, and Yann LeCun at New York University. The CIFAR NCAP research initiative had a multi-disciplinary nature that also included neuroscientists and experts in human and computer vision. At this point in time, deep networks were generally believed to be very difficult to train. We now know that algorithms that have existed since the 1980s work quite well, but this was not apparent circa 2006. The issue is perhaps simply that these algorithms were too computationally costly to allow much experimentation with the hardware available at the time. The third wave of neural networks research began with a breakthrough in 18CHAPTER 1, INTRODUCTION 2006. Geoffrey Hinton showed that a kind of neural network called a deep belief network could be efficiently trained using a strategy called greedy layer-wise pretraining (Hinton ef al., 2006), which will be described in more detail in section 15.1 ‘The other CIFAR-affiliated research groups quickly showed that the same strategy could be used to train many other kinds of deep networks (Bengio ct al., 20073 Ranzato et al. 2007a) and systematically helped to improve generalization on test examples. This wave of neural networks research popularized the use of the term “deop learning” to emphasize that researchers were now able to train deeper neural networks than had been possible before, and to focus attention on the theoretical importance of depth (Bengio and LeCun, 2007; Delalleau and Bengio, 2011; Pascanu et al., 2014a; Montufar et al., 2014), At this time, deep neural networks outperformed competing AI systems based on other machine learning technologies as well as hand-designed functionality. This third wave of popularity of neural networks continues to the time of this writing, though the focus of deep learning research has changed dramatically within the time of this wave. The third wave began with a focus on new unsupervised learning techniques and the ability of deep models to generalize well from small datasets, but today there is more interest in much older supervised learning algorithms and the ability of deep models to leverage large labeled datascts. 1.2.2 Increasing Dataset Sizes One may wonder why deep learning has only recently become recognized as a crucial technology though the first experiments with artificial neural networks were conducted in the 1950s. Deep learning has been successftlly used in commercial applications since the 1990s, but was often regarded as being more of an art than a technology and something that only an expert could use, until recently. It is true that some skill is required to get good performance from a deep learning algorithm. Fortunately, the amount of skill required reduces as the amount of training data increases. The learning algorithms reaching human perforriance on complex tasks today are nearly identical to the learning algorithms that struggled to solve toy problems in the 1980s, though the models we train with these algorithms have undergone changes that simplify the training of very deep architectures. The most important new development is that today we can provide these algorithms with the resources they need to succeed. Figure 1.8 shows how the size of benchmark datasets has increased semarkably over time. This trend is driven by the increasing digitization of society. As more and more of our activities take place on computers, more and more of what we do is recorded. As our computers are increasingly networked together, it becomes easier to centralize these records and curate them 19CHAPTER 1, INTRODUCTION into a dataset appropriate for machine learning applications. ‘The age of “Big Data” has made machine learning much easier because the key burden of statistical estimation—generalizing well to new data after observing only a small amount of data—has been considerably lightened. As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples. Working successfully with datasets smaller than this is an important research area, focusing in particular on how we can take advantage of large quantities of unlabeled examples, with unsupervised or semi-supervised learning. 1.2.3 Increasing Model Sizes Another key reason thet neural networks are wildly successful today after enjoying comparatively little suecess since the 1980s is that we have the computational resources to run much ‘arger models today. One of the main insights of connectionism is that animals become intelligent when many of their neurons work together. An individual neuron or small collection of neurons is not particularly useful. Biological neurons are not especially densely connected. As seen in figure 1.10, our machine learning models have had a number of connections per neuron that ‘was within an order of magnitude of even mammalian brains for decades. In terms of the total number of neurons, neural networks have been astonishingly small until quite recently, as shown in figure 1.11. Since the introduction of hidden units, artificial neural networks have doubled in size roughly every 2.4 years. This growth is driven by faster computers with larger memory and by the availability of larger datasets. Larger networks are able to achieve higher accuracy on more complex tasks. This trend looks set to continue for decades. Unless new technologies allow faster scaling, artificial neural networks will not have the same number of neurons as the human brain until at least the 2050s. Biological neurons way represent more complicated functions than current artificial neurons, so biological neural networks may be even larger than this plot portrays. In retrospect, it is not particularly surprising that neural networks with fewer neurons than a leech were unable to solve sophisticated artificial intelligence prob- Jems. Even today’s networks, which we consider quite larg? from a computational systems point of view, are smaller than the nervous system of even relatively primitive vertebrate animals like frogs. The increase in model size over time, due to the availability of faster CPUs, 20

Understanding Machine Learning
100% (71)
Understanding Machine Learning
416 pages
Deep Learning Book Overview
100% (1)
Deep Learning Book Overview
789 pages
Deep Learning Made Easy With R
100% (4)
Deep Learning Made Easy With R
251 pages
DL (1-10)
No ratings yet
DL (1-10)
10 pages
Table of Content
No ratings yet
Table of Content
9 pages
Table of Contents
No ratings yet
Table of Contents
9 pages
DL Deeplearningbook PDF
No ratings yet
DL Deeplearningbook PDF
10 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Deep Learning Book by François Fleuret
No ratings yet
Deep Learning Book by François Fleuret
149 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
2023.05.03 The Little Book of Deep Learning
No ratings yet
2023.05.03 The Little Book of Deep Learning
143 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
Deep Learning Essentials
100% (1)
Deep Learning Essentials
140 pages
LBDL
No ratings yet
LBDL
185 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
Deep Learning Essentials
No ratings yet
Deep Learning Essentials
143 pages
LBDL
No ratings yet
LBDL
142 pages
LBDL
No ratings yet
LBDL
143 pages
Table of Content: (Page Numbers in PDF File)
No ratings yet
Table of Content: (Page Numbers in PDF File)
223 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
100% (7)
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
368 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
LN ML Rug
No ratings yet
LN ML Rug
267 pages
Chap 2
No ratings yet
Chap 2
49 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Introduction To Deep Learning 1st Edition Eugene Charniak Download
100% (1)
Introduction To Deep Learning 1st Edition Eugene Charniak Download
62 pages
Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Deep Learning Essentials Guide
No ratings yet
Deep Learning Essentials Guide
156 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
155 pages
Deep Learning Guide by François Fleuret
No ratings yet
Deep Learning Guide by François Fleuret
156 pages
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
No ratings yet
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
48 pages
Future Proof Yourself-An AI Era Survival Guide
No ratings yet
Future Proof Yourself-An AI Era Survival Guide
259 pages
Deep Learning Essentials Guide
100% (1)
Deep Learning Essentials Guide
158 pages
Machine Learning and Deep Learning Techniques Used in Cybersecurity and Digital Forensics: A Review
No ratings yet
Machine Learning and Deep Learning Techniques Used in Cybersecurity and Digital Forensics: A Review
91 pages
The Little Book of Deep Learning François Fleuret Online Version
No ratings yet
The Little Book of Deep Learning François Fleuret Online Version
65 pages
Machine Learning Cheat Sheet HCMUT K
No ratings yet
Machine Learning Cheat Sheet HCMUT K
34 pages
Probabilistic Machine Learning Guide
No ratings yet
Probabilistic Machine Learning Guide
343 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
155 pages
Introduction To Machine Learning by Ethem Alpaydin 2nded - 2010
No ratings yet
Introduction To Machine Learning by Ethem Alpaydin 2nded - 2010
314 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Deep Learning Book Part1
No ratings yet
Deep Learning Book Part1
100 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
281 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
206 pages
DL UNIT 1 To UNIT 3 Lecture Notes
No ratings yet
DL UNIT 1 To UNIT 3 Lecture Notes
135 pages
Haykin S. Neural Networks
100% (2)
Haykin S. Neural Networks
897 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
257 pages
BBBB
No ratings yet
BBBB
8 pages

Deep Learning

Uploaded by

Deep Learning

Uploaded by

You might also like