Deep Learning
Deep Learning
more at http://ml.memect.com
Contents
2 Deep learning 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
i
ii CONTENTS
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Fundamental concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Deep learning in artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Deep learning architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Issues with deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.4 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.5 Convolutional Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.6 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.7 Stacked (Denoising) Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.8 Deep Stacking Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.9 Tensor Deep Stacking Networks (T-DSN) . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.10 Spike-and-Slab RBMs (ssRBMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.11 Compound Hierarchical-Deep Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.12 Deep Coding Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.13 Deep Kernel Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.14 Deep Q-Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Image recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4 Drug discovery and toxicology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.5 Customer relationship management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Deep learning in the human brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Commercial activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Criticism and comment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 Deep learning software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Feature learning 32
3.1 Supervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Supervised dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Unsupervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Local linear embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Independent component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
CONTENTS iii
4 Unsupervised learning 36
4.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Generative model 38
5.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Neural coding 39
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Encoding and decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Coding schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3.1 Rate coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3.2 Temporal coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.3 Population coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.4 Sparse coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Word embedding 48
7.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.3.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.3.2 Different types of layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4.1 Image recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4.2 Video analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4.4 Playing Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.5 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.6 Common libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
13 Google Brain 68
13.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.2 In Google products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.3 Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.4 Reception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
14 Google DeepMind 70
14.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.1.1 2011 to 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.1.2 Acquisition by Google . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.2 Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.2.1 Deep reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
14.4 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
16 Theano (software) 74
16.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
16.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
17 Deeplearning4j 75
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
17.2 Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi CONTENTS
18 Gensim 77
18.1 Gensim's tagline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
18.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
18.3 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
19 Geoffrey Hinton 78
19.1 Career . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.2 Research interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.3 Honours and awards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.4 Personal life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
20 Yann LeCun 80
20.1 Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
20.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
20.3 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
21 Jürgen Schmidhuber 82
21.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.1 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.2 Artificial evolution / genetic programming . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.3 Neural economy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.4 Artificial curiosity and creativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.5 Unsupervised learning / factorial codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.1.6 Kolmogorov complexity / computer-generated universe . . . . . . . . . . . . . . . . . . . 83
21.1.7 Universal AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.1.8 Low-complexity art / theory of beauty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.1.9 Robot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.3 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
21.4 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
23 Andrew Ng 87
23.1 Machine learning research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
23.2 Online education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
23.3 Personal life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
23.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
23.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
23.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
23.7 Text and image sources, contributors, and licenses . . . . . . . . . . . . . . . . . . . . . . . . . . 89
23.7.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
23.7.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
23.7.3 Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Chapter 1
“Neural network”redirects here. For networks of living tion is defined by a set of input neurons which may be
neurons, see Biological neural network. For the journal, activated by the pixels of an input image. After being
see Neural Networks (journal). For the evolutionary con- weighted and transformed by a function (determined by
cept, see Neutral network (evolution). the network's designer), the activations of these neurons
In machine learning and cognitive science, artificial are then passed on to other neurons. This process is re-
peated until finally, an output neuron is activated. This
determines which character was read.
Like other machine learning methods - systems that learn
from data - neural networks have been used to solve a
wide variety of tasks that are hard to solve using ordinary
rule-based programming, including computer vision and
speech recognition.
1.1 Background
Examinations of the human's central nervous system in-
spired the concept of neural networks. In an Artifi-
cial Neural Network, simple artificial nodes, known as
"neurons",“neurodes”“ , processing elements”or“units”
, are connected together to form a network which mimics
a biological neural network.
There is no single formal definition of what an artificial
neural network is. However, a class of statistical models
may commonly be called “Neural”if they possess the
following characteristics:
An artificial neural network is an interconnected group of nodes,
akin to the vast network of neurons in a brain. Here, each circu- 1. consist of sets of adaptive weights, i.e. numerical
lar node represents an artificial neuron and an arrow represents parameters that are tuned by a learning algorithm,
a connection from the output of one neuron to the input of an- and
other.
neural networks (ANNs) are a family of statistical learn- 2. are capable of approximating non-linear functions
ing models inspired by biological neural networks (the of their inputs.
central nervous systems of animals, in particular the
brain) and are used to estimate or approximate functions The adaptive weights are conceptually connection
that can depend on a large number of inputs and are strengths between neurons, which are activated during
generally unknown. Artificial neural networks are gen- training and prediction.
erally presented as systems of interconnected "neurons"Neural networks are similar to biological neural networks
which send messages to each other. The connections in performing functions collectively and in parallel by
have numeric weights that can be tuned based on expe- the units, rather than there being a clear delineation of
rience, making neural nets adaptive to inputs and capable
subtasks to which various units are assigned. The term
of learning. “neural network”usually refers to models employed in
For example, a neural network for handwriting recogni- statistics, cognitive psychology and artificial intelligence.
1
2 CHAPTER 1. ARTIFICIAL NEURAL NETWORK
Neural network models which emulate the central ner- Neural network research stagnated after the publication
vous system are part of theoretical neuroscience and of machine learning research by Marvin Minsky and
computational neuroscience. Seymour Papert* [7] (1969), who discovered two key is-
In modern software implementations of artificial neu- sues with the computational machines that processed neu-
ral networks, the approach inspired by biology has been ral networks. The first was that single-layer neural net-
largely abandoned for a more practical approach based works were incapable of processing the exclusive-or cir-
on statistics and signal processing. In some of these sys- cuit. The second significant issue was that computers
tems, neural networks or parts of neural networks (like were not sophisticated enough to effectively handle the
long run time required by large neural networks. Neu-
artificial neurons) form components in larger systems that
combine both adaptive and non-adaptive elements. While ral network research slowed until computers achieved
greater processing power. Also key later advances was
the more general approach of such systems is more suit-
able for real-world problem solving, it has little to do with the backpropagation algorithm which effectively solved
the exclusive-or problem (Werbos 1975).* [6]
the traditional artificial intelligence connectionist models.
What they do have in common, however, is the princi- The parallel distributed processing of the mid-1980s be-
ple of non-linear, distributed, parallel and local process- came popular under the name connectionism. The text by
ing and adaptation. Historically, the use of neural net- David E. Rumelhart and James McClelland* [8] (1986)
works models marked a paradigm shift in the late eight- provided a full exposition on the use of connectionism in
ies from high-level (symbolic) AI, characterized by expert computers to simulate neural processes.
systems with knowledge embodied in if-then rules, to low- Neural networks, as used in artificial intelligence, have
level (sub-symbolic) machine learning, characterized by traditionally been viewed as simplified models of neural
knowledge embodied in the parameters of a dynamical processing in the brain, even though the relation between
system. this model and brain biological architecture is debated,
as it is not clear to what degree artificial neural networks
mirror brain function.* [9]
1.2 History Neural networks were gradually overtaken in popular-
ity in machine learning by support vector machines and
*
Warren McCulloch and Walter Pitts [1] (1943) created other, much simpler methods such as linear classifiers.
a computational model for neural networks based on Renewed interest in neural nets was sparked in the late
mathematics and algorithms called threshold logic. This 2000s by the advent of deep learning.
model paved the way for neural network research to split
into two distinct approaches. One approach focused on
biological processes in the brain and the other focused
on the application of neural networks to artificial intelli- 1.2.1 Improvements since 2006
gence.
Computational devices have been created in CMOS, for
In the late 1940s psychologist Donald Hebb* [2] created a both biophysical simulation and neuromorphic comput-
hypothesis of learning based on the mechanism of neural
ing. More recent efforts show promise for creating
plasticity that is now known as Hebbian learning. Heb- nanodevices* [10] for very large scale principal compo-
bian learning is considered to be a 'typical' unsupervised
nents analyses and convolution. If successful, these ef-
learning rule and its later variants were early models for forts could usher in a new era of neural computing* [11]
long term potentiation. These ideas started being applied
that is a step beyond digital computing, because it de-
to computational models in 1948 with Turing's B-type pends on learning rather than programming and because
machines.
it is fundamentally analog rather than digital even though
Farley and Wesley A. Clark* [3] (1954) first used com- the first instantiations may in fact be with CMOS digital
putational machines, then called calculators, to simulate devices.
a Hebbian network at MIT. Other neural network com- Between 2009 and 2012, the recurrent neural networks
putational machines were created by Rochester, Holland, and deep feedforward neural networks developed in the
Habit, and Duda* [4] (1956). research group of Jürgen Schmidhuber at the Swiss AI
Frank Rosenblatt* [5] (1958) created the perceptron, an Lab IDSIA have won eight international competitions
algorithm for pattern recognition based on a two-layer in pattern recognition and machine learning.* [12]* [13]
learning computer network using simple addition and For example, the bi-directional and multi-dimensional
subtraction. With mathematical notation, Rosenblatt long short term memory (LSTM)* [14]* [15]* [16]* [17] of
also described circuitry not in the basic perceptron, such Alex Graves et al. won three competitions in connected
as the exclusive-or circuit, a circuit whose mathemati- handwriting recognition at the 2009 International Confer-
cal computation could not be processed until after the ence on Document Analysis and Recognition (ICDAR),
backpropagation algorithm was created by Paul Wer- without any prior knowledge about the three different
bos* [6] (1975). languages to be learned.
1.3. MODELS 3
Fast GPU-based implementations of this approach by 2. The learning process for updating the weights of the
Dan Ciresan and colleagues at IDSIA have won sev- interconnections
eral pattern recognition contests, including the IJCNN
2011 Traffic Sign Recognition Competition,* [18]* [19] 3. The activation function that converts a neuron's
the ISBI 2012 Segmentation of Neuronal Structures in weighted input to its output activation.
Electron Microscopy Stacks challenge,* [20] and others.
Their neural networks also were the first artificial pattern Mathematically, a neuron's network function f (x) is de-
recognizers to achieve human-competitive or even super- fined as a composition of other functions gi (x) , which
human performance* [21] on important benchmarks such can further be defined as a composition of other func-
as traffic sign recognition (IJCNN 2012), or the MNIST tions. This can be conveniently represented as a net-
handwritten digits problem of Yann LeCun at NYU. work structure, with arrows depicting the dependen-
cies between variables. A widely used type of com-
Deep, highly nonlinear neural architectures similar to the position
1980 neocognitron by Kunihiko Fukushima* [22] and the ∑ is the nonlinear weighted sum, where f (x) =
K ( i wi gi (x)) , where K (commonly referred to as
“standard architecture of vision”,* [23] inspired by the the activation function* [29]) is some predefined function,
simple and complex cells identified by David H. Hubel such as the hyperbolic tangent. It will be convenient for
and Torsten Wiesel in the primary visual cortex, can the following to refer to a collection of functions gi as
also be pre-trained by unsupervised methods* [24]* [25] simply a vector g = (g1 , g2 , . . . , gn ) .
of Geoff Hinton's lab at University of Toronto.* [26]* [27]
A team from this lab won a 2012 contest sponsored by
Merck to design software to help find molecules that
might lead to new drugs.* [28]
1.3 Models
Neural network models in artificial intelligence are usu-
ally referred to as artificial neural networks (ANNs);
these are essentially simple mathematical models defin-
ing a function f : X → Y or a distribution over X or
both X and Y , but sometimes models are also intimately
associated with a particular learning algorithm or learn-
ing rule. A common use of the phrase ANN model really ANN dependency graph
means the definition of a class of such functions (where
members of the class are obtained by varying parameters, This figure depicts such a decomposition of f , with de-
connection weights, or specifics of the architecture such pendencies between variables indicated by arrows. These
as the number of neurons or their connectivity). can be interpreted in two ways.
The first view is the functional view: the input x is trans-
1.3.1 Network function formed into a 3-dimensional vector h , which is then
transformed into a 2-dimensional vector g , which is fi-
See also: Graphical models nally transformed into f . This view is most commonly
encountered in the context of optimization.
The word network in the term 'artificial neural network' The second view is the probabilistic view: the random
refers to the inter–connections between the neurons in variable F = f (G) depends upon the random variable
the different layers of each system. An example system G = g(H) , which depends upon H = h(X) , which
has three layers. The first layer has input neurons which depends upon the random variable X . This view is most
send data via synapses to the second layer of neurons, and commonly encountered in the context of graphical mod-
then via more synapses to the third layer of output neu- els.
rons. More complex systems will have more layers of The two views are largely equivalent. In either case, for
neurons with some having increased layers of input neu- this particular network architecture, the components of
rons and output neurons. The synapses store parameters individual layers are independent of each other (e.g., the
called “weights”that manipulate the data in the calcu- components of g are independent of each other given their
lations. input h ). This naturally enables a degree of parallelism
An ANN is typically defined by three types of parameters: in the implementation.
Networks such as the previous one are commonly called
1. The interconnection pattern between the different feedforward, because their graph is a directed acyclic
layers of neurons graph. Networks with cycles are commonly called
4 CHAPTER 1. ARTIFICIAL NEURAL NETWORK
pairs (x, y) drawn from some distribution D . In prac- ing are pattern recognition (also known as classification)
tical situations we would only have N samples from D and regression (also known as function approximation).
and thus,∑ for the above example, we would only minimize The supervised learning paradigm is also applicable to
N
Ĉ = N1 i=1 (f (xi )−yi )2 . Thus, the cost is minimized sequential data (e.g., for speech and gesture recognition).
over a sample of the data rather than the entire data set. This can be thought of as learning with a “teacher”, in
1.4. EMPLOYING ARTIFICIAL NEURAL NETWORKS 5
the form of a function that provides continuous feedback those involved in vehicle routing,* [33] natural resources
on the quality of solutions obtained thus far. management* [34]* [35] or medicine* [36] because of the
ability of ANNs to mitigate losses of accuracy even when
reducing the discretization grid density for numerically
Unsupervised learning approximating the solution of the original control prob-
lems.
In unsupervised learning, some data x is given and the
cost function to be minimized, that can be any function Tasks that fall within the paradigm of reinforcement
of the data x and the network's output, f . learning are control problems, games and other sequential
decision making tasks.
The cost function is dependent on the task (what we are
trying to model) and our a priori assumptions (the implicit See also: dynamic programming and stochastic control
properties of our model, its parameters and the observed
variables).
As a trivial example, consider the model f (x) = a where 1.3.4 Learning algorithms
a is a constant and the cost C = E[(x − f (x))2 ] . Mini-
mizing this cost will give us a value of a that is equal to the Training a neural network model essentially means se-
mean of the data. The cost function can be much more lecting one model from the set of allowed models (or,
complicated. Its form depends on the application: for ex- in a Bayesian framework, determining a distribution over
ample, in compression it could be related to the mutual the set of allowed models) that minimizes the cost crite-
information between x and f (x) , whereas in statistical rion. There are numerous algorithms available for train-
modeling, it could be related to the posterior probability ing neural network models; most of them can be viewed
of the model given the data (note that in both of those ex- as a straightforward application of optimization theory
amples those quantities would be maximized rather than and statistical estimation.
minimized).
Most of the algorithms used in training artificial neural
Tasks that fall within the paradigm of unsupervised learn- networks employ some form of gradient descent, using
ing are in general estimation problems; the applications backpropagation to compute the actual gradients. This is
include clustering, the estimation of statistical distribu- done by simply taking the derivative of the cost function
tions, compression and filtering. with respect to the network parameters and then changing
those parameters in a gradient-related direction.
seen data requires a significant amount of experi- These networks have also been used to diagnose prostate
mentation. cancer. The diagnoses can be used to make specific mod-
els taken from a large group of patients compared to in-
• Robustness: If the model, cost function and learn- formation of one given patient. The models do not de-
ing algorithm are selected appropriately the result- pend on assumptions about correlations of different vari-
ing ANN can be extremely robust. ables. Colorectal cancer has also been predicted using
the neural networks. Neural networks could predict the
With the correct implementation, ANNs can be used nat- outcome for a patient with colorectal cancer with more
urally in online learning and large data set applications. accuracy than the current clinical methods. After train-
Their simple implementation and the existence of mostly ing, the networks could predict multiple patient outcomes
local dependencies exhibited in the structure allows for from unrelated institutions.* [43]
fast, parallel implementations in hardware.
The utility of artificial neural network models lies in the Theoretical and computational neuroscience is the field
fact that they can be used to infer a function from obser- concerned with the theoretical analysis and the computa-
vations. This is particularly useful in applications where tional modeling of biological neural systems. Since neu-
the complexity of the data or task makes the design of ral systems are intimately related to cognitive processes
such a function by hand impractical. and behavior, the field is closely related to cognitive and
behavioral modeling.
The aim of the field is to create models of biological neu-
1.5.1 Real-life applications ral systems in order to understand how biological systems
work. To gain this understanding, neuroscientists strive
The tasks artificial neural networks are applied to tend to to make a link between observed biological processes
fall within the following broad categories: (data), biologically plausible mechanisms for neural pro-
cessing and learning (biological neural network models)
• Function approximation, or regression analysis, in- and theory (statistical learning theory and information
cluding time series prediction, fitness approximation theory).
and modeling.
• Classification, including pattern and sequence
recognition, novelty detection and sequential deci- Types of models
sion making.
• Data processing, including filtering, clustering, blind Many models are used in the field, defined at different lev-
source separation and compression. els of abstraction and modeling different aspects of neural
systems. They range from models of the short-term be-
• Robotics, including directing manipulators, havior of individual neurons, models of how the dynamics
prosthesis. of neural circuitry arise from interactions between indi-
vidual neurons and finally to models of how behavior can
• Control, including Computer numerical control.
arise from abstract neural modules that represent com-
plete subsystems. These include models of the long-term,
Application areas include the system identification and and short-term plasticity, of neural systems and their rela-
control (vehicle control, process control, natural re- tions to learning and memory from the individual neuron
sources management), quantum chemistry,* [41] game- to the system level.
playing and decision making (backgammon, chess,
poker), pattern recognition (radar systems, face identi-
fication, object recognition and more), sequence recog-
nition (gesture, speech, handwritten text recognition),
medical diagnosis, financial applications (e.g. automated
1.6 Neural network software
trading systems), data mining (or knowledge discovery in
databases, “KDD”), visualization and e-mail spam fil- Main article: Neural network software
tering.
Artificial neural networks have also been used to diagnose Neural network software is used to simulate, research,
several cancers. An ANN based hybrid lung cancer de- develop and apply artificial neural networks, biological
tection system named HLND improves the accuracy of neural networks and, in some cases, a wider array of
diagnosis and the speed of lung cancer radiology.* [42] adaptive systems.
1.8. THEORETICAL PROPERTIES 7
1.7 Types of artificial neural net- regarding convergence are an unreliable guide to practical
application.
works
Main article: Types of artificial neural networks 1.8.4 Generalization and statistics
Artificial neural network types vary from those with only In applications where the goal is to create a system that
one or two layers of single direction logic, to complicated
generalizes well in unseen examples, the problem of over-
multi–input many directional feedback loops and layers. training has emerged. This arises in convoluted or over-
On the whole, these systems use algorithms in their pro- specified systems when the capacity of the network sig-
gramming to determine control and organization of their nificantly exceeds the needed free parameters. There
functions. Most systems use “weights”to change the are two schools of thought for avoiding this problem:
parameters of the throughput and the varying connec- The first is to use cross-validation and similar techniques
tions to the neurons. Artificial neural networks can be to check for the presence of overtraining and optimally
autonomous and learn by input from outside “teachers” select hyperparameters such as to minimize the gener-
or even self-teaching from written-in rules. alization error. The second is to use some form of
regularization. This is a concept that emerges naturally in
a probabilistic (Bayesian) framework, where the regular-
ization can be performed by selecting a larger prior prob-
1.8 Theoretical properties ability over simpler models; but also in statistical learning
theory, where the goal is to minimize over two quantities:
1.8.1 Computational power the 'empirical risk' and the 'structural risk', which roughly
corresponds to the error over the training set and the pre-
The multi-layer perceptron (MLP) is a universal function dicted error in unseen data due to overfitting.
approximator, as proven by the universal approximation
theorem. However, the proof is not constructive regard-
ing the number of neurons required or the settings of the
weights.
Work by Hava Siegelmann and Eduardo D. Sontag has
provided a proof that a specific recurrent architecture
with rational valued weights (as opposed to full preci-
sion real number-valued weights) has the full power of a
Universal Turing Machine* [44] using a finite number of
neurons and standard linear connections. Further, it has
been shown that the use of irrational values for weights
results in a machine with super-Turing power.* [45]
1.8.2 Capacity
Confidence analysis of a neural network
Artificial neural network models have a property called
'capacity', which roughly corresponds to their ability to Supervised neural networks that use a mean squared error
model any given function. It is related to the amount of (MSE) cost function can use formal statistical methods to
information that can be stored in the network and to the determine the confidence of the trained model. The MSE
notion of complexity. on a validation set can be used as an estimate for variance.
This value can then be used to calculate the confidence
interval of the output of the network, assuming a normal
1.8.3 Convergence distribution. A confidence analysis made this way is sta-
tistically valid as long as the output probability distribu-
Nothing can be said in general about convergence since it tion stays the same and the network is not modified.
depends on a number of factors. Firstly, there may exist By assigning a softmax activation function, a generaliza-
many local minima. This depends on the cost function tion of the logistic function, on the output layer of the
and the model. Secondly, the optimization method used neural network (or a softmax component in a component-
might not be guaranteed to converge when far away from based neural network) for categorical target variables, the
a local minimum. Thirdly, for a very large amount of outputs can be interpreted as posterior probabilities. This
data or parameters, some methods become impractical. is very useful in classification as it gives a certainty mea-
In general, it has been found that theoretical guarantees sure on classifications.
8 CHAPTER 1. ARTIFICIAL NEURAL NETWORK
The softmax activation function is: (they tend to consume considerable amounts of time and
money).
• Connectionist expert system [2] Hebb, Donald (1949). The Organization of Behavior.
New York: Wiley.
• Connectomics
[3] Farley, B.G.; W.A. Clark (1954). “Simulation of
• Cultured neuronal networks Self-Organizing Systems by Digital Computer”. IRE
Transactions on Information Theory 4 (4): 76–84.
• Deep learning doi:10.1109/TIT.1954.1057468.
• Digital morphogenesis [4] Rochester, N.; J.H. Holland, L.H. Habit, and W.L.
Duda (1956). “Tests on a cell assembly theory of the
• Encog action of the brain, using a large digital computer”.
• Fuzzy logic IRE Transactions on Information Theory 2 (3): 80–93.
doi:10.1109/TIT.1956.1056810.
• Gene expression programming
[5] Rosenblatt, F. (1958). “The Perceptron: A Probabilis-
• Genetic algorithm tic Model For Information Storage And Organization In
The Brain”. Psychological Review 65 (6): 386–408.
• Group method of data handling doi:10.1037/h0042519. PMID 13602029.
10 CHAPTER 1. ARTIFICIAL NEURAL NETWORK
[6] Werbos, P.J. (1975). Beyond Regression: New Tools for [20] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber.
Prediction and Analysis in the Behavioral Sciences. Deep Neural Networks Segment Neuronal Membranes in
Electron Microscopy Images. In Advances in Neural In-
[7] Minsky, M.; S. Papert (1969). An Introduction to Compu- formation Processing Systems (NIPS 2012), Lake Tahoe,
tational Geometry. MIT Press. ISBN 0-262-63022-2. 2012.
[8] Rumelhart, D.E; James McClelland (1986). Parallel Dis- [21] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column
tributed Processing: Explorations in the Microstructure of Deep Neural Networks for Image Classification. IEEE
Cognition. Cambridge: MIT Press. Conf. on Computer Vision and Pattern Recognition
CVPR 2012.
[9] Russell, Ingrid. “Neural Networks Module”. Retrieved
2012. [22] Fukushima, K. (1980).“Neocognitron: A self-organizing
neural network model for a mechanism of pattern recog-
[10] Yang, J. J.; Pickett, M. D.; Li, X. M.; Ohlberg, D. A. A.; nition unaffected by shift in position”. Biological Cyber-
Stewart, D. R.; Williams, R. S. Nat. Nanotechnol. 2008, netics 36 (4): 93–202. doi:10.1007/BF00344251. PMID
3, 429–433. 7370364.
[11] Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. [23] M Riesenhuber, T Poggio. Hierarchical models of object
S. Nature 2008, 453, 80–83. recognition in cortex. Nature neuroscience, 1999.
[12] 2012 Kurzweil AI Interview with Jürgen Schmidhuber on [24] Deep belief networks at Scholarpedia.
the eight competitions won by his Deep Learning team
2009–2012 [25] Hinton, G. E.; Osindero, S.; Teh, Y. W. (2006).
“A Fast Learning Algorithm for Deep Belief Nets”
[13] http://www.kurzweilai.net/ (PDF). Neural Computation 18 (7): 1527–1554.
how-bio-inspired-deep-learning-keeps-winning-competitions doi:10.1162/neco.2006.18.7.1527. PMID 16764513.
2012 Kurzweil AI Interview with Jürgen Schmidhuber on
the eight competitions won by his Deep Learning team [26] http://www.scholarpedia.org/article/Deep_belief_
2009–2012 networks /
[14] Graves, Alex; and Schmidhuber, Jürgen; Offline Hand- [27] Hinton, G. E.; Osindero, S.; Teh, Y. (2006).
writing Recognition with Multidimensional Recurrent Neu- “A fast learning algorithm for deep belief nets”
ral Networks, in Bengio, Yoshua; Schuurmans, Dale; Laf- (PDF). Neural Computation 18 (7): 1527–1554.
ferty, John; Williams, Chris K. I.; and Culotta, Aron doi:10.1162/neco.2006.18.7.1527. PMID 16764513.
(eds.), Advances in Neural Information Processing Systems
[28] John Markoff (November 23, 2012). “Scientists See
22 (NIPS'22), 7–10 December 2009, Vancouver, BC, Neu-
Promise in Deep-Learning Programs”. New York Times.
ral Information Processing Systems (NIPS) Foundation,
2009, pp. 545–552. [29] “The Machine Learning Dictionary”.
[15] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. [30] Dominic, S., Das, R., Whitley, D., Anderson, C. (July
Bunke, J. Schmidhuber. A Novel Connectionist System 1991). “Genetic reinforcement learning for neural net-
for Improved Unconstrained Handwriting Recognition. works”. IJCNN-91-Seattle International Joint Confer-
IEEE Transactions on Pattern Analysis and Machine In- ence on Neural Networks. IJCNN-91-Seattle International
telligence, vol. 31, no. 5, 2009. Joint Conference on Neural Networks. Seattle, Wash-
ington, USA: IEEE. doi:10.1109/IJCNN.1991.155315.
[16] Graves, Alex; and Schmidhuber, Jürgen; Offline Hand-
ISBN 0-7803-0164-1. Retrieved 29 July 2012.
writing Recognition with Multidimensional Recurrent Neu-
ral Networks, in Bengio, Yoshua; Schuurmans, Dale; Laf- [31] Hoskins, J.C.; Himmelblau, D.M. (1992). “Process con-
ferty, John; Williams, Chris K. I.; and Culotta, Aron trol via artificial neural networks and reinforcement learn-
(eds.), Advances in Neural Information Processing Systems ing”. Computers & Chemical Engineering 16 (4): 241–
22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC, 251. doi:10.1016/0098-1354(92)80045-B.
Neural Information Processing Systems (NIPS) Founda-
tion, 2009, pp. 545–552 [32] Bertsekas, D.P., Tsitsiklis, J.N. (1996). Neuro-dynamic
programming. Athena Scientific. p. 512. ISBN 1-
[17] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. 886529-10-8.
Bunke, J. Schmidhuber. A Novel Connectionist System
for Improved Unconstrained Handwriting Recognition. [33] Secomandi, Nicola (2000). “Comparing neuro-dynamic
IEEE Transactions on Pattern Analysis and Machine In- programming algorithms for the vehicle routing prob-
telligence, vol. 31, no. 5, 2009. lem with stochastic demands”. Computers & Operations
Research 27 (11–12): 1201–1225. doi:10.1016/S0305-
[18] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi- 0548(99)00146-X.
Column Deep Neural Network for Traffic Sign Classifica-
tion. Neural Networks, 2012. [34] de Rigo, D., Rizzoli, A. E., Soncini-Sessa, R., Weber, E.,
Zenesi, P. (2001). “Neuro-dynamic programming for
[19] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi- the efficient management of reservoir networks” (PDF).
Column Deep Neural Network for Traffic Sign Classifica- Proceedings of MODSIM 2001, International Congress on
tion. Neural Networks, 2012. Modelling and Simulation. MODSIM 2001, International
1.13. BIBLIOGRAPHY 11
Congress on Modelling and Simulation. Canberra, Aus- [45] Balcázar, José (Jul 1997). “Computational Power of
tralia: Modelling and Simulation Society of Australia Neural Networks: A Kolmogorov Complexity Character-
and New Zealand. doi:10.5281/zenodo.7481. ISBN 0- ization”. Information Theory, IEEE Transactions on 43
867405252. Retrieved 29 July 2012. (4): 1175–1183. doi:10.1109/18.605580. Retrieved 3
November 2014.
[35] Damas, M., Salmeron, M., Diaz, A., Ortega, J., Pri-
eto, A., Olivares, G. (2000). “Genetic algorithms and [46] NASA - Dryden Flight Research Center - News
neuro-dynamic programming: application to water sup- Room: News Releases: NASA NEURAL NETWORK
ply networks”. Proceedings of 2000 Congress on Evo- PROJECT PASSES MILESTONE. Nasa.gov. Retrieved
lutionary Computation. 2000 Congress on Evolution- on 2013-11-20.
ary Computation. La Jolla, California, USA: IEEE.
[47] Roger Bridgman's defence of neural networks
doi:10.1109/CEC.2000.870269. ISBN 0-7803-6375-2.
Retrieved 29 July 2012. [48] http://www.iro.umontreal.ca/~{}lisa/publications2/
index.php/publications/show/4
[36] Deng, Geng; Ferris, M.C. (2008). “Neuro-dynamic
programming for fractionated radiotherapy planning”. [49] Sun and Bookman (1990)
Springer Optimization and Its Applications 12: 47–70.
[50] Tahmasebi; Hezarkhani (2012). “A hybrid neural
doi:10.1007/978-0-387-73299-2_3.
networks-fuzzy logic-genetic algorithm for grade es-
[37] de Rigo, D., Castelletti, A., Rizzoli, A.E., Soncini-Sessa, timation”. Computers & Geosciences 42: 18–27.
R., Weber, E. (January 2005). “A selective improve- doi:10.1016/j.cageo.2012.02.004.
ment technique for fastening Neuro-Dynamic Program-
ming in Water Resources Network Management”. In
Pavel Zítek. Proceedings of the 16th IFAC World Congress 1.13 Bibliography
– IFAC-PapersOnLine. 16th IFAC World Congress 16.
Prague, Czech Republic: IFAC. doi:10.3182/20050703-
• Bhadeshia H. K. D. H. (1999). “Neu-
6-CZ-1902.02172. ISBN 978-3-902661-75-3. Retrieved
ral Networks in Materials Science” (PDF).
30 December 2011.
ISIJ International 39 (10): 966–979.
[38] Ferreira, C. (2006). “Designing Neural Networks Using doi:10.2355/isijinternational.39.966.
Gene Expression Programming”(PDF). In A. Abraham,
B. de Baets, M. Köppen, and B. Nickolay, eds., Applied • Bishop, C.M. (1995) Neural Networks for Pat-
Soft Computing Technologies: The Challenge of Com- tern Recognition, Oxford: Oxford University Press.
plexity, pages 517–536, Springer-Verlag. ISBN 0-19-853849-9 (hardback) or ISBN 0-19-
853864-2 (paperback)
[39] Da, Y., Xiurun, G. (July 2005). T. Villmann, ed. An
improved PSO-based ANN with simulated annealing tech- • Cybenko, G.V. (1989). Approximation by Super-
nique. New Aspects in Neurocomputing: 11th Euro- positions of a Sigmoidal function, Mathematics of
pean Symposium on Artificial Neural Networks. Elsevier. Control, Signals, and Systems, Vol. 2 pp. 303–314.
doi:10.1016/j.neucom.2004.07.002. electronic version
[40] Wu, J., Chen, E. (May 2009). Wang, H., Shen, Y., Huang, • Duda, R.O., Hart, P.E., Stork, D.G. (2001) Pat-
T., Zeng, Z., ed. A Novel Nonparametric Regression En- tern classification (2nd edition), Wiley, ISBN 0-471-
semble for Rainfall Forecasting Using Particle Swarm Op- 05669-3
timization Technique Coupled with Artificial Neural Net-
work. 6th International Symposium on Neural Networks, • Egmont-Petersen, M., de Ridder, D., Handels, H.
ISNN 2009. Springer. doi:10.1007/978-3-642-01513- (2002). “Image processing with neural networks –
7_6. ISBN 978-3-642-01215-0. a review”. Pattern Recognition 35 (10): 2279–2301.
doi:10.1016/S0031-3203(01)00178-9.
[41] Roman M. Balabin, Ekaterina I. Lomakina (2009).“Neu-
ral network approach to quantum-chemistry data: Accu- • Gurney, K. (1997) An Introduction to Neural Net-
rate prediction of density functional theory energies”. J. works London: Routledge. ISBN 1-85728-673-1
Chem. Phys. 131 (7): 074104. doi:10.1063/1.3206326. (hardback) or ISBN 1-85728-503-4 (paperback)
PMID 19708729.
• Haykin, S. (1999) Neural Networks: A Comprehen-
[42] Ganesan, N. “Application of Neural Networks in Diag- sive Foundation, Prentice Hall, ISBN 0-13-273350-
nosing Cancer Disease Using Demographic Data”(PDF). 1
International Journal of Computer Applications.
• Fahlman, S, Lebiere, C (1991). The Cascade-
[43] Bottaci, Leonardo. “Artificial Neural Networks Applied Correlation Learning Architecture, created for
to Outcome Prediction for Colorectal Cancer Patients in
National Science Foundation, Contract Number
Separate Institutions” (PDF). The Lancet.
EET-8716324, and Defense Advanced Research
[44] Siegelmann, H.T.; Sontag, E.D. (1991). “Turing com- Projects Agency (DOD), ARPA Order No. 4976
putability with neural nets” (PDF). Appl. Math. Lett. 4 under Contract F33615-87-C-1499. electronic
(6): 77–80. doi:10.1016/0893-9659(91)90080-F. version
12 CHAPTER 1. ARTIFICIAL NEURAL NETWORK
Deep learning
13
14 CHAPTER 2. DEEP LEARNING
chain of transformations from input to output is a credit rithms can make use of the unlabeled data that supervised
assignment path (CAP). CAPs describe potentially causal algorithms cannot. Unlabeled data is usually more abun-
connections between input and output and may vary in dant than labeled data, making this an important benefit
length. For a feedforward neural network, the depth of of these algorithms. The deep belief network is an exam-
the CAPs, and thus the depth of the network, is the num- ple of a deep structure that can be trained in an unsuper-
ber of hidden layers plus one (the output layer is also pa- vised manner.* [3]
rameterized). For recurrent neural networks, in which a
signal may propagate through a layer more than once, the
CAP is potentially unlimited in length. There is no uni- 2.2 History
versally agreed upon threshold of depth dividing shallow
learning from deep learning, but most researchers in the
field agree that deep learning has multiple nonlinear lay- Deep learning architectures, specifically those built from
ers (CAP > 2) and Schmidhuber considers CAP > 10 to artificial neural networks (ANN), date back at least to
be very deep learning.* [4]* (p7) the Neocognitron introduced by Kunihiko Fukushima in
1980.* [10] The ANNs themselves date back even further.
In 1989, Yann LeCun et al. were able to apply the stan-
dard backpropagation algorithm, which had been around
2.1.2 Fundamental concepts since 1974,* [11] to a deep neural network with the pur-
pose of recognizing handwritten ZIP codes on mail. De-
Deep learning algorithms are based on distributed rep- spite the success of applying the algorithm, the time to
resentations. The underlying assumption behind dis-
train the network on this dataset was approximately 3
tributed representations is that observed data is generated days, making it impractical for general use.* [12] Many
by the interactions of many different factors on different
factors contribute to the slow speed, one being due to the
levels. Deep learning adds the assumption that these fac- so-called vanishing gradient problem analyzed in 1991 by
tors are organized into multiple levels, corresponding to Sepp Hochreiter.* [13]* [14]
different levels of abstraction or composition. Varying
numbers of layers and layer sizes can be used to provide While such neural networks by 1991 were used for rec-
different amounts of abstraction.* [3] ognizing isolated 2-D hand-written digits, 3-D object
recognition by 1991 used a 3-D model-based approach
Deep learning algorithms in particular exploit this idea – matching 2-D images with a handcrafted 3-D object
of hierarchical explanatory factors. Different concepts model. Juyang Weng et al.. proposed that a human brain
are learned from other concepts, with the more abstract, does not use a monolithic 3-D object model and in 1992
higher level concepts being learned from the lower level they published Cresceptron,* [15]* [16]* [17] a method for
ones. These architectures are often constructed with a performing 3-D object recognition directly from clut-
greedy layer-by-layer method that models this idea. Deep tered scenes. Cresceptron is a cascade of many layers
learning helps to disentangle these abstractions and pick similar to Neocognitron. But unlike Neocognitron which
out which features are useful for learning.* [3] required the human programmer to hand-merge features,
For supervised learning tasks where label information is Cresceptron fully automatically learned an open num-
readily available in training, deep learning promotes a ber of unsupervised features in each layer of the cascade
principle which is very different than traditional meth- where each feature is represented by a convolution kernel.
ods of machine learning. That is, rather than focusing In addition, Cresceptron also segmented each learned ob-
on feature engineering which is often labor-intensive and ject from a cluttered scene through back-analysis through
varies from one task to another, deep learning methods is the network. Max-pooling, now often adopted by deep
focused on end-to-end learning based on raw features. In neural networks (e.g., ImageNet tests), was first used in
other words, deep learning moves away from feature engi- Cresceptron to reduce the position resolution by a factor
neering to a maximal extent possible. To accomplish end- of (2x2) to 1 through the cascade for better generaliza-
to-end optimization starting with raw features and ending tion. Because of a great lack of understanding how the
in labels, layered structures are often necessary. From brain autonomously wire its biological networks and the
this perspective, we can regard the use of layered struc- computational cost by ANNs then, simpler models that
tures to derive intermediate representations in deep learn- use task-specific handcrafted features such as Gabor fil-
ing as a natural consequence of raw-feature-based end-to- ter and support vector machines (SVMs) were of popular
end learning.* [1] Understanding the connection between choice of the field in the 1990s and 2000s.
the above two aspects of deep learning is important to ap- In the long history of speech recognition, both shal-
preciate its use in several application areas, all involving low form and deep form (e.g., recurrent nets) of ar-
supervised learning tasks (e.g., supervised speech and im- tificial neural networks had been explored for many
age recognition), as to be discussed in a later part of this years.* [18]* [19]* [20] But these methods never won over
article. the non-uniform internal-handcrafting Gaussian mixture
Many deep learning algorithms are framed as unsuper- model/Hidden Markov model (GMM-HMM) technology
vised learning problems. Because of this, these algo- based on generative models of speech trained discrim-
2.3. DEEP LEARNING IN ARTIFICIAL NEURAL NETWORKS 15
inatively.* [21] A number of key difficulties had been curred over the then-state-of-the-art GMM-HMM and
methodologically analyzed, including gradient diminish- more advanced generative model-based speech recogni-
ing and weak temporal correlation structure in the neural tion systems without the need for generative DBN pre-
predictive models.* [22]* [23] All these difficulties were in training, the finding verified subsequently by several other
addition to the lack of big training data and big comput- major speech recognition research groups * [24]* [35]
ing power in these early days. Most speech recognition Further, the nature of recognition errors produced by the
researchers who understood such barriers hence subse- two types of systems was found to be characteristically
quently moved away from neural nets to pursue generative different,* [25]* [36] offering technical insights into how
modeling approaches until the recent resurgence of deep to artfully integrate deep learning into the existing highly
learning that has overcome all these difficulties. Hinton efficient, run-time speech decoding system deployed by
et al. and Deng et al. reviewed part of this recent history all major players in speech recognition industry. The his-
about how their collaboration with each other and then tory of this significant development in deep learning has
with cross-group colleagues ignited the renaissance of been described and analyzed in recent books.* [1]* [37]
neural networks and initiated deep learning research and Advances in hardware have also been an important en-
applications in speech recognition.* [24]* [25]* [26]* [27]
abling factor for the renewed interest of deep learning.
The term “deep learning”gained traction in the mid- In particular, powerful graphics processing units (GPUs)
2000s after a publication by Geoffrey Hinton and Ruslan are highly suited for the kind of number crunching, ma-
Salakhutdinov showed how a many-layered feedforward trix/vector math involved in machine learning. GPUs
neural network could be effectively pre-trained one layer have been shown to speed up training algorithms by or-
at a time, treating each layer in turn as an unsupervised ders of magnitude, bringing running times of weeks back
restricted Boltzmann machine, then using supervised to days.* [38]* [39]
backpropagation for fine-tuning.* [28] In 1992, Schmid-
huber had already implemented a very similar idea for
the more general case of unsupervised deep hierarchies of
recurrent neural networks, and also experimentally shown 2.3 Deep learning in artificial neu-
its benefits for speeding up supervised learning * [29]* [30]
ral networks
Since the resurgence of deep learning, it has become
part of many state-of-the-art systems in different disci-
Some of the most successful deep learning methods in-
plines, particularly that of computer vision and automatic
volve artificial neural networks. Artificial neural net-
speech recognition (ASR). Results on commonly used
works are inspired by the 1959 biological model pro-
evaluation sets such as TIMIT (ASR) and MNIST (image
posed by Nobel laureates David H. Hubel & Torsten
classification) as well as a range of large vocabulary
Wiesel, who found two types of cells in the primary vi-
speech recognition tasks are constantly being improved
sual cortex: simple cells and complex cells. Many artifi-
with new applications of deep learning.* [24]* [31]* [32]
cial neural networks can be viewed as cascading models
Currently, it has been shown that deep learning architec- *
[15]* [16]* [17]* [40] of cell types inspired by these bio-
tures in the form of convolutional neural networks have
logical observations.
been nearly best performing;* [33]* [34] however, these
are more widely used in computer vision than in ASR. Fukushima's Neocognitron introduced convolutional
neural networks partially trained by unsupervised learn-
The real impact of deep learning in industry started in
ing while humans directed features in the neural
large-scale speech recognition around 2010. In late 2009,
plane. Yann LeCun et al. (1989) applied supervised
Geoff Hinton was invited by Li Deng to work with him
backpropagation to such architectures.* [41] Weng et al.
and colleagues at Microsoft Research in Redmond to
(1992) published convolutional neural networks Crescep-
apply deep learning to speech recognition. They co-
tron* [15]* [16]* [17] for 3-D object recognition from im-
organized the 2009 NIPS Workshop on Deep Learning
ages of cluttered scenes and segmentation of such objects
for Speech Recognition. The workshop was motivated
from images.
by the limitations of deep generative models of speech,
and the possibility that the big-compute, big-data era An obvious need for recognizing general 3-D objects
warranted a serious try of the deep neural net (DNN) is least shift invariance and tolerance to deformation.
approach. It was then (incorrectly) believed that pre- Max-pooling appeared to be first proposed by Crescep-
training of DNNs using generative models of deep be- tron* [15]* [16] to enable the network to tolerate small-to-
lief net (DBN) would be the cure for the main difficul- large deformation in a hierarchical way while using con-
ties of neural nets encountered during 1990's.* [26] How- volution. Max-pooling helps, but still does not fully guar-
ever, soon after the research along this direction started antee, shift-invariance at the pixel level.* [17]
at Microsoft Research, it was discovered that when large With the advent of the back-propagation algorithm in
amounts of training data are used and especially when the 1970s, many researchers tried to train supervised
DNNs are designed correspondingly with large, context- deep artificial neural networks from scratch, initially
dependent output layers, dramatic error reduction oc- with little success. Sepp Hochreiter's diploma thesis of
16 CHAPTER 2. DEEP LEARNING
1991* [42]* [43] formally identified the reason for this all other machine learning techniques on the old, famous
failure in the “vanishing gradient problem,”which not MNIST handwritten digits problem of Yann LeCun and
only affects many-layered feedforward networks, but also colleagues at NYU.
recurrent neural networks. The latter are trained by un- As of 2011, the state of the art in deep learning feedfor-
folding them into very deep feedforward networks, where ward networks alternates convolutional layers and max-
a new layer is created for each time step of an input se- pooling layers,* [53]* [54] topped by several pure clas-
quence processed by the network. As errors propagate sification layers. Training is usually done without any
from layer to layer, they shrink exponentially with the unsupervised pre-training. Since 2011, GPU-based im-
number of layers.
plementations* [53] of this approach won many pattern
To overcome this problem, several methods were pro- recognition contests, including the IJCNN 2011 Traf-
posed. One is Jürgen Schmidhuber's multi-level hi- fic Sign Recognition Competition,* [55] the ISBI 2012
erarchy of networks (1992) pre-trained one level at a Segmentation of neuronal structures in EM stacks chal-
time through unsupervised learning, fine-tuned through lenge,* [56] and others.
backpropagation.* [29] Here each level learns a com- Such supervised deep learning methods also were the
pressed representation of the observations that is fed to first artificial pattern recognizers to achieve human-
the next level. competitive performance on certain tasks.* [57]
Another method is the long short term memory (LSTM) To break the barriers of weak AI represented by deep
network of 1997 by Hochreiter & Schmidhuber.* [44] learning, it is necessary to go beyond the deep learning
In 2009, deep multidimensional LSTM networks won architectures because biological brains use both shallow
three ICDAR 2009 competitions in connected handwrit- and deep circuits as reported by brain anatomy* [58] in
ing recognition, without any prior knowledge about the order to deal with the wide variety of invariance that the
three different languages to be learned.* [45]* [46]
brain displays. Weng* [59] argued that the brain self-
Sven Behnke relied only on the sign of the gradient wires largely according to signal statistics and, therefore,
(Rprop) when training his Neural Abstraction Pyra- a serial cascade cannot catch all major statistical depen-
mid* [47] to solve problems like image reconstruction and dencies. Fully guaranteed shift invariance for ANNs to
face localization. deal with small and large natural objects in large clut-
Other methods also use unsupervised pre-training to tered scenes became true when the invariance went be-
structure a neural network, making it first learn generally yond shift, to extend to all ANN-learned concepts, such
useful feature detectors. Then the network is trained fur- as location, type (object class label), scale, lighting, in
ther by supervised back-propagation to classify labeled the Developmental Networks (DNs)* [60] whose embod-
data. The deep model of Hinton et al. (2006) involves iments are Where-What Networks, WWN-1 (2008)* [61]
learning the distribution of a high level representation us- through WWN-7 (2013).* [62]
ing successive layers of binary or real-valued latent vari-
ables. It uses a restricted Boltzmann machine (Smolen-
sky, 1986* [48]) to model each new layer of higher level 2.4 Deep learning architectures
features. Each new layer guarantees an increase on the
lower-bound of the log likelihood of the data, thus im- There are huge number of different variants of deep ar-
proving the model, if trained properly. Once sufficiently chitectures; however, most of them are branched from
many layers have been learned the deep architecture may some original parent architectures. It is not always pos-
be used as a generative model by reproducing the data sible to compare the performance of multiple architec-
when sampling down the model (an “ancestral pass”) tures all together, since they are not all implemented on
from the top level feature activations.* [49] Hinton reports the same data set. Deep learning is a fast-growing field
that his models are effective feature extractors over high- so new architectures, variants, or algorithms may appear
dimensional, structured data.* [50] every few weeks.
The Google Brain team led by Andrew Ng and Jeff Dean
created a neural network that learned to recognize higher-
2.4.1 Deep neural networks
level concepts, such as cats, only from watching unlabeled
images taken from YouTube videos.* [51] * [52]
A deep neural network (DNN) is an artificial neural
Other methods rely on the sheer processing power of network with multiple hidden layers of units between
modern computers, in particular, GPUs. In 2010 it was the input and output layers.* [2]* [4] Similar to shallow
shown by Dan Ciresan and colleagues* [38] in Jürgen ANNs, DNNs can model complex non-linear relation-
Schmidhuber's group at the Swiss AI Lab IDSIA that ships. DNN architectures, e.g., for object detection and
despite the above-mentioned “vanishing gradient prob- parsing generate compositional models where the ob-
lem,”the superior processing power of GPUs makes plain ject is expressed as layered composition of image prim-
back-propagation feasible for deep feedforward neural itives.* [63] The extra layers enable composition of fea-
networks with many layers. The method outperformed tures from lower layers, giving the potential of modeling
2.4. DEEP LEARNING ARCHITECTURES 17
complex data with fewer units than a similarly performing to better local optima in comparison with other training
shallow network.* [2] methods. However, these methods can be computation-
DNNs are typically designed as feedforward networks, ally expensive, especially when being used to train DNNs.
but recent research has successfully applied the deep There are many training parameters to be considered with
learning architecture to recurrent neural networks for ap- a DNN, such as the size (number of layers and number
plications such as language modeling.* [64] Convolutional of units per layer), the learning rate and initial weights.
deep neural networks (CNNs) are used in computer vi- Sweeping through the parameter space for optimal pa-
sion where their success is well-documented.* [65] More rameters may not be feasible due to the cost in time and
computational resources. Various 'tricks' such as using
recently, CNNs have been applied to acoustic modeling
for automatic speech recognition (ASR), where they have mini-batching (computing the gradient on several train-
ing examples at once rather than individual examples)
shown success over previous models.* [34] For simplicity, *
a look at training DNNs is given here. [69] have been shown to speed up computation. The
large processing throughput of GPUs has produced sig-
A DNN can be discriminatively trained with the standard nificant speedups in training, due to the matrix and vector
backpropagation algorithm. The weight updates can be computations required being well suited for GPUs.* [4]
done via stochastic gradient descent using the following
equation:
2.4.3 Deep belief networks
∂C Main article: Deep belief network
∆wij (t + 1) = ∆wij (t) + η
∂wij A deep belief network (DBN) is a probabilistic,
eling capability and faster convergence of the fine-tuning a training vector and values for the units in the already-
phase.* [71] trained RBM layers are assigned using the current weights
A DBN can be efficiently trained in an unsupervised, and biases. The final layer of the already-trained layers is
layer-by-layer manner where the layers are typically made used as input to the new RBM. The new RBM is then
of restricted Boltzmann machines (RBM). A description trained with the procedure above, and then this whole
of training a DBN via RBMs is provided below. An RBM process can be *
repeated until some desired stopping cri-
is an undirected, generative energy-based model with an terion is met. [2]
input layer and single hidden layer. Connections only ex-
Despite the approximation of CD to maximum likelihood
ist between the visible units of the input layer and the hid-
being very crude (CD has been shown to not follow the
den units of the hidden layer; there are no visible-visible
gradient of any function), empirical results have shown
or hidden-hidden connections. it to be an effective method for use with training deep
*
The training method for RBMs was initially proposed architectures. [69]
by Geoffrey Hinton for use with training “Product
of Expert”models and is known as contrastive diver-
2.4.4 Convolutional neural networks
gence (CD).* [72] CD provides an approximation to the
maximum likelihood method that would ideally be ap-
Main article: Convolutional neural network
plied for learning the weights of the RBM.* [69]* [73]
In training a single RBM, weight updates are performed
A CNN is composed of one or more convolutional layers
with gradient ascent via the following equation: ∆wij (t+
with fully connected layers (matching those in typical ar-
1) = wij (t) + η ∂ log(p(v))
∂wij . Here, p(v) is the prob- tificial neural networks) on top. It also uses tied weights
ability of
∑ −E(v,h)a visible vector, which is given by p(v) = and pooling layers. This architecture allows CNNs to
1
Z h e . Z is the partition function (used for take advantage of the 2D structure of input data. In
normalizing) and E(v, h) is the energy function assigned comparison with other deep architectures, convolutional
to the state of the network. A lower energy indicates the neural networks are starting to show superior results in
network is in a more“desirable”configuration. The gradi- both image and speech applications. They can also be
ent ∂ log(p(v))
∂wij has the simple form ⟨vi hj ⟩data −⟨vi hj ⟩model trained with standard backpropagation. CNNs are eas-
where ⟨· · · ⟩p represent averages with respect to distribu- ier to train than other regular, deep, feed-forward neural
tion p . The issue arises in sampling ⟨vi hj ⟩model as this networks and have many fewer parameters to estimate,
requires running alternating Gibbs sampling for a long making them a highly attractive architecture to use.* [74]
time. CD replaces this step by running alternating Gibbs
sampling for n steps (values of n = 1 have empirically
been shown to perform well). After n steps, the data is 2.4.5 Convolutional Deep Belief Networks
sampled and that sample is used in place of ⟨vi hj ⟩model .
The CD procedure works as follows:* [69] A recent achievement in deep learning is from the use of
convolutional deep belief networks (CDBN). A CDBN
is very similar to normal Convolutional neural network
1. Initialize the visible units to a training vector.
in terms of its structure. Therefore, like CNNs they are
2. Update the hidden units in parallel given also able to exploit the 2D structure of images combined
∑ the visible with the advantage gained by pre-training in Deep belief
units: p(hj = 1 | V) = σ(bj + i vi wij ) . σ
represents the sigmoid function and bj is the bias of network. They provide a generic structure which can be
hj . used in many image and signal processing tasks and can
be trained in a way similar to that for Deep Belief Net-
3. Update the visible units in parallel∑given the hidden works. Recently, many benchmark results on standard
units: p(vi = 1 | H) = σ(ai + j hj wij ) . ai is image datasets like CIFAR * [75] have been obtained us-
the bias of vi . This is called the “reconstruction” ing CDBNs.* [76]
step.
4. Reupdate the hidden units in parallel given the re- 2.4.6 Deep Boltzmann Machines
constructed visible units using the same equation as
in step 2. A Deep Boltzmann Machine (DBM) is a type of bi-
nary pairwise Markov random field (undirected proba-
5. Perform the weight update: ∆wij ∝ ⟨vi hj ⟩data − bilistic graphical models) with multiple layers of hidden
⟨vi hj ⟩reconstruction . random variables. It is a network of symmetrically cou-
pled stochastic binary units. It comprises a set of visible
Once an RBM is trained, another RBM can be“stacked” units ν ∈ {0, 1}D , and a series of layers of hidden units
atop of it to create a multilayer model. Each time another h(1) ∈ {0, 1}F1 , h(2) ∈ {0, 1}F2 , . . . , h(L) ∈ {0, 1}FL
RBM is stacked, the input visible layer is initialized to . There is no connection between the units of the same
2.4. DEEP LEARNING ARCHITECTURES 19
layer (like RBM). For the DBM, we can write the proba- y, where θ = {W , b} , W is the weight matrix and b is
bility which is assigned to vector ν as: an offset vector (bias). On the contrary a decoder maps
∑ ∑ (1) (1) ∑ (2) (1) (2) ∑ (3) back the hidden representation y to the reconstructed in-
(2) (3)
p(ν) = Z1 h e ij Wij νi hj + jl Wjl hj hl + lm Wlm hput l hm ,
z via gθ . The whole process of auto encoding is to
where h = {h(1) , h(2) , h(3) } are the set of hidden units, compare this reconstructed input to the original and try
and θ = {W (1) , W (2) , W (3) } are the model parame- to minimize this error to make the reconstructed value as
ters, representing visible-hidden and hidden-hidden sym- close as possible to the original.
metric interaction, since they are undirected links. As it In stacked denoising auto encoders, the partially corrupted
is clear by setting W (2) = 0 and W (3) = 0 the net- output is cleaned (denoised). This fact has been intro-
work becomes the well-known Restricted Boltzmann ma- duced in * [81] with a specific approach to good represen-
chine.* [77] tation, a good representation is one that can be obtained
There are several reasons which motivate us to take ad- robustly from a corrupted input and that will be useful for
vantage of deep Boltzmann machine architectures. Like recovering the corresponding clean input. Implicit in this
DBNs, they benefit from the ability of learning complex definition are the ideas of
and abstract internal representations of the input in tasks
such as object or speech recognition, with the use of lim- • The higher level representations are relatively stable
ited number of labeled data to fine-tune the representa- and robust to the corruption of the input;
tions built based on a large supply of unlabeled sensory in-
• It is required to extract features that are useful for
put data. However, unlike DBNs and deep convolutional
representation of the input distribution.
neural networks, they adopt the inference and training
procedure in both directions, bottom-up and top-down
pass, which enable the DBMs to better unveil the rep- The algorithm consists of multiple steps; starts by a
resentations of the ambiguous and complex input struc- stochastic mapping of x to x̃ through qD (x̃|x) , this is
tures,* [78] .* [79] the corrupting step. Then the corrupted input x̃ passes
through a basic auto encoder process and is mapped to a
Since the exact maximum likelihood learning is intractable hidden representation y = f (x̃) = s(W x̃ + b) . From
θ
for the DBMs, we may perform the approximate max- this hidden representation we can reconstruct z = g (y)
θ
imum likelihood learning. There is another possibility, . In the last stage a minimization algorithm is done in or-
to use mean-field inference to estimate data-dependent der to have a z as close as possible to uncorrupted input x
expectations, incorporation with a Markov chain Monte . The reconstruction error L (x, z) might be either the
H
Carlo (MCMC) based stochastic approximation technique cross-entropy loss with an affine-sigmoid decoder, or the
to approximate the expected sufficient statistics of the squared error loss with an affine decoder.* [81]
model.* [77]
In order to make a deep architecture, auto encoders stack
We can see the difference between DBNs and DBM. In one on top of another. Once the encoding function f
θ
DBNs, the top two layers form a restricted Boltzmann of the first denoising auto encoder is learned and used to
machine which is an undirected graphical model, but the uncorrupt the input (corrupted input), we can train the
lower layers form a directed generative model. second level.* [81]
Apart from all the advantages of DBMs discussed so far, Once the stacked auto encoder is trained, its output might
they have a crucial disadvantage which limits the per- be used as the input to a supervised learning algorithm
formance and functionality of this kind of architecture. such as support vector machine classifier or a multiclass
The approximate inference, which is based on mean- logistic regression.* [81]
field method, is about 25 to 50 times slower than a sin-
gle bottom-up pass in DBNs. This time consuming task
make the joint optimization, quite impractical for large 2.4.8 Deep Stacking Networks
data sets, and seriously restricts the use of DBMs in tasks
such as feature representations (the mean-field inference One of the deep architectures recently introduced in* [82]
have to be performed for each new test input).* [80] which is based on building hierarchies with blocks of
simplified neural network modules, is called deep con-
vex network. They are called “convex”because of the
2.4.7 Stacked (Denoising) Auto-Encoders formulation of the weights learning problem, which is a
convex optimization problem with a closed-form solu-
The auto encoder idea is motivated by the concept of good tion. The network is also called the deep stacking network
representation. For instance for the case of classifier it is (DSN),* [83] emphasizing on this fact that a similar mech-
possible to define that a good representation is one that will anism as the stacked generalization is used.* [84]
yield a better performing classifier. The DSN blocks, each consisting of a simple, easy-to-
An encoder is referred to a deterministic mapping fθ that learn module, are stacked to form the overall deep net-
transforms an input vector x into hidden representation work. It can be trained block-wise in a supervised fash-
20 CHAPTER 2. DEEP LEARNING
ion without the need for back-propagation for the entire scale up the design to larger (deeper) architectures and
blocks.* [85] data sets.
As designed in * [82] each block consists of a simplified The basic architecture is suitable for diverse tasks such as
MLP with a single hidden layer. It comprises a weight classification and regression.
matrix U as the connection between the logistic sigmoidal
units of the hidden layer h to the linear output layer y,
and a weight matrix W which connects each input of the 2.4.10 Spike-and-Slab RBMs (ssRBMs)
blocks to their respective hidden layers. If we assume
that the target vectors t be arranged to form the columns The need for real-valued inputs which are employed in
of T (the target matrix), let the input data vectors x be Gaussian RBMs (GRBMs), motivates scientists seeking
arranged to form the columns of X, let H = σ(W T X) new methods. One of these methods is the spike and slab
denote the matrix of hidden units, and assume the lower- RBM (ssRBMs), which models continuous-valued inputs
layer weights W are known (training layer-by-layer). The with strictly binary latent variables.* [90]
function performs the element-wise logistic sigmoid op-
Similar to basic RBMs and its variants, the spike and
eration. Then learning the upper-layer weight matrix U
slab RBM is a bipartite graph. Like GRBM, the visi-
given other weights in the network can be formulated as
ble units (input) are real-valued. The difference arises
a convex optimization problem:
in the hidden layer, where each hidden unit come along
with a binary spike variable and real-valued slab variable.
These terms (spike and slab) come from the statistics lit-
min f = ||U T H − T ||2F , erature,* [91] and refer to a prior including a mixture of
UT
two components. One is a discrete probability mass at
which has a closed-form solution. The input to the first zero called spike, and the other is a density over continu-
* *
block X only contains the original data, however in the ous domain. [92] [92]
upper blocks in addition to this original (raw) data there There is also an extension of the ssRBM model, which
is a copy of the lower-block(s) output y. is called µ-ssRBM. This variant provides extra modeling
In each block an estimate of the same final label class y capacity to the architecture using additional terms in the
is produced, then this estimated label concatenated with energy function. One of these terms enable model to form
original input to form the expanded input for the upper a conditional distribution of the spike variables by means
block. In contrast with other deep architectures, such as of marginalizing out the slab variables given an observa-
DBNs, the goal is not to discover the transformed feature tion.
representation. Regarding the structure of the hierarchy
of this kind of architecture, it makes the parallel training
straightforward as the problem is naturally a batch-mode 2.4.11 Compound Hierarchical-Deep
optimization one. In purely discriminative tasks DSN Models
performance is better than the conventional DBN.* [83]
The class architectures called compound HD models,
where HD stands for Hierarchical-Deep are structured
2.4.9 Tensor Deep Stacking Networks (T- as a composition of non-parametric Bayesian models
DSN) with deep networks. The features, learned by deep
architectures such as DBNs,* [93] DBMs,* [78] deep
This architecture is an extension of the DSN. It improves auto encoders,* [94] convolutional variants,* [95]* [96]
the DSN in two important ways, using the higher order ssRBMs,* [92] deep coding network,* [97] DBNs with
information by means of covariance statistics and trans- sparse feature learning,* [98] recursive neural net-
forming the non-convex problem of the lower-layer to a works,* [99] conditional DBNs,* [100] denoising auto
convex sub-problem of the upper-layer.* [86] encoders,* [101] are able to provide better representation
for more rapid and accurate classification tasks with
Unlike the DSN, the covariance statistics of the data is high-dimensional training data sets. However, they are
employed using a bilinear mapping from two distinct sets not quite powerful in learning novel classes with few
of hidden units in the same layer to predictions via a third- examples, themselves. In these architectures, all units
order tensor. through the network are involved in the representation of
The scalability and parallelization are the two important the input (distributed representations), and they have to
factors in the learning algorithms which are not consid- be adjusted together (high degree of freedom). However,
ered seriously in the conventional DNNs.* [87]* [88]* [89] if we limit the degree of freedom, we make it easier for
All the learning process for the DSN (and TDSN as the model to learn new classes out of few training sam-
well) is done on a batch-mode basis so as to make the ples (less parameters to learn). Hierarchical Bayesian
parallelization possible on a cluster of CPU or GPU (HB) models, provide learning from few examples, for
nodes.* [82]* [83] Parallelization gives the opportunity to example * [102]* [103]* [104]* [105]* [106] for computer
2.5. APPLICATIONS 21
vision, statistics, and cognitive science. It is also possible to extend the DPCN to form a
*
Compound HD architectures try to integrate both charac- convolutional network. [108]
teristics of HB and deep networks. The compound HDP-
DBM architecture, a hierarchical Dirichlet process (HDP) 2.4.13 Deep Kernel Machines
as a hierarchical model, incorporated with DBM archi-
tecture. It is a full generative model, generalized from The Multilayer Kernel Machine (MKM) as introduced in
abstract concepts flowing through the layers of the model, * [109] is a way of learning highly nonlinear functions
which is able to synthesize new examples in novel classes with the iterative applications of weakly nonlinear ker-
that look reasonably natural. Note that all the levels nels. They use the kernel principal component analy-
are learned jointly by maximizing a joint log-probability sis (KPCA), in,* [110] as method for unsupervised greedy
score.* [107] layer-wise pre-training step of the deep learning architec-
Consider a DBM with three hidden layers, the probability ture.
of a visible input ν is: Layer l + 1 -th learns the representation of the previous
∑ ∑ W (1) νi h1 +∑ W (2) h1 h2 +∑ W (3) h2layer
lm l hm , l , extracting the nl principal component (PC) of the
3
1
p(ν, ψ) = Z h e ij ij j jl jl j l lm
2.4.12 Deep Coding Networks There are some drawbacks in using the KPCA method as
the building cells of an MKM.
There are several advantages to having a model which can Another, more straightforward method of integrating ker-
actively update itself to the context in data. One of these nel machine into the deep learning architecture was de-
methods arises from the idea to have a model which is veloped by Microsoft researchers for spoken language un-
able to adjust its prior knowledge dynamically according derstanding applications.* [111] The main idea is to use a
to the context of the data. Deep coding network (DPCN) kernel machine to approximate a shallow neural net with
is a predictive coding scheme where top-down informa- an infinite number of hidden units, and then to use the
tion is used to empirically adjust the priors needed for stacking technique to splice the output of the kernel ma-
the bottom-up inference procedure by means of a deep chine and the raw input in building the next, higher level
locally-connected generative model. This is based on ex- of the kernel machine. The number of the levels in this
tracting sparse features out of time-varying observations kernel version of the deep convex network is a hyper-
using a linear dynamical model. Then, a pooling strategy parameter of the overall system determined by cross val-
is employed in order to learn invariant feature represen- idation.
tations. Similar to other deep architectures, these blocks
are the building elements of a deeper architecture where
greedy layer-wise unsupervised learning are used. Note 2.4.14 Deep Q-Networks
that the layers constitute a kind of Markov chain such that
the states at any layer are only dependent on the succeed-
This is the latest class of deep learning models targeted
ing and preceding layers. for reinforcement learning, published in February 2015
*
Deep predictive coding network (DPCN) [108] predicts in Nature [112]
*
2.5.1 Automatic speech recognition els; 5) Multi-task and transfer learning by DNNs and
related deep models; 6) Convolution neural networks
The results shown in the table below are for automatic and how to design them to best exploit domain knowl-
speech recognition on the popular TIMIT data set. This edge of speech; 7) Recurrent neural network and its
is a common data set used for initial evaluations of deep rich LSTM variants; 8) Other types of deep models in-
learning architectures. The entire set contains 630 speak- cluding tensor-based models and integrated deep genera-
ers from eight major dialects of American English, with tive/discriminative models.
each speaker reading 10 different sentences.* [113] Its Large-scale automatic speech recognition is the first and
small size allows many different configurations to be the most convincing successful case of deep learning in
tried effectively with it. More importantly, the TIMIT the recent history, embraced by both industry and aca-
task concerns phone-sequence recognition, which, unlike demic across the board. Between 2010 and 2014, the
word-sequence recognition, permits very weak“language two major conferences on signal processing and speech
models”and thus the weaknesses in acoustic modeling as- recognition, IEEE-ICASSP and Interspeech, have seen
pects of speech recognition can be more easily analyzed. near exponential growth in the numbers of accepted pa-
It was such analysis on TIMIT contrasting the GMM (and pers in their respective annual conference papers on the
other generative models of speech) vs. DNN models topic of deep learning for speech recognition. More im-
carried out by Li Deng and collaborators around 2009- portantly, all major commercial speech recognition sys-
2010 that stimulated early industrial investment on deep tems (e.g., Microsoft Cortana, Xbox, Skype Translator,
learning technology for speech recognition from small Google Now, Apple Siri, Baidu and iFlyTek voice search,
to large scales,* [25]* [36] eventually leading to pervasive and a range of Nuance speech products, etc.) nowadays
and dominant uses of deep learning in speech recogni- are based on deep learning methods.* [1]* [121]* [122]
tion industry. That analysis was carried out with compa- See also the recent media interview with the CTO of Nu-
rable performance (less than 1.5% in error rate) between ance Communications.* [123]
discriminative DNNs and generative models. The error
The wide-spreading success in speech recognition
rates presented below, including these early results and
achieved by 2011 was followed shortly by large-scale im-
measured as percent phone error rates (PER), have been
age recognition described next.
summarized over a time span of the past 20 years:
Extension of the success of deep learning from TIMIT
to large vocabulary speech recognition occurred in 2010 2.5.2 Image recognition
by industrial researchers, where large output layers of
the DNN based on context dependent HMM states con- A common evaluation set for image classification is the
structed by decision trees were adopted.* [116]* [117] See MNIST database data set. MNIST is composed of hand-
comprehensive reviews of this development and of the written digits and includes 60000 training examples and
state of the art as of October 2014 in the recent Springer 10000 test examples. Similar to TIMIT, its small size al-
book from Microsoft Research.* [37] See also the related lows multiple configurations to be tested. A comprehen-
background of automatic speech recognition and the im- sive list of results on this set can be found in.* [124] The
pact of various machine learning paradigms including no- current best result on MNIST is an error rate of 0.23%,
tably deep learning in a recent overview article.* [118] achieved by Ciresan et al. in 2012.* [125]
One fundamental principle of deep learning is to do away The real impact of deep learning in image or object
with hand-crafted feature engineering and to use raw fea- recognition, one major branch of computer vision, was
felt in the fall of 2012 after the team of Geoff Hinton and
tures. This principle was first explored successfully in the
architecture of deep autoencoder on the “raw”spec- his students won the large-scale ImageNet competition by
trogram or linear filter-bank features,* [119] showing its a significant margin over the then-state-of-the-art shallow
superiority over the Mel-Cepstral features which contain machine learning methods. The technology is based on
a few stages of fixed transformation from spectrograms. 20-year-old deep convolutional nets, but with much larger
The true “raw”features of speech, waveforms, have scale on a much larger task, since it had been learned
more recently been shown to produce excellent larger- that deep learning works quite well on large-scale speech
scale speech recognition results.* [120] recognition. In 2013 and 2014, the error rate on the Im-
Since the initial successful debut of DNNs for speech ageNet task using deep learning was further reduced at a
recognition around 2009-2011, there has been huge rapid pace, following a similar trend in large-scale speech
progress made. This progress (as well as future direc- recognition.
tions) has been summarized into the following eight ma- As in the ambitious moves from automatic speech recog-
jor areas:* [1]* [27]* [37] 1) Scaling up/out and speedup nition toward automatic speech translation and under-
DNN training and decoding; 2) Sequence discriminative standing, image classification has recently been extended
training of DNNs; 3) Feature processing by deep mod- to the more ambitious and challenging task of automatic
els with solid understanding of the underlying mecha- image captioning, in which deep learning is the essential
nisms; 4) Adaptation of DNNs and of related deep mod- underlying technology. * [126] * [127] * [128] * [129]
2.6. DEEP LEARNING IN THE HUMAN BRAIN 23
One example application is a car computer said to be illustrating suitability of the method for CRM automa-
trained with deep learning, which may be able to let cars tion. A neural network was used to approximate the value
interpret 360° camera views.* [130] of possible direct marketing actions over the customer
state space, defined in terms of RFM variables. The esti-
mated value function was shown to have a natural inter-
2.5.3 Natural language processing pretation as CLV (customer lifetime value).* [151]
they may also lead to changes in the extraction of infor- Others point out that deep learning should be looked
mation from the stimulus environment during the early at as a step towards realizing strong AI, not as an all-
self-organization of the brain. Of course, along with this encompassing solution. Despite the power of deep learn-
flexibility comes an extended period of immaturity, dur- ing methods, they still lack much of the functionality
ing which we are dependent upon our caretakers and our needed for realizing this goal entirely. Research psychol-
community for both support and training. The theory of ogist Gary Marcus has noted that:
deep learning therefore sees the coevolution of culture “Realistically, deep learning is only part of the larger
and cognition as a fundamental condition of human evo- challenge of building intelligent machines. Such tech-
lution.* [158]
niques lack ways of representing causal relationships (...)
have no obvious ways of performing logical inferences,
and they are also still a long way from integrating abstract
2.7 Commercial activity knowledge, such as information about what objects are,
what they are for, and how they are typically used. The
most powerful A.I. systems, like Watson (...) use tech-
Deep learning is often presented as a step towards real-
niques like deep learning as just one element in a very
ising strong AI* [159] and thus many organizations have
complicated ensemble of techniques, ranging from the
become interested in its use for particular applications.
statistical technique of Bayesian inference to deductive
Most recently, in December 2013, Facebook announced
reasoning.”* [161]
that it hired Yann LeCun to head its new artificial intel-
ligence (AI) lab that will have operations in California, To the extent that such a viewpoint implies, without in-
London, and New York. The AI lab will be used for tending to, that deep learning will ultimately constitute
developing deep learning techniques that will help Face- nothing more than the primitive discriminatory levels
book do tasks such as automatically tagging uploaded pic- of a comprehensive future machine intelligence, a re-
tures with the names of the people in them.* [160] cent pair of speculations regarding art and artificial in-
telligence* [162] offers an alternative and more expansive
In March 2013, Geoffrey Hinton and two of his graduate
outlook. The first such speculation is that it might be
students, Alex Krizhevsky and Ilya Sutskever, were hired
possible to train a machine vision stack to perform the
by Google. Their work will be focused on both improv-
sophisticated task of discriminating between “old mas-
ing existing machine learning products at Google and also
ter”and amateur figure drawings; and the second is that
help deal with the growing amount of data Google has.
such a sensitivity might in fact represent the rudiments
Google also purchased Hinton's company, DNNresearch.
of a non-trivial machine empathy. It is suggested, more-
In 2014 Google also acquired DeepMind Technologies, a over, that such an eventuality would be in line with both
British start-up that developed a system capable of learn- anthropology, which identifies a concern with aesthetics
ing how to play Atari video games using only raw pixels as a key element of behavioral modernity, and also with
as data input. a current school of thought which suspects that the allied
Baidu hired Andrew Ng to head their new Silicon Valley phenomenon of consciousness, formerly thought of as a
based research lab focusing on deep learning. purely high-order phenomenon, may in fact have roots
deep within the structure of the universe itself.
Some currently popular and successful deep learning ar-
chitectures display certain problematical behaviors* [163]
2.8 Criticism and comment (e.g. confidently classifying random data as belonging to
a familiar category of nonrandom images;* [164] and mis-
Given the far-reaching implications of artificial intelli- classifying miniscule perturbations of correctly classified
gence coupled with the realization that deep learning is images * [165]). The creator of OpenCog, Ben Goertzel
emerging as one of its most powerful techniques, the sub- hypothesized * [163] that these behaviors are tied with
ject is understandably attracting both criticism and com- limitations in the internal representations learned by these
ment, and in some cases from outside the field of com- architectures, and that these same limitations would in-
puter science itself. hibit integration of these architectures into heterogeneous
A main criticism of deep learning concerns the lack multi-component AGI architectures. It is suggested that
of theory surrounding many of the methods. Most of these issues can be worked around by developing deep
the learning in deep architectures is just some form of learning architectures that internally form states homol-
gradient descent. While gradient descent has been under- ogous to image-grammar * [166] decompositions of ob-
stood for a while now, the theory surrounding other algo- served entities and events.* [163] Learning a grammar
rithms, such as contrastive divergence is less clear (i.e., (visual or linguistic) from training data would be equiva-
Does it converge? If so, how fast? What is it approxi- lent to restricting the system to commonsense reasoning
mating?). Deep learning methods are often looked at as which operates on concepts in terms of production rules
a black box, with most confirmations done empirically, of the grammar, and is a basic goal of both human lan-
rather than theoretically. guage acquisition * [167] and A.I. (Also see Grammar in-
2.11. REFERENCES 25
[17] J. Weng, N. Ahuja and T. S. Huang, "Learning recognition [33] L. Deng, O. Abdel-Hamid, and D. Yu, A deep con-
and segmentation using the Cresceptron,”International volutional neural network using heterogeneous pooling
Journal of Computer Vision, vol. 25, no. 2, pp. 105-139, for trading acoustic invariance with phonetic confusion,
Nov. 1997. ICASSP, 2013.
[18] Morgan, Bourlard, Renals, Cohen, Franco (1993) “Hy- [34] T. Sainath et al., “Convolutional neural networks for
brid neural network/hidden Markov model systems for LVCSR,”ICASSP, 2013.
continuous speech recognition. ICASSP/IJPRAI”
[35] D. Yu, L. Deng, G. Li, and F. Seide (2011). “Discrimi-
[19] T. Robinson. (1992) A real-time recurrent error propaga- native pretraining of deep neural networks,”U.S. Patent
tion network word recognition system, ICASSP. Filing.
[20] Waibel, Hanazawa, Hinton, Shikano, Lang. (1989)
[36] NIPS Workshop: Deep Learning for Speech Recognition
“Phoneme recognition using time-delay neural networks.
and Related Applications, Whistler, BC, Canada, Dec.
IEEE Transactions on Acoustics, Speech and Signal Pro-
2009 (Organizers: Li Deng, Geoff Hinton, D. Yu).
cessing.”
[21] J. Baker, Li Deng, Jim Glass, S. Khudanpur, C.-H. Lee, [37] Yu, D.; Deng, L. (2014). “Automatic Speech Recogni-
N. Morgan, and D. O'Shaughnessy (2009). “Research tion: A Deep Learning Approach (Publisher: Springer)".
Developments and Directions in Speech Recognition and
Understanding, Part 1,”IEEE Signal Processing Magazine, [38] D. C. Ciresan et al., “Deep Big Simple Neural Nets for
vol. 26, no. 3, pp. 75-80, 2009. Handwritten Digit Recognition,”Neural Computation, 22,
pp. 3207–3220, 2010.
[22] Y. Bengio (1991). “Artificial Neural Networks and their
Application to Speech/Sequence Recognition,”Ph.D. the- [39] R. Raina, A. Madhavan, A. Ng.,“Large-scale Deep Unsu-
sis, McGill University, Canada. pervised Learning using Graphics Processors,”Proc. 26th
Int. Conf. on Machine Learning, 2009.
[23] L. Deng, K. Hassanein, M. Elmasry. (1994) “Analysis
of correlation structure for a neural predictive model with [40] Riesenhuber, M; Poggio, T. “Hierarchical models of ob-
applications to speech recognition,”Neural Networks, vol. ject recognition in cortex”. Nature Neuroscience 1999
7, No. 2., pp. 331-339. (11): 1019–1025.
[24] Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, [41] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, Howard, W. Hubbard, L. D. Jackel. Backpropagation Ap-
P.; Sainath, T.; Kingsbury, B. (2012). “Deep Neu- plied to Handwritten Zip Code Recognition. Neural Com-
ral Networks for Acoustic Modeling in Speech Recog- putation, 1(4):541–551, 1989.
nition --- The shared views of four research groups”
. IEEE Signal Processing Magazine 29 (6): 82–97. [42] S. Hochreiter. Untersuchungen zu dynamischen neu-
doi:10.1109/msp.2012.2205597. ronalen Netzen. Diploma thesis, Institut f. Informatik,
Technische Univ. Munich, 1991. Advisor: J. Schmidhu-
[25] Deng, L.; Hinton, G.; Kingsbury, B. (2013). “New types ber
of deep neural network learning for speech recognition
and related applications: An overview (ICASSP)". [43] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmid-
[26] Keynote talk: Recent Developments in Deep Neural Net- huber. Gradient flow in recurrent nets: the difficulty of
works. ICASSP, 2013 (by Geoff Hinton). learning long-term dependencies. In S. C. Kremer and J.
F. Kolen, editors, A Field Guide to Dynamical Recurrent
[27] Keynote talk: “Achievements and Challenges of Deep Neural Networks. IEEE Press, 2001.
Learning - From Speech Analysis and Recognition
To Language and Multimodal Processing,”Interspeech, [44] Hochreiter, Sepp; and Schmidhuber, Jürgen; Long Short-
September 2014. Term Memory, Neural Computation, 9(8):1735–1780,
1997
[28] G. E. Hinton., “Learning multiple layers of represen-
tation,”Trends in Cognitive Sciences, 11, pp. 428–434, [45] Graves, Alex; and Schmidhuber, Jürgen; Offline Hand-
2007. writing Recognition with Multidimensional Recurrent Neu-
ral Networks, in Bengio, Yoshua; Schuurmans, Dale; Laf-
[29] J. Schmidhuber.,“Learning complex, extended sequences ferty, John; Williams, Chris K. I.; and Culotta, Aron
using the principle of history compression,”Neural Com- (eds.), Advances in Neural Information Processing Systems
putation, 4, pp. 234–242, 1992. 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC,
Neural Information Processing Systems (NIPS) Founda-
[30] J. Schmidhuber., “My First Deep Learning System of
tion, 2009, pp. 545–552
1991 + Deep Learning Timeline 1962–2013.”
[31] http://research.microsoft.com/apps/pubs/default.aspx? [46] Graves, A.; Liwicki, M.; Fernandez, S.; Bertolami, R.;
id=189004 Bunke, H.; Schmidhuber, J.“A Novel Connectionist Sys-
tem for Improved Unconstrained Handwriting Recogni-
[32] L. Deng et al. Recent Advances in Deep Learning for tion”. IEEE Transactions on Pattern Analysis and Ma-
Speech Research at Microsoft, ICASSP, 2013. chine Intelligence 31 (5): 2009.
2.11. REFERENCES 27
[47] Sven Behnke (2003). Hierarchical Neural Networks for [62] X. Wu, G. Guo, and J. Weng, "Skull-closed Autonomous
Image Interpretation. (PDF). Lecture Notes in Computer Development: WWN-7 Dealing with Scales,”Proc. In-
Science 2766. Springer. ternational Conference on Brain-Mind, July 27–28, East
Lansing, Michigan, pp. +1-9, 2013.
[48] Smolensky, P. (1986). Information processing in dynam-
ical systems: Foundations of harmony theory. In D. E. [63] Szegedy, Christian, Alexander Toshev, and Dumitru Er-
Rumelhart, J. L. McClelland, & the PDP Research Group, han. “Deep neural networks for object detection.”Ad-
Parallel Distributed Processing: Explorations in the Mi- vances in Neural Information Processing Systems. 2013.
crostructure of Cognition. 1. pp. 194–281.
[64] T. Mikolov et al., “Recurrent neural network based lan-
[49] Hinton, G. E.; Osindero, S.; Teh, Y. (2006). guage model,”Interspeech, 2010.
“A fast learning algorithm for deep belief nets”
(PDF). Neural Computation 18 (7): 1527–1554. [65] LeCun, Y. et al. “Gradient-based learning applied to
doi:10.1162/neco.2006.18.7.1527. PMID 16764513. document recognition”. Proceedings of the IEEE 86 (11):
2278–2324. doi:10.1109/5.726791.
[50] Hinton, G. (2009).“Deep belief networks”. Scholarpedia
4 (5): 5947. doi:10.4249/scholarpedia.5947. [66] G. E. Hinton et al.., “Deep Neural Networks for Acous-
tic Modeling in Speech Recognition: The shared views of
[51] John Markoff (25 June 2012). “How Many Computers four research groups,”IEEE Signal Processing Magazine,
to Identify a Cat? 16,000.”. New York Times. pp. 82–97, November 2012.
[52] Ng, Andrew; Dean, Jeff (2012). “Building High-level [67] Y. Bengio et al..,“Advances in optimizing recurrent net-
Features Using Large Scale Unsupervised Learning” works,”ICASSP, 2013.
(PDF).
[68] G. Dahl et al..,“Improving DNNs for LVCSR using rec-
[53] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. tified linear units and dropout,”ICASSP, 2013.
Schmidhuber. Flexible, High Performance Convolutional
Neural Networks for Image Classification. International [69] G. E. Hinton.,“A Practical Guide to Training Restricted
Joint Conference on Artificial Intelligence (IJCAI-2011, Boltzmann Machines,”Tech. Rep. UTML TR 2010-003,
Barcelona), 2011. Dept. CS., Univ. of Toronto, 2010.
[54] Martines, H.; Bengio, Y.; Yannakakis, G. N. (2013). [70] Hinton, G.E. “Deep belief networks”. Scholarpedia 4
“Learning Deep Physiological Models of Affect”. IEEE (5): 5947. doi:10.4249/scholarpedia.5947.
Computational Intelligence 8 (2): 20.
[71] H. Larochelle et al.., “An empirical evaluation of deep
[55] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi- architectures on problems with many factors of variation,”
Column Deep Neural Network for Traffic Sign Classifica- in Proc. 24th Int. Conf. Machine Learning, pp. 473–480,
tion. Neural Networks, 2012. 2007.
[56] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. [72] G. E. Hinton.,“Training Product of Experts by Minimiz-
Deep Neural Networks Segment Neuronal Membranes in ing Contrastive Divergence,”Neural Computation, 14, pp.
Electron Microscopy Images. In Advances in Neural In- 1771–1800, 2002.
formation Processing Systems (NIPS 2012), Lake Tahoe,
[73] A. Fischer and C. Igel. Training Restricted Boltzmann
2012.
Machines: An Introduction. Pattern Recognition 47, pp.
[57] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column 25-39, 2014
Deep Neural Networks for Image Classification. IEEE
[74] http://ufldl.stanford.edu/tutorial/index.php/
Conf. on Computer Vision and Pattern Recognition
Convolutional_Neural_Network
CVPR 2012.
[75]
[58] D. J. Felleman and D. C. Van Essen, "Distributed hierar-
chical processing in the primate cerebral cortex,”Cerebral [76]
Cortex, 1, pp. 1-47, 1991.
[77] Hinton, Geoffrey; Salakhutdinov, Ruslan (2012).“A bet-
[59] J. Weng, "Natural and Artificial Intelligence: Introduction ter way to pretrain deep Boltzmann machines” (PDF).
to Computational Brain-Mind,”BMI Press, ISBN 978- Advances in Neural 3: 1–9.
0985875725, 2012.
[78] Hinton, Geoffrey; Salakhutdinov, Ruslan (2009). “Effi-
[60] J. Weng, "Why Have We Passed `Neural Networks Do not cient Learning of Deep Boltzmann Machines” (PDF) 3.
Abstract Well'?,”Natural Intelligence: the INNS Magazine, pp. 448–455.
vol. 1, no.1, pp. 13-22, 2011.
[79] Bengio, Yoshua; LeCun, Yann (2007).“Scaling Learning
[61] Z. Ji, J. Weng, and D. Prokhorov, "Where-What Network Algorithms towards AI” (PDF) 1. pp. 1–41.
1: Where and What Assist Each Other Through Top-down
Connections,”Proc. 7th International Conference on De- [80] Larochelle, Hugo; Salakhutdinov, Ruslan (2010). “Ef-
velopment and Learning (ICDL'08), Monterey, CA, Aug. ficient Learning of Deep Boltzmann Machines” (PDF).
9-12, pp. 1-6, 2008. pp. 693–700.
28 CHAPTER 2. DEEP LEARNING
[81] Vincent, Pascal; Larochelle, Hugo; Lajoie, Isabelle; Ben- [96] Lee, Honglak; Grosse, Roger (2009). “Convolutional
gio, Yoshua; Manzagol, Pierre-Antoine (2010).“Stacked deep belief networks for scalable unsupervised learning
Denoising Autoencoders: Learning Useful Representa- of hierarchical representations”. Proceedings of the 26th
tions in a Deep Network with a Local Denoising Crite- Annual International Conference on Machine Learning -
rion”. The Journal of Machine Learning Research 11: ICML '09: 1–8.
3371–3408.
[97] Lin, Yuanqing; Zhang, Tong (2010).“Deep Coding Net-
[82] Deng, Li; Yu, Dong (2011). “Deep Convex Net: A work” (PDF). Advances in Neural . . .: 1–9.
Scalable Architecture for Speech Pattern Classification”
(PDF). Proceedings of the Interspeech: 2285–2288. [98] Ranzato, Marc Aurelio; Boureau, Y-Lan (2007).“Sparse
Feature Learning for Deep Belief Networks”(PDF). Ad-
[83] Deng, Li; Yu, Dong; Platt, John (2012).“Scalable stack- vances in neural information . . .: 1–8.
ing and learning for building deep architectures”. 2012
[99] Socher, Richard; Lin, Clif (2011). “Parsing Natural
IEEE International Conference on Acoustics, Speech and
Scenes and Natural Language with Recursive Neural Net-
Signal Processing (ICASSP): 2133–2136.
works” (PDF). Proceedings of the . . .
[84] David, Wolpert (1992). “Stacked generalization”.
[100] Taylor, Graham; Hinton, Geoffrey (2006). “Modeling
Neural networks 5(2): 241–259. doi:10.1016/S0893-
Human Motion Using Binary Latent Variables” (PDF).
6080(05)80023-1.
Advances in neural . . .
[85] Bengio, Yoshua (2009).“Learning deep architectures for
[101] Vincent, Pascal; Larochelle, Hugo (2008). “Extract-
AI”. Foundations and trends in Machine Learning 2(1):
ing and composing robust features with denoising autoen-
1–127.
coders”. Proceedings of the 25th international conference
[86] Hutchinson, Brian; Deng, Li; Yu, Dong (2012). “Tensor on Machine learning - ICML '08: 1096–1103.
deep stacking networks”. IEEE transactions on pattern
[102] Kemp, Charles; Perfors, Amy; Tenenbaum, Joshua
analysis and machine intelligence 1–15.
(2007). “Learning overhypotheses with hierarchical
[87] Hinton, Geoffrey; Salakhutdinov, Ruslan (2006). “Re- Bayesian models”. Developmental science. 10(3):
ducing the Dimensionality of Data with Neural Networks” 307–21. doi:10.1111/j.1467-7687.2007.00585.x. PMID
. Science 313: 504–507. doi:10.1126/science.1127647. 17444972.
PMID 16873662. [103] Xu, Fei; Tenenbaum, Joshua (2007). “Word learning
[88] Dahl, G.; Yu, D.; Deng, L.; Acero, A. (2012). “Context- as Bayesian inference”. Psychol Rev. 114(2): 245–72.
Dependent Pre-Trained Deep Neural Networks for Large- doi:10.1037/0033-295X.114.2.245. PMID 17500627.
Vocabulary Speech Recognition”. Audio, Speech, and ... [104] Chen, Bo; Polatkan, Gungor (2011). “The Hierarchical
20(1): 30–42. Beta Process for Convolutional Factor Analysis and Deep
[89] Mohamed, Abdel-rahman; Dahl, George; Hinton, Geof- Learning” (PDF). Machine Learning . . .
frey (2012).“Acoustic Modeling Using Deep Belief Net- [105] Fei-Fei, Li; Fergus, Rob (2006). “One-shot learning of
works”. IEEE Transactions on Audio, Speech, and Lan- object categories”. IEEE Trans Pattern Anal Mach Intell.
guage Processing. 20(1): 14–22. 28(4): 594–611. doi:10.1109/TPAMI.2006.79. PMID
16566508.
[90] Courville, Aaron; Bergstra, James; Bengio, Yoshua
(2011). “A Spike and Slab Restricted Boltzmann Ma- [106] Rodriguez, Abel; Dunson, David (2008). “The
chine” (PDF). International . . . 15: 233–241. Nested Dirichlet Process”. Journal of the Amer-
ican Statistical Association. 103(483): 1131–1154.
[91] Mitchell, T; Beauchamp, J (1988). “Bayesian Vari-
doi:10.1198/016214508000000553.
able Selection in Linear Regression”. Journal of the
American Statistical Association. 83 (404): 1023–1032. [107] Ruslan, Salakhutdinov; Joshua, Tenenbaum (2012).
doi:10.1080/01621459.1988.10478694. “Learning with Hierarchical-Deep Models”. IEEE trans-
actions on pattern analysis and machine intelligence: 1–14.
[92] Courville, Aaron; Bergstra, James; Bengio, Yoshua
PMID 23267196.
(2011). “Unsupervised Models of Images by Spike-and-
Slab RBMs” (PDF). Proceedings of the . . . 10: 1–8. [108] Chalasani, Rakesh; Principe, Jose (2013). “Deep Pre-
dictive Coding Networks”. arXiv preprint arXiv: 1–13.
[93] Hinton, Geoffrey; Osindero, Simon; Teh, Yee-Whye
(2006). “A Fast Learning Algorithm for Deep Belief [109] Cho, Youngmin (2012). “Kernel Methods for Deep
Nets”. Neural Computation 1554: 1527–1554. Learning” (PDF). pp. 1–9.
[94] Larochelle, Hugo; Bengio, Yoshua; Louradour, Jerdme; [110] Scholkopf, B; Smola, Alexander (1998). “Nonlinear
Lamblin, Pascal (2009).“Exploring Strategies for Train- component analysis as a kernel eigenvalue problem”.
ing Deep Neural Networks”. The Journal of Machine Neural computation (44).
Learning Research 10: 1–40.
[111] L. Deng, G. Tur, X. He, and D. Hakkani-Tur. “Use of
[95] Coates, Adam; Carpenter, Blake (2011). “Text Detec- Kernel Deep Convex Networks and End-To-End Learning
tion and Character Recognition in Scene Images with Un- for Spoken Language Understanding,”Proc. IEEE Work-
supervised Feature Learning”. pp. 440–445. shop on Spoken Language Technologies, 2012
2.11. REFERENCES 29
[112] Mnih, Volodymyr et al. (2015). “Human-level control [131] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin.,“A Neu-
through deep reinforcement learning” (PDF) 518. pp. ral Probabilistic Language Model,”Journal of Machine
529–533. Learning Research 3 (2003) 1137–1155', 2003.
[113] TIMIT Acoustic-Phonetic Continuous Speech Corpus Lin- [132] Goldberg, Yoav; Levy, Omar. “word2vec Explained:
guistic Data Consortium, Philadelphia. Deriving Mikolov et al.’s Negative-Sampling Word-
Embedding Method” (PDF). Arxiv. Retrieved 26 Oc-
[114] Abdel-Hamid, O. et al. (2014). “Convolutional Neural tober 2014.
Networks for Speech Recognition”. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing 22 (10): [133] Socher, Richard; Manning, Christopher.“Deep Learning
1533–1545. doi:10.1109/taslp.2014.2339736. for NLP” (PDF). Retrieved 26 October 2014.
[115] Deng, L.; Platt, J. (2014). “Ensemble Deep Learning for [134] Socher, Richard; Bauer, John; Manning, Christopher; Ng,
Speech Recognition”. Proc. Interspeech. Andrew (2013). “Parsing With Compositional Vector
Grammars”(PDF). Proceedings of the ACL 2013 confer-
[116] Yu, D.; Deng, L. (2010). “Roles of Pre-Training
ence.
and Fine-Tuning in Context-Dependent DBN-HMMs for
Real-World Speech Recognition”. NIPS Workshop on [135] Socher, Richard (2013). “Recursive Deep Models for
Deep Learning and Unsupervised Feature Learning. Semantic Compositionality Over a Sentiment Treebank”
[117] Deng L., Li, J., Huang, J., Yao, K., Yu, D., Seide, F. et al. (PDF). EMNLP 2013.
Recent Advances in Deep Learning for Speech Research [136] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil (2014)
at Microsoft. ICASSP, 2013. " A Latent Semantic Model with Convolutional-Pooling
[118] Deng, L.; Li, Xiao (2013).“Machine Learning Paradigms Structure for Information Retrieval,”Proc. CIKM.
for Speech Recognition: An Overview”. IEEE Transac-
[137] P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck
tions on Audio, Speech, and Language Processing.
(2013) “Learning Deep Structured Semantic Models for
[119] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and Web Search using Clickthrough Data,”Proc. CIKM.
G. Hinton (2010) Binary Coding of Speech Spectrograms
[138] I. Sutskever, O. Vinyals, Q. Le (2014) “Sequence to Se-
Using a Deep Auto-encoder. Interspeech.
quence Learning with Neural Networks,”Proc. NIPS.
[120] Z. Tuske, P. Golik, R. Schlüter and H. Ney (2014).
Acoustic Modeling with Deep Neural Networks Using [139] J. Gao, X. He, W. Yih, and L. Deng(2014) “Learning
Raw Time Signal for LVCSR. Interspeech. Continuous Phrase Representations for Translation Mod-
eling,”Proc. ACL.
[121] McMillan, R.“How Skype Used AI to Build Its Amazing
New Language Translator”, Wire, Dec. 2014. [140] J. Gao, P. Pantel, M. Gamon, X. He, L. Deng (2014)
“Modeling Interestingness with Deep Neural Networks,”
[122] Hannun et al. (2014) “Deep Speech: Scaling up end-to- Proc. EMNLP.
end speech recognition”, arXiv:1412.5567.
[141] J. Gao, X. He, L. Deng (2014) “Deep Learning for Nat-
[123] Ron Schneiderman (2015) “Accuracy, Apps Advance ural Language Processing: Theory and Practice (Tuto-
Speech Recognition --- Interview with Vlad Sejnoha and rial),”CIKM.
Li Deng”, IEEE Signal Processing Magazine, Jan, 2015.
[142] Arrowsmith, J; Miller, P (2013). “Trial watch: Phase
[124] http://yann.lecun.com/exdb/mnist/. II and phase III attrition rates 2011-2012”. Nature Re-
views Drug Discovery 12 (8): 569. doi:10.1038/nrd4090.
[125] D. Ciresan, U. Meier, J. Schmidhuber., “Multi-column PMID 23903212.
Deep Neural Networks for Image Classification,”Techni-
cal Report No. IDSIA-04-12', 2012. [143] Verbist, B; Klambauer, G; Vervoort, L; Talloen, W;
The Qstar, Consortium; Shkedy, Z; Thas, O; Ben-
[126] Vinyals et al. (2014)."Show and Tell: A Neural Image
der, A; Göhlmann, H. W.; Hochreiter, S (2015).
Caption Generator,”arXiv:1411.4555.
“Using transcriptomics to guide lead optimization
[127] Fang et al. (2014)."From Captions to Visual Concepts and in drug discovery projects: Lessons learned from
Back,”arXiv:1411.4952. the QSTAR project”. Drug Discovery Today.
doi:10.1016/j.drudis.2014.12.014. PMID 25582842.
[128] Kiros et al. (2014)."Unifying Visual-Semantic Embed-
dings with Multimodal Neural Language Models,”arXiv: [144]“Announcement of the winners of the Merck Molec-
1411.2539. ular Activity Challenge”https://www.kaggle.com/c/
MerckActivity/details/winners.
[129] Zhong, S.; Liu, Y.; Liu, Y. “Bilinear Deep Learning for
Image Classification”. Proceedings of the 19th ACM In- [145] Dahl, G. E.; Jaitly, N.; & Salakhutdinov, R. (2014)
ternational Conference on Multimedia 11: 343–352. “Multi-task Neural Networks for QSAR Predictions,”
ArXiv, 2014.
[130] Nvidia Demos a Car Computer Trained with “Deep
Learning” (2015-01-06), David Talbot, MIT Technology [146]“Toxicology in the 21st century Data Challenge”https:
Review //tripod.nih.gov/tox21/challenge/leaderboard.jsp
30 CHAPTER 2. DEEP LEARNING
[147]“NCATS Announces Tox21 Data Challenge Winners” [162] Smith, G. W. (March 27, 2015). “Art and Artificial In-
http://www.ncats.nih.gov/news-and-events/features/ telligence”. ArtEnt. Retrieved March 27, 2015.
tox21-challenge-winners.html
[163] Ben Goertzel. Are there Deep Reasons Underlying the
[148] Unterthiner, T.; Mayr, A.; Klambauer, G.; Steijaert, M.; Pathologies of Today’s Deep Learning Algorithms?
Ceulemans, H.; Wegner, J. K.; & Hochreiter, S. (2014) (2015) Url: http://goertzel.org/DeepLearning_v1.pdf
“Deep Learning as an Opportunity in Virtual Screening”
. Workshop on Deep Learning and Representation Learn- [164] Nguyen, Anh, Jason Yosinski, and Jeff Clune. “Deep
ing (NIPS2014). Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images.”arXiv preprint
[149] Unterthiner, T.; Mayr, A.; Klambauer, G.; & Hochreiter, arXiv:1412.1897 (2014).
S. (2015) “Toxicity Prediction using Deep Learning”.
[165] Szegedy, Christian, et al.“Intriguing properties of neural
ArXiv, 2015.
networks.”arXiv preprint arXiv:1312.6199 (2013).
[150] Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Kon-
[166] Zhu, S.C.; Mumford, D. “A stochastic grammar of im-
erding, D.;& Pande, V. (2015)“Massively Multitask Net-
ages”. Found. Trends. Comput. Graph. Vis. 2 (4):
works for Drug Discovery”. ArXiv, 2015.
259–362. doi:10.1561/0600000018.
[151] Tkachenko, Yegor. Autonomous CRM Control via CLV [167] Miller, G. A., and N. Chomsky. “Pattern conception.”
Approximation with Deep Reinforcement Learning in Paper for Conference on pattern detection, University of
Discrete and Continuous Action Space. (April 8, 2015). Michigan. 1957.
arXiv.org: http://arxiv.org/abs/1504.01840
[168] Jason Eisner, Deep Learning of Recursive Struc-
[152] Utgoff, P. E.; Stracuzzi, D. J. (2002). “Many- ture: Grammar Induction, http://techtalks.tv/talks/
layered learning”. Neural Computation 14: 2497–2529. deep-learning-of-recursive-structure-grammar-induction/
doi:10.1162/08997660260293319. 58089/
[153] J. Elman, et al., “Rethinking Innateness,”1996.
[154] Shrager, J.; Johnson, MH (1996). “Dynamic plas- 2.12 External links
ticity influences the emergence of function in a simple
cortical array”. Neural Networks 9 (7): 1119–1129.
doi:10.1016/0893-6080(96)00033-0. • TED talk on the applications of deep learning and
future consequences by Jeremy Howard
[155] Quartz, SR; Sejnowski, TJ (1997). “The neural ba-
sis of cognitive development: A constructivist mani- • Deep learning information from the University of
festo”. Behavioral and Brain Sciences 20 (4): 537–556. Montreal
doi:10.1017/s0140525x97001581.
• Deep learning information from Stanford University
[156] S. Blakeslee., “In brain's early growth, timetable may be
critical,”The New York Times, Science Section, pp. B5–
B6, 1995. • Deep Learning Resources, NVIDIA Developer
Zone
[157] {BUFILL} E. Bufill, J. Agusti, R. Blesa., “Human
neoteny revisited: The case of synaptic plasticity,”Amer- • Geoffrey Hinton's webpage
ican Journal of Human Biology, 23 (6), pp. 729–739,
2011. • Hinton deep learning tutorial
[158] J. Shrager and M. H. Johnson., “Timing in the develop- • Yann LeCun's webpage
ment of cortical function: A computational approach,”In
• The Center for Biological and Computational
B. Julesz and I. Kovacs (Eds.), Maturational windows and
adult cortical plasticity, 1995. Learning (CBCL)
[159] D. Hernandez., “The Man Behind the Google • Stanford tutorial on unsupervised feature learning
Brain: Andrew Ng and the Quest for the New and deep learning
AI,”http://www.wired.com/wiredenterprise/2013/
• Google's DistBelief Framework
05/neuro-artificial-intelligence/all/. Wired, 10 May
2013. • NIPS 2013 Conference (talks on deep learning re-
[160] C. Metz., “Facebook's 'Deep Learning' Guru Reveals lated material)
the Future of AI,”http://www.wired.com/wiredenterprise/
• Mnih, Volodymyr; Kavukcuoglu, Koray; Silver,
2013/12/facebook-yann-lecun-qa/. Wired, 12 December
2013.
David; Graves, Alex; Antonoglou, Ioannis; Wier-
stra, Daan; Riedmiller, Martin (2013), Playing
[161] G. Marcus.,“Is“Deep Learning”a Revolution in Artificial Atari with Deep Reinforcement Learning (PDF),
Intelligence?" The New Yorker, 25 November 2012. arXiv:1312.5602
2.12. EXTERNAL LINKS 31
Feature learning
Feature learning or representation learning* [1] is a weights may be found by minimizing the average repre-
set of techniques that learn a transformation of raw data sentation error (over the input data), together with a L1
input to a representation that can be effectively exploited regularization on the weights to enable sparsity (i.e., the
in machine learning tasks. representation of each data point has only a few nonzero
weights).
Feature learning is motivated by the fact that machine
learning tasks such as classification often require input Supervised dictionary learning exploits both the structure
that is mathematically and computationally convenient to underlying the input data and the labels for optimizing the
process. However, real-world data such as images, video, dictionary elements. For example, a supervised dictio-
and sensor measurement is usually complex, redundant, nary learning technique was proposed by Mairal et al. in
2009.* [6] The authors apply dictionary learning on classi-
and highly variable. Thus, it is necessary to discover use-
ful features or representations from raw data. Traditional
fication problems by jointly optimizing the dictionary ele-
hand-crafted features often require expensive human la- ments, weights for representing data points, and parame-
bor and often rely on expert knowledge. Also, they nor- ters of the classifier based on the input data. In particular,
mally do not generalize well. This motivates the design a minimization problem is formulated, where the objec-
of efficient feature learning techniques. tive function consists of the classification error, the repre-
Feature learning can be divided into two categories: su- sentation error, an L1 regularization on the representing
pervised and unsupervised feature learning. weights for each data point (to enable sparse representa-
tion of data), and an L2 regularization on the parameters
of the classifier.
• In supervised feature learning, features are learned
with labeled input data. Examples include neural
networks, multilayer perceptron, and (supervised)
dictionary learning.
32
3.2. UNSUPERVISED FEATURE LEARNING 33
3.2 Unsupervised feature learning (i.e., subtracting the sample mean from the data vector).
Equivalently, these singular vectors are the eigenvectors
Unsupervised feature learning is to learn features from corresponding to the p largest eigenvalues of the sample
unlabeled data. The goal of unsupervised feature learn- covariance matrix of the input vectors. These p singu-
ing is often to discover low-dimensional features that cap- lar vectors are the feature vectors learned from the input
tures some structure underlying the high-dimensional in- data, and they represent directions along which the data
put data. When the feature learning is performed in an has the largest variations.
unsupervised way, it enables a form of semisupervised PCA is a linear feature learning approach since the p sin-
learning where first, features are learned from an gular vectors are linear functions of the data matrix. The
unlabeled dataset, which are then employed to im- singular vectors can be generated via a simple algorithm
prove performance in a supervised setting with labeled with p iterations. In the ith iteration, the projection of the
data.* [7]* [8] Several approaches are introduced in the data matrix on the (i-1)th eigenvector is subtracted, and
following. the ith singular vector is found as the right singular vector
corresponding to the largest singular of the residual data
matrix.
3.2.1 K-means clustering
PCA has several limitations. First, it assumes that the
K-means clustering is an approach for vector quantiza- directions with large variance are of most interest, which
tion. In particular, given a set of n vectors, k-means clus- may not be the case in many applications. PCA only relies
tering groups them into k clusters (i.e., subsets) in such a on orthogonal transformations of the original data, and it
way that each vector belongs to the cluster with the clos- only exploits the first- and second-order moments of the
est mean. The problem is computationally NP-hard, and data, which may not well characterize the distribution of
suboptimal greedy algorithms have been developed for k- the data. Furthermore, PCA can effectively reduce di-
means clustering. mension only when the input data vectors are correlated
(which results in a few dominant eigenvalues).
In feature learning, k-means clustering can be used to
group an unlabeled set of inputs into k clusters, and then
use the centroids of these clusters to produce features. 3.2.3 Local linear embedding
These features can be produced in several ways. The sim-
plest way is to add k binary features to each sample, where Local linear embedding (LLE) is a nonlinear unsuper-
each feature j has value one iff the jth centroid learned vised learning approach for generating low-dimensional
by k-means is the closest to the sample under considera- neighbor-preserving representations from (unlabeled)
tion.* [3] It is also possible to use the distances to the clus- high-dimension input. The approach was proposed by
ters as features, perhaps after transforming them through Sam T. Roweis and Lawrence K. Saul in 2000.* [12]* [13]
a radial basis function (a technique that has used to train
RBF networks* [9]). Coates and Ng note that certain vari- The general idea of LLE is to reconstruct the origi-
ants of k-means behave similarly to sparse coding algo- nal high-dimensional data using lower-dimensional points
rithms.* [10] while maintaining some geometric properties of the
neighborhoods in the original data set. LLE consists
In a comparative evaluation of unsupervised feature of two major steps. The first step is for “neighbor-
learning methods, Coates, Lee and Ng found that k- preserving,”where each input data point Xi is recon-
means clustering with an appropriate transformation out- structed as a weighted sum of K nearest neighboring data
performs the more recently invented auto-encoders and points, and the optimal weights are found by minimizing
RBMs on an image classification task.* [3] K-means has the average squared reconstruction error (i.e., difference
also been shown to improve performance in the domain between a point and its reconstruction) under the con-
of NLP, specifically for named-entity recognition;* [11] straint that the weights associated to each point sum up
there, it competes with Brown clustering, as well as with to one. The second step is for “dimension reduction,”
distributed word representations (also known as neural by looking for vectors in a lower-dimensional space that
word embeddings).* [8] minimizes the representation error using the optimized
weights in the first step. Note that in the first step, the
weights are optimized with data being fixed, which can
3.2.2 Principal component analysis
be solved as a least squares problem; while in the sec-
Principal component analysis (PCA) is often used for di- ond step, lower-dimensional points are optimized with
mension reduction. Given a unlabeled set of n input data the weights being fixed, which can be solved via sparse
vectors, PCA generates p (which is much smaller than the eigenvalue decomposition.
dimension of the input data) right singular vectors corre- The reconstruction weights obtained in the first step cap-
sponding to the p largest singular values of the data ma- tures the“intrinsic geometric properties”of a neighbor-
trix, where the kth row of the data matrix is the kth in- hood in the input data.* [13] It is assumed that original
put data vector shifted by the sample mean of the input data lie on a smooth lower-dimensional manifold, and the
34 CHAPTER 3. FEATURE LEARNING
“intrinsic geometric properties”captured by the weights den variables, a group of visible variables, and edges con-
of the original data are expected also on the manifold.necting the hidden and visible nodes. It is a special case of
This is why the same weights are used in the second step
the more general Boltzmann machines with the constraint
of LLE. Compared with PCA, LLE is more powerful in of no intra-node connections. Each edge in an RBM is
exploiting the underlying structure of data. associated with a weight. The weights together with the
connections define an energy function, based on which
a joint distribution of visible and hidden nodes can be
3.2.4 Independent component analysis devised. Based on the topology of the RBM, the hidden
(visible) variables are independent conditioned on the vis-
Independent component analysis (ICA) is technique for ible (hidden) variables. Such conditional independence
learning a representation of data using a weighted sum facilitates computations on RBM.
of independent non-Gaussian components.* [14] The as-
An RBM can be viewed as a single layer architecture for
sumption of non-Gaussian is imposed since the weights
unsupervised feature learning. In particular, the visible
cannot be uniquely determined when all the components
variables correspond to input data, and the hidden vari-
follow Gaussian distribution.
ables correspond to feature detectors. The weights can
be trained by maximizing the probability of visible vari-
ables using the contrastive divergence (CD) algorithm by
3.2.5 Unsupervised dictionary learning
Geoffrey Hinton.* [18]
Different from supervised dictionary learning, unsuper- In general, the training of RBM by solving the above
vised dictionary learning does not utilize the labels of the maximization problem tends to result in non-sparse rep-
data and only exploits the structure underlying the data for resentations. The sparse RBM, * [19] a modification of
optimizing the dictionary elements. An example of unsu- the RBM, was proposed to enable sparse representations.
pervised dictionary learning is sparse coding, which aims The idea is to add a regularization term in the objective
to learn basis functions (dictionary elements) for data rep- function of data likelihood, which penalizes the deviation
resentation from unlabeled input data. Sparse coding can of the expected hidden variables from a small constant p
be applied to learn overcomplete dictionary, where the .
number of dictionary elements is larger than the dimen-
sion of the input data.* [15] Aharon et al. proposed an
algorithm known as K-SVD for learning from unlabeled 3.3.2 Autoencoder
input data a dictionary of elements that enables sparse
representation of the data.* [16] An autoencoder consisting of encoder and decoder is a
paradigm for deep learning architectures. An example
is provided by Hinton and Salakhutdinov* [18] where the
encoder uses raw data (e.g., image) as input and produces
3.3 Multilayer/Deep architectures feature or representation as output, and the decoder uses
the extracted feature from the encoder as input and recon-
The hierarchical architecture of the neural system in- structs the original input raw data as output. The encoder
spires deep learning architectures for feature learning by and decoder are constructed by stacking multiple layers
stacking multiple layers of simple learning blocks.* [17] of RBMs. The parameters involved in the architecture
These architectures are often designed based on the as- are trained in a greedy layer-by-layer manner: after one
sumption of distributed representation: observed data is layer of feature detectors is learned, they are fed to upper
generated by the interactions of many different factors layers as visible variables for training the corresponding
on multiple levels. In a deep learning architecture, the RBM. The process can be repeated until some stopping
output of each intermediate layer can be viewed as a rep- criteria is satisfied.
resentation of the original input data. Each level uses the
representation produced by previous level as input, and
produces new representations as output, which is then fed 3.4 See also
to higher levels. The input of bottom layer is the raw
data, and the output of the final layer is the final low-
• Basis function
dimensional feature or representation.
• Deep learning
[2] Nathan Srebro; Jason D. M. Rennie; Tommi S. Jaakkola [17] Bengio, Yoshua (2009). “Learning Deep Architectures
(2004). Maximum-Margin Matrix Factorization. NIPS. for AI”. Foundations and Trends® in Machine Learning
2 (1): 1–127. doi:10.1561/2200000006.
[3] Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). An
analysis of single-layer networks in unsupervised feature [18] Hinton, G. E.; Salakhutdinov, R. R. (2006).
learning (PDF). Int'l Conf. on AI and Statistics (AIS- “Reducing the Dimensionality of Data with Neural
TATS). Networks” (PDF). Science 313 (5786): 504–507.
doi:10.1126/science.1127647. PMID 16873662.
[4] Csurka, Gabriella; Dance, Christopher C.; Fan, Lixin;
Willamowski, Jutta; Bray, Cédric (2004). Visual catego- [19] Lee, Honglak; Ekanadham, Chaitanya; Andrew, Ng
rization with bags of keypoints (PDF). ECCV Workshop (2008). “Sparse deep belief net model for visual area
on Statistical Learning in Computer Vision. V2”. Advances in neural information processing systems.
Unsupervised learning
36
4.4. FURTHER READING 37
4.3 Notes
[1] Jordan, Michael I.; Bishop, Christopher M. (2004).“Neu-
ral Networks”. In Allen B. Tucker. Computer Science
Handbook, Second Edition (Section VII: Intelligent Sys-
tems). Boca Raton, FL: Chapman & Hall/CRC Press
LLC. ISBN 1-58488-360-X.
Generative model
38
Chapter 6
Neural coding
Neural coding is a neuroscience-related field concerned recalled in the hippocampus, a brain region known to
with characterizing the relationship between the stimulus be central for memory formation.* [5]* [6]* [7] Neurosci-
and the individual or ensemble neuronal responses and entists have initiated several large-scale brain decoding
the relationship among the electrical activity of the neu- projects.* [8]* [9]
rons in the ensemble.* [1] Based on the theory that sen-
sory and other information is represented in the brain by
networks of neurons, it is thought that neurons can encode
both digital and analog information.* [2]
6.2 Encoding and decoding
39
40 CHAPTER 6. NEURAL CODING
6.3.1 Rate coding the stimulus. In practice, to get sensible averages, several
spikes should occur within the time window. Typical val-
The rate coding model of neuronal firing communication ues are T = 100 ms or T = 500 ms, but the duration may
states that as the intensity of a stimulus increases, the also be longer or shorter.* [19]
frequency or rate of action potentials, or “spike firing” The spike-count rate can be determined from a single
, increases. Rate coding is sometimes called frequency trial, but at the expense of losing all temporal resolu-
coding. tion about variations in neural response during the course
Rate coding is a traditional coding scheme, assuming that of the trial. Temporal averaging can work well in cases
most, if not all, information about the stimulus is con- where the stimulus is constant or slowly varying and does
tained in the firing rate of the neuron. Because the se- not require a fast reaction of the organism —and this is
quence of action potentials generated by a given stimu- the situation usually encountered in experimental proto-
lus varies from trial to trial, neuronal responses are typi- cols. Real-world input, however, is hardly stationary, but
cally treated statistically or probabilistically. They may often changing on a fast time scale. For example, even
be characterized by firing rates, rather than as specific when viewing a static image, humans perform saccades,
spike sequences. In most sensory systems, the firing rate rapid changes of the direction of gaze. The image pro-
increases, generally non-linearly, with increasing stimu- jected onto the retinal photoreceptors changes therefore
lus intensity.* [17] Any information possibly encoded in every few hundred milliseconds.* [19]
the temporal structure of the spike train is ignored. Con- Despite its shortcomings, the concept of a spike-count
sequently, rate coding is inefficient but highly robust with rate code is widely used not only in experiments, but also
respect to the ISI 'noise'.* [4] in models of neural networks. It has led to the idea that a
During rate coding, precisely calculating firing rate is very neuron transforms information about a single input vari-
important. In fact, the term “firing rate”has a few dif- able (the stimulus strength) into a single continuous out-
ferent definitions, which refer to different averaging pro- put variable (the firing rate).
cedures, such as an average over time or an average over
several repetitions of experiment.
Time-dependent firing rate
In rate coding, learning is based on activity-dependent
synaptic weight modifications. The time-dependent firing rate is defined as the average
Rate coding was originally shown by ED Adrian and number of spikes (averaged over trials) appearing dur-
Y Zotterman in 1926.* [18] In this simple experiment ing a short interval between times t and t+Δt, divided
different weights were hung from a muscle. As the by the duration of the interval. It works for stationary
weight of the stimulus increased, the number of spikes as well as for time-dependent stimuli. To experimentally
recorded from sensory nerves innervating the muscle also measure the time-dependent firing rate, the experimenter
increased. From these original experiments, Adrian and records from a neuron while stimulating with some in-
Zotterman concluded that action potentials were unitary put sequence. The same stimulation sequence is repeated
events, and that the frequency of events, and not indi- several times and the neuronal response is reported in
vidual event magnitude, was the basis for most inter- a Peri-Stimulus-Time Histogram (PSTH). The time t is
neuronal communication. measured with respect to the start of the stimulation se-
quence. The Δt must be large enough (typically in the
In the following decades, measurement of firing rates be-
range of one or a few milliseconds) so there are sufficient
came a standard tool for describing the properties of all
number of spikes within the interval to obtain a reliable
types of sensory or cortical neurons, partly due to the
estimate of the average. The number of occurrences of
relative ease of measuring rates experimentally. How-
spikes nK (t;t+Δt) summed over all repetitions of the ex-
ever, this approach neglects all the information possibly
periment divided by the number K of repetitions is a mea-
contained in the exact timing of the spikes. During re-
sure of the typical activity of the neuron between time t
cent years, more and more experimental evidence has
and t+Δt. A further division by the interval length Δt
suggested that a straightforward firing rate concept based
yields time-dependent firing rate r(t) of the neuron, which
on temporal averaging may be too simplistic to describe
is equivalent to the spike density of PSTH.
brain activity.* [4]
For sufficiently small Δt, r(t)Δt is the average number of
spikes occurring between times t and t+Δt over multiple
Spike-count rate trials. If Δt is small, there will never be more than one
spike within the interval between t and t+Δt on any given
The Spike-count rate, also referred to as temporal aver- trial. This means that r(t)Δt is also the fraction of trials
age, is obtained by counting the number of spikes that on which a spike occurred between those times. Equiva-
appear during a trial and dividing by the duration of trial. lently, r(t)Δt is the probability that a spike occurs during
The length T of the time window is set by experimenter this time interval.
and depends on the type of neuron recorded from and As an experimental procedure, the time-dependent firing
6.3. CODING SCHEMES 41
rate measure is a useful method to evaluate neuronal ac- domness, or precisely timed groups of spikes (tempo-
tivity, in particular in the case of time-dependent stimuli. ral patterns) are candidates for temporal codes.* [25] As
The obvious problem with this approach is that it can not there is no absolute time reference in the nervous system,
be the coding scheme used by neurons in the brain. Neu- the information is carried either in terms of the relative
rons can not wait for the stimuli to repeatedly present in timing of spikes in a population of neurons or with respect
an exactly same manner before generating response. to an ongoing brain oscillation.* [2]* [4]
Nevertheless, the experimental time-dependent firing rate The temporal structure of a spike train or firing rate
measure can make sense, if there are large populations evoked by a stimulus is determined both by the dynamics
of independent neurons that receive the same stimulus. of the stimulus and by the nature of the neural encod-
Instead of recording from a population of N neurons in ing process. Stimuli that change rapidly tend to generate
a single run, it is experimentally easier to record from a precisely timed spikes and rapidly changing firing rates no
single neuron and average over N repeated runs. Thus, the matter what neural coding strategy is being used. Tempo-
time-dependent firing rate coding relies on the implicit ral coding refers to temporal precision in the response that
assumption that there are always populations of neurons. does not arise solely from the dynamics of the stimulus,
but that nevertheless relates to properties of the stimulus.
The interplay between stimulus and encoding dynamics
6.3.2 Temporal coding makes the identification of a temporal code difficult.
In temporal coding, learning can be explained by activity-
When precise spike timing or high-frequency firing-rate dependent synaptic delay modifications.* [26] The modi-
fluctuations are found to carry information, the neural fications can themselves depend not only on spike rates
code is often identified as a temporal code.* [20] A num- (rate coding) but also on spike timing patterns (tempo-
ber of studies have found that the temporal resolution of ral coding), i.e., can be a special case of spike-timing-
the neural code is on a millisecond time scale, indicating dependent plasticity.
that precise spike timing is a significant element in neural
The issue of temporal coding is distinct and independent
coding.* [2]* [21]
from the issue of independent-spike coding. If each spike
Neurons exhibit high-frequency fluctuations of firing- is independent of all the other spikes in the train, the tem-
rates which could be noise or could carry information. poral character of the neural code is determined by the
Rate coding models suggest that these irregularities are behavior of time-dependent firing rate r(t). If r(t) varies
noise, while temporal coding models suggest that they en- slowly with time, the code is typically called a rate code,
code information. If the nervous system only used rate and if it varies rapidly, the code is called temporal.
codes to convey information, a more consistent, regular
firing rate would have been evolutionarily advantageous,
and neurons would have utilized this code over other less Temporal coding in sensory systems
robust options.* [22] Temporal coding supplies an alter-
nate explanation for the “noise,”suggesting that it ac- For very brief stimuli, a neuron's maximum firing rate
tually encodes information and affects neural process- may not be fast enough to produce more than a single
ing. To model this idea, binary symbols can be used to spike. Due to the density of information about the ab-
mark the spikes: 1 for a spike, 0 for no spike. Tempo- breviated stimulus contained in this single spike, it would
ral coding allows the sequence 000111000111 to mean seem that the timing of the spike itself would have to
something different from 001100110011, even though convey more information than simply the average fre-
the mean firing rate is the same for both sequences, at quency of action potentials over a given period of time.
6 spikes/10 ms.* [23] Until recently, scientists had put This model is especially important for sound localization,
the most emphasis on rate encoding as an explanation which occurs within the brain on the order of millisec-
for post-synaptic potential patterns. However, functions onds. The brain must obtain a large quantity of infor-
of the brain are more temporally precise than the use of mation based on a relatively short neural response. Ad-
only rate encoding seems to allow. In other words, essen- ditionally, if low firing rates on the order of ten spikes
tial information could be lost due to the inability of the per second must be distinguished from arbitrarily close
rate code to capture all the available information of the rate coding for different stimuli, then a neuron trying to
spike train. In addition, responses are different enough discriminate these two stimuli may need to wait for a sec-
between similar (but not identical) stimuli to suggest thatond or more to accumulate enough information. This is
the distinct patterns of spikes contain a higher volume of not consistent with numerous organisms which are able
information than is possible to include in a rate code.* [24]
to discriminate between stimuli in the time frame of mil-
Temporal codes employ those features of the spiking ac- liseconds, *
suggesting that a rate code is not the only model
tivity that cannot be described by the firing rate. For at work. [23]
example, time to first spike after the stimulus onset, To account for the fast encoding of visual stimuli, it has
characteristics based on the second and higher statistical been suggested that neurons of the retina encode visual in-
moments of the ISI probability distribution, spike ran- formation in the latency time between stimulus onset and
42 CHAPTER 6. NEURAL CODING
first action potential, also called latency to first spike.* [27] Temporal coding applications
This type of temporal coding has been shown also in the
auditory and somato-sensory system. The main drawback The specificity of temporal coding requires highly refined
of such a coding scheme is its sensitivity to intrinsic neu- technology to measure informative, reliable, experimen-
ronal fluctuations.* [28] In the primary visual cortex of tal data. Advances made in optogenetics allow neurolo-
macaques, the timing of the first spike relative to the start gists to control spikes in individual neurons, offering elec-
of the stimulus was found to provide more information trical and spatial single-cell resolution. For example, blue
than the interval between spikes. However, the interspike light causes the light-gated ion channel channelrhodopsin
interval could be used to encode additional information, to open, depolarizing the cell and producing a spike.
which is especially important when the spike rate reaches When blue light is not sensed by the cell, the channel
its limit, as in high-contrast situations. For this reason, closes, and the neuron ceases to spike. The pattern of
temporal coding may play a part in coding defined edges the spikes matches the pattern of the blue light stim-
rather than gradual transitions.* [29] uli. By inserting channelrhodopsin gene sequences into
mouse DNA, researchers can control spikes and therefore
The mammalian gustatory system is useful for studying
certain behaviors of the mouse (e.g., making the mouse
temporal coding because of its fairly distinct stimuli and
turn left).* [33] Researchers, through optogenetics, have
the easily discernible responses of the organism.* [30]
the tools to effect different temporal codes in a neuron
Temporally encoded information may help an organism
while maintaining the same mean firing rate, and thereby
discriminate between different tastants of the same cat-
can test whether or not temporal coding occurs in specific
egory (sweet, bitter, sour, salty, umami) that elicit very
neural circuits.* [34]
similar responses in terms of spike count. The tempo-
ral component of the pattern elicited by each tastant may Optogenetic technology also has the potential to enable
be used to determine its identity (e.g., the difference be- the correction of spike abnormalities at the root of several
tween two bitter tastants, such as quinine and denato- neurological and psychological disorders.* [34] If neurons
nium). In this way, both rate coding and temporal coding do encode information in individual spike timing pat-
may be used in the gustatory system – rate for basic tas- terns, key signals could be missed by attempting to crack
tant type, temporal for more specific differentiation.* [31] the code while looking only at mean firing rates.* [23] Un-
Research on mammalian gustatory system has shown that derstanding any temporally encoded aspects of the neural
there is an abundance of information present in tempo- code and replicating these sequences in neurons could al-
ral patterns across populations of neurons, and this in- low for greater control and treatment of neurological dis-
formation is different from that which is determined by orders such as depression, schizophrenia, and Parkinson's
rate coding schemes. Groups of neurons may synchro- disease. Regulation of spike intervals in single cells more
nize in response to a stimulus. In studies dealing with precisely controls brain activity than the addition of phar-
the front cortical portion of the brain in primates, pre- macological agents intravenously.* [33]
cise patterns with short time scales only a few millisec-
onds in length were found across small populations of
neurons which correlated with certain information pro- Phase-of-firing code
cessing behaviors. However, little information could be
determined from the patterns; one possible theory is they Phase-of-firing code is a neural coding scheme that com-
represented the higher-order processing taking place in bines the spike count code with a time reference based on
the brain.* [24] oscillations. This type of code takes into account a time
label for each spike according to a time reference based
As with the visual system, in mitral/tufted cells in the on phase of local ongoing oscillations at low* [35] or high
olfactory bulb of mice, first-spike latency relative to the frequencies.* [36] A feature of this code is that neurons
start of a sniffing action seemed to encode much of the adhere to a preferred order of spiking, resulting in firing
information about an odor. This strategy of using spike sequence.* [37]
latency allows for rapid identification of and reaction to an
odorant. In addition, some mitral/tufted cells have spe- It has been shown that neurons in some cortical sen-
cific firing patterns for given odorants. This type of extra sory areas encode rich naturalistic stimuli in terms of
information could help in recognizing a certain odor, but their spike times relative to the phase of ongoing net-
is not completely necessary, as average spike count over work fluctuations, rather than only in terms of their spike
the course of the animal's sniffing was also a good iden- count.* [35]* [38] Oscillations reflect local field potential
tifier.* [32] Along the same lines, experiments done with signals. It is often categorized as a temporal code al-
the olfactory system of rabbits showed distinct patterns though the time label used for spikes is coarse grained.
which correlated with different subsets of odorants, and That is, four discrete values for phase are enough to rep-
a similar result was obtained in experiments with the lo- resent all the information content in this kind of code
cust olfactory system.* [23] with respect to the phase of oscillations in low frequen-
cies. Phase-of-firing code is loosely based on the phase
precession phenomena observed in place cells of the
hippocampus.
6.3. CODING SCHEMES 43
Phase code has been shown in visual cortex to involve but overlapping selectivities, so that many neurons, but
also high-frequency oscillations.* [37] Within a cycle of not necessarily all, respond to a given stimulus.
gamma oscillation, each neuron has it own preferred rel- Typically an encoding function has a peak value such that
ative firing time. As a result, an entire population of neu- activity of the neuron is greatest if the perceptual value is
rons generates a firing sequence that has a duration of up close to the peak value, and becomes reduced accordingly
to about 15 ms.* [37] for values less close to the peak value.
It follows that the actual perceived value can be recon-
6.3.3 Population coding structed from the overall pattern of activity in the set
of neurons. The Johnson/Georgopoulos vector coding
is an example of simple averaging. A more sophisti-
Population coding is a method to represent stimuli by us-
cated mathematical technique for performing such a re-
ing the joint activities of a number of neurons. In popu-
construction is the method of maximum likelihood based
lation coding, each neuron has a distribution of responses
on a multivariate distribution of the neuronal responses.
over some set of inputs, and the responses of many neu-
These models can assume independence, second order
rons may be combined to determine some value about the
correlations ,* [44] or even more detailed dependencies
inputs.
such as higher order maximum entropy models* [45] or
From the theoretical point of view, population coding is copulas.* [46]
one of a few mathematically well-formulated problems
in neuroscience. It grasps the essential features of neu-
ral coding and yet, is simple enough for theoretic anal- Correlation coding
ysis.* [39] Experimental studies have revealed that this
coding paradigm is widely used in the sensor and mo- The correlation coding model of neuronal firing claims
tor areas of the brain. For example, in the visual area that correlations between action potentials, or “spikes”
medial temporal (MT), neurons are tuned to the mov- , within a spike train may carry additional information
ing direction.* [40] In response to an object moving in above and beyond the simple timing of the spikes. Early
a particular direction, many neurons in MT fire, with a work suggested that correlation between spike trains can
noise-corrupted and bell-shaped activity pattern across only reduce, and never increase, the total mutual infor-
the population. The moving direction of the object is re- mation present in the two spike trains about a stimu-
trieved from the population activity, to be immune from lus feature.* [47] However, this was later demonstrated
the fluctuation existing in a single neuron’s signal. In one to be incorrect. Correlation structure can increase in-
classic example in the primary motor cortex, Apostolos formation content if noise and signal correlations are
Georgopoulos and colleagues trained monkeys to move a of opposite sign.* [48] Correlations can also carry in-
joystick towards a lit target.* [41]* [42] They found that formation not present in the average firing rate of two
a single neuron would fire for multiple target directions. pairs of neurons. A good example of this exists in the
However it would fire fastest for one direction and more pentobarbital-anesthetized marmoset auditory cortex, in
slowly depending on how close the target was to the neu- which a pure tone causes an increase in the number of
ron's 'preferred' direction. correlated spikes, but not an increase in the mean firing
rate, of pairs of neurons.* [49]
Kenneth Johnson originally derived that if each neuron
represents movement in its preferred direction, and the
vector sum of all neurons is calculated (each neuron has Independent-spike coding
a firing rate and a preferred direction), the sum points in
the direction of motion. In this manner, the population of The independent-spike coding model of neuronal firing
neurons codes the signal for the motion. This particular claims that each individual action potential, or “spike”
population code is referred to as population vector cod- , is independent of each other spike within the spike
ing. This particular study divided the field of motor phys- train.* [50]* [51]
iologists between Evarts' “upper motor neuron”group,
which followed the hypothesis that motor cortex neurons
Position coding
contributed to control of single muscles, and the Geor-
gopoulos group studying the representation of movement A typical population code involves neurons with a Gaus-
directions in cortex. sian tuning curve whose means vary linearly with the
Population coding has a number of advantages, includ- stimulus intensity, meaning that the neuron responds most
ing reduction of uncertainty due to neuronal variability strongly (in terms of spikes per second) to a stimulus near
and the ability to represent a number of different stim- the mean. The actual intensity could be recovered as the
ulus attributes simultaneously. Population coding is also stimulus level corresponding to the mean of the neuron
much faster than rate coding and can reflect changes in the with the greatest response. However, the noise inherent
stimulus conditions nearly instantaneously.* [43] Individ- in neural responses means that a maximum likelihood es-
ual neurons in such a population typically have different timation function is more accurate.
44 CHAPTER 6. NEURAL CODING
than the dimensionality of the input set, the coding is [6] Chen, G; Wang, LP; Tsien, JZ (2009). “Neu-
overcomplete. Overcomplete codings smoothly interpo- ral population-level memory traces in the mouse
late between input vectors and are robust under input hippocampus”. PLoS One. 4 (12): e8256.
noise.* [57] The human primary visual cortex is estimated doi:10.1371/journal.pone.0008256. PMID 20016843.
to be overcomplete by a factor of 500, so that, for exam-
[7] Zhang, H; Chen, G; Kuang, H; Tsien, JZ (Nov
ple, a 14 x 14 patch of input (a 196-dimensional space)
2013). “Mapping and deciphering neural codes of
is coded by roughly 100,000 neurons.* [55] NMDA receptor-dependent fear memory engrams in
the hippocampus”. PLoS One. 8 (11): e79454.
doi:10.1371/journal.pone.0079454. PMID 24302990.
6.4 See also
[8] Brain Decoding Project. http://braindecodingproject.
org/
• Models of neural computation
[9] The Simons Collaboration on the Global Brain.
• Neural correlate http://www.simonsfoundation.org/life-sciences/
simons-collaboration-on-the-global-brain/
• Cognitive map
[10] Burcas G.T & Albright T.D. Gauging sensory representa-
• Neural decoding tions in the brain. http://www.vcl.salk.edu/Publications/
PDF/Buracas_Albright_1999_TINS.pdf
• Deep learning
[11] Gerstner W, Kreiter AK, Markram H, Herz AV; Kreiter;
• Autoencoder Markram; Herz (November 1997). “Neural codes: fir-
ing rates and beyond”. Proc. Natl. Acad. Sci. U.S.A.
• Vector quantization 94 (24): 12740–1. Bibcode:1997PNAS...9412740G.
doi:10.1073/pnas.94.24.12740. PMC 34168. PMID
• Binding problem
9398065.
• Artificial neural network
[12] Aur D., Jog, MS., 2010 Neuroelectrodynamics: Un-
derstanding the brain language, IOS Press, 2010,
• Grandmother cell
doi:10.3233/978-1-60750-473-3-i
• Feature integration theory
[13] Aur, D.; Connolly, C.I.; Jog, M.S. (2005). “Computing
• pooling spike directivity with tetrodes”. J. Neurosci 149 (1): 57–
63. doi:10.1016/j.jneumeth.2005.05.006.
• Sparse distributed memory
[14] Aur, D.; Jog, M.S. (2007). “Reading the Neural Code:
What do Spikes Mean for Behavior?". Nature Precedings.
doi:10.1038/npre.2007.61.1.
6.5 References
[15] Fraser, A.; Frey, A. H. (1968). “Electromagnetic emis-
sion at micron wavelengths from active nerves”. Bio-
[1] Brown EN, Kass RE, Mitra PP (May 2004). “Multi-
physical Journal 8 (6): 731–734. doi:10.1016/s0006-
ple neural spike train data analysis: state-of-the-art and
3495(68)86517-8.
future challenges”. Nat. Neurosci. 7 (5): 456–61.
doi:10.1038/nn1228. PMID 15114358.
[16] Aur, D (2012). “A comparative analysis of inte-
[2] Thorpe, S.J. (1990). “Spike arrival times: A highly effi- grating visual information in local neuronal ensembles”
cient coding scheme for neural networks”(PDF). In Eck- . Journal of neuroscience methods 207 (1): 23–30.
miller, R.; Hartmann, G.; Hauske, G. Parallel processing doi:10.1016/j.jneumeth.2012.03.008. PMID 22480985.
in neural systems and computers (PDF). North-Holland.
pp. 91–94. ISBN 978-0-444-88390-2. [17] Kandel, E.; Schwartz, J.; Jessel, T.M. (1991). Principles
of Neural Science (3rd ed.). Elsevier. ISBN 0444015620.
[3] Gerstner, Wulfram; Kistler, Werner M. (2002). Spiking
Neuron Models: Single Neurons, Populations, Plasticity. [18] Adrian ED & Zotterman Y. (1926). “The impulses pro-
Cambridge University Press. ISBN 978-0-521-89079-3. duced by sensory nerve endings: Part II: The response of
a single end organ.”. J Physiol (Lond.) 61: 151–171.
[4] Stein RB, Gossen ER, Jones KE (May 2005). “Neu-
ronal variability: noise or part of the signal?". Nat. Rev. [19] http://icwww.epfl.ch/~{}gerstner/SPNM/node7.html
Neurosci. 6 (5): 389–97. doi:10.1038/nrn1668. PMID
15861181. [20] Dayan, Peter; Abbott, L. F. (2001). Theoretical Neu-
roscience: Computational and Mathematical Modeling of
[5] The Memory Code. http://www.scientificamerican.com/ Neural Systems. Massachusetts Institute of Technology
article/the-memory-code/ Press. ISBN 978-0-262-04199-7.
46 CHAPTER 6. NEURAL CODING
[21] Butts DA, Weng C, Jin J et al. (September [35] Montemurro MA, Rasch MJ, Murayama Y, Logothetis
2007). “Temporal precision in the neural code NK, Panzeri S (March 2008). “Phase-of-firing coding
and the timescales of natural vision”. Nature of natural visual stimuli in primary visual cortex”. Curr.
449 (7158): 92–5. Bibcode:2007Natur.449...92B. Biol. 18 (5): 375–80. doi:10.1016/j.cub.2008.02.023.
doi:10.1038/nature06105. PMID 17805296. PMID 18328702.
[22] J. Leo van Hemmen, TJ Sejnowski. 23 Problems in Sys- [36] Fries P, Nikolić D, Singer W (July 2007). “The
tems Neuroscience. Oxford Univ. Press, 2006. p.143- gamma cycle”. Trends Neurosci. 30 (7): 309–16.
158. doi:10.1016/j.tins.2007.05.005. PMID 17555828.
[23] Theunissen, F; Miller, JP (1995). “Temporal En- [37] Havenith MN, Yu S, Biederlack J, Chen NH, Singer
coding in Nervous Systems: A Rigorous Definition” W, Nikolić D (June 2011). “Synchrony makes neu-
. Journal of Computational Neuroscience 2: 149–162. rons fire in sequence, and stimulus properties deter-
doi:10.1007/bf00961885. mine who is ahead”. J. Neurosci. 31 (23): 8570–
84. doi:10.1523/JNEUROSCI.2817-10.2011. PMID
[24] Zador, Stevens, Charles, Anthony. “The enigma of the
21653861.
brain”. © Current Biology 1995, Vol 5 No 12. Retrieved
4/08/12. Check date values in: |accessdate= (help)
[38] Spike arrival times: A highly efficient coding scheme for
[25] Kostal L, Lansky P, Rospars JP (November 2007). neural networks, SJ Thorpe - Parallel processing in neural
“Neuronal coding and spiking randomness”. Eur. J. systems, 1990
Neurosci. 26 (10): 2693–701. doi:10.1111/j.1460-
9568.2007.05880.x. PMID 18001270. [39] Wu S, Amari S, Nakahara H (May 2002). “Popu-
lation coding and decoding in a neural field: a com-
[26] Geoffrois, E.; Edeline, J.M.; Vibert, J.F. (1994).“Learn- putational study”. Neural Comput 14 (5): 999–1026.
ing by Delay Modifications”. In Eeckman, Frank H. Com- doi:10.1162/089976602753633367. PMID 11972905.
putation in Neurons and Neural Systems. Springer. pp.
133–8. ISBN 978-0-7923-9465-5. [40] Maunsell JH, Van Essen DC (May 1983). “Functional
properties of neurons in middle temporal visual area of
[27] Gollisch, T.; Meister, M. (22 February 2008). “Rapid the macaque monkey. I. Selectivity for stimulus direction,
Neural Coding in the Retina with Relative Spike speed, and orientation”. J. Neurophysiol. 49 (5): 1127–
Latencies”. Science 319 (5866): 1108–1111. 47. PMID 6864242.
doi:10.1126/science.1149639.
[41] Intro to Sensory Motor Systems Ch. 38 page 766
[28] Wainrib, Gilles; Michèle, Thieullen; Khashayar, Pak-
daman (7 April 2010). “Intrinsic variability of latency [42] Science. 1986 Sep 26;233(4771):1416-9
to first-spike”. Biological Cybernetics 103 (1): 43–56.
doi:10.1007/s00422-010-0384-8. [43] Hubel DH, Wiesel TN (October 1959).“Receptive fields
of single neurones in the cat's striate cortex”. J. Phys-
[29] Victor, Johnathan D (2005). “Spike train metrics” iol. (Lond.) 148 (3): 574–91. PMC 1363130. PMID
. Current Opinion in Neurobiology 15 (5): 585–592. 14403679.
doi:10.1016/j.conb.2005.08.002.
[44] Schneidman, E, Berry, MJ, Segev, R, Bialek, W (2006),
[30] Hallock, Robert M.; Di Lorenzo, Patricia M. (2006).
Weak Pairwise Correlations Imply Strongly Correlated Net-
“Temporal coding in the gustatory system”. Neuro-
work States in a Neural Population 440, Nature 440, 1007-
science & Biobehavioral Reviews 30 (8): 1145–1160.
1012, doi:10.1038/nature04701
doi:10.1016/j.neubiorev.2006.07.005.
[31] Carleton, Alan; Accolla, Riccardo; Simon, Sidney A. [45] Amari, SL (2001), Information Geometry on Hierarchy of
(2010). “Coding in the mammalian gustatory sys- Probability Distributions, IEEE Transactions on Informa-
tem”. Trends in Neurosciences 33 (7): 326–334. tion Theory 47, 1701-1711, CiteSeerX: 10.1.1.46.5226
doi:10.1016/j.tins.2010.04.002.
[46] Onken, A, Grünewälder, S, Munk, MHJ, Obermayer,
[32] Wilson, Rachel I (2008). “Neural and behav- K (2009), Analyzing Short-Term Noise Dependencies of
ioral mechanisms of olfactory perception”. Cur- Spike-Counts in Macaque Prefrontal Cortex Using Copu-
rent Opinion in Neurobiology 18 (4): 408–412. las and the Flashlight Transformation, PLoS Comput Biol
doi:10.1016/j.conb.2008.08.015. 5(11): e1000577, doi:10.1371/journal.pcbi.1000577
[33] Karl Diesseroth, Lecture.“Personal Growth Series: Karl [47] Johnson, KO (Jun 1980). J Neurophysiol 43 (6): 1793–
Diesseroth on Cracking the Neural Code.”Google Tech 815. Missing or empty |title= (help)
Talks. November 21, 2008. http://www.youtube.com/
watch?v=5SLdSbp6VjM [48] Panzeri, Schultz, Treves, Rolls, Proc Biol Sci. 1999 May
22;266(1423):1001-12.
[34] Han X, Qian X, Stern P, Chuong AS, Boyden ES. “In-
formational lesions: optical perturbations of spike timing [49] Nature 381 (6583): 610–3. Jun 1996.
and neural synchrony via microbial opsin gene fusions.” doi:10.1038/381610a0. Missing or empty |title=
Cambridge, MA: MIT Media Lad, 2009. (help)
6.6. FURTHER READING 47
[50] Dayan P & Abbott LF. Theoretical Neuroscience: Com- • Tsien, JZ. et al. (2014). “On initial Brain Ac-
putational and Mathematical Modeling of Neural Systems. tivity Mapping of episodic and semantic mem-
Cambridge, Massachusetts: The MIT Press; 2001. ISBN ory code in the hippocampus”. Neurobiol-
0-262-04199-5 ogy of Learning and Memory 105: 200–210.
doi:10.1016/j.nlm.2013.06.019.
[51] Rieke F, Warland D, de Ruyter van Steveninck R, Bialek
W. Spikes: Exploring the Neural Code. Cambridge, Mas-
sachusetts: The MIT Press; 1999. ISBN 0-262-68108-0
Word embedding
Word embedding is the collective name for a set of Sentiment Treebank” (PDF). Conference on Empirical
language modeling and feature learning techniques in Methods in Natural Language Processing.
natural language processing where words from the vocab-
ulary (and possibly phrases thereof) are mapped to vec-
tors of real numbers in a low dimensional space, relative
to the vocabulary size (“continuous space”).
There are several methods for generating this mapping.
They include neural networks,* [1] dimensionality reduc-
tion on the word co-occurrence matrix,* [2] and explicit
representation in terms of the context in which words ap-
pear.* [3]
Word and phrase embeddings, when used as the under-
lying input representation, have been shown to boost the
performance in NLP tasks such as syntactic parsing* [4]
and sentiment analysis.* [5]
7.2 References
[1] Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado,
Greg; Dean, Jeffrey (2013). “Distributed Representa-
tions of Words and Phrases and their Compositionality”.
arXiv:1310.4546 [cs.CL].
48
Chapter 8
In machine learning, a deep belief network (DBN) is a next. This also leads to a fast, layer-by-layer unsupervised
generative graphical model, or alternatively a type of deep training procedure, where contrastive divergence is ap-
neural network, composed of multiple layers of latent plied to each sub-network in turn, starting from the“low-
variables (“hidden units”), with connections between est”pair of layers (the lowest visible layer being a training
the layers but not between units within each layer.* [1] set).
The observation, due to Geoffrey Hinton's student Yee-
Whye Teh,* [2] that DBNs can be trained greedily, one
layer at a time, has been called a breakthrough in deep
learning.* [4]* :6
Hidden layer 3
8.1 Training algorithm
The training algorithm for DBNs proceeds as follows. Let
X be a matrix of inputs, regarded as a set of feature vec-
Hidden layer 2 tors.* [2]
49
50 CHAPTER 8. DEEP BELIEF NETWORK
For other uses, see CNN (disambiguation). volutional neural networks use relatively little pre-
processing. This means that the network is responsible
In machine learning, a convolutional neural network for learning the filters that in traditional algorithms were
hand-engineered. The lack of a dependence on prior-
(or CNN) is a type of feed-forward artificial neural net-
work where the individual neurons are tiled in such knowledge and the existence of difficult to design hand-
engineered features is a major advantage for CNNs.
a way that they respond to overlapping regions in the
visual field.* [1] Convolutional networks were inspired by
biological processes* [2] and are variations of multilayer
perceptrons which are designed to use minimal amounts 9.2 History
of preprocessing.* [3] They are widely used models for
image and video recognition. The design of convolutional neural networks follows the
discovery of visual mechanisms in living organisms. In
our brain, the visual cortex contains lots of cells. These
cells are responsible for detecting light in small, overlap-
9.1 Overview ping sub-regions of the visual field, called receptive fields.
These cells act as local filters over the input space. The
When used for image recognition, convolutional neural more complex cells have larger receptive fields. A con-
networks (CNNs) consist of multiple layers of small neu- volution operator is created to perform the same function
ron collections which look at small portions of the input by all of these cells.
image, called receptive fields. The results of these collec- Convolutional neural networks were introduced in a 1980
tions are then tiled so that they overlap to obtain a better paper by Kunihiko Fukushima.* [7]* [9] In 1988 they were
representation of the original image; this is repeated for separately developed, with explicit parallel and trainable
every such layer. Because of this, they are able to toler- convolutions for temporal signals, by Toshiteru Homma,
ate translation of the input image.* [4] Convolutional net-
Les Atlas, and Robert J. Marks II.* [10] Their design was
works may include local or global pooling layers, which later improved in 1998 by Yann LeCun, Léon Bottou,
combine the outputs of neuron clusters.* [5]* [6] They also
Yoshua Bengio, and Patrick Haffner,* [11] generalized
consist of various combinations of convolutional layers in 2003 by Sven Behnke,* [12] and simplified by Patrice
and fully connected layers, with pointwise nonlinearity
Simard, David Steinkraus, and John C. Platt in the same
applied at the end of or after each layer.* [7] It is inspired year.* [13] The famous LeNet-5 network can classify dig-
by biological process. To avoid the situation that there ex-
its successfully, which is applied to recognize checking
ist billions of parameters if all layers are fully connected, numbers. However, given more complex problems the
the idea of using a convolution operation on small regions, breadth and depth of the network will continue to in-
has been introduced. One major advantage of convolu- crease which would become limited by computing re-
tional networks is the use of shared weight in convolu- sources. The approach used LeNet did not perform well
tional layers, which means that the same filter (weights with more complex problems.
bank) is used for each pixel in the layer; this both reduces
required memory size and improves performance.* [3] With the rise of efficient GPU computing, it has become
possible to train larger networks. In 2006 several pub-
Some Time delay neural networks also use a very similar lications described more efficient ways to train convolu-
architecture to convolutional neural networks, especially tional neural networks with more layers.* [14]* [15]* [16]
those for image recognition and/or classification tasks, In 2011, they were refined by Dan Ciresan et al. and
since the “tiling”of the neuron outputs can easily be were implemented on a GPU with impressive perfor-
carried out in timed stages in a manner useful for analy- mance results.* [5] In 2012, Dan Ciresan et al. signifi-
sis of images.* [8] cantly improved upon the best performance in the litera-
Compared to other image classification algorithms, con- ture for multiple image databases, including the MNIST
51
52 CHAPTER 9. CONVOLUTIONAL NEURAL NETWORK
9.3 Details Since a fully connected layer occupies most of the pa-
rameters, it is prone to overfitting. The dropout method
*
9.3.1 Backpropagation [19] is introduced to prevent overfitting. That paper
defines (the simplest form of) dropout as: “The only
When doing propagation, the momentum and weight difference”, from “Learning algorithms developed for
decay values are chosen to reduce oscillation during Restricted_Boltzmann_machine such as Contrastive Di-
stochastic gradient descent. See Backpropagation for vergence”, “is that r”, the number derived (usually by
more. Sigmoid) from the incoming sum to the neural node from
other nodes, “is first sampled and only the hidden units
that are retained are used for training.”and“dropout can
9.3.2 Different types of layers be seen as multiplying by a Bernoulli_distribution ran-
dom variable rb that takes the value 1/p with probability
Convolutional layer p and 0 otherwise.”In other words, that simplest form
of dropout is to take the chance and see if it happens or
Unlike a hand-coded convolution kernel (Sobel, Prewitt, not, to observe if the neural node spikes/fires (instantly),
Roberts), in a convolutional neural net, the parameters instead of just remembering that chance happened.
of each convolution kernel are trained by the backprop- Dropout also significantly improves the speed of training.
agation algorithm. There are many convolution kernels This makes model combination practical, even for deep
in each layer, and each kernel is replicated over the en- neural nets. Dropout is performed randomly. In the in-
tire image with the same parameters. The function of the put layer, the probability of dropping a neuron is between
convolution operators is to extract different features of 0.5 and 1, while in the hidden layers, a probability of 0.5
the input. The capacity of a neural net varies, depending is used. The neurons that are dropped out, will not con-
on the number of layers. The first convolution layers will tribute to the forward pass and back propagation. This
obtain the low-level features, like edges, lines and cor- is equivalent to decreasing the number of neurons. This
ners. The more layers the network has, the higher-level will create neural networks with different architectures,
features it will get. but all of those networks will share the same weights.
In Neural_coding#Sparse_coding bio neurons generally
some react and some dont, at any moment of experi-
ReLU layer
ence, as we are not perfect machines, not to say brains
have the same chances and layer shapes used in simulated
ReLU is the abbreviation of Rectified Linear Units. This
Dropout, but we learn anyways through it.
is a layer of neurons that use the non-saturating activation
function f(x)=max(0,x). It increases the nonlinear prop- The biggest contribution of the dropout method is that,
erties of the decision function and of the overall network although it effectively generates 2^n neural nets, and as
without affecting the receptive fields of the convolution such, allows for model combination, at test time, only a
layer. single network needs to be tested. This is accomplished
by performing the test with the un-thinned network, while
Other functions are used to increase nonlinearity. For
multiplying the output weights of each neuron with the
example the saturating hyperbolic tangent f(x)=tanh(x),
probability of that neuron being retained (i.e. not dropped
f(x)=|tanh(x)|, and the sigmoid function f(x)=(1+e^(-x)
out).
)^(−1). Compared to tanh units, the advantage of ReLU
*
is that the neural network trains several times faster. [18] The“2^n neural nets”is accuracy but is pigeonholed by
far less than 2^n bits having come in during the life of the
neuralnet. We could the same way say Lambda_calculus
Pooling layer takes exponential time if it werent for using base2 in
memory instead of counting in Unary_numeral_system.
In order to reduce variance, pooling layers compute the So if each of 2^n abstract neural nets pushes at least 1
max or average value of a particular feature over a region bit through those weights, they must have taken exponen-
of the image. This will ensure that the same result will tially many turns sequentially since the bandwidth is not
be obtained, even when image features have small trans- that wide. P_versus_NP_problem millenium prize re-
lations. This is an important operation for object classi- mains unclaimed, even though there are great optimiza-
fication and detection. tions.
9.5. FINE-TUNING 53
Loss layer images that included faces at various angles and orienta-
tions and a further 20 million images without faces. They
It can use different loss functions for different tasks. Soft- used batches of 128 images over 50,000 iterations.* [23]
max loss is used for predicting a single class of K mutu-
ally exclusive classes. Sigmoid cross-entropy loss is used
for predicting K independent probability values in [0,1]. 9.4.2 Video analysis
Euclidean loss is used for regressing to real-valued labels
[-inf,inf] Video is more complex than images since it has an-
other temporal dimension. The common way is to
fuse the features of different convolutional neural net-
works, which are responsible for spatial and temporal
9.4 Applications stream.* [24]* [25]
9.6 Common libraries [5] Ciresan, Dan; Ueli Meier; Jonathan Masci; Luca M. Gam-
bardella; Jurgen Schmidhuber (2011). “Flexible, High
Performance Convolutional Neural Networks for Image
• Caffe: Caffe (replacement of Decaf) has been the
Classification”(PDF). Proceedings of the Twenty-Second
most popular library for convolutional neural net- international joint conference on Artificial Intelligence-
works. It is created by the Berkeley Vision and Volume Volume Two 2: 1237–1242. Retrieved 17
Learning Center (BVLC). The advantages are that it November 2013.
has cleaner architecture and faster speed. It supports
both CPU and GPU, easily switching between them. [6] Krizhevsky, Alex. “ImageNet Classification with Deep
It is developed in C++, and has Python and MAT- Convolutional Neural Networks” (PDF). Retrieved 17
LAB wrappers. In the developing of Caffe, proto- November 2013.
buf is used to make researchers tune the parameters [7] Ciresan, Dan; Meier, Ueli; Schmidhuber, Jürgen
easily as well as adding or removing layers. (June 2012). “Multi-column deep neural net-
works for image classification”. 2012 IEEE
• Torch7 (www.torch.ch) Conference on Computer Vision and Pattern Recog-
nition (New York, NY: Institute of Electrical
• OverFeat and Electronics Engineers (IEEE)): 3642–3649.
arXiv:1202.2745v1. doi:10.1109/CVPR.2012.6248110.
• Cuda-convnet ISBN 9781467312264. OCLC 812295155. Retrieved
2013-12-09.
• MatConvnet
[8] Le Callet, Patrick; Christian Viard-Gaudin; Dominique
• Theano: written in Python, using scientific python Barba (2006). “A Convolutional Neural Network
Approach for Objective Video Quality Assessment”
• Deeplearning4j: Deep learning in Java and Scala on (PDF). IEEE Transactions on Neural Networks 17 (5):
GPU-enabled Spark 1316–1327. doi:10.1109/TNN.2006.879766. PMID
17001990. Retrieved 17 November 2013.
• Deep learning [10] Homma, Toshiteru; Les Atlas; Robert Marks II (1988).
“An Artificial Neural Network for Spatio-Temporal Bipo-
• Time delay neural network lar Patters: Application to Phoneme Classification”
(PDF). Advances in Neural Information Processing Sys-
tems 1: 31–40.
9.8 References [11] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick
Haffner (1998).“Gradient-based learning applied to doc-
ument recognition” (PDF). Proceedings of the IEEE 86
[1] “Convolutional Neural Networks (LeNet) - DeepLearn-
(11): 2278–2324. doi:10.1109/5.726791. Retrieved 16
ing 0.1 documentation”. DeepLearning 0.1. LISA Lab.
November 2013.
Retrieved 31 August 2013.
[12] S. Behnke. Hierarchical Neural Networks for Image In-
[2] Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; terpretation, volume 2766 of Lecture Notes in Computer
Yuji Kaneda (2003). “Subject independent facial ex- Science. Springer, 2003.
pression recognition with robust face detection using a
convolutional neural network” (PDF). Neural Networks [13] Simard, Patrice, David Steinkraus, and John C. Platt.
16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1. “Best Practices for Convolutional Neural Networks Ap-
Retrieved 17 November 2013. plied to Visual Document Analysis.”In ICDAR, vol. 3,
pp. 958-962. 2003.
[3] LeCun, Yann.“LeNet-5, convolutional neural networks”
. Retrieved 16 November 2013. [14] Hinton, GE; Osindero, S; Teh, YW (Jul 2006). “A fast
learning algorithm for deep belief nets.”. Neural computa-
[4] Korekado, Keisuke; Morie, Takashi; Nomura, Osamu; tion 18 (7): 1527–54. doi:10.1162/neco.2006.18.7.1527.
Ando, Hiroshi; Nakano, Teppei; Matsugu, Masakazu; PMID 16764513.
Iwata, Atsushi (2003). “A Convolutional Neural Net-
work VLSI for Image Recognition Using Merged/Mixed [15] Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan;
Analog-Digital Architecture”. Knowledge-Based Intel- Larochelle, Hugo (2007). “Greedy Layer-Wise Train-
ligent Information and Engineering Systems: 169–176. ing of Deep Networks”. Advances in Neural Information
CiteSeerX: 10.1.1.125.3812. Processing Systems: 153–160.
9.9. EXTERNAL LINKS 55
[17] 10. Deng, Jia, et al. “Imagenet: A large-scale hierarchi- • UFLDL Tutorial
cal image database."Computer Vision and Pattern Recog-
• Deeplearning4j's Convolutional Nets
nition, 2009. CVPR 2009. IEEE Conference on. IEEE,
2009.
• Caffe
Chapter 10
56
10.2. TRAINING ALGORITHM 57
∏
m The algorithm most often used to train RBMs, that is, to
P (v|h) = P (vi |h) optimize the weight vector W , is the contrastive diver-
i=1 gence (CD) algorithm due to Hinton, originally developed
to train PoE (product of experts) models.* [13] * [14] The
Conversely, the conditional probability of h given v is algorithm performs Gibbs sampling and is used inside
a gradient descent procedure (similar to the way back-
propagation is used inside such a procedure when training
∏
n
feedforward neural nets) to compute weight update.
P (h|v) = P (hj |v)
j=1 The basic, single-step contrastive divergence (CD-1) pro-
cedure for a single sample can be summarized as follows:
The individual activation probabilities are given by
and weighted coin flips, until it converges to the coins in [9] Geoffrey Hinton (2010). A Practical Guide to Training Re-
lowest layer (visible nodes) staying mostly a certain way. stricted Boltzmann Machines. UTML TR 2010–003, Uni-
To train it, it is the same shape as running it except you versity of Toronto.
observe the weights of the pairs that are on, the first time [10] Sutskever, Ilya; Tieleman, Tijmen (2010). “On the con-
up you add the learning rate between those pairs, then go vergence properties of contrastive divergence” (PDF).
back down and up again and that time subtract the learn- Proc. 13th Int'l Conf. on AI and Statistics (AISTATS).
ing rate. As Geoffrey Hinton explained it, the first time up
is to learn the data, and the second time up is to unlearn [11] Asja Fischer and Christian Igel. Training Restricted
whatever its earlier reaction was to the data. Boltzmann Machines: An Introduction. Pattern Recog-
nition 47, pp. 25-39, 2014
Not to be confused with Recursive neural network. viding target signals for the RNN, instead a fitness func-
tion or reward function is occasionally used to evaluate
the RNN's performance, which is influencing its input
A recurrent neural network (RNN) is a class of
artificial neural network where connections between units stream through output units connected to actuators af-
form a directed cycle. This creates an internal state of the fecting the environment. Again, compare the section on
network which allows it to exhibit dynamic temporal be- training algorithms below.
havior. Unlike feedforward neural networks, RNNs can
use their internal memory to process arbitrary sequences
11.1.2 Hopfield network
of inputs. This makes them applicable to tasks such as
unsegmented connected handwriting recognition, where The Hopfield network is of historic interest although it is
they have achieved the best known results.* [1] not a general RNN, as it is not designed to process se-
quences of patterns. Instead it requires stationary inputs.
It is a RNN in which all connections are symmetric. In-
11.1 Architectures vented by John Hopfield in 1982, it guarantees that its dy-
namics will converge. If the connections are trained using
Hebbian learning then the Hopfield network can perform
11.1.1 Fully recurrent network as robust content-addressable memory, resistant to con-
nection alteration.
This is the basic architecture developed in the 1980s: a
A variation on the Hopfield network is the bidirectional
network of neuron-like units, each with a directed con-
associative memory (BAM). The BAM has two layers,
nection to every other unit. Each unit has a time-varying
either of which can be driven as an input, to recall an
real-valued activation. Each connection has a modifiable
association and produce an output on the other layer.* [2]
real-valued weight. Some of the nodes are called input
nodes, some output nodes, the rest hidden nodes. Most
architectures below are special cases. 11.1.3 Elman networks and Jordan net-
For supervised learning in discrete time settings, training works
sequences of real-valued input vectors become sequences
of activations of the input nodes, one input vector at a The following special case of the basic architecture above
time. At any given time step, each non-input unit com- was employed by Jeff Elman. A three-layer network is
putes its current activation as a nonlinear function of the used (arranged vertically as x, y, and z in the illustration),
weighted sum of the activations of all units from which it with the addition of a set of “context units”(u in the il-
receives connections. There may be teacher-given target lustration). There are connections from the middle (hid-
activations for some of the output units at certain time den) layer to these context units fixed with a weight of
steps. For example, if the input sequence is a speech sig- one.* [3] At each time step, the input is propagated in a
nal corresponding to a spoken digit, the final target out- standard feed-forward fashion, and then a learning rule is
put at the end of the sequence may be a label classify- applied. The fixed back connections result in the context
ing the digit. For each sequence, its error is the sum of units always maintaining a copy of the previous values
the deviations of all target signals from the corresponding of the hidden units (since they propagate over the con-
activations computed by the network. For a training set nections before the learning rule is applied). Thus the
of numerous sequences, the total error is the sum of the network can maintain a sort of state, allowing it to per-
errors of all individual sequences. Algorithms for mini- form such tasks as sequence-prediction that are beyond
mizing this error are mentioned in the section on training the power of a standard multilayer perceptron.
algorithms below. Jordan networks, due to Michael I. Jordan, are similar to
In reinforcement learning settings, there is no teacher pro- Elman networks. The context units are however fed from
59
60 CHAPTER 11. RECURRENT NEURAL NETWORK
The echo state network (ESN) is a recurrent neural net- • ẏi : Rate of change of activation of postsynaptic
work with a sparsely connected random hidden layer. The node
weights of output neurons are the only part of the net-
work that can change and be trained. ESN are good at • wji : Weight of connection from pre to postsynaptic
reproducing certain time series.* [4] A variant for spiking node
neurons is known as Liquid state machines.* [5]
• σ(x) : Sigmoid of x e.g. σ(x) = 1/(1 + e−x ) .
The Long short term memory (LSTM) network, devel- • Θj : Bias of presynaptic node
oped by Hochreiter & Schmidhuber in 1997,* [6] is an
artificial neural net structure that unlike traditional RNNs • Ii (t) : Input (if any) to node
doesn't have the vanishing gradient problem (compare
the section on training algorithms below). It works even
when there are long delays, and it can handle signals CTRNNs have frequently been applied in the field of
that have a mix of low and high frequency components. evolutionary robotics, where they have been used to ad-
LSTM RNN outperformed other methods in numerous dress, for example, vision,* [11] co-operation* [12] and
applications such as language learning* [7] and connected minimally cognitive behaviour.* [13]
handwriting recognition.* [8]
11.2.3 Global optimization methods tation for the current time step.
In particular, recurrent neural networks can appear as
Training the weights in a neural network can be modeled
nonlinear versions of finite impulse response and infinite
as a non-linear global optimization problem. A target
impulse response filters and also as a nonlinear autore-
function can be formed to evaluate the fitness or error of
gressive exogenous model (NARX)* [38]
a particular weight vector as follows: First, the weights
in the network are set according to the weight vector.
Next, the network is evaluated against the training se-
quence. Typically, the sum-squared-difference between 11.4 Issues with recurrent neural
the predictions and the target values specified in the train- networks
ing sequence is used to represent the error of the current
weight vector. Arbitrary global optimization techniques
may then be used to minimize this target function. Most RNNs have had scaling issues. In particular, RNNs
cannot be easily trained for large numbers of neuron units
The most common global optimization method for train- nor for large numbers of inputs units . Successful training
ing RNNs is genetic algorithms, especially in unstruc- has been mostly in time series problems with few inputs
tured networks.* [35]* [36]* [37] and in chemical process control.
Initially, the genetic algorithm is encoded with the neural
network weights in a predefined manner where one gene
in the chromosome represents one weight link, hence- 11.5 References
forth; the whole network is represented as a single chro-
mosome. The fitness function is evaluated as follows: 1)
[1] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H.
each weight encoded in the chromosome is assigned to the
Bunke, J. Schmidhuber. A Novel Connectionist System
respective weight link of the network ; 2) the training set for Improved Unconstrained Handwriting Recognition.
of examples is then presented to the network which prop- IEEE Transactions on Pattern Analysis and Machine In-
agates the input signals forward ; 3) the mean-squared- telligence, vol. 31, no. 5, 2009.
error is returned to the fitness function ; 4) this function
will then drive the genetic selection process. [2] Rául Rojas (1996). Neural networks: a systematic intro-
duction. Springer. p. 336. ISBN 978-3-540-60505-8.
There are many chromosomes that make up the pop-
ulation; therefore, many different neural networks are [3] Cruse, Holk; Neural Networks as Cybernetic Systems, 2nd
evolved until a stopping criterion is satisfied. A com- and revised edition
mon stopping scheme is: 1) when the neural network has
learnt a certain percentage of the training data or 2) when [4] H. Jaeger. Harnessing nonlinearity: Predicting chaotic
the minimum value of the mean-squared-error is satis- systems and saving energy in wireless communication.
fied or 3) when the maximum number of training gener- Science, 304:78–80, 2004.
ations has been reached. The stopping criterion is eval-
[5] W. Maass, T. Natschläger, and H. Markram. A fresh look
uated by the fitness function as it gets the reciprocal of
at real-time computation in generic recurrent neural cir-
the mean-squared-error from each neural network during cuits. Technical report, Institute for Theoretical Com-
training. Therefore, the goal of the genetic algorithm is puter Science, TU Graz, 2002.
to maximize the fitness function, hence, reduce the mean-
squared-error. [6] Hochreiter, Sepp; and Schmidhuber, Jürgen; Long Short-
Term Memory, Neural Computation, 9(8):1735–1780,
Other global (and/or evolutionary) optimization tech-
1997
niques may be used to seek a good set of weights such
as Simulated annealing or Particle swarm optimization. [7] Gers, Felix A.; and Schmidhuber, Jürgen; LSTM Recur-
rent Networks Learn Simple Context Free and Context Sen-
sitive Languages, IEEE Transactions on Neural Networks,
11.3 Related fields and models 12(6):1333–1340, 2001
[10] A. Graves and J. Schmidhuber. Framewise phoneme clas- [25] A. J. Robinson and F. Fallside. The utility driven dynamic
sification with bidirectional LSTM and other neural net- error propagation network. Technical Report CUED/F-
work architectures. Neural Networks, 18:602–610, 2005. INFENG/TR.1, Cambridge University Engineering De-
partment, 1987.
[11] Harvey, Inman; Husbands, P. and Cliff, D. (1994).“See-
ing the light: Artificial evolution, real vision”. Proceed- [26] R. J. Williams and D. Zipser. Gradient-based learning
ings of the third international conference on Simulation of algorithms for recurrent networks and their computational
adaptive behavior: from animals to animats 3: 392–401. complexity. In Back-propagation: Theory, Architectures
and Applications. Hillsdale, NJ: Erlbaum, 1994.
[12] Quinn, Matthew (2001).“Evolving communication with-
out dedicated communication channels”. Advances in [27] J. Schmidhuber. A local learning algorithm for dynamic
Artificial Life. Lecture Notes in Computer Science 2159: feedforward and recurrent networks. Connection Science,
357–366. doi:10.1007/3-540-44811-X_38. ISBN 978- 1(4):403–412, 1989.
3-540-42567-0.
[28] Neural and Adaptive Systems: Fundamentals through
[13] Beer, R.D. (1997). “The dynamics of adaptive behavior: Simulation. J.C. Principe, N.R. Euliano, W.C. Lefebvre
A research program”. Robotics and Autonomous Systems
[29] J. Schmidhuber. A fixed size storage O(n3) time com-
20 (2–4): 257–289. doi:10.1016/S0921-8890(96)00063-
plexity learning algorithm for fully recurrent continually
2.
running networks. Neural Computation, 4(2):243–248,
[14] J. Schmidhuber. Learning complex, extended sequences 1992.
using the principle of history compression. Neural Com- [30] R. J. Williams. Complexity of exact gradient computation
putation, 4(2):234-242, 1992 algorithms for recurrent neural networks. Technical Re-
[15] R.W. Paine, J. Tani, “How hierarchical control self- port Technical Report NU-CCS-89-27, Boston: North-
organizes in artificial adaptive systems,”Adaptive Behav- eastern University, College of Computer Science, 1989.
ior, 13(3), 211-225, 2005. [31] B. A. Pearlmutter. Learning state space trajectories in re-
[16] “CiteSeerX —Recurrent Multilayer Perceptrons for Iden- current neural networks. Neural Computation, 1(2):263–
tification and Control: The Road to Applications”. Cite- 269, 1989.
seerx.ist.psu.edu. Retrieved 2014-01-03. [32] S. Hochreiter. Untersuchungen zu dynamischen neu-
ronalen Netzen. Diploma thesis, Institut f. Informatik,
[17] C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun,
Technische Univ. Munich, 1991.
Y.C. Lee, “Learning and Extracting Finite State Au-
tomata with Second-Order Recurrent Neural Networks,” [33] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmid-
Neural Computation, 4(3), p. 393, 1992. huber. Gradient flow in recurrent nets: the difficulty of
learning long-term dependencies. In S. C. Kremer and J.
[18] C.W. Omlin, C.L. Giles, “Constructing Deterministic
F. Kolen, editors, A Field Guide to Dynamical Recurrent
Finite-State Automata in Recurrent Neural Networks,”
Neural Networks. IEEE Press, 2001.
Journal of the ACM, 45(6), 937-972, 1996.
[34] Martens, James, and Ilya Sutskever. "Training deep and
[19] Y. Yamashita, J. Tani, “Emergence of functional
recurrent networks with hessian-free optimization.”In
hierarchy in a multiple timescale neural network
Neural Networks: Tricks of the Trade, pp. 479-535.
model: a humanoid robot experiment,”PLoS Com-
Springer Berlin Heidelberg, 2012.
putational Biology, 4(11), e1000220, 211-225, 2008.
http://journals.plos.org/ploscompbiol/article?id=10. [35] F. J. Gomez and R. Miikkulainen. Solving non-
1371/journal.pcbi.1000220 Markovian control tasks with neuroevolution. Proc. IJ-
CAI 99, Denver, CO, 1999. Morgan Kaufmann.
[20] http://arxiv.org/pdf/1410.5401v2.pdf
[36] Applying Genetic Algorithms to Recurrent Neural Net-
[21] Kosko, B. (1988). “Bidirectional associative memories” works for Learning Network Parameters and Architec-
. IEEE Transactions on Systems, Man, and Cybernetics 18 ture. O. Syed, Y. Takefuji
(1): 49–60. doi:10.1109/21.87054.
[37] F. Gomez, J. Schmidhuber, R. Miikkulainen. Ac-
[22] Rakkiyappan, R.; Chandrasekar, A.; Lakshmanan, S.; celerated Neural Evolution through Cooperatively Coe-
Park, Ju H. (2 January 2015). “Exponential stabil- volved Synapses. Journal of Machine Learning Research
ity for markovian jumping stochastic BAM neural net- (JMLR), 9:937-965, 2008.
works with mode-dependent probabilistic time-varying
delays and impulse control”. Complexity 20 (3): 39–65. [38] Hava T. Siegelmann, Bill G. Horne, C. Lee Giles,“Com-
doi:10.1002/cplx.21503. putational capabilities of recurrent NARX neural net-
works,”IEEE Transactions on Systems, Man, and Cyber-
[23] P. J. Werbos. Generalization of backpropagation with ap- netics, Part B 27(2): 208-215 (1997).
plication to a recurrent gas market model. Neural Net-
works, 1, 1988.
• Mandic, D. & Chambers, J. (2001). Recurrent Neu-
[24] David E. Rumelhart; Geoffrey E. Hinton; Ronald J. ral Networks for Prediction: Learning Algorithms,
Williams. Learning Internal Representations by Error Architectures and Stability. Wiley. ISBN 0-471-
Propagation. 49517-4.
64 CHAPTER 11. RECURRENT NEURAL NETWORK
12.1 Architecture
An LSTM network is an artificial neural network that
contains LSTM blocks instead of, or in addition to, regu- out the value from the left-most unit, effectively blocking
lar network units. An LSTM block may be described as that value from entering into the next layer. The second
a“smart”network unit that can remember a value for an unit from the right is the “forget gate”. When it out-
arbitrary length of time. An LSTM block contains gates puts a value close to zero, the block will effectively forget
that determine when the input is significant enough to re- whatever value it was remembering. The right-most unit
member, when it should continue to remember or forget (on the bottom row) is the“output gate”. It determines
the value, and when it should output the value. when the unit should output the value in its memory. The
A typical implementation of an LSTM block is shown to units containing the Π symbol compute the product of
the right. The four units shown ∑at the bottom of the fig- their inputs ( y = Πxi ). These units have no weights.
ure are sigmoid units ( y = s( wi xi ) , where s is some The unit with the Σ∑symbol computes a linear function
squashing function, such as the logistic function). The of its inputs ( y = wi xi .) The output of this unit is
left-most of these units computes a value which is condi- not squashed so that it can remember the same value for
tionally fed as an input value to the block's memory. The many time-steps without the value decaying. This value
other three units serve as gates to determine when values is fed back in so that the block can “remember”it (as
are allowed to flow into or out of the block's memory. The long as the forget gate allows). Typically, this value is
second unit from the left (on the bottom row) is the “in- also fed into the 3 gating units to help them make gating
put gate”. When it outputs a value close to zero, it zeros decisions.
65
66 CHAPTER 12. LONG SHORT TERM MEMORY
• Time series prediction* [8] [8] J. Schmidhuber and D. Wierstra and F. J. Gomez.
Evolino: Hybrid Neuroevolution / Optimal Linear Search
• Speech recognition* [9]* [10]* [11] for Sequence Learning. Proceedings of the 19th Interna-
tional Joint Conference on Artificial Intelligence (IJCAI),
• Rhythm learning* [12] Edinburgh, pp. 853–858, 2005.
• Music composition* [13] [9] A. Graves and J. Schmidhuber. Framewise phoneme clas-
sification with bidirectional LSTM and other neural net-
• Grammar learning* [14]* [15]* [16] work architectures. Neural Networks 18:5–6, pp. 602–
610, 2005.
• Handwriting recognition* [17]* [18]
[10] S. Fernandez, A. Graves, J. Schmidhuber. An applica-
• Human action recognition* [19] tion of recurrent neural networks to discriminative key-
word spotting. Intl. Conf. on Artificial Neural Networks
• Protein Homology Detection* [20] ICANN'07, 2007.
• Prefrontal Cortex Basal Ganglia Working Memory [12] F. Gers, N. Schraudolph, J. Schmidhuber. Learning pre-
(PBWM) cise timing with LSTM recurrent networks. Journal of
Machine Learning Research 3:115–143, 2002.
• Recurrent neural network
[13] D. Eck and J. Schmidhuber. Learning The Long-Term
• Time series Structure of the Blues. In J. Dorronsoro, ed., Proceedings
of Int. Conf. on Artificial Neural Networks ICANN'02,
• Long-term potentiation Madrid, pages 284–289, Springer, Berlin, 2002.
12.6. EXTERNAL LINKS 67
Google Brain
68
13.6. REFERENCES 69
[2] “A Massive Google Network Learns To Identify —Cats” [19] “Improving Photo Search: A Step Across the Semantic
. National Public Radio. June 26, 2012. Retrieved Febru- Gap”. Google Research Blog. Google. June 12, 2013.
ary 11, 2014.
[20] Jeff Dean and Andrew Ng (26 June 2012). “Using large-
[3] Shin, Laura (June 26, 2012). “Google brain simulator scale brain simulations for machine learning and A.I.”.
teaches itself to recognize cats”. SmartPlanet. Retrieved Official Google Blog. Retrieved 26 January 2015.
February 11, 2014.
[21] “Ex-Google Brain head Andrew Ng to lead Baidu's arti-
[4] “U of T neural networks start-up acquired by Google” ficial intelligence drive”. South China Morning Post.
(Press release). Toronto, ON. 12 March 2013. Retrieved
13 March 2013. [22] “Quoc Le - Behind the Scenes”. Retrieved 20 April
2015.
[5] Regalado, Antonio (January 29, 2014). “Is Google Cor-
nering the Market on Deep Learning? A cutting-edge cor- [23] Hernandez, Daniela (May 7, 2013). “The Man Behind
ner of science is being wooed by Silicon Valley, to the the Google Brain: Andrew Ng and the Quest for the New
dismay of some academics.”. Technology Review. Re- AI”. Wired Magazine. Retrieved February 11, 2014.
trieved February 11, 2014.
[24] “Ray Kurzweil and the Brains Behind the Google Brain”
[6] Wohlsen, Marcus (January 27, 2014).“Google’s Grand . Big Think. December 8, 2013. Retrieved February 11,
Plan to Make Your Brain Irrelevant”. Wired Magazine. 2014.
Retrieved February 11, 2014.
[16] Levy, Steven (April 25, 2013). “How Ray Kurzweil Will
Help Google Make the Ultimate AI Brain”. Wired Mag-
azine. Retrieved February 11, 2014.
Google DeepMind
In 2011 the start-up was founded by Demis Hassabis, [...] Attempting to distil intelligence into
Shane Legg and Mustafa Suleyman.* [3]* [4] Hassabis and an algorithmic construct may prove to be
Legg first met at UCL's Gatsby Computational Neuro- the best path to understanding some of the
science Unit.* [5] enduring mysteries of our minds.
Since then major venture capitalist firms Horizons Ven- —Demis Hassabis, Nature (journal), 23
tures and Founders Fund have invested in the com- February 2012* [23]
pany,* [6] as well as entrepreneur Scott Banister.* [7] Jaan
Tallinn was an early investor and an advisor to the com-
pany.* [8] Currently the company's focus is on publishing research
on computer systems that are able to play games, and
In 2014, DeepMind received the“Company of the Year”
* developing these systems, ranging from strategy games
award by Cambridge Computer Laboratory. [9]
such as Go* [24] to arcade games. According to Shane
The company has created a neural network that learns Legg human-level machine intelligence can be achieved
how to play video games in a similar fashion to hu- "when a machine can learn to play a really wide range of
mans* [10] and a neural network that may be able to ac- games from perceptual stream input and output, and trans-
cess an external memory like a conventional Turing ma- fer understanding across games[...]."* [25] Research de-
chine, resulting in a computer that appears to possibly scribing an AI playing seven different Atari video games
mimic the short-term memory of the human brain.* [11] (Pong, Breakout, Space Invaders, Seaquest, Beamrider,
Enduro, and Q*bert) reportedly led to their acquisition
by Google.* [10]
70
14.4. EXTERNAL LINKS 71
than any human ever could.* [27] For most games though [17] Oreskovic, Alexei.“Reuters Report”. Reuters. Retrieved
(Space Invaders, Ms Pacman, Q*Bert for example), 27 January 2014.
DeepMind plays well below the current World Record.
[18] “Google Acquires Artificial Intelligence Start-Up Deep-
The application of DeepMind's AI to video games is cur- Mind”. The Verge. Retrieved 27 January 2014.
rently for games made in the 1970s and 1980s, with work
being done on more complex 3D games such as Doom, [19] “Google acquires AI pioneer DeepMind Technologies”.
which first appeared in the early 1990s.* [27] Ars Technica. Retrieved 27 January 2014.
[9] “Hall of Fame Awards: To celebrate the success of com- • Google DeepMind
panies founded by Computer Laboratory graduates.”.
Cambridge University. Retrieved 12 October 2014.
Torch is an open source machine learning library, a alized, as long as they do not contain references to ob-
scientific computing framework, and a script language jects that cannot be serialized, such as Lua coroutines,
based on the Lua programming language.* [3] It provides and Lua userdata. However, userdata can be serialized if
a wide range of algorithms for deep machine learning, it is wrapped by a table (or metatable) that provides read()
and uses an extremely fast scripting language LuaJIT, and and write() methods.
an underlying C implementation.
15.2 nn
15.1 torch
The nn package is used for building neural networks.
The core package of Torch is torch. It provides a flexi- It is divided into modular objects that share a com-
ble N-dimensional array or Tensor, which supports basic
mon Module interface. Modules have a forward() and
routines for indexing, slicing, transposing, type-casting, backward() method that allow them to feedforward and
resizing, sharing storage and cloning. This object is used
backpropagate, respectively. Modules can be joined to-
by most other packages and thus forms the core object of gether using module composites, like Sequential, Parallel
the library. The Tensor also supports mathematical op-
and Concat to create complex task-tailored graphs. Sim-
erations like max, min, sum, statistical distributions like pler modules like Linear, Tanh and Max make up the ba-
uniform, normal and multinomial, and BLAS operations sic component modules. This modular interface provides
like dot product, matrix-vector multiplication, matrix- first-order automatic gradient differentiation. What fol-
matrix multiplication, matrix-vector product and matrix lows is an example use-case for building a multilayer per-
product. ceptron using Modules:
The following exemplifies using torch via its REPL inter- > mlp = nn.Sequential() > mlp:add( nn.Linear(10,
preter: 25) ) -- 10 input, 25 hidden units > mlp:add(
> a = torch.randn(3,4) > =a −0.2381 −0.3401 −1.7844 nn.Tanh() ) -- some hyperbolic tangent transfer
−0.2615 0.1411 1.6249 0.1708 0.8299 −1.0434 2.2291 function > mlp:add( nn.Linear(25, 1) ) -- 1 output >
1.0525 0.8465 [torch.DoubleTensor of dimension =mlp:forward(torch.randn(10)) −0.1815 [torch.Tensor
3x4] > a[1][2] −0.34010116549482 > a:narrow(1,1,2) of dimension 1]
−0.2381 −0.3401 −1.7844 −0.2615 0.1411 1.6249
0.1708 0.8299 [torch.DoubleTensor of dimension 2x4]
Loss functions are implemented as sub-classes of Crite-
> a:index(1, torch.LongTensor{1,2}) −0.2381 −0.3401 rion, which has a similar interface to Module. It also has
−1.7844 −0.2615 0.1411 1.6249 0.1708 0.8299 forward() and backward methods for computing the loss
[torch.DoubleTensor of dimension 2x4] > a:min() and backpropagating gradients, respectively. Criteria are
−1.7844365427828 helpful to train neural network on classical tasks. Com-
mon criteria are the Mean Squared Error criterion imple-
The torch package also simplifies object oriented pro- mented in MSECriterion and the cross-entropy criterion
gramming and serialization by providing various con- implemented in ClassNLLCriterion. What follows is an
venience functions which are used throughout its pack- example of a Lua function that can be iteratively called
ages. The torch.class(classname, parentclass) function to train an mlp Module on input Tensor x, target Tensor
can be used to create object factories (classes). When y with a scalar learningRate:
the constructor is called, torch initializes and sets a Lua
function gradUpdate(mlp,x,y,learningRate) lo-
table with the user-defined metatable, which makes the cal criterion = nn.ClassNLLCriterion() pred =
table an object. mlp:forward(x) local err = criterion:forward(pred,
Objects created with the torch factory can also be seri- y); mlp:zeroGradParameters(); local t =
72
15.6. EXTERNAL LINKS 73
15.4 Applications
Torch is used by Google DeepMind,* [4] the Facebook AI
Research Group,* [5] IBM,* [6] Yandex* [7] and the Idiap
Research Institute.* [8] Torch has been extended for use
on Android* [9] and iOS.* [10] It has been used to build
hardware implementations for data flows like those found
in neural networks.* [11]
Facebook has released a set of extension modules as open
source software.* [12]
15.5 References
[1] “Torch: a modular machine learning software library”.
30 October 2002. Retrieved 24 April 2014.
Theano (software)
• Torch
16.2 References
[1] Bergstra, J.; O. Breuleux, F. Bastien, P. Lamblin, R. Pas-
canu, G. Desjardins, J. Turian, D. Warde-Farley and Y.
Bengio (30 June 2010).“Theano: A CPU and GPU Math
Expression Compiler” (PDF). Proceedings of the Python
for Scientific Computing Conference (SciPy) 2010.
[3] “deeplearning.net”.
74
Chapter 17
Deeplearning4j
Deeplearning4j is an open source deep learning library 17.3 Scientific Computing for the
written for Java and the Java Virtual Machine* [1]* [2]
and a computing framework with wide support for deep
JVM
learning algorithms. Deeplearning4j includes implemen-
tations of the restricted Boltzmann machine, deep be- Deeplearning4j includes an n-dimensional array class us-
lief net, deep autoencoder, stacked denoising autoen- ing ND4J that allows for scientific computing in Java and
coder and recursive neural tensor network, as well as Scala, similar to the functionality that Numpy provides
word2vec, doc2vec and GloVe. These algorithms all to Python. It's effectively based on a library for linear al-
include distributed parallel versions that integrate with gebra and matrix manipulation in a production environ-
Hadoop and Spark.* [3] ment. It relies on Matplotlib as a plotting package.
75
76 CHAPTER 17. DEEPLEARNING4J
17.7 References
[1] Metz, Cade (2014-06-02). “The Mission to Bring
Google's AI to the Rest of the World”. Wired.com. Re-
trieved 2014-06-28.
[7] “deeplearning4j.org”.
• “Github Repositories”.
• “Deeplearning4j vs. Torch vs. Caffe vs. Pylearn”
.
• “Canova: A General Vectorization Lib for Machine
Learning”.
• “Apache Flink”.
Chapter 18
Gensim
Gensim is an open-source vector space modeling and [8] Rehurek, Radim.“Gensim”. http://radimrehurek.com/''.
topic modeling toolkit, implemented in the Python pro- Retrieved 27 January 2015. Gensim's tagline: "Topic
gramming language, using NumPy, SciPy and optionally Modelling for Humans"
Cython for performance. It is specifically intended for
handling large text collections, using efficient online al-
gorithms. 18.3 External links
Gensim includes implementations of tf–idf, random pro-
jections, deep learning with Google's word2vec algo- • Official website
rithm * [1] (reimplemented and optimized in Cython),
hierarchical Dirichlet processes (HDP), latent semantic
analysis (LSA) and latent Dirichlet allocation (LDA), in-
cluding distributed parallel versions.* [2]
Gensim has been used in a number of commercial as well
as academic applications.* [3]* [4] The code is hosted on
GitHub* [5] and a support forum is maintained on Google
Groups.* [6]
Gensim accompanied the PhD dissertation Scalability
of Semantic Analysis in Natural Language Processing of
Radim Řehůřek (2011).* [7]
18.2 References
[1] Deep learning with word2vec and gensim
77
Chapter 19
Geoffrey Hinton
Geoffrey (Geoff) Everest Hinton FRS (born 6 Decem- invented Boltzmann machines with Terry Sejnowski. His
ber 1947) is a British-born cognitive psychologist and other contributions to neural network research include
computer scientist, most noted for his work on artificial distributed representations, time delay neural network,
neural networks. He now divides his time working for mixtures of experts, Helmholtz machines and Product
Google and University of Toronto.* [1] He is the co- of Experts. His current main interest is in unsupervised
inventor of the backpropagation and contrastive diver- learning procedures for neural networks with rich sensory
gence training algorithms and is an important figure in input.
the deep learning movement.* [2]
78
19.6. EXTERNAL LINKS 79
[3] https://www.coursera.org/course/neuralnets
Yann LeCun
Yann LeCun (born 1960) is a computer scientist with After a brief tenure as a Fellow of the NEC Research
contributions in machine learning, computer vision, Institute (now NEC-Labs America) in Princeton, NJ, he
mobile robotics and computational neuroscience. He is joined New York University (NYU) in 2003, where he is
well known for his work on optical character recogni- Silver Professor of Computer Science Neural Science at
tion and computer vision using convolutional neural net- the Courant Institute of Mathematical Science and the
works (CNN), and is a founding father of convolutional Center for Neural Science. He is also a professor at
nets.* [1]* [2] He is also one of the main creators of the Polytechnic Institute of New York University.* [8]* [9] At
DjVu image compression technology (together with Léon NYU, he has worked primarily on Energy-Based Mod-
Bottou and Patrick Haffner). He co-developed the Lush els for supervised and unsupervised learning,* [10] feature
programming language with Léon Bottou. learning for object recognition in Computer Vision,* [11]
and mobile robotics.* [12]
In 2012, he became the founding director of the NYU
20.1 Life Center for Data Science.* [13] On December 9, 2013, Le-
Cun became the first director of Facebook AI Research in
Yann LeCun was born near Paris, France, in 1960. New York City.,* [14] and stepped down from the NYU-
He received a Diplôme d'Ingénieur from the Ecole Su- CDS directorship in early 2014.
perieure d'Ingénieur en Electrotechnique et Electronique LeCun is the recipient of the 2014 IEEE Neural Network
(ESIEE), Paris in 1983, and a PhD in Computer Science Pioneer Award.
from Université Pierre et Marie Curie in 1987 during
In 2013, he and Yoshua Bengio co-founded the Inter-
which he proposed an early form of the back-propagation
national Conference on Learning Representations, which
learning algorithm for neural networks.* [3] He was a
adopted a post-publication open review process he previ-
postdoctoral research associate in Geoffrey Hinton's lab
ously advocated on his website. He was the chair and or-
at the University of Toronto.
ganizer of the“Learning Workshop”held every year be-
In 1988, he joined the Adaptive Systems Research De- tween 1986 and 2012 in Snowbird, Utah. He is a member
partment at AT&T Bell Laboratories in Holmdel, New of the Science Advisory Board of the Institute for Pure
Jersey, USA, where he developed a number of new ma- and Applied Mathematics* [15] at UCLA, and has been
chine learning methods, such as a biologically inspired on the advisory board of a number of companies, includ-
model of image recognition called Convolutional Neural ing MuseAmi, KXEN Inc., and Vidient Systems.* [16]
Networks,* [4] the“Optimal Brain Damage”regulariza- He is the Co-Director of the Neural Computation &
tion methods,* [5] and the Graph Transformer Networks Adaptive Perception research program of CIFAR* [17]
method (similar to conditional random field), which he
applied to handwriting recognition and OCR.* [6] The
bank check recognition system that he helped develop 20.2 References
was widely deployed by NCR and other companies, read-
ing over 10% of all the checks in the US in the late 1990s
[1] Convolutional Nets and CIFAR-10: An Interview with
and early 2000s.
Yann LeCun. Kaggle 2014
In 1996, he joined AT&T Labs-Research as head of the
Image Processing Research Department, which was part [2] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick
of Lawrence Rabiner's Speech and Image Processing Re- Haffner (1998).“Gradient-based learning applied to doc-
ument recognition” (PDF). Proceedings of the IEEE 86
search Lab, and worked primarily on the DjVu image
(11): 2278–2324. doi:10.1109/5.726791. Retrieved 16
compression technology,* [7] used by many websites, no- November 2013.
tably the Internet Archive, to distribute scanned docu-
ments. His collaborators at AT&T include Léon Bottou [3] Y. LeCun: Une procédure d'apprentissage pour réseau a
and Vladimir Vapnik. seuil asymmetrique (a Learning Scheme for Asymmetric
80
20.3. EXTERNAL LINKS 81
Threshold Networks), Proceedings of Cognitiva 85, 599– • Yann LeCun's List of PhD Students
604, Paris, France, 1985.
• Yann LeCun's publications
[4] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard and L. D. Jackel: Backpropagation • Convolutional Neural Networks
Applied to Handwritten Zip Code Recognition, Neural
Computation, 1(4):541-551, Winter 1989. • DjVuLibre website
[9] http://yann.lecun.com/
[13] http://cds.nyu.edu
[14] https://www.facebook.com/yann.lecun/posts/
10151728212367143
Jürgen Schmidhuber
Jürgen Schmidhuber (born 17 January 1963 in gramming. In the same year he published the first
Munich) is a computer scientist and artist known for work on Meta-genetic programming. Since then he
his work on machine learning, Artificial Intelligence has co-authored numerous additional papers on artifi-
(AI), artificial neural networks, digital physics, and low- cial evolution. Applications include robot control, soccer
complexity art. His contributions also include gen- learning, drag minimization, and time series prediction.
eralizations of Kolmogorov complexity and the Speed He received several best paper awards at scientific con-
Prior. From 2004 to 2009 he was professor of Cog- ferences on evolutionary computation.
nitive Robotics at the Technische Universität München.
Since 1995 he has been co-director of the Swiss AI
Lab IDSIA in Lugano, since 2009 also professor of Ar-
tificial Intelligence at the University of Lugano. Be- 21.1.3 Neural economy
tween 2009 and 2012, the recurrent neural networks and
deep feedforward neural networks developed in his re- In 1989 he created the first learning algorithm for neural
search group have won eight international competitions networks based on principles of the market economy (in-
in pattern recognition and machine learning.* [1] In honor spired by John Holland's bucket brigade algorithm for
of his achievements he was elected to the European classifier systems): adaptive neurons compete for being
Academy of Sciences and Arts in 2008. active in response to certain input patterns; those that
are active when there is external reward get stronger
synapses, but active neurons have to pay those that
21.1 Contributions activated them, by transferring parts of their synapse
strengths, thus rewarding “hidden”neurons setting the
stage for later success.* [5]
21.1.1 Recurrent neural networks
82
21.2. REFERENCES 83
During the early 1990s Schmidhuber also invented a Schmidhuber's low-complexity artworks (since 1997) can
neural method for nonlinear independent component be described by very short computer programs containing
analysis (ICA) called predictability minimization. It is very few bits of information, and reflect his formal the-
based on co-evolution of adaptive predictors and initially ory of beauty* [15] based on the concepts of Kolmogorov
random, adaptive feature detectors processing input pat- complexity and minimum description length.
terns from the environment. For each detector there is Schmidhuber writes that since age 15 or so his main sci-
a predictor trying to predict its current value from the entific ambition has been to build an optimal scientist,
values of neighboring detectors, while each detector is then retire. First he wants to build a scientist better than
simultaneously trying to become as unpredictable as pos- himself (he quips that his colleagues claim that should be
sible.* [8] It can be shown that the best the detectors can easy) who will then do the remaining work. He claims he
do is to create a factorial code of the environment, that “cannot see any more efficient way of using and multi-
is, a code that conveys all the information about the in- plying the little creativity he's got”.
puts such that the code components are statistically inde-
pendent, which is desirable for many pattern recognition
applications. 21.1.9 Robot learning
Jeffrey Adgate “Jeff”Dean (born 1968) is an American involvement in the engineering hiring process.
computer scientist and software engineer. He is currently
Among others, the projects he's worked on include:
a Google Senior Fellow in the Systems and Infrastructure
Group.
• Spanner - a scalable, multi-version, globally dis-
tributed, and synchronously replicated database
22.1 Personal life and education • Some of the production system design and statistical
machine translation system for Google Translate.
Dean received a Ph.D. in Computer Science from the • BigTable, a large-scale semi-structured storage sys-
University of Washington, working with Craig Chambers tem.
on whole-program optimization techniques for object-
oriented languages. He received a B.S., summa cum laude • MapReduce a system for large-scale data processing
from the University of Minnesota in Computer Science applications.
& Economics in 1990. He was elected to the National
Academy of Engineering in 2009, which recognized his • Google Brain a system for large-scale artificial neu-
work on“the science and engineering of large-scale dis- ral networks
tributed computer systems.”
85
86 CHAPTER 22. JEFF DEAN (COMPUTER SCIENTIST)
Andrew Ng
87
88 CHAPTER 23. ANDREW NG
• Publications
• Academic Genealogy
• Coursera-Leadership
23.7. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 89
23.7.2 Images
• File:Ambox_important.svg Source: http://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public do-
main Contributors: Own work, based off of Image:Ambox scales.svg Original artist: Dsmurat (talk · contribs)
• File:Ann_dependency_(graph).svg Source: http://upload.wikimedia.org/wikipedia/commons/d/dd/Ann_dependency_%28graph%29.
svg License: CC BY-SA 3.0 Contributors: Vector version of File:Ann dependency graph.png Original artist: Glosser.ca
• File:Colored_neural_network.svg Source: http://upload.wikimedia.org/wikipedia/commons/4/46/Colored_neural_network.svg Li-
cense: CC BY-SA 3.0 Contributors: Own work, Derivative of File:Artificial neural network.svg Original artist: Glosser.ca
• File:Commons-logo.svg Source: http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg License: ? Contributors: ? Original
artist: ?
• File:Deep_belief_net.svg Source: http://upload.wikimedia.org/wikipedia/commons/f/fa/Deep_belief_net.svg License: CC0 Contributors:
Own work Original artist: Qwertyus
• File:Edit-clear.svg Source: http://upload.wikimedia.org/wikipedia/en/f/f2/Edit-clear.svg License: Public domain Contributors: The
Tango! Desktop Project. Original artist:
The people from the Tango! project. And according to the meta-data in the file, specifically:“Andreas Nilsson, and Jakub Steiner (although
minimally).”
• File:Elman_srnn.png Source: http://upload.wikimedia.org/wikipedia/commons/8/8f/Elman_srnn.png License: CC BY 3.0 Contributors:
Own work Original artist: Fyedernoggersnodden
• File:Emoji_u1f4bb.svg Source: http://upload.wikimedia.org/wikipedia/commons/d/d7/Emoji_u1f4bb.svg License: Apache License 2.0
Contributors: https://code.google.com/p/noto/ Original artist: Google
• File:Fisher_iris_versicolor_sepalwidth.svg Source: http://upload.wikimedia.org/wikipedia/commons/4/40/Fisher_iris_versicolor_
sepalwidth.svg License: CC BY-SA 3.0 Contributors: en:Image:Fisher iris versicolor sepalwidth.png Original artist: en:User:Qwfp (origi-
nal); Pbroks13 (talk) (redraw)
• File:Folder_Hexagonal_Icon.svg Source: http://upload.wikimedia.org/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg License: Cc-by-
sa-3.0 Contributors: ? Original artist: ?
• File:Free_Software_Portal_Logo.svg Source: http://upload.wikimedia.org/wikipedia/commons/3/31/Free_and_open-source_
software_logo_%282009%29.svg License: Public domain Contributors: FOSS Logo.svg Original artist: Free Software Portal Logo.svg
(FOSS Logo.svg): ViperSnake151
• File:Gensim_logo.png Source: http://upload.wikimedia.org/wikipedia/en/b/b1/Gensim_logo.png License: Fair use Contributors:
http://radimrehurek.com/gensim/_static/images/logo-gensim.png Original artist: ?
• File:Internet_map_1024.jpg Source: http://upload.wikimedia.org/wikipedia/commons/d/d2/Internet_map_1024.jpg License: CC BY
2.5 Contributors: Originally from the English Wikipedia; description page is/was here. Original artist: The Opte Project
• File:Lstm_block.svg Source: http://upload.wikimedia.org/wikipedia/en/8/8d/Lstm_block.svg License: PD Contributors:
Headlessplatter (talk) (Uploads) Original artist:
Headlessplatter (talk) (Uploads)
• File:NoisyNeuralResponse.png Source: http://upload.wikimedia.org/wikipedia/en/6/66/NoisyNeuralResponse.png License: PD Contrib-
utors: ? Original artist: ?
• File:Nuvola_apps_arts.svg Source: http://upload.wikimedia.org/wikipedia/commons/e/e2/Nuvola_apps_arts.svg License: GFDL Con-
tributors: Image:Nuvola apps arts.png Original artist: Manco Capac
• File:P_vip.svg Source: http://upload.wikimedia.org/wikipedia/en/6/69/P_vip.svg License: PD Contributors: ? Original artist: ?
• File:People_icon.svg Source: http://upload.wikimedia.org/wikipedia/commons/3/37/People_icon.svg License: CC0 Contributors: Open-
Clipart Original artist: OpenClipart
• File:PopulationCode.svg Source: http://upload.wikimedia.org/wikipedia/commons/a/a1/PopulationCode.svg License: Public domain
Contributors: Image:PopulationCode.png at English Wikipedia Original artist:
• Original: AndrewKeenanRichardson
• File:Portal-puzzle.svg Source: http://upload.wikimedia.org/wikipedia/en/f/fd/Portal-puzzle.svg License: Public domain Contributors: ?
Original artist: ?
• File:Question_book-new.svg Source: http://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0
Contributors:
Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist:
Tkgd2007
92 CHAPTER 23. ANDREW NG