0% found this document useful (0 votes)

75 views100 pages

Deep Learning

Uploaded by

565864220

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views100 pages

Deep Learning

Uploaded by

565864220

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

Deep Learning

more at http://ml.memect.com
Contents

1 Artiﬁcial neural network 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Improvements since 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Network function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Learning paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Employing artiﬁcial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Real-life applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Neural networks and neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Neural network software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Types of artiﬁcial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Theoretical properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.1 Computational power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.4 Generalization and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 Controversies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9.1 Training issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9.2 Hardware issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9.3 Practical counterexamples to criticisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9.4 Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.10 Gallery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.11 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.13 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.14 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Deep learning 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

i
ii CONTENTS

2.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Fundamental concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Deep learning in artiﬁcial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Deep learning architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Issues with deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.4 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.5 Convolutional Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.6 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.7 Stacked (Denoising) Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.8 Deep Stacking Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.9 Tensor Deep Stacking Networks (T-DSN) . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.10 Spike-and-Slab RBMs (ssRBMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.11 Compound Hierarchical-Deep Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.12 Deep Coding Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.13 Deep Kernel Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.14 Deep Q-Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Image recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4 Drug discovery and toxicology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.5 Customer relationship management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Deep learning in the human brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Commercial activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Criticism and comment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 Deep learning software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Feature learning 32
3.1 Supervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Supervised dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Unsupervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Local linear embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Independent component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
CONTENTS iii

3.2.5 Unsupervised dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Multilayer/Deep architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Restricted Boltzmann machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Unsupervised learning 36
4.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Generative model 38
5.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Neural coding 39
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Encoding and decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Coding schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3.1 Rate coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3.2 Temporal coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.3 Population coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.4 Sparse coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Word embedding 48
7.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

8 Deep belief network 49

8.1 Training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.2 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.4 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

9 Convolutional neural network 51

9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
iv CONTENTS

9.3.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.3.2 Diﬀerent types of layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4.1 Image recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4.2 Video analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4.4 Playing Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.5 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.6 Common libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.9 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

10 Restricted Boltzmann machine 56

10.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
10.1.1 Relation to other models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.2 Training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
10.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
10.5 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

11 Recurrent neural network 59

11.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1.1 Fully recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1.2 Hopﬁeld network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1.3 Elman networks and Jordan networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1.4 Echo state network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.1.5 Long short term memory network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.1.6 Bi-directional RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.1.7 Continuous-time RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.1.8 Hierarchical RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.1.9 Recurrent multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.1.10 Second Order Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.1.11 Multiple Timescales Recurrent Neural Network (MTRNN) Model . . . . . . . . . . . . . 61
11.1.12 Pollack’s sequential cascaded networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.1.13 Neural Turing Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.1.14 Bidirectional Associative Memory (BAM) . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.2.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.2.2 Hessian Free Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.2.3 Global optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.3 Related ﬁelds and models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
CONTENTS v

11.4 Issues with recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

11.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

12 Long short term memory 65

12.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

13 Google Brain 68
13.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.2 In Google products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.3 Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.4 Reception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

14 Google DeepMind 70
14.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.1.1 2011 to 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.1.2 Acquisition by Google . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.2 Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.2.1 Deep reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
14.4 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

15 Torch (machine learning) 72

15.1 torch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
15.2 nn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
15.3 Other packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
15.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
15.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
15.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

16 Theano (software) 74
16.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
16.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

17 Deeplearning4j 75
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
17.2 Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi CONTENTS

17.3 Scientiﬁc Computing for the JVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

17.4 Canova Vectorization Lib for Machine-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
17.5 Text & NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
17.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
17.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
17.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

18 Gensim 77
18.1 Gensim's tagline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
18.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
18.3 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

19 Geoﬀrey Hinton 78
19.1 Career . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.2 Research interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.3 Honours and awards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.4 Personal life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
19.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

20 Yann LeCun 80
20.1 Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
20.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
20.3 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

21 Jürgen Schmidhuber 82
21.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.1 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.2 Artiﬁcial evolution / genetic programming . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.3 Neural economy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.4 Artiﬁcial curiosity and creativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
21.1.5 Unsupervised learning / factorial codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.1.6 Kolmogorov complexity / computer-generated universe . . . . . . . . . . . . . . . . . . . 83
21.1.7 Universal AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.1.8 Low-complexity art / theory of beauty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.1.9 Robot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
21.3 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
21.4 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

22 Jeﬀ Dean (computer scientist) 85

22.1 Personal life and education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
22.2 Career in computer science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
CONTENTS vii

22.3 Career at Google . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

22.4 Awards and honors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
22.5 Major publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
22.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
22.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

23 Andrew Ng 87
23.1 Machine learning research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
23.2 Online education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
23.3 Personal life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
23.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
23.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
23.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
23.7 Text and image sources, contributors, and licenses . . . . . . . . . . . . . . . . . . . . . . . . . . 89
23.7.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
23.7.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
23.7.3 Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Chapter 1

Artiﬁcial neural network

“Neural network”redirects here. For networks of living tion is defined by a set of input neurons which may be
neurons, see Biological neural network. For the journal, activated by the pixels of an input image. After being
see Neural Networks (journal). For the evolutionary con- weighted and transformed by a function (determined by
cept, see Neutral network (evolution). the network's designer), the activations of these neurons
In machine learning and cognitive science, artificial are then passed on to other neurons. This process is re-
peated until finally, an output neuron is activated. This
determines which character was read.
Like other machine learning methods - systems that learn
from data - neural networks have been used to solve a
wide variety of tasks that are hard to solve using ordinary
rule-based programming, including computer vision and
speech recognition.

1.1 Background
Examinations of the human's central nervous system in-
spired the concept of neural networks. In an Artifi-
cial Neural Network, simple artificial nodes, known as
"neurons",“neurodes”“ , processing elements”or“units”
, are connected together to form a network which mimics
a biological neural network.
There is no single formal definition of what an artificial
neural network is. However, a class of statistical models
may commonly be called “Neural”if they possess the
following characteristics:
An artificial neural network is an interconnected group of nodes,
akin to the vast network of neurons in a brain. Here, each circu- 1. consist of sets of adaptive weights, i.e. numerical
lar node represents an artificial neuron and an arrow represents parameters that are tuned by a learning algorithm,
a connection from the output of one neuron to the input of an- and
other.

neural networks (ANNs) are a family of statistical learn- 2. are capable of approximating non-linear functions
ing models inspired by biological neural networks (the of their inputs.
central nervous systems of animals, in particular the
brain) and are used to estimate or approximate functions The adaptive weights are conceptually connection
that can depend on a large number of inputs and are strengths between neurons, which are activated during
generally unknown. Artiﬁcial neural networks are gen- training and prediction.
erally presented as systems of interconnected "neurons"Neural networks are similar to biological neural networks
which send messages to each other. The connections in performing functions collectively and in parallel by
have numeric weights that can be tuned based on expe- the units, rather than there being a clear delineation of
rience, making neural nets adaptive to inputs and capable
subtasks to which various units are assigned. The term
of learning. “neural network”usually refers to models employed in
For example, a neural network for handwriting recogni- statistics, cognitive psychology and artiﬁcial intelligence.

1
2 CHAPTER 1. ARTIFICIAL NEURAL NETWORK

Neural network models which emulate the central ner- Neural network research stagnated after the publication
vous system are part of theoretical neuroscience and of machine learning research by Marvin Minsky and
computational neuroscience. Seymour Papert* [7] (1969), who discovered two key is-
In modern software implementations of artificial neu- sues with the computational machines that processed neu-
ral networks, the approach inspired by biology has been ral networks. The first was that single-layer neural net-
largely abandoned for a more practical approach based works were incapable of processing the exclusive-or cir-
on statistics and signal processing. In some of these sys- cuit. The second significant issue was that computers
tems, neural networks or parts of neural networks (like were not sophisticated enough to effectively handle the
long run time required by large neural networks. Neu-
artificial neurons) form components in larger systems that
combine both adaptive and non-adaptive elements. While ral network research slowed until computers achieved
greater processing power. Also key later advances was
the more general approach of such systems is more suit-
able for real-world problem solving, it has little to do with the backpropagation algorithm which effectively solved
the exclusive-or problem (Werbos 1975).* [6]
the traditional artificial intelligence connectionist models.
What they do have in common, however, is the princi- The parallel distributed processing of the mid-1980s be-
ple of non-linear, distributed, parallel and local process- came popular under the name connectionism. The text by
ing and adaptation. Historically, the use of neural net- David E. Rumelhart and James McClelland* [8] (1986)
works models marked a paradigm shift in the late eight- provided a full exposition on the use of connectionism in
ies from high-level (symbolic) AI, characterized by expert computers to simulate neural processes.
systems with knowledge embodied in if-then rules, to low- Neural networks, as used in artificial intelligence, have
level (sub-symbolic) machine learning, characterized by traditionally been viewed as simplified models of neural
knowledge embodied in the parameters of a dynamical processing in the brain, even though the relation between
system. this model and brain biological architecture is debated,
as it is not clear to what degree artificial neural networks
mirror brain function.* [9]
1.2 History Neural networks were gradually overtaken in popular-
ity in machine learning by support vector machines and
*
Warren McCulloch and Walter Pitts [1] (1943) created other, much simpler methods such as linear classifiers.
a computational model for neural networks based on Renewed interest in neural nets was sparked in the late
mathematics and algorithms called threshold logic. This 2000s by the advent of deep learning.
model paved the way for neural network research to split
into two distinct approaches. One approach focused on
biological processes in the brain and the other focused
on the application of neural networks to artificial intelli- 1.2.1 Improvements since 2006
gence.
Computational devices have been created in CMOS, for
In the late 1940s psychologist Donald Hebb* [2] created a both biophysical simulation and neuromorphic comput-
hypothesis of learning based on the mechanism of neural
ing. More recent efforts show promise for creating
plasticity that is now known as Hebbian learning. Heb- nanodevices* [10] for very large scale principal compo-
bian learning is considered to be a 'typical' unsupervised
nents analyses and convolution. If successful, these ef-
learning rule and its later variants were early models for forts could usher in a new era of neural computing* [11]
long term potentiation. These ideas started being applied
that is a step beyond digital computing, because it de-
to computational models in 1948 with Turing's B-type pends on learning rather than programming and because
machines.
it is fundamentally analog rather than digital even though
Farley and Wesley A. Clark* [3] (1954) first used com- the first instantiations may in fact be with CMOS digital
putational machines, then called calculators, to simulate devices.
a Hebbian network at MIT. Other neural network com- Between 2009 and 2012, the recurrent neural networks
putational machines were created by Rochester, Holland, and deep feedforward neural networks developed in the
Habit, and Duda* [4] (1956). research group of Jürgen Schmidhuber at the Swiss AI
Frank Rosenblatt* [5] (1958) created the perceptron, an Lab IDSIA have won eight international competitions
algorithm for pattern recognition based on a two-layer in pattern recognition and machine learning.* [12]* [13]
learning computer network using simple addition and For example, the bi-directional and multi-dimensional
subtraction. With mathematical notation, Rosenblatt long short term memory (LSTM)* [14]* [15]* [16]* [17] of
also described circuitry not in the basic perceptron, such Alex Graves et al. won three competitions in connected
as the exclusive-or circuit, a circuit whose mathemati- handwriting recognition at the 2009 International Confer-
cal computation could not be processed until after the ence on Document Analysis and Recognition (ICDAR),
backpropagation algorithm was created by Paul Wer- without any prior knowledge about the three different
bos* [6] (1975). languages to be learned.
1.3. MODELS 3

Fast GPU-based implementations of this approach by 2. The learning process for updating the weights of the
Dan Ciresan and colleagues at IDSIA have won sev- interconnections
eral pattern recognition contests, including the IJCNN
2011 Traffic Sign Recognition Competition,* [18]* [19] 3. The activation function that converts a neuron's
the ISBI 2012 Segmentation of Neuronal Structures in weighted input to its output activation.
Electron Microscopy Stacks challenge,* [20] and others.
Their neural networks also were the first artificial pattern Mathematically, a neuron's network function f (x) is de-
recognizers to achieve human-competitive or even super- fined as a composition of other functions gi (x) , which
human performance* [21] on important benchmarks such can further be defined as a composition of other func-
as traffic sign recognition (IJCNN 2012), or the MNIST tions. This can be conveniently represented as a net-
handwritten digits problem of Yann LeCun at NYU. work structure, with arrows depicting the dependen-
cies between variables. A widely used type of com-
Deep, highly nonlinear neural architectures similar to the position
1980 neocognitron by Kunihiko Fukushima* [22] and the ∑ is the nonlinear weighted sum, where f (x) =
K ( i wi gi (x)) , where K (commonly referred to as
“standard architecture of vision”,* [23] inspired by the the activation function* [29]) is some predefined function,
simple and complex cells identified by David H. Hubel such as the hyperbolic tangent. It will be convenient for
and Torsten Wiesel in the primary visual cortex, can the following to refer to a collection of functions gi as
also be pre-trained by unsupervised methods* [24]* [25] simply a vector g = (g1 , g2 , . . . , gn ) .
of Geoff Hinton's lab at University of Toronto.* [26]* [27]
A team from this lab won a 2012 contest sponsored by
Merck to design software to help find molecules that
might lead to new drugs.* [28]

1.3 Models
Neural network models in artificial intelligence are usu-
ally referred to as artificial neural networks (ANNs);
these are essentially simple mathematical models defin-
ing a function f : X → Y or a distribution over X or
both X and Y , but sometimes models are also intimately
associated with a particular learning algorithm or learn-
ing rule. A common use of the phrase ANN model really ANN dependency graph
means the definition of a class of such functions (where
members of the class are obtained by varying parameters, This figure depicts such a decomposition of f , with de-
connection weights, or specifics of the architecture such pendencies between variables indicated by arrows. These
as the number of neurons or their connectivity). can be interpreted in two ways.
The first view is the functional view: the input x is trans-
1.3.1 Network function formed into a 3-dimensional vector h , which is then
transformed into a 2-dimensional vector g , which is fi-
See also: Graphical models nally transformed into f . This view is most commonly
encountered in the context of optimization.

The word network in the term 'artificial neural network' The second view is the probabilistic view: the random
refers to the inter–connections between the neurons in variable F = f (G) depends upon the random variable
the different layers of each system. An example system G = g(H) , which depends upon H = h(X) , which
has three layers. The first layer has input neurons which depends upon the random variable X . This view is most
send data via synapses to the second layer of neurons, and commonly encountered in the context of graphical mod-
then via more synapses to the third layer of output neu- els.
rons. More complex systems will have more layers of The two views are largely equivalent. In either case, for
neurons with some having increased layers of input neu- this particular network architecture, the components of
rons and output neurons. The synapses store parameters individual layers are independent of each other (e.g., the
called “weights”that manipulate the data in the calcu- components of g are independent of each other given their
lations. input h ). This naturally enables a degree of parallelism
An ANN is typically defined by three types of parameters: in the implementation.
Networks such as the previous one are commonly called
1. The interconnection pattern between the different feedforward, because their graph is a directed acyclic
layers of neurons graph. Networks with cycles are commonly called
4 CHAPTER 1. ARTIFICIAL NEURAL NETWORK

When N → ∞ some form of online machine learning

must be used, where the cost is partially minimized as
each new example is seen. While online machine learning
is often used when D is ﬁxed, it is most useful in the case
where the distribution changes slowly over time. In neural
network methods, some form of online machine learning
is frequently used for ﬁnite datasets.
See also: Mathematical optimization, Estimation theory
and Machine learning

Choosing a cost function

While it is possible to deﬁne some arbitrary ad hoc cost

function, frequently a particular cost will be used, either
because it has desirable properties (such as convexity) or
because it arises naturally from a particular formulation
of the problem (e.g., in a probabilistic formulation the
posterior probability of the model can be used as an in-
Two separate depictions of the recurrent ANN dependency graph verse cost). Ultimately, the cost function will depend on
the desired task. An overview of the three main cate-
gories of learning tasks is provided below:
recurrent. Such networks are commonly depicted in the
manner shown at the top of the figure, where f is shown as
being dependent upon itself. However, an implied tem- 1.3.3 Learning paradigms
poral dependence is not shown.
There are three major learning paradigms, each corre-
sponding to a particular abstract learning task. These
1.3.2 Learning are supervised learning, unsupervised learning and
reinforcement learning.
What has attracted the most interest in neural networks is
the possibility of learning. Given a specific task to solve,
and a class of functions F , learning means using a set Supervised learning
of observations to find f ∗ ∈ F which solves the task in
some optimal sense. In supervised learning, we are given a set of example pairs
This entails defining a cost function C : F → R such that, (x, y), x ∈ X, y ∈ Y and the aim is to find a function
for the optimal solution f ∗ , C(f ∗ ) ≤ C(f ) ∀f ∈ F – f : X → Y in the allowed class of functions that matches
i.e., no solution has a cost less than the cost of the optimal the examples. In other words, we wish to infer the map-
solution (see Mathematical optimization). ping implied by the data; the cost function is related to the
mismatch between our mapping and the data and it im-
The cost function C is an important concept in learning,
as it is a measure of how far away a particular solution plicitly contains prior knowledge about the problem do-
main.
is from an optimal solution to the problem to be solved.
Learning algorithms search through the solution space to A commonly used cost is the mean-squared error, which
find a function that has the smallest possible cost. tries to minimize the average squared error between the
For applications where the solution is dependent on some network's output, f (x) , and the target value y over all
data, the cost must necessarily be a function of the obser- the example pairs. When one tries to minimize this cost
vations, otherwise we would not be modelling anything using gradient descent for the class of neural networks
related to the data. It is frequently defined as a statistic called multilayer perceptrons, one obtains the common
to which only approximations can be made. As a sim- and well-known backpropagation algorithm for training
ple example, consider the problem neural networks.
[ of finding ] the model
f , which minimizes C = E (f (x) − y) , for data Tasks that fall within the paradigm of supervised learn-
2

pairs (x, y) drawn from some distribution D . In prac- ing are pattern recognition (also known as classiﬁcation)
tical situations we would only have N samples from D and regression (also known as function approximation).
and thus,∑ for the above example, we would only minimize The supervised learning paradigm is also applicable to
N
Ĉ = N1 i=1 (f (xi )−yi )2 . Thus, the cost is minimized sequential data (e.g., for speech and gesture recognition).
over a sample of the data rather than the entire data set. This can be thought of as learning with a “teacher”, in
1.4. EMPLOYING ARTIFICIAL NEURAL NETWORKS 5

the form of a function that provides continuous feedback those involved in vehicle routing,* [33] natural resources
on the quality of solutions obtained thus far. management* [34]* [35] or medicine* [36] because of the
ability of ANNs to mitigate losses of accuracy even when
reducing the discretization grid density for numerically
Unsupervised learning approximating the solution of the original control prob-
lems.
In unsupervised learning, some data x is given and the
cost function to be minimized, that can be any function Tasks that fall within the paradigm of reinforcement
of the data x and the network's output, f . learning are control problems, games and other sequential
decision making tasks.
The cost function is dependent on the task (what we are
trying to model) and our a priori assumptions (the implicit See also: dynamic programming and stochastic control
properties of our model, its parameters and the observed
variables).
As a trivial example, consider the model f (x) = a where 1.3.4 Learning algorithms
a is a constant and the cost C = E[(x − f (x))2 ] . Mini-
mizing this cost will give us a value of a that is equal to the Training a neural network model essentially means se-
mean of the data. The cost function can be much more lecting one model from the set of allowed models (or,
complicated. Its form depends on the application: for ex- in a Bayesian framework, determining a distribution over
ample, in compression it could be related to the mutual the set of allowed models) that minimizes the cost crite-
information between x and f (x) , whereas in statistical rion. There are numerous algorithms available for train-
modeling, it could be related to the posterior probability ing neural network models; most of them can be viewed
of the model given the data (note that in both of those ex- as a straightforward application of optimization theory
amples those quantities would be maximized rather than and statistical estimation.
minimized).
Most of the algorithms used in training artiﬁcial neural
Tasks that fall within the paradigm of unsupervised learn- networks employ some form of gradient descent, using
ing are in general estimation problems; the applications backpropagation to compute the actual gradients. This is
include clustering, the estimation of statistical distribu- done by simply taking the derivative of the cost function
tions, compression and ﬁltering. with respect to the network parameters and then changing
those parameters in a gradient-related direction.

Reinforcement learning Evolutionary methods,* [37] gene expression program-

ming,* [38] simulated annealing,* [39] expectation-
In reinforcement learning, data x are usually not given, maximization, non-parametric
*
methods and particle
but generated by an agent's interactions with the environ- swarm optimization [40] are some commonly used
ment. At each point in time t , the agent performs an methods for training neural networks.
action yt and the environment generates an observation See also: machine learning
xt and an instantaneous cost ct , according to some (usu-
ally unknown) dynamics. The aim is to discover a policy
for selecting actions that minimizes some measure of a
long-term cost; i.e., the expected cumulative cost. The 1.4 Employing artificial neural net-
environment's dynamics and the long-term cost for each
policy are usually unknown, but can be estimated. works
More formally the environment is modelled as a Markov
decision process (MDP) with states s1 , ..., sn ∈ S and Perhaps the greatest advantage of ANNs is their ability
actions a1 , ..., am ∈ A with the following probability dis- to be used as an arbitrary function approximation mech-
tributions: the instantaneous cost distribution P (ct |st ) , anism that 'learns' from observed data. However, using
the observation distribution P (xt |st ) and the transition them is not so straightforward, and a relatively good un-
P (st+1 |st , at ) , while a policy is defined as conditional derstanding of the underlying theory is essential.
distribution over actions given the observations. Taken
together, the two then define a Markov chain (MC). The • Choice of model: This will depend on the data rep-
aim is to discover the policy that minimizes the cost; i.e., resentation and the application. Overly complex
the MC for which the cost is minimal. models tend to lead to problems with learning.
ANNs are frequently used in reinforcement learning as • Learning algorithm: There are numerous trade-offs
part of the overall algorithm.* [30]* [31] Dynamic pro- between learning algorithms. Almost any algorithm
gramming has been coupled with ANNs (Neuro dynamic will work well with the correct hyperparameters for
programming) by Bertsekas and Tsitsiklis* [32] and ap- training on a particular fixed data set. However, se-
plied to multi-dimensional nonlinear problems such as lecting and tuning an algorithm for training on un-
6 CHAPTER 1. ARTIFICIAL NEURAL NETWORK

seen data requires a significant amount of experi- These networks have also been used to diagnose prostate
mentation. cancer. The diagnoses can be used to make specific mod-
els taken from a large group of patients compared to in-
• Robustness: If the model, cost function and learn- formation of one given patient. The models do not de-
ing algorithm are selected appropriately the result- pend on assumptions about correlations of different vari-
ing ANN can be extremely robust. ables. Colorectal cancer has also been predicted using
the neural networks. Neural networks could predict the
With the correct implementation, ANNs can be used nat- outcome for a patient with colorectal cancer with more
urally in online learning and large data set applications. accuracy than the current clinical methods. After train-
Their simple implementation and the existence of mostly ing, the networks could predict multiple patient outcomes
local dependencies exhibited in the structure allows for from unrelated institutions.* [43]
fast, parallel implementations in hardware.

1.5 Applications 1.5.2 Neural networks and neuroscience

The utility of artificial neural network models lies in the Theoretical and computational neuroscience is the field
fact that they can be used to infer a function from obser- concerned with the theoretical analysis and the computa-
vations. This is particularly useful in applications where tional modeling of biological neural systems. Since neu-
the complexity of the data or task makes the design of ral systems are intimately related to cognitive processes
such a function by hand impractical. and behavior, the field is closely related to cognitive and
behavioral modeling.
The aim of the field is to create models of biological neu-
1.5.1 Real-life applications ral systems in order to understand how biological systems
work. To gain this understanding, neuroscientists strive
The tasks artificial neural networks are applied to tend to to make a link between observed biological processes
fall within the following broad categories: (data), biologically plausible mechanisms for neural pro-
cessing and learning (biological neural network models)
• Function approximation, or regression analysis, in- and theory (statistical learning theory and information
cluding time series prediction, fitness approximation theory).
and modeling.
• Classification, including pattern and sequence
recognition, novelty detection and sequential deci- Types of models
sion making.
• Data processing, including filtering, clustering, blind Many models are used in the field, defined at different lev-
source separation and compression. els of abstraction and modeling different aspects of neural
systems. They range from models of the short-term be-
• Robotics, including directing manipulators, havior of individual neurons, models of how the dynamics
prosthesis. of neural circuitry arise from interactions between indi-
vidual neurons and finally to models of how behavior can
• Control, including Computer numerical control.
arise from abstract neural modules that represent com-
plete subsystems. These include models of the long-term,
Application areas include the system identification and and short-term plasticity, of neural systems and their rela-
control (vehicle control, process control, natural re- tions to learning and memory from the individual neuron
sources management), quantum chemistry,* [41] game- to the system level.
playing and decision making (backgammon, chess,
poker), pattern recognition (radar systems, face identi-
fication, object recognition and more), sequence recog-
nition (gesture, speech, handwritten text recognition),
medical diagnosis, financial applications (e.g. automated
1.6 Neural network software
trading systems), data mining (or knowledge discovery in
databases, “KDD”), visualization and e-mail spam fil- Main article: Neural network software
tering.
Artificial neural networks have also been used to diagnose Neural network software is used to simulate, research,
several cancers. An ANN based hybrid lung cancer de- develop and apply artificial neural networks, biological
tection system named HLND improves the accuracy of neural networks and, in some cases, a wider array of
diagnosis and the speed of lung cancer radiology.* [42] adaptive systems.
1.8. THEORETICAL PROPERTIES 7

1.7 Types of artiﬁcial neural net- regarding convergence are an unreliable guide to practical
application.
works
Main article: Types of artiﬁcial neural networks 1.8.4 Generalization and statistics

Artificial neural network types vary from those with only In applications where the goal is to create a system that
one or two layers of single direction logic, to complicated
generalizes well in unseen examples, the problem of over-
multi–input many directional feedback loops and layers. training has emerged. This arises in convoluted or over-
On the whole, these systems use algorithms in their pro- specified systems when the capacity of the network sig-
gramming to determine control and organization of their nificantly exceeds the needed free parameters. There
functions. Most systems use “weights”to change the are two schools of thought for avoiding this problem:
parameters of the throughput and the varying connec- The first is to use cross-validation and similar techniques
tions to the neurons. Artificial neural networks can be to check for the presence of overtraining and optimally
autonomous and learn by input from outside “teachers” select hyperparameters such as to minimize the gener-
or even self-teaching from written-in rules. alization error. The second is to use some form of
regularization. This is a concept that emerges naturally in
a probabilistic (Bayesian) framework, where the regular-
ization can be performed by selecting a larger prior prob-
1.8 Theoretical properties ability over simpler models; but also in statistical learning
theory, where the goal is to minimize over two quantities:
1.8.1 Computational power the 'empirical risk' and the 'structural risk', which roughly
corresponds to the error over the training set and the pre-
The multi-layer perceptron (MLP) is a universal function dicted error in unseen data due to overfitting.
approximator, as proven by the universal approximation
theorem. However, the proof is not constructive regard-
ing the number of neurons required or the settings of the
weights.
Work by Hava Siegelmann and Eduardo D. Sontag has
provided a proof that a specific recurrent architecture
with rational valued weights (as opposed to full preci-
sion real number-valued weights) has the full power of a
Universal Turing Machine* [44] using a finite number of
neurons and standard linear connections. Further, it has
been shown that the use of irrational values for weights
results in a machine with super-Turing power.* [45]

1.8.2 Capacity
Confidence analysis of a neural network
Artificial neural network models have a property called
'capacity', which roughly corresponds to their ability to Supervised neural networks that use a mean squared error
model any given function. It is related to the amount of (MSE) cost function can use formal statistical methods to
information that can be stored in the network and to the determine the confidence of the trained model. The MSE
notion of complexity. on a validation set can be used as an estimate for variance.
This value can then be used to calculate the confidence
interval of the output of the network, assuming a normal
1.8.3 Convergence distribution. A confidence analysis made this way is sta-
tistically valid as long as the output probability distribu-
Nothing can be said in general about convergence since it tion stays the same and the network is not modified.
depends on a number of factors. Firstly, there may exist By assigning a softmax activation function, a generaliza-
many local minima. This depends on the cost function tion of the logistic function, on the output layer of the
and the model. Secondly, the optimization method used neural network (or a softmax component in a component-
might not be guaranteed to converge when far away from based neural network) for categorical target variables, the
a local minimum. Thirdly, for a very large amount of outputs can be interpreted as posterior probabilities. This
data or parameters, some methods become impractical. is very useful in classification as it gives a certainty mea-
In general, it has been found that theoretical guarantees sure on classifications.
8 CHAPTER 1. ARTIFICIAL NEURAL NETWORK

The softmax activation function is: (they tend to consume considerable amounts of time and
money).

ex i Computing power continues to grow roughly according

yi = ∑c to Moore's Law, which may provide suﬃcient resources
ex j
j=1 to accomplish new tasks. Neuromorphic engineering ad-
dresses the hardware diﬃculty directly, by constructing
non-Von-Neumann chips with circuits designed to imple-
1.9 Controversies ment neural nets from the ground up.

1.9.1 Training issues

1.9.3 Practical counterexamples to criti-
A common criticism of neural networks, particularly in
robotics, is that they require a large diversity of training cisms
for real-world operation . This is not surprising, since
any learning machine needs sufficient representative ex- Arguments against Dewdney's position are that neural
amples in order to capture the underlying structure that networks have been successfully used to solve many com-
allows it to generalize to new cases. Dean Pomerleau, in plex and diverse tasks, ranging from autonomously flying
his research presented in the paper “Knowledge-based aircraft* [46] to detecting credit card fraud .
Training of Artificial Neural Networks for Autonomous Technology writer Roger Bridgman commented on
Robot Driving,”uses a neural network to train a robotic Dewdney's statements about neural nets:
vehicle to drive on multiple types of roads (single lane,
multi-lane, dirt, etc.). A large amount of his research
is devoted to (1) extrapolating multiple training scenar- Neural networks, for instance, are in the
ios from a single training experience, and (2) preserving dock not only because they have been hyped
past training diversity so that the system does not become to high heaven, (what hasn't?) but also be-
overtrained (if, for example, it is presented with a series cause you could create a successful net with-
of right turns – it should not learn to always turn right). out understanding how it worked: the bunch of
These issues are common in neural networks that must de- numbers that captures its behaviour would in
cide from amongst a wide variety of responses, but can be all probability be “an opaque, unreadable ta-
dealt with in several ways, for example by randomly shuf- ble...valueless as a scientific resource”.
fling the training examples, by using a numerical opti- In spite of his emphatic declaration that
mization algorithm that does not take too large steps when science is not technology, Dewdney seems here
changing the network connections following an example, to pillory neural nets as bad science when most
or by grouping examples in so-called mini-batches. of those devising them are just trying to be
A. K. Dewdney, a former Scientific American columnist, good engineers. An unreadable table that a
wrote in 1997, “Although neural nets do solve a few toy useful machine could read would still be well
problems, their powers of computation are so limited that worth having.* [47]
I am surprised anyone takes them seriously as a general
problem-solving tool.”(Dewdney, p. 82) Although it is true that analyzing what has been learned
by an artificial neural network is difficult, it is much eas-
ier to do so than to analyze what has been learned by a
1.9.2 Hardware issues biological neural network. Furthermore, researchers in-
volved in exploring learning algorithms for neural net-
To implement large and effective software neural net-
works are gradually uncovering generic principles which
works, considerable processing and storage resources
allow a learning machine to be successful. For example,
need to be committed . While the brain has hardware
Bengio and LeCun (2007) wrote an article regarding local
tailored to the task of processing signals through a graph
vs non-local learning, as well as shallow vs deep architec-
of neurons, simulating even a most simplified form on
ture.* [48]
Von Neumann technology may compel a neural network
designer to fill many millions of database rows for its con-
nections – which can consume vast amounts of computer
memory and hard disk space. Furthermore, the designer 1.9.4 Hybrid approaches
of neural network systems will often need to simulate
the transmission of signals through many of these con- Some other criticisms come from advocates of hybrid
nections and their associated neurons – which must often models (combining neural networks and symbolic ap-
be matched with incredible amounts of CPU processing proaches), who believe that the intermix of these two ap-
power and time. While neural networks often yield effec- proaches can better capture the mechanisms of the human
tive programs, they too often do so at the cost of efficiency mind.* [49]* [50]
1.12. REFERENCES 9

1.10 Gallery • Habituation

• In Situ Adaptive Tabulation
• A single-layer feedforward artificial neural network.
Arrows originating from are omitted for clarity. • Models of neural computation
There are p inputs to this network and q outputs.
In this system, the value of the qth output, would be • Multilinear subspace learning
calculated as
• Neuroevolution
• A two-layer feedforward artificial neural network.
• Neural coding
•
• Neural gas
•
• Neural network software
• Neuroscience
1.11 See also • Ni1000 chip
• 20Q • Nonlinear system identification
• ADALINE • Optical neural network
• Adaptive resonance theory • Parallel Constraint Satisfaction Processes
• Artificial life • Parallel distributed processing
• Associative memory • Radial basis function network
• Autoencoder • Recurrent neural networks
• Backpropagation • Self-organizing map
• BEAM robotics • Spiking neural network
• Biological cybernetics • Systolic array
• Biologically inspired computing • Tensor product network
• Blue brain • Time delay neural network (TDNN)
• Catastrophic interference
• Cerebellar Model Articulation Controller 1.12 References
• Cognitive architecture
[1] McCulloch, Warren; Walter Pitts (1943). “A Log-
• Cognitive science ical Calculus of Ideas Immanent in Nervous Activity”
. Bulletin of Mathematical Biophysics 5 (4): 115–133.
• Convolutional neural network (CNN) doi:10.1007/BF02478259.

• Connectionist expert system [2] Hebb, Donald (1949). The Organization of Behavior.
New York: Wiley.
• Connectomics
[3] Farley, B.G.; W.A. Clark (1954). “Simulation of
• Cultured neuronal networks Self-Organizing Systems by Digital Computer”. IRE
Transactions on Information Theory 4 (4): 76–84.
• Deep learning doi:10.1109/TIT.1954.1057468.
• Digital morphogenesis [4] Rochester, N.; J.H. Holland, L.H. Habit, and W.L.
Duda (1956). “Tests on a cell assembly theory of the
• Encog action of the brain, using a large digital computer”.
• Fuzzy logic IRE Transactions on Information Theory 2 (3): 80–93.
doi:10.1109/TIT.1956.1056810.
• Gene expression programming
[5] Rosenblatt, F. (1958). “The Perceptron: A Probabilis-
• Genetic algorithm tic Model For Information Storage And Organization In
The Brain”. Psychological Review 65 (6): 386–408.
• Group method of data handling doi:10.1037/h0042519. PMID 13602029.
10 CHAPTER 1. ARTIFICIAL NEURAL NETWORK

[6] Werbos, P.J. (1975). Beyond Regression: New Tools for [20] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber.
Prediction and Analysis in the Behavioral Sciences. Deep Neural Networks Segment Neuronal Membranes in
Electron Microscopy Images. In Advances in Neural In-
[7] Minsky, M.; S. Papert (1969). An Introduction to Compu- formation Processing Systems (NIPS 2012), Lake Tahoe,
tational Geometry. MIT Press. ISBN 0-262-63022-2. 2012.
[8] Rumelhart, D.E; James McClelland (1986). Parallel Dis- [21] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column
tributed Processing: Explorations in the Microstructure of Deep Neural Networks for Image Classiﬁcation. IEEE
Cognition. Cambridge: MIT Press. Conf. on Computer Vision and Pattern Recognition
CVPR 2012.
[9] Russell, Ingrid. “Neural Networks Module”. Retrieved
2012. [22] Fukushima, K. (1980).“Neocognitron: A self-organizing
neural network model for a mechanism of pattern recog-
[10] Yang, J. J.; Pickett, M. D.; Li, X. M.; Ohlberg, D. A. A.; nition unaﬀected by shift in position”. Biological Cyber-
Stewart, D. R.; Williams, R. S. Nat. Nanotechnol. 2008, netics 36 (4): 93–202. doi:10.1007/BF00344251. PMID
3, 429–433. 7370364.
[11] Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. [23] M Riesenhuber, T Poggio. Hierarchical models of object
S. Nature 2008, 453, 80–83. recognition in cortex. Nature neuroscience, 1999.
[12] 2012 Kurzweil AI Interview with Jürgen Schmidhuber on [24] Deep belief networks at Scholarpedia.
the eight competitions won by his Deep Learning team
2009–2012 [25] Hinton, G. E.; Osindero, S.; Teh, Y. W. (2006).
“A Fast Learning Algorithm for Deep Belief Nets”
[13] http://www.kurzweilai.net/ (PDF). Neural Computation 18 (7): 1527–1554.
how-bio-inspired-deep-learning-keeps-winning-competitions doi:10.1162/neco.2006.18.7.1527. PMID 16764513.
2012 Kurzweil AI Interview with Jürgen Schmidhuber on
the eight competitions won by his Deep Learning team [26] http://www.scholarpedia.org/article/Deep_belief_
2009–2012 networks /

[14] Graves, Alex; and Schmidhuber, Jürgen; Offline Hand- [27] Hinton, G. E.; Osindero, S.; Teh, Y. (2006).
writing Recognition with Multidimensional Recurrent Neu- “A fast learning algorithm for deep belief nets”
ral Networks, in Bengio, Yoshua; Schuurmans, Dale; Laf- (PDF). Neural Computation 18 (7): 1527–1554.
ferty, John; Williams, Chris K. I.; and Culotta, Aron doi:10.1162/neco.2006.18.7.1527. PMID 16764513.
(eds.), Advances in Neural Information Processing Systems
[28] John Markoff (November 23, 2012). “Scientists See
22 (NIPS'22), 7–10 December 2009, Vancouver, BC, Neu-
Promise in Deep-Learning Programs”. New York Times.
ral Information Processing Systems (NIPS) Foundation,
2009, pp. 545–552. [29] “The Machine Learning Dictionary”.
[15] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. [30] Dominic, S., Das, R., Whitley, D., Anderson, C. (July
Bunke, J. Schmidhuber. A Novel Connectionist System 1991). “Genetic reinforcement learning for neural net-
for Improved Unconstrained Handwriting Recognition. works”. IJCNN-91-Seattle International Joint Confer-
IEEE Transactions on Pattern Analysis and Machine In- ence on Neural Networks. IJCNN-91-Seattle International
telligence, vol. 31, no. 5, 2009. Joint Conference on Neural Networks. Seattle, Wash-
ington, USA: IEEE. doi:10.1109/IJCNN.1991.155315.
[16] Graves, Alex; and Schmidhuber, Jürgen; Offline Hand-
ISBN 0-7803-0164-1. Retrieved 29 July 2012.
writing Recognition with Multidimensional Recurrent Neu-
ral Networks, in Bengio, Yoshua; Schuurmans, Dale; Laf- [31] Hoskins, J.C.; Himmelblau, D.M. (1992). “Process con-
ferty, John; Williams, Chris K. I.; and Culotta, Aron trol via artificial neural networks and reinforcement learn-
(eds.), Advances in Neural Information Processing Systems ing”. Computers & Chemical Engineering 16 (4): 241–
22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC, 251. doi:10.1016/0098-1354(92)80045-B.
Neural Information Processing Systems (NIPS) Founda-
tion, 2009, pp. 545–552 [32] Bertsekas, D.P., Tsitsiklis, J.N. (1996). Neuro-dynamic
programming. Athena Scientific. p. 512. ISBN 1-
[17] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. 886529-10-8.
Bunke, J. Schmidhuber. A Novel Connectionist System
for Improved Unconstrained Handwriting Recognition. [33] Secomandi, Nicola (2000). “Comparing neuro-dynamic
IEEE Transactions on Pattern Analysis and Machine In- programming algorithms for the vehicle routing prob-
telligence, vol. 31, no. 5, 2009. lem with stochastic demands”. Computers & Operations
Research 27 (11–12): 1201–1225. doi:10.1016/S0305-
[18] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi- 0548(99)00146-X.
Column Deep Neural Network for Traffic Sign Classifica-
tion. Neural Networks, 2012. [34] de Rigo, D., Rizzoli, A. E., Soncini-Sessa, R., Weber, E.,
Zenesi, P. (2001). “Neuro-dynamic programming for
[19] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi- the efficient management of reservoir networks” (PDF).
Column Deep Neural Network for Traffic Sign Classifica- Proceedings of MODSIM 2001, International Congress on
tion. Neural Networks, 2012. Modelling and Simulation. MODSIM 2001, International
1.13. BIBLIOGRAPHY 11

Congress on Modelling and Simulation. Canberra, Aus- [45] Balcázar, José (Jul 1997). “Computational Power of
tralia: Modelling and Simulation Society of Australia Neural Networks: A Kolmogorov Complexity Character-
and New Zealand. doi:10.5281/zenodo.7481. ISBN 0- ization”. Information Theory, IEEE Transactions on 43
867405252. Retrieved 29 July 2012. (4): 1175–1183. doi:10.1109/18.605580. Retrieved 3
November 2014.
[35] Damas, M., Salmeron, M., Diaz, A., Ortega, J., Pri-
eto, A., Olivares, G. (2000). “Genetic algorithms and [46] NASA - Dryden Flight Research Center - News
neuro-dynamic programming: application to water sup- Room: News Releases: NASA NEURAL NETWORK
ply networks”. Proceedings of 2000 Congress on Evo- PROJECT PASSES MILESTONE. Nasa.gov. Retrieved
lutionary Computation. 2000 Congress on Evolution- on 2013-11-20.
ary Computation. La Jolla, California, USA: IEEE.
[47] Roger Bridgman's defence of neural networks
doi:10.1109/CEC.2000.870269. ISBN 0-7803-6375-2.
Retrieved 29 July 2012. [48] http://www.iro.umontreal.ca/~{}lisa/publications2/
index.php/publications/show/4
[36] Deng, Geng; Ferris, M.C. (2008). “Neuro-dynamic
programming for fractionated radiotherapy planning”. [49] Sun and Bookman (1990)
Springer Optimization and Its Applications 12: 47–70.
[50] Tahmasebi; Hezarkhani (2012). “A hybrid neural
doi:10.1007/978-0-387-73299-2_3.
networks-fuzzy logic-genetic algorithm for grade es-
[37] de Rigo, D., Castelletti, A., Rizzoli, A.E., Soncini-Sessa, timation”. Computers & Geosciences 42: 18–27.
R., Weber, E. (January 2005). “A selective improve- doi:10.1016/j.cageo.2012.02.004.
ment technique for fastening Neuro-Dynamic Program-
ming in Water Resources Network Management”. In
Pavel Zítek. Proceedings of the 16th IFAC World Congress 1.13 Bibliography
– IFAC-PapersOnLine. 16th IFAC World Congress 16.
Prague, Czech Republic: IFAC. doi:10.3182/20050703-
• Bhadeshia H. K. D. H. (1999). “Neu-
6-CZ-1902.02172. ISBN 978-3-902661-75-3. Retrieved
ral Networks in Materials Science” (PDF).
30 December 2011.
ISIJ International 39 (10): 966–979.
[38] Ferreira, C. (2006). “Designing Neural Networks Using doi:10.2355/isijinternational.39.966.
Gene Expression Programming”(PDF). In A. Abraham,
B. de Baets, M. Köppen, and B. Nickolay, eds., Applied • Bishop, C.M. (1995) Neural Networks for Pat-
Soft Computing Technologies: The Challenge of Com- tern Recognition, Oxford: Oxford University Press.
plexity, pages 517–536, Springer-Verlag. ISBN 0-19-853849-9 (hardback) or ISBN 0-19-
853864-2 (paperback)
[39] Da, Y., Xiurun, G. (July 2005). T. Villmann, ed. An
improved PSO-based ANN with simulated annealing tech- • Cybenko, G.V. (1989). Approximation by Super-
nique. New Aspects in Neurocomputing: 11th Euro- positions of a Sigmoidal function, Mathematics of
pean Symposium on Artificial Neural Networks. Elsevier. Control, Signals, and Systems, Vol. 2 pp. 303–314.
doi:10.1016/j.neucom.2004.07.002. electronic version
[40] Wu, J., Chen, E. (May 2009). Wang, H., Shen, Y., Huang, • Duda, R.O., Hart, P.E., Stork, D.G. (2001) Pat-
T., Zeng, Z., ed. A Novel Nonparametric Regression En- tern classification (2nd edition), Wiley, ISBN 0-471-
semble for Rainfall Forecasting Using Particle Swarm Op- 05669-3
timization Technique Coupled with Artificial Neural Net-
work. 6th International Symposium on Neural Networks, • Egmont-Petersen, M., de Ridder, D., Handels, H.
ISNN 2009. Springer. doi:10.1007/978-3-642-01513- (2002). “Image processing with neural networks –
7_6. ISBN 978-3-642-01215-0. a review”. Pattern Recognition 35 (10): 2279–2301.
doi:10.1016/S0031-3203(01)00178-9.
[41] Roman M. Balabin, Ekaterina I. Lomakina (2009).“Neu-
ral network approach to quantum-chemistry data: Accu- • Gurney, K. (1997) An Introduction to Neural Net-
rate prediction of density functional theory energies”. J. works London: Routledge. ISBN 1-85728-673-1
Chem. Phys. 131 (7): 074104. doi:10.1063/1.3206326. (hardback) or ISBN 1-85728-503-4 (paperback)
PMID 19708729.
• Haykin, S. (1999) Neural Networks: A Comprehen-
[42] Ganesan, N. “Application of Neural Networks in Diag- sive Foundation, Prentice Hall, ISBN 0-13-273350-
nosing Cancer Disease Using Demographic Data”(PDF). 1
International Journal of Computer Applications.
• Fahlman, S, Lebiere, C (1991). The Cascade-
[43] Bottaci, Leonardo. “Artificial Neural Networks Applied Correlation Learning Architecture, created for
to Outcome Prediction for Colorectal Cancer Patients in
National Science Foundation, Contract Number
Separate Institutions” (PDF). The Lancet.
EET-8716324, and Defense Advanced Research
[44] Siegelmann, H.T.; Sontag, E.D. (1991). “Turing com- Projects Agency (DOD), ARPA Order No. 4976
putability with neural nets” (PDF). Appl. Math. Lett. 4 under Contract F33615-87-C-1499. electronic
(6): 77–80. doi:10.1016/0893-9659(91)90080-F. version
12 CHAPTER 1. ARTIFICIAL NEURAL NETWORK

• Hertz, J., Palmer, R.G., Krogh. A.S. (1990) Intro-

duction to the theory of neural computation, Perseus
Books. ISBN 0-201-51560-1

• Lawrence, Jeanette (1994) Introduction to Neu-

ral Networks, California Scientiﬁc Software Press.
ISBN 1-883157-00-5
• Masters, Timothy (1994) Signal and Image Process-
ing with Neural Networks, John Wiley & Sons, Inc.
ISBN 0-471-04963-8

• Ripley, Brian D. (1996) Pattern Recognition and

Neural Networks, Cambridge

• Siegelmann, H.T. and Sontag, E.D. (1994). Ana-

log computation via neural networks, Theoretical
Computer Science, v. 131, no. 2, pp. 331–360.
electronic version
• Sergios Theodoridis, Konstantinos Koutroumbas
(2009) “Pattern Recognition”, 4th Edition, Aca-
demic Press, ISBN 978-1-59749-272-0.

• Smith, Murray (1993) Neural Networks for Statisti-

cal Modeling, Van Nostrand Reinhold, ISBN 0-442-
01310-8
• Wasserman, Philip (1993) Advanced Methods in
Neural Computing, Van Nostrand Reinhold, ISBN
0-442-00461-3

• Computational Intelligence: A Methodologi-

cal Introduction by Kruse, Borgelt, Klawonn,
Moewes, Steinbrecher, Held, 2013, Springer, ISBN
9781447150121

• Neuro-Fuzzy-Systeme (3rd edition) by Borgelt,

Klawonn, Kruse, Nauck, 2003, Vieweg, ISBN
9783528252656

1.14 External links

• Neural Networks at DMOZ
• A brief introduction to Neural Networks (PDF), il-
lustrated 250p textbook covering the common kinds
of neural networks (CC license).
Chapter 2

Deep learning

For deep versus shallow learning in educational psychol- 2.1 Introduction

ogy, see Student approaches to learning
2.1.1 Definitions
Deep learning (deep machine learning, or deep structured
learning, or hierarchical learning, or sometimes DL) is a There are a number of ways that the field of deep learn-
branch of machine learning based on a set of algorithms ing has been characterized. Deep learning is a class of
that attempt to model high-level abstractions in data by machine learning training algorithms that* [1]* (pp199–
using model architectures, with complex structures or 200)
otherwise, composed of multiple non-linear transforma-
tions.* [1]* (p198)* [2]* [3]* [4] • use a cascade of many layers of nonlinear process-
Deep learning is part of a broader family of machine ing units for feature extraction and transformation.
learning methods based on learning representations of The next layer uses the output from the previous
data. An observation (e.g., an image) can be represented layer as input. The algorithms may be supervised
in many ways such as a vector of intensity values per pixel, or unsupervised and applications include pattern
or in a more abstract way as a set of edges, regions of par- recognition and statistical classification.
ticular shape, etc. Some representations make it easier
• are based on the (unsupervised) learning of multi-
to learn tasks (e.g., face recognition or facial expression
ple levels of features or representations of the data.
recognition* [5]) from examples. One of the promises
Higher level features are derived from lower level
of deep learning is replacing handcrafted features with
features to form a hierarchical representation.
efficient algorithms for unsupervised or semi-supervised
feature learning and hierarchical feature extraction.* [6] • are part of the broader machine learning field of
Research in this area attempts to make better represen- learning representations of data.
tations and create models to learn these representations • learn multiple levels of representations that corre-
from large-scale unlabeled data. Some of the represen- spond to different levels of abstraction; the levels
tations are inspired by advances in neuroscience and are form a hierarchy of concepts.
loosely based on interpretation of information processing
and communication patterns in a nervous system, such
as neural coding which attempts to define a relationship These definitions have in common (1) multiple layers
between the stimulus and the neuronal responses and the of nonlinear processing units and (2) the supervised or
relationship among the electrical activity of the neurons unsupervised learning of feature representations in each
in the brain.* [7] layer, with the layers forming a hierarchy from low-level
to high-level features.* [1]* (p200) The composition of a
Various deep learning architectures such as deep neural layer of nonlinear processing units used in a deep belief
networks, convolutional deep neural networks, and deep algorithm depends on the problem to be solved. Layers
belief networks have been applied to fields like computer that have been used in deep learning include hidden lay-
vision, automatic speech recognition, natural language ers of an artificial neural network, restricted Boltzmann
processing, audio recognition and bioinformatics where machines, and sets of complicated propositional formu-
they have been shown to produce state-of-the-art results las.* [2]
on various tasks.
Deep learning algorithms are contrasted with shallow
Alternatively, deep learning has been characterized as a learning algorithms by the number of parameterized
buzzword, or a rebranding of neural networks.* [8]* [9] transformations a signal encounters as it propagates from
the input layer to the output layer, where a parameter-
ized transformation is a processing unit that has trainable
parameters, such as weights and thresholds.* [4]* (p6) A

13
14 CHAPTER 2. DEEP LEARNING

chain of transformations from input to output is a credit rithms can make use of the unlabeled data that supervised
assignment path (CAP). CAPs describe potentially causal algorithms cannot. Unlabeled data is usually more abun-
connections between input and output and may vary in dant than labeled data, making this an important benefit
length. For a feedforward neural network, the depth of of these algorithms. The deep belief network is an exam-
the CAPs, and thus the depth of the network, is the num- ple of a deep structure that can be trained in an unsuper-
ber of hidden layers plus one (the output layer is also pa- vised manner.* [3]
rameterized). For recurrent neural networks, in which a
signal may propagate through a layer more than once, the
CAP is potentially unlimited in length. There is no uni- 2.2 History
versally agreed upon threshold of depth dividing shallow
learning from deep learning, but most researchers in the
field agree that deep learning has multiple nonlinear lay- Deep learning architectures, specifically those built from
ers (CAP > 2) and Schmidhuber considers CAP > 10 to artificial neural networks (ANN), date back at least to
be very deep learning.* [4]* (p7) the Neocognitron introduced by Kunihiko Fukushima in
1980.* [10] The ANNs themselves date back even further.
In 1989, Yann LeCun et al. were able to apply the stan-
dard backpropagation algorithm, which had been around
2.1.2 Fundamental concepts since 1974,* [11] to a deep neural network with the pur-
pose of recognizing handwritten ZIP codes on mail. De-
Deep learning algorithms are based on distributed rep- spite the success of applying the algorithm, the time to
resentations. The underlying assumption behind dis-
train the network on this dataset was approximately 3
tributed representations is that observed data is generated days, making it impractical for general use.* [12] Many
by the interactions of many different factors on different
factors contribute to the slow speed, one being due to the
levels. Deep learning adds the assumption that these fac- so-called vanishing gradient problem analyzed in 1991 by
tors are organized into multiple levels, corresponding to Sepp Hochreiter.* [13]* [14]
different levels of abstraction or composition. Varying
numbers of layers and layer sizes can be used to provide While such neural networks by 1991 were used for rec-
different amounts of abstraction.* [3] ognizing isolated 2-D hand-written digits, 3-D object
recognition by 1991 used a 3-D model-based approach
Deep learning algorithms in particular exploit this idea – matching 2-D images with a handcrafted 3-D object
of hierarchical explanatory factors. Different concepts model. Juyang Weng et al.. proposed that a human brain
are learned from other concepts, with the more abstract, does not use a monolithic 3-D object model and in 1992
higher level concepts being learned from the lower level they published Cresceptron,* [15]* [16]* [17] a method for
ones. These architectures are often constructed with a performing 3-D object recognition directly from clut-
greedy layer-by-layer method that models this idea. Deep tered scenes. Cresceptron is a cascade of many layers
learning helps to disentangle these abstractions and pick similar to Neocognitron. But unlike Neocognitron which
out which features are useful for learning.* [3] required the human programmer to hand-merge features,
For supervised learning tasks where label information is Cresceptron fully automatically learned an open num-
readily available in training, deep learning promotes a ber of unsupervised features in each layer of the cascade
principle which is very different than traditional meth- where each feature is represented by a convolution kernel.
ods of machine learning. That is, rather than focusing In addition, Cresceptron also segmented each learned ob-
on feature engineering which is often labor-intensive and ject from a cluttered scene through back-analysis through
varies from one task to another, deep learning methods is the network. Max-pooling, now often adopted by deep
focused on end-to-end learning based on raw features. In neural networks (e.g., ImageNet tests), was first used in
other words, deep learning moves away from feature engi- Cresceptron to reduce the position resolution by a factor
neering to a maximal extent possible. To accomplish end- of (2x2) to 1 through the cascade for better generaliza-
to-end optimization starting with raw features and ending tion. Because of a great lack of understanding how the
in labels, layered structures are often necessary. From brain autonomously wire its biological networks and the
this perspective, we can regard the use of layered struc- computational cost by ANNs then, simpler models that
tures to derive intermediate representations in deep learn- use task-specific handcrafted features such as Gabor fil-
ing as a natural consequence of raw-feature-based end-to- ter and support vector machines (SVMs) were of popular
end learning.* [1] Understanding the connection between choice of the field in the 1990s and 2000s.
the above two aspects of deep learning is important to ap- In the long history of speech recognition, both shal-
preciate its use in several application areas, all involving low form and deep form (e.g., recurrent nets) of ar-
supervised learning tasks (e.g., supervised speech and im- tificial neural networks had been explored for many
age recognition), as to be discussed in a later part of this years.* [18]* [19]* [20] But these methods never won over
article. the non-uniform internal-handcrafting Gaussian mixture
Many deep learning algorithms are framed as unsuper- model/Hidden Markov model (GMM-HMM) technology
vised learning problems. Because of this, these algo- based on generative models of speech trained discrim-
2.3. DEEP LEARNING IN ARTIFICIAL NEURAL NETWORKS 15

inatively.* [21] A number of key difficulties had been curred over the then-state-of-the-art GMM-HMM and
methodologically analyzed, including gradient diminish- more advanced generative model-based speech recogni-
ing and weak temporal correlation structure in the neural tion systems without the need for generative DBN pre-
predictive models.* [22]* [23] All these difficulties were in training, the finding verified subsequently by several other
addition to the lack of big training data and big comput- major speech recognition research groups * [24]* [35]
ing power in these early days. Most speech recognition Further, the nature of recognition errors produced by the
researchers who understood such barriers hence subse- two types of systems was found to be characteristically
quently moved away from neural nets to pursue generative different,* [25]* [36] offering technical insights into how
modeling approaches until the recent resurgence of deep to artfully integrate deep learning into the existing highly
learning that has overcome all these difficulties. Hinton efficient, run-time speech decoding system deployed by
et al. and Deng et al. reviewed part of this recent history all major players in speech recognition industry. The his-
about how their collaboration with each other and then tory of this significant development in deep learning has
with cross-group colleagues ignited the renaissance of been described and analyzed in recent books.* [1]* [37]
neural networks and initiated deep learning research and Advances in hardware have also been an important en-
applications in speech recognition.* [24]* [25]* [26]* [27]
abling factor for the renewed interest of deep learning.
The term “deep learning”gained traction in the mid- In particular, powerful graphics processing units (GPUs)
2000s after a publication by Geoffrey Hinton and Ruslan are highly suited for the kind of number crunching, ma-
Salakhutdinov showed how a many-layered feedforward trix/vector math involved in machine learning. GPUs
neural network could be effectively pre-trained one layer have been shown to speed up training algorithms by or-
at a time, treating each layer in turn as an unsupervised ders of magnitude, bringing running times of weeks back
restricted Boltzmann machine, then using supervised to days.* [38]* [39]
backpropagation for fine-tuning.* [28] In 1992, Schmid-
huber had already implemented a very similar idea for
the more general case of unsupervised deep hierarchies of
recurrent neural networks, and also experimentally shown 2.3 Deep learning in artificial neu-
its benefits for speeding up supervised learning * [29]* [30]
ral networks
Since the resurgence of deep learning, it has become
part of many state-of-the-art systems in different disci-
Some of the most successful deep learning methods in-
plines, particularly that of computer vision and automatic
volve artificial neural networks. Artificial neural net-
speech recognition (ASR). Results on commonly used
works are inspired by the 1959 biological model pro-
evaluation sets such as TIMIT (ASR) and MNIST (image
posed by Nobel laureates David H. Hubel & Torsten
classification) as well as a range of large vocabulary
Wiesel, who found two types of cells in the primary vi-
speech recognition tasks are constantly being improved
sual cortex: simple cells and complex cells. Many artifi-
with new applications of deep learning.* [24]* [31]* [32]
cial neural networks can be viewed as cascading models
Currently, it has been shown that deep learning architec- *
[15]* [16]* [17]* [40] of cell types inspired by these bio-
tures in the form of convolutional neural networks have
logical observations.
been nearly best performing;* [33]* [34] however, these
are more widely used in computer vision than in ASR. Fukushima's Neocognitron introduced convolutional
neural networks partially trained by unsupervised learn-
The real impact of deep learning in industry started in
ing while humans directed features in the neural
large-scale speech recognition around 2010. In late 2009,
plane. Yann LeCun et al. (1989) applied supervised
Geoff Hinton was invited by Li Deng to work with him
backpropagation to such architectures.* [41] Weng et al.
and colleagues at Microsoft Research in Redmond to
(1992) published convolutional neural networks Crescep-
apply deep learning to speech recognition. They co-
tron* [15]* [16]* [17] for 3-D object recognition from im-
organized the 2009 NIPS Workshop on Deep Learning
ages of cluttered scenes and segmentation of such objects
for Speech Recognition. The workshop was motivated
from images.
by the limitations of deep generative models of speech,
and the possibility that the big-compute, big-data era An obvious need for recognizing general 3-D objects
warranted a serious try of the deep neural net (DNN) is least shift invariance and tolerance to deformation.
approach. It was then (incorrectly) believed that pre- Max-pooling appeared to be first proposed by Crescep-
training of DNNs using generative models of deep be- tron* [15]* [16] to enable the network to tolerate small-to-
lief net (DBN) would be the cure for the main difficul- large deformation in a hierarchical way while using con-
ties of neural nets encountered during 1990's.* [26] How- volution. Max-pooling helps, but still does not fully guar-
ever, soon after the research along this direction started antee, shift-invariance at the pixel level.* [17]
at Microsoft Research, it was discovered that when large With the advent of the back-propagation algorithm in
amounts of training data are used and especially when the 1970s, many researchers tried to train supervised
DNNs are designed correspondingly with large, context- deep artificial neural networks from scratch, initially
dependent output layers, dramatic error reduction oc- with little success. Sepp Hochreiter's diploma thesis of
16 CHAPTER 2. DEEP LEARNING

1991* [42]* [43] formally identified the reason for this all other machine learning techniques on the old, famous
failure in the “vanishing gradient problem,”which not MNIST handwritten digits problem of Yann LeCun and
only affects many-layered feedforward networks, but also colleagues at NYU.
recurrent neural networks. The latter are trained by un- As of 2011, the state of the art in deep learning feedfor-
folding them into very deep feedforward networks, where ward networks alternates convolutional layers and max-
a new layer is created for each time step of an input se- pooling layers,* [53]* [54] topped by several pure clas-
quence processed by the network. As errors propagate sification layers. Training is usually done without any
from layer to layer, they shrink exponentially with the unsupervised pre-training. Since 2011, GPU-based im-
number of layers.
plementations* [53] of this approach won many pattern
To overcome this problem, several methods were pro- recognition contests, including the IJCNN 2011 Traf-
posed. One is Jürgen Schmidhuber's multi-level hi- fic Sign Recognition Competition,* [55] the ISBI 2012
erarchy of networks (1992) pre-trained one level at a Segmentation of neuronal structures in EM stacks chal-
time through unsupervised learning, fine-tuned through lenge,* [56] and others.
backpropagation.* [29] Here each level learns a com- Such supervised deep learning methods also were the
pressed representation of the observations that is fed to first artificial pattern recognizers to achieve human-
the next level. competitive performance on certain tasks.* [57]
Another method is the long short term memory (LSTM) To break the barriers of weak AI represented by deep
network of 1997 by Hochreiter & Schmidhuber.* [44] learning, it is necessary to go beyond the deep learning
In 2009, deep multidimensional LSTM networks won architectures because biological brains use both shallow
three ICDAR 2009 competitions in connected handwrit- and deep circuits as reported by brain anatomy* [58] in
ing recognition, without any prior knowledge about the order to deal with the wide variety of invariance that the
three different languages to be learned.* [45]* [46]
brain displays. Weng* [59] argued that the brain self-
Sven Behnke relied only on the sign of the gradient wires largely according to signal statistics and, therefore,
(Rprop) when training his Neural Abstraction Pyra- a serial cascade cannot catch all major statistical depen-
mid* [47] to solve problems like image reconstruction and dencies. Fully guaranteed shift invariance for ANNs to
face localization. deal with small and large natural objects in large clut-
Other methods also use unsupervised pre-training to tered scenes became true when the invariance went be-
structure a neural network, making it first learn generally yond shift, to extend to all ANN-learned concepts, such
useful feature detectors. Then the network is trained fur- as location, type (object class label), scale, lighting, in
ther by supervised back-propagation to classify labeled the Developmental Networks (DNs)* [60] whose embod-
data. The deep model of Hinton et al. (2006) involves iments are Where-What Networks, WWN-1 (2008)* [61]
learning the distribution of a high level representation us- through WWN-7 (2013).* [62]
ing successive layers of binary or real-valued latent vari-
ables. It uses a restricted Boltzmann machine (Smolen-
sky, 1986* [48]) to model each new layer of higher level 2.4 Deep learning architectures
features. Each new layer guarantees an increase on the
lower-bound of the log likelihood of the data, thus im- There are huge number of different variants of deep ar-
proving the model, if trained properly. Once sufficiently chitectures; however, most of them are branched from
many layers have been learned the deep architecture may some original parent architectures. It is not always pos-
be used as a generative model by reproducing the data sible to compare the performance of multiple architec-
when sampling down the model (an “ancestral pass”) tures all together, since they are not all implemented on
from the top level feature activations.* [49] Hinton reports the same data set. Deep learning is a fast-growing field
that his models are effective feature extractors over high- so new architectures, variants, or algorithms may appear
dimensional, structured data.* [50] every few weeks.
The Google Brain team led by Andrew Ng and Jeff Dean
created a neural network that learned to recognize higher-
2.4.1 Deep neural networks
level concepts, such as cats, only from watching unlabeled
images taken from YouTube videos.* [51] * [52]
A deep neural network (DNN) is an artificial neural
Other methods rely on the sheer processing power of network with multiple hidden layers of units between
modern computers, in particular, GPUs. In 2010 it was the input and output layers.* [2]* [4] Similar to shallow
shown by Dan Ciresan and colleagues* [38] in Jürgen ANNs, DNNs can model complex non-linear relation-
Schmidhuber's group at the Swiss AI Lab IDSIA that ships. DNN architectures, e.g., for object detection and
despite the above-mentioned “vanishing gradient prob- parsing generate compositional models where the ob-
lem,”the superior processing power of GPUs makes plain ject is expressed as layered composition of image prim-
back-propagation feasible for deep feedforward neural itives.* [63] The extra layers enable composition of fea-
networks with many layers. The method outperformed tures from lower layers, giving the potential of modeling
2.4. DEEP LEARNING ARCHITECTURES 17

complex data with fewer units than a similarly performing to better local optima in comparison with other training
shallow network.* [2] methods. However, these methods can be computation-
DNNs are typically designed as feedforward networks, ally expensive, especially when being used to train DNNs.
but recent research has successfully applied the deep There are many training parameters to be considered with
learning architecture to recurrent neural networks for ap- a DNN, such as the size (number of layers and number
plications such as language modeling.* [64] Convolutional of units per layer), the learning rate and initial weights.
deep neural networks (CNNs) are used in computer vi- Sweeping through the parameter space for optimal pa-
sion where their success is well-documented.* [65] More rameters may not be feasible due to the cost in time and
computational resources. Various 'tricks' such as using
recently, CNNs have been applied to acoustic modeling
for automatic speech recognition (ASR), where they have mini-batching (computing the gradient on several train-
ing examples at once rather than individual examples)
shown success over previous models.* [34] For simplicity, *
a look at training DNNs is given here. [69] have been shown to speed up computation. The
large processing throughput of GPUs has produced sig-
A DNN can be discriminatively trained with the standard niﬁcant speedups in training, due to the matrix and vector
backpropagation algorithm. The weight updates can be computations required being well suited for GPUs.* [4]
done via stochastic gradient descent using the following
equation:
2.4.3 Deep belief networks
∂C Main article: Deep belief network
∆wij (t + 1) = ∆wij (t) + η
∂wij A deep belief network (DBN) is a probabilistic,

Here, η is the learning rate, and C is the cost func-

tion. The choice of the cost function depends on fac-
Hidden units
tors such as the learning type (supervised, unsuper-
vised, reinforcement, etc.) and the activation function. Visible units
For example, when performing supervised learning on h3
a multiclass classification problem, common choices for
the activation function and cost function are the softmax v3
function and cross entropy function, respectively. The
exp(xj )
softmax function is defined as pj = ∑ exp(x k)
where pj h4
k
represents the class probability and xj and xk represent
the total input to units j and v1
∑ k respectively. Cross en-
tropy is defined as C = − j dj log(pj ) where dj rep-
resents the target probability for output unit j and pj is h1
the probability output for j after applying the activation
function.* [66]
v2
h2
2.4.2 Issues with deep neural networks
As with ANNs, many issues can arise with DNNs if they
are naively trained. Two common issues are overfitting A restricted Boltzmann machine (RBM) with fully connected visi-
and computation time. ble and hidden units. Note there are no hidden-hidden or visible-
visible connections.
DNNs are prone to overfitting because of the added lay-
ers of abstraction, which allow them to model rare de- generative model made up of multiple layers of hidden
pendencies in the training data. Regularization meth- units. It can be looked at as a composition of simple learn-
ods such as weight decay ( ℓ2 -regularization) or sparsity ing modules that make up each layer.* [70]
( ℓ1 -regularization) can be applied during training to A DBN can be used for generatively pre-training a DNN
help combat overfitting.* [67] A more recent regulariza- by using the learned weights as the initial weights. Back-
tion method applied to DNNs is dropout regularization. propagation or other discriminative algorithms can then
In dropout, some number of units are randomly omit- be applied for fine-tuning of these weights. This is par-
ted from the hidden layers during training. This helps to
ticularly helpful in situations where limited training data
break the rare dependencies that can occur in the training
is available, as poorly initialized weights can have signifi-
data * [68] cant impact on the performance of the final model. These
Backpropagation and gradient descent have been the pre- pre-trained weights are in a region of the weight space that
ferred method for training these structures due to the is closer to the optimal weights (as compared to just ran-
ease of implementation and their tendency to converge dom initialization). This allows for both improved mod-
18 CHAPTER 2. DEEP LEARNING

eling capability and faster convergence of the fine-tuning a training vector and values for the units in the already-
phase.* [71] trained RBM layers are assigned using the current weights
A DBN can be efficiently trained in an unsupervised, and biases. The final layer of the already-trained layers is
layer-by-layer manner where the layers are typically made used as input to the new RBM. The new RBM is then
of restricted Boltzmann machines (RBM). A description trained with the procedure above, and then this whole
of training a DBN via RBMs is provided below. An RBM process can be *
repeated until some desired stopping cri-
is an undirected, generative energy-based model with an terion is met. [2]
input layer and single hidden layer. Connections only ex-
Despite the approximation of CD to maximum likelihood
ist between the visible units of the input layer and the hid-
being very crude (CD has been shown to not follow the
den units of the hidden layer; there are no visible-visible
gradient of any function), empirical results have shown
or hidden-hidden connections. it to be an effective method for use with training deep
*
The training method for RBMs was initially proposed architectures. [69]
by Geoffrey Hinton for use with training “Product
of Expert”models and is known as contrastive diver-
2.4.4 Convolutional neural networks
gence (CD).* [72] CD provides an approximation to the
maximum likelihood method that would ideally be ap-
Main article: Convolutional neural network
plied for learning the weights of the RBM.* [69]* [73]
In training a single RBM, weight updates are performed
A CNN is composed of one or more convolutional layers
with gradient ascent via the following equation: ∆wij (t+
with fully connected layers (matching those in typical ar-
1) = wij (t) + η ∂ log(p(v))
∂wij . Here, p(v) is the prob- tificial neural networks) on top. It also uses tied weights
ability of
∑ −E(v,h)a visible vector, which is given by p(v) = and pooling layers. This architecture allows CNNs to
1
Z h e . Z is the partition function (used for take advantage of the 2D structure of input data. In
normalizing) and E(v, h) is the energy function assigned comparison with other deep architectures, convolutional
to the state of the network. A lower energy indicates the neural networks are starting to show superior results in
network is in a more“desirable”configuration. The gradi- both image and speech applications. They can also be
ent ∂ log(p(v))
∂wij has the simple form ⟨vi hj ⟩data −⟨vi hj ⟩model trained with standard backpropagation. CNNs are eas-
where ⟨· · · ⟩p represent averages with respect to distribu- ier to train than other regular, deep, feed-forward neural
tion p . The issue arises in sampling ⟨vi hj ⟩model as this networks and have many fewer parameters to estimate,
requires running alternating Gibbs sampling for a long making them a highly attractive architecture to use.* [74]
time. CD replaces this step by running alternating Gibbs
sampling for n steps (values of n = 1 have empirically
been shown to perform well). After n steps, the data is 2.4.5 Convolutional Deep Belief Networks
sampled and that sample is used in place of ⟨vi hj ⟩model .
The CD procedure works as follows:* [69] A recent achievement in deep learning is from the use of
convolutional deep belief networks (CDBN). A CDBN
is very similar to normal Convolutional neural network
1. Initialize the visible units to a training vector.
in terms of its structure. Therefore, like CNNs they are
2. Update the hidden units in parallel given also able to exploit the 2D structure of images combined
∑ the visible with the advantage gained by pre-training in Deep belief
units: p(hj = 1 | V) = σ(bj + i vi wij ) . σ
represents the sigmoid function and bj is the bias of network. They provide a generic structure which can be
hj . used in many image and signal processing tasks and can
be trained in a way similar to that for Deep Belief Net-
3. Update the visible units in parallel∑given the hidden works. Recently, many benchmark results on standard
units: p(vi = 1 | H) = σ(ai + j hj wij ) . ai is image datasets like CIFAR * [75] have been obtained us-
the bias of vi . This is called the “reconstruction” ing CDBNs.* [76]
step.

4. Reupdate the hidden units in parallel given the re- 2.4.6 Deep Boltzmann Machines
constructed visible units using the same equation as
in step 2. A Deep Boltzmann Machine (DBM) is a type of bi-
nary pairwise Markov random ﬁeld (undirected proba-
5. Perform the weight update: ∆wij ∝ ⟨vi hj ⟩data − bilistic graphical models) with multiple layers of hidden
⟨vi hj ⟩reconstruction . random variables. It is a network of symmetrically cou-
pled stochastic binary units. It comprises a set of visible
Once an RBM is trained, another RBM can be“stacked” units ν ∈ {0, 1}D , and a series of layers of hidden units
atop of it to create a multilayer model. Each time another h(1) ∈ {0, 1}F1 , h(2) ∈ {0, 1}F2 , . . . , h(L) ∈ {0, 1}FL
RBM is stacked, the input visible layer is initialized to . There is no connection between the units of the same
2.4. DEEP LEARNING ARCHITECTURES 19

layer (like RBM). For the DBM, we can write the proba- y, where θ = {W , b} , W is the weight matrix and b is
bility which is assigned to vector ν as: an offset vector (bias). On the contrary a decoder maps
∑ ∑ (1) (1) ∑ (2) (1) (2) ∑ (3) back the hidden representation y to the reconstructed in-
(2) (3)
p(ν) = Z1 h e ij Wij νi hj + jl Wjl hj hl + lm Wlm hput l hm ,
z via gθ . The whole process of auto encoding is to
where h = {h(1) , h(2) , h(3) } are the set of hidden units, compare this reconstructed input to the original and try
and θ = {W (1) , W (2) , W (3) } are the model parame- to minimize this error to make the reconstructed value as
ters, representing visible-hidden and hidden-hidden sym- close as possible to the original.
metric interaction, since they are undirected links. As it In stacked denoising auto encoders, the partially corrupted
is clear by setting W (2) = 0 and W (3) = 0 the net- output is cleaned (denoised). This fact has been intro-
work becomes the well-known Restricted Boltzmann ma- duced in * [81] with a specific approach to good represen-
chine.* [77] tation, a good representation is one that can be obtained
There are several reasons which motivate us to take ad- robustly from a corrupted input and that will be useful for
vantage of deep Boltzmann machine architectures. Like recovering the corresponding clean input. Implicit in this
DBNs, they benefit from the ability of learning complex definition are the ideas of
and abstract internal representations of the input in tasks
such as object or speech recognition, with the use of lim- • The higher level representations are relatively stable
ited number of labeled data to fine-tune the representa- and robust to the corruption of the input;
tions built based on a large supply of unlabeled sensory in-
• It is required to extract features that are useful for
put data. However, unlike DBNs and deep convolutional
representation of the input distribution.
neural networks, they adopt the inference and training
procedure in both directions, bottom-up and top-down
pass, which enable the DBMs to better unveil the rep- The algorithm consists of multiple steps; starts by a
resentations of the ambiguous and complex input struc- stochastic mapping of x to x̃ through qD (x̃|x) , this is
tures,* [78] .* [79] the corrupting step. Then the corrupted input x̃ passes
through a basic auto encoder process and is mapped to a
Since the exact maximum likelihood learning is intractable hidden representation y = f (x̃) = s(W x̃ + b) . From
θ
for the DBMs, we may perform the approximate max- this hidden representation we can reconstruct z = g (y)
θ
imum likelihood learning. There is another possibility, . In the last stage a minimization algorithm is done in or-
to use mean-field inference to estimate data-dependent der to have a z as close as possible to uncorrupted input x
expectations, incorporation with a Markov chain Monte . The reconstruction error L (x, z) might be either the
H
Carlo (MCMC) based stochastic approximation technique cross-entropy loss with an affine-sigmoid decoder, or the
to approximate the expected sufficient statistics of the squared error loss with an affine decoder.* [81]
model.* [77]
In order to make a deep architecture, auto encoders stack
We can see the difference between DBNs and DBM. In one on top of another. Once the encoding function f
θ
DBNs, the top two layers form a restricted Boltzmann of the first denoising auto encoder is learned and used to
machine which is an undirected graphical model, but the uncorrupt the input (corrupted input), we can train the
lower layers form a directed generative model. second level.* [81]
Apart from all the advantages of DBMs discussed so far, Once the stacked auto encoder is trained, its output might
they have a crucial disadvantage which limits the per- be used as the input to a supervised learning algorithm
formance and functionality of this kind of architecture. such as support vector machine classifier or a multiclass
The approximate inference, which is based on mean- logistic regression.* [81]
field method, is about 25 to 50 times slower than a sin-
gle bottom-up pass in DBNs. This time consuming task
make the joint optimization, quite impractical for large 2.4.8 Deep Stacking Networks
data sets, and seriously restricts the use of DBMs in tasks
such as feature representations (the mean-field inference One of the deep architectures recently introduced in* [82]
have to be performed for each new test input).* [80] which is based on building hierarchies with blocks of
simplified neural network modules, is called deep con-
vex network. They are called “convex”because of the
2.4.7 Stacked (Denoising) Auto-Encoders formulation of the weights learning problem, which is a
convex optimization problem with a closed-form solu-
The auto encoder idea is motivated by the concept of good tion. The network is also called the deep stacking network
representation. For instance for the case of classifier it is (DSN),* [83] emphasizing on this fact that a similar mech-
possible to define that a good representation is one that will anism as the stacked generalization is used.* [84]
yield a better performing classifier. The DSN blocks, each consisting of a simple, easy-to-
An encoder is referred to a deterministic mapping fθ that learn module, are stacked to form the overall deep net-
transforms an input vector x into hidden representation work. It can be trained block-wise in a supervised fash-
20 CHAPTER 2. DEEP LEARNING

ion without the need for back-propagation for the entire scale up the design to larger (deeper) architectures and
blocks.* [85] data sets.
As designed in * [82] each block consists of a simplified The basic architecture is suitable for diverse tasks such as
MLP with a single hidden layer. It comprises a weight classification and regression.
matrix U as the connection between the logistic sigmoidal
units of the hidden layer h to the linear output layer y,
and a weight matrix W which connects each input of the 2.4.10 Spike-and-Slab RBMs (ssRBMs)
blocks to their respective hidden layers. If we assume
that the target vectors t be arranged to form the columns The need for real-valued inputs which are employed in
of T (the target matrix), let the input data vectors x be Gaussian RBMs (GRBMs), motivates scientists seeking
arranged to form the columns of X, let H = σ(W T X) new methods. One of these methods is the spike and slab
denote the matrix of hidden units, and assume the lower- RBM (ssRBMs), which models continuous-valued inputs
layer weights W are known (training layer-by-layer). The with strictly binary latent variables.* [90]
function performs the element-wise logistic sigmoid op-
Similar to basic RBMs and its variants, the spike and
eration. Then learning the upper-layer weight matrix U
slab RBM is a bipartite graph. Like GRBM, the visi-
given other weights in the network can be formulated as
ble units (input) are real-valued. The difference arises
a convex optimization problem:
in the hidden layer, where each hidden unit come along
with a binary spike variable and real-valued slab variable.
These terms (spike and slab) come from the statistics lit-
min f = ||U T H − T ||2F , erature,* [91] and refer to a prior including a mixture of
UT
two components. One is a discrete probability mass at
which has a closed-form solution. The input to the first zero called spike, and the other is a density over continu-
* *
block X only contains the original data, however in the ous domain. [92] [92]
upper blocks in addition to this original (raw) data there There is also an extension of the ssRBM model, which
is a copy of the lower-block(s) output y. is called µ-ssRBM. This variant provides extra modeling
In each block an estimate of the same final label class y capacity to the architecture using additional terms in the
is produced, then this estimated label concatenated with energy function. One of these terms enable model to form
original input to form the expanded input for the upper a conditional distribution of the spike variables by means
block. In contrast with other deep architectures, such as of marginalizing out the slab variables given an observa-
DBNs, the goal is not to discover the transformed feature tion.
representation. Regarding the structure of the hierarchy
of this kind of architecture, it makes the parallel training
straightforward as the problem is naturally a batch-mode 2.4.11 Compound Hierarchical-Deep
optimization one. In purely discriminative tasks DSN Models
performance is better than the conventional DBN.* [83]
The class architectures called compound HD models,
where HD stands for Hierarchical-Deep are structured
2.4.9 Tensor Deep Stacking Networks (T- as a composition of non-parametric Bayesian models
DSN) with deep networks. The features, learned by deep
architectures such as DBNs,* [93] DBMs,* [78] deep
This architecture is an extension of the DSN. It improves auto encoders,* [94] convolutional variants,* [95]* [96]
the DSN in two important ways, using the higher order ssRBMs,* [92] deep coding network,* [97] DBNs with
information by means of covariance statistics and trans- sparse feature learning,* [98] recursive neural net-
forming the non-convex problem of the lower-layer to a works,* [99] conditional DBNs,* [100] denoising auto
convex sub-problem of the upper-layer.* [86] encoders,* [101] are able to provide better representation
for more rapid and accurate classification tasks with
Unlike the DSN, the covariance statistics of the data is high-dimensional training data sets. However, they are
employed using a bilinear mapping from two distinct sets not quite powerful in learning novel classes with few
of hidden units in the same layer to predictions via a third- examples, themselves. In these architectures, all units
order tensor. through the network are involved in the representation of
The scalability and parallelization are the two important the input (distributed representations), and they have to
factors in the learning algorithms which are not consid- be adjusted together (high degree of freedom). However,
ered seriously in the conventional DNNs.* [87]* [88]* [89] if we limit the degree of freedom, we make it easier for
All the learning process for the DSN (and TDSN as the model to learn new classes out of few training sam-
well) is done on a batch-mode basis so as to make the ples (less parameters to learn). Hierarchical Bayesian
parallelization possible on a cluster of CPU or GPU (HB) models, provide learning from few examples, for
nodes.* [82]* [83] Parallelization gives the opportunity to example * [102]* [103]* [104]* [105]* [106] for computer
2.5. APPLICATIONS 21

vision, statistics, and cognitive science. It is also possible to extend the DPCN to form a
*
Compound HD architectures try to integrate both charac- convolutional network. [108]
teristics of HB and deep networks. The compound HDP-
DBM architecture, a hierarchical Dirichlet process (HDP) 2.4.13 Deep Kernel Machines
as a hierarchical model, incorporated with DBM archi-
tecture. It is a full generative model, generalized from The Multilayer Kernel Machine (MKM) as introduced in
abstract concepts ﬂowing through the layers of the model, * [109] is a way of learning highly nonlinear functions
which is able to synthesize new examples in novel classes with the iterative applications of weakly nonlinear ker-
that look reasonably natural. Note that all the levels nels. They use the kernel principal component analy-
are learned jointly by maximizing a joint log-probability sis (KPCA), in,* [110] as method for unsupervised greedy
score.* [107] layer-wise pre-training step of the deep learning architec-
Consider a DBM with three hidden layers, the probability ture.
of a visible input ν is: Layer l + 1 -th learns the representation of the previous
∑ ∑ W (1) νi h1 +∑ W (2) h1 h2 +∑ W (3) h2layer
lm l hm , l , extracting the nl principal component (PC) of the
3
1
p(ν, ψ) = Z h e ij ij j jl jl j l lm

projection layer l output in the feature domain induced

where h = {h(1) , h(2) , h(3) } are the set of hidden units, by the kernel. For the sake of dimensionality reduction
and ψ = {W (1) , W (2) , W (3) } are the model parame- of the updated representation in each layer, a supervised
ters, representing visible-hidden and hidden-hidden sym- strategy is proposed to select the best informative features
metric interaction terms. among the ones extracted by KPCA. The process is:
After a DBM model has been learned, we have an
undirected model that defines the joint distribution • ranking the nl features according to their mutual in-
P (ν, h1 , h2 , h3 ) . One way to express what has been formation with the class labels;
learned is the conditional model P (ν, h1 , h2 |h3 ) and a
• for different values of K and ml ∈ {1, . . . , nl } ,
prior term P (h3 ) .
compute the classification error rate of a K-nearest
The part P (ν, h1 , h2 |h3 ) , represents a conditional DBM neighbor (K-NN) classifier using only the ml most
model, which can be viewed as a two-layer DBM but with informative features on a validation set;
bias terms given by the states of h3 :
∑ (1) ∑ (2) ∑ • the
(3)
value of ml with which the classifier has reached
Wij νi h1j + Wjl h1j h2l + Wlm h2l h3m
P (ν, h1 , h2 |h3 ) = 1
Z(ψ,h3 ) e
ij jl lm
the lowest . error rate determines the number of fea-
tures to retain.

2.4.12 Deep Coding Networks There are some drawbacks in using the KPCA method as
the building cells of an MKM.
There are several advantages to having a model which can Another, more straightforward method of integrating ker-
actively update itself to the context in data. One of these nel machine into the deep learning architecture was de-
methods arises from the idea to have a model which is veloped by Microsoft researchers for spoken language un-
able to adjust its prior knowledge dynamically according derstanding applications.* [111] The main idea is to use a
to the context of the data. Deep coding network (DPCN) kernel machine to approximate a shallow neural net with
is a predictive coding scheme where top-down informa- an inﬁnite number of hidden units, and then to use the
tion is used to empirically adjust the priors needed for stacking technique to splice the output of the kernel ma-
the bottom-up inference procedure by means of a deep chine and the raw input in building the next, higher level
locally-connected generative model. This is based on ex- of the kernel machine. The number of the levels in this
tracting sparse features out of time-varying observations kernel version of the deep convex network is a hyper-
using a linear dynamical model. Then, a pooling strategy parameter of the overall system determined by cross val-
is employed in order to learn invariant feature represen- idation.
tations. Similar to other deep architectures, these blocks
are the building elements of a deeper architecture where
greedy layer-wise unsupervised learning are used. Note 2.4.14 Deep Q-Networks
that the layers constitute a kind of Markov chain such that
the states at any layer are only dependent on the succeed-
This is the latest class of deep learning models targeted
ing and preceding layers. for reinforcement learning, published in February 2015
*
Deep predictive coding network (DPCN) [108] predicts in Nature [112]
*

the representation of the layer, by means of a top-down

approach using the information in upper layer and also
temporal dependencies from the previous states, it is 2.5 Applications
called
22 CHAPTER 2. DEEP LEARNING

2.5.1 Automatic speech recognition els; 5) Multi-task and transfer learning by DNNs and
related deep models; 6) Convolution neural networks
The results shown in the table below are for automatic and how to design them to best exploit domain knowl-
speech recognition on the popular TIMIT data set. This edge of speech; 7) Recurrent neural network and its
is a common data set used for initial evaluations of deep rich LSTM variants; 8) Other types of deep models in-
learning architectures. The entire set contains 630 speak- cluding tensor-based models and integrated deep genera-
ers from eight major dialects of American English, with tive/discriminative models.
each speaker reading 10 different sentences.* [113] Its Large-scale automatic speech recognition is the first and
small size allows many different configurations to be the most convincing successful case of deep learning in
tried effectively with it. More importantly, the TIMIT the recent history, embraced by both industry and aca-
task concerns phone-sequence recognition, which, unlike demic across the board. Between 2010 and 2014, the
word-sequence recognition, permits very weak“language two major conferences on signal processing and speech
models”and thus the weaknesses in acoustic modeling as- recognition, IEEE-ICASSP and Interspeech, have seen
pects of speech recognition can be more easily analyzed. near exponential growth in the numbers of accepted pa-
It was such analysis on TIMIT contrasting the GMM (and pers in their respective annual conference papers on the
other generative models of speech) vs. DNN models topic of deep learning for speech recognition. More im-
carried out by Li Deng and collaborators around 2009- portantly, all major commercial speech recognition sys-
2010 that stimulated early industrial investment on deep tems (e.g., Microsoft Cortana, Xbox, Skype Translator,
learning technology for speech recognition from small Google Now, Apple Siri, Baidu and iFlyTek voice search,
to large scales,* [25]* [36] eventually leading to pervasive and a range of Nuance speech products, etc.) nowadays
and dominant uses of deep learning in speech recogni- are based on deep learning methods.* [1]* [121]* [122]
tion industry. That analysis was carried out with compa- See also the recent media interview with the CTO of Nu-
rable performance (less than 1.5% in error rate) between ance Communications.* [123]
discriminative DNNs and generative models. The error
The wide-spreading success in speech recognition
rates presented below, including these early results and
achieved by 2011 was followed shortly by large-scale im-
measured as percent phone error rates (PER), have been
age recognition described next.
summarized over a time span of the past 20 years:
Extension of the success of deep learning from TIMIT
to large vocabulary speech recognition occurred in 2010 2.5.2 Image recognition
by industrial researchers, where large output layers of
the DNN based on context dependent HMM states con- A common evaluation set for image classification is the
structed by decision trees were adopted.* [116]* [117] See MNIST database data set. MNIST is composed of hand-
comprehensive reviews of this development and of the written digits and includes 60000 training examples and
state of the art as of October 2014 in the recent Springer 10000 test examples. Similar to TIMIT, its small size al-
book from Microsoft Research.* [37] See also the related lows multiple configurations to be tested. A comprehen-
background of automatic speech recognition and the im- sive list of results on this set can be found in.* [124] The
pact of various machine learning paradigms including no- current best result on MNIST is an error rate of 0.23%,
tably deep learning in a recent overview article.* [118] achieved by Ciresan et al. in 2012.* [125]
One fundamental principle of deep learning is to do away The real impact of deep learning in image or object
with hand-crafted feature engineering and to use raw fea- recognition, one major branch of computer vision, was
felt in the fall of 2012 after the team of Geoff Hinton and
tures. This principle was first explored successfully in the
architecture of deep autoencoder on the “raw”spec- his students won the large-scale ImageNet competition by
trogram or linear filter-bank features,* [119] showing its a significant margin over the then-state-of-the-art shallow
superiority over the Mel-Cepstral features which contain machine learning methods. The technology is based on
a few stages of fixed transformation from spectrograms. 20-year-old deep convolutional nets, but with much larger
The true “raw”features of speech, waveforms, have scale on a much larger task, since it had been learned
more recently been shown to produce excellent larger- that deep learning works quite well on large-scale speech
scale speech recognition results.* [120] recognition. In 2013 and 2014, the error rate on the Im-
Since the initial successful debut of DNNs for speech ageNet task using deep learning was further reduced at a
recognition around 2009-2011, there has been huge rapid pace, following a similar trend in large-scale speech
progress made. This progress (as well as future direc- recognition.
tions) has been summarized into the following eight ma- As in the ambitious moves from automatic speech recog-
jor areas:* [1]* [27]* [37] 1) Scaling up/out and speedup nition toward automatic speech translation and under-
DNN training and decoding; 2) Sequence discriminative standing, image classification has recently been extended
training of DNNs; 3) Feature processing by deep mod- to the more ambitious and challenging task of automatic
els with solid understanding of the underlying mecha- image captioning, in which deep learning is the essential
nisms; 4) Adaptation of DNNs and of related deep mod- underlying technology. * [126] * [127] * [128] * [129]
2.6. DEEP LEARNING IN THE HUMAN BRAIN 23

One example application is a car computer said to be illustrating suitability of the method for CRM automa-
trained with deep learning, which may be able to let cars tion. A neural network was used to approximate the value
interpret 360° camera views.* [130] of possible direct marketing actions over the customer
state space, deﬁned in terms of RFM variables. The esti-
mated value function was shown to have a natural inter-
2.5.3 Natural language processing pretation as CLV (customer lifetime value).* [151]

Neural networks have been used for implementing

language models since the early 2000s.* [131] Key tech-
niques in this field are negative sampling* [132] and word
2.6 Deep learning in the human
embedding. A word embedding, such as word2vec, can brain
be thought of as a representational layer in a deep learn-
ing architecture transforming an atomic word into a po- Computational deep learning is closely related to a class
sitional representation of the word relative to other words of theories of brain development (specifically, neocor-
in the dataset; the position is represented as a point in a tical development) proposed by cognitive neuroscien-
vector space. Using a word embedding as an input layer tists in the early 1990s.* [152] An approachable sum-
to a recursive neural network (RNN) allows for the train- mary of this work is Elman, et al.'s 1996 book “Re-
ing of the network to parse sentences and phrases using an thinking Innateness”* [153] (see also: Shrager and John-
effective compositional vector grammar. A compositional son;* [154] Quartz and Sejnowski * [155]). As these de-
vector grammar can be thought of as probabilistic con- velopmental theories were also instantiated in computa-
text free grammar (PCFG) implemented by a recursive tional models, they are technical predecessors of purely
neural network.* [133] Recursive autoencoders built atop computationally-motivated deep learning models. These
word embeddings have been trained to assess sentence developmental models share the interesting property that
similarity and detect paraphrasing.* [133] Deep neural ar- various proposed learning dynamics in the brain (e.g.,
chitectures have achieved state-of-the-art results in many a wave of nerve growth factor) conspire to support the
tasks in natural language processing, such as constituency self-organization of just the sort of inter-related neural
parsing,* [134] sentiment analysis,* [135] information re- networks utilized in the later, purely computational deep
trieval,* [136] * [137] machine translation, * [138] * [139] learning models; and such computational neural networks
contextual entity linking, * [140] and other areas of NLP. seem analogous to a view of the brain's neocortex as a hi-
*
[141] erarchy of filters in which each layer captures some of
the information in the operating environment, and then
passes the remainder, as well as modified base signal, to
2.5.4 Drug discovery and toxicology other layers further up the hierarchy. This process yields
a self-organizing stack of transducers, well-tuned to their
The pharmaceutical industry faces the problem that a
operating environment. As described in The New York
large percentage of candidate drugs fail to reach the
Times in 1995: "...the infant's brain seems to organize
market. These failures of chemical compounds are
itself under the influence of waves of so-called trophic-
caused by insufficient efficacy on the biomolecular tar-
factors ... different regions of the brain become con-
get (on-target effect), undetected and undesired inter-
nected sequentially, with one layer of tissue maturing be-
actions with other biomolecules (off-target effects), or
fore another and so on until the whole brain is mature.”
unanticipated toxic effects.* [142]* [143] In 2012 a team *
[156]
led by George Dahl won the“Merck Molecular Activity
Challenge”using multi-task deep neural networks to pre- The importance of deep learning with respect to the evo-
dict the biomolecular target of a compound.* [144]* [145] lution and development of human cognition did not es-
In 2014 Sepp Hochreiter's group used Deep Learning cape the attention of these researchers. One aspect of
to detect off-target and toxic effects of environmental human development that distinguishes us from our near-
chemicals in nutrients, household products and drugs and est primate neighbors may be changes in the timing of
won the “Tox21 Data Challenge”of NIH, FDA and development.* [157] Among primates, the human brain
NCATS.* [146]* [147] These impressive successes show remains relatively plastic until late in the post-natal pe-
Deep Learning may be superior to other virtual screen- riod, whereas the brains of our closest relatives are more
ing methods.* [148]* [149] Researchers from Google and completely formed by birth. Thus, humans have greater
Stanford enhanced Deep Learning for drug discovery by access to the complex experiences afforded by being
combining data from a variety of sources.* [150] out in the world during the most formative period of
brain development. This may enable us to “tune in”to
rapidly changing features of the environment that other
2.5.5 Customer relationship management animals, more constrained by evolutionary structuring of
their brains, are unable to take account of. To the ex-
Recently success has been reported with application of tent that these changes are reflected in similar timing
deep reinforcement learning in direct marketing settings, changes in hypothesized wave of cortical development,
24 CHAPTER 2. DEEP LEARNING

they may also lead to changes in the extraction of infor- Others point out that deep learning should be looked
mation from the stimulus environment during the early at as a step towards realizing strong AI, not as an all-
self-organization of the brain. Of course, along with this encompassing solution. Despite the power of deep learn-
flexibility comes an extended period of immaturity, during methods, they still lack much of the functionality
ing which we are dependent upon our caretakers and our needed for realizing this goal entirely. Research psychol-
community for both support and training. The theory of ogist Gary Marcus has noted that:
deep learning therefore sees the coevolution of culture “Realistically, deep learning is only part of the larger
and cognition as a fundamental condition of human evo- challenge of building intelligent machines. Such tech-
lution.* [158]
niques lack ways of representing causal relationships (...)
have no obvious ways of performing logical inferences,
and they are also still a long way from integrating abstract
2.7 Commercial activity knowledge, such as information about what objects are,
what they are for, and how they are typically used. The
most powerful A.I. systems, like Watson (...) use tech-
Deep learning is often presented as a step towards real-
niques like deep learning as just one element in a very
ising strong AI* [159] and thus many organizations have
complicated ensemble of techniques, ranging from the
become interested in its use for particular applications.
statistical technique of Bayesian inference to deductive
Most recently, in December 2013, Facebook announced
reasoning.”* [161]
that it hired Yann LeCun to head its new artificial intel-
ligence (AI) lab that will have operations in California, To the extent that such a viewpoint implies, without in-
London, and New York. The AI lab will be used for tending to, that deep learning will ultimately constitute
developing deep learning techniques that will help Face- nothing more than the primitive discriminatory levels
book do tasks such as automatically tagging uploaded pic- of a comprehensive future machine intelligence, a re-
tures with the names of the people in them.* [160] cent pair of speculations regarding art and artificial in-
telligence* [162] offers an alternative and more expansive
In March 2013, Geoffrey Hinton and two of his graduate
outlook. The first such speculation is that it might be
students, Alex Krizhevsky and Ilya Sutskever, were hired
possible to train a machine vision stack to perform the
by Google. Their work will be focused on both improv-
sophisticated task of discriminating between “old mas-
ing existing machine learning products at Google and also
ter”and amateur figure drawings; and the second is that
help deal with the growing amount of data Google has.
such a sensitivity might in fact represent the rudiments
Google also purchased Hinton's company, DNNresearch.
of a non-trivial machine empathy. It is suggested, more-
In 2014 Google also acquired DeepMind Technologies, a over, that such an eventuality would be in line with both
British start-up that developed a system capable of learn- anthropology, which identifies a concern with aesthetics
ing how to play Atari video games using only raw pixels as a key element of behavioral modernity, and also with
as data input. a current school of thought which suspects that the allied
Baidu hired Andrew Ng to head their new Silicon Valley phenomenon of consciousness, formerly thought of as a
based research lab focusing on deep learning. purely high-order phenomenon, may in fact have roots
deep within the structure of the universe itself.
Some currently popular and successful deep learning ar-
chitectures display certain problematical behaviors* [163]
2.8 Criticism and comment (e.g. confidently classifying random data as belonging to
a familiar category of nonrandom images;* [164] and mis-
Given the far-reaching implications of artificial intelli- classifying miniscule perturbations of correctly classified
gence coupled with the realization that deep learning is images * [165]). The creator of OpenCog, Ben Goertzel
emerging as one of its most powerful techniques, the sub- hypothesized * [163] that these behaviors are tied with
ject is understandably attracting both criticism and com- limitations in the internal representations learned by these
ment, and in some cases from outside the field of com- architectures, and that these same limitations would in-
puter science itself. hibit integration of these architectures into heterogeneous
A main criticism of deep learning concerns the lack multi-component AGI architectures. It is suggested that
of theory surrounding many of the methods. Most of these issues can be worked around by developing deep
the learning in deep architectures is just some form of learning architectures that internally form states homol-
gradient descent. While gradient descent has been under- ogous to image-grammar * [166] decompositions of ob-
stood for a while now, the theory surrounding other algo- served entities and events.* [163] Learning a grammar
rithms, such as contrastive divergence is less clear (i.e., (visual or linguistic) from training data would be equiva-
Does it converge? If so, how fast? What is it approxi- lent to restricting the system to commonsense reasoning
mating?). Deep learning methods are often looked at as which operates on concepts in terms of production rules
a black box, with most confirmations done empirically, of the grammar, and is a basic goal of both human lan-
rather than theoretically. guage acquisition * [167] and A.I. (Also see Grammar in-
2.11. REFERENCES 25

duction * [168]) 2.11 References

[1] L. Deng and D. Yu (2014) “Deep Learning: Methods
and Applications”http://research.microsoft.com/pubs/
2.9 Deep learning software li- 209355/DeepLearning-NowPublishing-Vol7-SIG-039.
braries pdf
[2] Bengio, Yoshua (2009). “Learning Deep Architectures
• Torch for AI” (PDF). Foundations and Trends in Machine
Learning 2 (1).
• Theano
[3] Y. Bengio, A. Courville, and P. Vincent.,“Representation
Learning: A Review and New Perspectives,”IEEE Trans.
• Deeplearning4j, distributed deep learning for the PAMI, special issue Learning Deep Architectures, 2013
JVM. Parallel GPUs.
[4] J. Schmidhuber,“Deep Learning in Neural Networks: An
• ND4J Overview”http://arxiv.org/abs/1404.7828, 2014
[5] Patrick Glauner (2015), Comparison of Training Methods
• NVIDIA cuDNN library of accelerated primitives
for Deep Neural Networks, arXiv:1504.06825
for deep neural networks.
[6] Song, Hyun Ah, and Soo-Young Lee.“Hierarchical Rep-
• DeepLearnToolbox, Matlab/Octave toolbox for resentation Using NMF.”Neural Information Processing.
deep learning Springer Berlin Heidelberg, 2013.
[7] Olshausen, Bruno A. “Emergence of simple-cell recep-
• convnetjs, deep learning library in Javascript. Con-
tive field properties by learning a sparse code for natural
tains online demos. images.”Nature 381.6583 (1996): 607-609.
• Gensim a toolkit for natural language processing; in- [8] Ronan Collobert (May 6, 2011). “Deep Learning for
cludes word2vec Efficient Discriminative Parsing”. videolectures.net. Ca.
7:45.
• Caffe
[9] Gomes, Lee (20 October 2014). “Machine-Learning
Maestro Michael Jordan on the Delusions of Big Data and
Other Huge Engineering Efforts”. IEEE Spectrum.
2.10 See also [10] Fukushima, K. (1980).“Neocognitron: A self-organizing
neural network model for a mechanism of pattern recog-
• Unsupervised learning nition unaffected by shift in position”. Biol. Cybern 36:
193–202. doi:10.1007/bf00344251.
• Graphical model [11] P. Werbos.,“Beyond Regression: New Tools for Predic-
tion and Analysis in the Behavioral Sciences,”PhD thesis,
• Feature learning Harvard University, 1974.

• Sparse coding [12] LeCun et al., “Backpropagation Applied to Handwritten

Zip Code Recognition,”Neural Computation, 1, pp. 541–
• Compressed Sensing 551, 1989.
[13] S. Hochreiter., "Untersuchungen zu dynamischen neu-
• Connectionism
ronalen Netzen,”Diploma thesis. Institut f. Informatik,
Technische Univ. Munich. Advisor: J. Schmidhuber,
• Self-organizing map 1991.
• Principal component analysis [14] S. Hochreiter et al., “Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies,”In S. C.
• Applications of artificial intelligence Kremer and J. F. Kolen, editors, A Field Guide to Dynam-
ical Recurrent Neural Networks. IEEE Press, 2001.
• List of artificial intelligence projects
[15] J. Weng, N. Ahuja and T. S. Huang, "Cresceptron: a self-
organizing neural network which grows adaptively,”Proc.
• Extreme Learning Machines
International Joint Conference on Neural Networks, Balti-
more, Maryland, vol I, pp. 576-581, June, 1992.
• Reservoir computing
[16] J. Weng, N. Ahuja and T. S. Huang, "Learning recogni-
• Liquid state machine tion and segmentation of 3-D objects from 2-D images,”
Proc. 4th International Conf. Computer Vision, Berlin,
• Echo state network Germany, pp. 121-128, May, 1993.
26 CHAPTER 2. DEEP LEARNING

[17] J. Weng, N. Ahuja and T. S. Huang, "Learning recognition [33] L. Deng, O. Abdel-Hamid, and D. Yu, A deep con-
and segmentation using the Cresceptron,”International volutional neural network using heterogeneous pooling
Journal of Computer Vision, vol. 25, no. 2, pp. 105-139, for trading acoustic invariance with phonetic confusion,
Nov. 1997. ICASSP, 2013.

[18] Morgan, Bourlard, Renals, Cohen, Franco (1993) “Hy- [34] T. Sainath et al., “Convolutional neural networks for
brid neural network/hidden Markov model systems for LVCSR,”ICASSP, 2013.
continuous speech recognition. ICASSP/IJPRAI”
[35] D. Yu, L. Deng, G. Li, and F. Seide (2011). “Discrimi-
[19] T. Robinson. (1992) A real-time recurrent error propaga- native pretraining of deep neural networks,”U.S. Patent
tion network word recognition system, ICASSP. Filing.
[20] Waibel, Hanazawa, Hinton, Shikano, Lang. (1989)
[36] NIPS Workshop: Deep Learning for Speech Recognition
“Phoneme recognition using time-delay neural networks.
and Related Applications, Whistler, BC, Canada, Dec.
IEEE Transactions on Acoustics, Speech and Signal Pro-
2009 (Organizers: Li Deng, Geoﬀ Hinton, D. Yu).
cessing.”

[21] J. Baker, Li Deng, Jim Glass, S. Khudanpur, C.-H. Lee, [37] Yu, D.; Deng, L. (2014). “Automatic Speech Recogni-
N. Morgan, and D. O'Shaughnessy (2009). “Research tion: A Deep Learning Approach (Publisher: Springer)".
Developments and Directions in Speech Recognition and
Understanding, Part 1,”IEEE Signal Processing Magazine, [38] D. C. Ciresan et al., “Deep Big Simple Neural Nets for
vol. 26, no. 3, pp. 75-80, 2009. Handwritten Digit Recognition,”Neural Computation, 22,
pp. 3207–3220, 2010.
[22] Y. Bengio (1991). “Artiﬁcial Neural Networks and their
Application to Speech/Sequence Recognition,”Ph.D. the- [39] R. Raina, A. Madhavan, A. Ng.,“Large-scale Deep Unsu-
sis, McGill University, Canada. pervised Learning using Graphics Processors,”Proc. 26th
Int. Conf. on Machine Learning, 2009.
[23] L. Deng, K. Hassanein, M. Elmasry. (1994) “Analysis
of correlation structure for a neural predictive model with [40] Riesenhuber, M; Poggio, T. “Hierarchical models of ob-
applications to speech recognition,”Neural Networks, vol. ject recognition in cortex”. Nature Neuroscience 1999
7, No. 2., pp. 331-339. (11): 1019–1025.

[24] Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, [41] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, Howard, W. Hubbard, L. D. Jackel. Backpropagation Ap-
P.; Sainath, T.; Kingsbury, B. (2012). “Deep Neu- plied to Handwritten Zip Code Recognition. Neural Com-
ral Networks for Acoustic Modeling in Speech Recog- putation, 1(4):541–551, 1989.
nition --- The shared views of four research groups”
. IEEE Signal Processing Magazine 29 (6): 82–97. [42] S. Hochreiter. Untersuchungen zu dynamischen neu-
doi:10.1109/msp.2012.2205597. ronalen Netzen. Diploma thesis, Institut f. Informatik,
Technische Univ. Munich, 1991. Advisor: J. Schmidhu-
[25] Deng, L.; Hinton, G.; Kingsbury, B. (2013). “New types ber
of deep neural network learning for speech recognition
and related applications: An overview (ICASSP)". [43] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmid-
[26] Keynote talk: Recent Developments in Deep Neural Net- huber. Gradient flow in recurrent nets: the difficulty of
works. ICASSP, 2013 (by Geoff Hinton). learning long-term dependencies. In S. C. Kremer and J.
F. Kolen, editors, A Field Guide to Dynamical Recurrent
[27] Keynote talk: “Achievements and Challenges of Deep Neural Networks. IEEE Press, 2001.
Learning - From Speech Analysis and Recognition
To Language and Multimodal Processing,”Interspeech, [44] Hochreiter, Sepp; and Schmidhuber, Jürgen; Long Short-
September 2014. Term Memory, Neural Computation, 9(8):1735–1780,
1997
[28] G. E. Hinton., “Learning multiple layers of represen-
tation,”Trends in Cognitive Sciences, 11, pp. 428–434, [45] Graves, Alex; and Schmidhuber, Jürgen; Offline Hand-
2007. writing Recognition with Multidimensional Recurrent Neu-
ral Networks, in Bengio, Yoshua; Schuurmans, Dale; Laf-
[29] J. Schmidhuber.,“Learning complex, extended sequences ferty, John; Williams, Chris K. I.; and Culotta, Aron
using the principle of history compression,”Neural Com- (eds.), Advances in Neural Information Processing Systems
putation, 4, pp. 234–242, 1992. 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC,
Neural Information Processing Systems (NIPS) Founda-
[30] J. Schmidhuber., “My First Deep Learning System of
tion, 2009, pp. 545–552
1991 + Deep Learning Timeline 1962–2013.”

[31] http://research.microsoft.com/apps/pubs/default.aspx? [46] Graves, A.; Liwicki, M.; Fernandez, S.; Bertolami, R.;
id=189004 Bunke, H.; Schmidhuber, J.“A Novel Connectionist Sys-
tem for Improved Unconstrained Handwriting Recogni-
[32] L. Deng et al. Recent Advances in Deep Learning for tion”. IEEE Transactions on Pattern Analysis and Ma-
Speech Research at Microsoft, ICASSP, 2013. chine Intelligence 31 (5): 2009.
2.11. REFERENCES 27

[47] Sven Behnke (2003). Hierarchical Neural Networks for [62] X. Wu, G. Guo, and J. Weng, "Skull-closed Autonomous
Image Interpretation. (PDF). Lecture Notes in Computer Development: WWN-7 Dealing with Scales,”Proc. In-
Science 2766. Springer. ternational Conference on Brain-Mind, July 27–28, East
Lansing, Michigan, pp. +1-9, 2013.
[48] Smolensky, P. (1986). Information processing in dynam-
ical systems: Foundations of harmony theory. In D. E. [63] Szegedy, Christian, Alexander Toshev, and Dumitru Er-
Rumelhart, J. L. McClelland, & the PDP Research Group, han. “Deep neural networks for object detection.”Ad-
Parallel Distributed Processing: Explorations in the Mi- vances in Neural Information Processing Systems. 2013.
crostructure of Cognition. 1. pp. 194–281.
[64] T. Mikolov et al., “Recurrent neural network based lan-
[49] Hinton, G. E.; Osindero, S.; Teh, Y. (2006). guage model,”Interspeech, 2010.
“A fast learning algorithm for deep belief nets”
(PDF). Neural Computation 18 (7): 1527–1554. [65] LeCun, Y. et al. “Gradient-based learning applied to
doi:10.1162/neco.2006.18.7.1527. PMID 16764513. document recognition”. Proceedings of the IEEE 86 (11):
2278–2324. doi:10.1109/5.726791.
[50] Hinton, G. (2009).“Deep belief networks”. Scholarpedia
4 (5): 5947. doi:10.4249/scholarpedia.5947. [66] G. E. Hinton et al.., “Deep Neural Networks for Acous-
tic Modeling in Speech Recognition: The shared views of
[51] John Markoﬀ (25 June 2012). “How Many Computers four research groups,”IEEE Signal Processing Magazine,
to Identify a Cat? 16,000.”. New York Times. pp. 82–97, November 2012.

[52] Ng, Andrew; Dean, Jeff (2012). “Building High-level [67] Y. Bengio et al..,“Advances in optimizing recurrent net-
Features Using Large Scale Unsupervised Learning” works,”ICASSP, 2013.
(PDF).
[68] G. Dahl et al..,“Improving DNNs for LVCSR using rec-
[53] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. tified linear units and dropout,”ICASSP, 2013.
Schmidhuber. Flexible, High Performance Convolutional
Neural Networks for Image Classification. International [69] G. E. Hinton.,“A Practical Guide to Training Restricted
Joint Conference on Artificial Intelligence (IJCAI-2011, Boltzmann Machines,”Tech. Rep. UTML TR 2010-003,
Barcelona), 2011. Dept. CS., Univ. of Toronto, 2010.

[54] Martines, H.; Bengio, Y.; Yannakakis, G. N. (2013). [70] Hinton, G.E. “Deep belief networks”. Scholarpedia 4
“Learning Deep Physiological Models of Affect”. IEEE (5): 5947. doi:10.4249/scholarpedia.5947.
Computational Intelligence 8 (2): 20.
[71] H. Larochelle et al.., “An empirical evaluation of deep
[55] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi- architectures on problems with many factors of variation,”
Column Deep Neural Network for Traffic Sign Classifica- in Proc. 24th Int. Conf. Machine Learning, pp. 473–480,
tion. Neural Networks, 2012. 2007.

[56] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. [72] G. E. Hinton.,“Training Product of Experts by Minimiz-
Deep Neural Networks Segment Neuronal Membranes in ing Contrastive Divergence,”Neural Computation, 14, pp.
Electron Microscopy Images. In Advances in Neural In- 1771–1800, 2002.
formation Processing Systems (NIPS 2012), Lake Tahoe,
[73] A. Fischer and C. Igel. Training Restricted Boltzmann
2012.
Machines: An Introduction. Pattern Recognition 47, pp.
[57] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column 25-39, 2014
Deep Neural Networks for Image Classification. IEEE
[74] http://ufldl.stanford.edu/tutorial/index.php/
Conf. on Computer Vision and Pattern Recognition
Convolutional_Neural_Network
CVPR 2012.
[75]
[58] D. J. Felleman and D. C. Van Essen, "Distributed hierar-
chical processing in the primate cerebral cortex,”Cerebral [76]
Cortex, 1, pp. 1-47, 1991.
[77] Hinton, Geoffrey; Salakhutdinov, Ruslan (2012).“A bet-
[59] J. Weng, "Natural and Artificial Intelligence: Introduction ter way to pretrain deep Boltzmann machines” (PDF).
to Computational Brain-Mind,”BMI Press, ISBN 978- Advances in Neural 3: 1–9.
0985875725, 2012.
[78] Hinton, Geoffrey; Salakhutdinov, Ruslan (2009). “Effi-
[60] J. Weng, "Why Have We Passed `Neural Networks Do not cient Learning of Deep Boltzmann Machines” (PDF) 3.
Abstract Well'?,”Natural Intelligence: the INNS Magazine, pp. 448–455.
vol. 1, no.1, pp. 13-22, 2011.
[79] Bengio, Yoshua; LeCun, Yann (2007).“Scaling Learning
[61] Z. Ji, J. Weng, and D. Prokhorov, "Where-What Network Algorithms towards AI” (PDF) 1. pp. 1–41.
1: Where and What Assist Each Other Through Top-down
Connections,”Proc. 7th International Conference on De- [80] Larochelle, Hugo; Salakhutdinov, Ruslan (2010). “Ef-
velopment and Learning (ICDL'08), Monterey, CA, Aug. ficient Learning of Deep Boltzmann Machines” (PDF).
9-12, pp. 1-6, 2008. pp. 693–700.
28 CHAPTER 2. DEEP LEARNING

[81] Vincent, Pascal; Larochelle, Hugo; Lajoie, Isabelle; Ben- [96] Lee, Honglak; Grosse, Roger (2009). “Convolutional
gio, Yoshua; Manzagol, Pierre-Antoine (2010).“Stacked deep belief networks for scalable unsupervised learning
Denoising Autoencoders: Learning Useful Representa- of hierarchical representations”. Proceedings of the 26th
tions in a Deep Network with a Local Denoising Crite- Annual International Conference on Machine Learning -
rion”. The Journal of Machine Learning Research 11: ICML '09: 1–8.
3371–3408.
[97] Lin, Yuanqing; Zhang, Tong (2010).“Deep Coding Net-
[82] Deng, Li; Yu, Dong (2011). “Deep Convex Net: A work” (PDF). Advances in Neural . . .: 1–9.
Scalable Architecture for Speech Pattern Classification”
(PDF). Proceedings of the Interspeech: 2285–2288. [98] Ranzato, Marc Aurelio; Boureau, Y-Lan (2007).“Sparse
Feature Learning for Deep Belief Networks”(PDF). Ad-
[83] Deng, Li; Yu, Dong; Platt, John (2012).“Scalable stack- vances in neural information . . .: 1–8.
ing and learning for building deep architectures”. 2012
[99] Socher, Richard; Lin, Clif (2011). “Parsing Natural
IEEE International Conference on Acoustics, Speech and
Scenes and Natural Language with Recursive Neural Net-
Signal Processing (ICASSP): 2133–2136.
works” (PDF). Proceedings of the . . .
[84] David, Wolpert (1992). “Stacked generalization”.
[100] Taylor, Graham; Hinton, Geoffrey (2006). “Modeling
Neural networks 5(2): 241–259. doi:10.1016/S0893-
Human Motion Using Binary Latent Variables” (PDF).
6080(05)80023-1.
Advances in neural . . .
[85] Bengio, Yoshua (2009).“Learning deep architectures for
[101] Vincent, Pascal; Larochelle, Hugo (2008). “Extract-
AI”. Foundations and trends in Machine Learning 2(1):
ing and composing robust features with denoising autoen-
1–127.
coders”. Proceedings of the 25th international conference
[86] Hutchinson, Brian; Deng, Li; Yu, Dong (2012). “Tensor on Machine learning - ICML '08: 1096–1103.
deep stacking networks”. IEEE transactions on pattern
[102] Kemp, Charles; Perfors, Amy; Tenenbaum, Joshua
analysis and machine intelligence 1–15.
(2007). “Learning overhypotheses with hierarchical
[87] Hinton, Geoffrey; Salakhutdinov, Ruslan (2006). “Re- Bayesian models”. Developmental science. 10(3):
ducing the Dimensionality of Data with Neural Networks” 307–21. doi:10.1111/j.1467-7687.2007.00585.x. PMID
. Science 313: 504–507. doi:10.1126/science.1127647. 17444972.
PMID 16873662. [103] Xu, Fei; Tenenbaum, Joshua (2007). “Word learning
[88] Dahl, G.; Yu, D.; Deng, L.; Acero, A. (2012). “Context- as Bayesian inference”. Psychol Rev. 114(2): 245–72.
Dependent Pre-Trained Deep Neural Networks for Large- doi:10.1037/0033-295X.114.2.245. PMID 17500627.
Vocabulary Speech Recognition”. Audio, Speech, and ... [104] Chen, Bo; Polatkan, Gungor (2011). “The Hierarchical
20(1): 30–42. Beta Process for Convolutional Factor Analysis and Deep
[89] Mohamed, Abdel-rahman; Dahl, George; Hinton, Geof- Learning” (PDF). Machine Learning . . .
frey (2012).“Acoustic Modeling Using Deep Belief Net- [105] Fei-Fei, Li; Fergus, Rob (2006). “One-shot learning of
works”. IEEE Transactions on Audio, Speech, and Lan- object categories”. IEEE Trans Pattern Anal Mach Intell.
guage Processing. 20(1): 14–22. 28(4): 594–611. doi:10.1109/TPAMI.2006.79. PMID
16566508.
[90] Courville, Aaron; Bergstra, James; Bengio, Yoshua
(2011). “A Spike and Slab Restricted Boltzmann Ma- [106] Rodriguez, Abel; Dunson, David (2008). “The
chine” (PDF). International . . . 15: 233–241. Nested Dirichlet Process”. Journal of the Amer-
ican Statistical Association. 103(483): 1131–1154.
[91] Mitchell, T; Beauchamp, J (1988). “Bayesian Vari-
doi:10.1198/016214508000000553.
able Selection in Linear Regression”. Journal of the
American Statistical Association. 83 (404): 1023–1032. [107] Ruslan, Salakhutdinov; Joshua, Tenenbaum (2012).
doi:10.1080/01621459.1988.10478694. “Learning with Hierarchical-Deep Models”. IEEE trans-
actions on pattern analysis and machine intelligence: 1–14.
[92] Courville, Aaron; Bergstra, James; Bengio, Yoshua
PMID 23267196.
(2011). “Unsupervised Models of Images by Spike-and-
Slab RBMs” (PDF). Proceedings of the . . . 10: 1–8. [108] Chalasani, Rakesh; Principe, Jose (2013). “Deep Pre-
dictive Coding Networks”. arXiv preprint arXiv: 1–13.
[93] Hinton, Geoffrey; Osindero, Simon; Teh, Yee-Whye
(2006). “A Fast Learning Algorithm for Deep Belief [109] Cho, Youngmin (2012). “Kernel Methods for Deep
Nets”. Neural Computation 1554: 1527–1554. Learning” (PDF). pp. 1–9.
[94] Larochelle, Hugo; Bengio, Yoshua; Louradour, Jerdme; [110] Scholkopf, B; Smola, Alexander (1998). “Nonlinear
Lamblin, Pascal (2009).“Exploring Strategies for Train- component analysis as a kernel eigenvalue problem”.
ing Deep Neural Networks”. The Journal of Machine Neural computation (44).
Learning Research 10: 1–40.
[111] L. Deng, G. Tur, X. He, and D. Hakkani-Tur. “Use of
[95] Coates, Adam; Carpenter, Blake (2011). “Text Detec- Kernel Deep Convex Networks and End-To-End Learning
tion and Character Recognition in Scene Images with Un- for Spoken Language Understanding,”Proc. IEEE Work-
supervised Feature Learning”. pp. 440–445. shop on Spoken Language Technologies, 2012
2.11. REFERENCES 29

[112] Mnih, Volodymyr et al. (2015). “Human-level control [131] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin.,“A Neu-
through deep reinforcement learning” (PDF) 518. pp. ral Probabilistic Language Model,”Journal of Machine
529–533. Learning Research 3 (2003) 1137–1155', 2003.

[113] TIMIT Acoustic-Phonetic Continuous Speech Corpus Lin- [132] Goldberg, Yoav; Levy, Omar. “word2vec Explained:
guistic Data Consortium, Philadelphia. Deriving Mikolov et al.’s Negative-Sampling Word-
Embedding Method” (PDF). Arxiv. Retrieved 26 Oc-
[114] Abdel-Hamid, O. et al. (2014). “Convolutional Neural tober 2014.
Networks for Speech Recognition”. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing 22 (10): [133] Socher, Richard; Manning, Christopher.“Deep Learning
1533–1545. doi:10.1109/taslp.2014.2339736. for NLP” (PDF). Retrieved 26 October 2014.
[115] Deng, L.; Platt, J. (2014). “Ensemble Deep Learning for [134] Socher, Richard; Bauer, John; Manning, Christopher; Ng,
Speech Recognition”. Proc. Interspeech. Andrew (2013). “Parsing With Compositional Vector
Grammars”(PDF). Proceedings of the ACL 2013 confer-
[116] Yu, D.; Deng, L. (2010). “Roles of Pre-Training
ence.
and Fine-Tuning in Context-Dependent DBN-HMMs for
Real-World Speech Recognition”. NIPS Workshop on [135] Socher, Richard (2013). “Recursive Deep Models for
Deep Learning and Unsupervised Feature Learning. Semantic Compositionality Over a Sentiment Treebank”
[117] Deng L., Li, J., Huang, J., Yao, K., Yu, D., Seide, F. et al. (PDF). EMNLP 2013.
Recent Advances in Deep Learning for Speech Research [136] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil (2014)
at Microsoft. ICASSP, 2013. " A Latent Semantic Model with Convolutional-Pooling
[118] Deng, L.; Li, Xiao (2013).“Machine Learning Paradigms Structure for Information Retrieval,”Proc. CIKM.
for Speech Recognition: An Overview”. IEEE Transac-
[137] P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck
tions on Audio, Speech, and Language Processing.
(2013) “Learning Deep Structured Semantic Models for
[119] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and Web Search using Clickthrough Data,”Proc. CIKM.
G. Hinton (2010) Binary Coding of Speech Spectrograms
[138] I. Sutskever, O. Vinyals, Q. Le (2014) “Sequence to Se-
Using a Deep Auto-encoder. Interspeech.
quence Learning with Neural Networks,”Proc. NIPS.
[120] Z. Tuske, P. Golik, R. Schlüter and H. Ney (2014).
Acoustic Modeling with Deep Neural Networks Using [139] J. Gao, X. He, W. Yih, and L. Deng(2014) “Learning
Raw Time Signal for LVCSR. Interspeech. Continuous Phrase Representations for Translation Mod-
eling,”Proc. ACL.
[121] McMillan, R.“How Skype Used AI to Build Its Amazing
New Language Translator”, Wire, Dec. 2014. [140] J. Gao, P. Pantel, M. Gamon, X. He, L. Deng (2014)
“Modeling Interestingness with Deep Neural Networks,”
[122] Hannun et al. (2014) “Deep Speech: Scaling up end-to- Proc. EMNLP.
end speech recognition”, arXiv:1412.5567.
[141] J. Gao, X. He, L. Deng (2014) “Deep Learning for Nat-
[123] Ron Schneiderman (2015) “Accuracy, Apps Advance ural Language Processing: Theory and Practice (Tuto-
Speech Recognition --- Interview with Vlad Sejnoha and rial),”CIKM.
Li Deng”, IEEE Signal Processing Magazine, Jan, 2015.
[142] Arrowsmith, J; Miller, P (2013). “Trial watch: Phase
[124] http://yann.lecun.com/exdb/mnist/. II and phase III attrition rates 2011-2012”. Nature Re-
views Drug Discovery 12 (8): 569. doi:10.1038/nrd4090.
[125] D. Ciresan, U. Meier, J. Schmidhuber., “Multi-column PMID 23903212.
Deep Neural Networks for Image Classiﬁcation,”Techni-
cal Report No. IDSIA-04-12', 2012. [143] Verbist, B; Klambauer, G; Vervoort, L; Talloen, W;
The Qstar, Consortium; Shkedy, Z; Thas, O; Ben-
[126] Vinyals et al. (2014)."Show and Tell: A Neural Image
der, A; Göhlmann, H. W.; Hochreiter, S (2015).
Caption Generator,”arXiv:1411.4555.
“Using transcriptomics to guide lead optimization
[127] Fang et al. (2014)."From Captions to Visual Concepts and in drug discovery projects: Lessons learned from
Back,”arXiv:1411.4952. the QSTAR project”. Drug Discovery Today.
doi:10.1016/j.drudis.2014.12.014. PMID 25582842.
[128] Kiros et al. (2014)."Unifying Visual-Semantic Embed-
dings with Multimodal Neural Language Models,”arXiv: [144]“Announcement of the winners of the Merck Molec-
1411.2539. ular Activity Challenge”https://www.kaggle.com/c/
MerckActivity/details/winners.
[129] Zhong, S.; Liu, Y.; Liu, Y. “Bilinear Deep Learning for
Image Classiﬁcation”. Proceedings of the 19th ACM In- [145] Dahl, G. E.; Jaitly, N.; & Salakhutdinov, R. (2014)
ternational Conference on Multimedia 11: 343–352. “Multi-task Neural Networks for QSAR Predictions,”
ArXiv, 2014.
[130] Nvidia Demos a Car Computer Trained with “Deep
Learning” (2015-01-06), David Talbot, MIT Technology [146]“Toxicology in the 21st century Data Challenge”https:
Review //tripod.nih.gov/tox21/challenge/leaderboard.jsp
30 CHAPTER 2. DEEP LEARNING

[147]“NCATS Announces Tox21 Data Challenge Winners” [162] Smith, G. W. (March 27, 2015). “Art and Artificial In-
http://www.ncats.nih.gov/news-and-events/features/ telligence”. ArtEnt. Retrieved March 27, 2015.
tox21-challenge-winners.html
[163] Ben Goertzel. Are there Deep Reasons Underlying the
[148] Unterthiner, T.; Mayr, A.; Klambauer, G.; Steijaert, M.; Pathologies of Today’s Deep Learning Algorithms?
Ceulemans, H.; Wegner, J. K.; & Hochreiter, S. (2014) (2015) Url: http://goertzel.org/DeepLearning_v1.pdf
“Deep Learning as an Opportunity in Virtual Screening”
. Workshop on Deep Learning and Representation Learn- [164] Nguyen, Anh, Jason Yosinski, and Jeff Clune. “Deep
ing (NIPS2014). Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images.”arXiv preprint
[149] Unterthiner, T.; Mayr, A.; Klambauer, G.; & Hochreiter, arXiv:1412.1897 (2014).
S. (2015) “Toxicity Prediction using Deep Learning”.
[165] Szegedy, Christian, et al.“Intriguing properties of neural
ArXiv, 2015.
networks.”arXiv preprint arXiv:1312.6199 (2013).
[150] Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Kon-
[166] Zhu, S.C.; Mumford, D. “A stochastic grammar of im-
erding, D.;& Pande, V. (2015)“Massively Multitask Net-
ages”. Found. Trends. Comput. Graph. Vis. 2 (4):
works for Drug Discovery”. ArXiv, 2015.
259–362. doi:10.1561/0600000018.
[151] Tkachenko, Yegor. Autonomous CRM Control via CLV [167] Miller, G. A., and N. Chomsky. “Pattern conception.”
Approximation with Deep Reinforcement Learning in Paper for Conference on pattern detection, University of
Discrete and Continuous Action Space. (April 8, 2015). Michigan. 1957.
arXiv.org: http://arxiv.org/abs/1504.01840
[168] Jason Eisner, Deep Learning of Recursive Struc-
[152] Utgoff, P. E.; Stracuzzi, D. J. (2002). “Many- ture: Grammar Induction, http://techtalks.tv/talks/
layered learning”. Neural Computation 14: 2497–2529. deep-learning-of-recursive-structure-grammar-induction/
doi:10.1162/08997660260293319. 58089/
[153] J. Elman, et al., “Rethinking Innateness,”1996.

[154] Shrager, J.; Johnson, MH (1996). “Dynamic plas- 2.12 External links
ticity influences the emergence of function in a simple
cortical array”. Neural Networks 9 (7): 1119–1129.
doi:10.1016/0893-6080(96)00033-0. • TED talk on the applications of deep learning and
future consequences by Jeremy Howard
[155] Quartz, SR; Sejnowski, TJ (1997). “The neural ba-
sis of cognitive development: A constructivist mani- • Deep learning information from the University of
festo”. Behavioral and Brain Sciences 20 (4): 537–556. Montreal
doi:10.1017/s0140525x97001581.
• Deep learning information from Stanford University
[156] S. Blakeslee., “In brain's early growth, timetable may be
critical,”The New York Times, Science Section, pp. B5–
B6, 1995. • Deep Learning Resources, NVIDIA Developer
Zone
[157] {BUFILL} E. Bufill, J. Agusti, R. Blesa., “Human
neoteny revisited: The case of synaptic plasticity,”Amer- • Geoffrey Hinton's webpage
ican Journal of Human Biology, 23 (6), pp. 729–739,
2011. • Hinton deep learning tutorial

[158] J. Shrager and M. H. Johnson., “Timing in the develop- • Yann LeCun's webpage
ment of cortical function: A computational approach,”In
• The Center for Biological and Computational
B. Julesz and I. Kovacs (Eds.), Maturational windows and
adult cortical plasticity, 1995. Learning (CBCL)

[159] D. Hernandez., “The Man Behind the Google • Stanford tutorial on unsupervised feature learning
Brain: Andrew Ng and the Quest for the New and deep learning
AI,”http://www.wired.com/wiredenterprise/2013/
• Google's DistBelief Framework
05/neuro-artificial-intelligence/all/. Wired, 10 May
2013. • NIPS 2013 Conference (talks on deep learning re-
[160] C. Metz., “Facebook's 'Deep Learning' Guru Reveals lated material)
the Future of AI,”http://www.wired.com/wiredenterprise/
• Mnih, Volodymyr; Kavukcuoglu, Koray; Silver,
2013/12/facebook-yann-lecun-qa/. Wired, 12 December
2013.
David; Graves, Alex; Antonoglou, Ioannis; Wier-
stra, Daan; Riedmiller, Martin (2013), Playing
[161] G. Marcus.,“Is“Deep Learning”a Revolution in Artiﬁcial Atari with Deep Reinforcement Learning (PDF),
Intelligence?" The New Yorker, 25 November 2012. arXiv:1312.5602
2.12. EXTERNAL LINKS 31

• 100 Best GitHub: Deep Learning

• Silicon Chips That See Are Going to Make Your

Smartphone Brilliant.“Many of the devices around
us may soon acquire powerful new abilities to un-
derstand images and video, thanks to hardware de-
signed for the machine-learning technique called
deep learning.”Tom Simonite (May 2015), MIT
Technology Review
Chapter 3

Feature learning

Feature learning or representation learning* [1] is a weights may be found by minimizing the average repre-
set of techniques that learn a transformation of raw data sentation error (over the input data), together with a L1
input to a representation that can be effectively exploited regularization on the weights to enable sparsity (i.e., the
in machine learning tasks. representation of each data point has only a few nonzero
weights).
Feature learning is motivated by the fact that machine
learning tasks such as classification often require input Supervised dictionary learning exploits both the structure
that is mathematically and computationally convenient to underlying the input data and the labels for optimizing the
process. However, real-world data such as images, video, dictionary elements. For example, a supervised dictio-
and sensor measurement is usually complex, redundant, nary learning technique was proposed by Mairal et al. in
2009.* [6] The authors apply dictionary learning on classi-
and highly variable. Thus, it is necessary to discover use-
ful features or representations from raw data. Traditional
fication problems by jointly optimizing the dictionary ele-
hand-crafted features often require expensive human la- ments, weights for representing data points, and parame-
bor and often rely on expert knowledge. Also, they nor- ters of the classifier based on the input data. In particular,
mally do not generalize well. This motivates the design a minimization problem is formulated, where the objec-
of efficient feature learning techniques. tive function consists of the classification error, the repre-
Feature learning can be divided into two categories: su- sentation error, an L1 regularization on the representing
pervised and unsupervised feature learning. weights for each data point (to enable sparse representa-
tion of data), and an L2 regularization on the parameters
of the classifier.
• In supervised feature learning, features are learned
with labeled input data. Examples include neural
networks, multilayer perceptron, and (supervised)
dictionary learning.

• In unsupervised feature learning, features are

learned with unlabeled input data. Examples include
3.1.2 Neural networks
dictionary learning, independent component analy-
sis, autoencoders, matrix factorization,* [2] and var- Neural networks are used to illustrate a family of learn-
ious forms of clustering.* [3]* [4]* [5] ing algorithms via a “network”consisting of multiple
layers of inter-connected nodes. It is inspired by the
nervous system, where the nodes are viewed as neurons
3.1 Supervised feature learning and edges are viewed as synapse. Each edge has an as-
sociated weight, and the network defines computational
rules that passes input data from the input layer to the
Supervised feature learning is to learn features from la-
output layer. A network function associated with a neu-
beled data. Several approaches are introduced in the fol-
ral network characterizes the relationship between input
lowing.
and output layers, which is parameterized by the weights.
With appropriately defined network functions, various
3.1.1 Supervised dictionary learning learning tasks can be performed by minimizing a cost
function over the network function (weights).
Dictionary learning is to learn a set (dictionary) of rep- Multilayer neural networks can be used to perform fea-
resentative elements from the input data such that each ture learning, since they learn a representation of their
data point can be represented as a weighted sum of the input at the hidden layer(s) which is subsequently used
representative elements. The dictionary elements and the for classification or regression at the output layer.

32
3.2. UNSUPERVISED FEATURE LEARNING 33

3.2 Unsupervised feature learning (i.e., subtracting the sample mean from the data vector).
Equivalently, these singular vectors are the eigenvectors
Unsupervised feature learning is to learn features from corresponding to the p largest eigenvalues of the sample
unlabeled data. The goal of unsupervised feature learn- covariance matrix of the input vectors. These p singu-
ing is often to discover low-dimensional features that cap- lar vectors are the feature vectors learned from the input
tures some structure underlying the high-dimensional in- data, and they represent directions along which the data
put data. When the feature learning is performed in an has the largest variations.
unsupervised way, it enables a form of semisupervised PCA is a linear feature learning approach since the p sin-
learning where first, features are learned from an gular vectors are linear functions of the data matrix. The
unlabeled dataset, which are then employed to im- singular vectors can be generated via a simple algorithm
prove performance in a supervised setting with labeled with p iterations. In the ith iteration, the projection of the
data.* [7]* [8] Several approaches are introduced in the data matrix on the (i-1)th eigenvector is subtracted, and
following. the ith singular vector is found as the right singular vector
corresponding to the largest singular of the residual data
matrix.
3.2.1 K-means clustering
PCA has several limitations. First, it assumes that the
K-means clustering is an approach for vector quantiza- directions with large variance are of most interest, which
tion. In particular, given a set of n vectors, k-means clus- may not be the case in many applications. PCA only relies
tering groups them into k clusters (i.e., subsets) in such a on orthogonal transformations of the original data, and it
way that each vector belongs to the cluster with the clos- only exploits the first- and second-order moments of the
est mean. The problem is computationally NP-hard, and data, which may not well characterize the distribution of
suboptimal greedy algorithms have been developed for k- the data. Furthermore, PCA can effectively reduce di-
means clustering. mension only when the input data vectors are correlated
(which results in a few dominant eigenvalues).
In feature learning, k-means clustering can be used to
group an unlabeled set of inputs into k clusters, and then
use the centroids of these clusters to produce features. 3.2.3 Local linear embedding
These features can be produced in several ways. The sim-
plest way is to add k binary features to each sample, where Local linear embedding (LLE) is a nonlinear unsuper-
each feature j has value one iff the jth centroid learned vised learning approach for generating low-dimensional
by k-means is the closest to the sample under considera- neighbor-preserving representations from (unlabeled)
tion.* [3] It is also possible to use the distances to the clus- high-dimension input. The approach was proposed by
ters as features, perhaps after transforming them through Sam T. Roweis and Lawrence K. Saul in 2000.* [12]* [13]
a radial basis function (a technique that has used to train
RBF networks* [9]). Coates and Ng note that certain vari- The general idea of LLE is to reconstruct the origi-
ants of k-means behave similarly to sparse coding algo- nal high-dimensional data using lower-dimensional points
rithms.* [10] while maintaining some geometric properties of the
neighborhoods in the original data set. LLE consists
In a comparative evaluation of unsupervised feature of two major steps. The first step is for “neighbor-
learning methods, Coates, Lee and Ng found that k- preserving,”where each input data point Xi is recon-
means clustering with an appropriate transformation out- structed as a weighted sum of K nearest neighboring data
performs the more recently invented auto-encoders and points, and the optimal weights are found by minimizing
RBMs on an image classification task.* [3] K-means has the average squared reconstruction error (i.e., difference
also been shown to improve performance in the domain between a point and its reconstruction) under the con-
of NLP, specifically for named-entity recognition;* [11] straint that the weights associated to each point sum up
there, it competes with Brown clustering, as well as with to one. The second step is for “dimension reduction,”
distributed word representations (also known as neural by looking for vectors in a lower-dimensional space that
word embeddings).* [8] minimizes the representation error using the optimized
weights in the first step. Note that in the first step, the
weights are optimized with data being fixed, which can
3.2.2 Principal component analysis
be solved as a least squares problem; while in the sec-
Principal component analysis (PCA) is often used for di- ond step, lower-dimensional points are optimized with
mension reduction. Given a unlabeled set of n input data the weights being fixed, which can be solved via sparse
vectors, PCA generates p (which is much smaller than the eigenvalue decomposition.
dimension of the input data) right singular vectors corre- The reconstruction weights obtained in the first step cap-
sponding to the p largest singular values of the data ma- tures the“intrinsic geometric properties”of a neighbor-
trix, where the kth row of the data matrix is the kth in- hood in the input data.* [13] It is assumed that original
put data vector shifted by the sample mean of the input data lie on a smooth lower-dimensional manifold, and the
34 CHAPTER 3. FEATURE LEARNING

“intrinsic geometric properties”captured by the weights den variables, a group of visible variables, and edges con-
of the original data are expected also on the manifold.necting the hidden and visible nodes. It is a special case of
This is why the same weights are used in the second step
the more general Boltzmann machines with the constraint
of LLE. Compared with PCA, LLE is more powerful in of no intra-node connections. Each edge in an RBM is
exploiting the underlying structure of data. associated with a weight. The weights together with the
connections define an energy function, based on which
a joint distribution of visible and hidden nodes can be
3.2.4 Independent component analysis devised. Based on the topology of the RBM, the hidden
(visible) variables are independent conditioned on the vis-
Independent component analysis (ICA) is technique for ible (hidden) variables. Such conditional independence
learning a representation of data using a weighted sum facilitates computations on RBM.
of independent non-Gaussian components.* [14] The as-
An RBM can be viewed as a single layer architecture for
sumption of non-Gaussian is imposed since the weights
unsupervised feature learning. In particular, the visible
cannot be uniquely determined when all the components
variables correspond to input data, and the hidden vari-
follow Gaussian distribution.
ables correspond to feature detectors. The weights can
be trained by maximizing the probability of visible vari-
ables using the contrastive divergence (CD) algorithm by
3.2.5 Unsupervised dictionary learning
Geoffrey Hinton.* [18]
Different from supervised dictionary learning, unsuper- In general, the training of RBM by solving the above
vised dictionary learning does not utilize the labels of the maximization problem tends to result in non-sparse rep-
data and only exploits the structure underlying the data for resentations. The sparse RBM, * [19] a modification of
optimizing the dictionary elements. An example of unsu- the RBM, was proposed to enable sparse representations.
pervised dictionary learning is sparse coding, which aims The idea is to add a regularization term in the objective
to learn basis functions (dictionary elements) for data rep- function of data likelihood, which penalizes the deviation
resentation from unlabeled input data. Sparse coding can of the expected hidden variables from a small constant p
be applied to learn overcomplete dictionary, where the .
number of dictionary elements is larger than the dimen-
sion of the input data.* [15] Aharon et al. proposed an
algorithm known as K-SVD for learning from unlabeled 3.3.2 Autoencoder
input data a dictionary of elements that enables sparse
representation of the data.* [16] An autoencoder consisting of encoder and decoder is a
paradigm for deep learning architectures. An example
is provided by Hinton and Salakhutdinov* [18] where the
encoder uses raw data (e.g., image) as input and produces
3.3 Multilayer/Deep architectures feature or representation as output, and the decoder uses
the extracted feature from the encoder as input and recon-
The hierarchical architecture of the neural system in- structs the original input raw data as output. The encoder
spires deep learning architectures for feature learning by and decoder are constructed by stacking multiple layers
stacking multiple layers of simple learning blocks.* [17] of RBMs. The parameters involved in the architecture
These architectures are often designed based on the as- are trained in a greedy layer-by-layer manner: after one
sumption of distributed representation: observed data is layer of feature detectors is learned, they are fed to upper
generated by the interactions of many different factors layers as visible variables for training the corresponding
on multiple levels. In a deep learning architecture, the RBM. The process can be repeated until some stopping
output of each intermediate layer can be viewed as a rep- criteria is satisfied.
resentation of the original input data. Each level uses the
representation produced by previous level as input, and
produces new representations as output, which is then fed 3.4 See also
to higher levels. The input of bottom layer is the raw
data, and the output of the final layer is the final low-
• Basis function
dimensional feature or representation.
• Deep learning

3.3.1 Restricted Boltzmann machine • Feature detection (computer vision)

Restricted Boltzmann machines (RBMs) are often used • Feature extraction

as a building block for multilayer learning architec- • Kernel trick
tures.* [3]* [18] An RBM can be represented by an undi-
rected bipartite graph consisting of a group of binary hid- • Vector quantization
3.5. REFERENCES 35

3.5 References [16] Aharon, Michal; Elad, Michael; Bruckstein, Alfred

(2006). “K-SVD: An Algorithm for Designing
[1] Y. Bengio; A. Courville; P. Vincent (2013). “Represen- Overcomplete Dictionaries for Sparse Representation”
tation Learning: A Review and New Perspectives”. IEEE . IEEE Trans. Signal Process. 54 (11): 4311–4322.
Trans. PAMI, special issue Learning Deep Architectures. doi:10.1109/TSP.2006.881199.

[2] Nathan Srebro; Jason D. M. Rennie; Tommi S. Jaakkola [17] Bengio, Yoshua (2009). “Learning Deep Architectures
(2004). Maximum-Margin Matrix Factorization. NIPS. for AI”. Foundations and Trends® in Machine Learning
2 (1): 1–127. doi:10.1561/2200000006.
[3] Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). An
analysis of single-layer networks in unsupervised feature [18] Hinton, G. E.; Salakhutdinov, R. R. (2006).
learning (PDF). Int'l Conf. on AI and Statistics (AIS- “Reducing the Dimensionality of Data with Neural
TATS). Networks” (PDF). Science 313 (5786): 504–507.
doi:10.1126/science.1127647. PMID 16873662.
[4] Csurka, Gabriella; Dance, Christopher C.; Fan, Lixin;
Willamowski, Jutta; Bray, Cédric (2004). Visual catego- [19] Lee, Honglak; Ekanadham, Chaitanya; Andrew, Ng
rization with bags of keypoints (PDF). ECCV Workshop (2008). “Sparse deep belief net model for visual area
on Statistical Learning in Computer Vision. V2”. Advances in neural information processing systems.

[5] Daniel Jurafsky; James H. Martin (2009). Speech and

Language Processing. Pearson Education International.
pp. 145–146.

[6] Mairal, Julien; Bach, Francis; Ponce, Jean; Sapiro,

Guillermo; Zisserman, Andrew (2009).“Supervised Dic-
tionary Learning”. Advances in neural information pro-
cessing systems.

[7] Percy Liang (2005). Semi-Supervised Learning for Natu-

ral Language (PDF) (M. Eng.). MIT. pp. 44–52.

[8] Joseph Turian; Lev Ratinov; Yoshua Bengio (2010).

Word representations: a simple and general method for
semi-supervised learning (PDF). Proceedings of the 48th
Annual Meeting of the Association for Computational
Linguistics.

[9] Schwenker, Friedhelm; Kestler, Hans A.; Palm, Gün-

ther (2001). “Three learning phases for radial-basis-
function networks”. Neural Networks 14: 439–
458. doi:10.1016/s0893-6080(01)00027-2. CiteSeerX:
10.1.1.109.312.

[10] Coates, Adam; Ng, Andrew Y. (2012). “Learning fea-

ture representations with k-means”. In G. Montavon, G.
B. Orr and K.-R. Müller. Neural Networks: Tricks of the
Trade. Springer.

[11] Dekang Lin; Xiaoyun Wu (2009). Phrase clustering for

discriminative learning (PDF). Proc. J. Conf. of the ACL
and 4th Int'l J. Conf. on Natural Language Processing of
the AFNLP. pp. 1030–1038.

[12] Roweis, Sam T; Saul, Lawrence K (2000). “Nonlin-

ear Dimensionality Reduction by Locally Linear Embed-
ding”. Science, New Series 290 (5500): 2323–2326.
doi:10.1126/science.290.5500.2323.

[13] Saul, Lawrence K; Roweis, Sam T (2000). “An Intro-

duction to Locally Linear Embedding”.

[14] Hyvärinen, Aapo; Oja, Erkki (2000).“Independent Com-

ponent Analysis: Algorithms and Applications”. Neural
networks (4): 411–430.

[15] Lee, Honglak; Battle, Alexis; Raina, Rajat; Ng, Andrew

Y (2007). “Eﬃcient sparse coding algorithms”. Ad-
vances in neural information processing systems.
Chapter 4

Unsupervised learning

In machine learning, the problem of unsupervised 4.1 Method of moments

learning is that of trying to find hidden structure in un-
labeled data. Since the examples given to the learner are One of the approaches in unsupervised learning is the
unlabeled, there is no error or reward signal to evaluate a method of moments. In the method of moments, the un-
potential solution. This distinguishes unsupervised learn- known parameters (of interest) in the model are related
ing from supervised learning and reinforcement learning. to the moments of one or more random variables, and
Unsupervised learning is closely related to the problem thus, these unknown parameters can be estimated given
of density estimation in statistics.* [1] However unsuper- the moments. The moments are usually estimated from
vised learning also encompasses many other techniques samples in an empirical way. The basic moments are first
that seek to summarize and explain key features of the and second order moments. For a random vector, the
data. Many methods employed in unsupervised learn- first order moment is the mean vector, and the second
ing are based on data mining methods used to preprocess order moment is the covariance matrix (when the mean
data. is zero). Higher order moments are usually represented
using tensors which are the generalization of matrices to
Approaches to unsupervised learning include:
higher orders as multi-dimensional arrays.
In particular, the method of moments is shown to be
• clustering (e.g., k-means, mixture models,
effective in learning the parameters of latent variable
hierarchical clustering),* [2]
models.* [5] Latent variable models are statistical models
where in addition to the observed variables, a set of la-
• Approaches for learning latent variable models such tent variables also exists which is not observed. A highly
as practical example of latent variable models in machine
learning is the topic modeling which is a statistical model
• Expectation–maximization algorithm (EM)
for generating the words (observed variables) in the docu-
• Method of moments ment based on the topic (latent variable) of the document.
• Blind signal separation techniques, e.g., In the topic modeling, the words in the document are gen-
erated according to different statistical parameters when
• Principal component analysis, the topic of the document is changed. It is shown that
• Independent component analysis, method of moments (tensor decomposition techniques)
• Non-negative matrix factorization, consistently recover the parameters of a large class of la-
• Singular value decomposition. [3]
* tent variable models under some assumptions.* [5]
Expectation–maximization algorithm (EM) is also one of
Among neural network models, the self-organizing map the most practical methods for learning latent variable
(SOM) and adaptive resonance theory (ART) are com- models. But, it can be stuck in local optima, and the
monly used unsupervised learning algorithms. The SOM global convergence of the algorithm to the true unknown
is a topographic organization in which nearby locations parameters of the model is not guaranteed. While, for
in the map represent inputs with similar properties. The the method of moments, the global convergence is guar-
*
ART model allows the number of clusters to vary with anteed under some conditions. [5]
problem size and lets the user control the degree of sim-
ilarity between members of the same clusters by means
of a user-defined constant called the vigilance parameter. 4.2 See also
ART networks are also used for many pattern recognition
tasks, such as automatic target recognition and seismic • Cluster analysis
signal processing. The first version of ART was“ART1”
, developed by Carpenter and Grossberg (1988).* [4] • Expectation–maximization algorithm

36
4.4. FURTHER READING 37

• Generative topographic map

• Multilinear subspace learning

• Multivariate analysis

• Radial basis function network

4.3 Notes
[1] Jordan, Michael I.; Bishop, Christopher M. (2004).“Neu-
ral Networks”. In Allen B. Tucker. Computer Science
Handbook, Second Edition (Section VII: Intelligent Sys-
tems). Boca Raton, FL: Chapman & Hall/CRC Press
LLC. ISBN 1-58488-360-X.

[2] Hastie,Trevor,Robert Tibshirani, Friedman,Jerome

(2009). The Elements of Statistical Learning: Data
mining,Inference,and Prediction. New York: Springer.
pp. 485–586. ISBN 978-0-387-84857-0.

[3] Acharyya, Ranjan (2008); A New Approach for Blind

Source Separation of Convolutive Sources, ISBN 978-3-
639-07797-1 (this book focuses on unsupervised learning
with Blind Source Separation)

[4] Carpenter, G.A. and Grossberg, S. (1988). “The ART

of adaptive pattern recognition by a self-organizing
neural network” (PDF). Computer 21: 77–88.
doi:10.1109/2.33.

[5] Anandkumar, Animashree; Ge, Rong; Hsu, Daniel;

Kakade, Sham; Telgarsky, Matus (2014). “Tensor
Decompositions for Learning Latent Variable Models”
(PDF). Journal of Machine Learning Research (JMLR) 15:
2773−2832.

4.4 Further reading

• Bousquet, O.; von Luxburg, U.; Raetsch, G., eds.
(2004). Advanced Lectures on Machine Learning.
Springer-Verlag. ISBN 978-3540231226.

• Duda, Richard O.; Hart, Peter E.; Stork, David G.

(2001). “Unsupervised Learning and Clustering”
. Pattern classiﬁcation (2nd ed.). Wiley. ISBN 0-
471-05669-3.

• Hastie, Trevor; Tibshirani, Robert (2009). The

Elements of Statistical Learning: Data min-
ing,Inference,and Prediction. New York: Springer.
pp. 485–586. doi:10.1007/978-0-387-84858-
7_14. ISBN 978-0-387-84857-0.

• Hinton, Geoﬀrey; Sejnowski, Terrence J., eds.

(1999). Unsupervised Learning: Foundations of
Neural Computation. MIT Press. ISBN 0-262-
58168-X. (This book focuses on unsupervised learn-
ing in neural networks)
Chapter 5

Generative model

In probability and statistics, a generative model is a • Restricted Boltzmann machine

model for randomly generating observable-data values,
typically given some hidden parameters. It specifies a If the observed data are truly sampled from the generative
joint probability distribution over observation and label model, then fitting the parameters of the generative model
sequences. Generative models are used in machine learn- to maximize the data likelihood is a common method.
ing for either modeling data directly (i.e., modeling ob- However, since most statistical models are only approx-
servations drawn from a probability density function), or imations to the true distribution, if the model's applica-
as an intermediate step to forming a conditional proba- tion is to infer about a subset of variables conditional on
bility density function. A conditional distribution can be known values of others, then it can be argued that the ap-
formed from a generative model through Bayes' rule. proximation makes more assumptions than are necessary
Shannon (1948) gives an example in which a table of fre- to solve the problem at hand. In such cases, it can be more
quencies of English word pairs is used to generate a sen- accurate to model the conditional density functions di-
tence beginning with “representing and speedily is an rectly using a discriminative model (see above), although
good"; which is not proper English but which will increas- application-specific details will ultimately dictate which
ingly approximate it as the table is moved from word pairs approach is most suitable in any particular case.
to word triplets etc.
Generative models contrast with discriminative models,
in that a generative model is a full probabilistic model 5.1 See also
of all variables, whereas a discriminative model provides
a model only for the target variable(s) conditional on the • Discriminative model
observed variables. Thus a generative model can be used,
• Graphical model
for example, to simulate (i.e. generate) values of any
variable in the model, whereas a discriminative model al-
lows only sampling of the target variables conditional on
the observed quantities. Despite the fact that discrimina- 5.2 References
tive models do not need to model the distribution of the
observed variables, they cannot generally express more [1] C. M. Bishop and J. Lasserre, Generative or Discrimina-
complex relationships between the observed and target tive? getting the best of both worlds. In Bayesian Statistics
variables. They don't necessarily perform better than gen- 8, Bernardo, J. M. et al. (Eds), Oxford University Press.
3–23, 2007.
erative models at classification and regression tasks. In
modern applications the two classes are seen as comple-
mentary or as different views of the same procedure.* [1]
Examples of generative models include:
5.3 Sources
• Shannon, C.E. (1948) "A Mathematical Theory of
• Gaussian mixture model and other types of mixture
Communication", Bell System Technical Journal,
model
vol. 27, pp. 379–423, 623–656, July, October,
• Hidden Markov model 1948

• Probabilistic context-free grammar

• Naive Bayes
• Averaged one-dependence estimators
• Latent Dirichlet allocation

38
Chapter 6

Neural coding

Neural coding is a neuroscience-related ﬁeld concerned recalled in the hippocampus, a brain region known to
with characterizing the relationship between the stimulus be central for memory formation.* [5]* [6]* [7] Neurosci-
and the individual or ensemble neuronal responses and entists have initiated several large-scale brain decoding
the relationship among the electrical activity of the neu- projects.* [8]* [9]
rons in the ensemble.* [1] Based on the theory that sen-
sory and other information is represented in the brain by
networks of neurons, it is thought that neurons can encode
both digital and analog information.* [2]
6.2 Encoding and decoding

The link between stimulus and response can be studied

6.1 Overview from two opposite points of view. Neural encoding refers
to the map from stimulus to response. The main focus
Neurons are remarkable among the cells of the body in is to understand how neurons respond to a wide variety
their ability to propagate signals rapidly over large dis- of stimuli, and to construct models that attempt to pre-
tances. They do this by generating characteristic electri- dict responses to other stimuli. Neural decoding refers
cal pulses called action potentials: voltage spikes that can to the reverse map, from response to stimulus, and the
travel down nerve fibers. Sensory neurons change their challenge is to reconstruct a stimulus, or certain aspects
activities by firing sequences of action potentials in vari- of that stimulus, from the spike sequences it evokes.
ous temporal patterns, with the presence of external sen-
sory stimuli, such as light, sound, taste, smell and touch.
It is known that information about the stimulus is encoded
in this pattern of action potentials and transmitted into
and around the brain. 6.3 Coding schemes
Although action potentials can vary somewhat in dura-
tion, amplitude and shape, they are typically treated as A sequence, or 'train', of spikes may contain information
identical stereotyped events in neural coding studies. If based on different coding schemes. In motor neurons,
the brief duration of an action potential (about 1ms) is for example, the strength at which an innervated muscle
ignored, an action potential sequence, or spike train, can is flexed depends solely on the 'firing rate', the average
be characterized simply by a series of all-or-none point number of spikes per unit time (a 'rate code'). At the other
events in time.* [3] The lengths of interspike intervals end, a complex 'temporal code' is based on the precise
(ISIs) between two successive spikes in a spike train often timing of single spikes. They may be locked to an external
vary, apparently randomly.* [4] The study of neural cod- stimulus such as in the visual* [10] and auditory system or
ing involves measuring and characterizing how stimulus be generated intrinsically by the neural circuitry.* [11]
attributes, such as light or sound intensity, or motor ac- Whether neurons use rate coding or temporal coding is
tions, such as the direction of an arm movement, are rep- a topic of intense debate within the neuroscience com-
resented by neuron action potentials or spikes. In order to munity, even though there is no clear definition of what
describe and analyze neuronal firing, statistical methods these terms mean. In one theory, termed “neuroelectro-
and methods of probability theory and stochastic point dynamics”, the following coding schemes are all consid-
processes have been widely applied. ered to be epiphenomena, replaced instead by molecular
With the development of large-scale neural recording and changes reflecting the spatial distribution of electric fields
decoding technologies, researchers have begun to crack within neurons as a result of the broad electromagnetic
the neural code and already provided the first glimpse spectrum of action potentials, and manifested in infor-
into the real-time neural code as memory is formed and mation as spike directivity.* [12]* [13]* [14]* [15]* [16]

39
40 CHAPTER 6. NEURAL CODING

6.3.1 Rate coding the stimulus. In practice, to get sensible averages, several
spikes should occur within the time window. Typical val-
The rate coding model of neuronal firing communication ues are T = 100 ms or T = 500 ms, but the duration may
states that as the intensity of a stimulus increases, the also be longer or shorter.* [19]
frequency or rate of action potentials, or “spike firing” The spike-count rate can be determined from a single
, increases. Rate coding is sometimes called frequency trial, but at the expense of losing all temporal resolu-
coding. tion about variations in neural response during the course
Rate coding is a traditional coding scheme, assuming that of the trial. Temporal averaging can work well in cases
most, if not all, information about the stimulus is con- where the stimulus is constant or slowly varying and does
tained in the firing rate of the neuron. Because the se- not require a fast reaction of the organism —and this is
quence of action potentials generated by a given stimu- the situation usually encountered in experimental proto-
lus varies from trial to trial, neuronal responses are typi- cols. Real-world input, however, is hardly stationary, but
cally treated statistically or probabilistically. They may often changing on a fast time scale. For example, even
be characterized by firing rates, rather than as specific when viewing a static image, humans perform saccades,
spike sequences. In most sensory systems, the firing rate rapid changes of the direction of gaze. The image pro-
increases, generally non-linearly, with increasing stimu- jected onto the retinal photoreceptors changes therefore
lus intensity.* [17] Any information possibly encoded in every few hundred milliseconds.* [19]
the temporal structure of the spike train is ignored. Con- Despite its shortcomings, the concept of a spike-count
sequently, rate coding is inefficient but highly robust with rate code is widely used not only in experiments, but also
respect to the ISI 'noise'.* [4] in models of neural networks. It has led to the idea that a
During rate coding, precisely calculating firing rate is very neuron transforms information about a single input vari-
important. In fact, the term “firing rate”has a few dif- able (the stimulus strength) into a single continuous out-
ferent definitions, which refer to different averaging pro- put variable (the firing rate).
cedures, such as an average over time or an average over
several repetitions of experiment.
Time-dependent firing rate
In rate coding, learning is based on activity-dependent
synaptic weight modifications. The time-dependent firing rate is defined as the average
Rate coding was originally shown by ED Adrian and number of spikes (averaged over trials) appearing dur-
Y Zotterman in 1926.* [18] In this simple experiment ing a short interval between times t and t+Δt, divided
different weights were hung from a muscle. As the by the duration of the interval. It works for stationary
weight of the stimulus increased, the number of spikes as well as for time-dependent stimuli. To experimentally
recorded from sensory nerves innervating the muscle also measure the time-dependent firing rate, the experimenter
increased. From these original experiments, Adrian and records from a neuron while stimulating with some in-
Zotterman concluded that action potentials were unitary put sequence. The same stimulation sequence is repeated
events, and that the frequency of events, and not indi- several times and the neuronal response is reported in
vidual event magnitude, was the basis for most inter- a Peri-Stimulus-Time Histogram (PSTH). The time t is
neuronal communication. measured with respect to the start of the stimulation se-
quence. The Δt must be large enough (typically in the
In the following decades, measurement of firing rates be-
range of one or a few milliseconds) so there are sufficient
came a standard tool for describing the properties of all
number of spikes within the interval to obtain a reliable
types of sensory or cortical neurons, partly due to the
estimate of the average. The number of occurrences of
relative ease of measuring rates experimentally. How-
spikes nK (t;t+Δt) summed over all repetitions of the ex-
ever, this approach neglects all the information possibly
periment divided by the number K of repetitions is a mea-
contained in the exact timing of the spikes. During re-
sure of the typical activity of the neuron between time t
cent years, more and more experimental evidence has
and t+Δt. A further division by the interval length Δt
suggested that a straightforward firing rate concept based
yields time-dependent firing rate r(t) of the neuron, which
on temporal averaging may be too simplistic to describe
is equivalent to the spike density of PSTH.
brain activity.* [4]
For sufficiently small Δt, r(t)Δt is the average number of
spikes occurring between times t and t+Δt over multiple
Spike-count rate trials. If Δt is small, there will never be more than one
spike within the interval between t and t+Δt on any given
The Spike-count rate, also referred to as temporal aver- trial. This means that r(t)Δt is also the fraction of trials
age, is obtained by counting the number of spikes that on which a spike occurred between those times. Equiva-
appear during a trial and dividing by the duration of trial. lently, r(t)Δt is the probability that a spike occurs during
The length T of the time window is set by experimenter this time interval.
and depends on the type of neuron recorded from and As an experimental procedure, the time-dependent firing
6.3. CODING SCHEMES 41

rate measure is a useful method to evaluate neuronal ac- domness, or precisely timed groups of spikes (tempo-
tivity, in particular in the case of time-dependent stimuli. ral patterns) are candidates for temporal codes.* [25] As
The obvious problem with this approach is that it can not there is no absolute time reference in the nervous system,
be the coding scheme used by neurons in the brain. Neu- the information is carried either in terms of the relative
rons can not wait for the stimuli to repeatedly present in timing of spikes in a population of neurons or with respect
an exactly same manner before generating response. to an ongoing brain oscillation.* [2]* [4]
Nevertheless, the experimental time-dependent firing rate The temporal structure of a spike train or firing rate
measure can make sense, if there are large populations evoked by a stimulus is determined both by the dynamics
of independent neurons that receive the same stimulus. of the stimulus and by the nature of the neural encod-
Instead of recording from a population of N neurons in ing process. Stimuli that change rapidly tend to generate
a single run, it is experimentally easier to record from a precisely timed spikes and rapidly changing firing rates no
single neuron and average over N repeated runs. Thus, the matter what neural coding strategy is being used. Tempo-
time-dependent firing rate coding relies on the implicit ral coding refers to temporal precision in the response that
assumption that there are always populations of neurons. does not arise solely from the dynamics of the stimulus,
but that nevertheless relates to properties of the stimulus.
The interplay between stimulus and encoding dynamics
6.3.2 Temporal coding makes the identification of a temporal code difficult.
In temporal coding, learning can be explained by activity-
When precise spike timing or high-frequency firing-rate dependent synaptic delay modifications.* [26] The modi-
fluctuations are found to carry information, the neural fications can themselves depend not only on spike rates
code is often identified as a temporal code.* [20] A num- (rate coding) but also on spike timing patterns (tempo-
ber of studies have found that the temporal resolution of ral coding), i.e., can be a special case of spike-timing-
the neural code is on a millisecond time scale, indicating dependent plasticity.
that precise spike timing is a significant element in neural
The issue of temporal coding is distinct and independent
coding.* [2]* [21]
from the issue of independent-spike coding. If each spike
Neurons exhibit high-frequency fluctuations of firing- is independent of all the other spikes in the train, the tem-
rates which could be noise or could carry information. poral character of the neural code is determined by the
Rate coding models suggest that these irregularities are behavior of time-dependent firing rate r(t). If r(t) varies
noise, while temporal coding models suggest that they en- slowly with time, the code is typically called a rate code,
code information. If the nervous system only used rate and if it varies rapidly, the code is called temporal.
codes to convey information, a more consistent, regular
firing rate would have been evolutionarily advantageous,
and neurons would have utilized this code over other less Temporal coding in sensory systems
robust options.* [22] Temporal coding supplies an alter-
nate explanation for the “noise,”suggesting that it ac- For very brief stimuli, a neuron's maximum firing rate
tually encodes information and affects neural process- may not be fast enough to produce more than a single
ing. To model this idea, binary symbols can be used to spike. Due to the density of information about the ab-
mark the spikes: 1 for a spike, 0 for no spike. Tempo- breviated stimulus contained in this single spike, it would
ral coding allows the sequence 000111000111 to mean seem that the timing of the spike itself would have to
something different from 001100110011, even though convey more information than simply the average fre-
the mean firing rate is the same for both sequences, at quency of action potentials over a given period of time.
6 spikes/10 ms.* [23] Until recently, scientists had put This model is especially important for sound localization,
the most emphasis on rate encoding as an explanation which occurs within the brain on the order of millisec-
for post-synaptic potential patterns. However, functions onds. The brain must obtain a large quantity of infor-
of the brain are more temporally precise than the use of mation based on a relatively short neural response. Ad-
only rate encoding seems to allow. In other words, essen- ditionally, if low firing rates on the order of ten spikes
tial information could be lost due to the inability of the per second must be distinguished from arbitrarily close
rate code to capture all the available information of the rate coding for different stimuli, then a neuron trying to
spike train. In addition, responses are different enough discriminate these two stimuli may need to wait for a sec-
between similar (but not identical) stimuli to suggest thatond or more to accumulate enough information. This is
the distinct patterns of spikes contain a higher volume of not consistent with numerous organisms which are able
information than is possible to include in a rate code.* [24]
to discriminate between stimuli in the time frame of mil-
Temporal codes employ those features of the spiking ac- liseconds, *
suggesting that a rate code is not the only model
tivity that cannot be described by the firing rate. For at work. [23]
example, time to first spike after the stimulus onset, To account for the fast encoding of visual stimuli, it has
characteristics based on the second and higher statistical been suggested that neurons of the retina encode visual in-
moments of the ISI probability distribution, spike ran- formation in the latency time between stimulus onset and
42 CHAPTER 6. NEURAL CODING

first action potential, also called latency to first spike.* [27] Temporal coding applications
This type of temporal coding has been shown also in the
auditory and somato-sensory system. The main drawback The specificity of temporal coding requires highly refined
of such a coding scheme is its sensitivity to intrinsic neu- technology to measure informative, reliable, experimen-
ronal fluctuations.* [28] In the primary visual cortex of tal data. Advances made in optogenetics allow neurolo-
macaques, the timing of the first spike relative to the start gists to control spikes in individual neurons, offering elec-
of the stimulus was found to provide more information trical and spatial single-cell resolution. For example, blue
than the interval between spikes. However, the interspike light causes the light-gated ion channel channelrhodopsin
interval could be used to encode additional information, to open, depolarizing the cell and producing a spike.
which is especially important when the spike rate reaches When blue light is not sensed by the cell, the channel
its limit, as in high-contrast situations. For this reason, closes, and the neuron ceases to spike. The pattern of
temporal coding may play a part in coding defined edges the spikes matches the pattern of the blue light stim-
rather than gradual transitions.* [29] uli. By inserting channelrhodopsin gene sequences into
mouse DNA, researchers can control spikes and therefore
The mammalian gustatory system is useful for studying
certain behaviors of the mouse (e.g., making the mouse
temporal coding because of its fairly distinct stimuli and
turn left).* [33] Researchers, through optogenetics, have
the easily discernible responses of the organism.* [30]
the tools to effect different temporal codes in a neuron
Temporally encoded information may help an organism
while maintaining the same mean firing rate, and thereby
discriminate between different tastants of the same cat-
can test whether or not temporal coding occurs in specific
egory (sweet, bitter, sour, salty, umami) that elicit very
neural circuits.* [34]
similar responses in terms of spike count. The tempo-
ral component of the pattern elicited by each tastant may Optogenetic technology also has the potential to enable
be used to determine its identity (e.g., the difference be- the correction of spike abnormalities at the root of several
tween two bitter tastants, such as quinine and denato- neurological and psychological disorders.* [34] If neurons
nium). In this way, both rate coding and temporal coding do encode information in individual spike timing pat-
may be used in the gustatory system – rate for basic tas- terns, key signals could be missed by attempting to crack
tant type, temporal for more specific differentiation.* [31] the code while looking only at mean firing rates.* [23] Un-
Research on mammalian gustatory system has shown that derstanding any temporally encoded aspects of the neural
there is an abundance of information present in tempo- code and replicating these sequences in neurons could al-
ral patterns across populations of neurons, and this in- low for greater control and treatment of neurological dis-
formation is different from that which is determined by orders such as depression, schizophrenia, and Parkinson's
rate coding schemes. Groups of neurons may synchro- disease. Regulation of spike intervals in single cells more
nize in response to a stimulus. In studies dealing with precisely controls brain activity than the addition of phar-
the front cortical portion of the brain in primates, pre- macological agents intravenously.* [33]
cise patterns with short time scales only a few millisec-
onds in length were found across small populations of
neurons which correlated with certain information pro- Phase-of-firing code
cessing behaviors. However, little information could be
determined from the patterns; one possible theory is they Phase-of-firing code is a neural coding scheme that com-
represented the higher-order processing taking place in bines the spike count code with a time reference based on
the brain.* [24] oscillations. This type of code takes into account a time
label for each spike according to a time reference based
As with the visual system, in mitral/tufted cells in the on phase of local ongoing oscillations at low* [35] or high
olfactory bulb of mice, first-spike latency relative to the frequencies.* [36] A feature of this code is that neurons
start of a sniffing action seemed to encode much of the adhere to a preferred order of spiking, resulting in firing
information about an odor. This strategy of using spike sequence.* [37]
latency allows for rapid identification of and reaction to an
odorant. In addition, some mitral/tufted cells have spe- It has been shown that neurons in some cortical sen-
cific firing patterns for given odorants. This type of extra sory areas encode rich naturalistic stimuli in terms of
information could help in recognizing a certain odor, but their spike times relative to the phase of ongoing net-
is not completely necessary, as average spike count over work fluctuations, rather than only in terms of their spike
the course of the animal's sniffing was also a good iden- count.* [35]* [38] Oscillations reflect local field potential
tifier.* [32] Along the same lines, experiments done with signals. It is often categorized as a temporal code al-
the olfactory system of rabbits showed distinct patterns though the time label used for spikes is coarse grained.
which correlated with different subsets of odorants, and That is, four discrete values for phase are enough to rep-
a similar result was obtained in experiments with the lo- resent all the information content in this kind of code
cust olfactory system.* [23] with respect to the phase of oscillations in low frequen-
cies. Phase-of-firing code is loosely based on the phase
precession phenomena observed in place cells of the
hippocampus.
6.3. CODING SCHEMES 43

Phase code has been shown in visual cortex to involve but overlapping selectivities, so that many neurons, but
also high-frequency oscillations.* [37] Within a cycle of not necessarily all, respond to a given stimulus.
gamma oscillation, each neuron has it own preferred rel- Typically an encoding function has a peak value such that
ative firing time. As a result, an entire population of neu- activity of the neuron is greatest if the perceptual value is
rons generates a firing sequence that has a duration of up close to the peak value, and becomes reduced accordingly
to about 15 ms.* [37] for values less close to the peak value.
It follows that the actual perceived value can be recon-
6.3.3 Population coding structed from the overall pattern of activity in the set
of neurons. The Johnson/Georgopoulos vector coding
is an example of simple averaging. A more sophisti-
Population coding is a method to represent stimuli by us-
cated mathematical technique for performing such a re-
ing the joint activities of a number of neurons. In popu-
construction is the method of maximum likelihood based
lation coding, each neuron has a distribution of responses
on a multivariate distribution of the neuronal responses.
over some set of inputs, and the responses of many neu-
These models can assume independence, second order
rons may be combined to determine some value about the
correlations ,* [44] or even more detailed dependencies
inputs.
such as higher order maximum entropy models* [45] or
From the theoretical point of view, population coding is copulas.* [46]
one of a few mathematically well-formulated problems
in neuroscience. It grasps the essential features of neu-
ral coding and yet, is simple enough for theoretic anal- Correlation coding
ysis.* [39] Experimental studies have revealed that this
coding paradigm is widely used in the sensor and mo- The correlation coding model of neuronal firing claims
tor areas of the brain. For example, in the visual area that correlations between action potentials, or “spikes”
medial temporal (MT), neurons are tuned to the mov- , within a spike train may carry additional information
ing direction.* [40] In response to an object moving in above and beyond the simple timing of the spikes. Early
a particular direction, many neurons in MT fire, with a work suggested that correlation between spike trains can
noise-corrupted and bell-shaped activity pattern across only reduce, and never increase, the total mutual infor-
the population. The moving direction of the object is re- mation present in the two spike trains about a stimu-
trieved from the population activity, to be immune from lus feature.* [47] However, this was later demonstrated
the fluctuation existing in a single neuron’s signal. In one to be incorrect. Correlation structure can increase in-
classic example in the primary motor cortex, Apostolos formation content if noise and signal correlations are
Georgopoulos and colleagues trained monkeys to move a of opposite sign.* [48] Correlations can also carry in-
joystick towards a lit target.* [41]* [42] They found that formation not present in the average firing rate of two
a single neuron would fire for multiple target directions. pairs of neurons. A good example of this exists in the
However it would fire fastest for one direction and more pentobarbital-anesthetized marmoset auditory cortex, in
slowly depending on how close the target was to the neu- which a pure tone causes an increase in the number of
ron's 'preferred' direction. correlated spikes, but not an increase in the mean firing
rate, of pairs of neurons.* [49]
Kenneth Johnson originally derived that if each neuron
represents movement in its preferred direction, and the
vector sum of all neurons is calculated (each neuron has Independent-spike coding
a firing rate and a preferred direction), the sum points in
the direction of motion. In this manner, the population of The independent-spike coding model of neuronal firing
neurons codes the signal for the motion. This particular claims that each individual action potential, or “spike”
population code is referred to as population vector cod- , is independent of each other spike within the spike
ing. This particular study divided the field of motor phys- train.* [50]* [51]
iologists between Evarts' “upper motor neuron”group,
which followed the hypothesis that motor cortex neurons
Position coding
contributed to control of single muscles, and the Geor-
gopoulos group studying the representation of movement A typical population code involves neurons with a Gaus-
directions in cortex. sian tuning curve whose means vary linearly with the
Population coding has a number of advantages, includ- stimulus intensity, meaning that the neuron responds most
ing reduction of uncertainty due to neuronal variability strongly (in terms of spikes per second) to a stimulus near
and the ability to represent a number of different stim- the mean. The actual intensity could be recovered as the
ulus attributes simultaneously. Population coding is also stimulus level corresponding to the mean of the neuron
much faster than rate coding and can reflect changes in the with the greatest response. However, the noise inherent
stimulus conditions nearly instantaneously.* [43] Individ- in neural responses means that a maximum likelihood es-
ual neurons in such a population typically have different timation function is more accurate.
44 CHAPTER 6. NEURAL CODING

distributed across neurons. A major result in neural cod-

ing from Olshausen et al.* [53] is that sparse coding of
natural images produces wavelet-like oriented filters that
resemble the receptive fields of simple cells in the visual
cortex. The capacity of sparse codes may be increased
by simultaneous use of temporal coding, as found in the
locust olfactory system.* [54]
Given a potentially large set of input patterns, sparse cod-
ing algorithms (e.g. Sparse Autoencoder) attempt to au-
tomatically find a small number of representative patterns
which, when combined in the right proportions, repro-
duce the original input patterns. The sparse coding for
Plot of typical position coding the input then consists of those representative patterns.
For example, the very large set of English sentences can
be encoded by a small number of symbols (i.e. letters,
numbers, punctuation, and spaces) combined in a partic-
ular order for a particular sentence, and so a sparse coding
for English would be those symbols.

Linear Generative Model

Most models of sparse coding are based on the linear gen-

erative model.* [55] In this model, the symbols are com-
bined in a linear fashion to approximate the input.

Neural responses are noisy and unreliable.

More formally, given a k-dimensional set of real-
numbered input vectors ξ⃗ ∈ Rk , the goal of sparse
coding is to determine n k-dimensional basis vectors
This type of code is used to encode continuous variables b⃗1 , . . . , b⃗n ∈ Rk along with a sparse n-dimensional vec-
such as joint position, eye position, color, or sound fre- tor of weights or coefficients ⃗s ∈ Rn for each input vec-
quency. Any individual neuron is too noisy to faithfully tor, so that a linear combination of the basis vectors with
encode the variable using rate coding, but an entire popu- proportions given by the coefficients results∑n in a close ap-
lation ensures greater fidelity and precision. For a popula- proximation to the input vector: ξ⃗ ≈ j=1 sj⃗bj .* [56]
tion of unimodal tuning curves, i.e. with a single peak, theThe codings generated by algorithms implementing a
precision typically scales linearly with the number of neu- linear generative model can be classified into codings
rons. Hence, for half the precision, half as many neurons with soft sparseness and those with hard sparseness.* [55]
are required. In contrast, when the tuning curves have These refer to the distribution of basis vector coeffi-
multiple peaks, as in grid cells that represent space, the cients for typical inputs. A coding with soft sparseness
precision of the population can scale exponentially with has a smooth Gaussian-like distribution, but peakier than
the number of neurons. This greatly reduces the number Gaussian, with many zero values, some small absolute
of neurons required for the same precision.* [52] values, fewer larger absolute values, and very few very
large absolute values. Thus, many of the basis vectors
are active. Hard sparseness, on the other hand, indicates
6.3.4 Sparse coding that there are many zero values, no or hardly any small
absolute values, fewer larger absolute values, and very
The sparse code is when each item is encoded by the few very large absolute values, and thus few of the ba-
strong activation of a relatively small set of neurons. For sis vectors are active. This is appealing from a metabolic
each item to be encoded, this is a different subset of all perspective: less energy is used when fewer neurons are
available neurons. firing.* [55]
As a consequence, sparseness may be focused on tempo- Another measure of coding is whether it is critically com-
ral sparseness “( a relatively small number of time periods plete or overcomplete. If the number of basis vectors n
are active”) or on the sparseness in an activated popula- is equal to the dimensionality k of the input set, the cod-
tion of neurons. In this latter case, this may be defined in ing is said to be critically complete. In this case, smooth
one time period as the number of activated neurons rela- changes in the input vector result in abrupt changes in the
tive to the total number of neurons in the population. This coefficients, and the coding is not able to gracefully han-
seems to be a hallmark of neural computations since com- dle small scalings, small translations, or noise in the in-
pared to traditional computers, information is massively puts. If, however, the number of basis vectors is larger
6.5. REFERENCES 45

than the dimensionality of the input set, the coding is [6] Chen, G; Wang, LP; Tsien, JZ (2009). “Neu-
overcomplete. Overcomplete codings smoothly interpo- ral population-level memory traces in the mouse
late between input vectors and are robust under input hippocampus”. PLoS One. 4 (12): e8256.
noise.* [57] The human primary visual cortex is estimated doi:10.1371/journal.pone.0008256. PMID 20016843.
to be overcomplete by a factor of 500, so that, for exam-
[7] Zhang, H; Chen, G; Kuang, H; Tsien, JZ (Nov
ple, a 14 x 14 patch of input (a 196-dimensional space)
2013). “Mapping and deciphering neural codes of
is coded by roughly 100,000 neurons.* [55] NMDA receptor-dependent fear memory engrams in
the hippocampus”. PLoS One. 8 (11): e79454.
doi:10.1371/journal.pone.0079454. PMID 24302990.
6.4 See also
[8] Brain Decoding Project. http://braindecodingproject.
org/
• Models of neural computation
[9] The Simons Collaboration on the Global Brain.
• Neural correlate http://www.simonsfoundation.org/life-sciences/
simons-collaboration-on-the-global-brain/
• Cognitive map
[10] Burcas G.T & Albright T.D. Gauging sensory representa-
• Neural decoding tions in the brain. http://www.vcl.salk.edu/Publications/
PDF/Buracas_Albright_1999_TINS.pdf
• Deep learning
[11] Gerstner W, Kreiter AK, Markram H, Herz AV; Kreiter;
• Autoencoder Markram; Herz (November 1997). “Neural codes: fir-
ing rates and beyond”. Proc. Natl. Acad. Sci. U.S.A.
• Vector quantization 94 (24): 12740–1. Bibcode:1997PNAS...9412740G.
doi:10.1073/pnas.94.24.12740. PMC 34168. PMID
• Binding problem
9398065.
• Artificial neural network
[12] Aur D., Jog, MS., 2010 Neuroelectrodynamics: Un-
derstanding the brain language, IOS Press, 2010,
• Grandmother cell
doi:10.3233/978-1-60750-473-3-i
• Feature integration theory
[13] Aur, D.; Connolly, C.I.; Jog, M.S. (2005). “Computing
• pooling spike directivity with tetrodes”. J. Neurosci 149 (1): 57–
63. doi:10.1016/j.jneumeth.2005.05.006.
• Sparse distributed memory
[14] Aur, D.; Jog, M.S. (2007). “Reading the Neural Code:
What do Spikes Mean for Behavior?". Nature Precedings.
doi:10.1038/npre.2007.61.1.
6.5 References
[15] Fraser, A.; Frey, A. H. (1968). “Electromagnetic emis-
sion at micron wavelengths from active nerves”. Bio-
[1] Brown EN, Kass RE, Mitra PP (May 2004). “Multi-
physical Journal 8 (6): 731–734. doi:10.1016/s0006-
ple neural spike train data analysis: state-of-the-art and
3495(68)86517-8.
future challenges”. Nat. Neurosci. 7 (5): 456–61.
doi:10.1038/nn1228. PMID 15114358.
[16] Aur, D (2012). “A comparative analysis of inte-
[2] Thorpe, S.J. (1990). “Spike arrival times: A highly effi- grating visual information in local neuronal ensembles”
cient coding scheme for neural networks”(PDF). In Eck- . Journal of neuroscience methods 207 (1): 23–30.
miller, R.; Hartmann, G.; Hauske, G. Parallel processing doi:10.1016/j.jneumeth.2012.03.008. PMID 22480985.
in neural systems and computers (PDF). North-Holland.
pp. 91–94. ISBN 978-0-444-88390-2. [17] Kandel, E.; Schwartz, J.; Jessel, T.M. (1991). Principles
of Neural Science (3rd ed.). Elsevier. ISBN 0444015620.
[3] Gerstner, Wulfram; Kistler, Werner M. (2002). Spiking
Neuron Models: Single Neurons, Populations, Plasticity. [18] Adrian ED & Zotterman Y. (1926). “The impulses pro-
Cambridge University Press. ISBN 978-0-521-89079-3. duced by sensory nerve endings: Part II: The response of
a single end organ.”. J Physiol (Lond.) 61: 151–171.
[4] Stein RB, Gossen ER, Jones KE (May 2005). “Neu-
ronal variability: noise or part of the signal?". Nat. Rev. [19] http://icwww.epfl.ch/~{}gerstner/SPNM/node7.html
Neurosci. 6 (5): 389–97. doi:10.1038/nrn1668. PMID
15861181. [20] Dayan, Peter; Abbott, L. F. (2001). Theoretical Neu-
roscience: Computational and Mathematical Modeling of
[5] The Memory Code. http://www.scientificamerican.com/ Neural Systems. Massachusetts Institute of Technology
article/the-memory-code/ Press. ISBN 978-0-262-04199-7.
46 CHAPTER 6. NEURAL CODING

[21] Butts DA, Weng C, Jin J et al. (September [35] Montemurro MA, Rasch MJ, Murayama Y, Logothetis
2007). “Temporal precision in the neural code NK, Panzeri S (March 2008). “Phase-of-ﬁring coding
and the timescales of natural vision”. Nature of natural visual stimuli in primary visual cortex”. Curr.
449 (7158): 92–5. Bibcode:2007Natur.449...92B. Biol. 18 (5): 375–80. doi:10.1016/j.cub.2008.02.023.
doi:10.1038/nature06105. PMID 17805296. PMID 18328702.

[22] J. Leo van Hemmen, TJ Sejnowski. 23 Problems in Sys- [36] Fries P, Nikolić D, Singer W (July 2007). “The
tems Neuroscience. Oxford Univ. Press, 2006. p.143- gamma cycle”. Trends Neurosci. 30 (7): 309–16.
158. doi:10.1016/j.tins.2007.05.005. PMID 17555828.
[23] Theunissen, F; Miller, JP (1995). “Temporal En- [37] Havenith MN, Yu S, Biederlack J, Chen NH, Singer
coding in Nervous Systems: A Rigorous Definition” W, Nikolić D (June 2011). “Synchrony makes neu-
. Journal of Computational Neuroscience 2: 149–162. rons fire in sequence, and stimulus properties deter-
doi:10.1007/bf00961885. mine who is ahead”. J. Neurosci. 31 (23): 8570–
84. doi:10.1523/JNEUROSCI.2817-10.2011. PMID
[24] Zador, Stevens, Charles, Anthony. “The enigma of the
21653861.
brain”. © Current Biology 1995, Vol 5 No 12. Retrieved
4/08/12. Check date values in: |accessdate= (help)
[38] Spike arrival times: A highly efficient coding scheme for
[25] Kostal L, Lansky P, Rospars JP (November 2007). neural networks, SJ Thorpe - Parallel processing in neural
“Neuronal coding and spiking randomness”. Eur. J. systems, 1990
Neurosci. 26 (10): 2693–701. doi:10.1111/j.1460-
9568.2007.05880.x. PMID 18001270. [39] Wu S, Amari S, Nakahara H (May 2002). “Popu-
lation coding and decoding in a neural field: a com-
[26] Geoffrois, E.; Edeline, J.M.; Vibert, J.F. (1994).“Learn- putational study”. Neural Comput 14 (5): 999–1026.
ing by Delay Modifications”. In Eeckman, Frank H. Com- doi:10.1162/089976602753633367. PMID 11972905.
putation in Neurons and Neural Systems. Springer. pp.
133–8. ISBN 978-0-7923-9465-5. [40] Maunsell JH, Van Essen DC (May 1983). “Functional
properties of neurons in middle temporal visual area of
[27] Gollisch, T.; Meister, M. (22 February 2008). “Rapid the macaque monkey. I. Selectivity for stimulus direction,
Neural Coding in the Retina with Relative Spike speed, and orientation”. J. Neurophysiol. 49 (5): 1127–
Latencies”. Science 319 (5866): 1108–1111. 47. PMID 6864242.
doi:10.1126/science.1149639.
[41] Intro to Sensory Motor Systems Ch. 38 page 766
[28] Wainrib, Gilles; Michèle, Thieullen; Khashayar, Pak-
daman (7 April 2010). “Intrinsic variability of latency [42] Science. 1986 Sep 26;233(4771):1416-9
to first-spike”. Biological Cybernetics 103 (1): 43–56.
doi:10.1007/s00422-010-0384-8. [43] Hubel DH, Wiesel TN (October 1959).“Receptive fields
of single neurones in the cat's striate cortex”. J. Phys-
[29] Victor, Johnathan D (2005). “Spike train metrics” iol. (Lond.) 148 (3): 574–91. PMC 1363130. PMID
. Current Opinion in Neurobiology 15 (5): 585–592. 14403679.
doi:10.1016/j.conb.2005.08.002.
[44] Schneidman, E, Berry, MJ, Segev, R, Bialek, W (2006),
[30] Hallock, Robert M.; Di Lorenzo, Patricia M. (2006).
Weak Pairwise Correlations Imply Strongly Correlated Net-
“Temporal coding in the gustatory system”. Neuro-
work States in a Neural Population 440, Nature 440, 1007-
science & Biobehavioral Reviews 30 (8): 1145–1160.
1012, doi:10.1038/nature04701
doi:10.1016/j.neubiorev.2006.07.005.

[31] Carleton, Alan; Accolla, Riccardo; Simon, Sidney A. [45] Amari, SL (2001), Information Geometry on Hierarchy of
(2010). “Coding in the mammalian gustatory sys- Probability Distributions, IEEE Transactions on Informa-
tem”. Trends in Neurosciences 33 (7): 326–334. tion Theory 47, 1701-1711, CiteSeerX: 10.1.1.46.5226
doi:10.1016/j.tins.2010.04.002.
[46] Onken, A, Grünewälder, S, Munk, MHJ, Obermayer,
[32] Wilson, Rachel I (2008). “Neural and behav- K (2009), Analyzing Short-Term Noise Dependencies of
ioral mechanisms of olfactory perception”. Cur- Spike-Counts in Macaque Prefrontal Cortex Using Copu-
rent Opinion in Neurobiology 18 (4): 408–412. las and the Flashlight Transformation, PLoS Comput Biol
doi:10.1016/j.conb.2008.08.015. 5(11): e1000577, doi:10.1371/journal.pcbi.1000577

[33] Karl Diesseroth, Lecture.“Personal Growth Series: Karl [47] Johnson, KO (Jun 1980). J Neurophysiol 43 (6): 1793–
Diesseroth on Cracking the Neural Code.”Google Tech 815. Missing or empty |title= (help)
Talks. November 21, 2008. http://www.youtube.com/
watch?v=5SLdSbp6VjM [48] Panzeri, Schultz, Treves, Rolls, Proc Biol Sci. 1999 May
22;266(1423):1001-12.
[34] Han X, Qian X, Stern P, Chuong AS, Boyden ES. “In-
formational lesions: optical perturbations of spike timing [49] Nature 381 (6583): 610–3. Jun 1996.
and neural synchrony via microbial opsin gene fusions.” doi:10.1038/381610a0. Missing or empty |title=
Cambridge, MA: MIT Media Lad, 2009. (help)
6.6. FURTHER READING 47

[50] Dayan P & Abbott LF. Theoretical Neuroscience: Com- • Tsien, JZ. et al. (2014). “On initial Brain Ac-
putational and Mathematical Modeling of Neural Systems. tivity Mapping of episodic and semantic mem-
Cambridge, Massachusetts: The MIT Press; 2001. ISBN ory code in the hippocampus”. Neurobiol-
0-262-04199-5 ogy of Learning and Memory 105: 200–210.
doi:10.1016/j.nlm.2013.06.019.
[51] Rieke F, Warland D, de Ruyter van Steveninck R, Bialek
W. Spikes: Exploring the Neural Code. Cambridge, Mas-
sachusetts: The MIT Press; 1999. ISBN 0-262-68108-0

[52] Mathis A, Herz AV, Stemmler MB; Herz; Stemm-

ler (July 2012). “Resolution of nested neu-
ronal representations can be exponential in the
number of neurons”. Phys. Rev. Lett. 109
(1): 018103. Bibcode:2012PhRvL.109a8103M.
doi:10.1103/PhysRevLett.109.018103. PMID
23031134.

[53] Olshausen, Bruno A. “Emergence of simple-cell recep-

tive ﬁeld properties by learning a sparse code for natural
images.”Nature 381.6583 (1996): 607-609.

[54] Gupta, N; Stopfer, M (6 October 2014). “A

temporal channel for information in sparse sensory
coding.”. Current Biology 24 (19): 2247–56.
doi:10.1016/j.cub.2014.08.021. PMID 25264257.

[55] Rehn, Martin; Sommer, Friedrich T. (2007). “A net-

work that uses few active neurones to code visual input
predicts the diverse shapes of cortical receptive ﬁelds”
(PDF). Journal of Computational Neuroscience 22: 135–
146. doi:10.1007/s10827-006-0003-9.

[56] Lee, Honglak; Battle, Alexis; Raina, Rajat; Ng, Andrew

Y. (2006). “Eﬃcient sparse coding algorithms” (PDF).
Advances in Neural Information Processing Systems.

[57] Olshausen, Bruno A.; Field, David J. (1997). “Sparse

Coding with an Overcomplete Basis Set: A Strategy Em-
ployed by V1?" (PDF). Vision Research 37 (23): 3311–
3325. doi:10.1016/s0042-6989(97)00169-7.

6.6 Further reading

• Földiák P, Endres D, Sparse coding, Scholarpedia,
3(1):2984, 2008.

• Dayan P & Abbott LF. Theoretical Neuroscience:

Computational and Mathematical Modeling of Neu-
ral Systems. Cambridge, Massachusetts: The MIT
Press; 2001. ISBN 0-262-04199-5

• Rieke F, Warland D, de Ruyter van Steveninck R,

Bialek W. Spikes: Exploring the Neural Code. Cam-
bridge, Massachusetts: The MIT Press; 1999. ISBN
0-262-68108-0

• Olshausen, B. A.; Field, D. J. “Emergence of

simple-cell receptive ﬁeld properties by learning a
sparse code for natural images”. Nature 381 (6583):
607–9. doi:10.1038/381607a0.
Chapter 7

Word embedding

Word embedding is the collective name for a set of Sentiment Treebank” (PDF). Conference on Empirical
language modeling and feature learning techniques in Methods in Natural Language Processing.
natural language processing where words from the vocab-
ulary (and possibly phrases thereof) are mapped to vec-
tors of real numbers in a low dimensional space, relative
to the vocabulary size (“continuous space”).
There are several methods for generating this mapping.
They include neural networks,* [1] dimensionality reduc-
tion on the word co-occurrence matrix,* [2] and explicit
representation in terms of the context in which words ap-
pear.* [3]
Word and phrase embeddings, when used as the under-
lying input representation, have been shown to boost the
performance in NLP tasks such as syntactic parsing* [4]
and sentiment analysis.* [5]

7.1 See also

• Brown clustering

7.2 References
[1] Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado,
Greg; Dean, Jeﬀrey (2013). “Distributed Representa-
tions of Words and Phrases and their Compositionality”.
arXiv:1310.4546 [cs.CL].

[2] Lebret, Rémi; Collobert, Ronan (2013). “Word

Emdeddings through Hellinger PCA”. arXiv:1312.5542
[cs.CL].

[3] Levy, Omer; Goldberg, Yoav.“Linguistic Regularities in

Sparse and Explicit Word Representations” (PDF). Pro-
ceedings of the Eighteenth Conference on Computational
Natural Language Learning, Baltimore, Maryland, USA,
June. Association for Computational Linguistics. 2014.

[4] Socher, Richard; Bauer, John; Manning, Christopher; Ng,

Andrew. “Parsing with compositional vector grammars”
(PDF). Proceedings of the ACL conference. 2013.

[5] Socher, Richard; Perelygin, Alex; Wu, Jean; Chuang, Ja-

son; Manning, Chris; Ng, Andrew; Potts, Chris. “Recur-
sive Deep Models for Semantic Compositionality Over a

48
Chapter 8

Deep belief network

In machine learning, a deep belief network (DBN) is a next. This also leads to a fast, layer-by-layer unsupervised
generative graphical model, or alternatively a type of deep training procedure, where contrastive divergence is ap-
neural network, composed of multiple layers of latent plied to each sub-network in turn, starting from the“low-
variables (“hidden units”), with connections between est”pair of layers (the lowest visible layer being a training
the layers but not between units within each layer.* [1] set).
The observation, due to Geoﬀrey Hinton's student Yee-
Whye Teh,* [2] that DBNs can be trained greedily, one
layer at a time, has been called a breakthrough in deep
learning.* [4]* :6
Hidden layer 3
8.1 Training algorithm
The training algorithm for DBNs proceeds as follows. Let
X be a matrix of inputs, regarded as a set of feature vec-
Hidden layer 2 tors.* [2]

1. Train a restricted Boltzmann machine on X to obtain

its weight matrix, W. Use this as the weight matrix
for between the lower two layers of the network.
2. Transform X by the RBM to produce new data X',
Hidden layer 1 either by sampling or by computing the mean acti-
vation of the hidden units.
3. Repeat this procedure with X ← X' for the next pair
of layers, until the top two layers of the network are
reached.

Visible layer (observed) 8.2 See also

• Bayesian network
Schematic overview of a deep belief net. Arrows represent di-
rected connections in the graphical model that the net represents. • Deep learning

When trained on a set of examples in an unsupervised

way, a DBN can learn to probabilistically reconstruct its 8.3 References
inputs. The layers then act as feature detectors on in-
puts.* [1] After this learning step, a DBN can be further [1] Hinton, G. (2009).“Deep belief networks”. Scholarpedia
trained in a supervised way to perform classiﬁcation.* [2] 4 (5): 5947. doi:10.4249/scholarpedia.5947.
DBNs can be viewed as a composition of simple, un- [2] Hinton, G. E.; Osindero, S.; Teh, Y. W. (2006).
supervised networks such as restricted Boltzmann ma- “A Fast Learning Algorithm for Deep Belief Nets”
chines (RBMs)* [1] or autoencoders,* [3] where each sub- (PDF). Neural Computation 18 (7): 1527–1554.
network's hidden layer serves as the visible layer for the doi:10.1162/neco.2006.18.7.1527. PMID 16764513.

49
50 CHAPTER 8. DEEP BELIEF NETWORK

[3] Yoshua Bengio; Pascal Lamblin; Dan Popovici; Hugh

Larochelle (2007). Greedy Layer-Wise Training of Deep
Networks (PDF). NIPS.

[4] Bengio, Y. (2009).“Learning Deep Architectures for AI”

(PDF). Foundations and Trends in Machine Learning 2.
doi:10.1561/2200000006.

8.4 External links

• “Deep Belief Networks”. Deep Learning Tutorials.

• “Deep Belief Network Example”. Deeplearning4j

Tutorials.
Chapter 9

Convolutional neural network

For other uses, see CNN (disambiguation). volutional neural networks use relatively little pre-
processing. This means that the network is responsible
In machine learning, a convolutional neural network for learning the filters that in traditional algorithms were
hand-engineered. The lack of a dependence on prior-
(or CNN) is a type of feed-forward artificial neural net-
work where the individual neurons are tiled in such knowledge and the existence of difficult to design hand-
engineered features is a major advantage for CNNs.
a way that they respond to overlapping regions in the
visual field.* [1] Convolutional networks were inspired by
biological processes* [2] and are variations of multilayer
perceptrons which are designed to use minimal amounts 9.2 History
of preprocessing.* [3] They are widely used models for
image and video recognition. The design of convolutional neural networks follows the
discovery of visual mechanisms in living organisms. In
our brain, the visual cortex contains lots of cells. These
cells are responsible for detecting light in small, overlap-
9.1 Overview ping sub-regions of the visual field, called receptive fields.
These cells act as local filters over the input space. The
When used for image recognition, convolutional neural more complex cells have larger receptive fields. A con-
networks (CNNs) consist of multiple layers of small neu- volution operator is created to perform the same function
ron collections which look at small portions of the input by all of these cells.
image, called receptive fields. The results of these collec- Convolutional neural networks were introduced in a 1980
tions are then tiled so that they overlap to obtain a better paper by Kunihiko Fukushima.* [7]* [9] In 1988 they were
representation of the original image; this is repeated for separately developed, with explicit parallel and trainable
every such layer. Because of this, they are able to toler- convolutions for temporal signals, by Toshiteru Homma,
ate translation of the input image.* [4] Convolutional net-
Les Atlas, and Robert J. Marks II.* [10] Their design was
works may include local or global pooling layers, which later improved in 1998 by Yann LeCun, Léon Bottou,
combine the outputs of neuron clusters.* [5]* [6] They also
Yoshua Bengio, and Patrick Haffner,* [11] generalized
consist of various combinations of convolutional layers in 2003 by Sven Behnke,* [12] and simplified by Patrice
and fully connected layers, with pointwise nonlinearity
Simard, David Steinkraus, and John C. Platt in the same
applied at the end of or after each layer.* [7] It is inspired year.* [13] The famous LeNet-5 network can classify dig-
by biological process. To avoid the situation that there ex-
its successfully, which is applied to recognize checking
ist billions of parameters if all layers are fully connected, numbers. However, given more complex problems the
the idea of using a convolution operation on small regions, breadth and depth of the network will continue to in-
has been introduced. One major advantage of convolu- crease which would become limited by computing re-
tional networks is the use of shared weight in convolu- sources. The approach used LeNet did not perform well
tional layers, which means that the same filter (weights with more complex problems.
bank) is used for each pixel in the layer; this both reduces
required memory size and improves performance.* [3] With the rise of efficient GPU computing, it has become
possible to train larger networks. In 2006 several pub-
Some Time delay neural networks also use a very similar lications described more efficient ways to train convolu-
architecture to convolutional neural networks, especially tional neural networks with more layers.* [14]* [15]* [16]
those for image recognition and/or classification tasks, In 2011, they were refined by Dan Ciresan et al. and
since the “tiling”of the neuron outputs can easily be were implemented on a GPU with impressive perfor-
carried out in timed stages in a manner useful for analy- mance results.* [5] In 2012, Dan Ciresan et al. signifi-
sis of images.* [8] cantly improved upon the best performance in the litera-
Compared to other image classification algorithms, con- ture for multiple image databases, including the MNIST

51
52 CHAPTER 9. CONVOLUTIONAL NEURAL NETWORK

database, the NORB database, the HWDB1.0 dataset Dropout “layer”

(Chinese characters), the CIFAR10 dataset (dataset of
60000 32x32 labeled RGB images),* [7] and the Ima-
geNet dataset.* [17]

9.3 Details Since a fully connected layer occupies most of the pa-
rameters, it is prone to overfitting. The dropout method
*
9.3.1 Backpropagation [19] is introduced to prevent overfitting. That paper
defines (the simplest form of) dropout as: “The only
When doing propagation, the momentum and weight difference”, from “Learning algorithms developed for
decay values are chosen to reduce oscillation during Restricted_Boltzmann_machine such as Contrastive Di-
stochastic gradient descent. See Backpropagation for vergence”, “is that r”, the number derived (usually by
more. Sigmoid) from the incoming sum to the neural node from
other nodes, “is first sampled and only the hidden units
that are retained are used for training.”and“dropout can
9.3.2 Different types of layers be seen as multiplying by a Bernoulli_distribution ran-
dom variable rb that takes the value 1/p with probability
Convolutional layer p and 0 otherwise.”In other words, that simplest form
of dropout is to take the chance and see if it happens or
Unlike a hand-coded convolution kernel (Sobel, Prewitt, not, to observe if the neural node spikes/fires (instantly),
Roberts), in a convolutional neural net, the parameters instead of just remembering that chance happened.
of each convolution kernel are trained by the backprop- Dropout also significantly improves the speed of training.
agation algorithm. There are many convolution kernels This makes model combination practical, even for deep
in each layer, and each kernel is replicated over the en- neural nets. Dropout is performed randomly. In the in-
tire image with the same parameters. The function of the put layer, the probability of dropping a neuron is between
convolution operators is to extract different features of 0.5 and 1, while in the hidden layers, a probability of 0.5
the input. The capacity of a neural net varies, depending is used. The neurons that are dropped out, will not con-
on the number of layers. The first convolution layers will tribute to the forward pass and back propagation. This
obtain the low-level features, like edges, lines and cor- is equivalent to decreasing the number of neurons. This
ners. The more layers the network has, the higher-level will create neural networks with different architectures,
features it will get. but all of those networks will share the same weights.
In Neural_coding#Sparse_coding bio neurons generally
some react and some dont, at any moment of experi-
ReLU layer
ence, as we are not perfect machines, not to say brains
have the same chances and layer shapes used in simulated
ReLU is the abbreviation of Rectified Linear Units. This
Dropout, but we learn anyways through it.
is a layer of neurons that use the non-saturating activation
function f(x)=max(0,x). It increases the nonlinear prop- The biggest contribution of the dropout method is that,
erties of the decision function and of the overall network although it effectively generates 2^n neural nets, and as
without affecting the receptive fields of the convolution such, allows for model combination, at test time, only a
layer. single network needs to be tested. This is accomplished
by performing the test with the un-thinned network, while
Other functions are used to increase nonlinearity. For
multiplying the output weights of each neuron with the
example the saturating hyperbolic tangent f(x)=tanh(x),
probability of that neuron being retained (i.e. not dropped
f(x)=|tanh(x)|, and the sigmoid function f(x)=(1+e^(-x)
out).
)^(−1). Compared to tanh units, the advantage of ReLU
*
is that the neural network trains several times faster. [18] The“2^n neural nets”is accuracy but is pigeonholed by
far less than 2^n bits having come in during the life of the
neuralnet. We could the same way say Lambda_calculus
Pooling layer takes exponential time if it werent for using base2 in
memory instead of counting in Unary_numeral_system.
In order to reduce variance, pooling layers compute the So if each of 2^n abstract neural nets pushes at least 1
max or average value of a particular feature over a region bit through those weights, they must have taken exponen-
of the image. This will ensure that the same result will tially many turns sequentially since the bandwidth is not
be obtained, even when image features have small trans- that wide. P_versus_NP_problem millenium prize re-
lations. This is an important operation for object classi- mains unclaimed, even though there are great optimiza-
fication and detection. tions.
9.5. FINE-TUNING 53

Loss layer images that included faces at various angles and orienta-
tions and a further 20 million images without faces. They
It can use different loss functions for different tasks. Soft- used batches of 128 images over 50,000 iterations.* [23]
max loss is used for predicting a single class of K mutu-
ally exclusive classes. Sigmoid cross-entropy loss is used
for predicting K independent probability values in [0,1]. 9.4.2 Video analysis
Euclidean loss is used for regressing to real-valued labels
[-inf,inf] Video is more complex than images since it has an-
other temporal dimension. The common way is to
fuse the features of different convolutional neural net-
works, which are responsible for spatial and temporal
9.4 Applications stream.* [24]* [25]

9.4.1 Image recognition

9.4.3 Natural Language Processing
Convolutional neural networks are often used in image
recognition systems. They have achieved an error rate of Convolutional neural networks have also seen use in the
0.23 percent on the MNIST database, which as of Febru- field of natural language processing or NLP. Like the im-
ary 2012 is the lowest achieved on the database.* [7] An- age classification problem, some NLP tasks can be for-
other paper on using CNN for image classification re- mulated as assigning labels to words in a sentence. The
ported that the learning process was “surprisingly fast"; neural network trained raw material fashion will extract
in the same paper, the best published results at the time the features of the sentences. Using some classifiers, it
were achieved in the MNIST database and the NORB could predict new sentences.* [26]
database.* [5]
When applied to facial recognition, they were able to con- 9.4.4 Playing Go
tribute to a large decrease in error rate.* [20] In another
paper, they were able to achieve a 97.6 percent recogni- Convolutional neural networks have been used in
tion rate on“5,600 still images of more than 10 subjects” computer Go. In December 2014, Christopher Clark and
.* [2] CNNs have been used to assess video quality in an Amos Storkey published a paper showing a convolutional
objective way after being manually trained; the resulting network trained by supervised learning from a database of
system had a very low root mean square error.* [8] human professional games could outperform Gnu Go and
The ImageNet Large Scale Visual Recognition Challenge win some games against Monte Carlo tree search Fuego
is a benchmark in object classification and detection, with 1.1 in a fraction of the time it took Fuego to play.* [27]
millions of images and hundreds of object classes. In Shortly after it was announced that a large 12-layer con-
the ILSVRC 2014, which is large-scale visual recognition volutional neural network had correctly predicted the pro-
challenge, almost every highly ranked team used CNN as fessional move in 55% of positions, equalling the accu-
their basic framework. The winner GoogLeNet* [21] in- racy of a 6 dan human player. When the trained convo-
creased the mean average precision of object detection lutional network was used directly to play games of Go,
to 0.439329, and reduced classification error to 0.06656, without any search, it beat the traditional search program
the best result to date. Its network applied more than 30 GNU Go in 97% of games, and matched the performance
layers. Performance of convolutional neural networks, on of the Monte Carlo tree search program Fuego simulat-
the ImageNet tests, is now close to that of humans.* [22] ing ten thousand playouts (about a million positions) per
The best algorithms still struggle with objects that are move.* [28]
small or thin, such as a small ant on a stem of a flower
or a person holding a quill in their hand. They also have
trouble with images that have been distorted with filters, 9.5 Fine-tuning
an increasingly common phenomenon with modern dig-
ital cameras. By contrast, those kinds of images rarely For many applications, only a small amount of training
trouble humans. Humans, however, tend to have trou- data is available. Convolutional neural networks usually
ble with other issues. For example, they are not good at require a large amount of training data in order to avoid
classifying objects into fine-grained categories such as the over-fitting. A common technique is to train the network
particular breed of dog or species of bird, whereas con- on a larger data set from a related domain. Once the net-
volutional neural networks handle this with ease. work parameters have converged an additional training
In 2015 a many-layered CNN demonstrated the ability to step is performed using the in-domain data to fine-tune
spot faces from a wide range of angles, including upside the network weights. This allows convolutional networks
down, even when partially occluded with competitive per- to be successfully applied to problems with small training
formance. The network trained on a database of 200,000 sets.
54 CHAPTER 9. CONVOLUTIONAL NEURAL NETWORK

9.6 Common libraries [5] Ciresan, Dan; Ueli Meier; Jonathan Masci; Luca M. Gam-
bardella; Jurgen Schmidhuber (2011). “Flexible, High
Performance Convolutional Neural Networks for Image
• Caffe: Caffe (replacement of Decaf) has been the
Classification”(PDF). Proceedings of the Twenty-Second
most popular library for convolutional neural net- international joint conference on Artificial Intelligence-
works. It is created by the Berkeley Vision and Volume Volume Two 2: 1237–1242. Retrieved 17
Learning Center (BVLC). The advantages are that it November 2013.
has cleaner architecture and faster speed. It supports
both CPU and GPU, easily switching between them. [6] Krizhevsky, Alex. “ImageNet Classification with Deep
It is developed in C++, and has Python and MAT- Convolutional Neural Networks” (PDF). Retrieved 17
LAB wrappers. In the developing of Caffe, proto- November 2013.
buf is used to make researchers tune the parameters [7] Ciresan, Dan; Meier, Ueli; Schmidhuber, Jürgen
easily as well as adding or removing layers. (June 2012). “Multi-column deep neural net-
works for image classification”. 2012 IEEE
• Torch7 (www.torch.ch) Conference on Computer Vision and Pattern Recog-
nition (New York, NY: Institute of Electrical
• OverFeat and Electronics Engineers (IEEE)): 3642–3649.
arXiv:1202.2745v1. doi:10.1109/CVPR.2012.6248110.
• Cuda-convnet ISBN 9781467312264. OCLC 812295155. Retrieved
2013-12-09.
• MatConvnet
[8] Le Callet, Patrick; Christian Viard-Gaudin; Dominique
• Theano: written in Python, using scientific python Barba (2006). “A Convolutional Neural Network
Approach for Objective Video Quality Assessment”
• Deeplearning4j: Deep learning in Java and Scala on (PDF). IEEE Transactions on Neural Networks 17 (5):
GPU-enabled Spark 1316–1327. doi:10.1109/TNN.2006.879766. PMID
17001990. Retrieved 17 November 2013.

[9] Fukushima, Kunihiko (1980). “Neocognitron: A Self-

9.7 See also organizing Neural Network Model for a Mechanism
of Pattern Recognition Unaﬀected by Shift in Posi-
• Neocognitron tion” (PDF). Biological Cybernetics 36 (4): 193–202.
doi:10.1007/BF00344251. PMID 7370364. Retrieved
• Convolution 16 November 2013.

• Deep learning [10] Homma, Toshiteru; Les Atlas; Robert Marks II (1988).
“An Artiﬁcial Neural Network for Spatio-Temporal Bipo-
• Time delay neural network lar Patters: Application to Phoneme Classiﬁcation”
(PDF). Advances in Neural Information Processing Sys-
tems 1: 31–40.

9.8 References [11] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick
Haﬀner (1998).“Gradient-based learning applied to doc-
ument recognition” (PDF). Proceedings of the IEEE 86
[1] “Convolutional Neural Networks (LeNet) - DeepLearn-
(11): 2278–2324. doi:10.1109/5.726791. Retrieved 16
ing 0.1 documentation”. DeepLearning 0.1. LISA Lab.
November 2013.
Retrieved 31 August 2013.
[12] S. Behnke. Hierarchical Neural Networks for Image In-
[2] Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; terpretation, volume 2766 of Lecture Notes in Computer
Yuji Kaneda (2003). “Subject independent facial ex- Science. Springer, 2003.
pression recognition with robust face detection using a
convolutional neural network” (PDF). Neural Networks [13] Simard, Patrice, David Steinkraus, and John C. Platt.
16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1. “Best Practices for Convolutional Neural Networks Ap-
Retrieved 17 November 2013. plied to Visual Document Analysis.”In ICDAR, vol. 3,
pp. 958-962. 2003.
[3] LeCun, Yann.“LeNet-5, convolutional neural networks”
. Retrieved 16 November 2013. [14] Hinton, GE; Osindero, S; Teh, YW (Jul 2006). “A fast
learning algorithm for deep belief nets.”. Neural computa-
[4] Korekado, Keisuke; Morie, Takashi; Nomura, Osamu; tion 18 (7): 1527–54. doi:10.1162/neco.2006.18.7.1527.
Ando, Hiroshi; Nakano, Teppei; Matsugu, Masakazu; PMID 16764513.
Iwata, Atsushi (2003). “A Convolutional Neural Net-
work VLSI for Image Recognition Using Merged/Mixed [15] Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan;
Analog-Digital Architecture”. Knowledge-Based Intel- Larochelle, Hugo (2007). “Greedy Layer-Wise Train-
ligent Information and Engineering Systems: 169–176. ing of Deep Networks”. Advances in Neural Information
CiteSeerX: 10.1.1.125.3812. Processing Systems: 153–160.
9.9. EXTERNAL LINKS 55

[16] Ranzato, MarcAurelio; Poultney, Christopher; Chopra, • Matlab toolbox

Sumit; LeCun, Yann (2007). “Eﬃcient Learning of
Sparse Representations with an Energy-Based Model” • MatConvnet
(PDF). Advances in Neural Information Processing Sys-
tems. • Theano

[17] 10. Deng, Jia, et al. “Imagenet: A large-scale hierarchi- • UFLDL Tutorial
cal image database."Computer Vision and Pattern Recog-
• Deeplearning4j's Convolutional Nets
nition, 2009. CVPR 2009. IEEE Conference on. IEEE,
2009.

[18] Krizhevsky, A.; Sutskever, I.; Hinton, G. E. (2012).“Im-

agenet classiﬁcation with deep convolutional neural net-
works”. Advances in Neural Information Processing Sys-
tems 1: 1097–1105.

[19] Srivastava, Nitish; C. Geoﬀrey Hinton; Alex Krizhevsky;

Ilya Sutskever; Ruslan Salakhutdinov (2014). “Dropout:
A Simple Way to Prevent Neural Networks from overﬁt-
ting” (PDF). Journal of Machine Learning Research 15
(1): 1929–1958.

[20] Lawrence, Steve; C. Lee Giles; Ah Chung Tsoi; Andrew

D. Back (1997). “Face Recognition: A Convolutional
Neural Network Approach”. Neural Networks, IEEE
Transactions on 8 (1): 98–113. doi:10.1109/72.554195.
CiteSeerX: 10.1.1.92.5813.

[21] Szegedy, Christian, et al. "Going deeper with convolu-

tions.”arXiv preprint arXiv:1409.4842 (2014).

[22] O. Russakovsky et al., "ImageNet Large Scale Visual

Recognition Challenge", 2014.

[23] “The Face Detection Algorithm Set To Revolutionize Im-

age Search”. Technology Review. February 16, 2015.
Retrieved February 2015.

[24] Karpathy, Andrej, et al. “Large-scale video classiﬁcation

with convolutional neural networks.”IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). 2014.

[25] Simonyan, Karen, and Andrew Zisserman. “Two-stream

convolutional networks for action recognition in videos.”
arXiv preprint arXiv:1406.2199 (2014).

[26] Collobert, Ronan, and Jason Weston. “A uniﬁed archi-

tecture for natural language processing: Deep neural net-
works with multitask learning."Proceedings of the 25th
international conference on Machine learning. ACM,
2008.

[27] Clark, C., Storkey A.J. (2014), "Teaching Deep Convolu-

tional Networks to play Go".

[28] Maddison C.J., Huang A., Sutskever I., Silver D. (2014),

"Move evaluation in Go using deep convolutional neural
networks".

9.9 External links

• A demonstration of a convolutional network created
for character recognition

• Caﬀe
Chapter 10

Restricted Boltzmann machine

class of Boltzmann machines, in particular the gradient-

Hidden units based contrastive divergence algorithm.* [7]
Restricted Boltzmann machines can also be used in deep
Visible units learning networks. In particular, deep belief networks
h3 can be formed by“stacking”RBMs and optionally fine-
tuning the resulting deep network with gradient descent
v3 and backpropagation.* [8]
h4
v1 10.1 Structure
h1
The standard type of RBM has binary-valued
v2 (Boolean/Bernoulli) hidden and visible units, and
consists of a matrix of weights W = (wi,j ) (size m×n)
h2 associated with the connection between hidden unit hj
and visible unit vi , as well as bias weights (offsets) ai for
the visible units and bj for the hidden units. Given these,
the energy of a configuration (pair of boolean vectors)
Diagram of a restricted Boltzmann machine with three visible (v,h) is defined as
units and four hidden units (no bias units).
∑ ∑ ∑∑
E(v, h) = − a i vi − bj hj − vi wi,j hj
A restricted Boltzmann machine (RBM) is a
i j i j
generative stochastic artificial neural network that can
learn a probability distribution over its set of inputs. or, in matrix notation,
RBMs were initially invented under the name Harmo-
nium by Paul Smolensky in 1986,* [1] but only rose to
prominence after Geoffrey Hinton and collaborators
invented fast learning algorithms for them in the mid- E(v, h) = −aT v − bT h − v T W h
2000s. RBMs have found applications in dimensionality
reduction,* [2] classification,* [3] collaborative filter- This energy function is analogous to that of a Hopfield
ing,* [4] feature learning* [5] and topic modelling.* [6] network. As in general Boltzmann machines, probability
They can be trained in either supervised or unsupervised distributions over hidden and/or visible vectors are de-
ways, depending on the task. fined in terms of the energy function:* [9]

As their name implies, RBMs are a variant of Boltzmann

machines, with the restriction that their neurons must 1 −E(v,h)
form a bipartite graph: a pair of nodes from each of the P (v, h) = e
Z
two groups of units, commonly referred to as the“visible”
and “hidden”units respectively, may have a symmet- where Z is a partition function defined as the sum of
ric connection between them, and there are no connec- e−E(v,h) over all possible configurations (in other words,
tions between nodes within a group. By contrast, “unre- just a normalizing constant to ensure the probability dis-
stricted”Boltzmann machines may have connections be- tribution sums to 1). Similarly, the (marginal) probability
tween hidden units. This restriction allows for more effi- of a visible (input) vector of booleans is the sum over all
cient training algorithms than are available for the general possible hidden layer configurations:* [9]

56
10.2. TRAINING ALGORITHM 57

V (a matrix, each row of which is treated as a visible vec-

tor v ),
1 ∑ −E(v,h)
P (v) = e
Z
h
∏
arg max P (v)
Since the RBM has the shape of a bipartite graph, with W
v∈V
no intra-layer connections, the hidden unit activations are
mutually independent given the visible unit activations or equivalently, to maximize the expected log probability
and conversely, the visible unit activations are mutually of V :* [10]* [11]
independent given the hidden unit activations.* [7] That
is, for m visible units and n hidden units, the conditional
[ ]
probability of a conﬁguration of the visible units v, given ∑
a conﬁguration of the hidden units h, is arg max E log P (v)
W
v∈V

∏
m The algorithm most often used to train RBMs, that is, to
P (v|h) = P (vi |h) optimize the weight vector W , is the contrastive diver-
i=1 gence (CD) algorithm due to Hinton, originally developed
to train PoE (product of experts) models.* [13] * [14] The
Conversely, the conditional probability of h given v is algorithm performs Gibbs sampling and is used inside
a gradient descent procedure (similar to the way back-
propagation is used inside such a procedure when training
∏
n
feedforward neural nets) to compute weight update.
P (h|v) = P (hj |v)
j=1 The basic, single-step contrastive divergence (CD-1) pro-
cedure for a single sample can be summarized as follows:
The individual activation probabilities are given by

∑m 1. Take a training sample v, compute the probabilities

P (hj = 1|v) = (σ (bj + i=1 wi,j v)i ) and of the hidden units and sample a hidden activation
∑n
P (vi = 1|h) = σ ai + j=1 wi,j hj vector h from this probability distribution.
2. Compute the outer product of v and h and call this
where σ denotes the logistic sigmoid. the positive gradient.
The visible units of RBM can be multinomial, although 3. From h, sample a reconstruction v' of the visible
the hidden units are Bernoulli. In this case, the logis- units, then resample the hidden activations h' from
tic function for visible units is replaced by the Softmax this. (Gibbs sampling step)
function
4. Compute the outer product of v' and h' and call this
the negative gradient.
exp(aki + Σj Wijk hj )
P (vik = 1|h) = 5. Let the weight update to wi,j be the positive gradi-
ΣK
k=1 exp(aki + Σj Wijk hj ) ent minus the negative gradient, times some learning
rate: ∆wi,j = ϵ(vhT − v ′ h′T ) .
where K is the number of discrete values that the visible
values have. They are applied in Topic Modeling,* [6] and
RecSys.* [4] The update rule for the biases a, b is defined analogously.
A Practical Guide to Training RBMs written by Hinton
can be found in his homepage.* [9]
10.1.1 Relation to other models
A restricted/layered Boltzmann machine (RBM) has ei-
Restricted Boltzmann machines are a special case ther bit or scalar node values, an array for each layer, and
of Boltzmann machines and Markov random between those are scalar values potentially for every pair
fields.* [10]* [11] Their graphical model corresponds of nodes one from each layer and an adjacent layer. It is
to that of factor analysis.* [12] run and trained using “weighted coin flips”of a chance
calculated at each individual node. Those chances are the
logistic sigmoid of the sum of scalar weights of whichever
pairs of nodes are on at the time, divided by tempera-
10.2 Training algorithm ture which decreases in each round of Simulated anneal-
ing as potentially all the data is trained in again. If either
Restricted Boltzmann machines are trained to maximize node in a pair is off, that weight is not counted. To run
the product of probabilities assigned to some training set it, you go up and down the layers, updating the chances
58 CHAPTER 10. RESTRICTED BOLTZMANN MACHINE

and weighted coin flips, until it converges to the coins in [9] Geoffrey Hinton (2010). A Practical Guide to Training Re-
lowest layer (visible nodes) staying mostly a certain way. stricted Boltzmann Machines. UTML TR 2010–003, Uni-
To train it, it is the same shape as running it except you versity of Toronto.
observe the weights of the pairs that are on, the first time [10] Sutskever, Ilya; Tieleman, Tijmen (2010). “On the con-
up you add the learning rate between those pairs, then go vergence properties of contrastive divergence” (PDF).
back down and up again and that time subtract the learn- Proc. 13th Int'l Conf. on AI and Statistics (AISTATS).
ing rate. As Geoffrey Hinton explained it, the first time up
is to learn the data, and the second time up is to unlearn [11] Asja Fischer and Christian Igel. Training Restricted
whatever its earlier reaction was to the data. Boltzmann Machines: An Introduction. Pattern Recog-
nition 47, pp. 25-39, 2014

[12] María Angélica Cueto; Jason Morton; Bernd Sturm-

10.3 See also fels (2010). “Geometry of the restricted Boltz-
mann machine” (PDF). Algebraic Methods in Statistics
and Probability (American Mathematical Society) 516.
• Autoencoder arXiv:0908.4425.
• Deep learning [13] Geoﬀrey Hinton (1999). Products of Experts. ICANN
1999.
• Helmholtz machine
[14] Hinton, G. E. (2002). “Training Products of Ex-
• Hopﬁeld network perts by Minimizing Contrastive Divergence”
(PDF). Neural Computation 14 (8): 1771–1800.
doi:10.1162/089976602760128018. PMID 12180402.
10.4 References
[1] Smolensky, Paul (1986). “Chapter 6: Information Pro- 10.5 External links
cessing in Dynamical Systems: Foundations of Harmony
Theory” (PDF). In Rumelhart, David E.; McLelland, • Introduction to Restricted Boltzmann Machines.
James L. Parallel Distributed Processing: Explorations in Edwin Chen's blog, July 18, 2011.
the Microstructure of Cognition, Volume 1: Foundations.
MIT Press. pp. 194–281. ISBN 0-262-68053-X. • Understanding RBMs. Deeplearning4j Documen-
tation, December 29, 2014.
[2] Hinton, G. E.; Salakhutdinov, R. R. (2006).
“Reducing the Dimensionality of Data with Neural
Networks” (PDF). Science 313 (5786): 504–507.
doi:10.1126/science.1127647. PMID 16873662.

[3] Larochelle, H.; Bengio, Y. (2008). Classiﬁcation

using discriminative restricted Boltzmann machines
(PDF). Proceedings of the 25th international con-
ference on Machine learning - ICML '08. p. 536.
doi:10.1145/1390156.1390224. ISBN 9781605582054.

[4] Salakhutdinov, R.; Mnih, A.; Hinton, G. (2007).

Restricted Boltzmann machines for collaborative ﬁlter-
ing. Proceedings of the 24th international confer-
ence on Machine learning - ICML '07. p. 791.
doi:10.1145/1273496.1273596. ISBN 9781595937933.

[5] Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). An

analysis of single-layer networks in unsupervised feature
learning (PDF). International Conference on Artiﬁcial In-
telligence and Statistics (AISTATS).

[6] Ruslan Salakhutdinov and Geoﬀrey Hinton (2010).

Replicated softmax: an undirected topic model. Neural
Information Processing Systems 23.

[7] Miguel Á. Carreira-Perpiñán and Geoﬀrey Hinton (2005).

On contrastive divergence learning. Artiﬁcial Intelligence
and Statistics.

[8] Hinton, G. (2009).“Deep belief networks”. Scholarpedia

4 (5): 5947. doi:10.4249/scholarpedia.5947.
Chapter 11

Recurrent neural network

Not to be confused with Recursive neural network. viding target signals for the RNN, instead a fitness func-
tion or reward function is occasionally used to evaluate
the RNN's performance, which is influencing its input
A recurrent neural network (RNN) is a class of
artificial neural network where connections between units stream through output units connected to actuators af-
form a directed cycle. This creates an internal state of the fecting the environment. Again, compare the section on
network which allows it to exhibit dynamic temporal be- training algorithms below.
havior. Unlike feedforward neural networks, RNNs can
use their internal memory to process arbitrary sequences
11.1.2 Hopfield network
of inputs. This makes them applicable to tasks such as
unsegmented connected handwriting recognition, where The Hopfield network is of historic interest although it is
they have achieved the best known results.* [1] not a general RNN, as it is not designed to process se-
quences of patterns. Instead it requires stationary inputs.
It is a RNN in which all connections are symmetric. In-
11.1 Architectures vented by John Hopfield in 1982, it guarantees that its dy-
namics will converge. If the connections are trained using
Hebbian learning then the Hopfield network can perform
11.1.1 Fully recurrent network as robust content-addressable memory, resistant to con-
nection alteration.
This is the basic architecture developed in the 1980s: a
A variation on the Hopfield network is the bidirectional
network of neuron-like units, each with a directed con-
associative memory (BAM). The BAM has two layers,
nection to every other unit. Each unit has a time-varying
either of which can be driven as an input, to recall an
real-valued activation. Each connection has a modifiable
association and produce an output on the other layer.* [2]
real-valued weight. Some of the nodes are called input
nodes, some output nodes, the rest hidden nodes. Most
architectures below are special cases. 11.1.3 Elman networks and Jordan net-
For supervised learning in discrete time settings, training works
sequences of real-valued input vectors become sequences
of activations of the input nodes, one input vector at a The following special case of the basic architecture above
time. At any given time step, each non-input unit com- was employed by Jeff Elman. A three-layer network is
putes its current activation as a nonlinear function of the used (arranged vertically as x, y, and z in the illustration),
weighted sum of the activations of all units from which it with the addition of a set of “context units”(u in the il-
receives connections. There may be teacher-given target lustration). There are connections from the middle (hid-
activations for some of the output units at certain time den) layer to these context units fixed with a weight of
steps. For example, if the input sequence is a speech sig- one.* [3] At each time step, the input is propagated in a
nal corresponding to a spoken digit, the final target out- standard feed-forward fashion, and then a learning rule is
put at the end of the sequence may be a label classify- applied. The fixed back connections result in the context
ing the digit. For each sequence, its error is the sum of units always maintaining a copy of the previous values
the deviations of all target signals from the corresponding of the hidden units (since they propagate over the con-
activations computed by the network. For a training set nections before the learning rule is applied). Thus the
of numerous sequences, the total error is the sum of the network can maintain a sort of state, allowing it to per-
errors of all individual sequences. Algorithms for mini- form such tasks as sequence-prediction that are beyond
mizing this error are mentioned in the section on training the power of a standard multilayer perceptron.
algorithms below. Jordan networks, due to Michael I. Jordan, are similar to
In reinforcement learning settings, there is no teacher pro- Elman networks. The context units are however fed from

59
60 CHAPTER 11. RECURRENT NEURAL NETWORK

dict or label each element of the sequence based on both

the past and the future context of the element. This is
done by adding the outputs of two RNN, one processing
the sequence from left to right, the other one from right
to left. The combined outputs are the predictions of the
teacher-given target signals. This technique proved to be
especially useful when combined with LSTM RNN.* [10]

11.1.7 Continuous-time RNN

A continuous time recurrent neural network (CTRNN) is

a dynamical systems model of biological neural networks.
A CTRNN uses a system of ordinary differential equa-
tions to model the effects on a neuron of the incoming
spike train. CTRNNs are more computationally efficient
than directly simulating every spike in a network as they
do not model neural activations at this level of detail .
For a neuron i in the network with action potential yi the
rate of change of activation is given by:
The Elman SRN

the output layer instead of the hidden layer. The context ∑n

τ ẏ
units in a Jordan network are also referred to as the state i i = −y i + σ( wji yj − Θj ) + Ii (t)
j=1
layer, and have a recurrent connection to themselves with
no other nodes on this connection.* [3] Elman and Jordan
Where:
networks are also known as“simple recurrent networks”
(SRN).
• τi : Time constant of postsynaptic node

11.1.4 Echo state network • yi : Activation of postsynaptic node

The echo state network (ESN) is a recurrent neural net- • ẏi : Rate of change of activation of postsynaptic
work with a sparsely connected random hidden layer. The node
weights of output neurons are the only part of the net-
work that can change and be trained. ESN are good at • wji : Weight of connection from pre to postsynaptic
reproducing certain time series.* [4] A variant for spiking node
neurons is known as Liquid state machines.* [5]
• σ(x) : Sigmoid of x e.g. σ(x) = 1/(1 + e−x ) .

11.1.5 Long short term memory network • yj : Activation of presynaptic node

The Long short term memory (LSTM) network, devel- • Θj : Bias of presynaptic node
oped by Hochreiter & Schmidhuber in 1997,* [6] is an
artiﬁcial neural net structure that unlike traditional RNNs • Ii (t) : Input (if any) to node
doesn't have the vanishing gradient problem (compare
the section on training algorithms below). It works even
when there are long delays, and it can handle signals CTRNNs have frequently been applied in the ﬁeld of
that have a mix of low and high frequency components. evolutionary robotics, where they have been used to ad-
LSTM RNN outperformed other methods in numerous dress, for example, vision,* [11] co-operation* [12] and
applications such as language learning* [7] and connected minimally cognitive behaviour.* [13]
handwriting recognition.* [8]

11.1.8 Hierarchical RNN

11.1.6 Bi-directional RNN
There are many instances of hierarchical RNN whose el-
Invented by Schuster & Paliwal in 1997,* [9] bi- ements are connected in various ways to decompose hi-
directional RNN or BRNN use a ﬁnite sequence to pre- erarchical behavior into useful subprograms.* [14]* [15]
11.2. TRAINING 61

11.1.9 Recurrent multilayer perceptron 11.1.14 Bidirectional Associative Memory

(BAM)
Generally, a Recurrent Multi-Layer Perceptron (RMLP)
consists of a series of cascaded subnetworks, each of First introduced by Kosko,* [21] BAM neural networks
which consists of multiple layers of nodes. Each of these store associative data as a vector. The bi-directionality
subnetworks is entirely feed-forward except for the last comes from passing information through a matrix and
layer, which can have feedback connections among itself. its transpose. Typically, bipolar encoding is preferred
Each of these subnets is connected only by feed forward to binary encoding of the associative pairs. Recently,
connections.* [16] stochastic BAM models using Markov stepping were op-
timized for increased network stability and relevance to
real-world applications.* [22]

11.1.10 Second Order Recurrent Neural

Network
11.2 Training
Second order RNNs use higher order weights wijk in-
stead of the standard wij weights, and inputs and states
can be a product. This allows a direct mapping to a 11.2.1 Gradient descent
finite state machine both in training and in representa-
tion* [17]* [18] Long short term memory is an example of To minimize total error, gradient descent can be used to
this. change each weight in proportion to the derivative of the
error with respect to that weight, provided the non-linear
activation functions are differentiable. Various methods
for doing so were developed in the 1980s and early 1990s
by Paul Werbos, Ronald J. Williams, Tony Robinson,
11.1.11 Multiple Timescales Recurrent Jürgen Schmidhuber, Sepp Hochreiter, Barak Pearlmut-
Neural Network (MTRNN) Model ter, and others.
The standard method is called "backpropagation through
MTRNN considered as a possible neural-based compu-
time" or BPTT, and is a generalization of back-
tational model that imitates to some extent, the brain ac-
propagation for feed-forward networks,* [23]* [24] and
tivities.* [19] It has ability to simulate the functional hi-
like that method, is an instance of Automatic differen-
erarchy of the brain through self-organization that is not
tiation in the reverse accumulation mode or Pontryagin's
only depended on the spatial connection among neurons,
minimum principle. A more computationally expensive
but also on distinct types of neuron activities, each with
online variant is called "Real-Time Recurrent Learning"
distinct time properties. With such varied neuron activi-
or RTRL,* [25]* [26] which is an instance of Automatic
ties, continuous sequences of any set of behavior are seg-
differentiation in the forward accumulation mode with
mented into reusable primitives, which in turn are flexibly
stacked tangent vectors. Unlike BPTT this algorithm is
integrated into diverse sequential behaviors. The biolog-
local in time but not local in space.* [27]* [28]
ical approval of such a type of hierarchy has been dis-
There also is an online hybrid between BPTT and RTRL
cussed on the memory-prediction theory of brain func-
tion by Jeff Hawkins in his book On Intelligence.with intermediate complexity,* [29]* [30] and there are
variants for continuous time.* [31] A major problem with
gradient descent for standard RNN architectures is that
error gradients vanish exponentially quickly with the size
of the time lag between important events.* [32] * [33] The
11.1.12 Pollack’s sequential cascaded net- Long short term memory architecture together with a
works BPTT/RTRL hybrid learning method was introduced in
an attempt to overcome these problems.* [6]
11.1.13 Neural Turing Machines

NTMs are method of extending the capabilities of re-

current neural networks by coupling them to external 11.2.2 Hessian Free Optimisation
memory resources, which they can interact with by
attentional processes. The combined system is analogous Successful training on complex tasks has been achieved
to a Turing Machine or Von Neumann architecture but by employing Hessian Free Optimisation. The speedup
is diﬀerentiable end-to-end, allowing it to be eﬃciently compared with previous training methods now makes
trained with gradient descent.* [20] RNN applications feasible.* [34]
62 CHAPTER 11. RECURRENT NEURAL NETWORK

11.2.3 Global optimization methods tation for the current time step.
In particular, recurrent neural networks can appear as
Training the weights in a neural network can be modeled
nonlinear versions of finite impulse response and infinite
as a non-linear global optimization problem. A target
impulse response filters and also as a nonlinear autore-
function can be formed to evaluate the fitness or error of
gressive exogenous model (NARX)* [38]
a particular weight vector as follows: First, the weights
in the network are set according to the weight vector.
Next, the network is evaluated against the training se-
quence. Typically, the sum-squared-difference between 11.4 Issues with recurrent neural
the predictions and the target values specified in the train- networks
ing sequence is used to represent the error of the current
weight vector. Arbitrary global optimization techniques
may then be used to minimize this target function. Most RNNs have had scaling issues. In particular, RNNs
cannot be easily trained for large numbers of neuron units
The most common global optimization method for train- nor for large numbers of inputs units . Successful training
ing RNNs is genetic algorithms, especially in unstruc- has been mostly in time series problems with few inputs
tured networks.* [35]* [36]* [37] and in chemical process control.
Initially, the genetic algorithm is encoded with the neural
network weights in a predefined manner where one gene
in the chromosome represents one weight link, hence- 11.5 References
forth; the whole network is represented as a single chro-
mosome. The fitness function is evaluated as follows: 1)
[1] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H.
each weight encoded in the chromosome is assigned to the
Bunke, J. Schmidhuber. A Novel Connectionist System
respective weight link of the network ; 2) the training set for Improved Unconstrained Handwriting Recognition.
of examples is then presented to the network which prop- IEEE Transactions on Pattern Analysis and Machine In-
agates the input signals forward ; 3) the mean-squared- telligence, vol. 31, no. 5, 2009.
error is returned to the fitness function ; 4) this function
will then drive the genetic selection process. [2] Rául Rojas (1996). Neural networks: a systematic intro-
duction. Springer. p. 336. ISBN 978-3-540-60505-8.
There are many chromosomes that make up the pop-
ulation; therefore, many different neural networks are [3] Cruse, Holk; Neural Networks as Cybernetic Systems, 2nd
evolved until a stopping criterion is satisfied. A com- and revised edition
mon stopping scheme is: 1) when the neural network has
learnt a certain percentage of the training data or 2) when [4] H. Jaeger. Harnessing nonlinearity: Predicting chaotic
the minimum value of the mean-squared-error is satis- systems and saving energy in wireless communication.
fied or 3) when the maximum number of training gener- Science, 304:78–80, 2004.
ations has been reached. The stopping criterion is eval-
[5] W. Maass, T. Natschläger, and H. Markram. A fresh look
uated by the fitness function as it gets the reciprocal of
at real-time computation in generic recurrent neural cir-
the mean-squared-error from each neural network during cuits. Technical report, Institute for Theoretical Com-
training. Therefore, the goal of the genetic algorithm is puter Science, TU Graz, 2002.
to maximize the fitness function, hence, reduce the mean-
squared-error. [6] Hochreiter, Sepp; and Schmidhuber, Jürgen; Long Short-
Term Memory, Neural Computation, 9(8):1735–1780,
Other global (and/or evolutionary) optimization tech-
1997
niques may be used to seek a good set of weights such
as Simulated annealing or Particle swarm optimization. [7] Gers, Felix A.; and Schmidhuber, Jürgen; LSTM Recur-
rent Networks Learn Simple Context Free and Context Sen-
sitive Languages, IEEE Transactions on Neural Networks,
11.3 Related fields and models 12(6):1333–1340, 2001

[8] Graves, Alex; and Schmidhuber, Jürgen; Oﬄine Hand-

RNNs may behave chaotically. In such cases, dynamical writing Recognition with Multidimensional Recurrent Neu-
systems theory may be used for analysis. ral Networks, in Bengio, Yoshua; Schuurmans, Dale; Laf-
Recurrent neural networks are in fact recursive neural net- ferty, John; Williams, Chris K. I.; and Culotta, Aron
(eds.), Advances in Neural Information Processing Systems
works with a particular structure: that of a linear chain.
22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC,
Whereas recursive neural networks operate on any hier- Neural Information Processing Systems (NIPS) Founda-
archical structure, combining child representations into tion, 2009, pp. 545–552
parent representations, recurrent neural networks operate
on the linear progression of time, combining the previous [9] Bidirectional recurrent neural networks. IEEE Transac-
time step and a hidden representation into the represen- tions on Signal Processing, 45:2673–81, November 1997.
11.5. REFERENCES 63

[10] A. Graves and J. Schmidhuber. Framewise phoneme clas- [25] A. J. Robinson and F. Fallside. The utility driven dynamic
sification with bidirectional LSTM and other neural net- error propagation network. Technical Report CUED/F-
work architectures. Neural Networks, 18:602–610, 2005. INFENG/TR.1, Cambridge University Engineering De-
partment, 1987.
[11] Harvey, Inman; Husbands, P. and Cliff, D. (1994).“See-
ing the light: Artificial evolution, real vision”. Proceed- [26] R. J. Williams and D. Zipser. Gradient-based learning
ings of the third international conference on Simulation of algorithms for recurrent networks and their computational
adaptive behavior: from animals to animats 3: 392–401. complexity. In Back-propagation: Theory, Architectures
and Applications. Hillsdale, NJ: Erlbaum, 1994.
[12] Quinn, Matthew (2001).“Evolving communication with-
out dedicated communication channels”. Advances in [27] J. Schmidhuber. A local learning algorithm for dynamic
Artificial Life. Lecture Notes in Computer Science 2159: feedforward and recurrent networks. Connection Science,
357–366. doi:10.1007/3-540-44811-X_38. ISBN 978- 1(4):403–412, 1989.
3-540-42567-0.
[28] Neural and Adaptive Systems: Fundamentals through
[13] Beer, R.D. (1997). “The dynamics of adaptive behavior: Simulation. J.C. Principe, N.R. Euliano, W.C. Lefebvre
A research program”. Robotics and Autonomous Systems
[29] J. Schmidhuber. A fixed size storage O(n3) time com-
20 (2–4): 257–289. doi:10.1016/S0921-8890(96)00063-
plexity learning algorithm for fully recurrent continually
2.
running networks. Neural Computation, 4(2):243–248,
[14] J. Schmidhuber. Learning complex, extended sequences 1992.
using the principle of history compression. Neural Com- [30] R. J. Williams. Complexity of exact gradient computation
putation, 4(2):234-242, 1992 algorithms for recurrent neural networks. Technical Re-
[15] R.W. Paine, J. Tani, “How hierarchical control self- port Technical Report NU-CCS-89-27, Boston: North-
organizes in artificial adaptive systems,”Adaptive Behav- eastern University, College of Computer Science, 1989.
ior, 13(3), 211-225, 2005. [31] B. A. Pearlmutter. Learning state space trajectories in re-
[16] “CiteSeerX —Recurrent Multilayer Perceptrons for Iden- current neural networks. Neural Computation, 1(2):263–
tification and Control: The Road to Applications”. Cite- 269, 1989.
seerx.ist.psu.edu. Retrieved 2014-01-03. [32] S. Hochreiter. Untersuchungen zu dynamischen neu-
ronalen Netzen. Diploma thesis, Institut f. Informatik,
[17] C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun,
Technische Univ. Munich, 1991.
Y.C. Lee, “Learning and Extracting Finite State Au-
tomata with Second-Order Recurrent Neural Networks,” [33] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmid-
Neural Computation, 4(3), p. 393, 1992. huber. Gradient flow in recurrent nets: the difficulty of
learning long-term dependencies. In S. C. Kremer and J.
[18] C.W. Omlin, C.L. Giles, “Constructing Deterministic
F. Kolen, editors, A Field Guide to Dynamical Recurrent
Finite-State Automata in Recurrent Neural Networks,”
Neural Networks. IEEE Press, 2001.
Journal of the ACM, 45(6), 937-972, 1996.
[34] Martens, James, and Ilya Sutskever. "Training deep and
[19] Y. Yamashita, J. Tani, “Emergence of functional
recurrent networks with hessian-free optimization.”In
hierarchy in a multiple timescale neural network
Neural Networks: Tricks of the Trade, pp. 479-535.
model: a humanoid robot experiment,”PLoS Com-
Springer Berlin Heidelberg, 2012.
putational Biology, 4(11), e1000220, 211-225, 2008.
http://journals.plos.org/ploscompbiol/article?id=10. [35] F. J. Gomez and R. Miikkulainen. Solving non-
1371/journal.pcbi.1000220 Markovian control tasks with neuroevolution. Proc. IJ-
CAI 99, Denver, CO, 1999. Morgan Kaufmann.
[20] http://arxiv.org/pdf/1410.5401v2.pdf
[36] Applying Genetic Algorithms to Recurrent Neural Net-
[21] Kosko, B. (1988). “Bidirectional associative memories” works for Learning Network Parameters and Architec-
. IEEE Transactions on Systems, Man, and Cybernetics 18 ture. O. Syed, Y. Takefuji
(1): 49–60. doi:10.1109/21.87054.
[37] F. Gomez, J. Schmidhuber, R. Miikkulainen. Ac-
[22] Rakkiyappan, R.; Chandrasekar, A.; Lakshmanan, S.; celerated Neural Evolution through Cooperatively Coe-
Park, Ju H. (2 January 2015). “Exponential stabil- volved Synapses. Journal of Machine Learning Research
ity for markovian jumping stochastic BAM neural net- (JMLR), 9:937-965, 2008.
works with mode-dependent probabilistic time-varying
delays and impulse control”. Complexity 20 (3): 39–65. [38] Hava T. Siegelmann, Bill G. Horne, C. Lee Giles,“Com-
doi:10.1002/cplx.21503. putational capabilities of recurrent NARX neural net-
works,”IEEE Transactions on Systems, Man, and Cyber-
[23] P. J. Werbos. Generalization of backpropagation with ap- netics, Part B 27(2): 208-215 (1997).
plication to a recurrent gas market model. Neural Net-
works, 1, 1988.
• Mandic, D. & Chambers, J. (2001). Recurrent Neu-
[24] David E. Rumelhart; Geoffrey E. Hinton; Ronald J. ral Networks for Prediction: Learning Algorithms,
Williams. Learning Internal Representations by Error Architectures and Stability. Wiley. ISBN 0-471-
Propagation. 49517-4.
64 CHAPTER 11. RECURRENT NEURAL NETWORK

• Elman, J.L. (1990). “Finding Structure in

Time”. Cognitive Science 14 (2): 179–211.
doi:10.1016/0364-0213(90)90002-E.

11.6 External links

• RNNSharp CRFs based on recurrent neural net-
works (C#, .NET)

• Recurrent Neural Networks with over 60 RNN pa-

pers by Jürgen Schmidhuber's group at IDSIA
• Elman Neural Network implementation for WEKA
Chapter 12

Long short term memory

Long short-term memory (LSTM) is a recurrent neural

network (RNN) architecture (an artiﬁcial neural network)
published* [1] in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber. Like most RNNs, an LSTM network is
universal in the sense that given enough network units it
Π
can compute anything a conventional computer can com-
pute, provided it has the proper weight matrix, which may
be viewed as its program. Unlike traditional RNNs, an
LSTM network is well-suited to learn from experience to Σ
classify, process and predict time series when there are
very long time lags of unknown size between important
events. This is one of the main reasons why LSTM out-
performs alternative RNNs and Hidden Markov Models
and other sequence learning methods in numerous appli-
Π Π
cations. For example, LSTM achieved the best known
results in unsegmented connected handwriting recogni-
tion,* [2] and in 2009 won the ICDAR handwriting com-
petition. LSTM networks have also been used for au-
tomatic speech recognition, and were a major compo-
nent of a network that recently achieved a record 17.7%
phoneme error rate on the classic TIMIT natural speech
dataset.* [3] A typical implementation of an LSTM block.

12.1 Architecture
An LSTM network is an artificial neural network that
contains LSTM blocks instead of, or in addition to, regu- out the value from the left-most unit, effectively blocking
lar network units. An LSTM block may be described as that value from entering into the next layer. The second
a“smart”network unit that can remember a value for an unit from the right is the “forget gate”. When it out-
arbitrary length of time. An LSTM block contains gates puts a value close to zero, the block will effectively forget
that determine when the input is significant enough to re- whatever value it was remembering. The right-most unit
member, when it should continue to remember or forget (on the bottom row) is the“output gate”. It determines
the value, and when it should output the value. when the unit should output the value in its memory. The
A typical implementation of an LSTM block is shown to units containing the Π symbol compute the product of
the right. The four units shown ∑at the bottom of the fig- their inputs ( y = Πxi ). These units have no weights.
ure are sigmoid units ( y = s( wi xi ) , where s is some The unit with the Σ∑symbol computes a linear function
squashing function, such as the logistic function). The of its inputs ( y = wi xi .) The output of this unit is
left-most of these units computes a value which is condi- not squashed so that it can remember the same value for
tionally fed as an input value to the block's memory. The many time-steps without the value decaying. This value
other three units serve as gates to determine when values is fed back in so that the block can “remember”it (as
are allowed to flow into or out of the block's memory. The long as the forget gate allows). Typically, this value is
second unit from the left (on the bottom row) is the “in- also fed into the 3 gating units to help them make gating
put gate”. When it outputs a value close to zero, it zeros decisions.

65
66 CHAPTER 12. LONG SHORT TERM MEMORY

12.2 Training 12.5 References

To minimize LSTM's total error on a set of train- [1] S. Hochreiter and J. Schmidhuber. Long short-term mem-
ing sequences, iterative gradient descent such as ory. Neural Computation, 9(8):1735–1780, 1997.
backpropagation through time can be used to change each [2] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H.
weight in proportion to its derivative with respect to the Bunke, J. Schmidhuber. A Novel Connectionist System
error. A major problem with gradient descent for stan- for Improved Unconstrained Handwriting Recognition.
dard RNNs is that error gradients vanish exponentially IEEE Transactions on Pattern Analysis and Machine In-
quickly with the size of the time lag between impor- telligence, vol. 31, no. 5, 2009.
tant events, as first realized in 1991.* [4]* [5] With LSTM
blocks, however, when error values are back-propagated [3] Graves, Alex; Mohamed, Abdel-rahman; Hinton, Geof-
from the output, the error becomes trapped in the mem- frey (2013). “Speech Recognition with Deep Recurrent
Neural Networks”. Acoustics, Speech and Signal Pro-
ory portion of the block. This is referred to as an “error
cessing (ICASSP), 2013 IEEE International Conference on:
carousel”, which continuously feeds error back to each 6645–6649.
of the gates until they become trained to cut off the value.
Thus, regular backpropagation is effective at training an [4] S. Hochreiter. Untersuchungen zu dynamischen neu-
LSTM block to remember values for very long durations. ronalen Netzen. Diploma thesis, Institut f. Informatik,
Technische Univ. Munich, 1991.
LSTM can also be trained by a combination of artificial
evolution for weights to the hidden units, and pseudo- [5] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmid-
inverse or support vector machines for weights to the huber. Gradient flow in recurrent nets: the difficulty of
output units.* [6] In reinforcement learning applications learning long-term dependencies. In S. C. Kremer and J.
LSTM can be trained by policy gradient methods or F. Kolen, editors, A Field Guide to Dynamical Recurrent
evolution strategies or genetic algorithms. Neural Networks. IEEE Press, 2001.

[6] J. Schmidhuber, D. Wierstra, M. Gagliolo, F. Gomez.

Training Recurrent Networks by Evolino. Neural Com-
12.3 Applications putation, 19(3): 757–779, 2007.

[7] H. Mayer, F. Gomez, D. Wierstra, I. Nagy, A. Knoll, and

Applications of LSTM include: J. Schmidhuber. A System for Robotic Heart Surgery that
Learns to Tie Knots Using Recurrent Neural Networks.
• Robot control* [7] Advanced Robotics, 22/13–14, pp. 1521–1537, 2008.

• Time series prediction* [8] [8] J. Schmidhuber and D. Wierstra and F. J. Gomez.
Evolino: Hybrid Neuroevolution / Optimal Linear Search
• Speech recognition* [9]* [10]* [11] for Sequence Learning. Proceedings of the 19th Interna-
tional Joint Conference on Artiﬁcial Intelligence (IJCAI),
• Rhythm learning* [12] Edinburgh, pp. 853–858, 2005.

• Music composition* [13] [9] A. Graves and J. Schmidhuber. Framewise phoneme clas-
siﬁcation with bidirectional LSTM and other neural net-
• Grammar learning* [14]* [15]* [16] work architectures. Neural Networks 18:5–6, pp. 602–
610, 2005.
• Handwriting recognition* [17]* [18]
[10] S. Fernandez, A. Graves, J. Schmidhuber. An applica-
• Human action recognition* [19] tion of recurrent neural networks to discriminative key-
word spotting. Intl. Conf. on Artiﬁcial Neural Networks
• Protein Homology Detection* [20] ICANN'07, 2007.

[11] Graves, Alex; Mohamed, Abdel-rahman; Hinton, Geof-

frey (2013). “Speech Recognition with Deep Recurrent
12.4 See also Neural Networks”. Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2013 IEEE International Conference on:
• Artiﬁcial neural network 6645–6649.

• Prefrontal Cortex Basal Ganglia Working Memory [12] F. Gers, N. Schraudolph, J. Schmidhuber. Learning pre-
(PBWM) cise timing with LSTM recurrent networks. Journal of
Machine Learning Research 3:115–143, 2002.
• Recurrent neural network
[13] D. Eck and J. Schmidhuber. Learning The Long-Term
• Time series Structure of the Blues. In J. Dorronsoro, ed., Proceedings
of Int. Conf. on Artiﬁcial Neural Networks ICANN'02,
• Long-term potentiation Madrid, pages 284–289, Springer, Berlin, 2002.
12.6. EXTERNAL LINKS 67

[14] J. Schmidhuber, F. Gers, D. Eck. J. Schmidhuber, F.

Gers, D. Eck. Learning nonregular languages: A com-
parison of simple recurrent networks and LSTM. Neural
Computation 14(9):2039–2041, 2002.

[15] F. A. Gers and J. Schmidhuber. LSTM Recurrent Net-

works Learn Simple Context Free and Context Sensi-
tive Languages. IEEE Transactions on Neural Networks
12(6):1333–1340, 2001.

[16] J. A. Perez-Ortiz, F. A. Gers, D. Eck, J. Schmidhuber.

Kalman ﬁlters improve LSTM network performance in
problems unsolvable by traditional recurrent nets. Neu-
ral Networks 16(2):241–250, 2003.

[17] A. Graves, J. Schmidhuber. Oﬄine Handwriting Recog-

nition with Multidimensional Recurrent Neural Networks.
Advances in Neural Information Processing Systems 22,
NIPS'22, pp 545–552, Vancouver, MIT Press, 2009.

[18] A. Graves, S. Fernandez,M. Liwicki, H. Bunke, J.

Schmidhuber. Unconstrained online handwriting recog-
nition with recurrent neural networks. Advances in Neu-
ral Information Processing Systems 21, NIPS'21, pp 577–
584, 2008, MIT Press, Cambridge, MA, 2008.

[19] M. Baccouche, F. Mamalet, C Wolf, C. Garcia, A.

Baskurt. Sequential Deep Learning for Human Action
Recognition. 2nd International Workshop on Human Be-
havior Understanding (HBU), A.A. Salah, B. Lepri ed.
Amsterdam, Netherlands. pp. 29–39. Lecture Notes in
Computer Science 7065. Springer. 2011

[20] Hochreiter, S.; Heusel, M.; Obermayer, K. (2007).

“Fast model-based protein homology detection with-
out alignment”. Bioinformatics 23 (14): 1728–1736.
doi:10.1093/bioinformatics/btm247. PMID 17488755.

12.6 External links

• Recurrent Neural Networks with over 30 LSTM pa-
pers by Jürgen Schmidhuber's group at IDSIA

• Gers PhD thesis on LSTM networks.

• Fraud detection paper with two chapters devoted
to explaining recurrent neural networks, especially
LSTM.

• Paper on a high-performing extension of LSTM that

has been simpliﬁed to a single node type and can
train arbitrary architectures.
• Tutorial: How to implement LSTM in python with
theano
Chapter 13

Google Brain

Google Brain is a deep learning research project at 13.2 In Google products

Google.
The project's technology is currently used in the Android
Operating System's speech recognition system* [18] and
13.1 History photosearch for Google+.* [19]

Stanford University professor Andrew Ng who, since

around 2006, became interested in using deep learning 13.3 Team
techniques to crack the problem of artificial intelligence,
started a deep learning project in 2011 as one of the
Google Brain was initially established by Google Fel-
Google X projects. He then joined with Google Fellow
low Jeff Dean and visiting Stanford professor Andrew
Jeff Dean to develop “the Google brain”as it was first
* Ng* [20] (Ng since moved to lead the artificial intelligence
referred to. [1]
group at Baidu* [21]). As of 2014, team members in-
In June 2012, the New York Times reported that a clus- cluded Jeff Dean, Geoffrey Hinton, Greg Corrado, Quoc
ter of 16,000 computers dedicated to mimicking some Le,* [22] Ilya Sutskever, Alex Krizhevsky, Samy Bengio,
aspects of human brain activity had successfully trained and Vincent Vanhoucke.
itself to recognize a cat based on 10 million digital images
taken from YouTube videos.* [1] The story was also cov-
ered by National Public Radio* [2] and SmartPlanet.* [3]
13.4 Reception
In March 2013, Google hired Geoffrey Hinton, a lead-
ing researcher in the deep learning field, and acquired the
Google Brain has received in-depth coverage in Wired
company DNNResearch Inc. headed by Hinton. Hinton
Magazine,* [6]* [16]* [23] the New York Times,* [23]
said that he would be dividing his future time between his
Technology Review,* [5]* [17] National Public Radio,* [2]
university research and his work at Google.* [4]
and Big Think.* [24]
On 26 January 2014, multiple news outlets stated
that Google had purchased DeepMind Technologies
for an undisclosed amount. Analysts later announced
that the company was purchased for £400 Million 13.5 See also
($650M USD / €486M), although later reports esti-
mated the acquisition was valued at over £500 Mil- • Google X
lion.* [5]* [6]* [7]* [8]* [9]* [10]* [11] The acquisition re-
portedly took place after Facebook ended negotiations • Google Research
with DeepMind Technologies in 2013, which resulted in
no agreement or purchase of the company.* [12] Google • Quantum Artificial Intelligence Lab run by Google
has yet to comment or make an official announcement on in collaboration with NASA and Universities Space
this acquisition. Research Association.
In December 2012, futurist and inventor Ray Kurzweil,
author of The Singularity is Near, joined Google in a
full-time engineering director role, but focusing on the 13.6 References
deep learning project.* [13] It was reported that Kurzweil
would have“unlimited resources”to pursue his vision at [1] Markoff, John (June 25, 2012). “How Many Computers
Google.* [14]* [15]* [16]* [17] However, he is leading his to Identify a Cat? 16,000”. New York Times. Retrieved
own team, which is independent of Google Brain. February 11, 2014.

68
13.6. REFERENCES 69

[2] “A Massive Google Network Learns To Identify —Cats” [19] “Improving Photo Search: A Step Across the Semantic
. National Public Radio. June 26, 2012. Retrieved Febru- Gap”. Google Research Blog. Google. June 12, 2013.
ary 11, 2014.
[20] Jeff Dean and Andrew Ng (26 June 2012). “Using large-
[3] Shin, Laura (June 26, 2012). “Google brain simulator scale brain simulations for machine learning and A.I.”.
teaches itself to recognize cats”. SmartPlanet. Retrieved Official Google Blog. Retrieved 26 January 2015.
February 11, 2014.
[21] “Ex-Google Brain head Andrew Ng to lead Baidu's arti-
[4] “U of T neural networks start-up acquired by Google” ficial intelligence drive”. South China Morning Post.
(Press release). Toronto, ON. 12 March 2013. Retrieved
13 March 2013. [22] “Quoc Le - Behind the Scenes”. Retrieved 20 April
2015.
[5] Regalado, Antonio (January 29, 2014). “Is Google Cor-
nering the Market on Deep Learning? A cutting-edge cor- [23] Hernandez, Daniela (May 7, 2013). “The Man Behind
ner of science is being wooed by Silicon Valley, to the the Google Brain: Andrew Ng and the Quest for the New
dismay of some academics.”. Technology Review. Re- AI”. Wired Magazine. Retrieved February 11, 2014.
trieved February 11, 2014.
[24] “Ray Kurzweil and the Brains Behind the Google Brain”
[6] Wohlsen, Marcus (January 27, 2014).“Google’s Grand . Big Think. December 8, 2013. Retrieved February 11,
Plan to Make Your Brain Irrelevant”. Wired Magazine. 2014.
Retrieved February 11, 2014.

[7] “Google Acquires UK AI startup Deepmind”. The

Guardian. Retrieved 27 January 2014.

[8] “Report of Acquisition, TechCrunch”. TechCrunch. Re-

trieved 27 January 2014.

[9] Oreskovic, Alexei.“Reuters Report”. Reuters. Retrieved

27 January 2014.

[10] “Google Acquires Artiﬁcial Intelligence Start-Up Deep-

Mind”. The Verge. Retrieved 27 January 2014.

[11] “Google acquires AI pioneer DeepMind Technologies”.

Ars Technica. Retrieved 27 January 2014.

[12] “Google beats Facebook for Acquisition of DeepMind

Technologies”. Retrieved 27 January 2014.

[13] Taylor, Colleen (December 14, 2012). “Ray Kurzweil

Joins Google In Full-Time Engineering Director Role;
Will Focus On Machine Learning, Language Processing”
. TechCrunch. Retrieved February 11, 2014.

[14] Empson, Rip (January 3, 2013). “Imagining The Fu-

ture: Ray Kurzweil Has “Unlimited Resources”For AI,
Language Research At Google”. TechCrunch. Retrieved
February 11, 2014.

[15] Ferenstein, Gregory (January 6, 2013). “Google’s New

Director Of Engineering, Ray Kurzweil, Is Building Your
‘Cybernetic Friend’". TechCrunch. Retrieved February
11, 2014.

[16] Levy, Steven (April 25, 2013). “How Ray Kurzweil Will
Help Google Make the Ultimate AI Brain”. Wired Mag-
azine. Retrieved February 11, 2014.

[17] Hof, Robert (April 23, 2013). “Deep Learning:

With massive amounts of computational power, machines
can now recognize objects and translate speech in real
time. Artiﬁcial intelligence is ﬁnally getting smart.”.
Technology Review. Retrieved February 11, 2014.

[18] “Speech Recognition and Deep Learning”. Google Re-

search Blog. Google. August 6, 2012. Retrieved February
11, 2014.
Chapter 14

Google DeepMind

Google DeepMind is a British artiﬁcial intelligence com- 14.2 Research

pany. Founded in 2011 as DeepMind Technologies, it
was acquired by Google in 2014. DeepMind Technologies's goal is to "solve intelli-
gence",* [21] which they are trying to achieve by com-
bining "the best techniques from machine learning and
systems neuroscience to build powerful general-purpose
14.1 History learning algorithms". * [21] They are trying to formalize
intelligence* [22] in order to not only implement it into
machines, but also understand the human brain, as Demis
14.1.1 2011 to 2014 Hassabis explains:

In 2011 the start-up was founded by Demis Hassabis, [...] Attempting to distil intelligence into
Shane Legg and Mustafa Suleyman.* [3]* [4] Hassabis and an algorithmic construct may prove to be
Legg first met at UCL's Gatsby Computational Neuro- the best path to understanding some of the
science Unit.* [5] enduring mysteries of our minds.
Since then major venture capitalist firms Horizons Ven- —Demis Hassabis, Nature (journal), 23
tures and Founders Fund have invested in the com- February 2012* [23]
pany,* [6] as well as entrepreneur Scott Banister.* [7] Jaan
Tallinn was an early investor and an advisor to the com-
pany.* [8] Currently the company's focus is on publishing research
on computer systems that are able to play games, and
In 2014, DeepMind received the“Company of the Year”
* developing these systems, ranging from strategy games
award by Cambridge Computer Laboratory. [9]
such as Go* [24] to arcade games. According to Shane
The company has created a neural network that learns Legg human-level machine intelligence can be achieved
how to play video games in a similar fashion to hu- "when a machine can learn to play a really wide range of
mans* [10] and a neural network that may be able to ac- games from perceptual stream input and output, and trans-
cess an external memory like a conventional Turing ma- fer understanding across games[...]."* [25] Research de-
chine, resulting in a computer that appears to possibly scribing an AI playing seven different Atari video games
mimic the short-term memory of the human brain.* [11] (Pong, Breakout, Space Invaders, Seaquest, Beamrider,
Enduro, and Q*bert) reportedly led to their acquisition
by Google.* [10]

14.1.2 Acquisition by Google

14.2.1 Deep reinforcement learning
*
On 26 January 2014, Google announced [12] that it had
agreed to acquire DeepMind Technologies. The acqui- As opposed to other AI's, such as IBM's Deep Blue or
sition reportedly took place after Facebook ended nego- Watson, which were developed for a pre-deﬁned purpose
tiations with DeepMind Technologies in 2013.* [13] Fol- and only function within its merit, DeepMind claims that
lowing the acquisition, the company was renamed Google their system is not pre-programmed: it learns from ex-
DeepMind.* [1] perience, using only raw pixels as data input.* [1]* [26]
They test the system on video games, notably early arcade
Estimates of the cost of acquisition vary, from $400 mil- games, such as Space Invaders or Breakout.* [26]* [27]
lion* [14] to over £500 million.* [15]* [16]* [17]* [18]* [19] Without altering the code, the AI begins to understand
One of DeepMind's conditions for Google was that they how to play the game, and after some time plays, for a few
establish an AI Ethics committee.* [20] games (most notably Breakout), a more eﬃcient game

70
14.4. EXTERNAL LINKS 71

than any human ever could.* [27] For most games though [17] Oreskovic, Alexei.“Reuters Report”. Reuters. Retrieved
(Space Invaders, Ms Pacman, Q*Bert for example), 27 January 2014.
DeepMind plays well below the current World Record.
[18] “Google Acquires Artiﬁcial Intelligence Start-Up Deep-
The application of DeepMind's AI to video games is cur- Mind”. The Verge. Retrieved 27 January 2014.
rently for games made in the 1970s and 1980s, with work
being done on more complex 3D games such as Doom, [19] “Google acquires AI pioneer DeepMind Technologies”.
which ﬁrst appeared in the early 1990s.* [27] Ars Technica. Retrieved 27 January 2014.

[20] “Inside Google's Mysterious Ethics Board”. Forbes. 3

February 2014. Retrieved 12 October 2014.
14.3 References
[21] “DeepMind Technologies Website”. DeepMind Tech-
nologies. Retrieved 11 October 2014.
[1] Mnih1, Volodymyr; Kavukcuoglu1, Koray; Silver,
David (26 February 2015). “Human-level con- [22] Shane Legg; Joel Veness (29 September 2011). “An
trol through deep reinforcement learning”. Nature. Approximation of the Universal Intelligence Measure”
doi:10.1038/nature14236. Retrieved 25 February 2015. (PDF). Retrieved 12 October 2014.
[2] “Forbes Report - Acquisition”. Forbes. Retrieved 27 [23] Demis Hassabis (23 February 2012).“Model the brain’s
January 2014. algorithms”(PDF). Nature. Retrieved 12 October 2014.
[3] “Google Buys U.K. Artificial Intelligence Company [24] Shih-Chieh Huang; Martin Müller (12 July 2014).“Inves-
DeepMind”. Bloomberg. 27 January 2014. Retrieved tigating the Limits of Monte-Carlo Tree Search Methods
13 November 2014. in Computer Go”. Springer.
[4] “Google makes £400m move in quest for artificial intel- [25] “Q&A with Shane Legg on risks from AI”. 17 June
ligence”. Financial Times. 27 January 2014. Retrieved 2011. Retrieved 12 October 2014.
13 November 2014.
[26] Volodymyr Mnih; Koray Kavukcuoglu; David Silver;
[5] “Demis Hassabis: 15 facts about the DeepMind Tech- Alex Graves; Ioannis Antonoglou; Daan Wierstra; Mar-
nologies founder”. The Guardian. Retrieved 12 October tin Riedmiller (12 December 2013). “Playing Atari with
2014. Deep Reinforcement Learning”(PDF). Retrieved 12 Oc-
tober 2014.
[6] “DeepMind buy heralds rise of the machines”. Financial
Times. Retrieved 14 October 2014. [27] Deepmind artificial intelligence @ FDOT14. 19 April
2014.
[7] “DeepMind Technologies Investors”. Retrieved 12 Oc-
tober 2014.

[8] “Recode.net - DeepMind Technologies Acquisition”. 14.4 External links

Retrieved 27 January 2014.

[9] “Hall of Fame Awards: To celebrate the success of com- • Google DeepMind
panies founded by Computer Laboratory graduates.”.
Cambridge University. Retrieved 12 October 2014.

[10] “The Last AI Breakthrough DeepMind Made Before

Google Bought It”. The Physics arXiv Blog. Retrieved
12 October 2014.

[11] Best of 2014: Google's Secretive DeepMind Startup Un-

veils a“Neural Turing Machine”, MIT Technology Review

[12] “Google to buy artiﬁcial intelligence company Deep-

Mind”. Reuters. 26 January 2014. Retrieved 12 October
2014.

[13] “Google beats Facebook for Acquisition of DeepMind

Technologies”. Retrieved 27 January 2014.

[14] “Computers, gaming”. The Economist. 28 February

2015.

[15] “Google Acquires UK AI startup Deepmind”. The

Guardian. Retrieved 27 January 2014.

[16] “Report of Acquisition, TechCrunch”. TechCrunch. Re-

trieved 27 January 2014.
Chapter 15

Torch (machine learning)

Torch is an open source machine learning library, a alized, as long as they do not contain references to ob-
scientiﬁc computing framework, and a script language jects that cannot be serialized, such as Lua coroutines,
based on the Lua programming language.* [3] It provides and Lua userdata. However, userdata can be serialized if
a wide range of algorithms for deep machine learning, it is wrapped by a table (or metatable) that provides read()
and uses an extremely fast scripting language LuaJIT, and and write() methods.
an underlying C implementation.

15.2 nn
15.1 torch
The nn package is used for building neural networks.
The core package of Torch is torch. It provides a flexi- It is divided into modular objects that share a com-
ble N-dimensional array or Tensor, which supports basic
mon Module interface. Modules have a forward() and
routines for indexing, slicing, transposing, type-casting, backward() method that allow them to feedforward and
resizing, sharing storage and cloning. This object is used
backpropagate, respectively. Modules can be joined to-
by most other packages and thus forms the core object of gether using module composites, like Sequential, Parallel
the library. The Tensor also supports mathematical op-
and Concat to create complex task-tailored graphs. Sim-
erations like max, min, sum, statistical distributions like pler modules like Linear, Tanh and Max make up the ba-
uniform, normal and multinomial, and BLAS operations sic component modules. This modular interface provides
like dot product, matrix-vector multiplication, matrix- first-order automatic gradient differentiation. What fol-
matrix multiplication, matrix-vector product and matrix lows is an example use-case for building a multilayer per-
product. ceptron using Modules:
The following exemplifies using torch via its REPL inter- > mlp = nn.Sequential() > mlp:add( nn.Linear(10,
preter: 25) ) -- 10 input, 25 hidden units > mlp:add(
> a = torch.randn(3,4) > =a −0.2381 −0.3401 −1.7844 nn.Tanh() ) -- some hyperbolic tangent transfer
−0.2615 0.1411 1.6249 0.1708 0.8299 −1.0434 2.2291 function > mlp:add( nn.Linear(25, 1) ) -- 1 output >
1.0525 0.8465 [torch.DoubleTensor of dimension =mlp:forward(torch.randn(10)) −0.1815 [torch.Tensor
3x4] > a[1][2] −0.34010116549482 > a:narrow(1,1,2) of dimension 1]
−0.2381 −0.3401 −1.7844 −0.2615 0.1411 1.6249
0.1708 0.8299 [torch.DoubleTensor of dimension 2x4]
Loss functions are implemented as sub-classes of Crite-
> a:index(1, torch.LongTensor{1,2}) −0.2381 −0.3401 rion, which has a similar interface to Module. It also has
−1.7844 −0.2615 0.1411 1.6249 0.1708 0.8299 forward() and backward methods for computing the loss
[torch.DoubleTensor of dimension 2x4] > a:min() and backpropagating gradients, respectively. Criteria are
−1.7844365427828 helpful to train neural network on classical tasks. Com-
mon criteria are the Mean Squared Error criterion imple-
The torch package also simplifies object oriented pro- mented in MSECriterion and the cross-entropy criterion
gramming and serialization by providing various con- implemented in ClassNLLCriterion. What follows is an
venience functions which are used throughout its pack- example of a Lua function that can be iteratively called
ages. The torch.class(classname, parentclass) function to train an mlp Module on input Tensor x, target Tensor
can be used to create object factories (classes). When y with a scalar learningRate:
the constructor is called, torch initializes and sets a Lua
function gradUpdate(mlp,x,y,learningRate) lo-
table with the user-defined metatable, which makes the cal criterion = nn.ClassNLLCriterion() pred =
table an object. mlp:forward(x) local err = criterion:forward(pred,
Objects created with the torch factory can also be seri- y); mlp:zeroGradParameters(); local t =

72
15.6. EXTERNAL LINKS 73

criterion:backward(pred, y); mlp:backward(x, t); 15.6 External links

mlp:updateParameters(learningRate); end
• Oﬃcial website
It also has StochasticGradient class for training a neural
• https://github.com/torch/torch7
network using Stochastic gradient descent, although the
Optim package provides much more options in this re-
spect, like momentum and weight decay regularization.

15.3 Other packages

Many packages other than the above oﬃcial packages
are used with Torch. These are listed in the torch cheat-
sheet. These extra packages provide a wide range of util-
ities such as parallelism, asynchronous input/output, im-
age processing, and so on.

15.4 Applications
Torch is used by Google DeepMind,* [4] the Facebook AI
Research Group,* [5] IBM,* [6] Yandex* [7] and the Idiap
Research Institute.* [8] Torch has been extended for use
on Android* [9] and iOS.* [10] It has been used to build
hardware implementations for data ﬂows like those found
in neural networks.* [11]
Facebook has released a set of extension modules as open
source software.* [12]

15.5 References
[1] “Torch: a modular machine learning software library”.
30 October 2002. Retrieved 24 April 2014.

[2] Ronan Collobert. “Torch7”. GitHub.

[3] Ronan Collobert; Koray Kavukcuoglu; Clement Farabet

(2011). “Torch7: A Matlab-like Environment for Ma-
chine Learning” (PDF). Neural Information Processing
Systems.

[4] What is going on with DeepMind and Google?

[5] KDnuggets Interview with Yann LeCun, Deep Learning

Expert, Director of Facebook AI Lab

[6] Hacker News

[7] Yann Lecun's Facebook Page

[8] IDIAP Research Institute : Torch

[9] Torch-android GitHub repository

[10] Torch-ios GitHub repository

[11] NeuFlow: A Runtime Reconﬁgurable Dataﬂow Processor

for Vision

[12] “Facebook Open-Sources a Trove of AI Tools”. Wired.

16 January 2015.
Chapter 16

Theano (software)

Theano is a numerical computation library for

Python.* [1] In Theano, computations are expressed
using a NumPy-like syntax and compiled to run
eﬃciently on either CPU or GPU architectures.
Theano is an open source project* [2] primarily developed
by a machine learning group at the Université de Mon-
tréal.* [3]

16.1 See also

• SciPy

• Torch

16.2 References
[1] Bergstra, J.; O. Breuleux, F. Bastien, P. Lamblin, R. Pas-
canu, G. Desjardins, J. Turian, D. Warde-Farley and Y.
Bengio (30 June 2010).“Theano: A CPU and GPU Math
Expression Compiler” (PDF). Proceedings of the Python
for Scientiﬁc Computing Conference (SciPy) 2010.

[2] “Github Repository”.

[3] “deeplearning.net”.

74
Chapter 17

Deeplearning4j

Deeplearning4j is an open source deep learning library 17.3 Scientific Computing for the
written for Java and the Java Virtual Machine* [1]* [2]
and a computing framework with wide support for deep
JVM
learning algorithms. Deeplearning4j includes implemen-
tations of the restricted Boltzmann machine, deep be- Deeplearning4j includes an n-dimensional array class us-
lief net, deep autoencoder, stacked denoising autoen- ing ND4J that allows for scientific computing in Java and
coder and recursive neural tensor network, as well as Scala, similar to the functionality that Numpy provides
word2vec, doc2vec and GloVe. These algorithms all to Python. It's effectively based on a library for linear al-
include distributed parallel versions that integrate with gebra and matrix manipulation in a production environ-
Hadoop and Spark.* [3] ment. It relies on Matplotlib as a plotting package.

17.1 Introduction 17.4 Canova Vectorization Lib for

Machine-Learning
Deeplearning4j relies on the widely used programming
language, Java - though it is compatible with Clojure and
includes a Scala API. It is powered by its own open-source Canova vectorizes various ﬁle formats and data types us-
numerical computing library, ND4J, and works with both ing an input/output format system similar to Hadoop's use
CPUs and GPUs.* [4] * [5] of MapReduce. A work in progress, Canova is designed
to vectorize CSVs, images, sound, text and video. Since
Deeplearning4j is an open source project* [6] primarily vectorization is a necessary step in preparing data to be
developed by a machine learning group in San Francisco ingested by neural nets, Canova solves one of the most
led by Adam Gibson.* [7]* [8] Deeplearning4j is the only important problems in machine learning. Canova can be
open-source project listed on Google's Word2vec page used from the command line.
for its Java implementation.* [9]
Deeplearning4j has been used in a number of com-
mercial and academic applications. The code is hosted 17.5 Text & NLP
on GitHub* [10] and a support forum is maintained on
Google Groups.* [11]
Deeplearning4j includes a vector space modeling and
The framework is composable, meaning shallow neural topic modeling toolkit, implemented in Java and integrat-
nets such as restricted Boltzmann machines, convolu- ing with parallel GPUs for performance. It is speciﬁcally
tional nets, autoencoders and recurrent nets can be added intended for handling large text collections.
to one another to create deep nets of varying types.
Deeplearning4j includes implementations of tf–idf, deep
learning, and Mikolov's word2vec algorithm, doc2vec
and GloVe -- reimplemented and optimized in Java. It
17.2 Distributed relies on TSNE for word-cloud visualizations.

Training with Deeplearning4j takes place in the cluster,

which means it can process massive amounts of data.
Neural nets are trained in parallel via iterative reduce, 17.6 See also
which works on Hadoop/YARN and on Spark.* [7]* [12]
Deeplearning4j also integrates with Cuda kernels to con- • Torch
duct pure GPU operations, and works with distributed
GPUs. • Theano

75
76 CHAPTER 17. DEEPLEARNING4J

17.7 References
[1] Metz, Cade (2014-06-02). “The Mission to Bring
Google's AI to the Rest of the World”. Wired.com. Re-
trieved 2014-06-28.

[2] Vance, Ashlee (2014-06-03).“Deep Learning for (Some

of) the People”. Bloomberg Businessweek. Retrieved
2014-06-28.

[3] TV, Functional (2015-02-12). “Adam Gibson,

DeepLearning4j on Spark and Data Science on JVM
with nd4j, SF Spark @Galvanize 20150212”. SF Spark
Meetup. Retrieved 2015-03-01.

[4] Harris, Derrick (2014-06-02). “A startup called Sky-

mind launches, pushing open source deep learning”.
GigaOM.com. Retrieved 2014-06-29.

[5] Novat, Jordan (2014-06-02). “Skymind launches with

open-source, plug-and-play deep learning features for
your app”. Retrieved 2014-06-29.

[6] “Github Repository”.

[7] “deeplearning4j.org”.

[8] “Crunchbase Proﬁle”.

[9] “Google Code”.

[10] Deeplearning4j source code

[11] Deeplearning4j Google Group

[12] “Iterative reduce”.

17.8 External links

• Oﬃcial website

• “Github Repositories”.
• “Deeplearning4j vs. Torch vs. Caﬀe vs. Pylearn”
.
• “Canova: A General Vectorization Lib for Machine
Learning”.
• “Apache Flink”.
Chapter 18

Gensim

Gensim is an open-source vector space modeling and [8] Rehurek, Radim.“Gensim”. http://radimrehurek.com/''.
topic modeling toolkit, implemented in the Python pro- Retrieved 27 January 2015. Gensim's tagline: "Topic
gramming language, using NumPy, SciPy and optionally Modelling for Humans"
Cython for performance. It is specifically intended for
handling large text collections, using efficient online al-
gorithms. 18.3 External links
Gensim includes implementations of tf–idf, random pro-
jections, deep learning with Google's word2vec algo- • Official website
rithm * [1] (reimplemented and optimized in Cython),
hierarchical Dirichlet processes (HDP), latent semantic
analysis (LSA) and latent Dirichlet allocation (LDA), in-
cluding distributed parallel versions.* [2]
Gensim has been used in a number of commercial as well
as academic applications.* [3]* [4] The code is hosted on
GitHub* [5] and a support forum is maintained on Google
Groups.* [6]
Gensim accompanied the PhD dissertation Scalability
of Semantic Analysis in Natural Language Processing of
Radim Řehůřek (2011).* [7]

18.1 Gensim's tagline

• Topic Modelling for Humans * [8]

18.2 References
[1] Deep learning with word2vec and gensim

[2] Radim Řehůřek and Petr Sojka (2010). Software frame-

work for topic modelling with large corpora. Proc. LREC
Workshop on New Challenges for NLP Frameworks.

[3] Interview with Radim Řehůřek, creator of gensim

[4] gensim academic citations

[5] gensim source code

[6] gensim mailing list

[7] Rehurek, Radim (2011). “Scalability of Semantic Anal-

ysis in Natural Language Processing” (PDF). http://
radimrehurek.com/''. Retrieved 27 January 2015. my
open-source gensim software package that accompanies
this thesis

77
Chapter 19

Geoﬀrey Hinton

Geoffrey (Geoff) Everest Hinton FRS (born 6 Decem- invented Boltzmann machines with Terry Sejnowski. His
ber 1947) is a British-born cognitive psychologist and other contributions to neural network research include
computer scientist, most noted for his work on artificial distributed representations, time delay neural network,
neural networks. He now divides his time working for mixtures of experts, Helmholtz machines and Product
Google and University of Toronto.* [1] He is the co- of Experts. His current main interest is in unsupervised
inventor of the backpropagation and contrastive diver- learning procedures for neural networks with rich sensory
gence training algorithms and is an important figure in input.
the deep learning movement.* [2]

19.3 Honours and awards

19.1 Career
Hinton was the first winner of the David E. Rumelhart
Hinton graduated from Cambridge in 1970, with a Prize. He was elected a Fellow of the Royal Society in
Bachelor of Arts in experimental psychology, and from 1998.* [5]
Edinburgh in 1978, with a PhD in artificial intelli-
In 2001, Hinton was awarded an Honorary Doctorate
gence. He has worked at Sussex, University of Califor-
from the University of Edinburgh.
nia San Diego, Cambridge, Carnegie Mellon University
and University College London. He was the founding di- Hinton was the 2005 recipient of the IJCAI Award for
rector of the Gatsby Computational Neuroscience Unit at Research Excellence lifetime-achievement award.
University College London, and is currently a professor He has also been awarded the 2011 Herzberg Canada
in the computer science department at the University of Gold Medal for Science and Engineering.* [6]
Toronto. He holds a Canada Research Chair in Machine
Learning. He is the director of the program on “Neural In 2013, Hinton was awarded an Honorary Doctorate
Computation and Adaptive Perception”which is funded from the Université de Sherbrooke.
by the Canadian Institute for Advanced Research. Hin-
ton taught a free online course on Neural Networks on the
education platform Coursera in 2012.* [3] Hinton joined 19.4 Personal life
Google in March 2013 when his company, DNNresearch
Inc, was acquired. He is planning to “divide his time
Hinton is the great-great-grandson both of logician
between his university research and his work at Google”
George Boole whose work eventually became one of the
.* [4]
foundations of modern computer science, and of surgeon
and author James Hinton.* [7] His father is Howard Hin-
ton.
19.2 Research interests
An accessible introduction to Geoffrey Hinton's research 19.5 References
can be found in his articles in Scientific American in
September 1992 and October 1993. He investigates
[1] Daniela Hernandez (7 May 2013). “The Man Behind the
ways of using neural networks for learning, memory,
Google Brain: Andrew Ng and the Quest for the New AI”
perception and symbol processing and has authored over . Wired. Retrieved 10 May 2013.
200 publications in these areas. He was one of the
researchers who introduced the back-propagation algo- [2] “How a Toronto professor’s research revolutionized ar-
rithm for training multi-layer neural networks that has tificial intelligence”. Toronto Star, Kate Allen, Apr 17
been widely used for practical applications. He co- 2015

78
19.6. EXTERNAL LINKS 79

[3] https://www.coursera.org/course/neuralnets

[4] “U of T neural networks start-up acquired by Google”

(Press release). Toronto, ON. 12 March 2013. Retrieved
13 March 2013.

[5] “Fellows of the Royal Society”. The Royal Society. Re-

trieved 14 March 2013.

[6] “Artiﬁcial intelligence scientist gets M prize”. CBC News.

14 February 2011.

[7] The Isaac Newton of logic

19.6 External links

• Geoﬀrey E. Hinton's Academic Genealogy

• Geoﬀrey E. Hinton's Publications in Reverse

Chronological Order

• Homepage (at UofT)

• “The Next Generation of Neural Networks” on
YouTube
• Gatsby Computational Neuroscience Unit (founding
director)
• Encyclopedia article on Boltzmann Machines writ-
ten by Geoﬀrey Hinton for Scholarpedia
Chapter 20

Yann LeCun

Yann LeCun (born 1960) is a computer scientist with After a brief tenure as a Fellow of the NEC Research
contributions in machine learning, computer vision, Institute (now NEC-Labs America) in Princeton, NJ, he
mobile robotics and computational neuroscience. He is joined New York University (NYU) in 2003, where he is
well known for his work on optical character recogni- Silver Professor of Computer Science Neural Science at
tion and computer vision using convolutional neural net- the Courant Institute of Mathematical Science and the
works (CNN), and is a founding father of convolutional Center for Neural Science. He is also a professor at
nets.* [1]* [2] He is also one of the main creators of the Polytechnic Institute of New York University.* [8]* [9] At
DjVu image compression technology (together with Léon NYU, he has worked primarily on Energy-Based Mod-
Bottou and Patrick Haffner). He co-developed the Lush els for supervised and unsupervised learning,* [10] feature
programming language with Léon Bottou. learning for object recognition in Computer Vision,* [11]
and mobile robotics.* [12]
In 2012, he became the founding director of the NYU
20.1 Life Center for Data Science.* [13] On December 9, 2013, Le-
Cun became the first director of Facebook AI Research in
Yann LeCun was born near Paris, France, in 1960. New York City.,* [14] and stepped down from the NYU-
He received a Diplôme d'Ingénieur from the Ecole Su- CDS directorship in early 2014.
perieure d'Ingénieur en Electrotechnique et Electronique LeCun is the recipient of the 2014 IEEE Neural Network
(ESIEE), Paris in 1983, and a PhD in Computer Science Pioneer Award.
from Université Pierre et Marie Curie in 1987 during
In 2013, he and Yoshua Bengio co-founded the Inter-
which he proposed an early form of the back-propagation
national Conference on Learning Representations, which
learning algorithm for neural networks.* [3] He was a
adopted a post-publication open review process he previ-
postdoctoral research associate in Geoffrey Hinton's lab
ously advocated on his website. He was the chair and or-
at the University of Toronto.
ganizer of the“Learning Workshop”held every year be-
In 1988, he joined the Adaptive Systems Research De- tween 1986 and 2012 in Snowbird, Utah. He is a member
partment at AT&T Bell Laboratories in Holmdel, New of the Science Advisory Board of the Institute for Pure
Jersey, USA, where he developed a number of new ma- and Applied Mathematics* [15] at UCLA, and has been
chine learning methods, such as a biologically inspired on the advisory board of a number of companies, includ-
model of image recognition called Convolutional Neural ing MuseAmi, KXEN Inc., and Vidient Systems.* [16]
Networks,* [4] the“Optimal Brain Damage”regulariza- He is the Co-Director of the Neural Computation &
tion methods,* [5] and the Graph Transformer Networks Adaptive Perception research program of CIFAR* [17]
method (similar to conditional random field), which he
applied to handwriting recognition and OCR.* [6] The
bank check recognition system that he helped develop 20.2 References
was widely deployed by NCR and other companies, read-
ing over 10% of all the checks in the US in the late 1990s
[1] Convolutional Nets and CIFAR-10: An Interview with
and early 2000s.
Yann LeCun. Kaggle 2014
In 1996, he joined AT&T Labs-Research as head of the
Image Processing Research Department, which was part [2] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick
of Lawrence Rabiner's Speech and Image Processing Re- Haffner (1998).“Gradient-based learning applied to doc-
ument recognition” (PDF). Proceedings of the IEEE 86
search Lab, and worked primarily on the DjVu image
(11): 2278–2324. doi:10.1109/5.726791. Retrieved 16
compression technology,* [7] used by many websites, no- November 2013.
tably the Internet Archive, to distribute scanned docu-
ments. His collaborators at AT&T include Léon Bottou [3] Y. LeCun: Une procédure d'apprentissage pour réseau a
and Vladimir Vapnik. seuil asymmetrique (a Learning Scheme for Asymmetric

80
20.3. EXTERNAL LINKS 81

Threshold Networks), Proceedings of Cognitiva 85, 599– • Yann LeCun's List of PhD Students
604, Paris, France, 1985.
• Yann LeCun's publications
[4] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard and L. D. Jackel: Backpropagation • Convolutional Neural Networks
Applied to Handwritten Zip Code Recognition, Neural
Computation, 1(4):541-551, Winter 1989. • DjVuLibre website

[5] Yann LeCun, J. S. Denker, S. Solla, R. E. Howard and L. • Lush website

D. Jackel: Optimal Brain Damage, in Touretzky, David
(Eds), Advances in Neural Information Processing Sys-
tems 2 (NIPS*89), Morgan Kaufmann, Denver, CO,
1990.

[6] Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick

Haﬀner: Gradient Based Learning Applied to Document
Recognition, Proceedings of IEEE, 86(11):2278–2324,
1998.

[7] Léon Bottou, Patrick Haﬀner, Paul G. Howard, Patrice

Simard, Yoshua Bengio and Yann LeCun: High Qual-
ity Document Image Compression with DjVu, Journal of
Electronic Imaging, 7(3):410–425, 1998.

[8] “People - Electrical and Computer Engineering”. Poly-

technic Institute of New York University. Retrieved 13
March 2013.

[9] http://yann.lecun.com/

[10] Yann LeCun, Sumit Chopra, Raia Hadsell, Ranzato

Marc'Aurelio and Fu-Jie Huang: A Tutorial on Energy-
Based Learning, in Bakir, G. and Hofman, T. and
Schölkopf, B. and Smola, A. and Taskar, B. (Eds), Pre-
dicting Structured Data, MIT Press, 2006.

[11] Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato

and Yann LeCun: What is the Best Multi-Stage Architec-
ture for Object Recognition?, Proc. International Confer-
ence on Computer Vision (ICCV'09), IEEE, 2009

[12] Raia Hadsell, Pierre Sermanet, Marco Scoﬃer, Ayse

Erkan, Koray Kavackuoglu, Urs Muller and Yann Le-
Cun: Learning Long-Range Vision for Autonomous Oﬀ-
Road Driving, Journal of Field Robotics, 26(2):120–144,
February 2009.

[13] http://cds.nyu.edu

[14] https://www.facebook.com/yann.lecun/posts/
10151728212367143

[15] http://www.ipam.ucla.edu/programs/gss2012/ Institute

for Pure and Applied Mathematics

[16] Vidient Systems.

[17] “Neural Computation & Adaptive Perception Advisory

Committee Yann LeCun”. CIFAR. Retrieved 16 Decem-
ber 2013.

20.3 External links

• Yann LeCun's personal website
• Yann LeCun's lab website at NYU
Chapter 21

Jürgen Schmidhuber

Jürgen Schmidhuber (born 17 January 1963 in gramming. In the same year he published the first
Munich) is a computer scientist and artist known for work on Meta-genetic programming. Since then he
his work on machine learning, Artificial Intelligence has co-authored numerous additional papers on artifi-
(AI), artificial neural networks, digital physics, and low- cial evolution. Applications include robot control, soccer
complexity art. His contributions also include gen- learning, drag minimization, and time series prediction.
eralizations of Kolmogorov complexity and the Speed He received several best paper awards at scientific con-
Prior. From 2004 to 2009 he was professor of Cog- ferences on evolutionary computation.
nitive Robotics at the Technische Universität München.
Since 1995 he has been co-director of the Swiss AI
Lab IDSIA in Lugano, since 2009 also professor of Ar-
tificial Intelligence at the University of Lugano. Be- 21.1.3 Neural economy
tween 2009 and 2012, the recurrent neural networks and
deep feedforward neural networks developed in his re- In 1989 he created the first learning algorithm for neural
search group have won eight international competitions networks based on principles of the market economy (in-
in pattern recognition and machine learning.* [1] In honor spired by John Holland's bucket brigade algorithm for
of his achievements he was elected to the European classifier systems): adaptive neurons compete for being
Academy of Sciences and Arts in 2008. active in response to certain input patterns; those that
are active when there is external reward get stronger
synapses, but active neurons have to pay those that
21.1 Contributions activated them, by transferring parts of their synapse
strengths, thus rewarding “hidden”neurons setting the
stage for later success.* [5]
21.1.1 Recurrent neural networks

The dynamic recurrent neural networks developed in his

lab are simplified mathematical models of the biological 21.1.4 Artificial curiosity and creativity
neural networks found in human brains. A particularly
successful model of this type is called Long short term In 1990 he published the first in a long series of papers on
memory.* [2] From training sequences it learns to solve artificial curiosity and creativity for an autonomous agent.
numerous tasks unsolvable by previous such models. Ap- The agent is equipped with an adaptive predictor trying to
plications range from automatic music composition to predict future events from the history of previous events
speech recognition, reinforcement learning and robotics and actions. A reward-maximizing, reinforcement learn-
in partially observable environments. As of 2010, his ing, adaptive controller is steering the agent and gets cu-
group has the best results on benchmarks in automatic riosity reward for executing action sequences that im-
handwriting recognition, obtained with deep neural net- prove the predictor. This discourages it from execut-
works* [3] and recurrent neural networks.* [4] ing actions leading to boring outcomes that are either
predictable or totally unpredictable.* [6] Instead the con-
troller is motivated to learn actions that help the predictor
21.1.2 Artificial evolution / genetic pro- to learn new, previously unknown regularities in its envi-
gramming ronment, thus improving its model of the world, which
in turn can greatly help to solve externally given tasks.
As an undergrad at TUM Schmidhuber evolved computer This has become an important concept of developmental
programs through genetic algorithms. The method was robotics. Schmidhuber argues that his corresponding
published in 1987 as one of the first papers in the formal theory of creativity explains essential aspects of
emerging field that later became known as genetic pro- art, science, music, and humor.* [7]

82
21.2. REFERENCES 83

21.1.5 Unsupervised learning / factorial 21.1.8 Low-complexity art / theory of

codes beauty

During the early 1990s Schmidhuber also invented a Schmidhuber's low-complexity artworks (since 1997) can
neural method for nonlinear independent component be described by very short computer programs containing
analysis (ICA) called predictability minimization. It is very few bits of information, and reflect his formal the-
based on co-evolution of adaptive predictors and initially ory of beauty* [15] based on the concepts of Kolmogorov
random, adaptive feature detectors processing input pat- complexity and minimum description length.
terns from the environment. For each detector there is Schmidhuber writes that since age 15 or so his main sci-
a predictor trying to predict its current value from the entific ambition has been to build an optimal scientist,
values of neighboring detectors, while each detector is then retire. First he wants to build a scientist better than
simultaneously trying to become as unpredictable as pos- himself (he quips that his colleagues claim that should be
sible.* [8] It can be shown that the best the detectors can easy) who will then do the remaining work. He claims he
do is to create a factorial code of the environment, that “cannot see any more efficient way of using and multi-
is, a code that conveys all the information about the in- plying the little creativity he's got”.
puts such that the code components are statistically inde-
pendent, which is desirable for many pattern recognition
applications. 21.1.9 Robot learning

In recent years a robotics group with focus on intelligent

21.1.6 Kolmogorov complexity / computer- and learning robots, especially in the fields of swarm and
generated universe humanoid robotics was established at his lab.* [16] The
lab is equipped with a variety of mobile and flying robots
In 1997 Schmidhuber published a paper based on Konrad and is one of the around 20 labs in the world owning an
Zuse's assumption (1967) that the history of the universe iCub humanoid robot. The group has applied a variety
is computable. He pointed out that the simplest explana- of machine learning algorithms, such as reinforcement
tion of the universe would be a very simple Turing ma- learning and genetic programming, to improve adaptive-
chine programmed to systematically execute all possible ness and autonomy of robotic systems.
programs computing all possible histories for all types
Recently his work on evolutionary robotics, with a focus
of computable physical laws.* [9]* [10] He also pointed
on using genetic programming to evolve robotic skills, es-
out that there is an optimally efficient way of computing
pecially in robot vision have allowed for quick and robust
all computable universes based on Leonid Levin's uni-
object detection in humanoid robots.* [17]* [18]* [19] ID-
versal search algorithm (1973). In 2000 he expanded
SIA's work with the iCub humanoid won the 2013 AAAI
this work by combining Ray Solomonoff's theory of in-
Student Video competition.* [20]* [21]
ductive inference with the assumption that quickly com-
putable universes are more likely than others.* [11] This
work on digital physics also led to limit-computable gen-
eralizations of algorithmic information or Kolmogorov 21.2 References
complexity and the concept of Super Omegas, which are
limit-computable numbers that are even more random (in [1] 2012 Kurzweil AI Interview with Jürgen Schmidhuber on
a certain sense) than Gregory Chaitin's number of wisdom the eight competitions won by his Deep Learning team
Omega.* [12] 2009-2012

[2] S. Hochreiter and J. Schmidhuber. Long Short-Term

Memory. Neural Computation, 9(8):1735–1780, 1997.
21.1.7 Universal AI
[3] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmid-
Important research topics of his group include universal huber. Deep Big Simple Neural Nets For Handwritten
learning algorithms and universal AI* [13]* [14] (see Digit Recognition. Neural Computation 22(12): 3207-
Gödel machine). Contributions include the first theo- 3220.
retically optimal decision makers living in environments
obeying arbitrary unknown but computable probabilistic [4] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H.
laws, and mathematically sound general problem solvers Bunke, J. Schmidhuber. A Novel Connectionist System
for Improved Unconstrained Handwriting Recognition.
such as the remarkable asymptotically fastest algorithm
IEEE Transactions on Pattern Analysis and Machine In-
for all well-defined problems, by his former postdoc
telligence, vol. 31, no. 5, 2009.
Marcus Hutter. Based on the theoretical results obtained
in the early 2000s, Schmidhuber is actively promoting the [5] J. Schmidhuber. A local learning algorithm for dynamic
view that in the new millennium the field of general AI has feedforward and recurrent networks. Connection Science,
matured and become a real formal science. 1(4):403–412, 1989
84 CHAPTER 21. JÜRGEN SCHMIDHUBER

[6] J. Schmidhuber. Curious model-building control systems. 21.3 Sources

In Proc. International Joint Conference on Neural Net-
works, Singapore, volume 2, pages 1458–1463. IEEE, • Google Scholar: Numerous scientiﬁc articles refer-
1991
encing Schmidhuber's work
[7] J. Schmidhuber. Formal Theory of Creativity, Fun, and
Intrinsic Motivation (1990–2010). IEEE Transactions on
• Scholarpedia article on Universal Search, discussing
Autonomous Mental Development, 2(3):230–247, 2010. Schmidhuber's Speed Prior, Optimal Ordered Prob-
lem Solver, Gödel machine
[8] J. Schmidhuber. Learning factorial codes by predictability
minimization. Neural Computation, 4(6):863–879, 1992 • German article on Schmidhuber in CIO magazine:
“Der ideale Wissenschaftler”(the ideal scientist)
[9] J. Schmidhuber. A computer scientist's view of life, the
universe, and everything. Foundations of Computer Sci- • Build An Optimal Scientist, Then Retire: Interview
ence: Potential – Theory – Cognition, Lecture Notes in with J. Schmidhuber in H+ magazine, 2010
Computer Science, pages 201–208, Springer, 1997
• Video of Schmidhuber's talk on artiﬁcial curiosity
[10] Brian Greene, Chapter 10 of: The Hidden Reality: Paral- and creativity at the Singularity Summit 2009, NYC
lel Universes and the Deep Laws of the Cosmos, Knopf,
2011 • TV clip: Schmidhuber on computable universes on
Through the Wormhole with Morgan Freeman.
[11] J. Schmidhuber. The Speed Prior: A New Simplicity
Measure Yielding Near-Optimal Computable Predictions. • On-going research on the iCub humanoid at the ID-
Proceedings of the 15th Annual Conference on Computa-
SIA Robotics Lab
tional Learning Theory (COLT 2002), Sydney, Australia,
LNAI, 216–228, Springer, 2002

[12] J. Schmidhuber. Hierarchies of generalized Kolmogorov 21.4 External links

complexities and nonenumerable universal measures com-
putable in the limit. International Journal of Foundations
of Computer Science 13(4):587–612, 2002 • Home page

[13] J. Schmidhuber. Ultimate Cognition à la Gödel. Cogni- • Publications

tive Computation 1(2):177–193, 2009
• Videos of Juergen Schmidhuber & the Swiss AI Lab
[14] J. Schmidhuber. Optimal Ordered Problem Solver. Ma- IDSIA
chine Learning, 54, 211–254, 2004

[15] J. Schmidhuber. Low-Complexity Art. Leonardo, Jour-

nal of the International Society for the Arts, Sciences, and
Technology, 30(2):97–103, MIT Press, 1997

[16] http://robotics.idsia.ch/ The IDSIA Robotics Lab

[17] J. Leitner, S. Harding, P. Chandrashekhariah, M. Frank,

A. Förster, J. Triesch and J. Schmidhuber. Learning Vi-
sual Object Detection and Localisation Using icVision.
Biologically Inspired Cognitive Architectures, Vol. 5,
2013.

[18] J. Leitner, S. Harding, M. Frank, A. Förster and J.

Schmidhuber. Humanoid Learns to Detect Its Own
Hands. IEEE Congress on Evolutionary Computing
(CEC), 2013.

[19] S. Harding, J. Leitner and J. Schmidhuber. Cartesian

Genetic Programming for Image Processing (CGP-IP).
Genetic Programming Theory and Practice X (Springer
Tract on Genetic and Evolutionary Computation). pp 31-
44. ISBN 978-1-4614-6845-5. Springer, Ann Arbor,
2013.

[20] http://www.aaaivideos.org/2013/ AAAI Video Competi-

tion 2013.

[21] M. Stollenga, L. Pape, M. Frank, J. Leitner, A. Förster and

J. Schmidhuber. Task-Relevant Roadmaps: A Frame-
work for Humanoid Motion Planning. IROS, 2013.
Chapter 22

Jeﬀ Dean (computer scientist)

Jeffrey Adgate “Jeff”Dean (born 1968) is an American involvement in the engineering hiring process.
computer scientist and software engineer. He is currently
Among others, the projects he's worked on include:
a Google Senior Fellow in the Systems and Infrastructure
Group.
• Spanner - a scalable, multi-version, globally dis-
tributed, and synchronously replicated database
22.1 Personal life and education • Some of the production system design and statistical
machine translation system for Google Translate.
Dean received a Ph.D. in Computer Science from the • BigTable, a large-scale semi-structured storage sys-
University of Washington, working with Craig Chambers tem.
on whole-program optimization techniques for object-
oriented languages. He received a B.S., summa cum laude • MapReduce a system for large-scale data processing
from the University of Minnesota in Computer Science applications.
& Economics in 1990. He was elected to the National
Academy of Engineering in 2009, which recognized his • Google Brain a system for large-scale artificial neu-
work on“the science and engineering of large-scale dis- ral networks
tributed computer systems.”

22.4 Awards and honors

22.2 Career in computer science
• Elected to the National Academy of Engineering
(2009)
Prior to joining Google, he was at DEC/Compaq's West-
ern Research Laboratory, where he worked on profiling • Fellow of the Association for Computing Machinery
tools, microprocessor architecture, and information re- (2009)
trieval.
• ACM-Infosys Foundation Award (2012)
Prior to graduate school, he worked at the World Health
Organization's Global Programme on AIDS, developing
software for statistical modeling and forecasting of the
HIV/AIDS pandemic. 22.5 Major publications
• Jeffrey Dean and Sanjay Ghemawat. 2004.
MapReduce: Simplified Data Processing on Large
22.3 Career at Google Clusters. OSDI'04: Sixth Symposium on Operat-
ing System Design and Implementation (December
Dean joined Google in mid-1999, and is currently a 2004)
Google Senior Fellow in the Systems Infrastructure
Group. While at Google, he has designed and im-
plemented large portions of the company's advertising,
crawling, indexing and query serving systems, along with
22.6 See also
various pieces of the distributed computing infrastruc-
ture that sits underneath most of Google's products. At • Big Table
various times, he has also worked on improving search • MapReduce
quality, statistical machine translation, and various inter-
nal software development tools and has had significant • Spanner

85
86 CHAPTER 22. JEFF DEAN (COMPUTER SCIENTIST)

22.7 External links

• Jeﬀ Dean's Google home page.

• Meet Google's Baddest Engineer, Jeﬀ Dean

• The Optimizer
Chapter 23

Andrew Ng

Andrew Yan-Tak Ng (Chinese: 吳恩達; born 1976) is 23.2 Online education

Chief Scientist at Baidu Research in Silicon Valley. In
addition, he is an associate professor in the Department Ng started the Stanford Engineering Everywhere (SEE)
of Computer Science and the Department of Electrical program, which in 2008 placed a number of Stanford
Engineering by courtesy at Stanford University. He is courses online, for free. Ng taught one of these courses,
chairman of the board of Coursera, an online education Machine Learning, which consisted of video lectures by
platform that he co-founded with Daphne Koller. him, along with the student materials used in the Stanford
He researches primarily in machine learning and deep CS229 class.
learning. His early work includes the Stanford Au- The “applied”version of the Stanford class (CS229a)
tonomous Helicopter project, which developed one was hosted on ml-class.org and started in October 2011,
of the most capable autonomous helicopters in the with over 100,000 students registered for its first iteration;
world,* [2]* [3] and the STAIR (STanford Artificial In- the course featured quizzes and graded programming as-
telligence Robot) project,* [4] which resulted in ROS, a signments and became one of the first successful MOOCs
widely used open-source robotics software platform. made by Stanford professors.* [14] His work subsequently
Ng is also the author or co-author of over 100 published led to the founding of Coursera in 2012.
papers in machine learning, robotics and related fields,
and some of his work in computer vision has been fea-
tured in a series of press releases and reviews.* [5] In 23.3 Personal life
2008, he was named to the MIT Technology Review TR35
as one of the top 35 innovators in the world under the age
of 35.* [6]* [7] In 2007, Ng was awarded a Sloan Fellow- Ng was born in the UK in 1976. His parents were
ship. For his work in Artificial Intelligence, he is also a both Hongkongers. He spent time in Hong Kong and
recipient of the Computers and Thought Award. Singapore* [1] and later graduated from Raffles Institu-
tion in Singapore as the class of 1992 and received his
On May 16, 2014, Ng announced from his Coursera blog undergraduate degree in computer science from Carnegie
that he would be stepping away from his day-to-day re- Mellon University in Pittsburgh, Pennsylvania as the class
sponsibilities at Coursera, and join Baidu as Chief Scien- of 1997. Then, he attained his master's degree from
tist, working on deep learning.* [8] Massachusetts Institute of Technology in Cambridge,
Massachusetts as the class of 1998 and received his PhD
from University of California, Berkeley in 2002. He
started working at Stanford University during that year;
23.1 Machine learning research he currently lives in Palo Alto, California. He married
Carol E. Reiley in 2014.

In 2011, Ng founded the Google Brain project at Google,

which developed very large scale artiﬁcial neural net-
works using Google's distributed computer infrastruc-
23.4 References
ture.* [9] Among its notable results was a neural network
trained using deep learning algorithms on 16,000 CPU [1] Seligman, Katherine (3 December 2006). “If Andrew
cores, that learned to recognize higher-level concepts, Ng could just get his robot to assemble an Ikea bookshelf,
we'd all buy one”. SFGate. Retrieved 12 February 2013.
such as cats, after watching only YouTube videos, and
without ever having been told what a“cat”is.* [10]* [11] [2] “From Self-Flying Helicopters to Classrooms of the Fu-
The project's technology is currently also used in the ture”. Chronicle of Higher Education. 2012.
Android Operating System's speech recognition sys-
tem.* [12] [3] “Stanford Autonomous Helicopter Project”.

87
88 CHAPTER 23. ANDREW NG

[4] John Markoﬀ (18 July 2006).“Brainy Robots Start Step-

ping Into Daily Life”. New York Times.

[5] New algorithm improves robot vision

[6]“2008 Young Innovators Under 35”. Technology Review.

2008. Retrieved August 15, 2011.

[7] Technology Review: TR35

[8] “A personal message from Co-founder Andrew Ng”.

Coursera blog. 2014. Retrieved May 16, 2014.

[9] Claire Miller and Nick Bilton (3 November 2011).

“Google’s Lab of Wildest Dreams”. New York Times.

[10] John Markoﬀ (25 June 2012). “How Many Computers

to Identify a Cat? 16,000.”. New York Times.

[11] Ng, Andrew; Dean, Jeﬀ (2012). “Building High-level

Features Using Large Scale Unsupervised Learning”
(PDF).

[12] “Speech Recognition and Deep Learning”. Google Re-

search Blog. Google. 6 August 2012. Retrieved 29 Jan-
uary 2013.

[13] “Interview with Coursera Co-Founder Andrew Ng”. De-

gree of Freedom. Retrieved May 19, 2013.

[14] Theresa Johnson.“Stanford for All”. Stanford Magazine.

23.5 See also

• Robot Operating System

23.6 External links

• Homepage
• STAIR Homepage

• Publications
• Academic Genealogy

• Machine Learning (CS 229) Video Lecture

• Lecture videos

• From Self-Flying Helicopters to Classrooms of the

Future

• Coursera-Leadership
23.7. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 89

23.7 Text and image sources, contributors, and licenses

23.7.1 Text
• Artificial neural network Source: http://en.wikipedia.org/wiki/Artificial%20neural%20network?oldid=662892335 Contributors: Mag-
nus Manske, Ed Poor, Iwnbap, PierreAbbat, Youandme, Susano, Hfastedge, Mrwojo, Michael Hardy, Erik Zachte, Oliver Pereira,
Bobby D. Bryant, Zeno Gantner, Parmentier~enwiki, Delirium, Pieter Suurmond, (, Alfio, 168..., Ellywa, Ronz, Snoyes, Den fjättrade
ankan~enwiki, Cgs, Glenn, Cyan, Hike395, Hashar, Novum, Charles Matthews, Guaka, Timwi, Reddi, Andrewman327, Munford, Fur-
rykef, Bevo, Fvw, Raul654, Nyxos, Unknown, Pakcw, Robbot, Chopchopwhitey, Bkell, Hadal, Wikibot, Diberri, Xanzzibar, Wile E.
Heresiarch, Connelly, Giftlite, Rs2, Markus Krötzsch, Spazzm, Seabhcan, BenFrantzDale, Zigger, Everyking, Rpyle731, Wikiwikifast,
Foobar, Edrex, Jabowery, Wildt~enwiki, Wmahan, Neilc, Quadell, Beland, Lylum, Gene s, Sbledsoe, Mozzerati, Karl-Henner, Jmep-
pley, Asbestos, Fintor, AAAAA, Splatty, Rich Farmbrough, Pak21, NeuronExMachina, Michal Jurosz, Pjacobi, Mecanismo, Zarutian,
Dbachmann, Bender235, ZeroOne, Violetriga, Mavhc, One-dimensional Tangent, Gyll, Stephane.magnenat, Mysteronald, .:Ajvol:., Foti-
nakis, Nk, Tritium6, JesseHogan, Mdd, Passw0rd, Zachlipton, Alansohn, Jhertel, Anthony Appleyard, Denoir, Arthena, Fritz Saalfeld,
Sp00n17, Rickyp, Hu, Tyrell turing, Cburnett, Notjim, Drbreznjev, Forderud, Oleg Alexandrov, Mogigoma, Madmardigan53, Justinle-
bar, Olethros, Ylem, Dr.U, Gengiskanhg, Male1979, Bar0n, Waldir, Eslip17, Yoghurt, Ashmoo, Graham87, Qwertyus, Imersion, Gram-
marbot, Rjwilmsi, Jeema, Venullian, SpNeo, Intgr, Predictor, Kri, BradBeattie, Plarroy, Windharp, Mehran.asadi, Commander Nemet,
Wavelength, Borgx, IanManka, Rsrikanth05, Philopedia, Ritchy, David R. Ingham, Grafen, Nrets, Exir Kamalabadi, Deodar~enwiki,
Mosquitopsu, Jpbowen, Dennis!, JulesH, Moe Epsilon, Supten, DeadEyeArrow, Eclipsed, SamuelRiv, Tribaal, Chase me ladies, I'm
the Cavalry, CWenger, Donhalcon, Banus, Shepard, John Broughton, A13ean, SmackBot, PinstripeMonkey, McGeddon, CommodiCast,
Jfmiller28, Stimpy, Commander Keane bot, Feshmania, ToddDeLuca, Diegotorquemada, Patrickdepinguin, KYN, Gilliam, Bluebot, Oli
Filth, Gardoma, Complexica, Nossac, Hongooi, Pdtl, Izhikevich, Trifon Triantafillidis, SeanAhern, Neshatian, Vernedj, Dankonikolic,
Rory096, Sina2, SS2005, Kuru, Plison, Lakinekaki, Bjankuloski06en~enwiki, IronGargoyle, WMod-NS, Dicklyon, Citicat, StanfordPro-
grammer, Ojan, Chi3x10, Aeternus, CapitalR, Atreys, George100, Gveret Tered, Devourer09, SkyWalker, CmdrObot, Leonoel, CBM,
Mcstrother, MarsRover, CX, Arauzo, Peterdjones, Josephorourke, Kozuch, ClydeC, NotQuiteEXPComplete, Irigi, Mbell, Oldiowl, Tolstoy
the Cat, Headbomb, Mitchell.E.Timin, Davidhorman, Sbandrews, KrakatoaKatie, QuiteUnusual, Prolog, AnAj, LinaMishima, Whenning,
Hamaryns, Daytona2, JAnDbot, MER-C, Dcooper, Extropian314, Magioladitis, VoABot II, Amitant, Jimjamjak, SSZ, Robotman1974,
David Eppstein, User A1, Martynas Patasius, Pmbhagat, JaGa, Tuhinsubhrakonar, SoyYo, Nikoladie~enwiki, R'n'B, Maproom, K.menin,
Gill110951, Tarotcards, Plasticup, Margareta, Paskari, Jamesontai, Kiran uvpce, Jamiejoseph, Error9312, Jlaramee, Jeff G., A4bot, Sin-
gleheart, Ebbedc, Lordvolton, Ask123, CanOfWorms, Mundhenk, Wikiisawesome, M karamanov, Enkya, Blumenkraft, Twikir, Mike-
moral, Oldag07, Smsarmad, Flyer22, Janopus, Bwieliczko, Dhatfield, F.j.gaze, Mark Lewis Epstein, S2000magician, PuercoPop, Martar-
ius, ClueBot, Ignacio Javier Igjav, Ahyeek, The Thing That Should Not Be, Fadesga, Zybler, Midiangr, Epsilon60198, Thomas Tvileren,
Wduch, Excirial, Three-quarter-ten, Skbkekas, Chaosdruid, Aprock, Qwfp, Jean-claude perez, Achler, XLinkBot, AgnosticPreachersKid,
BodhisattvaBot, Stickee, Cmr08, Porphyro, Fippy Darkpaw, Addbot, DOI bot, AndrewHZ, Thomblake, Techjerry, Looie496, MrOl-
lie, Transmobilator, Jarble, Yobot, Blm19732008, Nguyengiap84~enwiki, SparkOfCreation, AnomieBOT, DemocraticLuntz, Tryptofish,
Trevithj, Jim1138, Durran65, MockDuck, JonathanWilliford, Materialscientist, Citation bot, Eumolpo, Twri, NFD9001, Isheden, J04n,
Omnipaedista, Mark Schierbecker, RibotBOT, RoodyBeep, Gunjan verma81, FrescoBot, X7q, Ömer Cengiz Çelebi, Outback the koala,
Citation bot 1, Tylor.Sampson, Calmer Waters, Skyerise, Trappist the monk, Krassotkin, Cjlim, Fox Wilson, The Strategist, LilyKitty,
Eparo, ‫בן גרשון‬, Jfmantis, Mehdiabbasi, VernoWhitney, Wiknn, BertSeghers, DASHBot, EmausBot, Nacopt, Dzkd, Racerx11, Japs 88,
GoingBatty, RaoInWiki, Roposeidon, Epsiloner, Stheodor, Benlansdell, Radshashi, K6ka, D'oh!, Thisisentchris87, Aavindraa, Chire,
Glosser.ca, IGeMiNix, Donner60, Yoshua.Bengio, Shinosin, Venkatarun95, ChuckNorrisPwnedYou, Petrb, ClueBot NG, Raghith, Ro-
biminer, Snotbot, Tideflat, Frietjes, Gms3591, Ryansandersuk, Widr, MerlIwBot, Helpful Pixie Bot, Trepier, BG19bot, Thwien, Adams7,
Rahil2000, Michaelmalak, Compfreak7, Kirananils, Altaïr, J.Davis314, Attleboro, Pratyya Ghosh, JoshuSasori, Ferrarisailor, Eugeneche-
ung, Mtschida, ChrisGualtieri, Dave2k6inthemix, Whebzy, APerson, JurgenNL, Oritnk, Stevebillings, Djfrost711, Sa publishers, 㓟, Mark
viking, Markus.harz, Deeper Learning, Vinchaud20, Soueumxm, Toritris, Evolution and evolvability, Sboddhu, Sharva029, Paheld, Putting
things straight, Rosario Berganza, Monkbot, Buggiehuggie, Santoshwriter, Likerhayter, Joma.huguet, Bclark401, Rahulpratapsingh06, Don-
keychee, Michaelwine, Xsantostill, Jorge Guerra Pires, Wfwhitney, Loïc Bourgois, KasparBot and Anonymous: 488
• Deep learning Source: http://en.wikipedia.org/wiki/Deep%20learning?oldid=662755651 Contributors: The Anome, Ed Poor, Michael
Hardy, Meekohi, Glenn, Bearcat, Nandhp, Giraffedata, Jonsafari, Oleg Alexandrov, Justin Ormont, BD2412, Qwertyus, Rjwilmsi,
Kri, Bgwhite, Tomdooner, Bhny, Malcolma, Arthur Rubin, Mebden, SeanAhern, Dicklyon, JHP, ChrisCork, Lfstevens, A3nm,
R'n'B, Like.liberation, Popoki, Jshrager, Strife911, Bfx0, Daniel Hershcovich, Pinkpedaller, Dthomsen8, Addbot, Mamikonyana,
Yobot, AnomieBOT, Jonesey95, Zabbarob, Wyverald, RjwilmsiBot, Larry.europe, Helwr, GoingBatty, Sergey WereWolf, SlowByte,
Yoshua.Bengio, JuyangWeng, Renklauf, Bittenus, Widr, BG19bot, Lukas.tencer, Kareltje63, Synchronist, Gameboy97q, IjonTichyIjon-
Tichy, Mogism, AlwaysCoding, Mark viking, Cagarie, Deeper Learning, Prisx, Wikiyant, Underflow42, GreyShields, Opokopo, Prof.
Oundest, Gigavanti, Sevensharpnine, Yes deeper, Monkbot, Chieftains337, Samueldg89, Nikunj157, Engheta, Aspurdy, Velvel2, Deng629,
Zhuikov, Stergioc, Jerodlycett, DragonbornXXL and Anonymous: 94
• Feature learning Source: http://en.wikipedia.org/wiki/Feature%20learning?oldid=661746836 Contributors: Phil Boswell, Tobias Berge-
mann, Qwertyus, Rjwilmsi, Mcld, Kotabatubara, Dsimic, Yobot, AnomieBOT, BG19bot, Mavroudisv, TonyWang0316, Ixjlyons and
Anonymous: 7
• Unsupervised learning Source: http://en.wikipedia.org/wiki/Unsupervised%20learning?oldid=660135356 Contributors: Michael Hardy,
Kku, Alfio, Ahoerstemeier, Hike395, Ojigiri~enwiki, Gene s, Urhixidur, Alex Kosorukoff, Aaronbrick, Bobo192, 3mta3, Tablizer, De-
noir, Nkour, Qwertyus, Rjwilmsi, Chobot, Roboto de Ajvol, YurikBot, Darker Dreams, Daniel Mietchen, SmackBot, CommodiCast,
Trebor, DHN-bot~enwiki, Lambiam, CRGreathouse, Carstensen, Thijs!bot, Jaxelrod, AnAj, Peteymills, David Eppstein, Agenteseg-
reto, Maheshbest, Timohonkela, Ng.j, EverGreg, Algorithms, Kotsiantis, Auntof6, PixelBot, Edg2103, Addbot, EjsBot, Yobot, Les boys,
AnomieBOT, Salvamoreno, D'ohBot, Skyerise, Ranjan.acharyya, BertSeghers, EmausBot, Fly by Night, Rotcaeroib, Stheodor, Daryakav,
Ida Shaw, Chire, Candace Gillhoolley, WikiMSL, Helpful Pixie Bot, Majidjanz and Anonymous: 40
• Generative model Source: http://en.wikipedia.org/wiki/Generative%20model?oldid=660109222 Contributors: Awaterl, Michael Hardy,
Hike395, Andrewman327, Benwing, Cagri, Rama, Serapio, Jonsafari, Male1979, Qwertyus, Zanetu, Dicklyon, Repied, Barticus88,
RichardSocher~enwiki, Camrn86, YinZhang, Melcombe, Lfriedl, Shanttashjean, Omnipaedista, Geoffrey I Webb, Naxingyu, DanielWa-
terworth, Mekarpeles, ClueBot NG, Gilgoldm, C2oo56, Hestendelin, L T T H U and Anonymous: 14
90 CHAPTER 23. ANDREW NG

• Neural coding Source: http://en.wikipedia.org/wiki/Neural%20coding?oldid=662182422 Contributors: Ed Poor, Apraetor, An-

drewKeenanRichardson, LindsayH, CanisRufus, Kghose, Jheald, Woohookitty, BD2412, Rjwilmsi, RDBrown, Colonies Chris, Henrikhen-
rik, OrphanBot, Radagast83, NickPenguin, Vina-iwbot~enwiki, Clicketyclack, RomanSpa, Sohale, Rji, Andorin, Goodwillein, Anthony-
hcole, Nick Number, Davidm617617, Lova Falk, Addbot, Looie496, Luckas-bot, Yobot, Amirobot, Tryptofish, Citation bot, Xhuo, Sas-
soBot, FrescoBot, Xbcj0843hck3, Albertzeyer, TjeerdB, Jonkerz, Helwr, Bethnim, ZéroBot, Ego White Tray, Bibcode Bot, BG19bot,
FlinZ, ChrisGualtieri, Nicolenewell, Iamozy, Kernsters, Shirindora, Phleg1, Monkbot, PghJesse, Giedroid, AngevineMiller and Anony-
mous: 33
• Word embedding Source: http://en.wikipedia.org/wiki/Word%20embedding?oldid=623976145 Contributors: Qwertyus, Daniel Hersh-
covich, Yobot and Citation bot
• Deep belief network Source: http://en.wikipedia.org/wiki/Deep%20belief%20network?oldid=650454380 Contributors: Glenn, Qwer-
tyus, Kri, Like.liberation, Smuckola, Isthmuses, BG19bot, VivamusAmemus, Schurgast and Anonymous: 1
• Convolutional neural network Source: http://en.wikipedia.org/wiki/Convolutional%20neural%20network?oldid=662811606 Contribu-
tors: Glenn, Phil Boswell, Bearcat, GregorB, BD2412, Rjwilmsi, Mario23, Serg3d2, Mcld, Chris the speller, Frap, Dr.K., Lfstevens, The-
Seven, Like.liberation, Att159, XLinkBot, Yobot, Anypodetos, AnomieBOT, Cugino di mio cugino, Citation bot, Dithridge, Jhbdel, Alvin
Seville, Hobsonlane, RandomDSdevel, Arinelle, Peaceray, Osnetwork, BG19bot, BattyBot, Gameboy97q, APerson, MartinLjungqvist,
Zieglerk, Monkbot, VivamusAmemus, Velvel2, Fight123456, Stri8ted, Gnagyusa, Xksteven and Anonymous: 30
• Restricted Boltzmann machine Source: http://en.wikipedia.org/wiki/Restricted%20Boltzmann%20machine?oldid=650005600 Contrib-
utors: Shd~enwiki, Glenn, Dratman, Qwertyus, Kri, Gareth Jones, Deepdraft, Mcld, Hongooi, Dicklyon, Abhineetnazi, Tomtheebomb,
Dsimic, Arsi Warrior, Yobot, LilHelpa, UbaiSandouk, Gilo1969, Racerx11, Leopd, Enerjiparki, Sevensharpnine, Velvel2, Broido,
Dharmablues and Anonymous: 22
• Recurrent neural network Source: http://en.wikipedia.org/wiki/Recurrent%20neural%20network?oldid=662220310 Contributors:
Glenn, Barak~enwiki, Charles Matthews, Urhixidur, Aaronbrick, Tyrell turing, RJFJR, Alai, Male1979, DavidFarmbrough, Seliopou,
Nehalem, Bhny, Rwalker, Banus, That Guy, From That Show!, SmackBot, Moxon, Yume149~enwiki, Charivari, Kissaki0, MichaelGasser,
Terry Bollinger, Dicklyon, Fyedernoggersnodden, Thijs!bot, Daytona2, Curdeius, JaGa, Jamalex~enwiki, KylieTastic, Itb3d, Gdupont,
Jwray, Headlessplatter, Daniel Hershcovich, DumZiBoT, SlaterDeterminant, Achler, Addbot, DOI bot, LatitudeBot, Roux, Fadyone,
Yobot, AnomieBOT, Flewis, Citation bot, Omnipaedista, FrescoBot, Adrian Lange, Citation bot 1, Dmitry St, Epsiloner, Mister Mormon,
DASHBotAV, ClueBot NG, RichardTowers, MerlIwBot, Helpful Pixie Bot, BG19bot, Frze, Justinyap88, Ivan Ukhov, ‫طاها‬, Djfrost711,
Randykitty, Jmander, Ukpg, Minky76, OhGodItsSoAmazing, Mknjbhvg, Slashdottir, Monkbot, DoubleDr, Jungkanji and Anonymous: 46
• Long short term memory Source: http://en.wikipedia.org/wiki/Long%20short%20term%20memory?oldid=659668087 Contributors:
Michael Hardy, Glenn, Rich Farmbrough, Denoir, Woohookitty, SmackBot, Derek farn, Ninjakannon, Magioladitis, Barkeep, Pwoolf,
Headlessplatter, M4gnum0n, Muhandes, Jncraton, Yobot, Dithridge, Omnipaedista, Albertzeyer, Silenceisgod, Epsiloner, Ego White Tray,
Mister Mormon, Hmainsbot1, Mogism, Velvel2 and Anonymous: 13
• Google Brain Source: http://en.wikipedia.org/wiki/Google%20Brain?oldid=657438840 Contributors: Bearcat, Dfrankow, Vipul, Daranz,
JorisvS, Dicklyon, Aeternus, Diaa abdelmoneim, AnomieBOT, Cnwilliams, Chire, BG19bot, Q6637p, Stephen Balaban, BattyBot, Blharp
and Anonymous: 8
• Google DeepMind Source: http://en.wikipedia.org/wiki/Google%20DeepMind?oldid=654694254 Contributors: Ciphergoth, Bearcat,
Dratman, Bgwhite, Robofish, Aeternus, Edwardx, Yellowdesk, Magioladitis, Touch Of Light, Steel1943, Ciphershort, Wikiisawesome,
Sprachpfleger, Green Cardamom, Larry.europe, BG19bot, IjonTichyIjonTichy, Megab, William 2239, Bipper1024, Jon Jonathan, Har-
risonpop and Anonymous: 13
• Torch (machine learning) Source: http://en.wikipedia.org/wiki/Torch%20(machine%20learning)?oldid=661012919 Contributors: Qw-
ertyus, Strife911, Niceguyedc, Larry.europe, BG19bot, Lor and Anonymous: 6
• Theano (software) Source: http://en.wikipedia.org/wiki/Theano%20(software)?oldid=641284109 Contributors: Bearcat, Qwertyus, An-
dre.holzner, Strife911, Mrocklin, MartinThoma, Turn685 and Monkbot
• Deeplearning4j Source: http://en.wikipedia.org/wiki/Deeplearning4j?oldid=662808219 Contributors: Rwalker, Cydebot, Like.liberation,
NinjaRobotPirate, Daniel Hershcovich, Dawynn, Yobot, Wcherowi and Anonymous: 3
• Gensim Source: http://en.wikipedia.org/wiki/Gensim?oldid=651026231 Contributors: Thorwald, Qwertyus, Dawynn, Matěj Grabovský,
Smk65536, Armbrust, Lightlowemon, Delusion23, Velvel2 and Anonymous: 10
• Geoffrey Hinton Source: http://en.wikipedia.org/wiki/Geoffrey%20Hinton?oldid=660779435 Contributors: Edward, Zeno Gantner,
Rainer Wasserfuhr~enwiki, Flockmeal, MOiRe, Diberri, Risk one, Just Another Dan, Lawrennd, Rich Farmbrough, Davidswelt, Marudub-
shinki, Qwertyus, Rjwilmsi, Winterstein, Misterwindupbird, RussBot, Welsh, Gareth Jones, DaveWF, BorgQueen, InverseHypercube,
CRKingston, Jsusskin, Onkelschark, OrphanBot, Jmlk17, Dl2000, Shoeofdeath, Aeternus, Adam Newbold, Cydebot, Michael Fourman,
Roweis, Waacstats, Destynova, MetsBot, David Eppstein, STBot, FMAFan1990, AlexGreat, Sidcool1234, XLinkBot, Galyet, G7valera,
Addbot, ‫דוד שי‬, Omnipaedista, Plucas58, Morton Shumway, RjwilmsiBot, Larry.europe, Gumbys, Wpeaceout, Onionesque, BG19bot,
ChrisGualtieri, Makecat-bot, Anne Delong, Silas Ropac, Putting things straight, Annaflagg, Justanother109, Jonarnold1985, Velvel2, Math-
ewk1300 and Anonymous: 24
• Yann LeCun Source: http://en.wikipedia.org/wiki/Yann%20LeCun?oldid=662361291 Contributors: Zeno Gantner, Klemen Kocjancic,
Rich Farmbrough, Runner1928, Rschen7754, Bgwhite, RussBot, Gareth Jones, Jpbowen, Angoodkind, Cydebot, Studerby, Bongwarrior,
Waacstats, 72Dino, Aboutmovies, Falcon8765, Profshadoko, Cajunbill, M4gnum0n, Berean Hunter, MystBot, Addbot, Yobot, Bunny-
hop11, FrescoBot, Lbottou, Jvsarun1993, Uprightmanpower, JJRambo, Zzym, Hgsa5, AdjSilly, Polochicken, Lljjp, Crickavery, Justan-
other109, Voltdye, Velvel2, Visionscholar, Mathewk1300, KasparBot, Algorith and Anonymous: 3
• Jürgen Schmidhuber Source: http://en.wikipedia.org/wiki/J%C3%BCrgen%20Schmidhuber?oldid=660776145 Contributors: Michael
Hardy, Kosebamse, Hike395, Charles Matthews, Wik, Randomness~enwiki, Goedelman, Psb777, Newbie~enwiki, Juxi, Robin klein,
Klemen Kocjancic, D6, On you again, Kelly Ramsey, Rpresser, Derumi, Woohookitty, GregorB, Male1979, FlaBot, Bgwhite, RussBot,
SmackBot, Hongooi, Ben Moore, Phoxhat, JimStyle61093475, IDSIAupdate, Mpotse, Blaisorblade, Marek69, A1bb22, Waacstats, JaGa,
Gwern, Kornfan71, R'n'B, DadaNeem, BOTijo, Noveltyghost, Tbsdy lives, Addbot, Lightbot, Yobot, Fleabox, Omnipaedista, Underly-
ing lk, RjwilmsiBot, Angrytoast, Epsiloner, Helpful Pixie Bot, VIAFbot, It's here!, Midnightplunge, Laanaae, Hildensia, KasparBot and
Anonymous: 16
23.7. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 91

• Jeﬀ Dean (computer scientist) Source: http://en.wikipedia.org/wiki/Jeff%20Dean%20(computer%20scientist)?oldid=657563156 Con-

tributors: Klemen Kocjancic, Lockley, AntonioDsouza, Wavelength, Gareth Jones, Elkman, CommonsDelinker, Aboutmovies, Mr.Z-bot,
Omnipaedista, Patchy1, Iamjamesbond007, BG19bot, Stephen Balaban, Stephenbalaban, Timothy Gu, ChrisGualtieri, Mogism and Anony-
mous: 9
• Andrew Ng Source: http://en.wikipedia.org/wiki/Andrew%20Ng?oldid=659094106 Contributors: Michael Hardy, Msm, Simon Lacoste-
Julien, Prenju, Klemen Kocjancic, Rich Farmbrough, Vipul, Rd232, Mr Tan, Mandarax, Alex Bakharev, Gareth Jones, Pyronite, Johnd-
burger, InverseHypercube, Smallbones, Raysonho, CmdrDan, Cydebot, Batra, Utopiah, Magioladitis, Rootxploit, CommonsDelinker,
Coolg49964, Bcnof, Hoising, Maghnus, XKL, Arbor to SJ, Martarius, Chaosdruid, XLinkBot, Addbot, Mortense, Yobot, Azylber,
AnomieBOT, Chrisvanlang, Omnipaedista, Velblod, FrescoBot, Abductive, Amhey, RjwilmsiBot, Mmm333k, ZéroBot, Bemanna, Clue-
Bot NG, Kashthealien, Geistcj, Rrrlf, ArmbrustBot, Happyhappy001, E8xE8, Gnomy7, Linuxjava, AcrO O, Calisacole, Velvel2, Vision-
scholar, Demdim0, Csisawesome and Anonymous: 41

23.7.2 Images
• File:Ambox_important.svg Source: http://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public do-
main Contributors: Own work, based off of Image:Ambox scales.svg Original artist: Dsmurat (talk · contribs)
• File:Ann_dependency_(graph).svg Source: http://upload.wikimedia.org/wikipedia/commons/d/dd/Ann_dependency_%28graph%29.
svg License: CC BY-SA 3.0 Contributors: Vector version of File:Ann dependency graph.png Original artist: Glosser.ca
• File:Colored_neural_network.svg Source: http://upload.wikimedia.org/wikipedia/commons/4/46/Colored_neural_network.svg Li-
cense: CC BY-SA 3.0 Contributors: Own work, Derivative of File:Artificial neural network.svg Original artist: Glosser.ca
• File:Commons-logo.svg Source: http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg License: ? Contributors: ? Original
artist: ?
• File:Deep_belief_net.svg Source: http://upload.wikimedia.org/wikipedia/commons/f/fa/Deep_belief_net.svg License: CC0 Contributors:
Own work Original artist: Qwertyus
• File:Edit-clear.svg Source: http://upload.wikimedia.org/wikipedia/en/f/f2/Edit-clear.svg License: Public domain Contributors: The
Tango! Desktop Project. Original artist:
The people from the Tango! project. And according to the meta-data in the file, specifically:“Andreas Nilsson, and Jakub Steiner (although
minimally).”
• File:Elman_srnn.png Source: http://upload.wikimedia.org/wikipedia/commons/8/8f/Elman_srnn.png License: CC BY 3.0 Contributors:
Own work Original artist: Fyedernoggersnodden
• File:Emoji_u1f4bb.svg Source: http://upload.wikimedia.org/wikipedia/commons/d/d7/Emoji_u1f4bb.svg License: Apache License 2.0
Contributors: https://code.google.com/p/noto/ Original artist: Google
• File:Fisher_iris_versicolor_sepalwidth.svg Source: http://upload.wikimedia.org/wikipedia/commons/4/40/Fisher_iris_versicolor_
sepalwidth.svg License: CC BY-SA 3.0 Contributors: en:Image:Fisher iris versicolor sepalwidth.png Original artist: en:User:Qwfp (origi-
nal); Pbroks13 (talk) (redraw)
• File:Folder_Hexagonal_Icon.svg Source: http://upload.wikimedia.org/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg License: Cc-by-
sa-3.0 Contributors: ? Original artist: ?
• File:Free_Software_Portal_Logo.svg Source: http://upload.wikimedia.org/wikipedia/commons/3/31/Free_and_open-source_
software_logo_%282009%29.svg License: Public domain Contributors: FOSS Logo.svg Original artist: Free Software Portal Logo.svg
(FOSS Logo.svg): ViperSnake151
• File:Gensim_logo.png Source: http://upload.wikimedia.org/wikipedia/en/b/b1/Gensim_logo.png License: Fair use Contributors:
http://radimrehurek.com/gensim/_static/images/logo-gensim.png Original artist: ?
• File:Internet_map_1024.jpg Source: http://upload.wikimedia.org/wikipedia/commons/d/d2/Internet_map_1024.jpg License: CC BY
2.5 Contributors: Originally from the English Wikipedia; description page is/was here. Original artist: The Opte Project
• File:Lstm_block.svg Source: http://upload.wikimedia.org/wikipedia/en/8/8d/Lstm_block.svg License: PD Contributors:
Headlessplatter (talk) (Uploads) Original artist:
Headlessplatter (talk) (Uploads)
• File:NoisyNeuralResponse.png Source: http://upload.wikimedia.org/wikipedia/en/6/66/NoisyNeuralResponse.png License: PD Contrib-
utors: ? Original artist: ?
• File:Nuvola_apps_arts.svg Source: http://upload.wikimedia.org/wikipedia/commons/e/e2/Nuvola_apps_arts.svg License: GFDL Con-
tributors: Image:Nuvola apps arts.png Original artist: Manco Capac
• File:P_vip.svg Source: http://upload.wikimedia.org/wikipedia/en/6/69/P_vip.svg License: PD Contributors: ? Original artist: ?
• File:People_icon.svg Source: http://upload.wikimedia.org/wikipedia/commons/3/37/People_icon.svg License: CC0 Contributors: Open-
Clipart Original artist: OpenClipart
• File:PopulationCode.svg Source: http://upload.wikimedia.org/wikipedia/commons/a/a1/PopulationCode.svg License: Public domain
Contributors: Image:PopulationCode.png at English Wikipedia Original artist:
• Original: AndrewKeenanRichardson
• File:Portal-puzzle.svg Source: http://upload.wikimedia.org/wikipedia/en/f/fd/Portal-puzzle.svg License: Public domain Contributors: ?
Original artist: ?
• File:Question_book-new.svg Source: http://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0
Contributors:
Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist:
Tkgd2007
92 CHAPTER 23. ANDREW NG

• File:Recurrent_ann_dependency_graph.png Source: http://upload.wikimedia.org/wikipedia/commons/7/79/Recurrent_ann_

dependency_graph.png License: CC-BY-SA-3.0 Contributors: ? Original artist: ?
• File:Restricted_Boltzmann_machine.svg Source: http://upload.wikimedia.org/wikipedia/commons/e/e8/Restricted_Boltzmann_
machine.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Qwertyus
• File:Science-symbol-2.svg Source: http://upload.wikimedia.org/wikipedia/commons/7/75/Science-symbol-2.svg License: CC BY 3.0
Contributors: en:Image:Science-symbol2.png Original artist: en:User:AllyUnion, User:Stannered
• File:Synapse_deployment.jpg Source: http://upload.wikimedia.org/wikipedia/en/2/22/Synapse_deployment.jpg License: CC-BY-SA-
2.5 Contributors: ? Original artist: ?
• File:Text_document_with_red_question_mark.svg Source: http://upload.wikimedia.org/wikipedia/commons/a/a4/Text_document_
with_red_question_mark.svg License: Public domain Contributors: Created by bdesham with Inkscape; based upon Text-x-generic.svg
from the Tango project. Original artist: Benjamin D. Esham (bdesham)
• File:Torch_2014_logo.png Source: http://upload.wikimedia.org/wikipedia/en/f/f5/Torch_2014_logo.png License: Fair use Contributors:
https://github.com/torch Original artist: ?
• File:Wave.svg Source: http://upload.wikimedia.org/wikipedia/commons/4/40/Wave.svg License: BSD Contributors: http://duke.kenai.
com/wave/index.html (new), https://duke.dev.java.net/images/wave/index.html (old) Original artist: sbmehta converted to SVG from Sun
Microsystems AI version.
• File:Wikibooks-logo-en-noslogan.svg Source: http://upload.wikimedia.org/wikipedia/commons/d/df/Wikibooks-logo-en-noslogan.
svg License: CC BY-SA 3.0 Contributors: Own work Original artist: User:Bastique, User:Ramac et al.
• File:Wikiversity-logo.svg Source: http://upload.wikimedia.org/wikipedia/commons/9/91/Wikiversity-logo.svg License: CC BY-SA 3.0
Contributors: Snorky (optimized and cleaned up by verdy_p) Original artist: Snorky (optimized and cleaned up by verdy_p)

23.7.3 Content license

• Creative Commons Attribution-Share Alike 3.0

Machine Learning Systems
No ratings yet
Machine Learning Systems
2,048 pages
Machine Learning Systems
No ratings yet
Machine Learning Systems
1,748 pages
Machine Learning Systems: Vĳay Janapa Reddi
No ratings yet
Machine Learning Systems: Vĳay Janapa Reddi
1,474 pages
CS230
No ratings yet
CS230
101 pages
Machine Learning Systems
No ratings yet
Machine Learning Systems
300 pages
Deep Learning With PyTorch Step-by-Step
No ratings yet
Deep Learning With PyTorch Step-by-Step
136 pages
Deep Learning Cours
No ratings yet
Deep Learning Cours
165 pages
A Seminar Report On NEURAL NETWORK PDF
No ratings yet
A Seminar Report On NEURAL NETWORK PDF
26 pages
Deep Learning A Z PDF
100% (8)
Deep Learning A Z PDF
799 pages
Machine Learning With Python
100% (15)
Machine Learning With Python
692 pages
Daniel Voigt Godoy - Deep Learning With PyTorch Step-By-Step A Beginner's Guide-Leanpub - Com (2022)
100% (1)
Daniel Voigt Godoy - Deep Learning With PyTorch Step-By-Step A Beginner's Guide-Leanpub - Com (2022)
1,045 pages
Deep Learning - Roy Keyes
100% (1)
Deep Learning - Roy Keyes
163 pages
Notions de Deep Learning
No ratings yet
Notions de Deep Learning
116 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (16)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
A Brief Guide To Artificial Intelligence Tutorial Introductions James V Stone Full Digital Chapters
100% (1)
A Brief Guide To Artificial Intelligence Tutorial Introductions James V Stone Full Digital Chapters
77 pages
Deep Learning With Python
100% (8)
Deep Learning With Python
396 pages
Ali Thesis
No ratings yet
Ali Thesis
125 pages
Generating Arabic Letters Using Generative
No ratings yet
Generating Arabic Letters Using Generative
63 pages
Future Proof Yourself-An AI Era Survival Guide
No ratings yet
Future Proof Yourself-An AI Era Survival Guide
259 pages
A Guide To Convolutional Neural Networks
100% (2)
A Guide To Convolutional Neural Networks
209 pages
Sergios Karagiannakos - Deep Learning in Production (2022) - Libgen - Li
No ratings yet
Sergios Karagiannakos - Deep Learning in Production (2022) - Libgen - Li
223 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
The Python Bible
97% (31)
The Python Bible
506 pages
d2l en PDF
No ratings yet
d2l en PDF
635 pages
ArtificialNeuralNetworks Mku PDF
No ratings yet
ArtificialNeuralNetworks Mku PDF
206 pages
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
No ratings yet
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
245 pages
d2l en PDF
100% (1)
d2l en PDF
670 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
Improving Aircraft Performance Using Machine Learning
No ratings yet
Improving Aircraft Performance Using Machine Learning
28 pages
Machine Learning and Deep Learning Techniques Used in Cybersecurity and Digital Forensics: A Review
No ratings yet
Machine Learning and Deep Learning Techniques Used in Cybersecurity and Digital Forensics: A Review
91 pages
Neural Networks Lecture Notes
No ratings yet
Neural Networks Lecture Notes
231 pages
Dive Into Deep Learning - D2l-En
100% (1)
Dive Into Deep Learning - D2l-En
660 pages
Deep RL Overview for AI Researchers
No ratings yet
Deep RL Overview for AI Researchers
150 pages
Deep Learning-Based Structural Health Monitoring
No ratings yet
Deep Learning-Based Structural Health Monitoring
38 pages
Free and Open Machine Learning
No ratings yet
Free and Open Machine Learning
143 pages
Python Data Science
100% (1)
Python Data Science
173 pages
Deep Learning Guide: Concepts & Implementation
100% (1)
Deep Learning Guide: Concepts & Implementation
658 pages
Safari
No ratings yet
Safari
97 pages
Introduction To Machine Learning - Wikipedia
No ratings yet
Introduction To Machine Learning - Wikipedia
456 pages
Deep Learning Notes Andrew NG
100% (1)
Deep Learning Notes Andrew NG
54 pages
Lecture Notes - Machine Learning For The Sciences
No ratings yet
Lecture Notes - Machine Learning For The Sciences
84 pages
Deep Learning For Medical Image Analysis 1st Edition S. Kevin Zhou
No ratings yet
Deep Learning For Medical Image Analysis 1st Edition S. Kevin Zhou
62 pages
An To Neural Networks Ben Krose Patrick Van Der Smagt
No ratings yet
An To Neural Networks Ben Krose Patrick Van Der Smagt
9 pages
Understanding Machine Learning
100% (71)
Understanding Machine Learning
416 pages
d2l en
No ratings yet
d2l en
505 pages
Deep Learning Notes (1) 2
No ratings yet
Deep Learning Notes (1) 2
54 pages
Transformers For Machine Learning A Deep Dive (Uday Kamath, Kenneth L. Graham, Wael Emara)
100% (12)
Transformers For Machine Learning A Deep Dive (Uday Kamath, Kenneth L. Graham, Wael Emara)
284 pages
Chap 2
No ratings yet
Chap 2
49 pages
Data Structure and Algorithms With Python
100% (15)
Data Structure and Algorithms With Python
369 pages
Table of Content: (Page Numbers in PDF File)
No ratings yet
Table of Content: (Page Numbers in PDF File)
223 pages
Rapport
No ratings yet
Rapport
106 pages
Deep Learning Guide with Gluon
100% (1)
Deep Learning Guide with Gluon
633 pages
Deep Learning Book Part1
No ratings yet
Deep Learning Book Part1
100 pages
DL Deeplearningbook PDF
No ratings yet
DL Deeplearningbook PDF
10 pages
Wiki Book 4th Industrial Revolution
No ratings yet
Wiki Book 4th Industrial Revolution
184 pages
Seminar Report cnn1
No ratings yet
Seminar Report cnn1
23 pages
Machine Learning Cheat Sheet HCMUT K
No ratings yet
Machine Learning Cheat Sheet HCMUT K
34 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
Deep Learning Library Guide
No ratings yet
Deep Learning Library Guide
110 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Lecture Notes: Introduction To Machine Learning For The Sciences
No ratings yet
Lecture Notes: Introduction To Machine Learning For The Sciences
80 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Learning Statistics
100% (29)
Learning Statistics
408 pages
Machine Learning General Concepts
100% (4)
Machine Learning General Concepts
80 pages
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
100% (15)
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
210 pages
The Little Book of Deep Learning François Fleuret Online Version
No ratings yet
The Little Book of Deep Learning François Fleuret Online Version
65 pages
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
100% (12)
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
142 pages
BBBB
No ratings yet
BBBB
8 pages
Deep Learning in Computer Vision - Principles and Applications
100% (3)
Deep Learning in Computer Vision - Principles and Applications
339 pages
Table of Contents
No ratings yet
Table of Contents
9 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
Numpy Python Cheat Sheet
No ratings yet
Numpy Python Cheat Sheet
1 page
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Advance Statistical Methods in Data Science Chen
100% (6)
Advance Statistical Methods in Data Science Chen
229 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
TensorFlow For Machine Intelligence
100% (27)
TensorFlow For Machine Intelligence
305 pages
Foundations of Computer Vision
88% (8)
Foundations of Computer Vision
443 pages
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
100% (7)
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
368 pages
Machine Learning
100% (11)
Machine Learning
135 pages
Deep Learning With PyTorch Guide For Beginners and Intermediate
100% (7)
Deep Learning With PyTorch Guide For Beginners and Intermediate
120 pages
Tensorflow 2 Tutorial PDF
100% (4)
Tensorflow 2 Tutorial PDF
66 pages
Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
Tensorflow Tutorial PDF
100% (6)
Tensorflow Tutorial PDF
90 pages