Deep Learning with R
Abhijit Ghatak
Deep Learning with R
123
Abhijit Ghatak
Kolkata, India
ISBN 978-981-13-5849-4 ISBN 978-981-13-5850-0 (eBook)
https://doi.org/10.1007/978-981-13-5850-0
Library of Congress Control Number: 2019933713
© Springer Nature Singapore Pte Ltd. 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
I dedicate this book to the deep learning
fraternity at large who are trying their best,
to get systems to reason over longtime
horizons.
Preface
Artificial Intelligence
The term ‘Artificial Intelligence’ (AI) was coined by John McCarthy in 1956, but
the journey to understand if machines can truly think began much before that.
Vannevar Bush [1] in his seminal work—As We May Think,1—proposed a system
which amplifies people’s own knowledge and understanding.
Alan Turing was a pioneer in bringing AI from the realm of philosophical
prediction to reality. He wrote a paper on the notion of machines being able to
simulate human beings and the ability to do intelligent things. He also realized in
the 1950s that it would need a greater understanding of human intelligence before
we could hope to build machines which would “think” like humans. His paper titled
“Computing Machinery and Intelligence” in 1950 (published in a philosophical
journal called Mind) opened the doors to the field that would be called AI, much
before the term was actually adopted. The paper defined what would be known as
the Turing test,2 which is a model for measuring “intelligence.”
Significant AI breakthroughs have been promised “in the next 10 years,” for the
past 60 years. One of the proponents of AI, Marvin Minsky, claimed in
1967—“Within a generation …, the problem of creating “artificial intelligence” will
substantially be solved,” and in 1970, he quantified his earlier prediction by
stating—“In from three to eight years we will have a machine with the general
intelligence of a human being.”
In the 1960s and early 1970s, several other experts believed it to be right around
the corner. When it did not happen, it resulted in drying up of funds and a decline in
research activities, resulting in what we term as the first AI winter.
During the 1980s, interest in an approach to AI known as expert systems started
gathering momentum and a significant amount of money was being spent on
1
https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/.
2
https://www.turing.org.uk/scrapbook/test.html.
vii
viii Preface
research and development. By the beginning of the 1990s, due to the limited scope
of expert systems, interest waned and this resulted in the second AI winter.
Somehow, it appeared that expectations in AI always outpaced the results.
Evolution of Expert Systems to Machine Learning
An expert system (ES) is a program that is designed to solve problems in a specific
domain, which can replace a human expert. By mimicking the thinking of human
experts, the expert system was envisaged to analyze and make decisions.
The knowledge base of an ES contains both factual knowledge and heuristic
knowledge. The ES inference engine was supposed to provide a methodology for
reasoning the information present in the knowledge base. Its goal was to come up
with a recommendation, and to do so, it combined the facts of a specific case (input
data), with the knowledge contained in the knowledge base (rules), resulting in a
particular recommendation (answers).
Though ES was suitable to solve some well-defined logical problems, it proved
otherwise in solving other types of complex problems like image classification and
natural language processing (NLP). As a result, ES did not live up to its expecta-
tions and gave rise to a shift from the rule-based approach to a data-driven
approach. This paved the way to a new era in AI—machine learning.
Research over the past 60 years has resulted in significant advances in search
algorithms, machine learning algorithms, and integrating statistical analysis to
understand the world at large.
In machine learning, the system is trained rather than explicitly programmed
(unlike that in ES). By exposing large quantities of known facts (input data and
answers) to a learning mechanism and performing tuning sessions, we get a system
that can make predictions or classifications of unseen input data. It does this by
finding out the statistical structure of the input data (and the answers) and comes up
with rules for automating the task.
Starting in the 1990s, machine learning has quickly become the most popular
subfield of AI. This trend has also been driven by the availability of faster com-
puting and availability of diverse data sets.
A machine learning algorithm transforms its input data into meaningful outputs
by a process known as representations. Representations are transformations of the
input data, to represent it closer to the expected output. “Learning,” in the context of
machine learning, is an automatic search process for better representations of data.
Machine learning algorithms find these representations by searching through a
predefined set of operations.
To summarize, machine learning is searching for useful representations of the
input data within a predefined space, using the loss function (difference between the
actual output and the estimated output) as a feedback to modify the parameters
of the model.
Preface ix
Machine Learning and Deep Learning
It turns out that machine learning focuses on learning only one or two layers of
representations of the input data. This proved intractable for solving human per-
ception problems like image classification, text-to-speech translation, handwriting
transcription, etc. Therefore, it gave way to a new take on learning representations,
which put an emphasis on learning multiple successive layers of representations,
resulting in deep learning. The word deep in deep learning only implies the number
of layers used in a deep learning model.
In deep learning, we deal with layers. A layer is a data transformation function
which carries out the transformation of the data which goes through that layer.
These transformations are parametrized by a set of weights and biases, which
determine the transformation behavior at that layer.
Deep learning is a specific subfield of machine learning, which makes use of
tens/hundreds of successive layers of representations. The specification of what a
layer does to its input is stored in the layer’s parameters. Learning in deep learning
can also be defined as finding a set of values for the parameters of each layer of a
deep learning model, which will result in the appropriate mapping of the inputs to
the associated answers (outputs).
Deep learning has been proven to be better than conventional machine learning
algorithms for these “perceptual” tasks, but not yet proven to be better in other
domains as well.
Applications and Research in Deep Learning
Deep learning has been gaining traction in many fields, and some of them are listed
below. Although most of the work to this date are proof-of-concept (PoC), some
of the results have actually provided a new physical insight.
• Engineering—Signal processing techniques using traditional machine learning
exploit shallow architectures often containing a single layer of nonlinear feature
transformation. Examples of shallow architecture models are conventional
hidden Markov models (HMMs), linear or nonlinear dynamical systems, con-
ditional random fields (CRFs), maximum entropy (MaxEnt) models, support
vector machines (SVMs), kernel regression, multilayer perceptron (MLP) with a
single hidden layer, etc. Signal processing using machine learning also depends
a lot on handcrafted features. Deep learning can help in getting task-specific
feature representations, learning how to deal with noise in the signal and also
work with long-term sequential behaviors. Vision and speech signals require
deep architectures for extracting complex structures, and deep learning can
provide the necessary architecture. Specific signal processing areas where deep
x Preface
learning is being applied are speech/audio, image/video, language processing,
and information retrieval. All this can be improved with better feature extraction
at every layer, more powerful discriminative optimization techniques, and more
advanced architectures for modeling sequential data.
• Neuroscience—Cutting-edge research in human neuroscience using deep
learning is already happening. The cortical activity of “imagination” is being
studied to unveil the computational and system mechanisms that underpin the
phenomena of human imagination. Deep learning is being used to understand
certain neurophysiological phenomena, such as the firing properties of dopa-
mine neurons in the mammalian basal ganglia (a group of subcortical nuclei of
different origin, in the brains of vertebrates including humans, which are
associated with a variety of functions like eye movements, emotion, cognition,
and control of voluntary motor movements). There is a growing community
who are working on the need to distill intelligence into algorithms so that they
incrementally mimic the human brain.
• Oncology—Cancer is the second leading health-related cause of death in the
world. Early detection of cancer increases the probability of survival by nearly
10 times, and deep learning has demonstrated capabilities in achieving higher
diagnostic accuracy with respect to many domain experts. Cancer detection from
gene expression data is challenging due to its high dimensionality and com-
plexity. Researchers have developed DeepGene,3 which is an advanced cancer
classifier based on deep learning. It addresses the obstacles in existing somatic
point mutation-based cancer classification (SMCC) studies, and the results
outperform three widely adopted existing classifiers. Google’s CNN system4 has
demonstrated the ability to identify deadline skin cancers at an accuracy rate on
a par with practitioners. Shanghai University has developed a deep learning
system that can accurately differentiate between benign and malignant breast
tumors on ultrasound shear wave elastography (SWE), yielding more than 93%
accuracy on the elastogram images of more than 200 patients.5
• Physics—Conseil Europeen pour la Recherche Nucleaire (CERN) at Geneva
handles multiple petabytes of data per day during a single run of the Large
Hadron Collider (LHC). LHC collides protons/ions in the collider, and each
collision is recorded. After every collision, the trailing particles—a Higgs boson,
a pair of top quarks, or some mini-black holes—are created, which leave a
trailing signature. Deep learning is being used to classify and interpret these
signatures.
• Astrophysics—Deep learning is being extensively used to classify galaxy
morphologies.6
3
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1334-9.
4
https://www.nature.com/articles/nature21056.
5
https://www.umbjournal.org/article/S0301-5629(17)30002-9/abstract.
6
https://arxiv.org/abs/0908.2033.
Preface xi
• Natural Language Processing—There has been a rising number of research
papers (see Fig. 1) among the research community since 2012, as is reflected in
the paper titled Recent Trends in Deep Learning Based Natural Language
Processing by Young et al.
• Near human-level proficiency has been achieved in (a) speech recognition,
(b) image recognition, (c) handwriting transcription, and (d) autonomous driv-
ing. Moreover, super-human-level performance has been achieved by AlphaGo
(built by Google) when it defeated the world's best player Lee Sedol at Go.
Fig. 1 Percentage of deep learning papers submitted for various conferences—Association for
Computational Linguistics (ACL), Conference on Empirical Methods in Natural Language
Processing (EMNLP), European Chapter of the Association for Computational Linguistics
(EACL), North American Chapter of the Association for Computational Linguistics (NAACL),
over the last 6 years since 2018. [2]
Intended Audience
This book has been written to address a wide spectrum of learners:
• For the beginner, this book will be useful to understand the basic concepts of
machine/deep learning and the neural network architecture (Chaps. 1 and 2)
before moving on to the advanced concepts of deep learning.
• For the graduate student, this book will help the reader understand the behavior
of different types of neural networks by understanding the concepts, while
building them up from scratch. It will also introduce the reader to relevant
research papers for further exploration.
xii Preface
• For the data scientist who is familiar with the underlying principles of machine
learning, this book will provide a practical understanding of deep learning.
• For the deep learning enthusiast, this book will explain the deep learning
architecture and what goes on inside a neural network model.
An intermediate level of R programming knowledge is expected from the reader,
and no previous experience of the subject is assumed.
Kolkata, India Abhijit Ghatak
Acknowledgements
Acknowledgment is an unsatisfactory word for my deepest debts.
My father bequeathed to me a love for adventure and an interest in history,
literature, and mathematics.
My professors at the Faculty of Mechanical Engineering, Jadavpur University,
instilled an appetite for analysis and quantitative techniques in engineering; my
mentor and advisor at University of Pune, Prof. SY Bhave, motivated me to
interpret the algorithm and write a program using the C language on predicting
torsional vibration failures of a marine propulsion shaft using state vectors; and my
advisors at Stevens Institute of Technology helped me to transit from a career
submarine engineer in the Indian Navy to a data scientist.
My wife Sushmita lived through the slow gestation of this book. She listened
and engaged with me all the way. She saw potential in this work long before I did
and encouraged me to keep going.
I owe my thanks to Sunanda for painstakingly proofreading the manuscript.
I also have two old debts—Robert Louis Stevenson and Arthur Conan
Doyle. In Treasure Island, Mr Smollet is most eager to discover the treasure and
he says—“We must go on,” and in Case of Identity, Sherlock Holmes states—“It
has long been an axiom of mine that the little things are infinitely the most
important.” Both are profound statements in the realm of a new science, and the
litterateurs had inked their thoughts claiming no distinction, when there is not a
distinction between the nature of the pursuit.
I owe all of them my deepest debts.
Abhijit Ghatak
xiii
About This Book
• Deep learning is a growing area of interest to academia and industry alike. The
applications of deep learning range from medical diagnostics, robotics, security
and surveillance, computer vision, natural language processing, autonomous
driving, etc. This has been largely possible due to a conflation of research
activities around the subject and the emergence of APIs like Keras.
• This book is a sequel to Machine Learning with R, written by the same author,
and explains deep learning from first principles—how to construct different
neural network architectures and understand the hyperparameters of the neural
network and the need for various optimization algorithms. The theory and the
math are explained in detail before discussing the code in R. The different
functions are finally merged to create a customized deep learning application. It
also introduces the reader to the Keras and TensorFlow libraries in R and explains
the advantage of using these libraries to get a basic model up and running.
• This book builds on the understanding of deep learning to create R-based appli-
cations on computer vision, natural language processing, and transfer learning.
This book has been written to address a wide spectrum of learners:
• For the beginner, this book will be useful to understand the basic concepts of
machine/deep learning and the neural network architecture (Chaps. 1 and 2)
before moving on to the advanced concepts of deep learning.
• For the graduate student, this book will help the reader to understand the
behavior of different types of neural networks by understanding the concepts,
while building them up from scratch. It will also introduce the reader to relevant
research papers for further exploration.
• For the data scientist who is familiar with the underlying principles of machine
learning, this book will provide a practical understanding of deep learning.
• For the deep learning enthusiast, this book will explain the deep learning
architecture and what goes on inside a neural network model.
This book requires an intermediate level of skill in R and no previous experience
of deep learning.
xv
Contents
1 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . ... 1
1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 1
1.1.1 Difference Between Machine Learning and Statistics ... 2
1.1.2 Difference Between Machine Learning and Deep
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Bias–Variance Trade-off in Machine Learning . . . . . . . . . . . . . . 4
1.4 Addressing Bias and Variance in the Model . . . . . . . . . . . . . . . . 5
1.5 Underfitting and Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.9.1 Searching for Hyperparameters . . . . . . . . . . . . . . . . . . . 11
1.10 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 12
1.11 Quantifying Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.11.1 The Cross-Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . 14
1.11.2 Negative Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . 15
1.11.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.11.4 Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.11.5 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . 19
1.11.6 Summarizing the Measurement of Loss . . . . . . . . . . . . . 20
1.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Introduction to Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Types of Neural Network Architectures . . . . . . . . . . . . . . . . . . . 25
2.2.1 Feedforward Neural Networks (FFNNs) . . . . . . . . . . . . 25
2.2.2 Convolutional Neural Networks (ConvNets) . . . . . . . . . 25
2.2.3 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . 25
xvii
xviii Contents
2.3 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Input Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 Bias Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.4 Weight Matrix of Layer-1 . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.5 Activation Function at Layer-1 . . . . . . . . . . . . . . . . . . . 30
2.3.6 Weights Matrix of Layer-2 . . . . . . . . . . . . . . . . . . . . . . 30
2.3.7 Activation Function at Layer-2 . . . . . . . . . . . . . . . . . . . 32
2.3.8 Output Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.9 Summary of Forward Propagation . . . . . . . . . . . . . . . . . 34
2.4 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Hyperbolic Tangent . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.3 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.4 Leaky Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . . 38
2.4.5 Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Derivatives of Activation Functions . . . . . . . . . . . . . . . . . . . . . . 42
2.5.1 Derivative of Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.2 Derivative of tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 Derivative of Rectified Linear Unit . . . . . . . . . . . . . . . . 44
2.5.4 Derivative of Leaky Rectified Linear Unit . . . . . . . . . . . 44
2.5.5 Derivative of Softmax . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Cross-Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7 Derivative of the Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7.1 Derivative of Cross-Entropy Loss with Sigmoid . . . . . . . 49
2.7.2 Derivative of Cross-Entropy Loss with Softmax . . . . . . . 49
2.8 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8.1 Summary of Backward Propagation . . . . . . . . . . . . . . . . 53
2.9 Writing a Simple Neural Network Application . . . . . . . . . . . . . . 54
2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3 Deep Neural Networks-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.1 Writing a Deep Neural Network (DNN) Algorithm . . . . . . . . . . . 65
3.2 Overview of Packages for Deep Learning in R . . . . . . . . . . . . . . 80
3.3 Introduction to keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.1 Installing keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.2 Pipe Operator in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.3 Defining a keras Model . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.4 Configuring the keras Model . . . . . . . . . . . . . . . . . . . 81
3.3.5 Compile and Fit the Model . . . . . . . . . . . . . . . . . . . . . . 82
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Contents xix
4 Initialization of Network Parameters . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.1 Breaking Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.2 Zero Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.3 Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.4 Xavier Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.5 He Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Dealing with NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2.1 Hyperparameters and Weight Initialization . . . . . . . . . . . 100
4.2.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2.3 Using Different Activation Functions . . . . . . . . . . . . . . . 101
4.2.4 Use of NanGuardMode, DebugMode,
or MonitorMode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.5 Numerical Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.6 Algorithm Related . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.7 NaN Introduced by AllocEmpty . . . . . . . . . . . . . . . . . . 101
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.1 Gradient Descent or Batch Gradient Descent . . . . . . . . . 104
5.2.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 105
5.2.3 Mini-Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . 105
5.3 Parameter Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.1 Simple Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.2 Momentum Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.3 Nesterov Momentum Update . . . . . . . . . . . . . . . . . . . . 109
5.3.4 Annealing the Learning Rate . . . . . . . . . . . . . . . . . . . . . 110
5.3.5 Second-Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.6 Per-Parameter Adaptive Learning Rate Methods . . . . . . . 112
5.4 Vanishing Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5.1 Dropout Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.5.2 ‘2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.5.3 Combining Dropout and ‘2 Regularization? . . . . . . . . . . 144
5.6 Gradient Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6 Deep Neural Networks-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.1 Revisiting DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 Modeling Using keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2.1 Adjust Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.2.2 Add Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . 159
xx Contents
6.2.3 Add Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2.4 Add Weight Regularization . . . . . . . . . . . . . . . . . . . . . . 161
6.2.5 Adjust Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.2.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3 Introduction to TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.3.1 What is Tensor ‘Flow’? . . . . . . . . . . . . . . . . . . . . . . . 165
6.3.2 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.3.3 Installing and Running TensorFlow . . . . . . . . . . . . . . 166
6.4 Modeling Using TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.4.1 Importing MNIST Data Set from TensorFlow . . . . . . 167
6.4.2 Define Placeholders . . . . . . . . . . . . . . . . . . . . . . . 168
6.4.3 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.4.4 Instantiating a Session and Running the Model . . . . . 169
6.4.5 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7 Convolutional Neural Networks (ConvNets) . . . . . . . . . . . . . . . . . . . 171
7.1 Building Blocks of a Convolution Operation . . . . . . . . . . . . . . . 171
7.1.1 What is a Convolution Operation? . . . . . . . . . . . . . . . . 171
7.1.2 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.1.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.1.4 Strided Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.1.5 Convolutions over Volume . . . . . . . . . . . . . . . . . . . . . . 177
7.1.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.2 Single-Layer Convolutional Network . . . . . . . . . . . . . . . . . . . . . 180
7.2.1 Writing a ConvNet Application . . . . . . . . . . . . . . . . . . . 181
7.3 Training a ConvNet on a Small DataSet Using keras . . . . . . . . 186
7.3.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.4 Specialized Neural Network Architectures . . . . . . . . . . . . . . . . . 193
7.4.1 LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.4.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.4.3 VGG-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.4.4 GoogleNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.4.5 Transfer Learning or Using Pretrained Models . . . . . . . . 196
7.4.6 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.5 What is the ConvNet Learning? A Visualization of Different
Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.6 Introduction to Neural Style Transfer . . . . . . . . . . . . . . . . . . . . . 203
7.6.1 Content Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.6.2 Style Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.6.3 Generating Art Using Neural Style Transfer . . . . . . . . . . 204
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Contents xxi
8 Recurrent Neural Networks (RNN) or Sequence Models . . . . . . . . . 207
8.1 Sequence Models or RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.2 Applications of Sequence Models . . . . . . . . . . . . . . . . . . . . . . . 209
8.3 Sequence Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.4 Writing the Basic Sequence Model Architecture . . . . . . . . . . . . . 210
8.4.1 Backpropagation in Basic RNN . . . . . . . . . . . . . . . . . . 212
8.5 Long Short-Term Memory (LSTM) Models . . . . . . . . . . . . . . . . 215
8.5.1 The Problem with Sequence Models . . . . . . . . . . . . . . . 215
8.5.2 Walking Through LSTM . . . . . . . . . . . . . . . . . . . . . . . 216
8.6 Writing the LSTM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 217
8.7 Text Generation with LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.7.1 Working with Text Data . . . . . . . . . . . . . . . . . . . . . . . . 225
8.7.2 Generating Sequence Data . . . . . . . . . . . . . . . . . . . . . . 226
8.7.3 Sampling Strategy and the Importance of Softmax
Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.7.4 Implementing LSTM Text Generation
(Character-Level Neural Language Model) . . . . . . . . . . . 227
8.8 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.8.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.8.2 Transfer Learning and Word Embedding . . . . . . . . . . . . 231
8.8.3 Analyzing Word Similarity Using Word Vectors . . . . . . 232
8.8.4 Analyzing Word Analogies Using Word Vectors . . . . . . 233
8.8.5 Debiasing Word Vectors . . . . . . . . . . . . . . . . . . . . . . . . 234
8.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
9 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.1 Gathering Experience and Knowledge . . . . . . . . . . . . . . . . . . . . 239
9.1.1 Research Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.2 Towards Lifelong Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.2.1 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
About the Author
Abhijit Ghatak is a Data Engineer and holds graduate degrees in Engineering and
Data Science from India and USA. He started his career as a submarine engineer
officer in the Indian Navy where he worked on multiple data-intensive projects
involving submarine operations and submarine construction. He has thereafter
worked in academia, IT consulting and as research scientist in the area of Internet of
Things (IoT) and pattern recognition for the European Union. He has authored
many publications in the areas of engineering, IoT and machine learning. He
presently advises start-up companies on deep learning, pattern recognition and data
analytics. His areas of research include IoT, stream analytics and design of deep
learning systems. He can be reached at [email protected].
xxiii