Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views33 pages

Deep Learning Unit-2

The document provides an overview of deep learning, including its history, key concepts, and applications in various fields. It discusses the evolution of deep learning from its inception in the 1940s to modern advancements such as convolutional networks and generative adversarial networks. Additionally, it highlights the significance of probabilistic models in deep learning and their advantages over non-probabilistic models.

Uploaded by

navyasreek2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views33 pages

Deep Learning Unit-2

The document provides an overview of deep learning, including its history, key concepts, and applications in various fields. It discusses the evolution of deep learning from its inception in the 1940s to modern advancements such as convolutional networks and generative adversarial networks. Additionally, it highlights the significance of probabilistic models in deep learning and their advantages over non-probabilistic models.

Uploaded by

navyasreek2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

lOMoARcPSD|57496506

Unit-II

deep learning (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Think Big ([email protected])
lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

UNIT – II: DEEP NETWORKS


History of Deep Learning- A Probabilistic Theory of Deep Learning- Backpropagation and
regularization, batch normalization- VC Dimension and Neural Nets-Deep Vs Shallow Networks
Convolutional Networks- Generative Adversarial Networks (GAN), Semi-supervised Learning

2.1 History of Deep Learning:


Deep Learning, is a more evolved branch of machine learning, and uses layers of algorithms to
process data, and imitate the thinking process, or to develop abstractions.
It is often used to visually recognize objects and understand human speech. Information is passed
through each layer, with the output of the previous layer providing input for the next layer. The
first layer in a network is called the input layer, while the last is called an output layer.
All the layers between input and output are referred to as hidden layers. Each layer is typically a
simple, uniform algorithm containing one kind of activation function.

Feature extraction is another aspect of deep learning. It is used for pattern recognition and image
processing. Feature extraction uses an algorithm to automatically construct meaningful
“features” of the data for purposes of training, learning, and understanding. Normally a data
scientist, or a programmer, is responsible for feature extraction.

The history of deep learning can be traced back to 1943, when Walter Pitts and Warren
McCulloch created a computer model based on the neural networks of the human brain.

They used a combination of algorithms and mathematics they called “threshold logic” to mimic
the thought process. Since that time, Deep Learning has evolved steadily, with only two
significant breaks in its development. Both were tied to the infamous Artificial Intelligence
winters.

The 1960s

Henry J. Kelley is given credit for developing the basics of a continuous Back Propagation
Modelin 1960. In 1962, a simpler version based only on the chain rule was developed by Stuart
Dreyfus. While the concept of back propagation (the backward propagation of errors for

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

purposes of training) did exist in the early 1960s, it was clumsy and inefficient, and would not
become useful until 1985.

The earliest efforts in developing deep learning algorithms came from Alexey Grigoryevich
Ivakhnenko (developed the Group Method of Data Handling) and Valentin Grigorʹevich Lapa
(author of Cybernetics and Forecasting Techniques) in 1965. They used models with polynomial
(complicated equations) activation functions, that were then analyzed statistically. From each
layer, the best statistically chosen features were then forwarded on to the next layer (a slow,
manual process).

The 1970s

During the 1970’s the first AI winter kicked in, the result of promises that couldn’t be kept. The
impact of this lack of funding limited both DL and AI research. Fortunately, there were
individuals who carried on the research without funding.

The first “convolutional neural networks” were used by Kunihiko Fukushima. Fukushima
designed neural networks with multiple pooling and convolutional layers. In 1979, he developed
an artificial neural network, called Neocognitron, which used a hierarchical, multilayered design.
This design allowed the computer the “learn” to recognize visual patterns. The networks
resembled modern versions, but were trained with a reinforcement strategy of recurring
activation in multiple layers, which gained strength over time. Additionally, Fukushima’s design
allowed important features to be adjusted manually by increasing the “weight” of certain
connections.

Many of the concepts of Neocognitron continue to be used.

The use of top-down connections and new learning methods have allowed for a variety of neural
networks to be realized. When more than one pattern is presented at the same time, the Selective
Attention Model can separate and recognize individual patterns by shifting its attention from one
to the other. (The same process many of us use when multitasking). A modern Neocognitron can
not only identify patterns with missing information (for example, an incomplete number 5), but
can also complete the image by adding the missing information. This could be described as
“inference.”

Back propagation, the use of errors in training deep learning models, evolved significantly in
1970. This was when Seppo Linnainmaa wrote his master’s thesis, including a FORTRAN code
for back propagation.

Unfortunately, the concept was not applied to neural networks until 1985. This was when
Rumelhart, Williams, and Hinton demonstrated back propagation in a neural network could
provide “interesting” distribution representations. Philosophically, this discovery brought to light
the question within cognitive psychology of whether human understanding relies on symbolic
logic (computationalism) or distributed representations (connectionism).

The 1980s and 90s

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

In 1989, Yann LeCun provided the first practical demonstration of backpropagation at Bell Labs.
He combined convolutional neural networks with back propagation onto read “handwritten”
digits. This system was eventually used to read the numbers of handwritten checks.

This time is also when the second AI winter (1985-90s) kicked in, which also effected research
for neural networks and deep learning. Various overly-optimistic individuals had exaggerated the
“immediate” potential of Artificial Intelligence, breaking expectations and angering investors.
The anger was so intense, the phrase Artificial Intelligence reached pseudoscience status.
Fortunately, some people continued to work on AI and DL, and some significant advances were
made. In 1995, Dana Cortes and Vladimir Vapnik developed the support vector machine (a
system for mapping and recognizing similar data). LSTM (long short-term memory) for
recurrent neural networks was developed in 1997, by Sepp Hochreiter and Juergen Schmidhuber.

The next significant evolutionary step for deep learning took place in 1999, when computers
started becoming faster at processing data and GPU (graphics processing units) were developed.
Faster processing, with GPUs processing pictures, increased computational speeds by 1000 times
over a 10 year span. During this time, neural networks began to compete with support vector
machines. While a neural network could be slow compared to a support vector machine, neural
networks offered better results using the same data. Neural networks also have the advantage of
continuing to improve as more training data is added.

2000-2010

Around the year 2000, The Vanishing Gradient Problem appeared. It was discovered “features”
(lessons) formed in lower layers were not being learned by the upper layers, because no learning
signal reached these layers. This was not a fundamental problem for all neural networks, just the
ones with gradient-based learning methods. The source of the problem turned out to be certain
activation functions. A number of activation functions condensed their input, in turn reducing the
output range in a somewhat chaotic fashion. This produced large areas of input mapped over an
extremely small range. In these areas of input, a large change will be reduced to a small change
in the output, resulting in a vanishing gradient. Two solutions used to solve this problem were
layer-by-layer pre-training and the development of long short-term memory.

In 2001, a research report by META Group (now called Gartner) described he challenges and
opportunities of data growth as three-dimensional. The report described the increasing volume of
data and the increasing speed of data as increasing the range of data sources and types. This was
a call to prepare for the onslaught of Big Data, which was just starting.

In 2009, Fei-Fei Li, an AI professor at Stanford launched ImageNet, assembled a free database
of more than 14 million labeled images. The Internet is, and was, full of unlabeled images.
Labeled images were needed to “train” neural nets. Professor Li said, “Our vision was that big
data would change the way machine learning works. Data drives learning.”

2011-2020

By 2011, the speed of GPUs had increased significantly, making it possible to train
convolutional neural networks “without” the layer-by-layer pre-training. With the increased
computing speed, it became obvious deep learning had significant advantages in terms of

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

efficiency and speed. One example is AlexNet, a convolutional neural network whose
architecture won several international competitions during 2011 and 2012. Rectified linear units
were used to enhance the speed and dropout.

Also in 2012, Google Brain released the results of an unusual project known as The Cat
Experiment. The free-spirited project explored the difficulties of “unsupervised learning.” Deep
learning uses “supervised learning,” meaning the convolutional neural net is trained using
labeled data (think images from ImageNet). Using unsupervised learning, a convolutional neural
net is given unlabeled data, and is then asked to seek out recurring patterns.

The Cat Experiment used a neural net spread over 1,000 computers. Ten million “unlabeled”
images were taken randomly from YouTube, shown to the system, and then the training software
was allowed to run. At the end of the training, one neuron in the highest layer was found to
respond strongly to the images of cats. Andrew Ng, the project’s founder said, “We also found a
neuron that responded very strongly to human faces.” Unsupervised learning remains a
significant goal in the field of deep learning.

The Generative Adversarial Neural Network (GAN) was introduced in 2014. GAN was created
by Ian Goodfellow. With GAN, two neural networks play against each other in a game. The goal
of the game is for one network to imitate a photo, and trick its opponent into believing it is real.
The opponent is, of course, looking for flaws. The game is played until the near perfect photo
tricks the opponent. GAN provides a way to perfect a product (and has also begun being used by
scammers).

The Future of Deep learning and Business

Deep learning has provided image-based product searches – Ebay, Etsy– and efficient ways to
inspect products on the assembly line. The first supports consumer convenience, while the
second is an example of business productivity.

Currently, the evolution of artificial intelligenceis dependent on deep learning. Deep learning is
still evolving and in need of creative ideas.

Semantics technology is being used with deep learning to take artificial intelligence to the next
level, providing more natural sounding, human-like conversations.

Banks and financial services are using deep learning to automate trading, reduce risk, detect
fraud, and provide AI/chatbot advice to investors. A report from the EIU (Economist Intelligence
Unit) suggests 86% of financial services are planning to increase their artificial intelligence
investments by 2025.

Deep learning and artificial intelligence are influencing the creation of new business models.
These businesses are creating new corporate cultures that embrace deep learning, artificial
intelligence, and modern technology.

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

2.2 A Probabilistic Theory of Deep Learning

Deep learning (DL) is one of the hottest topics in data science and artificial intelligence today.
DL has only been feasible since 2012 with the widespread usage of GPUs, but you’re probably
already dealing with DL technologies in various areas of your daily life. When you vocally
communicate with a digital assistant, when you translate text from one language into another
using the free DeepL translator service (DeepL is a company producing translation engines based
on DL), or when you use a search engine such as Google, DL is doing its magic behind the
scenes. Many state-of-the-art DL applications such as text-to-speech translations boost their
performance using probabilistic DL models. Further, safety critical applications like self-driving
cars use Bayesian variants of probabilistic DL.

In this chapter, you will get a first high-level introduction to DL and its probabilistic variants.
We use simple examples to discuss the differences between non-probabilistic and probabilistic
models and then highlight some advantages of probabilistic DL models. We also give you a first
impression of what you gain when working with Bayesian variants of probabilistic DL models.
In the remaining chapters of the book, you will learn how to implement DL models and how to
tweak them to get their more powerful probabilistic variants. You will also learn about the
underlying principles that enable you to build your own models and to understand advanced
modern models so that you can adapt them for your own purposes.

2.1.1 A first look at probabilistic models

Let’s first get an idea of what a probabilistic model can look like and how you can use it. We use
an example from daily life to discuss the difference between a non-probabilistic model and a
probabilistic model. We then use the same example to highlight some advantages of a
probabilistic model.

In our cars, most of us use a satellite navigational system ( satnav--a.k.a. GPS) that tells us how
to get from A to B. For each suggested route, the satnav also predicts the needed travel time.
Such a predicted travel time can be understood as a best guess. You know you’ll sometimes need
more time and sometimes less time when taking the same route from A to B. But a standard
satnav is non-probabilistic: it predicts only a single value for the travel time and does not tell you
a possible range of values. For an example, look at the left panel in figure 1.1, where you see two
routes going from Croxton, New York, to the Museum of Modern Art (MoMA), also in New
York, with a predicted travel time that is the satnav’s best guess based on previous data and the
current road conditions.

Let’s imagine a fancier satnav that uses a probabilistic model. It not only gives you a best guess
for the travel time, but also captures the uncertainty of that travel time. The probabilistic
prediction of the travel time for a given route is provided as a distribution. For example, look at
the right panel of figure 1.1. You see two Gaussian bell curves describing the predicted travel-
time distributions for the two routes.

How can you benefit from knowing these distributions of the predicted travel time? Imagine you
are a New York cab driver. At Croxton, an art dealer boards your taxi. She wants to participate in
a great art auction that starts in 25 minutes and offers you a generous tip ($500) if she arrives
there on time. That’s quite an incentive!

Your satnav tool proposes two routes (see the left panel of figure 1.1). As a first impulse, you
would probably choose the upper route because, for this route, it estimates a travel time of 19

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

minutes, which is shorter than the 22 minutes for the other route. But, fortunately, you always
have the newest gadgets, and your satnav uses a probabilistic model that not only outputs the
mean travel time but also a whole distribution of travel times. Even better, you know how to
make use of the outputted distribution for the travel times.

You realize that in your current situation, the mean travel time is not very interesting. What
really matters to you is the following question: With which route do you have the better chance
of getting the $500 tip? To answer this question, you can look at the distributions on the right
side of figure 1.1. After a quick eyeball analysis, you conclude that you have a better chance of
getting the tip when taking the lower route, even though it has a larger mean travel time. The
reason is that the narrow distribution of the lower route has a larger fraction of the distribution
corresponding to travel times shorter than 25 minutes. To support your assessment with hard
numbers, you can use the satnav tool with the probabilistic model to compute for both
distributions the probability of arriving at MoMA in less than 25 minutes. This probability
corresponds to the proportion of the area under the curve left of the dashed line in figure 1.1,
which indicates a critical value of 25 minutes. Letting the tool compute the probabilities from the
distribution, you know that your chance of getting the tip is 93% when taking the lower route
and only 69% when taking the upper road.

As discussed in this cab driver example, the main advantages of probabilistic models are that
these can capture the uncertainties in most real-world applications and provide essential
information for decision making. Other examples of the use of probabilistic models include self-
driving cars or digital medicine probabilistic models. You can also use probabilistic DL to
generate new data that is similar to your observed data. A famous fun application is to create
realistic looking faces of non-existing people. We talk about this in chapter 6. Let’s first look at
DL from a bird’s-eye view before peeking into the curve-fitting part.

2.1.2 A first brief look at deep learning (DL)

What is DL anyway? When asked for a short elevator pitch, we would say that it’s a machine
learning(ML) technique based on artificial neural networks(NNs) and that it’s loosely inspired

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

by the way the human brain works. Before giving our personal definition of DL, we first want to
give you an idea of what an artificial NN looks like (see figure 1.2).

In figure 1.2, you can see a typical traditional artificial NN with three hidden layers and several
neurons in each layer. Each neuron within a layer is connected with each neuron in the next
layer.

An artificial NN is inspired by the brain that consists of up to billions of neurons processing, for
example, all sensory perceptions such as vision or hearing. Neurons within the brain aren’t
connected to every other neuron, and a signal is processed through a hierarchical network of
neurons. You can see a similar hierarchical network structure in the artificial NN shown in figure
1.2. While a biological neuron is quite complex in how it processes information, a neuron in an
artificial NN is a simplification and abstraction of its biological counterpart.

To get a first idea about an artificial NN, you can better imagine a neuron as a container for a
number. The neurons in the input layer are correspondingly holding the numbers of the input
data. Such input data could, for example, be the age (in years), income (in dollars), and height (in
inches) of a customer. All neurons in the following layers get the weighted sum of the values
from the connected neurons in the previous layer as their input. In general, the different
connections aren’t equally important but have weights, which determine the influence of the
incoming neuron’s value on the neuron’s value in the next layer. (Here we omit that this input is
further transformed within the neuron.) DL models are NNs, but they also have a large number
of hidden layers (not just three as in the example from figure 1.2).

The weights (strength of connections between neurons) in an artificial NN need to be learned for
the task at hand. For that learning step, you use training data and tune the weights to optimally fit
the data. This step is called fitting. Only after the fitting step can you use the model to do
predictions on new data.

Setting up a DL system is always a two-stage process. In the first step, you choose an
architecture. In figure 1.2, we chose a network with three layers in which each neuron from a
given layer is connected to each neuron in the next layer. Other types of networks have different
connections, but the principle stays the same. In the next step, you tune the weights of the model
so that the training data is best described. This fitting step is usually done using a procedure
called gradient descent. You’ll learn more about gradient descent in chapter 3.

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

Note that this two-step procedure is nothing special to DL but is also present in standard
statistical modeling and ML. The underlying principles of fitting are the same for DL, ML, and
statistics. We’re convinced that you can profit a lot by using the knowledge that was gained in
the field of statistics during the last centuries. This book acknowledges the heritage of traditional
statistics and builds on it. Because of this, you can understand much of DL by looking at
something as simple as linear regression, which we introduce in this chapter and use throughout
the book as an easy example. You’ll see in chapter 4 that linear regression already is a
probabilistic model providing more information than just one predicted output value for each
sample. In that chapter, you’ll learn how to pick an appropriate distribution to model the
variability of the outcome values. In chapter 5, we’ll show you how to use the TensorFlow
Probability framework to fit such a probabilistic DL model. You can then transfer this approach
to new situations allowing you to design and fit appropriate probabilistic DL models that not
only provide high performance predictions but also capture the noise of the data.

2.1.2.1 A success story

DL has revolutionized areas that so far have been especially hard to master with traditional ML
approaches but that are easy to solve by humans, such as the ability to recognize objects in
images (computer vision) and to process written text (natural language processing) or, more
generally, any kind of perception tasks. Image classification is far from being only an academic
problem and is used for a variety of applications:

 Face recognition
 Diagnostics of brain tumors in MRI data
 Recognition of road signs for self-driving cars

Although DL reveals its potential in different application areas, probably the easiest to grasp is in
the field of computer vision. We therefore use computer vision to motivate DL by one of its
biggest success stories.

In 2012, DL made a splash when Alex Krizhevsky from Geoffrey Hinton’s lab crushed all
competitors in the internationally renowned ImageNet competition with a DL-based model. In
this competition, teams from leading computer vision labs trained their models on a big data set
of ~1 million images with the goal of teaching these to distinguish 1,000 different classes of
image content. Examples for such classes are ships, mushrooms, and leopards. In the
competition, all trained models had to list the five most probable classes for a set of new test
images. If the right class wasn’t among the proposed classes, the test image counted as an error
(see figure 1.3, which shows how DL-based approaches took image classification by storm).

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

Before DL entered the competition, the best programs had an error rate of ~25%. In 2012,
Krizhevsky was the first to use DL and achieved a huge drop in the error rate (by 10% to only
~15%). Only a year later, in 2013, almost all competitors used DL, and in 2015, different DL-
based models reached the level of human performance, which is about 5%. You might wonder
why humans misclassify 1 image in 20 (5%). A fun fact: there are 170 different dog breeds in
that data set, which makes it a bit harder for humans to correctly classify the images.

2.1.3 Classification

Let’s look at the differences between non-probabilistic, probabilistic, and Bayesian probabilistic
classification. DL is known to outperform traditional methods, especially in image classification
tasks. Before going into details, we want to use a face recognition problem to give you a feeling
for the differences and the commonalities between a DL approach and a more traditional
approach to face recognition. As a side note, face recognition is actually the application that
initially brought us into contact with DL.

As statisticians, we had a collaboration project with some computer science colleagues for doing
face recognition on a Raspberry Pi minicomputer. The computer scientists challenged us by
kidding about the age of the used statistical methods. We took the challenge and brought them to
a surprised silence by proposing DL to tackle our face recognition problem. The success in this
first project triggered many other joint DL projects, and our interests grew, looking deeper into
the underlying principles of these models.

Let’s look at a specific task. Sara and Chantal were together on holidays and took many pictures,
each showing at least one of them. The task is to create a program that can look at a photo and
determine which of the two women is in the photo. To get a training data set, we labeled 900
pictures, 450 for each woman, with the name of the pictured woman. You can imagine that
images can be very different at first sight because the women might be pictured from different
angles, laughing or tired, dressed up or casual, or having a bad hair day. Still, for you, the task is
quite easy. But for a computer, an image is only an array of pixel values, and programming it to
tell the difference between two women is far from trivial.

2.1.3.1 Traditional approach to image classification

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

A traditional approach to image classification doesn’t directly start with the pixel values of the
images but tackles the classification task in a two-step process. As a first step, experts in the field
define features that are useful to classify the images. A simple example of such a feature would
be the mean intensity value of all pixels, which can be useful to distinguish night shots from
pictures taken during the day. Usually these features are more complex and tailored to a specific
task. In the face recognition problem, you can think about easily understandable features like the
length of the nose, width of the mouth, or the distance between the eyes (figure 1.4).

Figure 1.4 Chantal (left) has a large distance between the eyes and a rather small mouth. Sara
(right) has a small distance between the eyes and a rather large mouth.

But these kinds of high-level features often are difficult to determine because many aspects need
to be taken into account, such as mimics, scale, receptive angle, or light conditions. Therefore,
non-DL approaches often use less interpretable, low-level features like SIFT features (Scale-
Invariant Feature Transform), capturing local image properties such as magnification or rotation
that are invariant to transformations. You can, for example, think about an edge detector: an edge
won’t disappear if the image is rotated or scaled.

Already this simple example makes clear that feature engineering, meaning defining and
extracting those properties from the image that are important for the classification, is a
complicated and time-consuming task. It usually requires a high level of expertise. The (slow)
progress in many applications of computer vision like face recognition was mainly driven by the
construction of new and better features.

After the features-extraction step, the values of these features represent each image. In order to
identify Sara or Chantal from this feature representation of the image, you need to choose and fit
a classification.

What is the task of such a classification model? It should discriminate between the different class
labels. To visualize this idea, let’s imagine that an image is described by only two features: say,
distance of the eyes and width of the mouth. (We are aware that in most real cases, a good
characterization of an image requires many more features.)

Because the women aren’t always pictured head on but from different viewpoints, the apparent
distance between the eyes isn’t always the same for the same women. The apparent width of the
mouth can vary even more, depending if the woman laughs or makes an air-kiss. When
representing each image of the pictured woman by these two features, the feature space can be
visualized by a 2D plot. One axis indicates the eye distance and the other axis shows the mouth
width (see figure 1.5). Each image is represented by a point; images of Sara are labeled with an S
and images of Chantal with a C.

10

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

Figure 1.5 A 2D space spanned by the features mouth width and eye distance. Each point
represents an image described by these two features (S for Sara and C for Chantal). The dashed
line is a decision boundary separating the two classes.

One way you can think about a non-probabilistic classification model is that the model defines
decision boundaries (see the dashed line in figure 1.5) that split the feature space into different
regions. Each resulting region corresponds to one class label. In our example, we’ve determined
a Sara region and a Chantal region. You can now use this decision boundary to classify new
images from which you only know the values for the two features: if the corresponding point in
the 2D feature space ends up in the Sara region, you classify it as Sara; otherwise, as Chantal.

You might know from your data analysis experiences some methods like the following, which
you can use for classification. (Don’t worry if you aren’t familiar with these methods.)

 Logistic or multinomial regression


 Random forest
 Support vector machines
 Linear discriminant analysis

Most classification models, including the listed methods and also DL, are parametric models,
meaning the model has some parameters that determine the course of the boundaries. The model
is only ready to actually perform a classification or class probability prediction after replacing
the parameters by certain numbers. Fitting is about how to find these numbers and how to
quantify the certainty of these numbers.

Fitting the model to a set of training data with known class labels determines the values of the
parameter and fixes the decision boundaries in the feature space. Depending on the classification
method and the number of parameters, these decision boundaries could be simple straight lines
or a complex boundary with wiggles. You can summarize the traditional workflow to set up a
classification method in three steps:

1. Defining and extracting features from the raw data


2. Choosing a parametric model
3. Fitting the classification model to the data by tuning its parameter

11

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

To evaluate the performance of the models, you use a validation data set that is not used during
the training. A validation data set in the face recognition example would consist of new images
of Chantal and Sara that were not part of the training data set. You then can use the trained
model to predict the class label and use the percentage of correct classifications as a (non-
probabilistic) performance measure.

Depending on the situation, one or another classification method will achieve better results on
the validation data set. However, in classical image classification, the most important ingredient
for success isn’t the choice of the classification algorithm but the quality of the extracted image
features. If the extracted features take different values for images from different classes, you’ll
see a clear separation of the respective points in the feature space. In such a situation, many
classification models show a high classification performance.

With the example of discriminating Sara from Chantal, you went through the traditional image
classification workflow. For getting good features, you first had to recognize that these two
women differ in their mouth width and their eye distance. With these specific features, you saw
that it is easy to build a good classifier. However, for discriminating between two other women,
these features might not do the trick, and you would need to start over with the feature
developing process again. This is a common drawback when working with customized features.

2.1.3.2 Deep learning approach to image classification

In contrast to the traditional approach to image classification, the DL approach starts directly
from the raw image data and uses only the pixel values as the input features to the model. In this
feature representation of an image, the numbers of pixels define the dimension of the feature
space. For a low-resolution picture with 100 × 100 pixels, this already amounts to 10,000.

Besides such a high dimension, the main challenge is that pixel similarity of two pictures doesn’t
imply that the two images correspond to the same class label. Figure 1.6 illustrates where the
images in the same column obviously correspond to the same class but are different on the pixel
level. Simultaneously, images in the same row of figure 1.6 show high pixel similarity but don’t
correspond to the same class.

12

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

Figure 1.6 The left column shows two images of the class dog. The right column shows two
images of the class table. When comparing the pictures on the pixel level, the two images in the
same column are less similar than the two images in the same row, even if one image in a row
shows a dog and the other image displays a table.

The core idea of DL is to replace the challenging and time-consuming task of feature engineering
by incorporating the construction of appropriate features into the fitting process. Also, DL can’t
do any magic so, similar to traditional image analysis, the features have to be constructed from
the pixel values at hand. This is done via the hidden layers of the DL model.

Each neuron combines its inputs to yield a new value, and in this manner, each layer yields a
new feature representation of the input. Using many hidden layers allows the NN to decompose a
complicated transformation from the raw data to the outcome in a hierarchy of simple
transformations. When going from layer to layer, you get a more and more abstract
representation of the image that becomes better suited for discriminating between the classes.
You’ll learn more about this in chapter 2, where you’ll see that during the fitting process of a DL
model, a hierarchy of successively more complex features is learned. This then allows you to
discriminate between the different classes without the need of manually specifying the
appropriate features.

After defining the architectures, the network can be understood as a parametric model that often
contains millions of parameters. The model takes an input x and produces an output y. This is
true for every DL model (including reinforcement learning). The DL modeling workflow can be
summarized in two steps:

1. Defining the DL model architecture


2. Fitting the DL model to the raw data

13

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

The next sections discuss what is meant by non-probabilistic and probabilistic classification
models and what benefits you can get from a Bayesian variant of a probabilistic
classification model.

2.1.3.3 Non-probabilistic classification

Let’s first look at non-probabilistic classification. To make it easy and illustrative, we use the
image classification example again. The goal in image classification is to predict for a given
image which class it corresponds to. In the ImageNet competition in section 1.2, there were
1,000 different classes. In the face recognition example, there were only two classes: Chantal and
Sara.

In non-probabilistic image classification, you only get the predicted class label for each image.
More precisely, a non-probabilistic image classifier takes an image as input and then predicts
only the best guess for the class as output. In the face recognition example, it would either output
Chantal or Sara. You can also think about a non-probabilistic model as a deterministic model
without any uncertainty. When looking with probabilistic glasses at a non-probabilistic model, it
seems that a non-probabilistic model is always certain. The non-probabilistic model predicts with
a probability of one that the image belongs to one specific class.

Imagine a situation where the image shows Chantal where she dyes her hair the same color as
that of Sara and the hair covers Chantal’s face. For a human being, it’s quite hard to tell if the
image shows Chantal or Sara. But the non-probabilistic classifier still provides a predicted class
label (for example, Sara) without indicating any uncertainty. Or imagine an even more extreme
situation where you provide an image that shows neither Chantal nor Sara (see figure 1.7).
Which prediction will you get from the classifier? You would like the classifier to tell you that it
is not able to make a reliable prediction. But a non-probabilistic classifier still yields either
Chantal or Sara as a prediction without giving a hint of any uncertainty. To tackle such
challenges of handling difficult or novel situations, we turn to probabilistic models and their
Bayesian variants. These can express their uncertainty and indicate potentially
unreliable predictions.

14

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

Figure 1.7 A non-probabilistic image classifier for face recognition takes as input an image and
yields as outcome a class label. Here the predicted class label is Chantal, but only the upper
image really shows Chantal. The lower image shows a woman who is neither Chantal nor Sara.

2.1.3.4 Probabilistic classification

The special thing in probabilistic classification is that you not only get the best guess for the
class label but also a measure for the uncertainty of the classification. The uncertainty is
expressed by a probability distribution. In the face recognition example, a probabilistic classifier
would take a face image and then output a certain probability for Chantal and for Sara. Both
probabilities add up to 1 (see figure 1.8).

15

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

Figure 1.8 A probabilistic image classifier for face recognition takes as input an image and yields
as outcome a probability for each class label. In the upper panel, the image shows Chantal, and
the classifier predicts a probability of 0.85 for the class Chantal and a probability of 0.15 for the
class Sara. In the lower panel, the image shows neither Chantal nor Sara, and the classifier
predicts a probability of 0.8 for the class Chantal and a probability of 0.2 for the class Sara.

To give a best single guess, you would pick the class with the highest probability. It is common
to think about the probability of the predicted class as an uncertainty of the prediction. This is the
case when all the images are sufficiently similar to the training data.

But in reality, this is not always the case. Imagine that you provide the classifier with an image
that shows neither Chantal nor Sara. The classifier has no other choice than to assign
probabilities to the classes Chantal or Sara. But you would hope that the classifier shows its
uncertainty by assigning more or less equal probabilities to the two possible but wrong classes.
Unfortunately, this is often not the case when working with probabilistic NN models. Instead,
often quite high probabilities are still assigned to one of the possible but wrong classes (see
figure 1.8). To tackle this problem, in part 3 of our book, we extend the probabilistic models by
taking a Bayesian approach, which can add an additional uncertainty that you can use to detect
novel classes.

2.1.3.5 Bayesian probabilistic classification

The nice thing about Bayesian models is that these can express uncertainty about their
predictions. In our face recognition example, the non-Bayesian probabilistic model predicts an
outcome distribution that consists of the probability for Chantal and the probability for Sara,
which add up to 1. But how certain is the model about the assigned probabilities? Bayesian
models can give an answer to this question. In part 3 of this book, you will learn how this is done
in detail. At this point, let’s just note that you can ask a Bayesian model several times and get
different answers when you ask it. This reflects the uncertainty inherent in the model (see figure

16

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

1.9). Don’t worry if you do not see how you get these different model outputs for the same input.
You will learn about that in the third part of the book.

The main advantage of Bayesian models is that these can indicate a non-reliable prediction by a
large spread of the different sets of predictions (see lower panel of figure 1.9). In this way, you
have a better chance to identify novel classes like the young lady in the lower panel of figure 1.9
who is neither Chantal nor Sara.

Figure 1.9 A Bayesian probabilistic image classifier for face recognition takes as input an image
and yields as outcome a distribution of probability sets for the two class labels. In the upper
panel, the image is showing Chantal, and the predicted sets of probabilities all predict a large
probability for Chantal and an accordingly low probability for Sara. In the lower panel, the
image shows a lady who is neither Chantal nor Sara, so the classifier predicts different sets of
probabilities indicating a high uncertainty.

2.1.4 Curve fitting

We want to finish this introductory chapter talking about the differences in probabilistic and non-
probabilistic DL methods on regression tasks. Regression is sometimes also referred to as curve
fitting. This reminds one of the following:

All the impressive achievements of deep learning amount to just curve fitting.

--Judea Pearl, 2018

When we heard that Judea Pearl, the winner of the prestigious Turing Award in 2011 (the
computer science equivalent of the Nobel prize), claimed DL to be just curve fitting (the same
curve fitting done in simple analysis like linear regression for centuries), at first we were
surprised and even felt a bit offended. How could he be so disrespectful about our research
subject, which, moreover, showed such impressive results in practice? Our relative calmness is
probably due to the fact that we aren’t computer scientists but have a background in physics and

17

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

statistical data analysis. Curve fitting isn’t just curve fitting for us. However, giving his statement
a second thought, we can see his point: the underlying principles of DL and curve fitting are
identical in many respects.

2.1.4.1 Non-probabilistic curve fitting

Let’s first take a closer look at the non-probabilistic aspects of traditional curve-fitting methods.
Loosely speaking, non-probabilistic curve fitting is the science of putting lines through data
points. With linear regression in its most simple form, you put a straight line through the data
points (see figure 1.10). In that figure, we assume that we have only one feature, x, to predict a
continuous variable, y. In this simple case, the linear regression model has only two parameters,
a and b :

y=a⋅x+b

Figure 1.10 Scatter plot and regression model for the systolic blood pressure(SBP) example. The
dots are the measured data points; the straight line is the linear model. For three age values (22,
47, 71), the positions of the horizontal lines indicate the predicted best guesses for the SBP (11,
139, 166).

After the definition of the model, the parameters a and b need to be determined so that the model
can be actually used to predict a single best guess for the value of y when given x. In the context
of ML and DL, this step of finding good parameter values is called training. But how are
networks trained? The training of the simple linear regression and DL models is done by fitting
the model’s parameters to the training data--a.k.a. curve fitting.

Note that the number of parameters can be vastly different, ranging from 2 in the 1D linear
regression case to 500 million for advanced DL models. The whole procedure is the same as in
linear regression. You’ll learn in chapter 3 how to fit the parameter of a non-probabilistic linear
regression model.

So, what do we mean when we say a non-probabilistic model is fit to data? Let’s look at the
model y = a ⋅ x + b for a concrete example of predicting the blood pressure y based on the age x.
Figure 1.10 is a plot of the systolic blood pressure (SBP) against the age for 33 American
women. Figure 1.10 shows concrete realizations with a = 1.70 and b = 87.7 (the solid line). In a

18

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

non-probabilistic model, for each age value you get only one best guess for the SBP for women
of this age. In figure 1.10, this is demonstrated for three age values (22, 47, and 71), where the
predicted best guesses for the SBP (111, 139, and 166) are indicated by the positions of the
dashed horizontal lines.

2.1.4.2 Probabilistic curve fitting

What do you get when you fit a probabilistic model to the same data? Instead of only a single
best guess for the blood pressure, you get a whole probability distribution. This tells you that
women with the same age might well have different SBPs (see figure 1.11). In the non-
probabilistic linear regression, an SBP of 111 is predicted for 22-year-old women (see figure
1.10). Now, when looking at the predicted distribution for 22-year-old women, SBP values close
to 111 (the peak of the distribution) are expected with higher probability than values further
away from 111.

Figure 1.11 Scatter plot and regression model for the systolic blood pressure (SBP) example. The
dots are the measured data points. At each age value (22, 47, 71), a Gaussian distribution is fitted
that describes the probability distribution of possible SBP values of women in these age groups.
For the three age values, the predicted probability distributions are shown. The solid line
indicates the positions of the mean values of all distributions corresponding to the ages between
16 and 90 years. The upper and lower dashed lines indicate an interval in which 95% of all
values are expected by the model.

The solid line in figure 1.10 indicates the positions of the mean values of all distributions
corresponding to the age values between 16 and 90 years. The solid line in figure 1.11 exactly
matches the regression line in figure 1.10, which is predicted from a non-probabilistic model.
The dashed lines that are parallel to the mean indicate an interval in which 95% of all individual
SBP values are expected by the model.

How do you find the optimal values for the parameters in a non-probabilistic and a probabilistic
model? Technically, you use a loss function that describes how poorly the model fits the
(training) data and then minimizes it by tuning the weights of the model. You’ll learn about loss

19

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

functions and how to use these for fitting non-probabilistic or probabilistic models in chapters 3,
4, and 5. You’ll then see the difference between the loss function of a non-probabilistic and a
probabilistic model.

The discussed linear regression model is, of course, simple. We use it mainly to explain the
underlying principles that stay the same when turning to complex DL models. In real world
applications, you would often not assume a linear dependency, and you would also not always
want to assume that the variation of the data stays constant. You’ll see in chapter 2 that it’s easy
to set up a NN that can model non-linear relationships. In chapters 4 and 5, you’ll see that it is
also not hard to build a probabilistic model for regression tasks that can model data with non-
linear behavior and changing variations (see figure 1.12). To evaluate the performance of a
trained regression model, you should always use a validation data set that is not used during
training. In figure 1.12, you can see the predictions of a probabilistic DL model on a new
validation set that shows that the model is able to capture the non-linear behavior of the data and
also the changing data variation.

Figure 1.12 Scatter plot and validation data predictions from a (non-Bayesian) probabilistic
regression model. The model is fitted on some simulated data with a non-linear dependency
between x and y and with non-constant data variation. The solid line indicates the positions of the
mean values of all predicted distributions. The upper and lower dashed lines indicate an interval
in which 95% of all values are expected by the model.

What happens if we use the model to predict the outcome of x values outside the range of the
training data? You can get a first glimpse when looking at figure 1.12, where we only have data
between -5 and 25 but show the predictions in a wider range between -10 and 30. It seems that
the model is especially certain about its predictions in the ranges where it has never seen the
data. That is strange and not a desirable property of a model! The reason for the model’s
shortcoming is that it only captures the data variation--it does not capture the uncertainty about
the fitted parameters. In statistics, there are different approaches known to capture this
uncertainty; the Bayesian approach is among these. When working with DL models, the

20

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

Bayesian approach is the most feasible and appropriate. You will learn about that in the last two
chapters of this book.

2.1.4.3 Bayesian probabilistic curve fitting

The main selling point of a Bayesian DL model is its potential to sound the alarm in case of
novel situations for which the model was not trained. For a regression model, this corresponds to
extrapolation, meaning you use your model in a data range that is outside the range of the
training data. In figure 1.13, you can see the result of a Bayesian variant of the NN that produces
the fit shown in figure 1.12. It is striking that only the Bayesian variant of the NN raises the
uncertainty when leaving the range of the training data. This is a nice property because it can
indicate that your model might yield unreliable predictions.

Figure 1.13 Scatter plot and validation data predictions from a Bayesian probabilistic regression
model. The model was fitted on some simulated data with non-linear dependency
between x and y and non-constant data variation. The solid line indicates the positions of the
mean values of all predicted distributions. The upper and lower dashed lines indicate an interval
in which 95% of all values are expected by the model.

2.1.5 When to use and when not to use DL?

Recently, DL has had several extraordinary success stories. You therefore might ask yourself
whether you should forget about traditional ML approaches and use DL instead. The answer
depends on the situation and the task at hand. In this section, we cover when not to use DL as
well as what problems DL is useful for.

2.1.5.1 When not to use DL

DL typically has millions of parameters and, therefore, usually needs a lot of data to be trained.
If you only have access to a limited number of features that describe each instance, then DL isn’t
the way to go. This includes the following applications:

21

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

 Predict the scores of a student in their first university year based on only their scores in
high school
 Predict the risk for a heart attack within the next year based on the sex, age, BMI (body
mass index), blood pressure, and blood cholesterol concentration of a person
 Classify the sex of a turtle based on its weight, its height, and the length of its feet

Also, in situations where you have only few training data and you know exactly which features
determine the outcome of interest (and it’s easy for you to extract these features from your raw
data), then you should go for these features and use those as a basis for a traditional ML model.
Imagine, for example, you get images from a soccer player collection of different individual
French and Dutch soccer players. You know that the jerseys of the French team are always blue,
and those of the Dutch team are always orange. If your task is to develop a classifier that
discriminates between players of these two teams, it’s probably best to decide if the number of
blue pixels (the French team) in the image is larger than the number of orange pixels (the Dutch
team). All other features (such as hair color, for example) that seem to discriminate between the
two teams would add noise rather than help with the classification of new images. It’s therefore
probably not a good idea to extract and use additional features for your classifier.

2.1.5.2 When to use DL

DL is the method of choice in situations where each instance is described by complex raw data
(like images, text, or sounD) and where it isn’t easy to formulate the critical features that
characterize the different classes. DL models are then able to extract features from the raw data
that often outperform models that rely on handcrafted features. Figure 1.14 displays various tasks
in which DL recently changed the game.

Figure 1.14 The various tasks recently solved by DL that were out-of-reach for traditional ML
for a long time

2.1.5.3 When to use and when not to use probabilistic models?

22

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

You will see in this book that for most DL models, it is possible to set up a probabilistic version
of the model. You get the probabilistic version basically for free. In these cases, you can only
gain when using the probabilistic variant because it provides not only the information that you
get from the non-probabilistic version of the model, but also additional information that can be
essential for decision making. If you use a Bayesian variant of a probabilistic model, you have
the additional advantage of getting a measure that includes the model’s parameter uncertainty.
Having an uncertainty measure is especially important to identify situations in which your model
might yield unreliable predictions.

2.3 Regularization

Regularization is a technique used in machine learning and deep learning to prevent overfitting
and improve the generalization performance of a model. It involves adding a penalty term to the
loss function during training.

This penalty discourages the model from becoming too complex or having large parameter
values, which helps in controlling the model’s ability to fit noise in the training data.
Regularization methods include L1 and L2 regularization, dropout, early stopping, and more. By
applying regularization, models become more robust and better at making accurate predictions
on unseen data.

Have you seen this image before? As we move towards the right in this image, our model tries to
learn too well the details and the noise from the training data, which ultimately results in poor
performance on the unseen data.

In other words, while going towards the right, the complexity of the model increases such that
the training error reduces but the testing error doesn’t. This is shown in the image below.

23

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

If you’ve built a neural network before, you know how complex they are. This makes them more

prone to overfitting.

Regularization is a technique which makes slight modifications to the learning algorithm such

that the model generalizes better. This in turn improves the model’s performance on the unseen

data as well.

How does Regularization help reduce Overfitting?

Let’s consider a neural network which is overfitting on the training data as shown in the image

below.

24

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

If you have studied the concept of regularization in machine learning, you will have a fair idea

that regularization penalizes the coefficients. In deep learning, it actually penalizes the

weight matrices of the nodes.

Assume that our regularization coefficient is so high that some of the weight matrices are nearly

equal to zero.

This will result in a much simpler linear network and slight underfitting of the training data.

Such a large value of the regularization coefficient is not that useful. We need to optimize the

value of regularization coefficient in order to obtain a well-fitted model as shown in the image

below.

25

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

Different Regularization Techniques in Deep Learning:

1. Parameter Norm Penalties

The idea here is to limit the capacity (the space of all possible model families) of the model by

adding a parameter norm penalty, Ω(θ), to the objective function, J:

Here, θ represents only the weights and not the biases, the reason being that the biases require

much less data to fit and do not add much variance.

1.1 L2 & L1 regularization

L1 and L2 are the most common types of regularization. These update the general cost function
by adding another term known as the regularization term.

Cost function = Loss (say, binary cross entropy) + Regularization term

Due to the addition of this regularization term, the values of weight matrices decrease because it
assumes that a neural network with smaller weight matrices leads to simpler models. Therefore,
it will also reduce overfitting to quite an extent.

However, this regularization term differs in L1 and L2.

In L2, we have:

Here, lambda is the regularization parameter. It is the hyperparameter whose value is optimized
for better results. L2 regularization is also known as weight decay as it forces the weights to
decay towards zero (but not exactly zero).

26

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

In L1, we have:

In this, we penalize the absolute value of the weights. Unlike L2, the weights may be reduced to
zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we
usually prefer L2 over it.

In keras, we can directly apply regularization to any layer using the regularizers.

Below is the sample code to apply L2 regularization to a Dense layer.

from keras import regularizers


model.add(Dense(64, input_dim=64,
kernel_regularizer=regularizers.l2(0.01)

Similarly, we can also apply L1 regularization. We will look at this in more detail in a case study
later in this article.

2. Norm Penalties as Constrained Optimization

Sometimes we may wish to use explicit constraints rather than penalties.

 This can be useful if we have an idea of what value of k is appropriate and do not want to
spend time searching for the value of \alpha that corresponds to this k.
 penalities can cause nonconvex optimization.
 impose some stability on the optimization procedure.

3. Regularization and Under-Constrained Problems

4. Dataset Augmentation

The simplest way to reduce overfitting is to increase the size of the training data. In machine

learning, we were not able to increase the size of training data as the labeled data was too costly.

But, now let’s consider we are dealing with images. In this case, there are a few ways of

increasing the size of the training data – rotating the image, flipping, scaling, shifting, etc. In the

below image, some transformation has been done on the handwritten digits dataset.

27

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

This technique is known as data augmentation. This usually provides a big leap in improving the

accuracy of the model. It can be considered as a mandatory trick in order to improve our

predictions.

5 Noise Robustness

Noise injection can be much more powerful than simply shrinking the parameters, especially
when the noise is added to the hidden units.

Another way: adding noise to the weights.

 Encourage the parameters to go to regions of parameter space where small perturbations of


the weights have a relatively small influence onethe output.

5.1 Injecting Noise at the Output Targets

0,1→ϵk−1,1−ϵ0,1→��−1,1−�

 prevent softmax classifier

6 Semi-Supervised Learning

Share parameters

The generative criterion then expresses a particular form of prior belief about the solution.
(Lasserre et al., 2006)

7 Multitask Learning

Multitask learning

 Task-specific parameters
 Generic parameters

28

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

8 Early Stopping

Early stopping is a kind of cross-validation strategy where we keep one part of the training set as

the validation set. When we see that the performance on the validation set is getting worse, we

immediately stop the training on the model. This is known as early stopping.

In the above image, we will stop training at the dotted line since after that our model will start

overfitting on the training data.

In keras, we can apply early stopping using the callbacks function. Below is the sample code for

it.

from keras.callbacks import EarlyStopping

EarlyStopping(monitor='val_err', patience=5)

Here, monitor denotes the quantity that needs to be monitored and ‘val_err’ denotes the

validation error.

Patience denotes the number of epochs with no further improvement after which the training

will be stopped. For better understanding, let’s take a look at the above image again. After the

dotted line, each epoch will result in a higher value of validation error. Therefore, 5 epochs after

the dotted line (since our patience is equal to 5), our model will stop because no further

improvement is seen.

29

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

9 Parameter Tying and Parameter Sharing

dependencies between the model parameters

parameter sharing

9.1 Convolutional Neural Networks

10 Sparse Representations

to place a penalty on the activations of the units

Representational regularization

11 Bagging and Other Ensemble Methods

Bagging

general strategy: model averaging

The reason that model works

 independent error

Boosting

12 Dropout

This is the one of the most interesting types of regularization techniques. It also produces very
good results and is consequently the most frequently used regularization technique in the field of
deep learning.

To understand dropout, let’s say our neural network structure is akin to the one shown below:

So what does dropout do? At every iteration, it randomly selects some nodes and removes them

along with all of their incoming and outgoing connections as shown below.

30

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

So each iteration has a different set of nodes and this results in a different set of outputs. It can

also be thought of as an ensemble technique in machine learning.

Ensemble models usually perform better than a single model as they capture more randomness.

Similarly, dropout also performs better than a normal neural network model.

This probability of choosing how many nodes should be dropped is the hyperparameter of the

dropout function. As seen in the image above, dropout can be applied to both the hidden layers

as well as the input layers.

Due to these reasons, dropout is usually preferred when we have a large neural network structure

in order to introduce more randomness.

13 Adversarial Training

Semi-supervised learning

 virtual adversarial examples

14 Tangent Distance, Tangent Prop and Manifold Tangent Classifier

to overcome the curse of dimensionality

31

Downloaded by Think Big ([email protected])


lOMoARcPSD|57496506

Unit - II Deep Learning Lecture Notes

 assuming that data lies near a low-dimensional manifold.

tangent distance algorithm

 local invariance is achieved by requiring ∇xf(x)∇) to be orthogonal to the known manifold


tangent vectors v(i)) at x. (f(x)) is invariant for transformations).

Two major drawbacks

 it only regularizes the model to resist infinitesimal perturbation. Explicit dataset


augementation confers resistance to larger perturbations.
 the infinitesimal approach poses difficulties for models based on rectified linear units.
Dataset augmentation works well with RLU.

Tangent propagation is also related to

 double backprop
 adversarial training

Dataset augmentation is the non-infinitesimal version of tangent propagation, adversarial training


is the non-infinitesimal version of double backprop.

The manifold tangent classifier eliminates the need to know the tangent vectors a priori.

 Autoencoders can estimate the manifold tangent vectors.

The algorithm:

 use an autoencoder to learn the manifold struicture by unsupervised learning.


 use these tangents to regularize a neural net classifier as in tangent prop.

32

Downloaded by Think Big ([email protected])

You might also like