Deep Learning
BITS Pilani Dr. Pratik Narang
Pilani Campus Department of CSIS
BITS Pilani
Pilani Campus
Lecture 11
Autoencoders
BITS Pilani
Pilani Campus
Dimensionality reduction
Dimensionality reduction
• In machine learning, dimensionality reduction is the process of
reducing the number of features that describe some data.
• This reduction is done either by selection (only some existing
features are conserved) or by extraction (a reduced number of new
features are created based on the old features)
• Useful in many situations that require low dimensional data (data
visualisation, data storage, heavy computation…).
• Commonly used approaches: PCA, ICA
BITS Pilani, Pilani Campus
Dimensionality reduction
Let’s call encoder the process that produce the “new features”
representation from the “old features” representation (by selection or
by extraction) and decoder the reverse process.
Dimensionality reduction can then be interpreted as data compression
where the encoder compress the data (from the initial space to the
encoded space, also called latent space) whereas the decoder
decompress them.
Source: https://towardsdatascience.com/ BITS Pilani, Pilani Campus
Principal components analysis (PCA)
The idea of PCA is to build n_e new independent features
that are linear combinations of the n_d old features and
so that the projections of the data on the subspace
defined by these new features are as close as possible to
the initial data (in term of euclidean distance).
In other words, PCA is looking for the best linear subspace
of the initial space (described by an orthogonal basis of
new features) such that the error of approximating the
data by their projections on this subspace is as small as
possible.
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Generative modeling
BITS Pilani, Pilani Campus
Supervised vs unsupervised learning
BITS Pilani, Pilani Campus
Generative modeling
BITS Pilani, Pilani Campus
Why generative modelling?
Debiasing
BITS Pilani, Pilani Campus
Outlier detection
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Latent variable models –
Autoencoders and GANs
What is a latent variable?
Plato, Republic
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Autoencoders
Typical DNNs characteristics
So far, Deep Learning Models have things in common:
• Input Layer: (maybe vectorized), quantitative
representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression,
translation, segmentation, etc.
• Models used for supervised learning
BITS Pilani, Pilani Campus
Example
Source: https://cse.iitkgp.ac.in/~sudeshna/courses/DL18/ BITS Pilani, Pilani Campus
Changing the objective!
Now we will talk about unsupervised learning with Deep Neural Networks
Source: https://cse.iitkgp.ac.in/~sudeshna/courses/DL18/ BITS Pilani, Pilani Campus
Autoencoders: definition
Autoencoders are neural networks that are trained to copy their inputs to
their outputs.
• Usually constrained in particular ways to make this task more difficult.
• They compress the input into a lower-dimensional code and then
reconstruct the output from this representation. The code is a compact
“summary” or “compression” of the input, also called the latent-space
representation.
• Structure is almost always organized into encoder network, f, and
decoder network, g : model = g(f(x))
• Trained by gradient descent with reconstruction loss:
measures differences between input and output e.g. MSE:
BITS Pilani, Pilani Campus
Autoencoders
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Reconstruction quality
BITS Pilani, Pilani Campus
Autoencoders for representation learning
BITS Pilani, Pilani Campus
Autoencoders
Autoencoders are mainly a dimensionality reduction (or compression)
algorithm with a couple of important properties:
Data-specific: They are only able to meaningfully compress data similar
to what they have been trained on. Since they learn features specific
for the given training data, they are different than a standard data
compression algorithm like gzip.
Lossy: The output of the autoencoder will not be exactly the same as
the input, it will be a close but degraded representation.
Unsupervised: Autoencoders are considered an unsupervised learning
technique since they don’t need explicit labels to train on.
BITS Pilani, Pilani Campus
Undercomplete Autoeconders
Undercomplete Autoeconders are defined to have a hidden layer h, with
smaller dimension than input layer.
• Network must model x in lower dimensional space + map latent space
accurately back to input space.
• Encoder network: function that returns a useful, compressed representation
of input.
• If network has only linear transformations, encoder learns PCA. With typical
nonlinearities, network learns generalized, more powerful version of PCA.
Source: https://cse.iitkgp.ac.in/~sudeshna/courses/DL18/ BITS Pilani, Pilani Campus
Architecture
Source: https://towardsdatascience.com/ BITS Pilani, Pilani Campus
Training
Four hyperparameters need to be set before training an autoencoder:
Code size: number of nodes in the middle layer. Smaller size results in
more compression.
Number of layers: the autoencoder can be as deep as we like
Number of nodes per layer: a stacked autoencoder is one where the
layers are stacked one after another. Usually stacked autoencoders look
like a “sandwich”. The number of nodes per layer decreases with each
subsequent layer of the encoder, and increases back in the decoder. Also
the decoder is symmetric to the encoder in terms of layer structure.
Loss function: we either use mean squared error (MSE) or binary
crossentropy. If the input values are in the range [0, 1] then we typically
use crossentropy, otherwise we use the mean squared error.
BITS Pilani, Pilani Campus
Training
• We can make the autoencoder very powerful by increasing the number
of layers, nodes per layer and most importantly the code size.
• Increasing these hyperparameters will let the autoencoder to learn
more complex codings.
• But we should be careful to not make it too powerful. Otherwise the
autoencoder will simply learn to copy its inputs to the output, without
learning any meaningful representation. It will just mimic the identity
function.
• This is why we prefer a “sandwich” architecture, and deliberately keep
the code size small.
• Since the coding layer has a lower dimensionality than the input data,
the autoencoder is said to be undercomplete. It won’t be able to directly
copy its inputs to the output, and will be forced to learn intelligent
features
BITS Pilani, Pilani Campus
Denoising autoencoders
Another way to force the autoencoder to learn useful
features is by adding random noise to its inputs and
making it recover the original noise-free data.
• This way the autoencoder can’t simply copy the input to
its output because the input also contains random noise.
• We are asking it to subtract the noise and produce the
underlying meaningful data.
• This is called a denoising autoencoder.
BITS Pilani, Pilani Campus
Example
Source: https://towardsdatascience.com/ BITS Pilani, Pilani Campus
Denoising autoencoders
• We introduce a corruption process C(˜x | x), which represents a
conditional distribution over corrupted samples ˜x, given a data sample
x. The autoencoder then learns a reconstruction distribution
preconstruct (x | ˜x) estimated from training pairs (x,˜x) as follows:
• Sample a training example x from the training data.
• Sample a corrupted version ˜x from C(˜x | x=x),
• Use (x,˜x) as a training example for estimating the autoencoder
reconstruction distribution preconstruct(x |˜x) =pdecoder (x | h) with h the
output of encoder f(˜x) and pdecoder typically defined by a decoder g(h)
BITS Pilani, Pilani Campus
Sparse autoencoders
A third method to force the autoencoder to learn useful
features: using regularization
We can regularize the autoencoder by using a sparsity
constraint such that only a fraction of the nodes would have
nonzero values, called active nodes.
Add a penalty term to the loss function such that only a
fraction of the nodes become active. This forces the
autoencoder to represent each input as a combination of
small number of nodes, and demands it to discover
interesting structure in the data.
This method works even if the code size is large, since only a
small subset of the nodes will be active at any time.
BITS Pilani, Pilani Campus
Applications
Data denoising
Dimensionality reduction
Information retrieval
Content generation (Generative models) - VAEs
BITS Pilani, Pilani Campus
Content generation
At first sight, we could be tempted to think that, if the latent
space is regular enough (well “organized” by the encoder
during the training process), we could take a point
randomly from that latent space and decode it to get a
new content.
BITS Pilani, Pilani Campus
Content generation
However, the regularity of the latent
space for autoencoders is a difficult
point that depends on:
• the distribution of the data in the initial space,
• the dimension of the latent space,
• the architecture of the encoder.
So, it is pretty difficult (if not impossible) to ensure, a priori, that the
encoder will organize the latent space in a smart way compatible with
the generative process.
The autoencoder is solely trained to encode and decode with as few loss
as possible, no matter how the latent space is organised.
Thus, it is natural that, during the training, the network takes advantage of
any overfitting possibilities to achieve its task as well as it can.
BITS Pilani, Pilani Campus