Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views36 pages

L11 Autoencoders

My Lecture On Autoencoders Deep Learning in BITSPILANI

Uploaded by

iampulkit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views36 pages

L11 Autoencoders

My Lecture On Autoencoders Deep Learning in BITSPILANI

Uploaded by

iampulkit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Deep Learning

BITS Pilani Dr. Pratik Narang


Pilani Campus Department of CSIS
BITS Pilani
Pilani Campus

Lecture 11
Autoencoders
BITS Pilani
Pilani Campus

Dimensionality reduction
Dimensionality reduction

• In machine learning, dimensionality reduction is the process of


reducing the number of features that describe some data.

• This reduction is done either by selection (only some existing


features are conserved) or by extraction (a reduced number of new
features are created based on the old features)

• Useful in many situations that require low dimensional data (data


visualisation, data storage, heavy computation…).

• Commonly used approaches: PCA, ICA

BITS Pilani, Pilani Campus


Dimensionality reduction

Let’s call encoder the process that produce the “new features”
representation from the “old features” representation (by selection or
by extraction) and decoder the reverse process.
Dimensionality reduction can then be interpreted as data compression
where the encoder compress the data (from the initial space to the
encoded space, also called latent space) whereas the decoder
decompress them.

Source: https://towardsdatascience.com/ BITS Pilani, Pilani Campus


Principal components analysis (PCA)

The idea of PCA is to build n_e new independent features


that are linear combinations of the n_d old features and
so that the projections of the data on the subspace
defined by these new features are as close as possible to
the initial data (in term of euclidean distance).
In other words, PCA is looking for the best linear subspace
of the initial space (described by an orthogonal basis of
new features) such that the error of approximating the
data by their projections on this subspace is as small as
possible.

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Generative modeling
BITS Pilani, Pilani Campus
Supervised vs unsupervised learning

BITS Pilani, Pilani Campus


Generative modeling

BITS Pilani, Pilani Campus


Why generative modelling?

Debiasing

BITS Pilani, Pilani Campus


Outlier detection

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Latent variable models –


Autoencoders and GANs
What is a latent variable?

Plato, Republic

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Autoencoders
Typical DNNs characteristics

So far, Deep Learning Models have things in common:


• Input Layer: (maybe vectorized), quantitative
representation
• Hidden Layer(s): Apply transformations with nonlinearity
• Output Layer: Result for classification, regression,
translation, segmentation, etc.
• Models used for supervised learning

BITS Pilani, Pilani Campus


Example

Source: https://cse.iitkgp.ac.in/~sudeshna/courses/DL18/ BITS Pilani, Pilani Campus


Changing the objective!

Now we will talk about unsupervised learning with Deep Neural Networks

Source: https://cse.iitkgp.ac.in/~sudeshna/courses/DL18/ BITS Pilani, Pilani Campus


Autoencoders: definition

Autoencoders are neural networks that are trained to copy their inputs to
their outputs.

• Usually constrained in particular ways to make this task more difficult.

• They compress the input into a lower-dimensional code and then


reconstruct the output from this representation. The code is a compact
“summary” or “compression” of the input, also called the latent-space
representation.

• Structure is almost always organized into encoder network, f, and


decoder network, g : model = g(f(x))

• Trained by gradient descent with reconstruction loss:


measures differences between input and output e.g. MSE:
BITS Pilani, Pilani Campus
Autoencoders

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Reconstruction quality

BITS Pilani, Pilani Campus


Autoencoders for representation learning

BITS Pilani, Pilani Campus


Autoencoders

Autoencoders are mainly a dimensionality reduction (or compression)


algorithm with a couple of important properties:

Data-specific: They are only able to meaningfully compress data similar


to what they have been trained on. Since they learn features specific
for the given training data, they are different than a standard data
compression algorithm like gzip.

Lossy: The output of the autoencoder will not be exactly the same as
the input, it will be a close but degraded representation.

Unsupervised: Autoencoders are considered an unsupervised learning


technique since they don’t need explicit labels to train on.

BITS Pilani, Pilani Campus


Undercomplete Autoeconders

Undercomplete Autoeconders are defined to have a hidden layer h, with


smaller dimension than input layer.
• Network must model x in lower dimensional space + map latent space
accurately back to input space.
• Encoder network: function that returns a useful, compressed representation
of input.
• If network has only linear transformations, encoder learns PCA. With typical
nonlinearities, network learns generalized, more powerful version of PCA.

Source: https://cse.iitkgp.ac.in/~sudeshna/courses/DL18/ BITS Pilani, Pilani Campus


Architecture

Source: https://towardsdatascience.com/ BITS Pilani, Pilani Campus


Training

Four hyperparameters need to be set before training an autoencoder:

Code size: number of nodes in the middle layer. Smaller size results in
more compression.
Number of layers: the autoencoder can be as deep as we like
Number of nodes per layer: a stacked autoencoder is one where the
layers are stacked one after another. Usually stacked autoencoders look
like a “sandwich”. The number of nodes per layer decreases with each
subsequent layer of the encoder, and increases back in the decoder. Also
the decoder is symmetric to the encoder in terms of layer structure.
Loss function: we either use mean squared error (MSE) or binary
crossentropy. If the input values are in the range [0, 1] then we typically
use crossentropy, otherwise we use the mean squared error.

BITS Pilani, Pilani Campus


Training

• We can make the autoencoder very powerful by increasing the number


of layers, nodes per layer and most importantly the code size.
• Increasing these hyperparameters will let the autoencoder to learn
more complex codings.
• But we should be careful to not make it too powerful. Otherwise the
autoencoder will simply learn to copy its inputs to the output, without
learning any meaningful representation. It will just mimic the identity
function.
• This is why we prefer a “sandwich” architecture, and deliberately keep
the code size small.
• Since the coding layer has a lower dimensionality than the input data,
the autoencoder is said to be undercomplete. It won’t be able to directly
copy its inputs to the output, and will be forced to learn intelligent
features

BITS Pilani, Pilani Campus


Denoising autoencoders

Another way to force the autoencoder to learn useful


features is by adding random noise to its inputs and
making it recover the original noise-free data.

• This way the autoencoder can’t simply copy the input to


its output because the input also contains random noise.
• We are asking it to subtract the noise and produce the
underlying meaningful data.
• This is called a denoising autoencoder.

BITS Pilani, Pilani Campus


Example

Source: https://towardsdatascience.com/ BITS Pilani, Pilani Campus


Denoising autoencoders

• We introduce a corruption process C(˜x | x), which represents a


conditional distribution over corrupted samples ˜x, given a data sample
x. The autoencoder then learns a reconstruction distribution
preconstruct (x | ˜x) estimated from training pairs (x,˜x) as follows:

• Sample a training example x from the training data.

• Sample a corrupted version ˜x from C(˜x | x=x),

• Use (x,˜x) as a training example for estimating the autoencoder


reconstruction distribution preconstruct(x |˜x) =pdecoder (x | h) with h the
output of encoder f(˜x) and pdecoder typically defined by a decoder g(h)

BITS Pilani, Pilani Campus


Sparse autoencoders

A third method to force the autoencoder to learn useful


features: using regularization
We can regularize the autoencoder by using a sparsity
constraint such that only a fraction of the nodes would have
nonzero values, called active nodes.
Add a penalty term to the loss function such that only a
fraction of the nodes become active. This forces the
autoencoder to represent each input as a combination of
small number of nodes, and demands it to discover
interesting structure in the data.
This method works even if the code size is large, since only a
small subset of the nodes will be active at any time.

BITS Pilani, Pilani Campus


Applications

Data denoising

Dimensionality reduction

Information retrieval

Content generation (Generative models) - VAEs

BITS Pilani, Pilani Campus


Content generation

At first sight, we could be tempted to think that, if the latent


space is regular enough (well “organized” by the encoder
during the training process), we could take a point
randomly from that latent space and decode it to get a
new content.

BITS Pilani, Pilani Campus


Content generation

However, the regularity of the latent


space for autoencoders is a difficult
point that depends on:
• the distribution of the data in the initial space,
• the dimension of the latent space,
• the architecture of the encoder.

So, it is pretty difficult (if not impossible) to ensure, a priori, that the
encoder will organize the latent space in a smart way compatible with
the generative process.
The autoencoder is solely trained to encode and decode with as few loss
as possible, no matter how the latent space is organised.
Thus, it is natural that, during the training, the network takes advantage of
any overfitting possibilities to achieve its task as well as it can.
BITS Pilani, Pilani Campus

You might also like