5th Unit DL Final Class Notes
5th Unit DL Final Class Notes
Course Outcomes:
Ability to understand the concepts of Neural Networks
Ability to select the Learning Networks in modeling real world systems
Ability to use an efficient algorithm for Deep Models
Ability to apply optimization strategies for large scale applications
UNIT-I
Artificial Neural Networks Introduction, Basic models of ANN, important terminologies, Supervised
Learning Networks, Perceptron Networks, Adaptive Linear Neuron, Back-propagation Network.
Associative Memory Networks. Training Algorithms for pattern association, BAM and Hopfield
Networks.
UNIT-II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet, Hamming
Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization, Counter Propagation
Networks, Adaptive Resonance Theory Networks. Special Networks-Introduction to various networks.
UNIT - III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed - forward networks,
Gradient-Based learning, Hidden Units, Architecture Design, Back-Propagation and Other
Differentiation Algorithms
UNIT - IV
Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping, Parameter Typing and
Parameter Sharing, Sparse Representations, Bagging and other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent Classifier
UNIT - V
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic Algorithms,
Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates, Approximate Second-
Order Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural Language
Processing
TEXT BOOKS:
1. Deep Learning: An MIT Press Book By Ian Goodfellow and Yoshua Bengio and Aaron Courville
2. Neural Networks and Learning Machines, Simon Haykin, 3rd Edition, Pearson Prentice Hall.
This Neural Network & Deep Learning 5 Units Material is Prepared by
2. Neural network training is the most challenging optimization problem in deep learning
due to its complexity and computational cost.
3. Gradient-based optimization techniques are commonly used to solve the neural network
training problem, but they require specialized methods due to its difficulty.
4. The goal of neural network training is to find the parameters θ that significantly reduce a
cost function J(θ) by minimizing it using optimization algorithms.
5. Optimization used as a training algorithm for machine learning tasks differs from pure
optimization because it involves iterative updates of the parameters based on the gradient
of the cost function.
6. Challenges that make optimization of neural networks difficult include non-convexity, high
dimensionality, and vanishing/exploding gradients.
7. Practical algorithms for neural network training include basic optimization algorithms like
gradient descent and more advanced techniques like adaptive learning rates and second
derivative information.
8. Initialization strategies for neural network parameters are also important to ensure
convergence to a good local minimum.
9. Higher-level optimization strategies combine simple algorithms into more complex
procedures for improved performance, such as mini-batching and momentum techniques.
Optimization for Training Deep Models
Topic-1: Introduction to Optimization for Training Deep Models
Deep learning algorithms require optimization in various scenarios. For instance, optimizing
models like PCA involves solving an optimization problem.
We often use analytical optimization to create proofs or design algorithms.
Among the different optimization problems in deep learning, neural network training is considered
the most challenging.
It typically requires significant time and computational resources, with days to months of
investment on multiple machines to solve a single instance.
Due to the importance and cost involved, specialized optimization techniques have been developed
specifically for neural network training.
This chapter introduces these techniques for optimizing neural network training.
This chapter focuses on optimization techniques for training neural networks.
It assumes a basic understanding of gradient-based optimization principles, which are briefly
covered in Chapter 4.
The goal is to find the parameters (θ) of a neural network that minimize a cost function (J(θ)).
This cost function typically includes a performance measure evaluated on the entire training set and
regularization terms.
Training a neural network involves specific challenges compared to traditional optimization.
The chapter introduces practical algorithms for optimization, including initialization strategies and
algorithms that adapt learning rates or utilize second derivatives of the cost function.
It also discusses higher-level procedures formed by combining simple optimization algorithms.
Typically, the cost function can be written as an average over the training set such as:
In the supervised learning case, we have a per-example loss function (L) that measures the
discrepancy between the predicted output (f(x; θ)) and the target output (y) for a given input (x).
The objective function (Equation 8.1) is defined based on this loss function, empirical distribution
(Pˆdata), and the training set.
However, it is possible to extend this framework to include additional arguments such as θ or x, or
exclude y, to develop regularization or unsupervised learning approaches.
Ideally, we aim to minimize the objective function by considering the expectation across the entire
data generating distribution (Pdata) rather than just the finite training set.
1
The objective of a machine learning algorithm is to reduce the expected generalization error, also
known as the risk (Equation 8.2).
This expectation is taken over the true underlying distribution (pdata) of the data. If we knew this
distribution, risk minimization would be a solvable optimization task. However, in machine
learning, we typically do not have access to pdata but only have a training set.
To convert the machine learning problem into an optimization problem, we minimize the expected
loss on the training set.
This involves replacing the true distribution (p(x, y)) with the empirical distribution (pˆ(x, y))
defined by the training set.
In simpler terms, we aim to minimize the empirical risk.
2
Training stops even if the surrogate loss function still has large derivatives, which is unlike pure
optimization where convergence occurs when the gradient becomes very small.
8.1.3 In machine learning algorithms, the objective function is often a sum over the training
examples.
This is different from general optimization algorithms. Machine learning optimization computes
parameter updates using only a subset of the terms from the full cost function.
For instance, maximum likelihood estimation problems decompose into a sum over each example
when viewed in log space.
Computing the expectation exactly in machine learning algorithms can be expensive as it involves
evaluating the model on every example in the dataset.
Instead, we can compute these expectations by randomly sampling a small number of examples and
averaging over those samples.
The standard error of the mean, estimated from a number of samples, decreases less than linearly
as the number of examples increases.
For instance, using 10,000 examples decreases the standard error by only a factor of 10 compared
to using 100 examples, even though it requires 100 times more computation.
Optimization algorithms converge faster if they can quickly approximate the gradient instead of
computing it slowly and exactly.
When learning from a small number of samples, it's beneficial to estimate the
gradient statistically instead of using all samples.
This is because some training sets may have redundant samples, which can be
computationally expensive to process using the naive approach.
In the worst case, all samples could be identical copies, but this is unlikely.
However, large numbers of similar examples can still occur.
Batch gradient methods process all training examples simultaneously, but this
can be slow for large datasets.
Minibatch stochastic gradient descent uses smaller groups of examples, which
is faster and more efficient.
3
The term "batch" is sometimes used to describe both the full training set and
a group of examples, so it's important to clarify which meaning is being used.
Stochastic algorithms in deep learning use a small group of examples at a time
instead of all of them.
These are called minibatch methods and are different from online methods that
use just one example at a time, either from a stream or a fixed-size training
set.
Traditional stochastic methods are also called minibatch methods, and they
fall somewhere in between the two.
These methods are commonly used in deep learning because they provide a
balance between the computational efficiency of online methods and the
accuracy of using all examples at once.
When using stochastic methods like stochastic gradient descent, the size of the
batches used can affect performance.
Larger batches provide more accurate gradient estimates, but smaller batches
can be faster on multicore architectures and may offer a regularizing effect.
The amount of memory required for processing the batch can also limit the
size, and power of 2 batch sizes may offer better runtime on certain hardware.
However, very small batch sizes like 1 can require a smaller learning rate to
maintain stability due to high variance in the gradient estimate, which can
result in longer total runtime due to the need to make more steps.
Some algorithms use information from the minibatch differently and can
handle smaller batch sizes, while others require larger batch sizes due to
sensitivity to sampling error.
Second-order methods that use the Hessian matrix can be more sensitive and
require larger batch sizes to minimize fluctuations in estimates.
Estimation errors in the gradient can amplify with multiplication by the
Hessian or its inverse, especially if the Hessian has a poor condition number.
This can cause large changes in the update, even with a perfectly estimated
Hessian. Overall, larger batch sizes are needed to minimize errors in second-
order methods.
When training a neural network, it's important to select minibatches randomly
to get an unbiased estimate of the gradient.
This requires the samples to be independent, so two subsequent gradient
estimates should also be independent. Some datasets may have correlated
examples, like a list of medical test results for multiple patients.
To avoid bias, it's necessary to shuffle the examples before selecting
minibatches, especially for large datasets where sampling uniformly at
random may be impractical.
4
When training a machine learning model, it's often enough to shuffle the data
once and keep it in that order.
This fixes the order of consecutive examples used in each batch, but doesn't
significantly harm the model's performance.
Not shuffling the data at all can reduce effectiveness.
In some cases, we can compute updates for multiple batches simultaneously
because the optimization problem breaks down into separate updates for each
example.
This allows for parallel and distributed training, as discussed further in section
12.1.3.
Minibatch stochastic gradient descent follows the true generalization error
gradient when no examples are repeated.
This is because each minibatch provides an unbiased estimate of the error on
the entire dataset during the first pass.
However, on subsequent passes, the estimate becomes biased due to the re-
sampling of previously used examples.
In online learning, where examples are drawn from a stream, every experience
is a fair sample from the data-generating distribution, and there are no repeated
examples.
This makes it easier to see that stochastic gradient descent minimizes
generalization error in this scenario.
The equivalence is easiest to derive when both x and y are discrete.
In this case, the generalization error (equation 8.2) can be written as a sum
The fact that the gradient of the log-likelihood can be estimated by sampling
from the data distribution and computing the gradient for a minibatch was
previously shown.
This holds for other functions L as well, as long as certain mild assumptions
are met for continuous variables.
5
Therefore, we can obtain an unbiased estimator of the exact gradient of the
generalization error by sampling a minibatch from the data distribution and
computing the gradient of the loss for that batch.
6
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
6. Local vs. global minima: Due to the non-convex nature of neural network optimization, it
is challenging to distinguish between local and global minima. Sometimes, the optimization
process might get stuck in a poor local minimum instead of reaching the global minimum,
affecting the network's performance.
7. Vanishing and exploding gradients: Optimization can be hindered by the vanishing or
exploding gradient problem, where gradients either become too small or too large during
backpropagation. This issue can make it difficult for the network to converge to an optimal
solution.
8. Hyperparameter selection: Neural networks have various hyperparameters that need to be
tuned, such as learning rate, batch size, and regularization parameters. Identifying the optimal
values for these hyperparameters requires careful experimentation and can be a challenging
optimization problem itself.
1
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
2
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
Figure 8.1: Gradient descent often does not arrive at a critical point of
any kind. In this example, the gradient norm increases throughout the
training of a convolutional network used for object detection.
(Left)A scatterplot showing how the norms of individual gradient
evaluations are distributed over time. To improve legibility, only one
gradient norm is plotted per epoch.
3
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
4
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
5
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
1. Local minima and maxima in high-dimensional non-convex functions are rare compared to
saddle points, which are points with zero gradient.
2. Saddle points have a mix of positive and negative eigenvalues in the Hessian matrix, and
points along these eigenvalues can have higher or lower costs than the saddle point.
3. In low-dimensional spaces, local minima are common, but in higher-dimensional spaces,
saddle points are more common.
4. The expected ratio of saddle points to local minima grows exponentially with the
dimensionality of the function.
5. The Hessian matrix at a local minimum only has positive eigenvalues, while a saddle point
has a mixture of positive and negative eigenvalues.
6. Local minima are more likely to have low costs, while critical points with high costs are
more likely to be saddle points or local maxima.
7. Shallow autoencoders without nonlinearities have global minima and saddle points, but no
local minima with higher cost than the global minimum.
8. Real neural networks also have loss functions that contain many high-cost saddle points.
9. The implications of saddle points for training algorithms, especially those using gradient
information, are unclear.
10. Gradient descent can sometimes escape saddle points, even though the gradient can become
small near them.
11. Continuous-time gradient descent may be repelled from nearby saddle points, but the
situation may differ for more realistic uses.
12. Newton's method is affected by saddle points and faces challenges when encountering
them.
13. Gradient descent: It is a method that moves in the direction of decreasing values to find
the minimum point of a function. It doesn't explicitly aim to find a critical point.
14. Newton's method: It is a method that tries to find a point where the gradient of a function
is zero. However, it can sometimes jump to a saddle point if not modified properly.
15. Saddle-free Newton method: Introduced by Dauphin et al. in 2014, it is a modified version
of Newton's method that addresses the issue of jumping to saddle points. It has shown
improvements over the traditional version.
16. Difficulty in scaling second-order methods: Second-order methods, like Newton's
method, are challenging to apply to large neural networks. This is because of the presence
of many saddle points in high-dimensional spaces.
6
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
17. Types of points with zero gradient: Besides minima and saddle points, there are also
maxima. Maxima are similar to saddle points in terms of optimization, as many algorithms
are not attracted to them, except unmodified Newton's method.
18. Rarity of maxima and minima in high-dimensional space: Maxima and minima of
random functions become exponentially rare as the dimensionality increases.
19. Wide, flat regions of constant value: In these regions, both the gradient and the Hessian
(second derivative) of the function are zero. Such locations pose difficulties for numerical
optimization algorithms.
20. Wide, flat regions in convex and general optimization problems: In a convex problem,
a wide, flat region consists entirely of global minima. However, in a general optimization
problem, such a region could correspond to a high value of the objective function.
7
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
2. Surprisingly, these visualizations do not typically show many obvious obstacles or complex
structures in the cost function.
3. Before 2012, it was believed that neural net cost functions had more non-convex structures
than what these projections reveal.
4. The main obstacle shown in this visualization is a saddle point with high cost near the initial
parameters.
5. However, stochastic gradient descent (SGD) training easily escapes this saddle point.
6. Most of the training time is spent in the relatively flat valley of the cost function.
7. The flatness of the valley could be due to noisy gradients, poorly conditioned Hessian
matrix, or the need to take an indirect path to avoid a tall "mountain" in the figure.
8.2.4 Cliffs and Exploding Gradients
1. Neural networks with many layers can have steep regions resembling cliffs, caused by the
multiplication of large weights.
2. These cliffs can be problematic because the gradient update step can move the parameters
too far, potentially jumping off the cliff structure altogether.
3. The objective function for highly nonlinear deep neural networks or recurrent neural
networks often contains sharp nonlinearities in parameter space, resulting from the
multiplication of several parameters.
4. These nonlinearities lead to high derivatives in certain places, making the optimization
process challenging.
5. Approaching the cliff structure from above or below can be dangerous, as it can result in
losing most of the optimization progress.
6. However, the gradient clipping heuristic, described in section 10.11.1, can help mitigate the
consequences of cliffs.
7. The gradient clipping heuristic intervenes when the traditional gradient descent algorithm
suggests a large step, reducing the step size to stay within the region where the gradient
indicates the direction of steepest descent.
8. Cliff structures are particularly common in cost functions for recurrent neural networks due
to the multiplication of many factors, especially in long temporal sequences.
8
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
9
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
7. Recurrent networks use the same matrix at each time step, while
feedforward networks do not, allowing feedforward networks to largely
avoid the vanishing and exploding gradient problem.
8. Further discussion on the challenges of training recurrent networks
will be deferred until section 10.7, after recurrent networks have been
described in more detail.
8.2.6 Inexact Gradients
1. Most optimization algorithms assume that we have access to the exact gradient or Hessian
matrix, but in practice, we often only have noisy or biased estimates of these quantities.
2. Deep learning algorithms typically use a minibatch of training examples to compute an
approximate gradient.
3. In some cases, the objective function we want to minimize is intractable, which means its
gradient is also intractable. In these situations, we can only approximate the gradient.
4. Advanced models, especially in part III, can encounter these issues more frequently.
5. Contrastive divergence is a technique used to approximate the gradient of the intractable log-
likelihood of a Boltzmann machine.
6. Neural network optimization algorithms are designed to handle imperfections in the gradient
estimate.
7. Another approach is to choose a surrogate loss function that is easier to approximate than
the true loss function.
8.2.7 Poor Correspondence between local and global structure
1. Problems with the loss function: The loss function can be poorly conditioned, contain cliffs
or saddle points, making it difficult to make progress in optimization.
2. Poor correspondence between local and global structure: Even if the problems at a single
point are overcome, if the direction of improvement locally does not point towards regions of
lower cost globally, the performance can still be poor.
10
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
Figure 8.4: Optimization based on local downhill moves can fail if the local
surface does not point toward the global solution. Here we provide an example
of how this can occur, even if there are no saddle points and no local minima.
With the above figure:
1. Optimization based on local downhill moves can fail if the local surface does not point
towards the global solution.
2. Even without saddle points or local minima, a cost function with only asymptotes towards
low values can cause difficulties.
3. The main problem arises when starting on the wrong side of a "mountain" and being unable
to cross it.
4. In higher-dimensional space, learning algorithms can sometimes go around such mountains,
but this may lead to long trajectories and excessive training time, as shown in figure 8.2.
3. Length of the learning trajectory: Training time is often determined by the length of the
trajectory needed to reach the solution. Figure 8.2 illustrates that the trajectory often follows a
wide arc around a mountain-shaped structure.
4. Neural networks don't reach critical points: In practice, neural networks do not arrive at
a critical point, such as a global or local minimum or a saddle point.
5. Lack of small gradient regions: Figure 8.1 shows that neural networks often do not reach
regions of small gradient. Critical points may not exist, and the loss function may
asymptotically approach a certain value as the model becomes more confident.
6. Failure of local optimization: Local optimization may fail to find a good cost function
value, even without local minima or saddle points. Figure 8.4 provides an example of this.
11
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
7. Research focus: Current research aims to find good initial points for problems with difficult
global structure, rather than developing algorithms that use non-local moves.
8. Learning algorithms based on local moves: Gradient descent and most effective learning
algorithms for neural networks rely on making small, local moves. However, computing the
correct direction of these moves can be challenging.
9. Approximations and limitations: In some cases, we can only approximate properties of
the objective function, like its gradient, with bias or variance. This can lead to issues like poor
conditioning or discontinuous gradients, making the region where the gradient is reliable very
small.
10. High computational cost: In cases where the objective function has issues and the local
descent direction can only be computed with small steps, following the path of local descent
incurs a high computational cost due to the large number of steps involved.
12
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
Step-4: Sample a minibatch of m examples from the training set {x(1), . . . ,x(m)} with
II. Momentum:
1. Stochastic gradient descent is a common optimization strategy, but it can be slow in
certain cases. Momentum is a technique designed to accelerate learning, especially for high
curvature, small gradients, or noisy data.
2. Momentum works by keeping track of past gradients and moving in their direction with
an exponentially decaying moving average. This helps to overcome the issue of slow
convergence in these cases.
3. In the momentum algorithm, a velocity variable v is introduced to represent the direction
and speed at which parameters move through parameter space. The velocity is set to an
exponentially decaying average of the negative gradient, which can be thought of as a force
moving a particle through parameter space according to Newton's laws of motion. The
velocity vector v represents the momentum of the particle in this physical analogy.
4. A hyperparameter α ∈ [0, 1) determines how quickly the contributions of previous
gradients exponentially decay. The update rule is given by:
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
The size of the step in the momentum algorithm depends on the direction of multiple gradients.
When many gradients point in the same direction, the step size is largest.
If the algorithm always sees a gradient g, it will keep moving in the opposite direction until
reaching a maximum speed of
ε||g||/(1 - α) (8.17)
The hyperparameter α determines how much the maximum speed is multiplied compared to
gradient descent. Common values for α are .5, .9, and .99, and it can be adapted over time, but
it's less important to do so than to decrease the learning rate over time.
Algorithm 8.2 Stochastic Gradient Descent (SGD) with momentum
Step-1: Require Learning rate ε, momentum parameter α.
Step-2: Require: Initial parameter θ, initial velocity v.
Step-3: while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1), . . . ,x(m)} with
corresponding targets y(i).
Compute gradient estimate: g ← 1m∇θ Σi L(f(x(i); θ), y(i))
Compute velocity update: v ← αv − εg
Apply update: θ ← θ + v
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
Explanation of Steps
The SGD with momentum algorithm tries to find the best parameter values for a machine
learning model by adjusting them based on the gradient of the loss function. Here's a simple
explanation of each step:
1. Learning rate (ε) and momentum parameter (α) are required. These determine how much the
parameters will be adjusted and how much influence previous adjustments have on the current
one.
2. Initial parameters (θ) and initial velocity (v) are needed to start the optimization process.
3. A batch of examples is selected from the training data, and their gradients are calculated to
estimate the gradient of the loss function.
4. The velocity is updated based on the momentum parameter and the new gradient estimate.
The velocity is a running average of past gradients, and it helps to smooth out noisy gradients
and improve convergence speed.
5. The parameters are updated based on the velocity and the new gradient estimate. This step
moves the parameters in the direction of steepest descent, but with some influence from
previous steps due to the velocity term.
6. This process continues until a stopping criterion is met, such as reaching a certain level of
accuracy or stopping after a certain number of iterations.
3. The main difference is that Nesterov momentum evaluates the gradient after applying the current
velocity, while standard momentum evaluates it beforehand.
4. This can be seen as adding a correction factor to the standard momentum method.
Step-4: Sample a minibatch of m examples from the training set {x (1) , . . . ,x (m)}
Step-6: Compute gradient (at interim point): g ← 1/m ∇θ˜ Σi L(f(x(i) ; θ˜), y (i))
Step-7: Compute velocity update: v ← αv − εg
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
This algorithm is used to find the minimum value of a function in machine learning by
iteratively adjusting the parameter values based on the gradient and velocity calculated
from a small group of examples.
The momentum parameter helps to improve the convergence rate by adding some weight
to previous updates, while the learning rate controls how much each update affects the
parameter values.
The stopping criterion can be based on a certain number of iterations, a predefined accuracy,
or other criteria depending on the specific problem being solved.
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
1
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
2
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
10. The ideal initial weight scale balances the benefits of stronger symmetry breaking and
larger outputs with the risks of exploding gradients and saturated activations. Gradient clipping
can help mitigate the exploding gradient problem.
1. Optimization Perspective:
Improve the efficiency and reduce cost
Suggests weights should be large for successful information propagation.
Favors the use of optimization algorithms like stochastic gradient descent (SGD), which
prefers small incremental changes and tends to converge to areas close to initial
parameters.
2. Regularization Concerns:
Prevent Overfitting and Improving Accuracy
Advocate for smaller weights to address overfitting concerns.
Implies a preference for avoiding overly large weight values.
3. Connection to Early Stopping:
Monitors the performance of the model for every approach
Optimization with early stopping is akin to weight decay in some cases.
Early stopping in gradient descent expresses a preference for final parameters to be close
to initial parameters.
4. Gaussian Prior Analogy:
Gaussian prior has bell-shaped curve centered at Zero.
Initializing parameters (θ) is similar to imposing a Gaussian prior with mean θ0.
Choosing θ0 close to 0 implies a prior where units are more likely not to interact unless the
objective function strongly prefers interaction.
5. Impact of Initialization:
Initializing weights with the correct variance and correctly weighting residual modules.
Initializing θ0 to large values implies a prior specifying unit interaction and their nature.
θ0 near 0 suggests a prior where units are more likely not to interact unless compelled by
the objective function.
The interplay between optimization and regularization suggests a careful consideration of
weight initialization, balancing the need for information propagation with the desire to
prevent overfitting.
The choice of initial parameters is likened to expressing a prior belief about unit
interactions, influencing how units should or should not interact based on the problem at
hand.
6. Weight Initialization Heuristics:
Important design and consideration in NN & DL.
3
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
Use a common heuristic by initializing weights in a fully connected layer with m inputs
and n outputs by sampling from U(−1/√m, 1/√m).
Glorot and Bengio (2010) propose a normalized initialization for compromise between
activation and gradient variance goals.
• Saxe et al. suggest initializing to random orthogonal matrices with a scaling factor,
ensuring convergence independence of depth.
• Proper gain factor tuning can allow training very deep networks without vanishing or
exploding gradients.
7. Optimal Criteria Challenges:
• Theoretical predictions for optimal initial weights may not lead to optimal performance.
• Possible reasons: wrong criteria, properties not persisting during learning, or
unintentional increase in generalization error.
8. Weight Scaling Challenges:
Scaling rules setting all initial weights to the same standard deviation, like 1/√m, can
result in extremely small individual weights for large layers.
Sparse initialization, introduced by Martens (2010), initializes each unit with k non-
zero weights to maintain diversity but can cause issues.
9. Hyperparameter Considerations:
Treat initial weight scale as a hyperparameter; choose using a search algorithm (e.g.,
random search) based on activation or gradient range.
Dense or sparse initialization can be a hyperparameter choice.
10. Practical Approach:
Manually or through automated methods, identify layers with small activations and
increase their weights iteratively for better initial activations.
Consider gradient range and standard deviation if learning is still slow.
4
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
5
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.
Identify layers with small activations, increase weights iteratively for better initial
activations.
If learning is slow, examine gradient range and standard deviation.
20. Formalization by Mishkin and Matas (2015):
Mishkin and Matas (2015) formalize and study the protocol for weight initialization.
The focus so far has been on weight initialization, but other parameters are generally easier
to initialize.
21. Another common Type of Parameter:
a) A variance or precision parameter is a common type of parameter used in linear regression.
b) Instead of estimating the conditional variance directly, we can use a precision parameter β
in the model:
a. p(y | x) = N (y | wTx + b, 1/β) --------------------- (8.24)
c) Precision parameters are often initialized to 1 safely.
d). Another approach is to assume initial weights are close to zero, set biases based on the
correct marginal mean, and set variance parameters to the marginal variance of the output in
the training set.
6
5th Topic: Algorithms with Adaptive Learning Rates
1. Learning rate is a difficult hyperparameter to set in neural network training because it
has a big impact on the model's performance.
2. The cost function can be very sensitive to some directions in parameter space and less
sensitive to others, making learning difficult.
3. The momentum algorithm helps with this, but introduces another hyperparameter.
4. Adapting individual learning rates for each parameter during training can make sense if
the directions of sensitivity are axis-aligned.
5. The delta-bar-delta algorithm is an early heuristic approach to adapting individual
learning rates, based on the idea that if the partial derivative of the loss with respect to a
parameter stays the same sign, the learning rate should increase, and if it changes sign,
the learning rate should decrease.
6. More recent incremental methods adapt learning rates for model parameters during
mini-batch training.
7. These methods are necessary because full batch optimization is not practical for large
datasets.
AdaGrad
1. AdaGrad is an algorithm that adjusts the learning rates of all model parameters
differently based on their past values. It makes the learning rates smaller for parameters
that have already improved a lot and larger for parameters that haven't improved much
yet.
2. This means that parameters with larger partial derivatives (which indicate a stronger
influence on the loss function) will have a faster decrease in learning rate, while
parameters with smaller partial derivatives will have a slower decrease.
3. This results in faster progress in the less steep directions of parameter space, which can
be helpful for finding the optimal solution more efficiently.
4. In theory, AdaGrad has some nice properties for convex optimization problems, but in
practice, it can lead to a premature and excessive decrease in the effective learning rate
when training deep neural network models due to the accumulation of squared gradients
over time.
5. Overall, AdaGrad works well for some deep learning models but may not be the best
choice for all of them.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
RMSProp: (Root Mean Square Propagation)
1) The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-
convex setting by changing the gradient accumulation into an exponentially weighted
moving average.
2) AdaGrad is designed to converge rapidly when applied to a convex function.
3) When applied to a non-convex function to train a neural network, the learning
trajectory may pass through many different structures and eventually arrive at a region
that is a locally convex bowl.
4) AdaGrad shrinks the learning rate according to the entire history of the squared
gradient and may have made the learning rate too small before arriving at such a
convex structure.
5) RMSProp uses an exponentially decaying average to discard history from the extreme
past so that it can converge rapidly after finding a convex bowl, as if it were an instance
of the
6) AdaGrad algorithm initialized within that bowl.
7) RMSProp is shown in its standard form in algorithm 8.5 and combined with Nesterov
momentum in algorithm 8.6.
8) Compared to AdaGrad, the use of the moving average introduces a new
hyperparameter, ρ, that controls the length scale of the moving average.
9) Empirically, RMSProp has been shown to be an effective and practical optimization
algorithm for deep neural networks.
10) It is currently one of the go-to optimization methods being employed routinely by deep
learning practitioners.
11) RMsProp, root mean squared propagation is the optimization Machine Algorithm to
train the (ANN) by different adaptive learning rate derived from the concepts of
gradient descent and RProp.
12) RMSProp, Which stands for Root Mean Square Propagation Algorithm designed to
address some of the issues encountered with the SGD method in training deep NN.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
13) RMSProp is designed to accelerate the optimization process.
14) Ex: Decrease the number of function evaluations required to reach the optima or
improve the capability of the optimization algorithm for better result.
15) Root Mean Square Propagation (RMSp) is an adaptive learning algorithm that tries to
improve AdaGrad.
RMSProp is a popular optimization algorithm used in deep learning that has several
advantages:
a) Efficient Handles Gradients
b) RMSProp is well suited for DL Problems
c) Only few of the weights in the neural network are updated in each iteration
d) RMSProp also takes away the need to adjust learning rate, does it automatically.
Algorithm 8.4 The AdaGrad algorithm
Step-1: Require: Global learning rate ε
Step-2: Require: Initial parameter θ
Step-3: Require: Small constant δ, perhaps 10−7, for numerical stability Initialize gradient
accumulation variable r = 0
Step-4: while stopping criterion not met do
Step-5: Sample a minibatch of m examples from the training set {x(1) , . . . ,x(m)} with
corresponding targets y(i).
Step-6: Compute gradient: g ← 1/m∇θ Σi L(f(x(i) ; θ), y(i))
Step-7: Accumulate squared gradient: r ← r + g (•) g
Step-8: Compute update: ∆θ ← − ε /δ+√ r (•)g. (Division and square root applied element-
wise)
Step-9: Apply update: θ ← θ + ∆θ
Step-10: end while
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Explanation of the above algorithm steps:
Step 1: We need a global learning rate (ε) and an initial parameter value (θ).
Step 2: We have an initial parameter value (θ) that we want to adjust based on the training
data.
Step 3: We initialize a variable called r to 0, which we'll use to accumulate squared
gradients. We also set a small constant (δ) for numerical stability, which is typically set to
10^-7.
Step 4: We enter a loop that will continue until some stopping criterion is met. This might
be when we've trained for a certain number of iterations, or when the loss function has
converged to a certain value.
Step 5: We sample a batch of m examples from the training set, along with their
corresponding targets.
Step 6: We compute the gradient of the loss function with respect to our parameter (θ)
using these examples. This gives us a vector g that represents how much each parameter
should be adjusted to reduce the loss function.
Step 7: We accumulate the squared gradient in our r variable. This helps us to adjust the
learning rate based on the history of gradients we've seen so far.
Step 8: We compute the update for our parameter using the formula -ε / (δ + sqrt(r)) * g.
This formula adjusts the learning rate based on both the gradient and the history of
squared gradients we've seen so far. The division and square root are applied element-
wise to each component of g and r.
Step 9: We apply the update to our parameter value (θ).
Step 10: The loop continues until some stopping criterion is met.
Adam:
1. Adam is a new algorithm for optimizing the learning rate in machine learning models.
It's called "Adam" because it combines two techniques, adaptive moments (which adjust
the learning rate based on the history of the gradient), and momentum (which helps the
model converge faster).
2. Adam is similar to other algorithms like RMSProp, but it has a few key differences. For
example, in Adam, momentum is calculated using an estimate of the first-order moment
(the gradient) with exponential weighting, whereas in RMSProp, momentum is applied to
the rescaled gradients.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
3. Another difference is that Adam includes bias corrections to account for the fact that
the estimates of both the first-order and second-order moments are initialized at zero.
This helps to reduce the high bias that can occur early in training with RMSProp.
4. Overall, Adam is considered to be a robust algorithm that doesn't require a lot of tuning
of hyperparameters, although the learning rate may need to be adjusted from its default
value.
Algorithm 8.5 The RMSProp algorithm
Step-1: Require Global learning rate ε, decay rate ρ.
Step-2: Require: Initial parameter θ
Step-3: Require Small constant δ, usually 10−6, used to stabilize division by small numbers.
Step-4: Initialize accumulation variables r = 0
Step-5: while stopping criterion not met do
Step-6: Sample a minibatch of m examples from the training set {x(1) , . . . ,x(m)} with
corresponding targets y(i).
Step-7: Compute gradient: g ← 1/m∇θΣi L(f(x(i) ; θ), y(i))
Step-8: Accumulate squared gradient: r ← ρr + (1 − ρ)g (.) g
Step-9: Compute parameter update: ∆θ = − ε/(√ δ+r)g(.)g
(1/(√δ+r) applied element-wise)
Step-10: Apply update: θ ← θ + ∆θ
Step-11: end while
Explanation of Algorithm Steps:
Algorithm 8.5, called RMSProp, is a technique used to update the parameters of a neural
network during training. Here's a step-by-step explanation:
Step-1: We need a global learning rate (ε) and a decay rate (ρ) to control the size of
parameter updates.
Step-2: We start with an initial set of parameters (θ).
Step-3: We use a small constant (δ) to stabilize division by small numbers, which helps
prevent numerical issues during computation.
Step-4: We initialize an accumulation variable (r) to zero.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Step-5: We enter a loop that continues until a stopping criterion is met.
Step-6: We sample a batch of m examples from the training data, along with their
corresponding targets (y(i)).
Step-7: We compute the gradient of the loss function (L) with respect to the parameters
(θ), using the batch of examples.
Step-8: We accumulate the squared gradient (g) over time, using the decay rate (ρ) to
decay previous values. This helps smooth out the effects of noisy gradients.
Step-9: We compute the parameter update by dividing the gradient (g) by the square root
of the accumulated squared gradient (r), plus a small constant (δ). This helps prevent
exploding gradients and improves convergence.
Step-10: We apply the parameter update to our current set of parameters (θ).
Step-11: The loop continues until a stopping criterion is met, such as reaching a certain
number of iterations or achieving a desired level of accuracy.
Choosing the Right Optimization Algorithm
1. In this section, we discussed some algorithms that help in optimizing deep learning
models by adjusting the learning rate for each parameter.
2. There is no clear answer to which algorithm is the best to choose because researchers
have not yet agreed on this.
3. A study by Schaul et al. (2014) compared many optimization algorithms and found that
those with adaptive learning rates, such as RMSProp and AdaDelta, performed
consistently well across different learning tasks.
4. Currently, popular optimization algorithms used in practice include SGD, SGD with
momentum, RMSProp, RMSProp with momentum, AdaDelta, and Adam.
5. The choice of which algorithm to use seems to depend mostly on the user's familiarity
with it for easier hyperparameter tuning.
Algorithm 8.6 RMSProp algorithm with Nesterov momentum
Step-1: Require: Global learning rate ε, decay rate ρ, momentum coefficient α.
Step-2: Require: Initial parameter θ, initial velocity v.
Step-3: Initialize accumulation variable r = 0
Step-4: while stopping criterion not met do
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Step-5: Sample a minibatch of m examples from the training set {x(1) , . . . ,x(m)} with
corresponding targets y(i).
Step-6: Compute interim update: θ˜ ← θ + αv
Step-7: Compute gradient: g ← 1/m∇θ˜Σi L(f(x(i) ;θ˜),y(i))
Step-8: Accumulate gradient: r ← ρr + (1 − ρ)g (.) g
Step-9: Compute velocity update: v ← αv − ε/√r (.) g ( 1/√r applied element-wise)
Step-10: Apply update: θ ← θ + v
Step-11: end while
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Step 10: We apply the final weight update using our new velocity and the current weights
(θ).
Step 11: We repeat steps 5-10 until we meet our stopping criterion.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
3. The use of second-order methods for optimization involves taking into account the
curvature of the objective function, which can help to converge faster and more accurately
to the minimum.
4. These methods require the computation of second-order derivatives, which can be
computationally expensive for large-scale problems like training deep networks.
5. LeCun et al. (1998a) provide an early treatment of this subject. In this section, we will
follow their approach for simplicity of exposition.
6. However, the methods discussed here can be extended to more general objective
functions that include regularization terms.
7. The basic idea behind second-order methods is to approximate the objective function
around the current parameter values using a quadratic model.
8. This quadratic model is then minimized to obtain a new set of parameter values, which
are used as the starting point for the next iteration. This process is repeated until
convergence is achieved.
9. The main advantage of second-order methods is that they can converge faster than first-
order methods, such as gradient descent, because they take into account the curvature of
the objective function.
10. This allows them to more accurately estimate the direction and step size for each iteration,
resulting in fewer iterations and faster convergence.
11. However, second-order methods also have some disadvantages. They require the
computation of second-order derivatives, which can be computationally expensive for
large-scale problems like training deep networks.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
12. Additionally, they may become less effective as the objective function becomes more
nonlinear or complex, as the quadratic approximation may not accurately represent the
true objective function in these cases.
13. Despite these challenges, second-order methods have shown promising results in practice
for training deep networks.
14. They have been used successfully in applications such as image classification and speech
recognition, where large-scale datasets and complex models are commonplace.
15. As computing resources continue to improve and algorithms become more sophisticated,
it is likely that second-order methods will become even more widely adopted in deep
learning applications.
Newton’s Method:
1. Newton's method is a second-order optimization technique that uses the Hessian matrix to
improve convergence. It approximates the objective function with a quadratic function and
updates the parameters directly to the minimum.
2. If the objective function is locally quadratic, Newton's method can jump directly to the
minimum by rescaling the gradient with the inverse of the Hessian.
3. In deep learning, Newton's method can be applied iteratively as long as the Hessian is
positive definite. However, it can cause updates to move in the wrong direction near saddle
points or when the eigenvalues of the Hessian are not all positive.
4. Regularization strategies, such as adding a constant along the diagonal of the Hessian, can
help mitigate these issues.
5. However, the computational burden of computing and inverting the Hessian matrix for large
neural networks with millions of parameters makes it impractical to use Newton's method for
training them.
6. Alternative methods that aim to gain some of the advantages of Newton's method while
avoiding its computational challenges are being explored in deep learning research.
7. In deep learning, the Hessian matrix is a second-order derivative matrix that measures the
curvature of the loss function with respect to the model's parameters. It is calculated by taking
the second derivative of the loss function with respect to the weights and biases of the neural
network.
8. The Hessian matrix is used in optimization algorithms such as Newton's method and
conjugate gradient descent to find the minimum of the loss function more efficiently than
gradient descent alone.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
9. These methods use the Hessian matrix to approximate the curvature of the loss function
and adjust the learning rate accordingly. This can help the algorithm converge faster and more
accurately to the global minimum.
10. However, calculating the Hessian matrix can be computationally expensive, especially for
large neural networks with many parameters.
12. These methods use only a subset of the Hessian matrix or approximate it using lower-
order derivatives, making them more computationally efficient.
Algorithm 8.8 Newton’s method with objective J(θ ) =1/m Σi=1 m L(f(x(i); θ),y(i)).
Step-1: Require: Initial parameter θ0
Step-2: Require: Training set of m examples
Step-3: While stopping criterion not met do
Step-4: Compute gradient: g ← 1/m∇θ Σi L(f(x(i); θ), y(i))
Step 7: We compute the update step by subtracting the product of the Hessian
inverse and the gradient from our current parameter values. This moves us in the
direction of steepest descent while taking into account how much each parameter
should be changed relative to others.
Step 8: We update our parameter values based on the computed update step.
Step 9: We continue iterating until our stopping criterion is met.
8.6.2 Conjugate Gradients
1. The conjugate gradients method is a way to efficiently find the minimum of a function
without calculating its inverse Hessian matrix.
2. It works by iteratively moving in directions that are conjugate to each other, which
means they are not parallel or perpendicular.
3. This is different from the method of steepest descent, which moves in the direction of
the gradient but can sometimes move back and forth in a zig-zag pattern because the
directions it uses are orthogonal to each other.
4. The conjugate gradients method avoids this problem by choosing directions that are
not orthogonal to each other, making it more efficient.
5. The inspiration for this approach comes from studying the weaknesses of the method
of steepest descent in quadratic functions.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
6. During a line search, we move in a certain direction (dt-1) until we reach a minimum
point. At this point, the gradient (∇θJ) in that direction is zero.
7. Since the gradient now points in a new direction (dt), it's orthogonal to the previous
direction (dt-1). This means that the new direction is perpendicular to the old one.
8. This relationship between dt-1 and dt is shown in figure 8.6, which illustrates multiple
iterations of steepest descent.
9. The problem with this approach is that it can lead to a zig-zag pattern of progress, where
we have to re-minimize the objective function in the previous gradient direction after
descending to the minimum in the current gradient direction.
10. The method of conjugate gradients aims to address this issue by choosing directions
of descent that are not orthogonal to each other, which helps us avoid retracing our steps
and makes the optimization process more efficient.
11. In the method of conjugate gradients, we want to find a new search direction that
won't undo the progress made in the previous direction.
12. At each training iteration t, the new direction dt is a combination of the gradient and
the previous direction dt-1, with a coefficient βt that controls how much of dt-1 to add
back.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
13. Two directions are called conjugate if they don't contribute to each other's curvature,
which is measured by a matrix called the Hessian.
14. Calculating conjugacy using the eigenvectors of the Hessian would be computationally
expensive for large problems, but fortunately there's a simpler way that doesn't require
these calculations.
15. The conjugate gradient method for finding the minimum of a quadratic surface ensures
that we don't move backwards in the direction we just came from.
16. This means we stay on the path to the minimum as we search for it. In a space with k
dimensions, we only need to do k line searches to find the minimum using this method.
17. In simple terms, the conjugate gradient algorithm helps us find the lowest point on a
curved surface by moving in a smart way that avoids going back where we've been.
18. Conjugate gradients is a method to efficiently avoid calculating the inverse Hessian by
iteratively descending in conjugate directions.
19. The weakness of the method of steepest descent is that it progresses in a zig-zag
pattern because each line search direction is orthogonal to the previous one, causing
progress to be undone in those directions.
20. In conjugate gradients, the next search direction is a combination of the current
gradient and a coefficient that controls how much of the previous direction to add back.
21. Two directions are defined as conjugate if they do not undo progress made in each
other's direction.Two popular methods for computing the coefficient are Fletcher-Reeves
and Polak-Ribière.
23. The conjugate gradient algorithm requires at most k line searches to achieve the
minimum in a k-dimensional parameter space.
24. Conjugate gradients is more computationally viable than Newton's method for large
problems because it avoids calculating the eigenvectors of the Hessian matrix.
Two popular methods for computing the βt are:
Step-4: Initialize g0 = 0
Step-5: Initialize t = 1
Step-6: While stopping criterion not met do
Step-7: Initialize the gradient gt = 0
Step-8: Compute gradient: gt ← 1/m∇θ Σi L f( (x(i) ; ) θ , y(i) )
Step-9: Compute βt = (gt−gt−1)Tgt /gTt−1gt−1 (Polak-Ribière)
(Nonlinear conjugate gradient: optionally reset βt to zero, for example if t is a multiple of
some constant k, such as k = 5)
Step-10: Compute search direction: ρt = −gt + βtρt−1
Step-11: Perform line search to find: ε∗ = argmin ε 1/m Σm i=1 L f( (x(i); θt + ε ρt), y(i) )
(On a truly quadratic cost function, analytically solve for ε ∗ rather than explicitly searching
for it)
Step-12: Apply update: θt+1 = θt + ε∗ρt
Step-13: t ← t + 1
Step-14: end while
Step 1: We need to start with some initial values for the parameters we're trying to optimize
(θ0).
Step 2: We have a set of examples (m) that we'll use to train our model.
Step 3: We're keeping track of a variable called ρ0, which we'll use later in the algorithm. For
now, just think of it as a starting point.
Step 4: We're also keeping track of a variable called g0, which we'll use to calculate gradients
later on. Again, just think of it as a starting point for now.
Step 6: We're entering a loop that will continue until some stopping criterion is met (we'll
discuss this more in a bit). Inside this loop, we'll be calculating gradients, updating our
parameters, and searching for the best step size to take.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Step 7: We're initializing a variable called gt to zero. This will be used to store the gradient at
each iteration.
Step 8: We're calculating the gradient for our current set of parameters (θt) using the training
data (L f( (x(i); ) θ , y (i) )). This gradient tells us which direction we should move in order to
decrease the cost function (which is what we want to do).
Step 9: We're computing a variable called βt using the gradient from the current iteration (gt)
and the gradient from the previous iteration (gt-1). This variable helps us determine how far
we should move in the search direction we calculated in Step 11.
Step 10: We're calculating the search direction (ρ t) by subtracting the current gradient (gt)
from the gradient from the previous iteration (g t-1), scaled by βt. This gives us a direction to
move in that will hopefully help us decrease the cost function more quickly than just moving
in the direction of the current gradient alone.
Step 11: We're performing a line search to find the best step size (ε*) to take in this search
direction. This involves calculating the cost function for different values of ε and finding the
one that results in the smallest value of the cost function. This is important because we want
to make sure we're actually decreasing the cost function as we move through parameter
space.
Step 12: We're applying our new parameters (θ t+1) by adding our step size (ε*) times our
search direction (ρt). This moves us closer to our minimum in parameter space.
Step 13: We increment our iteration counter (t) and continue with Step 6 until our stopping
criterion is met. The stopping criterion could be things like reaching a certain number of
iterations, or finding that our changes in parameters are no longer resulting in significant
decreases in the cost function.
1. Conjugate gradients method is not just for quadratic functions, but can also be used for
nonlinear objectives in neural network training.
2. Without the guarantee of a quadratic function, the conjugate directions may not remain at
the minimum for previous directions, so occasional resets are necessary.
3. The nonlinear conjugate gradients algorithm involves restarting with line search along the
unaltered gradient during resets.
4. Initializing optimization with a few iterations of stochastic gradient descent before nonlinear
conjugate gradients can be beneficial.
5. Minibatch versions of nonlinear conjugate gradients have been successfully used for neural
network training (Le et al., 2011).
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
7. Practitioners report reasonable results using nonlinear conjugate gradients for training
neural networks, but it's beneficial to start with a few iterations of stochastic gradient descent.
BFGS
12. On the other hand, the BFGS algorithm must store the inverse Hessian matrix, M, that
requires O(n2) memory, making BFGS impractical for most modern deep learning models
that typically have millions of parameters .
Limited Memory BFGS (L-BFGS):
L-BFGS is a variant of BFGS that significantly decreases memory costs by avoiding storing
the complete inverse Hessian approximation M.
L-BFGS computes the approximation M using the same method as BFGS, but begins with
the assumption that M(t-1) is the identity matrix.
L-BFGS remains well behaved when the minimum of the line search is reached only
approximately, and it can be generalized to include more information about the Hessian
by storing some vectors used to update M at each time step, which costs only O(n) per
step.
The BFGS algorithm can have high memory costs because it requires storing the entire
inverse Hessian matrix.
To reduce this, a modified algorithm called L-BFGS avoids storing the inverse Hessian and
instead updates a matrix called M using the same method as BFGS, starting with the
identity matrix.
This approach allows for computing directions that are mutually conjugate, similar to the
method of conjugate gradients, but without requiring exact line searches.
The L-BFGS strategy with no storage can be improved by storing some vectors used to
update M, which costs only O(n) per step. This generalization allows for including more
information about the Hessian in the algorithm.
Conjugate Gradients (CG) vs Nonlinear Conjugate Gradients (NCG):
1. Conjugate Gradients (CG) is a linear algebra algorithm used to solve systems of linear
equations efficiently. It's an iterative method that involves calculating search directions
and step sizes to minimize the residual norm.
2. Nonlinear Conjugate Gradients (NCG) is a generalization of CG to nonlinear systems of
equations. Instead of finding the minimum of a quadratic function as in CG, NCG aims to
find the minimum of a nonlinear function by approximating it with a quadratic model at
each iteration.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
3. The main difference between CG and NCG is that CG is used for solving linear systems,
while NCG is used for solving nonlinear systems. In other words, CG works for systems
where the coefficients are constants, while NCG works for systems where the coefficients
can vary.
4. Another difference is that CG converges much faster than NCG due to the simplicity of
quadratic functions compared to nonlinear functions. CG has a convergence rate of
O(sqrt(n)), where n is the number of unknowns, while NCG's convergence rate depends
on the specific problem being solved.
5. In summary, CG is a linear algebra algorithm used for solving linear systems, while NCG is
a generalization of CG used for solving nonlinear systems. CG converges faster than NCG
due to the simplicity of quadratic functions compared to nonlinear functions.
7th Topic: Optimization Strategies and Meta Algorithms
1. Batch Normalization
2. Coordinate descent
3. Polyak Averaging
4. Supervised Pretraining
5. Designing Models to Aid Optimization
6. Continuation Methods and Curriculum Learning
1. Batch Normalization (BN): A technique that normalizes the input of each layer in a neural
network by scaling and shifting it to have zero mean and unit variance. This helps to speed up
training, reduce overfitting, and improve model performance.
2. Coordinate Descent (CD): A optimization algorithm that breaks down the optimization
problem into smaller subproblems, each of which is solved independently. This method is
particularly useful for large-scale optimization problems with many variables.
3. Polyak Averaging (PA): A technique used in stochastic gradient descent (SGD) that averages
the weights of the model after each iteration. This helps to reduce the variance of the gradient
estimates and improve convergence.
4. Supervised Pretraining (SPT): A method used to pretrain a neural network on a large
dataset before fine-tuning it on a smaller target dataset. This helps to improve the model's
ability to learn complex features and reduce overfitting.
5. Designing Models to Aid Optimization: This involves designing neural network
architectures that are easier to optimize, such as using residual connections, skip connections,
or inverted residual blocks. These techniques help to improve the model's ability to converge
during training and reduce the number of parameters needed for optimal performance.
6. Continuation Methods and Curriculum Learning: These are techniques used to gradually
increase the difficulty of the training data over time, allowing the model to learn more
complex features as it progresses through the training process. Continuation methods involve
starting with a simpler model and gradually increasing its complexity, while curriculum
learning involves presenting easier examples first and gradually increasing their difficulty over
time. Both techniques help to improve the model's ability to generalize to new, unseen data.
Text Book Matter as Follows
1. Batch Normalization:
a) Batch normalization is a technique in deep learning that helps in optimizing neural networks
by making the gradient calculations more stable during training. It's not actually an
optimization algorithm, but rather a way to reparameterize inputs to make the gradient
calculations more reliable.
b) The problem with training very deep neural networks is that as we update the weights of each
layer, the output of the previous layers also changes, which can lead to unexpected results.
Batch normalization addresses this issue by normalizing the inputs to each layer and scaling
them with learnable parameters. This helps to stabilize the gradient calculations and makes it
easier to train deep neural networks.
c) In simple terms, batch normalization works by computing the mean and variance of the inputs
to each layer during training, and then normalizing them to have zero mean and unit variance.
The normalized inputs are then scaled by a learnable parameter called gamma, and shifted by
another learnable parameter called beta. These parameters are learned during training and
help to adjust the distribution of the inputs to each layer.
d) The benefits of batch normalization include faster convergence, reduced sensitivity to
initialization, and improved accuracy on deep neural networks. It's a simple yet effective
technique that has become a standard component of many modern deep-learning models.
However, the actual update will include second-order and third-order effects, on up to effects
of order l. The new value of yˆ is given by
x(w1 − εg1 )(w2 − εg 2). . .(wl − εgl)------------------------------ (8.34)
Let H be a minibatch of activations of the layer to normalize, arranged as a design matrix, with
the activations for each example appearing in a row of the matrix. To normalize H, we replace
it with HI = (H − µ)/ σ (8.35)
2. Coordinate Descent:
a. Coordinate descent: When trying to find the minimum of a function with multiple variables,
we can break it down into smaller parts by optimizing one variable at a time. This is called
coordinate descent, and it guarantees finding a local minimum.
b. Block coordinate descent: Instead of optimizing one variable at a time, we can also optimize
a group of variables simultaneously. This is called block coordinate descent.
c. Coordinate descent makes sense when the variables can be separated into groups that don't
interact much with each other, or when optimizing one group is much easier than optimizing
all variables at once.
For Example:
This function describes a learning problem called sparse coding, where the goal is to find a
weight matrix W that can linearly decode a matrix of activation values H to reconstruct the
training set X.
3. Polyak Averaging:
1. Polyak averaging is a technique that averages multiple points in the optimization process to
improve convergence.
2. It works by taking the average of parameters visited during gradient descent, θˆ(t) = 1/ t Σi θ(i).
3. In convex problems, this approach has strong convergence guarantees.
4. In neural networks, it's a heuristic method that performs well in practice.
5. The idea is that optimization algorithms may repeatedly visit a valley without reaching its
bottom. The average of all visited points should be close to the bottom.
6. In non-convex problems, including distant past points with large barriers in the cost function
may not be useful, as the optimization trajectory can be complex and visit different regions.
As a result, when applying Polyak averaging to non-convex problems, it is typical to use an
exponentially decaying running average:
4. Supervised Pretraining
1. Pretraining is a strategy in deep learning where a simpler model is trained to solve a simpler task
before training the desired model for the final task. This can be helpful when directly training a
complex model is too difficult or the task is too challenging.
2. Pretraining can involve breaking down a problem into smaller components and solving them
separately, similar to how greedy algorithms work. This can be faster and cheaper than solving the
entire problem at once, but may not always result in an optimal solution.
3. Pretraining can be followed by fine-tuning, where a joint optimization algorithm is used to find
the best solution for the entire problem, using the pretrained model as a starting point.
4. Greedy supervised pretraining involves training each stage of a neural network on a simpler
supervised learning problem, using only a subset of the layers in the final network. This can help
provide better guidance to intermediate levels of the network and improve optimization and
generalization.
5. Another approach to pretraining is transfer learning, where a pretrained network is used to
initialize weights for a new network that will be trained on a different set of tasks with fewer
training examples.
6. FitNets is an approach that involves training a wide and shallow network as a teacher to provide
hints for training a deeper and thinner student network. The student network has two objectives:
to accomplish its own task and predict the middle layer of the teacher network. This can simplify
optimization and help train networks that would otherwise be difficult to train.
5. Designing Models to Aid Optimization
1. Improving optimization in deep learning models isn't always about improving the optimization
algorithm. Sometimes, it's better to make the models easier to optimize by choosing a simpler
model family.
2. Using non-monotonic activation functions can make optimization difficult, so it's better to
choose functions that increase and decrease smoothly.
3. Most advances in neural network learning have come from changing the model family, not the
optimization algorithm. SGD with momentum is still widely used today.
4. Modern neural networks use linear transformations between layers and differentiable activation
functions with large slope areas. This makes optimization easier because the gradient flows
through many layers and the direction of improvement is clear even if the model's output is far
from correct.
5. Linear paths or skip connections between layers can help mitigate the vanishing gradient
problem by reducing the shortest path from lower layer parameters to the output.
6. Adding auxiliary heads to intermediate hidden layers can provide error signals to lower layers
during training, making them easier to optimize without the need for pretraining strategies. This
allows all layers to be trained jointly in a single phase.
6. Continuation Methods and Curriculum Learning
1. Many optimization problems are difficult because the cost function has a complex global
structure.
2. Continuation methods try to overcome this by finding initial points that lead to easier-to-solve
problems, which can then be refined to solve more difficult ones.
3. These methods construct a series of cost functions that become increasingly difficult, with the
first being easy to minimize.
4. The idea is to choose initial points that are more likely to land in a region where local
optimization can succeed because it's larger and well-behaved.
5. Traditional continuation methods smooth the objective function, while simulated annealing
adds noise to the parameters.
6. Continuation methods can break down in three ways: requiring too many incremental cost
functions, not becoming convex when blurred, or tracking to a local minimum instead of a global
one.
7. Continuation methods can still help with neural network optimization because they make local
updates easier or improve correspondence between update directions and global solutions.
8. Curriculum learning, which begins with simple concepts and progresses to more complex ones,
can be seen as a continuation method in machine learning and has been successful on natural
language and computer vision tasks.
9. A stochastic curriculum, where easy and difficult examples are presented with a gradually
increasing proportion of the latter, is more effective than a deterministic one for training recurrent
neural networks to capture long-term dependencies.
10. Optimization methods discussed in this chapter are generally applicable to specialized
architectures with little or no modification as they scale to very large sizes and process structured
input data.
5th UNIT-PART-II
Applications With Case Studies:
Case Study:
Large-scale deep learning is applied to process vast amounts of social media content
for sentiment analysis, trending topics, and user behavior analysis.
Application: Deep learning is widely used for object detection and recognition in
images and videos.
Case Study: YOLO (You Only Look Once) is a deep learning model that excels in
real-time object detection, making it suitable for applications like video surveillance
and autonomous vehicles.
Advantages:
Achieves high accuracy in object detection.
Allows real-time processing for time-sensitive applications.
Disadvantages:
May struggle with small or heavily occluded objects.
Training can be computationally intensive.
2. Facial Recognition:
Case Study: DeepFace by Facebook uses deep learning to achieve high accuracy in
facial recognition tasks.
Advantages:
Enables reliable identification of individuals.
Useful in various applications, from security to user experience.
Disadvantages:
Privacy concerns regarding the use of facial data.
Potential biases in recognition systems.
3. Image Classification:
Case Study: AlexNet, a deep convolutional neural network, won the ImageNet Large
Scale Visual Recognition Challenge in 2012, demonstrating the effectiveness of deep
learning in image classification.
Advantages:
High accuracy in classifying diverse images.
Transfer learning allows leveraging pre-trained models.
Disadvantages:
Requires large labeled datasets for training.
Vulnerable to adversarial attacks.
4. Semantic Segmentation:
Application: Deep learning is used for semantic segmentation, where each pixel in an
image is classified, providing a detailed understanding of object boundaries.
Case Study: U-Net is a deep learning architecture commonly used for medical image
segmentation tasks.
Advantages:
Provides detailed information about object boundaries.
Useful in medical imaging and autonomous systems.
Disadvantages:
Requires extensive labeled data for training.
Computationally demanding for high-resolution images.
Application: Deep learning is applied for estimating the spatial positions of human
joints, enabling applications like gesture recognition and motion analysis.
Case Study: OpenPose is a deep learning framework for real-time multi-person pose
estimation.
Advantages:
Useful in human-computer interaction applications.
Enables understanding of complex human movements.
Disadvantages:
Challenges in handling diverse body shapes and poses.
Performance may degrade in crowded scenes.
6. Gesture Recognition:
Case Study: Microsoft's Kinect uses deep learning for gesture recognition, allowing
users to control devices through body movements.
Advantages:
Enhances user experience in interactive systems.
Enables touchless control in various applications.
Disadvantages:
Accuracy may vary based on lighting and environmental conditions.
Limited standardization in gesture definitions.
7. Visual Captioning:
Application: Deep learning is used for generating textual descriptions of visual content,
facilitating better understanding by machines.
Case Study: Google's Show and Tell uses deep learning to generate captions for images.
Advantages:
Bridges the gap between visual and textual information.
Useful in accessibility and content understanding.
Disadvantages:
Challenges in generating accurate and contextually relevant captions.
Highly dependent on the quality of training data.
Case Study: DeepAR is a deep learning model used for augmenting reality in various
applications.
Advantages:
Improves the realism and interactivity of AR and VR experiences.
Enables seamless integration of virtual and real-world elements.
Disadvantages:
Requires powerful hardware for real-time processing.
Challenges in maintaining accuracy in dynamic environments.
These case studies highlight the versatility of deep learning in various computer vision
applications, offering solutions to complex problems. While deep learning brings
significant advantages, including high accuracy and adaptability, it also poses
challenges such as the need for substantial computing resources and potential ethical
considerations.
III. SPEECH RECOGNITION
Introduction to Speech Recognition in Deep Learning:
Speech recognition in deep learning involves training machines to understand and
interpret spoken language, enabling applications like voice assistants, transcription
services, and more.
Advantages:
Deep learning models can learn complex patterns and representations in audio data.
Improved accuracy and performance compared to traditional methods.
Adaptability to diverse accents and languages.
Disadvantages:
Requires substantial labeled data for effective training.
Computationally intensive, especially for large-scale models.
Vulnerable to background noise and environmental variations.
1. Automatic Speech Recognition (ASR):
ASR is a common application where deep learning is used to convert spoken language
into text.
Case Study:
Google's DeepSpeech utilizes deep learning for accurate and real-time transcription
services.
2. Voice Assistants:
Deep learning is employed in voice assistants for natural language understanding and
response generation.
Case Study:
Amazon Alexa and Google Assistant leverage deep learning to comprehend user
commands and provide relevant responses.
3. Speaker Identification and Verification:
Deep learning models can be used to identify and verify individuals based on their voice
patterns.
Case Study:
Microsoft's Speaker Recognition API uses deep learning for speaker verification in
security applications.
4. Emotion Recognition in Speech:
Deep learning is applied to recognize emotions conveyed through speech, enabling
sentiment analysis.
Case Study:
Beyond Verbal's Emotion AI uses deep learning to analyze vocal intonations and
extract emotional insights.
5. Multilingual Speech Recognition:
Deep learning enables the development of models that can understand and transcribe
multiple languages.
Case Study:
IBM's Watson Speech to Text supports a variety of languages, showcasing the
adaptability of deep learning models.
6. Speech Analytics in Customer Service:
Deep learning is utilized in speech analytics to extract meaningful insights from
customer service interactions.
Case Study:
CallMiner Eureka uses deep learning to analyze call center conversations for sentiment
analysis and customer experience improvement.
7. Voice Bio-metrics:
Deep learning is applied to create voiceprint for secure authentication and access
control.
Case Study:
Nuance Communications uses deep learning for voice biometrics, enhancing security
in applications like banking and telecommunications.
8. Advancements in Deep Learning architectures:
Ongoing research introduces advanced architectures like Transformer-based models for
improved speech recognition.
Case Study:
Facebook's “wav2vec 2.0” utilizes a Transformer-based architecture for better
contextual understanding in speech recognition.
IV. NATURAL LANGUAGE PROCESSING
Introduction to Natural Language Processing (NLP) in Deep Learning:
NLP in deep learning involves teaching machines to understand, interpret, and generate
human language, enabling applications such as text analysis, machine translation,
sentiment analysis, and more.
Advantages of Deep Learning in NLP:
Captures Contextual Information: Deep learning models, especially transformers,
capture contextual nuances in language.
End-to-End Learning: End-to-end models simplify the NLP pipeline, allowing
direct learning from raw text data.
Transfer Learning: Pre-trained models facilitate transfer learning, improving
performance on specific tasks.
Disadvantages of Deep Learning in NLP:
Data Dependency: Requires large amounts of labeled data for effective training.
Computational Intensity: Training deep models can be computationally intensive.
Interpretability: Deep models are often considered black boxes, making it
challenging to interpret their decision-making processes.
1. Machine Translation:
Deep learning is applied in machine translation to automatically translate text from one
language to another.
Case Study:
Google's Transformer model, introduced in the "Attention is All You Need" paper,
significantly improved machine translation quality.
2. Sentiment Analysis:
Deep learning is used for sentiment analysis to determine the sentiment expressed in a
piece of text.
Case Study:
The use of deep learning in sentiment analysis has seen success in social media
monitoring tools and customer feedback analysis.
3. Text Summarization:
Deep learning models can be applied to automatically generate concise summaries of
longer texts.
Case Study:
BERT-based models have demonstrated effectiveness in extractive summarization
tasks.
4. Named Entity Recognition (NER):
Deep learning is employed in NER to identify and classify entities (e.g., names,
locations) in text.
Case Study:
The Stanford NER system, utilizing deep learning components, has been widely used
for entity recognition tasks.
5. Question Answering Systems:
Deep learning is applied to build systems that can understand and answer questions
posed in natural language.
Case Study:
OpenAI's GPT-3 has shown capabilities in question answering tasks, demonstrating its
understanding of context.
6. Chatbots and Virtual Assistants:
Deep learning is crucial for developing conversational agents capable of understanding
and generating human-like responses.
Case Study:
Google's BERT has been applied to enhance the natural language understanding
capabilities of chatbots.
7. Text Generation:
Deep learning models can generate human-like text based on input prompts or
context.
Case Study:
OpenAI's GPT-3 is known for its impressive text generation capabilities, producing
coherent and contextually relevant text.
8. Document Classification:
Deep learning is used for document classification tasks, such as categorizing
documents into predefined classes.
Case Study:
The use of deep neural networks has improved accuracy in document categorization,
benefiting information retrieval systems.
5th Unit (Deep Learning)
Applications:
a) Large-Scale Deep Learning,
b) Computer Vision,
c) Speech Recognition,
d) Natural Language Processing
I- Applications of Large-Scale Deep Learning with Case Studies
Abstract:
Deep learning has revolutionized the field of artificial intelligence by enabling machines to
learn and make decisions based on vast amounts of data. The recent advancements in
hardware and software have made it possible to train deep neural networks on large-scale
datasets with billions of parameters. In this application, we will discuss some of the most
promising applications of large-scale deep learning, highlighting the latest research and case
studies.
Introduction:
Deep learning is a subfield of machine learning that uses artificial neural networks with
multiple layers to learn and make predictions from large datasets. The success of deep
learning is attributed to the availability of big data, powerful computing resources, and the
development of new algorithms that can handle the complexity of deep neural networks. In
this application, we will discuss some of the most promising applications of large-scale deep
learning, highlighting the latest research and case studies.
1. Image and Speech Recognition:
One of the most popular applications of deep learning is image and speech recognition. Large-
scale datasets like ImageNet and CIFAR have been instrumental in driving progress in this area.
For instance, in 2012, AlexNet, a deep neural network with eight layers, achieved a top-5 error
rate of 15.3% on ImageNet, which was a significant improvement over the previous state-of-
the-art method. Since then, several other architectures like VGGNet, ResNet, and Inception
have been proposed, which have further improved the accuracy and efficiency of image
recognition systems.
Case Study: Google Cloud Vision API
Google Cloud Vision API is a cloud-based service that enables developers to integrate image
recognition capabilities into their applications. The service uses a pre-trained model called
InceptionV3, which has over 22 million parameters and can classify images into thousands of
categories with high accuracy. The API also provides additional functionalities like object
detection, labeling, and text recognition. The service is available for free up to a certain limit
and charges based on usage.
2. Natural Language Processing (NLP):
Natural Language Processing (NLP) is another exciting application of deep learning. Large-
scale datasets like Wikipedia and BookCorpus have been used to train deep neural networks
for NLP tasks like language modeling, machine translation, and question answering. For
instance, in 2017, Google's BERT (Bidirectional Encoder Representations from Transformers)
model achieved state-of-the-art results on several NLP benchmarks like GLUE and SQuAD.
BERT is a transformer-based model with over 110 million parameters that can understand the
contextual meaning of words in a sentence by considering their relationships with other words
in both directions.
Case Study: Google Cloud Natural Language API
Google Cloud Natural Language API is a cloud-based service that enables developers to
integrate NLP capabilities into their applications. The service uses a pre-trained model called
BERTbase, which has over 110 million parameters and can perform several NLP tasks like
sentiment analysis, entity recognition, syntax analysis, and question answering with high
accuracy. The API also provides additional functionalities like language detection and
translation between multiple languages. The service is available for free up to a certain limit
and charges based on usage.
3. Autonomous Driving:
Autonomous driving is another promising application of deep learning that requires large
amounts of data to train models for perception, prediction, planning, and control tasks. Large-
scale datasets like KITTI and Cityscapes have been used to train deep neural networks for tasks
like object detection, segmentation, and depth estimation. For instance, in 2018, Waymo's
self-driving car achieved over 8 million miles on public roads using deep learning algorithms
for perception and decision making. Waymo's system uses multiple sensors like lidar, cameras,
and radar to generate a 3D map of the environment in real time and make safe driving
decisions based on it.
XYZ Bank, a leading financial institution, implemented a customer service chatbot powered by
Natural Language Processing (NLP) technology. The chatbot, named "Banki," is available 24/7
to answer customer queries, provide account balances, and perform simple transactions.
The NLP algorithm enables Banki to understand the intent of the customer's message and
respond accurately. For instance, if a customer asks, "What is my account balance?", Banki
will retrieve the information from the customer's account and respond with the balance.
Similarly, if a customer asks, "Can you transfer $500 from my savings to my checking
account?", Banki will initiate the transfer after verifying the customer's identity.
The implementation of Banki has resulted in significant improvements in customer satisfaction
and operational efficiency. Customers can now resolve their queries quickly and easily without
having to wait on hold or visit a branch. This has led to a reduction in call volume and wait
times for customers. Additionally, Banki has freed up bank representatives to focus on more
complex queries that require human intervention.
Application 2: Content Curation Platform