Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
38 views77 pages

5th Unit DL Final Class Notes

Deep learning notes Jntuh

Uploaded by

niteeshs7e
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views77 pages

5th Unit DL Final Class Notes

Deep learning notes Jntuh

Uploaded by

niteeshs7e
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Optimization for Train Deep Models(UNIT-5 Class Notes)

R18 B.Tech. CSE (AIML) III & IV Year JNTU Hyderabad

NEURAL NETWORKS AND DEEP LEARNING

B.Tech. IV Year I Sem. L T P C


3 0 0 3
Course Objectives:
 To introduce the foundations of Artificial Neural Networks
 To acquire the knowledge on Deep Learning Concepts
 To learn various types of Artificial Neural Networks
 To gain knowledge to apply optimization strategies

Course Outcomes:
 Ability to understand the concepts of Neural Networks
 Ability to select the Learning Networks in modeling real world systems
 Ability to use an efficient algorithm for Deep Models
 Ability to apply optimization strategies for large scale applications

UNIT-I
Artificial Neural Networks Introduction, Basic models of ANN, important terminologies, Supervised
Learning Networks, Perceptron Networks, Adaptive Linear Neuron, Back-propagation Network.
Associative Memory Networks. Training Algorithms for pattern association, BAM and Hopfield
Networks.

UNIT-II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet, Hamming
Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization, Counter Propagation
Networks, Adaptive Resonance Theory Networks. Special Networks-Introduction to various networks.

UNIT - III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed - forward networks,
Gradient-Based learning, Hidden Units, Architecture Design, Back-Propagation and Other
Differentiation Algorithms

UNIT - IV
Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping, Parameter Typing and
Parameter Sharing, Sparse Representations, Bagging and other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent Classifier

UNIT - V
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic Algorithms,
Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates, Approximate Second-
Order Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural Language
Processing

TEXT BOOKS:
1. Deep Learning: An MIT Press Book By Ian Goodfellow and Yoshua Bengio and Aaron Courville
2. Neural Networks and Learning Machines, Simon Haykin, 3rd Edition, Pearson Prentice Hall.
This Neural Network & Deep Learning 5 Units Material is Prepared by

Dr K Madan Mohan, Ph.D, MISTE,MIEEE


Associate Professor, R & D Coordinator,
Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology,
Nagole, Bandlaguda, Hyderabad, 500068.
Milestones in my Career
1. Total Number of Subjects Taught: 23
2. Total Number of Research Papers: 36
3. Total Number of Teaching Experience: 14+ Years
4. Total Number of Mini Projects Completed: 20+
5. Total Number of Major Projects Completed: 20+
6. Ratified Faculty by JNTUH (Number of Times): 6
7. Total Number of Best Paper Awards: 2
8. Total Number of Cash Rewards:1 (Dedicated to Pulwama Incident Soldiers)
9. Number of Coordinator ships: 16+
10. Total Number of Patents: 02
Completed role as highest positions at Organizations
10. Assistant Controller of Examinations (ACE-II)
11. Dean ICT(Website)
5th UNIT
Optimization for Training Deep Models
Introduction
1. Deep learning algorithms involve optimization in various contexts, including PCA inference
and neural network training.

2. Neural network training is the most challenging optimization problem in deep learning
due to its complexity and computational cost.
3. Gradient-based optimization techniques are commonly used to solve the neural network
training problem, but they require specialized methods due to its difficulty.
4. The goal of neural network training is to find the parameters θ that significantly reduce a
cost function J(θ) by minimizing it using optimization algorithms.
5. Optimization used as a training algorithm for machine learning tasks differs from pure
optimization because it involves iterative updates of the parameters based on the gradient
of the cost function.
6. Challenges that make optimization of neural networks difficult include non-convexity, high
dimensionality, and vanishing/exploding gradients.
7. Practical algorithms for neural network training include basic optimization algorithms like
gradient descent and more advanced techniques like adaptive learning rates and second
derivative information.
8. Initialization strategies for neural network parameters are also important to ensure
convergence to a good local minimum.
9. Higher-level optimization strategies combine simple algorithms into more complex
procedures for improved performance, such as mini-batching and momentum techniques.
Optimization for Training Deep Models
Topic-1: Introduction to Optimization for Training Deep Models
 Deep learning algorithms require optimization in various scenarios. For instance, optimizing
models like PCA involves solving an optimization problem.
 We often use analytical optimization to create proofs or design algorithms.
 Among the different optimization problems in deep learning, neural network training is considered
the most challenging.
 It typically requires significant time and computational resources, with days to months of
investment on multiple machines to solve a single instance.
 Due to the importance and cost involved, specialized optimization techniques have been developed
specifically for neural network training.
 This chapter introduces these techniques for optimizing neural network training.
 This chapter focuses on optimization techniques for training neural networks.
 It assumes a basic understanding of gradient-based optimization principles, which are briefly
covered in Chapter 4.
 The goal is to find the parameters (θ) of a neural network that minimize a cost function (J(θ)).
 This cost function typically includes a performance measure evaluated on the entire training set and
regularization terms.
 Training a neural network involves specific challenges compared to traditional optimization.
 The chapter introduces practical algorithms for optimization, including initialization strategies and
algorithms that adapt learning rates or utilize second derivatives of the cost function.
 It also discusses higher-level procedures formed by combining simple optimization algorithms.

8.1 How Learning Differs from Pure Optimization


 Optimization algorithms used for training deep models differ from traditional ones in a few ways.
In machine learning, we often care about a performance measure (P) defined with respect to the test
set, which can be complex to calculate.
 Therefore, we indirectly optimize P by minimizing a different cost function (J(θ)).
 The goal is to improve P by reducing J.
 This is different from pure optimization, where minimizing J is the main objective.
 Additionally, optimization algorithms for training deep models are usually specialized to the
specific structure of machine learning objective functions. 

Typically, the cost function can be written as an average over the training set such as:

 In the supervised learning case, we have a per-example loss function (L) that measures the
discrepancy between the predicted output (f(x; θ)) and the target output (y) for a given input (x).
 The objective function (Equation 8.1) is defined based on this loss function, empirical distribution
(Pˆdata), and the training set.
 However, it is possible to extend this framework to include additional arguments such as θ or x, or
exclude y, to develop regularization or unsupervised learning approaches.
 Ideally, we aim to minimize the objective function by considering the expectation across the entire
data generating distribution (Pdata) rather than just the finite training set. 

1
 The objective of a machine learning algorithm is to reduce the expected generalization error, also
known as the risk (Equation 8.2).
 This expectation is taken over the true underlying distribution (pdata) of the data. If we knew this
distribution, risk minimization would be a solvable optimization task. However, in machine
learning, we typically do not have access to pdata but only have a training set.
 To convert the machine learning problem into an optimization problem, we minimize the expected
loss on the training set.
 This involves replacing the true distribution (p(x, y)) with the empirical distribution (pˆ(x, y))
defined by the training set.
 In simpler terms, we aim to minimize the empirical risk.

where m is the number of training examples.


 Empirical risk minimization is the training process where we aim to minimize the average training
error.
 Instead of directly optimizing the true risk, we optimize the empirical risk and hope that the true
risk also decreases significantly.
 However, this approach is prone to overfitting, as high-capacity models can simply memorize the
training set.
 Additionally, many useful loss functions lack useful derivatives, making traditional gradient-based
optimization challenging.
 Therefore, in the context of deep learning, we often use a different approach where the quantity we
optimize further differs from the true objective we want to optimize.

8.1.2 Surrogate Loss Functions and Early Stopping


 Sometimes, the loss function we want to optimize, like classification error, is not efficiently
solvable.
 In such cases, we use surrogate loss functions as proxies that have advantages.
 For example, the negative log-likelihood is often used as a surrogate for 0-1 loss. This allows the
model to estimate conditional probabilities of classes and select those with the lowest expected
classification error. Surrogate loss functions can even enable further learning.
 For instance, the test set 0-1 loss may continue to decrease after the training set 0-1 loss reaches
zero when using the log-likelihood surrogate.
 This is because pushing the classes apart improves the classifier's robustness, extracting more
information from the training data beyond simply minimizing the average training set 0-1 loss.
 In training algorithms for machine learning, there is a significant difference compared to general
optimization.
 Instead of halting at a local minimum, machine learning algorithms minimize a surrogate loss
function but stop based on an early stopping criterion.
 This criterion, often based on the true underlying loss function (e.g., 0-1 loss measured on a
validation set), is designed to prevent overfitting.

2
 Training stops even if the surrogate loss function still has large derivatives, which is unlike pure
optimization where convergence occurs when the gradient becomes very small.

8.1.3 In machine learning algorithms, the objective function is often a sum over the training
examples.
 This is different from general optimization algorithms. Machine learning optimization computes
parameter updates using only a subset of the terms from the full cost function.
 For instance, maximum likelihood estimation problems decompose into a sum over each example
when viewed in log space. 

 Maximizing this sum is equivalent to maximizing the expectation over the


empirical distribution defined by the training set:

 Most of the properties of the objective function J used by most of our


optimization algorithms are also expectations over the training set.
 For example, the most commonly used property is the gradient:

 Computing the expectation exactly in machine learning algorithms can be expensive as it involves
evaluating the model on every example in the dataset.
 Instead, we can compute these expectations by randomly sampling a small number of examples and
averaging over those samples.
 The standard error of the mean, estimated from a number of samples, decreases less than linearly
as the number of examples increases.
 For instance, using 10,000 examples decreases the standard error by only a factor of 10 compared
to using 100 examples, even though it requires 100 times more computation.
 Optimization algorithms converge faster if they can quickly approximate the gradient instead of
computing it slowly and exactly.
 When learning from a small number of samples, it's beneficial to estimate the
gradient statistically instead of using all samples.
 This is because some training sets may have redundant samples, which can be
computationally expensive to process using the naive approach.
 In the worst case, all samples could be identical copies, but this is unlikely.
However, large numbers of similar examples can still occur.
 Batch gradient methods process all training examples simultaneously, but this
can be slow for large datasets.
 Minibatch stochastic gradient descent uses smaller groups of examples, which
is faster and more efficient.

3
 The term "batch" is sometimes used to describe both the full training set and
a group of examples, so it's important to clarify which meaning is being used.
 Stochastic algorithms in deep learning use a small group of examples at a time
instead of all of them.
 These are called minibatch methods and are different from online methods that
use just one example at a time, either from a stream or a fixed-size training
set.
 Traditional stochastic methods are also called minibatch methods, and they
fall somewhere in between the two.
 These methods are commonly used in deep learning because they provide a
balance between the computational efficiency of online methods and the
accuracy of using all examples at once.
 When using stochastic methods like stochastic gradient descent, the size of the
batches used can affect performance.
 Larger batches provide more accurate gradient estimates, but smaller batches
can be faster on multicore architectures and may offer a regularizing effect.
 The amount of memory required for processing the batch can also limit the
size, and power of 2 batch sizes may offer better runtime on certain hardware.
 However, very small batch sizes like 1 can require a smaller learning rate to
maintain stability due to high variance in the gradient estimate, which can
result in longer total runtime due to the need to make more steps.
 Some algorithms use information from the minibatch differently and can
handle smaller batch sizes, while others require larger batch sizes due to
sensitivity to sampling error.
 Second-order methods that use the Hessian matrix can be more sensitive and
require larger batch sizes to minimize fluctuations in estimates.
 Estimation errors in the gradient can amplify with multiplication by the
Hessian or its inverse, especially if the Hessian has a poor condition number.
 This can cause large changes in the update, even with a perfectly estimated
Hessian. Overall, larger batch sizes are needed to minimize errors in second-
order methods.
 When training a neural network, it's important to select minibatches randomly
to get an unbiased estimate of the gradient.
 This requires the samples to be independent, so two subsequent gradient
estimates should also be independent. Some datasets may have correlated
examples, like a list of medical test results for multiple patients.
 To avoid bias, it's necessary to shuffle the examples before selecting
minibatches, especially for large datasets where sampling uniformly at
random may be impractical.

4
 When training a machine learning model, it's often enough to shuffle the data
once and keep it in that order.
 This fixes the order of consecutive examples used in each batch, but doesn't
significantly harm the model's performance.
 Not shuffling the data at all can reduce effectiveness.
 In some cases, we can compute updates for multiple batches simultaneously
because the optimization problem breaks down into separate updates for each
example.
 This allows for parallel and distributed training, as discussed further in section
12.1.3.
 Minibatch stochastic gradient descent follows the true generalization error
gradient when no examples are repeated.
 This is because each minibatch provides an unbiased estimate of the error on
the entire dataset during the first pass.
 However, on subsequent passes, the estimate becomes biased due to the re-
sampling of previously used examples.
 In online learning, where examples are drawn from a stream, every experience
is a fair sample from the data-generating distribution, and there are no repeated
examples.
 This makes it easier to see that stochastic gradient descent minimizes
generalization error in this scenario.
 The equivalence is easiest to derive when both x and y are discrete.
 In this case, the generalization error (equation 8.2) can be written as a sum

With the exact gradient

 The fact that the gradient of the log-likelihood can be estimated by sampling
from the data distribution and computing the gradient for a minibatch was
previously shown.
 This holds for other functions L as well, as long as certain mild assumptions
are met for continuous variables.

5
 Therefore, we can obtain an unbiased estimator of the exact gradient of the
generalization error by sampling a minibatch from the data distribution and
computing the gradient of the loss for that batch.

 When training a machine learning model, updating the parameters in the


direction of the gradient of the generalization error (which is the error on
unseen data) helps reduce the gap between the training and generalization
errors.
 This is called stochastic gradient descent (SGD). However, to get the best
results, it's often necessary to make multiple passes through the training data,
unless it's very large.
 In this case, the first pass follows the unbiased gradient of the generalization
error, but subsequent passes can increase the gap between the training and
generalization errors.
 This is because the training error decreases, but the generalization error may
not continue to decrease.
 As datasets grow faster than computing power, it's becoming more common
to use each training example only once or to make an incomplete pass through
the data.
 In this case, the main concerns are underfitting and computational efficiency,
as overfitting is not a major issue with very large datasets. (Source: Bottou and
Bousquet, 2008)

6
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

2nd Topic: Challenges in Neural Network Optimization

Training neural networks involves optimizing an objective function, which can be


challenging due to the following reasons:
1. Non-convexity: Unlike traditional optimization problems that are convex, neural networks
often have non-convex objective functions. This means that finding the global minimum, which
represents the best solution, is difficult as the function may have many local minima.
2. Designing the objective function: Creating an objective function that accurately represents
the desired outcome can be complex. It requires careful consideration of the problem at hand
and defining the right metrics to measure the network's performance.
3. Constraints: Neural network optimization may involve constraints, such as limits on the
model's complexity or the amount of available data. Incorporating these constraints into the
optimization process can add further complexity.
4. Computational cost: Training deep neural networks can be computationally expensive and
time-consuming. The optimization process often requires evaluating the objective function
numerous times, which can be a bottleneck if not efficiently managed.
5. Overfitting: Neural networks have the tendency to overfit, meaning they may memorize the
training data too well and struggle to generalize to unseen data. Optimization techniques need
to address this issue to ensure the network learns meaningful patterns rather than simply
memorizing the training set.

6. Local vs. global minima: Due to the non-convex nature of neural network optimization, it
is challenging to distinguish between local and global minima. Sometimes, the optimization
process might get stuck in a poor local minimum instead of reaching the global minimum,
affecting the network's performance.
7. Vanishing and exploding gradients: Optimization can be hindered by the vanishing or
exploding gradient problem, where gradients either become too small or too large during
backpropagation. This issue can make it difficult for the network to converge to an optimal
solution.
8. Hyperparameter selection: Neural networks have various hyperparameters that need to be
tuned, such as learning rate, batch size, and regularization parameters. Identifying the optimal
values for these hyperparameters requires careful experimentation and can be a challenging
optimization problem itself.

1
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

Optimizing neural networks involves dealing with non-convexity, designing appropriate


objective functions, managing constraints, addressing computational costs, preventing
overfitting, handling local and global minima, managing gradient issues, and tuning
hyperparameters. These challenges make neural network optimization a complex task.
8.2.1. Ill Conditioning

1. Optimizing convex functions can be challenging due to a problem


called ill-conditioning of the Hessian matrix.
2. Ill-conditioning is a common issue in numerical optimization and
can affect both convex and non-convex problems.
3. Ill-conditioning can cause problems for optimization algorithms like
stochastic gradient descent (SGD) because even small steps can
increase the cost function.
4. Equation 4.9 shows that a second-order Taylor series expansion of
the cost function predicts how a gradient descent step affects the
cost.
5. Ill-conditioning becomes problematic when the term 1/2ε 2gTHg
(related to curvature) becomes larger than the term εgTg (related to
the gradient).
1/2ε2gTHg-εgTg------------------------------------------ (1)
6. To determine if ill-conditioning affects neural network training, we
can monitor the squared gradient norm gTg and the gTHg term.
7. In many cases, the gradient norm (gTg) doesn't decrease significantly
during learning, but the gTHg term (related to curvature) increases
significantly.
8. As a result, learning becomes slow because the learning rate needs
to be reduced to compensate for the stronger curvature.

2
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

9. Figure 8.1 illustrates an example where the gradient increases


significantly during successful training of a neural network.
10. Ill-conditioning is not only limited to neural network training but
occurs in other settings as well.
11. However, techniques used to address ill-conditioning in other
contexts might not be as effective for neural networks.
12. For example, Newton's method, which is useful for minimizing
convex functions with poorly conditioned Hessians, requires
modifications when applied to neural networks.

Figure 8.1: Gradient descent often does not arrive at a critical point of
any kind. In this example, the gradient norm increases throughout the
training of a convolutional network used for object detection.
 (Left)A scatterplot showing how the norms of individual gradient
evaluations are distributed over time. To improve legibility, only one
gradient norm is plotted per epoch.

3
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

 The running average of all gradient norms is plotted as a solid curve.


The gradient norm clearly increases over time, rather than
decreasing as we would expect if the training process converged to
a critical point.
 (Right)Despite the increasing gradient, the training process is
reasonably successful. The validation set classification error
decreases to a low level.
1. Gradient descent is a commonly used optimization algorithm in
machine learning.
2. However, it doesn't always reach a critical point, which is a point
where the gradient is zero and could indicate a minimum or
maximum of the function being optimized.
3. In the example described, a convolutional network is being trained
for object detection.
4. The scatterplot on the left shows the distribution of gradient norms
over time. The gradient norm measures the size or magnitude of
the gradient.
5. In order to make the plot clearer, only one gradient norm is shown
per epoch (a complete pass through the training data).
6. The solid curve represents the running average of all gradient
norms.
7. Interestingly, the gradient norm increases over time instead of
decreasing, as we would expect if the training process was
converging to a critical point.

4
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

8. Despite the increasing gradient norm, the training process is still


successful in terms of the performance on the validation set.
9. The validation set classification error decreases to a low level,
indicating that the trained model is performing well in detecting
objects.
8.2.2 Local Minima
1. Convex optimization problems have the property that finding a local minimum solves the
problem, and any local minimum is also a global minimum. Some convex functions may have
a flat region at the bottom, and any point within that region is an acceptable solution.
2. Non-convex functions, like neural networks, can have many local minima. This is due to the
model identifiability problem, where equivalent models can be obtained by exchanging latent
variables.
3. Weight space symmetry is a type of non-identifiability in neural networks. It occurs when
we can rearrange the hidden units in multiple ways, resulting in different local minima that
have the same cost function value.
4. Other causes of non-identifiability in neural networks include scaling the weights and biases
of a unit, which leads to an (m × n)-dimensional hyperbola of equivalent local minima.
5. While non-identifiability can result in a large or infinite number of local minima, these
minima are equivalent in terms of cost function value.
6. Local minima can be problematic if they have a higher cost than the global minimum.
However, it is now believed that for sufficiently large neural networks, most local minima have
low-cost values, and it is not necessary to find the true global minimum.
7. It is an active area of research to determine whether there are many local minima with high
cost for practical neural networks and whether optimization algorithms encounter them.
8. Testing the norm of the gradient over time can help rule out local minima as the problem. If
the norm of the gradient does not significantly shrink, the issue is not related to local minima.
9. It can be challenging to positively establish that local minima are the problem in high-
dimensional spaces, as there can be other structures with small gradients.
10. It is important for practitioners to carefully test for specific optimization problems rather
than attributing all difficulties to local minima.
8.2.3 Plateaus, Saddle Points and Other Flat Regions

5
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

1. Local minima and maxima in high-dimensional non-convex functions are rare compared to
saddle points, which are points with zero gradient.
2. Saddle points have a mix of positive and negative eigenvalues in the Hessian matrix, and
points along these eigenvalues can have higher or lower costs than the saddle point.
3. In low-dimensional spaces, local minima are common, but in higher-dimensional spaces,
saddle points are more common.
4. The expected ratio of saddle points to local minima grows exponentially with the
dimensionality of the function.
5. The Hessian matrix at a local minimum only has positive eigenvalues, while a saddle point
has a mixture of positive and negative eigenvalues.
6. Local minima are more likely to have low costs, while critical points with high costs are
more likely to be saddle points or local maxima.
7. Shallow autoencoders without nonlinearities have global minima and saddle points, but no
local minima with higher cost than the global minimum.
8. Real neural networks also have loss functions that contain many high-cost saddle points.
9. The implications of saddle points for training algorithms, especially those using gradient
information, are unclear.
10. Gradient descent can sometimes escape saddle points, even though the gradient can become
small near them.
11. Continuous-time gradient descent may be repelled from nearby saddle points, but the
situation may differ for more realistic uses.
12. Newton's method is affected by saddle points and faces challenges when encountering
them.
13. Gradient descent: It is a method that moves in the direction of decreasing values to find
the minimum point of a function. It doesn't explicitly aim to find a critical point.
14. Newton's method: It is a method that tries to find a point where the gradient of a function
is zero. However, it can sometimes jump to a saddle point if not modified properly.
15. Saddle-free Newton method: Introduced by Dauphin et al. in 2014, it is a modified version
of Newton's method that addresses the issue of jumping to saddle points. It has shown
improvements over the traditional version.
16. Difficulty in scaling second-order methods: Second-order methods, like Newton's
method, are challenging to apply to large neural networks. This is because of the presence
of many saddle points in high-dimensional spaces.

6
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

17. Types of points with zero gradient: Besides minima and saddle points, there are also
maxima. Maxima are similar to saddle points in terms of optimization, as many algorithms
are not attracted to them, except unmodified Newton's method.
18. Rarity of maxima and minima in high-dimensional space: Maxima and minima of
random functions become exponentially rare as the dimensionality increases.
19. Wide, flat regions of constant value: In these regions, both the gradient and the Hessian
(second derivative) of the function are zero. Such locations pose difficulties for numerical
optimization algorithms.
20. Wide, flat regions in convex and general optimization problems: In a convex problem,
a wide, flat region consists entirely of global minima. However, in a general optimization
problem, such a region could correspond to a high value of the objective function.

Figure 8.2: A visualization of the cost function of a neural network.


Image adapted with permission from Goodfellow et al. (2015). These
visualizations appear similar for feedforward neural networks,
convolutional networks, and recurrent networks applied to real object
recognition and natural language processing tasks.
With the above figure some information:
1. This visualization is applicable to different types of neural networks used for tasks like
object recognition and natural language processing.

7
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

2. Surprisingly, these visualizations do not typically show many obvious obstacles or complex
structures in the cost function.
3. Before 2012, it was believed that neural net cost functions had more non-convex structures
than what these projections reveal.
4. The main obstacle shown in this visualization is a saddle point with high cost near the initial
parameters.
5. However, stochastic gradient descent (SGD) training easily escapes this saddle point.
6. Most of the training time is spent in the relatively flat valley of the cost function.
7. The flatness of the valley could be due to noisy gradients, poorly conditioned Hessian
matrix, or the need to take an indirect path to avoid a tall "mountain" in the figure.
8.2.4 Cliffs and Exploding Gradients
1. Neural networks with many layers can have steep regions resembling cliffs, caused by the
multiplication of large weights.
2. These cliffs can be problematic because the gradient update step can move the parameters
too far, potentially jumping off the cliff structure altogether.
3. The objective function for highly nonlinear deep neural networks or recurrent neural
networks often contains sharp nonlinearities in parameter space, resulting from the
multiplication of several parameters.
4. These nonlinearities lead to high derivatives in certain places, making the optimization
process challenging.
5. Approaching the cliff structure from above or below can be dangerous, as it can result in
losing most of the optimization progress.
6. However, the gradient clipping heuristic, described in section 10.11.1, can help mitigate the
consequences of cliffs.
7. The gradient clipping heuristic intervenes when the traditional gradient descent algorithm
suggests a large step, reducing the step size to stay within the region where the gradient
indicates the direction of steepest descent.
8. Cliff structures are particularly common in cost functions for recurrent neural networks due
to the multiplication of many factors, especially in long temporal sequences.

8.2.5 Long Term Dependencies


1. Neural network optimization algorithms face difficulties with deep
computational graphs in both feedforward and recurrent networks.

8
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

2. Repeated application of the same parameters in recurrent networks


creates pronounced difficulties, known as the vanishing and exploding
gradient problem.
Example: For example, suppose that a computational graph contains a
path that consists of repeatedly multiplying by a matrix W. After t steps,
this is equivalent to multiplying by Wt. Suppose that W has an eigen
decomposition W = V diag(λ)V−1. In this simple case, it is
straightforward to see that
Wt = (V diag(λ)V−1)t = Vdiag(λ)tV−1 ------------------------------- (8.11)
Any eigenvalues λi that are not near an absolute value of 1 will either
explode if they are greater than 1 in magnitude or vanish if they are less
than 1 in magnitude. The vanishing and exploding gradient problem
refers to the fact that gradients through such a graph are also scaled
according to diag(λ)t.
3. When a computational graph involves repeated matrix
multiplication, the resulting matrix after t steps (Wt) depends on the
eigenvalues of the matrix W.
4. Eigenvalues that are far from 1 in magnitude can cause gradients to
either explode (if greater than 1) or vanish (if less than 1), impacting
parameter updates.
5. Vanishing gradients make it challenging to determine the direction
for parameter improvement, while exploding gradients can lead to
unstable learning.
6. The repeated multiplication in recurrent networks is similar to the
power method algorithm used to find the largest eigenvalue of a matrix.

9
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

7. Recurrent networks use the same matrix at each time step, while
feedforward networks do not, allowing feedforward networks to largely
avoid the vanishing and exploding gradient problem.
8. Further discussion on the challenges of training recurrent networks
will be deferred until section 10.7, after recurrent networks have been
described in more detail.
8.2.6 Inexact Gradients
1. Most optimization algorithms assume that we have access to the exact gradient or Hessian
matrix, but in practice, we often only have noisy or biased estimates of these quantities.
2. Deep learning algorithms typically use a minibatch of training examples to compute an
approximate gradient.
3. In some cases, the objective function we want to minimize is intractable, which means its
gradient is also intractable. In these situations, we can only approximate the gradient.
4. Advanced models, especially in part III, can encounter these issues more frequently.
5. Contrastive divergence is a technique used to approximate the gradient of the intractable log-
likelihood of a Boltzmann machine.
6. Neural network optimization algorithms are designed to handle imperfections in the gradient
estimate.
7. Another approach is to choose a surrogate loss function that is easier to approximate than
the true loss function.
8.2.7 Poor Correspondence between local and global structure
1. Problems with the loss function: The loss function can be poorly conditioned, contain cliffs
or saddle points, making it difficult to make progress in optimization.
2. Poor correspondence between local and global structure: Even if the problems at a single
point are overcome, if the direction of improvement locally does not point towards regions of
lower cost globally, the performance can still be poor.

10
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

Figure 8.4: Optimization based on local downhill moves can fail if the local
surface does not point toward the global solution. Here we provide an example
of how this can occur, even if there are no saddle points and no local minima.
With the above figure:
1. Optimization based on local downhill moves can fail if the local surface does not point
towards the global solution.
2. Even without saddle points or local minima, a cost function with only asymptotes towards
low values can cause difficulties.
3. The main problem arises when starting on the wrong side of a "mountain" and being unable
to cross it.
4. In higher-dimensional space, learning algorithms can sometimes go around such mountains,
but this may lead to long trajectories and excessive training time, as shown in figure 8.2.
3. Length of the learning trajectory: Training time is often determined by the length of the
trajectory needed to reach the solution. Figure 8.2 illustrates that the trajectory often follows a
wide arc around a mountain-shaped structure.

4. Neural networks don't reach critical points: In practice, neural networks do not arrive at
a critical point, such as a global or local minimum or a saddle point.
5. Lack of small gradient regions: Figure 8.1 shows that neural networks often do not reach
regions of small gradient. Critical points may not exist, and the loss function may
asymptotically approach a certain value as the model becomes more confident.
6. Failure of local optimization: Local optimization may fail to find a good cost function
value, even without local minima or saddle points. Figure 8.4 provides an example of this.

11
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

7. Research focus: Current research aims to find good initial points for problems with difficult
global structure, rather than developing algorithms that use non-local moves.
8. Learning algorithms based on local moves: Gradient descent and most effective learning
algorithms for neural networks rely on making small, local moves. However, computing the
correct direction of these moves can be challenging.
9. Approximations and limitations: In some cases, we can only approximate properties of
the objective function, like its gradient, with bias or variance. This can lead to issues like poor
conditioning or discontinuous gradients, making the region where the gradient is reliable very
small.
10. High computational cost: In cases where the objective function has issues and the local
descent direction can only be computed with small steps, following the path of local descent
incurs a high computational cost due to the large number of steps involved.

8.2.8 Theoretical Limits of Optimization


1. Theoretical results show that there are limits on the performance of optimization algorithms
for neural networks.
2. However, these results often don't affect the practical use of neural networks.
3. Some results only apply to networks that output discrete values, while most neural networks
output smoothly increasing values that can be optimized effectively.
4. Certain problem classes may be intractable, but determining if a specific problem falls into
that class is challenging.
5. In some cases, finding a solution for a given network size may be intractable, but using a
larger network can make finding an acceptable solution easier.
6. In neural network training, we are usually more concerned with reducing the function's
value enough to achieve good generalization error, rather than finding the exact minimum.
7. Theoretical analysis of whether an optimization algorithm can achieve this goal is very
difficult.
8. It is important for machine learning research to develop more realistic bounds on the
performance of optimization algorithms.

12
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

3rd Topic: Basic Algorithms


1. Gradient Descent: This is an algorithm that helps us find the lowest point (minimum)
on a curve by following the steepest downhill path, which is the gradient of the curve. In
machine learning, we use this to minimize the error in our models by adjusting the weights
of our neurons.
2. Stochastic Gradient Descent (SGD): Instead of following the gradient of the entire
training set, SGD follows the gradient of randomly selected small sets of data called
minibatches. This makes the algorithm faster and more efficient because it only needs to
calculate the gradient for a subset of the data instead of the entire set.
3. SGD is a popular optimization algorithm used in machine learning, especially in deep
learning, because it is easy to implement and works well with large datasets. It also helps
prevent getting stuck in local minima, which can happen with other optimization methods.
4. SGD works by selecting a minibatch of m examples from the data and calculating their
gradients. These gradients are then averaged to get an unbiased estimate of the true gradient
for the entire dataset. This estimate is used to update the weights of our neurons in the
direction opposite to the gradient, which moves us closer to the minimum error point.
5. SGD can be further improved by using variants such as mini-batch gradient descent with
momentum (MBGD-M) or adaptive learning rate methods like Adam or RMSprop, which
adjust the learning rate based on the history of gradients and weights updates. These
techniques help accelerate convergence and improve stability during training.
Algorithm 8.1 Stochastic gradient descent (SGD) update at training iteration k

Step-1: Require: Learning rate εk.

Step-2: Require: Initial parameter θ

Step-3: while stopping criterion not met do

Step-4: Sample a minibatch of m examples from the training set {x(1), . . . ,x(m)} with

Step-5: corresponding targets y(i)

Step-6: Compute gradient estimate: gˆ ← +1/m ∇θ Σi L(f(x(i); θ), y(i))

Step-7: Apply update: θ ← θ − εˆg

Step-8: end while


Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

Explanation of Steps of the above algorithm:


1. SGD is an optimization algorithm used to find the lowest point (minimum) on a curve
by following the steepest downhill path, which is the gradient of the curve. In machine
learning, we use this to minimize the error in our models by adjusting the weights of our
neurons.
2. SGD works by selecting a small set of m examples from the training set called a
minibatch, calculating their gradients, and then averaging them to get an unbiased estimate
of the true gradient for the entire dataset.
3. At each training iteration k, we start with an initial parameter θ and a learning rate εk.
We then repeatedly sample a minibatch of m examples, compute their gradients, and update
our parameter θ in the direction opposite to the gradient estimate gˆ using the learning rate
εk.
4. We keep repeating this process until we meet a stopping criterion, such as when the error
has reached a certain threshold or when we have trained for a certain number of iterations.
5. SGD is an efficient algorithm because it only needs to calculate the gradient for a subset
of the data instead of the entire set, making it faster and more practical for large datasets.
Additionally, SGD helps prevent getting stuck in local minima, which can happen with
other optimization methods.
 In SGD (Stochastic Gradient Descent), a key factor is the learning rate, which was
previously fixed at ε.
 In reality, the learning rate should decrease gradually over time, so we now call it εk for
iteration k.
 This is necessary because the noisy gradient estimator used in SGD doesn't disappear
even when we reach a minimum, unlike the true gradient of the total cost function in
batch gradient descent, which becomes small and then zero as we approach and reach a
minimum.
 Therefore, batch gradient descent can use a fixed learning rate because its gradient
becomes small and then 0 at a minimum, but SGD needs a decreasing learning rate due
to the added noise from random sampling of m training examples.
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

II. Momentum:
1. Stochastic gradient descent is a common optimization strategy, but it can be slow in
certain cases. Momentum is a technique designed to accelerate learning, especially for high
curvature, small gradients, or noisy data.
2. Momentum works by keeping track of past gradients and moving in their direction with
an exponentially decaying moving average. This helps to overcome the issue of slow
convergence in these cases.
3. In the momentum algorithm, a velocity variable v is introduced to represent the direction
and speed at which parameters move through parameter space. The velocity is set to an
exponentially decaying average of the negative gradient, which can be thought of as a force
moving a particle through parameter space according to Newton's laws of motion. The
velocity vector v represents the momentum of the particle in this physical analogy.
4. A hyperparameter α ∈ [0, 1) determines how quickly the contributions of previous
gradients exponentially decay. The update rule is given by:
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

The size of the step in the momentum algorithm depends on the direction of multiple gradients.
When many gradients point in the same direction, the step size is largest.
If the algorithm always sees a gradient g, it will keep moving in the opposite direction until
reaching a maximum speed of
ε||g||/(1 - α) (8.17)
The hyperparameter α determines how much the maximum speed is multiplied compared to
gradient descent. Common values for α are .5, .9, and .99, and it can be adapted over time, but
it's less important to do so than to decrease the learning rate over time.
Algorithm 8.2 Stochastic Gradient Descent (SGD) with momentum
Step-1: Require Learning rate ε, momentum parameter α.
Step-2: Require: Initial parameter θ, initial velocity v.
Step-3: while stopping criterion not met do
 Sample a minibatch of m examples from the training set {x(1), . . . ,x(m)} with
corresponding targets y(i).
 Compute gradient estimate: g ← 1m∇θ Σi L(f(x(i); θ), y(i))
 Compute velocity update: v ← αv − εg
 Apply update: θ ← θ + v
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

Step-4: end while

Explanation of Steps
The SGD with momentum algorithm tries to find the best parameter values for a machine
learning model by adjusting them based on the gradient of the loss function. Here's a simple
explanation of each step:
1. Learning rate (ε) and momentum parameter (α) are required. These determine how much the
parameters will be adjusted and how much influence previous adjustments have on the current
one.
2. Initial parameters (θ) and initial velocity (v) are needed to start the optimization process.
3. A batch of examples is selected from the training data, and their gradients are calculated to
estimate the gradient of the loss function.
4. The velocity is updated based on the momentum parameter and the new gradient estimate.
The velocity is a running average of past gradients, and it helps to smooth out noisy gradients
and improve convergence speed.
5. The parameters are updated based on the velocity and the new gradient estimate. This step
moves the parameters in the direction of steepest descent, but with some influence from
previous steps due to the velocity term.
6. This process continues until a stopping criterion is met, such as reaching a certain level of
accuracy or stopping after a certain number of iterations.

III. Nesterov Momentum


 It is a technique that helps to mitigate this issue by adding a momentum term to the update the rule.
 Nesterov Momentum Sutskever et al. (2013) introduced a variant of the momentum algorithm
that was inspired by Nesterov’s accelerated gradient method (Nesterov, 1983, 2004).
The update rules in this case are given by:

1. Nesterov momentum is a variation of the standard momentum method in optimization.

2. The parameters α and ε have similar roles as in standard momentum.

3. The main difference is that Nesterov momentum evaluates the gradient after applying the current
velocity, while standard momentum evaluates it beforehand.

4. This can be seen as adding a correction factor to the standard momentum method.

5. Here's a summary of the Nesterov momentum algorithm (Algorithm 8):

1. Initialize velocity and position vectors v and x, respectively.


Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

2. For each iteration:


3. Compute x_t = x - α * v.
4. Compute y = f(x_t).
5. Compute g = grad(y).
6. Update velocity: v = v + α * (g - g_t), where g_t is the gradient at the previous position x_{t-1}.
7. Update position: x = x_t + ε * v.
8. Repeat until convergence or maximum iterations reached.
Some Important Points about Nesterov Momentum Algorithm:
1. Nesterov momentum improves the convergence rate of convex batch gradient descent from O(1/k)
to O(1/k^2), but it doesn't work for stochastic gradient descent in the same way.
2. In stochastic gradient descent, the excess error (difference between current cost and minimum
possible cost) is O(1/sqrt(k)) for convex problems and O(1/k) for strongly convex problems after k
iterations.
3. Batch gradient descent has better convergence rates than stochastic gradient descent in theory, but
the Cramér-Rao bound limits the generalization error to decrease no faster than O(1/k).
4. Faster convergence in stochastic gradient descent may lead to overfitting, but its initial progress with
fewer examples makes it more advantageous for large datasets.
5. Gradually increasing the minibatch size during learning can combine the benefits of both batch and
stochastic gradient descent.
6. Nesterov momentum and other optimization algorithms with faster convergence may not be worth
pursuing for machine learning tasks due to potential overfitting, but their benefits in practice are still
significant.
7. The asymptotic analysis of O(1/k) can obscure the advantages of stochastic gradient descent,
particularly its rapid initial progress with fewer examples.
Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum

Step-1: Require: Learning rate ε, momentum parameter α.

Step-2: Require: Initial parameter θ, initial velocity v.

Step-3: While stopping criterion not met do

Step-4: Sample a minibatch of m examples from the training set {x (1) , . . . ,x (m)}

with corresponding labels y (i) .

Step-5: Apply interim update: θ˜ ← θ + αv

Step-6: Compute gradient (at interim point): g ← 1/m ∇θ˜ Σi L(f(x(i) ; θ˜), y (i))
Step-7: Compute velocity update: v ← αv − εg
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

Step-8: Apply update: θ ← θ + v

Step-9: end while

Explanation of the steps:


Algorithm for Stochastic Gradient Descent with Nesterov Momentum:
1. Require: Learning rate ε, momentum parameter α.
2. Require: Initial parameter θ, initial velocity v.
3. While stopping criterion not met do:
a. Take a small group (minibatch) of m examples from the training set with their
corresponding labels.
b. Make an intermediate update to the parameter: θ˜ ← θ + αv
c. Calculate the gradient (slope) of the function at the intermediate point: g ← 1/m * Σi
L(f(x(i) ; θ˜), y (i))
d. Update the velocity: v ← αv - εg
e. Apply the final update to the parameter: θ ← θ + v
4. End while

 This algorithm is used to find the minimum value of a function in machine learning by
iteratively adjusting the parameter values based on the gradient and velocity calculated
from a small group of examples.
 The momentum parameter helps to improve the convergence rate by adding some weight
to previous updates, while the learning rate controls how much each update affects the
parameter values.
 The stopping criterion can be based on a certain number of iterations, a predefined accuracy,
or other criteria depending on the specific problem being solved.
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

4th Topic: Parameter Initialization Strategies


Parameter Initialization

Initializing the parameters of a deep neural network is an important


step in the training process, as it can have a significant impact on the
convergence and performance of the model. Here are some common
parameter initialization techniques used in deep learning:

1. Zero Initialization: Initialize all the weights and biases to zero.


This is not generally used in deep learning as it leads to symmetry
in the gradients, resulting in all the neurons learning the same
feature.
2. Random Initialization: Initialize the weights and biases
randomly from a uniform or normal distribution. This is the most
common technique used in deep learning.
3. Xavier Initialization: Initialize the weights with a normal
distribution with mean 0 and variance of sqrt(1/n), where n is the
number of neurons in the previous layer. This is used for the
sigmoid activation function.
4. He Initialization: Initialize the weights with a normal
distribution with mean 0 and variance of sqrt(2/n), where n is the
number of neurons in the previous layer. This is used for the ReLU
activation function.
5. Orthogonal Initialization: Initialize the weights with an
orthogonal matrix, which preserves the gradient norm during
backpropagation.
6. Uniform Initialization: Initialize the weights with a uniform
distribution. This is less commonly used than random
initialization.

1
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

7. Constant Initialization: Initialize the weights and biases with


a constant value. This is rarely used in deep learning.

It’s important to note that there is no one-size-fits-all initialization


technique, as the optimal initialization may vary depending on the
specific architecture and problem being tackled.

Therefore, it’s often a good idea to experiment with different


initialization techniques to see which works best for a given task.
1. Some algorithms find the best solution in one step, but most deep learning training
algorithms use an iterative approach to gradually improve the solution.
2. The starting point for these iterative algorithms is important in deep learning training because
it can affect the algorithm's performance and stability.
3. Choosing a good initial point can help the algorithm converge faster and find a better
solution, while a bad initial point can cause problems and make the algorithm less effective.
4. Different initial points can lead to different solutions with similar costs, but varying
generalization errors.
5. Improving initialization strategies is difficult because we don't fully understand neural
network optimization, but some strategies aim to break symmetry between units to prevent
them from computing the same function.
6. Random initialization is a simple and computationally cheaper alternative, as it's unlikely to
assign similar functions to different units in a high-dimensional space with a high-entropy
distribution.
7. The goal of breaking symmetry is to ensure each unit computes a unique function, which
can help prevent input or gradient patterns from being lost in null spaces during learning.
8. Biases and extra parameters are usually set to fixed values, while weights are randomly
initialized in neural networks.
9. The choice of Gaussian or uniform distribution for weight initialization doesn't matter much,
but the scale does. Larger initial weights help prevent redundant units and signal loss during
propagation, but too large weights can cause exploding values or chaos in recurrent networks.

2
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

10. The ideal initial weight scale balances the benefits of stronger symmetry breaking and
larger outputs with the risks of exploding gradients and saturated activations. Gradient clipping
can help mitigate the exploding gradient problem.

1. Optimization Perspective:
 Improve the efficiency and reduce cost
 Suggests weights should be large for successful information propagation.
 Favors the use of optimization algorithms like stochastic gradient descent (SGD), which
prefers small incremental changes and tends to converge to areas close to initial
parameters.
2. Regularization Concerns:
 Prevent Overfitting and Improving Accuracy
 Advocate for smaller weights to address overfitting concerns.
 Implies a preference for avoiding overly large weight values.
3. Connection to Early Stopping:
 Monitors the performance of the model for every approach
 Optimization with early stopping is akin to weight decay in some cases.
 Early stopping in gradient descent expresses a preference for final parameters to be close
to initial parameters.
4. Gaussian Prior Analogy:
 Gaussian prior has bell-shaped curve centered at Zero.
 Initializing parameters (θ) is similar to imposing a Gaussian prior with mean θ0.
 Choosing θ0 close to 0 implies a prior where units are more likely not to interact unless the
objective function strongly prefers interaction. 
5. Impact of Initialization:
 Initializing weights with the correct variance and correctly weighting residual modules.
 Initializing θ0 to large values implies a prior specifying unit interaction and their nature.
 θ0 near 0 suggests a prior where units are more likely not to interact unless compelled by
the objective function.
 The interplay between optimization and regularization suggests a careful consideration of
weight initialization, balancing the need for information propagation with the desire to
prevent overfitting.
 The choice of initial parameters is likened to expressing a prior belief about unit
interactions, influencing how units should or should not interact based on the problem at
hand.
6. Weight Initialization Heuristics:
 Important design and consideration in NN & DL.

3
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

 Use a common heuristic by initializing weights in a fully connected layer with m inputs
and n outputs by sampling from U(−1/√m, 1/√m).
 Glorot and Bengio (2010) propose a normalized initialization for compromise between
activation and gradient variance goals.

 Saxe et al. (2013) Recommendation:

• Saxe et al. suggest initializing to random orthogonal matrices with a scaling factor,
ensuring convergence independence of depth.
• Proper gain factor tuning can allow training very deep networks without vanishing or
exploding gradients.
7. Optimal Criteria Challenges:
• Theoretical predictions for optimal initial weights may not lead to optimal performance.
• Possible reasons: wrong criteria, properties not persisting during learning, or
unintentional increase in generalization error.
8. Weight Scaling Challenges:
 Scaling rules setting all initial weights to the same standard deviation, like 1/√m, can
result in extremely small individual weights for large layers.
 Sparse initialization, introduced by Martens (2010), initializes each unit with k non-
zero weights to maintain diversity but can cause issues.
9. Hyperparameter Considerations:
 Treat initial weight scale as a hyperparameter; choose using a search algorithm (e.g.,
random search) based on activation or gradient range.
 Dense or sparse initialization can be a hyperparameter choice.
10. Practical Approach:
 Manually or through automated methods, identify layers with small activations and
increase their weights iteratively for better initial activations.
 Consider gradient range and standard deviation if learning is still slow.

11. Recent Formalization:


 The above protocol has been formalized by Mishkin and Matas (2015), providing a more
detailed and studied approach.

4
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

12. Other Parameter Initialization:


 Initialization of parameters other than weights is generally easier and less complex.
13. Weight Initialization Heuristics:
 Use heuristics to set initial weights in neural networks.
 One method is to sample weights from U(−1/√m, 1/√m) for a fully connected layer with m
inputs and n outputs.
 Glorot and Bengio (2010) suggest a normalized initialization formula to balance activation
and gradient variances.
14. Saxe et al. Recommendation:
 Saxe et al. (2013) propose initializing with random orthogonal matrices, including a scaling
factor (gain factor) for nonlinearity.
 Proper scaling ensures deep network training without vanishing/exploding gradients.
 Sussillo (2014) shows correct gain factor setting can train networks up to 1,000 layers.
15. Challenges with Optimal Criteria:
 Theoretical predictions for optimal initial weights may not lead to peak performance.
 Possible issues: wrong criteria, properties not persisting during learning, or unintentional
increase in generalization error.
 Weight scale often treated as a hyperparameter.
16. Drawback of Standard Deviation Scaling:
 Uniformly setting weights with the same standard deviation (e.g., 1/√m) can make
individual weights extremely small in large layers.
 Martens (2010) introduces sparse initialization to maintain diversity, initializing each unit
with k non-zero weights.

17. Sparse Initialization Challenges:


 Sparse initialization helps diversity but imposes a strong prior on weights with large
Gaussian values.
 Gradient descent takes time to shrink large values, causing issues for units like max out
units with multiple coordinated filters.
18. Hyperparameter Considerations:
 Treat weight scale and initialization method (dense or sparse) as hyperparameters.
 Hyperparameter search algorithms like random search or manual search can be used.
 A rule of thumb: examine range or standard deviation of activations or gradients on a
single minibatch for initial scale choice.
19. Addressing Small Activations:
 If weights are too small, activations across a minibatch shrink.

5
Prepared By: Dr K Madan Mohan Associate Professor, Department of CSE (AIML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyd-68.

 Identify layers with small activations, increase weights iteratively for better initial
activations.
 If learning is slow, examine gradient range and standard deviation.
20. Formalization by Mishkin and Matas (2015):
 Mishkin and Matas (2015) formalize and study the protocol for weight initialization.
 The focus so far has been on weight initialization, but other parameters are generally easier
to initialize.
21. Another common Type of Parameter:
a) A variance or precision parameter is a common type of parameter used in linear regression.
b) Instead of estimating the conditional variance directly, we can use a precision parameter β
in the model:
a. p(y | x) = N (y | wTx + b, 1/β) --------------------- (8.24)
c) Precision parameters are often initialized to 1 safely.
d). Another approach is to assume initial weights are close to zero, set biases based on the
correct marginal mean, and set variance parameters to the marginal variance of the output in
the training set.

6
5th Topic: Algorithms with Adaptive Learning Rates
1. Learning rate is a difficult hyperparameter to set in neural network training because it
has a big impact on the model's performance.
2. The cost function can be very sensitive to some directions in parameter space and less
sensitive to others, making learning difficult.
3. The momentum algorithm helps with this, but introduces another hyperparameter.
4. Adapting individual learning rates for each parameter during training can make sense if
the directions of sensitivity are axis-aligned.
5. The delta-bar-delta algorithm is an early heuristic approach to adapting individual
learning rates, based on the idea that if the partial derivative of the loss with respect to a
parameter stays the same sign, the learning rate should increase, and if it changes sign,
the learning rate should decrease.
6. More recent incremental methods adapt learning rates for model parameters during
mini-batch training.
7. These methods are necessary because full batch optimization is not practical for large
datasets.
AdaGrad
1. AdaGrad is an algorithm that adjusts the learning rates of all model parameters
differently based on their past values. It makes the learning rates smaller for parameters
that have already improved a lot and larger for parameters that haven't improved much
yet.
2. This means that parameters with larger partial derivatives (which indicate a stronger
influence on the loss function) will have a faster decrease in learning rate, while
parameters with smaller partial derivatives will have a slower decrease.
3. This results in faster progress in the less steep directions of parameter space, which can
be helpful for finding the optimal solution more efficiently.
4. In theory, AdaGrad has some nice properties for convex optimization problems, but in
practice, it can lead to a premature and excessive decrease in the effective learning rate
when training deep neural network models due to the accumulation of squared gradients
over time.
5. Overall, AdaGrad works well for some deep learning models but may not be the best
choice for all of them.

Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
RMSProp: (Root Mean Square Propagation)
1) The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-
convex setting by changing the gradient accumulation into an exponentially weighted
moving average.
2) AdaGrad is designed to converge rapidly when applied to a convex function.
3) When applied to a non-convex function to train a neural network, the learning
trajectory may pass through many different structures and eventually arrive at a region
that is a locally convex bowl.
4) AdaGrad shrinks the learning rate according to the entire history of the squared
gradient and may have made the learning rate too small before arriving at such a
convex structure.
5) RMSProp uses an exponentially decaying average to discard history from the extreme
past so that it can converge rapidly after finding a convex bowl, as if it were an instance
of the
6) AdaGrad algorithm initialized within that bowl.
7) RMSProp is shown in its standard form in algorithm 8.5 and combined with Nesterov
momentum in algorithm 8.6.
8) Compared to AdaGrad, the use of the moving average introduces a new
hyperparameter, ρ, that controls the length scale of the moving average.
9) Empirically, RMSProp has been shown to be an effective and practical optimization
algorithm for deep neural networks.
10) It is currently one of the go-to optimization methods being employed routinely by deep
learning practitioners.
11) RMsProp, root mean squared propagation is the optimization Machine Algorithm to
train the (ANN) by different adaptive learning rate derived from the concepts of
gradient descent and RProp.
12) RMSProp, Which stands for Root Mean Square Propagation Algorithm designed to
address some of the issues encountered with the SGD method in training deep NN.

Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
13) RMSProp is designed to accelerate the optimization process.
14) Ex: Decrease the number of function evaluations required to reach the optima or
improve the capability of the optimization algorithm for better result.
15) Root Mean Square Propagation (RMSp) is an adaptive learning algorithm that tries to
improve AdaGrad.
RMSProp is a popular optimization algorithm used in deep learning that has several
advantages:
a) Efficient Handles Gradients
b) RMSProp is well suited for DL Problems
c) Only few of the weights in the neural network are updated in each iteration
d) RMSProp also takes away the need to adjust learning rate, does it automatically.
Algorithm 8.4 The AdaGrad algorithm
Step-1: Require: Global learning rate ε
Step-2: Require: Initial parameter θ
Step-3: Require: Small constant δ, perhaps 10−7, for numerical stability Initialize gradient
accumulation variable r = 0
Step-4: while stopping criterion not met do
Step-5: Sample a minibatch of m examples from the training set {x(1) , . . . ,x(m)} with
corresponding targets y(i).
Step-6: Compute gradient: g ← 1/m∇θ Σi L(f(x(i) ; θ), y(i))
Step-7: Accumulate squared gradient: r ← r + g (•) g
Step-8: Compute update: ∆θ ← − ε /δ+√ r (•)g. (Division and square root applied element-
wise)
Step-9: Apply update: θ ← θ + ∆θ
Step-10: end while

Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Explanation of the above algorithm steps:
Step 1: We need a global learning rate (ε) and an initial parameter value (θ).

Step 2: We have an initial parameter value (θ) that we want to adjust based on the training
data.
Step 3: We initialize a variable called r to 0, which we'll use to accumulate squared
gradients. We also set a small constant (δ) for numerical stability, which is typically set to
10^-7.
Step 4: We enter a loop that will continue until some stopping criterion is met. This might
be when we've trained for a certain number of iterations, or when the loss function has
converged to a certain value.

Step 5: We sample a batch of m examples from the training set, along with their
corresponding targets.
Step 6: We compute the gradient of the loss function with respect to our parameter (θ)
using these examples. This gives us a vector g that represents how much each parameter
should be adjusted to reduce the loss function.
Step 7: We accumulate the squared gradient in our r variable. This helps us to adjust the
learning rate based on the history of gradients we've seen so far.
Step 8: We compute the update for our parameter using the formula -ε / (δ + sqrt(r)) * g.
This formula adjusts the learning rate based on both the gradient and the history of
squared gradients we've seen so far. The division and square root are applied element-
wise to each component of g and r.
Step 9: We apply the update to our parameter value (θ).
Step 10: The loop continues until some stopping criterion is met.

Adam:
1. Adam is a new algorithm for optimizing the learning rate in machine learning models.
It's called "Adam" because it combines two techniques, adaptive moments (which adjust
the learning rate based on the history of the gradient), and momentum (which helps the
model converge faster).

2. Adam is similar to other algorithms like RMSProp, but it has a few key differences. For
example, in Adam, momentum is calculated using an estimate of the first-order moment
(the gradient) with exponential weighting, whereas in RMSProp, momentum is applied to
the rescaled gradients.

Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
3. Another difference is that Adam includes bias corrections to account for the fact that
the estimates of both the first-order and second-order moments are initialized at zero.
This helps to reduce the high bias that can occur early in training with RMSProp.
4. Overall, Adam is considered to be a robust algorithm that doesn't require a lot of tuning
of hyperparameters, although the learning rate may need to be adjusted from its default
value.
Algorithm 8.5 The RMSProp algorithm
Step-1: Require Global learning rate ε, decay rate ρ.
Step-2: Require: Initial parameter θ
Step-3: Require Small constant δ, usually 10−6, used to stabilize division by small numbers.
Step-4: Initialize accumulation variables r = 0
Step-5: while stopping criterion not met do
Step-6: Sample a minibatch of m examples from the training set {x(1) , . . . ,x(m)} with
corresponding targets y(i).
Step-7: Compute gradient: g ← 1/m∇θΣi L(f(x(i) ; θ), y(i))
Step-8: Accumulate squared gradient: r ← ρr + (1 − ρ)g (.) g
Step-9: Compute parameter update: ∆θ = − ε/(√ δ+r)g(.)g
(1/(√δ+r) applied element-wise)
Step-10: Apply update: θ ← θ + ∆θ
Step-11: end while
Explanation of Algorithm Steps:
Algorithm 8.5, called RMSProp, is a technique used to update the parameters of a neural
network during training. Here's a step-by-step explanation:
Step-1: We need a global learning rate (ε) and a decay rate (ρ) to control the size of
parameter updates.
Step-2: We start with an initial set of parameters (θ).
Step-3: We use a small constant (δ) to stabilize division by small numbers, which helps
prevent numerical issues during computation.
Step-4: We initialize an accumulation variable (r) to zero.

Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Step-5: We enter a loop that continues until a stopping criterion is met.
Step-6: We sample a batch of m examples from the training data, along with their
corresponding targets (y(i)).
Step-7: We compute the gradient of the loss function (L) with respect to the parameters
(θ), using the batch of examples.
Step-8: We accumulate the squared gradient (g) over time, using the decay rate (ρ) to
decay previous values. This helps smooth out the effects of noisy gradients.
Step-9: We compute the parameter update by dividing the gradient (g) by the square root
of the accumulated squared gradient (r), plus a small constant (δ). This helps prevent
exploding gradients and improves convergence.
Step-10: We apply the parameter update to our current set of parameters (θ).
Step-11: The loop continues until a stopping criterion is met, such as reaching a certain
number of iterations or achieving a desired level of accuracy.
Choosing the Right Optimization Algorithm
1. In this section, we discussed some algorithms that help in optimizing deep learning
models by adjusting the learning rate for each parameter.
2. There is no clear answer to which algorithm is the best to choose because researchers
have not yet agreed on this.
3. A study by Schaul et al. (2014) compared many optimization algorithms and found that
those with adaptive learning rates, such as RMSProp and AdaDelta, performed
consistently well across different learning tasks.
4. Currently, popular optimization algorithms used in practice include SGD, SGD with
momentum, RMSProp, RMSProp with momentum, AdaDelta, and Adam.
5. The choice of which algorithm to use seems to depend mostly on the user's familiarity
with it for easier hyperparameter tuning.
Algorithm 8.6 RMSProp algorithm with Nesterov momentum
Step-1: Require: Global learning rate ε, decay rate ρ, momentum coefficient α.
Step-2: Require: Initial parameter θ, initial velocity v.
Step-3: Initialize accumulation variable r = 0
Step-4: while stopping criterion not met do

Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Step-5: Sample a minibatch of m examples from the training set {x(1) , . . . ,x(m)} with
corresponding targets y(i).
Step-6: Compute interim update: θ˜ ← θ + αv
Step-7: Compute gradient: g ← 1/m∇θ˜Σi L(f(x(i) ;θ˜),y(i))
Step-8: Accumulate gradient: r ← ρr + (1 − ρ)g (.) g
Step-9: Compute velocity update: v ← αv − ε/√r (.) g ( 1/√r applied element-wise)
Step-10: Apply update: θ ← θ + v
Step-11: end while

Explanation of the above Algorithm:


The RMSProp algorithm with Nesterov momentum is a technique used to update the
weights of a neural network during training.
Step 1: We need three parameters - the global learning rate (ε), decay rate (ρ), and
momentum coefficient (α).
Step 2: We start with an initial set of weights (θ) and an initial velocity (v).
Step 3: We initialize an accumulation variable (r) to zero.
Step 4: We continue iterating until we meet a stopping criterion, such as reaching a certain
number of epochs or achieving a certain level of accuracy.
Step 5: We sample a batch of m examples from the training set along with their
corresponding targets (y(i)).
Step 6: We compute an intermediate update for our weights (θ˜) using the momentum
coefficient (α) and the current velocity (v).
Step 7: We calculate the gradient (g) of the loss function with respect to our intermediate
weights (θ˜).
Step 8: We accumulate the gradient by adding it to the previous accumulation variable (r)
with a decay factor (ρ). This helps to smooth out the noise in the gradient and prevent
oscillations in the weight updates.
Step 9: We update the velocity using the momentum coefficient (α), the current gradient
(g), and the inverse square root of the accumulated gradient (r). This helps to speed up
convergence and prevent getting stuck in local minima.

Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Step 10: We apply the final weight update using our new velocity and the current weights
(θ).
Step 11: We repeat steps 5-10 until we meet our stopping criterion.

Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

6th Topic: Approximate Second-Order Methods


1. The second-order methods for training deep networks, specifically for minimizing the
empirical risk.
2. The empirical risk is defined as the average loss over a set of training data, and is given by
equation (8.25).

3. The use of second-order methods for optimization involves taking into account the
curvature of the objective function, which can help to converge faster and more accurately
to the minimum.
4. These methods require the computation of second-order derivatives, which can be
computationally expensive for large-scale problems like training deep networks.
5. LeCun et al. (1998a) provide an early treatment of this subject. In this section, we will
follow their approach for simplicity of exposition.
6. However, the methods discussed here can be extended to more general objective
functions that include regularization terms.
7. The basic idea behind second-order methods is to approximate the objective function
around the current parameter values using a quadratic model.
8. This quadratic model is then minimized to obtain a new set of parameter values, which
are used as the starting point for the next iteration. This process is repeated until
convergence is achieved.
9. The main advantage of second-order methods is that they can converge faster than first-
order methods, such as gradient descent, because they take into account the curvature of
the objective function.
10. This allows them to more accurately estimate the direction and step size for each iteration,
resulting in fewer iterations and faster convergence.
11. However, second-order methods also have some disadvantages. They require the
computation of second-order derivatives, which can be computationally expensive for
large-scale problems like training deep networks.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

12. Additionally, they may become less effective as the objective function becomes more
nonlinear or complex, as the quadratic approximation may not accurately represent the
true objective function in these cases.
13. Despite these challenges, second-order methods have shown promising results in practice
for training deep networks.
14. They have been used successfully in applications such as image classification and speech
recognition, where large-scale datasets and complex models are commonplace.
15. As computing resources continue to improve and algorithms become more sophisticated,
it is likely that second-order methods will become even more widely adopted in deep
learning applications.

Newton’s Method:
1. Newton's method is a second-order optimization technique that uses the Hessian matrix to
improve convergence. It approximates the objective function with a quadratic function and
updates the parameters directly to the minimum.

2. If the objective function is locally quadratic, Newton's method can jump directly to the
minimum by rescaling the gradient with the inverse of the Hessian.

3. In deep learning, Newton's method can be applied iteratively as long as the Hessian is
positive definite. However, it can cause updates to move in the wrong direction near saddle
points or when the eigenvalues of the Hessian are not all positive.

4. Regularization strategies, such as adding a constant along the diagonal of the Hessian, can
help mitigate these issues.

5. However, the computational burden of computing and inverting the Hessian matrix for large
neural networks with millions of parameters makes it impractical to use Newton's method for
training them.

6. Alternative methods that aim to gain some of the advantages of Newton's method while
avoiding its computational challenges are being explored in deep learning research.

7. In deep learning, the Hessian matrix is a second-order derivative matrix that measures the
curvature of the loss function with respect to the model's parameters. It is calculated by taking
the second derivative of the loss function with respect to the weights and biases of the neural
network.

8. The Hessian matrix is used in optimization algorithms such as Newton's method and
conjugate gradient descent to find the minimum of the loss function more efficiently than
gradient descent alone.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

9. These methods use the Hessian matrix to approximate the curvature of the loss function
and adjust the learning rate accordingly. This can help the algorithm converge faster and more
accurately to the global minimum.

10. However, calculating the Hessian matrix can be computationally expensive, especially for
large neural networks with many parameters.

11. Therefore, it is often approximated using techniques such as limited-memory BFGS


(Broyden-Fletcher-Goldfarb-Shanno) or Hessian-free optimization methods.

12. These methods use only a subset of the Hessian matrix or approximate it using lower-
order derivatives, making them more computationally efficient.

Algorithm 8.7: The Adam algorithm


Step-1: Require: Step size ε (Suggested default: 0.001)
Step-2: Require: Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1).
(Suggested defaults: 0.9 and 0.999 respectively) First and Second moments of our
gradient.
Step-3: Require: Small constant δ used for numerical stabilization. (Suggested
default:10−8)
Step-4: Require: Initial parameters θ
Step-5: Initialize 1st and 2nd moment variables s = 0, r = 0
Step-6: Initialize time step t = 0
Step-7: While stopping criterion not met do
Step-8: Sample a minibatch of m examples from the training set {x(1), . . . ,x(m)} with
corresponding targets y(i).
Step-9: Compute gradient: g ← 1/m∇θ Σi L(f(x(i); θ), y(i)) L is here Loss Function
Step-10: t ← t + 1 t is number of iterations
Step-11: Update biased first moment estimate: s ← ρ1s + (1 − ρ1)g
Step-12: Update biased second moment estimate: r ← ρ2r + (1 − ρ2 )g . g
Step-13: Correct bias in first moment: sˆ ← s/1−ρt1
Step-14: Correct bias in second moment: rˆ ← r/(1−ρt2)
Step-15: Compute update: ∆θ = − ε√ ˆsrˆ+δ (operations applied element-wise)
Step-16: Apply update: θ ← θ + ∆θ
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

Ste-17: end while


Explanation of Steps:
The algorithm is for training a neural network using the RMSprop optimization
method.
Step 1: Choose a small step size (ε) for updating the weights of the neural network.
Step 2: Choose two decay rates (ρ1 and ρ2) between 0 and 1 to calculate the moments of
the gradients. These rates help to reduce the impact of large gradients and prevent
oscillations.
Step 3: Use a small constant (δ) for numerical stability during calculations.
Step 4: Start with initial weights (θ) for the neural network.
Step 5: Initialize variables for storing the first and second moments of the gradients (s and
r).
Step 6: Initialize a time step counter (t).
Step 7: Keep running the algorithm until a stopping criterion is met, such as reaching a
certain number of iterations or achieving a desired level of accuracy.
Step 8: Sample a batch of examples from the training data with their corresponding targets.
Step 9: Calculate the gradient (g) for this batch using the loss function.
Step 10: Increment the time step counter (t).
Step 11: Update the biased first moment estimate (s) using the current gradient and decay
rate (ρ1).
Step 12: Update the biased second moment estimate (r) using the current gradient and decay
rate (ρ2).
Step 13: Correct the bias in the first moment estimate by dividing it by one minus the decay
rate raised to the power of the number of iterations so far (t1). This helps to reduce the
impact of large gradients in earlier iterations.
Step 14: Correct the bias in the second moment estimate by dividing it by one minus the
decay rate raised to the power of twice the number of iterations so far (t2). This helps to
reduce oscillations in later iterations.
Step 15: Calculate an update vector (∆θ) using the corrected first and second moment
estimates, as well as a small constant for numerical stability (δ). This update vector is
applied element-wise to adjust the weights of the neural network.
Step 16: Apply the update vector to adjust the weights of the neural network (θ).
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

Algorithm 8.8 Newton’s method with objective J(θ ) =1/m Σi=1 m L(f(x(i); θ),y(i)).
Step-1: Require: Initial parameter θ0
Step-2: Require: Training set of m examples
Step-3: While stopping criterion not met do
Step-4: Compute gradient: g ← 1/m∇θ Σi L(f(x(i); θ), y(i))

Step-5: Compute Hessian: H ← 1/m ∇2θΣi L(f(x(i); θ), y(i))


Step-6: Compute Hessian inverse: H−1
Step-7: Compute update: ∆θ = −H−1g
Step-8: Apply update: θ = θ + ∆θ
Step-9: end while
Explanation of the above Algorithm:
This algorithm is called Newton's method, and it's used to find the minimum of a
function called the objective function, which in this case is the average of a loss
function applied to a set of training examples.
Step 1: We start with an initial guess for the parameter values, which we call θ0.
Step 2: We have a set of m examples, each with an input x(i) and an output y(i).
Step 3: We keep iterating until some stopping criterion is met, such as reaching a
certain number of iterations or achieving a small enough value for the objective
function.
Step 4: We calculate the gradient of the objective function with respect to the
parameter vector θ. This tells us which direction we need to move in order to
decrease the objective function.
Step 5: We calculate the Hessian matrix, which is a matrix of second derivatives
of the objective function with respect to each pair of parameters. This tells us how
much the gradient changes as we move in different directions.
Step 6: We calculate the inverse of the Hessian matrix, which is called the Hessian
inverse. This allows us to solve for the update step that will move us in the
direction of steepest descent.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

Step 7: We compute the update step by subtracting the product of the Hessian
inverse and the gradient from our current parameter values. This moves us in the
direction of steepest descent while taking into account how much each parameter
should be changed relative to others.
Step 8: We update our parameter values based on the computed update step.
Step 9: We continue iterating until our stopping criterion is met.
8.6.2 Conjugate Gradients
1. The conjugate gradients method is a way to efficiently find the minimum of a function
without calculating its inverse Hessian matrix.
2. It works by iteratively moving in directions that are conjugate to each other, which
means they are not parallel or perpendicular.
3. This is different from the method of steepest descent, which moves in the direction of
the gradient but can sometimes move back and forth in a zig-zag pattern because the
directions it uses are orthogonal to each other.
4. The conjugate gradients method avoids this problem by choosing directions that are
not orthogonal to each other, making it more efficient.
5. The inspiration for this approach comes from studying the weaknesses of the method
of steepest descent in quadratic functions.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

6. During a line search, we move in a certain direction (dt-1) until we reach a minimum
point. At this point, the gradient (∇θJ) in that direction is zero.
7. Since the gradient now points in a new direction (dt), it's orthogonal to the previous
direction (dt-1). This means that the new direction is perpendicular to the old one.
8. This relationship between dt-1 and dt is shown in figure 8.6, which illustrates multiple
iterations of steepest descent.
9. The problem with this approach is that it can lead to a zig-zag pattern of progress, where
we have to re-minimize the objective function in the previous gradient direction after
descending to the minimum in the current gradient direction.
10. The method of conjugate gradients aims to address this issue by choosing directions
of descent that are not orthogonal to each other, which helps us avoid retracing our steps
and makes the optimization process more efficient.
11. In the method of conjugate gradients, we want to find a new search direction that
won't undo the progress made in the previous direction.
12. At each training iteration t, the new direction dt is a combination of the gradient and
the previous direction dt-1, with a coefficient βt that controls how much of dt-1 to add
back.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

13. Two directions are called conjugate if they don't contribute to each other's curvature,
which is measured by a matrix called the Hessian.
14. Calculating conjugacy using the eigenvectors of the Hessian would be computationally
expensive for large problems, but fortunately there's a simpler way that doesn't require
these calculations.
15. The conjugate gradient method for finding the minimum of a quadratic surface ensures
that we don't move backwards in the direction we just came from.
16. This means we stay on the path to the minimum as we search for it. In a space with k
dimensions, we only need to do k line searches to find the minimum using this method.
17. In simple terms, the conjugate gradient algorithm helps us find the lowest point on a
curved surface by moving in a smart way that avoids going back where we've been.
18. Conjugate gradients is a method to efficiently avoid calculating the inverse Hessian by
iteratively descending in conjugate directions.
19. The weakness of the method of steepest descent is that it progresses in a zig-zag
pattern because each line search direction is orthogonal to the previous one, causing
progress to be undone in those directions.
20. In conjugate gradients, the next search direction is a combination of the current
gradient and a coefficient that controls how much of the previous direction to add back.
21. Two directions are defined as conjugate if they do not undo progress made in each
other's direction.Two popular methods for computing the coefficient are Fletcher-Reeves
and Polak-Ribière.
23. The conjugate gradient algorithm requires at most k line searches to achieve the
minimum in a k-dimensional parameter space.
24. Conjugate gradients is more computationally viable than Newton's method for large
problems because it avoids calculating the eigenvectors of the Hessian matrix.
Two popular methods for computing the βt are:

1. Fletcher-Reeves: βt = ∇θJ(θt)T∇θJ(θt)/ ∇θJ(θt−1)T∇θJ(θt−1) --------------------- (8.30)

2. Polak-Ribière: βt = (∇θJ(θt) − ∇θJ(θt−1))T∇θJ(θt) /∇θJ(θt−1 )T∇θJ(θt−1) -------------------- (8.31)

Algorithm 8.9 The conjugate gradient method


Step-1: Require: Initial parameters θ0
Step-2: Require: Training set of m examples
Step-3: Initialize ρ0 = 0
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

Step-4: Initialize g0 = 0
Step-5: Initialize t = 1
Step-6: While stopping criterion not met do
Step-7: Initialize the gradient gt = 0
Step-8: Compute gradient: gt ← 1/m∇θ Σi L f( (x(i) ; ) θ , y(i) )
Step-9: Compute βt = (gt−gt−1)Tgt /gTt−1gt−1 (Polak-Ribière)
(Nonlinear conjugate gradient: optionally reset βt to zero, for example if t is a multiple of
some constant k, such as k = 5)
Step-10: Compute search direction: ρt = −gt + βtρt−1
Step-11: Perform line search to find: ε∗ = argmin ε 1/m Σm i=1 L f( (x(i); θt + ε ρt), y(i) )
(On a truly quadratic cost function, analytically solve for ε ∗ rather than explicitly searching
for it)
Step-12: Apply update: θt+1 = θt + ε∗ρt
Step-13: t ← t + 1
Step-14: end while

Explanation of above algorithms Steps

Step 1: We need to start with some initial values for the parameters we're trying to optimize
(θ0).

Step 2: We have a set of examples (m) that we'll use to train our model.

Step 3: We're keeping track of a variable called ρ0, which we'll use later in the algorithm. For
now, just think of it as a starting point.

Step 4: We're also keeping track of a variable called g0, which we'll use to calculate gradients
later on. Again, just think of it as a starting point for now.

Step 5: We're starting our iterative process with t = 1.

Step 6: We're entering a loop that will continue until some stopping criterion is met (we'll
discuss this more in a bit). Inside this loop, we'll be calculating gradients, updating our
parameters, and searching for the best step size to take.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

Step 7: We're initializing a variable called gt to zero. This will be used to store the gradient at
each iteration.

Step 8: We're calculating the gradient for our current set of parameters (θt) using the training
data (L f( (x(i); ) θ , y (i) )). This gradient tells us which direction we should move in order to
decrease the cost function (which is what we want to do).

Step 9: We're computing a variable called βt using the gradient from the current iteration (gt)
and the gradient from the previous iteration (gt-1). This variable helps us determine how far
we should move in the search direction we calculated in Step 11.

Step 10: We're calculating the search direction (ρ t) by subtracting the current gradient (gt)
from the gradient from the previous iteration (g t-1), scaled by βt. This gives us a direction to
move in that will hopefully help us decrease the cost function more quickly than just moving
in the direction of the current gradient alone.

Step 11: We're performing a line search to find the best step size (ε*) to take in this search
direction. This involves calculating the cost function for different values of ε and finding the
one that results in the smallest value of the cost function. This is important because we want
to make sure we're actually decreasing the cost function as we move through parameter
space.

Step 12: We're applying our new parameters (θ t+1) by adding our step size (ε*) times our
search direction (ρt). This moves us closer to our minimum in parameter space.

Step 13: We increment our iteration counter (t) and continue with Step 6 until our stopping
criterion is met. The stopping criterion could be things like reaching a certain number of
iterations, or finding that our changes in parameters are no longer resulting in significant
decreases in the cost function.

Nonlinear Conjugate Gradients:

1. Conjugate gradients method is not just for quadratic functions, but can also be used for
nonlinear objectives in neural network training.

2. Without the guarantee of a quadratic function, the conjugate directions may not remain at
the minimum for previous directions, so occasional resets are necessary.

3. The nonlinear conjugate gradients algorithm involves restarting with line search along the
unaltered gradient during resets.

4. Initializing optimization with a few iterations of stochastic gradient descent before nonlinear
conjugate gradients can be beneficial.

5. Minibatch versions of nonlinear conjugate gradients have been successfully used for neural
network training (Le et al., 2011).
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

6. Scaled conjugate gradients algorithm is an adaptation of conjugate gradients specifically for


neural networks.

7. Practitioners report reasonable results using nonlinear conjugate gradients for training
neural networks, but it's beneficial to start with a few iterations of stochastic gradient descent.
BFGS

1. The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm attempts to bring some of the


advantages of Newton’s method without the computational burden. In that respect, BFGS
is similar to the conjugate gradient method. However, BFGS takes a more direct approach
to the approximation of Newton’s update.
2. Recall that Newton’s update is given by θ∗ = θ0 − H−1∇θ J(θ0) ------------------- (8.32) where H
is the Hessian of J with respect to θ evaluated at θ0.
3. The primary computational difficulty in applying Newton’s update is the calculation of the
inverse Hessian H−1.
4. The approach adopted by quasi-Newton methods (of which the BFGS algorithm is the
most prominent) is to approximate the inverse with a matrix Mt that is iteratively refined
by low rank updates to become a better approximation of H −1 .
5. The specification and derivation of the BFGS approximation is given in many textbooks on
optimization, including Luenberger (1984).
6. Once the inverse Hessian approximation Mt is updated, the direction of descent ρ t is
determined by ρt = Mtgt.
7. A line search is performed in this direction to determine the size of the step, ε* , taken in
this direction.
8. The final update to the parameters is given by: θt+1 = θt + ε*ρt---------- (8.33)
9. Like the method of conjugate gradients, the BFGS algorithm iterates a series of line
searches with the direction incorporating second-order information.
10. However, unlike conjugate gradients, the success of the approach is not heavily dependent
on the line search finding a point very close to the true minimum along the line.
11. Thus, relative to conjugate gradients, BFGS has the advantage that it can spend less time
refining each line search.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

12. On the other hand, the BFGS algorithm must store the inverse Hessian matrix, M, that
requires O(n2) memory, making BFGS impractical for most modern deep learning models
that typically have millions of parameters .
Limited Memory BFGS (L-BFGS):

 L-BFGS is a variant of BFGS that significantly decreases memory costs by avoiding storing
the complete inverse Hessian approximation M.

 L-BFGS computes the approximation M using the same method as BFGS, but begins with
the assumption that M(t-1) is the identity matrix.
 L-BFGS remains well behaved when the minimum of the line search is reached only
approximately, and it can be generalized to include more information about the Hessian
by storing some vectors used to update M at each time step, which costs only O(n) per
step.

 The BFGS algorithm can have high memory costs because it requires storing the entire
inverse Hessian matrix.
 To reduce this, a modified algorithm called L-BFGS avoids storing the inverse Hessian and
instead updates a matrix called M using the same method as BFGS, starting with the
identity matrix.
 This approach allows for computing directions that are mutually conjugate, similar to the
method of conjugate gradients, but without requiring exact line searches.
 The L-BFGS strategy with no storage can be improved by storing some vectors used to
update M, which costs only O(n) per step. This generalization allows for including more
information about the Hessian in the algorithm.
Conjugate Gradients (CG) vs Nonlinear Conjugate Gradients (NCG):
1. Conjugate Gradients (CG) is a linear algebra algorithm used to solve systems of linear
equations efficiently. It's an iterative method that involves calculating search directions
and step sizes to minimize the residual norm.
2. Nonlinear Conjugate Gradients (NCG) is a generalization of CG to nonlinear systems of
equations. Instead of finding the minimum of a quadratic function as in CG, NCG aims to
find the minimum of a nonlinear function by approximating it with a quadratic model at
each iteration.
Prepared By
Dr K Madan Mohan, Ph.D, MISTE, MIEEE
Associate Professor, R & D Coordinator, Department of CSE (AI&ML),
Sreyas Institute of Engineering and Technology, Bandlaguda, Nagole, Hyderabad, 500068.

3. The main difference between CG and NCG is that CG is used for solving linear systems,
while NCG is used for solving nonlinear systems. In other words, CG works for systems
where the coefficients are constants, while NCG works for systems where the coefficients
can vary.
4. Another difference is that CG converges much faster than NCG due to the simplicity of
quadratic functions compared to nonlinear functions. CG has a convergence rate of
O(sqrt(n)), where n is the number of unknowns, while NCG's convergence rate depends
on the specific problem being solved.
5. In summary, CG is a linear algebra algorithm used for solving linear systems, while NCG is
a generalization of CG used for solving nonlinear systems. CG converges faster than NCG
due to the simplicity of quadratic functions compared to nonlinear functions.
7th Topic: Optimization Strategies and Meta Algorithms
1. Batch Normalization
2. Coordinate descent
3. Polyak Averaging
4. Supervised Pretraining
5. Designing Models to Aid Optimization
6. Continuation Methods and Curriculum Learning
1. Batch Normalization (BN): A technique that normalizes the input of each layer in a neural
network by scaling and shifting it to have zero mean and unit variance. This helps to speed up
training, reduce overfitting, and improve model performance.
2. Coordinate Descent (CD): A optimization algorithm that breaks down the optimization
problem into smaller subproblems, each of which is solved independently. This method is
particularly useful for large-scale optimization problems with many variables.
3. Polyak Averaging (PA): A technique used in stochastic gradient descent (SGD) that averages
the weights of the model after each iteration. This helps to reduce the variance of the gradient
estimates and improve convergence.
4. Supervised Pretraining (SPT): A method used to pretrain a neural network on a large
dataset before fine-tuning it on a smaller target dataset. This helps to improve the model's
ability to learn complex features and reduce overfitting.
5. Designing Models to Aid Optimization: This involves designing neural network
architectures that are easier to optimize, such as using residual connections, skip connections,
or inverted residual blocks. These techniques help to improve the model's ability to converge
during training and reduce the number of parameters needed for optimal performance.
6. Continuation Methods and Curriculum Learning: These are techniques used to gradually
increase the difficulty of the training data over time, allowing the model to learn more
complex features as it progresses through the training process. Continuation methods involve
starting with a simpler model and gradually increasing its complexity, while curriculum
learning involves presenting easier examples first and gradually increasing their difficulty over
time. Both techniques help to improve the model's ability to generalize to new, unseen data.
Text Book Matter as Follows
1. Batch Normalization:
a) Batch normalization is a technique in deep learning that helps in optimizing neural networks
by making the gradient calculations more stable during training. It's not actually an
optimization algorithm, but rather a way to reparameterize inputs to make the gradient
calculations more reliable.
b) The problem with training very deep neural networks is that as we update the weights of each
layer, the output of the previous layers also changes, which can lead to unexpected results.
Batch normalization addresses this issue by normalizing the inputs to each layer and scaling
them with learnable parameters. This helps to stabilize the gradient calculations and makes it
easier to train deep neural networks.
c) In simple terms, batch normalization works by computing the mean and variance of the inputs
to each layer during training, and then normalizing them to have zero mean and unit variance.
The normalized inputs are then scaled by a learnable parameter called gamma, and shifted by
another learnable parameter called beta. These parameters are learned during training and
help to adjust the distribution of the inputs to each layer.
d) The benefits of batch normalization include faster convergence, reduced sensitivity to
initialization, and improved accuracy on deep neural networks. It's a simple yet effective
technique that has become a standard component of many modern deep-learning models.

However, the actual update will include second-order and third-order effects, on up to effects
of order l. The new value of yˆ is given by
x(w1 − εg1 )(w2 − εg 2). . .(wl − εgl)------------------------------ (8.34)
Let H be a minibatch of activations of the layer to normalize, arranged as a design matrix, with
the activations for each example appearing in a row of the matrix. To normalize H, we replace
it with HI = (H − µ)/ σ (8.35)
2. Coordinate Descent:
a. Coordinate descent: When trying to find the minimum of a function with multiple variables,
we can break it down into smaller parts by optimizing one variable at a time. This is called
coordinate descent, and it guarantees finding a local minimum.
b. Block coordinate descent: Instead of optimizing one variable at a time, we can also optimize
a group of variables simultaneously. This is called block coordinate descent.
c. Coordinate descent makes sense when the variables can be separated into groups that don't
interact much with each other, or when optimizing one group is much easier than optimizing
all variables at once.
For Example:

This function describes a learning problem called sparse coding, where the goal is to find a
weight matrix W that can linearly decode a matrix of activation values H to reconstruct the
training set X.
3. Polyak Averaging:
1. Polyak averaging is a technique that averages multiple points in the optimization process to
improve convergence.
2. It works by taking the average of parameters visited during gradient descent, θˆ(t) = 1/ t Σi θ(i).
3. In convex problems, this approach has strong convergence guarantees.
4. In neural networks, it's a heuristic method that performs well in practice.
5. The idea is that optimization algorithms may repeatedly visit a valley without reaching its
bottom. The average of all visited points should be close to the bottom.
6. In non-convex problems, including distant past points with large barriers in the cost function
may not be useful, as the optimization trajectory can be complex and visit different regions.
As a result, when applying Polyak averaging to non-convex problems, it is typical to use an
exponentially decaying running average:
4. Supervised Pretraining
1. Pretraining is a strategy in deep learning where a simpler model is trained to solve a simpler task
before training the desired model for the final task. This can be helpful when directly training a
complex model is too difficult or the task is too challenging.
2. Pretraining can involve breaking down a problem into smaller components and solving them
separately, similar to how greedy algorithms work. This can be faster and cheaper than solving the
entire problem at once, but may not always result in an optimal solution.
3. Pretraining can be followed by fine-tuning, where a joint optimization algorithm is used to find
the best solution for the entire problem, using the pretrained model as a starting point.
4. Greedy supervised pretraining involves training each stage of a neural network on a simpler
supervised learning problem, using only a subset of the layers in the final network. This can help
provide better guidance to intermediate levels of the network and improve optimization and
generalization.
5. Another approach to pretraining is transfer learning, where a pretrained network is used to
initialize weights for a new network that will be trained on a different set of tasks with fewer
training examples.
6. FitNets is an approach that involves training a wide and shallow network as a teacher to provide
hints for training a deeper and thinner student network. The student network has two objectives:
to accomplish its own task and predict the middle layer of the teacher network. This can simplify
optimization and help train networks that would otherwise be difficult to train.
5. Designing Models to Aid Optimization
1. Improving optimization in deep learning models isn't always about improving the optimization
algorithm. Sometimes, it's better to make the models easier to optimize by choosing a simpler
model family.
2. Using non-monotonic activation functions can make optimization difficult, so it's better to
choose functions that increase and decrease smoothly.
3. Most advances in neural network learning have come from changing the model family, not the
optimization algorithm. SGD with momentum is still widely used today.
4. Modern neural networks use linear transformations between layers and differentiable activation
functions with large slope areas. This makes optimization easier because the gradient flows
through many layers and the direction of improvement is clear even if the model's output is far
from correct.
5. Linear paths or skip connections between layers can help mitigate the vanishing gradient
problem by reducing the shortest path from lower layer parameters to the output.
6. Adding auxiliary heads to intermediate hidden layers can provide error signals to lower layers
during training, making them easier to optimize without the need for pretraining strategies. This
allows all layers to be trained jointly in a single phase.
6. Continuation Methods and Curriculum Learning
1. Many optimization problems are difficult because the cost function has a complex global
structure.
2. Continuation methods try to overcome this by finding initial points that lead to easier-to-solve
problems, which can then be refined to solve more difficult ones.
3. These methods construct a series of cost functions that become increasingly difficult, with the
first being easy to minimize.
4. The idea is to choose initial points that are more likely to land in a region where local
optimization can succeed because it's larger and well-behaved.
5. Traditional continuation methods smooth the objective function, while simulated annealing
adds noise to the parameters.
6. Continuation methods can break down in three ways: requiring too many incremental cost
functions, not becoming convex when blurred, or tracking to a local minimum instead of a global
one.
7. Continuation methods can still help with neural network optimization because they make local
updates easier or improve correspondence between update directions and global solutions.
8. Curriculum learning, which begins with simple concepts and progresses to more complex ones,
can be seen as a continuation method in machine learning and has been successful on natural
language and computer vision tasks.
9. A stochastic curriculum, where easy and difficult examples are presented with a gradually
increasing proportion of the latter, is more effective than a deterministic one for training recurrent
neural networks to capture long-term dependencies.
10. Optimization methods discussed in this chapter are generally applicable to specialized
architectures with little or no modification as they scale to very large sizes and process structured
input data.
5th UNIT-PART-II
Applications With Case Studies:

1. LARGE-SCALE DEEP LEARNING

Introduction to Large-Scale Deep Learning:


Large-scale deep learning refers to the application of deep learning techniques to vast
datasets, often involving distributed computing resources for training large models.

Advantages of Large-Scale Deep Learning:


 Improved Model Performance: Large-scale datasets enable the training of more
accurate and robust models.
 Complex Feature Learning: Deep learning models can learn intricate features and
representations from diverse data.
 Transfer Learning: Pre-trained models on large datasets can be fine-tuned for
specific tasks.

Disadvantages of Large-Scale Deep Learning:


 Computational Resources: Training large models on massive datasets requires
substantial computing power.
 Data Quality: Managing and ensuring the quality of large datasets can be
challenging.
 Overfitting: Large-scale models may be prone to overfitting, especially with
limited labeled data.
1. ImageNet Classification with Large Datasets:
Objective: To train deep convolutional neural networks for image classification on a
large-scale dataset (e.g., ImageNet).
Case Study:
AlexNet, VGGNet, and subsequent models demonstrated breakthroughs in image
classification accuracy when trained on the ImageNet dataset, leading to advancements
in computer vision.

2. Language Models on Large Corpora:


Objective: To develop powerful language models capable of understanding and
generating human-like text.
Case Study:
OpenAI's GPT-3 (Generative Pre-trained Transformer 3) is an example of a language
model trained on a massive corpus, showcasing its ability to perform diverse language
tasks.

3. Recommender Systems with Big Data:


Objective: To enhance recommender systems for personalized content
recommendations.
Case Study:
Netflix's recommendation engine uses large-scale deep learning to analyze user
preferences and behavior, providing tailored content suggestions.

4. Speech Recognition on Extensive Audio Datasets:


Objective: To improve the accuracy of automatic speech recognition (ASR) systems.
Case Study:
Baidu's Deep Speech system, trained on a large corpus of multilingual data, achieved
state-of-the-art performance in speech recognition tasks.

5. Healthcare Imaging with Large Medical Datasets:


Objective: To develop deep learning models for medical image analysis and diagnosis.
Case Study:
DeepMind's application of large-scale deep learning in healthcare involves training
models on extensive medical imaging datasets for tasks such as disease detection.

6. Advancements in Large-Scale Transformers:


Objective: To advance transformer-based architectures for various natural language
processing tasks.
Case Study:
The development of models like BERT (Bidirectional Encoder Representations from
Transformers) and GPT-3 demonstrates the impact of large-scale training on
transformer architectures.

7. Object Detection with Large Datasets:


Objective: To improve object detection accuracy in computer vision applications.
Case Study:
The success of YOLO (You Only Look Once) and Faster R-CNN in large-scale object
detection tasks has been attributed to training on extensive datasets.

8. Deep Reinforcement Learning in Gaming:


Objective:To train agents for playing complex games using deep reinforcement
learning.
Case Study:
DeepMind's AlphaGo and AlphaStar used large-scale deep reinforcement learning to
master the games of Go and StarCraft, respectively.

9. Financial Fraud Detection on Extensive Transaction Data:


Objective: To detect fraudulent activities in financial transactions.
Case Study:
Large-scale deep learning is applied in financial institutions to analyze extensive
transaction data for identifying patterns indicative of fraud.

10. Weather Prediction with Extensive Climate Data:


Objective: To improve the accuracy of weather and climate predictions.
Case Study:
The use of large-scale deep learning in climate modeling involves training models on
extensive climate datasets, contributing to more accurate weather forecasts.

11. Social Media Analytics with Big Data:


Objective:To analyze and understand patterns in social media data.

Case Study:
Large-scale deep learning is applied to process vast amounts of social media content
for sentiment analysis, trending topics, and user behavior analysis.

II. COMPUTER VISION


Definition:
Computer vision in deep learning is like giving computers the ability to see and
understand pictures and videos. It involves using advanced techniques called deep
learning to teach computers to recognize objects, understand scenes, and even interpret
gestures or facial expressions in images or videos. The idea is to make machines
understand and interpret visual information, somewhat like how humans do with their
eyes and brain.

1. Object Detection and Recognition:

Application: Deep learning is widely used for object detection and recognition in
images and videos.

Case Study: YOLO (You Only Look Once) is a deep learning model that excels in
real-time object detection, making it suitable for applications like video surveillance
and autonomous vehicles.

Advantages:
 Achieves high accuracy in object detection.
 Allows real-time processing for time-sensitive applications.

Disadvantages:
 May struggle with small or heavily occluded objects.
 Training can be computationally intensive.

2. Facial Recognition:

Application: Deep learning is applied for facial recognition in security systems,


authentication, and social media tagging.

Case Study: DeepFace by Facebook uses deep learning to achieve high accuracy in
facial recognition tasks.

Advantages:
 Enables reliable identification of individuals.
 Useful in various applications, from security to user experience.

Disadvantages:
 Privacy concerns regarding the use of facial data.
 Potential biases in recognition systems.

3. Image Classification:

Application: Deep learning is employed for image classification tasks, categorizing


images into predefined classes.

Case Study: AlexNet, a deep convolutional neural network, won the ImageNet Large
Scale Visual Recognition Challenge in 2012, demonstrating the effectiveness of deep
learning in image classification.
Advantages:
 High accuracy in classifying diverse images.
 Transfer learning allows leveraging pre-trained models.

Disadvantages:
 Requires large labeled datasets for training.
 Vulnerable to adversarial attacks.

4. Semantic Segmentation:

Application: Deep learning is used for semantic segmentation, where each pixel in an
image is classified, providing a detailed understanding of object boundaries.

Case Study: U-Net is a deep learning architecture commonly used for medical image
segmentation tasks.

Advantages:
 Provides detailed information about object boundaries.
 Useful in medical imaging and autonomous systems.

Disadvantages:
 Requires extensive labeled data for training.
 Computationally demanding for high-resolution images.

5. Human Pose Estimation:

Application: Deep learning is applied for estimating the spatial positions of human
joints, enabling applications like gesture recognition and motion analysis.

Case Study: OpenPose is a deep learning framework for real-time multi-person pose
estimation.

Advantages:
 Useful in human-computer interaction applications.
 Enables understanding of complex human movements.

Disadvantages:
 Challenges in handling diverse body shapes and poses.
 Performance may degrade in crowded scenes.

6. Gesture Recognition:

Application: Deep learning is employed for recognizing and interpreting gestures,


enabling natural and intuitive human-computer interaction.

Case Study: Microsoft's Kinect uses deep learning for gesture recognition, allowing
users to control devices through body movements.
Advantages:
 Enhances user experience in interactive systems.
 Enables touchless control in various applications.

Disadvantages:
 Accuracy may vary based on lighting and environmental conditions.
 Limited standardization in gesture definitions.

7. Visual Captioning:

Application: Deep learning is used for generating textual descriptions of visual content,
facilitating better understanding by machines.

Case Study: Google's Show and Tell uses deep learning to generate captions for images.

Advantages:
 Bridges the gap between visual and textual information.
 Useful in accessibility and content understanding.

Disadvantages:
 Challenges in generating accurate and contextually relevant captions.
 Highly dependent on the quality of training data.

8. Augmented Reality (AR) and Virtual Reality (VR):

Application: Deep learning contributes to AR and VR applications by enhancing object


recognition, tracking, and scene understanding.

Case Study: DeepAR is a deep learning model used for augmenting reality in various
applications.

Advantages:
 Improves the realism and interactivity of AR and VR experiences.
 Enables seamless integration of virtual and real-world elements.

Disadvantages:
 Requires powerful hardware for real-time processing.
 Challenges in maintaining accuracy in dynamic environments.

These case studies highlight the versatility of deep learning in various computer vision
applications, offering solutions to complex problems. While deep learning brings
significant advantages, including high accuracy and adaptability, it also poses
challenges such as the need for substantial computing resources and potential ethical
considerations.
III. SPEECH RECOGNITION
Introduction to Speech Recognition in Deep Learning:
Speech recognition in deep learning involves training machines to understand and
interpret spoken language, enabling applications like voice assistants, transcription
services, and more.
Advantages:
 Deep learning models can learn complex patterns and representations in audio data.
 Improved accuracy and performance compared to traditional methods.
 Adaptability to diverse accents and languages.
Disadvantages:
 Requires substantial labeled data for effective training.
 Computationally intensive, especially for large-scale models.
 Vulnerable to background noise and environmental variations.

1. Automatic Speech Recognition (ASR):
ASR is a common application where deep learning is used to convert spoken language
into text.
Case Study:
Google's DeepSpeech utilizes deep learning for accurate and real-time transcription
services.

2. Voice Assistants:
Deep learning is employed in voice assistants for natural language understanding and
response generation.
Case Study:
Amazon Alexa and Google Assistant leverage deep learning to comprehend user
commands and provide relevant responses.
3. Speaker Identification and Verification:
Deep learning models can be used to identify and verify individuals based on their voice
patterns.
Case Study:
Microsoft's Speaker Recognition API uses deep learning for speaker verification in
security applications.
4. Emotion Recognition in Speech:
Deep learning is applied to recognize emotions conveyed through speech, enabling
sentiment analysis.
Case Study:
Beyond Verbal's Emotion AI uses deep learning to analyze vocal intonations and
extract emotional insights.
5. Multilingual Speech Recognition:
Deep learning enables the development of models that can understand and transcribe
multiple languages.
Case Study:
IBM's Watson Speech to Text supports a variety of languages, showcasing the
adaptability of deep learning models.
6. Speech Analytics in Customer Service:
Deep learning is utilized in speech analytics to extract meaningful insights from
customer service interactions.
Case Study:
CallMiner Eureka uses deep learning to analyze call center conversations for sentiment
analysis and customer experience improvement.
7. Voice Bio-metrics:
Deep learning is applied to create voiceprint for secure authentication and access
control.
Case Study:
Nuance Communications uses deep learning for voice biometrics, enhancing security
in applications like banking and telecommunications.
8. Advancements in Deep Learning architectures:
Ongoing research introduces advanced architectures like Transformer-based models for
improved speech recognition.
Case Study:
Facebook's “wav2vec 2.0” utilizes a Transformer-based architecture for better
contextual understanding in speech recognition.
IV. NATURAL LANGUAGE PROCESSING
Introduction to Natural Language Processing (NLP) in Deep Learning:
NLP in deep learning involves teaching machines to understand, interpret, and generate
human language, enabling applications such as text analysis, machine translation,
sentiment analysis, and more.
Advantages of Deep Learning in NLP:
 Captures Contextual Information: Deep learning models, especially transformers,
capture contextual nuances in language.
 End-to-End Learning: End-to-end models simplify the NLP pipeline, allowing
direct learning from raw text data.
 Transfer Learning: Pre-trained models facilitate transfer learning, improving
performance on specific tasks.
Disadvantages of Deep Learning in NLP:
 Data Dependency: Requires large amounts of labeled data for effective training.
 Computational Intensity: Training deep models can be computationally intensive.
 Interpretability: Deep models are often considered black boxes, making it
challenging to interpret their decision-making processes.
1. Machine Translation:
Deep learning is applied in machine translation to automatically translate text from one
language to another.
Case Study:
Google's Transformer model, introduced in the "Attention is All You Need" paper,
significantly improved machine translation quality.

2. Sentiment Analysis:
Deep learning is used for sentiment analysis to determine the sentiment expressed in a
piece of text.
Case Study:
The use of deep learning in sentiment analysis has seen success in social media
monitoring tools and customer feedback analysis.

3. Text Summarization:
Deep learning models can be applied to automatically generate concise summaries of
longer texts.
Case Study:
BERT-based models have demonstrated effectiveness in extractive summarization
tasks.
4. Named Entity Recognition (NER):
Deep learning is employed in NER to identify and classify entities (e.g., names,
locations) in text.
Case Study:
The Stanford NER system, utilizing deep learning components, has been widely used
for entity recognition tasks.
5. Question Answering Systems:
Deep learning is applied to build systems that can understand and answer questions
posed in natural language.
Case Study:
OpenAI's GPT-3 has shown capabilities in question answering tasks, demonstrating its
understanding of context.
6. Chatbots and Virtual Assistants:
Deep learning is crucial for developing conversational agents capable of understanding
and generating human-like responses.
Case Study:
Google's BERT has been applied to enhance the natural language understanding
capabilities of chatbots.

7. Text Generation:
Deep learning models can generate human-like text based on input prompts or
context.
Case Study:
OpenAI's GPT-3 is known for its impressive text generation capabilities, producing
coherent and contextually relevant text.
8. Document Classification:
Deep learning is used for document classification tasks, such as categorizing
documents into predefined classes.
Case Study:
The use of deep neural networks has improved accuracy in document categorization,
benefiting information retrieval systems.
5th Unit (Deep Learning)
Applications:
a) Large-Scale Deep Learning,
b) Computer Vision,
c) Speech Recognition,
d) Natural Language Processing
I- Applications of Large-Scale Deep Learning with Case Studies
Abstract:
Deep learning has revolutionized the field of artificial intelligence by enabling machines to
learn and make decisions based on vast amounts of data. The recent advancements in
hardware and software have made it possible to train deep neural networks on large-scale
datasets with billions of parameters. In this application, we will discuss some of the most
promising applications of large-scale deep learning, highlighting the latest research and case
studies.

Introduction:
Deep learning is a subfield of machine learning that uses artificial neural networks with
multiple layers to learn and make predictions from large datasets. The success of deep
learning is attributed to the availability of big data, powerful computing resources, and the
development of new algorithms that can handle the complexity of deep neural networks. In
this application, we will discuss some of the most promising applications of large-scale deep
learning, highlighting the latest research and case studies.
1. Image and Speech Recognition:
One of the most popular applications of deep learning is image and speech recognition. Large-
scale datasets like ImageNet and CIFAR have been instrumental in driving progress in this area.
For instance, in 2012, AlexNet, a deep neural network with eight layers, achieved a top-5 error
rate of 15.3% on ImageNet, which was a significant improvement over the previous state-of-
the-art method. Since then, several other architectures like VGGNet, ResNet, and Inception
have been proposed, which have further improved the accuracy and efficiency of image
recognition systems.
Case Study: Google Cloud Vision API

Google Cloud Vision API is a cloud-based service that enables developers to integrate image
recognition capabilities into their applications. The service uses a pre-trained model called
InceptionV3, which has over 22 million parameters and can classify images into thousands of
categories with high accuracy. The API also provides additional functionalities like object
detection, labeling, and text recognition. The service is available for free up to a certain limit
and charges based on usage.
2. Natural Language Processing (NLP):

Natural Language Processing (NLP) is another exciting application of deep learning. Large-
scale datasets like Wikipedia and BookCorpus have been used to train deep neural networks
for NLP tasks like language modeling, machine translation, and question answering. For
instance, in 2017, Google's BERT (Bidirectional Encoder Representations from Transformers)
model achieved state-of-the-art results on several NLP benchmarks like GLUE and SQuAD.
BERT is a transformer-based model with over 110 million parameters that can understand the
contextual meaning of words in a sentence by considering their relationships with other words
in both directions.
Case Study: Google Cloud Natural Language API
Google Cloud Natural Language API is a cloud-based service that enables developers to
integrate NLP capabilities into their applications. The service uses a pre-trained model called
BERTbase, which has over 110 million parameters and can perform several NLP tasks like
sentiment analysis, entity recognition, syntax analysis, and question answering with high
accuracy. The API also provides additional functionalities like language detection and
translation between multiple languages. The service is available for free up to a certain limit
and charges based on usage.
3. Autonomous Driving:
Autonomous driving is another promising application of deep learning that requires large
amounts of data to train models for perception, prediction, planning, and control tasks. Large-
scale datasets like KITTI and Cityscapes have been used to train deep neural networks for tasks
like object detection, segmentation, and depth estimation. For instance, in 2018, Waymo's
self-driving car achieved over 8 million miles on public roads using deep learning algorithms
for perception and decision making. Waymo's system uses multiple sensors like lidar, cameras,
and radar to generate a 3D map of the environment in real time and make safe driving
decisions based on it.

Case Study: Tesla Autopilot: Full Self-Driving Capability Beta Program


Tesla's Autopilot system is a semi-autonomous driving feature that enables vehicles to steer,
accelerate/decelerate automatically based on traffic conditions using cameras, ultrasonic
sensors, GPS, and maps. Tesla's Full Self-Driving Capability Beta Program allows selected users
to test an advanced version of Autopilot that uses deep learning algorithms for perception
and decision-making using cameras only (no lidar or radar).

II- Applications of Computer Vision


Application 1: Autonomous Driving
Case Study: Waymo
Waymo, a subsidiary of Alphabet Inc., is a leading company in the development of
autonomous driving technology. The company's self-driving vehicles use computer vision to
navigate the roads and make decisions in real-time.
The computer vision system in Waymo's vehicles consists of multiple cameras, sensors, and
software algorithms that enable the vehicle to perceive its surroundings, understand traffic
signals and signs, and detect pedestrians, cyclists, and other vehicles. The system uses deep
learning techniques to process large amounts of visual data and make accurate predictions
about the environment.
One of the key challenges in autonomous driving is handling complex and unpredictable
scenarios, such as construction zones, roadwork, and unexpected obstacles. Waymo's
computer vision system is designed to handle these situations by using a combination of 3D
mapping, object detection, and path planning algorithms. The system can also adapt to new
environments by continuously learning from real-world data and updating its models.
Another challenge in autonomous driving is ensuring safety and reliability. Waymo's computer
vision system is rigorously tested and validated through simulation and real-world trials. The
company has also developed a comprehensive safety framework that includes redundancy,
backup systems, and human oversight.
Applications: Autonomous driving technology has the potential to revolutionize
transportation by reducing congestion, improving safety, and increasing accessibility for
people with disabilities. Waymo's computer vision system is a critical component of this
technology, enabling vehicles to navigate complex environments with confidence and
precision. As the technology continues to evolve, we can expect to see more advanced
applications in areas such as delivery services, ride-sharing, and urban mobility.

Application 2: Medical Imaging


Case Study: Google DeepMind Health
Google DeepMind Health is a division of Alphabet Inc. That focuses on developing AI solutions
for healthcare applications. One of their most promising applications is in medical imaging,
where computer vision can be used to diagnose diseases at an earlier stage and with greater
accuracy than traditional methods.
DeepMind's computer vision system uses a combination of convolutional neural networks
(CNNs) and transfer learning techniques to analyze medical images such as X-rays, CT scans,
and MRI scans. The system can detect abnormalities that are difficult for human radiologists
to identify, such as small lung nodules or early signs of cancer. It can also provide more
accurate predictions about the severity of diseases and the likelihood of recurrence.
One of the key challenges in medical imaging is ensuring accuracy and reliability while
minimizing false positives and false negatives. DeepMind's computer vision system addresses
this challenge by using a multi-modal approach that combines multiple types of data (e.g.,
images, clinical data) to make more informed predictions. The system also provides
explanations for its decisions, which helps radiologists understand how it arrived at its
conclusions and how it compares to human performance.
Applications: Medical imaging is a critical component of healthcare delivery, as it enables
early diagnosis and treatment of diseases before they become serious or life-threatening.
DeepMind's computer vision system has the potential to significantly improve the accuracy
and efficiency of medical imaging while reducing costs and improving accessibility for patients
in underserved areas. As the technology continues to evolve, we can expect to see more
advanced applications in areas such as telemedicine, remote diagnosis, and personalized
medicine.

III. Applications of Speech Recognition


Speech recognition technology has revolutionized the way we interact with devices and
systems, making it possible for us to control them using our voice. This technology has a
wide range of applications in various industries, and here are some examples with case
studies:
1. Healthcare: Speech recognition technology is being used in healthcare to improve patient
care and reduce costs. One such application is in the field of radiology, where radiologists can
dictate their reports using voice commands, which are then transcribed into text. This
eliminates the need for transcriptionists, reducing costs and turnaround times. A study by
Nuance Communications found that the use of speech recognition technology in radiology
resulted in a 60% reduction in turnaround times and a 50% reduction in costs.
2. Education: Speech recognition technology is being used in education to help students with
learning disabilities such as dyslexia. One such application is called "Read&Write," which is a
software program that helps students read, write, and study more effectively. The program
uses speech recognition technology to convert text into speech, making it easier for students
to understand and retain information. A study by Texthelp found that the use of Read&Write
resulted in a 20% improvement in reading comprehension for students with dyslexia.

3. Customer Service: Speech recognition technology is being used in customer service to


improve efficiency and reduce costs. One such application is called "Virtual Assistants," which
are AI-powered chatbots that can answer customer queries using voice commands. These
chatbots can handle simple tasks such as order tracking, account balances, and FAQs, freeing
up human agents to handle more complex tasks. A study by Juniper Research found that the
use of Virtual Assistants in customer service will result in cost savings of $8 billion by 2022.
4. Automotive: Speech recognition technology is being used in automotive to improve safety
and convenience for drivers. One such application is called "Voice-Activated Navigation,"
which allows drivers to navigate using voice commands instead of touching the screen or
buttons while driving. This reduces distractions and improves safety on the road. A study by
Frost & Sullivan found that the use of Voice-Activated Navigation resulted in a 30% reduction
in driver distractions and a 25% improvement in safety.
These are just a few examples of how speech recognition technology is being used across
various industries, improving efficiency, reducing costs, and enhancing user experiences. As
this technology continues to evolve, we can expect to see even more innovative applications
emerge in the future.

IV Applications of Natural Language Processing


Application 1: Customer Service Chatbot

Case Study: XYZ Bank

XYZ Bank, a leading financial institution, implemented a customer service chatbot powered by
Natural Language Processing (NLP) technology. The chatbot, named "Banki," is available 24/7
to answer customer queries, provide account balances, and perform simple transactions.
The NLP algorithm enables Banki to understand the intent of the customer's message and
respond accurately. For instance, if a customer asks, "What is my account balance?", Banki
will retrieve the information from the customer's account and respond with the balance.
Similarly, if a customer asks, "Can you transfer $500 from my savings to my checking
account?", Banki will initiate the transfer after verifying the customer's identity.
The implementation of Banki has resulted in significant improvements in customer satisfaction
and operational efficiency. Customers can now resolve their queries quickly and easily without
having to wait on hold or visit a branch. This has led to a reduction in call volume and wait
times for customers. Additionally, Banki has freed up bank representatives to focus on more
complex queries that require human intervention.
Application 2: Content Curation Platform

Case Study: BuzzFeed


BuzzFeed, a leading digital media company, implemented an NLP-powered content curation
platform to recommend articles to its users based on their reading history and preferences.
The platform uses machine learning algorithms to analyze user behavior and identify patterns
in their reading habits. Based on this analysis, the platform recommends articles that are likely
to interest the user.
The NLP algorithm enables the platform to understand the context of the articles and
recommend related content. For instance, if a user reads an article about "healthy eating,"
the platform will recommend articles about "healthy recipes" or "nutrition tips." This
personalized content recommendation has resulted in increased engagement and time spent
on the platform by users. Additionally, it has led to an increase in page views and ad revenue
for BuzzFeed.
Application 3: Legal Document Review System
Case Study: Latham & Watkins LLP
Latham & Watkins LLP, a leading global law firm, implemented an NLP-powered legal
document review system to streamline its document review process for litigation cases. The
system uses machine learning algorithms to analyze legal documents and identify relevant
information based on keywords and phrases. This enables lawyers to quickly review large
volumes of documents and focus on the most important ones.
The NLP algorithm enables the system to understand the context of legal documents and
identify relevant information even if it is not explicitly stated using keywords. For instance, if
a document mentions "settlement" but does not use the keyword "agreement," the system
will still identify it as relevant based on its context. This has resulted in significant
improvements in efficiency and accuracy in document review for Latham & Watkins LLP's
litigation cases.

You might also like