0% found this document useful (0 votes)

102 views23 pages

Advanced Stochastic Gradient Descent

This document discusses stochastic gradient descent (SGD), an optimization algorithm for minimizing an objective function. SGD approximates the gradient of the objective function using a random subset of training examples rather than the entire training set. This makes each iteration faster but introduces noise. The document outlines the benefits of SGD such as faster convergence and lower memory requirements. It also discusses important considerations like learning rate scheduling and monitoring validation error in addition to training error. Mini-batch SGD is introduced as a variant that reduces variance compared to single example SGD. Recommendations are provided for effectively implementing SGD.

Uploaded by

penets

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views23 pages

Advanced Stochastic Gradient Descent

Uploaded by

penets

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Stochastic Gradient Descent

CS 584: Big Data Analytics

Gradient Descent Recap
• Simplest and extremely popular

• Main Idea: take a step proportional to the negative of the

gradient

• Easy to implement

• Each iteration is relatively cheap

• Can be slow to converge

CS 584 [Spring 2016] - Ho

Example: Linear Regression
• Optimization problem:
min ||Xw y||2
w

• Closed form solution:

w⇤ = (X > X) 1
X >y

• Gradient update:
1 X >
w+ = w (xi w yi )xi
m i
Requires an entire pass through the data!
CS 584 [Spring 2016] - Ho
Tackling Compute Problems: Scaling to Large n
• Streaming implementation

• Parallelize your batch algorithm

• Aggressively subsample the data

• Change algorithm or training method

• Optimization is a surrogate for learning

• Trade-off weaker optimization with more data

CS 584 [Spring 2016] - Ho

Tradeoffs of Large Scale Learning
• True (generalization) error is a function of approximation
error, estimation error, and optimization error subject to
number of training examples and computational time

• Solution will depend on which budget constraint is active

Bottom and Bousquet (2011). The Tradeoffs of Large-Scale Learning.

In Optimization for Machine Learning (pp. 351–368).
CS 584 [Spring 2016] - Ho
Minimizing Generalization Error

If n ! 1,
then "est ! 0

For fixed generalization error, as number of samples increases,

we can increase optimization tolerance
CS 584 [Spring 2016] - Ho Talk by Aditya Menon, UCSD
Expected Risk vs Empirical Risk Minimization
Expected Risk Empirical Risk

• Assume we know the • Real world, ground truth

ground truth distribution distribution is not known
P(x,y)
• Only empirical risk can be
• Expected risk associated calculated for function
with classification function 1X
Z En (fw ) = L(fw (xi ), yi )
n i
E(fw ) = L(fw (x), y)dP (x, y)

= E[L(fw (x), y)]

CS 584 [Spring 2016] - Ho

Gradient Descent Reformulated
1X
w+ = w rw L(fw (xi ), yi ) rEn (fw )
n i
learning rate or gain

• True gradient descent is a batch algorithm, slow but sure

• Under sufficient regularity assumptions, initial estimate is

close to the optimal and gain is sufficiently small, there is
linear convergence

CS 584 [Spring 2016] - Ho

Stochastic Optimization Motivation
• Information is redundant amongst samples

• Sufficient samples means we can afford more frequent,

noisy updates

• Never-ending stream means we should not wait for all

data

• Tracking non-stationary data means that the target is

moving

CS 584 [Spring 2016] - Ho

Stochastic Optimization
• Idea: Estimate function and gradient from a small, current
subsample of your data and with enough iterations and data,
you will converge to the true minimum

• Pro: Better for large datasets and often faster convergence

• Con: Hard to reach high accuracy

• Con: Best classical methods can’t handle stochastic

approximation

• Con: Theoretical definitions for convergence not as well-

defined
CS 584 [Spring 2016] - Ho
Stochastic Gradient Descent (SGD)
• Randomized gradient estimate to minimize the function
using a single randomly picked example
Instead of rf, use r̃f, where E[r̃f ] = rf

• The resulting update is of the form:

w+ = w rw L(fw (xi , yi ))

• Although random noise is introduced, it behaves like

gradient descent in its expectation

CS 584 [Spring 2016] - Ho

SGD Algorithm
Randomly initialize parameter w and learning rate
while Not Converged do
Randomly shu✏e examples in training set
for i = 1, · · · , N do
w+ = w rw L(fw (xi , yi ))
end
end

CS 584 [Spring 2016] - Ho

The Benefits of SGD
• Gradient is easy to calculate (“instantaneous”)

• Less prone to local minima

• Small memory footprint

• Get to a reasonable solution quickly

• Works for non-stationary environments as well as online

settings

• Can be used for more complex models and error surfaces

CS 584 [Spring 2016] - Ho
Importance of Learning Rate
• Learning rate has a large impact on convergence

• Too small —> too slow

• Too large —> oscillatory and may even diverge

• Should learning rate be fixed or adaptive?

• Is convergence necessary?

• Non-stationary: convergence may not be required or desired

• Stationary: learning rate should decrease with time

1
• Robbins-Monroe sequence is adequate t =
t
CS 584 [Spring 2016] - Ho
Mini-batch Stochastic Gradient Descent
• Rather than using a single point, use a random subset
where the size is less than the original data size
1 X
+
w =w rw L(fw (xi , yi )), where Sk ✓ [n]
|Sk |
i2Sk
• Like the single random sample, the full gradient is
approximated via an unbiased noisy estimate

• Random subset reduces the variance by a factor of  

1/|Sk|, but is also |Sk| times more expensive

CS 584 [Spring 2016] - Ho

Example: Regularized Logistic Regression
• Optimization problem:
1 X ⇣ ⌘
> x> 2
min yi xi + log(1 + e i ) + || ||2
n i 2
• Gradient computation:
X
rf ( ) = (yi pi ( ))xi +
i
• Update costs:

• Batch: O(nd)
Batch is doable if n is moderate 
• Stochastic: O(d) but not when n is huge

• Mini-batch: O(|Sk|d)

CS 584 [Spring 2016] - Ho

Example: n=10,000, d=20

Iterations make better progress as mini-batch size is

larger but also takes more computation time
http://stat.cmu.edu/~ryantibs/convexopt/lectures/25-fast-stochastic.pdf
CS 584 [Spring 2016] - Ho
SGD Updates for Various Systems

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
Asymptotic Analysis of GD and SGD

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
SGD Recommendations
• Randomly shuffle training examples

• Although theory says you should randomly pick examples, it

is easier to make a pass through your training set sequentially

• Shuffling before each iteration eliminates the effect of order

• Monitor both training cost and validation error

• Set aside samples for a decent validation set

• Compute the objective on the training set and validation set

(expensive but better than overfitting or wasting computation)

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
SGD Recommendations (2)
• Check gradient using finite differences

• If computation is slightly incorrect can yield erratic and slow

algorithm

• Verify your code by slightly perturbing the parameter and

inspecting differences between the two gradients

• Experiment with the learning rates using small sample of training set

• SGD convergence rates are independent from sample size

• Use traditional optimization algorithms as a reference point

CS 584 [Spring 2016] - Ho

SGD Recommendations (3)
• Leverage sparsity of the training examples

• For very high-dimensional vectors with few non zero

coefficients, you only need to update the weight
coefficients corresponding to nonzero pattern in x
1
• Use learning rates of the form t = 0 (1 + 0 t)

• Allows you to start from reasonable learning rates

determined by testing on a small sample

• Works well in most situations if the initial point is slightly

smaller than best value observed in training sample
CS 584 [Spring 2016] - Ho
Some Resources for SGD
• Francis Bach’s talk in 2012: http://www.ann.jussieu.fr/~plc/
bach2012.pdf

• Stochastic Gradient Methods Workshop: http://

yaroslavvb.blogspot.com/2014/03/stochastic-gradient-
methods-2014.html

• Python implementation in scikit-learn: http://scikit-learn.org/

stable/modules/sgd.html

• iPython notebook for implementing GD and SGD in Python:

https://github.com/dtnewman/gradient_descent/blob/master/
stochastic_gradient_descent.ipynb
CS 584 [Spring 2016] - Ho

2.stochastic Gradient Descent (SGD)
No ratings yet
2.stochastic Gradient Descent (SGD)
11 pages
2,5 Stochastic Gradient Descent
No ratings yet
2,5 Stochastic Gradient Descent
11 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Stochastic Gradient Descent Tuning
No ratings yet
Stochastic Gradient Descent Tuning
8 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
SGD 1
No ratings yet
SGD 1
86 pages
Gradient Descent Optimization Guide
No ratings yet
Gradient Descent Optimization Guide
9 pages
Paper 2
No ratings yet
Paper 2
27 pages
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
No ratings yet
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
9 pages
Stochastic Gradient Descent Basics
No ratings yet
Stochastic Gradient Descent Basics
22 pages
SGD Explained for Data Scientists
No ratings yet
SGD Explained for Data Scientists
23 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
5 pages
Gradient Descent & Stochastic Optimization
No ratings yet
Gradient Descent & Stochastic Optimization
4 pages
Stochastic Gradient Descent - Math and Python Code
No ratings yet
Stochastic Gradient Descent - Math and Python Code
28 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Linear Models-Gradient Descent, Regularization (Introduction)
No ratings yet
Linear Models-Gradient Descent, Regularization (Introduction)
26 pages
Gradient Descent for Deep Learning
No ratings yet
Gradient Descent for Deep Learning
21 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
17 Convexoptim5
No ratings yet
17 Convexoptim5
63 pages
Gradient Descent for ML Experts
No ratings yet
Gradient Descent for ML Experts
5 pages
Why Stochastic Gradient Descent Works
No ratings yet
Why Stochastic Gradient Descent Works
6 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
SGD 2
No ratings yet
SGD 2
18 pages
Optimization Gradient Descent
No ratings yet
Optimization Gradient Descent
13 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Dla-Cat 1
No ratings yet
Dla-Cat 1
37 pages
ANN Explanation Request Updated
No ratings yet
ANN Explanation Request Updated
44 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Non-Convex Optimization For Deep Networks and Stochastic
No ratings yet
Non-Convex Optimization For Deep Networks and Stochastic
9 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
17 Large Scale Machine Learning PDF
No ratings yet
17 Large Scale Machine Learning PDF
10 pages
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
No ratings yet
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
2 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Lecture 21
No ratings yet
Lecture 21
49 pages
05.stochastic Gradient Descent
No ratings yet
05.stochastic Gradient Descent
2 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Gradient Decent
No ratings yet
Gradient Decent
15 pages
Gradient Descent Types Explained
No ratings yet
Gradient Descent Types Explained
5 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
SGD
No ratings yet
SGD
3 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Machine Learning: Gradient Descent Methods
No ratings yet
Machine Learning: Gradient Descent Methods
11 pages
Stochastic Gradient Descent With LocalMinima
No ratings yet
Stochastic Gradient Descent With LocalMinima
10 pages
Stochastic Search Methods
No ratings yet
Stochastic Search Methods
2 pages
Neural Network Training: Optimization
No ratings yet
Neural Network Training: Optimization
62 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Public Speaking Tips for Beginners
No ratings yet
Public Speaking Tips for Beginners
19 pages
Professional Ethics F Pfe301 Final
No ratings yet
Professional Ethics F Pfe301 Final
77 pages
Neuro Unit 7-9 Notes
No ratings yet
Neuro Unit 7-9 Notes
52 pages
Pengertian Conditional Sentence Type 3
No ratings yet
Pengertian Conditional Sentence Type 3
3 pages
Cooperation of The Eye
No ratings yet
Cooperation of The Eye
5 pages
Emotional Intelligence for Leaders
No ratings yet
Emotional Intelligence for Leaders
9 pages
LeadershipSelfDeception PG
No ratings yet
LeadershipSelfDeception PG
11 pages
Module .The Teaching Prof
No ratings yet
Module .The Teaching Prof
6 pages
Tracing Inter-Textuality in The Language of Pakistani Advertisements
No ratings yet
Tracing Inter-Textuality in The Language of Pakistani Advertisements
26 pages
Linguistics: Understanding Meta-functions
No ratings yet
Linguistics: Understanding Meta-functions
4 pages
Middle School Impact Project
No ratings yet
Middle School Impact Project
10 pages
Questionnaire
No ratings yet
Questionnaire
10 pages
MacDonald-1974-Evaluation and The Control of Education
No ratings yet
MacDonald-1974-Evaluation and The Control of Education
8 pages
The Role of Reading Skills On Reading Comprehensio
No ratings yet
The Role of Reading Skills On Reading Comprehensio
16 pages
Artificial Intelligence (Mahbub Mas'Ud)
No ratings yet
Artificial Intelligence (Mahbub Mas'Ud)
16 pages
Unit Plan Photosynthesis
No ratings yet
Unit Plan Photosynthesis
24 pages
AI's Impact on Film: A Rhetorical Analysis
No ratings yet
AI's Impact on Film: A Rhetorical Analysis
5 pages
OOMD Exam Prep for CS Students
0% (1)
OOMD Exam Prep for CS Students
5 pages
Stainton Rogers - Chapter11 - Social Selves, Social Identities-1
No ratings yet
Stainton Rogers - Chapter11 - Social Selves, Social Identities-1
31 pages
Active vs Passive Voice Guide
No ratings yet
Active vs Passive Voice Guide
5 pages
Tutoring Reflection 3
No ratings yet
Tutoring Reflection 3
3 pages
Rosemarie Parse
100% (1)
Rosemarie Parse
22 pages
Unit 3 Newsletter - The Move Toward Freedom
No ratings yet
Unit 3 Newsletter - The Move Toward Freedom
5 pages
J.Krishnamurti Selected Quotes PDF
No ratings yet
J.Krishnamurti Selected Quotes PDF
18 pages
Rephrasing For Pau Exam 1
0% (1)
Rephrasing For Pau Exam 1
4 pages
Pinheiro Et Al. 2023
No ratings yet
Pinheiro Et Al. 2023
14 pages
Practicing: Compiled by Dennis Askew
100% (1)
Practicing: Compiled by Dennis Askew
4 pages
Grade-8-TOS 1st Q
No ratings yet
Grade-8-TOS 1st Q
2 pages
Ojt LP 02
No ratings yet
Ojt LP 02
5 pages
Unit-2 - Advanced Concepts of Modeling in AI
No ratings yet
Unit-2 - Advanced Concepts of Modeling in AI
4 pages

Advanced Stochastic Gradient Descent

Uploaded by

Advanced Stochastic Gradient Descent

Uploaded by

Stochastic Gradient Descent

CS 584: Big Data Analytics

• Main Idea: take a step proportional to the negative of the

• Each iteration is relatively cheap

• Can be slow to converge

CS 584 [Spring 2016] - Ho

• Closed form solution:

• Parallelize your batch algorithm

• Aggressively subsample the data

• Change algorithm or training method

• Optimization is a surrogate for learning

• Trade-off weaker optimization with more data

CS 584 [Spring 2016] - Ho

• Solution will depend on which budget constraint is active

Bottom and Bousquet (2011). The Tradeoffs of Large-Scale Learning.

For fixed generalization error, as number of samples increases,

• Assume we know the • Real world, ground truth

= E[L(fw (x), y)]

CS 584 [Spring 2016] - Ho

• True gradient descent is a batch algorithm, slow but sure

• Under sufficient regularity assumptions, initial estimate is

CS 584 [Spring 2016] - Ho

• Sufficient samples means we can afford more frequent,

• Never-ending stream means we should not wait for all

• Tracking non-stationary data means that the target is

CS 584 [Spring 2016] - Ho

• Pro: Better for large datasets and often faster convergence

• Con: Hard to reach high accuracy

• Con: Best classical methods can’t handle stochastic

• Con: Theoretical definitions for convergence not as well-

• The resulting update is of the form:

• Although random noise is introduced, it behaves like

CS 584 [Spring 2016] - Ho

CS 584 [Spring 2016] - Ho

• Less prone to local minima

• Small memory footprint

• Get to a reasonable solution quickly

• Works for non-stationary environments as well as online

• Can be used for more complex models and error surfaces

• Too small —> too slow

• Too large —> oscillatory and may even diverge

• Should learning rate be fixed or adaptive?

• Non-stationary: convergence may not be required or desired

• Stationary: learning rate should decrease with time

• Random subset reduces the variance by a factor of

CS 584 [Spring 2016] - Ho

CS 584 [Spring 2016] - Ho

Iterations make better progress as mini-batch size is

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

• Although theory says you should randomly pick examples, it

• Shuffling before each iteration eliminates the effect of order

• Monitor both training cost and validation error

• Set aside samples for a decent validation set

• Compute the objective on the training set and validation set

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

• If computation is slightly incorrect can yield erratic and slow

• Verify your code by slightly perturbing the parameter and

• SGD convergence rates are independent from sample size

• Use traditional optimization algorithms as a reference point

CS 584 [Spring 2016] - Ho

• For very high-dimensional vectors with few non zero

• Allows you to start from reasonable learning rates

• Works well in most situations if the initial point is slightly

• Stochastic Gradient Methods Workshop: http://

• Python implementation in scikit-learn: http://scikit-learn.org/

• iPython notebook for implementing GD and SGD in Python:

You might also like

• Random subset reduces the variance by a factor of