Stochastic Gradient Descent
CS 584: Big Data Analytics
Gradient Descent Recap
• Simplest and extremely popular
• Main Idea: take a step proportional to the negative of the
gradient
• Easy to implement
• Each iteration is relatively cheap
• Can be slow to converge
CS 584 [Spring 2016] - Ho
Example: Linear Regression
• Optimization problem:
min ||Xw y||2
w
• Closed form solution:
w⇤ = (X > X) 1
X >y
• Gradient update:
1 X >
w+ = w (xi w yi )xi
m i
Requires an entire pass through the data!
CS 584 [Spring 2016] - Ho
Tackling Compute Problems: Scaling to Large n
• Streaming implementation
• Parallelize your batch algorithm
• Aggressively subsample the data
• Change algorithm or training method
• Optimization is a surrogate for learning
• Trade-off weaker optimization with more data
CS 584 [Spring 2016] - Ho
Tradeoffs of Large Scale Learning
• True (generalization) error is a function of approximation
error, estimation error, and optimization error subject to
number of training examples and computational time
• Solution will depend on which budget constraint is active
Bottom and Bousquet (2011). The Tradeoffs of Large-Scale Learning.
In Optimization for Machine Learning (pp. 351–368).
CS 584 [Spring 2016] - Ho
Minimizing Generalization Error
If n ! 1,
then "est ! 0
For fixed generalization error, as number of samples increases,
we can increase optimization tolerance
CS 584 [Spring 2016] - Ho Talk by Aditya Menon, UCSD
Expected Risk vs Empirical Risk Minimization
Expected Risk Empirical Risk
• Assume we know the • Real world, ground truth
ground truth distribution distribution is not known
P(x,y)
• Only empirical risk can be
• Expected risk associated calculated for function
with classification function 1X
Z En (fw ) = L(fw (xi ), yi )
n i
E(fw ) = L(fw (x), y)dP (x, y)
= E[L(fw (x), y)]
CS 584 [Spring 2016] - Ho
Gradient Descent Reformulated
1X
w+ = w rw L(fw (xi ), yi ) rEn (fw )
n i
learning rate or gain
• True gradient descent is a batch algorithm, slow but sure
• Under sufficient regularity assumptions, initial estimate is
close to the optimal and gain is sufficiently small, there is
linear convergence
CS 584 [Spring 2016] - Ho
Stochastic Optimization Motivation
• Information is redundant amongst samples
• Sufficient samples means we can afford more frequent,
noisy updates
• Never-ending stream means we should not wait for all
data
• Tracking non-stationary data means that the target is
moving
CS 584 [Spring 2016] - Ho
Stochastic Optimization
• Idea: Estimate function and gradient from a small, current
subsample of your data and with enough iterations and data,
you will converge to the true minimum
• Pro: Better for large datasets and often faster convergence
• Con: Hard to reach high accuracy
• Con: Best classical methods can’t handle stochastic
approximation
• Con: Theoretical definitions for convergence not as well-
defined
CS 584 [Spring 2016] - Ho
Stochastic Gradient Descent (SGD)
• Randomized gradient estimate to minimize the function
using a single randomly picked example
Instead of rf, use r̃f, where E[r̃f ] = rf
• The resulting update is of the form:
w+ = w rw L(fw (xi , yi ))
• Although random noise is introduced, it behaves like
gradient descent in its expectation
CS 584 [Spring 2016] - Ho
SGD Algorithm
Randomly initialize parameter w and learning rate
while Not Converged do
Randomly shu✏e examples in training set
for i = 1, · · · , N do
w+ = w rw L(fw (xi , yi ))
end
end
CS 584 [Spring 2016] - Ho
The Benefits of SGD
• Gradient is easy to calculate (“instantaneous”)
• Less prone to local minima
• Small memory footprint
• Get to a reasonable solution quickly
• Works for non-stationary environments as well as online
settings
• Can be used for more complex models and error surfaces
CS 584 [Spring 2016] - Ho
Importance of Learning Rate
• Learning rate has a large impact on convergence
• Too small —> too slow
• Too large —> oscillatory and may even diverge
• Should learning rate be fixed or adaptive?
• Is convergence necessary?
• Non-stationary: convergence may not be required or desired
• Stationary: learning rate should decrease with time
1
• Robbins-Monroe sequence is adequate t =
t
CS 584 [Spring 2016] - Ho
Mini-batch Stochastic Gradient Descent
• Rather than using a single point, use a random subset
where the size is less than the original data size
1 X
+
w =w rw L(fw (xi , yi )), where Sk ✓ [n]
|Sk |
i2Sk
• Like the single random sample, the full gradient is
approximated via an unbiased noisy estimate
• Random subset reduces the variance by a factor of
1/|Sk|, but is also |Sk| times more expensive
CS 584 [Spring 2016] - Ho
Example: Regularized Logistic Regression
• Optimization problem:
1 X ⇣ ⌘
> x> 2
min yi xi + log(1 + e i ) + || ||2
n i 2
• Gradient computation:
X
rf ( ) = (yi pi ( ))xi +
i
• Update costs:
• Batch: O(nd)
Batch is doable if n is moderate
• Stochastic: O(d) but not when n is huge
• Mini-batch: O(|Sk|d)
CS 584 [Spring 2016] - Ho
Example: n=10,000, d=20
Iterations make better progress as mini-batch size is
larger but also takes more computation time
http://stat.cmu.edu/~ryantibs/convexopt/lectures/25-fast-stochastic.pdf
CS 584 [Spring 2016] - Ho
SGD Updates for Various Systems
Bottou, L. (2012). Stochastic Gradient Descent Tricks.
CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
Asymptotic Analysis of GD and SGD
Bottou, L. (2012). Stochastic Gradient Descent Tricks.
CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
SGD Recommendations
• Randomly shuffle training examples
• Although theory says you should randomly pick examples, it
is easier to make a pass through your training set sequentially
• Shuffling before each iteration eliminates the effect of order
• Monitor both training cost and validation error
• Set aside samples for a decent validation set
• Compute the objective on the training set and validation set
(expensive but better than overfitting or wasting computation)
Bottou, L. (2012). Stochastic Gradient Descent Tricks.
CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
SGD Recommendations (2)
• Check gradient using finite differences
• If computation is slightly incorrect can yield erratic and slow
algorithm
• Verify your code by slightly perturbing the parameter and
inspecting differences between the two gradients
• Experiment with the learning rates using small sample of training set
• SGD convergence rates are independent from sample size
• Use traditional optimization algorithms as a reference point
CS 584 [Spring 2016] - Ho
SGD Recommendations (3)
• Leverage sparsity of the training examples
• For very high-dimensional vectors with few non zero
coefficients, you only need to update the weight
coefficients corresponding to nonzero pattern in x
1
• Use learning rates of the form t = 0 (1 + 0 t)
• Allows you to start from reasonable learning rates
determined by testing on a small sample
• Works well in most situations if the initial point is slightly
smaller than best value observed in training sample
CS 584 [Spring 2016] - Ho
Some Resources for SGD
• Francis Bach’s talk in 2012: http://www.ann.jussieu.fr/~plc/
bach2012.pdf
• Stochastic Gradient Methods Workshop: http://
yaroslavvb.blogspot.com/2014/03/stochastic-gradient-
methods-2014.html
• Python implementation in scikit-learn: http://scikit-learn.org/
stable/modules/sgd.html
• iPython notebook for implementing GD and SGD in Python:
https://github.com/dtnewman/gradient_descent/blob/master/
stochastic_gradient_descent.ipynb
CS 584 [Spring 2016] - Ho