Codestin Search App

thomelane · 2018-06-15T16:28:39Z

Description

Added tutorial that demonstrates a method of finding a good initial starting learning rate.

Method from "Cyclical Learning Rates for Training Neural Networks" by Leslie N. Smith (2015). And seen in many blog posts (e.g. https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

szha · 2018-06-17T00:25:50Z

@aaronmarkham would you provide some input on the writing? @astonzhang could you help review the code when you have time?

thomelane · 2018-06-19T18:41:44Z

@Ishitori @ThomasDelteil could you review when you get a chance, thanks!

thomelane · 2018-06-22T17:36:12Z

@indhub got any time to review? thanks!

Ishitori

I have reviewed this code before, so just a few minor notes.

Ishitori · 2018-06-22T18:04:11Z

+        learner: able to take single iteration with given learning rate and return loss
+           and save and load parameters of the network (Learner)
+        """
+        self.learner = learner


Private fields better to start with underscore "_"

Will leave for this tutorial since stylistic, but will do in future!

Ishitori · 2018-06-22T18:06:37Z

+        smoothing: applied to running mean which is used for thresholding (float)
+        min_iter: minimum number of iterations before early stopping can occur (int)
+        """
+        self.smoothing = smoothing


Underscore as a prefix

Same as before.

Ishitori · 2018-06-22T18:12:43Z

+        lr = lr_start
+        self.results = [] # List of (lr, loss) tuples
+        stopping_criteria = LRFinderStoppingCriteria(smoothing)
+        while True:


Try to rewrite to while not stopping_criteria(loss), as it looks cleaner

Would need duplication of two lines of code, so prefer as is.

Ishitori · 2018-06-22T18:13:25Z

+
+    def close(self):
+        # Close open iterator and associated workers
+        self.data_loader_iter.shutdown()


Is simple self.data_loader_iter.close() is not enough?

After experimenting, determined that shutdown was best option for reliable closing.

safrooze · 2018-06-22T19:08:53Z

+        # So we don't need to be in `for batch in data_loader` scope
+        # and can call for next batch in `iteration`
+        self.data_loader_iter = iter(self.data_loader)
+        self.net.collect_params().initialize(mx.init.Xavier(), ctx=self.ctx)


net.initialize()

safrooze · 2018-06-22T19:15:52Z

+        if not self.learner.trainer._kv_initialized:
+            self.learner.trainer._init_kvstore()
+        # Store params and optimizer state for restore after lr_finder procedure
+        self.learner.net.save_params("lr_finder.params")


I find the code for saving and restoring parameters adding unnecessary complication. Given that the LR finding is a completely separate step before actual learning and its output plot must be manually analyzed, why not just create a completely separate network/trainer for LR Finder and once this process is over, create another network/trainer for confirming that correct LR has been selected.

It's been suggested that this method can be used during training too, which would require save and load to get back to original state. Will add a comment to explain this. And also helps with reproducibility of results of learning rate finder (for different settings) since the same initialization values are used each time.

indhub · 2018-06-22T17:50:03Z

+
+Setting the learning rate for stochastic gradient descent (SGD) is crucially important when training neural network because it controls both the speed of convergence and the ultimate performance of the network. Set the learning too low and you could be twiddling your thumbs for quite some time as the parameters update very slowly. Set it too high and the updates will skip over optimal solutions, or worse the optimizer might not converge at all!
+
+Leslie Smith from the U.S. Naval Research Laboratory presented a method for finding a good learning rate in a paper called ["Cyclical Learning Rates for Training Neural Networks"](https://arxiv.org/abs/1506.01186). We take a look at the central idea of the paper, cyclical learning rate schedules, in the tutorial found here, but in this tutorial we implement a 'Learning Rate Finder' in MXNet with the Gluon API that you can use while training your own networks.


"found here" - Did you mean to provide a link here?

"but in this tutorial" - "but in this tutorial,"

Good catch, unpublished tutorial so replaced with name of tutorial.

Comma overload if add.

indhub · 2018-06-22T17:55:23Z

+
+![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/lr_finder/finder_plot.png) <!--notebook-skip-line-->
+
+As expected, for very small learning rates we don't see much change in the loss as the paramater updates are negligible. At a learning rate of 0.001 we start to see the loss fall. Setting the initial learning rate here is reasonable, but we still have the potential to learn faster. We observe a drop in the loss up until 0.1 where the loss appears to diverge. We want to set the initial learning rate as high as possible before the loss becomes unstable, so we choose a learning rate of 0.05.


"paramater" -> "parameter"

"At a learning rate of 0.001" -> "At a learning rate of 0.001,"

indhub · 2018-06-22T17:56:32Z

+2. start with a very small learning rate (e.g. 0.000001) and slowly increase it every iteration
+3. record the training loss and continue until we see the training loss diverge
+
+We then analyse the results by plotting a graph of the learning rate against the training loss as seen below (taking note of the log scales).


"analyse" -> "analyze"?

It's the Queen's English Indu! 🇬🇧

indhub · 2018-06-22T17:57:36Z

+
+## Epoch to Iteration
+
+Usually our unit of work is an epoch (a full pass through the dataset) and the learning rate would typically be held constant throughout the epoch. With the Learning Rate Finder (and cyclical learning rate schedules) we are required to vary the learning rate every iteration. As such we structure our training code so that a single iteration can be run with a given learning rate. You can implement Learner as you wish. Just initialize the network, define the loss and trainer in `__init__` and keep your training logic for a single batch in `iteration`.


"Usually" -> "Usually,"

indhub · 2018-06-22T17:58:28Z

+
+## Implementation
+
+With preparation complete, we're ready to write our Learning Rate Finder that wraps the `Learner` we defined above. We implement a `find` method for the procedure, and `plot` for the visualization. Starting with a very low learning rate as defined by `lr_start` we train one iteration at a time and keep multiplying the learning rate by `lr_multiplier`. We analyse the loss and continue until it diverges according to `LRFinderStoppingCriteria` (which is defined later on). You may also notice that we save the parameters and state of the optimizer before the process and restore afterwards. This is so the Learning Rate Finder process doesn't impact the state of the model, and can be used at any point during training.


"analyse" -> "analyze"?

Same as above.

indhub · 2018-06-22T19:26:25Z

+        if not self.learner.trainer._kv_initialized:
+            self.learner.trainer._init_kvstore()
+        # Store params and optimizer state for restore after lr_finder procedure
+        self.learner.net.save_params("lr_finder.params")


"save_params" is deprecated. Should we use save_parameters instead?

indhub · 2018-06-22T19:26:53Z

+
+
+```python
+learner.net.save_params("net.params")


save_params is deprecated. Should we use save_parameters instead?

indhub · 2018-06-22T19:27:15Z

+```python
+net = mx.gluon.model_zoo.vision.resnet18_v2(classes=10)
+learner = Learner(net=net, data_loader=data_loader, ctx=ctx)
+learner.net.load_params("net.params", ctx=ctx)


load_params is deprecated. Should we use load_parameters instead?

indhub · 2018-06-22T19:27:38Z

+```python
+net = mx.gluon.model_zoo.vision.resnet18_v2(classes=10)
+learner = Learner(net=net, data_loader=data_loader, ctx=ctx)
+learner.net.load_params("net.params", ctx=ctx)


load_params is deprecated. Should we use load_parameters instead?

indhub · 2018-06-22T19:28:02Z

+                break
+            lr = lr * lr_multiplier
+        # Restore params (as finder changed them)
+        self.learner.net.load_params("lr_finder.params", ctx=self.learner.ctx)


load_params is deprecated. Should we use load_parameters instead?

thomelane · 2018-06-22T21:50:20Z

thanks for the reviews @indhub @safrooze and @Ishitori!

thomelane · 2018-06-22T22:30:14Z

@indhub reverting back to use save_params since save_parameters only works with v1.3 which doesn't yet have pip version.

thomelane · 2018-06-26T16:51:46Z

JIRA: https://issues.apache.org/jira/browse/MXNET-594

aaronmarkham · 2018-06-28T22:31:30Z

+
+Setting the learning rate for stochastic gradient descent (SGD) is crucially important when training neural network because it controls both the speed of convergence and the ultimate performance of the network. Set the learning too low and you could be twiddling your thumbs for quite some time as the parameters update very slowly. Set it too high and the updates will skip over optimal solutions, or worse the optimizer might not converge at all!
+
+Leslie Smith from the U.S. Naval Research Laboratory presented a method for finding a good learning rate in a paper called ["Cyclical Learning Rates for Training Neural Networks"](https://arxiv.org/abs/1506.01186). We take a look at the central idea of the paper, cyclical learning rate schedules, in the tutorial called 'Advanced Learning Rate Schedules', but in this tutorial we implement a 'Learning Rate Finder' in MXNet with the Gluon API that you can use while training your own networks.


Can you link to the tutorial mentioned? 'Advanced Learning Rate Schedules'

I'd break the sentence up at ",but"
Flip it around for clarity....
(assuming there's some order? is there a preferred order?)
In the Advanced Learning Rate Schedules tutorial you learned about cyclical learning rate schedules which (do x and y). In this tutorial you will learn how to implement a "Learning Rate Finder" which (does z or x differently). You will use MXNet with the Gluon API to train your network.

Added link and changed sentence.

aaronmarkham · 2018-06-28T22:37:12Z

+
+Given an initialized network, a defined loss and a training dataset we take the following steps:
+
+1. train one batch at a time (a.k.a. an iteration)


nit:
Train
Start
Record

aaronmarkham · 2018-06-28T22:40:23Z

+
+![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/lr_finder/finder_plot.png) <!--notebook-skip-line-->
+
+As expected, for very small learning rates we don't see much change in the loss as the parameter updates are negligible. At a learning rate of 0.001, we start to see the loss fall. Setting the initial learning rate here is reasonable, but we still have the potential to learn faster. We observe a drop in the loss up until 0.1 where the loss appears to diverge. We want to set the initial learning rate as high as possible before the loss becomes unstable, so we choose a learning rate of 0.05.


Could you add event pointers on the chart for increased clarity.
--> at 0.001 loss falls
--> at 0.1 divergence (loss increases)
--> at 0.05 loss seems lowest (right?)

Good suggestion, created annotated chart.

aaronmarkham · 2018-06-28T22:47:19Z

+
+## Wrap Up
+
+Give Learning Rate Finder a try on your current projects, and experiment with the different learning rate schedules found in this tutorial too.


I went back to look for the "different learning rate schedules" and they not explicitly defined. Or maybe I missed that point, but I wasn't really able to find it.

ThomasDelteil · 2018-06-29T17:22:01Z

+class Learner():
+    def __init__(self, net, data_loader, ctx):
+        """
+        net: network (mx.gluon.Block)


please use standard pydoc format, see the Gluon code for examples
Could the loss be passed as an argument?

Changed to use reStructuredText Docstring Format (PEP 287)

ThomasDelteil · 2018-06-29T17:23:10Z

+        # Update parameters
+        if take_step: self.trainer.step(data.shape[0])  
+        # Set and return loss.
+        # Although notice this is still an MXNet NDArray to avoid blocking


I think blocking would be good here since we are not trying to optimize performance and we shouldn't want to risk cuda malloc

Was anticipating a comment, so made non blocking :)
Changed to be non-blocking.

ThomasDelteil · 2018-06-29T17:24:35Z

+        # Restore params (as finder changed them)
+        self.learner.net.load_params("lr_finder.params", ctx=self.learner.ctx)
+        self.learner.trainer.load_states("lr_finder.state")
+        self.plot()


can you return results here rather than calling plot? Users might not be a in a graphical environment

Removed plot, returned results. Called plot separately.

ThomasDelteil · 2018-06-29T17:27:50Z

+# Set seed for reproducibility
+mx.random.seed(42)
+
+class Learner():


I am not sure about this naming but at the same time can't really find a better one. IterationRunner? I don't know

Me neither, but going to leave to avoid confusion with the videos that have been recorded.

thomelane

thanks for the reviews @aaronmarkham and @ThomasDelteil! made necessary changes.

thomelane · 2018-07-03T21:11:13Z

+
+Setting the learning rate for stochastic gradient descent (SGD) is crucially important when training neural network because it controls both the speed of convergence and the ultimate performance of the network. Set the learning too low and you could be twiddling your thumbs for quite some time as the parameters update very slowly. Set it too high and the updates will skip over optimal solutions, or worse the optimizer might not converge at all!
+
+Leslie Smith from the U.S. Naval Research Laboratory presented a method for finding a good learning rate in a paper called ["Cyclical Learning Rates for Training Neural Networks"](https://arxiv.org/abs/1506.01186). We take a look at the central idea of the paper, cyclical learning rate schedules, in the tutorial called 'Advanced Learning Rate Schedules', but in this tutorial we implement a 'Learning Rate Finder' in MXNet with the Gluon API that you can use while training your own networks.


Added link and changed sentence.

thomelane · 2018-07-03T21:11:52Z

+
+Given an initialized network, a defined loss and a training dataset we take the following steps:
+
+1. train one batch at a time (a.k.a. an iteration)


thomelane · 2018-07-03T21:13:34Z

+
+![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/lr_finder/finder_plot.png) <!--notebook-skip-line-->
+
+As expected, for very small learning rates we don't see much change in the loss as the parameter updates are negligible. At a learning rate of 0.001, we start to see the loss fall. Setting the initial learning rate here is reasonable, but we still have the potential to learn faster. We observe a drop in the loss up until 0.1 where the loss appears to diverge. We want to set the initial learning rate as high as possible before the loss becomes unstable, so we choose a learning rate of 0.05.


Good suggestion, created annotated chart.

thomelane · 2018-07-03T21:43:10Z

+# Set seed for reproducibility
+mx.random.seed(42)
+
+class Learner():


Me neither, but going to leave to avoid confusion with the videos that have been recorded.

thomelane · 2018-07-03T21:44:16Z

+class Learner():
+    def __init__(self, net, data_loader, ctx):
+        """
+        net: network (mx.gluon.Block)


Changed to use reStructuredText Docstring Format (PEP 287)

thomelane · 2018-07-03T21:53:52Z

+        # Update parameters
+        if take_step: self.trainer.step(data.shape[0])  
+        # Set and return loss.
+        # Although notice this is still an MXNet NDArray to avoid blocking


Was anticipating a comment, so made non blocking :)
Changed to be non-blocking.

thomelane · 2018-07-03T22:01:20Z

+        # Restore params (as finder changed them)
+        self.learner.net.load_params("lr_finder.params", ctx=self.learner.ctx)
+        self.learner.trainer.load_states("lr_finder.state")
+        self.plot()


Removed plot, returned results. Called plot separately.

thomelane · 2018-07-04T16:51:37Z

@indhub made changes as per the feedback, so if all looks good would you be able to merge?

* Added Learning Rate Finder tutorial. * Updated based on feedback. * Reverting save_parameters changes. * Adjusted based on feedback. * Corrected outdated code comment.

Added Learning Rate Finder tutorial.

664c30e

thomelane requested a review from szha as a code owner June 15, 2018 16:28

Merge branch 'master' into tutorial_lr_find

ad43bae

Ishitori reviewed Jun 22, 2018

View reviewed changes

safrooze reviewed Jun 22, 2018

View reviewed changes

indhub reviewed Jun 22, 2018

View reviewed changes

Updated based on feedback.

f15c0ec

Reverting save_parameters changes.

c77ff52

thomelane changed the title ~~Added Learning Rate Finder tutorial~~ [MXNET-594] Added Learning Rate Finder tutorial Jun 26, 2018

Merge branch 'master' into tutorial_lr_find

4b5e1b8

aaronmarkham reviewed Jun 28, 2018

View reviewed changes

ThomasDelteil suggested changes Jun 29, 2018

View reviewed changes

Adjusted based on feedback.

280eaaa

thomelane commented Jul 3, 2018

View reviewed changes

Corrected outdated code comment.

29b922a

indhub merged commit e870890 into apache:master Jul 4, 2018

thomelane deleted the tutorial_lr_find branch January 11, 2019 19:44


		Setting the learning rate for stochastic gradient descent (SGD) is crucially important when training neural network because it controls both the speed of convergence and the ultimate performance of the network. Set the learning too low and you could be twiddling your thumbs for quite some time as the parameters update very slowly. Set it too high and the updates will skip over optimal solutions, or worse the optimizer might not converge at all!

		Leslie Smith from the U.S. Naval Research Laboratory presented a method for finding a good learning rate in a paper called ["Cyclical Learning Rates for Training Neural Networks"](https://arxiv.org/abs/1506.01186). We take a look at the central idea of the paper, cyclical learning rate schedules, in the tutorial found here, but in this tutorial we implement a 'Learning Rate Finder' in MXNet with the Gluon API that you can use while training your own networks.


		![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/lr_finder/finder_plot.png) <!--notebook-skip-line-->

		As expected, for very small learning rates we don't see much change in the loss as the paramater updates are negligible. At a learning rate of 0.001 we start to see the loss fall. Setting the initial learning rate here is reasonable, but we still have the potential to learn faster. We observe a drop in the loss up until 0.1 where the loss appears to diverge. We want to set the initial learning rate as high as possible before the loss becomes unstable, so we choose a learning rate of 0.05.


		## Epoch to Iteration

		Usually our unit of work is an epoch (a full pass through the dataset) and the learning rate would typically be held constant throughout the epoch. With the Learning Rate Finder (and cyclical learning rate schedules) we are required to vary the learning rate every iteration. As such we structure our training code so that a single iteration can be run with a given learning rate. You can implement Learner as you wish. Just initialize the network, define the loss and trainer in `__init__` and keep your training logic for a single batch in `iteration`.


		## Implementation

		With preparation complete, we're ready to write our Learning Rate Finder that wraps the `Learner` we defined above. We implement a `find` method for the procedure, and `plot` for the visualization. Starting with a very low learning rate as defined by `lr_start` we train one iteration at a time and keep multiplying the learning rate by `lr_multiplier`. We analyse the loss and continue until it diverges according to `LRFinderStoppingCriteria` (which is defined later on). You may also notice that we save the parameters and state of the optimizer before the process and restore afterwards. This is so the Learning Rate Finder process doesn't impact the state of the model, and can be used at any point during training.


		Given an initialized network, a defined loss and a training dataset we take the following steps:

		1. train one batch at a time (a.k.a. an iteration)


		## Wrap Up

		Give Learning Rate Finder a try on your current projects, and experiment with the different learning rate schedules found in this tutorial too.



		```python
		learner.net.save_params("net.params")

Conversation

thomelane commented Jun 15, 2018

Description

Checklist

Essentials

Changes

Comments

Uh oh!

szha commented Jun 17, 2018

Uh oh!

thomelane commented Jun 19, 2018

Uh oh!

thomelane commented Jun 22, 2018

Uh oh!

Ishitori left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

safrooze Jun 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

safrooze Jun 22, 2018 •

edited

Loading