Add SGDR(Stochastic Gradient Descent with Warm Restarts) scheduler #17226
Add SGDR(Stochastic Gradient Descent with Warm Restarts) scheduler #17226Kirayue wants to merge 18 commits into
Conversation
|
Hi, @ezyang, thank you for helping me on this PR. Is it possible for you to let me know where I went wrong with merge? I Thank you, I really appreciate your help. |
| for base_lr in self.base_lrs] | ||
|
|
||
| def step(self, epoch=None): | ||
| """Step could be called after every update, i.e. if one epoeh has 10 iterations(num_train / batch_size), |
There was a problem hiding this comment.
Thanks for checking.
mrshenli
left a comment
There was a problem hiding this comment.
Please fix the lint error as well.
There was a problem hiding this comment.
There are quite some duplicated code in the four test cases. Can you consolidate them by, say, creating other helper methods or using a loop?
There was a problem hiding this comment.
I combined test_sgdr_lr1 and test_sgdr_lr2 to test_sgdr_lr1, so did test_sgdr_lr3 and test_sgdr_lr4. The former tests integer epoch and the latter tests float epoch, so I remain two tests functions. If there are any suggestions, please let me know.
There was a problem hiding this comment.
Combining 4 functions to 2 sounds good to me. Thanks for addressing this!
There was a problem hiding this comment.
you mean "could called" -> "should call"?
There was a problem hiding this comment.
Should it be T_i to match the description in the docs above?
There was a problem hiding this comment.
why T_cur is not reset to 0?
There was a problem hiding this comment.
The self.last_epoch is only used in the if branch. Is there any reason for put it here? Is it because users can call step() with and without epoch arg in an interleaved way, so that you want to remember last_epoch when possible?
There was a problem hiding this comment.
yes, I consider the interleaved way. it is somewhat impractical. In your opinion, is it better to take it away?
There was a problem hiding this comment.
I am OK with this API, but we need to explicitly explain it in the docs. Could you please explain this behavior in detail and add it to doc strings. Thanks.
There was a problem hiding this comment.
Is T_i the number of epochs in every run (i.e., # of epochs between two warm restarts)? Please add docs to explain it.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Do you need to check the range of these args? For example, what if T_0, T_multi are negative, or eta_min is larger than initial lr (eta_max) ?
There was a problem hiding this comment.
var i is not used, maybe replace it with _? (same for other for loops in this file)
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
There was a problem hiding this comment.
Let's test a float point number T_mult :)
There was a problem hiding this comment.
Good, but it made me think about T_mult >= 1 (e.g. 2.5) and T_mult < 1 (e.g. 0.5). I think the latter is not practical, so I modified the range of T_mult, if < 1, raise ValueError.
What do you think?
By the way, because of T_mults could be float point number, so the test cases were slightly changed to test the case like T_i = 62.5
There was a problem hiding this comment.
I agree with you on T_mult >= 1, as the original paper says:
we suggest an option to start with an initially small
T_iand increase it by a factor ofT_multat every restart
There was a problem hiding this comment.
Let's be consistent and use T_i
There was a problem hiding this comment.
is this resuming or assuming?
There was a problem hiding this comment.
I copied the line 23 from the base class _LRScheduler(object). In my opinion, it is resuming. But I think these codes are redundant, the optimizer will be checked by calling super(SGDR, self).__init__(optimizer, last_epoch)
There was a problem hiding this comment.
Shall we add a test for this?
|
@pytorchbot rebase this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Some tests failed: |
mrshenli
left a comment
There was a problem hiding this comment.
Please fix failures introduced by the new tests.
|
Hi, @mrshenli |
There was a problem hiding this comment.
Could be caused by this line. If epoch % self.T_0 != 0 and they are both int, it will drop the residual.
There was a problem hiding this comment.
oh, maybe not, if that is the case, it should fail everywhere ignore me please, I got it wrong
There was a problem hiding this comment.
I traced this part of codes again, I think n = 2, T_i = 22.5 and T_cur = 0. I have no idea why we got a different result.
mrshenli
left a comment
There was a problem hiding this comment.
The error msg expected 0.05, got 1e-10 suggests it occurs on restart boundary. See comments below.
There was a problem hiding this comment.
you are right, it is redundant.
There was a problem hiding this comment.
Sorry, I double check the results. It's necessary, because of the interleaved usage e.g. call step() for 100 times and call step(10), we need to reset the T_i. But for the epoch < T_0, it's redundant.
There was a problem hiding this comment.
It could be my suggestion introduced numerical instability. It makes me rethink whether it makes sense to have non-integer T_i. Any thought? @Kirayue
There was a problem hiding this comment.
It is not restricted to the integer from my opinion and the origin paper. But if we can make sure that the failure is caused by numerical instability, I can modify the code.
By the way, how can I reproduce the failure? (I tested it in python3, it works.)
There was a problem hiding this comment.
@Kirayue sorry for the delay on this - would you like to send us your email address so that we can share the credentials for read-only access to the Docker images in ECR? Thanks!
There was a problem hiding this comment.
Do not worry about it 😄
Sure, just use this email address [email protected]
There was a problem hiding this comment.
This scheduler is not tied to SGD but can also be effectively used for other optimizers such as Adam, so although it's in the name of the paper, I find the name SGDR misleading. Why not call it CosineAnnealingWR or CosineAnnealingWarmRestarts?
There was a problem hiding this comment.
you are right, that's fine with me
What do you think, @mrshenli ?
There was a problem hiding this comment.
Sounds good to me too.
Curious, there is a CosineAnnealingLR which implements the SGDR without warm restarts. Do you know what does LR mean here? (learning rate? less restart?) I just want to make sure if we go with CosineAnnealingWR, the naming does not confuse people.
There was a problem hiding this comment.
CosineAnnealingWarmRestarts sounds better to me.
There was a problem hiding this comment.
hi, @mrshenli
It's a kind of learning rate policy acoording to Cyclical Learning Rates for Training Neural Networks. So, in my opion, it's learning rate.
This observation leads to the idea of letting the learning rate vary within a range of values rather than adopting a stepwise fixed or exponentially decreasing value. That is, one sets minimum and maximum boundaries and the learning rate cyclically varies between these bounds. Experiments with numerous functional forms, such as a triangular window (linear), a Welch window (parabolic) and a Hann window (sinusoidal) all produced equivalent results
There was a problem hiding this comment.
That sounds good to me. Please edit the doc and arg check accordingly. I will approve and merge. Thanks!
There was a problem hiding this comment.
Hi, @mrshenli
There two ways to convert T_i to an integer. Because T_i equals to T_{i - 1} * T_mult, we can use int(T_mult) or int(T_{i - 1} * T_mult) to force T_i to be an integer. However, the latter would case a inconsistency, for example, let T_0 = 10, T_mult = 1.5, if we call scheduler.step() for 100 times, and call scheduler.step(100) the T_i would be different.
The T_i for the example would be [10, 15, 22, 33, 49] and 50.625 (50 after int()) for the formula result. So for simplicity, I choose the former way.
There was a problem hiding this comment.
:math:\T_{i} -> :math:T_{i}?
There was a problem hiding this comment.
Shall we do an explicit type check, and raise an error if it is not an int? This also applies to T_0. We might want to avoid silently surprising people even if they did pay attention to the doc.
There was a problem hiding this comment.
Ok, thank you for the suggestions.
|
The errors look irrelevant to your changes. Could you please try rebase and test again? |
|
@mrshenli |
|
@Kirayue yes, rebase to the current master please. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
mrshenli
left a comment
There was a problem hiding this comment.
Thanks for contributing!!
|
@mrshenli |
|
Sorry, should I close this PR? |
|
@Kirayue I am landing this PR. It will be closed automatically in a moment. :) |
…ytorch#17226) Summary: Because of merge error with master in pytorch#15042, open a new PR for ezyang. Pull Request resolved: pytorch#17226 Differential Revision: D14418145 Pulled By: mrshenli fbshipit-source-id: 099ba225b28e6aba71760b81b2153ad1c40fbaae
…ytorch#17226) Summary: Because of merge error with master in pytorch#15042, open a new PR for ezyang. Pull Request resolved: pytorch#17226 Differential Revision: D14418145 Pulled By: mrshenli fbshipit-source-id: 099ba225b28e6aba71760b81b2153ad1c40fbaae
Because of merge error with master in #15042, open a new PR for @ezyang.