[MXNET-535] Fix bugs in LR Schedulers and add warmup#11234
Conversation
|
|
||
| """ | ||
|
|
||
| def __init__(self, max_update, base_lr=0.01, pwr=2): |
There was a problem hiding this comment.
don't remove base_lr, it will break API.
Pass it to super init instead
| if warmup_steps <= 0: | ||
| raise ValueError("Warmup steps has to be positive") | ||
| self.warmup_steps = warmup_steps | ||
| self.lrs_updates = {} |
There was a problem hiding this comment.
what's the point of this cache? Looks like it will always miss
There was a problem hiding this comment.
We would have for each batch, number of calls to call equal to the number of learnable parameter arrays.
| self.lrs_updates[num_update] = self.lr_begin + increase | ||
| else: | ||
| if isinstance(self.scheduler, PolyScheduler): | ||
| self.lrs_updates[num_update] = self.scheduler(num_update - self.warmup_steps) |
There was a problem hiding this comment.
Why special case for PolyScheduler?
Is num_update - self.warmup_steps standard? Does Tf or Pytorch do it this way?
Why not num_update directly?
There was a problem hiding this comment.
PolyScheduler, and CosineScheduler(not implemented here) reduce lr from a "starting lr" to an "ending lr" smoothly, for example from 0.1 to 0.
With warmup, we first increase the lr from a small value (e.g. 0) to the starting lr, then apply the main scheduler. Assuming we have warmup for the first 5 epochs, and the total training epochs is 90, then the effective number of epochs for the poly scheduler is 85.
As for piecewise-constant/factor scheduler, it decays the lr only at certain points, on which the warmup stage has no effect.
There was a problem hiding this comment.
Yeah for the above reason, but I'm updating the code to remove this special case, and pass a wamup_steps param to such schedulers so we can handle it cleanly
|
Improved how warmup is handled, please review |
|
@piiswrong Could you please review? This has a fix for an important bug where the MultiFactorScheduler didn't take a base_lr previously. This meant that the example/image_classification/ scripts didn't use the given LR correctly. It would drop from x to 0.001 and 0.0001 regardless of the LR given. |
|
@piiswrong @szha @eric-haibin-lin Please review |
|
@szha @eric-haibin-lin please review |
| The initial learning rate. | ||
| """ | ||
| def __init__(self, base_lr=0.01): | ||
| def __init__(self, base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): |
There was a problem hiding this comment.
Do you mind adding documentation for warmup_begin_lr?
There was a problem hiding this comment.
There was some for the inherited classes, but not for this base abstract class. Anyway, now added for all. Please check.
|
|
||
| def __init__(self, max_update, base_lr=0.01, pwr=2): | ||
| super(PolyScheduler, self).__init__(base_lr) | ||
| def __init__(self, max_update, base_lr=0.01, final_lr=0, |
There was a problem hiding this comment.
why did you remove pwr? This is API breakage
There was a problem hiding this comment.
I've not removed it. Git is getting confused :/ It thinks I've changed PolyScheduler to CosineScheduler when in fact I've modified PolyScheduler and added a new CosineScheduler.
|
Hopefully this will give committers confidence to merge InterfacesCosineSchedulerPolySchedulerThis PREarlierMultiFactorSchedulerThis PREarlierFactorSchedulerThis PREarlierSchedulerThis PREarlierPlots of LR decay from unit tests |
* Add warmup and fix inconsistencies with learning rate schedulers * add comments * remove assert


Description
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.