Codestin Search App

mcarilli · 2019-03-31T02:51:03Z

Handle arbitrary combinations of optimizers/models/losses. Do not maintain a loss scale per optimizer. Instead, maintain either a single global loss scale (by default), or a loss scale per-loss (if the user supplies the num_loss argument to amp.initialize and the loss_id argument to each invokation of amp.scale_loss).

Current status: L0 tests have been added and are passing (maybe need more). L1 single-GPU tests pass. Verified that it fixes Szymon's issue with master params getting out of sync across processes for JoC GNMT. L1 multi-GPU tests underway.

…ls/losses

hukkelas · 2019-03-31T18:53:31Z

I tried this branch earlier today to include different loss scales for each loss in my progressive growing GAN, but I had issues getting it working correctly. What I ended up with was something similar to this:

d_optimizer.zero_grad()
with amp.scale_loss(wasserstein_distance, self.d_optimizer, loss_id=0) as scaled_loss:
                    scaled_loss.backward(retain_graph=True)
with amp.scale_loss(epsilon_penalty, self.d_optimizer, loss_id=1) as scaled_loss:
                    scaled_loss.backward(retain_graph=True)
with amp.scale_loss(gradient_penalty, self.d_optimizer, loss_id=2) as scaled_loss:
                    scaled_loss.backward()
d_optimzier.step()

This seems to be working with a single GPU, but when I use apex DistributedDataParallel (2 GPU's) it is very unstable, and I have no idea why :/

I also took a look at the advanced AMP usage for multiple backward passes for a single optimizer, where it says that you should use delay_unscale=True. I tried this (by having delay_unscale=True for the first two backward passes), but it seems like it was using the last loss's loss_scaler to unscale all the grads. So the model was diverging immediately.

mcarilli · 2019-03-31T19:56:53Z

Bold of you to test it already...That usage looks correct. I'm glad to hear it's working on single-GPU and that the new syntax enables your use case. I'm still trying it with multi-GPU.

I haven't updated documentation for the new branch, but guidance will be that using delay_unscale is no longer necessary and should only be used as a minor performance optimization if you're really sure what you're doing.

mcarilli · 2019-04-01T16:17:27Z

@hukkelas I notice you're using gradient penalty. I don't think Apex DDP supports gradient penalty properly, so your instability may not be the fault of the new branch. I still haven't finished testing the new branch with distributed, but it seems to work ok so far. Can you try torch.nn.parallel.DistributedDataParallel instead? Replace

model, optimizer = amp.initialize(model, optimizer, opt_level=XX, num_losses=YY)
model = apex.parallel.DistributedDataParallel(model)

with

model, optimizer = amp.initialize(model, optimizer, opt_level=XX, num_losses=YY)
model = torch.nn.parallel.DistributedDataParallel(model,
    device_ids=[args.local_rank],
    output_device=args.local_rank)

hukkelas · 2019-04-01T18:07:17Z

Hmm, didn't know that, thanks!

I was using Apex DDP, switched to torch.nn.parallel.DistributedDataParallel now. From some quick tests it looks better.

I just recently started using multi-gpu training for my GAN. I think most of my previous instability, with a single GPU, has come from the gradient penalty calculation. I made a hacky solution to introduce independent loss scaling for it previously, but this branch would remove a lot of ugly code :)

mcarilli · 2019-04-01T20:44:10Z

Glad it appears to be working now. My local distributed tests/tests on local models have also succeeded so far so I'll probably merge this branch today.

When you say you are scaling the gradient penalty calculation, is this the line you are referring to:

with amp.scale_loss(gradient_penalty, self.d_optimizer, loss_id=2) as scaled_loss:
                    scaled_loss.backward()

? Scaling the loss for this call^ is fine.

Or are you referring to an earlier call to torch.autograd.grad(...., create_graph=True, only_inputs=True) to create out-of-place gradients used to construct the gradient_penalty scalar? Scaling the loss for that particular call would not be supported by the new branch either (Amp has no visibility of the newly created out-of-place gradients, so it has no way to check them for infs).

hukkelas · 2019-04-02T07:56:42Z

Before this branch, I independently scaled my gradient penalty like this:

logits = discriminator(x_hat, condition, landmarks)
logits = logits.sum() * loss_scaler.loss_scale()
grad = torch.autograd.grad(
    outputs=logits,
    inputs=x_hat,
    grad_outputs=torch.ones(logits.shape).to(fake_data.dtype).to(fake_data.device),
    create_graph=True
)[0] 
grad = grad.view(x_hat.shape[0], -1)
if check_overflow(grad):
    print("Overflow in gradient penalty calculation.")
    loss_scaler._loss_scale /= 2
    print("Scaling down loss to:", loss_scaler._loss_scale)
    return None
grad = grad / loss_scaler.loss_scale()

grad_penalty = ((grad.norm(p=2, dim=1) - 1)**2)

check_overfow is doing the same checks as scale_check_overflow_python in scaler.py. If there is an overflow, I just skip the current batch for all GPUs.

From my understanding, since AMP has no visibility of the torch.autograd.grad call, I should keep this independent scaling of the gradient penalty, and also include the independent loss scaling as I showed earlier?

hukkelas · 2019-04-02T14:09:15Z

+        else:
+            optimizers[i] = _process_optimizer(optimizer, properties)
+
+    _amp_state.loss_scalers = []


Should there be a assert / check that _amp_state.loss_scalers is None? If you call amp_initialize a second time, even without providing any optimizer, it will overwrite the _amp_state.loss_scalers, and you get indexing error for the backward pass when providing a loss_id.

Found a work around for it in my code, but its still confusing to find the error for this.

My intention was that amp.initialize should be called only once. Why do you need to call it multiple times? Is it because you are progressively adding new layers to your GAN? It might be better to start with one big module and progressively unfreeze layers, rather than adding new modules and calling amp.initialize many times. I'm very interested to hear why calling amp.initialize many times is essential for your use case.

Also, the scaling of the out-of-place backward to create the grads for the gradient penalty is always going to be tricky...can you start an issue where we can track that and figure out best practices?

…sses

…nto no_wrap_optimizers

Author: mcarilli <[email protected]> WIP: Handle arbitrary combinations of optimizers/models/losses (#232)

definitelynotmcarilli added 3 commits March 28, 2019 13:22

Refactor to allow more flexible treatment of multiple optimizers/mode…

7ce4cb1

…ls/losses

Adding _process_optimizers.py

47338c9

Created L0 tests (now passing).

deff0fa

mcarilli mentioned this pull request Mar 31, 2019

how to prevent overflow #179

Open

fix: minor print typo (#234)

c654351

make L1 results easier to read

2e7e6f1

hukkelas reviewed Apr 2, 2019

View reviewed changes

definitelynotmcarilli added 5 commits April 3, 2019 19:51

L0 multiple model/optimizer/loss test fleshed out

2ba3940

Adding test that master params remain synced across distributed proce…

094bb23

…sses

Docstring updates

0563f60

Merge branch 'no_wrap_optimizers' of https://github.com/NVIDIA/apex i…

e8a7185

…nto no_wrap_optimizers

Docstring updates

74a1c92

mcarilli merged commit 3f87614 into master Apr 4, 2019

mcarilli pushed a commit that referenced this pull request Apr 4, 2019

Generated gh-pages for commit 3f87614

b53bf49

Author: mcarilli <[email protected]> WIP: Handle arbitrary combinations of optimizers/models/losses (#232)

This was referenced Apr 4, 2019

amp + checkpoint loading = problems #180

Open

Debugging sudden gradient overflows? #192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Handle arbitrary combinations of optimizers/models/losses#232

WIP: Handle arbitrary combinations of optimizers/models/losses#232
mcarilli merged 10 commits into
masterfrom
no_wrap_optimizers

mcarilli commented Mar 31, 2019 •

edited

Loading

Uh oh!

hukkelas commented Mar 31, 2019

Uh oh!

mcarilli commented Mar 31, 2019

Uh oh!

mcarilli commented Apr 1, 2019

Uh oh!

hukkelas commented Apr 1, 2019

Uh oh!

mcarilli commented Apr 1, 2019

Uh oh!

hukkelas commented Apr 2, 2019

Uh oh!

hukkelas Apr 2, 2019 •

edited

Loading

Uh oh!

mcarilli Apr 4, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mcarilli commented Mar 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hukkelas commented Mar 31, 2019

Uh oh!

mcarilli commented Mar 31, 2019

Uh oh!

mcarilli commented Apr 1, 2019

Uh oh!

hukkelas commented Apr 1, 2019

Uh oh!

mcarilli commented Apr 1, 2019

Uh oh!

hukkelas commented Apr 2, 2019

Uh oh!

hukkelas Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mcarilli commented Mar 31, 2019 •

edited

Loading

hukkelas Apr 2, 2019 •

edited

Loading

mcarilli Apr 4, 2019 •

edited

Loading