Thanks to visit codestin.com
Credit goes to github.com

Skip to content

WIP: Handle arbitrary combinations of optimizers/models/losses#232

Merged
mcarilli merged 10 commits into
masterfrom
no_wrap_optimizers
Apr 4, 2019
Merged

WIP: Handle arbitrary combinations of optimizers/models/losses#232
mcarilli merged 10 commits into
masterfrom
no_wrap_optimizers

Conversation

@mcarilli
Copy link
Copy Markdown

@mcarilli mcarilli commented Mar 31, 2019

Handle arbitrary combinations of optimizers/models/losses. Do not maintain a loss scale per optimizer. Instead, maintain either a single global loss scale (by default), or a loss scale per-loss (if the user supplies the num_loss argument to amp.initialize and the loss_id argument to each invokation of amp.scale_loss).

Current status: L0 tests have been added and are passing (maybe need more). L1 single-GPU tests pass. Verified that it fixes Szymon's issue with master params getting out of sync across processes for JoC GNMT. L1 multi-GPU tests underway.

@hukkelas
Copy link
Copy Markdown

I tried this branch earlier today to include different loss scales for each loss in my progressive growing GAN, but I had issues getting it working correctly. What I ended up with was something similar to this:

d_optimizer.zero_grad()
with amp.scale_loss(wasserstein_distance, self.d_optimizer, loss_id=0) as scaled_loss:
                    scaled_loss.backward(retain_graph=True)
with amp.scale_loss(epsilon_penalty, self.d_optimizer, loss_id=1) as scaled_loss:
                    scaled_loss.backward(retain_graph=True)
with amp.scale_loss(gradient_penalty, self.d_optimizer, loss_id=2) as scaled_loss:
                    scaled_loss.backward()
d_optimzier.step()

This seems to be working with a single GPU, but when I use apex DistributedDataParallel (2 GPU's) it is very unstable, and I have no idea why :/

I also took a look at the advanced AMP usage for multiple backward passes for a single optimizer, where it says that you should use delay_unscale=True. I tried this (by having delay_unscale=True for the first two backward passes), but it seems like it was using the last loss's loss_scaler to unscale all the grads. So the model was diverging immediately.

@mcarilli
Copy link
Copy Markdown
Author

Bold of you to test it already...That usage looks correct. I'm glad to hear it's working on single-GPU and that the new syntax enables your use case. I'm still trying it with multi-GPU.

I haven't updated documentation for the new branch, but guidance will be that using delay_unscale is no longer necessary and should only be used as a minor performance optimization if you're really sure what you're doing.

@mcarilli
Copy link
Copy Markdown
Author

mcarilli commented Apr 1, 2019

@hukkelas I notice you're using gradient penalty. I don't think Apex DDP supports gradient penalty properly, so your instability may not be the fault of the new branch. I still haven't finished testing the new branch with distributed, but it seems to work ok so far. Can you try torch.nn.parallel.DistributedDataParallel instead? Replace

model, optimizer = amp.initialize(model, optimizer, opt_level=XX, num_losses=YY)
model = apex.parallel.DistributedDataParallel(model)

with

model, optimizer = amp.initialize(model, optimizer, opt_level=XX, num_losses=YY)
model = torch.nn.parallel.DistributedDataParallel(model,
    device_ids=[args.local_rank],
    output_device=args.local_rank)

@hukkelas
Copy link
Copy Markdown

hukkelas commented Apr 1, 2019

Hmm, didn't know that, thanks!

I was using Apex DDP, switched to torch.nn.parallel.DistributedDataParallel now. From some quick tests it looks better.

I just recently started using multi-gpu training for my GAN. I think most of my previous instability, with a single GPU, has come from the gradient penalty calculation. I made a hacky solution to introduce independent loss scaling for it previously, but this branch would remove a lot of ugly code :)

@mcarilli
Copy link
Copy Markdown
Author

mcarilli commented Apr 1, 2019

Glad it appears to be working now. My local distributed tests/tests on local models have also succeeded so far so I'll probably merge this branch today.

When you say you are scaling the gradient penalty calculation, is this the line you are referring to:

with amp.scale_loss(gradient_penalty, self.d_optimizer, loss_id=2) as scaled_loss:
                    scaled_loss.backward()

? Scaling the loss for this call^ is fine.

Or are you referring to an earlier call to torch.autograd.grad(...., create_graph=True, only_inputs=True) to create out-of-place gradients used to construct the gradient_penalty scalar? Scaling the loss for that particular call would not be supported by the new branch either (Amp has no visibility of the newly created out-of-place gradients, so it has no way to check them for infs).

@hukkelas
Copy link
Copy Markdown

hukkelas commented Apr 2, 2019

Before this branch, I independently scaled my gradient penalty like this:

logits = discriminator(x_hat, condition, landmarks)
logits = logits.sum() * loss_scaler.loss_scale()
grad = torch.autograd.grad(
    outputs=logits,
    inputs=x_hat,
    grad_outputs=torch.ones(logits.shape).to(fake_data.dtype).to(fake_data.device),
    create_graph=True
)[0] 
grad = grad.view(x_hat.shape[0], -1)
if check_overflow(grad):
    print("Overflow in gradient penalty calculation.")
    loss_scaler._loss_scale /= 2
    print("Scaling down loss to:", loss_scaler._loss_scale)
    return None
grad = grad / loss_scaler.loss_scale()

grad_penalty = ((grad.norm(p=2, dim=1) - 1)**2)

check_overfow is doing the same checks as scale_check_overflow_python in scaler.py. If there is an overflow, I just skip the current batch for all GPUs.

From my understanding, since AMP has no visibility of the torch.autograd.grad call, I should keep this independent scaling of the gradient penalty, and also include the independent loss scaling as I showed earlier?

Comment thread apex/amp/_initialize.py
else:
optimizers[i] = _process_optimizer(optimizer, properties)

_amp_state.loss_scalers = []
Copy link
Copy Markdown

@hukkelas hukkelas Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a assert / check that _amp_state.loss_scalers is None? If you call amp_initialize a second time, even without providing any optimizer, it will overwrite the _amp_state.loss_scalers, and you get indexing error for the backward pass when providing a loss_id.

Found a work around for it in my code, but its still confusing to find the error for this.

Copy link
Copy Markdown
Author

@mcarilli mcarilli Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was that amp.initialize should be called only once. Why do you need to call it multiple times? Is it because you are progressively adding new layers to your GAN? It might be better to start with one big module and progressively unfreeze layers, rather than adding new modules and calling amp.initialize many times. I'm very interested to hear why calling amp.initialize many times is essential for your use case.

Also, the scaling of the out-of-place backward to create the grads for the gradient penalty is always going to be tricky...can you start an issue where we can track that and figure out best practices?

@mcarilli mcarilli merged commit 3f87614 into master Apr 4, 2019
mcarilli pushed a commit that referenced this pull request Apr 4, 2019
Author: mcarilli <[email protected]>

    WIP:  Handle arbitrary combinations of optimizers/models/losses (#232)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants