WIP: Handle arbitrary combinations of optimizers/models/losses#232
Conversation
|
I tried this branch earlier today to include different loss scales for each loss in my progressive growing GAN, but I had issues getting it working correctly. What I ended up with was something similar to this: d_optimizer.zero_grad()
with amp.scale_loss(wasserstein_distance, self.d_optimizer, loss_id=0) as scaled_loss:
scaled_loss.backward(retain_graph=True)
with amp.scale_loss(epsilon_penalty, self.d_optimizer, loss_id=1) as scaled_loss:
scaled_loss.backward(retain_graph=True)
with amp.scale_loss(gradient_penalty, self.d_optimizer, loss_id=2) as scaled_loss:
scaled_loss.backward()
d_optimzier.step()This seems to be working with a single GPU, but when I use apex DistributedDataParallel (2 GPU's) it is very unstable, and I have no idea why :/ I also took a look at the advanced AMP usage for multiple backward passes for a single optimizer, where it says that you should use |
|
Bold of you to test it already...That usage looks correct. I'm glad to hear it's working on single-GPU and that the new syntax enables your use case. I'm still trying it with multi-GPU. I haven't updated documentation for the new branch, but guidance will be that using |
|
@hukkelas I notice you're using gradient penalty. I don't think Apex DDP supports gradient penalty properly, so your instability may not be the fault of the new branch. I still haven't finished testing the new branch with distributed, but it seems to work ok so far. Can you try with |
|
Hmm, didn't know that, thanks! I was using Apex DDP, switched to I just recently started using multi-gpu training for my GAN. I think most of my previous instability, with a single GPU, has come from the gradient penalty calculation. I made a hacky solution to introduce independent loss scaling for it previously, but this branch would remove a lot of ugly code :) |
|
Glad it appears to be working now. My local distributed tests/tests on local models have also succeeded so far so I'll probably merge this branch today. When you say you are scaling the gradient penalty calculation, is this the line you are referring to: ? Scaling the loss for this call^ is fine. Or are you referring to an earlier call to |
|
Before this branch, I independently scaled my gradient penalty like this: logits = discriminator(x_hat, condition, landmarks)
logits = logits.sum() * loss_scaler.loss_scale()
grad = torch.autograd.grad(
outputs=logits,
inputs=x_hat,
grad_outputs=torch.ones(logits.shape).to(fake_data.dtype).to(fake_data.device),
create_graph=True
)[0]
grad = grad.view(x_hat.shape[0], -1)
if check_overflow(grad):
print("Overflow in gradient penalty calculation.")
loss_scaler._loss_scale /= 2
print("Scaling down loss to:", loss_scaler._loss_scale)
return None
grad = grad / loss_scaler.loss_scale()
grad_penalty = ((grad.norm(p=2, dim=1) - 1)**2)
From my understanding, since AMP has no visibility of the |
| else: | ||
| optimizers[i] = _process_optimizer(optimizer, properties) | ||
|
|
||
| _amp_state.loss_scalers = [] |
There was a problem hiding this comment.
Should there be a assert / check that _amp_state.loss_scalers is None? If you call amp_initialize a second time, even without providing any optimizer, it will overwrite the _amp_state.loss_scalers, and you get indexing error for the backward pass when providing a loss_id.
Found a work around for it in my code, but its still confusing to find the error for this.
There was a problem hiding this comment.
My intention was that amp.initialize should be called only once. Why do you need to call it multiple times? Is it because you are progressively adding new layers to your GAN? It might be better to start with one big module and progressively unfreeze layers, rather than adding new modules and calling amp.initialize many times. I'm very interested to hear why calling amp.initialize many times is essential for your use case.
Also, the scaling of the out-of-place backward to create the grads for the gradient penalty is always going to be tricky...can you start an issue where we can track that and figure out best practices?
Author: mcarilli <[email protected]> WIP: Handle arbitrary combinations of optimizers/models/losses (#232)
Handle arbitrary combinations of optimizers/models/losses. Do not maintain a loss scale per optimizer. Instead, maintain either a single global loss scale (by default), or a loss scale per-loss (if the user supplies the
num_lossargument to amp.initialize and theloss_idargument to each invokation of amp.scale_loss).Current status: L0 tests have been added and are passing (maybe need more). L1 single-GPU tests pass. Verified that it fixes Szymon's issue with master params getting out of sync across processes for JoC GNMT. L1 multi-GPU tests underway.