Fix dense Embedding to work with double backward#9078
Conversation
|
shouldn't the solution be adding a custom double backward, rather than slowing down backward with autograd ops? |
|
Sure , I ll try that and update once done. |
|
@kshitij12345 this needs a rebase now |
|
@weiyangfb sure will do that. |
|
Wouldn't it still be better to have a fused index_add_ + mul for the backward than implementing a specific double backward? I'd think that it is probably a bit less code and more similar to what we do for other ops. |
|
@ssnl I have tried and here is my opinion from what I understand In the derivatives.yaml, We'll need to pass in Also So I believe that as @t-vi suggested, we should opt for fused index_add_ + mul. |
|
@t-vi @kshitij12345 I'm fine if you want to implement fused index_add_ + mul and also write a backward for that. But that would be considerably more work than just writing a custom double backward for this. |
|
Oh , will take a look and update you on it. |
In that case can you please share a minimal code snippet that produces the given error, so even I can check. |
I am trying to write a minimal code to reproduce the given error. |
Here is the minimal code to replicate the error. Sorry, but you need OpenNMT to import Embeddings. If you run this code with your version of PyTorch (I mean commit 6dcaa47) you'll see the error while the original version (ae1a972) doesn't generate error. |
|
@pooryapzm Indeed problem was in my part of code. Sorry. Have fixed it. Please let me know if it works for you as well. Have checked the following code: import torch
torch.manual_seed(6)
from onmt.modules.embeddings import Embeddings
class Test(torch.nn.Module):
def __init__(self):
super(Test, self).__init__()
self.dense = torch.nn.Linear(100, 1)
self.oembd = Embeddings(word_vec_size=100,
position_encoding=False,
dropout=0.3,feat_merge=None,
word_padding_idx=1,
word_vocab_size=1000)
def forward(self, inp):
inp = self.oembd(inp)
return self.dense(inp[0])
test = Test()
test.cuda()
inp = torch.tensor([1, 1, 2, 1, 1, 2],device='cuda')
inp=inp.unsqueeze(0).unsqueeze(-1)
out = test(inp)
raw_loss = out.mean(dim=1)
loss_grad = torch.autograd.grad(outputs=raw_loss,
inputs=list(test.parameters()),
retain_graph=True, create_graph=True, only_inputs=True)
norm = sum([param.norm()**2 for param in loss_grad])
loss = raw_loss + norm
loss.backward()
print("Succesful") |
|
@kshitij12345 Awesome. It works with my code as well. |
|
I'm afraid I have to own up to the fact that there the indexAdd approach isn't good after all. |
|
@t-vi Oh, no worries , I ll try the double backwards approach. If I have any doubts or get stuck will ask for your help. Thank You Again for guiding me. However I am confused about what you meant in regards to non-deterministic. Cause if iyou mean the non-determinism in time, from the benchmark it is visible that the running time for embedding with index_add_ has less standard deviation, be it on CUDA or CPU. Just curious to understand and know more. |
|
index_add_ does not produce deterministic results, as the order of addition is unspecified (see #12217 for a tiny bit more context). |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@colesbury , please have a look. |
|
That looks simple enough for me to be a bit embarrassed to have suggested the weighted
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Even I am embarrased, for my first attempt of contributing here is taking so long with all these mistakes (also with the fact that in excitement I started work in master branch and sent PR through master ) As for second point, in the little experience I have, embeddings have always been vectors in all the courses that I have taken and literature that I have seen. (Would be interesting to see if that is not always the case). |
|
@colesbury @ssnl please review. |
There was a problem hiding this comment.
can we add a few more tests with other parameters being tested as well, like padding_idx, max_norm, etc?
There was a problem hiding this comment.
There already are tests for those parameters. Will add a test for double_backward in padding_idx.
There was a problem hiding this comment.
I meant a test here for double backwards
There was a problem hiding this comment.
From what I see here, the normalization is independently applied before embedding. So I believe the test for max_norm should be independent like it is. Do let me know if I am missing something.
As for the padding_idx, I have extended the already present test to check for double backwards as well. Please review.
Thank You.
|
@colesbury , please review |
|
@pytorchbot rebase this please |
|
There's nothing to do! This branch is already up to date with master (1240327). (To learn more about this bot, see Bot commands.) |
soumith
left a comment
There was a problem hiding this comment.
thanks a lot for your contribution Kshitij, it looks like this is finally good to go :)
facebook-github-bot
left a comment
There was a problem hiding this comment.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Fixes : #6469
ATen/native/native_functions.ymlhad dispatch variants for forembedding_dense_backward, howeverembedding_backwardexplicitly made call to it, thus leading to error.In case of CUDA type tensor, the function crashed used to crash on dereferencing of indices's data pointer.
Both have been solved and checked against (on CUDA and CPU)