Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lyuwen
Copy link
Contributor

@lyuwen lyuwen commented Dec 2, 2024

Description

In line 17 of e3nn/math/_normalize_activation.py, I changed from

z = torch.randn(1_000_000, generator=gen, dtype=torch.float64).to(dtype=dtype, device=device)

to

    z = torch.randn(1_000_000, generator=gen, dtype=torch.float64, device="cpu").to(dtype=dtype, device=device)

Motivation and Context

Background: I'm using E3NN in a training code written with pytorch-lightning. During the training process, I encountered the following error:

  File "/opt/conda/lib/python3.10/site-packages/e3nn/nn/_fc.py", line 73, in __init__
    act = normalize2mom(act)
  File "/opt/conda/lib/python3.10/site-packages/e3nn/math/_normalize_activation.py", line 43, in __init__
    cst = moment(f, 2, dtype=torch.float64, device="cpu").pow(-0.5).item()
  File "/opt/conda/lib/python3.10/site-packages/e3nn/math/_normalize_activation.py", line 17, in moment
    z = torch.randn(1_000_000, generator=gen, dtype=torch.float64).to(dtype=dtype, device=device)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lightning/fabric/utilities/init.py", line 53, in __torch_function__
    return func(*args, **kwargs)
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

From the code I assume the process is to create a random number array on CPU and then copy to GPU. Under the environment of the pytorch-lightning, it must be trying the create the random number array on GPU and failed since everything else has been placed explicitly on CPU.
Therefore I propose to add the device="cpu" to the torch.randn function call, to avoid this error.

How Has This Been Tested?

I have tested the fixed code training my own model. It successfully fixed the error.
Since the edit is minimal, it should be safe from introducing bugs.

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • I have updated the documentation (if relevant).
  • I have added tests that cover my changes (if relevant).
  • The modified code is cuda compatible (github tests don't test cuda) (if relevant).
  • I have updated the CHANGELOG.

…as not specified, and it might complain with the error "RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'"
@mitkotak
Copy link
Member

mitkotak commented Dec 2, 2024

Hey thanks for the PR. Looking at the function, I am seeing a device attribute in there. Did you try setting that to cpu ?

@mitkotak
Copy link
Member

mitkotak commented Dec 2, 2024

As I looking at your log more, looks like the issue is stemming from not taking in a device in FullyConnected which is then propagated to normalize2mom and moment

@lyuwen
Copy link
Contributor Author

lyuwen commented Dec 2, 2024

True. Though the whole normalize2mom class seems hard-coded to put everything on cpu. Is that intended?

@mitkotak
Copy link
Member

mitkotak commented Dec 2, 2024

Yup thats weird. Can you please modify it to take in device from attribute, set default to CPU and then put in a pytest that checks this so that we are not breaking any backward compatibility ? Thanks !

@lyuwen
Copy link
Contributor Author

lyuwen commented Dec 2, 2024

Sure. I'll work on that.

…om "cpu" to the inferred device from _get_device(f) or the torch's current default device.
@lyuwen
Copy link
Contributor Author

lyuwen commented Dec 3, 2024

So I changed the "cpu" to device. I noticed in commit c965503, some tests for this part of the code where added (my edits did pass all these tests), and in this commit, the dtype and device where changed from the input dtype and device to torch.float64 and "cpu" on purpose. Was there a reason behind this?

@mitkotak
Copy link
Member

mitkotak commented Dec 3, 2024

@mariogeiger

@mitkotak mitkotak merged commit 42aab0e into e3nn:main Dec 5, 2024
1 check passed
@mitkotak
Copy link
Member

mitkotak commented Dec 5, 2024

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants