explicitly specify device placement in random number generator in e3nn.math._normalize_activation.moment #492

lyuwen · 2024-12-02T15:21:47Z

Description

In line 17 of e3nn/math/_normalize_activation.py, I changed from

Line 17 in 8c9891d

    
           z = torch.randn(1_000_000, generator=gen, dtype=torch.float64).to(dtype=dtype, device=device)

to

    z = torch.randn(1_000_000, generator=gen, dtype=torch.float64, device="cpu").to(dtype=dtype, device=device)

Motivation and Context

Background: I'm using E3NN in a training code written with pytorch-lightning. During the training process, I encountered the following error:

  File "/opt/conda/lib/python3.10/site-packages/e3nn/nn/_fc.py", line 73, in __init__
    act = normalize2mom(act)
  File "/opt/conda/lib/python3.10/site-packages/e3nn/math/_normalize_activation.py", line 43, in __init__
    cst = moment(f, 2, dtype=torch.float64, device="cpu").pow(-0.5).item()
  File "/opt/conda/lib/python3.10/site-packages/e3nn/math/_normalize_activation.py", line 17, in moment
    z = torch.randn(1_000_000, generator=gen, dtype=torch.float64).to(dtype=dtype, device=device)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lightning/fabric/utilities/init.py", line 53, in __torch_function__
    return func(*args, **kwargs)
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

From the code I assume the process is to create a random number array on CPU and then copy to GPU. Under the environment of the pytorch-lightning, it must be trying the create the random number array on GPU and failed since everything else has been placed explicitly on CPU.
Therefore I propose to add the device="cpu" to the torch.randn function call, to avoid this error.

How Has This Been Tested?

I have tested the fixed code training my own model. It successfully fixed the error.
Since the edit is minimal, it should be safe from introducing bugs.

Checklist:

I have read the CONTRIBUTING document.
My code follows the code style of this project.
I have updated the documentation (if relevant).
I have added tests that cover my changes (if relevant).
The modified code is cuda compatible (github tests don't test cuda) (if relevant).
I have updated the CHANGELOG.

…as not specified, and it might complain with the error "RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'"

mitkotak · 2024-12-02T15:37:22Z

Hey thanks for the PR. Looking at the function, I am seeing a device attribute in there. Did you try setting that to cpu ?

mitkotak · 2024-12-02T15:40:15Z

As I looking at your log more, looks like the issue is stemming from not taking in a device in FullyConnected which is then propagated to normalize2mom and moment

lyuwen · 2024-12-02T15:56:34Z

True. Though the whole normalize2mom class seems hard-coded to put everything on cpu. Is that intended?

mitkotak · 2024-12-02T17:04:51Z

Yup thats weird. Can you please modify it to take in device from attribute, set default to CPU and then put in a pytest that checks this so that we are not breaking any backward compatibility ? Thanks !

lyuwen · 2024-12-02T23:21:08Z

Sure. I'll work on that.

…om "cpu" to the inferred device from _get_device(f) or the torch's current default device.

lyuwen · 2024-12-03T06:56:40Z

So I changed the "cpu" to device. I noticed in commit c965503, some tests for this part of the code where added (my edits did pass all these tests), and in this commit, the dtype and device where changed from the input dtype and device to torch.float64 and "cpu" on purpose. Was there a reason behind this?

mitkotak · 2024-12-03T14:12:38Z

@mariogeiger

mitkotak · 2024-12-05T05:24:02Z

Thanks !

fix a bug where the device placement of the generated random number w…

d9f6da9

…as not specified, and it might complain with the error "RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'"

In e3nn/math/_normalize_activation.py, change the device placement fr…

01b4064

…om "cpu" to the inferred device from _get_device(f) or the torch's current default device.

mitkotak merged commit 42aab0e into e3nn:main Dec 5, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explicitly specify device placement in random number generator in e3nn.math._normalize_activation.moment #492

explicitly specify device placement in random number generator in e3nn.math._normalize_activation.moment #492

Uh oh!

lyuwen commented Dec 2, 2024

Uh oh!

mitkotak commented Dec 2, 2024

Uh oh!

mitkotak commented Dec 2, 2024

Uh oh!

lyuwen commented Dec 2, 2024

Uh oh!

mitkotak commented Dec 2, 2024

Uh oh!

lyuwen commented Dec 2, 2024

Uh oh!

lyuwen commented Dec 3, 2024 •

edited

Loading

Uh oh!

mitkotak commented Dec 3, 2024

Uh oh!

Uh oh!

mitkotak commented Dec 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

explicitly specify device placement in random number generator in e3nn.math._normalize_activation.moment #492

explicitly specify device placement in random number generator in e3nn.math._normalize_activation.moment #492

Uh oh!

Conversation

lyuwen commented Dec 2, 2024

Description

Motivation and Context

How Has This Been Tested?

Checklist:

Uh oh!

mitkotak commented Dec 2, 2024

Uh oh!

mitkotak commented Dec 2, 2024

Uh oh!

lyuwen commented Dec 2, 2024

Uh oh!

mitkotak commented Dec 2, 2024

Uh oh!

lyuwen commented Dec 2, 2024

Uh oh!

lyuwen commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mitkotak commented Dec 3, 2024

Uh oh!

Uh oh!

mitkotak commented Dec 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lyuwen commented Dec 3, 2024 •

edited

Loading