Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Unexpected behavior when using dist.all_reduce(x, op=dist.ReduceOp.SUM) #152300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fhk357869050 opened this issue Apr 28, 2025 · 1 comment
Open
Labels
module: c10d Issues/PRs related to collective communications and process groups oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@fhk357869050
Copy link

fhk357869050 commented Apr 28, 2025

πŸ› Describe the bug

import torch
import torch.distributed as dist
from torch.multiprocessing import Process
import numpy as np


def exec_op(rank):
    dist.init_process_group(backend='gloo', rank=rank, world_size=2, init_method=f'tcp://127.0.0.1:40001')
    np.random.seed(1024 + rank)
    x = np.random.uniform(-65504, 65504, [m, k]).astype(np.float16)
    x = torch.from_numpy(x)
    print(f"rank:{rank} before all_reduce x[7205]:{x[7205]}")
    dist.all_reduce(x, op=dist.ReduceOp.SUM)
    print(f"rank:{rank} after all_reduce x[7205]:{x[7205]}")


if __name__ == '__main__':
    m, k = [24063328, 1]
    p_list = []
    for g_rank in range(2):
        p = Process(target=exec_op, args=(g_rank,))
        p_list.append(p)
    for p in p_list:
        p.start()
    for p in p_list:
        p.join()

Image

about 0.007% points didn't match.

Image

Versions

python3.8.5
torch2.4.0

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@soulitzer soulitzer added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 28, 2025
@fduwjj
Copy link
Contributor

fduwjj commented Apr 29, 2025

I can repro this one, let me see what went wrong.

@tianyu-l tianyu-l added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: c10d Issues/PRs related to collective communications and process groups labels May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: c10d Issues/PRs related to collective communications and process groups oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants