Thanks to visit codestin.com
Credit goes to github.com

Skip to content

_get_op wrong if condition in cupyx.distributed._nccn_comm.py #7981

@jemiryguo

Description

@jemiryguo

Description

in class NCCLBackend(_Backend) there is function _get_op:

def _get_op(self, op, dtype):
    if op not in _nccl_ops:
        raise RuntimeError(f'Unknown op {op} for NCCL')
    if dtype in 'FD' and op != nccl.NCCL_SUM:
        raise ValueError(
            'Only nccl.SUM is supported for complex arrays')
    return _nccl_ops[op]

op is designed to be in 'sum', 'prod', 'max', and 'min' according to the defination of _nccl_ops:

_nccl_ops = {'sum': nccl.NCCL_SUM,
             'prod': nccl.NCCL_PROD,
             'max': nccl.NCCL_MAX,
             'min': nccl.NCCL_MIN}

However the ValueError will be raised if op != nccl.NCCL_SUM, which should be corrected to op != 'sum'.

To Reproduce

import cupy, os
from cupyx.distributed import NCCLBackend

os.unsetenv("NCCL_DEBUG")
NCCLBackend._get_op(None, 'sum', 'D')

Installation

Conda-Forge (conda install ...)

Environment

OS                           : Linux-5.4.143.bsk.7-amd64-x86_64-with-glibc2.31
Python Version               : 3.10.9
CuPy Version                 : 11.5.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.24.2
SciPy Version                : 1.10.1
Cython Build Version         : 0.29.33
Cython Runtime Version       : None
CUDA Root                    : /usr/local/cuda
nvcc PATH                    : /usr/local/cuda/bin/nvcc
CUDA Build Version           : 11020
CUDA Driver Version          : 11080
CUDA Runtime Version         : 11080
cuBLAS Version               : (available)
cuFFT Version                : 10900
cuRAND Version               : 10300
cuSOLVER Version             : (11, 4, 1)
cuSPARSE Version             : (available)
NVRTC Version                : (11, 8)
Thrust Version               : 101000
CUB Build Version            : 101000
Jitify Build Version         : b8d229d
cuDNN Build Version          : 8401
cuDNN Version                : 8600
NCCL Build Version           : 21403
NCCL Runtime Version         : 21501
cuTENSOR Version             : 10602
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA A100-SXM4-80GB
Device 0 Compute Capability  : 80
Device 0 PCI Bus ID          : 0000:16:00.0

Additional Information

I can push PR to fix this.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions