Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@priyammaz
Copy link

This is a fix for this issue here:

#9430

But also, if our only goal is to make sure its within 8 bits, a better fix may be:

shifted_nccl_id = bytes([b & 0xFF for b in nccl_id])
nccl_id = tuple([int(b) for b in shifted_nccl_id])

I can update it if that is more correct (and will work for all systems regardless of how they encode chars)

# make them positive and send them as bytes to the proxy store
shifted_nccl_id = bytes([b + 128 for b in nccl_id])
#shifted_nccl_id = bytes([b + 128 for b in nccl_id])
shifted_nccl_id = bytes([b for b in nccl_id])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Yeah, this shouldn't work for non-ARM as you said. For the way back (below), I think we can just use <char>b (no int() needed). Since that is what get_unique_id uses as well (and I think cython must allow conversion from byte-string to char even if signed).

For the one above, I think we need to do <unsigned char><char>b and explain in the comment: Ensure positive values for conversion to bytes via Python.
(this would be less awkward if nccl.get_unique_id() was a cdef function or just returned the bytes the bytes...).

What would be amazing is if you could ad a very small test for this in tests/cupy_tests/cuda_tests/test_nccl.py, I assume we are missing that and that will also proof that everything is still good on non-ARM system.

(I would also rename the variable to just nccl_id_bytes or so, I think the comment is enough, and "shift" reads weird when it's a cast.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, after there was another issue about this (or something similar lower)...

I am thinking we should fix this at the lower level, the next CuPy version is v14, so a change that might break some niche users seems OK to me.

That is: Make nccl.get_unique_id() return a bytes string which will remove this dance and simplify the code generally.

If you are interested in pursuing this, that is great. But we should fix this pretty soon for CuPy v14, so I would pick it up very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants