Codestin Search App

angeloskath · 2026-02-26T08:19:33Z

This PR refactors the communication implementations outside the groups so we can use the more efficient ring reduce in the mesh group for large sizes.

Comparison of all reduce performance on 4 M3 Ultras.

nastya236 · 2026-02-26T08:25:16Z

Looks awesome!! Thanks for adding me as a reviewer, I will look at it today!

nastya236

This looks awesome! This PR very nicely:

Separates algorithm implementation from the group classes by introducing MeshImpl and RingImpl objects
With the above, we can make an adaptive decision on which all_reduce to pick: if the number of nodes is more than 2 and the size of the message is big enough, we pick bidirectional ring all_reduce (which is better for larger messages since each node only sends/receives 1/N of the data per step)
So now for mesh type of communication we allocate both buffers for a mesh and for a ring and init both MeshImpl and RingImpl

I left couple of comments mostly for my personal understanding.

nastya236 · 2026-02-26T13:51:42Z

mlx/distributed/jaccl/mesh.cpp

      for (int j = 0; j < size_; j++) {
        buffers_.emplace_back(FRAME_SIZE * (1 << k));
      }
+      // Ring buffers (1 for each direction)


Nit and just for me to understand better: in case size_ = 2, mesh and ring should be identical and we would never fall into ring all_reduce. I am wondering if it makes sense to allocate ring buffers and init RingImpl only if size_ > 2? But probably, extra complexity does not justify small memory saving..

Yeah that isn't a bad idea. The memory is 4MB so not important but perhaps there are other reasons to not register unnecessary buffers.

We might still want to use a ring even with 2 nodes because the process is different. The mesh sends all at once and reduces all. The ring sends half, reduces half and gathers half. If the reduction is very expensive then the ring will be faster.

nastya236 · 2026-02-26T13:54:00Z

mlx/distributed/jaccl/mesh.cpp

-      }
+  encoder.dispatch([in_ptr, out_ptr, size, this, reduce_op]() {
+    if (size_ > 2 &&
+        ((std::is_same_v<T, bfloat16_t> && size > 65536) ||


Interesting! Just for my understanding, why for bfloat16 the cutover is at 65,536 elements = 128KB, but for other 2-byte types (like float16), it is 8MB / 2 = 4M elements = 8MB?

Bfloat summation is slow on the M3s (we should put in a TODO to make it faster) so the ring (that sums less) is more efficient. For 8MB is where the ring surpasses the mesh with 4 nodes.

nastya236 · 2026-02-26T13:58:35Z

mlx/distributed/jaccl/mesh.cpp

-        }
-      }
-    }
+    mesh_.all_gather(in_ptr, out_ptr, n_bytes);


Also just for me to understand better, I am wondering if it makes sense here to have similar conditional all_gather as for all_reduce if a message is large enough? What do you think?

Good pointing it out. It doesn't because there is no bandwidth benefit for all gather as there is for all reduce. All reduce benefits because the summations are shared among nodes so we don't ever send all the data.

So tl;dr all gather will be quite slower via the ring vs the mesh.

angeloskath added 5 commits February 25, 2026 17:20

Refactor the ring algorithms outside the ring group

1107ed0

Refactor the mesh algorithms outside the mesh group

d25b7a4

Route to ring reduce for large messages

7b0836d

Add forgotten sizeof type

d717a8e

Use ring reduce only if more than 2 nodes

0efe08c

angeloskath requested a review from nastya236 February 26, 2026 08:19

nastya236 approved these changes Feb 26, 2026

View reviewed changes

angeloskath merged commit 5c4abd2 into main Feb 26, 2026
16 checks passed

angeloskath deleted the jaccl branch February 26, 2026 21:56

0xDaizz mentioned this pull request Feb 27, 2026

[JACCL] Fix silent data corruption from unchecked RDMA work completion status #3152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JACCL refactor and small update#3174

JACCL refactor and small update#3174
angeloskath merged 5 commits intomainfrom
jaccl

angeloskath commented Feb 26, 2026

Uh oh!

nastya236 commented Feb 26, 2026

Uh oh!

nastya236 left a comment

Uh oh!

nastya236 Feb 26, 2026

Uh oh!

angeloskath Feb 26, 2026

Uh oh!

nastya236 Feb 26, 2026

Uh oh!

angeloskath Feb 26, 2026

Uh oh!

nastya236 Feb 26, 2026

Uh oh!

angeloskath Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

angeloskath commented Feb 26, 2026

Uh oh!

nastya236 commented Feb 26, 2026

Uh oh!

nastya236 left a comment

Choose a reason for hiding this comment

Uh oh!

nastya236 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

angeloskath Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

nastya236 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

angeloskath Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

nastya236 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

angeloskath Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants