-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Description
Hello, I'm trying the multi-process launch of JAX using basically the same code as in here.
import jax
import jax.numpy as jnp
from jax.sharding import NamedSharding, PartitionSpec as P
import numpy as np
jax.distributed.initialize()
# this example assumes 8 devices total
assert jax.device_count() == 4
# make a 2D mesh that refers to devices from all processes
mesh = jax.make_mesh((2, 2), ('i', 'j'))
# create some toy data
global_data = np.arange(32).reshape((4, 8))
# make a process- and device-spanning array from our toy data
sharding = NamedSharding(mesh, P('i', 'j'))
global_array = jax.device_put(global_data, sharding)
assert global_array.shape == global_data.shape
# each process has different shards of the global array
for shard in global_array.addressable_shards:
print(f"device {shard.device} has local data {shard.data}")
# apply a simple computation, automatically partitioned
global_result = jnp.sum(jnp.sin(global_array))
print(f'process={jax.process_index()} got result: {global_result}')
I launch processes with OpenMPI by running the following command:
mpirun -c 4 uv run main.py
While the script runs successfully, at the end of the script it always gives the following gRPC warnings:
W0929 01:41:53.661844 72590 pjrt_client.cc:1469] WatchJobStateAsync failed for task goo.gle/debugproto job_name: "jax_worker": UNAVAILABLE: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/WatchJobState:
:UNKNOWN:Error received from peer {grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused"}
W0929 01:41:53.661874 72587 pjrt_client.cc:1469] WatchJobStateAsync failed for task goo.gle/debugproto job_name: "jax_worker" task_id: 1: UNAVAILABLE: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/WatchJobState:
:UNKNOWN:Error received from peer {grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused"}
W0929 01:41:53.661847 72588 pjrt_client.cc:1469] WatchJobStateAsync failed for task goo.gle/debugproto job_name: "jax_worker" task_id: 2: UNAVAILABLE: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/WatchJobState:
:UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused", grpc_status:14}
W0929 01:41:53.661861 72589 pjrt_client.cc:1469] WatchJobStateAsync failed for task goo.gle/debugstr job_name: "jax_worker" task_id: 3: UNAVAILABLE: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/WatchJobState:
:UNKNOWN:Error received from peer {grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused"}
Are these warnings the expected behavior of a multi-process launch? If not, is there any way to get rid of them?
System info (python version, jaxlib version, accelerator, etc.)
jax: 0.7.2
jaxlib: 0.7.2
numpy: 2.3.3
python: 3.13.4 (main, Jun 4 2025, 17:37:06) [Clang 20.1.4 ]
device info: NVIDIA H100 80GB HBM3-4, 4 local devices"
process_count: 1
platform: uname_result(system='Linux', node='xyge--2a356b966abe-o57wysikki', release='5.15.0-119-generic', version='#129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024', machine='x86_64')
JAX_COMPILATION_CACHE_DIR=/inspire/hdd/global_user/hezhengfu-240208120186/.cache/jax
$ nvidia-smi
Mon Sep 29 01:45:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:18:00.0 Off | 0 |
| N/A 26C P0 116W / 700W | 556MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:2A:00.0 Off | 0 |
| N/A 28C P0 117W / 700W | 536MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:3A:00.0 Off | 0 |
| N/A 28C P0 117W / 700W | 536MiB / 81559MiB | 1% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 27C P0 121W / 700W | 536MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working