Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Warnings on multi-process launch #32165

@dest1n1s

Description

@dest1n1s

Description

Hello, I'm trying the multi-process launch of JAX using basically the same code as in here.

import jax
import jax.numpy as jnp
from jax.sharding import NamedSharding, PartitionSpec as P
import numpy as np

jax.distributed.initialize()
# this example assumes 8 devices total
assert jax.device_count() == 4

# make a 2D mesh that refers to devices from all processes
mesh = jax.make_mesh((2, 2), ('i', 'j'))

# create some toy data
global_data = np.arange(32).reshape((4, 8))

# make a process- and device-spanning array from our toy data
sharding = NamedSharding(mesh, P('i', 'j'))
global_array = jax.device_put(global_data, sharding)
assert global_array.shape == global_data.shape

# each process has different shards of the global array
for shard in global_array.addressable_shards:
  print(f"device {shard.device} has local data {shard.data}")

# apply a simple computation, automatically partitioned
global_result = jnp.sum(jnp.sin(global_array))
print(f'process={jax.process_index()} got result: {global_result}')

I launch processes with OpenMPI by running the following command:

mpirun -c 4 uv run main.py

While the script runs successfully, at the end of the script it always gives the following gRPC warnings:

W0929 01:41:53.661844   72590 pjrt_client.cc:1469] WatchJobStateAsync failed for task goo.gle/debugproto  job_name: "jax_worker": UNAVAILABLE: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/WatchJobState:
:UNKNOWN:Error received from peer  {grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused"}
W0929 01:41:53.661874   72587 pjrt_client.cc:1469] WatchJobStateAsync failed for task goo.gle/debugproto  job_name: "jax_worker" task_id: 1: UNAVAILABLE: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/WatchJobState:
:UNKNOWN:Error received from peer  {grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused"}
W0929 01:41:53.661847   72588 pjrt_client.cc:1469] WatchJobStateAsync failed for task goo.gle/debugproto    job_name: "jax_worker" task_id: 2: UNAVAILABLE: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/WatchJobState:
:UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused", grpc_status:14}
W0929 01:41:53.661861   72589 pjrt_client.cc:1469] WatchJobStateAsync failed for task goo.gle/debugstr    job_name: "jax_worker" task_id: 3: UNAVAILABLE: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/WatchJobState:
:UNKNOWN:Error received from peer  {grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.246.73.22:64432: Failed to connect to remote host: Connection refused"}

Are these warnings the expected behavior of a multi-process launch? If not, is there any way to get rid of them?

System info (python version, jaxlib version, accelerator, etc.)

jax:    0.7.2
jaxlib: 0.7.2
numpy:  2.3.3
python: 3.13.4 (main, Jun  4 2025, 17:37:06) [Clang 20.1.4 ]
device info: NVIDIA H100 80GB HBM3-4, 4 local devices"
process_count: 1
platform: uname_result(system='Linux', node='xyge--2a356b966abe-o57wysikki', release='5.15.0-119-generic', version='#129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024', machine='x86_64')
JAX_COMPILATION_CACHE_DIR=/inspire/hdd/global_user/hezhengfu-240208120186/.cache/jax

$ nvidia-smi
Mon Sep 29 01:45:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                    0 |
| N/A   26C    P0            116W /  700W |     556MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:2A:00.0 Off |                    0 |
| N/A   28C    P0            117W /  700W |     536MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:3A:00.0 Off |                    0 |
| N/A   28C    P0            117W /  700W |     536MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   27C    P0            121W /  700W |     536MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions