-
Couldn't load subscription status.
- Fork 37
Description
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Centos 7.6
TensorFlow installed from (source or binary): source
TensorFlow version: 1.15.0
Python version: 2.7
Installed using virtualenv? pip? conda?: No
CUDA/cuDNN version: No
GPU model and memory: No
Worker job number: 50
Ps job number: 10
chief job number: 1
Describe the problem
Sadly, I have run an experiment using TensorFlow with Verbs for communication on multiple Workers, which means I use the protocol "grpc+verbs". The framework is 50 worker'nodes,10 ps' nodes, and 1 chief' node. When at the end of the training, all 50 workes stopped normally. But only one of 10 ps nodes met the core-dump problem. Other ps' nodes and chief's node stoped normally.
when using the gdb to print the bt of cored-ump file. The print of the ps' node is as follows.