System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): archlinux
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
- TensorFlow installed from (source or binary):binary
- TensorFlow version (use command below):b'v1.9.0-rc2-5276-ge57874169f' 1.12.0-dev20181004
- Python version:3.6
- Bazel version (if compiling from source):n/a
- GCC/Compiler version (if compiling from source):n/a
- CUDA/cuDNN version:9.0
- GPU model and memory:1080Ti
- Exact command to reproduce:below
This code:
import tensorflow as tf
import numpy as np
def f(boxes, scores):
def f(X):
prob, box = X
output_shape = tf.shape(prob)
ids = tf.reshape(tf.where(prob > 0.05), [-1])
prob = tf.gather(prob, ids)
box = tf.gather(box, ids)
# prob = tf.Print(prob, [box, prob], summarize=100, message='boxandprob')
selection = tf.image.non_max_suppression(box, prob, 100, 0.5)
selection = tf.to_int32(tf.gather(ids, selection))
selection = tf.Print(selection, [ids, selection], summarize=100, message='ids_selection_2')
sorted_selection = -tf.nn.top_k(-selection, k=tf.size(selection))[0]
mask = tf.sparse_to_dense(
sparse_indices=sorted_selection,
output_shape=output_shape,
sparse_values=True,
default_value=False)
return mask
masks = tf.map_fn(f, (scores, boxes), dtype=tf.bool, parallel_iterations=10) # #cat x N
return masks
with tf.device('/gpu:0'):
boxes = tf.placeholder(tf.float32, (80, None, 4), name='boxes')
scores = tf.placeholder(tf.float32, (80, None), name='scores')
outs = f(boxes, scores)
config = tf.ConfigProto()
config.allow_soft_placement = True
sess = tf.Session(config=config)
data = dict(np.load('debug.npz'))
for k in range(1000):
sess.run(outs, feed_dict={boxes: data['boxes'].transpose(1, 0, 2)[1:, :, :], scores: data['scores'][:, 1:].T})
print(k)
causes segmentation fault on tf-nightly-gpu, as well as tensorflow-gpu==1.11.0. It works on 1.10.
It needs the data file debug.npz here:
debug.zip
Note:
- I tested on two machines, an error happen in >90% runs.
- The code was distilled from the bug report about MaskRCNN evaluation here. The original bug report does not always segfault, but occasionally crash with other different unreasonable TF internal errors, such as:
InvalidArgumentError (see above for traceback): scores has incompatible shape
[[node map/while/non_max_suppression/NonMaxSuppressionV3 (defined at bug.py:15) = NonMaxSuppressionV3[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Gather
V2_1/_29, map/while/GatherV2/_31, map/while/non_max_suppression/NonMaxSuppressionV3/max_output_size/_33, map/while/non_max_suppression/iou_threshold/_35, map/while/non_max_suppression/score_thresh
old/_37)]]
2018-10-04 14:59:14.736180: F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)
2018-10-04 14:59:49.523436: F tensorflow/core/common_runtime/bfc_allocator.cc:380] Check failed: h != kInvalidChunkHandle
2018-10-05 00:12:03.720295: F ./tensorflow/core/framework/tensor.h:643] Check failed: new_num_elements == NumElements() (39 vs. 0)
InvalidArgumentError (see above for traceback): indices[1] = [0] is repeated
[[{{node map/while/SparseToDense}} = SparseToDense[T=DT_BOOL, Tindices=DT_INT32, _class=["loc:@map/while/TensorArrayWrite/TensorArrayWriteV3"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Neg_1/_51, map/while/Shape/_53, map/while/SparseToDense/sparse_values/_55, map/while/SparseToDense/default_value/_57)]]
[[{{node map/while/SparseToDense/sparse_values/_54}} = _Send[T=DT_BOOL, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_111_map/while/SparseToDense/sparse_values", _device="/job:localhost/replica:0/task:0/device:GPU:0"](map/while/SparseToDense/sparse_values)]]
After distilled to this small repro, it seems to mostly do segfault. But the above error messages might help. Seems like a memory corruption.
System information
This code:
causes segmentation fault on tf-nightly-gpu, as well as tensorflow-gpu==1.11.0. It works on 1.10.
It needs the data file
debug.npzhere:debug.zip
Note:
After distilled to this small repro, it seems to mostly do segfault. But the above error messages might help. Seems like a memory corruption.