Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Segmentation fault with small repro #22750

@ppwwyyxx

Description

@ppwwyyxx

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): archlinux
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
  • TensorFlow installed from (source or binary):binary
  • TensorFlow version (use command below):b'v1.9.0-rc2-5276-ge57874169f' 1.12.0-dev20181004
  • Python version:3.6
  • Bazel version (if compiling from source):n/a
  • GCC/Compiler version (if compiling from source):n/a
  • CUDA/cuDNN version:9.0
  • GPU model and memory:1080Ti
  • Exact command to reproduce:below

This code:

import tensorflow as tf
import numpy as np

def f(boxes, scores):
    def f(X):
        prob, box = X
        output_shape = tf.shape(prob)
        ids = tf.reshape(tf.where(prob > 0.05), [-1])
        prob = tf.gather(prob, ids)
        box = tf.gather(box, ids)
        # prob = tf.Print(prob, [box, prob], summarize=100, message='boxandprob')
        selection = tf.image.non_max_suppression(box, prob, 100, 0.5)
        selection = tf.to_int32(tf.gather(ids, selection))
        selection = tf.Print(selection, [ids, selection], summarize=100, message='ids_selection_2')
        sorted_selection = -tf.nn.top_k(-selection, k=tf.size(selection))[0]
        mask = tf.sparse_to_dense(
            sparse_indices=sorted_selection,
            output_shape=output_shape,
            sparse_values=True,
            default_value=False)
        return mask

    masks = tf.map_fn(f, (scores, boxes), dtype=tf.bool, parallel_iterations=10)     # #cat x N
    return masks

with tf.device('/gpu:0'):
    boxes = tf.placeholder(tf.float32, (80, None, 4), name='boxes')
    scores = tf.placeholder(tf.float32, (80, None), name='scores')
    outs = f(boxes, scores)

config = tf.ConfigProto()
config.allow_soft_placement = True
sess = tf.Session(config=config)
data = dict(np.load('debug.npz'))
for k in range(1000):
    sess.run(outs, feed_dict={boxes: data['boxes'].transpose(1, 0, 2)[1:, :, :], scores: data['scores'][:, 1:].T})
    print(k)

causes segmentation fault on tf-nightly-gpu, as well as tensorflow-gpu==1.11.0. It works on 1.10.
It needs the data file debug.npz here:
debug.zip

Note:

  1. I tested on two machines, an error happen in >90% runs.
  2. The code was distilled from the bug report about MaskRCNN evaluation here. The original bug report does not always segfault, but occasionally crash with other different unreasonable TF internal errors, such as:
InvalidArgumentError (see above for traceback): scores has incompatible shape
         [[node map/while/non_max_suppression/NonMaxSuppressionV3 (defined at bug.py:15)  = NonMaxSuppressionV3[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Gather
V2_1/_29, map/while/GatherV2/_31, map/while/non_max_suppression/NonMaxSuppressionV3/max_output_size/_33, map/while/non_max_suppression/iou_threshold/_35, map/while/non_max_suppression/score_thresh
old/_37)]]
2018-10-04 14:59:14.736180: F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)                                                     
2018-10-04 14:59:49.523436: F tensorflow/core/common_runtime/bfc_allocator.cc:380] Check failed: h != kInvalidChunkHandle 
2018-10-05 00:12:03.720295: F ./tensorflow/core/framework/tensor.h:643] Check failed: new_num_elements == NumElements() (39 vs. 0)

InvalidArgumentError (see above for traceback): indices[1] = [0] is repeated
         [[{{node map/while/SparseToDense}} = SparseToDense[T=DT_BOOL, Tindices=DT_INT32, _class=["loc:@map/while/TensorArrayWrite/TensorArrayWriteV3"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Neg_1/_51, map/while/Shape/_53, map/while/SparseToDense/sparse_values/_55, map/while/SparseToDense/default_value/_57)]]
         [[{{node map/while/SparseToDense/sparse_values/_54}} = _Send[T=DT_BOOL, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_111_map/while/SparseToDense/sparse_values", _device="/job:localhost/replica:0/task:0/device:GPU:0"](map/while/SparseToDense/sparse_values)]]

After distilled to this small repro, it seems to mostly do segfault. But the above error messages might help. Seems like a memory corruption.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions