Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Memory Corruption in matmul with rank-4 input tensor and ONEDNN enabled #117700

@jasminetrail

Description

@jasminetrail

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

tf 2.21.0

Custom code

Yes

OS platform and distribution

Linux, 22.04.1-Ubuntu

Mobile device

No response

Python version

3.10.12

Bazel version

v1.27.0

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I encountered a crash in TensorFlow when running tf.linalg.matmul on a rank-4 tensor in eager mode. The issue manifests as a native crash, not a Python exception. Observed errors include:

  • munmap_chunk(): invalid pointer
  • corrupted double-linked list
  • free(): invalid pointer
  • free(): invalid next size (fast)
  • Segmentation fault (core dumped)

The issue disappears when disabling oneDNN optimizations, which suggests the bug is in the oneDNN CPU backend.

Standalone code to reproduce the issue

import os
os.environ['TF_ENABLE_ONEDNN_OPTS']='1'
import tensorflow as tf
tf.random.set_seed(280958)

p = tf.random.normal([13, 1, 3, 4])
w = tf.random.normal([4, 16])

out = tf.linalg.matmul(p, w)

print('output:', out)

Relevant log output

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1778010199.773678  758771 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0000 00:00:1778010199.774000  758771 cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
I0000 00:00:1778010199.812209  758771 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1778010200.986329  758771 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0000 00:00:1778010200.987024  758771 cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
E0000 00:00:1778010201.097696  758771 cuda_platform.cc:52] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
corrupted double-linked list
Aborted (core dumped)

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions