[CUDA] illegal memory read on MatrixDiagPart

### Issue type

Bug

### Have you reproduced the bug with TensorFlow Nightly?

No

### Source

source

### TensorFlow version

tf 2.20

### Custom code

Yes

### OS platform and distribution

Ubuntu 22.04

### Mobile device

_No response_

### Python version

3.10

### Bazel version

_No response_

### GCC/compiler version

_No response_

### CUDA/cuDNN version

CUDA 12.5.1, cuDNN 9.2.1

### GPU model and memory

_No response_

### Current behavior?

Compute-Sanitizer reports an out of bounds read on MatrixDiagPartKernel

### Standalone code to reproduce the issue

```shell
# MatrixDiagPartOp

import tensorflow as tf

gpus = tf.config.list_physical_devices('GPU')

assert gpus, 'No GPU found'

tf.config.experimental.set_memory_growth(gpus[0], True)

with tf.device("/GPU:0"):
    # values truncated in crash log; using uniform fill to match declared shape
    input = tf.ones([46341,46341], dtype=tf.float32)
    tf.raw_ops.MatrixDiagPart(input=input)
```

### Relevant log output

```shell
========= Invalid __global__ read of size 4 bytes
=========     at void tensorflow::functor::MatrixDiagPartKernel<float>(int, int, int, int, int, int, int, T1, bool, bool, const T1 *, T1 *)+0x690
=========     by thread (260,0,0) in block (45,0,0)
=========     Address 0x7f3f80004860 is out of bounds
=========     and is 8589916064 bytes before the nearest allocation at 0x7f4180000000 of size 17179869184 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] illegal memory read on MatrixDiagPart #104363

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CUDA] illegal memory read on MatrixDiagPart #104363

Description

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions