0x29 Error on Llama_8b_fp16

### What happened?

Have a reproducer w/ `Llama3.1_8b_fp16`, in which we receive the following error for a pretty long prompt (18,432 tokens) on `mi300`:

```bash
EXEC @prefill_bs4
:0:rocdevice.cpp            :2991: 218582848177 us:  Callback: Queue 0x7f0810400000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)
```

I tried to run a `tracy_capture`, but received the following error when attempting to:

```bash
Connecting to 127.0.0.1:8090...*** buffer overflow detected ***: terminated
Aborted (core dumped)
```

I tried w/ IREE at HEAD, along with using the [3.7.0rc20250908](https://github.com/iree-org/iree/releases/tag/iree-3.7.0rc20250908) version currently pinned in `shark-ai`. I also tried pulling in the following fix: [Don't inline immutable globals with non-util dialect attrs](https://github.com/iree-org/iree/pull/21986) to see if it was a duplicate of #21946, but all threw the same error.

Had `watch -n0 rocm-smi` running while invoking `iree-run-module`, but did not see anything suspicious in terms of VRAM usage.

### Steps to reproduce your issue

1. Download mlir
```bash
az storage download \
    --account-name sharkblobs \
    --container-name stephen \
    --name iree_0x29_error/mlir/llama_8b_fp16.mlir \
    --file llama_8b_fp16.mlir \
    --account-key <account_key>
```
2. Download weights (if needed)
```bash
az storage download \
    --account-name sharkblobs \
    --container-name stephen \
    --name iree_0x29_error/weights/llama.irpa \
    --file llama.irpa \
    --account-key <account_key>
```
3. Download inputs
```bash
az storage download \
    --account-name sharkblobs \
    --container-name stephen \
    --name iree_0x29_error/inputs/tokens.npy \
    --file tokens.npy \
    --account-key <account_key>
az storage download \
    --account-name sharkblobs \
    --container-name stephen \
    --name iree_0x29_error/inputs/seq_lens.npy \
    --file seq_lens.npy \
    --account-key <account_key>
az storage download \
    --account-name sharkblobs \
    --container-name stephen \
    --name iree_0x29_error/inputs/seq_block_ids.npy \
    --file seq_block_ids.npy \
    --account-key <account_key>
az storage download \
    --account-name sharkblobs \
    --container-name stephen \
    --name iree_0x29_error/inputs/page_table.npy \
    --file page_table.npy \
    --account-key <account_key>
```
4. Compile model:
```bash
iree-compile llama_8b_fp16.mlir \
    -o llama_8b_fp16.vmfb \
    --iree-hal-target-device=hip \
    --iree-hip-target=gfx942 \
    --iree-opt-level=O3  \
    --iree-hal-indirect-command-buffers=true  \
    --iree-stream-resource-memory-model=discrete  \
    --iree-hal-memoization=true
```
5. Invoke `iree-run-module`:
```bash
iree-run-module \
    --module=llama_8b_fp16.vmfb \
    --parameters=model=llama.irpa \
    --device=hip://0 \
    --function=prefill_bs4 \
    --input=@./tokens.npy \
    --input=@./seq_lens.npy \
    --input=@./seq_block_ids.npy \
    --input=@./page_table.npy
```
6. See error:
```bash
EXEC @prefill_bs4
:0:rocdevice.cpp            :2991: 218977493659 us:  Callback: Queue 0x7feb10d00000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)
```

### What component(s) does this issue relate to?

_No response_

### Version information

d588da331a6ad7ffad3df9f4bf8042a2e0dd4ffa

### Additional context

This seems to be the root cause of the following identified issue in shark-ai:

[[Regression] Regression in Llama Prefill causing 0x29 error](https://github.com/nod-ai/shark-ai/issues/2279)

Again, tried to collect a trace, but received the following error:

```bash
Connecting to 127.0.0.1:8090...*** buffer overflow detected ***: terminated
Aborted (core dumped)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

0x29 Error on Llama_8b_fp16 #22050

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

0x29 Error on Llama_8b_fp16 #22050

Description

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions