Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Block Issue][hip graph]: HIP error: operation not permitted when stream is capturing #3876

@zejunchen-zejun

Description

@zejunchen-zejun

Problem Description

Hi, developers

We are AMD developers from AI Group. We use the vllm to run the LLM models and we found there is the hip graph issue when doing the hip graph capture. For now we have a block issue, the part of log is shown as below:

[rank0]:[E1030 15:32:32.970893361 ProcessGroupNCCL.cpp:2055] [PG ID 5 PG GUID 51 Rank 0] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
Search for `hipErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__HIPRT__TYPES.html for more information.
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:45 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7f789de5f1bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x374e1 (0x7f78cec984e1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1f1 (0x7f78cec98371 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7f78d1bbddde in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7f78d1bcdc90 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7f78d1bd1a3e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x117 (0x7f78d1bd3d27 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xdc253 (0x7f789c2ef253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f78e5d60ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f78e5df1a74 in /lib/x86_64-linux-gnu/libc.so.6)

The core log is HIP error: operation not permitted when stream is capturing, so we want to know which op cannot be captured by the hip runtime. May I know how can we debug such error?

Thank you.

Operating System

Ubuntu 22.04

CPU

AMD EPYC 9575F 64-Core Processor

GPU

AMD MI355 * 8

ROCm Version

ROCm 7.0.1

ROCm Component

HIP

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions