-
Notifications
You must be signed in to change notification settings - Fork 572
Open
Labels
Description
Problem Description
Hi, developers
We are AMD developers from AI Group. We use the vllm to run the LLM models and we found there is the hip graph issue when doing the hip graph capture. For now we have a block issue, the part of log is shown as below:
[rank0]:[E1030 15:32:32.970893361 ProcessGroupNCCL.cpp:2055] [PG ID 5 PG GUID 51 Rank 0] Process group watchdog thread terminated with exception: HIP error: operation not permitted when stream is capturing
Search for `hipErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__HIPRT__TYPES.html for more information.
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
Exception raised from c10_hip_check_implementation at /app/pytorch/c10/hip/HIPException.cpp:45 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7f789de5f1bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x374e1 (0x7f78cec984e1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1f1 (0x7f78cec98371 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7f78d1bbddde in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7f78d1bcdc90 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7f78d1bd1a3e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x117 (0x7f78d1bd3d27 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xdc253 (0x7f789c2ef253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f78e5d60ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f78e5df1a74 in /lib/x86_64-linux-gnu/libc.so.6)
The core log is HIP error: operation not permitted when stream is capturing, so we want to know which op cannot be captured by the hip runtime. May I know how can we debug such error?
Thank you.
Operating System
Ubuntu 22.04
CPU
AMD EPYC 9575F 64-Core Processor
GPU
AMD MI355 * 8
ROCm Version
ROCm 7.0.1
ROCm Component
HIP
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response