Codestin Search App

lfmeadow · 2026-05-31T18:14:59Z

Status

Supersedes draft PR #2714
(same logical patch content, rebased onto current amd-staging; the
intermediate multi-device fix in commit b1b20686afe becomes redundant
under this drain and its tests under compiler-rt/test/profile/GPU/
continue to work via the retained ABI forwarder).

This revision also addresses the static-review feedback gathered against
#2714 — see "Review-driven changes" below.

Build, install & usage

Instructions for building the amdgcn device profile runtime, installing it
into the clang resource directory, and compiling/running/reporting HIP
device coverage live in the new "HIP / AMDGPU device code coverage"
section of clang/docs/SourceBasedCodeCoverage.rst
(added by this PR).

Summary

Replaces the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain. At process exit the
drain walks every loaded HSA code object on every GPU agent, finds the
canonical __llvm_profile_sections bounds table emitted by
compiler-rt/lib/profile/InstrProfilingPlatformGPU.c, D2H-copies its
counters/data/names back to the host, and writes an arch-prefixed
.profraw via __llvm_write_custom_profile.

What this changes vs. the existing approach

Host and device drains are fully decoupled. The device drain runs
from an atexit handler installed by a library constructor in
libclang_rt.profile. Device counters are collected whether or not
the host TUs were instrumented and without any host-side per-TU
shadow, CUID matching, or module-load interception.
Cases the old 1-1 host↔device model could not handle now work:
separate device-only modules loaded at runtime
(hipModuleLoad / hsa_executable_*), an uninstrumented host, and
multi-GPU. Multi-GPU no longer relies on hipGetSymbolAddress,
removing the comgr-at-atexit null deref that the predecessor
band-aid worked around by restricting collection to a single device.
Clang stops emitting any PGO-specific machinery for HIP.
CGCUDANV.cpp loses the offload-profiling shadow/registration code
entirely. For HIP+PGO host links the driver force-links the host drain
object via -u__llvm_profile_hip_collect_device_data (the drain's
atexit handler is otherwise unreferenced now that the host emits no
shadow).
Device profile runtime is linked on the offload device link.
LinkerWrapper::ConstructJob now forwards the static device profile
runtime (libclang_rt.profile-<arch>.a, defining
__llvm_profile_instrument_gpu and the __llvm_profile_sections bounds
table) to each GPU device linker when instrumentation is enabled. Without
this a plain clang -x hip -fprofile-instr-generate -fcoverage-mapping
full link fails with undefined symbol: __llvm_profile_instrument_gpu,
since the offload device link otherwise only pulls in libc/builtins (and
only for OpenMP). The device runtime must be installed at the per-target
resource path lib/<device-triple>/libclang_rt.profile.a for the driver
to find it.
ABI compatibility: legacy
__llvm_profile_offload_register_{shadow,section_shadow,dynamic_module}
and __llvm_profile_hip_collect_device_data are retained as
no-ops / forwarders so binaries compiled against the previous runtime
still link and produce output.

Review-driven changes (vs the #2714 snapshot)

drainDevices() idempotency — split into DrainInProgress /
DrainCompleted; only latches "done" after a successful walk that
actually drained data. Transient no-ops ("HSA/HIP not yet
resolvable", "no GPU agents", "no loaded segments", "no instrumented
sections") stay retryable so a late atexit call still picks up code
objects that loaded after an early host-write trigger. Covered by
the new compiler-rt/test/profile/AMDGPU/device-early-collect.hip.
HIP runtime resolution — tries RTLD_DEFAULT first (catches the
common case of HIP already being in the process namespace, including
runtime-only ROCm installs without an unversioned dev symlink), then
falls back to dlopen of libamdhip64.so.{7,6,5,4} and finally the
unversioned .so.
Section bounds validation — processDeviceSections now does
uintptr_t-based size math, rejects End < Begin and per-section
spans above 256 MiB, requires
DataSize % sizeof(__llvm_profile_data) == 0, and per-record
validates that each CounterPtr resolves inside the copied counters
region (out-of-range entries are zeroed and warned about instead of
producing a .profraw that points at unrelated memory).
HSA symbol-iter error path — non-SUCCESS/non-INFO_BREAK
return from hsa_executable_iterate_agent_symbols is now warned and
reflected in the drain's exit status.
lit gate tightened — compiler-rt/test/profile/lit.cfg.py now
requires /dev/kfd plus a usable HIP install (probed via
$ROCM_PATH//opt/rocm) plus the amdgcn device profile runtime
in the resource directory before enabling the hip / amdgpu
features. Exports %hip_lib_path and %amdgpu_arch so the new
AMDGPU/*.hip tests stay portable (and consistent with the existing
GPU/instrprof-hip-* tests).

Validation

Built clean against current amd-staging on Linux x86_64 + AMDGPU:
host clang/lld/compiler-rt, and libclang_rt.profile-amdgcn.a
(device profile runtime) for the amdgcn target.

End-to-end exercised with RCCL under its --enable-device-coverage
fast build path (debug + -gline-tables-only -O1, default device
linker, gfx90a, ~10 min incremental relink against the rebased runtime):

librccl.so carries the expected
__llvm_prf_{names,cnts,data,vnds} + __llvm_covfun +
__llvm_covmap sections and the
__llvm_profile_hip_collect_device_data forwarder.
all_reduce_perf -g 2 produces 1 host .profraw + 26 device
gfx90a.*.profraw per rank (one per loaded HSA executable), all
LLVM raw profile version 10.
Verbose drain log:
HIP resolved via existing process namespace (RTLD_DEFAULT) and
walk complete: agents=4 pairs=28 found=26 drained=26 iter-failures=0
— zero out-of-range counter pointer warnings.
llvm-profdata merge + llvm-cov report produce realistic per-file
coverage for both the host librccl.so and the per-arch device ELF.

ronlieb · 2026-05-31T19:21:58Z

does this need to up upstream at some point ?

yxsamliu · 2026-06-01T19:48:45Z

I like this approach. The main concern for me is whether it works on Windows. My PR used hipRegister* mainly due to needing to support offload pgo on Windows.

lfmeadow · 2026-06-02T11:24:28Z

does this need to up upstream at some point ?

I don't really understand our upstream strategy. This is largely dependent on HIP. I don't know if it would ever be ported to NVIDIA, they have their own tools. I guess someone else needs to answer that question.

ronlieb · 2026-06-02T15:01:08Z

if this PR is needed to get device code coverage working with a trunk build, then yes, upstream

yxsamliu · 2026-06-02T15:16:28Z

On Windows, HIP runs on top of PAL (Platform Abstraction Library) rather than ROCm/HSA, so hsa-runtime64.dll is not available — the HSA runtime API is not exposed as a separate DLL on Windows at all. The only HSA-named files in the Windows system directory are hsa-thunk64.dll / hsa-thunk.dll, which are the kernel-mode KMT thunk and only export hsaKmt* symbols, not the HSA runtime API. So hsa_init, hsa_iterate_agents, hsa_executable_iterate_agent_symbols, hsa_system_get_major_extension_table, and query_segment_descriptors are all absent on Windows. Only hipMemcpy is available.

So the HSA introspection approach can't work on Windows. My original hipRegister*-based approach worked there because it only needed HIP APIs, which are available on both platforms.

Would it be possible to keep the old shadow-registration path for Windows (#ifdef _WIN32) and use the new HSA introspection drain on Linux? That way Windows users don't lose device PGO support, and Linux gets the cleaner decoupled approach.

Replace the host-shadow / per-CUID / hipModuleLoad-interceptor device profile drain with an HSA-introspection drain that walks every loaded device code object at process exit, finds the canonical __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU.c, D2H-copies its counters/data/names, and writes an arch-prefixed .profraw via __llvm_write_custom_profile. Host and device drains are now fully independent: the drain runs from an atexit handler registered in a library constructor, so device counters are collected whether or not the host TUs were instrumented and without any host-side per-TU shadow, CUID matching, or module-load interception. This fixes the cases the old 1-1 host<->device model could not handle (separate device-only modules, uninstrumented host, multi-GPU). * compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp: full rewrite to the HSA-introspection drain (dlopen'd HSA/HIP, agent enumeration, hsa_ven_amd_loader segment descriptors, executable symbol walk, bounds dedup, idempotent drainDevices + atexit, collision-free target names). Legacy __llvm_profile_offload_register_* symbols kept as no-ops for ABI compatibility. Guarded host-only for GPU builds. * clang/lib/CodeGen/CGCUDANV.cpp: delete the entire offload-profiling machinery (OffloadProfShadow, emitOffloadProfilingSections, both shadow-registration sites). Clang emits nothing PGO-specific for HIP. * clang/lib/Driver/ToolChains/Gnu.cpp: for HIP host links built with PGO, force-link the drain object via -u__llvm_profile_hip_collect_device_data (it is otherwise unreferenced now that the host emits no shadow). * compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake: add amdgcn / nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime (the sole source of __llvm_profile_sections after this change) is built for GPU targets. Tests: * clang/test/CodeGenHIP/offload-pgo-sections.hip: now asserts the per-CUID struct / shadow / registration are NOT emitted. * clang/test/Driver/hip-profile-device-drain.hip: the -u force-link is added for HIP+PGO and only then. * compiler-rt/test/profile/instrprof-offload-abi-compat.c: legacy no-op symbols still link and run. * compiler-rt/test/profile/AMDGPU/{device-basic,device-no-kernel, device-symbols}.hip (REQUIRES: amdgpu) + lit .hip suffix and amdgpu feature gate. Co-authored-by: Cursor <[email protected]>

Add a "HIP / AMDGPU device code coverage" section to SourceBasedCodeCoverage.rst covering how to build the amdgcn device profile runtime (COMPILER_RT_BUILD_PROFILE_ROCM), install it into the clang resource directory, and compile/run/report device coverage, including the manual force-link required for non-HIP hosts. Co-authored-by: Cursor <[email protected]>

The HSA-introspection drain relies on the device profile runtime (libclang_rt.profile-<arch>.a, defining __llvm_profile_instrument_gpu and the __llvm_profile_sections bounds table) being linked into the device image. On the --offload-new-driver full-link path used by HIP, nothing pulled it in: clang only forwarded the device C lib / builtins / flang-rt to clang-linker-wrapper, and only for OpenMP host offloading. A plain `clang -x hip -fprofile-instr-generate -fcoverage-mapping` link therefore failed with `undefined symbol: __llvm_profile_instrument_gpu`. Forward the static device profile runtime to each GPU (AMDGPU/NVPTX) device linker when instrumentation is enabled, in the per-device-toolchain loop in LinkerWrapper::ConstructJob (next to the existing -lompdevice handling). The full archive path is used because the device runtime is arch-suffixed and would not resolve via a -l name; the VFS exists() guard keeps this a no-op when the device profile runtime is not installed. Also update the retained GPU/instrprof-hip-* tests for the new drain: - multi-gpu: assert the HSA agent walk drained device data (the drain now walks all agents by design) instead of the old single-device messages. - multiple-kernels: correct the stale mangled names (_Z5scalePii, _Z6negatePii). - coverage: anchor the REPORT-NOT so "0.00%" doesn't match inside "80.00%". Document the device profile runtime build/install (per-target resource path) in SourceBasedCodeCoverage.rst. With this, compiler-rt's check-profile passes the AMDGPU/*.hip and GPU/instrprof-hip-*.hip suites on gfx90a (145 passed, 0 failed). Co-authored-by: Cursor <[email protected]>

lfmeadow · 2026-06-03T19:44:04Z

Folks, I just realized that this project isn't properly set up for therock build and CI. I'm working on revamping that. Good news it will make private builds easier, but you'll end up having to use more therock infra. Stay tuned, I'm new to this myself.

Enable COMPILER_RT_BUILD_PROFILE in compiler-rt/cmake/caches/GPU.cmake, the cache file TheRock's amdgcn-amd-amdhsa runtimes target uses (via compiler/pre_hook_amd-llvm.cmake). With that target's LLVM_ENABLE_PER_TARGET_RUNTIME_DIR=ON, libclang_rt.profile.a is now built and installed to lib/clang/<v>/lib/amdgcn-amd-amdhsa/ as a normal part of the build and packaged via the amd-llvm artifact (force_include lib/llvm/lib/clang/**) -- so a packaged ROCm toolchain links the device runtime on a HIP + -fprofile-instr-generate/-fcoverage-mapping device link with no manual step. Also force COMPILER_RT_BUILD_PROFILE_ROCM OFF for this target: the host-side HSA drain (InstrProfilingPlatformROCm.cpp) dlopen's HSA/HIP and is host-only, so it must never be compiled for amdgcn. The host (default) target keeps it on by default and carries the drain in libclang_rt.profile-x86_64.a. Document the integrated layout in SourceBasedCodeCoverage.rst. Co-authored-by: Cursor <[email protected]>

The amdgcn-amd-amdhsa runtimes build is freestanding (-nostdlibinc, no host libc headers), so building the full profile runtime there fails with 'unistd.h'/'fcntl.h'/'sys/file.h' not found (InstrProfilingPort.h includes <unistd.h> unless COMPILER_RT_PROFILE_BAREMETAL, and the filesystem sources pull in fcntl.h/sys/file.h). Set COMPILER_RT_PROFILE_BAREMETAL in compiler-rt/cmake/caches/GPU.cmake so the device profile build uses its baremetal subset: it drops the filesystem / value-profiling sources and skips <unistd.h>, while keeping InstrProfilingPlatformGPU.c (__llvm_profile_instrument_gpu and the __llvm_profile_sections bounds table) -- exactly what the device link needs. The host (default) runtimes target is unaffected and keeps the full profile runtime + HSA drain. Verified: the amdgcn profile archive builds cleanly under the GPU.cmake baremetal config and check-profile (incl. the AMDGPU/GPU device tests) passes on gfx90a with the trimmed archive installed. Co-authored-by: Cursor <[email protected]>

Fix issues raised in review of the HSA introspection drain: - processDeviceSections() now returns 1/0/-1 so an empty device section is no longer miscounted as a successful drain (and never latches DrainCompleted); only a positive result advances the drain count. - Zero the CounterPtr field with sizeof(CounterPtr) instead of sizeof(uint64_t) so a 32-bit host can't clobber adjacent record fields. - RuntimeState now uses acquire/release atomics; observing RuntimeState==1 happens-after the resolved function-pointer writes. - DrainInProgress/DrainCompleted now use an atomic CAS claim plus release stores, so concurrent/reentrant drains can't both run the walk and corrupt the global SeenBounds table. - lit.cfg.py: correct the feature-gating comment and accept the arch-less libclang_rt.profile.a (only under an amdgcn dir) in addition to the arch-suffixed name. Co-authored-by: Cursor <[email protected]>

Windows has no HSA runtime (HIP runs on PAL; hsa-runtime64.dll is absent), so the HSA introspection drain cannot run there. Restore the legacy hipRegister*/host-shadow mechanism for Windows, kept in its own file so the two paths stay free of interleaved #ifdef blocks: - InstrProfilingPlatformROCmWindows.cpp: the legacy HIP host-shadow drain, guarded by _WIN32. - InstrProfilingPlatformROCm.cpp (HSA drain): now also gated on !_WIN32. - CMakeLists.txt: compile exactly one of the two per host platform. - CGCUDANV: re-emit the per-TU __llvm_profile_sections_<CUID> device bounds table and host shadow + registration, behind a single "host is Windows" guard (aux triple on device compiles); inert on other hosts. - Gnu.cpp: clarify that the host -u force-link is the Linux/ELF path only. Decoupled (host-uninstrumented) collection remains Linux-only; Windows still requires instrumented host TUs, as in the original design. offload-pgo-sections.hip now covers both paths: Windows emits the shadow + __hipRegisterVar + registration and the device bounds struct; Linux emits no host-shadow machinery. Co-authored-by: Cursor <[email protected]>

lfmeadow · 2026-06-03T22:56:43Z

@yxsamliu Done — implemented exactly the dual path you suggested (commit ad69004).

Windows keeps the legacy hipRegister* host-shadow drain; Linux uses the HSA introspection drain.
To avoid #ifdef clutter, the Windows code lives in its own file (InstrProfilingPlatformROCmWindows.cpp) and the CMake compiles exactly one of the two by host platform — no interleaved #if _WIN32 … #else … #endif.
The CodeGen shadow emission (__llvm_profile_sections_<CUID> device bounds table + host shadow + __hipRegisterVar/registration) is restored behind a single "host is Windows" guard, checked via the aux triple on device compiles; it's inert on other hosts.

One intentional semantic difference: the decoupled, host-uninstrumented collection is Linux-only. Windows still requires the host TUs to be instrumented (same as your original PR), since the shadow registration lives in host code.

Caveat: I haven't been able to validate the Windows path end-to-end (no Windows+GPU CI leg here). It's effectively your previously-working mechanism moved into a separate file, but a Windows CI run would be good before we rely on it. offload-pgo-sections.hip now covers both paths at the IR level.

The Windows drain only needs to resolve HIP entry points by name and intercepts nothing, so depending on the sanitizer interception framework was both unnecessary and harmful: it forced PROFILE_HAS_HIP_INTERCEPTOR (the RTInterception/sanitizer_common object libs), and in a profile-only Windows build those targets are absent, which silently dropped the entire device-PGO drain. - InstrProfilingPlatformROCmWindows.cpp: resolve HIP with direct Win32 LoadLibrary/GetProcAddress instead of __interception::*; drop the interception header; remove the dead Linux-only hipModuleLoad interceptor block and the now-dead non-Windows #else branches (this file is _WIN32 only). - CMakeLists.txt: build the ROCm drain on Windows whenever COMPILER_RT_BUILD_PROFILE_ROCM is set (no interception dependency); keep the interception object-lib merge + gating for the Linux HSA drain only. Define COMPILER_RT_BUILD_PROFILE_ROCM=1 accordingly, and emit a STATUS message when the drain is dropped for lack of interception libs. Pick the MSVC CRT (/MT vs /MD) based on whether those object libs are actually merged. With this, a green Windows compiler-runtime build genuinely exercises the Windows drain TU rather than potentially skipping it. Co-authored-by: Cursor <[email protected]>

ronlieb self-requested a review May 31, 2026 18:58

lfmeadow mentioned this pull request Jun 1, 2026

[RCCL] In-tree device coverage via HSA introspection drain ROCm/rocm-systems#6630

Draft

lfmeadow and others added 3 commits June 3, 2026 13:57

lfmeadow force-pushed the device-pgo-introspection-drain-v2 branch from 8b3346f to b703833 Compare June 3, 2026 19:00

lfmeadow marked this pull request as ready for review June 3, 2026 19:00

lfmeadow requested a review from Copilot June 3, 2026 19:33

Copilot started reviewing on behalf of lfmeadow June 3, 2026 19:33 View session

This comment was marked as resolved.

Sign in to view

lfmeadow and others added 4 commits June 3, 2026 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PGO][HIP] Decouple device profile drain via HSA introspection#2743

[PGO][HIP] Decouple device profile drain via HSA introspection#2743
lfmeadow wants to merge 8 commits into
amd-stagingfrom
device-pgo-introspection-drain-v2

lfmeadow commented May 31, 2026 •

edited

Loading

Uh oh!

ronlieb commented May 31, 2026

Uh oh!

yxsamliu commented Jun 1, 2026

Uh oh!

lfmeadow commented Jun 2, 2026

Uh oh!

ronlieb commented Jun 2, 2026

Uh oh!

yxsamliu commented Jun 2, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

lfmeadow commented Jun 3, 2026

Uh oh!

lfmeadow commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lfmeadow commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Build, install & usage

Summary

What this changes vs. the existing approach

Review-driven changes (vs the #2714 snapshot)

Validation

Uh oh!

ronlieb commented May 31, 2026

Uh oh!

yxsamliu commented Jun 1, 2026

Uh oh!

lfmeadow commented Jun 2, 2026

Uh oh!

ronlieb commented Jun 2, 2026

Uh oh!

yxsamliu commented Jun 2, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

lfmeadow commented Jun 3, 2026

Uh oh!

lfmeadow commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lfmeadow commented May 31, 2026 •

edited

Loading