Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[PGO][HIP] Decouple device profile drain via HSA introspection#2743

Open
lfmeadow wants to merge 8 commits into
amd-stagingfrom
device-pgo-introspection-drain-v2
Open

[PGO][HIP] Decouple device profile drain via HSA introspection#2743
lfmeadow wants to merge 8 commits into
amd-stagingfrom
device-pgo-introspection-drain-v2

Conversation

@lfmeadow
Copy link
Copy Markdown

@lfmeadow lfmeadow commented May 31, 2026

Status

Supersedes draft PR #2714
(same logical patch content, rebased onto current amd-staging; the
intermediate multi-device fix in commit b1b20686afe becomes redundant
under this drain and its tests under compiler-rt/test/profile/GPU/
continue to work via the retained ABI forwarder).

This revision also addresses the static-review feedback gathered against
#2714 — see "Review-driven changes" below.

Build, install & usage

Instructions for building the amdgcn device profile runtime, installing it
into the clang resource directory, and compiling/running/reporting HIP
device coverage live in the new "HIP / AMDGPU device code coverage"
section of clang/docs/SourceBasedCodeCoverage.rst
(added by this PR).

Summary

Replaces the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain. At process exit the
drain walks every loaded HSA code object on every GPU agent, finds the
canonical __llvm_profile_sections bounds table emitted by
compiler-rt/lib/profile/InstrProfilingPlatformGPU.c, D2H-copies its
counters/data/names back to the host, and writes an arch-prefixed
.profraw via __llvm_write_custom_profile.

What this changes vs. the existing approach

  • Host and device drains are fully decoupled. The device drain runs
    from an atexit handler installed by a library constructor in
    libclang_rt.profile. Device counters are collected whether or not
    the host TUs were instrumented and without any host-side per-TU
    shadow, CUID matching, or module-load interception.

  • Cases the old 1-1 host↔device model could not handle now work:
    separate device-only modules loaded at runtime
    (hipModuleLoad / hsa_executable_*), an uninstrumented host, and
    multi-GPU. Multi-GPU no longer relies on hipGetSymbolAddress,
    removing the comgr-at-atexit null deref that the predecessor
    band-aid worked around by restricting collection to a single device.

  • Clang stops emitting any PGO-specific machinery for HIP.
    CGCUDANV.cpp loses the offload-profiling shadow/registration code
    entirely. For HIP+PGO host links the driver force-links the host drain
    object via -u__llvm_profile_hip_collect_device_data (the drain's
    atexit handler is otherwise unreferenced now that the host emits no
    shadow).

  • Device profile runtime is linked on the offload device link.
    LinkerWrapper::ConstructJob now forwards the static device profile
    runtime (libclang_rt.profile-<arch>.a, defining
    __llvm_profile_instrument_gpu and the __llvm_profile_sections bounds
    table) to each GPU device linker when instrumentation is enabled. Without
    this a plain clang -x hip -fprofile-instr-generate -fcoverage-mapping
    full link fails with undefined symbol: __llvm_profile_instrument_gpu,
    since the offload device link otherwise only pulls in libc/builtins (and
    only for OpenMP). The device runtime must be installed at the per-target
    resource path lib/<device-triple>/libclang_rt.profile.a for the driver
    to find it.

  • ABI compatibility: legacy
    __llvm_profile_offload_register_{shadow,section_shadow,dynamic_module}
    and __llvm_profile_hip_collect_device_data are retained as
    no-ops / forwarders so binaries compiled against the previous runtime
    still link and produce output.

Review-driven changes (vs the #2714 snapshot)

  • drainDevices() idempotency — split into DrainInProgress /
    DrainCompleted; only latches "done" after a successful walk that
    actually drained data. Transient no-ops ("HSA/HIP not yet
    resolvable", "no GPU agents", "no loaded segments", "no instrumented
    sections") stay retryable so a late atexit call still picks up code
    objects that loaded after an early host-write trigger. Covered by
    the new compiler-rt/test/profile/AMDGPU/device-early-collect.hip.

  • HIP runtime resolution — tries RTLD_DEFAULT first (catches the
    common case of HIP already being in the process namespace, including
    runtime-only ROCm installs without an unversioned dev symlink), then
    falls back to dlopen of libamdhip64.so.{7,6,5,4} and finally the
    unversioned .so.

  • Section bounds validationprocessDeviceSections now does
    uintptr_t-based size math, rejects End < Begin and per-section
    spans above 256 MiB, requires
    DataSize % sizeof(__llvm_profile_data) == 0, and per-record
    validates that each CounterPtr resolves inside the copied counters
    region (out-of-range entries are zeroed and warned about instead of
    producing a .profraw that points at unrelated memory).

  • HSA symbol-iter error path — non-SUCCESS/non-INFO_BREAK
    return from hsa_executable_iterate_agent_symbols is now warned and
    reflected in the drain's exit status.

  • lit gate tightenedcompiler-rt/test/profile/lit.cfg.py now
    requires /dev/kfd plus a usable HIP install (probed via
    $ROCM_PATH//opt/rocm) plus the amdgcn device profile runtime
    in the resource directory before enabling the hip / amdgpu
    features. Exports %hip_lib_path and %amdgpu_arch so the new
    AMDGPU/*.hip tests stay portable (and consistent with the existing
    GPU/instrprof-hip-* tests).

Validation

Built clean against current amd-staging on Linux x86_64 + AMDGPU:
host clang/lld/compiler-rt, and libclang_rt.profile-amdgcn.a
(device profile runtime) for the amdgcn target.

End-to-end exercised with RCCL under its --enable-device-coverage
fast build path (debug + -gline-tables-only -O1, default device
linker, gfx90a, ~10 min incremental relink against the rebased runtime):

  • librccl.so carries the expected
    __llvm_prf_{names,cnts,data,vnds} + __llvm_covfun +
    __llvm_covmap sections and the
    __llvm_profile_hip_collect_device_data forwarder.
  • all_reduce_perf -g 2 produces 1 host .profraw + 26 device
    gfx90a.*.profraw per rank (one per loaded HSA executable), all
    LLVM raw profile version 10.
  • Verbose drain log:
    HIP resolved via existing process namespace (RTLD_DEFAULT) and
    walk complete: agents=4 pairs=28 found=26 drained=26 iter-failures=0
    — zero out-of-range counter pointer warnings.
  • llvm-profdata merge + llvm-cov report produce realistic per-file
    coverage for both the host librccl.so and the per-arch device ELF.

@ronlieb ronlieb self-requested a review May 31, 2026 18:58
@ronlieb
Copy link
Copy Markdown
Collaborator

ronlieb commented May 31, 2026

does this need to up upstream at some point ?

@yxsamliu
Copy link
Copy Markdown

yxsamliu commented Jun 1, 2026

I like this approach. The main concern for me is whether it works on Windows. My PR used hipRegister* mainly due to needing to support offload pgo on Windows.

@lfmeadow
Copy link
Copy Markdown
Author

lfmeadow commented Jun 2, 2026

does this need to up upstream at some point ?

I don't really understand our upstream strategy. This is largely dependent on HIP. I don't know if it would ever be ported to NVIDIA, they have their own tools. I guess someone else needs to answer that question.

@ronlieb
Copy link
Copy Markdown
Collaborator

ronlieb commented Jun 2, 2026

if this PR is needed to get device code coverage working with a trunk build, then yes, upstream

@yxsamliu
Copy link
Copy Markdown

yxsamliu commented Jun 2, 2026

On Windows, HIP runs on top of PAL (Platform Abstraction Library) rather than ROCm/HSA, so hsa-runtime64.dll is not available — the HSA runtime API is not exposed as a separate DLL on Windows at all. The only HSA-named files in the Windows system directory are hsa-thunk64.dll / hsa-thunk.dll, which are the kernel-mode KMT thunk and only export hsaKmt* symbols, not the HSA runtime API. So hsa_init, hsa_iterate_agents, hsa_executable_iterate_agent_symbols, hsa_system_get_major_extension_table, and query_segment_descriptors are all absent on Windows. Only hipMemcpy is available.

So the HSA introspection approach can't work on Windows. My original hipRegister*-based approach worked there because it only needed HIP APIs, which are available on both platforms.

Would it be possible to keep the old shadow-registration path for Windows (#ifdef _WIN32) and use the new HSA introspection drain on Linux? That way Windows users don't lose device PGO support, and Linux gets the cleaner decoupled approach.

lfmeadow and others added 3 commits June 3, 2026 13:57
Replace the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain that walks every loaded
device code object at process exit, finds the canonical
__llvm_profile_sections bounds table emitted by
InstrProfilingPlatformGPU.c, D2H-copies its counters/data/names, and
writes an arch-prefixed .profraw via __llvm_write_custom_profile.

Host and device drains are now fully independent: the drain runs from an
atexit handler registered in a library constructor, so device counters
are collected whether or not the host TUs were instrumented and without
any host-side per-TU shadow, CUID matching, or module-load interception.
This fixes the cases the old 1-1 host<->device model could not handle
(separate device-only modules, uninstrumented host, multi-GPU).

  * compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp: full rewrite
    to the HSA-introspection drain (dlopen'd HSA/HIP, agent enumeration,
    hsa_ven_amd_loader segment descriptors, executable symbol walk,
    bounds dedup, idempotent drainDevices + atexit, collision-free
    target names). Legacy __llvm_profile_offload_register_* symbols kept
    as no-ops for ABI compatibility. Guarded host-only for GPU builds.

  * clang/lib/CodeGen/CGCUDANV.cpp: delete the entire offload-profiling
    machinery (OffloadProfShadow, emitOffloadProfilingSections, both
    shadow-registration sites). Clang emits nothing PGO-specific for HIP.

  * clang/lib/Driver/ToolChains/Gnu.cpp: for HIP host links built with
    PGO, force-link the drain object via
    -u__llvm_profile_hip_collect_device_data (it is otherwise
    unreferenced now that the host emits no shadow).

  * compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake: add amdgcn /
    nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime
    (the sole source of __llvm_profile_sections after this change) is
    built for GPU targets.

Tests: * clang/test/CodeGenHIP/offload-pgo-sections.hip: now asserts the
    per-CUID struct / shadow / registration are NOT emitted.
  * clang/test/Driver/hip-profile-device-drain.hip: the -u force-link is
    added for HIP+PGO and only then.
  * compiler-rt/test/profile/instrprof-offload-abi-compat.c: legacy
    no-op symbols still link and run.
  * compiler-rt/test/profile/AMDGPU/{device-basic,device-no-kernel,
    device-symbols}.hip (REQUIRES: amdgpu) + lit .hip suffix and amdgpu
    feature gate.
Co-authored-by: Cursor <[email protected]>
Add a "HIP / AMDGPU device code coverage" section to
SourceBasedCodeCoverage.rst covering how to build the amdgcn device
profile runtime (COMPILER_RT_BUILD_PROFILE_ROCM), install it into the
clang resource directory, and compile/run/report device coverage,
including the manual force-link required for non-HIP hosts.

Co-authored-by: Cursor <[email protected]>
The HSA-introspection drain relies on the device profile runtime
(libclang_rt.profile-<arch>.a, defining __llvm_profile_instrument_gpu and
the __llvm_profile_sections bounds table) being linked into the device
image. On the --offload-new-driver full-link path used by HIP, nothing
pulled it in: clang only forwarded the device C lib / builtins / flang-rt
to clang-linker-wrapper, and only for OpenMP host offloading. A plain
`clang -x hip -fprofile-instr-generate -fcoverage-mapping` link therefore
failed with `undefined symbol: __llvm_profile_instrument_gpu`.

Forward the static device profile runtime to each GPU (AMDGPU/NVPTX)
device linker when instrumentation is enabled, in the per-device-toolchain
loop in LinkerWrapper::ConstructJob (next to the existing -lompdevice
handling). The full archive path is used because the device runtime is
arch-suffixed and would not resolve via a -l name; the VFS exists() guard
keeps this a no-op when the device profile runtime is not installed.

Also update the retained GPU/instrprof-hip-* tests for the new drain:
- multi-gpu: assert the HSA agent walk drained device data (the drain now
  walks all agents by design) instead of the old single-device messages.
- multiple-kernels: correct the stale mangled names (_Z5scalePii,
  _Z6negatePii).
- coverage: anchor the REPORT-NOT so "0.00%" doesn't match inside "80.00%".

Document the device profile runtime build/install (per-target resource
path) in SourceBasedCodeCoverage.rst.

With this, compiler-rt's check-profile passes the AMDGPU/*.hip and
GPU/instrprof-hip-*.hip suites on gfx90a (145 passed, 0 failed).

Co-authored-by: Cursor <[email protected]>
@lfmeadow lfmeadow force-pushed the device-pgo-introspection-drain-v2 branch from 8b3346f to b703833 Compare June 3, 2026 19:00
@lfmeadow lfmeadow marked this pull request as ready for review June 3, 2026 19:00
@lfmeadow lfmeadow requested a review from Copilot June 3, 2026 19:33
@lfmeadow

This comment was marked as resolved.

This comment was marked as resolved.

@lfmeadow
Copy link
Copy Markdown
Author

lfmeadow commented Jun 3, 2026

Folks, I just realized that this project isn't properly set up for therock build and CI. I'm working on revamping that. Good news it will make private builds easier, but you'll end up having to use more therock infra. Stay tuned, I'm new to this myself.

lfmeadow and others added 4 commits June 3, 2026 14:48
Enable COMPILER_RT_BUILD_PROFILE in compiler-rt/cmake/caches/GPU.cmake, the
cache file TheRock's amdgcn-amd-amdhsa runtimes target uses (via
compiler/pre_hook_amd-llvm.cmake). With that target's
LLVM_ENABLE_PER_TARGET_RUNTIME_DIR=ON, libclang_rt.profile.a is now built and
installed to lib/clang/<v>/lib/amdgcn-amd-amdhsa/ as a normal part of the build
and packaged via the amd-llvm artifact (force_include lib/llvm/lib/clang/**) --
so a packaged ROCm toolchain links the device runtime on a HIP +
-fprofile-instr-generate/-fcoverage-mapping device link with no manual step.

Also force COMPILER_RT_BUILD_PROFILE_ROCM OFF for this target: the host-side
HSA drain (InstrProfilingPlatformROCm.cpp) dlopen's HSA/HIP and is host-only,
so it must never be compiled for amdgcn. The host (default) target keeps it on
by default and carries the drain in libclang_rt.profile-x86_64.a.

Document the integrated layout in SourceBasedCodeCoverage.rst.

Co-authored-by: Cursor <[email protected]>
The amdgcn-amd-amdhsa runtimes build is freestanding (-nostdlibinc, no host
libc headers), so building the full profile runtime there fails with
'unistd.h'/'fcntl.h'/'sys/file.h' not found (InstrProfilingPort.h includes
<unistd.h> unless COMPILER_RT_PROFILE_BAREMETAL, and the filesystem sources
pull in fcntl.h/sys/file.h).

Set COMPILER_RT_PROFILE_BAREMETAL in compiler-rt/cmake/caches/GPU.cmake so the
device profile build uses its baremetal subset: it drops the filesystem /
value-profiling sources and skips <unistd.h>, while keeping
InstrProfilingPlatformGPU.c (__llvm_profile_instrument_gpu and the
__llvm_profile_sections bounds table) -- exactly what the device link needs.
The host (default) runtimes target is unaffected and keeps the full profile
runtime + HSA drain.

Verified: the amdgcn profile archive builds cleanly under the GPU.cmake
baremetal config and check-profile (incl. the AMDGPU/GPU device tests) passes
on gfx90a with the trimmed archive installed.

Co-authored-by: Cursor <[email protected]>
Fix issues raised in review of the HSA introspection drain:
  - processDeviceSections() now returns 1/0/-1 so an empty device section
    is no longer miscounted as a successful drain (and never latches
    DrainCompleted); only a positive result advances the drain count.
  - Zero the CounterPtr field with sizeof(CounterPtr) instead of
    sizeof(uint64_t) so a 32-bit host can't clobber adjacent record fields.
  - RuntimeState now uses acquire/release atomics; observing RuntimeState==1
    happens-after the resolved function-pointer writes.
  - DrainInProgress/DrainCompleted now use an atomic CAS claim plus release
    stores, so concurrent/reentrant drains can't both run the walk and
    corrupt the global SeenBounds table.
  - lit.cfg.py: correct the feature-gating comment and accept the arch-less
    libclang_rt.profile.a (only under an amdgcn dir) in addition to the
    arch-suffixed name.

Co-authored-by: Cursor <[email protected]>
Windows has no HSA runtime (HIP runs on PAL; hsa-runtime64.dll is absent),
so the HSA introspection drain cannot run there. Restore the legacy
hipRegister*/host-shadow mechanism for Windows, kept in its own file so the
two paths stay free of interleaved #ifdef blocks:

  - InstrProfilingPlatformROCmWindows.cpp: the legacy HIP host-shadow drain,
    guarded by _WIN32.
  - InstrProfilingPlatformROCm.cpp (HSA drain): now also gated on !_WIN32.
  - CMakeLists.txt: compile exactly one of the two per host platform.
  - CGCUDANV: re-emit the per-TU __llvm_profile_sections_<CUID> device bounds
    table and host shadow + registration, behind a single "host is Windows"
    guard (aux triple on device compiles); inert on other hosts.
  - Gnu.cpp: clarify that the host -u force-link is the Linux/ELF path only.

Decoupled (host-uninstrumented) collection remains Linux-only; Windows still
requires instrumented host TUs, as in the original design.

offload-pgo-sections.hip now covers both paths: Windows emits the shadow +
__hipRegisterVar + registration and the device bounds struct; Linux emits no
host-shadow machinery.

Co-authored-by: Cursor <[email protected]>
@lfmeadow
Copy link
Copy Markdown
Author

lfmeadow commented Jun 3, 2026

@yxsamliu Done — implemented exactly the dual path you suggested (commit ad69004).

  • Windows keeps the legacy hipRegister* host-shadow drain; Linux uses the HSA introspection drain.
  • To avoid #ifdef clutter, the Windows code lives in its own file (InstrProfilingPlatformROCmWindows.cpp) and the CMake compiles exactly one of the two by host platform — no interleaved #if _WIN32 … #else … #endif.
  • The CodeGen shadow emission (__llvm_profile_sections_<CUID> device bounds table + host shadow + __hipRegisterVar/registration) is restored behind a single "host is Windows" guard, checked via the aux triple on device compiles; it's inert on other hosts.

One intentional semantic difference: the decoupled, host-uninstrumented collection is Linux-only. Windows still requires the host TUs to be instrumented (same as your original PR), since the shadow registration lives in host code.

Caveat: I haven't been able to validate the Windows path end-to-end (no Windows+GPU CI leg here). It's effectively your previously-working mechanism moved into a separate file, but a Windows CI run would be good before we rely on it. offload-pgo-sections.hip now covers both paths at the IR level.

The Windows drain only needs to resolve HIP entry points by name and
intercepts nothing, so depending on the sanitizer interception framework was
both unnecessary and harmful: it forced PROFILE_HAS_HIP_INTERCEPTOR (the
RTInterception/sanitizer_common object libs), and in a profile-only Windows
build those targets are absent, which silently dropped the entire device-PGO
drain.

  - InstrProfilingPlatformROCmWindows.cpp: resolve HIP with direct Win32
    LoadLibrary/GetProcAddress instead of __interception::*; drop the
    interception header; remove the dead Linux-only hipModuleLoad interceptor
    block and the now-dead non-Windows #else branches (this file is _WIN32
    only).
  - CMakeLists.txt: build the ROCm drain on Windows whenever
    COMPILER_RT_BUILD_PROFILE_ROCM is set (no interception dependency); keep
    the interception object-lib merge + gating for the Linux HSA drain only.
    Define COMPILER_RT_BUILD_PROFILE_ROCM=1 accordingly, and emit a STATUS
    message when the drain is dropped for lack of interception libs. Pick the
    MSVC CRT (/MT vs /MD) based on whether those object libs are actually
    merged.

With this, a green Windows compiler-runtime build genuinely exercises the
Windows drain TU rather than potentially skipping it.

Co-authored-by: Cursor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants