[PGO][HIP] Decouple device profile drain via HSA introspection#2743
[PGO][HIP] Decouple device profile drain via HSA introspection#2743lfmeadow wants to merge 8 commits into
Conversation
|
does this need to up upstream at some point ? |
|
I like this approach. The main concern for me is whether it works on Windows. My PR used hipRegister* mainly due to needing to support offload pgo on Windows. |
I don't really understand our upstream strategy. This is largely dependent on HIP. I don't know if it would ever be ported to NVIDIA, they have their own tools. I guess someone else needs to answer that question. |
|
if this PR is needed to get device code coverage working with a trunk build, then yes, upstream |
|
On Windows, HIP runs on top of PAL (Platform Abstraction Library) rather than ROCm/HSA, so So the HSA introspection approach can't work on Windows. My original Would it be possible to keep the old shadow-registration path for Windows ( |
Replace the host-shadow / per-CUID / hipModuleLoad-interceptor device
profile drain with an HSA-introspection drain that walks every loaded
device code object at process exit, finds the canonical
__llvm_profile_sections bounds table emitted by
InstrProfilingPlatformGPU.c, D2H-copies its counters/data/names, and
writes an arch-prefixed .profraw via __llvm_write_custom_profile.
Host and device drains are now fully independent: the drain runs from an
atexit handler registered in a library constructor, so device counters
are collected whether or not the host TUs were instrumented and without
any host-side per-TU shadow, CUID matching, or module-load interception.
This fixes the cases the old 1-1 host<->device model could not handle
(separate device-only modules, uninstrumented host, multi-GPU).
* compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp: full rewrite
to the HSA-introspection drain (dlopen'd HSA/HIP, agent enumeration,
hsa_ven_amd_loader segment descriptors, executable symbol walk,
bounds dedup, idempotent drainDevices + atexit, collision-free
target names). Legacy __llvm_profile_offload_register_* symbols kept
as no-ops for ABI compatibility. Guarded host-only for GPU builds.
* clang/lib/CodeGen/CGCUDANV.cpp: delete the entire offload-profiling
machinery (OffloadProfShadow, emitOffloadProfilingSections, both
shadow-registration sites). Clang emits nothing PGO-specific for HIP.
* clang/lib/Driver/ToolChains/Gnu.cpp: for HIP host links built with
PGO, force-link the drain object via
-u__llvm_profile_hip_collect_device_data (it is otherwise
unreferenced now that the host emits no shadow).
* compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake: add amdgcn /
nvptx64 to ALL_PROFILE_SUPPORTED_ARCH so the device profile runtime
(the sole source of __llvm_profile_sections after this change) is
built for GPU targets.
Tests: * clang/test/CodeGenHIP/offload-pgo-sections.hip: now asserts the
per-CUID struct / shadow / registration are NOT emitted.
* clang/test/Driver/hip-profile-device-drain.hip: the -u force-link is
added for HIP+PGO and only then.
* compiler-rt/test/profile/instrprof-offload-abi-compat.c: legacy
no-op symbols still link and run.
* compiler-rt/test/profile/AMDGPU/{device-basic,device-no-kernel,
device-symbols}.hip (REQUIRES: amdgpu) + lit .hip suffix and amdgpu
feature gate.
Co-authored-by: Cursor <[email protected]>
Add a "HIP / AMDGPU device code coverage" section to SourceBasedCodeCoverage.rst covering how to build the amdgcn device profile runtime (COMPILER_RT_BUILD_PROFILE_ROCM), install it into the clang resource directory, and compile/run/report device coverage, including the manual force-link required for non-HIP hosts. Co-authored-by: Cursor <[email protected]>
The HSA-introspection drain relies on the device profile runtime (libclang_rt.profile-<arch>.a, defining __llvm_profile_instrument_gpu and the __llvm_profile_sections bounds table) being linked into the device image. On the --offload-new-driver full-link path used by HIP, nothing pulled it in: clang only forwarded the device C lib / builtins / flang-rt to clang-linker-wrapper, and only for OpenMP host offloading. A plain `clang -x hip -fprofile-instr-generate -fcoverage-mapping` link therefore failed with `undefined symbol: __llvm_profile_instrument_gpu`. Forward the static device profile runtime to each GPU (AMDGPU/NVPTX) device linker when instrumentation is enabled, in the per-device-toolchain loop in LinkerWrapper::ConstructJob (next to the existing -lompdevice handling). The full archive path is used because the device runtime is arch-suffixed and would not resolve via a -l name; the VFS exists() guard keeps this a no-op when the device profile runtime is not installed. Also update the retained GPU/instrprof-hip-* tests for the new drain: - multi-gpu: assert the HSA agent walk drained device data (the drain now walks all agents by design) instead of the old single-device messages. - multiple-kernels: correct the stale mangled names (_Z5scalePii, _Z6negatePii). - coverage: anchor the REPORT-NOT so "0.00%" doesn't match inside "80.00%". Document the device profile runtime build/install (per-target resource path) in SourceBasedCodeCoverage.rst. With this, compiler-rt's check-profile passes the AMDGPU/*.hip and GPU/instrprof-hip-*.hip suites on gfx90a (145 passed, 0 failed). Co-authored-by: Cursor <[email protected]>
8b3346f to
b703833
Compare
This comment was marked as resolved.
This comment was marked as resolved.
|
Folks, I just realized that this project isn't properly set up for therock build and CI. I'm working on revamping that. Good news it will make private builds easier, but you'll end up having to use more therock infra. Stay tuned, I'm new to this myself. |
Enable COMPILER_RT_BUILD_PROFILE in compiler-rt/cmake/caches/GPU.cmake, the cache file TheRock's amdgcn-amd-amdhsa runtimes target uses (via compiler/pre_hook_amd-llvm.cmake). With that target's LLVM_ENABLE_PER_TARGET_RUNTIME_DIR=ON, libclang_rt.profile.a is now built and installed to lib/clang/<v>/lib/amdgcn-amd-amdhsa/ as a normal part of the build and packaged via the amd-llvm artifact (force_include lib/llvm/lib/clang/**) -- so a packaged ROCm toolchain links the device runtime on a HIP + -fprofile-instr-generate/-fcoverage-mapping device link with no manual step. Also force COMPILER_RT_BUILD_PROFILE_ROCM OFF for this target: the host-side HSA drain (InstrProfilingPlatformROCm.cpp) dlopen's HSA/HIP and is host-only, so it must never be compiled for amdgcn. The host (default) target keeps it on by default and carries the drain in libclang_rt.profile-x86_64.a. Document the integrated layout in SourceBasedCodeCoverage.rst. Co-authored-by: Cursor <[email protected]>
The amdgcn-amd-amdhsa runtimes build is freestanding (-nostdlibinc, no host libc headers), so building the full profile runtime there fails with 'unistd.h'/'fcntl.h'/'sys/file.h' not found (InstrProfilingPort.h includes <unistd.h> unless COMPILER_RT_PROFILE_BAREMETAL, and the filesystem sources pull in fcntl.h/sys/file.h). Set COMPILER_RT_PROFILE_BAREMETAL in compiler-rt/cmake/caches/GPU.cmake so the device profile build uses its baremetal subset: it drops the filesystem / value-profiling sources and skips <unistd.h>, while keeping InstrProfilingPlatformGPU.c (__llvm_profile_instrument_gpu and the __llvm_profile_sections bounds table) -- exactly what the device link needs. The host (default) runtimes target is unaffected and keeps the full profile runtime + HSA drain. Verified: the amdgcn profile archive builds cleanly under the GPU.cmake baremetal config and check-profile (incl. the AMDGPU/GPU device tests) passes on gfx90a with the trimmed archive installed. Co-authored-by: Cursor <[email protected]>
Fix issues raised in review of the HSA introspection drain:
- processDeviceSections() now returns 1/0/-1 so an empty device section
is no longer miscounted as a successful drain (and never latches
DrainCompleted); only a positive result advances the drain count.
- Zero the CounterPtr field with sizeof(CounterPtr) instead of
sizeof(uint64_t) so a 32-bit host can't clobber adjacent record fields.
- RuntimeState now uses acquire/release atomics; observing RuntimeState==1
happens-after the resolved function-pointer writes.
- DrainInProgress/DrainCompleted now use an atomic CAS claim plus release
stores, so concurrent/reentrant drains can't both run the walk and
corrupt the global SeenBounds table.
- lit.cfg.py: correct the feature-gating comment and accept the arch-less
libclang_rt.profile.a (only under an amdgcn dir) in addition to the
arch-suffixed name.
Co-authored-by: Cursor <[email protected]>
Windows has no HSA runtime (HIP runs on PAL; hsa-runtime64.dll is absent),
so the HSA introspection drain cannot run there. Restore the legacy
hipRegister*/host-shadow mechanism for Windows, kept in its own file so the
two paths stay free of interleaved #ifdef blocks:
- InstrProfilingPlatformROCmWindows.cpp: the legacy HIP host-shadow drain,
guarded by _WIN32.
- InstrProfilingPlatformROCm.cpp (HSA drain): now also gated on !_WIN32.
- CMakeLists.txt: compile exactly one of the two per host platform.
- CGCUDANV: re-emit the per-TU __llvm_profile_sections_<CUID> device bounds
table and host shadow + registration, behind a single "host is Windows"
guard (aux triple on device compiles); inert on other hosts.
- Gnu.cpp: clarify that the host -u force-link is the Linux/ELF path only.
Decoupled (host-uninstrumented) collection remains Linux-only; Windows still
requires instrumented host TUs, as in the original design.
offload-pgo-sections.hip now covers both paths: Windows emits the shadow +
__hipRegisterVar + registration and the device bounds struct; Linux emits no
host-shadow machinery.
Co-authored-by: Cursor <[email protected]>
|
@yxsamliu Done — implemented exactly the dual path you suggested (commit ad69004).
One intentional semantic difference: the decoupled, host-uninstrumented collection is Linux-only. Windows still requires the host TUs to be instrumented (same as your original PR), since the shadow registration lives in host code. Caveat: I haven't been able to validate the Windows path end-to-end (no Windows+GPU CI leg here). It's effectively your previously-working mechanism moved into a separate file, but a Windows CI run would be good before we rely on it. |
The Windows drain only needs to resolve HIP entry points by name and
intercepts nothing, so depending on the sanitizer interception framework was
both unnecessary and harmful: it forced PROFILE_HAS_HIP_INTERCEPTOR (the
RTInterception/sanitizer_common object libs), and in a profile-only Windows
build those targets are absent, which silently dropped the entire device-PGO
drain.
- InstrProfilingPlatformROCmWindows.cpp: resolve HIP with direct Win32
LoadLibrary/GetProcAddress instead of __interception::*; drop the
interception header; remove the dead Linux-only hipModuleLoad interceptor
block and the now-dead non-Windows #else branches (this file is _WIN32
only).
- CMakeLists.txt: build the ROCm drain on Windows whenever
COMPILER_RT_BUILD_PROFILE_ROCM is set (no interception dependency); keep
the interception object-lib merge + gating for the Linux HSA drain only.
Define COMPILER_RT_BUILD_PROFILE_ROCM=1 accordingly, and emit a STATUS
message when the drain is dropped for lack of interception libs. Pick the
MSVC CRT (/MT vs /MD) based on whether those object libs are actually
merged.
With this, a green Windows compiler-runtime build genuinely exercises the
Windows drain TU rather than potentially skipping it.
Co-authored-by: Cursor <[email protected]>
Status
Supersedes draft PR #2714
(same logical patch content, rebased onto current
amd-staging; theintermediate multi-device fix in commit
b1b20686afebecomes redundantunder this drain and its tests under
compiler-rt/test/profile/GPU/continue to work via the retained ABI forwarder).
This revision also addresses the static-review feedback gathered against
#2714 — see "Review-driven changes" below.
Build, install & usage
Instructions for building the amdgcn device profile runtime, installing it
into the clang resource directory, and compiling/running/reporting HIP
device coverage live in the new "HIP / AMDGPU device code coverage"
section of
clang/docs/SourceBasedCodeCoverage.rst(added by this PR).
Summary
Replaces the host-shadow / per-CUID /
hipModuleLoad-interceptor deviceprofile drain with an HSA-introspection drain. At process exit the
drain walks every loaded HSA code object on every GPU agent, finds the
canonical
__llvm_profile_sectionsbounds table emitted bycompiler-rt/lib/profile/InstrProfilingPlatformGPU.c, D2H-copies itscounters/data/names back to the host, and writes an arch-prefixed
.profrawvia__llvm_write_custom_profile.What this changes vs. the existing approach
Host and device drains are fully decoupled. The device drain runs
from an
atexithandler installed by a library constructor inlibclang_rt.profile. Device counters are collected whether or notthe host TUs were instrumented and without any host-side per-TU
shadow, CUID matching, or module-load interception.
Cases the old 1-1 host↔device model could not handle now work:
separate device-only modules loaded at runtime
(
hipModuleLoad/hsa_executable_*), an uninstrumented host, andmulti-GPU. Multi-GPU no longer relies on
hipGetSymbolAddress,removing the comgr-at-atexit null deref that the predecessor
band-aid worked around by restricting collection to a single device.
Clang stops emitting any PGO-specific machinery for HIP.
CGCUDANV.cpploses the offload-profiling shadow/registration codeentirely. For HIP+PGO host links the driver force-links the host drain
object via
-u__llvm_profile_hip_collect_device_data(the drain'satexithandler is otherwise unreferenced now that the host emits noshadow).
Device profile runtime is linked on the offload device link.
LinkerWrapper::ConstructJobnow forwards the static device profileruntime (
libclang_rt.profile-<arch>.a, defining__llvm_profile_instrument_gpuand the__llvm_profile_sectionsboundstable) to each GPU device linker when instrumentation is enabled. Without
this a plain
clang -x hip -fprofile-instr-generate -fcoverage-mappingfull link fails with
undefined symbol: __llvm_profile_instrument_gpu,since the offload device link otherwise only pulls in libc/builtins (and
only for OpenMP). The device runtime must be installed at the per-target
resource path
lib/<device-triple>/libclang_rt.profile.afor the driverto find it.
ABI compatibility: legacy
__llvm_profile_offload_register_{shadow,section_shadow,dynamic_module}and
__llvm_profile_hip_collect_device_dataare retained asno-ops / forwarders so binaries compiled against the previous runtime
still link and produce output.
Review-driven changes (vs the #2714 snapshot)
drainDevices()idempotency — split intoDrainInProgress/DrainCompleted; only latches "done" after a successful walk thatactually drained data. Transient no-ops ("HSA/HIP not yet
resolvable", "no GPU agents", "no loaded segments", "no instrumented
sections") stay retryable so a late
atexitcall still picks up codeobjects that loaded after an early host-write trigger. Covered by
the new
compiler-rt/test/profile/AMDGPU/device-early-collect.hip.HIP runtime resolution — tries
RTLD_DEFAULTfirst (catches thecommon case of HIP already being in the process namespace, including
runtime-only ROCm installs without an unversioned dev symlink), then
falls back to dlopen of
libamdhip64.so.{7,6,5,4}and finally theunversioned
.so.Section bounds validation —
processDeviceSectionsnow doesuintptr_t-based size math, rejectsEnd < Beginand per-sectionspans above 256 MiB, requires
DataSize % sizeof(__llvm_profile_data) == 0, and per-recordvalidates that each
CounterPtrresolves inside the copied countersregion (out-of-range entries are zeroed and warned about instead of
producing a
.profrawthat points at unrelated memory).HSA symbol-iter error path — non-
SUCCESS/non-INFO_BREAKreturn from
hsa_executable_iterate_agent_symbolsis now warned andreflected in the drain's exit status.
lit gate tightened —
compiler-rt/test/profile/lit.cfg.pynowrequires
/dev/kfdplus a usable HIP install (probed via$ROCM_PATH//opt/rocm) plus the amdgcn device profile runtimein the resource directory before enabling the
hip/amdgpufeatures. Exports
%hip_lib_pathand%amdgpu_archso the newAMDGPU/*.hiptests stay portable (and consistent with the existingGPU/instrprof-hip-*tests).Validation
Built clean against current
amd-stagingon Linux x86_64 + AMDGPU:host
clang/lld/compiler-rt, andlibclang_rt.profile-amdgcn.a(device profile runtime) for the amdgcn target.
End-to-end exercised with RCCL under its
--enable-device-coveragefast build path (debug +
-gline-tables-only -O1, default devicelinker, gfx90a, ~10 min incremental relink against the rebased runtime):
librccl.socarries the expected__llvm_prf_{names,cnts,data,vnds}+__llvm_covfun+__llvm_covmapsections and the__llvm_profile_hip_collect_device_dataforwarder.all_reduce_perf -g 2produces 1 host.profraw+ 26 devicegfx90a.*.profrawper rank (one per loaded HSA executable), allLLVM raw profile version 10.
HIP resolved via existing process namespace (RTLD_DEFAULT)andwalk complete: agents=4 pairs=28 found=26 drained=26 iter-failures=0— zero
out-of-range counter pointerwarnings.llvm-profdata merge+llvm-cov reportproduce realistic per-filecoverage for both the host
librccl.soand the per-arch device ELF.