Run gdb.rocm tests in parallel across GPUs#149
Conversation
Add a generic shared/exclusive file-lock layer in lib/gdb-utils.exp
and wire it into the rocm GPU lock so most gdb.rocm tests can run in
parallel on a multi-GPU node, while tests that need exclusive access
to a device (or to the whole machine) keep their historical
serialisation.
In lib/gdb-utils.exp, add lock_file_try_acquire and the
shared/exclusive helpers (lock_file_acquire_shared,
lock_file_acquire_exclusive, lock_file_release_shared) built on the
existing open-EXCL primitive used by lock_file_acquire. Shared
acquirers grab one of N slot files; exclusive acquirers grab all N,
with a barrier file giving writer priority so an exclusive waiter
cannot be starved. Add with_shared_lock and with_exclusive_lock
convenience wrappers, and the multi-pool variants
(with_shared_lock_multi, with_exclusive_lock_multi,
with_machine_lock_multi) used by the rocm lock to spread tests
across GPUs. The shared multi-pool acquirer fills slot 0 of every
pool before touching slot 1 of any pool, so up to N tests on an
N-GPU box each get a dedicated GPU before any GPU sees two tenants;
a randomised per-row start offset avoids concentrating acquirers on
pool 0. An ASCII diagram in the file header shows how the four
tiers relate. The single-pool wrappers (with_shared_lock and
with_exclusive_lock) are registered in tclint-plugin.py; the
multi-pool variants are intentionally not registered because the
in-tree caller (with_rocm_gpu_lock) builds the body in a variable
that tclint cannot analyse as a script.
In lib/rocm.exp, replace the single-file gpu-parallel.lock with one
pool per visible GPU. Per-GPU width is configurable via
ROCGDB_ROCM_PARALLEL_SLOTS (default 1). The active pool count is
capped at min(eligible_gpus, ROCGDB_ROCM_GPU_MAX_PARALLEL); the cap
is unset by default and exists so users can lower the active set if
they hit spurious failures from kfd/dbgapi contention.
with_rocm_gpu_lock now dispatches on a per-test marker
"set rocm_gpu_concurrency {parallel|serial|machine}": parallel
takes one slot on any GPU, serial (the default) takes every slot of
one GPU while leaving other GPUs free, and machine takes every slot
of every GPU to restore the historical "no other rocm test runs
anywhere" semantics.
At startup, build a rocm_pool_to_physical list mapping each dense
pool index to a physical device index. When the user pre-sets
ROCR_VISIBLE_DEVICES the map honours that mask, so pool i pins the
physical device the user actually picked rather than silently
renumbering to 0..N-1. The map is also narrowed by ROCM_TEST_ARCH
filtering, the heterogeneous-fleet fallback, and the
ROCGDB_ROCM_GPU_MAX_PARALLEL cap. For the duration of the body,
ROCR_VISIBLE_DEVICES is pinned to that physical index so the
inferior sees exactly the GPU it was scheduled on.
On heterogeneous (mixed-arch) systems the suite falls back to a
single GPU unless ROCM_TEST_ARCH selects one arch's GPUs explicitly.
When GDB_PARALLEL is unset the suite pins to a single pool,
preserving the pre-parallel single-tenant behaviour. As a failsafe,
if any local device lacks multi-process debug support (per
hip_devices_support_debug_multi_process), parallel mode is downgraded
to serial regardless of the per-test marker.
Extend the check-parallel lockfile cleanup glob in Makefile.in to
sweep the new per-pool slot and barrier files.
Update gdb.rocm/addr-bp-gpu-no-deb-info.exp accordingly:
with_rocm_gpu_lock now pins ROCR_VISIBLE_DEVICES in the dejagnu
process environment and GDB inherits it at startup, so clean_restart
and the HIP_ENABLE_DEFERRED_LOADING set have to run inside the lock;
drop the now-redundant hard-coded
"set environment ROCR_VISIBLE_DEVICES=0" workaround.
Add gdb.testsuite/slot-pool-lock.exp covering the lock primitives in
lib/gdb-utils.exp that with_rocm_gpu_lock is built on. The test
spawns tclsh children so it exercises the file locks across real
processes; each child appends timestamped acquire/hold/release
events to a shared log that the driver inspects.
Properties checked:
* shared-capacity - N shared acquirers hold concurrently within
a single pool's slot count.
* mutex - exclusive waits for the shared holder to
release before acquiring.
* multi-pool-spread - row-major fill places the first NPOOLS
shared acquirers on distinct pools.
* writer-priority - a queued exclusive blocks new shared
acquirers, so a late shared arrival cannot
slip in when the pool drains; it must wait
for the exclusive to complete.
* error-releases - an error raised inside with_shared_lock_multi
propagates and releases the slot, so the
next acquire is immediate.
This commit is validation-only and adds no callers of the lock
layer beyond the new test.
Mark 60 gdb.rocm tests as parallel-safe and wrap two more in
with_rocm_gpu_lock so they participate in the shared lock at all.
For each parallel-safe test, add
set rocm_gpu_concurrency parallel
just below "load_lib rocm.exp", which lets with_rocm_gpu_lock take a
shared slot instead of locking the whole GPU. The marked tests all
launch a single small kernel and only inspect per-inferior state, so
they can safely share a GPU with other rocm tests up to the
configured slot count.
simple-outside-debugger.exp and mi-attach.exp did not previously
call with_rocm_gpu_lock at all, so they ran without any coordination
against the rest of the suite even though both do exercise the GPU.
Wrap the GPU-using portion of each in with_rocm_gpu_lock and opt in
to shared mode as well.
The remaining wrapped tests (precise-memory*, device-attach,
device-interrupt, fork-exec-*, multi-inferior-*,
step-schedlock-spurious-waves, load-core-remote-system) are
intentionally left unmarked and continue to take the exclusive
(per-GPU) lock, since they exercise device-wide state, run hog
kernels, or require multiple concurrent GPU inferiors.
maint-print-registers.exp is left untouched because it is a pure
host-side test with no GPU launch.
Add a "ROCm GPU Parallel Testing" section to gdb/testsuite/README
describing the knobs that govern how gdb.rocm tests share GPUs under
GDB_PARALLEL.
The per-test ::rocm_gpu_concurrency marker selects one of three
modes: parallel (one slot on any GPU), serial (the default; every
slot of one GPU, leaving other GPUs free), and machine (every slot
of every active GPU). The section also documents the
hip_devices_support_debug_multi_process failsafe that downgrades
parallel to serial on devices that cannot host two debugged
processes at once.
The environment variables documented are:
* ROCGDB_ROCM_PARALLEL_SLOTS - per-GPU concurrent-test cap
(default 2), including the =1 setting that restores the
historical "one rocm test per GPU" behaviour and the effect of
higher values on serial-marked tests.
* ROCGDB_ROCM_GPU_MAX_PARALLEL - cap on the number of active GPU
pools (unset by default, meaning use every eligible GPU); lower
it if kfd/dbgapi contention from many concurrent debug sessions
produces spurious mid-run failures.
* ROCR_VISIBLE_DEVICES - pre-set in the environment to restrict
the lock layer to a chosen subset of the node's physical GPUs.
Documents the pool-index -> physical-index mapping (pool 0 pins
the first listed device, etc.), the interaction with
ROCM_TEST_ARCH filtering, and the length-mismatch fallback to
an identity map.
Round it out with the combined ceiling
min(eligible_gpus, ROCGDB_ROCM_GPU_MAX_PARALLEL)
* ROCGDB_ROCM_PARALLEL_SLOTS
on the number of concurrent rocm tests when every test is
parallel-marked.
runtime-core.exp ran six independent do_test invocations serially in a single .exp file. Each one regenerates a HIP coredump and restarts GDB, so the file dominated gdb.rocm wall-clock time. Move the shared body to runtime-core.exp.tcl and add one thin shard per (fault, core_type, output_type) combination. Each shard sets the three parameters and sources the shared tcl, mirroring the pattern used by gdb.base/all-architectures-*.exp and gdb.arch/aarch64-sme-core-*.exp. The shards still spawn HIP applications that crash and dump cores via the runtime, which is sensitive to other rocm tests sharing the device. Each shard therefore acquires a whole GPU (the default "serial" mode); other GPUs remain free, so shards still run in parallel across devices.
13407bc to
1155a4a
Compare
|
Did you find the reason why currently "make check-parallel" fails? I'd prefer seeing the fix for that in a separate change first, before all the sharding stuff. |
I don't think it fails. Not deterministically anyway. It does fail from time to time due to non-determinism in what the usage of the GPU's look like and what GPU ROCR picks up for the job. The ultimate problem with the current code is wall clock time and resource utilization. We do have a couple tests (mentioned in the description) that lacked the rocm lock, so those were running in parallel. But I don't think they would've caused issues. |
Oh, that's not my experience. We have internal tickets about this. My suspition (I think recorded in the jiras) is that we were missing some with_rocm_gpu_lock calls in some testcases.
On the contrary -- that's exactly the sort of thing that causes problems, for which with_rocm_gpu_lock was invented in the first place. May be you're running tests on a GPU that allows debugging multiple processes at the same time. But on a GPU that does not, missing a with_rocm_gpu_lock lock in one testcase breaks any other that runs at the same time. So I think that bit should be hoisted out into its own commit, and preferably, it's own PR. |
Done now in #153. |
Blocked on:
Summary
Add a per-GPU lock-pool layer so
gdb.rocmtests can run in parallel acrossGPUs (and, optionally, share a GPU) under
make check-parallel, instead ofserialising the entire rocm suite behind a single machine-wide lock.
Today every
gdb.rocmtest contends for the samegpu-parallel.lock, so onan N-GPU node the suite still runs one rocm test at a time. This series
introduces a generic shared/exclusive file-lock primitive in
lib/gdb-utils.exp, replaces the single lockfile with one pool per visibleGPU, lets each test opt into one of three concurrency modes, and ships a
self-test plus documentation for the new knobs.
What's in each commit
gdb/testsuite/rocm: allow parallel rocm tests via per-GPU lock poolsNew shared/exclusive lock primitives in
lib/gdb-utils.exp(single-poolmulti-GPU node spreads before stacking).
lib/rocm.expreplaces thesingle
gpu-parallel.lockwith one pool per visible GPU, picksper-test concurrency from
::rocm_gpu_concurrency(
parallel/serial/machine), and pinsROCR_VISIBLE_DEVICESto thescheduled GPU for the body. Honours a pre-set
ROCR_VISIBLE_DEVICES(pool 0 → first listed device),
ROCM_TEST_ARCHfiltering, and aROCGDB_ROCM_GPU_MAX_PARALLELcap. Falls back to single-GPU onheterogeneous fleets; downgrades parallel→serial on devices without
multi-process debug support. Extends the
check-parallelcleanup globto sweep the new per-pool files.
gdb/testsuite: self-test for the slot-pool shared/exclusive lock layergdb.testsuite/slot-pool-lock.expexercises the new primitives acrossreal tclsh child processes: shared capacity, shared/exclusive mutex,
multi-pool spread, writer priority, exclusive-behind-exclusive,
machine lock, randomised start offset, and error-releases-lock.
gdb/testsuite/rocm: opt rocm tests into shared (parallel) GPU lockMarks 60
gdb.rocmtests asparallel-safe (small single-kerneltests that only inspect per-inferior state). Wraps
simple-outside-debugger.expandmi-attach.expinwith_rocm_gpu_lock(they previously ran with no coordination).Tests that touch device-wide state, spawn hog kernels, or use
multiple concurrent GPU inferiors stay on the default
serialmode.gdb/testsuite: document gdb.rocm parallel testing controlsNew "ROCm GPU Parallel Testing" section in
gdb/testsuite/READMEcovering
::rocm_gpu_concurrency, the failsafe, and theROCGDB_ROCM_PARALLEL_SLOTS/ROCGDB_ROCM_GPU_MAX_PARALLEL/ROCR_VISIBLE_DEVICESknobs, with the combined concurrencyceiling.
gdb/testsuite/rocm: split runtime-core.exp into parallel shardsSplits
runtime-core.exp's six serialdo_testinvocations intoone shard per
(fault, core_type, output_type)combinationsourcing a shared
runtime-core.exp.tcl, mirroringall-architectures-*.exp/aarch64-sme-core-*.exp. Each shardtakes a whole GPU (default
serialmode); other GPUs stay free.Defaults
ROCGDB_ROCM_PARALLEL_SLOTS=1— one rocm test per GPU; raise it onhardware/stacks known to tolerate higher kfd/dbgapi concurrency.
ROCGDB_ROCM_GPU_MAX_PARALLELunset — use every eligible GPU.serial(no behavioural change for unmarkedtests, except they no longer block other GPUs).
Test plan
make check-parallel TESTS="gdb.testsuite/slot-pool-lock.exp"passes(self-test of the lock layer).
make check-parallel TESTS="gdb.rocm/*.exp"passes on a multi-GPUnode and finishes meaningfully faster than the pre-series baseline.
regressions).
ROCR_VISIBLE_DEVICES=<subset>pre-set: rocm tests only touchthe listed devices, regardless of pool indexing.
ROCGDB_ROCM_GPU_MAX_PARALLEL=1: rocm suite shares a singleGPU end-to-end (equivalent to the pre-series behaviour).
ROCM_TEST_ARCH: suite pins toa single GPU and logs the fallback reason.
make checkrun: unchanged behaviour.