Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Run gdb.rocm tests in parallel across GPUs#149

Open
lumachad wants to merge 5 commits into
amd-stagingfrom
users/lumachad/amd-staging/parallel_rocm
Open

Run gdb.rocm tests in parallel across GPUs#149
lumachad wants to merge 5 commits into
amd-stagingfrom
users/lumachad/amd-staging/parallel_rocm

Conversation

@lumachad
Copy link
Copy Markdown
Collaborator

@lumachad lumachad commented Jun 1, 2026

Blocked on:

Summary

Add a per-GPU lock-pool layer so gdb.rocm tests can run in parallel across
GPUs (and, optionally, share a GPU) under make check-parallel, instead of
serialising the entire rocm suite behind a single machine-wide lock.

Today every gdb.rocm test contends for the same gpu-parallel.lock, so on
an N-GPU node the suite still runs one rocm test at a time. This series
introduces a generic shared/exclusive file-lock primitive in
lib/gdb-utils.exp, replaces the single lockfile with one pool per visible
GPU, lets each test opt into one of three concurrency modes, and ships a
self-test plus documentation for the new knobs.

What's in each commit

  1. gdb/testsuite/rocm: allow parallel rocm tests via per-GPU lock pools
    New shared/exclusive lock primitives in lib/gdb-utils.exp (single-pool

    • multi-pool variants, writer-priority barriers, row-major fill so a
      multi-GPU node spreads before stacking). lib/rocm.exp replaces the
      single gpu-parallel.lock with one pool per visible GPU, picks
      per-test concurrency from ::rocm_gpu_concurrency
      (parallel/serial/machine), and pins ROCR_VISIBLE_DEVICES to the
      scheduled GPU for the body. Honours a pre-set ROCR_VISIBLE_DEVICES
      (pool 0 → first listed device), ROCM_TEST_ARCH filtering, and a
      ROCGDB_ROCM_GPU_MAX_PARALLEL cap. Falls back to single-GPU on
      heterogeneous fleets; downgrades parallel→serial on devices without
      multi-process debug support. Extends the check-parallel cleanup glob
      to sweep the new per-pool files.
  2. gdb/testsuite: self-test for the slot-pool shared/exclusive lock layer
    gdb.testsuite/slot-pool-lock.exp exercises the new primitives across
    real tclsh child processes: shared capacity, shared/exclusive mutex,
    multi-pool spread, writer priority, exclusive-behind-exclusive,
    machine lock, randomised start offset, and error-releases-lock.

  3. gdb/testsuite/rocm: opt rocm tests into shared (parallel) GPU lock
    Marks 60 gdb.rocm tests as parallel-safe (small single-kernel
    tests that only inspect per-inferior state). Wraps
    simple-outside-debugger.exp and mi-attach.exp in
    with_rocm_gpu_lock (they previously ran with no coordination).
    Tests that touch device-wide state, spawn hog kernels, or use
    multiple concurrent GPU inferiors stay on the default serial mode.

  4. gdb/testsuite: document gdb.rocm parallel testing controls
    New "ROCm GPU Parallel Testing" section in gdb/testsuite/README
    covering ::rocm_gpu_concurrency, the failsafe, and the
    ROCGDB_ROCM_PARALLEL_SLOTS / ROCGDB_ROCM_GPU_MAX_PARALLEL /
    ROCR_VISIBLE_DEVICES knobs, with the combined concurrency
    ceiling.

  5. gdb/testsuite/rocm: split runtime-core.exp into parallel shards
    Splits runtime-core.exp's six serial do_test invocations into
    one shard per (fault, core_type, output_type) combination
    sourcing a shared runtime-core.exp.tcl, mirroring
    all-architectures-*.exp / aarch64-sme-core-*.exp. Each shard
    takes a whole GPU (default serial mode); other GPUs stay free.

Defaults

  • ROCGDB_ROCM_PARALLEL_SLOTS=1 — one rocm test per GPU; raise it on
    hardware/stacks known to tolerate higher kfd/dbgapi concurrency.
  • ROCGDB_ROCM_GPU_MAX_PARALLEL unset — use every eligible GPU.
  • Per-test default mode is serial (no behavioural change for unmarked
    tests, except they no longer block other GPUs).

Test plan

  • make check-parallel TESTS="gdb.testsuite/slot-pool-lock.exp" passes
    (self-test of the lock layer).
  • make check-parallel TESTS="gdb.rocm/*.exp" passes on a multi-GPU
    node and finishes meaningfully faster than the pre-series baseline.
  • Same on a single-GPU node (parallel mode collapses to serial; no
    regressions).
  • With ROCR_VISIBLE_DEVICES=<subset> pre-set: rocm tests only touch
    the listed devices, regardless of pool indexing.
  • With ROCGDB_ROCM_GPU_MAX_PARALLEL=1: rocm suite shares a single
    GPU end-to-end (equivalent to the pre-series behaviour).
  • Heterogeneous-arch fleet without ROCM_TEST_ARCH: suite pins to
    a single GPU and logs the fallback reason.
  • Non-parallel make check run: unchanged behaviour.

@lumachad lumachad self-assigned this Jun 1, 2026
@lumachad lumachad requested a review from a team as a code owner June 1, 2026 14:26
lumachad added 5 commits June 1, 2026 09:44
Add a generic shared/exclusive file-lock layer in lib/gdb-utils.exp
and wire it into the rocm GPU lock so most gdb.rocm tests can run in
parallel on a multi-GPU node, while tests that need exclusive access
to a device (or to the whole machine) keep their historical
serialisation.

In lib/gdb-utils.exp, add lock_file_try_acquire and the
shared/exclusive helpers (lock_file_acquire_shared,
lock_file_acquire_exclusive, lock_file_release_shared) built on the
existing open-EXCL primitive used by lock_file_acquire.  Shared
acquirers grab one of N slot files; exclusive acquirers grab all N,
with a barrier file giving writer priority so an exclusive waiter
cannot be starved.  Add with_shared_lock and with_exclusive_lock
convenience wrappers, and the multi-pool variants
(with_shared_lock_multi, with_exclusive_lock_multi,
with_machine_lock_multi) used by the rocm lock to spread tests
across GPUs.  The shared multi-pool acquirer fills slot 0 of every
pool before touching slot 1 of any pool, so up to N tests on an
N-GPU box each get a dedicated GPU before any GPU sees two tenants;
a randomised per-row start offset avoids concentrating acquirers on
pool 0.  An ASCII diagram in the file header shows how the four
tiers relate.  The single-pool wrappers (with_shared_lock and
with_exclusive_lock) are registered in tclint-plugin.py; the
multi-pool variants are intentionally not registered because the
in-tree caller (with_rocm_gpu_lock) builds the body in a variable
that tclint cannot analyse as a script.

In lib/rocm.exp, replace the single-file gpu-parallel.lock with one
pool per visible GPU.  Per-GPU width is configurable via
ROCGDB_ROCM_PARALLEL_SLOTS (default 1).  The active pool count is
capped at min(eligible_gpus, ROCGDB_ROCM_GPU_MAX_PARALLEL); the cap
is unset by default and exists so users can lower the active set if
they hit spurious failures from kfd/dbgapi contention.
with_rocm_gpu_lock now dispatches on a per-test marker
"set rocm_gpu_concurrency {parallel|serial|machine}": parallel
takes one slot on any GPU, serial (the default) takes every slot of
one GPU while leaving other GPUs free, and machine takes every slot
of every GPU to restore the historical "no other rocm test runs
anywhere" semantics.

At startup, build a rocm_pool_to_physical list mapping each dense
pool index to a physical device index.  When the user pre-sets
ROCR_VISIBLE_DEVICES the map honours that mask, so pool i pins the
physical device the user actually picked rather than silently
renumbering to 0..N-1.  The map is also narrowed by ROCM_TEST_ARCH
filtering, the heterogeneous-fleet fallback, and the
ROCGDB_ROCM_GPU_MAX_PARALLEL cap.  For the duration of the body,
ROCR_VISIBLE_DEVICES is pinned to that physical index so the
inferior sees exactly the GPU it was scheduled on.

On heterogeneous (mixed-arch) systems the suite falls back to a
single GPU unless ROCM_TEST_ARCH selects one arch's GPUs explicitly.
When GDB_PARALLEL is unset the suite pins to a single pool,
preserving the pre-parallel single-tenant behaviour.  As a failsafe,
if any local device lacks multi-process debug support (per
hip_devices_support_debug_multi_process), parallel mode is downgraded
to serial regardless of the per-test marker.

Extend the check-parallel lockfile cleanup glob in Makefile.in to
sweep the new per-pool slot and barrier files.

Update gdb.rocm/addr-bp-gpu-no-deb-info.exp accordingly:
with_rocm_gpu_lock now pins ROCR_VISIBLE_DEVICES in the dejagnu
process environment and GDB inherits it at startup, so clean_restart
and the HIP_ENABLE_DEFERRED_LOADING set have to run inside the lock;
drop the now-redundant hard-coded
"set environment ROCR_VISIBLE_DEVICES=0" workaround.
Add gdb.testsuite/slot-pool-lock.exp covering the lock primitives in
lib/gdb-utils.exp that with_rocm_gpu_lock is built on.  The test
spawns tclsh children so it exercises the file locks across real
processes; each child appends timestamped acquire/hold/release
events to a shared log that the driver inspects.

Properties checked:

  * shared-capacity   - N shared acquirers hold concurrently within
                        a single pool's slot count.
  * mutex             - exclusive waits for the shared holder to
                        release before acquiring.
  * multi-pool-spread - row-major fill places the first NPOOLS
                        shared acquirers on distinct pools.
  * writer-priority   - a queued exclusive blocks new shared
                        acquirers, so a late shared arrival cannot
                        slip in when the pool drains; it must wait
                        for the exclusive to complete.
  * error-releases    - an error raised inside with_shared_lock_multi
                        propagates and releases the slot, so the
                        next acquire is immediate.

This commit is validation-only and adds no callers of the lock
layer beyond the new test.
Mark 60 gdb.rocm tests as parallel-safe and wrap two more in
with_rocm_gpu_lock so they participate in the shared lock at all.

For each parallel-safe test, add

    set rocm_gpu_concurrency parallel

just below "load_lib rocm.exp", which lets with_rocm_gpu_lock take a
shared slot instead of locking the whole GPU.  The marked tests all
launch a single small kernel and only inspect per-inferior state, so
they can safely share a GPU with other rocm tests up to the
configured slot count.

simple-outside-debugger.exp and mi-attach.exp did not previously
call with_rocm_gpu_lock at all, so they ran without any coordination
against the rest of the suite even though both do exercise the GPU.
Wrap the GPU-using portion of each in with_rocm_gpu_lock and opt in
to shared mode as well.

The remaining wrapped tests (precise-memory*, device-attach,
device-interrupt, fork-exec-*, multi-inferior-*,
step-schedlock-spurious-waves, load-core-remote-system) are
intentionally left unmarked and continue to take the exclusive
(per-GPU) lock, since they exercise device-wide state, run hog
kernels, or require multiple concurrent GPU inferiors.
maint-print-registers.exp is left untouched because it is a pure
host-side test with no GPU launch.
Add a "ROCm GPU Parallel Testing" section to gdb/testsuite/README
describing the knobs that govern how gdb.rocm tests share GPUs under
GDB_PARALLEL.

The per-test ::rocm_gpu_concurrency marker selects one of three
modes: parallel (one slot on any GPU), serial (the default; every
slot of one GPU, leaving other GPUs free), and machine (every slot
of every active GPU).  The section also documents the
hip_devices_support_debug_multi_process failsafe that downgrades
parallel to serial on devices that cannot host two debugged
processes at once.

The environment variables documented are:

  * ROCGDB_ROCM_PARALLEL_SLOTS - per-GPU concurrent-test cap
    (default 2), including the =1 setting that restores the
    historical "one rocm test per GPU" behaviour and the effect of
    higher values on serial-marked tests.

  * ROCGDB_ROCM_GPU_MAX_PARALLEL - cap on the number of active GPU
    pools (unset by default, meaning use every eligible GPU); lower
    it if kfd/dbgapi contention from many concurrent debug sessions
    produces spurious mid-run failures.

  * ROCR_VISIBLE_DEVICES - pre-set in the environment to restrict
    the lock layer to a chosen subset of the node's physical GPUs.
    Documents the pool-index -> physical-index mapping (pool 0 pins
    the first listed device, etc.), the interaction with
    ROCM_TEST_ARCH filtering, and the length-mismatch fallback to
    an identity map.

Round it out with the combined ceiling

    min(eligible_gpus, ROCGDB_ROCM_GPU_MAX_PARALLEL)
      * ROCGDB_ROCM_PARALLEL_SLOTS

on the number of concurrent rocm tests when every test is
parallel-marked.
runtime-core.exp ran six independent do_test invocations serially in
a single .exp file.  Each one regenerates a HIP coredump and restarts
GDB, so the file dominated gdb.rocm wall-clock time.

Move the shared body to runtime-core.exp.tcl and add one thin shard
per (fault, core_type, output_type) combination.  Each shard sets
the three parameters and sources the shared tcl, mirroring the
pattern used by gdb.base/all-architectures-*.exp and
gdb.arch/aarch64-sme-core-*.exp.

The shards still spawn HIP applications that crash and dump cores via
the runtime, which is sensitive to other rocm tests sharing the
device.  Each shard therefore acquires a whole GPU (the default
"serial" mode); other GPUs remain free, so shards still run in
parallel across devices.
@lumachad lumachad force-pushed the users/lumachad/amd-staging/parallel_rocm branch from 13407bc to 1155a4a Compare June 1, 2026 14:44
@palves
Copy link
Copy Markdown
Collaborator

palves commented Jun 2, 2026

Did you find the reason why currently "make check-parallel" fails? I'd prefer seeing the fix for that in a separate change first, before all the sharding stuff.

@lumachad
Copy link
Copy Markdown
Collaborator Author

lumachad commented Jun 2, 2026

Did you find the reason why currently "make check-parallel" fails? I'd prefer seeing the fix for that in a separate change first, before all the sharding stuff.

I don't think it fails. Not deterministically anyway. It does fail from time to time due to non-determinism in what the usage of the GPU's look like and what GPU ROCR picks up for the job. The ultimate problem with the current code is wall clock time and resource utilization.

We do have a couple tests (mentioned in the description) that lacked the rocm lock, so those were running in parallel. But I don't think they would've caused issues.

@palves
Copy link
Copy Markdown
Collaborator

palves commented Jun 2, 2026

I don't think it fails. Not deterministically anyway. It does fail from time to time due to non-determinism in what the usage of the GPU's look like and what GPU ROCR picks up for the job. The ultimate problem with the current code is wall clock time and resource utilization.

We do have a couple tests (mentioned in the description) that lacked the rocm lock, so those were running in parallel.

Oh, that's not my experience. We have internal tickets about this. My suspition (I think recorded in the jiras) is that we were missing some with_rocm_gpu_lock calls in some testcases.

But I don't think they would've caused issues.

On the contrary -- that's exactly the sort of thing that causes problems, for which with_rocm_gpu_lock was invented in the first place. May be you're running tests on a GPU that allows debugging multiple processes at the same time. But on a GPU that does not, missing a with_rocm_gpu_lock lock in one testcase breaks any other that runs at the same time.

So I think that bit should be hoisted out into its own commit, and preferably, it's own PR.

@lumachad
Copy link
Copy Markdown
Collaborator Author

lumachad commented Jun 2, 2026

I don't think it fails. Not deterministically anyway. It does fail from time to time due to non-determinism in what the usage of the GPU's look like and what GPU ROCR picks up for the job. The ultimate problem with the current code is wall clock time and resource utilization.
We do have a couple tests (mentioned in the description) that lacked the rocm lock, so those were running in parallel.

Oh, that's not my experience. We have internal tickets about this. My suspition (I think recorded in the jiras) is that we were missing some with_rocm_gpu_lock calls in some testcases.

But I don't think they would've caused issues.

On the contrary -- that's exactly the sort of thing that causes problems, for which with_rocm_gpu_lock was invented in the first place. May be you're running tests on a GPU that allows debugging multiple processes at the same time. But on a GPU that does not, missing a with_rocm_gpu_lock lock in one testcase breaks any other that runs at the same time.

So I think that bit should be hoisted out into its own commit, and preferably, it's own PR.

Done now in #153.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants