Codestin Search App

lumachad · 2026-06-01T14:26:13Z

Blocked on:

Wrap mi-attach in with_rocm_gpu_lock #153

Summary

Add a per-GPU lock-pool layer so gdb.rocm tests can run in parallel across
GPUs (and, optionally, share a GPU) under make check-parallel, instead of
serialising the entire rocm suite behind a single machine-wide lock.

Today every gdb.rocm test contends for the same gpu-parallel.lock, so on
an N-GPU node the suite still runs one rocm test at a time. This series
introduces a generic shared/exclusive file-lock primitive in
lib/gdb-utils.exp, replaces the single lockfile with one pool per visible
GPU, lets each test opt into one of three concurrency modes, and ships a
self-test plus documentation for the new knobs.

What's in each commit

gdb/testsuite/rocm: allow parallel rocm tests via per-GPU lock pools
New shared/exclusive lock primitives in lib/gdb-utils.exp (single-pool
- multi-pool variants, writer-priority barriers, row-major fill so a
  multi-GPU node spreads before stacking). lib/rocm.exp replaces the
  single gpu-parallel.lock with one pool per visible GPU, picks
  per-test concurrency from ::rocm_gpu_concurrency
  (parallel/serial/machine), and pins ROCR_VISIBLE_DEVICES to the
  scheduled GPU for the body. Honours a pre-set ROCR_VISIBLE_DEVICES
  (pool 0 → first listed device), ROCM_TEST_ARCH filtering, and a
  ROCGDB_ROCM_GPU_MAX_PARALLEL cap. Falls back to single-GPU on
  heterogeneous fleets; downgrades parallel→serial on devices without
  multi-process debug support. Extends the check-parallel cleanup glob
  to sweep the new per-pool files.
gdb/testsuite: self-test for the slot-pool shared/exclusive lock layer
gdb.testsuite/slot-pool-lock.exp exercises the new primitives across
real tclsh child processes: shared capacity, shared/exclusive mutex,
multi-pool spread, writer priority, exclusive-behind-exclusive,
machine lock, randomised start offset, and error-releases-lock.
gdb/testsuite/rocm: opt rocm tests into shared (parallel) GPU lock
Marks 60 gdb.rocm tests as parallel-safe (small single-kernel
tests that only inspect per-inferior state). Wraps
simple-outside-debugger.exp and mi-attach.exp in
with_rocm_gpu_lock (they previously ran with no coordination).
Tests that touch device-wide state, spawn hog kernels, or use
multiple concurrent GPU inferiors stay on the default serial mode.
gdb/testsuite: document gdb.rocm parallel testing controls
New "ROCm GPU Parallel Testing" section in gdb/testsuite/README
covering ::rocm_gpu_concurrency, the failsafe, and the
ROCGDB_ROCM_PARALLEL_SLOTS / ROCGDB_ROCM_GPU_MAX_PARALLEL /
ROCR_VISIBLE_DEVICES knobs, with the combined concurrency
ceiling.
gdb/testsuite/rocm: split runtime-core.exp into parallel shards
Splits runtime-core.exp's six serial do_test invocations into
one shard per (fault, core_type, output_type) combination
sourcing a shared runtime-core.exp.tcl, mirroring
all-architectures-*.exp / aarch64-sme-core-*.exp. Each shard
takes a whole GPU (default serial mode); other GPUs stay free.

Defaults

ROCGDB_ROCM_PARALLEL_SLOTS=1 — one rocm test per GPU; raise it on
hardware/stacks known to tolerate higher kfd/dbgapi concurrency.
ROCGDB_ROCM_GPU_MAX_PARALLEL unset — use every eligible GPU.
Per-test default mode is serial (no behavioural change for unmarked
tests, except they no longer block other GPUs).

Test plan

make check-parallel TESTS="gdb.testsuite/slot-pool-lock.exp" passes
(self-test of the lock layer).
make check-parallel TESTS="gdb.rocm/*.exp" passes on a multi-GPU
node and finishes meaningfully faster than the pre-series baseline.
Same on a single-GPU node (parallel mode collapses to serial; no
regressions).
With ROCR_VISIBLE_DEVICES=<subset> pre-set: rocm tests only touch
the listed devices, regardless of pool indexing.
With ROCGDB_ROCM_GPU_MAX_PARALLEL=1: rocm suite shares a single
GPU end-to-end (equivalent to the pre-series behaviour).
Heterogeneous-arch fleet without ROCM_TEST_ARCH: suite pins to
a single GPU and logs the fallback reason.
Non-parallel make check run: unchanged behaviour.

Add a generic shared/exclusive file-lock layer in lib/gdb-utils.exp and wire it into the rocm GPU lock so most gdb.rocm tests can run in parallel on a multi-GPU node, while tests that need exclusive access to a device (or to the whole machine) keep their historical serialisation. In lib/gdb-utils.exp, add lock_file_try_acquire and the shared/exclusive helpers (lock_file_acquire_shared, lock_file_acquire_exclusive, lock_file_release_shared) built on the existing open-EXCL primitive used by lock_file_acquire. Shared acquirers grab one of N slot files; exclusive acquirers grab all N, with a barrier file giving writer priority so an exclusive waiter cannot be starved. Add with_shared_lock and with_exclusive_lock convenience wrappers, and the multi-pool variants (with_shared_lock_multi, with_exclusive_lock_multi, with_machine_lock_multi) used by the rocm lock to spread tests across GPUs. The shared multi-pool acquirer fills slot 0 of every pool before touching slot 1 of any pool, so up to N tests on an N-GPU box each get a dedicated GPU before any GPU sees two tenants; a randomised per-row start offset avoids concentrating acquirers on pool 0. An ASCII diagram in the file header shows how the four tiers relate. The single-pool wrappers (with_shared_lock and with_exclusive_lock) are registered in tclint-plugin.py; the multi-pool variants are intentionally not registered because the in-tree caller (with_rocm_gpu_lock) builds the body in a variable that tclint cannot analyse as a script. In lib/rocm.exp, replace the single-file gpu-parallel.lock with one pool per visible GPU. Per-GPU width is configurable via ROCGDB_ROCM_PARALLEL_SLOTS (default 1). The active pool count is capped at min(eligible_gpus, ROCGDB_ROCM_GPU_MAX_PARALLEL); the cap is unset by default and exists so users can lower the active set if they hit spurious failures from kfd/dbgapi contention. with_rocm_gpu_lock now dispatches on a per-test marker "set rocm_gpu_concurrency {parallel|serial|machine}": parallel takes one slot on any GPU, serial (the default) takes every slot of one GPU while leaving other GPUs free, and machine takes every slot of every GPU to restore the historical "no other rocm test runs anywhere" semantics. At startup, build a rocm_pool_to_physical list mapping each dense pool index to a physical device index. When the user pre-sets ROCR_VISIBLE_DEVICES the map honours that mask, so pool i pins the physical device the user actually picked rather than silently renumbering to 0..N-1. The map is also narrowed by ROCM_TEST_ARCH filtering, the heterogeneous-fleet fallback, and the ROCGDB_ROCM_GPU_MAX_PARALLEL cap. For the duration of the body, ROCR_VISIBLE_DEVICES is pinned to that physical index so the inferior sees exactly the GPU it was scheduled on. On heterogeneous (mixed-arch) systems the suite falls back to a single GPU unless ROCM_TEST_ARCH selects one arch's GPUs explicitly. When GDB_PARALLEL is unset the suite pins to a single pool, preserving the pre-parallel single-tenant behaviour. As a failsafe, if any local device lacks multi-process debug support (per hip_devices_support_debug_multi_process), parallel mode is downgraded to serial regardless of the per-test marker. Extend the check-parallel lockfile cleanup glob in Makefile.in to sweep the new per-pool slot and barrier files. Update gdb.rocm/addr-bp-gpu-no-deb-info.exp accordingly: with_rocm_gpu_lock now pins ROCR_VISIBLE_DEVICES in the dejagnu process environment and GDB inherits it at startup, so clean_restart and the HIP_ENABLE_DEFERRED_LOADING set have to run inside the lock; drop the now-redundant hard-coded "set environment ROCR_VISIBLE_DEVICES=0" workaround.

Add gdb.testsuite/slot-pool-lock.exp covering the lock primitives in lib/gdb-utils.exp that with_rocm_gpu_lock is built on. The test spawns tclsh children so it exercises the file locks across real processes; each child appends timestamped acquire/hold/release events to a shared log that the driver inspects. Properties checked: * shared-capacity - N shared acquirers hold concurrently within a single pool's slot count. * mutex - exclusive waits for the shared holder to release before acquiring. * multi-pool-spread - row-major fill places the first NPOOLS shared acquirers on distinct pools. * writer-priority - a queued exclusive blocks new shared acquirers, so a late shared arrival cannot slip in when the pool drains; it must wait for the exclusive to complete. * error-releases - an error raised inside with_shared_lock_multi propagates and releases the slot, so the next acquire is immediate. This commit is validation-only and adds no callers of the lock layer beyond the new test.

Mark 60 gdb.rocm tests as parallel-safe and wrap two more in with_rocm_gpu_lock so they participate in the shared lock at all. For each parallel-safe test, add set rocm_gpu_concurrency parallel just below "load_lib rocm.exp", which lets with_rocm_gpu_lock take a shared slot instead of locking the whole GPU. The marked tests all launch a single small kernel and only inspect per-inferior state, so they can safely share a GPU with other rocm tests up to the configured slot count. simple-outside-debugger.exp and mi-attach.exp did not previously call with_rocm_gpu_lock at all, so they ran without any coordination against the rest of the suite even though both do exercise the GPU. Wrap the GPU-using portion of each in with_rocm_gpu_lock and opt in to shared mode as well. The remaining wrapped tests (precise-memory*, device-attach, device-interrupt, fork-exec-*, multi-inferior-*, step-schedlock-spurious-waves, load-core-remote-system) are intentionally left unmarked and continue to take the exclusive (per-GPU) lock, since they exercise device-wide state, run hog kernels, or require multiple concurrent GPU inferiors. maint-print-registers.exp is left untouched because it is a pure host-side test with no GPU launch.

Add a "ROCm GPU Parallel Testing" section to gdb/testsuite/README describing the knobs that govern how gdb.rocm tests share GPUs under GDB_PARALLEL. The per-test ::rocm_gpu_concurrency marker selects one of three modes: parallel (one slot on any GPU), serial (the default; every slot of one GPU, leaving other GPUs free), and machine (every slot of every active GPU). The section also documents the hip_devices_support_debug_multi_process failsafe that downgrades parallel to serial on devices that cannot host two debugged processes at once. The environment variables documented are: * ROCGDB_ROCM_PARALLEL_SLOTS - per-GPU concurrent-test cap (default 2), including the =1 setting that restores the historical "one rocm test per GPU" behaviour and the effect of higher values on serial-marked tests. * ROCGDB_ROCM_GPU_MAX_PARALLEL - cap on the number of active GPU pools (unset by default, meaning use every eligible GPU); lower it if kfd/dbgapi contention from many concurrent debug sessions produces spurious mid-run failures. * ROCR_VISIBLE_DEVICES - pre-set in the environment to restrict the lock layer to a chosen subset of the node's physical GPUs. Documents the pool-index -> physical-index mapping (pool 0 pins the first listed device, etc.), the interaction with ROCM_TEST_ARCH filtering, and the length-mismatch fallback to an identity map. Round it out with the combined ceiling min(eligible_gpus, ROCGDB_ROCM_GPU_MAX_PARALLEL) * ROCGDB_ROCM_PARALLEL_SLOTS on the number of concurrent rocm tests when every test is parallel-marked.

runtime-core.exp ran six independent do_test invocations serially in a single .exp file. Each one regenerates a HIP coredump and restarts GDB, so the file dominated gdb.rocm wall-clock time. Move the shared body to runtime-core.exp.tcl and add one thin shard per (fault, core_type, output_type) combination. Each shard sets the three parameters and sources the shared tcl, mirroring the pattern used by gdb.base/all-architectures-*.exp and gdb.arch/aarch64-sme-core-*.exp. The shards still spawn HIP applications that crash and dump cores via the runtime, which is sensitive to other rocm tests sharing the device. Each shard therefore acquires a whole GPU (the default "serial" mode); other GPUs remain free, so shards still run in parallel across devices.

palves · 2026-06-02T12:45:46Z

Did you find the reason why currently "make check-parallel" fails? I'd prefer seeing the fix for that in a separate change first, before all the sharding stuff.

lumachad · 2026-06-02T13:59:05Z

Did you find the reason why currently "make check-parallel" fails? I'd prefer seeing the fix for that in a separate change first, before all the sharding stuff.

I don't think it fails. Not deterministically anyway. It does fail from time to time due to non-determinism in what the usage of the GPU's look like and what GPU ROCR picks up for the job. The ultimate problem with the current code is wall clock time and resource utilization.

We do have a couple tests (mentioned in the description) that lacked the rocm lock, so those were running in parallel. But I don't think they would've caused issues.

palves · 2026-06-02T14:51:30Z

I don't think it fails. Not deterministically anyway. It does fail from time to time due to non-determinism in what the usage of the GPU's look like and what GPU ROCR picks up for the job. The ultimate problem with the current code is wall clock time and resource utilization.

We do have a couple tests (mentioned in the description) that lacked the rocm lock, so those were running in parallel.

Oh, that's not my experience. We have internal tickets about this. My suspition (I think recorded in the jiras) is that we were missing some with_rocm_gpu_lock calls in some testcases.

But I don't think they would've caused issues.

On the contrary -- that's exactly the sort of thing that causes problems, for which with_rocm_gpu_lock was invented in the first place. May be you're running tests on a GPU that allows debugging multiple processes at the same time. But on a GPU that does not, missing a with_rocm_gpu_lock lock in one testcase breaks any other that runs at the same time.

So I think that bit should be hoisted out into its own commit, and preferably, it's own PR.

lumachad · 2026-06-02T15:18:52Z

I don't think it fails. Not deterministically anyway. It does fail from time to time due to non-determinism in what the usage of the GPU's look like and what GPU ROCR picks up for the job. The ultimate problem with the current code is wall clock time and resource utilization.
We do have a couple tests (mentioned in the description) that lacked the rocm lock, so those were running in parallel.

Oh, that's not my experience. We have internal tickets about this. My suspition (I think recorded in the jiras) is that we were missing some with_rocm_gpu_lock calls in some testcases.

But I don't think they would've caused issues.

On the contrary -- that's exactly the sort of thing that causes problems, for which with_rocm_gpu_lock was invented in the first place. May be you're running tests on a GPU that allows debugging multiple processes at the same time. But on a GPU that does not, missing a with_rocm_gpu_lock lock in one testcase breaks any other that runs at the same time.

So I think that bit should be hoisted out into its own commit, and preferably, it's own PR.

Done now in #153.

lumachad self-assigned this Jun 1, 2026

lumachad requested a review from a team as a code owner June 1, 2026 14:26

lumachad added 5 commits June 1, 2026 09:44

lumachad force-pushed the users/lumachad/amd-staging/parallel_rocm branch from 13407bc to 1155a4a Compare June 1, 2026 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run gdb.rocm tests in parallel across GPUs#149

Run gdb.rocm tests in parallel across GPUs#149
lumachad wants to merge 5 commits into
amd-stagingfrom
users/lumachad/amd-staging/parallel_rocm

lumachad commented Jun 1, 2026 •

edited

Loading

Uh oh!

palves commented Jun 2, 2026

Uh oh!

lumachad commented Jun 2, 2026 •

edited

Loading

Uh oh!

palves commented Jun 2, 2026

Uh oh!

lumachad commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lumachad commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in each commit

Defaults

Test plan

Uh oh!

palves commented Jun 2, 2026

Uh oh!

lumachad commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palves commented Jun 2, 2026

Uh oh!

lumachad commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lumachad commented Jun 1, 2026 •

edited

Loading

lumachad commented Jun 2, 2026 •

edited

Loading