[libc] Honour LIBC_GPU_TEST_JOBS in lit test runs#193797
[libc] Honour LIBC_GPU_TEST_JOBS in lit test runs#193797
Conversation
Under CTest, LIBC_GPU_TEST_JOBS controlled a ninja job pool that limited concurrent GPU test processes. The AMD GPU buildbot sets this to 4 to avoid overloading the GPU driver. When running tests via lit, this constraint was lost because lit uses its own -j flag (defaulting to nproc, or set to 64 on the AMD bot via LLVM_LIT_ARGS). All GPU loader processes launched simultaneously, leading to hangs from GPU resource exhaustion. Propagated LIBC_GPU_TEST_JOBS into the lit site config as a parallelism group so lit throttles GPU test concurrency independently of the global -j setting.
|
@llvm/pr-subscribers-libc Author: Jeff Bailey (kaladron) ChangesUnder CTest, LIBC_GPU_TEST_JOBS controlled a ninja job pool that limited concurrent GPU test processes. The AMD GPU buildbot sets this to 4 to avoid overloading the GPU driver. When running tests via lit, this constraint was lost because lit uses its own -j flag (defaulting to nproc, or set to 64 on the AMD bot via LLVM_LIT_ARGS). All GPU loader processes launched simultaneously, leading to hangs from GPU resource exhaustion. Propagated LIBC_GPU_TEST_JOBS into the lit site config as a parallelism group so lit throttles GPU test concurrency independently of the global -j setting. Full diff: https://github.com/llvm/llvm-project/pull/193797.diff 2 Files Affected:
diff --git a/libc/cmake/modules/prepare_libc_gpu_build.cmake b/libc/cmake/modules/prepare_libc_gpu_build.cmake
index c87a1df926c85..554c6c49b0435 100644
--- a/libc/cmake/modules/prepare_libc_gpu_build.cmake
+++ b/libc/cmake/modules/prepare_libc_gpu_build.cmake
@@ -29,6 +29,7 @@ if(LIBC_GPU_TEST_JOBS)
set_property(GLOBAL PROPERTY JOB_POOLS LIBC_GPU_TEST_POOL=${LIBC_GPU_TEST_JOBS})
set(LIBC_HERMETIC_TEST_JOB_POOL JOB_POOL LIBC_GPU_TEST_POOL)
else()
+ set(LIBC_GPU_TEST_JOBS 1)
set_property(GLOBAL PROPERTY JOB_POOLS LIBC_GPU_TEST_POOL=1)
set(LIBC_HERMETIC_TEST_JOB_POOL JOB_POOL LIBC_GPU_TEST_POOL)
endif()
diff --git a/libc/test/lit.site.cfg.py.in b/libc/test/lit.site.cfg.py.in
index 3668a491cd05c..bc8d0e3e31713 100644
--- a/libc/test/lit.site.cfg.py.in
+++ b/libc/test/lit.site.cfg.py.in
@@ -40,3 +40,8 @@ if hasattr(config, "llvm_tools_dir") and config.llvm_tools_dir:
[config.llvm_tools_dir, config.environment.get("PATH", "")]
)
+# Limit concurrent GPU tests to avoid overloading the GPU driver.
+libc_gpu_test_jobs = "@LIBC_GPU_TEST_JOBS@"
+if libc_gpu_test_jobs:
+ lit_config.parallelism_groups["libc-gpu"] = int(libc_gpu_test_jobs)
+ config.parallelism_group = "libc-gpu"
|
michaelrj-google
left a comment
There was a problem hiding this comment.
Approving to avoid blocking lit switchover
jhuber6
left a comment
There was a problem hiding this comment.
I'll need to reevaluate this, it's a bit better than it was in the past, but still possible to exhaust scratch.
My best guess is that what was taking out the AMD fmul test was running 64 GPU tests simultaneously. I think it did remarkably well. =) The NV tests on my machine happily handled 32. |
Under CTest, LIBC_GPU_TEST_JOBS controlled a ninja job pool that limited concurrent GPU test processes. The AMD GPU buildbot sets this to 4 to avoid overloading the GPU driver.
When running tests via lit, this constraint was lost because lit uses its own -j flag (defaulting to nproc, or set to 64 on the AMD bot via LLVM_LIT_ARGS). All GPU loader processes launched simultaneously, leading to hangs from GPU resource exhaustion.
Propagated LIBC_GPU_TEST_JOBS into the lit site config as a parallelism group so lit throttles GPU test concurrency independently of the global -j setting.