{tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1 #24926

Flamefire · 2025-12-18T17:03:43Z

(created using eb --new-pr)

Includes:

{tools}[GCCcore/14.3.0] pytest-subtests v0.15.0, unittest-xml-reporting v3.2.0, parameterized v0.9.0 #24801
{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0 #24365 (patches)

It makes sense to merge #24365 first as any changes there need to be reflected here. But this allows testing both in parallel

…tests-0.15.0-GCCcore-14.3.0.eb, PyTorch-2.9.1-foss-2025b-CUDA-12.9.1.eb, unittest-xml-reporting-3.2.0-GCCcore-14.3.0.eb and patches: PyTorch-1.12.1_add-hypothesis-suppression.patch, PyTorch-1.7.0_disable-dev-shm-test.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.6.0_show-test-duration.patch, PyTorch-2.6.0_skip-test_segfault.patch, PyTorch-2.7.0_avoid_caffe2_test_cpp_jit.patch, PyTorch-2.7.1_avoid-caffe2-sandcastle-test-lib.patch, PyTorch-2.7.1_skip-test_data_parallel_rnn.patch, PyTorch-2.7.1_skip-test_gds_fails_in_ci.patch, PyTorch-2.7.1_skip-test_mixed_mm_exhaustive_dtypes.patch, PyTorch-2.7.1_skip-tests-requiring-SM90.patch, PyTorch-2.7.1_suport-64bit-BARs.patch, PyTorch-2.7.1_tolerance-test_partial_flat_weights.patch, PyTorch-2.9.0_disable-test_nan_assert.patch, PyTorch-2.9.0_enable-symbolizer-in-test_workspace_allocation_error.patch, PyTorch-2.9.0_fix-attention-squeeze.patch, PyTorch-2.9.0_fix-FP16-CPU-tests-in-test_torchinductor_opinfo.patch, PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_fix-test_exclude_padding.patch, PyTorch-2.9.0_fix-test_version_error.patch, PyTorch-2.9.0_honor-XDG_CACHE_HOME.patch, PyTorch-2.9.0_increase-tolerance-in-test_transformers.patch, PyTorch-2.9.0_remove-faulty-close.patch, PyTorch-2.9.0_revert-pybind11-3-change.patch, PyTorch-2.9.0_skip-test_benchmark_on_non_zero_device.patch, PyTorch-2.9.0_skip-test_convolution1-on-H100.patch, PyTorch-2.9.0_skip-test_inductor_all_gather_into_tensor_coalesced.patch, PyTorch-2.9.0_skip-test_original_aten_preserved_pad_mm.patch, PyTorch-2.9.0_skip-test_override-without-CUDA.patch, PyTorch-2.9.0_skip-test_unbacked_reduction.patch, PyTorch-2.9.0_skip-tests-requiring-CUDA-12.8.patch, PyTorch-2.9.0_skip-unexpected-success-in-test_fake_export.patch, PyTorch-2.9.1_skip-RingFlexAttentionTest.patch

github-actions · 2025-12-18T17:04:25Z

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

Flamefire · 2025-12-19T09:27:12Z

Test report by @Thyre FAILED Build succeeded for 3 out of 4 (total: 55 secs) (4 easyconfigs in total) jrc0900.jureca - Linux Rocky Linux 9.6, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 580.95.05, Python 3.9.21 See https://gist.github.com/Thyre/576f0dbeceb975733d860d97f16ca3fc for a full test report.

== 2025-12-19 10:19:27,773 build_log.py:233 ERROR EasyBuild encountered an error: Nothing found to replace 'if IS_CI:\n\s+# Add the option to generate XML test report.*' in test/run_test.py (at easybuild/tools/filetools.py:1861 in apply_regex_substitutions)

Are you using the latest easyblock? It is missing this commit from easybuilders/easybuild-easyblocks#3803

Flamefire · 2025-12-19T12:51:21Z

2025b is using GCC 14 that has new warnings. See pytorch/pytorch#166873

Patch added. Seems to only affect ARM

Flamefire · 2025-12-19T13:27:47Z

Oh, it is a C file. Updated the patch to also add it to C-flags

Flamefire · 2025-12-19T14:23:28Z

Looks like I need to set those values earlier. Can you try again?

Thyre · 2025-12-19T14:46:22Z

Actual failure was an internal GCC compiler error:

In file included from /dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/build/aten/src/ATen/native/cpu/Unfold2d.cpp.SVE256.cpp:1:
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp: In function ‘void at::native::{anonymous}::unfolded2d_acc_kernel(c10::ScalarType, void*, void*, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool)’:
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp:225:1: error: unrecognizable insn:
  225 | }
      | ^
(insn 1375 1374 1376 99 (set (reg:VNx16BI 3253)
        (unspec:VNx16BI [
                (reg:VNx16BI 3250)
                (reg:VNx8BI 3252)
                (const_vector:VNx4BI [
                        (const_int 0 [0]) repeated x8
                    ])
            ] UNSPEC_TRN1_CONV)) "/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/torch/headeronly/util/bit_cast.h":40:14 -1
     (nil))
during RTL pass: vregs
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp:225:1: internal compiler error: in extract_insn, at recog.cc:2812
0x7d30df _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
	../../gcc/rtl-error.cc:108
0x7d3113 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
	../../gcc/rtl-error.cc:116
0xec1d17 extract_insn(rtx_insn*)
	../../gcc/recog.cc:2812
0xc2a28b instantiate_virtual_regs_in_insn
	../../gcc/function.cc:1612
0xc2a28b instantiate_virtual_regs
	../../gcc/function.cc:1995
0xc2a28b execute
	../../gcc/function.cc:2042
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

Test report by @Thyre
FAILED
Build succeeded for 3 out of 4 (total: 17 mins 7 secs) (4 easyconfigs in total)
jrc0900.jureca - Linux Rocky Linux 9.6, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 580.95.05, Python 3.9.21
See https://gist.github.com/Thyre/bdc1ee06d4f8b430f52f9c220b66e11f for a full test report.

Thyre · 2025-12-19T15:44:49Z

Failure may be caused by this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027

There was a PR which should have worked around this, but seemingly the fix doesn't work?
See also:

Maybe we need to patch GCCcore/14.3.0 with this change?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027#c9

Flamefire · 2025-12-19T15:54:30Z

Failure may be caused by this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027

There was a PR which should have worked around this, but seemingly the fix doesn't work? See also:
* https://github.com/pytorch/pytorch/blob/f026b098e4319413db7d3fc1dbcb39dda69fcf0c/aten/src/ATen/native/cpu/Unfold2d.cpp#L172

* [Build error: unrecognizable insn with using gcc-14 on aarch64 pytorch/pytorch#157842](https://github.com/pytorch/pytorch/issues/157842)

That is not included in this (or any) release yet. I'll add it to the patch list

Maybe we need to patch GCCcore/14.3.0 with this change? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027#c9

Would be an option, not sure if it is worth it: This EC is included since EB 5.1.0, although we did that in the past

Flamefire · 2025-12-28T19:55:19Z

Test report by @Flamefire
FAILED
Build succeeded for 3 out of 4 (total: 9 mins 46 secs) (4 easyconfigs in total)
i8025 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/693aa68464c9935e44bbc6f730a51469 for a full test report.

github-actions bot added 2025b issues & PRs related to 2025b common toolchains update labels Dec 18, 2025

This comment was marked as outdated.

Sign in to view

Add testcase

5e96b25

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

Add patch for GCC 14 ARM builds

9f728f6

This comment was marked as outdated.

Sign in to view

Also ignore warning for C files

54f0441

This comment was marked as outdated.

Sign in to view

Move flags setting before including dependencies

b921399

Use flag only for C

dd7464f

Add workaround for GCC 14 ICE

54b64ef

Flamefire changed the title ~~{tools}[GCCcore/14.3.0] parameterized v0.9.0, pytest-subtests v0.15.0, PyTorch v2.9.1, ... w/ CUDA 12.9.1~~ {tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1 Dec 19, 2025

boegel added this to the next release (5.2.1?) milestone Dec 31, 2025

Flamefire added 5 commits January 5, 2026 13:38

Remove already included patch

bf271b1

Add missing patch

5e4e033

Skip tests requiring CUDA SM 9.0

20c68d3

Remove old patch

dc3a09e

Add patch avoiding infinite test hang

39cf857

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

{tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1 #24926

{tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1 #24926

Uh oh!

Flamefire commented Dec 18, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

Flamefire commented Dec 19, 2025

Uh oh!

This comment was marked as outdated.

Flamefire commented Dec 19, 2025

Uh oh!

This comment was marked as outdated.

Flamefire commented Dec 19, 2025

Uh oh!

This comment was marked as outdated.

Flamefire commented Dec 19, 2025

Uh oh!

Thyre commented Dec 19, 2025 •

edited

Loading

Uh oh!

Thyre commented Dec 19, 2025 •

edited

Loading

Uh oh!

Flamefire commented Dec 19, 2025 •

edited

Loading

Uh oh!

Flamefire commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

{tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1 #24926

Are you sure you want to change the base?

{tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1 #24926

Uh oh!

Conversation

Flamefire commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

Flamefire commented Dec 19, 2025

Uh oh!

This comment was marked as outdated.

Flamefire commented Dec 19, 2025

Uh oh!

This comment was marked as outdated.

Flamefire commented Dec 19, 2025

Uh oh!

This comment was marked as outdated.

Flamefire commented Dec 19, 2025

Uh oh!

Thyre commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thyre commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Flamefire commented Dec 18, 2025 •

edited

Loading

Thyre commented Dec 19, 2025 •

edited

Loading

Thyre commented Dec 19, 2025 •

edited

Loading

Flamefire commented Dec 19, 2025 •

edited

Loading