Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Dec 18, 2025

(created using eb --new-pr)

Includes:

It makes sense to merge #24365 first as any changes there need to be reflected here. But this allows testing both in parallel

…tests-0.15.0-GCCcore-14.3.0.eb, PyTorch-2.9.1-foss-2025b-CUDA-12.9.1.eb, unittest-xml-reporting-3.2.0-GCCcore-14.3.0.eb and patches: PyTorch-1.12.1_add-hypothesis-suppression.patch, PyTorch-1.7.0_disable-dev-shm-test.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.6.0_show-test-duration.patch, PyTorch-2.6.0_skip-test_segfault.patch, PyTorch-2.7.0_avoid_caffe2_test_cpp_jit.patch, PyTorch-2.7.1_avoid-caffe2-sandcastle-test-lib.patch, PyTorch-2.7.1_skip-test_data_parallel_rnn.patch, PyTorch-2.7.1_skip-test_gds_fails_in_ci.patch, PyTorch-2.7.1_skip-test_mixed_mm_exhaustive_dtypes.patch, PyTorch-2.7.1_skip-tests-requiring-SM90.patch, PyTorch-2.7.1_suport-64bit-BARs.patch, PyTorch-2.7.1_tolerance-test_partial_flat_weights.patch, PyTorch-2.9.0_disable-test_nan_assert.patch, PyTorch-2.9.0_enable-symbolizer-in-test_workspace_allocation_error.patch, PyTorch-2.9.0_fix-attention-squeeze.patch, PyTorch-2.9.0_fix-FP16-CPU-tests-in-test_torchinductor_opinfo.patch, PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_fix-test_exclude_padding.patch, PyTorch-2.9.0_fix-test_version_error.patch, PyTorch-2.9.0_honor-XDG_CACHE_HOME.patch, PyTorch-2.9.0_increase-tolerance-in-test_transformers.patch, PyTorch-2.9.0_remove-faulty-close.patch, PyTorch-2.9.0_revert-pybind11-3-change.patch, PyTorch-2.9.0_skip-test_benchmark_on_non_zero_device.patch, PyTorch-2.9.0_skip-test_convolution1-on-H100.patch, PyTorch-2.9.0_skip-test_inductor_all_gather_into_tensor_coalesced.patch, PyTorch-2.9.0_skip-test_original_aten_preserved_pad_mm.patch, PyTorch-2.9.0_skip-test_override-without-CUDA.patch, PyTorch-2.9.0_skip-test_unbacked_reduction.patch, PyTorch-2.9.0_skip-tests-requiring-CUDA-12.8.patch, PyTorch-2.9.0_skip-unexpected-success-in-test_fake_export.patch, PyTorch-2.9.1_skip-RingFlexAttentionTest.patch
@github-actions github-actions bot added 2025b issues & PRs related to 2025b common toolchains update labels Dec 18, 2025
@github-actions
Copy link

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

@Thyre

This comment was marked as outdated.

@Thyre

This comment was marked as resolved.

@Flamefire
Copy link
Contributor Author

Test report by @Thyre FAILED Build succeeded for 3 out of 4 (total: 55 secs) (4 easyconfigs in total) jrc0900.jureca - Linux Rocky Linux 9.6, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 580.95.05, Python 3.9.21 See https://gist.github.com/Thyre/576f0dbeceb975733d860d97f16ca3fc for a full test report.

== 2025-12-19 10:19:27,773 build_log.py:233 ERROR EasyBuild encountered an error: Nothing found to replace 'if IS_CI:\n\s+# Add the option to generate XML test report.*' in test/run_test.py (at easybuild/tools/filetools.py:1861 in apply_regex_substitutions)

Are you using the latest easyblock? It is missing this commit from easybuilders/easybuild-easyblocks#3803

@Thyre

This comment was marked as outdated.

@Flamefire
Copy link
Contributor Author

2025b is using GCC 14 that has new warnings. See pytorch/pytorch#166873

Patch added. Seems to only affect ARM

@Thyre

This comment was marked as outdated.

@Flamefire
Copy link
Contributor Author

Oh, it is a C file. Updated the patch to also add it to C-flags

@Thyre

This comment was marked as outdated.

@Flamefire
Copy link
Contributor Author

Looks like I need to set those values earlier. Can you try again?

@Thyre
Copy link
Collaborator

Thyre commented Dec 19, 2025

Actual failure was an internal GCC compiler error:

In file included from /dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/build/aten/src/ATen/native/cpu/Unfold2d.cpp.SVE256.cpp:1:
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp: In function ‘void at::native::{anonymous}::unfolded2d_acc_kernel(c10::ScalarType, void*, void*, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool)’:
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp:225:1: error: unrecognizable insn:
  225 | }
      | ^
(insn 1375 1374 1376 99 (set (reg:VNx16BI 3253)
        (unspec:VNx16BI [
                (reg:VNx16BI 3250)
                (reg:VNx8BI 3252)
                (const_vector:VNx4BI [
                        (const_int 0 [0]) repeated x8
                    ])
            ] UNSPEC_TRN1_CONV)) "/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/torch/headeronly/util/bit_cast.h":40:14 -1
     (nil))
during RTL pass: vregs
/dev/shm/reuter1/easybuild/build/PyTorch/2.9.1/foss-2025b-CUDA-12.9.1/pytorch-v2.9.1/aten/src/ATen/native/cpu/Unfold2d.cpp:225:1: internal compiler error: in extract_insn, at recog.cc:2812
0x7d30df _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
	../../gcc/rtl-error.cc:108
0x7d3113 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
	../../gcc/rtl-error.cc:116
0xec1d17 extract_insn(rtx_insn*)
	../../gcc/recog.cc:2812
0xc2a28b instantiate_virtual_regs_in_insn
	../../gcc/function.cc:1612
0xc2a28b instantiate_virtual_regs
	../../gcc/function.cc:1995
0xc2a28b execute
	../../gcc/function.cc:2042
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

Test report by @Thyre
FAILED
Build succeeded for 3 out of 4 (total: 17 mins 7 secs) (4 easyconfigs in total)
jrc0900.jureca - Linux Rocky Linux 9.6, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 580.95.05, Python 3.9.21
See https://gist.github.com/Thyre/bdc1ee06d4f8b430f52f9c220b66e11f for a full test report.

@Thyre
Copy link
Collaborator

Thyre commented Dec 19, 2025

@Flamefire
Copy link
Contributor Author

Flamefire commented Dec 19, 2025

Failure may be caused by this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027

There was a PR which should have worked around this, but seemingly the fix doesn't work? See also:

* https://github.com/pytorch/pytorch/blob/f026b098e4319413db7d3fc1dbcb39dda69fcf0c/aten/src/ATen/native/cpu/Unfold2d.cpp#L172

* [Build error: unrecognizable insn with using gcc-14 on aarch64 pytorch/pytorch#157842](https://github.com/pytorch/pytorch/issues/157842)

That is not included in this (or any) release yet. I'll add it to the patch list

Maybe we need to patch GCCcore/14.3.0 with this change? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027#c9

Would be an option, not sure if it is worth it: This EC is included since EB 5.1.0, although we did that in the past

@Flamefire Flamefire changed the title {tools}[GCCcore/14.3.0] parameterized v0.9.0, pytest-subtests v0.15.0, PyTorch v2.9.1, ... w/ CUDA 12.9.1 {tools}[GCCcore/14.3.0] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1 Dec 19, 2025
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 3 out of 4 (total: 9 mins 46 secs) (4 easyconfigs in total)
i8025 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/693aa68464c9935e44bbc6f730a51469 for a full test report.

@boegel boegel added this to the next release (5.2.1?) milestone Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025b issues & PRs related to 2025b common toolchains update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants