BUG: Address interaction between SME and FPSR (#29223) #29235
Merged
+217
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of #29223.
This is intended to resolve #28687
The root cause is an interaction between Arm Scalable Matrix Extension (SME) and the floating point status register (FPSR).
As noted in Arm docs for FPSR, "On entry to or exit from Streaming SVE
mode, FPSR.{IOC, DZC, OFC, UFC, IXC, IDC, QC} are set to 1 and the
remaining bits are set to 0". This means that floating point status
flags are all raised when SME is used, regardless of values or
operations performed.
These are manifesting now because Apple Silicon M4 supports SME and macOS 15.4 enables SME codepaths for Accelerate BLAS / LAPACK. However, SME / FPSR behavior is not specific to Apple Silicon M4 and will occur on non-Apple chips using SME as well.
Changes add compile and runtime checks to determine whether BLAS / LAPACK might use SME (macOS / Accelerate only at the moment). If so, special handling of floating-point error (FPE) is added, which includes:
All tests pass
Performance is similar
Another approach would have been to wrap all BLAS / LAPACK calls with save / restore FPE. However, it added a lot of overhead for the inner loops that utilize BLAS / LAPACK. Some benchmarks were 8x slower.
Address the linker & linter failures