Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[InstCombine] Do not combine shuffle+bitcast if the bitcast is eliminable. #135769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rj-jesus
Copy link
Contributor

If we are attempting to combine shuffle+bitcast but the bitcast is
pairable with a subsequent bitcast, we should not fold the shuffle as
doing so can block further simplifications.

The motivation for this is a long standing regression affecting SIMDe on
AArch64 introduced indirectly by the AlwaysInliner (1a2e77c). Some
reproducers:

…able.

If we are attempting to combine shuffle+bitcast but the bitcast is
pairable with a subsequent bitcast, we should not fold the shuffle as
doing so can block further simplifications.

The motivation for this is a long standing regression affecting SIMDe on
AArch64 introduced indirectly by the alwaysinliner (1a2e77c). Examples
of reproducers:
* https://godbolt.org/z/53qx18s6M
* https://godbolt.org/z/o5e43h5M7
@llvmbot
Copy link
Member

llvmbot commented Apr 15, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Ricardo Jesus (rj-jesus)

Changes

If we are attempting to combine shuffle+bitcast but the bitcast is
pairable with a subsequent bitcast, we should not fold the shuffle as
doing so can block further simplifications.

The motivation for this is a long standing regression affecting SIMDe on
AArch64 introduced indirectly by the AlwaysInliner (1a2e77c). Some
reproducers:


Full diff: https://github.com/llvm/llvm-project/pull/135769.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp (+12-4)
  • (modified) llvm/test/Transforms/InstCombine/shufflevec-bitcast.ll (+15)
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp b/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
index f897cc7855d2d..f6423cb40492e 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
@@ -3029,10 +3029,18 @@ Instruction *InstCombinerImpl::visitShuffleVectorInst(ShuffleVectorInst &SVI) {
     SmallVector<BitCastInst *, 8> BCs;
     DenseMap<Type *, Value *> NewBCs;
     for (User *U : SVI.users())
-      if (BitCastInst *BC = dyn_cast<BitCastInst>(U))
-        if (!BC->use_empty())
-          // Only visit bitcasts that weren't previously handled.
-          BCs.push_back(BC);
+      if (BitCastInst *BC = dyn_cast<BitCastInst>(U)) {
+        // Only visit bitcasts that weren't previously handled.
+        if (BC->use_empty())
+          continue;
+        // Prefer to combine bitcasts of bitcasts before attempting this fold.
+        if (BC->hasOneUse()) {
+          auto *BC2 = dyn_cast<BitCastInst>(BC->user_back());
+          if (BC2 && isEliminableCastPair(BC, BC2))
+            continue;
+        }
+        BCs.push_back(BC);
+      }
     for (BitCastInst *BC : BCs) {
       unsigned BegIdx = Mask.front();
       Type *TgtTy = BC->getDestTy();
diff --git a/llvm/test/Transforms/InstCombine/shufflevec-bitcast.ll b/llvm/test/Transforms/InstCombine/shufflevec-bitcast.ll
index f20077243273c..c6152368f06fd 100644
--- a/llvm/test/Transforms/InstCombine/shufflevec-bitcast.ll
+++ b/llvm/test/Transforms/InstCombine/shufflevec-bitcast.ll
@@ -235,3 +235,18 @@ define <3 x i4> @shuf_bitcast_wrong_size(<2 x i8> %v, i8 %x) {
   %r = shufflevector <4 x i4> %b, <4 x i4> undef, <3 x i32> <i32 0, i32 1, i32 2>
   ret <3 x i4> %r
 }
+
+; Negative test - chain of bitcasts.
+
+define <16 x i8> @shuf_bitcast_chain(<8 x i32> %v) {
+; CHECK-LABEL: @shuf_bitcast_chain(
+; CHECK-NEXT:    [[S:%.*]] = shufflevector <8 x i32> [[V:%.*]], <8 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[C:%.*]] = bitcast <4 x i32> [[S]] to <16 x i8>
+; CHECK-NEXT:    ret <16 x i8> [[C]]
+;
+  %s = shufflevector <8 x i32> %v, <8 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %a = bitcast <4 x i32> %s to <2 x i64>
+  %b = bitcast <2 x i64> %a to i128
+  %c = bitcast i128 %b to <16 x i8>
+  ret <16 x i8> %c
+}

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative here would be to fold the

  %s.bc = bitcast <8 x i32> %v to <2 x i128>
  %s.extract = extractelement <2 x i128> %s.bc, i64 0
  %c = bitcast i128 %s.extract to <16 x i8>

pattern to shufflevector + bitcast instead. That would seem like the more robust solution. Do you think that would be viable, or does that run up against some other problem (I guess introducing shufflevectors without cost model may be problematic?)

@rj-jesus
Copy link
Contributor Author

The alternative here would be to fold the

  %s.bc = bitcast <8 x i32> %v to <2 x i128>
  %s.extract = extractelement <2 x i128> %s.bc, i64 0
  %c = bitcast i128 %s.extract to <16 x i8>

pattern to shufflevector + bitcast instead. That would seem like the more robust solution. Do you think that would be viable, or does that run up against some other problem (I guess introducing shufflevectors without cost model may be problematic?)

Going that route seemed more complex, and I wasn't sure if it could have unintended consequences elsewhere (as you say, because of the shufflevectors). Would you like me to give it a try, though?

@rj-jesus
Copy link
Contributor Author

rj-jesus commented Apr 22, 2025

Hi @nikic, shall I open a PR with the alternative fold (bitcast+extractelement+bitcast to shufflevector+bitcast) so that we can compare both approaches, or are you happy for us to block the initial fold (shufflevector+bitcast to bitcast+extractelement) from happening in the first place?

@nikic
Copy link
Contributor

nikic commented Apr 22, 2025

@rj-jesus I think it would be good to at least try the alternative and see whether it runs into any problems.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rj-jesus rj-jesus merged commit c91c3f9 into llvm:main Apr 30, 2025
11 checks passed
@rj-jesus rj-jesus deleted the rjj/instcombine-skip-shuffle-bitcast-bitcast branch April 30, 2025 07:22
@llvm-ci
Copy link
Collaborator

llvm-ci commented Apr 30, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-dev-x86-64 running on ml-opt-dev-x86-64-b2 while building llvm at step 4 "cmake-configure".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/17726

Here is the relevant piece of the build log for the reference
Step 4 (cmake-configure) failure: cmake (failure)
...
-- Targeting XCore
-- Registering ExampleIRTransforms as a pass plugin (static build: OFF)
-- Registering Bye as a pass plugin (static build: OFF)
-- Failed to find LLVM FileCheck
-- Google Benchmark version: v0.0.0, normalized to 0.0.0
-- Looking for shm_open in rt
-- Looking for shm_open in rt - found
-- Performing Test HAVE_CXX_FLAG_WALL
-- Performing Test HAVE_CXX_FLAG_WALL - Success
-- Performing Test HAVE_CXX_FLAG_WEXTRA
-- Performing Test HAVE_CXX_FLAG_WEXTRA - Success
-- Performing Test HAVE_CXX_FLAG_WSHADOW
-- Performing Test HAVE_CXX_FLAG_WSHADOW - Success
-- Performing Test HAVE_CXX_FLAG_WFLOAT_EQUAL
-- Performing Test HAVE_CXX_FLAG_WFLOAT_EQUAL - Success
-- Performing Test HAVE_CXX_FLAG_WOLD_STYLE_CAST
-- Performing Test HAVE_CXX_FLAG_WOLD_STYLE_CAST - Success
-- Performing Test HAVE_CXX_FLAG_WSUGGEST_OVERRIDE
-- Performing Test HAVE_CXX_FLAG_WSUGGEST_OVERRIDE - Success
-- Performing Test HAVE_CXX_FLAG_PEDANTIC
-- Performing Test HAVE_CXX_FLAG_PEDANTIC - Success
-- Performing Test HAVE_CXX_FLAG_PEDANTIC_ERRORS
-- Performing Test HAVE_CXX_FLAG_PEDANTIC_ERRORS - Success
-- Performing Test HAVE_CXX_FLAG_WSHORTEN_64_TO_32
-- Performing Test HAVE_CXX_FLAG_WSHORTEN_64_TO_32 - Failed
-- Performing Test HAVE_CXX_FLAG_FSTRICT_ALIASING
-- Performing Test HAVE_CXX_FLAG_FSTRICT_ALIASING - Success
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED_DECLARATIONS
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED_DECLARATIONS - Success
-- Performing Test HAVE_CXX_FLAG_FNO_EXCEPTIONS
-- Performing Test HAVE_CXX_FLAG_FNO_EXCEPTIONS - Success
-- Performing Test HAVE_CXX_FLAG_WSTRICT_ALIASING
-- Performing Test HAVE_CXX_FLAG_WSTRICT_ALIASING - Success
-- Performing Test HAVE_CXX_FLAG_WD654
-- Performing Test HAVE_CXX_FLAG_WD654 - Failed
-- Performing Test HAVE_CXX_FLAG_WTHREAD_SAFETY
-- Performing Test HAVE_CXX_FLAG_WTHREAD_SAFETY - Failed
-- Performing Test HAVE_CXX_FLAG_COVERAGE
-- Performing Test HAVE_CXX_FLAG_COVERAGE - Success
-- Compiling and running to test HAVE_GNU_POSIX_REGEX
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
-- Compiling and running to test HAVE_POSIX_REGEX
-- Performing Test HAVE_POSIX_REGEX -- success
-- Compiling and running to test HAVE_STEADY_CLOCK
-- Performing Test HAVE_STEADY_CLOCK -- success
-- Compiling and running to test HAVE_PTHREAD_AFFINITY
-- Performing Test HAVE_PTHREAD_AFFINITY -- failed to compile
-- Configuring incomplete, errors occurred!
See also "/b/ml-opt-dev-x86-64-b1/build/CMakeFiles/CMakeOutput.log".
See also "/b/ml-opt-dev-x86-64-b1/build/CMakeFiles/CMakeError.log".

@rj-jesus
Copy link
Contributor Author

rj-jesus commented May 1, 2025

Hi @nikic - Would it be reasonable to backport this to LLVM 20?
This fixes a ~30% regression in the workloads where the regression was detected.

@nikic
Copy link
Contributor

nikic commented May 1, 2025

@rj-jesus Yes, this should be fine to backport.

@rj-jesus
Copy link
Contributor Author

rj-jesus commented May 1, 2025

Perfect, thanks very much!

@rj-jesus
Copy link
Contributor Author

rj-jesus commented May 1, 2025

/cherry-pick c91c3f9

@llvmbot
Copy link
Member

llvmbot commented May 1, 2025

/pull-request #138142

@llvmbot llvmbot moved this from Needs Triage to Done in LLVM Release Status May 1, 2025
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…able. (llvm#135769)

If we are attempting to combine shuffle+bitcast but the bitcast is
pairable with a subsequent bitcast, we should not fold the shuffle as
doing so can block further simplifications.

The motivation for this is a long-standing regression affecting SIMDe on
AArch64, introduced indirectly by the AlwaysInliner (1a2e77c). Some
reproducers:
* https://godbolt.org/z/53qx18s6M
* https://godbolt.org/z/o5e43h5M7
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…able. (llvm#135769)

If we are attempting to combine shuffle+bitcast but the bitcast is
pairable with a subsequent bitcast, we should not fold the shuffle as
doing so can block further simplifications.

The motivation for this is a long-standing regression affecting SIMDe on
AArch64, introduced indirectly by the AlwaysInliner (1a2e77c). Some
reproducers:
* https://godbolt.org/z/53qx18s6M
* https://godbolt.org/z/o5e43h5M7
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…able. (llvm#135769)

If we are attempting to combine shuffle+bitcast but the bitcast is
pairable with a subsequent bitcast, we should not fold the shuffle as
doing so can block further simplifications.

The motivation for this is a long-standing regression affecting SIMDe on
AArch64, introduced indirectly by the AlwaysInliner (1a2e77c). Some
reproducers:
* https://godbolt.org/z/53qx18s6M
* https://godbolt.org/z/o5e43h5M7
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
…able. (llvm#135769)

If we are attempting to combine shuffle+bitcast but the bitcast is
pairable with a subsequent bitcast, we should not fold the shuffle as
doing so can block further simplifications.

The motivation for this is a long-standing regression affecting SIMDe on
AArch64, introduced indirectly by the AlwaysInliner (1a2e77c). Some
reproducers:
* https://godbolt.org/z/53qx18s6M
* https://godbolt.org/z/o5e43h5M7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

4 participants