-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[LoopIdiom] Select llvm.experimental.memset.pattern intrinsic rather than memset_pattern16 libcall #126736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[LoopIdiom] Select llvm.experimental.memset.pattern intrinsic rather than memset_pattern16 libcall #126736
Conversation
…than memset_pattern16 libcall In order to keep the change as incremental as possible, this only introduces the memset.pattern intrinsic in cases where memset_pattern16 would have been used. Future patches can enable it on targets that don't have the intrinsic. As the memset.pattern intrinsic takes the number of times to store the pattern as an argument unlike memset_pattern16 which takes the number of bytes to write, we no longer try to form an i128 pattern. Special care is taken for cases where multiple stores in the same loop iteration were combined to form a single pattern. For such cases, we inherit the limitation that loops such as the following are supported: ``` for (unsigned i = 0; i < 2 * n; i += 2) { f[i] = 2; f[i+1] = 2; } ``` But the following doesn't result in a memset.pattern (even though it could be, by forming an appropriate pattern): ``` for (unsigned i = 0; i < 2 * n; i += 2) { f[i] = 2; f[i+1] = 3; } ``` Addressing this existing deficiency is left for a follow-up due to a desire not to change too much at once (i.e. to target equivalence to the current codegen). A command line option is introduced to force the selection of the intrinsic even in cases it wouldn't be (i.e. in cases where the libcall wouldn't have been selected). This is intended as a transitionary option for testing and experimentation, to be removed at a later point.
@llvm/pr-subscribers-llvm-transforms Author: Alex Bradbury (asb) ChangesNote: This patch is fully ready for technical review, but is not ready for merging until ongoing testing that the altered codegen doesn't (as far as can be seen) regress code generation for the existing libcall, which may result in other patches that should land before this one. I post this for feedback now, because any alternate implementation approach would impact this testing. In order to keep the change as incremental as possible, this only introduces the memset.pattern intrinsic in cases where memset_pattern16 would have been used. Future patches can enable it on targets that don't have the intrinsic, and select it in cases where the libcall isn't directly usable. As the memset.pattern intrinsic takes the number of times to store the pattern as an argument unlike memset_pattern16 which takes the number of bytes to write, we no longer try to form an i128 pattern. Special care is taken for cases where multiple stores in the same loop iteration were combined to form a single pattern. For such cases, we inherit the limitation that loops such as the following are supported:
But the following doesn't result in a memset.pattern (even though it could be, by forming an appropriate pattern):
Addressing this existing deficiency is left for a follow-up due to a desire not to change too much at once (i.e. to target equivalence to the current codegen). A command line option is introduced to force the selection of the intrinsic even in cases it wouldn't be (i.e. in cases where the libcall wouldn't have been selected). This is intended as a transitionary option for testing and experimentation, to be removed at a later point. Patch is 23.99 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126736.diff 7 Files Affected:
diff --git a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
index 2462ec33e0c20..58c12298ca926 100644
--- a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
@@ -132,6 +132,11 @@ static cl::opt<bool> UseLIRCodeSizeHeurs(
"with -Os/-Oz"),
cl::init(true), cl::Hidden);
+static cl::opt<bool> ForceMemsetPatternIntrinsic(
+ "loop-idiom-force-memset-pattern-intrinsic",
+ cl::desc("Enable use of the memset.pattern intrinsic"), cl::init(false),
+ cl::Hidden);
+
namespace {
class LoopIdiomRecognize {
@@ -303,10 +308,15 @@ bool LoopIdiomRecognize::runOnLoop(Loop *L) {
L->getHeader()->getParent()->hasOptSize() && UseLIRCodeSizeHeurs;
HasMemset = TLI->has(LibFunc_memset);
+ // TODO: Unconditionally enable use of the memset pattern intrinsic (or at
+ // least, opt-in via target hook) once we are confident it will never result
+ // in worse codegen than without. For now, use it only when we would have
+ // previously emitted a libcall to memset_pattern16 (or unless this is
+ // overridden by command line option).
HasMemsetPattern = TLI->has(LibFunc_memset_pattern16);
HasMemcpy = TLI->has(LibFunc_memcpy);
- if (HasMemset || HasMemsetPattern || HasMemcpy)
+ if (HasMemset || HasMemsetPattern || ForceMemsetPatternIntrinsic || HasMemcpy)
if (SE->hasLoopInvariantBackedgeTakenCount(L))
return runOnCountableLoop();
@@ -392,14 +402,7 @@ static Constant *getMemSetPatternValue(Value *V, const DataLayout *DL) {
if (Size > 16)
return nullptr;
- // If the constant is exactly 16 bytes, just use it.
- if (Size == 16)
- return C;
-
- // Otherwise, we'll use an array of the constants.
- unsigned ArraySize = 16 / Size;
- ArrayType *AT = ArrayType::get(V->getType(), ArraySize);
- return ConstantArray::get(AT, std::vector<Constant *>(ArraySize, C));
+ return C;
}
LoopIdiomRecognize::LegalStoreKind
@@ -463,8 +466,9 @@ LoopIdiomRecognize::isLegalStore(StoreInst *SI) {
// It looks like we can use SplatValue.
return LegalStoreKind::Memset;
}
- if (!UnorderedAtomic && HasMemsetPattern && !DisableLIRP::Memset &&
- // Don't create memset_pattern16s with address spaces.
+ if (!UnorderedAtomic && (HasMemsetPattern || ForceMemsetPatternIntrinsic) &&
+ !DisableLIRP::Memset &&
+ // Don't create memset.pattern intrinsic calls with address spaces.
StorePtr->getType()->getPointerAddressSpace() == 0 &&
getMemSetPatternValue(StoredVal, DL)) {
// It looks like we can use PatternValue!
@@ -1064,53 +1068,101 @@ bool LoopIdiomRecognize::processLoopStridedStore(
return Changed;
// Okay, everything looks good, insert the memset.
+ // MemsetArg is the number of bytes for the memset libcall, and the number
+ // of pattern repetitions if the memset.pattern intrinsic is being used.
+ Value *MemsetArg;
+ std::optional<int64_t> BytesWritten = std::nullopt;
+
+ if (PatternValue && (HasMemsetPattern || ForceMemsetPatternIntrinsic)) {
+ const SCEV *TripCountS =
+ SE->getTripCountFromExitCount(BECount, IntIdxTy, CurLoop);
+ if (!Expander.isSafeToExpand(TripCountS))
+ return Changed;
+ const SCEVConstant *ConstStoreSize = dyn_cast<SCEVConstant>(StoreSizeSCEV);
+ if (!ConstStoreSize)
+ return Changed;
+ Value *TripCount = Expander.expandCodeFor(TripCountS, IntIdxTy,
+ Preheader->getTerminator());
+ uint64_t PatternRepsPerTrip =
+ (ConstStoreSize->getValue()->getZExtValue() * 8) /
+ DL->getTypeSizeInBits(PatternValue->getType());
+ // If ConstStoreSize is not equal to the width of PatternValue, then
+ // MemsetArg is TripCount * (ConstStoreSize/PatternValueWidth). Else
+ // MemSetArg is just TripCount.
+ MemsetArg =
+ PatternRepsPerTrip == 1
+ ? TripCount
+ : Builder.CreateMul(TripCount,
+ Builder.getIntN(IntIdxTy->getIntegerBitWidth(),
+ PatternRepsPerTrip));
+ if (auto CI = dyn_cast<ConstantInt>(TripCount))
+ BytesWritten =
+ CI->getZExtValue() * ConstStoreSize->getValue()->getZExtValue();
+ } else {
+ const SCEV *NumBytesS =
+ getNumBytes(BECount, IntIdxTy, StoreSizeSCEV, CurLoop, DL, SE);
- const SCEV *NumBytesS =
- getNumBytes(BECount, IntIdxTy, StoreSizeSCEV, CurLoop, DL, SE);
-
- // TODO: ideally we should still be able to generate memset if SCEV expander
- // is taught to generate the dependencies at the latest point.
- if (!Expander.isSafeToExpand(NumBytesS))
- return Changed;
-
- Value *NumBytes =
- Expander.expandCodeFor(NumBytesS, IntIdxTy, Preheader->getTerminator());
+ // TODO: ideally we should still be able to generate memset if SCEV expander
+ // is taught to generate the dependencies at the latest point.
+ if (!Expander.isSafeToExpand(NumBytesS))
+ return Changed;
+ MemsetArg =
+ Expander.expandCodeFor(NumBytesS, IntIdxTy, Preheader->getTerminator());
+ if (auto CI = dyn_cast<ConstantInt>(MemsetArg))
+ BytesWritten = CI->getZExtValue();
+ }
+ assert(MemsetArg && "MemsetArg should have been set");
- if (!SplatValue && !isLibFuncEmittable(M, TLI, LibFunc_memset_pattern16))
+ if (!SplatValue && !(ForceMemsetPatternIntrinsic ||
+ isLibFuncEmittable(M, TLI, LibFunc_memset_pattern16)))
return Changed;
AAMDNodes AATags = TheStore->getAAMetadata();
for (Instruction *Store : Stores)
AATags = AATags.merge(Store->getAAMetadata());
- if (auto CI = dyn_cast<ConstantInt>(NumBytes))
- AATags = AATags.extendTo(CI->getZExtValue());
+ if (BytesWritten)
+ AATags = AATags.extendTo(BytesWritten.value());
else
AATags = AATags.extendTo(-1);
CallInst *NewCall;
if (SplatValue) {
NewCall = Builder.CreateMemSet(
- BasePtr, SplatValue, NumBytes, MaybeAlign(StoreAlignment),
+ BasePtr, SplatValue, MemsetArg, MaybeAlign(StoreAlignment),
/*isVolatile=*/false, AATags.TBAA, AATags.Scope, AATags.NoAlias);
} else {
- assert (isLibFuncEmittable(M, TLI, LibFunc_memset_pattern16));
+ assert(ForceMemsetPatternIntrinsic ||
+ isLibFuncEmittable(M, TLI, LibFunc_memset_pattern16));
// Everything is emitted in default address space
- Type *Int8PtrTy = DestInt8PtrTy;
-
- StringRef FuncName = "memset_pattern16";
- FunctionCallee MSP = getOrInsertLibFunc(M, *TLI, LibFunc_memset_pattern16,
- Builder.getVoidTy(), Int8PtrTy, Int8PtrTy, IntIdxTy);
- inferNonMandatoryLibFuncAttrs(M, FuncName, *TLI);
-
- // Otherwise we should form a memset_pattern16. PatternValue is known to be
- // an constant array of 16-bytes. Plop the value into a mergable global.
- GlobalVariable *GV = new GlobalVariable(*M, PatternValue->getType(), true,
- GlobalValue::PrivateLinkage,
- PatternValue, ".memset_pattern");
- GV->setUnnamedAddr(GlobalValue::UnnamedAddr::Global); // Ok to merge these.
- GV->setAlignment(Align(16));
- Value *PatternPtr = GV;
- NewCall = Builder.CreateCall(MSP, {BasePtr, PatternPtr, NumBytes});
+
+ assert(isa<SCEVConstant>(StoreSizeSCEV) && "Expected constant store size");
+
+ Value *PatternArg;
+ IntegerType *PatternArgTy =
+ Builder.getIntNTy(DL->getTypeSizeInBits(PatternValue->getType()));
+
+ // If the pattern value can be casted directly to an integer argument, use
+ // that. Otherwise (e.g. if the value is a global pointer), create a
+ // GlobalVariable and load from it.
+ if (isa<ConstantInt>(PatternValue)) {
+ PatternArg = PatternValue;
+ } else if (isa<ConstantFP>(PatternValue)) {
+ PatternArg = Builder.CreateBitCast(PatternValue, PatternArgTy);
+ } else {
+ GlobalVariable *GV = new GlobalVariable(*M, PatternValue->getType(), true,
+ GlobalValue::PrivateLinkage,
+ PatternValue, ".memset_pattern");
+ GV->setUnnamedAddr(
+ GlobalValue::UnnamedAddr::Global); // Ok to merge these.
+ GV->setAlignment(Align(PatternArgTy->getPrimitiveSizeInBits()));
+ PatternArg = Builder.CreateLoad(PatternArgTy, GV);
+ }
+ assert(PatternArg);
+
+ NewCall = Builder.CreateIntrinsic(Intrinsic::experimental_memset_pattern,
+ {DestInt8PtrTy, PatternArgTy, IntIdxTy},
+ {BasePtr, PatternArg, MemsetArg,
+ ConstantInt::getFalse(M->getContext())});
// Set the TBAA info if present.
if (AATags.TBAA)
diff --git a/llvm/test/Transforms/LoopIdiom/RISCV/memset-pattern.ll b/llvm/test/Transforms/LoopIdiom/RISCV/memset-pattern.ll
new file mode 100644
index 0000000000000..b3cee756076af
--- /dev/null
+++ b/llvm/test/Transforms/LoopIdiom/RISCV/memset-pattern.ll
@@ -0,0 +1,49 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals
+; RUN: opt -passes=loop-idiom -mtriple=riscv64 < %s -S | FileCheck %s
+; RUN: opt -passes=loop-idiom -mtriple=riscv64 -loop-idiom-force-memset-pattern-intrinsic < %s -S \
+; RUN: | FileCheck -check-prefix=CHECK-INTRIN %s
+
+define dso_local void @double_memset(ptr nocapture %p) {
+; CHECK-LABEL: @double_memset(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: for.cond.cleanup:
+; CHECK-NEXT: ret void
+; CHECK: for.body:
+; CHECK-NEXT: [[I_07:%.*]] = phi i64 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[PTR1:%.*]] = getelementptr inbounds double, ptr [[P:%.*]], i64 [[I_07]]
+; CHECK-NEXT: store double 3.141590e+00, ptr [[PTR1]], align 1
+; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_07]], 1
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 16
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
+;
+; CHECK-INTRIN-LABEL: @double_memset(
+; CHECK-INTRIN-NEXT: entry:
+; CHECK-INTRIN-NEXT: call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 4614256650576692846, i64 16, i1 false)
+; CHECK-INTRIN-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK-INTRIN: for.cond.cleanup:
+; CHECK-INTRIN-NEXT: ret void
+; CHECK-INTRIN: for.body:
+; CHECK-INTRIN-NEXT: [[I_07:%.*]] = phi i64 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-INTRIN-NEXT: [[PTR1:%.*]] = getelementptr inbounds double, ptr [[P]], i64 [[I_07]]
+; CHECK-INTRIN-NEXT: [[INC]] = add nuw nsw i64 [[I_07]], 1
+; CHECK-INTRIN-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 16
+; CHECK-INTRIN-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
+;
+entry:
+ br label %for.body
+
+for.cond.cleanup:
+ ret void
+
+for.body:
+ %i.07 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
+ %ptr1 = getelementptr inbounds double, ptr %p, i64 %i.07
+ store double 3.14159e+00, ptr %ptr1, align 1
+ %inc = add nuw nsw i64 %i.07, 1
+ %exitcond.not = icmp eq i64 %inc, 16
+ br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+;.
+; CHECK-INTRIN: attributes #[[ATTR0:[0-9]+]] = { nocallback nofree nounwind willreturn memory(argmem: write) }
+;.
diff --git a/llvm/test/Transforms/LoopIdiom/basic.ll b/llvm/test/Transforms/LoopIdiom/basic.ll
index e6fc11625317b..0fe8cd747408f 100644
--- a/llvm/test/Transforms/LoopIdiom/basic.ll
+++ b/llvm/test/Transforms/LoopIdiom/basic.ll
@@ -7,8 +7,7 @@ target triple = "x86_64-apple-darwin10.0.0"
;.
; CHECK: @G = global i32 5
; CHECK: @g_50 = global [7 x i32] [i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 0], align 16
-; CHECK: @.memset_pattern = private unnamed_addr constant [4 x i32] [i32 1, i32 1, i32 1, i32 1], align 16
-; CHECK: @.memset_pattern.1 = private unnamed_addr constant [2 x ptr] [ptr @G, ptr @G], align 16
+; CHECK: @.memset_pattern = private unnamed_addr constant ptr @G, align 64
;.
define void @test1(ptr %Base, i64 %Size) nounwind ssp {
; CHECK-LABEL: @test1(
@@ -533,7 +532,7 @@ for.end13: ; preds = %for.inc10
define void @test11_pattern(ptr nocapture %P) nounwind ssp {
; CHECK-LABEL: @test11_pattern(
; CHECK-NEXT: entry:
-; CHECK-NEXT: call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern, i64 40000)
+; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.i32.i64(ptr [[P:%.*]], i32 1, i64 10000, i1 false)
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:
; CHECK-NEXT: [[INDVAR:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
@@ -596,7 +595,8 @@ for.end: ; preds = %for.body
define void @test13_pattern(ptr nocapture %P) nounwind ssp {
; CHECK-LABEL: @test13_pattern(
; CHECK-NEXT: entry:
-; CHECK-NEXT: call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern.1, i64 80000)
+; CHECK-NEXT: [[TMP0:%.*]] = load i64, ptr @.memset_pattern, align 8
+; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 [[TMP0]], i64 10000, i1 false)
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:
; CHECK-NEXT: [[INDVAR:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
@@ -1625,6 +1625,5 @@ define noalias ptr @_ZN8CMSPULog9beginImplEja(ptr nocapture writeonly %0) local_
; CHECK: attributes #[[ATTR1:[0-9]+]] = { nounwind }
; CHECK: attributes #[[ATTR2:[0-9]+]] = { nocallback nofree nounwind willreturn memory(argmem: readwrite) }
; CHECK: attributes #[[ATTR3:[0-9]+]] = { nocallback nofree nounwind willreturn memory(argmem: write) }
-; CHECK: attributes #[[ATTR4:[0-9]+]] = { nofree nounwind willreturn memory(argmem: readwrite) }
-; CHECK: attributes #[[ATTR5:[0-9]+]] = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+; CHECK: attributes #[[ATTR4:[0-9]+]] = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
;.
diff --git a/llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll b/llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll
index 57a91a3bf6e2c..98521ef82fbe7 100644
--- a/llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll
+++ b/llvm/test/Transforms/LoopIdiom/memset-pattern-tbaa.ll
@@ -6,15 +6,10 @@ target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f3
target triple = "x86_64-apple-darwin10.0.0"
-;.
-; CHECK: @.memset_pattern = private unnamed_addr constant [2 x double] [double 3.141590e+00, double 3.141590e+00], align 16
-; CHECK: @.memset_pattern.1 = private unnamed_addr constant [2 x double] [double 3.141590e+00, double 3.141590e+00], align 16
-; CHECK: @.memset_pattern.2 = private unnamed_addr constant [2 x double] [double 3.141590e+00, double 3.141590e+00], align 16
-;.
define dso_local void @double_memset(ptr nocapture %p) {
; CHECK-LABEL: @double_memset(
; CHECK-NEXT: entry:
-; CHECK-NEXT: call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern, i64 128), !tbaa [[TBAA0:![0-9]+]]
+; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 4614256650576692846, i64 16, i1 false), !tbaa [[TBAA0:![0-9]+]]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void
@@ -44,7 +39,7 @@ for.body:
define dso_local void @struct_memset(ptr nocapture %p) {
; CHECK-LABEL: @struct_memset(
; CHECK-NEXT: entry:
-; CHECK-NEXT: call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern.1, i64 128), !tbaa [[TBAA4:![0-9]+]]
+; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 4614256650576692846, i64 16, i1 false), !tbaa [[TBAA4:![0-9]+]]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void
@@ -73,8 +68,7 @@ for.body:
define dso_local void @var_memset(ptr nocapture %p, i64 %len) {
; CHECK-LABEL: @var_memset(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[TMP0:%.*]] = shl nuw i64 [[LEN:%.*]], 3
-; CHECK-NEXT: call void @memset_pattern16(ptr [[P:%.*]], ptr @.memset_pattern.2, i64 [[TMP0]])
+; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.i64.i64(ptr [[P:%.*]], i64 4614256650576692846, i64 [[TMP0:%.*]], i1 false)
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void
@@ -82,7 +76,7 @@ define dso_local void @var_memset(ptr nocapture %p, i64 %len) {
; CHECK-NEXT: [[I_07:%.*]] = phi i64 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
; CHECK-NEXT: [[PTR1:%.*]] = getelementptr inbounds double, ptr [[P]], i64 [[I_07]]
; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_07]], 1
-; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], [[LEN]]
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], [[TMP0]]
; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
;
entry:
@@ -116,7 +110,7 @@ for.body:
!21 = !{!22, !20, i64 0}
!22 = !{!"B", !20, i64 0}
;.
-; CHECK: attributes #[[ATTR0:[0-9]+]] = { nofree nounwind willreturn memory(argmem: readwrite) }
+; CHECK: attributes #[[ATTR0:[0-9]+]] = { nocallback nofree nounwind willreturn memory(argmem: write) }
;.
; CHECK: [[TBAA0]] = !{[[META1:![0-9]+]], [[META1]], i64 0}
; CHECK: [[META1]] = !{!"double", [[META2:![0-9]+]], i64 0}
diff --git a/llvm/test/Transforms/LoopIdiom/struct_pattern.ll b/llvm/test/Transforms/LoopIdiom/struct_pattern.ll
index b65e95353ab3e..f5be8e71cf7bd 100644
--- a/llvm/test/Transforms/LoopIdiom/struct_pattern.ll
+++ b/llvm/test/Transforms/LoopIdiom/struct_pattern.ll
@@ -16,11 +16,6 @@ target triple = "x86_64-apple-darwin10.0.0"
;}
-;.
-; CHECK: @.memset_pattern = private unnamed_addr constant [4 x i32] [i32 2, i32 2, i32 2, i32 2], align 16
-; CHECK: @.memset_pattern.1 = private unnamed_addr constant [4 x i32] [i32 2, i32 2, i32 2, i32 2], align 16
-; CHECK: @.memset_pattern.2 = private unnamed_addr constant [4 x i32] [i32 2, i32 2, i32 2, i32 2], align 16
-;.
define void @bar1(ptr %f, i32 %n) nounwind ssp {
; CHECK-LABEL: @bar1(
; CHECK-NEXT: entry:
@@ -28,8 +23,8 @@ define void @bar1(ptr %f, i32 %n) nounwind ssp {
; CHECK-NEXT: br i1 [[CMP1]], label [[FOR_END:%.*]], label [[FOR_BODY_PREHEADER:%.*]]
; CHECK: for.body.preheader:
; CHECK-NEXT: [[TMP0:%.*]] = zext i32 [[N]] to i64
-; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 3
-; CHECK-NEXT: call void @memset_pattern16(ptr [[F:%.*]], ptr @.memset_pattern, i64 [[TMP1]])
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
+; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.i32.i64(ptr [[F:%.*]], i32 2, i64 [[TMP1]], i1 false)
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:
; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
@@ -82,8 +77,8 @@ define void @bar2(ptr %f, i32 %n) nounwind ssp {
; CHECK-NEXT: br i1 [[CMP1]], label [[FOR_END:%.*]], label [[FOR_BODY_PREHEADER:%.*]]
; CHECK: for.body.preheader:
; CHECK-NEXT: [[TMP0:%.*]] = zext i32 [[N]] to i64
-; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 3
-; CHECK-NEXT: call void @memset_pattern16(ptr [[F:%.*]], ptr @.memset_pattern.1, i64 [[TMP1]])
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
+; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.i32.i64(ptr [[F:%.*]], i32 2, i64 [[TMP1]], i1 false)
; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:
; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
@@ -142,7 +137,8 @@ define void @bar3(ptr nocapture %f, i32 %n) nounwind ssp {
; CHECK-NEXT: [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 3
; CHECK-NEXT: [[TMP5:%.*]] = sub i64 [[TMP1]], [[TMP4]]
; CHECK-NEXT: [[UGLYGEP:%.*]] = getelementptr i8, ptr [[F:%.*]], i64 [[TMP5]]
-; CHECK-NEXT: call void @memset_pattern16(ptr [[UGLYGEP]], ptr @.memset_pattern.2, i64 [[TMP1]])
+; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP0]], 2
+; CHECK-NEXT: call void @llvm.experimental.memset.pattern.p0.i32.i64(ptr [[UGLYGEP]], i32 2, i64 [[TMP7]], i1 false)
; CHECK-NEXT: br label [[FOR_BODY:%.*...
[truncated]
|
…gnize-switch-to-memset-pattern-intrinsic
…ading from constant global This is motivated by llvm#126736, and catches a case that would have resulted in memset_pattern16 being produced by LoopIdiomRecognize previously but is missed after moving to the intrinsic in llvm#126736 and relying on PreISelintrinsicLoewring to produce the libcall when available. The logic for handling load instructions that access constant globals could be made more extensive, but it's not clear it would be worthwhile. For now we prioritise the patterns that could be produced by LoopIdiomRecognize.
…gnize-switch-to-memset-pattern-intrinsic
…gnize-switch-to-memset-pattern-intrinsic
…done for global pointers) We can just do a ptrtoint.
@@ -1154,7 +1154,7 @@ bool LoopIdiomRecognize::processLoopStridedStore( | |||
PatternValue, ".memset_pattern"); | |||
GV->setUnnamedAddr( | |||
GlobalValue::UnnamedAddr::Global); // Ok to merge these. | |||
GV->setAlignment(Align(PatternArgTy->getPrimitiveSizeInBits())); | |||
GV->setAlignment(Align(PatternArgTy->getPrimitiveSizeInBits() / 8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just above you already queried the getTypeSizeInBits for PatternArgTy, avoid repeating it? Every getPrimitiveSizeInBits makes me nervous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to create a GlobalVariable isn't needed any more, so I've now dropped this whole bit.
…gnize-switch-to-memset-pattern-intrinsic
…gnize-switch-to-memset-pattern-intrinsic
Reverse ping. Can you rebase this? |
…gnize-switch-to-memset-pattern-intrinsic
Just catching up post EuroLLVM. I've resolved the merge conflicts now, and fixed and outdated comment and accepted your suggestion (thanks!). And now putting some more focused effort in getting this closed off. I have a flow for looking for unexpected codegen differences (which should only show up on darwin where the libcall is supported) by applying a hacky patch that unconditionally marks memset_pattern16 as available on all platforms, then building the test-suite with
It's possible some of the above results in changes to this patch, but also equally possible they result in separate PRs that end up being effectively pre-requisites for this one. |
Interesting case. This makes me wonder whether it would make sense to do the expansion for this intrinsic earlier, in the late middle-end pipeline (we'd still have to keep it in the backend for O0 fallback). If we do it before ConstantMerge, it would merge the constants back together. More interestingly, if we do it before addVectorPasses, the loop expansion case would still get the usual vectorization and unrolling heuristics, which would avoid the need to consider those in the backend expansion. I don't think the duplicate globals are a big problem though, so that's just a thought on how to improve things in the future... |
I wrote a bunch of test cases to see if I could identity your problem. I believe this is a consequence of the second parameter (the value param) not being known to be readnone. There might be other cases, but I can definitely see this case failing to fold in DSE and GVN. I added those tests in 15c2f79. Unfortunately, the fact that the value param is only sometimes a pointer fits badly into our intrinsic attribute system. I have a local patch which hacks in a fix, but it sure seems like I'm missing something in terms of code structure as I'm having to change multiple places. Edited to expand comment slightly. |
First attempt at a partial fix for the pointer value operand issue above: #138559 |
Note: This patch is fully ready for technical review, but is not ready for merging until ongoing testing that the altered codegen doesn't (as far as can be seen) regress code generation for the existing libcall, which may result in other patches that should land before this one. I post this for feedback now, because any alternate implementation approach would impact this testing.
In order to keep the change as incremental as possible, this only introduces the memset.pattern intrinsic in cases where memset_pattern16 would have been used. Future patches can enable it on targets that don't have the intrinsic, and select it in cases where the libcall isn't directly usable. As the memset.pattern intrinsic takes the number of times to store the pattern as an argument unlike memset_pattern16 which takes the number of bytes to write, we no longer try to form an i128 pattern.
Special care is taken for cases where multiple stores in the same loop iteration were combined to form a single pattern. For such cases, we inherit the limitation that loops such as the following are supported:
But the following doesn't result in a memset.pattern (even though it could be, by forming an appropriate pattern):
Addressing this existing deficiency is left for a follow-up due to a desire not to change too much at once (i.e. to target equivalence to the current codegen).
A command line option is introduced to force the selection of the intrinsic even in cases it wouldn't be (i.e. in cases where the libcall wouldn't have been selected). This is intended as a transitionary option for testing and experimentation, to be removed at a later point.