Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LV: Expand llvm.histogram intrinsic to support umax, umin, and uadd.sat operations #127399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

RonDahan101
Copy link

@RonDahan101 RonDahan101 commented Feb 16, 2025

This patch extends the llvm.histogram intrinsic to support additional update operations beyond the existing add. Specifically, the new supported operations are:

  • umax: unsigned maximum

  • umin: unsigned minimum

  • uadd.sat: unsigned saturated addition

Changes in this patch include:

  • Updates to the Loop Vectorizer (LV) to recognize and handle the new operations.

  • Scalar fallback support added via ScalarizeMaskedMemIntrin for targets that do not natively support these operations.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Feb 16, 2025

@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: None (RonDahan101)

Changes

Expanding the Histogram intrinsic to support more update options, uadd.sat, umax, umin.


Patch is 25.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127399.diff

7 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+3)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+18)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+19-12)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+4-2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+8-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+75-9)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll (+199)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index deb87365ae8d7..59496ebb93cd7 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -20295,6 +20295,9 @@ More update operation types may be added in the future.
 
     declare void @llvm.experimental.vector.histogram.add.v8p0.i32(<8 x ptr> %ptrs, i32 %inc, <8 x i1> %mask)
     declare void @llvm.experimental.vector.histogram.add.nxv2p0.i64(<vscale x 2 x ptr> %ptrs, i64 %inc, <vscale x 2 x i1> %mask)
+    declare void @llvm.experimental.vector.histogram.uadd.sat.v8p0.i32(<8 x ptr> %ptrs, i32 %inc, <8 x i1> %mask)
+    declare void @llvm.experimental.vector.histogram.umax.v8p0.i32(<8 x ptr> %ptrs, i32 %val, <8 x i1> %mask)
+    declare void @llvm.experimental.vector.histogram.umin.v8p0.i32(<8 x ptr> %ptrs, i32 %val, <8 x i1> %mask)
 
 Arguments:
 """"""""""
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 14ecae41ff08f..31a0ba2e6500d 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1947,6 +1947,24 @@ def int_experimental_vector_histogram_add : DefaultAttrsIntrinsic<[],
                                LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
                              [ IntrArgMemOnly ]>;
 
+def int_experimental_vector_histogram_uadd_sat : DefaultAttrsIntrinsic<[],
+                             [ llvm_anyvector_ty, // Vector of pointers
+                               llvm_anyint_ty,    // Increment
+                               LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
+                             [ IntrArgMemOnly ]>;
+
+def int_experimental_vector_histogram_umin : DefaultAttrsIntrinsic<[],
+                             [ llvm_anyvector_ty, // Vector of pointers
+                               llvm_anyint_ty,    // Update value
+                               LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
+                             [ IntrArgMemOnly ]>;
+
+def int_experimental_vector_histogram_umax : DefaultAttrsIntrinsic<[],
+                             [ llvm_anyvector_ty, // Vector of pointers
+                               llvm_anyint_ty,    // Update value
+                               LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
+                             [ IntrArgMemOnly ]>;
+
 // Experimental match
 def int_experimental_vector_match : DefaultAttrsIntrinsic<
                              [ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty> ],
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index e3599315e224f..bfa1cdfd9c584 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -1079,25 +1079,26 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
 
 /// Find histogram operations that match high-level code in loops:
 /// \code
-/// buckets[indices[i]]+=step;
+/// buckets[indices[i]] = UpdateOpeartor(buckets[indices[i]], Val);
 /// \endcode
+/// When updateOperator can be add, sub, add.sat, umin, umax, sub.
 ///
 /// It matches a pattern starting from \p HSt, which Stores to the 'buckets'
-/// array the computed histogram. It uses a BinOp to sum all counts, storing
-/// them using a loop-variant index Load from the 'indices' input array.
+/// array the computed histogram. It uses a update instruction to update all
+/// counts, storing them using a loop-variant index Load from the 'indices'
+/// input array.
 ///
 /// On successful matches it updates the STATISTIC 'HistogramsDetected',
 /// regardless of hardware support. When there is support, it additionally
-/// stores the BinOp/Load pairs in \p HistogramCounts, as well the pointers
+/// stores the UpdateOp/Load pairs in \p HistogramCounts, as well the pointers
 /// used to update histogram in \p HistogramPtrs.
 static bool findHistogram(LoadInst *LI, StoreInst *HSt, Loop *TheLoop,
                           const PredicatedScalarEvolution &PSE,
                           SmallVectorImpl<HistogramInfo> &Histograms) {
 
-  // Store value must come from a Binary Operation.
   Instruction *HPtrInstr = nullptr;
-  BinaryOperator *HBinOp = nullptr;
-  if (!match(HSt, m_Store(m_BinOp(HBinOp), m_Instruction(HPtrInstr))))
+  Instruction *HInstr = nullptr;
+  if (!match(HSt, m_Store(m_Instruction(HInstr), m_Instruction(HPtrInstr))))
     return false;
 
   // BinOp must be an Add or a Sub modifying the bucket value by a
@@ -1105,8 +1106,14 @@ static bool findHistogram(LoadInst *LI, StoreInst *HSt, Loop *TheLoop,
   // FIXME: We assume the loop invariant term is on the RHS.
   //        Fine for an immediate/constant, but maybe not a generic value?
   Value *HIncVal = nullptr;
-  if (!match(HBinOp, m_Add(m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
-      !match(HBinOp, m_Sub(m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))))
+  if (!match(HInstr, m_Add(m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
+         !match(HInstr, m_Sub(m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
+         !match(HInstr, m_Intrinsic<Intrinsic::uadd_sat>(
+                           m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
+         !match(HInstr, m_Intrinsic<Intrinsic::umax>(
+                           m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
+         !match(HInstr, m_Intrinsic<Intrinsic::umin>(
+                           m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))))
     return false;
 
   // Make sure the increment value is loop invariant.
@@ -1148,15 +1155,15 @@ static bool findHistogram(LoadInst *LI, StoreInst *HSt, Loop *TheLoop,
 
   // Ensure we'll have the same mask by checking that all parts of the histogram
   // (gather load, update, scatter store) are in the same block.
-  LoadInst *IndexedLoad = cast<LoadInst>(HBinOp->getOperand(0));
+  LoadInst *IndexedLoad = cast<LoadInst>(HInstr->getOperand(0));
   BasicBlock *LdBB = IndexedLoad->getParent();
-  if (LdBB != HBinOp->getParent() || LdBB != HSt->getParent())
+  if (LdBB != HInstr->getParent() || LdBB != HSt->getParent())
     return false;
 
   LLVM_DEBUG(dbgs() << "LV: Found histogram for: " << *HSt << "\n");
 
   // Store the operations that make up the histogram.
-  Histograms.emplace_back(IndexedLoad, HBinOp, HSt);
+  Histograms.emplace_back(IndexedLoad, HInstr, HSt);
   return true;
 }
 
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 2cdb87fdd3f8d..2422b68a353d3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8634,14 +8634,16 @@ VPRecipeBuilder::tryToWidenHistogram(const HistogramInfo *HI,
                                      ArrayRef<VPValue *> Operands) {
   // FIXME: Support other operations.
   unsigned Opcode = HI->Update->getOpcode();
-  assert((Opcode == Instruction::Add || Opcode == Instruction::Sub) &&
-         "Histogram update operation must be an Add or Sub");
+  assert(VPHistogramRecipe::isLegalUpdateInstruction(HI->Update) &&
+         "Found Ilegal update instruction for histogram");
 
   SmallVector<VPValue *, 3> HGramOps;
   // Bucket address.
   HGramOps.push_back(Operands[1]);
   // Increment value.
   HGramOps.push_back(getVPValueOrAddLiveIn(HI->Update->getOperand(1)));
+  // Update Instruction.
+  HGramOps.push_back(getVPValueOrAddLiveIn(HI->Update));
 
   // In case of predicated execution (due to tail-folding, or conditional
   // execution, or both), pass the relevant mask.
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index fbbc466f2f7f6..b5588349df9e6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1466,9 +1466,16 @@ class VPHistogramRecipe : public VPRecipeBase {
   /// Return the mask operand if one was provided, or a null pointer if all
   /// lanes should be executed unconditionally.
   VPValue *getMask() const {
-    return getNumOperands() == 3 ? getOperand(2) : nullptr;
+    return getNumOperands() == 4 ? getOperand(3) : nullptr;
   }
 
+  /// Returns true if \p I is a legal update instruction of histogram operation.
+  static bool isLegalUpdateInstruction(Instruction *I);
+
+  /// Given update instruction \p I, returns the opcode of the coresponding
+  /// histogram instruction.
+  static unsigned getHistogramOpcode(Instruction *I);
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe
   void print(raw_ostream &O, const Twine &Indent,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index f5d5e12b1c85d..95eb337cbbdd0 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -1223,6 +1223,7 @@ void VPHistogramRecipe::execute(VPTransformState &State) {
 
   Value *Address = State.get(getOperand(0));
   Value *IncAmt = State.get(getOperand(1), /*IsScalar=*/true);
+  Instruction *UpdateInst = cast<Instruction>(State.get(getOperand(2)));
   VectorType *VTy = cast<VectorType>(Address->getType());
 
   // The histogram intrinsic requires a mask even if the recipe doesn't;
@@ -1239,10 +1240,10 @@ void VPHistogramRecipe::execute(VPTransformState &State) {
   // add a separate intrinsic in future, but for now we'll try this.
   if (Opcode == Instruction::Sub)
     IncAmt = Builder.CreateNeg(IncAmt);
-  else
-    assert(Opcode == Instruction::Add && "only add or sub supported for now");
+  assert(isLegalUpdateInstruction(UpdateInst) &&
+         "Found Ilegal update instruction for histogram");
 
-  State.Builder.CreateIntrinsic(Intrinsic::experimental_vector_histogram_add,
+  State.Builder.CreateIntrinsic(getHistogramOpcode(UpdateInst),
                                 {VTy, IncAmt->getType()},
                                 {Address, IncAmt, Mask});
 }
@@ -1277,24 +1278,51 @@ InstructionCost VPHistogramRecipe::computeCost(ElementCount VF,
   IntrinsicCostAttributes ICA(Intrinsic::experimental_vector_histogram_add,
                               Type::getVoidTy(Ctx.LLVMCtx),
                               {PtrTy, IncTy, MaskTy});
+  auto *UpdateInst = getOperand(2)->getUnderlyingValue();
+  InstructionCost UpdateCost;
+  if (isa<IntrinsicInst>(UpdateInst)) {
+    IntrinsicCostAttributes UpdateICA(Opcode, IncTy, {IncTy, IncTy});
+    UpdateCost = Ctx.TTI.getIntrinsicInstrCost(UpdateICA, Ctx.CostKind);
+  } else
+    UpdateCost = Ctx.TTI.getArithmeticInstrCost(Opcode, VTy, Ctx.CostKind);
 
   // Add the costs together with the add/sub operation.
   return Ctx.TTI.getIntrinsicInstrCost(ICA, Ctx.CostKind) + MulCost +
-         Ctx.TTI.getArithmeticInstrCost(Opcode, VTy, Ctx.CostKind);
+         UpdateCost;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
 void VPHistogramRecipe::print(raw_ostream &O, const Twine &Indent,
                               VPSlotTracker &SlotTracker) const {
+  auto *UpdateInst = cast<Instruction>(getOperand(2)->getUnderlyingValue());
+  assert(isLegalUpdateInstruction(UpdateInst) &&
+         "Found Ilegal update instruction for histogram");
   O << Indent << "WIDEN-HISTOGRAM buckets: ";
   getOperand(0)->printAsOperand(O, SlotTracker);
 
-  if (Opcode == Instruction::Sub)
-    O << ", dec: ";
-  else {
-    assert(Opcode == Instruction::Add);
-    O << ", inc: ";
+  std::string UpdateMsg;
+  if (isa<BinaryOperator>(UpdateInst)) {
+    if (Opcode == Instruction::Sub)
+      UpdateMsg = ", dec: ";
+    else {
+      UpdateMsg = ", inc: ";
+    }
+  } else {
+    switch (cast<IntrinsicInst>(UpdateInst)->getIntrinsicID()) {
+    case Intrinsic::uadd_sat:
+      UpdateMsg = ", saturated inc: ";
+      break;
+    case Intrinsic::umax:
+      UpdateMsg = ", max: ";
+      break;
+    case Intrinsic::umin:
+      UpdateMsg = ", min: ";
+      break;
+    default:
+      llvm_unreachable("Found Ilegal update instruction for histogram");
+    }
   }
+  O << UpdateMsg;
   getOperand(1)->printAsOperand(O, SlotTracker);
 
   if (VPValue *Mask = getMask()) {
@@ -1303,6 +1331,44 @@ void VPHistogramRecipe::print(raw_ostream &O, const Twine &Indent,
   }
 }
 
+bool VPHistogramRecipe::isLegalUpdateInstruction(Instruction *I) {
+  // We only support add and sub instructions and the following list of
+  // intrinsics: uadd.sat, umax, umin.
+  if (isa<BinaryOperator>(I))
+    return I->getOpcode() == Instruction::Add || I->getOpcode() == Instruction::Sub;
+  if (auto *II = dyn_cast<IntrinsicInst>(I)) {
+    switch (II->getIntrinsicID()) {
+    case Intrinsic::uadd_sat:
+    case Intrinsic::umax:
+    case Intrinsic::umin:
+      return true;
+    default:
+      return false;
+    }
+  }
+  return false;
+}
+
+unsigned VPHistogramRecipe::getHistogramOpcode(Instruction *I) {
+  // We only support add and sub instructions and the following list of
+  // intrinsics: uadd.sat, umax, umin.
+  assert(isLegalUpdateInstruction(I) &&
+         "Found Ilegal update instruction for histogram");
+  if (isa<BinaryOperator>(I))
+    return Intrinsic::experimental_vector_histogram_add;
+  auto *II = cast<IntrinsicInst>(I);
+  switch (II->getIntrinsicID()) {
+  case Intrinsic::uadd_sat:
+    return Intrinsic::experimental_vector_histogram_uadd_sat;;
+  case Intrinsic::umax:
+    return Intrinsic::experimental_vector_histogram_umax;
+  case Intrinsic::umin:
+    return Intrinsic::experimental_vector_histogram_umin;
+  default:
+    llvm_unreachable("Found Ilegal update instruction for histogram");
+  }
+}
+
 void VPWidenSelectRecipe::print(raw_ostream &O, const Twine &Indent,
                                 VPSlotTracker &SlotTracker) const {
   O << Indent << "WIDEN-SELECT ";
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll
index 3b00312959d8a..eeffdad582ce2 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll
@@ -927,6 +927,205 @@ for.exit:
   ret void
 }
 
+define void @simple_histogram_uadd_sat(ptr noalias %buckets, ptr readonly %indices, i64 %N) #0 {
+; CHECK-LABEL: define void @simple_histogram_uadd_sat(
+; CHECK-SAME: ptr noalias [[BUCKETS:%.*]], ptr readonly [[INDICES:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[DOTNEG:%.*]] = mul nsw i64 [[TMP2]], -4
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i64 [[N]], [[DOTNEG]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[INDICES]], i64 [[INDEX]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x i32>, ptr [[TMP5]], align 4
+; CHECK-NEXT:    [[TMP6:%.*]] = zext <vscale x 4 x i32> [[WIDE_LOAD]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[BUCKETS]], <vscale x 4 x i64> [[TMP6]]
+; CHECK-NEXT:    call void @llvm.experimental.vector.histogram.uadd.sat.nxv4p0.i32(<vscale x 4 x ptr> [[TMP7]], i32 1, <vscale x 4 x i1> splat (i1 true))
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_EXIT:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[GEP_INDICES:%.*]] = getelementptr inbounds i32, ptr [[INDICES]], i64 [[IV]]
+; CHECK-NEXT:    [[L_IDX:%.*]] = load i32, ptr [[GEP_INDICES]], align 4
+; CHECK-NEXT:    [[IDXPROM1:%.*]] = zext i32 [[L_IDX]] to i64
+; CHECK-NEXT:    [[GEP_BUCKET:%.*]] = getelementptr inbounds nuw i32, ptr [[BUCKETS]], i64 [[IDXPROM1]]
+; CHECK-NEXT:    [[L_BUCKET:%.*]] = load i32, ptr [[GEP_BUCKET]], align 4
+; CHECK-NEXT:    [[INC:%.*]] = call i32 @llvm.uadd.sat.i32(i32 [[L_BUCKET]], i32 1)
+; CHECK-NEXT:    store i32 [[INC]], ptr [[GEP_BUCKET]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_EXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP27:![0-9]+]]
+; CHECK:       for.exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+  %gep.indices = getelementptr inbounds i32, ptr %indices, i64 %iv
+  %l.idx = load i32, ptr %gep.indices, align 4
+  %idxprom1 = zext i32 %l.idx to i64
+  %gep.bucket = getelementptr inbounds i32, ptr %buckets, i64 %idxprom1
+  %l.bucket = load i32, ptr %gep.bucket, align 4
+  %inc = call i32 @llvm.uadd.sat.i32(i32 %l.bucket, i32 1)
+  store i32 %inc, ptr %gep.bucket, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, %N
+  br i1 %exitcond, label %for.exit, label %for.body, !llvm.loop !4
+
+for.exit:
+  ret void
+}
+
+define void @simple_histogram_umax(ptr noalias %buckets, ptr readonly %indices, i64 %N) #0 {
+; CHECK-LABEL: define void @simple_histogram_umax(
+; CHECK-SAME: ptr noalias [[BUCKETS:%.*]], ptr readonly [[INDICES:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[DOTNEG:%.*]] = mul nsw i64 [[TMP2]], -4
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i64 [[N]], [[DOTNEG]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[INDICES]], i64 [[INDEX]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x i32>, ptr [[TMP5]], align 4
+; CHECK-NEXT:    [[TMP6:%.*]] = zext <vscale x 4 x i32> [[WIDE_LOAD]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[BUCKETS]], <vscale x 4 x i64> [[TMP6]]
+; CHECK-NEXT:    call void @llvm.experimental.vector.histogram.umax.nxv4p0.i32(<vscale x 4 x ptr> [[TMP7]], i32 120, <vscale x 4 x i1> splat (i1 true))
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP28:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_EXIT:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[GEP_INDICES:%.*]] = getelementptr inbounds i32, ptr [[INDICES]], i64 [[IV]]
+; CHECK-NEXT:    [[L_IDX:%.*]] = load i32, ptr [[GEP_INDICES]], align 4
+; CHEC...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Feb 16, 2025

@llvm/pr-subscribers-llvm-ir

Author: None (RonDahan101)

Changes

Expanding the Histogram intrinsic to support more update options, uadd.sat, umax, umin.


Patch is 25.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127399.diff

7 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+3)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+18)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+19-12)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+4-2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+8-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+75-9)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll (+199)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index deb87365ae8d7..59496ebb93cd7 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -20295,6 +20295,9 @@ More update operation types may be added in the future.
 
     declare void @llvm.experimental.vector.histogram.add.v8p0.i32(<8 x ptr> %ptrs, i32 %inc, <8 x i1> %mask)
     declare void @llvm.experimental.vector.histogram.add.nxv2p0.i64(<vscale x 2 x ptr> %ptrs, i64 %inc, <vscale x 2 x i1> %mask)
+    declare void @llvm.experimental.vector.histogram.uadd.sat.v8p0.i32(<8 x ptr> %ptrs, i32 %inc, <8 x i1> %mask)
+    declare void @llvm.experimental.vector.histogram.umax.v8p0.i32(<8 x ptr> %ptrs, i32 %val, <8 x i1> %mask)
+    declare void @llvm.experimental.vector.histogram.umin.v8p0.i32(<8 x ptr> %ptrs, i32 %val, <8 x i1> %mask)
 
 Arguments:
 """"""""""
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 14ecae41ff08f..31a0ba2e6500d 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1947,6 +1947,24 @@ def int_experimental_vector_histogram_add : DefaultAttrsIntrinsic<[],
                                LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
                              [ IntrArgMemOnly ]>;
 
+def int_experimental_vector_histogram_uadd_sat : DefaultAttrsIntrinsic<[],
+                             [ llvm_anyvector_ty, // Vector of pointers
+                               llvm_anyint_ty,    // Increment
+                               LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
+                             [ IntrArgMemOnly ]>;
+
+def int_experimental_vector_histogram_umin : DefaultAttrsIntrinsic<[],
+                             [ llvm_anyvector_ty, // Vector of pointers
+                               llvm_anyint_ty,    // Update value
+                               LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
+                             [ IntrArgMemOnly ]>;
+
+def int_experimental_vector_histogram_umax : DefaultAttrsIntrinsic<[],
+                             [ llvm_anyvector_ty, // Vector of pointers
+                               llvm_anyint_ty,    // Update value
+                               LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>], // Mask
+                             [ IntrArgMemOnly ]>;
+
 // Experimental match
 def int_experimental_vector_match : DefaultAttrsIntrinsic<
                              [ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty> ],
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index e3599315e224f..bfa1cdfd9c584 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -1079,25 +1079,26 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
 
 /// Find histogram operations that match high-level code in loops:
 /// \code
-/// buckets[indices[i]]+=step;
+/// buckets[indices[i]] = UpdateOpeartor(buckets[indices[i]], Val);
 /// \endcode
+/// When updateOperator can be add, sub, add.sat, umin, umax, sub.
 ///
 /// It matches a pattern starting from \p HSt, which Stores to the 'buckets'
-/// array the computed histogram. It uses a BinOp to sum all counts, storing
-/// them using a loop-variant index Load from the 'indices' input array.
+/// array the computed histogram. It uses a update instruction to update all
+/// counts, storing them using a loop-variant index Load from the 'indices'
+/// input array.
 ///
 /// On successful matches it updates the STATISTIC 'HistogramsDetected',
 /// regardless of hardware support. When there is support, it additionally
-/// stores the BinOp/Load pairs in \p HistogramCounts, as well the pointers
+/// stores the UpdateOp/Load pairs in \p HistogramCounts, as well the pointers
 /// used to update histogram in \p HistogramPtrs.
 static bool findHistogram(LoadInst *LI, StoreInst *HSt, Loop *TheLoop,
                           const PredicatedScalarEvolution &PSE,
                           SmallVectorImpl<HistogramInfo> &Histograms) {
 
-  // Store value must come from a Binary Operation.
   Instruction *HPtrInstr = nullptr;
-  BinaryOperator *HBinOp = nullptr;
-  if (!match(HSt, m_Store(m_BinOp(HBinOp), m_Instruction(HPtrInstr))))
+  Instruction *HInstr = nullptr;
+  if (!match(HSt, m_Store(m_Instruction(HInstr), m_Instruction(HPtrInstr))))
     return false;
 
   // BinOp must be an Add or a Sub modifying the bucket value by a
@@ -1105,8 +1106,14 @@ static bool findHistogram(LoadInst *LI, StoreInst *HSt, Loop *TheLoop,
   // FIXME: We assume the loop invariant term is on the RHS.
   //        Fine for an immediate/constant, but maybe not a generic value?
   Value *HIncVal = nullptr;
-  if (!match(HBinOp, m_Add(m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
-      !match(HBinOp, m_Sub(m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))))
+  if (!match(HInstr, m_Add(m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
+         !match(HInstr, m_Sub(m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
+         !match(HInstr, m_Intrinsic<Intrinsic::uadd_sat>(
+                           m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
+         !match(HInstr, m_Intrinsic<Intrinsic::umax>(
+                           m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))) &&
+         !match(HInstr, m_Intrinsic<Intrinsic::umin>(
+                           m_Load(m_Specific(HPtrInstr)), m_Value(HIncVal))))
     return false;
 
   // Make sure the increment value is loop invariant.
@@ -1148,15 +1155,15 @@ static bool findHistogram(LoadInst *LI, StoreInst *HSt, Loop *TheLoop,
 
   // Ensure we'll have the same mask by checking that all parts of the histogram
   // (gather load, update, scatter store) are in the same block.
-  LoadInst *IndexedLoad = cast<LoadInst>(HBinOp->getOperand(0));
+  LoadInst *IndexedLoad = cast<LoadInst>(HInstr->getOperand(0));
   BasicBlock *LdBB = IndexedLoad->getParent();
-  if (LdBB != HBinOp->getParent() || LdBB != HSt->getParent())
+  if (LdBB != HInstr->getParent() || LdBB != HSt->getParent())
     return false;
 
   LLVM_DEBUG(dbgs() << "LV: Found histogram for: " << *HSt << "\n");
 
   // Store the operations that make up the histogram.
-  Histograms.emplace_back(IndexedLoad, HBinOp, HSt);
+  Histograms.emplace_back(IndexedLoad, HInstr, HSt);
   return true;
 }
 
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 2cdb87fdd3f8d..2422b68a353d3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8634,14 +8634,16 @@ VPRecipeBuilder::tryToWidenHistogram(const HistogramInfo *HI,
                                      ArrayRef<VPValue *> Operands) {
   // FIXME: Support other operations.
   unsigned Opcode = HI->Update->getOpcode();
-  assert((Opcode == Instruction::Add || Opcode == Instruction::Sub) &&
-         "Histogram update operation must be an Add or Sub");
+  assert(VPHistogramRecipe::isLegalUpdateInstruction(HI->Update) &&
+         "Found Ilegal update instruction for histogram");
 
   SmallVector<VPValue *, 3> HGramOps;
   // Bucket address.
   HGramOps.push_back(Operands[1]);
   // Increment value.
   HGramOps.push_back(getVPValueOrAddLiveIn(HI->Update->getOperand(1)));
+  // Update Instruction.
+  HGramOps.push_back(getVPValueOrAddLiveIn(HI->Update));
 
   // In case of predicated execution (due to tail-folding, or conditional
   // execution, or both), pass the relevant mask.
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index fbbc466f2f7f6..b5588349df9e6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1466,9 +1466,16 @@ class VPHistogramRecipe : public VPRecipeBase {
   /// Return the mask operand if one was provided, or a null pointer if all
   /// lanes should be executed unconditionally.
   VPValue *getMask() const {
-    return getNumOperands() == 3 ? getOperand(2) : nullptr;
+    return getNumOperands() == 4 ? getOperand(3) : nullptr;
   }
 
+  /// Returns true if \p I is a legal update instruction of histogram operation.
+  static bool isLegalUpdateInstruction(Instruction *I);
+
+  /// Given update instruction \p I, returns the opcode of the coresponding
+  /// histogram instruction.
+  static unsigned getHistogramOpcode(Instruction *I);
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe
   void print(raw_ostream &O, const Twine &Indent,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index f5d5e12b1c85d..95eb337cbbdd0 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -1223,6 +1223,7 @@ void VPHistogramRecipe::execute(VPTransformState &State) {
 
   Value *Address = State.get(getOperand(0));
   Value *IncAmt = State.get(getOperand(1), /*IsScalar=*/true);
+  Instruction *UpdateInst = cast<Instruction>(State.get(getOperand(2)));
   VectorType *VTy = cast<VectorType>(Address->getType());
 
   // The histogram intrinsic requires a mask even if the recipe doesn't;
@@ -1239,10 +1240,10 @@ void VPHistogramRecipe::execute(VPTransformState &State) {
   // add a separate intrinsic in future, but for now we'll try this.
   if (Opcode == Instruction::Sub)
     IncAmt = Builder.CreateNeg(IncAmt);
-  else
-    assert(Opcode == Instruction::Add && "only add or sub supported for now");
+  assert(isLegalUpdateInstruction(UpdateInst) &&
+         "Found Ilegal update instruction for histogram");
 
-  State.Builder.CreateIntrinsic(Intrinsic::experimental_vector_histogram_add,
+  State.Builder.CreateIntrinsic(getHistogramOpcode(UpdateInst),
                                 {VTy, IncAmt->getType()},
                                 {Address, IncAmt, Mask});
 }
@@ -1277,24 +1278,51 @@ InstructionCost VPHistogramRecipe::computeCost(ElementCount VF,
   IntrinsicCostAttributes ICA(Intrinsic::experimental_vector_histogram_add,
                               Type::getVoidTy(Ctx.LLVMCtx),
                               {PtrTy, IncTy, MaskTy});
+  auto *UpdateInst = getOperand(2)->getUnderlyingValue();
+  InstructionCost UpdateCost;
+  if (isa<IntrinsicInst>(UpdateInst)) {
+    IntrinsicCostAttributes UpdateICA(Opcode, IncTy, {IncTy, IncTy});
+    UpdateCost = Ctx.TTI.getIntrinsicInstrCost(UpdateICA, Ctx.CostKind);
+  } else
+    UpdateCost = Ctx.TTI.getArithmeticInstrCost(Opcode, VTy, Ctx.CostKind);
 
   // Add the costs together with the add/sub operation.
   return Ctx.TTI.getIntrinsicInstrCost(ICA, Ctx.CostKind) + MulCost +
-         Ctx.TTI.getArithmeticInstrCost(Opcode, VTy, Ctx.CostKind);
+         UpdateCost;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
 void VPHistogramRecipe::print(raw_ostream &O, const Twine &Indent,
                               VPSlotTracker &SlotTracker) const {
+  auto *UpdateInst = cast<Instruction>(getOperand(2)->getUnderlyingValue());
+  assert(isLegalUpdateInstruction(UpdateInst) &&
+         "Found Ilegal update instruction for histogram");
   O << Indent << "WIDEN-HISTOGRAM buckets: ";
   getOperand(0)->printAsOperand(O, SlotTracker);
 
-  if (Opcode == Instruction::Sub)
-    O << ", dec: ";
-  else {
-    assert(Opcode == Instruction::Add);
-    O << ", inc: ";
+  std::string UpdateMsg;
+  if (isa<BinaryOperator>(UpdateInst)) {
+    if (Opcode == Instruction::Sub)
+      UpdateMsg = ", dec: ";
+    else {
+      UpdateMsg = ", inc: ";
+    }
+  } else {
+    switch (cast<IntrinsicInst>(UpdateInst)->getIntrinsicID()) {
+    case Intrinsic::uadd_sat:
+      UpdateMsg = ", saturated inc: ";
+      break;
+    case Intrinsic::umax:
+      UpdateMsg = ", max: ";
+      break;
+    case Intrinsic::umin:
+      UpdateMsg = ", min: ";
+      break;
+    default:
+      llvm_unreachable("Found Ilegal update instruction for histogram");
+    }
   }
+  O << UpdateMsg;
   getOperand(1)->printAsOperand(O, SlotTracker);
 
   if (VPValue *Mask = getMask()) {
@@ -1303,6 +1331,44 @@ void VPHistogramRecipe::print(raw_ostream &O, const Twine &Indent,
   }
 }
 
+bool VPHistogramRecipe::isLegalUpdateInstruction(Instruction *I) {
+  // We only support add and sub instructions and the following list of
+  // intrinsics: uadd.sat, umax, umin.
+  if (isa<BinaryOperator>(I))
+    return I->getOpcode() == Instruction::Add || I->getOpcode() == Instruction::Sub;
+  if (auto *II = dyn_cast<IntrinsicInst>(I)) {
+    switch (II->getIntrinsicID()) {
+    case Intrinsic::uadd_sat:
+    case Intrinsic::umax:
+    case Intrinsic::umin:
+      return true;
+    default:
+      return false;
+    }
+  }
+  return false;
+}
+
+unsigned VPHistogramRecipe::getHistogramOpcode(Instruction *I) {
+  // We only support add and sub instructions and the following list of
+  // intrinsics: uadd.sat, umax, umin.
+  assert(isLegalUpdateInstruction(I) &&
+         "Found Ilegal update instruction for histogram");
+  if (isa<BinaryOperator>(I))
+    return Intrinsic::experimental_vector_histogram_add;
+  auto *II = cast<IntrinsicInst>(I);
+  switch (II->getIntrinsicID()) {
+  case Intrinsic::uadd_sat:
+    return Intrinsic::experimental_vector_histogram_uadd_sat;;
+  case Intrinsic::umax:
+    return Intrinsic::experimental_vector_histogram_umax;
+  case Intrinsic::umin:
+    return Intrinsic::experimental_vector_histogram_umin;
+  default:
+    llvm_unreachable("Found Ilegal update instruction for histogram");
+  }
+}
+
 void VPWidenSelectRecipe::print(raw_ostream &O, const Twine &Indent,
                                 VPSlotTracker &SlotTracker) const {
   O << Indent << "WIDEN-SELECT ";
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll
index 3b00312959d8a..eeffdad582ce2 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll
@@ -927,6 +927,205 @@ for.exit:
   ret void
 }
 
+define void @simple_histogram_uadd_sat(ptr noalias %buckets, ptr readonly %indices, i64 %N) #0 {
+; CHECK-LABEL: define void @simple_histogram_uadd_sat(
+; CHECK-SAME: ptr noalias [[BUCKETS:%.*]], ptr readonly [[INDICES:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[DOTNEG:%.*]] = mul nsw i64 [[TMP2]], -4
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i64 [[N]], [[DOTNEG]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[INDICES]], i64 [[INDEX]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x i32>, ptr [[TMP5]], align 4
+; CHECK-NEXT:    [[TMP6:%.*]] = zext <vscale x 4 x i32> [[WIDE_LOAD]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[BUCKETS]], <vscale x 4 x i64> [[TMP6]]
+; CHECK-NEXT:    call void @llvm.experimental.vector.histogram.uadd.sat.nxv4p0.i32(<vscale x 4 x ptr> [[TMP7]], i32 1, <vscale x 4 x i1> splat (i1 true))
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_EXIT:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[GEP_INDICES:%.*]] = getelementptr inbounds i32, ptr [[INDICES]], i64 [[IV]]
+; CHECK-NEXT:    [[L_IDX:%.*]] = load i32, ptr [[GEP_INDICES]], align 4
+; CHECK-NEXT:    [[IDXPROM1:%.*]] = zext i32 [[L_IDX]] to i64
+; CHECK-NEXT:    [[GEP_BUCKET:%.*]] = getelementptr inbounds nuw i32, ptr [[BUCKETS]], i64 [[IDXPROM1]]
+; CHECK-NEXT:    [[L_BUCKET:%.*]] = load i32, ptr [[GEP_BUCKET]], align 4
+; CHECK-NEXT:    [[INC:%.*]] = call i32 @llvm.uadd.sat.i32(i32 [[L_BUCKET]], i32 1)
+; CHECK-NEXT:    store i32 [[INC]], ptr [[GEP_BUCKET]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_EXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP27:![0-9]+]]
+; CHECK:       for.exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+  %gep.indices = getelementptr inbounds i32, ptr %indices, i64 %iv
+  %l.idx = load i32, ptr %gep.indices, align 4
+  %idxprom1 = zext i32 %l.idx to i64
+  %gep.bucket = getelementptr inbounds i32, ptr %buckets, i64 %idxprom1
+  %l.bucket = load i32, ptr %gep.bucket, align 4
+  %inc = call i32 @llvm.uadd.sat.i32(i32 %l.bucket, i32 1)
+  store i32 %inc, ptr %gep.bucket, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, %N
+  br i1 %exitcond, label %for.exit, label %for.body, !llvm.loop !4
+
+for.exit:
+  ret void
+}
+
+define void @simple_histogram_umax(ptr noalias %buckets, ptr readonly %indices, i64 %N) #0 {
+; CHECK-LABEL: define void @simple_histogram_umax(
+; CHECK-SAME: ptr noalias [[BUCKETS:%.*]], ptr readonly [[INDICES:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[DOTNEG:%.*]] = mul nsw i64 [[TMP2]], -4
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i64 [[N]], [[DOTNEG]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[INDICES]], i64 [[INDEX]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x i32>, ptr [[TMP5]], align 4
+; CHECK-NEXT:    [[TMP6:%.*]] = zext <vscale x 4 x i32> [[WIDE_LOAD]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[BUCKETS]], <vscale x 4 x i64> [[TMP6]]
+; CHECK-NEXT:    call void @llvm.experimental.vector.histogram.umax.nxv4p0.i32(<vscale x 4 x ptr> [[TMP7]], i32 120, <vscale x 4 x i1> splat (i1 true))
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP28:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_EXIT:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[GEP_INDICES:%.*]] = getelementptr inbounds i32, ptr [[INDICES]], i64 [[IV]]
+; CHECK-NEXT:    [[L_IDX:%.*]] = load i32, ptr [[GEP_INDICES]], align 4
+; CHEC...
[truncated]

@RonDahan101
Copy link
Author

Copy link

github-actions bot commented Feb 16, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Member

@paschalis-mpeis paschalis-mpeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ron,

Great work, thanks for extending this area!
Looks good. I've left a couple of nits.

Expanding the Histogram intrinsic to support more update options,
uadd.sat, umax, umin.
@RonDahan101
Copy link
Author

Hey Paschalis,

it was my pleasure :)

If you don't have any more comments, could you approve?

@paschalis-mpeis

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will the strategy to lower those intrinsics? Doe the default lowering need updating?

It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.

Then extend LV to use them, which will likely also need updating to the cost model?.

@RonDahan101
Copy link
Author

What will the strategy to lower those intrinsics? Doe the default lowering need updating?

It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.

Then extend LV to use them, which will likely also need updating to the cost model?.

Hey Florian,

I'm working on out of tree target, so there will not be upstream targets that can lower these intrinsics.
About the cost-model I didn't change anything related to the vectorization of histogram so it uses the same one as before.
I prefer to upstream these changes in case any other will benefit from this histogram variations.

@RonDahan101
Copy link
Author

@huntergr-arm
Copy link
Collaborator

What will the strategy to lower those intrinsics? Doe the default lowering need updating?
It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.
Then extend LV to use them, which will likely also need updating to the cost model?.

Hey Florian,

I'm working on out of tree target, so there will not be upstream targets that can lower these intrinsics. About the cost-model I didn't change anything related to the vectorization of histogram so it uses the same one as before. I prefer to upstream these changes in case any other will benefit from this histogram variations.

Hi,

I think the default lowering (scalarizeMaskedVectorHistogram, in ScalarizeMaskedMemIntrin.cpp) should be extended to support these new cases. It just performs scalarization of the vector code, so would need updating to check for the new intrinsics and plant the appropriate scalar update operation. No need to have dedicated target support.

Right now that won't be used outside of unit tests, but I will implement the target-independent cost model to enable it at some point.

@RonDahan101
Copy link
Author

What will the strategy to lower those intrinsics? Doe the default lowering need updating?
It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.
Then extend LV to use them, which will likely also need updating to the cost model?.

Hey Florian,
I'm working on out of tree target, so there will not be upstream targets that can lower these intrinsics. About the cost-model I didn't change anything related to the vectorization of histogram so it uses the same one as before. I prefer to upstream these changes in case any other will benefit from this histogram variations.

Hi,

I think the default lowering (scalarizeMaskedVectorHistogram, in ScalarizeMaskedMemIntrin.cpp) should be extended to support these new cases. It just performs scalarization of the vector code, so would need updating to check for the new intrinsics and plant the appropriate scalar update operation. No need to have dedicated target support.

Right now that won't be used outside of unit tests, but I will implement the target-independent cost model to enable it at some point.

Thanks for the clarification.

Added support + lit tests

@RonDahan101
Copy link
Author

1 similar comment
@RonDahan101
Copy link
Author

@RonDahan101
Copy link
Author

@huntergr-arm, I don't see a merge button, so maybe I don't have the necessary permissions.
Can you merge that PR?

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.

It would still be good to separate the changes for adding the new intrinsics from using them in LV. Also the description/titl needs updating.

@RonDahan101 RonDahan101 changed the title Expanding the Histogram Intrinsic LV: Expand llvm.histogram intrinsic to support umax, umin, and uadd.sat operations Apr 23, 2025
@RonDahan101
Copy link
Author

It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.

It would still be good to separate the changes for adding the new intrinsics from using them in LV. Also the description/titl needs updating.

I updated the title and description.
I'm not sure I understand what you're proposing. Adding the intrinsics without using them in LV seems redundant. What's the point of adding unused intrinsics?
If you meant something else, please elaborate on what's needs to be done.

Thanks in advance for taking your time and reviewing.

@huntergr-arm
Copy link
Collaborator

It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.

It would still be good to separate the changes for adding the new intrinsics from using them in LV. Also the description/titl needs updating.

I updated the title and description. I'm not sure I understand what you're proposing. Adding the intrinsics without using them in LV seems redundant. What's the point of adding unused intrinsics? If you meant something else, please elaborate on what's needs to be done.

Thanks in advance for taking your time and reviewing.

Splitting it into separate patches is often done to make it easier to review or limit the amount of code that must be examined if a bug is found later in a given commit. I did create a separate patch for the initial histogram intrinsic in fbb37e9, along with AArch64 codegen. The use in LoopVectorize came later in 6f1a8c2.

@RonDahan101
Copy link
Author

It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.

It would still be good to separate the changes for adding the new intrinsics from using them in LV. Also the description/titl needs updating.

It would be good to add the intrinsics together with initial lowering as separate patch, including codegen tests.

It would still be good to separate the changes for adding the new intrinsics from using them in LV. Also the description/titl needs updating.

I updated the title and description. I'm not sure I understand what you're proposing. Adding the intrinsics without using them in LV seems redundant. What's the point of adding unused intrinsics? If you meant something else, please elaborate on what's needs to be done.
Thanks in advance for taking your time and reviewing.

Splitting it into separate patches is often done to make it easier to review or limit the amount of code that must be examined if a bug is found later in a given commit. I did create a separate patch for the initial histogram intrinsic in fbb37e9, along with AArch64 codegen. The use in LoopVectorize came later in 6f1a8c2.

Thank you both for your patience and explanation. I have uploaded a new PR that adds the intrinsic without utilizing it in the LV. Once the previous patch is merged, I will update the current PR to include only the LV changes and notify you.

New PR: #138447

Thank you for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants