[LV] Introduce the EVLIVSimplify Pass for EVL-vectorized loops #91796

mshockwave · 2024-05-10T19:02:17Z

When we enable EVL-based loop vectorization w/ predicated tail-folding, each vectorized loop has effectively two induction variables: one calculates the step using (VF x vscale) and the other one increases the IV by values returned from experiment.get.vector.length. The former, also known as canonical IV, is more favorable for analyses as it's "countable" in the sense of SCEV; the latter (EVL-based IV), however, is more favorable to codegen, at least for those that support scalable vectors like AArch64 SVE and RISC-V.

The idea is that we use canonical IV all the way until the beginning of codegen pipeline, where we replace it with EVL-based IV using EVLIVSimplify introduced here. Such that we can have the best from both worlds.

This Pass is enabled by default in RISC-V. However, since we haven't really vectorize loops with predicate tail-folding, this Pass is no-op at this moment.

That said, I have validate the correctness of this Pass by enable EVL-based LV + predicated tail-folding
(i.e. -force-tail-folding-style=data-with-evl
-prefer-predicate-over-epilogue=predicate-dont-vectorize) and run on SPEC2006INT and SPEC2017 intrate w/ test workload.
With this setup and -mcpu=sifive-x280, I'd observed a notable 1.8% improvement on dynamic instruction counts on 525.x264_r. Aside from that, rest of benchmarks only have negligible (dynamic IC) differences which are less than 1%.

llvmbot · 2024-05-10T19:02:47Z

@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-llvm-transforms

Author: Min-Yih Hsu (mshockwave)

Changes

When we enable EVL-based loop vectorization w/ predicated tail-folding, each vectorized loop has effectively two induction variables: one calculates the step using (VF x vscale) and the other one increases the IV by values returned from experiment.get.vector.length. The former, also known as canonical IV, is more favorable for analyses as it's "countable" in the sense of SCEV; the latter (EVL-based IV), however, is more favorable to codegen, at least for those that support scalable vectors like AArch64 SVE and RISC-V.

The idea is that we use canonical IV all the way until the beginning of codegen pipeline, where we replace it with EVL-based IV using EVLIVSimplify introduced here. Such that we can have the best from both worlds.

This Pass is enabled by default in RISC-V. However, since we haven't really vectorize loops with predicate tail-folding, this Pass is no-op at this moment.

That said, I have validate the correctness of this Pass by enable EVL-based LV + predicated tail-folding
(i.e. -force-tail-folding-style=data-with-evl
-prefer-predicate-over-epilogue=predicate-dont-vectorize) and run on SPEC2006INT and SPEC2017 intrate w/ test workload.
With this setup and -mcpu=sifive-x280, I'd observed a notable 1.8% improvement on dynamic instruction counts on 525.x264_r. Aside from that, rest of benchmarks only have negligible (dynamic IC) differences which are less than 1%.

Patch is 33.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91796.diff

10 Files Affected:

(modified) llvm/include/llvm/InitializePasses.h (+1)
(added) llvm/include/llvm/Transforms/Vectorize/EVLIndVarSimplify.h (+34)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+1)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+4)
(modified) llvm/lib/Transforms/Vectorize/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/Vectorize/EVLIndVarSimplify.cpp (+253)
(modified) llvm/test/CodeGen/RISCV/O3-pipeline.ll (+3)
(added) llvm/test/Transforms/LoopVectorize/RISCV/evl-iv-simplify.ll (+282)

diff --git a/llvm/include/llvm/InitializePasses.h b/llvm/include/llvm/InitializePasses.h
index 9ba75d491c1c9..5e3eff127046d 100644
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@@ -130,6 +130,7 @@ void initializeIfConverterPass(PassRegistry&);
 void initializeImmutableModuleSummaryIndexWrapperPassPass(PassRegistry&);
 void initializeImplicitNullChecksPass(PassRegistry&);
 void initializeIndirectBrExpandLegacyPassPass(PassRegistry &);
+void initializeEVLIndVarSimplifyPass(PassRegistry &);
 void initializeInferAddressSpacesPass(PassRegistry&);
 void initializeInstSimplifyLegacyPassPass(PassRegistry &);
 void initializeInstructionCombiningPassPass(PassRegistry&);
diff --git a/llvm/include/llvm/Transforms/Vectorize/EVLIndVarSimplify.h b/llvm/include/llvm/Transforms/Vectorize/EVLIndVarSimplify.h
new file mode 100644
index 0000000000000..9b1c207439f8a
--- /dev/null
+++ b/llvm/include/llvm/Transforms/Vectorize/EVLIndVarSimplify.h
@@ -0,0 +1,34 @@
+//===-------- EVLIndVarSimplify.h - Optimize vectorized loops w/ EVL IV----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass optimizes a vectorized loop with canonical IV to using EVL-based
+// IV if it was tail-folded by predicated EVL.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_VECTORIZE_EVLINDVARSIMPLIFY_H
+#define LLVM_TRANSFORMS_VECTORIZE_EVLINDVARSIMPLIFY_H
+
+#include "llvm/Analysis/LoopAnalysisManager.h"
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+class Loop;
+class LPMUpdater;
+class Pass;
+
+/// Turn vectorized loops with canonical induction variables into loops that
+/// only use a single EVL-based induction variable.
+struct EVLIndVarSimplifyPass : public PassInfoMixin<EVLIndVarSimplifyPass> {
+  PreservedAnalyses run(Loop &L, LoopAnalysisManager &LAM,
+                        LoopStandardAnalysisResults &AR, LPMUpdater &U);
+};
+
+Pass *createEVLIndVarSimplifyPass();
+} // namespace llvm
+#endif
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index e4131706aba01..8b32bc623ecad 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -294,6 +294,7 @@
 #include "llvm/Transforms/Utils/SymbolRewriter.h"
 #include "llvm/Transforms/Utils/UnifyFunctionExitNodes.h"
 #include "llvm/Transforms/Utils/UnifyLoopExits.h"
+#include "llvm/Transforms/Vectorize/EVLIndVarSimplify.h"
 #include "llvm/Transforms/Vectorize/LoadStoreVectorizer.h"
 #include "llvm/Transforms/Vectorize/LoopVectorize.h"
 #include "llvm/Transforms/Vectorize/SLPVectorizer.h"
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 1d7f0510450c9..ce11d023c85b4 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -135,6 +135,7 @@
 #include "llvm/Transforms/Utils/NameAnonGlobals.h"
 #include "llvm/Transforms/Utils/RelLookupTableConverter.h"
 #include "llvm/Transforms/Utils/SimplifyCFGOptions.h"
+#include "llvm/Transforms/Vectorize/EVLIndVarSimplify.h"
 #include "llvm/Transforms/Vectorize/LoopVectorize.h"
 #include "llvm/Transforms/Vectorize/SLPVectorizer.h"
 #include "llvm/Transforms/Vectorize/VectorCombine.h"
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index e5ce6cb7da649..683b7d2f64a42 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -602,6 +602,7 @@ LOOP_ANALYSIS("should-run-extra-simple-loop-unswitch",
 #endif
 LOOP_PASS("canon-freeze", CanonicalizeFreezeInLoopsPass())
 LOOP_PASS("dot-ddg", DDGDotPrinterPass())
+LOOP_PASS("evl-iv-simplify", EVLIndVarSimplifyPass())
 LOOP_PASS("guard-widening", GuardWideningPass())
 LOOP_PASS("indvars", IndVarSimplifyPass())
 LOOP_PASS("invalidate<all>", InvalidateAllAnalysesPass())
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 7b2dcadc41917..88dca1fdaab8d 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -37,6 +37,7 @@
 #include "llvm/Target/TargetOptions.h"
 #include "llvm/Transforms/IPO.h"
 #include "llvm/Transforms/Scalar.h"
+#include "llvm/Transforms/Vectorize/EVLIndVarSimplify.h"
 #include <optional>
 using namespace llvm;
 
@@ -419,6 +420,9 @@ void RISCVPassConfig::addIRPasses() {
   }
 
   TargetPassConfig::addIRPasses();
+
+  if (getOptLevel() != CodeGenOptLevel::None)
+    addPass(createEVLIndVarSimplifyPass());
 }
 
 bool RISCVPassConfig::addPreISel() {
diff --git a/llvm/lib/Transforms/Vectorize/CMakeLists.txt b/llvm/lib/Transforms/Vectorize/CMakeLists.txt
index 9674094024b9e..0bcc198086693 100644
--- a/llvm/lib/Transforms/Vectorize/CMakeLists.txt
+++ b/llvm/lib/Transforms/Vectorize/CMakeLists.txt
@@ -1,4 +1,5 @@
 add_llvm_component_library(LLVMVectorize
+  EVLIndVarSimplify.cpp
   LoadStoreVectorizer.cpp
   LoopVectorizationLegality.cpp
   LoopVectorize.cpp
diff --git a/llvm/lib/Transforms/Vectorize/EVLIndVarSimplify.cpp b/llvm/lib/Transforms/Vectorize/EVLIndVarSimplify.cpp
new file mode 100644
index 0000000000000..21c453925cd76
--- /dev/null
+++ b/llvm/lib/Transforms/Vectorize/EVLIndVarSimplify.cpp
@@ -0,0 +1,253 @@
+//===------ EVLIndVarSimplify.cpp - Optimize vectorized loops w/ EVL IV----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass optimizes a vectorized loop with canonical IV to using EVL-based
+// IV if it was tail-folded by predicated EVL.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Transforms/Vectorize/EVLIndVarSimplify.h"
+#include "llvm/ADT/Statistic.h"
+#include "llvm/Analysis/IVDescriptors.h"
+#include "llvm/Analysis/LoopInfo.h"
+#include "llvm/Analysis/LoopPass.h"
+#include "llvm/Analysis/ScalarEvolution.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/MathExtras.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Scalar/LoopPassManager.h"
+
+#define DEBUG_TYPE "evl-iv-simplify"
+
+using namespace llvm;
+
+STATISTIC(NumEliminatedCanonicalIV, "Number of canonical IVs we eliminated");
+
+namespace {
+struct EVLIndVarSimplifyImpl {
+  ScalarEvolution &SE;
+
+  explicit EVLIndVarSimplifyImpl(LoopStandardAnalysisResults &LAR)
+      : SE(LAR.SE) {}
+
+  explicit EVLIndVarSimplifyImpl(ScalarEvolution &SE) : SE(SE) {}
+
+  // Returns true if modify the loop.
+  bool run(Loop &L);
+};
+
+struct EVLIndVarSimplify : public LoopPass {
+  static char ID;
+
+  EVLIndVarSimplify() : LoopPass(ID) {
+    initializeEVLIndVarSimplifyPass(*PassRegistry::getPassRegistry());
+  }
+
+  bool runOnLoop(Loop *L, LPPassManager &LPM) override;
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<ScalarEvolutionWrapperPass>();
+    AU.setPreservesCFG();
+  }
+};
+} // anonymous namespace
+
+static std::optional<uint32_t> getVFFromIndVar(const SCEV *Step,
+                                               const Function &F) {
+  if (!Step)
+    return std::nullopt;
+
+  // Looking for loops with IV step value in the form of `(<constant VF> x
+  // vscale)`.
+  if (auto *Mul = dyn_cast<SCEVMulExpr>(Step)) {
+    if (Mul->getNumOperands() == 2) {
+      const SCEV *LHS = Mul->getOperand(0);
+      const SCEV *RHS = Mul->getOperand(1);
+      if (auto *Const = dyn_cast<SCEVConstant>(LHS)) {
+        uint64_t V = Const->getAPInt().getLimitedValue();
+        if (isa<SCEVVScale>(RHS) && llvm::isUInt<32>(V))
+          return static_cast<uint32_t>(V);
+      }
+    }
+  }
+
+  // If not, see if the vscale_range of the parent function is a fixed value,
+  // which makes the step value to be replaced by a constant.
+  if (isa<SCEVConstant>(Step) && F.hasFnAttribute(Attribute::VScaleRange)) {
+    APInt V = cast<SCEVConstant>(Step)->getAPInt().abs();
+    ConstantRange CR = llvm::getVScaleRange(&F, 64);
+    if (const APInt *Fixed = CR.getSingleElement()) {
+      V = V.zextOrTrunc(Fixed->getBitWidth());
+      uint64_t VF = V.udiv(*Fixed).getLimitedValue();
+      if (VF && llvm::isUInt<32>(VF))
+        return static_cast<uint32_t>(VF);
+    }
+  }
+
+  return std::nullopt;
+}
+
+// Remove the original induction variable if it's not used anywhere.
+static void cleanupOriginalIndVar(PHINode *OrigIndVar, BasicBlock *InitBlock,
+                                  BasicBlock *BackEdgeBlock) {
+  Value *InitValue = OrigIndVar->getIncomingValueForBlock(InitBlock);
+  Value *RecValue = OrigIndVar->getIncomingValueForBlock(BackEdgeBlock);
+
+  // If the only user of OrigIndVar is the one produces RecValue, then we can
+  // safely remove it.
+  if (!OrigIndVar->hasOneUse() || OrigIndVar->user_back() != RecValue)
+    return;
+
+  LLVM_DEBUG(dbgs() << "Removed the original IndVar " << *OrigIndVar << "\n");
+  // Remove OrigIndVar by replacing all its uses by the initial value of this
+  // loop. Then DCE will take care of the rest.
+  OrigIndVar->replaceAllUsesWith(InitValue);
+  OrigIndVar->eraseFromParent();
+}
+
+bool EVLIndVarSimplifyImpl::run(Loop &L) {
+  InductionDescriptor IVD;
+  PHINode *IndVar = L.getInductionVariable(SE);
+  if (!IndVar || !L.getInductionDescriptor(SE, IVD)) {
+    LLVM_DEBUG(dbgs() << "Cannot retrieve IV from loop " << L.getName()
+                      << "\n");
+    return false;
+  }
+
+  BasicBlock *InitBlock, *BackEdgeBlock;
+  if (!L.getIncomingAndBackEdge(InitBlock, BackEdgeBlock)) {
+    LLVM_DEBUG(dbgs() << "Expect unique incoming and backedge in "
+                      << L.getName() << "\n");
+    return false;
+  }
+
+  // Retrieve the loop bounds.
+  std::optional<Loop::LoopBounds> Bounds = L.getBounds(SE);
+  if (!Bounds) {
+    LLVM_DEBUG(dbgs() << "Could not obtain the bounds for loop " << L.getName()
+                      << "\n");
+    return false;
+  }
+  Value *CanonicalIVInit = &Bounds->getInitialIVValue();
+  Value *CanonicalIVFinal = &Bounds->getFinalIVValue();
+
+  const SCEV *StepV = IVD.getStep();
+  auto VF = getVFFromIndVar(StepV, *L.getHeader()->getParent());
+  if (!VF) {
+    LLVM_DEBUG(dbgs() << "Could not infer VF from IndVar step '" << *StepV
+                      << "'\n");
+    return false;
+  }
+  LLVM_DEBUG(dbgs() << "Using VF=" << *VF << " for loop " << L.getName()
+                    << "\n");
+
+  // Try to find the EVL-based induction variable.
+  using namespace PatternMatch;
+  BasicBlock *BB = IndVar->getParent();
+
+  Value *EVLIndex = nullptr;
+  Value *RemVL = nullptr, *AVL = nullptr;
+  auto IntrinsicMatch = m_Intrinsic<Intrinsic::experimental_get_vector_length>(
+      m_Value(RemVL), m_SpecificInt(*VF),
+      /*Scalable=*/m_SpecificInt(1));
+  for (auto &PN : BB->phis()) {
+    if (&PN == IndVar)
+      continue;
+
+    // Check 1: it has to contain both incoming (init) & backedge blocks
+    // from IndVar.
+    if (PN.getBasicBlockIndex(InitBlock) < 0 ||
+        PN.getBasicBlockIndex(BackEdgeBlock) < 0)
+      continue;
+    // Check 2: EVL index is always increasing, thus its inital value has to be
+    // equal to either the initial IV value (when the canonical IV is also
+    // increasing) or the last IV value (when canonical IV is decreasing).
+    Value *Init = PN.getIncomingValueForBlock(InitBlock);
+    using Direction = Loop::LoopBounds::Direction;
+    switch (Bounds->getDirection()) {
+    case Direction::Increasing:
+      if (Init != CanonicalIVInit)
+        continue;
+      break;
+    case Direction::Decreasing:
+      if (Init != CanonicalIVFinal)
+        continue;
+      break;
+    case Direction::Unknown:
+      // To be more permissive and see if either the initial or final IV value
+      // matches PN's init value.
+      if (Init != CanonicalIVInit && Init != CanonicalIVFinal)
+        continue;
+      break;
+    }
+    Value *RecValue = PN.getIncomingValueForBlock(BackEdgeBlock);
+    assert(RecValue);
+
+    LLVM_DEBUG(dbgs() << "Found candidate PN of EVL-based IndVar: " << PN
+                      << "\n");
+
+    // Check 3: Pattern match to find the EVL-based index and total trip count
+    // (AVL).
+    if (match(RecValue,
+              m_c_Add(m_ZExtOrSelf(IntrinsicMatch), m_Specific(&PN))) &&
+        match(RemVL, m_Sub(m_Value(AVL), m_Specific(&PN)))) {
+      EVLIndex = RecValue;
+      break;
+    }
+  }
+
+  if (!EVLIndex || !AVL)
+    return false;
+
+  LLVM_DEBUG(dbgs() << "Using " << *EVLIndex << " for EVL-based IndVar\n");
+
+  // Create an EVL-based comparison and replace the branch to use it as
+  // predicate.
+  ICmpInst *OrigLatchCmp = L.getLatchCmpInst();
+  ICmpInst::Predicate Pred = OrigLatchCmp->getPredicate();
+  if (!ICmpInst::isEquality(Pred))
+    return false;
+
+  IRBuilder<> Builder(OrigLatchCmp);
+  auto *NewPred = Builder.CreateICmp(Pred, EVLIndex, AVL);
+  OrigLatchCmp->replaceAllUsesWith(NewPred);
+
+  cleanupOriginalIndVar(IndVar, InitBlock, BackEdgeBlock);
+
+  ++NumEliminatedCanonicalIV;
+
+  return true;
+}
+
+PreservedAnalyses EVLIndVarSimplifyPass::run(Loop &L, LoopAnalysisManager &LAM,
+                                             LoopStandardAnalysisResults &AR,
+                                             LPMUpdater &U) {
+  if (EVLIndVarSimplifyImpl(AR).run(L))
+    return PreservedAnalyses::allInSet<CFGAnalyses>();
+  return PreservedAnalyses::all();
+}
+
+char EVLIndVarSimplify::ID = 0;
+
+INITIALIZE_PASS_BEGIN(EVLIndVarSimplify, DEBUG_TYPE,
+                      "EVL-based Induction Variables Simplify", false, false)
+INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
+INITIALIZE_PASS_END(EVLIndVarSimplify, DEBUG_TYPE,
+                    "EVL-based Induction Variables Simplify", false, false)
+
+bool EVLIndVarSimplify::runOnLoop(Loop *L, LPPassManager &LPM) {
+  auto &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();
+  return EVLIndVarSimplifyImpl(SE).run(*L);
+}
+
+Pass *llvm::createEVLIndVarSimplifyPass() { return new EVLIndVarSimplify(); }
diff --git a/llvm/test/CodeGen/RISCV/O3-pipeline.ll b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
index 4a71d3276d263..4f4ba0efd8fe3 100644
--- a/llvm/test/CodeGen/RISCV/O3-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
@@ -68,6 +68,9 @@
 ; CHECK-NEXT:       Expand reduction intrinsics
 ; CHECK-NEXT:       Natural Loop Information
 ; CHECK-NEXT:       TLS Variable Hoist
+; CHECK-NEXT:       Scalar Evolution Analysis
+; CHECK-NEXT:       Loop Pass Manager
+; CHECK-NEXT:         EVL-based Induction Variables Simplify
 ; CHECK-NEXT:       Type Promotion
 ; CHECK-NEXT:       CodeGen Prepare
 ; CHECK-NEXT:       Dominator Tree Construction
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/evl-iv-simplify.ll b/llvm/test/Transforms/LoopVectorize/RISCV/evl-iv-simplify.ll
new file mode 100644
index 0000000000000..5db92fa7255f2
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/evl-iv-simplify.ll
@@ -0,0 +1,282 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -S -mtriple=riscv64 -mattr='+v' --passes='loop(evl-iv-simplify)' < %s | FileCheck %s
+; RUN: opt -S -mtriple=riscv64 -mattr='+v' --passes='loop(evl-iv-simplify),function(simplifycfg,dce)' < %s | FileCheck %s --check-prefix=LOOP-DEL
+
+define void @simple(ptr noalias %a, ptr noalias %b, ptr noalias %c, i64 %N) {
+; CHECK-LABEL: define void @simple(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]], ptr noalias [[C:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = sub i64 -1, [[N]]
+; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP2:%.*]] = mul i64 [[TMP1]], 4
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
+; CHECK-NEXT:    br i1 [[TMP3]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP6]], 4
+; CHECK-NEXT:    [[TMP8:%.*]] = sub i64 [[TMP7]], 1
+; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP10:%.*]] = mul i64 [[TMP9]], 4
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[EVL_BASED_IV:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP11:%.*]] = sub i64 [[N]], [[EVL_BASED_IV]]
+; CHECK-NEXT:    [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[TMP11]], i32 4, i1 true)
+; CHECK-NEXT:    [[TMP13:%.*]] = add i64 [[EVL_BASED_IV]], 0
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[TMP13]]
+; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i32 0
+; CHECK-NEXT:    [[VP_OP_LOAD:%.*]] = call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr align 4 [[TMP15]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP12]])
+; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i32, ptr [[C]], i64 [[TMP13]]
+; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds i32, ptr [[TMP16]], i32 0
+; CHECK-NEXT:    [[VP_OP_LOAD1:%.*]] = call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr align 4 [[TMP17]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP12]])
+; CHECK-NEXT:    [[TMP18:%.*]] = add nsw <vscale x 4 x i32> [[VP_OP_LOAD1]], [[VP_OP_LOAD]]
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP13]]
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr inbounds i32, ptr [[TMP19]], i32 0
+; CHECK-NEXT:    call void @llvm.vp.store.nxv4i32.p0(<vscale x 4 x i32> [[TMP18]], ptr align 4 [[TMP20]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP12]])
+; CHECK-NEXT:    [[TMP21:%.*]] = zext i32 [[TMP12]] to i64
+; CHECK-NEXT:    [[INDEX_EVL_NEXT]] = add i64 [[TMP21]], [[EVL_BASED_IV]]
+; CHECK-NEXT:    [[INDEX_NEXT:%.*]] = add i64 0, [[TMP10]]
+; CHECK-NEXT:    [[TMP23:%.*]] = icmp eq i64 [[INDEX_EVL_NEXT]], [[N]]
+; CHECK-NEXT:    [[TMP26:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP23]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP24:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, ptr [[C]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP25:%.*]] = load i32, ptr [[ARRAYIDX2]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[TMP25]], [[TMP24]]
+; CHECK-NEXT:    [[ARRAYIDX4:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    store i32 [[ADD]], ptr [[ARRAYIDX4]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-...
[truncated]

alexey-bataev · 2024-05-13T13:34:20Z

llvm/lib/Transforms/Vectorize/EVLIndVarSimplify.cpp

+  if (isa<SCEVConstant>(Step) && F.hasFnAttribute(Attribute::VScaleRange)) {
+    APInt V = cast<SCEVConstant>(Step)->getAPInt().abs();


Use dyn_cast instead of isa/cast

alexey-bataev · 2024-05-13T13:36:55Z

llvm/test/Transforms/LoopVectorize/RISCV/evl-iv-simplify.ll

+; RUN: opt -S -mtriple=riscv64 -mattr='+v' --passes='loop(evl-iv-simplify)' < %s | FileCheck %s
+; RUN: opt -S -mtriple=riscv64 -mattr='+v' --passes='loop(evl-iv-simplify),function(simplifycfg,dce)' < %s | FileCheck %s --check-prefix=LOOP-DEL
+
+define void @simple(ptr noalias %a, ptr noalias %b, ptr noalias %c, i64 %N) {


Would be good to simplify the test as much as possible

I've tried to simplify it by only using only a single vp.load rather than two.

llvm/lib/Transforms/Vectorize/EVLIndVarSimplify.cpp

fhahn

as this is a backend pass, wouldn't it be better suited in llvm/lib/Codegen like other IR codegen passes?

fhahn · 2024-05-15T19:33:26Z

llvm/test/Transforms/LoopVectorize/RISCV/evl-iv-simplify.ll

@@ -0,0 +1,264 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -S -mtriple=riscv64 -mattr='+v' --passes='loop(evl-iv-simplify)' < %s | FileCheck %s
+; RUN: opt -S -mtriple=riscv64 -mattr='+v' --passes='loop(evl-iv-simplify),function(simplifycfg,dce)' < %s | FileCheck %s --check-prefix=LOOP-DEL


evl-iv-simplify is a separate pass and should have separate tests, independent of LV tests.

Moved to CodeGen test folder now.

llvm/include/llvm/Transforms/Vectorize/EVLIndVarSimplify.h

topperc · 2024-05-15T22:39:41Z

llvm/test/CodeGen/RISCV/O3-pipeline.ll

@@ -68,6 +68,9 @@
 ; CHECK-NEXT:       Expand reduction intrinsics
 ; CHECK-NEXT:       Natural Loop Information
 ; CHECK-NEXT:       TLS Variable Hoist
+; CHECK-NEXT:       Scalar Evolution Analysis
+; CHECK-NEXT:       Loop Pass Manager
+; CHECK-NEXT:         EVL-based Induction Variables Simplify


Is possible to get this into the same loop pass manager as LSR?

IIUC in order to do that we need to put EVLIndVarSimplify side-by-side with LSR, but I feel like we can make good uses of canonical form loops even between LSR and CodeGenPrepare. Do you think that will be the case?

I'm not sure. LSR itself already does some things that make loops harder to analyze. Though it doesn't make them not countable.

mshockwave · 2024-05-16T18:27:32Z

as this is a backend pass, wouldn't it be better suited in llvm/lib/Codegen like other IR codegen passes?

I agree, it's moved under CodeGen now.

nikolaypanchenko · 2024-05-20T17:28:15Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+#define DEBUG_TYPE "evl-iv-simplify"
+
+using namespace llvm;
+


Can you add an option to control that pass ?

nikolaypanchenko · 2024-05-20T17:34:03Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+};
+} // anonymous namespace
+
+static std::optional<uint32_t> getVFFromIndVar(const SCEV *Step,


nit: unsigned as a return and 0 to mimic nullopt ?

Good point, VF will never be zero. It's fixed now.

nikolaypanchenko · 2024-05-20T17:40:09Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+
+// Remove the original induction variable if it's not used anywhere.
+static void cleanupOriginalIndVar(PHINode *OrigIndVar, BasicBlock *InitBlock,
+                                  BasicBlock *BackEdgeBlock) {


Why can't you use OrigIndVar to get incoming BBs and values ? What if passed BBs are invalid for the phi ?

It's fixed now. I use InductionDescriptor to obtain the initial value.

nikolaypanchenko · 2024-05-20T17:40:39Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+
+  // If the only user of OrigIndVar is the one produces RecValue, then we can
+  // safely remove it.
+  if (!OrigIndVar->hasOneUse() || OrigIndVar->user_back() != RecValue)


nit: presence of the checks makes me think that function should be named tryToCleanupOriginalIndVar

also, do you need to check number of incoming values ?

I've renamed the function as well as check the number of incoming values.

nikolaypanchenko · 2024-05-20T17:50:22Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+  BasicBlock *BB = IndVar->getParent();
+
+  Value *EVLIndex = nullptr;
+  Value *RemVL = nullptr, *AVL = nullptr;


to be in sync with vectorizer, better to RemVL -> RemTC, AVL -> TC

nikolaypanchenko · 2024-05-20T17:53:52Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+  using namespace PatternMatch;
+  BasicBlock *BB = IndVar->getParent();
+
+  Value *EVLIndex = nullptr;


just EVL or EVLIndVar

nikolaypanchenko

lgtm

topperc · 2024-05-22T02:18:59Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+                    "EVL-based Induction Variables Simplify", false, false)
+
+bool EVLIndVarSimplify::runOnLoop(Loop *L, LPPassManager &LPM) {
+  auto &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();


Do we need to check skipLoop here for opt-bisect?

topperc · 2024-05-22T02:26:35Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+      ConstantRange CR = llvm::getVScaleRange(&F, 64);
+      if (const APInt *Fixed = CR.getSingleElement()) {
+        V = V.zextOrTrunc(Fixed->getBitWidth());
+        uint64_t VF = V.udiv(*Fixed).getLimitedValue();


Is it possible the constant isn't evenly divisible by vscale?

I added a check here.

topperc · 2024-05-22T02:37:24Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+
+  if (!EVLIndVar || !TC)
+    return false;
+


Do we need to check that TC is related to the original trip count calculation? Couldn't we construct an invalid loop that has a different trip count for the vsetvli than the countable trip count?

Ideally we can do this by comparing the TC's SCEV with canonical trip count's SCEV. But it turns out canonical trip count is not always SCEV-computable. So I only do such comparison when both TC and vscale are constants. Otherwise, I simply check if TC is used in the SCEV expressions of either the upper or lower bound of canonical IV.

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

topperc · 2024-05-22T02:39:12Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+
+  LLVM_DEBUG(dbgs() << "Removed the original IndVar " << *OrigIndVar << "\n");
+  // Remove OrigIndVar by replacing all its uses by the initial value of this
+  // loop. Then DCE will take care of the rest.


There's no explicit DCE after this pass is there?

No there isn't, though the dead code will be cleanup anyway later in the codegen pipeline. So I change the wording here to not mentioning DCE.

npanchen · 2024-05-22T18:33:00Z

add @penzn

preames · 2024-05-24T15:29:51Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+}
+
+// Remove the original induction variable if it's not used anywhere.
+static void tryCleanupOriginalIndVar(PHINode *OrigIndVar,


See RecursivelyDeleteDeadPHINode in Local.h. Slightly more powerful than what you have implemented, so confirm that deleting expressions outside the loop is valid.

Good point, it's fixed now.

preames · 2024-05-24T15:33:43Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+  }
+
+  // Retrieve the loop bounds.
+  std::optional<Loop::LoopBounds> Bounds = L.getBounds(SE);


Using the getBounds API is a slightly weird choice here. getBounds doesn't use SCEV, and has a bunch of restrictions that SCEV does not. You're immediately turning around and converting the results to SCEV, so maybe just use that to begin with?

The SCEV of initial and final values are probably going to be gone, since they're only used in the ad-hoc check I wrote, which is subject to be replaced. Also, I don't quite understand what you mean by saying getBounds doesn't use SCEV, because judging from its code it's using ScalarEvolution to calculate the initial and final values. Could you elaborate a little more?

IVDescriptors has limited use of SCEV to identify addrecs. LoopBounds::getBounds works solely on IR, matching existing instructions. It doesn't produce the limit or start in terms of SCEV values.

preames · 2024-05-24T15:42:52Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+        MatchTC = (CanonicalTripCount * Step - TripCount).ult(Step);
+      }
+  }
+  // Otherwise, we simply check if the upper or lower bound expression of the


"contains TCV"? I don't see how this proves anything useful. It could contain it in an arbitrary expression which computes a completely unrelated value?

mshockwave · 2024-05-28T21:08:47Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+
+  LLVM_DEBUG(dbgs() << "Using " << *EVLIndVar << " for EVL-based IndVar\n");
+
+  // Create an EVL-based comparison and replace the branch to use it as


Thank you for the suggestion, I think IV->evaluateAtIteration(BTC) + EVLStep.evaluateAtIteration(BTC) is pretty neat.

I have one question here though:

First, terminology. EVLStep = min(VF, sub_nsw(EVL_TC, IV))

Is it supposed to be EVLStep = min(IV.Step, sub_nsw(EVL_TC, IV))?

No, it wasn't. The active_vector_length has the VF parameter. That may be the same as the step, but in principle doesn't have to be. Note that my (1) above (which your code already does) is proving VF=IV.Step, but that is a step you need to prove. It's not something you can assume.

I think there is another problem here: the backedge taken count cannot be computed when vscale is not constant. Namely, IV.Step is in the most commonly seen form of (VF x vscale). Which also causes a problem in another comment thread where you suggested avoid using getBounds, because I guess you were suggesting to get the final IV SCEV via something like IV.evaluateAtIteration(CanonicalTripCount), but CanonicalTripCount, just like backedge taken count, cannot be computed in this (i.e. non-constant vscale) case.

However, alternatively I think we can get exit value via this:

IV_BTC = IV_Final - IV.Step Tail = TC - IV_BTC Exit value = IV_BTC + Tail

IV_BTC as the name suggested, is the equivalence of IV->evaluateAtIteration(BTC) you suggested, but this time we calculate it from IV_Final, which is, well, final IV value returned from getBounds. Since it only uses IR pattern matching we don't have to worry about the aforementioned uncomputable problem.

What do you think?

think there is another problem here: the backedge taken count cannot be computed when vscale is not constant. Namely, IV.Step is in the most commonly seen form of (VF x vscale).... just like backedge taken count, cannot be computed in this (i.e. non-constant vscale) case.

Ok, this seems weird. I don't know of any fundamental reason that SCEV shouldn't be able to figure out a trip count for a loop invariant step - even if non constant. Would you mind sharing a piece of IR which demonstrates this? I want to see if this is something easy to fix, or I'm forgetting some subtle interaction.

Would you mind sharing a piece of IR which demonstrates this? I want to see if this is something easy to fix, or I'm forgetting some subtle interaction.

I was using the first function (i.e. @simple) in the test for this PR (evl-iv-simplify.ll). I actually traced down to why it could not compute: the smoking gun is ScalarEvolution::howFarToZero, specifically, this line that actually bails out with SCEVCouldNotCompute. As the TODO comment right above it suggested, it couldn't handle non-constant step in AddRec for now.

Thanks!

That limitation actually looks pretty easy to at least partially remove. I'm currently a bit buried, but expect a patch in the next few days.

Thank you!

See #94411

Though this is with two changes to your test:

adding vscale_range to the function

making the IV nuw

I think both should apply to your original (i.e. your test is slightly over-reduced), but please confirm.

See #94411

Though this is with two changes to your test:

adding vscale_range to the function

making the IV nuw

I think both should apply to your original (i.e. your test is slightly over-reduced), but please confirm.

Thank you for the quick fix.
I gave it a try and found 1 minor and 1 major concern: the minor one is that vscale_range has to be set otherwise BTC cannot be computed. This is because when we're computing BTC, one of the preconditions is that the range of (VF x vscale) has to be non-zero, which depends on the range of vscale. Currently we use [1,0) ,a.k.a wrap around starting from 1, to represent the range of vscale when vscale_range is not set -- but this is an unsigned wrap around. However, when computing the multiplication range of (VF x vscale) this [1,0) is treated as signed wrap around, leading to (VF x vscale)'s range being full_set. Therefore when being asked the minimum unsigned value of (VF x vscale)'s range, it returns zero. I call this "minor" because I believe in practice vscale_range will mostly be set. But it's still a little annoyed to be that we cannot represent explicit unsigned range like [1, UINT_MAX] with ConstantRange.

The major concern being that the instructions for computing exit value expanded from the SCEV expressions we crafted is pretty verbose, long, a more importantly, contain division instructions. Let me start with the formula we have:

IV = the original canonical IV EVL_TC = the trip count in the scalar loop IV.Step = (VF x vscale) EVLStep = umin(IV.Step, sub_nsw(EVL_TC, IV)) BTC = backedge taken count in the original canonical IV loop ExitValue = IV.evaluateAtIteration(BTC) + EVLStep.evaluateAtIteration(BTC)

In your original formula EVLStep is min(VF, sub_nsw(EVL_TC, IV)), which I had a question on whether the first operand of min is VF or IV.Step. Because to my best understandings, VF is the same VF we put in the scalable vector type we use, (vscale x VF x ...), and also the VF used in computing step value in IV: (VF x vscale). So I think we should use IV.Step here. For the very least, if I use VF (as in my definition) here, the result will be pretty wrong.

Now, since only SCEVAddRec has evaluateAtIteration, we can't really do EVLStep.evaluateAtIteration. Instead, I rewrote EVLStep.evaluateAtIteration into:

umin(IV.Step, sub_nsw(EVL_TC, IV.evaluateAtIteration(BTC))

Because IV.Step and EVL_TC are both loop invariant.
Using the @simple function in my test case + the same modifications you made in your comment above, as an example, this final IV.evaluateAtIteration(BTC) + umin(IV.Step, sub_nsw(EVL_TC, IV.evaluateAtIteration(BTC)) looks like:

((4 * vscale * ((-1 * vscale * (4 + (-4 * ((-1 + (4 * vscale)<nuw><nsw> + %N) /u (4 * vscale)<nuw><nsw>))<nsw>)<nsw>) /u (4 * vscale)<nuw><nsw>)) + (((-4 * vscale * ((-1 * vscale * (4 + (-4 * ((-1 + (4 * vscale)<nuw><nsw> + %N) /u (4 * vscale)<nuw><nsw>))<nsw>)<nsw>) /u (4 * vscale)<nuw><nsw>)) + %N) umin (4 * vscale)<nuw><nsw>))

And those divisions seen in the expressions will be expanded into division instructions. I'm concern whether we should spend so many instructions to compute something that will almost certain be EVL_TC.

I also run the same algorithm on the @fixed_iv_step_tc function in my test case, which has fixed IV.Step and fixed trip count. The generated exit value is correct, which means my methodology is probably not terribly off.

I've also tried to use sub_nsw(EVL_TC, IV.evaluateAtIteration(BTC), which is the tail in the last iteration, in replacement of umin(IV.Step, sub_nsw(EVL_TC, IV.evaluateAtIteration(BTC)). But the resulting Exit value, IV.evaluateAtIteration(BTC) + sub_nsw(EVL_TC, IV.evaluateAtIteration(BTC) is not really meaningful (it will certainly be EVL_TC though).

Another thing is that I'm still a little confused about how to do Step (1) you described: it's true that I already did such a check (line 212 ~ 223) -- but only on cases where both trip count and vscale are constant. For every other cases we only have SCEV expressions, which we cannot know it value during compile time. Even we expand the check into runtime check, what should we do if the check fails during runtime? Do we fall back the original canonical IV?

See #94411
Though this is with two changes to your test:

adding vscale_range to the function

making the IV nuw

I think both should apply to your original (i.e. your test is slightly over-reduced), but please confirm.

Thank you for the quick fix. I gave it a try and found 1 minor and 1 major concern: the minor one is that vscale_range has to be set otherwise BTC cannot be computed. This is because when we're computing BTC, one of the preconditions is that the range of (VF x vscale) has to be non-zero, which depends on the range of vscale. Currently we use [1,0) ,a.k.a wrap around starting from 1, to represent the range of vscale when vscale_range is not set -- but this is an unsigned wrap around. However, when computing the multiplication range of (VF x vscale) this [1,0) is treated as signed wrap around, leading to (VF x vscale)'s range being full_set. Therefore when being asked the minimum unsigned value of (VF x vscale)'s range, it returns zero. I call this "minor" because I believe in practice vscale_range will mostly be set. But it's still a little annoyed to be that we cannot represent explicit unsigned range like [1, UINT_MAX] with ConstantRange.

Clang emits the vscale_range by default. vscale is bounded by the values of possible VLEN, and VLEN has a specific upper bound in the RISCV vector specification. So while annoying that SCEV can't just ask the backend for this, this is a non-problem for any real c/c++ input.

For the rest of your response, one quick comment - the divisions are by 4 * vscale. We know that vscale is a power of two by definition, so these are right shifts. You do need a popcount to know which power of two, but we could reasonable emit one popcount and a bunch of shifts. (Not saying we do so today - I'm saying it's possible.)

For the rest, I feel we are talking past each other - and frankly, I'm having trouble following the detail in github PR interface - can I suggest we setup a brief phone call to chat this through instead?

For the rest, I feel we are talking past each other - and frankly, I'm having trouble following the detail in github PR interface - can I suggest we setup a brief phone call to chat this through instead?

That will be really helpful! I'll send you an email to your "public" email address to coordinate this.

mshockwave · 2024-09-10T23:56:06Z

I'd just updated this PR to use a new exit condition. Previously, the exit condition of the new EVL-based loop was implemented by comparing IV against a trip count retrieved from IR pattern matching. This turned out to be risky since the said condition was not derived from the original canonical loop, so there is a chance that the number of iterations of the new EVL-based loop might not match that of the original loop.

The new exit condition is EVLIV > VF * BTC, where EVLIV is the new induction variable, VF is the vectorization factor and BTC is backedge taken count (i.e. trip count minus one in most cases). I think I have a proof written somewhere showing that this new condition is sound. I'll share the proof here once I found it. It is worth noting that this new exit condition is based upon the new semantics of llvm.experimental.get.vector.length landed in e806370.

I've verified the correctness of the new exit condition with SPEC2006INT and SPEC2017INTRATE. In these two suites, the dynamic instruction counts keep mostly the same, with minor (<1%) improvements in some benchmarks.

One thing which I haven't had a good solution yet is the fact that SCEVExpander (or just ScalarEvolution in general) couldn't simplify the SCEV expression for VF * BTC in some cases. For instance, the @simple function in test/CodeGen/RISCV/evl-iv-simplify.ll uses a bunch of division and multiplication to build the said expression. Though in cases where vscale_range are singular ScalarEvolution successfully simplifies the expression. Plus, my experiments also showed that the dynamic instruction counts barely increase. @preames what do you think on this particular issue?

wangpc-pp · 2024-09-11T05:22:24Z

The new exit condition is EVLIV > VF * BTC, where EVLIV is the new induction variable, VF is the vectorization factor and BTC is backedge taken count (i.e. trip count minus one in most cases).

I don't know if I understand correctly, maybe #102575 is related to this change?

mshockwave · 2024-10-07T21:04:44Z

The new exit condition is EVLIV > VF * BTC, where EVLIV is the new induction variable, VF is the vectorization factor and BTC is backedge taken count (i.e. trip count minus one in most cases).

I don't know if I understand correctly, maybe #102575 is related to this change?

thanks, I tried your PR but unfortunately it didn't help a lot...

mshockwave · 2024-10-14T21:56:27Z

ping

When we enable EVL-based loop vectorization w/ predicated tail-folding, each vectorized loop has effectively two induction variables: one calculates the step using (VF x vscale) and the other one increases the IV by values returned from experiment.get.vector.length. The former, also known as canonical IV, is more favorable for analyses as it's "countable" in the sense of SCEV; the latter (EVL-based IV), however, is more favorable to codegen, at least for those that support scalable vectors like AArch64 SVE and RISC-V. The idea is that we use canonical IV all the way until the beginning of codegen pipeline, where we replace it with EVL-based IV using EVLIVSimplify introduced here. Such that we can have the best from both worlds. This Pass is enabled by default in RISC-V. However, since we haven't really vectorize loops with predicate tail-folding, this Pass is no-op at this moment. That said, I have validate the correctness of this Pass by enable EVL-based LV + predicated tail-folding (i.e. -force-tail-folding-style=data-with-evl -prefer-predicate-over-epilogue=predicate-dont-vectorize) and run on SPEC2006INT and SPEC2017 intrate w/ test workload.

And simplify the test cases.

And check if the trip count matches the canonical IV.

The new exit condition will not be `EVLIV > VF * BTC` where VF is the vectorization factor and BTC being the backedge taken count.

github-actions · 2024-10-21T18:19:59Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff a60050cf1966eb8d46253c8da13df52ac8b9ec33 e70145290e28b965889adb7c30a547d0b1ebd679 --extensions cpp,h -- llvm/include/llvm/CodeGen/EVLIndVarSimplify.h llvm/lib/CodeGen/EVLIndVarSimplify.cpp llvm/lib/Passes/PassBuilder.cpp llvm/lib/Passes/PassBuilderPipelines.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index 6b0d5732ce..4fc99ff4a5 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -84,9 +84,9 @@
 #include "llvm/CodeGen/ComplexDeinterleavingPass.h"
 #include "llvm/CodeGen/DeadMachineInstructionElim.h"
 #include "llvm/CodeGen/DwarfEHPrepare.h"
+#include "llvm/CodeGen/EVLIndVarSimplify.h"
 #include "llvm/CodeGen/EarlyIfConversion.h"
 #include "llvm/CodeGen/EdgeBundles.h"
-#include "llvm/CodeGen/EVLIndVarSimplify.h"
 #include "llvm/CodeGen/ExpandLargeDivRem.h"
 #include "llvm/CodeGen/ExpandLargeFpConvert.h"
 #include "llvm/CodeGen/ExpandMemCmp.h"

In order to simplify the exit condition.

arcbbb · 2024-11-25T07:28:00Z

Related pull request: #117506, addressing the removal of the original canonical IV.

preames · 2024-11-25T16:34:20Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+  const SCEV *VFV = SE.getConstant(BTC->getType(), VF);
+  VFV = SE.getMulExpr(VFV, SE.getVScale(VFV->getType()));
+  const SCEV *ExitValV = SE.getMulExpr(BTC, VFV);
+  LLVM_DEBUG(dbgs() << "ExitVal: " << *ExitValV << "\n");


Isn't ExitValV just CanonicalIVFinal - CanonicalIVInit - CanonicalIVStep by construction? Or alternatively, CanonicalIV->evaluateAtIteration(BTC)?

Note that the way you're computing it should be correct when combined with the matching code above, so this is purely a style/codegen point.

preames · 2024-11-25T16:49:44Z

llvm/lib/CodeGen/EVLIndVarSimplify.cpp

+    // Check 3: Pattern match to find the EVL-based index.
+    if (match(RecValue,
+              m_c_Add(m_ZExtOrSelf(IntrinsicMatch), m_Specific(&PN))) &&
+        match(RemTC, m_Sub(m_Value(), m_Specific(&PN)))) {


There's a nasty correctness problem lurking here.

Consider an EVL IV where the sub is M, and the canonical IV where BTC is in terms of N. N != M. In this case, the original loop would run a different number of iterations than the modified loop unless you can prove N and M are the same. (Well, actually that the BTC derived from both is the same, which is slightly different and maybe easier.)

Let's consider the case where M < N. This may or may not be correct. For a purely EVL vectorized loop, running the extra iterations had no effect. However, if there's any bit of scalar code in the loop with side effects (i.e. this isn't the output of the vectorizer), then we've dropped semantically visible side effects.

Two alternate approaches here (warning, these are quite different - we should probably brainstorm offline before you bother to implement either):

Proof of noop iterations - prove that once active.vector.length is zero, that the loop has no side effects and is not infinite. Thus trip count can be reduced. Unfortunately, this only proves an upper bound, you'd have to then somehow prove (via scev) that M >= N.

Directly teach SCEV how to compute the BTC of an EVL based IV, and then directly prove that the two BTCs are the same. This may be hard. (It can also subsume the option above.)

Honestly, at this point I'm leaning strongly towards just having the vectorizer do this transform during codegen. It knows by construction the two BTCs are equal and doesn't have to do this difficult proof.

having the vectorizer do this transform during codegen. It knows by construction the two BTCs are equal

I actually have been testing an alternative solution that uses IR pattern matching to find the original trip count (similar to the very first version of this PR) but putting the Pass right after the LoopVectorizer. Based on the assumption that loop vectorizer will produce the desired "shape" of loops that we can easily extract the trip count / BTC without fear. But I guess even with this, we somehow need to make sure the LV produces the same BTCs for the two IVs just like you mentioned.

Consider an EVL IV where the sub is M, and the canonical IV where BTC is in terms of N. N != M. In this case, the original loop would run a different number of iterations than the modified loop unless you can prove N and M are the same

To my best understandings, due to how LoopVectorizer derives N and M from the original trip count, N and M will be the same in the code produced by LV. So with my alternative solution which run this Pass right after LV, I think we can assure that N and M as seen by this Pass will be the same.

Since we're not adding to the codegen pipeline anymore

mshockwave · 2024-12-23T23:26:45Z

I've pushed the latest version of this Pass that relies on IR pattern matching against the specific shape of loops generated by LoopVectorizer. Which means that this Pass now runs right after the LV.

@mshockwave

Added an extension point after vectorizer passes in the PassBuilder. Additionally, added extension points before and after vectorizer passes in `buildLTODefaultPipeline`. Credit goes to @mshockwave for guiding me through my first LLVM contribution (and my first open source contribution in general!) :) - Implemented `registerVectorizerEndEPCallback` - Implemented `invokeVectorizerEndEPCallbacks` - Added `VectorizerEndEPCallbacks` SmallVector - Added a command line option `passes-ep-vectorizer-end` to `NewPMDriver.cpp` - `buildModuleOptimizationPipeline` now calls `invokeVectorizerEndEPCallbacks` - `buildO0DefaultPipeline` now calls `invokeVectorizerEndEPCallbacks` - `buildLTODefaultPipeline` now calls BOTH `invokeVectorizerStartEPCallbacks` and `invokeVectorizerEndEPCallbacks` - Added LIT tests to `new-pm-defaults.ll`, `new-pm-lto-defaults.ll`, `new-pm-O0-ep-callbacks.ll`, and `pass-pipeline-parsing.ll` - Renamed `CHECK-EP-Peephole` to `CHECK-EP-PEEPHOLE` in `new-pm-lto-defaults.ll` for consistency. This code is intended for developers that wish to implement and run custom passes after the vectorizer passes in the PassBuilder pipeline. For example, in #91796, a pass was created that changed the induction variables of vectorized code. This is right after the vectorization passes.

mshockwave · 2025-03-12T18:25:32Z

Closes in favor of #131005

mshockwave requested review from preames, arcbbb and alexey-bataev May 10, 2024 19:02

llvmbot added backend:RISC-V vectorizers llvm:transforms labels May 10, 2024

mshockwave requested a review from npanchen May 10, 2024 19:07

alexey-bataev requested a review from topperc May 13, 2024 13:31

alexey-bataev reviewed May 13, 2024

View reviewed changes

arcbbb reviewed May 15, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/EVLIndVarSimplify.cpp Outdated Show resolved Hide resolved

fhahn reviewed May 15, 2024

View reviewed changes

topperc reviewed May 15, 2024

View reviewed changes

nikolaypanchenko reviewed May 20, 2024

View reviewed changes

nikolaypanchenko reviewed May 22, 2024

View reviewed changes

topperc reviewed May 22, 2024

View reviewed changes

preames approved these changes May 24, 2024

View reviewed changes

mshockwave force-pushed the patch/evl-vectorized-iv-simplified branch from 4eecaf8 to 3ea4b69 Compare September 10, 2024 23:36

mshockwave added 6 commits October 21, 2024 10:00

Address review comments

855cca9

And simplify the test cases.

Move EVLIndVarSimplify to CodeGen

6bc5c94

Address review comments

14f4a7e

Address review comments

3f6b660

And check if the trip count matches the canonical IV.

fixup! Address review comments

9f8bbd7

mshockwave added 2 commits October 21, 2024 10:01

Use RecursivelyDeleteDeadPHINode instead of cleaning IndVar manually

c833e9e

Use a better exit condition that is derived from the original loop

8259007

The new exit condition will not be `EVLIV > VF * BTC` where VF is the vectorization factor and BTC being the backedge taken count.

mshockwave force-pushed the patch/evl-vectorized-iv-simplified branch from 3ea4b69 to 8259007 Compare October 21, 2024 18:16

mshockwave added 5 commits November 4, 2024 11:31

Merge branch 'main' into patch/evl-vectorized-iv-simplified

97c3c8a

Merge branch 'main' into patch/evl-vectorized-iv-simplified

9da476f

Merge branch 'main' into patch/evl-vectorized-iv-simplified

73c651f

(Stash) Put EVLIVSimplify Pass at the end of the vectorizer Pass

8f04feb

In order to simplify the exit condition.

Merge branch 'main' into patch/evl-vectorized-iv-simplified

34dd59e

preames reviewed Nov 25, 2024

View reviewed changes

mshockwave added 3 commits December 23, 2024 14:47

Merge branch 'main' into patch/evl-vectorized-iv-simplified

14a0c0d

Update tests

07d2c2c

Removed legacy Pass

284f733

Since we're not adding to the codegen pipeline anymore

Update tests

e701452

mshockwave requested review from Mel-Chen, arcbbb and alexey-bataev January 14, 2025 00:54

lukel97 mentioned this pull request Jan 15, 2025

RISC-V EVL tail folding #123069

Open

16 tasks

mshockwave mentioned this pull request Jan 24, 2025

[PassBuilder] VectorizerEnd Extension Points #123494

Merged

mshockwave mentioned this pull request Mar 12, 2025

[LV][EVL] Introduce the EVLIndVarSimplify Pass for EVL-vectorized loops #131005

Open

mshockwave closed this Mar 12, 2025

		if (isa<SCEVConstant>(Step) && F.hasFnAttribute(Attribute::VScaleRange)) {
		APInt V = cast<SCEVConstant>(Step)->getAPInt().abs();


		LLVM_DEBUG(dbgs() << "Using " << *EVLIndVar << " for EVL-based IndVar\n");

		// Create an EVL-based comparison and replace the branch to use it as

[LV] Introduce the EVLIVSimplify Pass for EVL-vectorized loops #91796

[LV] Introduce the EVLIVSimplify Pass for EVL-vectorized loops #91796

Conversation

mshockwave commented May 10, 2024

llvmbot commented May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshockwave commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikolaypanchenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npanchen commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshockwave May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshockwave Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshockwave commented Sep 10, 2024

wangpc-pp commented Sep 11, 2024

mshockwave commented Oct 7, 2024

mshockwave commented Oct 14, 2024

github-actions bot commented Oct 21, 2024 • edited Loading

arcbbb commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshockwave Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshockwave commented Dec 23, 2024

mshockwave commented Mar 12, 2025

llvmbot commented May 10, 2024 •

edited

Loading

mshockwave May 29, 2024 •

edited

Loading

mshockwave Jun 5, 2024 •

edited

Loading

github-actions bot commented Oct 21, 2024 •

edited

Loading

mshockwave Nov 27, 2024 •

edited

Loading