[LLVM][NVPTX] Add NVPTX codegen support for clusterlaunchcontrol instruction #134568

schwarzschild-radius · 2025-04-07T02:36:41Z

This commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer PTX ISA

llvmbot · 2025-04-07T02:37:14Z

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Pradeep Kumar (schwarzschild-radius)

Changes

This commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer PTX ISA

Patch is 34.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134568.diff

11 Files Affected:

(modified) llvm/docs/NVPTXUsage.rst (+98)
(modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+46)
(modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp (+40)
(modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h (+1)
(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+121-1)
(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.h (+3)
(modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+1)
(modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+63)
(modified) llvm/lib/Target/NVPTX/NVPTXSubtarget.h (+14)
(added) llvm/test/CodeGen/NVPTX/clusterlaunchcontrol-multicast.ll (+50)
(added) llvm/test/CodeGen/NVPTX/clusterlaunchcontrol.ll (+140)

diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 621879fc5648b..7bbd18c71fbeb 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -1478,6 +1478,104 @@ similar but the latter uses generic addressing (see `Generic Addressing <https:/
 
 For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st-bulk>`__.
 
+
+clusterlaunchcontrol Intrinsics
+-------------------------------
+
+'``llvm.nvvm.clusterlaunchcontrol.try_cancel*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async(ptr %addr, ptr %mbar)
+  declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.shared(ptr addrspace(3) %addr, ptr addrspace(3) %mbar)
+  declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast(ptr %addr, ptr %mbar)
+  declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast.shared(ptr addrspace(3) %addr, ptr addrspace(3) %mbar)
+
+Overview:
+"""""""""
+
+The ``clusterlaunchcontrol.try_cancel`` intrinsics requests atomically cancelling
+the launch of a cluster that has not started running yet. It asynchronously writes
+a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the
+operation succeeded or failed. ``addr`` and ``mbar`` must be in ``shared::cta``
+otherwise the result is undefined. The completion of the asynchronous operation
+is tracked using the mbarrier completion mechanism at ``.cluster`` scope referenced
+by the shared memory pointer, ``mbar``. On success, the opaque response contains
+the CTA id of the first CTA of the canceled cluster; no other successful response
+from other ``clusterlaunchcontrol.try_cancel`` operations from the same grid will
+contain that id.
+
+The ``multicast`` variant specifies that the response is asynchronously written to
+the corresponding shared memory location of each CTA in the requesting cluster.
+The completion of the write of each local response is tracked by independent
+mbarriers at the corresponding shared memory location of each CTA in the
+cluster.
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-try-cancel>`__.
+
+'``llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i1 @llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled(i128 %try_cancel_response)
+
+Overview:
+"""""""""
+
+The ``llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled`` intrinsic can be
+used to decode the opaque response written by the
+``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation.
+
+The intrinsic returns false if the request failed. If the request succeeded,
+it returns true. A true result indicates that:
+
+- the thread block cluster whose first CTA id matches that of the response
+  handle will not run
+- no other successful response of another ``try_cancel`` request will contain
+  the first CTA id of that cluster
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-query-cancel>`__.
+
+
+'``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare <4 x i32> @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid(i128 %try_cancel_response)
+  declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.x(i128 %try_cancel_response)
+  declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.y(i128 %try_cancel_response)
+  declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.z(i128 %try_cancel_response)
+
+Overview:
+"""""""""
+
+The ``clusterlaunchcontrol.query_cancel.get_first_ctaid`` intrinsic can be
+used to decode the opaque response written by the
+``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation.
+
+If the request succeeded,
+``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.{x,y,z}`` returns
+the coordinate of the first CTA in the canceled cluster, either x, y, or z.
+``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid`` returns a vector
+where the first three components are the x, y, z coordinates of the first CTA.
+The contents of the fourth element are unspecified
+
+If the request failed, the behavior of these intrinsics is undefined.
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-query-cancel>`__.
+
 Other Intrinsics
 ----------------
 
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 3e9588a515c9e..12d98229eba02 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -5381,4 +5381,50 @@ def int_nvvm_st_bulk_shared_cta : DefaultAttrsIntrinsic<[],
   [IntrArgMemOnly, IntrWriteMem,
     WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>]>;
 
+//
+// Cluster launch control
+//
+
+// clusterlaunchcontrol.try_cancel
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async
+    : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],
+                [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+                "llvm.nvvm.clusterlaunchcontrol.try_cancel.async">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_shared
+    : Intrinsic<[], [llvm_shared_ptr_ty, llvm_shared_ptr_ty],
+                [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+                "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.shared">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_multicast
+    : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],
+                [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+                "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_multicast_shared
+    : Intrinsic<[], [llvm_shared_ptr_ty, llvm_shared_ptr_ty],
+                [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+                "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast.shared">;
+
+// clusterlaunchcontrol.query_cancel.is_canceled
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_is_canceled
+    : Intrinsic<[llvm_i1_ty], [llvm_i128_ty], [],
+                "llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled">;
+
+// clusterlaunchcontrol.query_cancel.get_first_ctaid
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid
+    : Intrinsic<[llvm_v4i32_ty], [llvm_i128_ty], [],
+                "llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid">;
+
+foreach dim = ["x", "y", "z"] in {
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_ # dim
+    : Intrinsic<[llvm_i32_ty], [llvm_i128_ty], [],
+                "llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid." # dim>;
+}
+
+
 } // let TargetPrefix = "nvvm"
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index ec1f969494cd1..36dff04f01413 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -312,6 +312,37 @@ void NVPTXDAGToDAGISel::SelectTcgen05Ld(SDNode *N, bool hasOffset) {
   }
 }
 
+void NVPTXDAGToDAGISel::SelectClusterLaunchControl(SDNode *N) {
+  SDLoc DL(N);
+  unsigned IID = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
+  unsigned Opcode;
+  switch (IID) {
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_IS_CANCELED;
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID;
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_x;
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_y;
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_z;
+    break;
+  }
+
+  SDValue Operands[] = {
+      N->getOperand(2), // TryCancelResponse_0
+      N->getOperand(3), // TryCancelResponse_1
+      N->getOperand(0), // Chain
+  };
+
+  ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Operands));
+}
+
 bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
   unsigned IID = N->getConstantOperandVal(1);
   switch (IID) {
@@ -366,6 +397,15 @@ bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
     SelectTcgen05Ld(N, /* hasOffset */ true);
     return true;
   }
+
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+    SelectClusterLaunchControl(N);
+    return true;
+  }
   }
 }
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index 23cbd458571a0..f634276c498b7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -102,6 +102,7 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
                                            bool IsIm2Col = false);
   void SelectTcgen05Ld(SDNode *N, bool hasOffset = false);
   void SelectTcgen05St(SDNode *N, bool hasOffset = false);
+  void SelectClusterLaunchControl(SDNode *N);
 
   inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
     return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index b566cdd4b6bfc..cae193e5d8595 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -1017,6 +1017,9 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
                      {MVT::v2i32, MVT::v4i32, MVT::v8i32, MVT::v16i32,
                       MVT::v32i32, MVT::v64i32, MVT::v128i32},
                      Custom);
+
+  // Enable custom lowering for i128 bit type supported in PTX
+  setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i128, Custom);
 }
 
 const char *NVPTXTargetLowering::getTargetNodeName(unsigned Opcode) const {
@@ -1165,6 +1168,54 @@ NVPTXTargetLowering::LowerGlobalAddress(SDValue Op, SelectionDAG &DAG) const {
   return DAG.getNode(NVPTXISD::Wrapper, dl, PtrVT, Op);
 }
 
+SDValue NVPTXTargetLowering::LowerIntrinsicWChain(SDValue Op,
+                                                  SelectionDAG &DAG) const {
+  SDNode *N = Op.getNode();
+  SDValue Intrin = N->getOperand(1);
+  SDLoc DL(N);
+
+  // Get the intrinsic ID
+  unsigned IntrinNo = cast<ConstantSDNode>(Intrin.getNode())->getZExtValue();
+  switch (IntrinNo) {
+  default:
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+
+    if (N->getOperand(2).getValueType() != MVT::i128) {
+      // return, if the operand is already lowered
+      return SDValue();
+    }
+
+    SDLoc DL(N);
+    SmallVector<SDValue, 8> Ops;
+
+    Ops.push_back(N->getOperand(0)); // Chain
+    Ops.push_back(N->getOperand(1)); // Intrinsic
+
+    SDValue TryCancelResponse = N->getOperand(2);
+    SDValue Cast = DAG.getNode(ISD::BITCAST, DL, MVT::v2i64, TryCancelResponse);
+    SDValue TryCancelResponse0 =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+                    DAG.getIntPtrConstant(0, DL));
+    SDValue TryCancelResponse1 =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+                    DAG.getIntPtrConstant(1, DL));
+
+    Ops.push_back(TryCancelResponse0);
+    Ops.push_back(TryCancelResponse1);
+
+    MemIntrinsicSDNode *MemSD = cast<MemIntrinsicSDNode>(N);
+    return DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, N->getVTList(),
+                                   Ops, MemSD->getMemoryVT(),
+                                   MemSD->getMemOperand());
+  }
+  }
+  return Op;
+}
+
 static bool IsTypePassedAsArray(const Type *Ty) {
   return Ty->isAggregateType() || Ty->isVectorTy() || Ty->isIntegerTy(128) ||
          Ty->isHalfTy() || Ty->isBFloatTy();
@@ -2862,7 +2913,7 @@ NVPTXTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
   case ISD::GlobalAddress:
     return LowerGlobalAddress(Op, DAG);
   case ISD::INTRINSIC_W_CHAIN:
-    return Op;
+    return LowerIntrinsicWChain(Op, DAG);
   case ISD::INTRINSIC_VOID:
     return LowerIntrinsicVoid(Op, DAG);
   case ISD::BUILD_VECTOR:
@@ -4724,6 +4775,21 @@ bool NVPTXTargetLowering::getTgtMemIntrinsic(
     Info.align.reset();
     return true;
   }
+
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+    auto &DL = I.getModule()->getDataLayout();
+    Info.opc = ISD::INTRINSIC_W_CHAIN;
+    Info.memVT = getValueType(DL, I.getType());
+    Info.ptrVal = nullptr;
+    Info.offset = 0;
+    Info.flags = MachineMemOperand::MOLoad;
+    Info.align.reset();
+    return true;
+  }
   }
   return false;
 }
@@ -6036,6 +6102,60 @@ static void ReplaceINTRINSIC_W_CHAIN(SDNode *N, SelectionDAG &DAG,
   case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x64:
   case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x128:
     return ReplaceTcgen05Ld(N, DAG, Results, /* Offset */ true);
+
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid: {
+    // The intrinsic returns the CTAID of x, y and z dimension as a v4i32 value
+    EVT ResVT = N->getValueType(0);
+    if (!ResVT.isVector())
+      return; // already legalized.
+
+    const unsigned NumElts = ResVT.getVectorNumElements(); // v4i32
+
+    // Create the return type of the instructions
+    SmallVector<EVT, 5> ListVTs;
+    for (unsigned i = 0; i < NumElts; ++i)
+      ListVTs.push_back(MVT::i32);
+    ListVTs.push_back(MVT::Other);
+
+    SDVTList ResVTs = DAG.getVTList(ListVTs);
+
+    SmallVector<SDValue, 8> Ops;
+    // Add Chain and Intrinsic ID
+    Ops.push_back(N->getOperand(0)); // Chain
+    Ops.push_back(N->getOperand(1)); // Intrinsic ID
+
+    SDValue TryCancelResponse = N->getOperand(2); // i128 operand
+    // Cast i128 to v2i64 and split into two i64
+    SDValue Cast = DAG.getNode(ISD::BITCAST, DL, MVT::v2i64, TryCancelResponse);
+    SDValue TryCancelResponse_0 =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+                    DAG.getIntPtrConstant(0, DL));
+    SDValue TryCancelResponse_1 =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+                    DAG.getIntPtrConstant(1, DL));
+
+    Ops.push_back(TryCancelResponse_0);
+    Ops.push_back(TryCancelResponse_1);
+
+    MemIntrinsicSDNode *MemSD = cast<MemIntrinsicSDNode>(N);
+    // Create a new Intrinsic Node with 2 x i64 operands
+    SDValue NewNode =
+        DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, ResVTs, Ops,
+                                MemSD->getMemoryVT(), MemSD->getMemOperand());
+
+    // Scalarize the vector results
+    SmallVector<SDValue, 4> ScalarRes;
+    for (unsigned i = 0; i < NumElts; ++i) {
+      SDValue Res = NewNode.getValue(i);
+      ScalarRes.push_back(Res);
+    }
+
+    SDValue Chain = NewNode.getValue(NumElts); // v4i32 value
+    SDValue BuildVector = DAG.getNode(ISD::BUILD_VECTOR, DL, ResVT, ScalarRes);
+    Results.push_back(BuildVector); // Build Vector
+    Results.push_back(Chain);       // Chain
+    return;
+  }
   }
 }
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 39470be254efa..723e6defa8327 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -329,6 +329,9 @@ class NVPTXTargetLowering : public TargetLowering {
   SDValue LowerVASTART(SDValue Op, SelectionDAG &DAG) const;
 
   SDValue LowerCopyToReg_128(SDValue Op, SelectionDAG &DAG) const;
+
+  SDValue LowerIntrinsicWChain(SDValue Op, SelectionDAG &DAG) const;
+
   unsigned getNumRegisters(LLVMContext &Context, EVT VT,
                            std::optional<MVT> RegisterVT) const override;
   bool
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 7d0c47fa464c5..29d19c40a6f59 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -160,6 +160,7 @@ def hasHWROT32 : Predicate<"Subtarget->hasHWROT32()">;
 def noHWROT32 : Predicate<"!Subtarget->hasHWROT32()">;
 def hasDotInstructions : Predicate<"Subtarget->hasDotInstructions()">;
 def hasTcgen05Instructions : Predicate<"Subtarget->hasTcgen05Instructions()">;
+def hasBlackwellArch : Predicate<"Subtarget->hasBlackwellArch()">;
 
 def True : Predicate<"true">;
 def False : Predicate<"false">;
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 34cb63e44ca71..3f46c8c1a64ba 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -7301,3 +7301,66 @@ def INT_NVVM_ST_BULK_SHARED_CTA:
             "st.bulk.shared::cta [$dest_addr], $size, 0;",
             [(int_nvvm_st_bulk_shared_cta addr:$dest_addr, i64:$size, (i64 0))]>,
             Requires<[hasSM<100>, hasPTX<86>]>;
+
+// clusterlaunchcontrol Instructions
+
+multiclass CLUSTERLAUNCHCONTROL_TRY_CANCEL<Intrinsic Intrin, bit Multicast, list<Predicate> Pred> {
+  defvar Intrinsic = !cast<Intrinsic>(Intrin # !if(!eq(Multicast, 1), "_multicast", ""));
+  defvar IntrinsicShared = !cast<Intrinsic>(Intrin # !if(!eq(Multicast, 1), "_multicast", "") # "_shared");
+
+  def _gen: NVPTXInst<(outs), (ins Int64Regs:$addr, Int64Regs:$mbar),
+                      "clusterlaunchcontrol.try_cancel.async.mbarrier::complete_tx::bytes" #
+                      !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+                      ".b128 [$addr], [$mbar];",
+                      [(Intrinsic Int64Regs:$addr, Int64Regs:$mbar)]>, Requires<Pred>;
+
+  def _shared: NVPTXInst<(outs), (ins Int64Regs:$addr, Int64Regs:$mbar),
+                         "clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes" #
+                         !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+                         ".b128 [$addr], [$mbar];",
+                         [(IntrinsicShared Int64Regs:$addr, Int64Regs:$mbar)]>, Requires<Pred>;
+
+  def _shared32: NVPTXInst<(outs), (ins Int32Regs:$addr, Int32Regs:$mbar),
+                           "clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes" #
+                           !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+                           ".b128 [$addr], [$mbar];",
+                           [(IntrinsicShared Int32Regs:$addr, Int32Regs:$mbar)]>, Requires<Pred>;
+}
+
+defm CLUSTERLAUNCHCONTRL_TRY_CANCEL :
+      CLUSTERLAUNCHCONTROL_TRY_CANCEL<
+        int_nvvm_clusterlaunchcontrol_try_cancel_async, 0, [hasSM<100>, hasPTX<86>]>;
+
+defm CLUSTERLAUNCHCONTRL_TRY_CANCEL_MULTICAST :
+      CLUSTERLAUNCHCONTROL_TRY_CANCEL<
+        int_nvvm_clusterlaunchcontrol_try_cance...
[truncated]

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

llvm/lib/Target/NVPTX/NVPTXSubtarget.h

llvm/include/llvm/IR/IntrinsicsNVVM.td

llvm/docs/NVPTXUsage.rst

gonzalobg · 2025-04-07T12:04:43Z

llvm/docs/NVPTXUsage.rst

+
+The ``clusterlaunchcontrol.try_cancel`` intrinsics requests atomically cancelling
+the launch of a cluster that has not started running yet. It asynchronously writes
+a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the


Suggested change

a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the

a 16-byte opaque response to shared memory, pointed to by 16-byte-aligned ``addr`` indicating whether the

llvm/docs/NVPTXUsage.rst

gonzalobg · 2025-04-07T12:08:36Z

llvm/docs/NVPTXUsage.rst

+``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation.
+
+The intrinsic returns false if the request failed. If the request succeeded,
+it returns true. A true result indicates that:


Suggested change

it returns true. A true result indicates that:

it returns ``1`` (true). A true result indicates that:

llvm/docs/NVPTXUsage.rst

gonzalobg · 2025-04-07T12:09:42Z

llvm/docs/NVPTXUsage.rst

+"""""""""
+
+The ``clusterlaunchcontrol.query_cancel.get_first_ctaid`` intrinsic can be
+used to decode the opaque response written by the


Suggested change

used to decode the opaque response written by the

used to decode the successful opaque response written by the

llvm/docs/NVPTXUsage.rst

llvm/include/llvm/IR/IntrinsicsNVVM.td

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

gonzalobg · 2025-04-07T12:20:27Z

llvm/lib/Target/NVPTX/NVPTXSubtarget.h

+      return false;
+    };
+
+    return HasRequiredSM() && PTXVersion >= 86;


The non-.multicast instructions being added are not arch conditional, so sm >= 100 && ptx >= 86 suffice.

For the .multicast flavors, they only apply to this one instruction, so calling it "blackwellArchAccelerated" instead of hasMulticastClusterLaunchControlTryCancel is not very accurate. It is possible for future non-blackwell GPUs to provide .multicast flavors of this instruction as well, and for those GPUs to not provide .multicast flavors of other Blackwell instructions.

I'm not sure we need this custom predicate at the moment.
We already use [hasSM<100>, hasArchAccelFeatures, hasPTX<86>] in a lot of places.
We could shorten that in tablegen, if necessary, if it becomes too much of a burden, but it does not seem to be the case for now.

If/when some instructions eventually turn out to be blackwell-only, we'll sort it out then, when we have specific info.

llvm/test/CodeGen/NVPTX/clusterlaunchcontrol.ll

llvm/include/llvm/IR/IntrinsicsNVVM.td

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

llvm/include/llvm/IR/IntrinsicsNVVM.td

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

AlexMaclean · 2025-04-07T16:17:42Z

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

+
+    SDVTList ResVTs = DAG.getVTList(ListVTs);
+
+    SmallVector<SDValue, 8> Ops;


A SmallVector seems overkill given that there will always be exactly 4 operands. Can we instead use an initializer list directly in the getMemIntrinsicNode call?

llvm/include/llvm/IR/IntrinsicsNVVM.td

Artem-B · 2025-04-07T21:52:31Z

llvm/lib/Target/NVPTX/NVPTXSubtarget.h

+      return false;
+    };
+
+    return HasRequiredSM() && PTXVersion >= 86;


I'm not sure we need this custom predicate at the moment.
We already use [hasSM<100>, hasArchAccelFeatures, hasPTX<86>] in a lot of places.
We could shorten that in tablegen, if necessary, if it becomes too much of a burden, but it does not seem to be the case for now.

If/when some instructions eventually turn out to be blackwell-only, we'll sort it out then, when we have specific info.

schwarzschild-radius · 2025-05-09T00:33:56Z

Summary of changes in the latest revision:

Addressed documentation changes by @gonzalobg
Removed generic variants of the intrinsics
Removed the custom selection code for clusterlaunchcontrol_query_cancel* variants in the favour of custom SelectionDAG Nodes as requested by @AlexMaclean
Replaced the vector variant of clusterlaunchcontrol_query_cancel_get_first_ctaid with a struct variants taking three arguments
Replaced the new sm arch predicate with the existing one as requested by @Artem-B
Removed NoCapture properties from clusterlaunchcontrol.try_cancel variants they do capture the address and mbarrier arguments

llvm/include/llvm/IR/IntrinsicsNVVM.td

github-actions · 2025-05-13T17:27:13Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Artem-B · 2025-05-14T17:23:16Z

@schwarzschild-radius would it be possible to go over still open comments and mark the ones you've addressed as resolved? Right now it's really hard to tell whether you're still working on all/some of them, or if they are done.

schwarzschild-radius · 2025-05-15T14:07:37Z

@Artem-B I have addressed all of the comments and resolved them as well. Please let me know if the review is good to go

AlexMaclean

LGTM. Please wait for other reviewers before landing.

As a side note, all the NVPTXISelLowering code and custom nodes seem like they could be eliminated if only we fully supported i128 types as legal. This has been on my list for a while and perhaps this is a sign I need to come back to it.

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

llvm/test/CodeGen/NVPTX/clusterlaunchcontrol.ll

…ruction This commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-try-cancel)

durga4github

The latest changes LGTM

schwarzschild-radius · 2025-05-16T13:54:28Z

@Artem-B I have addressed all of the comments, Please let me know if I am can go ahead with merge

schwarzschild-radius requested a review from Artem-B April 7, 2025 02:36

llvmbot added backend:NVPTX llvm:ir labels Apr 7, 2025

schwarzschild-radius requested a review from durga4github April 7, 2025 02:36

durga4github reviewed Apr 7, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td Outdated Show resolved Hide resolved

durga4github reviewed Apr 7, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td Outdated Show resolved Hide resolved

durga4github reviewed Apr 7, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXSubtarget.h Outdated Show resolved Hide resolved

durga4github reviewed Apr 7, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXSubtarget.h Outdated Show resolved Hide resolved

durga4github reviewed Apr 7, 2025

View reviewed changes

llvm/include/llvm/IR/IntrinsicsNVVM.td Outdated Show resolved Hide resolved

schwarzschild-radius force-pushed the clusterlaunchcontrol_nvptx_support branch from 87e3d74 to 3310c99 Compare April 7, 2025 11:34