Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[LLVM][NVPTX] Add NVPTX codegen support for clusterlaunchcontrol instruction #134568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

schwarzschild-radius
Copy link
Contributor

This commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer PTX ISA

@llvmbot
Copy link
Member

llvmbot commented Apr 7, 2025

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Pradeep Kumar (schwarzschild-radius)

Changes

This commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer PTX ISA


Patch is 34.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134568.diff

11 Files Affected:

  • (modified) llvm/docs/NVPTXUsage.rst (+98)
  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+46)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp (+40)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h (+1)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+121-1)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.h (+3)
  • (modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+1)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+63)
  • (modified) llvm/lib/Target/NVPTX/NVPTXSubtarget.h (+14)
  • (added) llvm/test/CodeGen/NVPTX/clusterlaunchcontrol-multicast.ll (+50)
  • (added) llvm/test/CodeGen/NVPTX/clusterlaunchcontrol.ll (+140)
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 621879fc5648b..7bbd18c71fbeb 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -1478,6 +1478,104 @@ similar but the latter uses generic addressing (see `Generic Addressing <https:/
 
 For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st-bulk>`__.
 
+
+clusterlaunchcontrol Intrinsics
+-------------------------------
+
+'``llvm.nvvm.clusterlaunchcontrol.try_cancel*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async(ptr %addr, ptr %mbar)
+  declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.shared(ptr addrspace(3) %addr, ptr addrspace(3) %mbar)
+  declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast(ptr %addr, ptr %mbar)
+  declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast.shared(ptr addrspace(3) %addr, ptr addrspace(3) %mbar)
+
+Overview:
+"""""""""
+
+The ``clusterlaunchcontrol.try_cancel`` intrinsics requests atomically cancelling
+the launch of a cluster that has not started running yet. It asynchronously writes
+a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the
+operation succeeded or failed. ``addr`` and ``mbar`` must be in ``shared::cta``
+otherwise the result is undefined. The completion of the asynchronous operation
+is tracked using the mbarrier completion mechanism at ``.cluster`` scope referenced
+by the shared memory pointer, ``mbar``. On success, the opaque response contains
+the CTA id of the first CTA of the canceled cluster; no other successful response
+from other ``clusterlaunchcontrol.try_cancel`` operations from the same grid will
+contain that id.
+
+The ``multicast`` variant specifies that the response is asynchronously written to
+the corresponding shared memory location of each CTA in the requesting cluster.
+The completion of the write of each local response is tracked by independent
+mbarriers at the corresponding shared memory location of each CTA in the
+cluster.
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-try-cancel>`__.
+
+'``llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i1 @llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled(i128 %try_cancel_response)
+
+Overview:
+"""""""""
+
+The ``llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled`` intrinsic can be
+used to decode the opaque response written by the
+``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation.
+
+The intrinsic returns false if the request failed. If the request succeeded,
+it returns true. A true result indicates that:
+
+- the thread block cluster whose first CTA id matches that of the response
+  handle will not run
+- no other successful response of another ``try_cancel`` request will contain
+  the first CTA id of that cluster
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-query-cancel>`__.
+
+
+'``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare <4 x i32> @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid(i128 %try_cancel_response)
+  declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.x(i128 %try_cancel_response)
+  declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.y(i128 %try_cancel_response)
+  declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.z(i128 %try_cancel_response)
+
+Overview:
+"""""""""
+
+The ``clusterlaunchcontrol.query_cancel.get_first_ctaid`` intrinsic can be
+used to decode the opaque response written by the
+``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation.
+
+If the request succeeded,
+``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.{x,y,z}`` returns
+the coordinate of the first CTA in the canceled cluster, either x, y, or z.
+``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid`` returns a vector
+where the first three components are the x, y, z coordinates of the first CTA.
+The contents of the fourth element are unspecified
+
+If the request failed, the behavior of these intrinsics is undefined.
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-query-cancel>`__.
+
 Other Intrinsics
 ----------------
 
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 3e9588a515c9e..12d98229eba02 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -5381,4 +5381,50 @@ def int_nvvm_st_bulk_shared_cta : DefaultAttrsIntrinsic<[],
   [IntrArgMemOnly, IntrWriteMem,
     WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>]>;
 
+//
+// Cluster launch control
+//
+
+// clusterlaunchcontrol.try_cancel
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async
+    : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],
+                [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+                "llvm.nvvm.clusterlaunchcontrol.try_cancel.async">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_shared
+    : Intrinsic<[], [llvm_shared_ptr_ty, llvm_shared_ptr_ty],
+                [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+                "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.shared">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_multicast
+    : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],
+                [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+                "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_multicast_shared
+    : Intrinsic<[], [llvm_shared_ptr_ty, llvm_shared_ptr_ty],
+                [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+                "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast.shared">;
+
+// clusterlaunchcontrol.query_cancel.is_canceled
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_is_canceled
+    : Intrinsic<[llvm_i1_ty], [llvm_i128_ty], [],
+                "llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled">;
+
+// clusterlaunchcontrol.query_cancel.get_first_ctaid
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid
+    : Intrinsic<[llvm_v4i32_ty], [llvm_i128_ty], [],
+                "llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid">;
+
+foreach dim = ["x", "y", "z"] in {
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_ # dim
+    : Intrinsic<[llvm_i32_ty], [llvm_i128_ty], [],
+                "llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid." # dim>;
+}
+
+
 } // let TargetPrefix = "nvvm"
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index ec1f969494cd1..36dff04f01413 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -312,6 +312,37 @@ void NVPTXDAGToDAGISel::SelectTcgen05Ld(SDNode *N, bool hasOffset) {
   }
 }
 
+void NVPTXDAGToDAGISel::SelectClusterLaunchControl(SDNode *N) {
+  SDLoc DL(N);
+  unsigned IID = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
+  unsigned Opcode;
+  switch (IID) {
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_IS_CANCELED;
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID;
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_x;
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_y;
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z:
+    Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_z;
+    break;
+  }
+
+  SDValue Operands[] = {
+      N->getOperand(2), // TryCancelResponse_0
+      N->getOperand(3), // TryCancelResponse_1
+      N->getOperand(0), // Chain
+  };
+
+  ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Operands));
+}
+
 bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
   unsigned IID = N->getConstantOperandVal(1);
   switch (IID) {
@@ -366,6 +397,15 @@ bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
     SelectTcgen05Ld(N, /* hasOffset */ true);
     return true;
   }
+
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+    SelectClusterLaunchControl(N);
+    return true;
+  }
   }
 }
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index 23cbd458571a0..f634276c498b7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -102,6 +102,7 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
                                            bool IsIm2Col = false);
   void SelectTcgen05Ld(SDNode *N, bool hasOffset = false);
   void SelectTcgen05St(SDNode *N, bool hasOffset = false);
+  void SelectClusterLaunchControl(SDNode *N);
 
   inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
     return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index b566cdd4b6bfc..cae193e5d8595 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -1017,6 +1017,9 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
                      {MVT::v2i32, MVT::v4i32, MVT::v8i32, MVT::v16i32,
                       MVT::v32i32, MVT::v64i32, MVT::v128i32},
                      Custom);
+
+  // Enable custom lowering for i128 bit type supported in PTX
+  setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i128, Custom);
 }
 
 const char *NVPTXTargetLowering::getTargetNodeName(unsigned Opcode) const {
@@ -1165,6 +1168,54 @@ NVPTXTargetLowering::LowerGlobalAddress(SDValue Op, SelectionDAG &DAG) const {
   return DAG.getNode(NVPTXISD::Wrapper, dl, PtrVT, Op);
 }
 
+SDValue NVPTXTargetLowering::LowerIntrinsicWChain(SDValue Op,
+                                                  SelectionDAG &DAG) const {
+  SDNode *N = Op.getNode();
+  SDValue Intrin = N->getOperand(1);
+  SDLoc DL(N);
+
+  // Get the intrinsic ID
+  unsigned IntrinNo = cast<ConstantSDNode>(Intrin.getNode())->getZExtValue();
+  switch (IntrinNo) {
+  default:
+    break;
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+
+    if (N->getOperand(2).getValueType() != MVT::i128) {
+      // return, if the operand is already lowered
+      return SDValue();
+    }
+
+    SDLoc DL(N);
+    SmallVector<SDValue, 8> Ops;
+
+    Ops.push_back(N->getOperand(0)); // Chain
+    Ops.push_back(N->getOperand(1)); // Intrinsic
+
+    SDValue TryCancelResponse = N->getOperand(2);
+    SDValue Cast = DAG.getNode(ISD::BITCAST, DL, MVT::v2i64, TryCancelResponse);
+    SDValue TryCancelResponse0 =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+                    DAG.getIntPtrConstant(0, DL));
+    SDValue TryCancelResponse1 =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+                    DAG.getIntPtrConstant(1, DL));
+
+    Ops.push_back(TryCancelResponse0);
+    Ops.push_back(TryCancelResponse1);
+
+    MemIntrinsicSDNode *MemSD = cast<MemIntrinsicSDNode>(N);
+    return DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, N->getVTList(),
+                                   Ops, MemSD->getMemoryVT(),
+                                   MemSD->getMemOperand());
+  }
+  }
+  return Op;
+}
+
 static bool IsTypePassedAsArray(const Type *Ty) {
   return Ty->isAggregateType() || Ty->isVectorTy() || Ty->isIntegerTy(128) ||
          Ty->isHalfTy() || Ty->isBFloatTy();
@@ -2862,7 +2913,7 @@ NVPTXTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
   case ISD::GlobalAddress:
     return LowerGlobalAddress(Op, DAG);
   case ISD::INTRINSIC_W_CHAIN:
-    return Op;
+    return LowerIntrinsicWChain(Op, DAG);
   case ISD::INTRINSIC_VOID:
     return LowerIntrinsicVoid(Op, DAG);
   case ISD::BUILD_VECTOR:
@@ -4724,6 +4775,21 @@ bool NVPTXTargetLowering::getTgtMemIntrinsic(
     Info.align.reset();
     return true;
   }
+
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+    auto &DL = I.getModule()->getDataLayout();
+    Info.opc = ISD::INTRINSIC_W_CHAIN;
+    Info.memVT = getValueType(DL, I.getType());
+    Info.ptrVal = nullptr;
+    Info.offset = 0;
+    Info.flags = MachineMemOperand::MOLoad;
+    Info.align.reset();
+    return true;
+  }
   }
   return false;
 }
@@ -6036,6 +6102,60 @@ static void ReplaceINTRINSIC_W_CHAIN(SDNode *N, SelectionDAG &DAG,
   case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x64:
   case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x128:
     return ReplaceTcgen05Ld(N, DAG, Results, /* Offset */ true);
+
+  case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid: {
+    // The intrinsic returns the CTAID of x, y and z dimension as a v4i32 value
+    EVT ResVT = N->getValueType(0);
+    if (!ResVT.isVector())
+      return; // already legalized.
+
+    const unsigned NumElts = ResVT.getVectorNumElements(); // v4i32
+
+    // Create the return type of the instructions
+    SmallVector<EVT, 5> ListVTs;
+    for (unsigned i = 0; i < NumElts; ++i)
+      ListVTs.push_back(MVT::i32);
+    ListVTs.push_back(MVT::Other);
+
+    SDVTList ResVTs = DAG.getVTList(ListVTs);
+
+    SmallVector<SDValue, 8> Ops;
+    // Add Chain and Intrinsic ID
+    Ops.push_back(N->getOperand(0)); // Chain
+    Ops.push_back(N->getOperand(1)); // Intrinsic ID
+
+    SDValue TryCancelResponse = N->getOperand(2); // i128 operand
+    // Cast i128 to v2i64 and split into two i64
+    SDValue Cast = DAG.getNode(ISD::BITCAST, DL, MVT::v2i64, TryCancelResponse);
+    SDValue TryCancelResponse_0 =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+                    DAG.getIntPtrConstant(0, DL));
+    SDValue TryCancelResponse_1 =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+                    DAG.getIntPtrConstant(1, DL));
+
+    Ops.push_back(TryCancelResponse_0);
+    Ops.push_back(TryCancelResponse_1);
+
+    MemIntrinsicSDNode *MemSD = cast<MemIntrinsicSDNode>(N);
+    // Create a new Intrinsic Node with 2 x i64 operands
+    SDValue NewNode =
+        DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, ResVTs, Ops,
+                                MemSD->getMemoryVT(), MemSD->getMemOperand());
+
+    // Scalarize the vector results
+    SmallVector<SDValue, 4> ScalarRes;
+    for (unsigned i = 0; i < NumElts; ++i) {
+      SDValue Res = NewNode.getValue(i);
+      ScalarRes.push_back(Res);
+    }
+
+    SDValue Chain = NewNode.getValue(NumElts); // v4i32 value
+    SDValue BuildVector = DAG.getNode(ISD::BUILD_VECTOR, DL, ResVT, ScalarRes);
+    Results.push_back(BuildVector); // Build Vector
+    Results.push_back(Chain);       // Chain
+    return;
+  }
   }
 }
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 39470be254efa..723e6defa8327 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -329,6 +329,9 @@ class NVPTXTargetLowering : public TargetLowering {
   SDValue LowerVASTART(SDValue Op, SelectionDAG &DAG) const;
 
   SDValue LowerCopyToReg_128(SDValue Op, SelectionDAG &DAG) const;
+
+  SDValue LowerIntrinsicWChain(SDValue Op, SelectionDAG &DAG) const;
+
   unsigned getNumRegisters(LLVMContext &Context, EVT VT,
                            std::optional<MVT> RegisterVT) const override;
   bool
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 7d0c47fa464c5..29d19c40a6f59 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -160,6 +160,7 @@ def hasHWROT32 : Predicate<"Subtarget->hasHWROT32()">;
 def noHWROT32 : Predicate<"!Subtarget->hasHWROT32()">;
 def hasDotInstructions : Predicate<"Subtarget->hasDotInstructions()">;
 def hasTcgen05Instructions : Predicate<"Subtarget->hasTcgen05Instructions()">;
+def hasBlackwellArch : Predicate<"Subtarget->hasBlackwellArch()">;
 
 def True : Predicate<"true">;
 def False : Predicate<"false">;
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 34cb63e44ca71..3f46c8c1a64ba 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -7301,3 +7301,66 @@ def INT_NVVM_ST_BULK_SHARED_CTA:
             "st.bulk.shared::cta [$dest_addr], $size, 0;",
             [(int_nvvm_st_bulk_shared_cta addr:$dest_addr, i64:$size, (i64 0))]>,
             Requires<[hasSM<100>, hasPTX<86>]>;
+
+// clusterlaunchcontrol Instructions
+
+multiclass CLUSTERLAUNCHCONTROL_TRY_CANCEL<Intrinsic Intrin, bit Multicast, list<Predicate> Pred> {
+  defvar Intrinsic = !cast<Intrinsic>(Intrin # !if(!eq(Multicast, 1), "_multicast", ""));
+  defvar IntrinsicShared = !cast<Intrinsic>(Intrin # !if(!eq(Multicast, 1), "_multicast", "") # "_shared");
+
+  def _gen: NVPTXInst<(outs), (ins Int64Regs:$addr, Int64Regs:$mbar),
+                      "clusterlaunchcontrol.try_cancel.async.mbarrier::complete_tx::bytes" #
+                      !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+                      ".b128 [$addr], [$mbar];",
+                      [(Intrinsic Int64Regs:$addr, Int64Regs:$mbar)]>, Requires<Pred>;
+
+  def _shared: NVPTXInst<(outs), (ins Int64Regs:$addr, Int64Regs:$mbar),
+                         "clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes" #
+                         !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+                         ".b128 [$addr], [$mbar];",
+                         [(IntrinsicShared Int64Regs:$addr, Int64Regs:$mbar)]>, Requires<Pred>;
+
+  def _shared32: NVPTXInst<(outs), (ins Int32Regs:$addr, Int32Regs:$mbar),
+                           "clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes" #
+                           !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+                           ".b128 [$addr], [$mbar];",
+                           [(IntrinsicShared Int32Regs:$addr, Int32Regs:$mbar)]>, Requires<Pred>;
+}
+
+defm CLUSTERLAUNCHCONTRL_TRY_CANCEL :
+      CLUSTERLAUNCHCONTROL_TRY_CANCEL<
+        int_nvvm_clusterlaunchcontrol_try_cancel_async, 0, [hasSM<100>, hasPTX<86>]>;
+
+defm CLUSTERLAUNCHCONTRL_TRY_CANCEL_MULTICAST :
+      CLUSTERLAUNCHCONTROL_TRY_CANCEL<
+        int_nvvm_clusterlaunchcontrol_try_cance...
[truncated]

@schwarzschild-radius schwarzschild-radius force-pushed the clusterlaunchcontrol_nvptx_support branch from 87e3d74 to 3310c99 Compare April 7, 2025 11:34

The ``clusterlaunchcontrol.try_cancel`` intrinsics requests atomically cancelling
the launch of a cluster that has not started running yet. It asynchronously writes
a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the
a 16-byte opaque response to shared memory, pointed to by 16-byte-aligned ``addr`` indicating whether the

``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation.

The intrinsic returns false if the request failed. If the request succeeded,
it returns true. A true result indicates that:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it returns true. A true result indicates that:
it returns ``1`` (true). A true result indicates that:

"""""""""

The ``clusterlaunchcontrol.query_cancel.get_first_ctaid`` intrinsic can be
used to decode the opaque response written by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
used to decode the opaque response written by the
used to decode the successful opaque response written by the

return false;
};

return HasRequiredSM() && PTXVersion >= 86;
Copy link
Contributor

@gonzalobg gonzalobg Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-.multicast instructions being added are not arch conditional, so sm >= 100 && ptx >= 86 suffice.

For the .multicast flavors, they only apply to this one instruction, so calling it "blackwellArchAccelerated" instead of hasMulticastClusterLaunchControlTryCancel is not very accurate. It is possible for future non-blackwell GPUs to provide .multicast flavors of this instruction as well, and for those GPUs to not provide .multicast flavors of other Blackwell instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this custom predicate at the moment.
We already use [hasSM<100>, hasArchAccelFeatures, hasPTX<86>] in a lot of places.
We could shorten that in tablegen, if necessary, if it becomes too much of a burden, but it does not seem to be the case for now.

If/when some instructions eventually turn out to be blackwell-only, we'll sort it out then, when we have specific info.


SDVTList ResVTs = DAG.getVTList(ListVTs);

SmallVector<SDValue, 8> Ops;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A SmallVector seems overkill given that there will always be exactly 4 operands. Can we instead use an initializer list directly in the getMemIntrinsicNode call?

return false;
};

return HasRequiredSM() && PTXVersion >= 86;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this custom predicate at the moment.
We already use [hasSM<100>, hasArchAccelFeatures, hasPTX<86>] in a lot of places.
We could shorten that in tablegen, if necessary, if it becomes too much of a burden, but it does not seem to be the case for now.

If/when some instructions eventually turn out to be blackwell-only, we'll sort it out then, when we have specific info.

@schwarzschild-radius schwarzschild-radius force-pushed the clusterlaunchcontrol_nvptx_support branch 2 times, most recently from b84c1be to adc0836 Compare May 9, 2025 00:32
@schwarzschild-radius
Copy link
Contributor Author

Summary of changes in the latest revision:

  • Addressed documentation changes by @gonzalobg
  • Removed generic variants of the intrinsics
  • Removed the custom selection code for clusterlaunchcontrol_query_cancel* variants in the favour of custom SelectionDAG Nodes as requested by @AlexMaclean
  • Replaced the vector variant of clusterlaunchcontrol_query_cancel_get_first_ctaid with a struct variants taking three arguments
  • Replaced the new sm arch predicate with the existing one as requested by @Artem-B
  • Removed NoCapture properties from clusterlaunchcontrol.try_cancel variants they do capture the address and mbarrier arguments

@schwarzschild-radius schwarzschild-radius force-pushed the clusterlaunchcontrol_nvptx_support branch from adc0836 to 209f4c1 Compare May 13, 2025 17:23
Copy link

github-actions bot commented May 13, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@schwarzschild-radius schwarzschild-radius force-pushed the clusterlaunchcontrol_nvptx_support branch 6 times, most recently from cfe1fb9 to c02763b Compare May 14, 2025 16:30
@Artem-B
Copy link
Member

Artem-B commented May 14, 2025

@schwarzschild-radius would it be possible to go over still open comments and mark the ones you've addressed as resolved? Right now it's really hard to tell whether you're still working on all/some of them, or if they are done.

@schwarzschild-radius schwarzschild-radius force-pushed the clusterlaunchcontrol_nvptx_support branch from c02763b to c61abce Compare May 15, 2025 13:42
@schwarzschild-radius
Copy link
Contributor Author

@Artem-B I have addressed all of the comments and resolved them as well. Please let me know if the review is good to go

Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please wait for other reviewers before landing.

As a side note, all the NVPTXISelLowering code and custom nodes seem like they could be eliminated if only we fully supported i128 types as legal. This has been on my list for a while and perhaps this is a sign I need to come back to it.

…ruction

This commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-try-cancel)
@schwarzschild-radius schwarzschild-radius force-pushed the clusterlaunchcontrol_nvptx_support branch from c61abce to e063a41 Compare May 16, 2025 03:28
Copy link
Contributor

@durga4github durga4github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest changes LGTM

@schwarzschild-radius
Copy link
Contributor Author

@Artem-B I have addressed all of the comments, Please let me know if I am can go ahead with merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants