-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[LLVM][NVPTX] Add NVPTX codegen support for clusterlaunchcontrol instruction #134568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[LLVM][NVPTX] Add NVPTX codegen support for clusterlaunchcontrol instruction #134568
Conversation
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-backend-nvptx Author: Pradeep Kumar (schwarzschild-radius) ChangesThis commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer PTX ISA Patch is 34.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134568.diff 11 Files Affected:
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 621879fc5648b..7bbd18c71fbeb 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -1478,6 +1478,104 @@ similar but the latter uses generic addressing (see `Generic Addressing <https:/
For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st-bulk>`__.
+
+clusterlaunchcontrol Intrinsics
+-------------------------------
+
+'``llvm.nvvm.clusterlaunchcontrol.try_cancel*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async(ptr %addr, ptr %mbar)
+ declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.shared(ptr addrspace(3) %addr, ptr addrspace(3) %mbar)
+ declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast(ptr %addr, ptr %mbar)
+ declare void @llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast.shared(ptr addrspace(3) %addr, ptr addrspace(3) %mbar)
+
+Overview:
+"""""""""
+
+The ``clusterlaunchcontrol.try_cancel`` intrinsics requests atomically cancelling
+the launch of a cluster that has not started running yet. It asynchronously writes
+a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the
+operation succeeded or failed. ``addr`` and ``mbar`` must be in ``shared::cta``
+otherwise the result is undefined. The completion of the asynchronous operation
+is tracked using the mbarrier completion mechanism at ``.cluster`` scope referenced
+by the shared memory pointer, ``mbar``. On success, the opaque response contains
+the CTA id of the first CTA of the canceled cluster; no other successful response
+from other ``clusterlaunchcontrol.try_cancel`` operations from the same grid will
+contain that id.
+
+The ``multicast`` variant specifies that the response is asynchronously written to
+the corresponding shared memory location of each CTA in the requesting cluster.
+The completion of the write of each local response is tracked by independent
+mbarriers at the corresponding shared memory location of each CTA in the
+cluster.
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-try-cancel>`__.
+
+'``llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare i1 @llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled(i128 %try_cancel_response)
+
+Overview:
+"""""""""
+
+The ``llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled`` intrinsic can be
+used to decode the opaque response written by the
+``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation.
+
+The intrinsic returns false if the request failed. If the request succeeded,
+it returns true. A true result indicates that:
+
+- the thread block cluster whose first CTA id matches that of the response
+ handle will not run
+- no other successful response of another ``try_cancel`` request will contain
+ the first CTA id of that cluster
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-query-cancel>`__.
+
+
+'``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare <4 x i32> @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid(i128 %try_cancel_response)
+ declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.x(i128 %try_cancel_response)
+ declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.y(i128 %try_cancel_response)
+ declare i32 @llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.z(i128 %try_cancel_response)
+
+Overview:
+"""""""""
+
+The ``clusterlaunchcontrol.query_cancel.get_first_ctaid`` intrinsic can be
+used to decode the opaque response written by the
+``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation.
+
+If the request succeeded,
+``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.{x,y,z}`` returns
+the coordinate of the first CTA in the canceled cluster, either x, y, or z.
+``llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid`` returns a vector
+where the first three components are the x, y, z coordinates of the first CTA.
+The contents of the fourth element are unspecified
+
+If the request failed, the behavior of these intrinsics is undefined.
+
+For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-query-cancel>`__.
+
Other Intrinsics
----------------
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 3e9588a515c9e..12d98229eba02 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -5381,4 +5381,50 @@ def int_nvvm_st_bulk_shared_cta : DefaultAttrsIntrinsic<[],
[IntrArgMemOnly, IntrWriteMem,
WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>]>;
+//
+// Cluster launch control
+//
+
+// clusterlaunchcontrol.try_cancel
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async
+ : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],
+ [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+ "llvm.nvvm.clusterlaunchcontrol.try_cancel.async">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_shared
+ : Intrinsic<[], [llvm_shared_ptr_ty, llvm_shared_ptr_ty],
+ [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+ "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.shared">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_multicast
+ : Intrinsic<[], [llvm_ptr_ty, llvm_ptr_ty],
+ [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+ "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast">;
+
+def int_nvvm_clusterlaunchcontrol_try_cancel_async_multicast_shared
+ : Intrinsic<[], [llvm_shared_ptr_ty, llvm_shared_ptr_ty],
+ [IntrArgMemOnly, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>],
+ "llvm.nvvm.clusterlaunchcontrol.try_cancel.async.multicast.shared">;
+
+// clusterlaunchcontrol.query_cancel.is_canceled
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_is_canceled
+ : Intrinsic<[llvm_i1_ty], [llvm_i128_ty], [],
+ "llvm.nvvm.clusterlaunchcontrol.query_cancel.is_canceled">;
+
+// clusterlaunchcontrol.query_cancel.get_first_ctaid
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid
+ : Intrinsic<[llvm_v4i32_ty], [llvm_i128_ty], [],
+ "llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid">;
+
+foreach dim = ["x", "y", "z"] in {
+
+def int_nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_ # dim
+ : Intrinsic<[llvm_i32_ty], [llvm_i128_ty], [],
+ "llvm.nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid." # dim>;
+}
+
+
} // let TargetPrefix = "nvvm"
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index ec1f969494cd1..36dff04f01413 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -312,6 +312,37 @@ void NVPTXDAGToDAGISel::SelectTcgen05Ld(SDNode *N, bool hasOffset) {
}
}
+void NVPTXDAGToDAGISel::SelectClusterLaunchControl(SDNode *N) {
+ SDLoc DL(N);
+ unsigned IID = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
+ unsigned Opcode;
+ switch (IID) {
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+ Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_IS_CANCELED;
+ break;
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+ Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID;
+ break;
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+ Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_x;
+ break;
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+ Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_y;
+ break;
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z:
+ Opcode = NVPTX::CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_z;
+ break;
+ }
+
+ SDValue Operands[] = {
+ N->getOperand(2), // TryCancelResponse_0
+ N->getOperand(3), // TryCancelResponse_1
+ N->getOperand(0), // Chain
+ };
+
+ ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Operands));
+}
+
bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
unsigned IID = N->getConstantOperandVal(1);
switch (IID) {
@@ -366,6 +397,15 @@ bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
SelectTcgen05Ld(N, /* hasOffset */ true);
return true;
}
+
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+ SelectClusterLaunchControl(N);
+ return true;
+ }
}
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index 23cbd458571a0..f634276c498b7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -102,6 +102,7 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
bool IsIm2Col = false);
void SelectTcgen05Ld(SDNode *N, bool hasOffset = false);
void SelectTcgen05St(SDNode *N, bool hasOffset = false);
+ void SelectClusterLaunchControl(SDNode *N);
inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index b566cdd4b6bfc..cae193e5d8595 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -1017,6 +1017,9 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
{MVT::v2i32, MVT::v4i32, MVT::v8i32, MVT::v16i32,
MVT::v32i32, MVT::v64i32, MVT::v128i32},
Custom);
+
+ // Enable custom lowering for i128 bit type supported in PTX
+ setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i128, Custom);
}
const char *NVPTXTargetLowering::getTargetNodeName(unsigned Opcode) const {
@@ -1165,6 +1168,54 @@ NVPTXTargetLowering::LowerGlobalAddress(SDValue Op, SelectionDAG &DAG) const {
return DAG.getNode(NVPTXISD::Wrapper, dl, PtrVT, Op);
}
+SDValue NVPTXTargetLowering::LowerIntrinsicWChain(SDValue Op,
+ SelectionDAG &DAG) const {
+ SDNode *N = Op.getNode();
+ SDValue Intrin = N->getOperand(1);
+ SDLoc DL(N);
+
+ // Get the intrinsic ID
+ unsigned IntrinNo = cast<ConstantSDNode>(Intrin.getNode())->getZExtValue();
+ switch (IntrinNo) {
+ default:
+ break;
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+
+ if (N->getOperand(2).getValueType() != MVT::i128) {
+ // return, if the operand is already lowered
+ return SDValue();
+ }
+
+ SDLoc DL(N);
+ SmallVector<SDValue, 8> Ops;
+
+ Ops.push_back(N->getOperand(0)); // Chain
+ Ops.push_back(N->getOperand(1)); // Intrinsic
+
+ SDValue TryCancelResponse = N->getOperand(2);
+ SDValue Cast = DAG.getNode(ISD::BITCAST, DL, MVT::v2i64, TryCancelResponse);
+ SDValue TryCancelResponse0 =
+ DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+ DAG.getIntPtrConstant(0, DL));
+ SDValue TryCancelResponse1 =
+ DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+ DAG.getIntPtrConstant(1, DL));
+
+ Ops.push_back(TryCancelResponse0);
+ Ops.push_back(TryCancelResponse1);
+
+ MemIntrinsicSDNode *MemSD = cast<MemIntrinsicSDNode>(N);
+ return DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, N->getVTList(),
+ Ops, MemSD->getMemoryVT(),
+ MemSD->getMemOperand());
+ }
+ }
+ return Op;
+}
+
static bool IsTypePassedAsArray(const Type *Ty) {
return Ty->isAggregateType() || Ty->isVectorTy() || Ty->isIntegerTy(128) ||
Ty->isHalfTy() || Ty->isBFloatTy();
@@ -2862,7 +2913,7 @@ NVPTXTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
case ISD::GlobalAddress:
return LowerGlobalAddress(Op, DAG);
case ISD::INTRINSIC_W_CHAIN:
- return Op;
+ return LowerIntrinsicWChain(Op, DAG);
case ISD::INTRINSIC_VOID:
return LowerIntrinsicVoid(Op, DAG);
case ISD::BUILD_VECTOR:
@@ -4724,6 +4775,21 @@ bool NVPTXTargetLowering::getTgtMemIntrinsic(
Info.align.reset();
return true;
}
+
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_is_canceled:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_x:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_y:
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid_z: {
+ auto &DL = I.getModule()->getDataLayout();
+ Info.opc = ISD::INTRINSIC_W_CHAIN;
+ Info.memVT = getValueType(DL, I.getType());
+ Info.ptrVal = nullptr;
+ Info.offset = 0;
+ Info.flags = MachineMemOperand::MOLoad;
+ Info.align.reset();
+ return true;
+ }
}
return false;
}
@@ -6036,6 +6102,60 @@ static void ReplaceINTRINSIC_W_CHAIN(SDNode *N, SelectionDAG &DAG,
case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x64:
case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x128:
return ReplaceTcgen05Ld(N, DAG, Results, /* Offset */ true);
+
+ case Intrinsic::nvvm_clusterlaunchcontrol_query_cancel_get_first_ctaid: {
+ // The intrinsic returns the CTAID of x, y and z dimension as a v4i32 value
+ EVT ResVT = N->getValueType(0);
+ if (!ResVT.isVector())
+ return; // already legalized.
+
+ const unsigned NumElts = ResVT.getVectorNumElements(); // v4i32
+
+ // Create the return type of the instructions
+ SmallVector<EVT, 5> ListVTs;
+ for (unsigned i = 0; i < NumElts; ++i)
+ ListVTs.push_back(MVT::i32);
+ ListVTs.push_back(MVT::Other);
+
+ SDVTList ResVTs = DAG.getVTList(ListVTs);
+
+ SmallVector<SDValue, 8> Ops;
+ // Add Chain and Intrinsic ID
+ Ops.push_back(N->getOperand(0)); // Chain
+ Ops.push_back(N->getOperand(1)); // Intrinsic ID
+
+ SDValue TryCancelResponse = N->getOperand(2); // i128 operand
+ // Cast i128 to v2i64 and split into two i64
+ SDValue Cast = DAG.getNode(ISD::BITCAST, DL, MVT::v2i64, TryCancelResponse);
+ SDValue TryCancelResponse_0 =
+ DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+ DAG.getIntPtrConstant(0, DL));
+ SDValue TryCancelResponse_1 =
+ DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64, Cast,
+ DAG.getIntPtrConstant(1, DL));
+
+ Ops.push_back(TryCancelResponse_0);
+ Ops.push_back(TryCancelResponse_1);
+
+ MemIntrinsicSDNode *MemSD = cast<MemIntrinsicSDNode>(N);
+ // Create a new Intrinsic Node with 2 x i64 operands
+ SDValue NewNode =
+ DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, ResVTs, Ops,
+ MemSD->getMemoryVT(), MemSD->getMemOperand());
+
+ // Scalarize the vector results
+ SmallVector<SDValue, 4> ScalarRes;
+ for (unsigned i = 0; i < NumElts; ++i) {
+ SDValue Res = NewNode.getValue(i);
+ ScalarRes.push_back(Res);
+ }
+
+ SDValue Chain = NewNode.getValue(NumElts); // v4i32 value
+ SDValue BuildVector = DAG.getNode(ISD::BUILD_VECTOR, DL, ResVT, ScalarRes);
+ Results.push_back(BuildVector); // Build Vector
+ Results.push_back(Chain); // Chain
+ return;
+ }
}
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 39470be254efa..723e6defa8327 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -329,6 +329,9 @@ class NVPTXTargetLowering : public TargetLowering {
SDValue LowerVASTART(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerCopyToReg_128(SDValue Op, SelectionDAG &DAG) const;
+
+ SDValue LowerIntrinsicWChain(SDValue Op, SelectionDAG &DAG) const;
+
unsigned getNumRegisters(LLVMContext &Context, EVT VT,
std::optional<MVT> RegisterVT) const override;
bool
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 7d0c47fa464c5..29d19c40a6f59 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -160,6 +160,7 @@ def hasHWROT32 : Predicate<"Subtarget->hasHWROT32()">;
def noHWROT32 : Predicate<"!Subtarget->hasHWROT32()">;
def hasDotInstructions : Predicate<"Subtarget->hasDotInstructions()">;
def hasTcgen05Instructions : Predicate<"Subtarget->hasTcgen05Instructions()">;
+def hasBlackwellArch : Predicate<"Subtarget->hasBlackwellArch()">;
def True : Predicate<"true">;
def False : Predicate<"false">;
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 34cb63e44ca71..3f46c8c1a64ba 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -7301,3 +7301,66 @@ def INT_NVVM_ST_BULK_SHARED_CTA:
"st.bulk.shared::cta [$dest_addr], $size, 0;",
[(int_nvvm_st_bulk_shared_cta addr:$dest_addr, i64:$size, (i64 0))]>,
Requires<[hasSM<100>, hasPTX<86>]>;
+
+// clusterlaunchcontrol Instructions
+
+multiclass CLUSTERLAUNCHCONTROL_TRY_CANCEL<Intrinsic Intrin, bit Multicast, list<Predicate> Pred> {
+ defvar Intrinsic = !cast<Intrinsic>(Intrin # !if(!eq(Multicast, 1), "_multicast", ""));
+ defvar IntrinsicShared = !cast<Intrinsic>(Intrin # !if(!eq(Multicast, 1), "_multicast", "") # "_shared");
+
+ def _gen: NVPTXInst<(outs), (ins Int64Regs:$addr, Int64Regs:$mbar),
+ "clusterlaunchcontrol.try_cancel.async.mbarrier::complete_tx::bytes" #
+ !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+ ".b128 [$addr], [$mbar];",
+ [(Intrinsic Int64Regs:$addr, Int64Regs:$mbar)]>, Requires<Pred>;
+
+ def _shared: NVPTXInst<(outs), (ins Int64Regs:$addr, Int64Regs:$mbar),
+ "clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes" #
+ !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+ ".b128 [$addr], [$mbar];",
+ [(IntrinsicShared Int64Regs:$addr, Int64Regs:$mbar)]>, Requires<Pred>;
+
+ def _shared32: NVPTXInst<(outs), (ins Int32Regs:$addr, Int32Regs:$mbar),
+ "clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes" #
+ !if(!eq(Multicast, 1), ".multicast::cluster::all", "") #
+ ".b128 [$addr], [$mbar];",
+ [(IntrinsicShared Int32Regs:$addr, Int32Regs:$mbar)]>, Requires<Pred>;
+}
+
+defm CLUSTERLAUNCHCONTRL_TRY_CANCEL :
+ CLUSTERLAUNCHCONTROL_TRY_CANCEL<
+ int_nvvm_clusterlaunchcontrol_try_cancel_async, 0, [hasSM<100>, hasPTX<86>]>;
+
+defm CLUSTERLAUNCHCONTRL_TRY_CANCEL_MULTICAST :
+ CLUSTERLAUNCHCONTROL_TRY_CANCEL<
+ int_nvvm_clusterlaunchcontrol_try_cance...
[truncated]
|
87e3d74
to
3310c99
Compare
llvm/docs/NVPTXUsage.rst
Outdated
|
||
The ``clusterlaunchcontrol.try_cancel`` intrinsics requests atomically cancelling | ||
the launch of a cluster that has not started running yet. It asynchronously writes | ||
a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a 16-byte opaque response to shared memory, pointed to by ``addr`` indicating whether the | |
a 16-byte opaque response to shared memory, pointed to by 16-byte-aligned ``addr`` indicating whether the |
llvm/docs/NVPTXUsage.rst
Outdated
``llvm.nvvm.clusterlaunchcontrol.try_cancel`` operation. | ||
|
||
The intrinsic returns false if the request failed. If the request succeeded, | ||
it returns true. A true result indicates that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it returns true. A true result indicates that: | |
it returns ``1`` (true). A true result indicates that: |
llvm/docs/NVPTXUsage.rst
Outdated
""""""""" | ||
|
||
The ``clusterlaunchcontrol.query_cancel.get_first_ctaid`` intrinsic can be | ||
used to decode the opaque response written by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used to decode the opaque response written by the | |
used to decode the successful opaque response written by the |
return false; | ||
}; | ||
|
||
return HasRequiredSM() && PTXVersion >= 86; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-.multicast
instructions being added are not arch conditional, so sm >= 100 && ptx >= 86
suffice.
For the .multicast
flavors, they only apply to this one instruction, so calling it "blackwellArchAccelerated" instead of hasMulticastClusterLaunchControlTryCancel
is not very accurate. It is possible for future non-blackwell GPUs to provide .multicast
flavors of this instruction as well, and for those GPUs to not provide .multicast
flavors of other Blackwell instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need this custom predicate at the moment.
We already use [hasSM<100>, hasArchAccelFeatures, hasPTX<86>]
in a lot of places.
We could shorten that in tablegen, if necessary, if it becomes too much of a burden, but it does not seem to be the case for now.
If/when some instructions eventually turn out to be blackwell-only, we'll sort it out then, when we have specific info.
|
||
SDVTList ResVTs = DAG.getVTList(ListVTs); | ||
|
||
SmallVector<SDValue, 8> Ops; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A SmallVector
seems overkill given that there will always be exactly 4 operands. Can we instead use an initializer list directly in the getMemIntrinsicNode
call?
return false; | ||
}; | ||
|
||
return HasRequiredSM() && PTXVersion >= 86; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need this custom predicate at the moment.
We already use [hasSM<100>, hasArchAccelFeatures, hasPTX<86>]
in a lot of places.
We could shorten that in tablegen, if necessary, if it becomes too much of a burden, but it does not seem to be the case for now.
If/when some instructions eventually turn out to be blackwell-only, we'll sort it out then, when we have specific info.
b84c1be
to
adc0836
Compare
Summary of changes in the latest revision:
|
adc0836
to
209f4c1
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
cfe1fb9
to
c02763b
Compare
@schwarzschild-radius would it be possible to go over still open comments and mark the ones you've addressed as resolved? Right now it's really hard to tell whether you're still working on all/some of them, or if they are done. |
c02763b
to
c61abce
Compare
@Artem-B I have addressed all of the comments and resolved them as well. Please let me know if the review is good to go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please wait for other reviewers before landing.
As a side note, all the NVPTXISelLowering code and custom nodes seem like they could be eliminated if only we fully supported i128
types as legal. This has been on my list for a while and perhaps this is a sign I need to come back to it.
…ruction This commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/?a#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-try-cancel)
c61abce
to
e063a41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest changes LGTM
@Artem-B I have addressed all of the comments, Please let me know if I am can go ahead with merge |
This commit adds NVPTX codegen support for clusterlaunchcontrol instructions with tests under clusterlaunchcontrol.ll and clusterlaunchcontrol-multicast.ll. For more information, Please refer PTX ISA