-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[NVPTX] Add TMA Bulk Copy Intrinsics #138679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[NVPTX] Add TMA Bulk Copy Intrinsics #138679
Conversation
@llvm/pr-subscribers-llvm-ir Author: Durgadoss R (durga4github) ChangesThis patch adds a new variant of TMA Bulk Copy PTX Spec link: Full diff: https://github.com/llvm/llvm-project/pull/138679.diff 6 Files Affected:
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index c1426823d87af..2a091e27358eb 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -616,6 +616,7 @@ Syntax:
.. code-block:: llvm
declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i64 %ch, i1 %flag_ch)
+ declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(..., i32 %size, i16 %mask, i64 %ch, i1 %flag_ch)
Overview:
"""""""""
@@ -624,7 +625,10 @@ The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.global``' intrinsic
corresponds to the ``cp.async.bulk.global.shared::cta.*`` set of PTX
instructions. These instructions initiate an asynchronous copy from
shared::cta to global memory. The 32-bit operand ``%size`` specifies
-the amount of memory to be copied and it must be a multiple of 16.
+the amount of memory to be copied (in bytes) and it must be a multiple
+of 16. For the ``.bytemask`` variant, the 16-bit wide mask operand
+specifies whether the i-th byte of each 16-byte wide chunk of source
+data is copied to the destination.
* The last argument to these intrinsics is a boolean flag
indicating support for cache_hint. This flag argument must
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 8b87822d3fdda..abe1395bb9f47 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -5323,6 +5323,20 @@ def int_nvvm_cp_async_bulk_shared_cta_to_global
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
ImmArg<ArgIndex<4>>]>;
+// From Shared CTA to Global memory with bytemask
+def int_nvvm_cp_async_bulk_shared_cta_to_global_bytemask
+ : DefaultAttrsIntrinsic<[],
+ [llvm_global_ptr_ty, // dst_gmem_ptr
+ llvm_shared_ptr_ty, // src_smem_ptr
+ llvm_i32_ty, // copy_size
+ llvm_i16_ty, // byte_mask
+ llvm_i64_ty, // cache_hint
+ llvm_i1_ty], // Flag for cache_hint
+ [IntrConvergent, IntrArgMemOnly,
+ WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
+ NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
+ ImmArg<ArgIndex<5>>]>;
+
// Intrinsics for Bulk Copy Prefetch L2
def int_nvvm_cp_async_bulk_prefetch_L2
: DefaultAttrsIntrinsic<[],
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index 6f6084b99dda2..fd352598fdaf4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -2720,28 +2720,46 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorReduceCommon(SDNode *N,
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}
-void NVPTXDAGToDAGISel::SelectCpAsyncBulkS2G(SDNode *N) {
+void NVPTXDAGToDAGISel::SelectCpAsyncBulkS2GCommon(SDNode *N, bool HasMask) {
// We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
- // dst, src, size, cache_hint, cache_hint_flag
+ // dst, src, size, mask, cache_hint, cache_hint_flag
// NumOperands = {Chain, IID} + {Actual intrinsic args}
- // = {2} + {5}
+ // = {2} + {6}
size_t NumOps = N->getNumOperands();
bool IsCacheHint = N->getConstantOperandVal(NumOps - 1) == 1;
- size_t NumArgs = IsCacheHint ? 4 : 3; // src, dst, size, cache_hint
+ size_t CacheHintIdx = NumOps - 2;
SDLoc DL(N);
- SmallVector<SDValue, 8> Ops(N->ops().slice(2, NumArgs));
- Ops.push_back(N->getOperand(0)); // Chain operand
+ SDValue Offset, Base;
+ SelectADDR(N->getOperand(3), Base, Offset); // src
- bool IsShared32 =
- CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
- unsigned Opcode;
+ SmallVector<SDValue, 8> Ops;
+ // BaseArgs: {dst, src, size}
+ Ops.push_back(N->getOperand(2)); // dst
+ Ops.push_back(Base); // src
+ Ops.push_back(Offset); // src
+ Ops.push_back(N->getOperand(4)); // size
+
+ // Push Mask operand, if available
+ if (HasMask)
+ Ops.push_back(N->getOperand(CacheHintIdx - 1));
+
+ // Push CacheHint operand, if available
if (IsCacheHint)
- Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_S2G_SHARED32_CH
- : NVPTX::CP_ASYNC_BULK_S2G_CH;
- else
- Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_S2G_SHARED32
- : NVPTX::CP_ASYNC_BULK_S2G;
+ Ops.push_back(N->getOperand(CacheHintIdx));
+
+ // Finally, the chain operand
+ Ops.push_back(N->getOperand(0));
+
+ unsigned Opcode = [&]() {
+ if (HasMask && IsCacheHint)
+ return NVPTX::CP_ASYNC_BULK_S2G_BM_CH;
+ if (HasMask)
+ return NVPTX::CP_ASYNC_BULK_S2G_BM;
+ if (IsCacheHint)
+ return NVPTX::CP_ASYNC_BULK_S2G_CH;
+ return NVPTX::CP_ASYNC_BULK_S2G;
+ }();
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}
@@ -2928,7 +2946,10 @@ bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
SelectCpAsyncBulkG2S(N);
return true;
case Intrinsic::nvvm_cp_async_bulk_shared_cta_to_global:
- SelectCpAsyncBulkS2G(N);
+ SelectCpAsyncBulkS2GCommon(N);
+ return true;
+ case Intrinsic::nvvm_cp_async_bulk_shared_cta_to_global_bytemask:
+ SelectCpAsyncBulkS2GCommon(N, /*HasMask=*/true);
return true;
case Intrinsic::nvvm_cp_async_bulk_prefetch_L2:
SelectCpAsyncBulkPrefetchL2(N);
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index 23cbd458571a0..53f6c060405e7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -93,7 +93,7 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
void SelectV2I64toI128(SDNode *N);
void SelectI128toV2I64(SDNode *N);
void SelectCpAsyncBulkG2S(SDNode *N);
- void SelectCpAsyncBulkS2G(SDNode *N);
+ void SelectCpAsyncBulkS2GCommon(SDNode *N, bool HasMask = false);
void SelectCpAsyncBulkPrefetchL2(SDNode *N);
void SelectCpAsyncBulkTensorG2SCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorS2GCommon(SDNode *N, bool IsIm2Col = false);
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 3eedb43e4c81a..65f1b524aec20 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -516,6 +516,9 @@ class CpAsyncBulkStr<bit mc, bit ch> {
string S2G = "cp.async.bulk.global.shared::cta.bulk_group"
# !if(ch, ".L2::cache_hint", "");
+ // Shared to Global memory with bytemask
+ string S2G_BM = S2G # ".cp_mask";
+
// Global to Shared cluster memory
string G2S = "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes"
# !if(mc, ".multicast::cluster", "")
@@ -525,18 +528,26 @@ class CpAsyncBulkStr<bit mc, bit ch> {
string C2C = "cp.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes";
}
-multiclass CP_ASYNC_BULK_S2G<NVPTXRegClass rc> {
- def NAME: NVPTXInst<(outs),
- (ins Int64Regs:$dst, rc:$src, Int32Regs:$size),
+def CP_ASYNC_BULK_S2G : NVPTXInst<(outs),
+ (ins Int64Regs:$dst, ADDR:$src, Int32Regs:$size),
!strconcat(CpAsyncBulkStr<0, 0>.S2G, " [$dst], [$src], $size;"), []>,
Requires<[hasPTX<80>, hasSM<90>]>;
- def NAME # _CH: NVPTXInst<(outs),
- (ins Int64Regs:$dst, rc:$src, Int32Regs:$size, Int64Regs:$ch),
- !strconcat(CpAsyncBulkStr<0, 1>.S2G, " [$dst], [$src], $size, $ch;"), []>,
- Requires<[hasPTX<80>, hasSM<90>]>;
-}
-defm CP_ASYNC_BULK_S2G : CP_ASYNC_BULK_S2G<Int64Regs>;
-defm CP_ASYNC_BULK_S2G_SHARED32 : CP_ASYNC_BULK_S2G<Int32Regs>;
+
+def CP_ASYNC_BULK_S2G_CH : NVPTXInst<(outs),
+ (ins Int64Regs:$dst, ADDR:$src, Int32Regs:$size, Int64Regs:$ch),
+ !strconcat(CpAsyncBulkStr<0, 1>.S2G, " [$dst], [$src], $size, $ch;"), []>,
+ Requires<[hasPTX<80>, hasSM<90>]>;
+
+// Variants with bytemask
+def CP_ASYNC_BULK_S2G_BM : NVPTXInst<(outs),
+ (ins Int64Regs:$dst, ADDR:$src, Int32Regs:$size, Int16Regs:$mask),
+ !strconcat(CpAsyncBulkStr<0, 0>.S2G_BM, " [$dst], [$src], $size, $mask;"), []>,
+ Requires<[hasPTX<86>, hasSM<100>]>;
+
+def CP_ASYNC_BULK_S2G_BM_CH : NVPTXInst<(outs),
+ (ins Int64Regs:$dst, ADDR:$src, Int32Regs:$size, Int16Regs:$mask, Int64Regs:$ch),
+ !strconcat(CpAsyncBulkStr<0, 1>.S2G_BM, " [$dst], [$src], $size, $ch, $mask;"), []>,
+ Requires<[hasPTX<86>, hasSM<100>]>;
multiclass CP_ASYNC_BULK_G2S<NVPTXRegClass rc> {
def NAME: NVPTXInst<(outs),
diff --git a/llvm/test/CodeGen/NVPTX/cp-async-bulk-s2g-sm100.ll b/llvm/test/CodeGen/NVPTX/cp-async-bulk-s2g-sm100.ll
new file mode 100644
index 0000000000000..c2e2b26ed6882
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/cp-async-bulk-s2g-sm100.ll
@@ -0,0 +1,46 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-12.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86| %ptxas-verify -arch=sm_100 %}
+; RUN: %if ptxas-12.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 --nvptx-short-ptr| %ptxas-verify -arch=sm_100 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(ptr addrspace(1), ptr addrspace(3), i32, i16, i64, i1)
+
+define void @cp_async_bulk_s2g_bytemask(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i16 %mask, i64 %ch) {
+; CHECK-PTX64-LABEL: cp_async_bulk_s2g_bytemask(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b16 %rs<2>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [cp_async_bulk_s2g_bytemask_param_0];
+; CHECK-PTX64-NEXT: ld.param.u64 %rd2, [cp_async_bulk_s2g_bytemask_param_1];
+; CHECK-PTX64-NEXT: ld.param.u32 %r1, [cp_async_bulk_s2g_bytemask_param_2];
+; CHECK-PTX64-NEXT: ld.param.u16 %rs1, [cp_async_bulk_s2g_bytemask_param_3];
+; CHECK-PTX64-NEXT: ld.param.u64 %rd3, [cp_async_bulk_s2g_bytemask_param_4];
+; CHECK-PTX64-NEXT: cp.async.bulk.global.shared::cta.bulk_group.L2::cache_hint.cp_mask [%rd1], [%rd2], %r1, %rd3, %rs1;
+; CHECK-PTX64-NEXT: cp.async.bulk.global.shared::cta.bulk_group.cp_mask [%rd1], [%rd2], %r1, %rs1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: cp_async_bulk_s2g_bytemask(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b16 %rs<2>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.u64 %rd1, [cp_async_bulk_s2g_bytemask_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.u32 %r1, [cp_async_bulk_s2g_bytemask_param_1];
+; CHECK-PTX-SHARED32-NEXT: ld.param.u32 %r2, [cp_async_bulk_s2g_bytemask_param_2];
+; CHECK-PTX-SHARED32-NEXT: ld.param.u16 %rs1, [cp_async_bulk_s2g_bytemask_param_3];
+; CHECK-PTX-SHARED32-NEXT: ld.param.u64 %rd2, [cp_async_bulk_s2g_bytemask_param_4];
+; CHECK-PTX-SHARED32-NEXT: cp.async.bulk.global.shared::cta.bulk_group.L2::cache_hint.cp_mask [%rd1], [%r1], %r2, %rd2, %rs1;
+; CHECK-PTX-SHARED32-NEXT: cp.async.bulk.global.shared::cta.bulk_group.cp_mask [%rd1], [%r1], %r2, %rs1;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i16 %mask, i64 %ch, i1 1)
+ tail call void @llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i16 %mask, i64 0, i1 0)
+ ret void
+}
|
@llvm/pr-subscribers-backend-nvptx Author: Durgadoss R (durga4github) ChangesThis patch adds a new variant of TMA Bulk Copy PTX Spec link: Full diff: https://github.com/llvm/llvm-project/pull/138679.diff 6 Files Affected:
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index c1426823d87af..2a091e27358eb 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -616,6 +616,7 @@ Syntax:
.. code-block:: llvm
declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i64 %ch, i1 %flag_ch)
+ declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(..., i32 %size, i16 %mask, i64 %ch, i1 %flag_ch)
Overview:
"""""""""
@@ -624,7 +625,10 @@ The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.global``' intrinsic
corresponds to the ``cp.async.bulk.global.shared::cta.*`` set of PTX
instructions. These instructions initiate an asynchronous copy from
shared::cta to global memory. The 32-bit operand ``%size`` specifies
-the amount of memory to be copied and it must be a multiple of 16.
+the amount of memory to be copied (in bytes) and it must be a multiple
+of 16. For the ``.bytemask`` variant, the 16-bit wide mask operand
+specifies whether the i-th byte of each 16-byte wide chunk of source
+data is copied to the destination.
* The last argument to these intrinsics is a boolean flag
indicating support for cache_hint. This flag argument must
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 8b87822d3fdda..abe1395bb9f47 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -5323,6 +5323,20 @@ def int_nvvm_cp_async_bulk_shared_cta_to_global
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
ImmArg<ArgIndex<4>>]>;
+// From Shared CTA to Global memory with bytemask
+def int_nvvm_cp_async_bulk_shared_cta_to_global_bytemask
+ : DefaultAttrsIntrinsic<[],
+ [llvm_global_ptr_ty, // dst_gmem_ptr
+ llvm_shared_ptr_ty, // src_smem_ptr
+ llvm_i32_ty, // copy_size
+ llvm_i16_ty, // byte_mask
+ llvm_i64_ty, // cache_hint
+ llvm_i1_ty], // Flag for cache_hint
+ [IntrConvergent, IntrArgMemOnly,
+ WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
+ NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
+ ImmArg<ArgIndex<5>>]>;
+
// Intrinsics for Bulk Copy Prefetch L2
def int_nvvm_cp_async_bulk_prefetch_L2
: DefaultAttrsIntrinsic<[],
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index 6f6084b99dda2..fd352598fdaf4 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -2720,28 +2720,46 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorReduceCommon(SDNode *N,
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}
-void NVPTXDAGToDAGISel::SelectCpAsyncBulkS2G(SDNode *N) {
+void NVPTXDAGToDAGISel::SelectCpAsyncBulkS2GCommon(SDNode *N, bool HasMask) {
// We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
- // dst, src, size, cache_hint, cache_hint_flag
+ // dst, src, size, mask, cache_hint, cache_hint_flag
// NumOperands = {Chain, IID} + {Actual intrinsic args}
- // = {2} + {5}
+ // = {2} + {6}
size_t NumOps = N->getNumOperands();
bool IsCacheHint = N->getConstantOperandVal(NumOps - 1) == 1;
- size_t NumArgs = IsCacheHint ? 4 : 3; // src, dst, size, cache_hint
+ size_t CacheHintIdx = NumOps - 2;
SDLoc DL(N);
- SmallVector<SDValue, 8> Ops(N->ops().slice(2, NumArgs));
- Ops.push_back(N->getOperand(0)); // Chain operand
+ SDValue Offset, Base;
+ SelectADDR(N->getOperand(3), Base, Offset); // src
- bool IsShared32 =
- CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
- unsigned Opcode;
+ SmallVector<SDValue, 8> Ops;
+ // BaseArgs: {dst, src, size}
+ Ops.push_back(N->getOperand(2)); // dst
+ Ops.push_back(Base); // src
+ Ops.push_back(Offset); // src
+ Ops.push_back(N->getOperand(4)); // size
+
+ // Push Mask operand, if available
+ if (HasMask)
+ Ops.push_back(N->getOperand(CacheHintIdx - 1));
+
+ // Push CacheHint operand, if available
if (IsCacheHint)
- Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_S2G_SHARED32_CH
- : NVPTX::CP_ASYNC_BULK_S2G_CH;
- else
- Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_S2G_SHARED32
- : NVPTX::CP_ASYNC_BULK_S2G;
+ Ops.push_back(N->getOperand(CacheHintIdx));
+
+ // Finally, the chain operand
+ Ops.push_back(N->getOperand(0));
+
+ unsigned Opcode = [&]() {
+ if (HasMask && IsCacheHint)
+ return NVPTX::CP_ASYNC_BULK_S2G_BM_CH;
+ if (HasMask)
+ return NVPTX::CP_ASYNC_BULK_S2G_BM;
+ if (IsCacheHint)
+ return NVPTX::CP_ASYNC_BULK_S2G_CH;
+ return NVPTX::CP_ASYNC_BULK_S2G;
+ }();
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}
@@ -2928,7 +2946,10 @@ bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
SelectCpAsyncBulkG2S(N);
return true;
case Intrinsic::nvvm_cp_async_bulk_shared_cta_to_global:
- SelectCpAsyncBulkS2G(N);
+ SelectCpAsyncBulkS2GCommon(N);
+ return true;
+ case Intrinsic::nvvm_cp_async_bulk_shared_cta_to_global_bytemask:
+ SelectCpAsyncBulkS2GCommon(N, /*HasMask=*/true);
return true;
case Intrinsic::nvvm_cp_async_bulk_prefetch_L2:
SelectCpAsyncBulkPrefetchL2(N);
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index 23cbd458571a0..53f6c060405e7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -93,7 +93,7 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
void SelectV2I64toI128(SDNode *N);
void SelectI128toV2I64(SDNode *N);
void SelectCpAsyncBulkG2S(SDNode *N);
- void SelectCpAsyncBulkS2G(SDNode *N);
+ void SelectCpAsyncBulkS2GCommon(SDNode *N, bool HasMask = false);
void SelectCpAsyncBulkPrefetchL2(SDNode *N);
void SelectCpAsyncBulkTensorG2SCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorS2GCommon(SDNode *N, bool IsIm2Col = false);
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 3eedb43e4c81a..65f1b524aec20 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -516,6 +516,9 @@ class CpAsyncBulkStr<bit mc, bit ch> {
string S2G = "cp.async.bulk.global.shared::cta.bulk_group"
# !if(ch, ".L2::cache_hint", "");
+ // Shared to Global memory with bytemask
+ string S2G_BM = S2G # ".cp_mask";
+
// Global to Shared cluster memory
string G2S = "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes"
# !if(mc, ".multicast::cluster", "")
@@ -525,18 +528,26 @@ class CpAsyncBulkStr<bit mc, bit ch> {
string C2C = "cp.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes";
}
-multiclass CP_ASYNC_BULK_S2G<NVPTXRegClass rc> {
- def NAME: NVPTXInst<(outs),
- (ins Int64Regs:$dst, rc:$src, Int32Regs:$size),
+def CP_ASYNC_BULK_S2G : NVPTXInst<(outs),
+ (ins Int64Regs:$dst, ADDR:$src, Int32Regs:$size),
!strconcat(CpAsyncBulkStr<0, 0>.S2G, " [$dst], [$src], $size;"), []>,
Requires<[hasPTX<80>, hasSM<90>]>;
- def NAME # _CH: NVPTXInst<(outs),
- (ins Int64Regs:$dst, rc:$src, Int32Regs:$size, Int64Regs:$ch),
- !strconcat(CpAsyncBulkStr<0, 1>.S2G, " [$dst], [$src], $size, $ch;"), []>,
- Requires<[hasPTX<80>, hasSM<90>]>;
-}
-defm CP_ASYNC_BULK_S2G : CP_ASYNC_BULK_S2G<Int64Regs>;
-defm CP_ASYNC_BULK_S2G_SHARED32 : CP_ASYNC_BULK_S2G<Int32Regs>;
+
+def CP_ASYNC_BULK_S2G_CH : NVPTXInst<(outs),
+ (ins Int64Regs:$dst, ADDR:$src, Int32Regs:$size, Int64Regs:$ch),
+ !strconcat(CpAsyncBulkStr<0, 1>.S2G, " [$dst], [$src], $size, $ch;"), []>,
+ Requires<[hasPTX<80>, hasSM<90>]>;
+
+// Variants with bytemask
+def CP_ASYNC_BULK_S2G_BM : NVPTXInst<(outs),
+ (ins Int64Regs:$dst, ADDR:$src, Int32Regs:$size, Int16Regs:$mask),
+ !strconcat(CpAsyncBulkStr<0, 0>.S2G_BM, " [$dst], [$src], $size, $mask;"), []>,
+ Requires<[hasPTX<86>, hasSM<100>]>;
+
+def CP_ASYNC_BULK_S2G_BM_CH : NVPTXInst<(outs),
+ (ins Int64Regs:$dst, ADDR:$src, Int32Regs:$size, Int16Regs:$mask, Int64Regs:$ch),
+ !strconcat(CpAsyncBulkStr<0, 1>.S2G_BM, " [$dst], [$src], $size, $ch, $mask;"), []>,
+ Requires<[hasPTX<86>, hasSM<100>]>;
multiclass CP_ASYNC_BULK_G2S<NVPTXRegClass rc> {
def NAME: NVPTXInst<(outs),
diff --git a/llvm/test/CodeGen/NVPTX/cp-async-bulk-s2g-sm100.ll b/llvm/test/CodeGen/NVPTX/cp-async-bulk-s2g-sm100.ll
new file mode 100644
index 0000000000000..c2e2b26ed6882
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/cp-async-bulk-s2g-sm100.ll
@@ -0,0 +1,46 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-12.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86| %ptxas-verify -arch=sm_100 %}
+; RUN: %if ptxas-12.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 --nvptx-short-ptr| %ptxas-verify -arch=sm_100 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(ptr addrspace(1), ptr addrspace(3), i32, i16, i64, i1)
+
+define void @cp_async_bulk_s2g_bytemask(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i16 %mask, i64 %ch) {
+; CHECK-PTX64-LABEL: cp_async_bulk_s2g_bytemask(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b16 %rs<2>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [cp_async_bulk_s2g_bytemask_param_0];
+; CHECK-PTX64-NEXT: ld.param.u64 %rd2, [cp_async_bulk_s2g_bytemask_param_1];
+; CHECK-PTX64-NEXT: ld.param.u32 %r1, [cp_async_bulk_s2g_bytemask_param_2];
+; CHECK-PTX64-NEXT: ld.param.u16 %rs1, [cp_async_bulk_s2g_bytemask_param_3];
+; CHECK-PTX64-NEXT: ld.param.u64 %rd3, [cp_async_bulk_s2g_bytemask_param_4];
+; CHECK-PTX64-NEXT: cp.async.bulk.global.shared::cta.bulk_group.L2::cache_hint.cp_mask [%rd1], [%rd2], %r1, %rd3, %rs1;
+; CHECK-PTX64-NEXT: cp.async.bulk.global.shared::cta.bulk_group.cp_mask [%rd1], [%rd2], %r1, %rs1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: cp_async_bulk_s2g_bytemask(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b16 %rs<2>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.u64 %rd1, [cp_async_bulk_s2g_bytemask_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.u32 %r1, [cp_async_bulk_s2g_bytemask_param_1];
+; CHECK-PTX-SHARED32-NEXT: ld.param.u32 %r2, [cp_async_bulk_s2g_bytemask_param_2];
+; CHECK-PTX-SHARED32-NEXT: ld.param.u16 %rs1, [cp_async_bulk_s2g_bytemask_param_3];
+; CHECK-PTX-SHARED32-NEXT: ld.param.u64 %rd2, [cp_async_bulk_s2g_bytemask_param_4];
+; CHECK-PTX-SHARED32-NEXT: cp.async.bulk.global.shared::cta.bulk_group.L2::cache_hint.cp_mask [%rd1], [%r1], %r2, %rd2, %rs1;
+; CHECK-PTX-SHARED32-NEXT: cp.async.bulk.global.shared::cta.bulk_group.cp_mask [%rd1], [%r1], %r2, %rs1;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i16 %mask, i64 %ch, i1 1)
+ tail call void @llvm.nvvm.cp.async.bulk.shared.cta.to.global.bytemask(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i16 %mask, i64 0, i1 0)
+ ret void
+}
|
@@ -2720,28 +2720,46 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorReduceCommon(SDNode *N, | |||
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops)); | |||
} | |||
|
|||
void NVPTXDAGToDAGISel::SelectCpAsyncBulkS2G(SDNode *N) { | |||
void NVPTXDAGToDAGISel::SelectCpAsyncBulkS2GCommon(SDNode *N, bool HasMask) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks simple enough that it could be a table-gen pattern. Is there something I'm missing that prevents this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a flag for the cache-hint. The flag itself gets dropped here, and we use the actual value when it is true (and ignore the value when it is false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's annoying that table-gen doesn't seem to allow you to drop/ignore an operand. It looks like you could to something like (int_nvvm_cp_async_bulk_shared_cta_to_global i64:$dst, addr:$src, i32:$size, i64:$ch, -1)
for the case where we want the cache hint but I can't seem to figure out if there is a way to just ignore the cache-hint operand when the final operand is 0. @Artem-B do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the cache-hint pattern is common across a few families of TMA intrinsics. So, if there is an option to do this in table-gen, it can simplify a good chunk of the Cpp implementations.
Let us wait to hear from Artem,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it seems we can get it to work with 0 and -1.
I will clean this up and post a revision for us to review..
This patch adds a new variant of TMA Bulk Copy intrinsics introduced in sm100+. This variant has an additional byte_mask to select the bytes for the copy operation. Signed-off-by: Durgadoss R <[email protected]>
359413a
to
5c97fa7
Compare
This patch adds a new variant of TMA Bulk Copy
intrinsics introduced in sm100+. This variant
has an additional byte_mask to select the bytes
for the copy operation. This patch adds the
relevant documentation and lit tests as well.
PTX Spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-bulk-copy