Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add auto-bitcasts between x86amx and i32x256 for AMX intrinsics #140763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sayantn
Copy link
Contributor

@sayantn sayantn commented May 7, 2025

I tried adding support for AMX Tile types to Rust. They are very simple - the LLVM intrinsics operate on x86amx types, and all that is needed for us to call those intrinsics is inserting bitcasts to/from x86amx and i32x256 before and after the function call (as in this file in LLVM)

I tested the codegen for this fragment
test.rs

#![feature(
    link_llvm_intrinsics,
    abi_unadjusted,
    x86_amx_intrinsics,
    repr_simd,
    simd_ffi
)]
#![allow(internal_features)]
#![no_std]

#[repr(simd)]
pub struct Tile([u32; 256]);

#[allow(improper_ctypes)]
unsafe extern "unadjusted" {
    #[link_name = "llvm.x86.tdpbuud.internal"]
    fn tdpbuud(m: u16, n: u16, k: u16, a: Tile, b: Tile, c: Tile) -> Tile;
}

#[unsafe(no_mangle)]
#[target_feature(enable = "amx-int8")]
pub fn foo(m: u16, n: u16, k: u16, a: Tile, b: Tile, c: Tile) -> Tile {
    unsafe { tdpbuud(m, n, k, a, b, c) }
}

The LLVM IR generated is (output of rustc +stage1 --emit=llvm-ir --crate-type=rlib -O test.rs && cat test.ll)

; ModuleID = 'test.acdeec3141bb4e39-cgu.0'
source_filename = "test.acdeec3141bb4e39-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: nonlazybind uwtable
define void @foo(ptr sret([1024 x i8]) align 1024 %_0, i16 %m, i16 %n, i16 %k, ptr align 1024 %a, ptr align 1024 %b, ptr align 1024 %c) unnamed_addr #0 {
start:
  %0 = load <256 x i32>, ptr %a, align 1024
  %1 = load <256 x i32>, ptr %b, align 1024
  %2 = load <256 x i32>, ptr %c, align 1024
  %3 = bitcast <256 x i32> %0 to x86_amx
  %4 = bitcast <256 x i32> %1 to x86_amx
  %5 = bitcast <256 x i32> %2 to x86_amx
  %6 = call x86_amx @llvm.x86.tdpbuud.internal(i16 %m, i16 %n, i16 %k, x86_amx %3, x86_amx %4, x86_amx %5) #1
  %7 = bitcast x86_amx %6 to <256 x i32>
  store <256 x i32> %7, ptr %_0, align 1024
  ret void
}

; Function Attrs: nounwind
declare x86_amx @llvm.x86.tdpbuud.internal(i16, i16, i16, x86_amx, x86_amx, x86_amx) unnamed_addr #1

attributes #0 = { nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" "target-features"="+amx-int8,+amx-tile" }
attributes #1 = { nounwind }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
!2 = !{!"rustc version 1.88.0-dev"}

and the ASM generated is (output of rustc +stage1 --emit=asm --crate-type=rlib -O test.rs && cat test.s)

	.file	"test.acdeec3141bb4e39-cgu.0"
	.section	.text.foo,"ax",@progbits
	.globl	foo
	.p2align	4
	.type	foo,@function
foo:
	.cfi_startproc
	movq	%rdi, %rax
	xorps	%xmm0, %xmm0
	movups	%xmm0, -64(%rsp)
	movups	%xmm0, -48(%rsp)
	movups	%xmm0, -32(%rsp)
	movups	%xmm0, -16(%rsp)
	movb	$1, -64(%rsp)
	movw	%dx, -44(%rsp)
	movb	%sil, -15(%rsp)
	movw	%dx, -48(%rsp)
	movb	%sil, -16(%rsp)
	movzwl	%cx, %ecx
	movw	%cx, -46(%rsp)
	movq	8(%rsp), %rdi
	movl	%ecx, %r10d
	movb	%r10b, -14(%rsp)
	shrl	$2, %r10d
	movb	%r10b, -14(%rsp)
	movl	$64, %r11d
	ldtilecfg	-64(%rsp)
	tileloadd	(%r8,%r11), %tmm0
	tileloadd	(%r9,%r11), %tmm1
	tileloadd	(%rdi,%r11), %tmm2
	tdpbuud	%tmm2, %tmm1, %tmm0
	tilestored	%tmm0, (%rax,%r11)
	tilerelease
	retq
.Lfunc_end0:
	.size	foo, .Lfunc_end0-foo
	.cfi_endproc

	.ident	"rustc version 1.88.0-dev"
	.section	".note.GNU-stack","",@progbits

(note: the tests were done on x86_64-unknown-linux-gnu)

This is pretty similar to the CLang codegen (https://godbolt.org/z/G19rjo3Ke).

Reviews are welcome, as I am not too confident in the code (I am still not sure if the checks for AMX are strict enough, I will try strengthen them).

Unresolved Questions

  • Are bitcast's good enough? CLang uses llvm.x86.cast.vector.to.tile.v256i32 and llvm.x86.cast.tile.to.vector.v256i32, is there any functional difference with bitcasts? turns out bitcast can cause miscompilation (https://reviews.llvm.org/D99152), so we have to use the amx-specific casts
  • Should we allow only i32x256, or all vector types of size 8192? The LLVM file I referenced only does this for i32x256, but there is really not reason to be restrictive.

@rustbot label O-x86_64 T-compiler
r? codegen

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. O-x86_64 Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64) labels May 7, 2025
@rustbot
Copy link
Collaborator

rustbot commented May 8, 2025

Some changes occurred in compiler/rustc_codegen_ssa

cc @WaffleLapkin

@sayantn
Copy link
Contributor Author

sayantn commented May 8, 2025

I have changed the detection logic to be name-based. This approach can't have false negatives, but it has nontrivial behaviour with functions that export themselves as LLVM intrinsic, with unadjusted ABI (the ABI is important, otherwise rustc will pass the arguments via reference). The following code doesn't codegen

// things from earlier definition

#[export_name = "llvm.x86.tdpbsud.internal"]
#[target_feature(enable = "amx-int8")]
pub extern "unadjusted" fn bar(m: u16, n: u16, k: u16, a: Tile, b: Tile, c: Tile) -> Tile {
    unsafe { tdpbuud(m, n, k, a, b, c) }
}

I was honestly surprised that this code even compiles! Exporting a function masquerading as an LLVM intrinsic is vile! The LLVM IR produced is

; ModuleID = 'test.acdeec3141bb4e39-cgu.0'
source_filename = "test.acdeec3141bb4e39-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: nounwind
define x86_amx @llvm.x86.tdpbsud.internal(i16 %m, i16 %n, i16 %k, x86_amx %a, x86_amx %b, x86_amx %c) unnamed_addr #0 {
start:
  %0 = tail call x86_amx @llvm.x86.tdpbuud.internal(i16 noundef %m, i16 noundef %n, i16 noundef %k, x86_amx %a, x86_amx %b, x86_amx %c) #0
  %_0 = bitcast x86_amx %0 to <256 x i32>
  ret <256 x i32> %_0
}

; Function Attrs: nounwind
declare x86_amx @llvm.x86.tdpbuud.internal(i16, i16, i16, x86_amx, x86_amx, x86_amx) unnamed_addr #0

attributes #0 = { nounwind }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
!2 = !{!"rustc version 1.88.0-dev"}

I couldn't find any more edge cases, but I will try make the check stricter

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rustbot
Copy link
Collaborator

rustbot commented May 8, 2025

Some changes occurred in compiler/rustc_codegen_gcc

cc @antoyo, @GuillaumeGomez

@rust-log-analyzer

This comment has been minimized.

@sayantn sayantn changed the title Add auto-bitcasts from/to x86amx and i32x256 for AMX intrinsics Add auto-bitcasts from/to x86amx for i32x256 for AMX intrinsics May 8, 2025
@sayantn sayantn changed the title Add auto-bitcasts from/to x86amx for i32x256 for AMX intrinsics Add auto-bitcasts between x86amx and i32x256 for AMX intrinsics May 8, 2025
@sayantn
Copy link
Contributor Author

sayantn commented May 8, 2025

I managed to resolve the false positive (which resulted in #140822).

This can now be extended to more use-cases, e.g. using bf16 vectors from Rust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-x86_64 Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64) S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants