Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[AArch64] Use pattern to select bf16 fpextend #137212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 2, 2025

Conversation

john-brawn-arm
Copy link
Collaborator

Currently bf16 fpextend is lowered to a vector shift. Instead leave it as fpextend and have an instruction selection pattern which selects to a shift later. Doing this means that DAGCombiner patterns for fpextend will be applied, leading to better codegen. It also means that in some situations we use a mov instruction where we previously have a dup instruction, but I don't think this makes any difference.

@llvmbot
Copy link
Member

llvmbot commented Apr 24, 2025

@llvm/pr-subscribers-backend-aarch64

Author: John Brawn (john-brawn-arm)

Changes

Currently bf16 fpextend is lowered to a vector shift. Instead leave it as fpextend and have an instruction selection pattern which selects to a shift later. Doing this means that DAGCombiner patterns for fpextend will be applied, leading to better codegen. It also means that in some situations we use a mov instruction where we previously have a dup instruction, but I don't think this makes any difference.


Patch is 49.31 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137212.diff

8 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+6-32)
  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+18)
  • (modified) llvm/test/CodeGen/AArch64/arm64-fast-isel-conversion-fallback.ll (+2-6)
  • (modified) llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/bf16-instructions.ll (+6-12)
  • (modified) llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll (+314-314)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fcopysign.ll (+7-21)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index ee4cc51f8d4ff..4b94805c8b0a8 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -766,13 +766,14 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationAction(Op, MVT::v8bf16, Expand);
   }
 
-  // For bf16, fpextend is custom lowered to be optionally expanded into shifts.
-  setOperationAction(ISD::FP_EXTEND, MVT::f32, Custom);
+  // fpextend from f16 or bf16 to f32 is legal
+  setOperationAction(ISD::FP_EXTEND, MVT::f32, Legal);
+  setOperationAction(ISD::FP_EXTEND, MVT::v4f32, Legal);
+  setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f32, Legal);
+  setOperationAction(ISD::STRICT_FP_EXTEND, MVT::v4f32, Legal);
+  // fpextend from bf16 to f64 needs to be split into two fpextends
   setOperationAction(ISD::FP_EXTEND, MVT::f64, Custom);
-  setOperationAction(ISD::FP_EXTEND, MVT::v4f32, Custom);
-  setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f32, Custom);
   setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f64, Custom);
-  setOperationAction(ISD::STRICT_FP_EXTEND, MVT::v4f32, Custom);
 
   auto LegalizeNarrowFP = [this](MVT ScalarVT) {
     for (auto Op : {
@@ -4558,33 +4559,6 @@ SDValue AArch64TargetLowering::LowerFP_EXTEND(SDValue Op,
     return SDValue();
   }
 
-  if (VT.getScalarType() == MVT::f32) {
-    // FP16->FP32 extends are legal for v32 and v4f32.
-    if (Op0VT.getScalarType() == MVT::f16)
-      return Op;
-    if (Op0VT.getScalarType() == MVT::bf16) {
-      SDLoc DL(Op);
-      EVT IVT = VT.changeTypeToInteger();
-      if (!Op0VT.isVector()) {
-        Op0 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v4bf16, Op0);
-        IVT = MVT::v4i32;
-      }
-
-      EVT Op0IVT = Op0.getValueType().changeTypeToInteger();
-      SDValue Ext =
-          DAG.getNode(ISD::ANY_EXTEND, DL, IVT, DAG.getBitcast(Op0IVT, Op0));
-      SDValue Shift =
-          DAG.getNode(ISD::SHL, DL, IVT, Ext, DAG.getConstant(16, DL, IVT));
-      if (!Op0VT.isVector())
-        Shift = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i32, Shift,
-                            DAG.getConstant(0, DL, MVT::i64));
-      Shift = DAG.getBitcast(VT, Shift);
-      return IsStrict ? DAG.getMergeValues({Shift, Op.getOperand(0)}, DL)
-                      : Shift;
-    }
-    return SDValue();
-  }
-
   assert(Op.getValueType() == MVT::f128 && "Unexpected lowering");
   return SDValue();
 }
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 4657a77e80ecc..c1a0a22c1d598 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -8469,6 +8469,24 @@ def : InstAlias<"uxtl2 $dst.2d, $src1.4s",
                 (USHLLv4i32_shift V128:$dst, V128:$src1, 0)>;
 }
 
+// fpextend from bf16 to f32 is just a shift left by 16
+let Predicates = [HasNEON] in {
+def : Pat<(f32 (any_fpextend (bf16 FPR16:$Rn))),
+          (f32 (EXTRACT_SUBREG
+            (v4i32 (SHLLv4i16 (v4i16 (SUBREG_TO_REG (i64 0), (bf16 FPR16:$Rn), hsub)))),
+            ssub))>;
+def : Pat<(v4f32 (any_fpextend (v4bf16 V64:$Rn))),
+          (SHLLv4i16 V64:$Rn)>;
+def : Pat<(v4f32 (any_fpextend (extract_high_v8bf16 (v8bf16 V128:$Rn)))),
+          (SHLLv8i16 V128:$Rn)>;
+}
+// Fallback pattern for when we don't have NEON
+def : Pat<(f32 (any_fpextend (bf16 FPR16:$Rn))),
+          (f32 (COPY_TO_REGCLASS
+            (i32 (UBFMWri (i32 (SUBREG_TO_REG (i32 0), (bf16 FPR16:$Rn), hsub)),
+                          (i64 16), (i64 15))),
+            FPR32))>;
+
 def abs_f16 :
   OutPatFrag<(ops node:$Rn),
              (EXTRACT_SUBREG (f32 (COPY_TO_REGCLASS
diff --git a/llvm/test/CodeGen/AArch64/arm64-fast-isel-conversion-fallback.ll b/llvm/test/CodeGen/AArch64/arm64-fast-isel-conversion-fallback.ll
index 9a1203f18243d..1d33545cb171a 100644
--- a/llvm/test/CodeGen/AArch64/arm64-fast-isel-conversion-fallback.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-fast-isel-conversion-fallback.ll
@@ -155,9 +155,7 @@ entry:
 define i32 @fptosi_bf(bfloat %a) nounwind ssp {
 ; CHECK-LABEL: fptosi_bf:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    fmov s1, s0
-; CHECK-NEXT:    // implicit-def: $d0
-; CHECK-NEXT:    fmov s0, s1
+; CHECK-NEXT:    // kill: def $d0 killed $h0
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
 ; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $q0
 ; CHECK-NEXT:    fcvtzs w0, s0
@@ -171,9 +169,7 @@ entry:
 define i32 @fptoui_sbf(bfloat %a) nounwind ssp {
 ; CHECK-LABEL: fptoui_sbf:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    fmov s1, s0
-; CHECK-NEXT:    // implicit-def: $d0
-; CHECK-NEXT:    fmov s0, s1
+; CHECK-NEXT:    // kill: def $d0 killed $h0
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
 ; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $q0
 ; CHECK-NEXT:    fcvtzu w0, s0
diff --git a/llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll b/llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll
index 9b5e48d2b4217..e3e18a1f91c6d 100644
--- a/llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll
@@ -641,7 +641,7 @@ define <2 x bfloat> @test_atomicrmw_fmax_v2bf16_seq_cst_align4(ptr %ptr, <2 x bf
 ; NOLSE-LABEL: test_atomicrmw_fmax_v2bf16_seq_cst_align4:
 ; NOLSE:       // %bb.0:
 ; NOLSE-NEXT:    // kill: def $d0 killed $d0 def $q0
-; NOLSE-NEXT:    dup v1.4h, v0.h[1]
+; NOLSE-NEXT:    mov h1, v0.h[1]
 ; NOLSE-NEXT:    mov w8, #32767 // =0x7fff
 ; NOLSE-NEXT:    shll v0.4s, v0.4h, #16
 ; NOLSE-NEXT:    shll v1.4s, v1.4h, #16
@@ -649,7 +649,7 @@ define <2 x bfloat> @test_atomicrmw_fmax_v2bf16_seq_cst_align4(ptr %ptr, <2 x bf
 ; NOLSE-NEXT:    // =>This Inner Loop Header: Depth=1
 ; NOLSE-NEXT:    ldaxr w9, [x0]
 ; NOLSE-NEXT:    fmov s2, w9
-; NOLSE-NEXT:    dup v3.4h, v2.h[1]
+; NOLSE-NEXT:    mov h3, v2.h[1]
 ; NOLSE-NEXT:    shll v2.4s, v2.4h, #16
 ; NOLSE-NEXT:    fmaxnm s2, s2, s0
 ; NOLSE-NEXT:    shll v3.4s, v3.4h, #16
@@ -677,14 +677,14 @@ define <2 x bfloat> @test_atomicrmw_fmax_v2bf16_seq_cst_align4(ptr %ptr, <2 x bf
 ; LSE-LABEL: test_atomicrmw_fmax_v2bf16_seq_cst_align4:
 ; LSE:       // %bb.0:
 ; LSE-NEXT:    // kill: def $d0 killed $d0 def $q0
-; LSE-NEXT:    dup v1.4h, v0.h[1]
+; LSE-NEXT:    mov h1, v0.h[1]
 ; LSE-NEXT:    shll v2.4s, v0.4h, #16
 ; LSE-NEXT:    mov w8, #32767 // =0x7fff
 ; LSE-NEXT:    ldr s0, [x0]
 ; LSE-NEXT:    shll v1.4s, v1.4h, #16
 ; LSE-NEXT:  .LBB7_1: // %atomicrmw.start
 ; LSE-NEXT:    // =>This Inner Loop Header: Depth=1
-; LSE-NEXT:    dup v3.4h, v0.h[1]
+; LSE-NEXT:    mov h3, v0.h[1]
 ; LSE-NEXT:    shll v4.4s, v0.4h, #16
 ; LSE-NEXT:    fmaxnm s4, s4, s2
 ; LSE-NEXT:    shll v3.4s, v3.4h, #16
diff --git a/llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll b/llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll
index f6c542fe7d407..10de6777bd285 100644
--- a/llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll
@@ -641,7 +641,7 @@ define <2 x bfloat> @test_atomicrmw_fmin_v2bf16_seq_cst_align4(ptr %ptr, <2 x bf
 ; NOLSE-LABEL: test_atomicrmw_fmin_v2bf16_seq_cst_align4:
 ; NOLSE:       // %bb.0:
 ; NOLSE-NEXT:    // kill: def $d0 killed $d0 def $q0
-; NOLSE-NEXT:    dup v1.4h, v0.h[1]
+; NOLSE-NEXT:    mov h1, v0.h[1]
 ; NOLSE-NEXT:    mov w8, #32767 // =0x7fff
 ; NOLSE-NEXT:    shll v0.4s, v0.4h, #16
 ; NOLSE-NEXT:    shll v1.4s, v1.4h, #16
@@ -649,7 +649,7 @@ define <2 x bfloat> @test_atomicrmw_fmin_v2bf16_seq_cst_align4(ptr %ptr, <2 x bf
 ; NOLSE-NEXT:    // =>This Inner Loop Header: Depth=1
 ; NOLSE-NEXT:    ldaxr w9, [x0]
 ; NOLSE-NEXT:    fmov s2, w9
-; NOLSE-NEXT:    dup v3.4h, v2.h[1]
+; NOLSE-NEXT:    mov h3, v2.h[1]
 ; NOLSE-NEXT:    shll v2.4s, v2.4h, #16
 ; NOLSE-NEXT:    fminnm s2, s2, s0
 ; NOLSE-NEXT:    shll v3.4s, v3.4h, #16
@@ -677,14 +677,14 @@ define <2 x bfloat> @test_atomicrmw_fmin_v2bf16_seq_cst_align4(ptr %ptr, <2 x bf
 ; LSE-LABEL: test_atomicrmw_fmin_v2bf16_seq_cst_align4:
 ; LSE:       // %bb.0:
 ; LSE-NEXT:    // kill: def $d0 killed $d0 def $q0
-; LSE-NEXT:    dup v1.4h, v0.h[1]
+; LSE-NEXT:    mov h1, v0.h[1]
 ; LSE-NEXT:    shll v2.4s, v0.4h, #16
 ; LSE-NEXT:    mov w8, #32767 // =0x7fff
 ; LSE-NEXT:    ldr s0, [x0]
 ; LSE-NEXT:    shll v1.4s, v1.4h, #16
 ; LSE-NEXT:  .LBB7_1: // %atomicrmw.start
 ; LSE-NEXT:    // =>This Inner Loop Header: Depth=1
-; LSE-NEXT:    dup v3.4h, v0.h[1]
+; LSE-NEXT:    mov h3, v0.h[1]
 ; LSE-NEXT:    shll v4.4s, v0.4h, #16
 ; LSE-NEXT:    fminnm s4, s4, s2
 ; LSE-NEXT:    shll v3.4s, v3.4h, #16
diff --git a/llvm/test/CodeGen/AArch64/bf16-instructions.ll b/llvm/test/CodeGen/AArch64/bf16-instructions.ll
index 2fc9c53112ab6..1dd883580715e 100644
--- a/llvm/test/CodeGen/AArch64/bf16-instructions.ll
+++ b/llvm/test/CodeGen/AArch64/bf16-instructions.ll
@@ -1996,13 +1996,11 @@ define bfloat @test_copysign_f64(bfloat %a, double %b) #0 {
 define float @test_copysign_extended(bfloat %a, bfloat %b) #0 {
 ; CHECK-CVT-LABEL: test_copysign_extended:
 ; CHECK-CVT:       // %bb.0:
-; CHECK-CVT-NEXT:    // kill: def $h0 killed $h0 def $d0
-; CHECK-CVT-NEXT:    movi v2.4s, #16
 ; CHECK-CVT-NEXT:    // kill: def $h1 killed $h1 def $d1
-; CHECK-CVT-NEXT:    ushll v0.4s, v0.4h, #0
-; CHECK-CVT-NEXT:    shll v1.4s, v1.4h, #16
-; CHECK-CVT-NEXT:    ushl v0.4s, v0.4s, v2.4s
+; CHECK-CVT-NEXT:    // kill: def $h0 killed $h0 def $d0
 ; CHECK-CVT-NEXT:    mvni v2.4s, #128, lsl #24
+; CHECK-CVT-NEXT:    shll v1.4s, v1.4h, #16
+; CHECK-CVT-NEXT:    shll v0.4s, v0.4h, #16
 ; CHECK-CVT-NEXT:    bif v0.16b, v1.16b, v2.16b
 ; CHECK-CVT-NEXT:    fmov w8, s0
 ; CHECK-CVT-NEXT:    lsr w8, w8, #16
@@ -2013,16 +2011,12 @@ define float @test_copysign_extended(bfloat %a, bfloat %b) #0 {
 ;
 ; CHECK-SD-LABEL: test_copysign_extended:
 ; CHECK-SD:       // %bb.0:
-; CHECK-SD-NEXT:    // kill: def $h0 killed $h0 def $d0
-; CHECK-SD-NEXT:    movi v2.4s, #16
 ; CHECK-SD-NEXT:    // kill: def $h1 killed $h1 def $d1
-; CHECK-SD-NEXT:    ushll v0.4s, v0.4h, #0
-; CHECK-SD-NEXT:    shll v1.4s, v1.4h, #16
-; CHECK-SD-NEXT:    ushl v0.4s, v0.4s, v2.4s
+; CHECK-SD-NEXT:    // kill: def $h0 killed $h0 def $d0
 ; CHECK-SD-NEXT:    mvni v2.4s, #128, lsl #24
-; CHECK-SD-NEXT:    bif v0.16b, v1.16b, v2.16b
-; CHECK-SD-NEXT:    bfcvt h0, s0
+; CHECK-SD-NEXT:    shll v1.4s, v1.4h, #16
 ; CHECK-SD-NEXT:    shll v0.4s, v0.4h, #16
+; CHECK-SD-NEXT:    bif v0.16b, v1.16b, v2.16b
 ; CHECK-SD-NEXT:    // kill: def $s0 killed $s0 killed $q0
 ; CHECK-SD-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll b/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
index 3a55b68f2d1a3..f4ab8ff581e23 100644
--- a/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
+++ b/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
@@ -882,11 +882,11 @@ define <8 x i16> @fptoui_i16(<8 x bfloat> %a) #0 {
 define <8 x i1> @test_fcmp_une(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_une:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
+; CHECK-NEXT:    mov h2, v1.h[1]
+; CHECK-NEXT:    mov h3, v0.h[1]
+; CHECK-NEXT:    mov h4, v1.h[2]
+; CHECK-NEXT:    mov h5, v0.h[2]
+; CHECK-NEXT:    mov h6, v0.h[3]
 ; CHECK-NEXT:    shll v2.4s, v2.4h, #16
 ; CHECK-NEXT:    shll v3.4s, v3.4h, #16
 ; CHECK-NEXT:    fcmp s3, s2
@@ -896,34 +896,34 @@ define <8 x i1> @test_fcmp_une(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-NEXT:    fcmp s3, s2
 ; CHECK-NEXT:    shll v3.4s, v4.4h, #16
 ; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
+; CHECK-NEXT:    mov h5, v1.h[3]
 ; CHECK-NEXT:    csetm w9, ne
 ; CHECK-NEXT:    fmov s2, w9
 ; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
+; CHECK-NEXT:    mov h3, v1.h[4]
+; CHECK-NEXT:    shll v4.4s, v5.4h, #16
+; CHECK-NEXT:    shll v5.4s, v6.4h, #16
+; CHECK-NEXT:    mov h6, v0.h[4]
 ; CHECK-NEXT:    mov v2.h[1], w8
 ; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
+; CHECK-NEXT:    fcmp s5, s4
+; CHECK-NEXT:    shll v3.4s, v3.4h, #16
+; CHECK-NEXT:    mov h5, v1.h[5]
 ; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
+; CHECK-NEXT:    mov h6, v0.h[5]
 ; CHECK-NEXT:    mov v2.h[2], w8
 ; CHECK-NEXT:    csetm w8, ne
 ; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
+; CHECK-NEXT:    mov h3, v1.h[6]
+; CHECK-NEXT:    shll v4.4s, v5.4h, #16
+; CHECK-NEXT:    shll v5.4s, v6.4h, #16
+; CHECK-NEXT:    mov h6, v0.h[6]
+; CHECK-NEXT:    mov h1, v1.h[7]
+; CHECK-NEXT:    mov h0, v0.h[7]
 ; CHECK-NEXT:    mov v2.h[3], w8
 ; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
+; CHECK-NEXT:    fcmp s5, s4
+; CHECK-NEXT:    shll v3.4s, v3.4h, #16
 ; CHECK-NEXT:    shll v4.4s, v6.4h, #16
 ; CHECK-NEXT:    shll v1.4s, v1.4h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
@@ -945,54 +945,54 @@ define <8 x i1> @test_fcmp_une(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 define <8 x i1> @test_fcmp_ueq(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_ueq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
+; CHECK-NEXT:    mov h2, v1.h[1]
+; CHECK-NEXT:    mov h3, v0.h[1]
+; CHECK-NEXT:    mov h4, v1.h[2]
+; CHECK-NEXT:    mov h5, v0.h[2]
+; CHECK-NEXT:    mov h6, v0.h[3]
 ; CHECK-NEXT:    shll v2.4s, v2.4h, #16
 ; CHECK-NEXT:    shll v3.4s, v3.4h, #16
+; CHECK-NEXT:    shll v6.4s, v6.4h, #16
 ; CHECK-NEXT:    fcmp s3, s2
 ; CHECK-NEXT:    shll v2.4s, v1.4h, #16
 ; CHECK-NEXT:    shll v3.4s, v0.4h, #16
 ; CHECK-NEXT:    csetm w8, eq
 ; CHECK-NEXT:    csinv w8, w8, wzr, vc
 ; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v3.4s, v4.4h, #16
+; CHECK-NEXT:    shll v2.4s, v4.4h, #16
 ; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
+; CHECK-NEXT:    mov h3, v1.h[3]
+; CHECK-NEXT:    mov h5, v1.h[4]
 ; CHECK-NEXT:    csetm w9, eq
 ; CHECK-NEXT:    csinv w9, w9, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
+; CHECK-NEXT:    fcmp s4, s2
+; CHECK-NEXT:    mov h4, v0.h[4]
 ; CHECK-NEXT:    fmov s2, w9
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
+; CHECK-NEXT:    shll v3.4s, v3.4h, #16
 ; CHECK-NEXT:    mov v2.h[1], w8
 ; CHECK-NEXT:    csetm w8, eq
+; CHECK-NEXT:    shll v4.4s, v4.4h, #16
 ; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
+; CHECK-NEXT:    fcmp s6, s3
 ; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
+; CHECK-NEXT:    mov h5, v1.h[5]
+; CHECK-NEXT:    mov h6, v0.h[5]
 ; CHECK-NEXT:    mov v2.h[2], w8
 ; CHECK-NEXT:    csetm w8, eq
 ; CHECK-NEXT:    csinv w8, w8, wzr, vc
 ; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
+; CHECK-NEXT:    mov h3, v1.h[6]
+; CHECK-NEXT:    mov h4, v0.h[6]
+; CHECK-NEXT:    shll v5.4s, v5.4h, #16
+; CHECK-NEXT:    shll v6.4s, v6.4h, #16
+; CHECK-NEXT:    mov h1, v1.h[7]
+; CHECK-NEXT:    mov h0, v0.h[7]
 ; CHECK-NEXT:    mov v2.h[3], w8
 ; CHECK-NEXT:    csetm w8, eq
 ; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
+; CHECK-NEXT:    fcmp s6, s5
+; CHECK-NEXT:    shll v3.4s, v3.4h, #16
+; CHECK-NEXT:    shll v4.4s, v4.4h, #16
 ; CHECK-NEXT:    shll v1.4s, v1.4h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
 ; CHECK-NEXT:    mov v2.h[4], w8
@@ -1016,11 +1016,11 @@ define <8 x i1> @test_fcmp_ueq(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 define <8 x i1> @test_fcmp_ugt(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_ugt:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
+; CHECK-NEXT:    mov h2, v1.h[1]
+; CHECK-NEXT:    mov h3, v0.h[1]
+; CHECK-NEXT:    mov h4, v1.h[2]
+; CHECK-NEXT:    mov h5, v0.h[2]
+; CHECK-NEXT:    mov h6, v0.h[3]
 ; CHECK-NEXT:    shll v2.4s, v2.4h, #16
 ; CHECK-NEXT:    shll v3.4s, v3.4h, #16
 ; CHECK-NEXT:    fcmp s3, s2
@@ -1030,34 +1030,34 @@ define <8 x i1> @test_fcmp_ugt(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-NEXT:    fcmp s3, s2
 ; CHECK-NEXT:    shll v3.4s, v4.4h, #16
 ; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
+; CHECK-NEXT:    mov h5, v1.h[3]
 ; CHECK-NEXT:    csetm w9, hi
 ; CHECK-NEXT:    fmov s2, w9
 ; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
+; CHECK-NEXT:    mov h3, v1.h[4]
+; CHECK-NEXT:    shll v4.4s, v5.4h, #16
+; CHECK-NEXT:    shll v5.4s, v6.4h, #16
+; CHECK-NEXT:    mov h6, v0.h[4]
 ; CHECK-NEXT:    mov v2.h[1], w8
 ; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
+; CHECK-NEXT:    fcmp s5, s4
+; CHECK-NEXT:    shll v3.4s, v3.4h, #16
+; CHECK-NEXT:    mov h5, v1.h[5]
 ; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
+; CHECK-NEXT:    mov h6, v0.h[5]
 ; CHECK-NEXT:    mov v2.h[2], w8
 ; CHECK-NEXT:    csetm w8, hi
 ; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
+; CHECK-NEXT:    mov h3, v1.h[6]
+; CHECK-NEXT:    shll v4.4s, v5.4h, #16
+; CHECK-NEXT:    shll v5.4s, v6.4h, #16
+; CHECK-NEXT:    mov h6, v0.h[6]
+; CHECK-NEXT:    mov h1, v1.h[7]
+; CHECK-NEXT:    mov h0, v0.h[7]
 ; CHECK-NEXT:    mov v2.h[3], w8
 ; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
+; CHECK-NEXT:    fcmp s5, s4
+; CHECK-NEXT:    shll v3.4s, v3.4h, #16
 ; CHECK-NEXT:    shll v4.4s, v6.4h, #16
 ; CHECK-NEXT:    shll v1.4s, v1.4h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
@@ -1079,11 +1079,11 @@ define <8 x i1> @test_fcmp_ugt(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 define <8 x i1> @test_fcmp_uge(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_uge:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
+; CHECK-NEXT:    mov h2, v1.h[1]
+; CHECK-NEXT:    mov h3, v0.h[1]
+; CHECK-NEXT:    mov h4, v1.h[2]
+; CHECK-NEXT:    mov h5, v0.h[2]
+; CHECK-NEXT:    mov h6, v0.h[3]
 ; CHECK-NEXT:    shll v2.4s, v2.4h, #16
 ; CHECK-NEXT:    shll v3.4s, v3.4h, #16
 ; CHECK-NEXT:    fcmp s3, s2
@@ -1093,34 +1093,34 @@ define <8 x i1> @test_fcmp_uge(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-NEXT:    fcmp s3, s2
 ; CHECK-NEXT:    shll v3.4s, v4.4h, #16
 ; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
+; CHECK-NEXT:    mov h5, v1.h[3]
 ; CHECK-NEXT:    csetm w9, pl
 ; CHECK-NEXT:    fmov s2, w9
 ; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
+; CHECK-NEXT:    mov h3, v1.h[4]
+; CHECK-NEXT:    shll v4.4s, v5.4h, #16
+; CHECK-NEXT:    shll v5.4s, v6.4h, #16
+; CHECK-NEXT:    mov h6, v0.h[4]
 ; CHECK-NEXT:    mov v2.h[1], w8
 ; CHECK-NEXT:    csetm w8, pl
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll ...
[truncated]

@john-brawn-arm
Copy link
Collaborator Author

This is something I noticed while working on #131345, as it prevents the DAGCombiner transform it adds from being applied.

@davemgreen
Copy link
Collaborator

As per #118966 was hoping the opposite would be better. I can see the advantage if there at fast-math fpexts/fptruncs that can be removed, but I was hoping that we could lower the extends/round so that the shifts and whatnot can be optimized nicely as they should be. Otherwise you miss fairly basic codegen opportunities.

I am mostly against patterns that produce multiple output instructions, especially anything that is a cross-register-bank copy. SUBREG_TO_REG and EXTRACT_SUBREG are fine as they don't produce instructions. But I'm not sure if (i32 (SUBREG_TO_REG (i32 0), (bf16 FPR16:$Rn), hsub)) is really valid.

Currently bf16 fpextend is lowered to a vector shift. Instead leave it
as fpextend and have an instruction selection pattern which selects to
a shift later. Doing this means that DAGCombiner patterns for fpextend
will be applied, leading to better codegen. It also means that in some
situations we use a mov instruction where we previously have a dup
instruction, but I don't think this makes any difference.
@john-brawn-arm
Copy link
Collaborator Author

As per #118966 was hoping the opposite would be better. I can see the advantage if there at fast-math fpexts/fptruncs that can be removed, but I was hoping that we could lower the extends/round so that the shifts and whatnot can be optimized nicely as they should be. Otherwise you miss fairly basic codegen opportunities.

All of the changes in test output as a result of this patch are better or equivalent to what we currently have, as far as I can tell, so if there's a situation where converting to a shift earlier rather than later is better we don't have a test for it.

I am mostly against patterns that produce multiple output instructions, especially anything that is a cross-register-bank copy. SUBREG_TO_REG and EXTRACT_SUBREG are fine as they don't produce instructions. But I'm not sure if (i32 (SUBREG_TO_REG (i32 0), (bf16 FPR16:$Rn), hsub)) is really valid.

This results in a COPY being implicitly added later, but having an explicit COPY_TO_REGCLASS here is probably better. I'll do that.

@davemgreen
Copy link
Collaborator

All of the changes in test output as a result of this patch are better or equivalent to what we currently have, as far as I can tell, so if there's a situation where converting to a shift earlier rather than later is better we don't have a test for it.

Yeah some of the old codegen certainly looks like it is not doing as well as it should be. We obviously wouldn't have tests for everything possible. I was thinking about cases like fpext(load) with noneon, which should turn into a scalar gpr load + gpr shift, not a fpr load + fpr->gpr move + gpr shift. But that doesn't even work before! (it crashes or generates wrong instructions). There are other cases like how copysign gets expanded that should be optimizing better than they are at the moment.

define float @test(ptr %a) {
  %l = load bfloat, ptr %a
  %e = fpext bfloat %l to float
  ret float %e
}

When there is just one instruction being generate it looks OK, it's only the noneon patterns that worry me and those are of less importance overall. (i.e. this sounds ok, but...) As far as I can see this won't currently work without +bf16, as it relies on seeing the fpext(fpround()) after legalization, and the fpround will equally be expanded. It is a lot more code, emitting it with a pattern sounds unreasonable but did you give it any thought?

(Would it be possible to have the fpext(fpround) optimization happen as part of getNode(), so that it happens almost immediately and doesn't have the requirement that the fpround and fpext are legal operations?)

@john-brawn-arm
Copy link
Collaborator Author

When there is just one instruction being generate it looks OK, it's only the noneon patterns that worry me and those are of less importance overall. (i.e. this sounds ok, but...) As far as I can see this won't currently work without +bf16, as it relies on seeing the fpext(fpround()) after legalization, and the fpround will equally be expanded. It is a lot more code, emitting it with a pattern sounds unreasonable but did you give it any thought?

I'll have a look at this and see if there's anything we can do.

(Would it be possible to have the fpext(fpround) optimization happen as part of getNode(), so that it happens almost immediately and doesn't have the requirement that the fpround and fpext are legal operations?)

It looks like this doesn't work, as the fpext/fpround are introduced during legalization which happens from the bottom up, so at the time the fpext is created the fpround doesn't exist yet.

@davemgreen
Copy link
Collaborator

It looks like this doesn't work, as the fpext/fpround are introduced during legalization which happens from the bottom up, so at the time the fpext is created the fpround doesn't exist yet.

Ah, OK thanks for checking. Considering this fixes the noneon codegen and most of the important cases are a single instruction it looks OK to me. LGTM

@john-brawn-arm john-brawn-arm merged commit 5060f08 into llvm:main May 2, 2025
11 checks passed
@john-brawn-arm john-brawn-arm deleted the bf16_aarch64 branch May 2, 2025 11:55
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Currently bf16 fpextend is lowered to a vector shift. Instead leave it
as fpextend and have an instruction selection pattern which selects to a
shift later. Doing this means that DAGCombiner patterns for fpextend
will be applied, leading to better codegen. It also means that in some
situations we use a mov instruction where we previously have a dup
instruction, but I don't think this makes any difference.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Currently bf16 fpextend is lowered to a vector shift. Instead leave it
as fpextend and have an instruction selection pattern which selects to a
shift later. Doing this means that DAGCombiner patterns for fpextend
will be applied, leading to better codegen. It also means that in some
situations we use a mov instruction where we previously have a dup
instruction, but I don't think this makes any difference.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Currently bf16 fpextend is lowered to a vector shift. Instead leave it
as fpextend and have an instruction selection pattern which selects to a
shift later. Doing this means that DAGCombiner patterns for fpextend
will be applied, leading to better codegen. It also means that in some
situations we use a mov instruction where we previously have a dup
instruction, but I don't think this makes any difference.
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
Currently bf16 fpextend is lowered to a vector shift. Instead leave it
as fpextend and have an instruction selection pattern which selects to a
shift later. Doing this means that DAGCombiner patterns for fpextend
will be applied, leading to better codegen. It also means that in some
situations we use a mov instruction where we previously have a dup
instruction, but I don't think this makes any difference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants