More efficient remainder lowering #31811

llvmbot · 2017-03-29T21:40:08Z


Bugzilla Link	32464
Version	trunk
OS	Windows NT
Reporter	LLVM Bugzilla Contributor
CC	@efriedma-quic,@rotateright,@TNorthover

Extended Description

Given the following IR:

cat remainder.ll
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"

define i32 @tinky(i32 %j) local_unnamed_addr #0 {
entry:
%rem = srem i32 %j, 255
ret i32 %rem
}

attributes #0 = { norecurse nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a53" "target-features"="+crc,-crypto,-fp-armv8,-neon" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{#0}

Fast-isel seems to generate slightly faster/shorter code than SelectionDAG:

C:\Users\fluttershy\work\llvm-msvc\Debug\bin\llc.exe -O3 -mcpu=cortex-a53 remainder.ll -o - -fast-isel
.text
.file "remainder.ll"
.globl tinky
.p2align 2
.type tinky,@function
tinky: // @tinky
// BB#0: // %entry
orr w8, wzr, #0xff
sdiv w9, w0, w8
msub w0, w9, w8, w0
ret
.Lfunc_end0:
.size tinky, .Lfunc_end0-tinky

    .ident  "clang version 5.0.0 (trunk 298784) (llvm/trunk 298825)"
    .section        ".note.GNU-stack","",@progbits

C:\Users\fluttershy\work\llvm-msvc\Debug\bin\llc.exe -O3 -mcpu=cortex-a53 remainder.ll -o -
.text
.file "remainder.ll"
.globl tinky
.p2align 2
.type tinky,@function
tinky: // @tinky
// BB#0: // %entry
orr w8, wzr, #0xff
sdiv w8, w0, w8
lsl w9, w8, #8
sub w8, w9, w8
sub w0, w0, w8
ret
.Lfunc_end0:
.size tinky, .Lfunc_end0-tinky

    .ident  "clang version 5.0.0 (trunk 298784) (llvm/trunk 298825)"
    .section        ".note.GNU-stack","",@progbits

The text was updated successfully, but these errors were encountered:

llvmbot · 2017-03-29T21:43:36Z

We end up in such situation iff we don't use a fast remainder algorithm when we select srem (n % (2^k - 1)), i.e. if isCheapDiv is true. isCheapDiv is true (right now) IFF we're optimizing for size and we're not operating on vectors.

The fast-isel version is slightly shorter (so more size efficient) and in my tests, slightly faster.

efriedma-quic · 2017-03-29T22:34:39Z

Compare to:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"
define i32 @tinky(i32 %i, i32 %j) local_unnamed_addr #0 {
entry:
%rem = mul i32 %j, 255
%s = sub i32 %i, %rem
ret i32 %s
}

We generate the same lsl+sub+sub here.

There's an obvious transformation we're missing to transform an lsl+sub pair into a "sub...lsl". Beyond that, not sure what's worth doing here; maybe the right thing depends on whether we have an immediate 255 available?

rotateright · 2017-03-30T22:19:36Z

define i32 @tinky(i32 %i, i32 %j) local_unnamed_addr #0 {
entry:
%rem = mul i32 %j, 255
%s = sub i32 %i, %rem
ret i32 %s
}

So right now, we have:
lsl w8, w1, #8
sub w8, w8, w1
sub w0, w0, w8

But that should be:
sub w1, w1, w1, lsl #8
add w0, w1, w0

Can that be done with tablegen matching? In x86, we have all kinds of target-specific nodes to create custom instructions in the DAG, but I don't see that in aarch64.

efriedma-quic · 2017-03-30T22:31:20Z

It should be possible to write a pattern matching something like "(sub $x, (sub (lsl $y, 8), $y)"... granted, it might be a bit tricky to generalize.

Alternatively, you could match this using a target combine; DAGCombine doesn't aggressively reassociate operations or anything like that, so any rearrangement will stick.

(See performMulCombine in AArch64ISelLowering.cpp for the multiply->shift+sub transform.)

llvmbot · 2017-03-30T22:49:45Z

It should be possible to write a pattern matching something like "(sub $x,
(sub (lsl $y, 8), $y)"... granted, it might be a bit tricky to generalize.

I was going that route. I think it's "fine", but then I actually realized you need to introduce your own predicates (a-la imm0_65535, but imm0_ispower_of_two), and I investigated for an alternative solution.

Alternatively, you could match this using a target combine; DAGCombine
doesn't aggressively reassociate operations or anything like that, so any
rearrangement will stick.

(See performMulCombine in AArch64ISelLowering.cpp for the
multiply->shift+sub transform.)

I noticed that we actually already produce the 'best' sequence for
mul i32 %j, constant when constant is not 2^n - 1

e.g.

define i32 @tinky(i32 %i, i32 %j) {
%rem = mul i32 %j, 254
%s = sub i32 %i, %rem
ret i32 %s
}

=>

// BB#0:
orr w8, wzr, #0xfe
msub w0, w1, w8, w0

so, I guess an easy(easier) way to handle this would be that of not lowering in the first case mul -> shl + sub if we have mul + sub to begin with? I could miss something, so forgive me if I'm wrong.

efriedma-quic · 2017-03-30T22:56:46Z

The comment in performMulCombine mentions the issue; we might need to re-evaluate.

c-rhodes · 2025-05-02T10:37:22Z

define i32 @tinky(i32 %i, i32 %j) local_unnamed_addr #0 {
entry:
  %rem = mul i32 %j, 255
  %s = sub i32 %i, %rem
  ret i32 %s
}
So right now, we have:
lsl w8, w1, #8
sub w8, w8, w1
sub w0, w0, w8
But that should be:
sub w1, w1, w1, lsl #8
add w0, w1, w0
Can that be done with tablegen matching? In x86, we have all kinds of target-specific nodes to create custom instructions in the DAG, but I don't see that in aarch64.

this is now fixed at least: https://godbolt.org/z/sj6ojvrTs

as for the original example: https://godbolt.org/z/s8MT9dvGs

codegen for both SDAG and fast-isel were generating sdiv, but now only fast-isel is. Although I did try earlier versions (5.0, 6.0, etc) and none ever produced an sdiv? Perhaps it was temporarily generated sometime between LLVM 5.0 - 6.0.

Anyhow, at a glance SDAG now looks much worse than fast-isel which is significantly fewer instructions. And mca agrees for the a53 fast-isel one is better: https://godbolt.org/z/nv3rK7b4Y

but the latency of integer divides on a53 (according to the scheduling model in LLVM) is 4, whereas for later cores divides can be much more expensive:

a57, div latency 4-20: https://godbolt.org/z/4oxP53PG8
neoverse-v2, div latency 5-12: https://godbolt.org/z/Mhh6Y5v75

so for newer cores SDAG version could be faster, with the caveat mca is using the worst case here.

think this one can be closed?

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient remainder lowering #31811

More efficient remainder lowering #31811

llvmbot commented Mar 29, 2017

llvmbot commented Mar 29, 2017

efriedma-quic commented Mar 29, 2017

rotateright commented Mar 30, 2017

efriedma-quic commented Mar 30, 2017

llvmbot commented Mar 30, 2017

efriedma-quic commented Mar 30, 2017

c-rhodes commented May 2, 2025

More efficient remainder lowering #31811

More efficient remainder lowering #31811

Comments

llvmbot commented Mar 29, 2017

Extended Description

llvmbot commented Mar 29, 2017

efriedma-quic commented Mar 29, 2017

rotateright commented Mar 30, 2017

efriedma-quic commented Mar 30, 2017

llvmbot commented Mar 30, 2017

efriedma-quic commented Mar 30, 2017

c-rhodes commented May 2, 2025