Thanks to visit codestin.com
Credit goes to github.com

Skip to content

More efficient remainder lowering #31811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
llvmbot opened this issue Mar 29, 2017 · 7 comments
Open

More efficient remainder lowering #31811

llvmbot opened this issue Mar 29, 2017 · 7 comments
Labels
backend:AArch64 bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Member

llvmbot commented Mar 29, 2017

Bugzilla Link 32464
Version trunk
OS Windows NT
Reporter LLVM Bugzilla Contributor
CC @efriedma-quic,@rotateright,@TNorthover

Extended Description

Given the following IR:

cat remainder.ll
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"

define i32 @​tinky(i32 %j) local_unnamed_addr #​0 {
entry:
%rem = srem i32 %j, 255
ret i32 %rem
}

attributes #​0 = { norecurse nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a53" "target-features"="+crc,-crypto,-fp-armv8,-neon" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{#0}

Fast-isel seems to generate slightly faster/shorter code than SelectionDAG:

C:\Users\fluttershy\work\llvm-msvc\Debug\bin\llc.exe -O3 -mcpu=cortex-a53 remainder.ll -o - -fast-isel
.text
.file "remainder.ll"
.globl tinky
.p2align 2
.type tinky,@function
tinky: // @​tinky
// BB#0: // %entry
orr w8, wzr, #​0xff
sdiv w9, w0, w8
msub w0, w9, w8, w0
ret
.Lfunc_end0:
.size tinky, .Lfunc_end0-tinky

    .ident  "clang version 5.0.0 (trunk 298784) (llvm/trunk 298825)"
    .section        ".note.GNU-stack","",@progbits

C:\Users\fluttershy\work\llvm-msvc\Debug\bin\llc.exe -O3 -mcpu=cortex-a53 remainder.ll -o -
.text
.file "remainder.ll"
.globl tinky
.p2align 2
.type tinky,@function
tinky: // @​tinky
// BB#0: // %entry
orr w8, wzr, #​0xff
sdiv w8, w0, w8
lsl w9, w8, #​8
sub w8, w9, w8
sub w0, w0, w8
ret
.Lfunc_end0:
.size tinky, .Lfunc_end0-tinky

    .ident  "clang version 5.0.0 (trunk 298784) (llvm/trunk 298825)"
    .section        ".note.GNU-stack","",@progbits
@llvmbot
Copy link
Member Author

llvmbot commented Mar 29, 2017

We end up in such situation iff we don't use a fast remainder algorithm when we select srem (n % (2^k - 1)), i.e. if isCheapDiv is true. isCheapDiv is true (right now) IFF we're optimizing for size and we're not operating on vectors.

The fast-isel version is slightly shorter (so more size efficient) and in my tests, slightly faster.

@efriedma-quic
Copy link
Collaborator

Compare to:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"
define i32 @​tinky(i32 %i, i32 %j) local_unnamed_addr #​0 {
entry:
%rem = mul i32 %j, 255
%s = sub i32 %i, %rem
ret i32 %s
}

We generate the same lsl+sub+sub here.

There's an obvious transformation we're missing to transform an lsl+sub pair into a "sub...lsl". Beyond that, not sure what's worth doing here; maybe the right thing depends on whether we have an immediate 255 available?

@rotateright
Copy link
Contributor

define i32 @​tinky(i32 %i, i32 %j) local_unnamed_addr #​0 {
entry:
%rem = mul i32 %j, 255
%s = sub i32 %i, %rem
ret i32 %s
}

So right now, we have:
lsl w8, w1, #​8
sub w8, w8, w1
sub w0, w0, w8

But that should be:
sub w1, w1, w1, lsl #​8
add w0, w1, w0

Can that be done with tablegen matching? In x86, we have all kinds of target-specific nodes to create custom instructions in the DAG, but I don't see that in aarch64.

@efriedma-quic
Copy link
Collaborator

It should be possible to write a pattern matching something like "(sub $x, (sub (lsl $y, 8), $y)"... granted, it might be a bit tricky to generalize.

Alternatively, you could match this using a target combine; DAGCombine doesn't aggressively reassociate operations or anything like that, so any rearrangement will stick.

(See performMulCombine in AArch64ISelLowering.cpp for the multiply->shift+sub transform.)

@llvmbot
Copy link
Member Author

llvmbot commented Mar 30, 2017

It should be possible to write a pattern matching something like "(sub $x,
(sub (lsl $y, 8), $y)"... granted, it might be a bit tricky to generalize.

I was going that route. I think it's "fine", but then I actually realized you need to introduce your own predicates (a-la imm0_65535, but imm0_ispower_of_two), and I investigated for an alternative solution.

Alternatively, you could match this using a target combine; DAGCombine
doesn't aggressively reassociate operations or anything like that, so any
rearrangement will stick.

(See performMulCombine in AArch64ISelLowering.cpp for the
multiply->shift+sub transform.)

I noticed that we actually already produce the 'best' sequence for
mul i32 %j, constant when constant is not 2^n - 1

e.g.

define i32 @​tinky(i32 %i, i32 %j) {
%rem = mul i32 %j, 254
%s = sub i32 %i, %rem
ret i32 %s
}

=>

// BB#0:
orr w8, wzr, #​0xfe
msub w0, w1, w8, w0

so, I guess an easy(easier) way to handle this would be that of not lowering in the first case mul -> shl + sub if we have mul + sub to begin with? I could miss something, so forgive me if I'm wrong.

@efriedma-quic
Copy link
Collaborator

The comment in performMulCombine mentions the issue; we might need to re-evaluate.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@c-rhodes
Copy link
Collaborator

c-rhodes commented May 2, 2025

define i32 @​tinky(i32 %i, i32 %j) local_unnamed_addr #​0 {
entry:
  %rem = mul i32 %j, 255
  %s = sub i32 %i, %rem
  ret i32 %s
}

So right now, we have:

lsl w8, w1, #​8
sub w8, w8, w1
sub w0, w0, w8

But that should be:

sub w1, w1, w1, lsl #​8
add w0, w1, w0

Can that be done with tablegen matching? In x86, we have all kinds of target-specific nodes to create custom instructions in the DAG, but I don't see that in aarch64.

this is now fixed at least: https://godbolt.org/z/sj6ojvrTs

as for the original example: https://godbolt.org/z/s8MT9dvGs

codegen for both SDAG and fast-isel were generating sdiv, but now only fast-isel is. Although I did try earlier versions (5.0, 6.0, etc) and none ever produced an sdiv? Perhaps it was temporarily generated sometime between LLVM 5.0 - 6.0.

Anyhow, at a glance SDAG now looks much worse than fast-isel which is significantly fewer instructions. And mca agrees for the a53 fast-isel one is better: https://godbolt.org/z/nv3rK7b4Y

but the latency of integer divides on a53 (according to the scheduling model in LLVM) is 4, whereas for later cores divides can be much more expensive:

so for newer cores SDAG version could be faster, with the caveat mca is using the worst case here.

think this one can be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

4 participants