-
Notifications
You must be signed in to change notification settings - Fork 13.4k
More efficient remainder lowering #31811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We end up in such situation iff we don't use a fast remainder algorithm when we select The fast-isel version is slightly shorter (so more size efficient) and in my tests, slightly faster. |
Compare to: target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" We generate the same lsl+sub+sub here. There's an obvious transformation we're missing to transform an lsl+sub pair into a "sub...lsl". Beyond that, not sure what's worth doing here; maybe the right thing depends on whether we have an immediate 255 available? |
So right now, we have: But that should be: Can that be done with tablegen matching? In x86, we have all kinds of target-specific nodes to create custom instructions in the DAG, but I don't see that in aarch64. |
It should be possible to write a pattern matching something like "(sub $x, (sub (lsl $y, 8), $y)"... granted, it might be a bit tricky to generalize. Alternatively, you could match this using a target combine; DAGCombine doesn't aggressively reassociate operations or anything like that, so any rearrangement will stick. (See performMulCombine in AArch64ISelLowering.cpp for the multiply->shift+sub transform.) |
I was going that route. I think it's "fine", but then I actually realized you need to introduce your own predicates (a-la imm0_65535, but imm0_ispower_of_two), and I investigated for an alternative solution.
I noticed that we actually already produce the 'best' sequence for e.g. define i32 @tinky(i32 %i, i32 %j) { => // BB#0: so, I guess an easy(easier) way to handle this would be that of not lowering in the first case mul -> shl + sub if we have mul + sub to begin with? I could miss something, so forgive me if I'm wrong. |
The comment in performMulCombine mentions the issue; we might need to re-evaluate. |
this is now fixed at least: https://godbolt.org/z/sj6ojvrTs as for the original example: https://godbolt.org/z/s8MT9dvGs codegen for both SDAG and fast-isel were generating sdiv, but now only fast-isel is. Although I did try earlier versions (5.0, 6.0, etc) and none ever produced an sdiv? Perhaps it was temporarily generated sometime between LLVM 5.0 - 6.0. Anyhow, at a glance SDAG now looks much worse than fast-isel which is significantly fewer instructions. And mca agrees for the a53 fast-isel one is better: https://godbolt.org/z/nv3rK7b4Y but the latency of integer divides on a53 (according to the scheduling model in LLVM) is 4, whereas for later cores divides can be much more expensive:
so for newer cores SDAG version could be faster, with the caveat mca is using the worst case here. think this one can be closed? |
Extended Description
Given the following IR:
define i32 @tinky(i32 %j) local_unnamed_addr #0 {
entry:
%rem = srem i32 %j, 255
ret i32 %rem
}
attributes #0 = { norecurse nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a53" "target-features"="+crc,-crypto,-fp-armv8,-neon" "unsafe-fp-math"="false" "use-soft-float"="false" }
!llvm.ident = !{#0}
Fast-isel seems to generate slightly faster/shorter code than SelectionDAG:
The text was updated successfully, but these errors were encountered: