-
Notifications
You must be signed in to change notification settings - Fork 5k
[JIT] Enable conditional chaining for Intel APX #111072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] Enable conditional chaining for Intel APX #111072
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
1. Intel SDE TestingTest run with SDE: Test run with SDE with 2. SuperPMI resultsDiffs are based on 2,635,272 contexts (1,050,818 MinOpts, 1,584,454 FullOpts). MISSED contexts: 2,984 (0.11%) Base JIT options: JitBypassApxCheck=1 Diff JIT options: JitBypassApxCheck=1;JitEnableApxIfConv=1 Overall (-169,140 bytes)
FullOpts (-169,140 bytes)
|
src/coreclr/jit/lowerxarch.cpp
Outdated
// On X86, a FP compare is implemented as a fallthrough, which requires two flag checks; hence, | ||
// we cannot simply get a single output condition to feed into a ccmp. Might be possible to chain | ||
// this, but skipping those cases for now | ||
GenCondition cond1; | ||
if (op2->OperIsCmpCompare() && varTypeIsIntegralOrI(op2->gtGetOp1()) && IsInvariantInRange(op2, tree) && | ||
ProducesPotentialConsumableFlagsForCCMP(op1) && TryLowerConditionToFlagsNode(tree, op1, &cond1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be preferable to get rid of ProducedPotentialConsumableFlagsForCCMP
and add an argument to TryLowerConditionToFlagsNode
about whether it is allowed to lower to a condition that requires multiple flags checks. Otherwise we end up having to keep ProducedPotentialConsumableFlagsForCCMP
and TryLowerConditionToFlagsNode
in sync.
You can use GenConditionDesc::Get(cond).jumpKind2== EJ_NONE
to check this condition in the appropriate places in TryLowerConditionToFlagsNode
.
0e2c0c7
to
75bf5ca
Compare
75bf5ca
to
c341c3b
Compare
Latest superpmi results after rebasing and making some changes. Diffs are based on 2,669,439 contexts (1,084,276 MinOpts, 1,585,163 FullOpts). MISSED contexts: 1,455 (0.05%) Base JIT options: JitBypassApxCheck=1 Diff JIT options: JitBypassApxCheck=1;EnableApxConditionalChaining=1 Overall (-148,632 bytes)
FullOpts (-148,632 bytes)
|
a459a25
to
4f6dd62
Compare
Note that right now, we have found a small bug that will appear due to some implicitly dependency on Once we have this ironed out, I can open this as a PR ready for review. |
1d40ef7
to
4c302ec
Compare
The follow is the latest superpmi results rebased on main. Note that I am comparing with a patched version of main to force APX (JitBypassApxCheck), so the diff is based on the conditional chaining optimization. Diffs are based on 2,673,047 contexts (1,085,228 MinOpts, 1,587,819 FullOpts). MISSED contexts: 123 (0.00%) Base JIT options: JitBypassApxCheck=1 Diff JIT options: JitBypassApxCheck=1;EnableApxConditionalChaining=1 Overall (-140,494 bytes)
FullOpts (-140,494 bytes)
|
Attaching the full summary for reference: |
@jakobbotsch @BruceForstall this is ready for review. It would be helpful if we can discuss any major changes / concerns with respect to this optimization (it will be off by default, and can be tuned at a later point as well). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a few nits: we don't want names in comments, so please fix the "TODO".
@@ -8908,6 +8954,58 @@ void CodeGen::genEmitHelperCall(unsigned helper, int argSize, emitAttr retSize, | |||
regSet.verifyRegistersUsed(killMask); | |||
} | |||
|
|||
insOpts CodeGen::OptsFromCFlags(insCflags flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add standard function header comments to this function, and the next one.
@@ -443,6 +443,7 @@ RELEASE_CONFIG_INTEGER(EnableArm64Sve, "EnableArm64Sve", | |||
RELEASE_CONFIG_INTEGER(EnableEmbeddedBroadcast, "EnableEmbeddedBroadcast", 1) // Allows embedded broadcasts to be disabled | |||
RELEASE_CONFIG_INTEGER(EnableEmbeddedMasking, "EnableEmbeddedMasking", 1) // Allows embedded masking to be disabled | |||
RELEASE_CONFIG_INTEGER(EnableApxNDD, "EnableApxNDD", 0) // Allows APX NDD feature to be disabled | |||
RELEASE_CONFIG_INTEGER(EnableApxConditionalChaining, "EnableApxConditionalChaining", 0) // Allows APX conditional compare chaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a release knob for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at the moment, it's good to have as a knob until we are able to tune it on APX hardware (if needed).
src/coreclr/jit/lower.cpp
Outdated
@@ -4422,6 +4425,11 @@ bool Lowering::TryLowerConditionToFlagsNode(GenTree* parent, GenTree* condition, | |||
GenTree* relopOp2 = relop->gtGetOp2(); | |||
|
|||
#ifdef TARGET_XARCH | |||
if (!allowMultipleFlagsChecks && cond->IsFloat()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you run into problems with my suggestion? Not all float compares require multiple flag checks so this is unnecessarily conservative.
Also the SETCC
case below also needs to be updated. And it has an ARM64 ifdef that should be enabled for xarch too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through and kept it conservative, but I can try your suggestion. I am away part of this week and all of next week, will give it a chance then.
Are there any other major thoughts or comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at the code now and I understand a bit better. Will make the changes and try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other major thoughts or comments?
No, I think this starting to look good. Just had a comment about unifying TryLowerAndOrToCCMP
between the platforms.
#if defined(TARGET_AMD64) | ||
case GT_CCMP: | ||
genCodeForCCMP(treeNode->AsCCMP()); | ||
break; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents this from being supported on x86?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, APX is only available in Intel 64-bit mode.
78e46c4
to
6ba5070
Compare
Co-authored-by: Bruce Forstall <[email protected]>
Co-authored-by: Bruce Forstall <[email protected]>
Also lifts GenConditionDesc into CodeGenInterface to better check which flag lowerings will produce multiple instructions.
Some code will conflict with latest changes. I've squashed so we can discuss how to merge in properly.
a27d330
to
f294c57
Compare
@jakobbotsch I think the failing tests are unrelated. I will grab one more final SPMI results for documentation. Is there any other changes required? |
src/coreclr/jit/lower.cpp
Outdated
#ifdef TARGET_XARCH | ||
if (!allowMultipleFlagsChecks) | ||
{ | ||
const GenConditionDesc& desc = GenConditionDesc::Get(*cond); | ||
|
||
if (desc.oper != GT_NONE) | ||
{ | ||
return false; | ||
} | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#ifdef TARGET_XARCH | |
if (!allowMultipleFlagsChecks) | |
{ | |
const GenConditionDesc& desc = GenConditionDesc::Get(*cond); | |
if (desc.oper != GT_NONE) | |
{ | |
return false; | |
} | |
} | |
#endif | |
if (!allowMultipleFlagsChecks) | |
{ | |
const GenConditionDesc& desc = GenConditionDesc::Get(*cond); | |
if (desc.oper != GT_NONE) | |
{ | |
return false; | |
} | |
} |
src/coreclr/jit/lowerxarch.cpp
Outdated
// (NOTE: This just has to make the condition be true, i.e., if the condition calls for (SF ^ OF), then | ||
// returning one will suffice | ||
// | ||
// TODO-XArch-APX: Revisit this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed, it is left over from before I understood how the code worked. Internal note.
// | ||
// Return Value: | ||
// True if the conversion was successful, false otherwise | ||
// | ||
bool Compiler::optSwitchDetectAndConvert(BasicBlock* firstBlock) | ||
bool Compiler::optSwitchDetectAndConvert(BasicBlock* firstBlock, bool testingForConversion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this check for, and why does it affect x64 and not arm64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this check in when I saw that for x64, with conditonal chaining, an optimized switch statement/jump table was getting converted into a long series of ccmp
.
With respect to ARM, I have not dug into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post the x64 diffs comparing the PR without this and the PR with this?
Ok on arm64, we can try that as a follow up to see how it impacts it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have updated: #111072 (comment)
FYI, latest superpmi results with APX and no conditional chaining as the base. Diffs are based on 2,603,000 contexts (1,081,295 MinOpts, 1,521,705 FullOpts). MISSED contexts: 3,511 (0.13%) Base JIT options: JitBypassApxCheck=1 Diff JIT options: JitBypassApxCheck=1;EnableApxConditionalChaining=1 Overall (-118,918 bytes)
FullOpts (-118,918 bytes)
==== Diffs with Diffs are based on 2,602,983 contexts (1,081,172 MinOpts, 1,521,811 FullOpts). MISSED contexts: 3,717 (0.14%) Base JIT options: JitBypassApxCheck=1 Diff JIT options: JitBypassApxCheck=1;EnableApxConditionalChaining=1 Overall (-98,217 bytes)
FullOpts (-98,217 bytes)
|
@anthonycanino Please fix the formatting issues. |
src/coreclr/jit/lower.cpp
Outdated
@@ -4460,6 +4463,18 @@ bool Lowering::TryLowerConditionToFlagsNode(GenTree* parent, GenTree* condition, | |||
} | |||
#endif | |||
|
|||
#if !defined(TARGET_LOONGARCH64) && !defined(TARGET_RISCV64) | |||
if (!allowMultipleFlagsChecks && cond->IsFloat()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!allowMultipleFlagsChecks && cond->IsFloat()) | |
if (!allowMultipleFlagsChecks) |
Nit (the check below also doesn't have this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit.
/ba-g azure linux timeouts |
Thanks for all the review efforts @jakobbotsch and @BruceForstall ! |
* Enable conditional compare chaining for AMD64. * Reduce duplication from `optSwitchDetectLikely`. * Update src/coreclr/jit/lsrabuild.cpp Co-authored-by: Bruce Forstall <[email protected]> * Update src/coreclr/jit/lowerxarch.cpp Co-authored-by: Bruce Forstall <[email protected]> * Update src/coreclr/jit/lowerxarch.cpp Co-authored-by: Bruce Forstall <[email protected]> * Widen the potential candidates for ccmp folding. Also lifts GenConditionDesc into CodeGenInterface to better check which flag lowerings will produce multiple instructions. * Refactor some common code into lower.cpp. Some code will conflict with latest changes. I've squashed so we can discuss how to merge in properly. * Refactored common code out. * Review edits. * Fix build errors. * Formatting. --------- Co-authored-by: Bruce Forstall <[email protected]>
* Enable conditional compare chaining for AMD64. * Reduce duplication from `optSwitchDetectLikely`. * Update src/coreclr/jit/lsrabuild.cpp Co-authored-by: Bruce Forstall <[email protected]> * Update src/coreclr/jit/lowerxarch.cpp Co-authored-by: Bruce Forstall <[email protected]> * Update src/coreclr/jit/lowerxarch.cpp Co-authored-by: Bruce Forstall <[email protected]> * Widen the potential candidates for ccmp folding. Also lifts GenConditionDesc into CodeGenInterface to better check which flag lowerings will produce multiple instructions. * Refactor some common code into lower.cpp. Some code will conflict with latest changes. I've squashed so we can discuss how to merge in properly. * Refactored common code out. * Review edits. * Fix build errors. * Formatting. --------- Co-authored-by: Bruce Forstall <[email protected]>
Design
This PR mostly enables existing conditional chaining logic for X86 with the inclusion of APX
ccmp
instruction. Currently, the optimization must be explicitly enabled viaDOTNET_JitEnableApxConditionalChaining=1
.Testing
Note: The testing plan for APX work has been discussed in #106557, please refer to that PR for details, only results and comments will be posted in this PR. Results posted below.