-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
gh-134584: Decref elimination for float ops in the JIT #134588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-134584: Decref elimination for float ops in the JIT #134588
Conversation
@tomasr8 @brandtbucher 7.5% speedup in nbody on the JIT in this PR 👀. I removed some of the optimizations in this PR to make things easier to review, and also to split up work later! |
Whether we can skip the refcount operation is a property of the reference, not the object. You'll need to remove the |
The optimizer in this PR uses Consider the following operation:
Then it's trivial to see that x and y have strong references elsewhere. Thus they can just become
Basically, I'm leveraging on the work that Matt did to do cheap and safe reference analysis. I think it should be possible to just do this for all |
Yeah this is not safe. If we assign it back to a locals, it needs to be an owned reference, not borrowed. |
I'm internally deliberating on this one. It's a lot of churn; a few thousand lines of diff. I don't think we need to. We used the tagged pointer representation over PyObject because PyObject has proper lifetimes, reference management, and also to save space. Meanwhile JitOptSymbol has no lifetime management at all (it's effectively immortal for the context of the optimizer), needs no reference management, and if we want to save space, we can just reuse the memory in the thread state. We basically get none of the associated benefits of the _PyStackRef API (sound lifetime semantics), for all the associated negatives like the code churn. If this needs better naming, I'm all for it. Eventually, I want to expand this so it's more powerful than the stackref API -- skipping reference counts when we know it's truly safe to skip, not just like stackrefs where we skip only because it's on the stack. For example, take the following code:
A smart optimizer would be able to skip all refcounting operations on a,b,c,d as long as x is kept live. As we know x holds an owned reference to a b c d. This is why skipping refcounting is a property of the symbol, not a property of the stackref, and it should be kept as part of the symbol. |
I've renamed it to clarify that the main property is observing the strong references are held by someone else, not whether we can skip refcounts or not. |
On second thought, this name is even worse. It's confusing and error-prone. |
This reverts commit 5c429b6.
I'll repeat what I said before "Whether we can skip the refcount operation is a property of the reference, not the object."
Will keep leaking as |
To be pedantic, that won't leak as we don't trace from function entry at the moment. So the optimizer will never see that as a constant symbol. But yes I get your point, let me just go through the churn then. |
@markshannon I have done the refactor like you requested. |
@markshannon if you don't mind, I plan to merge this soon, as it will be a massive conflict with everyone else's JIT work. |
I know it is frustrating, having to wait for reviews, but we shouldn't be merging anything non-trivial without reviews. I'll try to review ASAP, or maybe @brandtbucher can take a look? |
Sorry yeah that's core dev 101. Will be more patient in the future. |
Are all the name changes from |
Yes it's necessary. The problem is that we atill have some functions that used internally in optimizer symbols.c that operate on symbols (so they start with sym), while some functions operate on refs. So we need to distinguish them. |
Would it be better to rename the few internal function calls and leave the many external functions unchanged? |
I think the |
We already were operating on references to symbols. Pointers are references |
Also, I think we drop "steal" and "borrow" from the API converting tagged values to pointers. I think it is confusing.
|
@markshannon I don't agree with leaving them sym, but it's not something I'm willing to duel over. So I'll just leave it as is. As for not using the stackref API, I agree with that. I realised the symbols have no lifetimes, so naming them with lifetimes makes no sense. |
Python/optimizer_bytecodes.c
Outdated
@@ -316,6 +320,10 @@ dummy_func(void) { | |||
else { | |||
res = sym_new_type(ctx, &PyFloat_Type); | |||
} | |||
// TODO (gh-134584): Move this to the optimizer generator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do this "automatically". What we do want to do is to move almost all the Py_DECREF
s into a few ops like POP_TOP
and "manually" optimize those few ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a comment and said to refactor this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This looks good now.
I think the next step is to start refactoring uops that have DECREFs to put the DECREFs into a uop at the end.
…H-134588) This PR adds a PyJitRef API to the JIT's optimizer that mimics the _PyStackRef API. This allows it to track references and their stack lifetimes properly. Thus opening up the doorway to refcount elimination in the JIT.
The key idea is to remove decref of inputs of common operations in the JIT, if they are deemed to be borrowed/using borrowed refcounts. We rely on the base interpreter's static analysis to determine if things are borrowed.
For the float operations, these can't be generated using the cases generator, because they use the
FromDoubleConsumeFloat
things which are notDECREF_INPUT
.For the other list/tuple stuff, or anything that use
DECREF_INPUT
s, we can automatically generate the instruction variants by not emitting the decref at all. However, I left that out of this PR as this is the initial proof of conceptRoughly a 7.5% speedup on nbody just from these optimizations alone!
It's still slower than the normal interpreter on my system, but it's a good start.
I removed the list ops at ad03b1e to make the PR easier to review, we can add them back in later for the speed boost!