gh-144569: Avoid creating temporary objects in BINARY_SLICE for list, tuple, and unicode#144590
gh-144569: Avoid creating temporary objects in BINARY_SLICE for list, tuple, and unicode#144590markshannon merged 19 commits intopython:mainfrom
BINARY_SLICE for list, tuple, and unicode#144590Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
You forgot to add the new recorder to BINARY_SLICE macro op |
Fidget-Spinner
left a comment
There was a problem hiding this comment.
lgtm just one comment and add news please
|
Actually I've changed my mind. Can you please open a PR to convert BINARY_SLICE to the POP_TOP form please? Then after we merge that PR we can update this one |
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Let's leave the DECREF_INPUTS alone for now and try to remove it in a future PR.
The code has repeated instances that should be factored out to a micro-op
Misc/NEWS.d/next/Core_and_Builtins/2026-02-08-13-14-00.gh-issue-144569.pjlJVe.rst
Outdated
Show resolved
Hide resolved
|
This is pretty close, I'm going to push some changes in, then wait a day to see if Mark has any objections tomorrow. |
|
Thanks for your guidance! |
1. We cannot exit after unpacking indices, as the stack contains tagged ints now which may lead to a crash. We must insert a guard for the type before unpacking indices. 2. It is possible for an indice to not fit in a tagged int, in which case we must deopt. 3. Recorded values do not always mean that that's the actual type we will see at runtime. So we must guard on that as well.
|
@cocolato please check out the latest commit and the extended commit message for the bugs fixed. Thanks! |
|
Results using hyperfine: So we made it >10% faster. Nice! |
LGTM! Thanks again for the fix, I learned a lot from it.
This is my first time learning about this tool, and it looks great. In the past, when I needed to run benchmarks locally, |
This comment was marked as resolved.
This comment was marked as resolved.
markshannon
left a comment
There was a problem hiding this comment.
Thanks for doing this.
This looks sound, but there are some inefficiencies.
Beware the size of uops. They will be converted to machine code, so keeping code size down is important for performance.
|
Thanks for making the requested changes! @markshannon: please review the changes made to this pull request. |
|
Sorry for the delay in getting back to this. I don't think we want I think we should spilt this into two PRs: the first with just the interpreter changes (bytecodes.c), leaving the JIT optimizations for another PR. To make it worth splitting the operation into uops we need to be able to optimize the uops effectively. That would mean handling the index adjustments in the optimizer as much as possible and leaving only minimal checking to the runtime. For example, take Which looks bulky, but each of the uops is only a few machine instructions ( (In the above So, let's keep it simple for the first PR, and just add special cases for lists, tuples and strings in |
|
@markshannon Thanks for review! I have reimplemented the fast path for |
|
Windows tests failed, I'm trying to find the cause. |
It might we worth merging in latest main, to see if that fixes it. |
|
On my test machine, this optimization has about 16-17% speed up: test.py: bench result: |
markshannon
left a comment
There was a problem hiding this comment.
Thanks
I've a couple of suggestions on how to structure the code, but other than that LGTM.
The performance numbers are really good.
Co-authored-by: Mark Shannon <[email protected]>
|
Thanks for review again! Updated. |
Optimize
BINARY_SLICEforlist,tuple, andunicodeobject with int/None indices.This is the first step for
BINARY_SLICEoptimize. We will implement more jit optimizations after this PR.