-
Notifications
You must be signed in to change notification settings - Fork 5.4k
YJIT: Interleave inline and outlined code blocks #6460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for tackling this challenging problem Kokubun. I think generally speaking your approach looks good though I am not sure how set_write_pos / get_write_pos will work in this new scheme, particularly with the inline/outlined region no longer being contiguous. I wrote some comments/questions. |
I pushed 90c7b01 to make |
I figured out how to fix a few immediate problems, but didn't get to fix everything I encountered next. This PR still needs some more work. It has been challenging to make codegen work when the assembler attempts to write code but there's no enough space in the current page. I've got a few ideas:
1 and 2 would have essentially the same code layout, but there's some difference between 1 and 3. While 3 would minimize the code page size, it's more likely that code of a single ISEQ is split into two pages. 1 doesn't have the split page problem, however, it would leave gaps at the end of every code page. I'm leaning towards having whichever simpler (still thinking about which one actually is) first, but it's worth evaluating the performance difference after that. |
Solution #3 might be better just because, we're likely to end up needing a jump anyway if a block has to be compiled in the next page. Might as well go for the solution that makes things the most transparent and minimizes memory usage? I think it's ok to keep things simple and assume that all instructions need 64 bytes available, otherwise we need to change page. The main caveat that I can see is that we could end up in a situation where we insert a jump in the middle of a patchable/rewritable branch, which could be annoying to deal with? Though as long as there is enough bytes left in the page so that we could generate the whole rewritable branch in one block, it shouldn't be a problem. |
945008a
to
8491112
Compare
π Another idea similar to 3 is that we could attempt a write and retry it after Debugging the current failure on Arm might be helpful if anybody wants to make a progress on this project this Friday. |
That might work but we also need to be able to guarantee that we are able to write the largest possible size of a jump to the next page. There are also potentially edge cases if we switch page in the middle of the branch instruction at the end of a block... Although I think that should work as well... Provided that we always succeed in allocating a next page π€ Welcome back to America btw! I'm going to be away at a burst most of next week so still difficult to pair, though at least we will be in closer time zones so we should be able to chat on Slack in real time. Looking forward to pairing the week after :) I should mention, @XrXr should be available to pair next week. He knows this codebase and CRuby very well so should definitely be able to help. |
Current status: Alan and I paired on the immediate issue of this PR. We see a couple of assertion failures on "branch stubs should never enlarge branches" and PC calculation Because of the amount of problems I'm facing again and again, it felt like another option could be simpler. I was thinking about choosing one of the following two options, summarizing the current blockers of each option:
Because of the cons and the blockers, I was almost leaving Solution #1, but after revisiting Solution #2, I noticed the thing that I wrote as the blocker. The next step is to explore an idea of possibly using a label to address the blocker of the Solution #2, and then go back to debugging Solution #1 if it doesn't work. |
I want to note that even with solution 2, we still need to write out jumps sometimes. In |
Thanks for helping with this @XrXr π Trying to think ahead here. In the future we might want to try generating code for multiple blocks at once, so we can have longer instruction sequences to optimize. If we were doing that, reusing the same Assembler for multiple blocks, then giving up if we can't fit the entire sequence in the current page could be problematic (could waste a lot of space). IMO that suggests that an Insn-level retry is probably more flexible.
I'm wondering if there's not a stupid-but-good-enough way to solve this. Something like, we define a Unless that assertion failure just comes from the fact that the blocks are no longer always next to each other when queued π€ ? |
While we still didn't reach a point to make it work today, we tried a few ideas that we thought would work more easily. I ran out of time before posting a summary, so I'll summarize our current status tomorrow, discussing your points as well. |
At 24a4a42, Solution 1 was passing btest except for test_yjit_30k_ifelse with What we tried on it yesterday were:
Agreed. We should probably stick to the design unless we find a blocker for it.
We did work on that kind of idea. I think we'll explore more of that on today's pairing π€ |
Alan and I made a good progress today. We kept improving and debugging this implementation of Solution 1, and with The current idea is:
|
39a4ebc
to
38c617d
Compare
Happy to see that you and Alan are pairing and that you are making so much progress π :) |
29f17dd
to
62dfd51
Compare
Today I refactored this branch and added initial x86_64 implementation. arm64 is stable enough to pass btest and test-all once locally, but test-all seems crashing on CI. test-spec is mostly passing but stops on x86_64 next_page seems to cause problems; the next step is to investigate SEGV on I'll pair with @XrXr on those issues tomorrow. |
959812b
to
d0801b5
Compare
Wonderful. Great work so far π :) |
725ebd4
to
5f7d990
Compare
5f7d990
to
a8fba68
Compare
Co-authored-by: Alan Wu <[email protected]> Co-authored-by: Maxime Chevalier-Boisvert <[email protected]>
a8fba68
to
75cfe66
Compare
included the benchmark numbers for the latest revision in the PR description (not changed much from #6460 (comment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. I'm very happy you were able to get this in a working state and with good performance π
097f619
to
a722743
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done! Happy with the progress :)
Co-authored-by: Alan Wu <[email protected]> Co-authored-by: Maxime Chevalier-Boisvert <[email protected]>
This PR changes the code layout so that inlined and outlined code for the same ISEQ get closer. We previously had only two giant blocks for inline and outlined code, but this PR alternates many more inline and outlined blocks instead. We'll have less distinct ISEQs in each memory page with this layout, so hopefully it'll trigger code GC more often if we check code usages per ISEQ.
This also optimizes Arm JIT code because of smaller relative jump offsets, while Intel's performance is not really impacted.
Design
VirtualMem
needs to be shared by both inline and outlined code blocks. So it's held asRc<RefCell<VirtualMem>>
.page_size
,CodeBlock
can only write a half of eachpage_size
. The first half is forcb
and the second half is forocb
. Whenwrite_byte
is attempted outside that half, it setspage_fault
flag.cb
orocb
reaches the end of a page,CodeBlock#next_page
moves bothcb
andocb
one-page ahead.Benchmark
It seems like Arm got some speedup and Intel had no change as expected.
x86_64
arm64