Thanks to visit codestin.com
Credit goes to github.com

Skip to content

YJIT: Interleave inline and outlined code blocks #6460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 17, 2022

Conversation

k0kubun
Copy link
Member

@k0kubun k0kubun commented Sep 28, 2022

This PR changes the code layout so that inlined and outlined code for the same ISEQ get closer. We previously had only two giant blocks for inline and outlined code, but this PR alternates many more inline and outlined blocks instead. We'll have less distinct ISEQs in each memory page with this layout, so hopefully it'll trigger code GC more often if we check code usages per ISEQ.

This also optimizes Arm JIT code because of smaller relative jump offsets, while Intel's performance is not really impacted.

Design

  • To keep consecutive virtual memory addresses, a single large VirtualMem needs to be shared by both inline and outlined code blocks. So it's held as Rc<RefCell<VirtualMem>>.
  • Taking a page_size, CodeBlock can only write a half of each page_size. The first half is for cb and the second half is for ocb. When write_byte is attempted outside that half, it sets page_fault flag.
  • When cb or ocb reaches the end of a page, CodeBlock#next_page moves both cb and ocb one-page ahead.
  • For a code page size, I'm using fixed 16KiB instead of a native page size to let Linux and macOS behave similarly.

Benchmark

It seems like Arm got some speedup and Intel had no change as expected.

x86_64

before: ruby 3.2.0dev (2022-10-14T16:16:21Z master 7e24ebc649) +YJIT [x86_64-linux]
after: ruby 3.2.0dev (2022-10-14T16:36:14Z yjit-code-layout 75cfe661c5) +YJIT [x86_64-linux]

----------  -----------  ----------  ----------  ----------  ------------  -------------
bench       before (ms)  stddev (%)  after (ms)  stddev (%)  before/after  after 1st itr
railsbench  1111.0       1.6         1109.5      2.0         1.00          0.99
----------  -----------  ----------  ----------  ----------  ------------  -------------
$ ./run_benchmarks.rb railsbench -e "before::$(rbenv ruby yjit-release-before) --yjit" -e "after::$(rbenv ruby yjit-release-after) --yjit"
Running benchmark "railsbench" (1/1)
setarch x86_64 -R taskset -c 7 /home/k0kubun/.rbenv/versions/yjit-release-before/bin/ruby --yjit -I ./harness benchmarks/railsbench/benchmark.rb
ruby 3.2.0dev (2022-10-14T16:16:21Z master 7e24ebc649) +YJIT [x86_64-linux]
Command: bundle install --quiet
Command: bin/rails db:migrate db:seed
Calling `DidYouMean::SPELL_CHECKERS.merge!(error_name => spell_checker)' has been deprecated. Please call `DidYouMean.correct_error(error_name, spell_checker)' instead.
Deleted all 100 posts
Creating 100 posts....................................................................................................
itr #1: 1236ms
itr #2: 1108ms
itr #3: 1073ms
itr #4: 1138ms
itr #5: 1107ms
itr #6: 1075ms
itr #7: 1108ms
itr #8: 1106ms
itr #9: 1142ms
itr #10: 1108ms
itr #11: 1109ms
itr #12: 1106ms
itr #13: 1108ms
itr #14: 1106ms
itr #15: 1140ms
itr #16: 1109ms
itr #17: 1106ms
itr #18: 1108ms
itr #19: 1106ms
itr #20: 1109ms
itr #21: 1107ms
itr #22: 1108ms
itr #23: 1140ms
itr #24: 1140ms
itr #25: 1074ms
Average of last 10, non-warmup iters: 1111ms
Running benchmark "railsbench" (1/1)
setarch x86_64 -R taskset -c 7 /home/k0kubun/.rbenv/versions/yjit-release-after/bin/ruby --yjit -I ./harness benchmarks/railsbench/benchmark.rb
ruby 3.2.0dev (2022-10-14T16:36:14Z yjit-code-layout 75cfe661c5) +YJIT [x86_64-linux]
Command: bundle install --quiet
Command: bin/rails db:migrate db:seed
Calling `DidYouMean::SPELL_CHECKERS.merge!(error_name => spell_checker)' has been deprecated. Please call `DidYouMean.correct_error(error_name, spell_checker)' instead.
Deleted all 100 posts
Creating 100 posts....................................................................................................
itr #1: 1246ms
itr #2: 1073ms
itr #3: 1141ms
itr #4: 1105ms
itr #5: 1072ms
itr #6: 1140ms
itr #7: 1108ms
itr #8: 1107ms
itr #9: 1141ms
itr #10: 1107ms
itr #11: 1106ms
itr #12: 1107ms
itr #13: 1105ms
itr #14: 1109ms
itr #15: 1122ms
itr #16: 1110ms
itr #17: 1115ms
itr #18: 1111ms
itr #19: 1106ms
itr #20: 1141ms
itr #21: 1108ms
itr #22: 1108ms
itr #23: 1074ms
itr #24: 1145ms
itr #25: 1072ms
Average of last 10, non-warmup iters: 1109ms
Total time spent benchmarking: 60s

end_time: 2022-10-14 09:41:11 PDT (-0700)
before: ruby 3.2.0dev (2022-10-14T16:16:21Z master 7e24ebc649) +YJIT [x86_64-linux]
after: ruby 3.2.0dev (2022-10-14T16:36:14Z yjit-code-layout 75cfe661c5) +YJIT [x86_64-linux]

----------  -----------  ----------  ----------  ----------  ------------  -------------
bench       before (ms)  stddev (%)  after (ms)  stddev (%)  before/after  after 1st itr
railsbench  1111.0       1.6         1109.5      2.0         1.00          0.99
----------  -----------  ----------  ----------  ----------  ------------  -------------
Legend:
- before/after: ratio of before/after time. Higher is better for after. Above 1 represents a speedup.
- after 1st itr: ratio of before/after time for the first benchmarking iteration.

arm64

before: ruby 3.2.0dev (2022-10-14T16:16:21Z master 7e24ebc649) +YJIT [arm64-darwin21]
after: ruby 3.2.0dev (2022-10-14T16:36:14Z yjit-code-layout 75cfe661c5) +YJIT [arm64-darwin21]

----------  -----------  ----------  ----------  ----------  ------------  -------------
bench       before (ms)  stddev (%)  after (ms)  stddev (%)  before/after  after 1st itr
railsbench  747.2        1.7         728.3       1.1         1.03          0.97
----------  -----------  ----------  ----------  ----------  ------------  -------------
$ ./run_benchmarks.rb railsbench -e "before::$(rbenv ruby yjit-release-before-$arch) --yjit" -e "after::$(rbenv ruby yjit-release-after-$arch) --yjit"
Running benchmark "railsbench" (1/1)
/Users/k0kubun/.rbenv/versions/yjit-release-before-arm64/bin/ruby --yjit -I ./harness benchmarks/railsbench/benchmark.rb
ruby 3.2.0dev (2022-10-14T16:16:21Z master 7e24ebc649) +YJIT [arm64-darwin21]
PID: 47578
Command: bundle check 2> /dev/null || bundle install
The Gemfile's dependencies are satisfied
Command: bin/rails db:migrate db:seed
Calling `DidYouMean::SPELL_CHECKERS.merge!(error_name => spell_checker)' has been deprecated. Please call `DidYouMean.correct_error(error_name, spell_checker)' instead.
Deleted all 100 posts
Creating 100 posts....................................................................................................
itr #1: 905ms
itr #2: 741ms
itr #3: 745ms
itr #4: 740ms
itr #5: 760ms
itr #6: 747ms
itr #7: 719ms
itr #8: 765ms
itr #9: 746ms
itr #10: 756ms
itr #11: 742ms
itr #12: 726ms
itr #13: 742ms
itr #14: 761ms
itr #15: 742ms
itr #16: 742ms
itr #17: 742ms
itr #18: 725ms
itr #19: 771ms
itr #20: 743ms
itr #21: 742ms
itr #22: 764ms
itr #23: 736ms
itr #24: 748ms
itr #25: 754ms
Average of last 10, non-warmup iters: 747ms
Running benchmark "railsbench" (1/1)
/Users/k0kubun/.rbenv/versions/yjit-release-after-arm64/bin/ruby --yjit -I ./harness benchmarks/railsbench/benchmark.rb
ruby 3.2.0dev (2022-10-14T16:36:14Z yjit-code-layout 75cfe661c5) +YJIT [arm64-darwin21]
PID: 47697
Command: bundle check 2> /dev/null || bundle install
The Gemfile's dependencies are satisfied
Command: bin/rails db:migrate db:seed
Calling `DidYouMean::SPELL_CHECKERS.merge!(error_name => spell_checker)' has been deprecated. Please call `DidYouMean.correct_error(error_name, spell_checker)' instead.
Deleted all 100 posts
Creating 100 posts....................................................................................................
itr #1: 932ms
itr #2: 789ms
itr #3: 796ms
itr #4: 796ms
itr #5: 820ms
itr #6: 794ms
itr #7: 796ms
itr #8: 815ms
itr #9: 798ms
itr #10: 820ms
itr #11: 789ms
itr #12: 793ms
itr #13: 787ms
itr #14: 721ms
itr #15: 745ms
itr #16: 725ms
itr #17: 726ms
itr #18: 742ms
itr #19: 739ms
itr #20: 719ms
itr #21: 727ms
itr #22: 722ms
itr #23: 723ms
itr #24: 718ms
itr #25: 738ms
Average of last 10, non-warmup iters: 728ms
Total time spent benchmarking: 46s

end_time: 2022-10-14 09:48:28 PDT (-0700)
before: ruby 3.2.0dev (2022-10-14T16:16:21Z master 7e24ebc649) +YJIT [arm64-darwin21]
after: ruby 3.2.0dev (2022-10-14T16:36:14Z yjit-code-layout 75cfe661c5) +YJIT [arm64-darwin21]

----------  -----------  ----------  ----------  ----------  ------------  -------------
bench       before (ms)  stddev (%)  after (ms)  stddev (%)  before/after  after 1st itr
railsbench  747.2        1.7         728.3       1.1         1.03          0.97
----------  -----------  ----------  ----------  ----------  ------------  -------------
Legend:
- before/after: ratio of before/after time. Higher is better for after. Above 1 represents a speedup.
- after 1st itr: ratio of before/after time for the first benchmarking iteration.

@maximecb
Copy link
Contributor

Thank you for tackling this challenging problem Kokubun. I think generally speaking your approach looks good though I am not sure how set_write_pos / get_write_pos will work in this new scheme, particularly with the inline/outlined region no longer being contiguous. I wrote some comments/questions.

@maximecb
Copy link
Contributor

I pushed 90c7b01 to make --yjit-code-page-size a command-line option so we can experiment with different values later πŸ”¬ :)

@k0kubun
Copy link
Member Author

k0kubun commented Sep 29, 2022

I figured out how to fix a few immediate problems, but didn't get to fix everything I encountered next. This PR still needs some more work.

It has been challenging to make codegen work when the assembler attempts to write code but there's no enough space in the current page. I've got a few ideas:

  1. Current PR: On asm.compile, attempt to write the current page. If it fails, write from the beginning in the next page.
    • asm.compile should also return CodePtr in case the caller needs to recognize the next page as the starting address.
  2. On asm.compile, predict the size of assembled code, and move to the next page as needed.
    • This also requires you to return CodePtr from asm.compile and use it.
    • You need to implement the code size estimation of some Ops.
  3. When emitting an insn, predict the size of its assembled code, and generate a jmp instruction to go to the next page as needed.
    • You don't need to return CodePtr from asm.compile. The current callers should work as is.

1 and 2 would have essentially the same code layout, but there's some difference between 1 and 3. While 3 would minimize the code page size, it's more likely that code of a single ISEQ is split into two pages. 1 doesn't have the split page problem, however, it would leave gaps at the end of every code page.

I'm leaning towards having whichever simpler (still thinking about which one actually is) first, but it's worth evaluating the performance difference after that.

@maximecb
Copy link
Contributor

1 and 2 would have essentially the same code layout, but there's some difference between 1 and 3. While 3 would minimize the code page size, it's more likely that code of a single ISEQ is split into two pages. 1 doesn't have the split page problem, however, it would leave gaps at the end of every code page.

I'm leaning towards having whichever simpler (still thinking about which one actually is) first, but it's worth evaluating the performance difference after that.

Solution #3 might be better just because, we're likely to end up needing a jump anyway if a block has to be compiled in the next page. Might as well go for the solution that makes things the most transparent and minimizes memory usage?

I think it's ok to keep things simple and assume that all instructions need 64 bytes available, otherwise we need to change page.

The main caveat that I can see is that we could end up in a situation where we insert a jump in the middle of a patchable/rewritable branch, which could be annoying to deal with? Though as long as there is enough bytes left in the page so that we could generate the whole rewritable branch in one block, it shouldn't be a problem.

@k0kubun
Copy link
Member Author

k0kubun commented Sep 30, 2022

Solution #3 might be better just because, we're likely to end up needing a jump anyway if a block has to be compiled in the next page. Might as well go for the solution that makes things the most transparent and minimizes memory usage?

πŸ‘

Another idea similar to 3 is that we could attempt a write and retry it after next_page per insn instead of predicting the write size. I just pushed its PoC for Arm at 8491112, but I wasn't able to finish it up because I need to go to the airport soon. It seems less buggy than the previous patch, but still crashing.

Debugging the current failure on Arm might be helpful if anybody wants to make a progress on this project this Friday.

@maximecb
Copy link
Contributor

maximecb commented Sep 30, 2022

we could attempt a write and retry it after next_page per insn instead of predicting the write size.

That might work but we also need to be able to guarantee that we are able to write the largest possible size of a jump to the next page.

There are also potentially edge cases if we switch page in the middle of the branch instruction at the end of a block... Although I think that should work as well... Provided that we always succeed in allocating a next page πŸ€”

Welcome back to America btw! I'm going to be away at a burst most of next week so still difficult to pair, though at least we will be in closer time zones so we should be able to chat on Slack in real time. Looking forward to pairing the week after :)

I should mention, @XrXr should be available to pair next week. He knows this codebase and CRuby very well so should definitely be able to help.

@k0kubun
Copy link
Member Author

k0kubun commented Oct 4, 2022

Current status: Alan and I paired on the immediate issue of this PR. We see a couple of assertion failures on "branch stubs should never enlarge branches" and PC calculation VM_ASSERT(n <= ISEQ_BODY(iseq)->iseq_size);.

Because of the amount of problems I'm facing again and again, it felt like another option could be simpler. I was thinking about choosing one of the following two options, summarizing the current blockers of each option:

  1. Current PR: In arm64_emit, try emitting an Insn, jump to the next page if it fails, and retry emitting the same Insn.
    • pros: Minimizes memory usage
    • cons: Deciding whether to generate a jmp or not is tricky. More edge cases than I initially expected. One of the issues is that you sometimes need to guarantee there's an enough space for a jmp before the current Assembler is instantiated (the previous Assembler should leave space), while you also need to guarantee assumptions like "branch stubs should never enlarge branches".
    • blockers:
      • Assertion failure on "branch stubs should never enlarge branches" (probably not difficult to fix)
      • Assertion failure in PC calculation: VM_ASSERT(n <= ISEQ_BODY(iseq)->iseq_size); (not immediately obvious)
  2. In compile_with_regs, try arm64_emit, set the position to the next page if it fails, and retry entire arm64_emit.
    • pros: An Assembler-level retry seems kind of simpler than an Insn-level retry.
    • cons: More memory usage. When generated code is too large, you have to give up.
    • blocker: Assertion failure on "invalidation needs the entry_exit field" because the entry address is decided after compile_with_regs, however, it's needed beforehand for invalidation.

Because of the cons and the blockers, I was almost leaving Solution #1, but after revisiting Solution #2, I noticed the thing that I wrote as the blocker. The next step is to explore an idea of possibly using a label to address the blocker of the Solution #2, and then go back to debugging Solution #1 if it doesn't work.

@XrXr
Copy link
Member

XrXr commented Oct 4, 2022

I want to note that even with solution 2, we still need to write out jumps sometimes. In gen_direct_jump() we assume that we can always place blocks next to each other and that's not true anymore once we change the layout.

@maximecb
Copy link
Contributor

maximecb commented Oct 5, 2022

Thanks for helping with this @XrXr πŸ™

Trying to think ahead here. In the future we might want to try generating code for multiple blocks at once, so we can have longer instruction sequences to optimize. If we were doing that, reusing the same Assembler for multiple blocks, then giving up if we can't fit the entire sequence in the current page could be problematic (could waste a lot of space). IMO that suggests that an Insn-level retry is probably more flexible.

cons: Deciding whether to generate a jmp or not is tricky. More edge cases than I initially expected. One of the issues is that you sometimes need to guarantee there's an enough space for a jmp before the current Assembler is instantiated (the previous Assembler should leave space), while you also need to guarantee assumptions like "branch stubs should never enlarge branches".

I'm wondering if there's not a stupid-but-good-enough way to solve this. Something like, we define a MAX_PATCHABLE_BRANCH_SIZE, and if there isn't that much space left for a whole patchable branch, then we just move to the next page. That way, we never end up in a situation where a patchable branch can be split across two pages?

Unless that assertion failure just comes from the fact that the blocks are no longer always next to each other when queued πŸ€” ?

@k0kubun
Copy link
Member Author

k0kubun commented Oct 5, 2022

While we still didn't reach a point to make it work today, we tried a few ideas that we thought would work more easily. I ran out of time before posting a summary, so I'll summarize our current status tomorrow, discussing your points as well.

@k0kubun
Copy link
Member Author

k0kubun commented Oct 5, 2022

At 24a4a42, Solution 1 was passing btest except for test_yjit_30k_ifelse with Assertion Failed: ../vm_backtrace.c:57:calc_pos:n <= ISEQ_BODY(iseq)->iseq_size and test_yjit_30k_methods with panicked at 'branch stubs should never enlarge branches'.

What we tried on it yesterday were:

  • First, we removed a workaround that allows the last insn to run out of space in the current page because it wasn't enough to avoid "branch stubs should never enlarge branches". This resulted in making 30k_ifelse fail with "branch stubs should never enlarge branches" as well.
  • Tried starting inline/outlined code of a next page from a 20-byte offset so that the first uninitialized code range triggers SIGILL when inline code steps into outlined code and vice versa. As of this revision, SIGILL wasn't triggered, still both "branch stubs should never enlarge branches".
  • Added can_change_page flag to Assembler to explicitly disallow going to the next page when necessary. This caused those two tests to fail with Assertion Failed: ../vm_backtrace.c:57:calc_pos:n <= ISEQ_BODY(iseq)->iseq_size or Assertion Failed: ../vm_backtrace.c:58:calc_pos:n >= 0.
  • Removed can_change_page, and tweaked the assertion condition of "branch stubs should never enlarge branches" instead. It caused 'verify_ctx: stack value was mapped to self, but values did not match! and Assertion Failed: ../yjit.c:392:rb_iseq_pc_at_idx:insn_idx < iseq->body->iseq_size.
  • Tried another design that ends each page with a jump to the next page, letting Assembler generate nops at the page end. This enters an infinite loop or a failure of an assertion !ocb.dropped_bytes that we added.

In the future we might want to try generating code for multiple blocks at once, so we can have longer instruction sequences to optimize.
IMO that suggests that an Insn-level retry is probably more flexible.

Agreed. We should probably stick to the design unless we find a blocker for it.

Something like, we define a MAX_PATCHABLE_BRANCH_SIZE, and if there isn't that much space left for a whole patchable branch, then we just move to the next page. That way, we never end up in a situation where a patchable branch can be split across two pages?

We did work on that kind of idea. I think we'll explore more of that on today's pairing πŸ€”

@k0kubun
Copy link
Member Author

k0kubun commented Oct 6, 2022

Alan and I made a good progress today. We kept improving and debugging this implementation of Solution 1, and with --yjit-call-threshold=1, make btest is finally passing and make test-all is passing except for three tests of TestYJIT with !cb.has_dropped_bytes.

The current idea is:

  • Use page_end_offset: 20 to reserve space for a jmp at the end of each page. Temporarily set it to 0 when it needs to jump to the next page.
  • When next_page() is called, do not move the cb that did not request it backwards. For example, when cb runs out of space in an old page while regenerate_branch or branch_stub_hit, ocb should not go to the next page because it could repeatedly point to the same old address that is possibly already written.
    • The ideal solution to keep inline and outlined codes closer is to keep track of write_pos in every past page to go back to an unwritten position correctly, but the tradeoff here is that it requires some extra memory for it. Probably not too significant and worth a consideration though.

@k0kubun k0kubun force-pushed the yjit-code-layout branch 4 times, most recently from 39a4ebc to 38c617d Compare October 6, 2022 21:42
@maximecb
Copy link
Contributor

maximecb commented Oct 6, 2022

Happy to see that you and Alan are pairing and that you are making so much progress πŸ™ :)

@k0kubun k0kubun force-pushed the yjit-code-layout branch 3 times, most recently from 29f17dd to 62dfd51 Compare October 7, 2022 00:52
@k0kubun
Copy link
Member Author

k0kubun commented Oct 7, 2022

Today I refactored this branch and added initial x86_64 implementation.

arm64 is stable enough to pass btest and test-all once locally, but test-all seems crashing on CI. test-spec is mostly passing but stops on spec/ruby/optional/capi/kernel_spec.rb with Killed: 9.

x86_64 next_page seems to cause problems; the next step is to investigate SEGV on test_yjit_30k_ifelse.rb and test_yjit_30k_methods.rb.

I'll pair with @XrXr on those issues tomorrow.

@maximecb
Copy link
Contributor

In the second pairing session with Alan, we were working on the #6406 side (changes are not finished/pushed yet). I think this PR is almost ready behavior-wise, but I need some more time tomorrow to refactor this before marking it ready for reviews.

Wonderful. Great work so far πŸ‘Œ :)

@k0kubun k0kubun force-pushed the yjit-code-layout branch 3 times, most recently from 725ebd4 to 5f7d990 Compare October 14, 2022 04:09
@k0kubun k0kubun marked this pull request as ready for review October 14, 2022 04:09
@matzbot matzbot requested a review from a team October 14, 2022 04:09
Co-authored-by: Alan Wu <[email protected]>
Co-authored-by: Maxime Chevalier-Boisvert <[email protected]>
@k0kubun
Copy link
Member Author

k0kubun commented Oct 14, 2022

included the benchmark numbers for the latest revision in the PR description (not changed much from #6460 (comment))

Copy link
Contributor

@maximecb maximecb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. I'm very happy you were able to get this in a working state and with good performance πŸ‘

@k0kubun k0kubun requested a review from a team October 17, 2022 16:24
Copy link
Contributor

@maximecb maximecb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Happy with the progress :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants