Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Deduplicate Regexp literals #2859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

casperisfine
Copy link
Contributor

@casperisfine casperisfine commented Jan 23, 2020

Ruby ticket: https://bugs.ruby-lang.org/issues/16557

Context

Real world application contain many duplicated Regexp literals.

From a rails/console in Redmine:

>> ObjectSpace.each_object(Regexp).count
=> 6828
>> ObjectSpace.each_object(Regexp).uniq.count
=> 4162
>> ObjectSpace.each_object(Regexp).to_a.map { |r| ObjectSpace.memsize_of(r) }.sum
=> 4611957 # 4.4 MB total
>> ObjectSpace.each_object(Regexp).to_a.map { |r| ObjectSpace.memsize_of(r) }.sum - ObjectSpace.each_object(Regexp).to_a.uniq.map { |r| ObjectSpace.memsize_of(r) }.sum
=> 1490601 # 1.42 MB could be saved

Here's the to 10 duplicated regexps in Redmine:

147: /"/
107: /\s+/
103: //
89: /\n/
83: /'/
76: /\s+/m
37: /\d+/
35: /\[/
33: /./
33: /\\./

Any empty Rails application will have a similar amount of regexps.

The feature

Since https://bugs.ruby-lang.org/issues/16377 made literal regexps frozen, it is possible to deduplicate literal regexps without changing any semantic.

This patch is heavily inspired by the frozen_strings table, but applied to literal regexps.

Bugs

Unfortunately this patch segfault during GC on Linux, and I haven't managed to figure out why. Interestingly it doesn't segfault on OSX.

@shyouhei
Copy link
Member

It seems the CI failures are real. You could have missed something.

@casperisfine
Copy link
Contributor Author

@shyouhei Yes, I know that PR isn't ready to merge. But I can't seem to figure out these last issues. So before I spend a long time figuring the problem out, I'd rather check if the feature itself is acceptable.

@casperisfine casperisfine force-pushed the dedup-literal-regexp branch 2 times, most recently from 6fad863 to 69802c3 Compare January 24, 2020 11:16
@byroot
Copy link
Member

byroot commented Jan 24, 2020

But I can't seem to figure out these last issues.

Actually scratch that.

Not sure how but seems like the segfault I was tracking was solved by my rebase.

I fixed the 3 legit spec & test failures.

There is still some crash on Travis, but I failed to understand what it is about and wether it's legit or not.

@casperisfine
Copy link
Contributor Author

Actually the travis failure is the segfault I couldn't figure out:

ruby(ruby_sip_hash13+0x68) [0xaaaad08886b0] ../siphash.c:421
ruby(reg_lit_hash+0x70) [0xaaaad089a0d0] ../re.c:2995
ruby(rb_st_delete+0x3c) [0xaaaad08cfaf4] ../st.c:329
ruby(rb_reg_free+0x84) [0xaaaad089ed24] ../re.c:3053
ruby(obj_free+0x60) [0xaaaad07c2fc4] ../gc.c:2714

My theory is that the ->src string is moved by GC.compact at some point, but I haven't managed to confirm it, nor to find a fix for it

@ko1
Copy link
Contributor

ko1 commented Feb 25, 2020

  • MUST: lazy sweep defers free function calls so table can contain free'ed regexps. fstring has same issue, so we need to check liveness like fstirng.
  • Comment: regexp creation also can be avoided.

@casperisfine
Copy link
Contributor Author

Thanks for the hints. I'll try to fix these issues.

@casperisfine
Copy link
Contributor Author

I'm afraid I'm still stuck on the same issue. I'm not certain, but from my understanding, when the Regexp is freed, sometimes the src frstring is already freed.

I initially tried to prevent this with rb_mark_tbl_no_pin(vm->regexp_literals) but clearly that doesn't work (probably because of the lazy sweep?).

Since the src property is used to generate the regexp hash, I don't see any way to cleanup the literals_regexps table in this scenario.

Maybe a solution would be to record the hash value in struct RRegexp but I doubt that would be an acceptable solution?

Either way I don't think I can fix this issue myself without some help.

@ko1
Copy link
Contributor

ko1 commented Feb 27, 2020

We discussed how to implement it at dev-meeting with some other committers and we conclude it is better to have a cache system in onigumo layer with reference counting. Do you want to try it? Unfortunately, I'm very busy now so I can't help.

@casperisfine
Copy link
Contributor Author

we conclude it is better to have a cache system in onigumo layer with reference counting.

That indeed sounds easier to implement, but I'm not convinced it would yield a better result from a Ruby user perspective.

Unfortunately, I'm very busy now so I can't help.

Totally understandable.

I'll give a few more tries at implementing either solution, but if anyone is reading this and feel they can implement it, feel free.

Real world application contain many duplicated Regexp literals.

From a rails/console in Redmine:

```
>> ObjectSpace.each_object(Regexp).count
=> 6828
>> ObjectSpace.each_object(Regexp).uniq.count
=> 4162
>> ObjectSpace.each_object(Regexp).to_a.map { |r| ObjectSpace.memsize_of(r) }.sum
=> 4611957 # 4.4 MB total
>> ObjectSpace.each_object(Regexp).to_a.map { |r| ObjectSpace.memsize_of(r) }.sum - ObjectSpace.each_object(Regexp).to_a.uniq.map { |r| ObjectSpace.memsize_of(r) }.sum
=> 1490601 # 1.42 MB could be saved
```

Here's the to 10 duplicated regexps in Redmine:

```
147: /"/
107: /\s+/
103: //
89: /\n/
83: /'/
76: /\s+/m
37: /\d+/
35: /\[/
33: /./
33: /\\./
```
@casperisfine
Copy link
Contributor Author

Kind of a personal braindump:

I managed to reduce the crash a bit and have a nice feedback loop locally.

After adding some debug output, the crash is caused by a Regexp instance that has:

  • A surprisingly small SRC_PTR, e.g. 0x3f3 (varies from a run to the other but is always < 0xfff, when other SRC_PTR are more like 0x7f9d1e8bf8f0).
  • A totally corrupted SRC_LEN, a huge number, often negative.
  • So clearly the ->src point to some garbage.

What's weird is that this Regexp has the REG_LITERAL flags, but my debug statement don't show it ever going though rb_reg_compile.

So I initially suspected it might be created via #dup and retained the flag, but I wasn't able to confirm this.

So I'm still digging here, I don't think there is a fundamental problem to the current implementation, however there's clearly some edge case I'm overlooking.

@junaruga
Copy link
Member

junaruga commented May 25, 2021

Actually the travis failure is the segfault I couldn't figure out:

Now the Travis was revived and only manages Arm (arm64, arm62), IBM (ppc64le, s390x) CPU architectures by the commit 9d4266f . You can rebase this PR based on the latest master to check if the Travis fails.
https://www.travis-ci.com/github/ruby/ruby/branches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants