-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Deduplicate Regexp literals #2859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It seems the CI failures are real. You could have missed something. |
@shyouhei Yes, I know that PR isn't ready to merge. But I can't seem to figure out these last issues. So before I spend a long time figuring the problem out, I'd rather check if the feature itself is acceptable. |
6fad863
to
69802c3
Compare
Actually scratch that. Not sure how but seems like the segfault I was tracking was solved by my rebase. I fixed the 3 legit spec & test failures. There is still some crash on Travis, but I failed to understand what it is about and wether it's legit or not. |
Actually the travis failure is the segfault I couldn't figure out:
My theory is that the |
|
Thanks for the hints. I'll try to fix these issues. |
69802c3
to
d8717e0
Compare
I'm afraid I'm still stuck on the same issue. I'm not certain, but from my understanding, when the Regexp is freed, sometimes the I initially tried to prevent this with Since the Maybe a solution would be to record the hash value in Either way I don't think I can fix this issue myself without some help. |
We discussed how to implement it at dev-meeting with some other committers and we conclude it is better to have a cache system in onigumo layer with reference counting. Do you want to try it? Unfortunately, I'm very busy now so I can't help. |
That indeed sounds easier to implement, but I'm not convinced it would yield a better result from a Ruby user perspective.
Totally understandable. I'll give a few more tries at implementing either solution, but if anyone is reading this and feel they can implement it, feel free. |
95e525d
to
faff237
Compare
Real world application contain many duplicated Regexp literals. From a rails/console in Redmine: ``` >> ObjectSpace.each_object(Regexp).count => 6828 >> ObjectSpace.each_object(Regexp).uniq.count => 4162 >> ObjectSpace.each_object(Regexp).to_a.map { |r| ObjectSpace.memsize_of(r) }.sum => 4611957 # 4.4 MB total >> ObjectSpace.each_object(Regexp).to_a.map { |r| ObjectSpace.memsize_of(r) }.sum - ObjectSpace.each_object(Regexp).to_a.uniq.map { |r| ObjectSpace.memsize_of(r) }.sum => 1490601 # 1.42 MB could be saved ``` Here's the to 10 duplicated regexps in Redmine: ``` 147: /"/ 107: /\s+/ 103: // 89: /\n/ 83: /'/ 76: /\s+/m 37: /\d+/ 35: /\[/ 33: /./ 33: /\\./ ```
faff237
to
5473c1a
Compare
Kind of a personal braindump: I managed to reduce the crash a bit and have a nice feedback loop locally. After adding some debug output, the crash is caused by a Regexp instance that has:
What's weird is that this Regexp has the So I initially suspected it might be created via So I'm still digging here, I don't think there is a fundamental problem to the current implementation, however there's clearly some edge case I'm overlooking. |
1a747da
to
7203ed4
Compare
Now the Travis was revived and only manages Arm (arm64, arm62), IBM (ppc64le, s390x) CPU architectures by the commit 9d4266f . You can rebase this PR based on the latest master to check if the Travis fails. |
Ruby ticket: https://bugs.ruby-lang.org/issues/16557
Context
Real world application contain many duplicated Regexp literals.
From a rails/console in Redmine:
Here's the to 10 duplicated regexps in Redmine:
Any empty Rails application will have a similar amount of regexps.
The feature
Since https://bugs.ruby-lang.org/issues/16377 made literal regexps frozen, it is possible to deduplicate literal regexps without changing any semantic.
This patch is heavily inspired by the
frozen_strings
table, but applied to literal regexps.Bugs
Unfortunately this patch segfault during GC on Linux, and I haven't managed to figure out why. Interestingly it doesn't segfault on OSX.