-
Notifications
You must be signed in to change notification settings - Fork 164
Optimized Unicode.simpleFold() … #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 283
💛 - Coveralls |
…se-case. VisualVM no more sees it with this change.
68d56a0 to
234765d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sorry for the delay in this review. Since UnicodeTables.java is generated, the changes to that file should be implemented by modifying the make_unicode_tables.awk script.
I think it might be worth writing the unicode tables generator in Java, I'm going to have a stab at that right now.
This change does give a nice speedup in compilation and matching, probably worth the 35kb sparse array that it creates.
alandonovan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to rewrite the generator script. The necessary change can easily be made to the awk script.
| {0x212A, 0x004B}, | ||
| {0x212B, 0x00C5}, | ||
| }; | ||
| final char[] result = new char[tmp[tmp.length - 1][0] + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment is necessary here.
// Precompute the case folding mapping to avoid binary search at run time.
// The 'result' array maps each cased char to the next char in its orbit.
// The orbit is a cycle such as k -> K -> K [Kelvin] -> k.
|
No need to rewrite it to support this change, but it could definitely use rewriting. I have a prototype that uses ICU4J to emit the same information. |
This implements the approach taken by @mykeul in google#101. Instead of a dense array of (codepoint, case-folded codepoint) mappings, CASE_ORBIT becomes a sparse array whose indices represent the key in this map.
This implements the approach taken by @mykeul in google#101. Instead of a dense array of (codepoint, case-folded codepoint) mappings, CASE_ORBIT becomes a sparse array whose indices represent the key in this map. The previous pull request was written before UnicodeTablesGenerator existed.
|
Thank you, I implemented the same approach with the new UnicodeTablesGenerator in #114 |
… which was the hot spot (35.8%) in my use-case. VisualVM no more sees it with this change.