-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Addresses Regex Perf Issue 32764 #32899
Conversation
…xOptions.IgnoreCase or RegexOptions.CultureInvariant. This saves over 40% in these cases.
……xOptions.IgnoreCase or RegexOptions.CultureInvariant. This saves over 40% in these cases.
…ith RegexOptions.IgnoreCase or RegexOptions.CultureInvariant." This reverts commit e151e93.
|
||
private void CallToLower() | ||
{ | ||
Ldloc(_cultureV); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the lifetime of this? e.g. does this cache end up spanning multiple regex calls and user code such that user code could change the current culture and end up with different behavior than before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_cultureV is a local variable. The lifetime of the cache is therefore method local in FindFirstChar and Go methods of the compiled regular expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thanks.
private void CallToLower() | ||
{ | ||
Ldloc(_cultureV); | ||
Call(s_chartolowerM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ViktorHofer, am I misremembering, or did you do some ToLower-related optimizations/changes when you were cleaning up the regex code recently? If yes, should we instead just port the appropriate changes to the compiled version? If no, is there a similar optimization to be done on the interpreted side of things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but only minor allocation optimizations. I wouldn't do that as part of this PR but revisit the compiled code paths later to make sure that we don't diverge and bring optimizations over.
Do we have regex tests (interpreted and compiled) that rely on the current culture being something specific? If not, it'd be good to add as part of this. |
Only very few for Unicode characters. We should definitely add a bunch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please post performance numbers with BenchmarkDotNet baselined on master including allocation?
@ViktorHofer
And here the baseline version where I did revert the changes of my commit to really measure the impact of only my change:
Just in case you did not see here is the ETW chart: Actually the Benchmark.NET numbers are even better. From the numbers it looks like the slow code gen path is only triggerd when RegexOptions.IgnoreCase is used. But anyway this is certainly a widely used option. |
@stephentoub: I could think of some Turkish I tests which behave differently in different locales. That would be a good test to verify that the right culture was used. |
That sounds good and should suffice. As mentioned before, we currently don't have a comprehensive set of inputs with different cultures. We should definitely fix that. |
@ViktorHofer: Added some locale tests which really show different behavior under the turkish locale. |
@dotnet-bot test this please |
string input = "Iıİi"; | ||
|
||
var cultInvariantRegex = Create(input, CultureInfo.InvariantCulture, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant); | ||
var turkishRegex = Create(input, turkish, RegexOptions.IgnoreCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please change var to be the actual type here... we only use var when the type is obvious from the right-hand side, namely a ctor or an explicit cast.
@dotnet-bot test this please |
Thanks! |
Nice win here @Alois-xx many thanks! Any interest in more work on regex? |
@danmosemsft: If I find new issues or ideas I will definitely file an issue. But currently my free time is quite limited. |
* Fix for Regex performance issue when compiled Regex is used with RegexOptions.IgnoreCase or RegexOptions.CultureInvariant. This saves over 40% in these cases. * Fix for Regex performance issue when compiled Regex is used with Rege…xOptions.IgnoreCase or RegexOptions.CultureInvariant. This saves over 40% in these cases. * Revert "Fix for Regex performance issue when compiled Regex is used with RegexOptions.IgnoreCase or RegexOptions.CultureInvariant." This reverts commit dotnet/corefx@e151e93. * Added TurkishI tests which check compiled and interpreted regular expressions. * Removed var of test Commit migrated from dotnet/corefx@58e2b4c
This fix should save ca 40% for the common case when many Char.ToLower calls are emitted which access CultureInfo.CurrentCulture for compiled Regex Queries.