-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Improve literal-after-loop regex optimization #93190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve literal-after-loop regex optimization #93190
Conversation
Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support: - It currently gives up if the set loop is wrapped in an atomic and/or a capture. - It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop. - If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it. - If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration. This fixes all of those issues, such that the optimization extends to more patterns.
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsRegex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:
This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331. Based on benchmark from https://github.com/BurntSushi/rebar#ruff-noqa: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("start", "end", "Error", "StdDev", "Job")]
public partial class Tests
{
private static readonly Regex s_regex = new Regex("(\\s*)((?i:# noqa)(?::\\s?(([A-Z]+[0-9]+(?:[,\\s]+)?)+))?)", RegexOptions.Compiled);
private static string s_haystack = new HttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;
[Benchmark]
public int LineByLine()
{
int total = 0;
foreach (ReadOnlySpan<char> line in s_haystack.AsSpan().EnumerateLines())
{
total += s_regex.Count(line);
}
return total;
}
[Benchmark]
public int All() => s_regex.Count(s_haystack);
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good at my level
Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:
This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331.
Based on benchmark from https://github.com/BurntSushi/rebar#ruff-noqa: