Improve literal-after-loop regex optimization #93190

stephentoub · 2023-10-08T19:24:34Z

Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:

It currently gives up if the set loop is wrapped in an atomic and/or a capture.
It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.
If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.
If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.

This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331.

Based on benchmark from https://github.com/BurntSushi/rebar#ruff-noqa:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

[HideColumns("start", "end", "Error", "StdDev", "Job")]
public partial class Tests
{
    private static readonly Regex s_regex = new Regex("(\\s*)((?i:# noqa)(?::\\s?(([A-Z]+[0-9]+(?:[,\\s]+)?)+))?)", RegexOptions.Compiled);
    private static string s_haystack = new HttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;

    [Benchmark]
    public int LineByLine()
    {
        int total = 0;
        foreach (ReadOnlySpan<char> line in s_haystack.AsSpan().EnumerateLines())
        {
            total += s_regex.Count(line);
        }
        return total;
    }

    [Benchmark]
    public int All() => s_regex.Count(s_haystack);
}

Method	Toolchain	Mean	Ratio
LineByLine	\main\corerun.exe	308.76 ms	1.00
LineByLine	\pr\corerun.exe	66.82 ms	0.22

All	\main\corerun.exe	278.16 ms	1.00
All	\pr\corerun.exe	23.11 ms	0.08

Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support: - It currently gives up if the set loop is wrapped in an atomic and/or a capture. - It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop. - If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it. - If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration. This fixes all of those issues, such that the optimization extends to more patterns.

ghost · 2023-10-08T19:24:48Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:

It currently gives up if the set loop is wrapped in an atomic and/or a capture.
It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.
If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.
If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.

This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331.

Based on benchmark from https://github.com/BurntSushi/rebar#ruff-noqa:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

[HideColumns("start", "end", "Error", "StdDev", "Job")]
public partial class Tests
{
    private static readonly Regex s_regex = new Regex("(\\s*)((?i:# noqa)(?::\\s?(([A-Z]+[0-9]+(?:[,\\s]+)?)+))?)", RegexOptions.Compiled);
    private static string s_haystack = new HttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;

    [Benchmark]
    public int LineByLine()
    {
        int total = 0;
        foreach (ReadOnlySpan<char> line in s_haystack.AsSpan().EnumerateLines())
        {
            total += s_regex.Count(line);
        }
        return total;
    }

    [Benchmark]
    public int All() => s_regex.Count(s_haystack);
}

Method	Toolchain	Mean	Ratio
LineByLine	\main\corerun.exe	308.76 ms	1.00
LineByLine	\pr\corerun.exe	66.82 ms	0.22

All	\main\corerun.exe	278.16 ms	1.00
All	\pr\corerun.exe	23.11 ms	0.08

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	9.0.0

buyaa-n

Looks good at my level

src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexFindOptimizationsTests.cs

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs

src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Groups.Tests.cs

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Oct 8, 2023

stephentoub added this to the 9.0.0 milestone Oct 8, 2023

stephentoub requested review from danmoseley, joperezr and MihaZupan October 8, 2023 19:24

ghost assigned stephentoub Oct 8, 2023

Add a few more tests and comments

6aad456

build-analysis bot mentioned this pull request Oct 9, 2023

Tracking issue for CI build timeouts #76454

Closed

buyaa-n approved these changes Oct 18, 2023

View reviewed changes

src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexFindOptimizationsTests.cs Outdated Show resolved Hide resolved

danmoseley reviewed Oct 18, 2023

View reviewed changes

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs Show resolved Hide resolved

danmoseley reviewed Oct 18, 2023

View reviewed changes

src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Groups.Tests.cs Show resolved Hide resolved

danmoseley reviewed Oct 18, 2023

View reviewed changes

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs Show resolved Hide resolved

danmoseley reviewed Oct 18, 2023

View reviewed changes

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs Show resolved Hide resolved

stephentoub added 2 commits October 18, 2023 17:28

Merge branch 'main' into improveliteralafterloopopt

23f6a2e

Address PR feedback

b5a2b31

stephentoub merged commit 0cd1774 into dotnet:main Oct 19, 2023

stephentoub deleted the improveliteralafterloopopt branch October 19, 2023 01:25

cincuranet mentioned this pull request Oct 24, 2023

Regressions in System.Text.RegularExpressions #93927

Closed

ghost locked as resolved and limited conversation to collaborators Nov 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve literal-after-loop regex optimization #93190

Improve literal-after-loop regex optimization #93190

Uh oh!

stephentoub commented Oct 8, 2023

Uh oh!

ghost commented Oct 8, 2023

Uh oh!

buyaa-n left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve literal-after-loop regex optimization #93190

Improve literal-after-loop regex optimization #93190

Uh oh!

Conversation

stephentoub commented Oct 8, 2023

Uh oh!

ghost commented Oct 8, 2023

Uh oh!

buyaa-n left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!