Improve alternation switch optimization in regex source generator #98723

stephentoub · 2024-02-20T22:35:06Z

The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch.

Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts.

This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.

Fixes #98683

The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch. Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts. This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.

ghost · 2024-02-20T22:35:12Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch.

Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts.

This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.

Fixes #98683

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	9.0.0

danmoseley · 2024-02-21T00:22:19Z

Is this not relevant to the compiler mode or just not worth doing in it?

Would it be worth adding a perf test, or do you expect it to be covered by an existing scenario?

stephentoub · 2024-02-21T00:47:08Z

Is this not relevant to the compiler mode or just not worth doing in it?

This path doesn't exist for the compiler because it relies on C# optimizations around switches. I want to add something at some point.

Would it be worth adding a perf test, or do you expect it to be covered by an existing scenario?

We currently don't have any source generator perf tests.

stephentoub · 2024-03-01T19:17:00Z

@joperezr, mind taking a peek?

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

joperezr

LGTM.

I assume that this whole section is not present in the compiler code, right? As I expect this piece is handled by regular if else blocks, but just wondering if there is something that we can think about optimizing there too, like the setCharSize change.

Did we ever write tests for expected output given an expression? This type of fixes might benefit from such tests in order to ensure we are getting the switch statements as expected.

stephentoub · 2024-03-04T15:05:47Z

I assume that this whole section is not present in the compiler code, right?

Correct. We rely on the C# compiler's optimizations around lowering of the switch block; it might just use cascading if/elses, it might use a jump table, it might do various things. We could choose to implement those same optimizations in RegexCompiler, at which point we could port over this logic, but thus far I haven't wanted us to take on that maintenance burden. We could, though.

Did we ever write tests for expected output given an expression?

Yes, we have tests for three different patterns here:
https://github.com/dotnet/runtime/blob/main/src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorOutputTests.cs
None of those result in a switch for an alternation, though.

stephentoub added the area-System.Text.RegularExpressions label Feb 20, 2024

stephentoub added this to the 9.0.0 milestone Feb 20, 2024

stephentoub requested review from danmoseley, tarekgh and joperezr February 20, 2024 22:35

ghost assigned stephentoub Feb 20, 2024

build-analysis bot mentioned this pull request Feb 21, 2024

[browser][MT] Assert failed: Cannot find Promise for JSHandle -2 #98406

Closed

Merge branch 'main' into regexsgswitch

b48d8be

joperezr reviewed Mar 1, 2024

View reviewed changes

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Show resolved Hide resolved

joperezr reviewed Mar 1, 2024

View reviewed changes

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Show resolved Hide resolved

joperezr reviewed Mar 1, 2024

View reviewed changes

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Show resolved Hide resolved

joperezr approved these changes Mar 1, 2024

View reviewed changes

stephentoub merged commit 061d4df into dotnet:main Mar 4, 2024

stephentoub deleted the regexsgswitch branch March 4, 2024 15:05

github-actions bot locked and limited conversation to collaborators Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve alternation switch optimization in regex source generator #98723

Improve alternation switch optimization in regex source generator #98723

Uh oh!

stephentoub commented Feb 20, 2024

Uh oh!

ghost commented Feb 20, 2024

Uh oh!

danmoseley commented Feb 21, 2024

Uh oh!

stephentoub commented Feb 21, 2024

Uh oh!

stephentoub commented Mar 1, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joperezr left a comment

Uh oh!

stephentoub commented Mar 4, 2024

Uh oh!

Uh oh!

Improve alternation switch optimization in regex source generator #98723

Improve alternation switch optimization in regex source generator #98723

Uh oh!

Conversation

stephentoub commented Feb 20, 2024

Uh oh!

ghost commented Feb 20, 2024

Uh oh!

danmoseley commented Feb 21, 2024

Uh oh!

stephentoub commented Feb 21, 2024

Uh oh!

stephentoub commented Mar 1, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joperezr left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Mar 4, 2024

Uh oh!

Uh oh!