Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

stephentoub
Copy link
Member

The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch.

Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts.

This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.

Fixes #98683

The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch.

Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path.  With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1.  This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts.

This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.
@ghost
Copy link

ghost commented Feb 20, 2024

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch.

Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts.

This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.

Fixes #98683

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 9.0.0

@ghost ghost assigned stephentoub Feb 20, 2024
@danmoseley
Copy link
Member

Is this not relevant to the compiler mode or just not worth doing in it?

Would it be worth adding a perf test, or do you expect it to be covered by an existing scenario?

@stephentoub
Copy link
Member Author

Is this not relevant to the compiler mode or just not worth doing in it?

This path doesn't exist for the compiler because it relies on C# optimizations around switches. I want to add something at some point.

Would it be worth adding a perf test, or do you expect it to be covered by an existing scenario?

We currently don't have any source generator perf tests.

@stephentoub
Copy link
Member Author

@joperezr, mind taking a peek?

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I assume that this whole section is not present in the compiler code, right? As I expect this piece is handled by regular if else blocks, but just wondering if there is something that we can think about optimizing there too, like the setCharSize change.

Did we ever write tests for expected output given an expression? This type of fixes might benefit from such tests in order to ensure we are getting the switch statements as expected.

@stephentoub
Copy link
Member Author

I assume that this whole section is not present in the compiler code, right?

Correct. We rely on the C# compiler's optimizations around lowering of the switch block; it might just use cascading if/elses, it might use a jump table, it might do various things. We could choose to implement those same optimizations in RegexCompiler, at which point we could port over this logic, but thus far I haven't wanted us to take on that maintenance burden. We could, though.

Did we ever write tests for expected output given an expression?

Yes, we have tests for three different patterns here:
https://github.com/dotnet/runtime/blob/main/src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorOutputTests.cs
None of those result in a switch for an alternation, though.

@stephentoub stephentoub merged commit 061d4df into dotnet:main Mar 4, 2024
@stephentoub stephentoub deleted the regexsgswitch branch March 4, 2024 15:05
@github-actions github-actions bot locked and limited conversation to collaborators Apr 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve regex source generator optimization collision between alternations and repetitions
3 participants