-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Improve alternation switch optimization in regex source generator #98723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch. Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts. This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsThe regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch. Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts. This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves. Fixes #98683
|
Is this not relevant to the compiler mode or just not worth doing in it? Would it be worth adding a perf test, or do you expect it to be covered by an existing scenario? |
This path doesn't exist for the compiler because it relies on C# optimizations around switches. I want to add something at some point.
We currently don't have any source generator perf tests. |
@joperezr, mind taking a peek? |
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I assume that this whole section is not present in the compiler code, right? As I expect this piece is handled by regular if else blocks, but just wondering if there is something that we can think about optimizing there too, like the setCharSize change.
Did we ever write tests for expected output given an expression? This type of fixes might benefit from such tests in order to ensure we are getting the switch statements as expected.
Correct. We rely on the C# compiler's optimizations around lowering of the switch block; it might just use cascading if/elses, it might use a jump table, it might do various things. We could choose to implement those same optimizations in RegexCompiler, at which point we could port over this logic, but thus far I haven't wanted us to take on that maintenance burden. We could, though.
Yes, we have tests for three different patterns here: |
The regex source generator has an optimization that tries to emit a switch statement to handle an alternation. If it can prove that an alternation is atomic, either because of a surrounding construct (like an atomic group) or because nothing in the alternation itself might backtrack (like a loop in one of the branches), and if it can prove that none of the branches overlap on the first character they must match (because all branches always begin with a different character from the others), then it can emit a switch over the first character required by each branch.
Today, the analysis that leads to this optimization being used only considers branches that start with a specific character (RegexNodeKind.One), a set (RegexNodeKind.Set), a string (RegexNodeKind.Multi), or a concatenation that begins with one of those. Anything else, and it gets knocked off the optimized switch path. With this PR, this is evolved to instead allow those One/Set/Multi constructs to be the first non-zero width construct matched in the branch, but not necessarily the first node, e.g. the branch could be a capture around one of these nodes, or a loop of one of these with a minimum iteration count of at least 1. This PR also adds in support for not just individual chars or sets, but loops of them (normal, lazy, or atomic), again as long as they have a minimum iteration count of 1... this in particular helps with duplicate characters in a row, as earlier optimizations will have likely condensed them into repeaters represented as loops with equal min and max counts.
This PR also makes one more tweak, which is that the sets supported may now be larger. Previously the code was allowing for a set to expand to at most 5 characters, an arbitrary limit set primarily to support ignore-case (which would typically result in sets of 2 or 3 characters). But this ignores the fact that previous optimizations may combine sets for a variety of reasons, e.g. an alternation where one branch contains 's' and the next contains 't' would be combined into a single branch for [st]. This limit has now been increased significantly, with little downside; the main limitation is stack consumption, and the new limit is well within typical stackallocs we use ourselves.
Fixes #98683