Unify raw string lexing and parsing#80817
Conversation
- Create RawStringIndentationHelper with shared logic for both lexer and parser - Move CheckForSpaceDifference, CharToString, and StartsWith to helper class - Update Lexer_RawStringLiteral.cs to use helper methods - Update LanguageParser_InterpolatedString.cs to use helper methods - Tests pass: 165 RawStringLiteralLexingTests + 37 RawStringLiteralCompilingTests Co-authored-by: CyrusNajmabadi <[email protected]>
Add UTF-8 BOM to match encoding requirements of other C# files in the project Co-authored-by: CyrusNajmabadi <[email protected]>
| @@ -11720,10 +11720,6 @@ ExpressionSyntax parsePrimaryExpressionWithoutPostfix(Precedence precedence) | |||
| case SyntaxKind.NumericLiteralToken: | |||
There was a problem hiding this comment.
Disable whitespace diff to understand best what changed (some methods were very stripped down, and without that, git thinks they're entirely removed, and then readded).
| /// <summary> | ||
| /// Converts a whitespace character to its string representation for error messages. | ||
| /// </summary> | ||
| private static string CharToString(char ch) |
There was a problem hiding this comment.
this is jsut a move. the lexer doesn't need it anymore
|
@dotnet/roslyn-compiler this is ready for review. |
|
@jjonescz ptal. |
|
@dotnet/roslyn-compiler this is ready for review. |
There was a problem hiding this comment.
nit
| Debug.Assert(originalText[0] is '"'); | |
| Debug.Assert(originalText[1] is '"'); | |
| Debug.Assert(originalText[2] is '"'); | |
| Debug.Assert(originalText is ['"', '"', '"', ..]); |
src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Why is this code commented out?
There was a problem hiding this comment.
Great question. See the asserts and explanations in e112a7e for more detail.
Basically, this line was never necessary (given an assert a couple lines above that this token cannot have diagnostcs on it). I've also beefed up all teh code to be clearer and assert these invariants in a few places.
src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs
Outdated
Show resolved
Hide resolved
…edString.cs Co-authored-by: Jan Jones <[email protected]>
…edString.cs Co-authored-by: Jan Jones <[email protected]>
Co-authored-by: Jan Jones <[email protected]>
…edString.cs Co-authored-by: Jan Jones <[email protected]>
src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| internal void ScanInterpolatedStringLiteralTop( | ||
| internal void ScanStringLiteralTop( |
There was a problem hiding this comment.
I didn't follow why 'Interpolated' was deleted from this name. Wouldn't it make sense to use 'InterpolatedOrRaw' like many of the other methods are doing? Or did I miss something?
There was a problem hiding this comment.
Sure! I can def rename.
There was a problem hiding this comment.
ah. this is already in the type InterpolatedOrRawStringScanner. So i think it's fine for this to be ScanStringLiteralTop since this is already implied.
| ScanRawInterpolatedStringLiteralEnd(kind, startingQuoteCount); | ||
|
|
||
| if (!_isInterpolatedString) | ||
| _lexer.ScanUtf8Suffix(); |
There was a problem hiding this comment.
This is needed because non-interpolated raw strings can now go through this path, right?
Fixes #80731
Fixes #59044
The issue this is addressing is that both the lexer and the parser have a copy of all the code that reports errors on raw strings. And they both have a copy that determines the content meaning of the raw-string (including figuring out what the meaning is after dedentation).
This is a lot of duplication. And, as it turns out, they were slightly different in a couple of places (like where diagnostics are placed).
This PR unifies all that logic in the following manner:
A good mental model of this is that a raw-normal-string is just a raw-interpolated-string with no interpolations (no amount of
{s starts an interpolation).By doing that, errors are for normal vs interpolated raw strings are unified. As is string value/dedentation computation.
Note: we have exactly 500 tests around lexing/parsing. Only 22 had a tiny change in diagnostic location placement due to the unification. I looked into preserving the exact same semantics. but it was often very strange code that had to be written. Especially due to the nature of all the different string lexing/parsing routines we have. So i opted to ensure the vast majority were the same, while also giving totally fine, but slightly different errors, for a few cases.
Relates to feature #55306