-
Notifications
You must be signed in to change notification settings - Fork 14
Handle method with only empty lines or comments inside #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When `tmp/alias` already exists, I'm now getting phantom folders in the directory pointing at older aliases which is distracting/confusing. By checking and removing that alias before symlinking we can prevent this strange behavior (possibly caused by newer Mac OS?).
When removing comments I previously replaced them with a newline. This loses some context and may affect the order of the indent search which in turn affects the final result. By preserving whitespace in front of the comment, we preserve the "natural" indentation order of the line while also allowing the parser/lexer to see and join naturally consecutive (method chain) lines. close #177
While #177 is reported as being caused by a comment, the underlying behavior is a problem due to the newline that we generated (from a comment). The prior commit fixed that problem by preserving whitespace before the comment. That guarantees that a block will form there from the frontier before it will be expanded there via a "neighbors" method. Since empty lines are valid ruby code, it will be hidden and be safe. ## Problem setup This failure mode is not fixed by the prior commit, because the indentation is 0. To provide good results, we must make the algorithm less greedy. One heuristic/signal to follow is developer added newlines. If a developer puts a newline between code, it's more likely they're unrelated. For example: ``` port = rand(1000...9999) stub_request(:any, "localhost:#{port}") query = Cutlass::FunctionQuery.new( port: port ).call expect(WebMock).to have_requested(:post, "localhost:#{port}"). with(body: "{}") ``` This code is split into three chunks by the developer. Each are likely (but not guaranteed) to be intended to stand on their own (in terms of syntax). This behavior is good for scanning neighbors (same indent or higher) within a method, but bad for parsing neighbors across methods. ## Problem Code is expanded to capture all neighbors, and then it decreases indent level which allows it to capture surrounding scope (think moving from within the method to also capturing the `def/end` definition. Once the indentation level has been increased, we go back to scanning neighbors, but now neighbors also contain keywords. For example: ``` 1 def bark 2 3 end 4 5 def sit 6 end ``` In this case if lines 4, 5, and 6 are in a block when it tries to expand neighbors it will expand up. If it stops after line 2 or 3 it may cause problems since there's a valid kw/end pair, but the block will be checked without it. TLDR; It's good to stop scanning code after hitting a newline when you're in a method...it causes a problem scanning code between methods when everything inside of one of the methods is an empty line. In this case it grabs the end on line 3 and since the problem was an extra end, the program now compiles correctly. It incorrectly assumes that the block it captured was causing the problem. ## Extra bit of context One other technical detail is that after we've decided to stop scanning code for a new neighbor block expansion, we look around the block and grab any empty newlines. Basically adding empty newlines before of after a code block do not affect the parsing of that block. ## The fix Since we know that this problem only happens when there's a newline inside of a method and we know this particular failure mode is due to having an invalid block (capturing an extra end, but not it's keyword) we have all the metadata we need to detect this scenario and correct it. We know that the next line above our block must be code or empty (since we grabbed extra newlines). Same for code below it. We can count all the keywords and ends in the block. If they are balanced, it's likely (but not guaranteed) we formed the block correctly. If they're imbalanced, look above or below (depending on the nature of the imbalance), check to see if adding that line would balance the count. This concept of balance and "leaning" comes from work in #152 and has proven useful, but not been formally introduced into the main branch. ## Outcome Adding this extra check introduced no regressions and fixed the test case. It might be possible there's a mirror or similar problem that we're not handling. That will come out in time. It might also be possible that this causes a worse case in some code not under test. That too would come out in time. One other possible concern to adding logic in this area (which is a hot codepath), is performance. This extra count check will be performed for every block. In general the two most helpful performance strategies I've found are reducing total number of blocks (therefore reducing overall N internal iterations) and making better matches (the parser to determine if a close block is valid or not is a major bottleneck. If we can split valid code into valid blocks, then it's only evaluated by the parser once, where as invalid code must be continuously re-checked by the parser until it becomes valid, or is determined to be the cause of the core problem. This extra logic should very rarely result in a change, but when it does it should tend to produce slightly larger blocks (by one line) and more accurate blocks. Informally it seems to have no impact on performance: `` This branch: DEBUG_DISPLAY=1 bundle exec rspec spec/ --format=failures 3.01s user 1.62s system 113% cpu 4.076 total ``` ``` On main: DEBUG_DISPLAY=1 bundle exec rspec spec/ --format=failures 3.02s user 1.64s system 113% cpu 4.098 total ```
315728c
to
ad3db77
Compare
Originally I fixed #177 by making the process of comment removal indentation aware. The next commit is the more general fix and means we don't need to carry that additional logic/overhead. Also: Update syntax via linter
ad3db77
to
eacf9e6
Compare
I believe this is sufficient. I'll keep it open for a day or two on the off chance someone else wants to review and comment. It's a long PR, but mostly because I threw in a refactoring and a bunch of docs while I was at it. |
schneems
added a commit
that referenced
this pull request
May 3, 2023
- Add tests to ScanHistory - Add docs to ScanHistory - Maybe refactor some logic from AroundBlock Scan to somewhere else Reported in #187 this code: ``` class Foo def foo if cond? foo else # comment end end # ... def bar if @recv end_is_missing_here end end ``` Triggers an incorrect suggestion: ``` Unmatched keyword, missing `end' ? 1 class Foo 2 def foo > 3 if cond? > 5 else 8 end 16 end ``` Part of the issue is that while scanning we're using newlines to determine when to stop and pause. This is useful for determining logically smaller chunks to evaluate but in this case it causes us to pause before grabbing the "end" that is right below the newline. This problem is similar to #179. However in the case of expanding same indentation "neighbors" I want to always grab all empty values at the end of the scan. I don't want to do that when changing indentation levels as it affects scan results. There may be some way to normalize this behavior between the two, but the tests really don't like that change. To fix this issue for expanding against different indentation I needed a way to first try and grab any additional newlines with the ability to rollback that guess. In #192 I experimented with decoupling scanning from the AroundBlockScan logic. It also added the ability to take a snapshot of the current scanner and rollback to prior changes. With this ability in place now we: - Grab extra empties before looking at the above/below line for the matching keyword/end statement - If there's a match, grab it - If there's no match, discard the newlines we picked up in the evaluation That solves the issue.
schneems
added a commit
that referenced
this pull request
May 3, 2023
Reported in #187 this code: ``` class Foo def foo if cond? foo else # comment end end # ... def bar if @recv end_is_missing_here end end ``` Triggers an incorrect suggestion: ``` Unmatched keyword, missing `end' ? 1 class Foo 2 def foo > 3 if cond? > 5 else 8 end 16 end ``` Part of the issue is that while scanning we're using newlines to determine when to stop and pause. This is useful for determining logically smaller chunks to evaluate but in this case it causes us to pause before grabbing the "end" that is right below the newline. This problem is similar to #179. However in the case of expanding same indentation "neighbors" I want to always grab all empty values at the end of the scan. I don't want to do that when changing indentation levels as it affects scan results. There may be some way to normalize this behavior between the two, but the tests really don't like that change. To fix this issue for expanding against different indentation I needed a way to first try and grab any additional newlines with the ability to rollback that guess. In #192 I experimented with decoupling scanning from the AroundBlockScan logic. It also added the ability to take a snapshot of the current scanner and rollback to prior changes. With this ability in place now we: - Grab extra empties before looking at the above/below line for the matching keyword/end statement - If there's a match, grab it - If there's no match, discard the newlines we picked up in the evaluation That solves the issue.
schneems
added a commit
that referenced
this pull request
May 3, 2023
Reported in #187 this code: ``` class Foo def foo if cond? foo else # comment end end # ... def bar if @recv end_is_missing_here end end ``` Triggers an incorrect suggestion: ``` Unmatched keyword, missing `end' ? 1 class Foo 2 def foo > 3 if cond? > 5 else 8 end 16 end ``` Part of the issue is that while scanning we're using newlines to determine when to stop and pause. This is useful for determining logically smaller chunks to evaluate but in this case it causes us to pause before grabbing the "end" that is right below the newline. This problem is similar to #179. However in the case of expanding same indentation "neighbors" I want to always grab all empty values at the end of the scan. I don't want to do that when changing indentation levels as it affects scan results. There may be some way to normalize this behavior between the two, but the tests really don't like that change. To fix this issue for expanding against different indentation I needed a way to first try and grab any additional newlines with the ability to rollback that guess. In #192 I experimented with decoupling scanning from the AroundBlockScan logic. It also added the ability to take a snapshot of the current scanner and rollback to prior changes. With this ability in place now we: - Grab extra empties before looking at the above/below line for the matching keyword/end statement - If there's a match, grab it - If there's no match, discard the newlines we picked up in the evaluation That solves the issue.
schneems
added a commit
that referenced
this pull request
May 3, 2023
Reported in #187 this code: ``` class Foo def foo if cond? foo else # comment end end # ... def bar if @recv end_is_missing_here end end ``` Triggers an incorrect suggestion: ``` Unmatched keyword, missing `end' ? 1 class Foo 2 def foo > 3 if cond? > 5 else 8 end 16 end ``` Part of the issue is that while scanning we're using newlines to determine when to stop and pause. This is useful for determining logically smaller chunks to evaluate but in this case it causes us to pause before grabbing the "end" that is right below the newline. This problem is similar to #179. However in the case of expanding same indentation "neighbors" I want to always grab all empty values at the end of the scan. I don't want to do that when changing indentation levels as it affects scan results. There may be some way to normalize this behavior between the two, but the tests really don't like that change. To fix this issue for expanding against different indentation I needed a way to first try and grab any additional newlines with the ability to rollback that guess. In #192 I experimented with decoupling scanning from the AroundBlockScan logic. It also added the ability to take a snapshot of the current scanner and rollback to prior changes. With this ability in place now we: - Grab extra empties before looking at the above/below line for the matching keyword/end statement - If there's a match, grab it - If there's no match, discard the newlines we picked up in the evaluation That solves the issue.
schneems
added a commit
that referenced
this pull request
May 3, 2023
Reported in #187 this code: ``` class Foo def foo if cond? foo else # comment end end # ... def bar if @recv end_is_missing_here end end ``` Triggers an incorrect suggestion: ``` Unmatched keyword, missing `end' ? 1 class Foo 2 def foo > 3 if cond? > 5 else 8 end 16 end ``` Part of the issue is that while scanning we're using newlines to determine when to stop and pause. This is useful for determining logically smaller chunks to evaluate but in this case it causes us to pause before grabbing the "end" that is right below the newline. This problem is similar to #179. However in the case of expanding same indentation "neighbors" I want to always grab all empty values at the end of the scan. I don't want to do that when changing indentation levels as it affects scan results. There may be some way to normalize this behavior between the two, but the tests really don't like that change. To fix this issue for expanding against different indentation I needed a way to first try and grab any additional newlines with the ability to rollback that guess. In #192 I experimented with decoupling scanning from the AroundBlockScan logic. It also added the ability to take a snapshot of the current scanner and rollback to prior changes. With this ability in place now we: - Grab extra empties before looking at the above/below line for the matching keyword/end statement - If there's a match, grab it - If there's no match, discard the newlines we picked up in the evaluation That solves the issue.
matzbot
pushed a commit
to ruby/ruby
that referenced
this pull request
May 23, 2023
ruby/syntax_suggest#187 Handle if/else with empty/comment line Reported in #187 this code: ``` class Foo def foo if cond? foo else # comment end end # ... def bar if @recv end_is_missing_here end end ``` Triggers an incorrect suggestion: ``` Unmatched keyword, missing `end' ? 1 class Foo 2 def foo > 3 if cond? > 5 else 8 end 16 end ``` Part of the issue is that while scanning we're using newlines to determine when to stop and pause. This is useful for determining logically smaller chunks to evaluate but in this case it causes us to pause before grabbing the "end" that is right below the newline. This problem is similar to ruby/syntax_suggest#179. However in the case of expanding same indentation "neighbors" I want to always grab all empty values at the end of the scan. I don't want to do that when changing indentation levels as it affects scan results. There may be some way to normalize this behavior between the two, but the tests really don't like that change. To fix this issue for expanding against different indentation I needed a way to first try and grab any additional newlines with the ability to rollback that guess. In #192 I experimented with decoupling scanning from the AroundBlockScan logic. It also added the ability to take a snapshot of the current scanner and rollback to prior changes. With this ability in place now we: - Grab extra empties before looking at the above/below line for the matching keyword/end statement - If there's a match, grab it - If there's no match, discard the newlines we picked up in the evaluation That solves the issue. ruby/syntax_suggest@513646b912
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While #177 is reported as being caused by a comment, the underlying behavior is a problem due to the newline that we generated (from a comment). The prior commit fixed that problem by preserving whitespace before the comment. That guarantees that a block will form there from the frontier before it will be expanded there via a "neighbors" method. Since empty lines are valid ruby code, it will be hidden and be safe.
Problem setup
This failure mode is not fixed by the prior commit, because the indentation is 0. To provide good results, we must make the algorithm less greedy. One heuristic/signal to follow is developer added newlines. If a developer puts a newline between code, it's more likely they're unrelated. For example:
This code is split into three chunks by the developer. Each are likely (but not guaranteed) to be intended to stand on their own (in terms of syntax). This behavior is good for scanning neighbors (same indent or higher) within a method, but bad for parsing neighbors across methods.
Problem
Code is expanded to capture all neighbors, and then it decreases indent level which allows it to capture surrounding scope (think moving from within the method to also capturing the
def/end
definition. Once the indentation level has been increased, we go back to scanning neighbors, but now neighbors also contain keywords.For example:
In this case if lines 4, 5, and 6 are in a block when it tries to expand neighbors it will expand up. If it stops after line 2 or 3 it may cause problems since there's a va
lid kw/end pair, but the block will be checked without it.
TLDR; It's good to stop scanning code after hitting a newline when you're in a method...it causes a problem scanning code between methods when everything inside of one of t
he methods is an empty line.
In this case it grabs the end on line 3 and since the problem was an extra end, the program now compiles correctly. It incorrectly assumes that the block it captured was ca
using the problem.
Extra bit of context
One other technical detail is that after we've decided to stop scanning code for a new neighbor block expansion, we look around the block and grab any empty newlines. Basic
ally adding empty newlines before of after a code block do not affect the parsing of that block.
The fix
Since we know that this problem only happens when there's a newline inside of a method and we know this particular failure mode is due to having an invalid block (capturing
an extra end, but not it's keyword) we have all the metadata we need to detect this scenario and correct it.
We know that the next line above our block must be code or empty (since we grabbed extra newlines). Same for code below it. We can count all the keywords and ends in the bl
ock. If they are balanced, it's likely (but not guaranteed) we formed the block correctly. If they're imbalanced, look above or below (depending on the nature of the imbalance)
, check to see if adding that line would balance the count.
This concept of balance and "leaning" comes from work in #152 and has proven useful, but not been formally introduced into the ma
in branch.
Outcome
Adding this extra check introduced no regressions and fixed the test case. It might be possible there's a mirror or similar problem that we're not handling. That will come
out in time. It might also be possible that this causes a worse case in some code not under test. That too would come out in time.
One other possible concern to adding logic in this area (which is a hot codepath), is performance. This extra count check will be performed for every block. In general the
two most helpful performance strategies I've found are reducing total number of blocks (therefore reducing overall N internal iterations) and making better matches (the parser
to determine if a close block is valid or not is a major bottleneck. If we can split valid code into valid blocks, then it's only evaluated by the parser once, where as invalid
code must be continuously re-checked by the parser until it becomes valid, or is determined to be the cause of the core problem.
This extra logic should very rarely result in a change, but when it does it should tend to produce slightly larger blocks (by one line) and more accurate blocks.
Informally it seems to have no impact on performance:
``
This branch:
DEBUG_DISPLAY=1 bundle exec rspec spec/ --format=failures 3.01s user 1.62s system 113% cpu 4.076 total
On main:
DEBUG_DISPLAY=1 bundle exec rspec spec/ --format=failures 3.02s user 1.64s system 113% cpu 4.098 total