Fix wrong token offsets for embedded code blocks (Markdown/RST/TiddlyWiki)#3158
Open
sarathfrancis90 wants to merge 1 commit into
Open
Fix wrong token offsets for embedded code blocks (Markdown/RST/TiddlyWiki)#3158sarathfrancis90 wants to merge 1 commit into
sarathfrancis90 wants to merge 1 commit into
Conversation
The Markdown, reStructuredText and TiddlyWiki5 lexers delegate the content of an embedded code block to a sub-lexer via get_tokens_unprocessed(code). That returns indices relative to the code snippet (starting at 0), but the indices yielded by the outer lexer must be absolute positions within the whole input text. The result was that every token inside a code block reported a wrong offset (the MarkdownLexer even carried a 'FIXME: aren't the offsets wrong?' comment about this). Re-base the delegated token indices by the start position of the code group. The plain get_tokens output and the rendered HTML are unchanged; only the previously incorrect positions returned by get_tokens_unprocessed are corrected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3133.
Bug
The Markdown, reStructuredText and TiddlyWiki5 lexers all delegate the body of an embedded code block to a sub-lexer with
lexer.get_tokens_unprocessed(code). That call returns indices relative to thecodesnippet (starting at0), but the indices the outer lexer yields are supposed to be absolute positions within the whole input text. So every token inside a fenced/indented code block came back with a wrong offset. TheMarkdownLexereven had a# FIXME: aren't the offsets wrong?comment about exactly this.Minimal check:
Cause
The delegated token stream (and the
do_insertions(...)output wrapping it) is never re-based by the start position of the code group, so the sub-lexer's 0-based indices leak out unchanged.Fix
Added a small
_shift_indices(tokens, offset)helper and re-based the delegated tokens bymatch.start(<code group>)at the four affected sites (RstLexer._handle_sourcecode,MarkdownLexer._handle_codeblock,TiddlyWiki5Lexer._handle_codeblock,TiddlyWiki5Lexer._handle_cssblock). The plainget_tokensoutput and the rendered HTML are byte-for-byte unchanged — only the previously incorrect positions fromget_tokens_unprocessedare corrected.Testing
Added
tests/test_markup.pyasserting that every token's index points at its own value in the input, for all three lexers (fails before, passes after). Full suite5262 passed, 16 skipped;ruff check --ignore UP031 .clean.