Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix wrong token offsets for embedded code blocks (Markdown/RST/TiddlyWiki)#3158

Open
sarathfrancis90 wants to merge 1 commit into
pygments:masterfrom
sarathfrancis90:fix-codeblock-token-offsets
Open

Fix wrong token offsets for embedded code blocks (Markdown/RST/TiddlyWiki)#3158
sarathfrancis90 wants to merge 1 commit into
pygments:masterfrom
sarathfrancis90:fix-codeblock-token-offsets

Conversation

@sarathfrancis90

Copy link
Copy Markdown

Fixes #3133.

Bug

The Markdown, reStructuredText and TiddlyWiki5 lexers all delegate the body of an embedded code block to a sub-lexer with lexer.get_tokens_unprocessed(code). That call returns indices relative to the code snippet (starting at 0), but the indices the outer lexer yields are supposed to be absolute positions within the whole input text. So every token inside a fenced/indented code block came back with a wrong offset. The MarkdownLexer even had a # FIXME: aren't the offsets wrong? comment about exactly this.

Minimal check:

from pygments.lexers.markup import MarkdownLexer
text = "intro\n```python\nx = 1\n```\n"
for index, _tok, value in MarkdownLexer().get_tokens_unprocessed(text):
    assert text[index:index+len(value)] == value  # fails on the embedded code tokens

Cause

The delegated token stream (and the do_insertions(...) output wrapping it) is never re-based by the start position of the code group, so the sub-lexer's 0-based indices leak out unchanged.

Fix

Added a small _shift_indices(tokens, offset) helper and re-based the delegated tokens by match.start(<code group>) at the four affected sites (RstLexer._handle_sourcecode, MarkdownLexer._handle_codeblock, TiddlyWiki5Lexer._handle_codeblock, TiddlyWiki5Lexer._handle_cssblock). The plain get_tokens output and the rendered HTML are byte-for-byte unchanged — only the previously incorrect positions from get_tokens_unprocessed are corrected.

Testing

Added tests/test_markup.py asserting that every token's index points at its own value in the input, for all three lexers (fails before, passes after). Full suite 5262 passed, 16 skipped; ruff check --ignore UP031 . clean.

The Markdown, reStructuredText and TiddlyWiki5 lexers delegate the
content of an embedded code block to a sub-lexer via
get_tokens_unprocessed(code). That returns indices relative to the
code snippet (starting at 0), but the indices yielded by the outer
lexer must be absolute positions within the whole input text. The
result was that every token inside a code block reported a wrong
offset (the MarkdownLexer even carried a 'FIXME: aren't the offsets
wrong?' comment about this).

Re-base the delegated token indices by the start position of the code
group. The plain get_tokens output and the rendered HTML are
unchanged; only the previously incorrect positions returned by
get_tokens_unprocessed are corrected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update from 2.19.1 to 2.20.0 destroys markdown rendering of code blocks

1 participant