Codestin Search App

sarathfrancis90 · 2026-06-16T08:08:43Z

Bug

The Markdown, reStructuredText and TiddlyWiki5 lexers all delegate the body of an embedded code block to a sub-lexer with lexer.get_tokens_unprocessed(code). That call returns indices relative to the code snippet (starting at 0), but the indices the outer lexer yields are supposed to be absolute positions within the whole input text. So every token inside a fenced/indented code block came back with a wrong offset. The MarkdownLexer even had a # FIXME: aren't the offsets wrong? comment about exactly this.

Minimal check:

from pygments.lexers.markup import MarkdownLexer
text = "intro\n```python\nx = 1\n```\n"
for index, _tok, value in MarkdownLexer().get_tokens_unprocessed(text):
    assert text[index:index+len(value)] == value  # fails on the embedded code tokens

Cause

The delegated token stream (and the do_insertions(...) output wrapping it) is never re-based by the start position of the code group, so the sub-lexer's 0-based indices leak out unchanged.

Fix

Added a small _shift_indices(tokens, offset) helper and re-based the delegated tokens by match.start(<code group>) at the four affected sites (RstLexer._handle_sourcecode, MarkdownLexer._handle_codeblock, TiddlyWiki5Lexer._handle_codeblock, TiddlyWiki5Lexer._handle_cssblock). The plain get_tokens output and the rendered HTML are byte-for-byte unchanged — only the previously incorrect positions from get_tokens_unprocessed are corrected.

Testing

Added tests/test_markup.py asserting that every token's index points at its own value in the input, for all three lexers (fails before, passes after). Full suite 5262 passed, 16 skipped; ruff check --ignore UP031 . clean.

The Markdown, reStructuredText and TiddlyWiki5 lexers delegate the content of an embedded code block to a sub-lexer via get_tokens_unprocessed(code). That returns indices relative to the code snippet (starting at 0), but the indices yielded by the outer lexer must be absolute positions within the whole input text. The result was that every token inside a code block reported a wrong offset (the MarkdownLexer even carried a 'FIXME: aren't the offsets wrong?' comment about this). Re-base the delegated token indices by the start position of the code group. The plain get_tokens output and the rendered HTML are unchanged; only the previously incorrect positions returned by get_tokens_unprocessed are corrected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wrong token offsets for embedded code blocks (Markdown/RST/TiddlyWiki)#3158

Fix wrong token offsets for embedded code blocks (Markdown/RST/TiddlyWiki)#3158
sarathfrancis90 wants to merge 1 commit into
pygments:masterfrom
sarathfrancis90:fix-codeblock-token-offsets

sarathfrancis90 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sarathfrancis90 commented Jun 16, 2026

Bug

Cause

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant