-
-
Notifications
You must be signed in to change notification settings - Fork 794
feat(parser/markdown): parser implementation #8525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: next
Are you sure you want to change the base?
Conversation
Enhanced the Markdown lexer to recognize additional tokens needed for block parsing: - Hash (#) for ATX headers - Backtick (`) and triple backtick for code fences - Tilde (~) for fenced code blocks - Greater than (>) for block quotes - Star (*), underscore (_), minus (-) for emphasis and lists - Brackets and parentheses for links - Bang (!) for images Implemented paragraph parsing with blank line detection: - Added parse_paragraph, parse_inline_item_list, parse_textual - Added trivia position tracking to detect paragraph boundaries - Modified grammar to make hard_line optional in MdParagraph Added comprehensive lexer tests for all new token types.
Implemented ATX header parsing for Markdown with support for: - Header levels 1-6 (# through ######) - Content parsing within headers - Empty trailing hash list (trailing hashes included in content for now) Added has_newline_since method to detect single newlines for header boundaries, as headers end at a single newline unlike paragraphs which end at blank lines (2+ newlines).
Add support for parsing fenced code blocks (``` delimiters). Changes: - Add parse_fenced_code_block function with fence detection - Support optional language info string after opening fence - Handle empty language lines using preceding line break detection - Simplify MdCodeNameList grammar to use non-separated list - Use MdInlineItemList for code block content - Add fenced_code_block.md test file with snapshot
Implement parse_indent_code_block for CommonMark-style indented code blocks that require 4+ spaces of indentation. Key changes: - Fix at_indent_code_block() to use >= 4 instead of > 4 (CommonMark spec) - Implement parse_indent_code_block() with proper grammar structure - Use bump_remap(MD_TEXTUAL_LITERAL) to handle all token types as textual content within code blocks (fixes BOGUS node issue) - Simplify MdIndentCodeBlock grammar since indentation is tracked in trivia - Fix parse_textual() to use bump_remap for syntax factory compatibility The implementation correctly: - Detects 4+ spaces of leading whitespace in trivia - Parses all content as MdTextual nodes within MdInlineItemList - Ends code blocks at blank lines or non-indented lines - Preserves all source text in the CST for lossless parsing
Add parsing support for block quotes and bullet lists: - Block quote parsing: detects `>` at line start, parses content as paragraph with quote marker included as textual content - Bullet list parsing: detects `-` or `*` at line start, parses consecutive bullets into MdBulletListItem containing MdBulletList - Add `at_start_of_input()` helper to detect document start for block detection - Update grammar to remove required space token from MdBullet (whitespace is handled as trivia) - Add test cases for block quotes and bullet lists
Add parsing for inline Markdown elements: - Inline code spans (`code`) - Emphasis/bold (**text** and __text__) - Italic (*text* and _text_) - Links ([text](url)) - Images () Also fix bullet list parsing to stop at line breaks, ensuring each list item contains only single-line content. Updated the MdInlineImage grammar to match standard Markdown image syntax () instead of the linked image format.
- Fix `parse_bullet` to return `ParsedSyntax` instead of `()` - Add error recovery for list parsing to prevent infinite loops - Make closing tokens optional in grammar for inline elements (matches Markdown's permissive nature for unclosed elements) - Fix grammar comment to say "Markdown's" instead of "Rust's" - Fix clippy doc-link-with-quotes warnings These changes address the critical issues identified in code review that would have blocked a PR submission.
- Add backslash escape handling in lexer for CommonMark punctuation - Add image support to nested inline parsing - Document trailing hash limitation in ATX headers - Add 3 lexer tests for escape sequences - Update changeset to reflect escape sequence support
- Fix import order in quote.rs - Update quick_test example in spec_test.rs
Improvements to the Markdown parser: - Add constants for magic numbers (INDENT_CODE_BLOCK_SPACES, MAX_HEADER_HASHES) - Implement trailing hash parsing for ATX headers using lookahead detection - Add TRIPLE_TILDE token and support for ~~~ fenced code blocks per CommonMark - Update tests and snapshots
|
WalkthroughThe PR converts the crate into a Markdown parser: lexer rewritten for Markdown tokens (headers, fences, emphasis, block quotes, lists, escapes, textual grouping); new private lexer helpers for backticks, tildes, hashes, escapes and thematic-break vs emphasis logic; parser gains trivia inspection APIs and new modules (fenced_code_block, header, inline, list, quote) implementing fenced/indented code blocks, ATX headers, block quotes, bullet lists, paragraphs and inline elements; grammar/kind updates and a large suite of lexer and fixture tests added. Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (7)
📒 Files selected for processing (4)
🧰 Additional context used📓 Path-based instructions (1)crates/**/*.rs📄 CodeRabbit inference engine (CONTRIBUTING.md)
Files:
🧠 Learnings (32)📓 Common learnings📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-12-12T10:11:05.564ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-11-24T18:05:42.356ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-11-24T18:05:27.810ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-11-24T18:06:03.545ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-11-24T18:05:27.810ZApplied to files:
📚 Learning: 2025-12-19T12:53:30.399ZApplied to files:
📚 Learning: 2025-11-24T18:05:20.371ZApplied to files:
📚 Learning: 2025-11-09T12:47:46.298ZApplied to files:
📚 Learning: 2025-11-24T18:05:20.371ZApplied to files:
📚 Learning: 2025-11-24T18:05:20.371ZApplied to files:
📚 Learning: 2025-11-24T18:06:12.048ZApplied to files:
📚 Learning: 2025-11-24T18:05:20.371ZApplied to files:
📚 Learning: 2025-11-24T18:05:20.371ZApplied to files:
🧬 Code graph analysis (1)crates/biome_markdown_parser/src/lexer/mod.rs (2)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
crates/biome_markdown_parser/src/syntax/header.rs (1)
151-155: Redundant wrapper function.
has_newline_sincejust delegates top.has_newline_since. Consider inlining calls directly.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (13)
crates/biome_markdown_factory/src/generated/node_factory.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_factory/src/generated/syntax_factory.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_parser/tests/md_test_suite/ok/block_quote.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/header.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/indent_code_block.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_syntax/src/generated/kind.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_syntax/src/generated/macros.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_syntax/src/generated/nodes.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_syntax/src/generated/nodes_mut.rsis excluded by!**/generated/**,!**/generated/**and included by**
📒 Files selected for processing (21)
.changeset/curly-donkeys-show.md(1 hunks)crates/biome_markdown_parser/src/lexer/mod.rs(4 hunks)crates/biome_markdown_parser/src/lexer/tests.rs(1 hunks)crates/biome_markdown_parser/src/parser.rs(1 hunks)crates/biome_markdown_parser/src/syntax.rs(3 hunks)crates/biome_markdown_parser/src/syntax/fenced_code_block.rs(1 hunks)crates/biome_markdown_parser/src/syntax/header.rs(1 hunks)crates/biome_markdown_parser/src/syntax/inline.rs(1 hunks)crates/biome_markdown_parser/src/syntax/list.rs(1 hunks)crates/biome_markdown_parser/src/syntax/quote.rs(1 hunks)crates/biome_markdown_parser/src/token_source.rs(1 hunks)crates/biome_markdown_parser/tests/md_test_suite/ok/block_quote.md(1 hunks)crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md(1 hunks)crates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.md(1 hunks)crates/biome_markdown_parser/tests/md_test_suite/ok/header.md(1 hunks)crates/biome_markdown_parser/tests/md_test_suite/ok/indent_code_block.md(1 hunks)crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.md(1 hunks)crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.md(1 hunks)crates/biome_markdown_parser/tests/spec_test.rs(1 hunks)xtask/codegen/markdown.ungram(4 hunks)xtask/codegen/src/markdown_kinds_src.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
crates/**/*.rs
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Files:
crates/biome_markdown_parser/src/parser.rscrates/biome_markdown_parser/src/syntax/list.rscrates/biome_markdown_parser/src/syntax/header.rscrates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/src/token_source.rscrates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rscrates/biome_markdown_parser/tests/spec_test.rscrates/biome_markdown_parser/src/syntax.rs
.changeset/*.md
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Write changesets that are concise (1-3 sentences), user-focused, use past tense for actions taken and present tense for Biome behavior, include code examples for rules, and end sentences with periods
Files:
.changeset/curly-donkeys-show.md
🧠 Learnings (37)
📓 Common learnings
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/lexer/mod.rs : Implement a `Lexer` trait from `biome_parser` crate for the lexer struct that consumes characters from source code and emits tokens
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : The first paragraph of rule documentation must be a single line describing what the rule does
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.md
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Prefix line with `#` in documentation code examples sparingly; prefer concise complete snippets
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/block_quote.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/indent_code_block.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.mdcrates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/tests/spec_test.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-12-04T13:29:49.287Z
Learnt from: dyc3
Repo: biomejs/biome PR: 8291
File: crates/biome_html_formatter/tests/specs/prettier/vue/html-vue/elastic-header.html:10-10
Timestamp: 2025-12-04T13:29:49.287Z
Learning: Files under `crates/biome_html_formatter/tests/specs/prettier` are test fixtures synced from Prettier and should not receive detailed code quality reviews (e.g., HTTP vs HTTPS, formatting suggestions, etc.). These files are test data meant to validate formatter behavior and should be preserved as-is.
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/indent_code_block.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/tests/spec_test.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/*.ungram : Add a legend comment to `.ungram` files explaining the grammar syntax including comments, non-terminals, tokens, sequences, alternations, repetitions, and optional elements
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.mdxtask/codegen/markdown.ungram
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement a token source struct that wraps the lexer and implements `TokenSourceWithBufferedLexer` and `LexerWithCheckpoint` for lookahead and re-lexing capabilities
Applied to files:
crates/biome_markdown_parser/src/parser.rscrates/biome_markdown_parser/src/syntax/list.rscrates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/src/token_source.rscrates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : A parser struct must implement the `Parser` trait and save the token source, parser context, and optional parser options
Applied to files:
crates/biome_markdown_parser/src/parser.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/tests/specs/**/*.jsonc : Use `.jsonc` files to contain arrays of code snippet strings for snapshot tests
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/indent_code_block.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/tests/spec_test.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/language_kind.rs : Add a new language prefix to the `LANGUAGE_PREFIXES` constant in `language_kind.rs` file
Applied to files:
xtask/codegen/src/markdown_kinds_src.rscrates/biome_markdown_parser/src/syntax/fenced_code_block.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/language_kind.rs : Add a new variant to `LanguageKind` enum in `language_kind.rs` file and implement all methods for the new language variant
Applied to files:
xtask/codegen/src/markdown_kinds_src.rscrates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/*_kinds_src.rs : Create a file called `src/{language}_kinds_src.rs` that returns a static `KindSrc` for new language parsers
Applied to files:
xtask/codegen/src/markdown_kinds_src.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Use `ParseSeparatedList` and `ParseNodeList` for parsing lists with error recovery to avoid infinite loops
Applied to files:
crates/biome_markdown_parser/src/syntax/list.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-12-12T10:11:05.564Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Applies to crates/**/*.rs : Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Applied to files:
crates/biome_markdown_parser/src/syntax/list.rscrates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.mdcrates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement error recovery in list parsing using `or_recover()` to wrap unparseable tokens in a `BOGUS_*` node and consume tokens until a recovery token is found
Applied to files:
crates/biome_markdown_parser/src/syntax/list.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/lint/nursery/**/*.rs : Add `issue_number` field to `declare_lint_rule!` macro for work-in-progress rules
Applied to files:
crates/biome_markdown_parser/src/syntax/list.rs
📚 Learning: 2025-11-24T18:05:20.371Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:20.371Z
Learning: Applies to crates/biome_formatter/**/biome_*_formatter/Cargo.toml : Include development dependencies in `Cargo.toml` for formatter tests: `biome_formatter_test`, `biome_<language>_factory`, `biome_<language>_parser`, `biome_parser`, `biome_service`, `countme`, `iai`, `quickcheck`, `quickcheck_macros`, and `tests_macros`
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.mdcrates/biome_markdown_parser/src/lexer/tests.rs
📚 Learning: 2025-11-24T18:05:20.371Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:20.371Z
Learning: Applies to crates/biome_formatter/**/biome_*_formatter/tests/language.rs : Implement `TestFormatLanguage` trait in `tests/language.rs` for the formatter's test language
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/tests/spec_test.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Use `ignore` code block property to exclude documentation code examples from automatic validation
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.mdcrates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/tests/spec_test.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Lines prefixed with `#` in rule documentation code examples will be hidden from output
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.mdcrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/tests/spec_test.rs
📚 Learning: 2025-12-12T10:11:05.564Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Applies to .changeset/*.md : Write changesets that are concise (1-3 sentences), user-focused, use past tense for actions taken and present tense for Biome behavior, include code examples for rules, and end sentences with periods
Applied to files:
.changeset/curly-donkeys-show.md
📚 Learning: 2025-12-12T10:11:05.564Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Create a changeset for user-facing bug fixes and features using `just new-changeset`, selecting appropriate packages and change type (`major`, `minor`, or `patch`)
Applied to files:
.changeset/curly-donkeys-show.md
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation code blocks must have a language defined for syntax highlighting
Applied to files:
crates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/tests/spec_test.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation code blocks should be ordered as language, expect_diagnostic, options/full_options/use_options, ignore, file
Applied to files:
crates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Parse rules must take a mutable reference to the parser as their only parameter and return a `ParsedSyntax`
Applied to files:
crates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Use `ConditionalParsedSyntax` for syntax that is only valid in specific contexts (e.g., strict mode, file types, language versions) and call `or_invalid_to_bogus()` to convert to a bogus node if not supported
Applied to files:
crates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Parse rule functions must be prefixed with `parse_` and use the name defined in the grammar file, e.g., `parse_for_statement` or `parse_expression`
Applied to files:
crates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Parse rules must return `ParsedSyntax::Absent` if the rule can't predict by the next token(s) if they form the expected node, and must not progress the parser in this case
Applied to files:
crates/biome_markdown_parser/src/syntax/fenced_code_block.rscrates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/syntax/quote.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Use `p.eat(token)` for optional tokens, `p.expect(token)` for required tokens, `parse_rule(p).ok(p)` for optional nodes, and `parse_rule(p).or_add_diagnostic(p, error)` for required nodes
Applied to files:
crates/biome_markdown_parser/src/syntax/inline.rscrates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Use `full_options` code block property for complete biome.json configuration snippets in documentation
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.md
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Invalid code examples in rule documentation must be marked with `expect_diagnostic` code block property
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.mdcrates/biome_markdown_parser/tests/spec_test.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/lexer/mod.rs : Implement a `Lexer` trait from `biome_parser` crate for the lexer struct that consumes characters from source code and emits tokens
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rscrates/biome_markdown_parser/src/syntax.rs
📚 Learning: 2025-11-24T18:05:20.371Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:20.371Z
Learning: Applies to crates/biome_formatter/**/biome_*_formatter/tests/spec_tests.rs : Use the `tests_macros::gen_tests!` macro in `spec_tests.rs` to generate test functions for each specification file matching the pattern `tests/specs/<language>/**/*.<ext>`
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/tests/spec_test.rs
📚 Learning: 2025-11-24T18:06:12.048Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_service/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:12.048Z
Learning: Applies to crates/biome_service/src/workspace/watcher.tests.rs : Implement watcher tests for workspace methods in watcher.tests.rs and end-to-end tests in LSP tests
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/tests/spec_test.rs
📚 Learning: 2025-11-24T18:05:20.371Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:20.371Z
Learning: Create test infrastructure with `tests/specs` folder structure and `spec_test.rs`, `spec_tests.rs`, and `language.rs` files in test directories
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rs
📚 Learning: 2025-11-24T18:05:27.810Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_js_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:27.810Z
Learning: Applies to crates/biome_js_formatter/**/*.rs : When formatting AST nodes, use mandatory tokens from the AST instead of hardcoding token strings (e.g., use `node.l_paren_token().format()` instead of `token("(")`)
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-09T12:47:46.298Z
Learnt from: ematipico
Repo: biomejs/biome PR: 8031
File: crates/biome_html_parser/src/syntax/svelte.rs:140-147
Timestamp: 2025-11-09T12:47:46.298Z
Learning: In the Biome HTML parser, `expect` and `expect_with_context` consume the current token and then lex the next token. The context parameter in `expect_with_context` controls how the next token (after the consumed one) is lexed, not the current token being consumed. For example, in Svelte parsing, after `bump_with_context(T!["{:"], HtmlLexContext::Svelte)`, the next token is already lexed in the Svelte context, so `expect(T![else])` is sufficient unless the token after `else` also needs to be lexed in a specific context.
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Valid code examples in rule documentation should not trigger any diagnostics
Applied to files:
crates/biome_markdown_parser/tests/spec_test.rs
🧬 Code graph analysis (7)
crates/biome_markdown_parser/src/parser.rs (2)
crates/biome_markdown_parser/src/token_source.rs (2)
has_blank_line_since(93-122)has_newline_since(126-133)crates/biome_markdown_parser/src/syntax/header.rs (1)
has_newline_since(153-155)
crates/biome_markdown_parser/src/syntax/list.rs (2)
crates/biome_markdown_parser/src/syntax.rs (1)
parse_any_inline(160-162)crates/biome_markdown_parser/src/syntax/inline.rs (1)
parse_any_inline(244-258)
crates/biome_markdown_parser/src/syntax/inline.rs (2)
crates/biome_markdown_parser/src/lexer/mod.rs (1)
source(74-76)crates/biome_markdown_parser/src/syntax.rs (2)
parse_any_inline(160-162)parse_textual(171-181)
crates/biome_markdown_parser/src/syntax/quote.rs (1)
crates/biome_markdown_parser/src/syntax.rs (1)
parse_any_inline(160-162)
crates/biome_markdown_parser/src/token_source.rs (2)
crates/biome_markdown_parser/src/parser.rs (2)
has_blank_line_since(40-42)has_newline_since(45-47)crates/biome_markdown_parser/src/syntax/header.rs (1)
has_newline_since(153-155)
crates/biome_markdown_parser/src/lexer/mod.rs (2)
crates/biome_unicode_table/src/lib.rs (1)
lookup_byte(106-109)crates/biome_yaml_parser/src/lexer/mod.rs (1)
start_position(872-874)
crates/biome_markdown_parser/src/syntax.rs (5)
crates/biome_markdown_parser/src/syntax/fenced_code_block.rs (2)
at_fenced_code_block(9-11)parse_fenced_code_block(21-55)crates/biome_markdown_parser/src/syntax/header.rs (2)
at_header(15-17)parse_header(27-53)crates/biome_markdown_parser/src/syntax/list.rs (2)
at_bullet_list_item(17-30)parse_bullet_list_item(42-72)crates/biome_markdown_parser/src/syntax/quote.rs (2)
at_quote(14-16)parse_quote(24-58)crates/biome_markdown_parser/src/syntax/inline.rs (1)
parse_any_inline(244-258)
🪛 LanguageTool
crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.md
[style] ~4-~4: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...This is italic and this is bold. This is also italic and also bold. He...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~5-~5: You’ve already used the word ‘also’ once in your sentence, so using it again may be redundant.
Context: ... bold. This is also italic and also bold. Here is a [link](https://example.co...
(REDUNDANT_FILLER)
🔇 Additional comments (53)
xtask/codegen/src/markdown_kinds_src.rs (1)
19-20: LGTM!The
TRIPLE_TILDEtoken follows the established pattern alongsideTRIPLE_BACKTICK, neatly supporting CommonMark tilde-fenced code blocks.crates/biome_markdown_parser/tests/md_test_suite/ok/block_quote.md (1)
1-4: LGTM!Solid test fixture covering multi-line block quotes and blank-line separation between distinct quote blocks.
crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.md (1)
1-7: LGTM!Comprehensive coverage of inline elements: code spans, both emphasis syntaxes (
*/_), links, and images. The static analysis grumbles about repetitive phrasing, but that's rather the point of a test fixture.crates/biome_markdown_parser/tests/md_test_suite/ok/header.md (1)
1-17: LGTM!Excellent coverage of ATX headers (H1–H6) and trailing-hash edge cases. The "Mixed # content ## with ### trailing ####" line is a nice stress test for the lookahead-based detection.
Consider adding edge-case tests for invalid headers in a separate fixture (e.g.,
####### Seven hasheswhich shouldn't parse as a header, or#NoSpacewithout the required space). These would exercise error recovery paths.crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md (1)
1-6: LGTM!Covers both bullet markers (
-and*) with clean separation. Given multi-line list items are deferred to future work, this is appropriate Phase 1 coverage.crates/biome_markdown_parser/tests/md_test_suite/ok/indent_code_block.md (1)
1-8: LGTM!Test fixture appropriately exercises indented code block parsing with proper 4-space indentation and mixed content.
crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.md (1)
1-3: LGTM!Clean test fixture for paragraph parsing with blank line separation.
crates/biome_markdown_parser/tests/spec_test.rs (1)
138-141: LGTM!Test input updated with realistic indented code block content, aligning with the new test fixtures.
crates/biome_markdown_parser/src/lexer/tests.rs (1)
165-422: LGTM!Comprehensive test coverage for the new Markdown lexer capabilities. The tests are well-structured, consistently use the
assert_lex!macro, and appropriately exercise ATX headers, fenced code blocks, block quotes, emphasis, links, images, and escape sequences.crates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.md (1)
1-15: LGTM!Test fixture properly exercises fenced code block parsing with both backtick and tilde syntax, including language specifiers.
.changeset/curly-donkeys-show.md (1)
7-16: LGTM!Clear, detailed description of the implemented features with appropriate caveats about scope limitations.
crates/biome_markdown_parser/src/token_source.rs (3)
86-89: LGTM!Simple trivia length getter appropriately exposes the trivia list size for position tracking.
91-122: LGTM!The blank line detection logic is correct: it properly counts consecutive newlines (ignoring whitespace between them), resets on trailing trivia or non-newline/whitespace trivia, and returns true when 2+ consecutive newlines are found.
124-133: LGTM!Simple and correct detection of any non-trailing newline since the given position.
crates/biome_markdown_parser/src/parser.rs (1)
34-53: LGTM!All four methods are clean delegations to the token source with clear documentation. They appropriately expose trivia tracking capabilities for block-level construct detection and paragraph boundary handling.
crates/biome_markdown_parser/src/syntax/quote.rs (2)
1-11: Module structure looks good.Clean imports and module documentation. The module is correctly scoped as
pub(crate).
13-16: Detection logic is correct.Properly checks for
>at line start usinghas_preceding_line_break()orat_start_of_input().crates/biome_markdown_parser/src/syntax/fenced_code_block.rs (4)
8-11: Detection function is correct.Properly identifies both backtick and tilde fence openings.
21-55: Well-structured fenced code block parsing.Good handling of fence type matching per CommonMark spec. The optional closing fence is correctly handled.
62-82: Language name parsing is correct.Sensibly checks for preceding line break to detect empty language strings and parses until newline.
86-105: Content parsing handles fence matching correctly.Appropriately stops at matching fence type or EOF. Each token is wrapped as
MD_TEXTUALwhich aligns with the grammar.crates/biome_markdown_parser/src/syntax/header.rs (4)
10-11: Good use of named constant.
MAX_HEADER_HASHES = 6makes the CommonMark constraint clear.
57-70: Hash list parsing is straightforward.Correctly wraps each
#inMD_HASHand the sequence inMD_HASH_LIST.
126-149: Trailing hash detection uses checkpoint correctly.The lookahead pattern with
checkpoint/rewindis the right approach for this detection.
27-42: No issue here — the parser state is properly cleaned up.
parse_headeris called only from withintry_parse, which saves a checkpoint before the call. Whenrewind()is triggered on parse failure, it truncates the entire events list back to the checkpoint position, discarding the MD_HASH_LIST completion and all subsequent work. The function signature correctly returnsAbsentand relies on the caller's checkpoint/rewind mechanism, which is the standard pattern in biome_parser.Likely an incorrect or invalid review comment.
crates/biome_markdown_parser/src/syntax.rs (6)
1-5: Good module organisation.Clean separation of concerns with dedicated submodules for each block/inline type.
21-22: Named constant improves readability.
INDENT_CODE_BLOCK_SPACES = 4documents the CommonMark requirement clearly.
27-28: EOF bump is required by the grammar.Good addition with clear comment explaining the rationale.
41-77: Block dispatch logic is well-structured.Good fallback chain with
try_parsefor constructs that may fail validation (thematic breaks, headers). The fallback to paragraph parsing is sensible.
90-109: Indented code block parsing is correct.Properly checks for 4+ spaces and consumes content while indentation holds.
133-155: Inline item list parsing handles blank lines correctly.Unlike
quote.rs, the trivia position is captured before the loop (line 137), so blank-line detection works as intended.crates/biome_markdown_parser/src/syntax/list.rs (4)
17-30: Bullet detection logic is sound.Correctly requires line-start position for bullet markers.
52-68: Error recovery prevents infinite loops.Good defensive programming with position tracking.
101-122: Bullet content parsing stops at newline correctly.The
has_preceding_line_break()check before parsing ensures single-line items.
124-137: Ordered list stubs are appropriately marked.Using
#[expect(dead_code)]is cleaner than#[allow(dead_code)]as it will warn when the code is eventually used.crates/biome_markdown_parser/src/lexer/mod.rs (5)
13-14: Import refactoring is cleaner.Importing
Dispatchmodule andlookup_byteseparately improves clarity.
182-196: Dispatch table extended correctly.Good coverage of Markdown-specific tokens: hash, backtick, tilde, brackets, escape sequences.
199-250: Escape handling follows CommonMark spec.All escapable ASCII punctuation characters are correctly enumerated.
336-399: Thematic break vs emphasis disambiguation is well-handled.The save/restore position pattern for speculative parsing is correct. Good handling of the edge case where IDT can match letters or underscore.
401-436: Hash, backtick, and tilde consumers are straightforward.Clean implementation differentiating single vs triple tokens where applicable.
crates/biome_markdown_parser/src/syntax/inline.rs (8)
1-10: Clean module setup with clear documentation.Good overview of supported inline elements.
15-40: Inline code parsing handles unclosed spans gracefully.Correctly parses content until closing backtick, newline, or EOF. The optional closing backtick handling is appropriate for error tolerance.
45-78: Emphasis parsing is correct.Good use of
openervariable to ensure matching close marker.
80-116: Italic parsing mirrors emphasis structure.Structurally similar to
parse_inline_emphasis. If this pattern repeats further, consider extracting a helper, but for two cases it's fine.
118-165: Link parsing handles partial syntax gracefully.Good error tolerance - incomplete
[text]or missing(url)portions are handled without panicking.
167-223: Image parsing correctly requires![prefix.The two-token lookahead (
BANGthenL_BRACK) is the right approach.
225-241: Nested inline avoids emphasis recursion - good design.The comment at lines 228-230 explains the intentional limitation clearly.
243-258: Dispatcher routes correctly with textual fallback.Complete coverage of inline constructs with sensible default to
parse_textual.xtask/codegen/markdown.ungram (5)
109-113: Grammar allows mismatched fence characters.The grammar permits a block opened with backticks to close with tildes (or vice versa). This is fine since the grammar defines structure rather than semantics — the parser is responsible for enforcing that the closing fence matches the opening fence per CommonMark.
If this is intentional for maximum error tolerance, brilliant. Just worth confirming the parser validates matching fence types.
163-167: Optional closing tokens for error tolerance — sensible approach.Making
r_fenceoptional here (and similarly for emphasis, code spans, links, and images) allows the parser to produce a valid AST even when the user hasn't finished typing. Good call for editor integration.
188-195: MdInlineLink: Consider requiring'['to pair with']'in the required set.Currently
'['is required whilst']','(', and')'are all optional. This means a valid link node could be just[followed by content — which is maximally tolerant but potentially quite lenient.If you find false positives during testing (stray
[characters being parsed as links), you might tighten this. For now, this is a design choice rather than a defect.
200-207: MdInlineImage structure looks good.The
'!'and'['are required (correctly identifying intent to create an image), with everything else optional for error tolerance. This mirrors the link structure and is consistent.
7-19: Legend comment present — nicely done.As per the contribution guidelines, the legend explaining grammar syntax is included. Appreciate the clarity.
.changeset/curly-donkeys-show.md
Outdated
| "@biomejs/biome": patch | ||
| --- | ||
|
|
||
| Add Markdown parser support for inline elements, block quotes, lists, headers, and code blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use past tense for the summary.
Per coding guidelines, changesets should use past tense for actions taken.
🔎 Proposed fix
-Add Markdown parser support for inline elements, block quotes, lists, headers, and code blocks.
+Added Markdown parser support for inline elements, block quotes, lists, headers, and code blocks.As per coding guidelines, changesets should use past tense for actions taken.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Add Markdown parser support for inline elements, block quotes, lists, headers, and code blocks. | |
| Added Markdown parser support for inline elements, block quotes, lists, headers, and code blocks. |
🤖 Prompt for AI Agents
In .changeset/curly-donkeys-show.md around line 5 the changeset summary is
written in present tense ("Add Markdown parser support..."); update that summary
to past tense ("Added Markdown parser support for inline elements, block quotes,
lists, headers, and code blocks.") and save the file so the changeset follows
the repository's past-tense guideline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support for + bullet list marker per CommonMark spec.
Move start_trivia_pos capture before the while loop so that blank line detection correctly checks from the start of the quote content rather than only since the last inline parse.
xtask/codegen/markdown.ungram
Outdated
|
|
||
| // *italic* | ||
| // ^^^^^^^^ | ||
| // Note: r_fence is optional to handle unclosed italic gracefully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unusual. We usually chose to emit bogus nodes and recover for these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: Removed ? from grammar and now use p.expect() for closing delimiters, which emits diagnostics when missing.
| pub(crate) fn parse_bullet_list_item(p: &mut MarkdownParser) -> ParsedSyntax { | ||
| if !at_bullet_list_item(p) { | ||
| return Absent; | ||
| } | ||
|
|
||
| let item_m = p.start(); | ||
| let list_m = p.start(); | ||
|
|
||
| // Parse bullet items until we're no longer at a valid bullet marker. | ||
| // Track position for error recovery to prevent infinite loops. | ||
| loop { | ||
| if !at_bullet_list_item(p) { | ||
| break; | ||
| } | ||
|
|
||
| let prev_position = p.source().position(); | ||
|
|
||
| if parse_bullet(p).is_absent() { | ||
| break; | ||
| } | ||
|
|
||
| // Error recovery: if we didn't advance, break to avoid infinite loop | ||
| if p.source().position() == prev_position { | ||
| // Skip the problematic token and continue | ||
| p.bump_any(); | ||
| } | ||
| } | ||
|
|
||
| list_m.complete(p, MD_BULLET_LIST); | ||
| Present(item_m.complete(p, MD_BULLET_LIST_ITEM)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you parse lists, it should be a ParseNodeList or ParseSeparatedList impl so you can properly implement recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: Implemented ParseNodeList for BulletList with MD_BOGUS_BULLET recovery via ParseRecoveryTokenSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
crates/biome_markdown_parser/src/syntax/quote.rs (2)
36-40: Consider explaining the rationale for remapping the quote marker.The code remaps
>toMD_TEXTUAL_LITERAL, making it part of the paragraph content rather than a separate quote marker node. Whilst this works, the design choice is non-obvious. Expanding the comment to explain why this approach was chosen (e.g., to preserve the marker for formatting or due to AST design constraints) would help future maintainers.
1-5: Document Phase 1 scope limitations.The module documentation mentions "nested block elements", but the current implementation only parses inline content within a single paragraph and doesn't support nested quotes or multiple blocks. Adding a note about Phase 1 limitations (deferred: nesting, multi-line quote consolidation, multiple blocks) would set correct expectations.
Also applies to: 18-23
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
crates/biome_markdown_parser/tests/md_test_suite/ok/block_quote.md.snapis excluded by!**/*.snapand included by**
📒 Files selected for processing (1)
crates/biome_markdown_parser/src/syntax/quote.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
crates/**/*.rs
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Files:
crates/biome_markdown_parser/src/syntax/quote.rs
🧠 Learnings (6)
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Parse rules must take a mutable reference to the parser as their only parameter and return a `ParsedSyntax`
Applied to files:
crates/biome_markdown_parser/src/syntax/quote.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Prefix line with `#` in documentation code examples sparingly; prefer concise complete snippets
Applied to files:
crates/biome_markdown_parser/src/syntax/quote.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation code blocks must have a language defined for syntax highlighting
Applied to files:
crates/biome_markdown_parser/src/syntax/quote.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation code blocks should be ordered as language, expect_diagnostic, options/full_options/use_options, ignore, file
Applied to files:
crates/biome_markdown_parser/src/syntax/quote.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement a token source struct that wraps the lexer and implements `TokenSourceWithBufferedLexer` and `LexerWithCheckpoint` for lookahead and re-lexing capabilities
Applied to files:
crates/biome_markdown_parser/src/syntax/quote.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : A parser struct must implement the `Parser` trait and save the token source, parser context, and optional parser options
Applied to files:
crates/biome_markdown_parser/src/syntax/quote.rs
🧬 Code graph analysis (1)
crates/biome_markdown_parser/src/syntax/quote.rs (1)
crates/biome_markdown_parser/src/syntax.rs (1)
parse_any_inline(160-162)
🔇 Additional comments (1)
crates/biome_markdown_parser/src/syntax/quote.rs (1)
24-57: Multi-line quote handling requires clarification.The implementation treats consecutive lines with
>markers as separate block quotes rather than as a single multi-line quote. This is confirmed by the test snapshot forblock_quote.md, which shows the second line's>being parsed as inline textual content within the first quote block, rather than as a quote continuation marker.In the test input:
> This is a quote > It continues hereThe expected AST includes the second
>asMD_TEXTUAL_LITERALwithin the inline item list of the first quote, not as a block-level marker. This differs from CommonMark handling, where each>at line start represents a block quote marker.Is this an intentional Phase 1 limitation for basic quote support, or should continuation lines with
>be recognised as proper quote structure?
Markdown support is not yet publicly available, so no changeset is needed.
Address PR review comments from dyc3: - Implement ParseNodeList trait for MdBulletList with proper recovery - Add MD_BOGUS_BULLET node type for error recovery - Make closing delimiters required in grammar (remove ?) - Use p.expect() instead of if p.at() for closing tokens - Emits diagnostics for unclosed inline elements
Per CommonMark spec section 5.2, bullet list markers can be -, *, or +. This adds the missing + marker support to the lexer and parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
crates/biome_markdown_parser/src/lexer/mod.rs (1)
200-251: Consider using rustdoc comments for documentation.The escape handling logic correctly implements CommonMark's ASCII punctuation escape rules. However, the function uses standard
//comments rather than///rustdoc comments. Per coding guidelines, inline rustdoc documentation should be updated when adding new features.🔎 Suggested documentation style
- /// Consume a backslash escape sequence. - /// Per CommonMark spec, a backslash before ASCII punctuation makes it literal. - /// Escapable: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ + /// Consume a backslash escape sequence. + /// + /// Per CommonMark spec, a backslash before ASCII punctuation makes it literal. + /// Escapable: `!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~` + /// + /// If the backslash is not followed by an escapable character, only the backslash + /// is consumed and returned as a textual literal. fn consume_escape(&mut self) -> MarkdownSyntaxKind {Based on learnings, inline rustdoc documentation should be maintained for features in Rust crates.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
crates/biome_markdown_factory/src/generated/syntax_factory.rsis excluded by!**/generated/**,!**/generated/**and included by**crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_syntax/src/generated/kind.rsis excluded by!**/generated/**,!**/generated/**and included by**
📒 Files selected for processing (6)
crates/biome_markdown_parser/src/lexer/mod.rs(4 hunks)crates/biome_markdown_parser/src/lexer/tests.rs(2 hunks)crates/biome_markdown_parser/src/syntax/list.rs(1 hunks)crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md(1 hunks)xtask/codegen/markdown.ungram(5 hunks)xtask/codegen/src/markdown_kinds_src.rs(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/biome_markdown_parser/src/syntax/list.rs
- crates/biome_markdown_parser/src/lexer/tests.rs
🧰 Additional context used
📓 Path-based instructions (1)
crates/**/*.rs
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Files:
crates/biome_markdown_parser/src/lexer/mod.rs
🧠 Learnings (13)
📚 Learning: 2025-12-04T13:29:49.287Z
Learnt from: dyc3
Repo: biomejs/biome PR: 8291
File: crates/biome_html_formatter/tests/specs/prettier/vue/html-vue/elastic-header.html:10-10
Timestamp: 2025-12-04T13:29:49.287Z
Learning: Files under `crates/biome_html_formatter/tests/specs/prettier` are test fixtures synced from Prettier and should not receive detailed code quality reviews (e.g., HTTP vs HTTPS, formatting suggestions, etc.). These files are test data meant to validate formatter behavior and should be preserved as-is.
Applied to files:
crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/language_kind.rs : Add a new variant to `LanguageKind` enum in `language_kind.rs` file and implement all methods for the new language variant
Applied to files:
xtask/codegen/src/markdown_kinds_src.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/language_kind.rs : Add a new language prefix to the `LANGUAGE_PREFIXES` constant in `language_kind.rs` file
Applied to files:
xtask/codegen/src/markdown_kinds_src.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/lexer/mod.rs : Implement a `Lexer` trait from `biome_parser` crate for the lexer struct that consumes characters from source code and emits tokens
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement a token source struct that wraps the lexer and implements `TokenSourceWithBufferedLexer` and `LexerWithCheckpoint` for lookahead and re-lexing capabilities
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Use `p.eat(token)` for optional tokens, `p.expect(token)` for required tokens, `parse_rule(p).ok(p)` for optional nodes, and `parse_rule(p).or_add_diagnostic(p, error)` for required nodes
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rsxtask/codegen/markdown.ungram
📚 Learning: 2025-11-24T18:05:27.810Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_js_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:27.810Z
Learning: Applies to crates/biome_js_formatter/**/*.rs : When formatting AST nodes, use mandatory tokens from the AST instead of hardcoding token strings (e.g., use `node.l_paren_token().format()` instead of `token("(")`)
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-12-12T10:11:05.564Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Applies to crates/**/*.rs : Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rsxtask/codegen/markdown.ungram
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement error recovery in list parsing using `or_recover()` to wrap unparseable tokens in a `BOGUS_*` node and consume tokens until a recovery token is found
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-09T12:47:46.298Z
Learnt from: ematipico
Repo: biomejs/biome PR: 8031
File: crates/biome_html_parser/src/syntax/svelte.rs:140-147
Timestamp: 2025-11-09T12:47:46.298Z
Learning: In the Biome HTML parser, `expect` and `expect_with_context` consume the current token and then lex the next token. The context parameter in `expect_with_context` controls how the next token (after the consumed one) is lexed, not the current token being consumed. For example, in Svelte parsing, after `bump_with_context(T!["{:"], HtmlLexContext::Svelte)`, the next token is already lexed in the Svelte context, so `expect(T![else])` is sufficient unless the token after `else` also needs to be lexed in a specific context.
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rsxtask/codegen/markdown.ungram
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/*.ungram : Add a legend comment to `.ungram` files explaining the grammar syntax including comments, non-terminals, tokens, sequences, alternations, repetitions, and optional elements
Applied to files:
xtask/codegen/markdown.ungram
📚 Learning: 2025-12-12T10:11:05.564Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Applies to .changeset/*.md : Write changesets that are concise (1-3 sentences), user-focused, use past tense for actions taken and present tense for Biome behavior, include code examples for rules, and end sentences with periods
Applied to files:
xtask/codegen/markdown.ungram
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Preserve valid tree structure during error recovery by marking invalid parts as 'bogus' nodes rather than invalidating parent nodes
Applied to files:
xtask/codegen/markdown.ungram
🧬 Code graph analysis (1)
crates/biome_markdown_parser/src/lexer/mod.rs (2)
crates/biome_unicode_table/src/lib.rs (1)
lookup_byte(106-109)crates/biome_yaml_parser/src/lexer/mod.rs (1)
start_position(872-874)
🔇 Additional comments (13)
xtask/codegen/src/markdown_kinds_src.rs (1)
16-16: LGTM! Token and node additions align with CommonMark support.The PLUS token enables
+bullet lists, TRIPLE_TILDE supports tilde-fenced code blocks (~~~), and MD_BOGUS_BULLET provides error recovery for malformed bullets. All additions are consistent with the PR objectives.Also applies to: 21-21, 43-43
xtask/codegen/markdown.ungram (6)
105-113: Approve tilde fence support; verify content type is addressed.The addition of tilde-fenced code blocks (
~~~) aligns with CommonMark. However, the content type issue flagged separately must be resolved.
115-115: LGTM! Simplified grammar.Removing the label simplifies the grammar without losing functionality.
134-134: LGTM! Plus bullet support added.The
+bullet marker aligns with CommonMark and is consistent with the PLUS token addition.
144-144: Reasonable to make hard line optional.Not all paragraphs end with explicit hard line breaks, so making this optional is sensible.
193-202: LGTM! Simplified inline image structure.Removing intermediate nodes (MdInlineImageAlt, MdInlineImageSource, MdInlineImageLink) and using MdInlineItemList directly simplifies the grammar whilst maintaining the necessary structure for image alt text and source.
97-99: [Rewritten review comment]
[Classification tag]crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md (1)
1-9: Test fixture looks good.The bullet list test data appropriately covers the three common Markdown bullet markers (-, *, +) with proper formatting and blank line separation. This should provide good coverage for bullet list parsing validation.
crates/biome_markdown_parser/src/lexer/mod.rs (5)
1-14: Import structure looks solid.The updated imports correctly bring in the dispatch mechanism and lookup_byte function needed for the expanded token routing. Documentation update appropriately reflects the Markdown-specific implementation.
180-198: Token dispatch routing is well-structured.The dispatch logic properly routes each token type to its appropriate handler, with simple single-character tokens using the consume_byte helper and complex constructs (thematic breaks, escapes) delegated to dedicated parsing functions.
402-437: Token consumption logic is sound.The three token consumers (hash, backtick, tilde) correctly handle their respective characters. The backtick and tilde handlers appropriately distinguish between single and triple variants using lookahead, which aligns with CommonMark's fenced code block syntax.
503-507: Helper function is fit for purpose.The consume_byte helper cleanly handles single-byte token consumption and is actively used throughout the dispatch logic.
382-390: No action required—the implementation correctly reflects the CommonMark specification.The code tokenises
**and__as DOUBLE_STAR and DOUBLE_UNDERSCORE respectively for strong emphasis, whilst--is correctly tokenised as MINUS. This is not an inconsistency; it reflects the markdown spec: dashes are not emphasis markers, only asterisks and underscores are. The comment at line 387 is accurate.
|
Thank you @jfmcdowell Please review the implementation of the parser. While I understand you used AI assistance, a human intervention is always required. Here's an example: check the emitted CST in the snapshots. For each letter inside a paragraph, we create a node. This implantation will create an exponentially big CST for large documents. That's not acceptable for us and the users. Please review the parsing logic so that the emitted CST groups text into a single node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
crates/biome_markdown_parser/src/lexer/tests.rs (1)
166-184: Consider consolidating hash tests or testing the full header use case.The
hash_tokenandmultiple_hashestests verify that hashes are emitted individually, which is correct for ATX header parsing. However, it would be valuable to add a test covering a complete header scenario (e.g.,"# Title") to ensure the combination works as expected in practice.crates/biome_markdown_parser/src/lexer/mod.rs (1)
409-437: Code fences support 3+ characters per CommonMark spec—consider extending beyond the current hard-coded triple.CommonMark requires code fences to be at least three consecutive backticks or tildes, and using four backticks to wrap three is perfectly valid for nesting. The current implementation handles exactly 1 or 3 characters; supporting variable lengths (4, 5, etc.) would allow proper code block nesting without resorting to tildes.
For Phase 1 this is acceptable, but worth addressing in a follow-up.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (7)
crates/biome_markdown_parser/tests/md_test_suite/ok/block_quote.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/header.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/indent_code_block.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.md.snapis excluded by!**/*.snapand included by**
📒 Files selected for processing (2)
crates/biome_markdown_parser/src/lexer/mod.rs(5 hunks)crates/biome_markdown_parser/src/lexer/tests.rs(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
crates/**/*.rs
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rs
🧠 Learnings (18)
📓 Common learnings
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement a token source struct that wraps the lexer and implements `TokenSourceWithBufferedLexer` and `LexerWithCheckpoint` for lookahead and re-lexing capabilities
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/*.ungram : Add a legend comment to `.ungram` files explaining the grammar syntax including comments, non-terminals, tokens, sequences, alternations, repetitions, and optional elements
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Applies to crates/**/*.rs : Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/lexer/mod.rs : Implement a `Lexer` trait from `biome_parser` crate for the lexer struct that consumes characters from source code and emits tokens
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement a token source struct that wraps the lexer and implements `TokenSourceWithBufferedLexer` and `LexerWithCheckpoint` for lookahead and re-lexing capabilities
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:05:20.371Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:20.371Z
Learning: Applies to crates/biome_formatter/**/biome_*_formatter/tests/language.rs : Implement `TestFormatLanguage` trait in `tests/language.rs` for the formatter's test language
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/lexer/mod.rs : Implement a `Lexer` trait from `biome_parser` crate for the lexer struct that consumes characters from source code and emits tokens
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:12.048Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_service/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:12.048Z
Learning: Applies to crates/biome_service/src/workspace/watcher.tests.rs : Implement watcher tests for workspace methods in watcher.tests.rs and end-to-end tests in LSP tests
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rs
📚 Learning: 2025-12-12T10:11:05.564Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Applies to crates/**/*.rs : Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:05:20.371Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:20.371Z
Learning: Applies to crates/biome_formatter/**/biome_*_formatter/tests/spec_tests.rs : Use the `tests_macros::gen_tests!` macro in `spec_tests.rs` to generate test functions for each specification file matching the pattern `tests/specs/<language>/**/*.<ext>`
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rs
📚 Learning: 2025-11-24T18:05:27.810Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_js_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:27.810Z
Learning: Applies to crates/biome_js_formatter/**/*.rs : When formatting AST nodes, use mandatory tokens from the AST instead of hardcoding token strings (e.g., use `node.l_paren_token().format()` instead of `token("(")`)
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-12-04T13:29:49.287Z
Learnt from: dyc3
Repo: biomejs/biome PR: 8291
File: crates/biome_html_formatter/tests/specs/prettier/vue/html-vue/elastic-header.html:10-10
Timestamp: 2025-12-04T13:29:49.287Z
Learning: Files under `crates/biome_html_formatter/tests/specs/prettier` are test fixtures synced from Prettier and should not receive detailed code quality reviews (e.g., HTTP vs HTTPS, formatting suggestions, etc.). These files are test data meant to validate formatter behavior and should be preserved as-is.
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rs
📚 Learning: 2025-11-24T18:05:20.371Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:20.371Z
Learning: Create test infrastructure with `tests/specs` folder structure and `spec_test.rs`, `spec_tests.rs`, and `language.rs` files in test directories
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Avoid string allocations by comparing against `&str` or using `TokenText`
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Use `p.eat(token)` for optional tokens, `p.expect(token)` for required tokens, `parse_rule(p).ok(p)` for optional nodes, and `parse_rule(p).or_add_diagnostic(p, error)` for required nodes
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-09T12:47:46.298Z
Learnt from: ematipico
Repo: biomejs/biome PR: 8031
File: crates/biome_html_parser/src/syntax/svelte.rs:140-147
Timestamp: 2025-11-09T12:47:46.298Z
Learning: In the Biome HTML parser, `expect` and `expect_with_context` consume the current token and then lex the next token. The context parameter in `expect_with_context` controls how the next token (after the consumed one) is lexed, not the current token being consumed. For example, in Svelte parsing, after `bump_with_context(T!["{:"], HtmlLexContext::Svelte)`, the next token is already lexed in the Svelte context, so `expect(T![else])` is sufficient unless the token after `else` also needs to be lexed in a specific context.
Applied to files:
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/*.ungram : Add a legend comment to `.ungram` files explaining the grammar syntax including comments, non-terminals, tokens, sequences, alternations, repetitions, and optional elements
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement error recovery in list parsing using `or_recover()` to wrap unparseable tokens in a `BOGUS_*` node and consume tokens until a recovery token is found
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Prefix line with `#` in documentation code examples sparingly; prefer concise complete snippets
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:05:27.810Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_js_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:27.810Z
Learning: Applies to crates/biome_js_formatter/**/*.rs : For tokens that are not mandatory, use helper functions instead of hardcoding
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : A parser struct must implement the `Parser` trait and save the token source, parser context, and optional parser options
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
🧬 Code graph analysis (1)
crates/biome_markdown_parser/src/lexer/mod.rs (1)
crates/biome_unicode_table/src/lib.rs (1)
lookup_byte(106-109)
🔇 Additional comments (7)
crates/biome_markdown_parser/src/lexer/tests.rs (3)
364-392: LGTM!The escape sequence tests correctly verify CommonMark escape handling: punctuation escapes are grouped with their backslash, while non-punctuation escapes leave the backslash as a standalone token.
232-240: Good coverage for textual grouping.These tests directly verify that consecutive text characters are batched into single
MD_TEXTUAL_LITERALtokens rather than per-character nodes — precisely what was requested in the PR feedback regarding CST size.Also applies to: 262-273, 299-308
143-164: LGTM!Thematic break tests cover the key CommonMark variations including spaced markers, confirming that internal whitespace is included in the literal token.
crates/biome_markdown_parser/src/lexer/mod.rs (4)
200-251: LGTM!The escape handling correctly implements CommonMark's backslash escape rules with the full set of escapable ASCII punctuation characters.
493-540: Core fix for CST bloat — well implemented.This grouping logic directly addresses the reviewer concern about per-letter nodes. By batching consecutive non-special characters into single
MD_TEXTUAL_LITERALtokens, the CST size becomes linear rather than exponential with document length.The dispatch-based boundary detection is thorough and correctly handles multi-byte UTF-8 via
advance_char_unchecked().
13-14: LGTM!The dispatch-based token routing is clean and efficient. Using
lookup_bytefor character classification keeps the lexer fast while supporting the full Markdown token set.Also applies to: 180-198
345-352: LGTM!Good defensive check —
IDTdispatch covers both letters and underscore, so explicitly filtering forb'_'prevents letters from being mishandled as emphasis markers.
| // Check for double emphasis markers (**, __, --) | ||
| if self.peek_byte() == Some(start_char) { | ||
| self.advance(2); | ||
| return match start_char { | ||
| b'*' => DOUBLE_STAR, | ||
| b'_' => DOUBLE_UNDERSCORE, | ||
| b'-' => MINUS, // No DOUBLE_MINUS in grammar, use MINUS | ||
| _ => unreachable!(), | ||
| }; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential inconsistency: -- returns MINUS with length 2.
When encountering --, this advances by 2 bytes but returns MINUS, which is typically a single-character token. This creates an inconsistency where the same token kind can have different lengths (1 or 2), which may confuse downstream parsing or source reconstruction.
Consider either:
- Adding a
DOUBLE_MINUStoken to the grammar, or - Returning two separate
MINUStokens by only advancing 1 byte here.
🔎 Option 2: Return single MINUS for consistency
- // Check for double emphasis markers (**, __, --)
- if self.peek_byte() == Some(start_char) {
- self.advance(2);
- return match start_char {
- b'*' => DOUBLE_STAR,
- b'_' => DOUBLE_UNDERSCORE,
- b'-' => MINUS, // No DOUBLE_MINUS in grammar, use MINUS
- _ => unreachable!(),
- };
- }
+ // Check for double emphasis markers (**, __)
+ // Note: -- is not doubled; let it fall through to single MINUS
+ if self.peek_byte() == Some(start_char) && start_char != b'-' {
+ self.advance(2);
+ return match start_char {
+ b'*' => DOUBLE_STAR,
+ b'_' => DOUBLE_UNDERSCORE,
+ _ => unreachable!(),
+ };
+ }🤖 Prompt for AI Agents
In crates/biome_markdown_parser/src/lexer/mod.rs around lines 381–390, the lexer
treats `--` by advancing 2 bytes but returns a single-character MINUS token,
causing token-length inconsistency; change the logic so that when start_char ==
b'-' you only advance(1) and return MINUS (while keeping advance(2) for `*` and
`_`), ensuring `--` will be tokenized as two separate MINUS tokens; update or
add lexer tests to cover `-` vs `--` cases and adjust any downstream assumptions
about MINUS token length if necessary.
Fixed the issue where each character in a paragraph created a separate MdTextual node. The lexer's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better, but I think we need to change things.
Compared to other languages, markdown is purely markup. This means that: trivia don't exist, exception for some rare cases.
The markdown lexer needs to treat whitespaces and newlines as part of the language:
- whitespaces are part of a text
- newlines become relevant during parsing
Modified the lexer's consume_textual() to consume multiple consecutive non-special characters in a single token, instead of creating one token per character. This dramatically reduces CST node count for large documents. Before: "This is a paragraph." → 27 MdTextual nodes (one per character) After: "This is a paragraph." → 5 MdTextual nodes (one per word) Special characters that stop grouping: * - _ + # ` ~ > ! [ ] ( ) \ and whitespace
a7c2a67 to
1102636
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
crates/biome_markdown_parser/src/lexer/mod.rs (2)
211-243: Consider refactoring the escapable character match for maintainability.The 33-line match against escapable ASCII punctuation is difficult to read and maintain. Consider extracting this into a helper method or using a lookup approach.
🔎 Example refactor using a helper function
Add a helper method:
/// Returns true if the byte is an ASCII punctuation character that can be escaped /// per the CommonMark spec: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ #[inline] fn is_escapable_ascii_punctuation(byte: u8) -> bool { matches!( byte, b'!' | b'"' | b'#' | b'$' | b'%' | b'&' | b'\'' | b'(' | b')' | b'*' | b'+' | b',' | b'-' | b'.' | b'/' | b':' | b';' | b'<' | b'=' | b'>' | b'?' | b'@' | b'[' | b'\\' | b']' | b'^' | b'_' | b'`' | b'{' | b'|' | b'}' | b'~' ) }Then simplify consume_escape:
// Check if next character is escapable ASCII punctuation - if let Some(next) = self.current_byte() - && matches!( - next, - b'!' | b'"' - | b'#' - | b'$' - | b'%' - | b'&' - | b'\'' - | b'(' - | b')' - | b'*' - | b'+' - | b',' - | b'-' - | b'.' - | b'/' - | b':' - | b';' - | b'<' - | b'=' - | b'>' - | b'?' - | b'@' - | b'[' - | b'\\' - | b']' - | b'^' - | b'_' - | b'`' - | b'{' - | b'|' - | b'}' - | b'~' - ) - { + if let Some(next) = self.current_byte() && Self::is_escapable_ascii_punctuation(next) { // Consume the escaped character too self.advance(1); }
345-352: IDT dispatch handling could be more maintainable.The IDT dispatch matches both letters (A-Z, a-z) and underscore, but only underscore is a Markdown emphasis marker. This special-case handling is duplicated in
consume_textual(lines 526-531), making the pattern less obvious and potentially error-prone during future maintenance.Consider adding a comment explaining why IDT requires special handling, or extracting the underscore check into a helper method for clarity.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (7)
crates/biome_markdown_parser/tests/md_test_suite/ok/block_quote.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/bullet_list.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/fenced_code_block.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/header.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/indent_code_block.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/inline_elements.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph.md.snapis excluded by!**/*.snapand included by**
📒 Files selected for processing (2)
crates/biome_markdown_parser/src/lexer/mod.rs(5 hunks)crates/biome_markdown_parser/src/lexer/tests.rs(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/biome_markdown_parser/src/lexer/tests.rs
🧰 Additional context used
📓 Path-based instructions (1)
crates/**/*.rs
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Files:
crates/biome_markdown_parser/src/lexer/mod.rs
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Applies to crates/**/*.rs : Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/*.ungram : Add a legend comment to `.ungram` files explaining the grammar syntax including comments, non-terminals, tokens, sequences, alternations, repetitions, and optional elements
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Prefix line with `#` in documentation code examples sparingly; prefer concise complete snippets
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/lexer/mod.rs : Implement a `Lexer` trait from `biome_parser` crate for the lexer struct that consumes characters from source code and emits tokens
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement a token source struct that wraps the lexer and implements `TokenSourceWithBufferedLexer` and `LexerWithCheckpoint` for lookahead and re-lexing capabilities
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Use `p.eat(token)` for optional tokens, `p.expect(token)` for required tokens, `parse_rule(p).ok(p)` for optional nodes, and `parse_rule(p).or_add_diagnostic(p, error)` for required nodes
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-12-12T10:11:05.564Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2025-12-12T10:11:05.564Z
Learning: Applies to crates/**/*.rs : Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust crates
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:05:27.810Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_js_formatter/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:05:27.810Z
Learning: Applies to crates/biome_js_formatter/**/*.rs : When formatting AST nodes, use mandatory tokens from the AST instead of hardcoding token strings (e.g., use `node.l_paren_token().format()` instead of `token("(")`)
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Prefix line with `#` in documentation code examples sparingly; prefer concise complete snippets
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/*.ungram : Add a legend comment to `.ungram` files explaining the grammar syntax including comments, non-terminals, tokens, sequences, alternations, repetitions, and optional elements
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-24T18:06:03.545Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_parser/CONTRIBUTING.md:0-0
Timestamp: 2025-11-24T18:06:03.545Z
Learning: Applies to crates/biome_parser/**/src/**/*.rs : Implement error recovery in list parsing using `or_recover()` to wrap unparseable tokens in a `BOGUS_*` node and consume tokens until a recovery token is found
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-12-19T12:53:30.399Z
Learnt from: CR
Repo: biomejs/biome PR: 0
File: crates/biome_analyze/CONTRIBUTING.md:0-0
Timestamp: 2025-12-19T12:53:30.399Z
Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Avoid string allocations by comparing against `&str` or using `TokenText`
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
📚 Learning: 2025-11-09T12:47:46.298Z
Learnt from: ematipico
Repo: biomejs/biome PR: 8031
File: crates/biome_html_parser/src/syntax/svelte.rs:140-147
Timestamp: 2025-11-09T12:47:46.298Z
Learning: In the Biome HTML parser, `expect` and `expect_with_context` consume the current token and then lex the next token. The context parameter in `expect_with_context` controls how the next token (after the consumed one) is lexed, not the current token being consumed. For example, in Svelte parsing, after `bump_with_context(T!["{:"], HtmlLexContext::Svelte)`, the next token is already lexed in the Svelte context, so `expect(T![else])` is sufficient unless the token after `else` also needs to be lexed in a specific context.
Applied to files:
crates/biome_markdown_parser/src/lexer/mod.rs
🧬 Code graph analysis (1)
crates/biome_markdown_parser/src/lexer/mod.rs (1)
crates/biome_unicode_table/src/lib.rs (1)
lookup_byte(106-109)
🔇 Additional comments (7)
crates/biome_markdown_parser/src/lexer/mod.rs (7)
1-14: Documentation and imports updated appropriately.The module documentation now accurately describes the Markdown lexer, and the import changes support the dispatch-based token handling.
180-198: Dispatch table correctly routes Markdown tokens.The expanded dispatcher appropriately handles all Markdown syntax characters, routing them to dedicated consumption methods.
381-390: The--token-length inconsistency has been resolved.The past review concern about
--advancing 2 bytes whilst returning a single-character MINUS token has been addressed. Line 383 now explicitly excludes-from double-marker treatment, so each hyphen in--is lexed separately as an individual MINUS token with consistent length.
493-540: Textual batching successfully addresses CST bloat concern.This implementation groups consecutive textual characters into a single
MD_TEXTUAL_LITERALtoken, directly addressing the issue raised in the PR objectives where each letter was previously emitted as a separate node. The dispatch-based boundary detection correctly identifies when to stop batching, maintaining accurate parsing whilst improving efficiency.
258-262: Inconsistency with AI-generated summary.The AI summary states "Removed the dead_code attribute from the eat_byte method", but the
#[expect(dead_code)]attribute is still present at line 258. If this method is now used, the attribute should be removed; otherwise, the summary is incorrect.
409-422: Convert comment to proper rustdoc format.The comment at line 409 should use
///for rustdoc instead of//to be included in generated documentation.As per coding guidelines, update inline rustdoc documentation when adding new features.
⛔ Skipped due to learnings
Learnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Prefix line with `#` in documentation code examples sparingly; prefer concise complete snippetsLearnt from: CR Repo: biomejs/biome PR: 0 File: CONTRIBUTING.md:0-0 Timestamp: 2025-12-12T10:11:05.564Z Learning: Applies to crates/**/*.rs : Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust cratesLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Lines prefixed with `#` in rule documentation code examples will be hidden from outputLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation code blocks should be ordered as language, expect_diagnostic, options/full_options/use_options, ignore, fileLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : The first paragraph of rule documentation must be a single line describing what the rule doesLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_parser/CONTRIBUTING.md:0-0 Timestamp: 2025-11-24T18:06:03.545Z Learning: Applies to crates/biome_parser/**/*.ungram : Add a legend comment to `.ungram` files explaining the grammar syntax including comments, non-terminals, tokens, sequences, alternations, repetitions, and optional elementsLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Use `options` code block property for rule-specific configuration snippets in documentationLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation code blocks must have a language defined for syntax highlightingLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Invalid code examples in rule documentation must be marked with `expect_diagnostic` code block propertyLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation must include `## Options` section if the rule has options
493-494: Convert comment to proper rustdoc format.The comments at lines 493-494 should use
///for rustdoc instead of//to be included in generated documentation.As per coding guidelines, update inline rustdoc documentation when adding new features.
⛔ Skipped due to learnings
Learnt from: CR Repo: biomejs/biome PR: 0 File: CONTRIBUTING.md:0-0 Timestamp: 2025-12-12T10:11:05.564Z Learning: Applies to crates/**/*.rs : Update inline rustdoc documentation for rules, assists, and their options when adding new features or changing existing features in Rust cratesLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Prefix line with `#` in documentation code examples sparingly; prefer concise complete snippetsLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Invalid code examples in rule documentation must be marked with `expect_diagnostic` code block propertyLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Use `options` code block property for rule-specific configuration snippets in documentationLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Use `use_options` code block property for code examples that follow an options configuration in documentationLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Lines prefixed with `#` in rule documentation code examples will be hidden from outputLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_formatter/CONTRIBUTING.md:0-0 Timestamp: 2025-11-24T18:05:20.371Z Learning: Applies to crates/biome_formatter/**/biome_*_formatter/src/comments.rs : Define `<Language>CommentStyle` as a public type alias for `Comments<<Language>Language>` in a `comments.rs` fileLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation must include `## Examples` section with `### Invalid` and `### Valid` subsectionsLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation code blocks should be ordered as language, expect_diagnostic, options/full_options/use_options, ignore, fileLearnt from: CR Repo: biomejs/biome PR: 0 File: crates/biome_analyze/CONTRIBUTING.md:0-0 Timestamp: 2025-12-19T12:53:30.399Z Learning: Applies to crates/biome_analyze/**/*analyze/src/**/*.rs : Rule documentation must include `## Options` section if the rule has options
| /// Consume a backslash escape sequence. | ||
| /// Per CommonMark spec, a backslash before ASCII punctuation makes it literal. | ||
| /// Escapable: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add rustdoc documentation.
This method lacks documentation. Per coding guidelines, rustdoc should be added when introducing new features. Consider documenting the CommonMark escape behaviour and which characters are escapable.
As per coding guidelines, update inline rustdoc documentation when adding new features.
🤖 Prompt for AI Agents
In crates/biome_markdown_parser/src/lexer/mod.rs around lines 200 to 202, the
method that consumes a backslash escape sequence is missing rustdoc; add a
triple-slash doc comment above the method that briefly states its purpose
(consume a backslash escape per CommonMark), describes the behavior (a backslash
before ASCII punctuation makes that character literal), lists or references the
set of escapable characters (e.g. !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~), and include
a short note pointing to the CommonMark spec and a tiny usage example or
edge-case note (e.g., that non-ASCII characters are not escaped). Ensure the
wording is concise and follows project rustdoc style.
| /// Consumes thematic break literal or returns emphasis marker tokens. | ||
| /// Called when we see *, -, or _. | ||
| fn consume_thematic_break_or_emphasis(&mut self, dispatched: Dispatch) -> MarkdownSyntaxKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add rustdoc documentation.
This method lacks documentation explaining when it returns a thematic break versus emphasis tokens, and the special handling for the IDT dispatch case.
As per coding guidelines, update inline rustdoc documentation when adding new features.
| /// Consume a single hash character for ATX headers | ||
| fn consume_hash(&mut self) -> MarkdownSyntaxKind { | ||
| self.assert_at_char_boundary(); | ||
| self.advance(1); | ||
| HASH | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add rustdoc documentation.
Brief documentation explaining that this consumes a single hash character for ATX headers would improve code clarity.
As per coding guidelines, update inline rustdoc documentation when adding new features.
🤖 Prompt for AI Agents
In crates/biome_markdown_parser/src/lexer/mod.rs around lines 402 to 407, the
private method consume_hash lacks rustdoc; add a concise /// doc comment above
the function stating that it consumes a single '#' character used for ATX
headers, ensures the current position is at a UTF-8 character boundary, advances
the lexer by one byte, and returns the HASH MarkdownSyntaxKind token; keep the
doc short and aligned with project guidelines and update it if behavior changes
in future.
| /// Consume tilde(s) - either single for other uses or triple for fenced code blocks | ||
| fn consume_tilde(&mut self) -> MarkdownSyntaxKind { | ||
| self.assert_at_char_boundary(); | ||
|
|
||
| // Check for triple tilde | ||
| if self.peek_byte() == Some(b'~') && self.byte_at(2) == Some(b'~') { | ||
| self.advance(3); | ||
| return TRIPLE_TILDE; | ||
| } | ||
|
|
||
| // Single tilde | ||
| self.advance(1); | ||
| TILDE | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Convert comment to proper rustdoc format.
The comment at line 424 should use /// for rustdoc instead of // to be included in generated documentation.
As per coding guidelines, update inline rustdoc documentation when adding new features.
🤖 Prompt for AI Agents
In crates/biome_markdown_parser/src/lexer/mod.rs around lines 424 to 437, the
leading comment for consume_tilde should be a rustdoc comment; replace the
current inline comment with a triple-slash doc comment (///) immediately above
the fn declaration and ensure the text is formatted as a single-line or short
paragraph rustdoc description so it appears in generated docs.
Include whitespace in text tokens when not at line start to reduce CST node count. Previously each word was a separate token; now entire phrases become single tokens. Key changes: - Lexer handles whitespace context-sensitively (line start vs mid-line) - consume_textual() includes spaces/tabs but stops at newlines - Parser detects paragraph boundaries via has_blank_line_since() Results: 10 MdTextual nodes -> 2 for a two-paragraph document.
|
@ematipico thanks for the feedback. 7d3d2b9 addresses it. Since this is getting long would you prefer I stop here and once this is done submit a second PR for the remaining common mark features for the parser or keep going in this PR? Whatever makes it easier for the maintainers. |
Note
AI Assistance Disclosure: This PR was developed with assistance from Claude Code.
Summary
This PR completes Phase 1 of the Markdown support implementation as outlined in #3718. It delivers a functional Markdown parser that handles CommonMark-compliant syntax.
Implemented Features
Leaf Blocks:
#through######) with proper trailing hash parsing---,***,___)Container Blocks:
>)-,+, and*markers)Inline Elements:
`code`)**bold**,__bold__)*italic*,_italic_)[text](url)))\*,\[, etc.) per CommonMark specParser Infrastructure:
INDENT_CODE_BLOCK_SPACES,MAX_HEADER_HASHES)TRIPLE_TILDEtoken for tilde-fenced code blocksDeferred to Future Work
1.,2., etc.)===or---)Test Plan
paragraph.mdheader.md(including trailing hashes)fenced_code_block.md(including tilde fences)indent_code_block.mdblock_quote.mdbullet_list.mdinline_elements.mdthematic_break_block.mdjust fandjust lpassNote
just readyfails due to a pre-existing rustdoc error inbiome_html_parser(commit a3a713d) whereset_double_text_expressionlinks to privateTextExpressionKind::Double. This is unrelated to this PR and will be addressed separately.Next Steps
Phase 2 (Formatter) and Phase 3 (Service Integration) will follow in separate PRs.