This is a custom Recursive Descent Parser I wrote in C++ to handle a specific subset of Markdown for GATEQuest.
Basically, I needed something that could handle mixed content like Markdown, LaTeX math equations (like
It’s not perfect, but it works, and handles about 2,608 complex questions from my dataset in under 20ms.
To be honest, the JS version was working fine and maybe it is objectively better. But I wanted to learn how actual parsers work under the hood.
I built this to handle specific parts only which was required for GATEQuest:
- Math: Inline
$...$ and Block$$...$$ (passed through to KaTeX). - Tables: Standard Markdown tables with pipe | delimiters.
- Code: Inline backticks and fenced code blocks.
- Standard MD: Bold, Italics, Images, Links.
It uses a standard compiler architecture:
- Lexer (Tokenizer): Scans the raw string and breaks it into tokens (TEXT, BOLD, PIPE, MATH_BLOCK, etc.).
- Parser: A Recursive Descent Parser that constructs an Abstract Syntax Tree (AST). It handles the nesting logic (e.g., "we are inside a table row, so the next pipe means a new cell").
- Renderer: Walks the AST and generates the final HTML string.
I ran a head-to-head benchmark against my Node.js Regex parser on a dataset of 2,608 questions.
- JavaScript (Regex): ~0.001ms per item.
- C++ (Sht): ~0.007ms per item.
Yeah, the JS engine is technically faster for simple cases because V8 is a beast, but Sht is good too, I guess.
You need a C++17 compiler.
# Compile the project
make
# Run it on a JSON file
bin/renderer input.json output.json- Rendering has an issue in "
$", basically $ within single backticks. -
<div><p>tags are in options too (in my dataset), which shouldn't happen as it creates unnecessary space. - Will try to compile it in WASM(WebAssembly) for GATEQuest.
It doesn't support 100% of the CommonMark spec, just the parts I need for my dataset.
Use it if you want, but you're probably better off using a battle-tested library unless you're trying to learn how parsers work like I was.