DocMark: Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Official repository for the CVPR 2025 paper "Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding".
[📖 Paper] [🤗 Models & Datasets]
Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts.
We propose DocMark, an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. Our approach:
- Converts documents to structured markup languages that preserve rich semantic and layout information
- Generates contextually relevant markup code as intermediate reasoning steps
- Adaptively selects the most appropriate markup language for different document types
-
A novel pipeline that adaptively uses various markup languages to bridge the gap between visual inputs and linguistic understanding, significantly enhancing document comprehension capabilities.
-
Two fine-grained structured datasets:
- DocMark-Pile: 3.8M pretraining data pairs for document parsing
- DocMark-Instruct: 624k fine-tuning annotations for grounded instruction following
-
State-of-the-art performance across various document understanding benchmarks, significantly outperforming existing MLLMs.
| Model | LLM Size | TextVQA | DocVQA | InfoVQA | ChartQA | AI2D | OCRBench | WebQA | MathVision |
|---|---|---|---|---|---|---|---|---|---|
| DocMark-2B | 2B | 74.8 | 87.8 | 61.2 | 79.8 | 82.5 | 813 | 70.1 | 18.8 |
| DocMark-8B | 8B | 78.0 | 89.8 | 68.3 | 84.2 | 86.2 | 823 | 78.9 | 21.1 |
Pre-trained models are available on Hugging Face: DocMark-Pretrain-2B
A comprehensive pretraining dataset for document parsing with various markup languages:
- Plain Text: Natural photos and regional text images
- Markdown: Dense text documents and tables
- LaTeX: Mathematics textbooks and handwritten formulas
- HTML: Webpages and webpage summarization
- JSON: Key information extraction from charts, receipts, and forms
- TikZ: Scientific and geometry diagrams
Fine-tuning dataset featuring chain-of-thought-like reasoning annotations for contextually-grounded instruction following.
cd ms-swift
pip install -e .cd ms-swift
# Pretraining on DocMark-Pile
bash exps/docmark_pretrain_2b.sh
# Fine-tuning on DocMark-Instruct
bash exps/docmark_finetune_2b.shDocMark supports various document understanding tasks including text extraction, OCR with grounding, and document-to-markup conversion. Below are example usage scenarios:
import torch
from swift.utils import seed_everything
from modelscope import AutoModel, AutoTokenizer
from utils import load_image
# Initialize model
model_path = 'HanXiao1999/DocMark-Pretrain-2B'
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True,
use_fast=True,
legacy=False,
add_prefix_space=False
)
generation_config = dict(max_new_tokens=2048, do_sample=False)
seed_everything(42)# Load image
image_path = 'examples/text_img.jpg'
pixel_values = load_image(image_path, max_num=12).to(torch.bfloat16).cuda()
# Extract text content
question = '<image>Could you extract the text from the image for me?'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('Extracted Text:', answer)question = '<image>OCR with grounding:'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('OCR Results with Grounding:', answer)# Process document image
image_path = 'examples/example_doc.png'
pixel_values = load_image(image_path, max_num=12).to(torch.bfloat16).cuda()
# Convert to structured markup
question = '<image>Convert this document to structured markup format'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('Generated Markup:', answer)For more comprehensive examples including:
- Mathematical Content to LaTeX
- Webpage Content Analysis
- Scientific Diagram Reconstruction
- Structured Data Extraction
Please see the demo notebook for complete usage examples.
DocMark significantly outperforms existing state-of-the-art MLLMs on document understanding tasks, particularly excelling in handling complex document formats and reasoning tasks.
We would like to thank the following repos for their great work:
- InternVL for the base architecture
- SWIFT for the training framework
- VLMEvalKit for the benchmark evaluation
If you find DocMark useful for your research and applications, please kindly cite using this BibTeX:
@inproceedings{xiao2025adaptive,
title={Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding},
author={Xiao, Han and Xie, Yina and Tan, Guanxin and Chen, Yinghao and Hu, Rui and Wang, Ke and Zhou, Aojun and Li, Hao and Shao, Hao and Lu, Xudong and others},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={29558--29568},
year={2025}
}
}For any questions or inquiries, please contact us at [email protected].