DocMark: Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Official repository for the CVPR 2025 paper "Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding".

[📖 Paper] [🤗 Models & Datasets]

About DocMark

Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts.

We propose DocMark, an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. Our approach:

Converts documents to structured markup languages that preserve rich semantic and layout information
Generates contextually relevant markup code as intermediate reasoning steps
Adaptively selects the most appropriate markup language for different document types

Key Contributions

A novel pipeline that adaptively uses various markup languages to bridge the gap between visual inputs and linguistic understanding, significantly enhancing document comprehension capabilities.
Two fine-grained structured datasets:
- DocMark-Pile: 3.8M pretraining data pairs for document parsing
- DocMark-Instruct: 624k fine-tuning annotations for grounded instruction following
State-of-the-art performance across various document understanding benchmarks, significantly outperforming existing MLLMs.

Model Zoo

Model	LLM Size	TextVQA	DocVQA	InfoVQA	ChartQA	AI2D	OCRBench	WebQA	MathVision
DocMark-2B	2B	74.8	87.8	61.2	79.8	82.5	813	70.1	18.8
DocMark-8B	8B	78.0	89.8	68.3	84.2	86.2	823	78.9	21.1

Pre-trained models are available on Hugging Face: DocMark-Pretrain-2B

Datasets

DocMark-Pile (3.8M samples)

Download on Hugging Face

A comprehensive pretraining dataset for document parsing with various markup languages:

Plain Text: Natural photos and regional text images
Markdown: Dense text documents and tables
LaTeX: Mathematics textbooks and handwritten formulas
HTML: Webpages and webpage summarization
JSON: Key information extraction from charts, receipts, and forms
TikZ: Scientific and geometry diagrams

DocMark-Instruct (624k samples)

Download on Hugging Face

Fine-tuning dataset featuring chain-of-thought-like reasoning annotations for contextually-grounded instruction following.

Usage

Installation

cd ms-swift
pip install -e .

Training

cd ms-swift
# Pretraining on DocMark-Pile
bash exps/docmark_pretrain_2b.sh

# Fine-tuning on DocMark-Instruct
bash exps/docmark_finetune_2b.sh

Inference

DocMark supports various document understanding tasks including text extraction, OCR with grounding, and document-to-markup conversion. Below are example usage scenarios:

Basic Setup

import torch
from swift.utils import seed_everything
from modelscope import AutoModel, AutoTokenizer
from utils import load_image

# Initialize model
model_path = 'HanXiao1999/DocMark-Pretrain-2B'
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True,
    use_fast=True,
    legacy=False,
    add_prefix_space=False
)

generation_config = dict(max_new_tokens=2048, do_sample=False)
seed_everything(42)

Text Extraction from Images

# Load image
image_path = 'examples/text_img.jpg'
pixel_values = load_image(image_path, max_num=12).to(torch.bfloat16).cuda()

# Extract text content
question = '<image>Could you extract the text from the image for me?'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('Extracted Text:', answer)

OCR with Grounding

question = '<image>OCR with grounding:'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('OCR Results with Grounding:', answer)

Document to Markup Conversion

# Process document image
image_path = 'examples/example_doc.png'
pixel_values = load_image(image_path, max_num=12).to(torch.bfloat16).cuda()

# Convert to structured markup
question = '<image>Convert this document to structured markup format'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('Generated Markup:', answer)

Additional Examples

For more comprehensive examples including:

Mathematical Content to LaTeX
Webpage Content Analysis
Scientific Diagram Reconstruction
Structured Data Extraction

Please see the demo notebook for complete usage examples.

Results on Downstream Document Understanding Tasks

DocMark significantly outperforms existing state-of-the-art MLLMs on document understanding tasks, particularly excelling in handling complex document formats and reasoning tasks.

Acknowledgements

We would like to thank the following repos for their great work:

InternVL for the base architecture
SWIFT for the training framework
VLMEvalKit for the benchmark evaluation

Citation

If you find DocMark useful for your research and applications, please kindly cite using this BibTeX:

@inproceedings{xiao2025adaptive,
  title={Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding},
  author={Xiao, Han and Xie, Yina and Tan, Guanxin and Chen, Yinghao and Hu, Rui and Wang, Ke and Zhou, Aojun and Li, Hao and Shao, Hao and Lu, Xudong and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={29558--29568},
  year={2025}
}
}

Contact

For any questions or inquiries, please contact us at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets/figures		assets/figures
examples		examples
ms-swift		ms-swift
.gitattributes		.gitattributes
README.md		README.md
demo.ipynb		demo.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocMark: Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

About DocMark

Key Contributions

Model Zoo

Datasets

DocMark-Pile (3.8M samples)

DocMark-Instruct (624k samples)

Usage

Installation

Training

Inference

Basic Setup

Text Extraction from Images

OCR with Grounding

Document to Markup Conversion

Additional Examples

Results on Downstream Document Understanding Tasks

Acknowledgements

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

Euphoria16/DocMark

Folders and files

Latest commit

History

Repository files navigation

DocMark: Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

About DocMark

Key Contributions

Model Zoo

Datasets

DocMark-Pile (3.8M samples)

DocMark-Instruct (624k samples)

Usage

Installation

Training

Inference

Basic Setup

Text Extraction from Images

OCR with Grounding

Document to Markup Conversion

Additional Examples

Results on Downstream Document Understanding Tasks

Acknowledgements

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages