Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[CVPR 2025] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Notifications You must be signed in to change notification settings

Euphoria16/DocMark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocMark: Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Official repository for the CVPR 2025 paper "Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding".

[📖 Paper] [🤗 Models & Datasets]

About DocMark

Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts.

We propose DocMark, an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. Our approach:

  1. Converts documents to structured markup languages that preserve rich semantic and layout information
  2. Generates contextually relevant markup code as intermediate reasoning steps
  3. Adaptively selects the most appropriate markup language for different document types

Key Contributions

  1. A novel pipeline that adaptively uses various markup languages to bridge the gap between visual inputs and linguistic understanding, significantly enhancing document comprehension capabilities.

  2. Two fine-grained structured datasets:

    • DocMark-Pile: 3.8M pretraining data pairs for document parsing
    • DocMark-Instruct: 624k fine-tuning annotations for grounded instruction following
  3. State-of-the-art performance across various document understanding benchmarks, significantly outperforming existing MLLMs.

Model Zoo

Model LLM Size TextVQA DocVQA InfoVQA ChartQA AI2D OCRBench WebQA MathVision
DocMark-2B 2B 74.8 87.8 61.2 79.8 82.5 813 70.1 18.8
DocMark-8B 8B 78.0 89.8 68.3 84.2 86.2 823 78.9 21.1

Pre-trained models are available on Hugging Face: DocMark-Pretrain-2B

Datasets

DocMark-Pile (3.8M samples)

Download on Hugging Face

A comprehensive pretraining dataset for document parsing with various markup languages:

  • Plain Text: Natural photos and regional text images
  • Markdown: Dense text documents and tables
  • LaTeX: Mathematics textbooks and handwritten formulas
  • HTML: Webpages and webpage summarization
  • JSON: Key information extraction from charts, receipts, and forms
  • TikZ: Scientific and geometry diagrams

DocMark-Instruct (624k samples)

Download on Hugging Face

Fine-tuning dataset featuring chain-of-thought-like reasoning annotations for contextually-grounded instruction following.

Usage

Installation

cd ms-swift
pip install -e .

Training

cd ms-swift
# Pretraining on DocMark-Pile
bash exps/docmark_pretrain_2b.sh

# Fine-tuning on DocMark-Instruct
bash exps/docmark_finetune_2b.sh

Inference

DocMark supports various document understanding tasks including text extraction, OCR with grounding, and document-to-markup conversion. Below are example usage scenarios:

Basic Setup

import torch
from swift.utils import seed_everything
from modelscope import AutoModel, AutoTokenizer
from utils import load_image

# Initialize model
model_path = 'HanXiao1999/DocMark-Pretrain-2B'
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True,
    use_fast=True,
    legacy=False,
    add_prefix_space=False
)

generation_config = dict(max_new_tokens=2048, do_sample=False)
seed_everything(42)

Text Extraction from Images

# Load image
image_path = 'examples/text_img.jpg'
pixel_values = load_image(image_path, max_num=12).to(torch.bfloat16).cuda()

# Extract text content
question = '<image>Could you extract the text from the image for me?'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('Extracted Text:', answer)

OCR with Grounding

question = '<image>OCR with grounding:'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('OCR Results with Grounding:', answer)

Document to Markup Conversion

# Process document image
image_path = 'examples/example_doc.png'
pixel_values = load_image(image_path, max_num=12).to(torch.bfloat16).cuda()

# Convert to structured markup
question = '<image>Convert this document to structured markup format'
answer = model.chat(tokenizer, pixel_values, question, generation_config)
print('Generated Markup:', answer)

Additional Examples

For more comprehensive examples including:

  • Mathematical Content to LaTeX
  • Webpage Content Analysis
  • Scientific Diagram Reconstruction
  • Structured Data Extraction

Please see the demo notebook for complete usage examples.

Results on Downstream Document Understanding Tasks

DocMark significantly outperforms existing state-of-the-art MLLMs on document understanding tasks, particularly excelling in handling complex document formats and reasoning tasks.

Acknowledgements

We would like to thank the following repos for their great work:

Citation

If you find DocMark useful for your research and applications, please kindly cite using this BibTeX:

@inproceedings{xiao2025adaptive,
  title={Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding},
  author={Xiao, Han and Xie, Yina and Tan, Guanxin and Chen, Yinghao and Hu, Rui and Wang, Ke and Zhou, Aojun and Li, Hao and Shao, Hao and Lu, Xudong and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={29558--29568},
  year={2025}
}
}

Contact

For any questions or inquiries, please contact us at [email protected].

About

[CVPR 2025] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages