Thanks to visit codestin.com
Credit goes to github.com

Skip to content

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

License

Notifications You must be signed in to change notification settings

bytedance/Dolphin

Repository files navigation


Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model (0.3B) following an analyze-then-parse paradigm. This repository contains the demo code and pre-trained models for Dolphin.

📑 Overview

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:

  1. 🔍 Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order
  2. 🧩 Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts

Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.

📅 Changelog

  • 🔥 2025.10.16 Released Dolphin-1.5 model. While maintaining the lightweight 0.3B architecture, this version achieves significant parsing improvements. (Dolphin 1.0 moved to v1.0 branch)
  • 🔥 2025.07.10 Released the Fox-Page Benchmark, a manually refined subset of the original Fox dataset. Download via: Baidu Yun | Google Drive.
  • 🔥 2025.06.30 Added TensorRT-LLM support for accelerated inference!
  • 🔥 2025.06.27 Added vLLM support for accelerated inference!
  • 🔥 2025.06.13 Added multi-page PDF document parsing capability.
  • 🔥 2025.05.21 Our demo is released at link. Check it out!
  • 🔥 2025.05.20 The pretrained model and inference code of Dolphin are released.
  • 🔥 2025.05.16 Our paper has been accepted by ACL 2025. Paper link: arXiv.

📈 Performance

Comprehensive evaluation of document parsing on Fox-Page and Dolphin-Page
Model Fox-Page-enEdit Fox-Page-zhEdit Dolphin-Page-Edit EvgEdit
Dolphin 0.0114 0.0131 0.1028 0.0424
Dolphin-1.5 0.0074 0.0077 0.0743 0.0298
Comprehensive evaluation of document parsing on OmniDocBench (v1.5)
Model Overall↑ TextEdit FormulaCDM TableTEDS TableTEDS-S Read OrderEdit
Dolphin 74.67 0.125 67.85 68.70 77.77 0.124
Dolphin-1.5 83.21 0.092 80.78 78.06 84.10 0.080

🛠️ Installation

  1. Clone the repository:

    git clone https://github.com/ByteDance/Dolphin.git
    cd Dolphin
  2. Install the dependencies:

    pip install -r requirements.txt
  3. Download the pre-trained models of Dolphin-1.5:

    Visit our Huggingface model card, or download model by:

    # Download the model from Hugging Face Hub
    git lfs install
    git clone https://huggingface.co/ByteDance/Dolphin-1.5 ./hf_model
    # Or use the Hugging Face CLI
    pip install huggingface_hub
    huggingface-cli download ByteDance/Dolphin-1.5 --local-dir ./hf_model

⚡ Inference

Dolphin provides two inference frameworks with support for two parsing granularities:

  • Page-level Parsing: Parse the entire document page into a structured JSON and Markdown format
  • Element-level Parsing: Parse individual document elements (text, table, formula)

📄 Page-level Parsing

# Process a single document image
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_1.png 

# Process a single document pdf
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_6.pdf 

# Process all documents in a directory
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs 

# Process with custom batch size for parallel element decoding
python demo_page.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs \
    --max_batch_size 8

🧩 Element-level Parsing

# Process element images (specify element_type: table, formula, text, or code)
python demo_element.py --model_path ./hf_model --save_dir ./results \
    --input_path  \
    --element_type [table|formula|text|code]

🎨 Layout Parsing

# Process a single document image
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_1.png \
    
# Process a single PDF document
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs/page_6.pdf \

# Process all documents in a directory
python demo_layout.py --model_path ./hf_model --save_dir ./results \
    --input_path ./demo/page_imgs 

🌟 Key Features

  • 🔄 Two-stage analyze-then-parse approach based on a single VLM
  • 📊 Promising performance on document parsing tasks
  • 🔍 Natural reading order element sequence generation
  • 🧩 Heterogeneous anchor prompting for different document elements
  • ⏱️ Efficient parallel parsing mechanism
  • 🤗 Support for Hugging Face Transformers for easier integration

📮 Notice

Call for Bad Cases: If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue. We are continuously working to optimize and improve the model.

💖 Acknowledgement

We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:

📝 Citation

If you find this code useful for your research, please use the following BibTeX entry.

@article{feng2025dolphin,
  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others},
  journal={arXiv preprint arXiv:2505.14059},
  year={2025}
}

Star History

Star History Chart

About

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published