Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ olmocr Public
forked from allenai/olmocr

Toolkit for linearizing PDFs for LLM datasets/training

License

targc/olmocr

 
 

Repository files navigation

pdelfin

Toolkit for truly understanding PDF documents in the wild.

image

Things supported:

  • A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data)
  • An eval toolkit for comparing different pipeline versions
  • Basic filtering by language and SEO spam removal
  • Finetuning code for Qwen2-VL (and soon other VLMs)

Note: Font installation

You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice.

sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

About

Toolkit for linearizing PDFs for LLM datasets/training

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 91.8%
  • HTML 6.5%
  • Shell 1.7%