Stars
Benchmarking Open Source LLMs for Text-to-Table Generation
🏠PDF text extraction pipeline: self-hosted, local-first, Docker-based
Document Layout Analysis resources repos for development with PdfPig.
Label Studio is a multi-type data labeling and annotation tool with standardized output format
A curated list of resources for Document Understanding (DU) topic
The data for the CRASS-benchmark
Table structure recognition dataset of the paper: Complicated Table Structure Recognition
Master repository which includes most other OCR-D repositories as submodules
Collection of OCR-related python tools and wrappers from @OCR-D
Extracts raw text from web archives (WARCs).
A Unified Toolkit for Deep Learning Based Document Image Analysis