A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
-
Updated
Sep 10, 2025 - Python
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
A system for agentic LLM-powered data processing and ETL
Read and extract text and other content from PDFs in C# (port of PDFBox)
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
A curated list of resources for Document Understanding (DU) topic
Open-source platform for extracting structured data from documents using AI.
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
AssemblyLine 4: File triage and malware analysis
A package for parsing PDFs and analyzing their content using LLMs.
RObust document image BINarization
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
Local adaptive image binarization
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
Document Visual Question Answering
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
Add a description, image, and links to the document-analysis topic page so that developers can more easily learn about it.
To associate your repository with the document-analysis topic, visit your repo's landing page and select "manage topics."