document-analysis

Here are 329 public repositories matching this topic...

opendatalab / MinerU

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

python pdf parser ocr pdf-converter extract-data document-analysis pdf-parser layout-analysis ai4science pdf-extractor-rag pdf-extractor-llm pdf-extractor-pretrain

Updated Mar 2, 2026
Python

bytedance / Dolphin

Star

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

python pdf parser ocr pdf-converter document-analysis pdf-parser layout-analysis vlm-ocr

Updated Dec 17, 2025
Python

ucbepic / docetl

Star

A system for agentic LLM-powered data processing and ETL

python workflow data etl semantic-data elt data-pipelines agents document-analysis document-processing unstructured-data unstructured-data-analysis llm

Updated Feb 2, 2026
Python

UglyToad / PdfPig

Star

Read and extract text and other content from PDFs in C# (port of PDFBox)

pdf csharp pdfbox netstandard pdf-files pdf-document pdf-generation hocr document-analysis pdf-extractor alto-xml page-xml layout-analysis pdf-document-processor

Updated Feb 28, 2026
C#

NanoNets / docext

Star

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Updated Aug 25, 2025
Python

AlibabaResearch / AdvancedLiterateMachinery

Star

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Updated Apr 9, 2025
C++

tstanislawek / awesome-document-understanding

Star

A curated list of resources for Document Understanding (DU) topic

Updated Jun 2, 2023

DocumindHQ / documind

Star

Open-source platform for extracting structured data from documents using AI.

open-source pdf parser ocr ai pdf-converter developer-tools extract-data document-analysis pdf-extractor document-extraction llms pdf-extractor-llm

Updated May 15, 2025
JavaScript

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

ocr document-analysis document-processing scene-text-recognition scene-text-detection ocr-pytorch chineseocr document-parsing

Updated Mar 2, 2026
Python

Deodat-Lawson / PDR_AI_v2

Star

AI-powered StartUp Accelerator Engine built with Next.js, LangChain, PostgreSQL + pgvector. Upload, organize, and chat with documents. Includes predictive missing-document detection, role-based workflows, and page-level insight extraction.

typescript ocr nextjs postgresql full-stack openai document-analysis rag vector-search ai-chatbot document-ai langchain pgvector drizzle-orm rag-chatbot llm-app

Updated Mar 3, 2026
JavaScript

Yuliang-Liu / Curve-Text-Detector

Star

This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

deep-learning object-detection document-analysis scene-text

Updated Jul 20, 2020
Jupyter Notebook

ispras / dedoc

Star

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

html pdf ocr table-of-contents excel html-parser docx documents doc scanned-documents txt document-analysis odt pdf-parser table-recognition docx-parser document-content-extraction logical-structure-extraction

Updated Mar 2, 2026
Python

wenwenyu / PICK-pytorch

Star

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

document-analysis graph-convolutional-network graph-learning graph-neural-networks document-understanding key-information-extraction

Updated Jul 25, 2024
Python

CybercentreCanada / assemblyline

Star

AssemblyLine 4: File triage and malware analysis

framework incident-response malware python3 cybersecurity cert infosec malware-analyzer malware-analysis malware-research automation-framework cyber-security file-analysis document-analysis security-automation security-tools malware-detection assemblyline security-automation-framework

Updated Feb 26, 2026
Python

jpWang / LiLT

Star

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

nlp information-extraction document-analysis document-understanding multilingual-models document-ai multimodal-pre-trained-model