Thanks to visit codestin.com
Credit goes to www.docling.ai

Docling converts messy documents into structured data and simplifies downstream document and AI processing by detecting tables, formulas, reading order, OCR, and much more.

GitHub HuggingFace Discord LinkedIn YouTube

Start

Install Docling as a Python library with your favorite package manager:

pip install docling

Run the CLI directly from your terminal:

docling https://arxiv.org/pdf/2206.01062

Code a document conversion as part of a Python application:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
doc = converter.convert(source).document
print(doc.export_to_markdown())

Deploy it as Docling Serve
Enable an agent via Docling MCP

Features

Import many document formats into a unified and structured Docling Document, including scanned pages via an OCR engine of your choice.

Export a parsed document to formats that simplify processing and ingestion into AI, RAG, and agentic systems.

Extract document components and their properties from the Docling Document.

Rich Markup Tabular Image Audio
Import PDF Markdown XLSX PNG MP3
DOCX HTML CSV JPEG WAV
PPTX AsciiDoc TIFF
WebVTT BMP
WEBP
Export JSON Text
Doctags Markdown
HTML
Extract Page Component
Text Table Picture
Image Header Structure Image
Number Paragraph Cell Class
Header List item Description
Footer Code
Formula Caption
Reading order
Chunks
Bounding boxes
Docling partitions a document into bite-sized chunks of contiguous text, ready for ingestion by AI systems.
Docling stores and traverses components according to reading order.
Docling detects one or multiple bounding boxes per component, which can fragment and span different pages.
Docling detects and optionally excludes page headers and footers from exports.
Docling captures table structure, such as rows, columns, and (multi-level) headers. Docling is able to interpret complex table cell content, such as lists. Docling groups captions with their respective pictures and tables.
Docling extracts pictures as image data and stores it in the Docling Document or as external image files. Docling classifies pictures by their contents, assigning labels such as chart and diagram types. Docling enriches pictures with additional captions that describe their contents.
Docling detects mathematical formulas and converts them to LaTeX syntax.
Docling detects blocks of code and classifies their programming languages. Docling detects list items and groups them together.
Docling distinguishes section headers from subsequent paragraphs. Docling concatenates fragmented paragraphs, across one or multiple pages, into one text.