Codestin Search App

Docling converts messy documents into structured data and simplifies downstream document and AI processing by detecting tables, formulas, reading order, OCR, and much more.

GitHub

HuggingFace

Discord

YouTube

Start

Install Docling as a Python library with your favorite package manager:

pip install docling

Run the CLI directly from your terminal:

docling https://arxiv.org/pdf/2206.01062

Code a document conversion as part of a Python application:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
doc = converter.convert(source).document
print(doc.export_to_markdown())

Deploy it as Docling Serve

Enable an agent via Docling MCP

Features

Import many document formats into a unified and structured Docling Document, including scanned pages via an OCR engine of your choice.

Export a parsed document to formats that simplify processing and ingestion into AI, RAG, and agentic systems.

Extract document components and their properties from the Docling Document.

	Rich	Markup	Tabular	Image	Audio
Import	PDF	Markdown	XLSX	PNG	MP3
	DOCX	HTML	CSV	JPEG	WAV
	PPTX	AsciiDoc		TIFF
		WebVTT		BMP
				WEBP
Export	JSON	Text
	Doctags	Markdown
		HTML
Extract	Page	Component
	Page	Text	Table	Picture
	Image	Header	Structure	Image
	Number	Paragraph	Cell	Class
	Header	List item		Description
	Footer	Code
		Formula	Caption
		Reading order
		Chunks
	Bounding boxes

Docling partitions a document into bite-sized chunks of contiguous text, ready for ingestion by AI systems.

Docling stores and traverses components according to reading order.

Docling detects one or multiple bounding boxes per component, which can fragment and span different pages.

Docling detects and optionally excludes page headers and footers from exports.

Docling captures table structure, such as rows, columns, and (multi-level) headers. Docling is able to interpret complex table cell content, such as lists. Docling groups captions with their respective pictures and tables.

Docling extracts pictures as image data and stores it in the Docling Document or as external image files. Docling classifies pictures by their contents, assigning labels such as chart and diagram types. Docling enriches pictures with additional captions that describe their contents.

Docling detects mathematical formulas and converts them to LaTeX syntax.

Docling detects blocks of code and classifies their programming languages. Docling detects list items and groups them together.

Docling distinguishes section headers from subsequent paragraphs. Docling concatenates fragmented paragraphs, across one or multiple pages, into one text.