Kreuzberg

Extract text and metadata from a wide range of file formats (75+), generate embeddings and post-process at native speeds without needing a GPU.

Key Features

Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, and document extractors
Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, and Elixir
75+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
OCR support – Tesseract (all bindings), PaddleOCR (all native bindings), EasyOCR (Python), extensible via plugin API
High performance – Rust core with native PDFium, SIMD optimizations and full parallelism
Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
Ruby – RubyGems package, idiomatic Ruby API, native bindings
PHP – Composer package, modern PHP 8.4+ support, type-safe API
Elixir – Hex package, OTP integration, concurrent processing

JavaScript/TypeScript:

@kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
@kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers

Compiled Languages:

Go – Go module with FFI bindings, context-aware async
Java – Maven Central, Foreign Function & Memory API
C# – NuGet package, .NET 6.0+, full async/await support

Native:

Rust – Core library, flexible feature flags, zero-copy APIs

Containers:

Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

Language	Linux x86_64	Linux aarch64	macOS ARM64	Windows x64
Python	✅	✅	✅	✅
Node.js	✅	✅	✅	✅
Ruby	✅	✅	✅	-
Elixir	✅	✅	✅	✅
Go	✅	✅	✅	✅
Java	✅	✅	✅	✅
C#	✅	✅	✅	✅
PHP	✅	✅	✅	✅
Rust	✅	✅	✅	✅
CLI	✅	✅	✅	✅
Docker	✅	✅	✅	-

Note: ✅ = Precompiled binaries available with instant installation. All platforms are tested in CI. macOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

Install ONNX Runtime 1.24+:
- Linux: Download from ONNX Runtime releases (Debian packages may have older versions)
- macOS: brew install onnxruntime
- Windows: Download from ONNX Runtime releases
Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

75+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category	Formats	Capabilities
Word Processing	`.docx`, `.odt`	Full text, tables, lists, images, metadata, styles
Spreadsheets	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods`	Sheet data, formulas, cell metadata, charts
Presentations	`.pptx`, `.pptm`, `.ppsx`	Slides, speaker notes, images, metadata
PDF	`.pdf`	Text, tables, images, metadata, OCR support
eBooks	`.epub`, `.fb2`	Chapters, metadata, embedded resources

Images (OCR-Enabled)

Category	Formats	Features
Raster	`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`	OCR, table detection, EXIF metadata, dimensions, color space
Advanced	`.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`	Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection
Vector	`.svg`	DOM parsing, embedded text, graphics metadata

Web & Data

Category	Formats	Features
Markup	`.html`, `.htm`, `.xhtml`, `.xml`, `.svg`	DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data	`.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`	Schema detection, nested structures, validation
Text & Markdown	`.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf`	CommonMark, GFM, Djot, reStructuredText, Org Mode, Rich Text

Email & Archives

Category	Formats	Features
Email	`.eml`, `.msg`	Headers, body (HTML/plain), attachments, UTF-16 support
Archives	`.zip`, `.tar`, `.tgz`, `.gz`, `.7z`	Recursive extraction, nested archives, metadata

Academic & Scientific

Category	Formats	Features
Citations	`.bib`, `.ris`, `.nbib`, `.enw`, `.csl`	BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON
Scientific	`.tex`, `.latex`, `.typ`, `.typst`, `.jats`, `.ipynb`	LaTeX, Typst, JATS journal articles, Jupyter notebooks
Publishing	`.fb2`, `.docbook`, `.dbk`, `.opml`	FictionBook, DocBook XML, OPML outlines
Documentation	`.pod`, `.mdoc`, `.troff`	Perl POD, man pages, troff

Complete Format Reference →

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

AI Coding Assistants

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.

Install the skill into any project using the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.

Documentation

Installation Guide – Setup and dependencies
User Guide – Comprehensive usage guide
API Reference – Complete API documentation
Format Support – Supported file formats
OCR Backends – OCR engine setup
CLI Guide – Command-line usage
Migration Guide – Upgrading from v3

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details. You can use Kreuzberg freely in both commercial and closed-source products with no obligations, no viral effects, and no licensing restrictions.

Name		Name	Last commit message	Last commit date
Latest commit History 3,441 Commits
.ai-rulez		.ai-rulez
.cargo		.cargo
.github		.github
.mvn		.mvn
.task		.task
crates		crates
docker		docker
docs		docs
e2e		e2e
examples		examples
fixtures		fixtures
packages		packages
scripts		scripts
skills/kreuzberg		skills/kreuzberg
test_documents		test_documents
tests		tests
tools		tools
vendor		vendor
.commitlintrc		.commitlintrc
.deepsource.toml		.deepsource.toml
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.markdownlint.yaml		.markdownlint.yaml
.npmrc		.npmrc
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.sdkmanrc		.sdkmanrc
.shellcheckrc		.shellcheckrc
.taplo.toml		.taplo.toml
ATTRIBUTIONS.md		ATTRIBUTIONS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
biome.json		biome.json
composer.json		composer.json
composer.lock		composer.lock
deny.toml		deny.toml
go.work		go.work
go.work.sum		go.work.sum
kreuzberg.iml		kreuzberg.iml
mkdocs.yaml		mkdocs.yaml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml
tsconfig.json		tsconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kreuzberg

Key Features

Installation

Platform Support

Embeddings Support (Optional)

Supported Formats

Office Documents

Images (OCR-Enabled)

Web & Data

Email & Archives

Academic & Scientific

Key Features

AI Coding Assistants

Documentation

Contributing

License

About

Uh oh!

Releases 134

Packages

Uh oh!

Uh oh!

Contributors 26

Languages

License

kreuzberg-dev/kreuzberg

Folders and files

Latest commit

History

Repository files navigation

Kreuzberg

Key Features

Installation

Platform Support

Embeddings Support (Optional)

Supported Formats

Office Documents

Images (OCR-Enabled)

Web & Data

Email & Archives

Academic & Scientific

Key Features

AI Coding Assistants

Documentation

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 134

Packages 0

Uh oh!

Uh oh!

Contributors 26

Languages

Packages