PDF Oxide - The Fastest PDF Library for Python and Rust

The fastest Python PDF library for text extraction, image extraction, and markdown conversion. Built on a Rust core — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

Quick Start

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)

pip install pdf_oxide

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;

[dependencies]
pdf_oxide = "0.3"

Why pdf_oxide?

Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
Complete — Text extraction, image extraction, PDF creation, and editing in one library
Dual-language — First-class Rust API and Python bindings via PyO3
Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects

Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). 18 libraries tested, single-thread, 60s timeout, no warm-up.

Python Libraries

Library	Mean	p99	Pass Rate	License
PDF Oxide	0.8ms	9ms	100%	MIT
unstructured	478.4ms	1,477ms	99.6%	Apache-2.0
PyMuPDF	4.6ms	28ms	99.3%	AGPL-3.0
pypdfium2	4.1ms	42ms	99.2%	Apache-2.0
kreuzberg	7.2ms	49ms	99.1%	MIT
pymupdf4llm	55.5ms	280ms	99.1%	AGPL-3.0
pdftext	7.3ms	82ms	99.0%	GPL-3.0
extractous	112.0ms	165ms	98.9%	Apache-2.0
pdfminer	16.8ms	124ms	98.8%	MIT
pdfplumber	23.2ms	189ms	98.8%	MIT
markitdown	108.8ms	378ms	98.6%	MIT
pypdf	12.1ms	97ms	98.4%	BSD-3

Rust Libraries

Library	Mean	p99	Pass Rate	Text Extraction
PDF Oxide	0.8ms	9ms	100%	Built-in
oxidize_pdf	13.5ms	11ms	99.1%	Basic
unpdf	2.8ms	10ms	95.1%	Basic
pdf_extract	4.08ms	37ms	91.5%	Basic
lopdf	0.3ms	2ms	80.2%	No built-in extraction

Text Quality

99.5% text parity vs PyMuPDF, pypdfium2, and kreuzberg across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.

Corpus

Suite	PDFs	Pass Rate
veraPDF (PDF/A compliance)	2,907	100%
Mozilla pdf.js	897	99.2%
SafeDocs (targeted edge cases)	26	100%
Total	3,830	100%

100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Features

Extract	Create	Edit
Text & Layout	Documents	Annotations
Images	Tables	Form Fields
Forms	Graphics	Bookmarks
Annotations	Templates	Links
Bookmarks	Images	Content

Python API

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count}")
print(f"Version: {doc.version}")

# Extract text from each page
for i in range(doc.page_count):
    text = doc.extract_text(i)
    print(f"Page {i}: {len(text)} chars")

# Character-level extraction with positions
chars = doc.extract_chars(0)
for ch in chars:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f})")

# Password-protected PDFs
doc = PdfDocument("encrypted.pdf")
doc.authenticate("password")
text = doc.extract_text(0)

Rust API

use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}

Installation

Python

pip install pdf_oxide

Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.

Rust

[dependencies]
pdf_oxide = "0.3"

Building from Source

# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

Documentation

Getting Started (Rust) - Complete Rust guide
Getting Started (Python) - Complete Python guide
API Docs - Full Rust API reference
Full Documentation - Complete documentation site
Performance Benchmarks - Full benchmark methodology and results

Use Cases

RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
Data extraction — Pull structured data from forms, tables, and layouts
Academic research — Parse papers, extract citations, and process large corpora
PDF generation — Create invoices, reports, certificates, and templated documents programmatically
PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings

Citation

@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust and Python},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
.cargo		.cargo
.devin		.devin
.github		.github
.models		.models
benches		benches
docs		docs
examples		examples
hooks		hooks
models		models
python		python
scripts		scripts
src		src
tests		tests
tools		tools
training		training
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
clippy.toml		clippy.toml
context7.json		context7.json
deny.toml		deny.toml
llms.txt		llms.txt
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml
uv.lock		uv.lock
validate_fix.sh		validate_fix.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Licenses found

Repository files navigation

PDF Oxide - The Fastest PDF Library for Python and Rust

Quick Start

Python

Rust

Why pdf_oxide?

Performance

Python Libraries

Rust Libraries

Text Quality

Corpus

Features

Python API

Rust API

Installation

Python

Rust

Building from Source

Documentation

Use Cases

License

Contributing

Citation

About

Licenses found

Uh oh!

Releases 18

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 7

Languages

Uh oh!

License

Licenses found

yfedoseev/pdf_oxide

Folders and files

Latest commit

History

Repository files navigation

PDF Oxide - The Fastest PDF Library for Python and Rust

Quick Start

Python

Rust

Why pdf_oxide?

Performance

Python Libraries

Rust Libraries

Text Quality

Corpus

Features

Python API

Rust API

Installation

Python

Rust

Building from Source

Documentation

Use Cases

License

Contributing

Citation

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 18

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 7

Languages

Packages