The fastest Python PDF library for text extraction, image extraction, and markdown conversion. Built on a Rust core — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)pip install pdf_oxideuse pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;[dependencies]
pdf_oxide = "0.3"- Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
- Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
- Complete — Text extraction, image extraction, PDF creation, and editing in one library
- Dual-language — First-class Rust API and Python bindings via PyO3
- Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects
Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). 18 libraries tested, single-thread, 60s timeout, no warm-up.
| Library | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 100% | MIT |
| unstructured | 478.4ms | 1,477ms | 99.6% | Apache-2.0 |
| PyMuPDF | 4.6ms | 28ms | 99.3% | AGPL-3.0 |
| pypdfium2 | 4.1ms | 42ms | 99.2% | Apache-2.0 |
| kreuzberg | 7.2ms | 49ms | 99.1% | MIT |
| pymupdf4llm | 55.5ms | 280ms | 99.1% | AGPL-3.0 |
| pdftext | 7.3ms | 82ms | 99.0% | GPL-3.0 |
| extractous | 112.0ms | 165ms | 98.9% | Apache-2.0 |
| pdfminer | 16.8ms | 124ms | 98.8% | MIT |
| pdfplumber | 23.2ms | 189ms | 98.8% | MIT |
| markitdown | 108.8ms | 378ms | 98.6% | MIT |
| pypdf | 12.1ms | 97ms | 98.4% | BSD-3 |
| Library | Mean | p99 | Pass Rate | Text Extraction |
|---|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 100% | Built-in |
| oxidize_pdf | 13.5ms | 11ms | 99.1% | Basic |
| unpdf | 2.8ms | 10ms | 95.1% | Basic |
| pdf_extract | 4.08ms | 37ms | 91.5% | Basic |
| lopdf | 0.3ms | 2ms | 80.2% | No built-in extraction |
99.5% text parity vs PyMuPDF, pypdfium2, and kreuzberg across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.
| Suite | PDFs | Pass Rate |
|---|---|---|
| veraPDF (PDF/A compliance) | 2,907 | 100% |
| Mozilla pdf.js | 897 | 99.2% |
| SafeDocs (targeted edge cases) | 26 | 100% |
| Total | 3,830 | 100% |
100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).
| Extract | Create | Edit |
|---|---|---|
| Text & Layout | Documents | Annotations |
| Images | Tables | Form Fields |
| Forms | Graphics | Bookmarks |
| Annotations | Templates | Links |
| Bookmarks | Images | Content |
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count}")
print(f"Version: {doc.version}")
# Extract text from each page
for i in range(doc.page_count):
text = doc.extract_text(i)
print(f"Page {i}: {len(text)} chars")
# Character-level extraction with positions
chars = doc.extract_chars(0)
for ch in chars:
print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f})")
# Password-protected PDFs
doc = PdfDocument("encrypted.pdf")
doc.authenticate("password")
text = doc.extract_text(0)use pdf_oxide::PdfDocument;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut doc = PdfDocument::open("paper.pdf")?;
// Extract text
let text = doc.extract_text(0)?;
// Character-level extraction
let chars = doc.extract_chars(0)?;
// Extract images
let images = doc.extract_images(0)?;
// Vector graphics
let paths = doc.extract_paths(0)?;
Ok(())
}pip install pdf_oxideWheels available for Linux, macOS, and Windows. Python 3.8–3.14.
[dependencies]
pdf_oxide = "0.3"# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release
# Run tests
cargo test
# Build Python bindings
maturin develop- Getting Started (Rust) - Complete Rust guide
- Getting Started (Python) - Complete Python guide
- API Docs - Full Rust API reference
- Full Documentation - Complete documentation site
- Performance Benchmarks - Full benchmark methodology and results
- RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
- Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
- Data extraction — Pull structured data from forms, tables, and layouts
- Academic research — Parse papers, extract citations, and process large corpora
- PDF generation — Create invoices, reports, certificates, and templated documents programmatically
- PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions
Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.
We welcome contributions! See CONTRIBUTING.md for guidelines.
cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings@software{pdf_oxide,
title = {PDF Oxide: Fast PDF Toolkit for Rust and Python},
author = {Yury Fedoseev},
year = {2025},
url = {https://github.com/yfedoseev/pdf_oxide}
}Rust + Python | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than PyMuPDF | v0.3.8