PDF to Markdown Converter - PDFstract

A beautiful web application built with FastAPI and HTML that converts PDF files to Markdown format using various conversion libraries.

Features

🚀 Multiple Conversion Libraries: Support for pymupdf4llm, markitdown, marker, and docling
📱 Modern UI: Beautiful, responsive design with drag-and-drop file upload
⚡ Fast Processing: Efficient backend processing with real-time feedback
🔄 Library Status: Dynamic checking of available libraries
📄 Preview Results: View converted Markdown content directly in the browser

Libraries Supported

pymupdf4llm (>=0.0.26) - Fast PDF text extraction with PyMuPDF
markitdown (>=0.1.2) - Microsoft's document conversion tool
marker (>=1.8.1) - Advanced PDF to Markdown conversion with ML
docling (>=2.41.0) - IBM's document intelligence platform

Installation

Prerequisites

Python 3.8 or higher
UV (fast Python package installer)

Setup

Clone or download the project files
Install dependencies:
```
uv sync
```
Or if you don't have a virtual environment:
```
uv pip install -r requirements.txt
```
Note: Some libraries may require additional system dependencies:
- For marker: May require additional ML dependencies
- For docling: May require specific Python versions and dependencies
- For pymupdf4llm: Should work out of the box
- For markitdown: May require additional dependencies for certain file types

Verify installation (optional):

uv run python -c "import fastapi; print('FastAPI installed successfully')"

Running the Application

Start the server:

uv run python main.py

Or alternatively:

uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Open your browser and navigate to:
```
http://localhost:8000
```
Use the application:
- Select a conversion library from the dropdown
- Upload a PDF file (drag-and-drop or click to select)
- Click "Convert to Markdown"
- View the converted result

Running with Docker

This project includes a Dockerfile and docker-compose.yml for easy containerization.

Build and run the container:
```
docker-compose up --build
```
Open your browser and navigate to:
```
http://localhost:8000
```

This will start the application inside a Docker container, accessible on port 8000.

API Endpoints

GET / - Main web interface
GET /health - Health check endpoint
GET /libraries - Get available conversion libraries
POST /convert - Convert PDF to Markdown

API Usage Example

# Check available libraries
curl http://localhost:8000/libraries

# Convert a PDF file
curl -X POST \
  -F "[email protected]" \
  -F "library=pymupdf4llm" \
  http://localhost:8000/convert

Troubleshooting

Library Installation Issues

If you encounter issues with specific libraries:

pymupdf4llm: Usually installs without issues
```
uv add pymupdf4llm
```
markitdown: May need Microsoft Build Tools on Windows
```
uv add markitdown
```
marker: Requires additional ML dependencies
```
uv add marker-pdf
```
docling: May have specific version requirements
```
uv add docling
```

Common Issues

"Library not available": The library failed to import. Check the installation.
"Conversion failed": The selected library couldn't process your PDF. Try a different library.
Large file timeout: Some libraries may take longer for large files.

Performance Tips

pymupdf4llm: Fastest for simple text extraction
markitdown: Good balance of speed and quality
marker: Best quality but slower, especially on first run
docling: Advanced features but may be slower

Development

Project Structure

pdftomd-ui/
├── main.py              # FastAPI application
├── requirements.txt     # Python dependencies
├── templates/
│   └── index.html      # Web interface
├── uploads/            # Temporary upload directory (auto-created)
└── README.md           # This file

Adding New Libraries

To add support for additional conversion libraries:

Add the library to requirements.txt
Import it in main.py with try/except
Add it to the get_available_libraries() function
Create a conversion function following the existing pattern
Add it to the conversion logic in /convert endpoint

License

This project is provided as-is for educational and development purposes.

Contributing

Feel free to submit issues and enhancement requests!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
UI.png		UI.png
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

PDF to Markdown Converter - PDFstract

Features

Libraries Supported

Installation

Prerequisites

Setup

Running the Application

Running with Docker

API Endpoints

API Usage Example

Troubleshooting

Library Installation Issues

Common Issues

Performance Tips

Development

Project Structure

Adding New Libraries

License

Contributing

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

Uh oh!

Uh oh!

AKSarav/pdfstract

Folders and files

Latest commit

History

Repository files navigation

PDF to Markdown Converter - PDFstract

Features

Libraries Supported

Installation

Prerequisites

Setup

Running the Application

Running with Docker

API Endpoints

API Usage Example

Troubleshooting

Library Installation Issues

Common Issues

Performance Tips

Development

Project Structure

Adding New Libraries

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Packages