Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Convert scanned PDFs into searchable text locally using Vision LLMs (olmOCR). 100% private, offline, and free. Features a modern Web UI & CLI.

License

Notifications You must be signed in to change notification settings

ahnafnafee/local-llm-pdf-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Local LLM PDF OCR

Python FastAPI License Local AI

Transform scanned and written documents into fully searchable, selectable PDFs using the power of Local LLM Vision.

PDF LLM OCR is a next-generation OCR tool that moves beyond traditional Tesseract-based scanning. By leveraging OCR Vision Language Models (VLMs) like olmOCR running locally on your machine, it "reads" documents with human-like understanding while keeping 100% of your data private.


✨ Features

  • 🧠 AI-Powered Vision: Uses advanced VLMs to transcribe text with high accuracy, even on complex layouts or noisy scans.
  • 🀝 Hybrid Alignment Strategy: Combines Surya OCR Detection for precise bounding boxes with Local LLM for perfect text content via position-based alignment.
  • ⚑ 10-21x Faster Detection: Uses detection-only mode (skips slow recognition) and batch processing for maximum speed.
  • πŸ”’ 100% Local & Private: No cloud APIs, no subscription fees. Run it entirely offline using LM Studio.
  • πŸ” Searchable Outputs: Embeds an invisible text layer directly into your PDF, making it compatible with valid PDF readers for searching (Ctrl+F) and selecting.
  • πŸ–₯️ Dual Interfaces:
    • Web UI: An interface with Drag & Drop, Dark Mode, and Real-time progress tracking.
    • CLI: A robust command-line tool for power users and batch automation, featuring a "lively" terminal UI.
  • ⚑ Real-time Feedback: Watch your document process page-by-page with live web sockets or animated terminal bars.

πŸ—οΈ Architecture

graph TD
    A[Input PDF] --> B[PDF to Image Conversion]
    B --> C[Batch Processing]

    subgraph "Phase 1: Layout Detection (Surya)"
        C --> D[Surya DetectionPredictor]
        D --> E[Bounding Boxes]
        E --> F[Sorted by Reading Order]
    end

    subgraph "Phase 2: Text Extraction (Local LLM)"
        C --> G[OlmOCR Vision Model]
        G --> H[Pure Text Content]
    end

    F --> I[Position-Based Aligner]
    H --> I

    I -->|Distribute by Box Width| J[Aligned Text Blocks]
    J --> K[Sandwich PDF Generator]
    K --> L[Searchable PDF Output]
Loading

How It Works

  1. Batch Layout Detection: Surya's DetectionPredictor processes all pages at once, extracting bounding boxes without slow text recognition (~1s total vs ~20s per page with recognition).

  2. LLM Text Extraction: A local vision model (OlmOCR) reads each page with human-like understanding, handling handwriting and complex layouts perfectly.

  3. Position-Based Alignment: The aligner distributes LLM text across detected boxes proportionally by box width in reading orderβ€”no fuzzy matching needed.

  4. Sandwich PDF: The original page is rendered as an image with invisible, searchable text overlaid using PyMuPDF.


πŸš€ Getting Started

Prerequisites

  1. Python 3.10+
  2. LM Studio: Download and install LM Studio.
    • Load a Vision Model (highly recommended: allenai/olmocr-2-7b).
    • Start the Local Server at default port 1234.

Configuration

Create a .env file in the root directory to configure your Local LLM:

LLM_API_BASE=http://localhost:1234/v1
LLM_MODEL=allenai/olmocr-2-7b

Installation

This project is managed with uv for lightning-fast dependency management.

  1. Install uv (if not installed):

    pip install uv
  2. Clone the repository:

    git clone https://github.com/ahnafnafee/pdf-ocr-llm.git
    cd pdf-ocr-llm
  3. Sync Dependencies:

    uv sync

Usage

1. 🌐 Web Interface (Recommended)

The easiest way to use the tool. Features a modern dashboard with Dark Mode and Text Preview.

  1. Start the Server:
    uv run uvicorn server:app --reload --port 8000
  2. Open your browser to http://localhost:8000.
  3. Drag & Drop your PDF.
  4. Watch the magic happen! ✨
    • Real-time Progress: Track per-page OCR status.
    • Preview: Click "View Text" to inspect the raw AI extraction.
    • Dark Mode: Toggle the moon icon for a sleek dark theme.

2. πŸ’» Command Line Interface (CLI)

Perfect for developers or integrating into scripts.

Run the OCR tool on any PDF:

uv run main.py input.pdf output_ocr.pdf

Options:

Option Description
input_pdf Path to input PDF (required)
output_pdf Path to output PDF (optional, defaults to <input>_ocr.pdf)
-v, --verbose Enable debug logging (alignment details, box counts)
-q, --quiet Suppress all output except errors
--dpi <int> DPI for image rendering (default: 200)
--pages <range> Page range to process, e.g., 1-3,5 (default: all)
--api-base <url> Override LLM API base URL
--model <name> Override LLM model name

Examples:

# Basic usage (auto-generates input_ocr.pdf)
uv run main.py scan.pdf

# Process specific pages with higher quality
uv run main.py document.pdf output.pdf --pages 1-5 --dpi 300

# Use a different model with verbose output
uv run main.py report.pdf --model "custom-model" --verbose

You'll see beautiful animated progress bars showing batch detection and per-page LLM processing.


πŸ“ Project Structure

local-llm-pdf-ocr/
β”œβ”€β”€ src/pdf_ocr/           # Core package
β”‚   β”œβ”€β”€ core/              # OCR processing modules
β”‚   β”‚   β”œβ”€β”€ aligner.py     # Hybrid text alignment
β”‚   β”‚   β”œβ”€β”€ ocr.py         # LLM OCR processor
β”‚   β”‚   └── pdf.py         # PDF handling utilities
β”‚   └── utils/             # Utility modules
β”‚       └── tqdm_patch.py  # Progress bar silencer
β”œβ”€β”€ scripts/               # Debug and visualization tools
β”œβ”€β”€ static/                # Web UI assets
β”œβ”€β”€ examples/              # Sample PDFs
β”œβ”€β”€ main.py                # CLI entry point
└── server.py              # Web server

πŸ› οΈ Tech Stack

  • Backend: FastAPI (Async Web Framework)
  • Frontend: Vanilla JS + CSS Variables
  • PDF Processing: PyMuPDF (Fitz)
  • Layout Detection: Surya OCR (Detection-only mode)
  • AI Integration: OpenAI Client (compatible with Local LLM servers)
  • CLI UI: Rich (Terminal formatting)

⚑ Performance

Document Type Detection Time Speedup vs Recognition
Digital PDF ~1s 21x faster
Handwritten ~1s 10x faster
Hybrid Form ~1s 11x faster

Detection uses batch processingβ€”all pages in one call.


🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License: MIT

About

Convert scanned PDFs into searchable text locally using Vision LLMs (olmOCR). 100% private, offline, and free. Features a modern Web UI & CLI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published