Thanks to visit codestin.com
Credit goes to github.com

Skip to content

yicone/universal-doc-extractor

Repository files navigation

Universal Document Extractor

This is a generic document information extraction tool based on AI Vision LLMs (Default: Qwen3-VL). Through a simple YAML configuration file, it can extract structured data from any type of image or PDF document (such as invoices, receipts, resumes, contracts) and generate Excel reports.

✨ Core Features

  • Generic: Extract data from any document by modifying the configuration.
  • Configuration-Driven: Zero-code modification; define extraction fields and prompts by simply editing the yaml file.
  • Smart Extraction: Uses Qwen3-VL's powerful visual understanding capabilities to automatically identify and extract information.
  • Automatic Summary: Supports automatic calculation of summation for amount columns.
  • Incremental Update: Automatically skips processed files to prevent duplicates.
  • Source File Links: Excel tables contain clickable links pointing to the original files.

🚀 Quick Start

1. Preparation

Ensure Python 3 is installed.

For the backend, you need an OpenAI-compatible API with Vision/OCR capabilities.

  • Recommended: LM Studio running qwen/qwen3-vl-8b (Local, Free).
  • Alternative: OpenAI GPT-4o, Claude 3.5 Sonnet, or other Vision models. (Note: Extraction accuracy depends on the model's capabilities and how well the prompt_template matches it.)
pip install -r requirements.txt
pip install PyYAML

2. Configuration (Config)

The project includes a default config/example.yaml for extracting Chinese VAT invoices. Key configuration items:

# config/example.yaml
extraction:
  fields:
    - name: "invoice_number"
      description: "发票号码 (Invoice Number)"
      column_header: "No."
      primary_key: true  # Used for deduplication

    - name: "amount"
      description: "金额总计 (Total Amount)"
      type: "currency"   # Automatically converts to numeric format
      sum_in_footer: true

  output:
    extra_columns:
      - header: "Source File"
        type: "file_link" # Creates a clickable local link
        width: 30
      - header: "Remarks" # Creates an empty column for manual notes
        width: 20

3. Run Extraction

Run the tool using doc_extractor.py and specify your configuration file:

# Extract files using example.yaml
python3 doc_extractor.py --config config/example.yaml

# Force reprocess all files
python3 doc_extractor.py --config config/example.yaml --force

4. Common Commands Reference

Goal Command
Normal Run python3 doc_extractor.py --config config/example.yaml
Force Rerun python3 doc_extractor.py --config config/example.yaml --force
Test First 5 python3 doc_extractor.py --config config/example.yaml --limit 5
Specific Config python3 doc_extractor.py --config my_config.yaml

🛠️ Helper Tools

Batch Downloader (scripts/batch_download.py)

A generic utility script to batch download files from a list of URLs (one URL per line). If run without arguments, it reads file_urls.txt and downloads to output/ by default.

# Syntax
python3 scripts/batch_download.py [--file <url_list_file>] [--output <download_dir>]

# Example
python3 scripts/batch_download.py --file file_urls.txt --output output/files/

📂 Project Structure

  • doc_extractor.py: Main Program, generic CLI entry point
  • doc_extractor_core.py: Core extraction engine
  • config_manager.py: Configuration loader
  • config/example.yaml: File extraction configuration file (Default)
  • output/: Distribution Bundle (Contains results and source files)
    • examples/: Directory for PDF/Image files (Place your files here)
    • example_output.xlsx: Generated result table (Portable)

📝 Configuration Details

You can copy config/example.yaml to create a new configuration, such as config/resume.yaml to extract resume information:

extraction:
  prompt_template: "Please extract the following information from the resume..."
  fields:
    - name: "name"
      description: "Candidate Name"
      column_header: "Name"
    - name: "email"
      description: "Email Address"
      column_header: "Contact"
    - name: "education"
      description: "Highest Education"
      column_header: "Education"

Run: python3 doc_extractor.py --config config/resume.yaml


Note: The old config .env.example, scripts invoice_ocr.py and generate_invoice_table.py have been replaced by the new generic architecture and are kept for reference only.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages