Universal Document Extractor

This is a generic document information extraction tool based on AI Vision LLMs (Default: Qwen3-VL). Through a simple YAML configuration file, it can extract structured data from any type of image or PDF document (such as invoices, receipts, resumes, contracts) and generate Excel reports.

✨ Core Features

Generic: Extract data from any document by modifying the configuration.
Configuration-Driven: Zero-code modification; define extraction fields and prompts by simply editing the yaml file.
Smart Extraction: Uses Qwen3-VL's powerful visual understanding capabilities to automatically identify and extract information.
Automatic Summary: Supports automatic calculation of summation for amount columns.
Incremental Update: Automatically skips processed files to prevent duplicates.
Source File Links: Excel tables contain clickable links pointing to the original files.

🚀 Quick Start

1. Preparation

Ensure Python 3 is installed.

For the backend, you need an OpenAI-compatible API with Vision/OCR capabilities.

Recommended: LM Studio running qwen/qwen3-vl-8b (Local, Free).
Alternative: OpenAI GPT-4o, Claude 3.5 Sonnet, or other Vision models. (Note: Extraction accuracy depends on the model's capabilities and how well the prompt_template matches it.)

pip install -r requirements.txt
pip install PyYAML

2. Configuration (Config)

The project includes a default config/example.yaml for extracting Chinese VAT invoices. Key configuration items:

# config/example.yaml
extraction:
  fields:
    - name: "invoice_number"
      description: "发票号码 (Invoice Number)"
      column_header: "No."
      primary_key: true  # Used for deduplication

    - name: "amount"
      description: "金额总计 (Total Amount)"
      type: "currency"   # Automatically converts to numeric format
      sum_in_footer: true

  output:
    extra_columns:
      - header: "Source File"
        type: "file_link" # Creates a clickable local link
        width: 30
      - header: "Remarks" # Creates an empty column for manual notes
        width: 20

3. Run Extraction

Run the tool using doc_extractor.py and specify your configuration file:

# Extract files using example.yaml
python3 doc_extractor.py --config config/example.yaml

# Force reprocess all files
python3 doc_extractor.py --config config/example.yaml --force

4. Common Commands Reference

Goal	Command
Normal Run	`python3 doc_extractor.py --config config/example.yaml`
Force Rerun	`python3 doc_extractor.py --config config/example.yaml --force`
Test First 5	`python3 doc_extractor.py --config config/example.yaml --limit 5`
Specific Config	`python3 doc_extractor.py --config my_config.yaml`

🛠️ Helper Tools

Batch Downloader (`scripts/batch_download.py`)

A generic utility script to batch download files from a list of URLs (one URL per line). If run without arguments, it reads file_urls.txt and downloads to output/ by default.

# Syntax
python3 scripts/batch_download.py [--file <url_list_file>] [--output <download_dir>]

# Example
python3 scripts/batch_download.py --file file_urls.txt --output output/files/

📂 Project Structure

doc_extractor.py: Main Program, generic CLI entry point
doc_extractor_core.py: Core extraction engine
config_manager.py: Configuration loader
config/example.yaml: File extraction configuration file (Default)
output/: Distribution Bundle (Contains results and source files)
- examples/: Directory for PDF/Image files (Place your files here)
- example_output.xlsx: Generated result table (Portable)

📝 Configuration Details

You can copy config/example.yaml to create a new configuration, such as config/resume.yaml to extract resume information:

extraction:
  prompt_template: "Please extract the following information from the resume..."
  fields:
    - name: "name"
      description: "Candidate Name"
      column_header: "Name"
    - name: "email"
      description: "Email Address"
      column_header: "Contact"
    - name: "education"
      description: "Highest Education"
      column_header: "Education"

Run: python3 doc_extractor.py --config config/resume.yaml

Note: The old config .env.example, scripts invoice_ocr.py and generate_invoice_table.py have been replaced by the new generic architecture and are kept for reference only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal Document Extractor

✨ Core Features

🚀 Quick Start

1. Preparation

2. Configuration (Config)

3. Run Extraction

4. Common Commands Reference

🛠️ Helper Tools

Batch Downloader (`scripts/batch_download.py`)

📂 Project Structure

📝 Configuration Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
output/examples		output/examples
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config_manager.py		config_manager.py
doc_extractor.py		doc_extractor.py
doc_extractor_core.py		doc_extractor_core.py
file_urls.txt		file_urls.txt
generate_invoice_table.py		generate_invoice_table.py
invoice_ocr.py		invoice_ocr.py
requirements.txt		requirements.txt

yicone/universal-doc-extractor

Folders and files

Latest commit

History

Repository files navigation

Universal Document Extractor

✨ Core Features

🚀 Quick Start

1. Preparation

2. Configuration (Config)

3. Run Extraction

4. Common Commands Reference

🛠️ Helper Tools

Batch Downloader (scripts/batch_download.py)

📂 Project Structure

📝 Configuration Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Batch Downloader (`scripts/batch_download.py`)

Packages