This is a generic document information extraction tool based on AI Vision LLMs (Default: Qwen3-VL). Through a simple YAML configuration file, it can extract structured data from any type of image or PDF document (such as invoices, receipts, resumes, contracts) and generate Excel reports.
- Generic: Extract data from any document by modifying the configuration.
- Configuration-Driven: Zero-code modification; define extraction fields and prompts by simply editing the
yamlfile. - Smart Extraction: Uses Qwen3-VL's powerful visual understanding capabilities to automatically identify and extract information.
- Automatic Summary: Supports automatic calculation of summation for amount columns.
- Incremental Update: Automatically skips processed files to prevent duplicates.
- Source File Links: Excel tables contain clickable links pointing to the original files.
Ensure Python 3 is installed.
For the backend, you need an OpenAI-compatible API with Vision/OCR capabilities.
- Recommended: LM Studio running
qwen/qwen3-vl-8b(Local, Free). - Alternative: OpenAI GPT-4o, Claude 3.5 Sonnet, or other Vision models.
(Note: Extraction accuracy depends on the model's capabilities and how well the
prompt_templatematches it.)
pip install -r requirements.txt
pip install PyYAMLThe project includes a default config/example.yaml for extracting Chinese VAT invoices. Key configuration items:
# config/example.yaml
extraction:
fields:
- name: "invoice_number"
description: "发票号码 (Invoice Number)"
column_header: "No."
primary_key: true # Used for deduplication
- name: "amount"
description: "金额总计 (Total Amount)"
type: "currency" # Automatically converts to numeric format
sum_in_footer: true
output:
extra_columns:
- header: "Source File"
type: "file_link" # Creates a clickable local link
width: 30
- header: "Remarks" # Creates an empty column for manual notes
width: 20Run the tool using doc_extractor.py and specify your configuration file:
# Extract files using example.yaml
python3 doc_extractor.py --config config/example.yaml
# Force reprocess all files
python3 doc_extractor.py --config config/example.yaml --force| Goal | Command |
|---|---|
| Normal Run | python3 doc_extractor.py --config config/example.yaml |
| Force Rerun | python3 doc_extractor.py --config config/example.yaml --force |
| Test First 5 | python3 doc_extractor.py --config config/example.yaml --limit 5 |
| Specific Config | python3 doc_extractor.py --config my_config.yaml |
A generic utility script to batch download files from a list of URLs (one URL per line).
If run without arguments, it reads file_urls.txt and downloads to output/ by default.
# Syntax
python3 scripts/batch_download.py [--file <url_list_file>] [--output <download_dir>]
# Example
python3 scripts/batch_download.py --file file_urls.txt --output output/files/doc_extractor.py: Main Program, generic CLI entry pointdoc_extractor_core.py: Core extraction engineconfig_manager.py: Configuration loaderconfig/example.yaml: File extraction configuration file (Default)output/: Distribution Bundle (Contains results and source files)examples/: Directory for PDF/Image files (Place your files here)example_output.xlsx: Generated result table (Portable)
You can copy config/example.yaml to create a new configuration, such as config/resume.yaml to extract resume information:
extraction:
prompt_template: "Please extract the following information from the resume..."
fields:
- name: "name"
description: "Candidate Name"
column_header: "Name"
- name: "email"
description: "Email Address"
column_header: "Contact"
- name: "education"
description: "Highest Education"
column_header: "Education"Run: python3 doc_extractor.py --config config/resume.yaml
Note: The old config .env.example, scripts invoice_ocr.py and generate_invoice_table.py have been replaced by the new generic architecture and are kept for reference only.