A web application for extracting structured data from documents with AI-powered intelligence and feedback learning. Built on Google's LangExtract and Stanford's DSPy frameworks with an intuitive web interface.
- PDF Upload: Convert PDFs to text using OCR
- Text Input: Process plain text content directly
- Multiple Formats: Handle various document types (contracts, invoices, forms)
- Extract Mode: Find specific text spans from documents (names, dates, amounts)
- Generate Mode: Create interpreted content (summaries, classifications)
- Multiple Providers: Azure OpenAI, OpenAI, Google Gemini, Ollama
- Template System: Save and reuse extraction configurations
- Feedback Learning: Improve results through user feedback and DSPy optimization
- Template Management: Create and edit extraction schemas
- Real-time Processing: Live progress updates for document processing
- Result Export: Download results in HTML, JSON, or JSONL formats
- Python 3.8+
- API access to one of: Azure OpenAI, OpenAI, Google Gemini, or Ollama
- Clone and install
git clone https://github.com/LM-150A/docflash.git
cd docflash
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt- Configure environment
cp .env.example .env
# Edit .env with your API credentials- Run application
python start_fastapi.pyAccess the application at http://localhost:5000
# Choose one provider
LLM_PROVIDER=azure_openai # Options: azure_openai, openai, gemini, ollama
# Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4
AZURE_OPENAI_API_KEY=your-api-key
# OpenAI
OPENAI_API_KEY=your-openai-key
# Google Gemini
GOOGLE_API_KEY=your-google-api-key
# Ollama (local)
OLLAMA_MODEL_ID=gemma2:2b
OLLAMA_BASE_URL=http://localhost:11434
# Optional: Enable DSPy optimization
DSPY_ENABLED=trueCreate extraction attributes specifying what information to extract:
| Attribute | Description | Mode |
|---|---|---|
| client_name | Name of the client | Extract |
| contract_value | Total contract amount | Extract |
| summary | Brief contract summary | Generate |
- Upload PDF files for OCR processing
- Copy/paste text content directly
- Provide multiple examples for better training
The system creates training examples based on your schema and sample documents.
Upload new documents and run extraction with configurable settings:
- Number of extraction passes (1-3)
- Parallel processing workers (5-20)
- Temperature settings based on extraction modes
- Rate generated examples to improve future results
- Use detailed feedback to guide AI improvements
- DSPy automatically optimizes prompts based on feedback
| Endpoint | Method | Purpose |
|---|---|---|
/ |
GET | Main interface |
/upload_pdf |
POST | PDF upload and OCR |
/generate_examples |
POST | Create training examples |
/run_extraction |
POST | Process documents |
/register_template |
POST | Save templates |
/feedback/examples |
POST | Submit feedback |
Frontend (HTML/JS) ←→ Backend (FastAPI) ←→ AI Providers
↓ ↓ ↓
• Template UI • LangExtract • Azure OpenAI
• Document Upload • OCR Pipeline • OpenAI
• Feedback System • DSPy Integration • Google Gemini
• Progress Tracking • Template Storage • Ollama
- Use Extract mode for factual data that appears verbatim
- Use Generate mode for analysis or interpreted content
- Write clear, specific attribute descriptions
- Provide 2-4 diverse sample documents
- Include variations and edge cases
- Ensure samples cover all schema attributes
- Rate examples regularly to improve performance
- Use detailed feedback for specific issues
- Feedback is isolated by document type
API Configuration
- Verify API credentials in
.envfile - Check endpoint URLs and model names
- Ensure sufficient API quota/credits
PDF Processing
- Use clear, text-based PDFs (not scanned images)
- Check file size limits (typically 16MB max)
- Try alternative OCR if text extraction fails
Poor Extraction Results
- Review and improve schema descriptions
- Add more diverse training examples
- Increase extraction passes for better recall
- Provide feedback on generated examples
DSPy Optimization
- Set
DSPY_ENABLED=truein environment - Provide sufficient feedback (default: 10+ examples)
- Check logs for optimization triggers
- Fork the repository
- Create a feature branch
- Make changes and add tests
- Submit a pull request
Apache License 2.0 - see LICENSE file for details.
- Google LangExtract - Core extraction framework
- Stanford DSPy - Prompt optimization framework