A web application for extracting structured data from documents with AI-powered intelligence and feedback learning. Built on Google's LangExtract and Stanford's DSPy frameworks with an intuitive web interface.
- PDF Upload: Convert PDFs to text using OCR
 - Text Input: Process plain text content directly
 - Multiple Formats: Handle various document types (contracts, invoices, forms)
 
- Extract Mode: Find specific text spans from documents (names, dates, amounts)
 - Generate Mode: Create interpreted content (summaries, classifications)
 
- Multiple Providers: Azure OpenAI, OpenAI, Google Gemini, Ollama
 - Template System: Save and reuse extraction configurations
 - Feedback Learning: Improve results through user feedback and DSPy optimization
 
- Template Management: Create and edit extraction schemas
 - Real-time Processing: Live progress updates for document processing
 - Result Export: Download results in HTML, JSON, or JSONL formats
 
- Python 3.8+
 - API access to one of: Azure OpenAI, OpenAI, Google Gemini, or Ollama
 
- Clone and install
 
git clone https://github.com/LM-150A/docflash.git
cd docflash
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt- Configure environment
 
cp .env.example .env
# Edit .env with your API credentials- Run application
 
python start_fastapi.pyAccess the application at http://localhost:5000
# Choose one provider
LLM_PROVIDER=azure_openai  # Options: azure_openai, openai, gemini, ollama
# Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4
AZURE_OPENAI_API_KEY=your-api-key
# OpenAI
OPENAI_API_KEY=your-openai-key
# Google Gemini  
GOOGLE_API_KEY=your-google-api-key
# Ollama (local)
OLLAMA_MODEL_ID=gemma2:2b
OLLAMA_BASE_URL=http://localhost:11434
# Optional: Enable DSPy optimization
DSPY_ENABLED=trueCreate extraction attributes specifying what information to extract:
| Attribute | Description | Mode | 
|---|---|---|
| client_name | Name of the client | Extract | 
| contract_value | Total contract amount | Extract | 
| summary | Brief contract summary | Generate | 
- Upload PDF files for OCR processing
 - Copy/paste text content directly
 - Provide multiple examples for better training
 
The system creates training examples based on your schema and sample documents.
Upload new documents and run extraction with configurable settings:
- Number of extraction passes (1-3)
 - Parallel processing workers (5-20)
 - Temperature settings based on extraction modes
 
- Rate generated examples to improve future results
 - Use detailed feedback to guide AI improvements
 - DSPy automatically optimizes prompts based on feedback
 
| Endpoint | Method | Purpose | 
|---|---|---|
/ | 
GET | Main interface | 
/upload_pdf | 
POST | PDF upload and OCR | 
/generate_examples | 
POST | Create training examples | 
/run_extraction | 
POST | Process documents | 
/register_template | 
POST | Save templates | 
/feedback/examples | 
POST | Submit feedback | 
Frontend (HTML/JS) ←→ Backend (FastAPI) ←→ AI Providers
    ↓                       ↓                    ↓
• Template UI          • LangExtract        • Azure OpenAI
• Document Upload      • OCR Pipeline       • OpenAI  
• Feedback System      • DSPy Integration   • Google Gemini
• Progress Tracking    • Template Storage   • Ollama
- Use Extract mode for factual data that appears verbatim
 - Use Generate mode for analysis or interpreted content
 - Write clear, specific attribute descriptions
 
- Provide 2-4 diverse sample documents
 - Include variations and edge cases
 - Ensure samples cover all schema attributes
 
- Rate examples regularly to improve performance
 - Use detailed feedback for specific issues
 - Feedback is isolated by document type
 
API Configuration
- Verify API credentials in 
.envfile - Check endpoint URLs and model names
 - Ensure sufficient API quota/credits
 
PDF Processing
- Use clear, text-based PDFs (not scanned images)
 - Check file size limits (typically 16MB max)
 - Try alternative OCR if text extraction fails
 
Poor Extraction Results
- Review and improve schema descriptions
 - Add more diverse training examples
 - Increase extraction passes for better recall
 - Provide feedback on generated examples
 
DSPy Optimization
- Set 
DSPY_ENABLED=truein environment - Provide sufficient feedback (default: 10+ examples)
 - Check logs for optimization triggers
 
- Fork the repository
 - Create a feature branch
 - Make changes and add tests
 - Submit a pull request
 
Apache License 2.0 - see LICENSE file for details.
- Google LangExtract - Core extraction framework
 - Stanford DSPy - Prompt optimization framework