A Python project for building RAG (Retrieval-Augmented Generation) applications without vector embeddings, focusing on legal document analysis using the CUAD (Contract Understanding Atticus Dataset).
vectorless/
├── src/ # Core source code
├── scripts/ # Processing scripts
│ ├── process_contract.py # Main contract processing pipeline
│ └── run_all_41_questions.py # Sample evaluation script
├── docs/ # Documentation
│ ├── README.md # Detailed documentation
│ └── GENERALIZED_WORKFLOW.md # Workflow documentation
├── data/ # Input datasets
├── sample_dataset/ # Sample data for development
├── output/ # Generated outputs
│ ├── results/ # Processing results
│ └── segmentation_results/ # Cached segmentation data
├── main.py # Entry point
├── pyproject.toml # Project configuration
└── CLAUDE.md # AI assistant instructions
# Install dependencies
uv sync
# Run main application
uv run python main.py
# Process a specific contract
uv run python scripts/process_contract.py --contract-index 0
# Run evaluation on sample data
uv run python scripts/run_all_41_questions.py- Document segmentation without vector embeddings
- Parallel question processing
- Intelligent caching for performance
- Comprehensive evaluation metrics
- Generalizable workflow for different document types
See docs/ for detailed documentation and workflow guides.