Production-ready retrieval-augmented generation system for legal document Q&A with Russian LLM support.
RAG system designed for automated search and analysis of legal documents. Built as MVP with focus on Russian language processing using native LLM infrastructure (GigaChat). Implements two-stage retrieval with hierarchical context enrichment for accurate citation.
Status: MVP (Minimum Viable Product) - core functionality operational, additional features in roadmap.
Strengths:
- Dual retrieval modes: Standard RAG with generative answers and citation mode with neighbor chunk enrichment
- Hierarchical parsing: Preserves document structure (articles, parts, points, subpoints) with full metadata
- Hybrid search: Combines vector similarity with keyword matching and metadata filtering
- Resumable indexing: Checkpoint system for safe processing of large documents (30-90 min for 1000+ pages)
- Native Russian support: GigaChat LLM and embeddings optimized for Russian legal terminology
- Smart deduplication: Sentence-level and token-level overlap detection with configurable thresholds
MVP Limitations (by design):
- Single document support: Currently processes one legal document; multi-document architecture planned for v2.0
- CLI-only interface: Web UI and Telegram bot in development (bot scaffolding present)
- Local deployment: Requires self-hosted Qdrant; cloud deployment configuration coming
- Basic keyword search: Simple tokenization; can be enhanced with NLP preprocessing if needed
- Manual configuration: Most settings via .env; admin panel planned for future releases
Standard RAG Mode: Generates comprehensive answers with automatic source attribution. Uses LangChain's retrieval chain with custom legal-focused system prompt. Suitable for exploratory questions requiring interpretation.
Citation Mode: Returns precise document excerpts with hierarchical grouping. Implements two-stage retrieval: (1) primary vector search, (2) neighbor chunk enrichment based on document structure. Prevents duplication through sentence-level overlap detection. Ideal for compliance verification and exact reference lookup.
Design rationale: Dual-mode approach addresses different use cases - generative for understanding, extractive for verification. Citation mode complexity justified by legal requirement for exact source attribution.
Document (DOCX) → Parser → Splitter → Embeddings → Qdrant
↓
User Query → Retriever (Hybrid/Enriched) → RAG Chain → Response
Pipeline stages:
- Document processing: DOCX parser extracts hierarchical structure with regex-based classification
- Text splitting: Configurable chunk size (1000) with overlap (200) to preserve context
- Embedding creation: Batch processing (30-50) with disk cache and checkpoint system
- Vector storage: Qdrant with cosine similarity, metadata filtering support
- Retrieval: Hybrid (vector + keyword) or enriched (with neighbor chunks)
- Generation: LangChain chain with GigaChat-PRO
Technology choices:
- GigaChat: Only Russian LLM with production-ready API and quality embeddings
- Qdrant: Best vector DB for self-hosted deployment, strong metadata support
- LangChain: Industry standard for RAG pipelines, though adds abstraction overhead
- Python-DOCX: Reliable parsing; more complex document structures may need custom processors
- Python 3.10+
- Docker (for Qdrant)
- GigaChat API credentials
python main.py setup # Validate configuration
python main.py index # Index documents
python main.py index --force # Force reindex
python main.py status # System status
python main.py interactive # Standard RAG mode
python main.py interactive --citations # Citation mode
python main.py clear-cache # Clear embeddings cacheGenerates interpretative answer with context:
ANSWER: Contract signing occurs within 10 days from protocol publication...
SOURCES: Article 54 Part 3, Article 70 Part 1
RELATED: Article 34 (information system), Article 42 (procedures)
Returns exact excerpts grouped by hierarchy:
FOUND ELEMENTS (3 results, by relevance):
1. Article 54, Part 3
Relevance: 0.892 | Chunks: 2
═══════════════════════════════════════
Contract is signed within ten days from protocol
publication date in unified information system...
═══════════════════════════════════════
pytest tests/- Place DOCX file in
data/ - Adjust parser in
indexing/parser.pyif structure differs - Run
python main.py index
Parser limitations: Current implementation handles structured legal documents with articles/parts/points. Unstructured documents or complex nested structures may require custom parsing logic.
Edit system_prompt in retrieval/rag_chain.py:
system_prompt = """
You are a legal expert. Provide answers STRICTLY based on context...
"""Prompt engineering: Current prompt optimized for factual recall with strict citation requirements. Generative mode intentionally verbose to prevent hallucination of non-existent legal references.
v1.0 (Current MVP):
- Core RAG pipeline operational
- Dual retrieval modes
- CLI interface
- Single document support
v2.0 (Planned):
- Multi-document support with domain routing
- Telegram bot deployment
- Web interface for queries
- Enhanced NLP preprocessing (lemmatization, entity recognition)
- API endpoints for integration
- Metrics dashboard
v3.0 (Future):
- Cross-document reasoning
- Temporal tracking (law amendments)
- Admin panel for document management
- User authentication and access control
By design (MVP scope):
- Single document architecture: Multi-document routing adds complexity not justified at MVP stage
- Basic keyword search: Full NLP pipeline increases dependencies and processing time
- Manual configuration: Admin UI development deferred to gather user requirements
- CLI-only: Web/bot interfaces require production infrastructure decisions
Technical:
- Large documents (1000+ pages) require 30-90 minutes for indexing
- GigaChat API rate limits may affect batch processing
- Checkpoint recovery requires manual cleanup if corrupted
Workarounds:
- Use
--forceflag to restart failed indexing - Adjust
EMBEDDING_BATCH_SIZEif hitting rate limits - Monitor
checkpoints/andcache/directories for disk space
Deduplication algorithm:
- Sentence-level: Exact match after whitespace normalization
- Token-level: Sliding window comparison (configurable overlap threshold)
- Preserves first occurrence, removes subsequent duplicates
- Aggressive mode available for high-duplication documents
Enrichment strategy:
- Primary search identifies top-K relevant chunks
- System retrieves all chunks from same hierarchical element (article/part/point)
- Neighbor chunks weighted lower (0.7x) than primary matches
- Final grouping by hierarchy level prevents fragmentation
Why Python-DOCX: Chosen for reliability over complex alternatives (Apache POI, docx4j). Handles majority of legal documents. Known limitation: embedded objects and complex tables may need manual preprocessing.
Why local Qdrant: Self-hosted deployment provides data sovereignty (important for legal documents) and eliminates API costs. Cloud Qdrant support planned for organizations without infrastructure.
Contributions welcome. Focus areas:
- Enhanced document parsers for complex structures
- Additional embedding models (multilingual support)
- Performance optimizations for large-scale deployment
- Test coverage improvements
- Issues: https://github.com/hermitage-cyber/rag/issues
- Discussions: https://github.com/hermitage-cyber/rag/discussions
Version: 2.0-MVP
Status: Production-ready for single-document deployments
Last updated: November 2025