RAG System for Legal Document Analysis

Production-ready retrieval-augmented generation system for legal document Q&A with Russian LLM support.

Overview

RAG system designed for automated search and analysis of legal documents. Built as MVP with focus on Russian language processing using native LLM infrastructure (GigaChat). Implements two-stage retrieval with hierarchical context enrichment for accurate citation.

Status: MVP (Minimum Viable Product) - core functionality operational, additional features in roadmap.

Key Features

Strengths:

Dual retrieval modes: Standard RAG with generative answers and citation mode with neighbor chunk enrichment
Hierarchical parsing: Preserves document structure (articles, parts, points, subpoints) with full metadata
Hybrid search: Combines vector similarity with keyword matching and metadata filtering
Resumable indexing: Checkpoint system for safe processing of large documents (30-90 min for 1000+ pages)
Native Russian support: GigaChat LLM and embeddings optimized for Russian legal terminology
Smart deduplication: Sentence-level and token-level overlap detection with configurable thresholds

MVP Limitations (by design):

Single document support: Currently processes one legal document; multi-document architecture planned for v2.0
CLI-only interface: Web UI and Telegram bot in development (bot scaffolding present)
Local deployment: Requires self-hosted Qdrant; cloud deployment configuration coming
Basic keyword search: Simple tokenization; can be enhanced with NLP preprocessing if needed
Manual configuration: Most settings via .env; admin panel planned for future releases

Architecture

Retrieval Modes

Standard RAG Mode: Generates comprehensive answers with automatic source attribution. Uses LangChain's retrieval chain with custom legal-focused system prompt. Suitable for exploratory questions requiring interpretation.

Citation Mode: Returns precise document excerpts with hierarchical grouping. Implements two-stage retrieval: (1) primary vector search, (2) neighbor chunk enrichment based on document structure. Prevents duplication through sentence-level overlap detection. Ideal for compliance verification and exact reference lookup.

Design rationale: Dual-mode approach addresses different use cases - generative for understanding, extractive for verification. Citation mode complexity justified by legal requirement for exact source attribution.

System Architecture

Document (DOCX) → Parser → Splitter → Embeddings → Qdrant
                                ↓
User Query → Retriever (Hybrid/Enriched) → RAG Chain → Response

Pipeline stages:

Document processing: DOCX parser extracts hierarchical structure with regex-based classification
Text splitting: Configurable chunk size (1000) with overlap (200) to preserve context
Embedding creation: Batch processing (30-50) with disk cache and checkpoint system
Vector storage: Qdrant with cosine similarity, metadata filtering support
Retrieval: Hybrid (vector + keyword) or enriched (with neighbor chunks)
Generation: LangChain chain with GigaChat-PRO

Technology choices:

GigaChat: Only Russian LLM with production-ready API and quality embeddings
Qdrant: Best vector DB for self-hosted deployment, strong metadata support
LangChain: Industry standard for RAG pipelines, though adds abstraction overhead
Python-DOCX: Reliable parsing; more complex document structures may need custom processors

Installation

Prerequisites

Python 3.10+
Docker (for Qdrant)
GigaChat API credentials

Quick start

python main.py setup              # Validate configuration
python main.py index              # Index documents
python main.py index --force      # Force reindex
python main.py status             # System status
python main.py interactive        # Standard RAG mode
python main.py interactive --citations  # Citation mode
python main.py clear-cache        # Clear embeddings cache

Interactive mode

Generates interpretative answer with context:

ANSWER: Contract signing occurs within 10 days from protocol publication...
SOURCES: Article 54 Part 3, Article 70 Part 1
RELATED: Article 34 (information system), Article 42 (procedures)

Citation mode

Returns exact excerpts grouped by hierarchy:

FOUND ELEMENTS (3 results, by relevance):

1. Article 54, Part 3
   Relevance: 0.892 | Chunks: 2
   ═══════════════════════════════════════
   Contract is signed within ten days from protocol
   publication date in unified information system...
   ═══════════════════════════════════════

Development

Testing

pytest tests/

Adding custom documents

Place DOCX file in data/
Adjust parser in indexing/parser.py if structure differs
Run python main.py index

Parser limitations: Current implementation handles structured legal documents with articles/parts/points. Unstructured documents or complex nested structures may require custom parsing logic.

Customizing prompts

Edit system_prompt in retrieval/rag_chain.py:

system_prompt = """
You are a legal expert. Provide answers STRICTLY based on context...
"""

Prompt engineering: Current prompt optimized for factual recall with strict citation requirements. Generative mode intentionally verbose to prevent hallucination of non-existent legal references.

Roadmap

v1.0 (Current MVP):

Core RAG pipeline operational
Dual retrieval modes
CLI interface
Single document support

v2.0 (Planned):

Multi-document support with domain routing
Telegram bot deployment
Web interface for queries
Enhanced NLP preprocessing (lemmatization, entity recognition)
API endpoints for integration
Metrics dashboard

v3.0 (Future):

Cross-document reasoning
Temporal tracking (law amendments)
Admin panel for document management
User authentication and access control

Known Limitations

By design (MVP scope):

Single document architecture: Multi-document routing adds complexity not justified at MVP stage
Basic keyword search: Full NLP pipeline increases dependencies and processing time
Manual configuration: Admin UI development deferred to gather user requirements
CLI-only: Web/bot interfaces require production infrastructure decisions

Technical:

Large documents (1000+ pages) require 30-90 minutes for indexing
GigaChat API rate limits may affect batch processing
Checkpoint recovery requires manual cleanup if corrupted

Workarounds:

Use --force flag to restart failed indexing
Adjust EMBEDDING_BATCH_SIZE if hitting rate limits
Monitor checkpoints/ and cache/ directories for disk space

Technical Details

Deduplication algorithm:

Sentence-level: Exact match after whitespace normalization
Token-level: Sliding window comparison (configurable overlap threshold)
Preserves first occurrence, removes subsequent duplicates
Aggressive mode available for high-duplication documents

Enrichment strategy:

Primary search identifies top-K relevant chunks
System retrieves all chunks from same hierarchical element (article/part/point)
Neighbor chunks weighted lower (0.7x) than primary matches
Final grouping by hierarchy level prevents fragmentation

Why Python-DOCX: Chosen for reliability over complex alternatives (Apache POI, docx4j). Handles majority of legal documents. Known limitation: embedded objects and complex tables may need manual preprocessing.

Why local Qdrant: Self-hosted deployment provides data sovereignty (important for legal documents) and eliminates API costs. Cloud Qdrant support planned for organizations without infrastructure.

Contributing

Contributions welcome. Focus areas:

Enhanced document parsers for complex structures
Additional embedding models (multilingual support)
Performance optimizations for large-scale deployment
Test coverage improvements

Support

Issues: https://github.com/hermitage-cyber/rag/issues
Discussions: https://github.com/hermitage-cyber/rag/discussions

Version: 2.0-MVP
Status: Production-ready for single-document deployments
Last updated: November 2025

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bot		bot
config		config
docs		docs
embeddings		embeddings
indexing		indexing
logs		logs
retrieval		retrieval
service		service
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
tasks.md		tasks.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG System for Legal Document Analysis

Overview

Key Features

Architecture

Retrieval Modes

System Architecture

Installation

Prerequisites

Quick start

Interactive mode

Citation mode

Development

Testing

Adding custom documents

Customizing prompts

Roadmap

Known Limitations

Technical Details

Contributing

Support

About

Uh oh!

Releases

Packages

Languages

hermitage-cyber/rag

Folders and files

Latest commit

History

Repository files navigation

RAG System for Legal Document Analysis

Overview

Key Features

Architecture

Retrieval Modes

System Architecture

Installation

Prerequisites

Quick start

Interactive mode

Citation mode

Development

Testing

Adding custom documents

Customizing prompts

Roadmap

Known Limitations

Technical Details

Contributing

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages