Thanks to visit codestin.com
Credit goes to github.com

Skip to content

hermitage-cyber/rag

Repository files navigation

RAG System for Legal Document Analysis

Production-ready retrieval-augmented generation system for legal document Q&A with Russian LLM support.

Python LangChain

Overview

RAG system designed for automated search and analysis of legal documents. Built as MVP with focus on Russian language processing using native LLM infrastructure (GigaChat). Implements two-stage retrieval with hierarchical context enrichment for accurate citation.

Status: MVP (Minimum Viable Product) - core functionality operational, additional features in roadmap.

Key Features

Strengths:

  • Dual retrieval modes: Standard RAG with generative answers and citation mode with neighbor chunk enrichment
  • Hierarchical parsing: Preserves document structure (articles, parts, points, subpoints) with full metadata
  • Hybrid search: Combines vector similarity with keyword matching and metadata filtering
  • Resumable indexing: Checkpoint system for safe processing of large documents (30-90 min for 1000+ pages)
  • Native Russian support: GigaChat LLM and embeddings optimized for Russian legal terminology
  • Smart deduplication: Sentence-level and token-level overlap detection with configurable thresholds

MVP Limitations (by design):

  • Single document support: Currently processes one legal document; multi-document architecture planned for v2.0
  • CLI-only interface: Web UI and Telegram bot in development (bot scaffolding present)
  • Local deployment: Requires self-hosted Qdrant; cloud deployment configuration coming
  • Basic keyword search: Simple tokenization; can be enhanced with NLP preprocessing if needed
  • Manual configuration: Most settings via .env; admin panel planned for future releases

Architecture

Retrieval Modes

Standard RAG Mode: Generates comprehensive answers with automatic source attribution. Uses LangChain's retrieval chain with custom legal-focused system prompt. Suitable for exploratory questions requiring interpretation.

Citation Mode: Returns precise document excerpts with hierarchical grouping. Implements two-stage retrieval: (1) primary vector search, (2) neighbor chunk enrichment based on document structure. Prevents duplication through sentence-level overlap detection. Ideal for compliance verification and exact reference lookup.

Design rationale: Dual-mode approach addresses different use cases - generative for understanding, extractive for verification. Citation mode complexity justified by legal requirement for exact source attribution.

System Architecture

Document (DOCX) → Parser → Splitter → Embeddings → Qdrant
                                ↓
User Query → Retriever (Hybrid/Enriched) → RAG Chain → Response

Pipeline stages:

  1. Document processing: DOCX parser extracts hierarchical structure with regex-based classification
  2. Text splitting: Configurable chunk size (1000) with overlap (200) to preserve context
  3. Embedding creation: Batch processing (30-50) with disk cache and checkpoint system
  4. Vector storage: Qdrant with cosine similarity, metadata filtering support
  5. Retrieval: Hybrid (vector + keyword) or enriched (with neighbor chunks)
  6. Generation: LangChain chain with GigaChat-PRO

Technology choices:

  • GigaChat: Only Russian LLM with production-ready API and quality embeddings
  • Qdrant: Best vector DB for self-hosted deployment, strong metadata support
  • LangChain: Industry standard for RAG pipelines, though adds abstraction overhead
  • Python-DOCX: Reliable parsing; more complex document structures may need custom processors

Installation

Prerequisites

  • Python 3.10+
  • Docker (for Qdrant)
  • GigaChat API credentials

Quick start

python main.py setup              # Validate configuration
python main.py index              # Index documents
python main.py index --force      # Force reindex
python main.py status             # System status
python main.py interactive        # Standard RAG mode
python main.py interactive --citations  # Citation mode
python main.py clear-cache        # Clear embeddings cache

Interactive mode

Generates interpretative answer with context:

ANSWER: Contract signing occurs within 10 days from protocol publication...
SOURCES: Article 54 Part 3, Article 70 Part 1
RELATED: Article 34 (information system), Article 42 (procedures)

Citation mode

Returns exact excerpts grouped by hierarchy:

FOUND ELEMENTS (3 results, by relevance):

1. Article 54, Part 3
   Relevance: 0.892 | Chunks: 2
   ═══════════════════════════════════════
   Contract is signed within ten days from protocol
   publication date in unified information system...
   ═══════════════════════════════════════

Development

Testing

pytest tests/

Adding custom documents

  1. Place DOCX file in data/
  2. Adjust parser in indexing/parser.py if structure differs
  3. Run python main.py index

Parser limitations: Current implementation handles structured legal documents with articles/parts/points. Unstructured documents or complex nested structures may require custom parsing logic.

Customizing prompts

Edit system_prompt in retrieval/rag_chain.py:

system_prompt = """
You are a legal expert. Provide answers STRICTLY based on context...
"""

Prompt engineering: Current prompt optimized for factual recall with strict citation requirements. Generative mode intentionally verbose to prevent hallucination of non-existent legal references.

Roadmap

v1.0 (Current MVP):

  • Core RAG pipeline operational
  • Dual retrieval modes
  • CLI interface
  • Single document support

v2.0 (Planned):

  • Multi-document support with domain routing
  • Telegram bot deployment
  • Web interface for queries
  • Enhanced NLP preprocessing (lemmatization, entity recognition)
  • API endpoints for integration
  • Metrics dashboard

v3.0 (Future):

  • Cross-document reasoning
  • Temporal tracking (law amendments)
  • Admin panel for document management
  • User authentication and access control

Known Limitations

By design (MVP scope):

  • Single document architecture: Multi-document routing adds complexity not justified at MVP stage
  • Basic keyword search: Full NLP pipeline increases dependencies and processing time
  • Manual configuration: Admin UI development deferred to gather user requirements
  • CLI-only: Web/bot interfaces require production infrastructure decisions

Technical:

  • Large documents (1000+ pages) require 30-90 minutes for indexing
  • GigaChat API rate limits may affect batch processing
  • Checkpoint recovery requires manual cleanup if corrupted

Workarounds:

  • Use --force flag to restart failed indexing
  • Adjust EMBEDDING_BATCH_SIZE if hitting rate limits
  • Monitor checkpoints/ and cache/ directories for disk space

Technical Details

Deduplication algorithm:

  • Sentence-level: Exact match after whitespace normalization
  • Token-level: Sliding window comparison (configurable overlap threshold)
  • Preserves first occurrence, removes subsequent duplicates
  • Aggressive mode available for high-duplication documents

Enrichment strategy:

  • Primary search identifies top-K relevant chunks
  • System retrieves all chunks from same hierarchical element (article/part/point)
  • Neighbor chunks weighted lower (0.7x) than primary matches
  • Final grouping by hierarchy level prevents fragmentation

Why Python-DOCX: Chosen for reliability over complex alternatives (Apache POI, docx4j). Handles majority of legal documents. Known limitation: embedded objects and complex tables may need manual preprocessing.

Why local Qdrant: Self-hosted deployment provides data sovereignty (important for legal documents) and eliminates API costs. Cloud Qdrant support planned for organizations without infrastructure.

Contributing

Contributions welcome. Focus areas:

  • Enhanced document parsers for complex structures
  • Additional embedding models (multilingual support)
  • Performance optimizations for large-scale deployment
  • Test coverage improvements

Support


Version: 2.0-MVP
Status: Production-ready for single-document deployments
Last updated: November 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages