Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A comprehensive Retrieval-Augmented Generation system built with LangChain and LangGraph frameworks, supporting multi-format document parsing, advanced chunking strategies, hybrid retrieval mechanisms, and structured answer generation with comprehensive logging and performance monitoring.

xxcs13/simpleRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG System

A comprehensive Retrieval-Augmented Generation system built with LangChain and LangGraph frameworks, supporting multi-format document parsing, advanced chunking strategies, hybrid retrieval mechanisms, and structured answer generation with comprehensive logging and performance monitoring.

System Overview

This RAG system provides a complete document processing and question-answering pipeline that can handle PDF, PPTX, and Excel files. The system utilizes ChromaDB for vector storage, OpenAI GPT-4.1-mini for language generation, text-embedding-3-small for embeddings, and implements sophisticated text chunking with cross-page awareness. The workflow is orchestrated using LangGraph state machines for reliable and scalable processing.

Workflow Architecture

graph TD
    A[Document Input] --> B[Document Parsing]
    B --> C[Text Chunking]
    C --> D[Vector Embedding]
    D --> E[Vector Database Storage]
    
    F[User Query] --> G[Vector Retrieval]
    G --> H[Parent Page Aggregation]
    H --> I[LLM Reranking]
    I --> J[Answer Generation]
    J --> K[Structured Response]
    K --> L[Logging & Monitoring]
    
    E --> G
    
    subgraph "Document Processing Pipeline"
        B1[PDF Parser] --> B
        B2[PPTX Parser] --> B
        B3[Excel Parser] --> B
    end
    
    subgraph "Retrieval Pipeline"
        G1[Vector Search] --> G
        H1[Cross-page Aggregation] --> H
    end
    
    subgraph "Generation Pipeline"
        I1[Context Assembly] --> I
        I2[Prompt Engineering] --> I
        I3[Response Validation] --> I
    end
Loading

Processing Workflow

  1. Document Ingestion: Multi-format document parsing with layout detection and content extraction
  2. Text Chunking: Advanced chunking with cross-page awareness and parent-child relationships
  3. Vector Embedding: Batch processing with token limit management using text-embedding-3-small
  4. Storage: Persistent vector database with metadata preservation using ChromaDB
  5. Retrieval: Vector-based semantic search with similarity scoring
  6. Parent Aggregation: Cross-page chunk aggregation to parent page level
  7. Reranking: LLM-based relevance scoring for optimal context selection
  8. Generation: Structured answer generation with confidence scoring and source attribution

Installation and Setup

Environment Setup

# Create conda environment
conda create -n rag python=3.10
conda activate rag

# Install dependencies
pip install -r requirements.txt

Environment Variables

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key_here
ENABLE_TELEMETRY=false  # Optional: disable telemetry for privacy

Dependencies

Key libraries include:

  • langchain and langchain-openai for LLM integration
  • langgraph for workflow orchestration
  • chromadb for vector database
  • openai GPT-4.1-mini for language generation
  • text-embedding-3-small for document embeddings
  • pdfplumber and pypdf for PDF processing
  • python-pptx for PowerPoint processing
  • pandas for Excel processing
  • tiktoken for token counting

Usage

Command Line Mode

# Single question processing
python main.py "What are the key financial metrics mentioned in the documents?"

Interactive Mode

# Start interactive session
python main.py

# Follow prompts to add documents and ask questions
# Type 'quit' to exit

Document Processing

The system supports multiple document formats:

  • PDF: Advanced layout detection with multi-column support
  • PPTX: Complete content extraction including tables, charts, and images
  • Excel: Multi-sheet processing with data preservation

Core Modules

Configuration Management (config.py)

  • Environment variable handling
  • Model configuration and constants
  • System-wide settings management

Document Parsing (parsing.py)

  • PDFParser: Advanced PDF text extraction with layout analysis
  • PPTXParser: Comprehensive PowerPoint content extraction
  • ExcelParser: Multi-sheet Excel data processing
  • UnifiedDocumentParser: Format detection and routing

Text Processing (chunking.py)

  • CrossPageTextSplitter: Context-aware chunking across page boundaries
  • ParentPageAggregator: Hierarchical chunk organization
  • Token-aware splitting with configurable overlap

Vector Database (vectorstore.py)

  • VectorStoreManager: Persistent storage with metadata recovery
  • Batch processing for large document sets
  • Automatic retry logic for API rate limits

Retrieval System (retrieval.py)

  • VectorRetriever: Semantic similarity search using embeddings
  • ParentPageAggregator: Cross-page chunk aggregation to parent pages
  • LLMReranker: GPT-4.1-mini based relevance scoring and reranking
  • HybridRetriever: Complete retrieval pipeline orchestration
  • Configurable retrieval parameters and scoring weights

Answer Generation (generation.py)

  • AnswerGenerator: Structured response generation using GPT-4.1-mini
  • Confidence scoring and uncertainty handling
  • Source attribution and reasoning chains

Workflow Orchestration (workflow.py)

  • LangGraph state machine implementation
  • Separate pipelines for document processing and querying
  • Error handling and state management

Features

Advanced Document Processing

  • Multi-column PDF layout detection
  • Table and chart extraction from presentations
  • Cross-page text chunking with context preservation
  • Metadata-rich document representation

Intelligent Retrieval

  • Vector-based semantic search with similarity scoring
  • Parent page aggregation for cross-page chunk handling
  • LLM-powered relevance reranking using GPT-4.1-mini
  • Configurable retrieval parameters and scoring weights
  • Source document tracking with page-level attribution

Structured Generation

  • Confidence-scored responses using GPT-4.1-mini
  • Step-by-step reasoning chains
  • Source attribution with page references
  • Uncertainty acknowledgment

Performance Monitoring

  • Comprehensive query logging
  • Processing time tracking
  • Token usage monitoring
  • Error rate analysis

Scalability Features

  • Batch processing for large document sets
  • Persistent vector database with incremental updates
  • Automatic retry logic for API failures
  • Memory-efficient chunking strategies

System Architecture

The system follows a modular architecture with clear separation of concerns:

  • Data Layer: Document parsing and storage management
  • Processing Layer: Text chunking and vector embedding
  • Retrieval Layer: Vector search, parent aggregation, and LLM reranking
  • Generation Layer: LLM-based answer synthesis using GPT-4.1-mini
  • Orchestration Layer: Workflow management and error handling

Each module is designed for independent testing and maintenance, with well-defined interfaces and comprehensive error handling.

Performance Considerations

  • Batch processing prevents OpenAI API token limit violations
  • Persistent vector storage eliminates reprocessing overhead
  • Vector-based retrieval provides fast semantic search
  • Parent page aggregation reduces redundant content
  • Token counting prevents context window overflow
  • Incremental document addition for large knowledge bases

Monitoring and Logging

The system provides comprehensive logging including:

  • Query processing times and token usage
  • Retrieval effectiveness metrics
  • Generation quality indicators
  • Error tracking and debugging information
  • Performance analytics for optimization

Extensibility

The modular design supports easy extension:

  • Additional document format parsers
  • Custom chunking strategies
  • Alternative embedding models
  • Enhanced retrieval algorithms
  • Specialized generation pipelines

About

A comprehensive Retrieval-Augmented Generation system built with LangChain and LangGraph frameworks, supporting multi-format document parsing, advanced chunking strategies, hybrid retrieval mechanisms, and structured answer generation with comprehensive logging and performance monitoring.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages