A comprehensive Retrieval-Augmented Generation system built with LangChain and LangGraph frameworks, supporting multi-format document parsing, advanced chunking strategies, hybrid retrieval mechanisms, and structured answer generation with comprehensive logging and performance monitoring.
This RAG system provides a complete document processing and question-answering pipeline that can handle PDF, PPTX, and Excel files. The system utilizes ChromaDB for vector storage, OpenAI GPT-4.1-mini for language generation, text-embedding-3-small for embeddings, and implements sophisticated text chunking with cross-page awareness. The workflow is orchestrated using LangGraph state machines for reliable and scalable processing.
graph TD
A[Document Input] --> B[Document Parsing]
B --> C[Text Chunking]
C --> D[Vector Embedding]
D --> E[Vector Database Storage]
F[User Query] --> G[Vector Retrieval]
G --> H[Parent Page Aggregation]
H --> I[LLM Reranking]
I --> J[Answer Generation]
J --> K[Structured Response]
K --> L[Logging & Monitoring]
E --> G
subgraph "Document Processing Pipeline"
B1[PDF Parser] --> B
B2[PPTX Parser] --> B
B3[Excel Parser] --> B
end
subgraph "Retrieval Pipeline"
G1[Vector Search] --> G
H1[Cross-page Aggregation] --> H
end
subgraph "Generation Pipeline"
I1[Context Assembly] --> I
I2[Prompt Engineering] --> I
I3[Response Validation] --> I
end
- Document Ingestion: Multi-format document parsing with layout detection and content extraction
- Text Chunking: Advanced chunking with cross-page awareness and parent-child relationships
- Vector Embedding: Batch processing with token limit management using
text-embedding-3-small - Storage: Persistent vector database with metadata preservation using
ChromaDB - Retrieval: Vector-based semantic search with similarity scoring
- Parent Aggregation: Cross-page chunk aggregation to parent page level
- Reranking: LLM-based relevance scoring for optimal context selection
- Generation: Structured answer generation with confidence scoring and source attribution
# Create conda environment
conda create -n rag python=3.10
conda activate rag
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
OPENAI_API_KEY=your_openai_api_key_here
ENABLE_TELEMETRY=false # Optional: disable telemetry for privacyKey libraries include:
langchainandlangchain-openaifor LLM integrationlanggraphfor workflow orchestrationchromadbfor vector databaseopenaiGPT-4.1-mini for language generationtext-embedding-3-smallfor document embeddingspdfplumberandpypdffor PDF processingpython-pptxfor PowerPoint processingpandasfor Excel processingtiktokenfor token counting
# Single question processing
python main.py "What are the key financial metrics mentioned in the documents?"# Start interactive session
python main.py
# Follow prompts to add documents and ask questions
# Type 'quit' to exitThe system supports multiple document formats:
- PDF: Advanced layout detection with multi-column support
- PPTX: Complete content extraction including tables, charts, and images
- Excel: Multi-sheet processing with data preservation
- Environment variable handling
- Model configuration and constants
- System-wide settings management
PDFParser: Advanced PDF text extraction with layout analysisPPTXParser: Comprehensive PowerPoint content extractionExcelParser: Multi-sheet Excel data processingUnifiedDocumentParser: Format detection and routing
CrossPageTextSplitter: Context-aware chunking across page boundariesParentPageAggregator: Hierarchical chunk organization- Token-aware splitting with configurable overlap
VectorStoreManager: Persistent storage with metadata recovery- Batch processing for large document sets
- Automatic retry logic for API rate limits
VectorRetriever: Semantic similarity search using embeddingsParentPageAggregator: Cross-page chunk aggregation to parent pagesLLMReranker: GPT-4.1-mini based relevance scoring and rerankingHybridRetriever: Complete retrieval pipeline orchestration- Configurable retrieval parameters and scoring weights
AnswerGenerator: Structured response generation usingGPT-4.1-mini- Confidence scoring and uncertainty handling
- Source attribution and reasoning chains
LangGraphstate machine implementation- Separate pipelines for document processing and querying
- Error handling and state management
- Multi-column PDF layout detection
- Table and chart extraction from presentations
- Cross-page text chunking with context preservation
- Metadata-rich document representation
- Vector-based semantic search with similarity scoring
- Parent page aggregation for cross-page chunk handling
- LLM-powered relevance reranking using
GPT-4.1-mini - Configurable retrieval parameters and scoring weights
- Source document tracking with page-level attribution
- Confidence-scored responses using
GPT-4.1-mini - Step-by-step reasoning chains
- Source attribution with page references
- Uncertainty acknowledgment
- Comprehensive query logging
- Processing time tracking
- Token usage monitoring
- Error rate analysis
- Batch processing for large document sets
- Persistent vector database with incremental updates
- Automatic retry logic for API failures
- Memory-efficient chunking strategies
The system follows a modular architecture with clear separation of concerns:
- Data Layer: Document parsing and storage management
- Processing Layer: Text chunking and vector embedding
- Retrieval Layer: Vector search, parent aggregation, and LLM reranking
- Generation Layer: LLM-based answer synthesis using
GPT-4.1-mini - Orchestration Layer: Workflow management and error handling
Each module is designed for independent testing and maintenance, with well-defined interfaces and comprehensive error handling.
- Batch processing prevents OpenAI API token limit violations
- Persistent vector storage eliminates reprocessing overhead
- Vector-based retrieval provides fast semantic search
- Parent page aggregation reduces redundant content
- Token counting prevents context window overflow
- Incremental document addition for large knowledge bases
The system provides comprehensive logging including:
- Query processing times and token usage
- Retrieval effectiveness metrics
- Generation quality indicators
- Error tracking and debugging information
- Performance analytics for optimization
The modular design supports easy extension:
- Additional document format parsers
- Custom chunking strategies
- Alternative embedding models
- Enhanced retrieval algorithms
- Specialized generation pipelines