The Problem: Building production-ready RAG systems requires solving multiple complex challenges:
- Strategy Comparison: Testing different retrieval approaches (vector vs keyword vs hybrid)
- Interface Flexibility: Supporting both traditional APIs and modern AI agent protocols
- Performance Optimization: Balancing comprehensive processing with high-speed data access
- Evaluation Infrastructure: Measuring and comparing retrieval effectiveness
The Solution: A Dual Interface Architecture that eliminates code duplication while providing:
- π FastAPI β MCP automatic conversion (zero duplication patterns)
- β‘ Command vs Query optimization (full processing vs direct data access)
- π 6 retrieval strategies with built-in benchmarking
- π Comprehensive observability via Phoenix telemetry integration
Traditional systems require maintaining separate codebases for different interfaces. Our Dual Interface Architecture solves this with:
graph TB
A[FastAPI Endpoints] --> B[Automatic Conversion]
B --> C[MCP Tools Server]
A --> D[RAG Pipeline]
D --> E[MCP Resources Server]
subgraph Command ["Command Pattern - Full Processing"]
C --> F[Complete RAG Pipeline]
F --> G[LLM Synthesis]
G --> H[Formatted Response]
end
subgraph Query ["Query Pattern - Direct Access"]
E --> I[Vector Search Only]
I --> J[Raw Results]
J --> K[3-5x Faster Response]
end
style A fill:#e8f5e8
style C fill:#fff3e0
style E fill:#f3e5f5
style H fill:#e3f2fd
style K fill:#ffebee
Command Pattern (MCP Tools):
- Use Case: When you need complete RAG processing with LLM synthesis
- What Happens: Full pipeline β retrieval β synthesis β formatted answer
- Example:
"What makes John Wick popular?"
β Full analysis with context
Query Pattern (MCP Resources):
- Use Case: When you need fast, direct data access for further processing
- What Happens: Vector search only β raw results β no synthesis
- Example:
retriever://semantic/{query}
β Raw documents for agent consumption
Key Benefit: Same underlying system, optimized interfaces for different needs.
π Deep Dive: See ARCHITECTURE.md for complete technical details and CQRS_IMPLEMENTATION_SUMMARY.md for implementation specifics.
- Compare retrieval strategies side-by-side (naive, BM25, ensemble, semantic, etc.)
- Production-ready patterns with error handling, caching, and monitoring
- Zero-setup evaluation with John Wick movie data for immediate testing
- Reference implementation of FastAPI β MCP conversion using FastMCP
- 6 working MCP tools ready for Claude Desktop integration
- Schema validation and compliance tooling
- HTTP API endpoints for integration into existing applications
- Hybrid search capabilities combining vector and keyword approaches
- LangChain LCEL patterns for chain composition
graph LR
Q[Query: John Wick action scenes] --> S1[Naive Vector Similarity]
Q --> S2[BM25 Keyword Search]
Q --> S3[Contextual AI Reranking]
Q --> S4[Multi-Query Expansion]
Q --> S5[Ensemble Hybrid Mix]
Q --> S6[Semantic Advanced Chunks]
S1 --> R1[Fast Direct Embedding Match]
S2 --> R2[Traditional IR Term Frequency]
S3 --> R3[LLM-Powered Relevance Scoring]
S4 --> R4[Multiple Queries Synthesized]
S5 --> R5[Best of All Weighted Combination]
S6 --> R6[Context-Aware Semantic Chunks]
style S1 fill:#e8f5e8
style S2 fill:#fff3e0
style S3 fill:#f3e5f5
style S4 fill:#e3f2fd
style S5 fill:#ffebee
style S6 fill:#f0f8f0
Strategy Details:
- Naive Retriever - Pure vector similarity, fastest baseline approach
- BM25 Retriever - Traditional keyword matching, excellent for exact term queries
- Contextual Compression - AI reranks results for relevance, highest quality
- Multi-Query - Generates query variations, best coverage
- Ensemble - Combines multiple methods, balanced performance
- Semantic - Advanced chunking strategy, context-optimized
graph TB
A[Single RAG Codebase] --> B[FastAPI Endpoints]
A --> C[Shared Pipeline]
B --> D[FastMCP Automatic Conversion]
C --> E[Direct Resource Access]
D --> F[MCP Tools Server Command Pattern]
E --> G[MCP Resources Server Query Pattern]
F --> H[Complete Processing 20-30 seconds]
G --> I[Direct Data Access 3-5 seconds]
style A fill:#e8f5e8
style F fill:#fff3e0
style G fill:#f3e5f5
style H fill:#e3f2fd
style I fill:#ffebee
Interface Benefits:
- FastAPI REST API - Traditional HTTP endpoints for web integration
- MCP Tools - AI agent workflows with full processing pipeline
- MCP Resources - High-speed data access for agent consumption
- Zero Code Duplication - Single codebase, multiple optimized interfaces
- Redis caching for performance
- Phoenix telemetry for monitoring
- Docker containerization for deployment
- Comprehensive test suite for reliability
- Docker & Docker Compose - Infrastructure services
- Python 3.13+ with uv package manager
- OpenAI API key - Required for LLM and embeddings
- Cohere API key - Required for reranking (optional for basic functionality)
# 1. Environment & Dependencies
uv venv && source .venv/bin/activate && uv sync --dev
# 2. Infrastructure & Configuration
docker-compose up -d && cp .env.example .env
# Edit .env with your API keys
# 3. Data & Server
python scripts/ingestion/csv_ingestion_pipeline.py && python run.py
# 4. Test (in another terminal)
curl -X POST "http://localhost:8000/invoke/semantic_retriever" \
-H "Content-Type: application/json" \
-d '{"question": "What makes John Wick movies popular?"}'
π For complete setup instructions, troubleshooting, and MCP integration: See docs/SETUP.md
flowchart TD
A[AI Agent Task] --> B{Need complete processed answer?}
B -->|Yes| C[MCP Tools: Command Pattern]
B -->|No| D[MCP Resources: Query Pattern]
C --> E[Use Case: Research Assistant]
D --> F[Use Case: Data Gatherer]
E --> G[Tool: semantic_retriever]
F --> H[Resource: retriever://semantic/query]
style C fill:#fff3e0
style D fill:#f3e5f5
style E fill:#e3f2fd
style F fill:#ffebee
When to Use: Your AI agent needs a complete, ready-to-use answer
- β Research and analysis workflows
- β Question-answering systems
- β Content generation with citations
- β User-facing responses
Real Example:
# Agent Task: "Analyze John Wick's popularity"
tool_result = mcp_client.call_tool("semantic_retriever", {
"question": "What makes John Wick movies so popular?"
})
# Returns: Complete analysis with context and citations
Available Tools:
semantic_retriever
- Advanced semantic analysis with full contextensemble_retriever
- Hybrid approach combining multiple strategiescontextual_compression_retriever
- AI-ranked results with filteringmulti_query_retriever
- Query expansion for comprehensive coveragenaive_retriever
- Fast baseline vector searchbm25_retriever
- Traditional keyword-based retrieval
When to Use: Your AI agent needs raw data for further processing (3-5x faster)
- β‘ Multi-step workflows where agent processes data further
- β‘ Bulk data collection and analysis
- β‘ Performance-critical applications
- β‘ Custom synthesis pipelines
Real Example:
# Agent Task: "Collect movie data for trend analysis"
raw_docs = mcp_client.read_resource("retriever://semantic/action movies")
# Returns: Raw documents for agent's custom analysis pipeline
Available Resources:
retriever://semantic_retriever/{query}
- Context-aware document retrievalretriever://ensemble_retriever/{query}
- Multi-strategy hybrid resultsretriever://naive_retriever/{query}
- Direct vector similarity searchsystem://health
- System status and configuration
Interface | Processing Time | Use Case | Output Format |
---|---|---|---|
MCP Tools | ~20-30 seconds | Complete analysis | Formatted answer + context |
MCP Resources | ~3-5 seconds β‘ | Raw data collection | Document list + metadata |
FastAPI | ~15-25 seconds | HTTP integration | JSON response |
# 1. Start both MCP servers
python src/mcp/server.py # Tools (Command Pattern)
python src/mcp/resources.py # Resources (Query Pattern)
# 2. Test the interfaces
python tests/integration/verify_mcp.py # Verify tools work
python tests/integration/test_cqrs_resources.py # Verify resources work
# 3. Use with Claude Desktop or other MCP clients
# Tools: For complete AI assistant responses
# Resources: For high-speed data collection workflows
The system integrates with external MCP servers for enhanced capabilities:
Data Storage & Memory:
qdrant-code-snippets
(Port 8002) - Code pattern storage and retrievalqdrant-semantic-memory
(Port 8003) - Contextual insights and project decisionsmemory
- Official MCP knowledge graph for structured relationships
Observability & Analysis:
phoenix
(localhost:6006) - Critical for AI agent observability and experiment tracking- Access Phoenix UI data and experiments via MCP for agent behavior analysis
Development Tools:
ai-docs-server
- Comprehensive documentation access (Cursor, PydanticAI, MCP Protocol, etc.)sequential-thinking
- Enhanced reasoning capabilities for complex problem-solving
Native Schema Discovery (Recommended):
# Start server with streamable HTTP
python src/mcp/server.py
# Native MCP discovery
curl -X POST http://127.0.0.1:8000/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"rpc.discover","params":{}}'
Legacy Schema Export (Development):
# Generate MCP-compliant schemas
python scripts/mcp/export_mcp_schema.py
# Validate against MCP 2025-03-26 specification
python scripts/mcp/validate_mcp_schema.py
# Compare all 6 retrieval strategies with quantified metrics
python scripts/evaluation/retrieval_method_comparison.py
# Run semantic architecture benchmark
python scripts/evaluation/semantic_architecture_benchmark.py
# View detailed results in Phoenix dashboard
open http://localhost:6006
This system implements Samuel Colvin's MCP telemetry patterns for comprehensive AI agent observability:
Key Features:
- Automatic Tracing: All retrieval operations and agent decision points
- Experiment Tracking:
johnwick_golden_testset
for performance analysis - Real-time Monitoring: Agent behavior analysis and performance optimization
- Cross-session Memory: Three-tier memory architecture with external MCP services
Telemetry Use Cases:
- Agent Performance Analysis: Query Phoenix via MCP to understand retrieval strategy effectiveness
- Debugging Agent Decisions: Trace through agent reasoning with full context
- Performance Optimization: Identify bottlenecks in agent workflows using live telemetry data
- Experiment Comparison: Compare different RAG strategies with quantified metrics
Access Patterns:
# Direct Phoenix UI access
curl http://localhost:6006
# MCP-based Phoenix integration (via Claude Code CLI)
# Access Phoenix experiment data through MCP interface
# Query performance metrics across retrieval strategies
# Analyze agent decision patterns and effectiveness
1. Knowledge Graph Memory (memory
MCP):
- Structured entities, relationships, and observations
- User preferences and project team modeling
- Cross-session persistence of structured knowledge
2. Semantic Memory (qdrant-semantic-memory
):
- Unstructured learning insights and decisions
- Pattern recognition across development sessions
- Contextual project knowledge
3. Telemetry Data (phoenix
):
- Real-time agent behavior analysis
johnwick_golden_testset
performance benchmarking- Quantified retrieval strategy effectiveness
graph TB
subgraph Clients ["Client Interfaces"]
A[HTTP Clients: curl, apps]
B[MCP Clients: Claude Desktop, AI Agents]
end
subgraph Interface ["Dual Interface Layer"]
C[FastAPI Server REST Endpoints]
D[MCP Tools Server Command Pattern]
E[MCP Resources Server Query Pattern]
end
subgraph Pipeline ["RAG Pipeline Core"]
F[6 Retrieval Strategies]
G[LangChain LCEL Chains]
H[OpenAI LLM Integration]
I[Embedding Models]
end
subgraph Infrastructure ["Data & Infrastructure"]
J[Qdrant Vector DB johnwick collections]
K[Redis Cache Performance Layer]
L[Phoenix Telemetry Observability]
end
A --> C
B --> D
B --> E
C -.->|FastMCP Conversion| D
D --> F
E --> J
F --> G
G --> H
G --> I
F --> J
G --> K
C --> L
D --> L
E --> L
style A fill:#e8f5e8
style B fill:#e3f2fd
style C fill:#fff3e0
style D fill:#f3e5f5
style E fill:#ffebee
style F fill:#f0f8f0
Interface Layer (Full Details):
- FastAPI Server: Traditional REST API with 6 retrieval endpoints
- MCP Tools: Automatic FastAPIβMCP conversion using FastMCP
- MCP Resources: Native CQRS implementation for direct data access
RAG Processing Core (Implementation Guide):
- Strategy Factory: 6 different retrieval approaches (naive β ensemble)
- LangChain LCEL: Composable chain patterns for all strategies
- Model Integration: OpenAI GPT-4.1-mini + text-embedding-3-small
Data & Observability (Setup Guide):
- Qdrant Collections: Vector storage with semantic chunking
- Redis Caching: Performance optimization with TTL management
- Phoenix Telemetry: Complete request tracing and experiment tracking
src/api/
- FastAPI endpoints and request handlingsrc/rag/
- RAG pipeline components (retrievers, chains, embeddings)src/mcp/
- MCP server implementation and resourcessrc/core/
- Shared configuration and utilitiestests/
- Comprehensive test suitescripts/
- Data ingestion and evaluation utilities
Scenario: You're building a document analysis system and need to find the optimal retrieval approach.
Workflow:
# 1. Test all strategies with your domain data
python scripts/evaluation/retrieval_method_comparison.py
# 2. Compare performance in Phoenix dashboard
open http://localhost:6006
# 3. Choose the best strategy for your use case
# Naive: Fast baseline | BM25: Exact keywords | Ensemble: Best overall
Value: Compare 6 different approaches with quantified metrics instead of guessing.
Scenario: Your AI agent needs intelligent document retrieval for research tasks.
Command Pattern (Complete Analysis):
# Agent: "Analyze this topic comprehensively"
response = await mcp_client.call_tool("semantic_retriever", {
"question": "What are the key themes in John Wick movies?"
})
# Returns: Complete analysis with citations ready for user
Query Pattern (Data Collection):
# Agent: "Gather data for multi-step analysis"
docs = await mcp_client.read_resource("retriever://ensemble/action movies")
# Returns: Raw documents for agent's custom synthesis pipeline
Value: Choose optimal interface based on agent workflow needs.
Scenario: You're building a customer support system that needs contextual responses.
HTTP API Integration:
# Real-time customer query processing
curl -X POST "http://localhost:8000/invoke/contextual_compression_retriever" \
-H "Content-Type: application/json" \
-d '{"question": "How do I troubleshoot connection issues?"}'
Benefits:
- Redis Caching: Sub-second responses for repeated queries
- Phoenix Telemetry: Monitor query patterns and performance
- Multiple Strategies: A/B test different retrieval approaches
Scenario: You're building an AI system that processes hundreds of queries per hour.
Interface Selection Strategy:
- MCP Resources (3-5 sec): Bulk data collection, preprocessing pipelines
- MCP Tools (20-30 sec): User-facing analysis, final report generation
- FastAPI (15-25 sec): Traditional web application integration
Scaling Pattern:
graph LR
A[High Volume Queries] --> B[MCP Resources Fast Data Collection]
B --> C[Agent Processing Pipeline]
C --> D[MCP Tools Final Synthesis]
D --> E[User Response]
style B fill:#ffebee
style D fill:#fff3e0
Scenario: You're researching RAG effectiveness for your domain.
Research Workflow:
# 1. Ingest your domain data
python scripts/ingestion/csv_ingestion_pipeline.py
# 2. Run comprehensive benchmarks
python scripts/evaluation/semantic_architecture_benchmark.py
# 3. Export Phoenix experiment data
# Use Phoenix MCP integration to query results programmatically
Published Capabilities:
- Reproducible Experiments: Deterministic model pinning
- Quantified Comparisons: All 6 strategies with performance metrics
- Open Architecture: Extend with your own retrieval methods
For complete setup instructions, see SETUP.md
# Copy environment template
cp .env.example .env
# Edit with your API keys
OPENAI_API_KEY=your_key_here
COHERE_API_KEY=your_key_here
# Start supporting services
docker-compose up -d
# Verify services
curl http://localhost:6333 # Qdrant
curl http://localhost:6006 # Phoenix
curl http://localhost:6379 # Redis
# Run complete data pipeline
python scripts/ingestion/csv_ingestion_pipeline.py
# Verify collections created
curl http://localhost:6333/collections
# Run full test suite
pytest tests/ -v
# Test MCP integration
python tests/integration/verify_mcp.py
# Test API endpoints
bash tests/integration/test_api_endpoints.sh
# Start development server
python run.py
# Start MCP Tools server
python src/mcp/server.py
# Start MCP Resources server
python src/mcp/resources.py
# Run benchmarks
python scripts/evaluation/retrieval_method_comparison.py
# View telemetry
open http://localhost:6006
# Test specific retrieval strategy
curl -X POST "http://localhost:8000/invoke/ensemble_retriever" \
-H "Content-Type: application/json" \
-d '{"question": "Your test question"}'
# Test MCP tool directly
python -c "
from src.mcp.server import mcp
# Test tool invocation
"
- docs/SETUP.md - Complete setup guide
- docs/QUICK_REFERENCE.md - Daily commands and validation
- docs/TROUBLESHOOTING.md - Problem-solving guide
- docs/ARCHITECTURE.md - Deep technical details
- Follow the tiered architecture patterns in the codebase
- Add tests for new functionality
- Update documentation for API changes
- Validate MCP schema compliance
This project is licensed under the MIT License - see the LICENSE file for details.
- β Commercial use - Use in commercial projects
- β Modification - Modify and distribute modified versions
- β Distribution - Distribute original or modified versions
- β Private use - Use privately without restrictions
β οΈ Attribution required - Include copyright notice and license- β No warranty - Software provided "as is"
flowchart TD
A[I want to...] --> B[Try RAG strategies with sample data]
A --> C[Integrate with my AI agent]
A --> D[Build a production application]
A --> E[Research RAG effectiveness]
B --> F[Quick Start 4-step setup above]
C --> G[MCP Integration Claude Desktop guide]
D --> H[Architecture Docs ARCHITECTURE.md]
E --> I[Evaluation Scripts Phoenix telemetry]
style F fill:#e8f5e8
style G fill:#fff3e0
style H fill:#f3e5f5
style I fill:#ffebee
- Start Here: 4-Step Quick Start - Get running in 5 minutes
- Understand the Architecture: Why Dual Interface? - Core concepts explained
- Production Setup: docs/SETUP.md - Complete installation guide
- Daily Commands: docs/QUICK_REFERENCE.md - Commands and validation
- Deep Technical Details: docs/ARCHITECTURE.md - Complete system design
- Troubleshooting: docs/TROUBLESHOOTING.md - Common issues and solutions
Current Status: β FULLY OPERATIONAL (Validated 2025-06-23)
Quick Validation Check:
# Run complete validation suite (recommended)
bash scripts/validation/run_system_health_check.sh
# Quick system status only
python scripts/status.py --verbose
What's Validated:
- β All 5 Tiers: Environment β Infrastructure β Application β MCP β Data
- β All 6 Retrieval Strategies: Working with proper context (3-10 docs per query)
- β Dual MCP Interfaces: 8 Tools + 5 Resources functional
- β Performance: Sub-30 second response times verified
- β Phoenix Telemetry: Real-time tracing and experiment tracking
π System validated 2025-06-23 - All 5 tiers operational, 6 retrieval strategies functional
- Try the System - Follow the 4-step quick start above
- Validate Your Setup - Run
bash scripts/validation/run_system_health_check.sh
- Explore Strategies - Run
python scripts/evaluation/retrieval_method_comparison.py
- Integrate with Agents - Connect to Claude Desktop or build custom MCP clients
- Scale to Production - Use Docker deployment and Redis caching
- Contribute - Submit issues, improvements, or new retrieval strategies
β Star this repo if it's useful! | π€ Contribute | π Full Documentation