Welcome, brave soul, to the wild world of Retrieval-Augmented Generation evaluation! You're about to embark on a journey that's part software engineering, part data science, and part digital archaeology. Don't worryβwe've got your back.
For a deep-dive into the technical concepts and a complete walkthrough of the code, see the full technical guide.
- How to implement and compare 6 different retrieval strategies
- The difference between semantic search, keyword search (BM25), and hybrid approaches
- How to use Phoenix for LLM observability and debugging
- Setting up a vector database with PostgreSQL and pgvector
- Complete RAGAS implementation for golden test sets and automated evaluation
π Want deeper technical insights? Check out the DeepWiki documentation for architecture diagrams and detailed explanations.
- Python 3.13+ (required by the project)
- Basic understanding of LLMs and embeddings
- Familiarity with async Python (we use asyncio)
- ~$5 in API credits (OpenAI + Cohere)
- Docker installed and running
You're getting a complete 3-stage RAG evaluation pipeline! This isn't just a foundation - it's a full toolkit that takes you from infrastructure setup through automated evaluation with RAGAS golden test sets.
You're building:
- Stage 1: A comparison engine for 6 different retrieval strategies
- Stage 2: Automated test generation with RAGAS golden datasets
- Stage 3: Systematic evaluation with metrics and experiment tracking
Think of it as a complete RAG evaluation laboratory, with research PDFs (AI/HCI literature) as your test dataset. By the end, you'll have objective metrics telling you which retrieval strategy performs best.
This toolkit implements a production-ready evaluation pipeline that progresses through three stages:
Script: langchain_eval_foundations_e2e.py
- Sets up PostgreSQL with pgvector for hybrid search
- Implements 6 different retrieval strategies
- Provides side-by-side comparison with Phoenix tracing
- You learn: How different retrieval methods work and when to use each
Script: langchain_eval_golden_testset.py
- Uses RAGAS to automatically generate diverse test questions
- Creates ground-truth answers and reference contexts
- Uploads datasets to Phoenix for experiment tracking
- You get: A reusable test set for consistent evaluation
Script: langchain_eval_experiments.py
- Runs all strategies against the golden test set
- Calculates QA correctness and relevance metrics
- Provides quantitative rankings and performance data
- You discover: Which strategy objectively performs best
π All three stages work together to give you a complete evaluation workflow from setup to metrics!
# 1. Clone and setup
git clone <repo-url>
cd rag-eval-foundations
cp .env.example .env # Edit with your API keys
# 2. Install dependencies
uv venv --python 3.13 && source .venv/bin/activate
uv sync
# 3. Run the complete pipeline
python claude_code_scripts/run_rag_evaluation_pipeline.pyThe orchestration script will:
- β Validate your environment and API keys
- π³ Start Docker services (PostgreSQL + Phoenix)
- π Execute all 3 pipeline steps in correct order
- π Generate comprehensive evaluation results
# 1. Clone and setup
git clone <repo-url>
cd rag-eval-foundations
cp .env.example .env # Edit with your API keys
# 2. Start services
docker-compose up -d # Or use individual docker run commands below
# 3. Install and run manually
uv venv --python 3.13 && source .venv/bin/activate
uv sync
python src/langchain_eval_foundations_e2e.py
python src/langchain_eval_golden_testset.py
python src/langchain_eval_experiments.pyNow that you've got the pipeline running, here's where to go next:
- DeepWiki Documentation: Interactive architecture diagrams, detailed retrieval strategy analysis, and performance benchmarks
- Technical Blog Post: Complete walkthrough of the implementation with code examples
- Validation Scripts: Interactive tools to explore your data and compare strategies
postgres_data_analysis.py: Visualize embeddings and chunking strategiesretrieval_strategy_comparison.py: Benchmark and compare all 6 strategiesvalidate_telemetry.py: Understand Phoenix tracing in depth
- Learn about RAGAS golden test sets for automated evaluation
- Explore Phoenix documentation for advanced observability
- Check out LangChain's retriever docs for custom implementations
The claude_code_scripts/run_rag_evaluation_pipeline.py script provides a comprehensive, repeatable process for executing all 3 pipeline steps with proper error handling and logging.
- π Environment Validation: Checks .env file, API keys, and dependencies
- π³ Service Management: Automatically starts Docker services if needed
- π Step-by-Step Execution: Runs all 3 scripts in correct dependency order
- π Comprehensive Logging: Detailed logs with timestamps and progress tracking
- β Error Handling: Graceful failure recovery and clear error messages
# Standard execution (recommended)
python claude_code_scripts/run_rag_evaluation_pipeline.py
# Skip Docker service management (if already running)
python claude_code_scripts/run_rag_evaluation_pipeline.py --skip-services
# Enable verbose debug logging
python claude_code_scripts/run_rag_evaluation_pipeline.py --verbose
# Get help
python claude_code_scripts/run_rag_evaluation_pipeline.py --help-
Main E2E Pipeline (
langchain_eval_foundations_e2e.py)- Loads documents from configured sources (PDFs by default)
- Creates PostgreSQL vector stores
- Tests 6 retrieval strategies
- Generates Phoenix traces
-
Golden Test Set Generation (
langchain_eval_golden_testset.py)- Uses RAGAS to generate evaluation questions
- Uploads test set to Phoenix for experiments
-
Automated Experiments (
langchain_eval_experiments.py)- Runs systematic evaluation on all strategies
- Calculates QA correctness and relevance scores
- Creates detailed experiment reports in Phoenix
The pipeline supports multiple document formats and can be configured via the Config class in langchain_eval_foundations_e2e.py:
Default Configuration (current):
load_pdfs: bool = True # Research PDFs from data/ directory (enabled)
load_csvs: bool = False # CSV datasets (disabled by default)
load_markdowns: bool = True # Markdown documents (enabled)Data Sources by Type:
- PDFs: Research papers, technical documents (AI/HCI literature included)
- CSVs: Structured datasets with metadata columns
- Markdown: Documentation files split on header boundaries
Example Data Included:
- Current dataset: 269 PDF documents on human-LLM interaction and AI usage research
- Topics: Prompt engineering, trust calibration, cognitive collaboration, interface design
To use different data:
- Place your PDFs in
data/directory - Update config flags in
langchain_eval_foundations_e2e.py - Run the pipeline - system automatically adapts to your documents
Example queries for current research data:
- "What factors influence user trust in AI systems?"
- "How do novice users struggle with prompt engineering?"
- "What design strategies mitigate automation bias?"
The script creates detailed logs in the logs/ directory with timestamps. All output includes:
- β Success indicators for each step
- β±οΈ Execution time tracking
- π Direct links to Phoenix UI for viewing results
- π Summary statistics and experiment IDs
This project uses uv, an extremely fast Python package and project manager.
-
Install
uvIf you don't have
uvinstalled, open your terminal and run the official installer:# Install uv (macOS & Linux) curl -LsSf https://astral.sh/uv/install.sh | sh
For Windows and other installation methods, please refer to the official uv installation guide.
-
Create Environment & Install Dependencies
With
uvinstalled, you can create a virtual environment and install all the necessary packages frompyproject.tomlin two commands:# Create a virtual environment with Python 3.13+ uv venv --python 3.13 # Activate the virtual environment # On macOS/Linux: source .venv/bin/activate # On Windows (CMD): # .venv\Scripts\activate.bat # Install dependencies into the virtual environment uv sync
If you're new to uv, think of uv venv as a replacement for python -m venv and uv sync as a much faster version of pip install -r requirements.txt.
Create a .env file (because hardcoding API keys is how we end up on r/ProgrammerHumor):
OPENAI_API_KEY=sk-your-actual-key-not-this-placeholder
COHERE_API_KEY=your-cohere-key-goes-here
PHOENIX_COLLECTOR_ENDPOINT="http://localhost:6006"
# Optional:
HUGGINGFACE_TOKEN=hf_your_token_here
PHOENIX_CLIENT_HEADERS='...' # For cloud Phoenix instancesπ Note: See .env.example for a complete template with all supported variables.
Pro tip: Yes, you need both keys. Yes, they cost money. Yes, it's worth it. Think of it as buying premium gas for your AI Ferrari.
Quick heads-up: We're using interactive mode (-it --rm) for easy cleanup - when you kill these containers, all data vanishes. Perfect for demos, terrible if you want to keep anything. For persistent setups, use docker-compose instead.
docker run -it --rm --name pgvector-container \
-e POSTGRES_USER=langchain \
-e POSTGRES_PASSWORD=langchain \
-e POSTGRES_DB=langchain \
-p 6024:5432 \
pgvector/pgvector:pg16This is your vector database. It's like a regular database, but it can do math with meanings. Fancy stuff.
export TS=$(date +"%Y%m%d_%H%M%S")
docker run -it --rm --name phoenix-container \
-e PHOENIX_PROJECT_NAME="retrieval-comparison-${TS}" \
-p 6006:6006 \
-p 4317:4317 \
arizephoenix/phoenix:latestPhoenix watches everything your AI does and tells you where it went wrong. It's like having a really helpful, non-judgmental therapist for your code.
- Port 6006: Phoenix UI (view traces here)
- Port 4317: OpenTelemetry collector (receives trace data)
docker-compose up -dThis starts both PostgreSQL and Phoenix with the correct settings.
python src/langchain_eval_foundations_e2e.pyWhat should happen:
- π₯ Loads research PDF documents (AI/HCI literature from the data/ directory)
- ποΈ Creates fancy vector tables in PostgreSQL
- π Tests 6 different ways to search for information
- π Shows you which method gives the best answers
- π΅οΈ Sends traces to Phoenix UI at
http://localhost:6006
Expected runtime: 2-5 minutes (depending on number of PDFs loaded)
"ModuleNotFoundError: No module named 'whatever'"
- Translation: You forgot to install something
- Solution: Make sure you ran
uv syncafter activating your venv - Encouragement: Even senior developers forget to activate their venv
"Connection refused" or "Port already in use"
- Translation: Docker containers aren't happy
- Solution:
docker ps # Check what's running docker logs pgvector-container # Check PostgreSQL logs docker logs phoenix-container # Check Phoenix logs # If ports are taken: lsof -i :6024 # Check what's using PostgreSQL port lsof -i :6006 # Check what's using Phoenix port
- Encouragement: Docker is like a moody teenagerβsometimes you just need to restart everything
"Invalid API key" or "Rate limit exceeded"
- Translation: OpenAI/Cohere is giving you the cold shoulder
- Solution: Check your
.envfile, verify your API keys have credits - Encouragement: At least the error is clear! Better than "something went wrong" π€·
"Async this, await that, event loop already running"
- Translation: Python's async system is having an existential crisis
- Solution: Restart your Python session, try again
- Encouragement: Async programming is hard. If it was easy, we'd all be doing it
Copy your error message and ask:
"I'm running this RAG evaluation foundations setup with LangChain, PostgreSQL, and Phoenix. I'm getting this error: [paste error]. The code is supposed to compare different retrieval strategies for research PDFs. What's going wrong?"
Why this works: AI assistants are surprisingly good at debugging, especially when you give them context. They won't judge you for that typo in line 47.
# Kill all Docker containers
docker kill $(docker ps -q)
# Clear Python cache (sometimes helps with import issues)
find . -type d -name __pycache__ -exec rm -rf {} +
# Start over with containers
# (Re-run the docker commands from above)- β The script runs without errors
- β You see a DataFrame with 6 different responses
- β
Phoenix UI at
http://localhost:6006shows your traces (click on a trace to see the full execution flow) - β PostgreSQL has tables full of research document embeddings
- β You feel like a wizard who just summoned 6 different search spirits
- β Complete 3-stage evaluation pipeline from infrastructure to automated metrics
- β 6 retrieval strategies implemented and ready for comparison
- β RAGAS golden test sets for consistent, repeatable evaluation
- β Automated scoring with QA correctness and relevance metrics
- β Phoenix observability tracking every operation and experiment
- β Production-ready foundation that you can extend for your use case
π Deep dive into retrieval strategies: See DeepWiki's technical comparison for detailed performance analysis and architecture insights.
- Stage 1: Infrastructure setup and manual strategy comparison
- Stage 2: RAGAS-powered golden test set generation
- Stage 3: Automated experiments with objective metrics
- Validation Scripts: Interactive tools for deeper analysis
- Phoenix Integration: Full observability and experiment tracking
Remember: every AI engineer has been exactly where you are right now. The difference between a beginner and an expert isn't that experts don't encounter errorsβit's that they've learned to Google them more effectively.
You're building a complete evaluation pipeline while learning the vocabulary. Understanding how BM25 differs from semantic search, why ensemble methods matter, and what Phoenix traces tell you about retriever performance. This hands-on experience is what separates engineers who can copy-paste code from those who can architect real solutions. (Learn more about retrieval strategies β)
You're not just running code; you're learning to think in retrievers with a complete toolkit at your disposal.
What's next for you: Now that you have the complete pipeline, you can:
- Customize evaluation metrics for your domain
- Add new retrieval strategies to compare
- Scale to larger datasets
- Integrate into CI/CD for continuous evaluation
Now go forth and retrieve! The vectors are waiting. π―
Navigate our comprehensive documentation based on your needs:
| Document | Purpose | Best For |
|---|---|---|
| This README | Quick start, setup, troubleshooting | Getting started, practical usage |
| Technical Journey | Detailed 3-stage progression guide | Understanding the complete pipeline |
| Blog Post | Deep dive into Stage 1 implementation | Learning retrieval strategies in depth |
| DeepWiki | Interactive Q&A and exploration | Quick answers, architecture insights |
- Main Scripts:
langchain_eval_foundations_e2e.py- Stage 1: Foundation & infrastructurelangchain_eval_golden_testset.py- Stage 2: RAGAS golden test set generationlangchain_eval_experiments.py- Stage 3: Automated evaluation & metricsdata_loader.py- Utilities for loading data
- Location:
diagrams/folder contains Excalidraw source files - Viewing: Use VS Code Excalidraw extension or excalidraw.com
- Exports: PNG/SVG versions in
diagrams/exports/(when available) - Current Status: Work in progress - see
diagrams/README.mdfor details - Interactive Diagrams: View system architecture and data flow diagrams on DeepWiki
The validation/ directory contains interactive scripts for exploring and validating the RAG system components.
# 1. Ensure services are running
docker-compose up -d
# 2. Run the main pipeline first to populate data
python claude_code_scripts/run_rag_evaluation_pipeline.pyπ‘ Tip: These scripts provide hands-on exploration of concepts covered in the DeepWiki technical documentation.
python validation/postgres_data_analysis.pyPurpose: Comprehensive analysis of the vector database
- Analyzes document distribution and metadata
- Compares baseline vs semantic chunking strategies
- Generates PCA visualization of embeddings
- Outputs: Creates 3 PNG charts in
outputs/charts/postgres_analysis/
python validation/validate_telemetry.pyPurpose: Demonstrates Phoenix OpenTelemetry tracing integration
- Tests various LLM chain patterns with tracing
- Shows streaming responses with real-time trace updates
- Validates token usage and latency tracking
- View traces: http://localhost:6006
python validation/retrieval_strategy_comparison.py Purpose: Interactive comparison of all 6 retrieval strategies
- Compares naive, semantic, BM25, compression, multiquery, and ensemble strategies
- Runs performance benchmarks across strategies
- Demonstrates query-specific strategy strengths
- Outputs: Performance visualization in
outputs/charts/retrieval_analysis/
- β Phoenix Integration: All scripts include OpenTelemetry tracing
- π Visualization: Generates charts and performance metrics
- π§ Interactive: Real-time comparison and analysis capabilities
- π Documentation: Each script includes detailed output explanations
π Detailed Instructions: See validation/README.md for comprehensive usage guide and troubleshooting.
- OpenAI: ~$0.50-$2.00 per full run (depending on data size)
- Cohere: ~$0.10-$0.50 for reranking
- Total: Budget $5 for experimentation
- Data loading: 30-60 seconds
- Embedding generation: 2-4 minutes for ~269 PDF documents
- Retrieval comparison: 30-60 seconds
- Total runtime: 3-6 minutes (varies with PDF count)
π For detailed performance analysis: Check out DeepWiki's performance insights including:
- Strategy-by-strategy latency comparisons
- Token usage optimization techniques
- Scaling recommendations for larger datasets
- Cost-performance trade-offs for each retrieval method
- RAG: Retrieval-Augmented Generation - enhancing LLM responses with retrieved context
- Embeddings: Vector representations of text for semantic search
- BM25: Best Matching 25 - a keyword-based ranking algorithm
- Semantic Search: Finding similar content based on meaning, not just keywords
- Phoenix: Open-source LLM observability platform by Arize
- pgvector: PostgreSQL extension for vector similarity search
- RAGAS: Framework for evaluating RAG pipelines
π Want to understand these concepts in depth? Visit DeepWiki for interactive explanations and examples.
- Docker Issues:
docker logs container-name - Python Issues: Your friendly neighborhood AI assistant
- Existential Crisis: Remember, even PostgreSQL had bugs once
- Success Stories: Share them! The community loves a good victory lap
P.S. If this guide helped you succeed, pay it forward by helping the next intrepid adventurer who's staring at the same error messages you just conquered.
- uv Documentation: Learn more about the fast Python package and project manager used in this guide.
DeepWiki provides an AI-powered interface to explore this project in depth. Ask questions, get instant answers, and discover:
- ποΈ System Architecture: Interactive diagrams showing how all components connect
- π Performance Analysis: Detailed benchmarks comparing all 6 retrieval strategies
- π§ Configuration Deep Dives: Advanced settings and optimization techniques
- π‘ Implementation Insights: Code explanations and design decisions
- π Scaling Strategies: How to adapt this foundation for production use
Perfect for when you need quick answers or want to explore specific technical aspects without diving through all the code.