A simple Q&A bot for technical documentation designed to test and compare different LLM evaluation frameworks including DeepEval, LangChain Evaluation, RAGAS, and OpenAI Evals.
This project serves as a testbed for comparing how different evaluation frameworks assess the same RAG (Retrieval-Augmented Generation) system.
This project evaluates a RAG (Retrieval-Augmented Generation) system. Here are a few key concepts to help you understand the components:
-
RAG (Retrieval-Augmented Generation): This is a technique where a large language model's knowledge is supplemented with information retrieved from other sources (in this case, our local documents). The process has two main steps:
- Retrieval: A search algorithm (like TF-IDF) finds relevant documents based on the user's query.
- Generation: A language model takes the retrieved documents and the original query to generate a comprehensive answer.
-
Ground Truth: In the context of evaluation, "ground truth" refers to the ideal or perfect answer to a given question. We use the ground truth dataset (
data/ground_truth.json) as a benchmark to measure how accurate and relevant the Q&A bot's answers are. -
TF-IDF (Term Frequency-Inverse Document Frequency): This is the retrieval algorithm used by the Q&A bot to find relevant documents. It works by assigning a score to each word in a document based on two factors:
- Term Frequency (TF): How often a word appears in a specific document.
- Inverse Document Frequency (IDF): How rare or common the word is across all documents.
This allows the system to prioritize words that are important to a specific document over common words that appear everywhere (like "the" or "and").
- Clone the repository
git clone https://github.com/LiteObject/eval-framework-sandbox.git
cd eval-framework-sandbox- Install dependencies
pip install -r requirements.txt- Set up environment variables
cp .env.example .env
# Edit .env with your API keys (optional unless running remote evals)- Ask a question
python -m src.main "How do you install the Python requests library?"The bot will print a synthesized answer and list matching documents.
- Run the unit tests
pytest- (Optional) Try an evaluation framework
- Update
.envwith the relevant API keys or enable the Ollama flag for a local model (details below). - Install extras:
pip install -r requirements.txtalready includes optional libs, orpip install .[eval]after editable install. - Use the runner scripts in
evaluations/as starting points; each script writes results intoresults/.
- Update
The core QA bot already runs fully offline using TF-IDF retrieval. If you also want LangChain's evaluators to call a local Ollama model instead of OpenAI:
- Install Ollama and pull a model, e.g.
ollama pull llama3. - Set the following environment variables (via
.envor your shell):LANGCHAIN_USE_OLLAMA=trueOLLAMA_MODEL=llama3(or any other pulled model)- Optionally
OLLAMA_BASE_URL=http://localhost:11434if you're running Ollama on a non-default host/port.
- Leave
OPENAI_API_KEYblank; the LangChain evaluator will detect the Ollama flag and useChatOllama.
If LANGCHAIN_USE_OLLAMA is false, the evaluator falls back to ChatOpenAI and expects a valid OPENAI_API_KEY plus LANGCHAIN_OPENAI_MODEL (defaults to gpt-3.5-turbo).
The evaluation workflow follows four distinct steps:
from src.qa_bot import QABot
bot = QABot(documents_path="data/documents/sample_docs")Load your bot with the technical documentation it will search through.
import json
from pathlib import Path
questions = json.loads(Path("data/test_questions.json").read_text())
predictions = {}
for item in questions:
answer = bot.answer(item["question"])
predictions[item["id"]] = answer.responseAsk your bot each test question and collect its answers as predictions.
from evaluations.utils import load_dataset_from_files
dataset = load_dataset_from_files(
questions_path=Path("data/test_questions.json"),
ground_truth_path=Path("data/ground_truth.json"),
predictions=predictions, # Your bot's answers
)This pairs each question with:
- Prediction: Your bot's answer
- Ground Truth: The expected correct answer
from evaluations.langchain_eval_runner import LangChainEvalRunner
runner = LangChainEvalRunner()
result = runner.evaluate(dataset)
print(f"Score: {result.score}")The framework compares your predictions against ground truth and returns a score (0-1 scale).
Sample Docs β QA Bot β Predictions
β
Test Questions + Ground Truth + Predictions β Evaluator β Score & Details
These integrations are opt-in. Install the additional dependencies with:
pip install .[eval]-
Set
DEEPEVAL_API_KEYin.envif you plan to submit results to the hosted DeepEval service (local scoring works without it). -
Run the runner programmatically:
from evaluations.deepeval_runner import DeepEvalRunner runner = DeepEvalRunner() result = runner.evaluate(dataset) print(result.score, result.details)
The report is also written to
results/deepeval_result.json.
-
Choose your backend:
- Remote OpenAI models: set
OPENAI_API_KEYand optionallyLANGCHAIN_OPENAI_MODEL(defaults togpt-3.5-turbo). - Local Ollama: set
LANGCHAIN_USE_OLLAMA=true,OLLAMA_MODEL, and optionallyOLLAMA_BASE_URL; no OpenAI key required.
- Remote OpenAI models: set
-
Invoke the runner:
from evaluations.langchain_eval_runner import LangChainEvalRunner runner = LangChainEvalRunner() result = runner.evaluate(dataset) print(result.score, result.details)
LangChain will call the configured chat model to grade responses and store the output at
results/langchain_result.json.
-
Install the
ragasextras (already included in.[eval]). Some metrics call an LLM; setOPENAI_API_KEYor configure RagAS to use a local model before running. -
Evaluate the dataset:
from evaluations.ragas_runner import RagasRunner runner = RagasRunner() result = runner.evaluate(dataset) print(result.score, result.details)
The raw metric results are saved to
results/ragas_result.json.
This repository only prepares the dataset and relies on OpenAI's CLI for the
actual evaluation. Ensure evals is installed and OPENAI_API_KEY is set, then
use evaluations/openai_eval_runner.py to export a dataset and follow the
OpenAI Evals documentation to launch the
experiments with oaieval.
data/: Test questions, ground truth, and source documentssrc/: Core Q&A bot implementationevaluations/: Framework-specific evaluation scriptsresults/: Evaluation results and comparisons (gitignored except for.gitkeep)
- Answer Correctness
- Context Relevance
- Faithfulness
- Answer Similarity
- Response Time
- Hallucination Rate