This repository contains code and resources for evaluating the performance of language models in a Retrieval-Augmented Generation (RAG) setting, specifically focusing on question answering with context and hallucination detection. The project encompasses prompt augmentation, model inference, and comprehensive statistical evaluation.
The repository is structured as follows:
augmented_prompt.py: Generates augmented prompts by enriching questions with relevant context retrieved from a ChromaDB instance. It leverages sentence embeddings for semantic search.results_4o-mini.py: Executes inference using the OpenAI API with thegpt-4o-mini-2024-07-18model. It processes augmented prompts and saves the model's responses. Includes rate limiting to avoid API errors.results_slim_raft.py: Performs inference with a locally hostedTeenyTinyLlama-160m-CEP-ftmodel, utilizing the Hugging Facetransformerslibrary.evaluate.ipynb: A Jupyter Notebook designed for in-depth evaluation of model outputs. It calculates key metrics such as quality, agreement, accuracy, and hallucination rate. Statistical significance is assessed using Kruskal-Wallis and Dunn's post-hoc tests.evaluate_simple.ipynb: A streamlined Jupyter Notebook for extracting evaluation scores and justifications directly from model-generated JSON outputs.Amostra100AvalCEP.txt: Contains the original set of prompts and baseline answers used for prompt augmentation.augmented_prompt.csv: Stores the augmented prompts generated byaugmented_prompt.py, combining the original prompts with retrieved context.final_evaluation.csv: Stores the final evaluation results, including all calculated metrics.results/: A directory to store the output CSV files from the model inference scripts.chroma_db/: (Potentially) A directory containing the persistent ChromaDB database files. This depends on your ChromaDB configuration.
-
Clone the repository:
git clone <repository_url> cd webist_cep
-
Create and activate a virtual environment (recommended):
python3 -m venv .venv source .venv/bin/activate # Linux/macOS .venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
Create a
.envfile in the project root with the following content:OPENAI_API_KEY=YOUR_OPENAI_API_KEYReplace
YOUR_OPENAI_API_KEYwith your actual OpenAI API key. Ensure the scripts correctly reference this.envfile. -
Download the
TeenyTinyLlama-160m-CEP-ftmodel (if using):If you plan to use
results_slim_raft.py, download the model from Hugging Face and update themodel_pathvariable in the script.huggingface-cli download vinidiol/TeenyTinyLlama-160m-CEP-ft --local-dir /path/to/your/model/directory --local-dir-use-symlinks False
-
ChromaDB Setup:
- Ensure ChromaDB is installed and running. The simplest way is to run it in-process.
- The
augmented_prompt.pyscript connects to a ChromaDB collection named "cep". Make sure this collection exists and is populated with relevant data. The data should be text chunks suitable for retrieval given the prompts inAmostra100AvalCEP.txt. The embedding model used to embed the data in ChromaDB must be the same as the one used inaugmented_prompt.py(currentlysentence-transformers/paraphrase-multilingual-MiniLM-L12-v2).
-
Generate Augmented Prompts:
Run
augmented_prompt.pyto create augmented prompts and save them toaugmented_prompt.csv.python augmented_prompt.py
-
Run Model Inference:
-
gpt-4o-mini: Execute
results_4o-mini.pyto generate responses using the OpenAI API. This script includes basic rate limiting.python results_4o-mini.py
-
TeenyTinyLlama-160m-CEP-ft: Run
results_slim_raft.pyfor local inference with theTeenyTinyLlamamodel.python results_slim_raft.py
The generated results will be saved as CSV files in the
results/directory. -
-
Evaluate Model Performance:
Open and run either
evaluate.ipynb(for detailed analysis) orevaluate_simple.ipynb(for a simplified view) to evaluate the model outputs.jupyter notebook evaluate.ipynb
or
jupyter notebook evaluate_simple.ipynb
augmented_prompt.py: Generates augmented prompts by combining questions with context from ChromaDB.results_4o-mini.py: Performs inference with OpenAI'sgpt-4o-mini-2024-07-18model.results_slim_raft.py: Runs inference locally with theTeenyTinyLlama-160m-CEP-ftmodel.evaluate.ipynb: Provides a comprehensive evaluation of model performance, including statistical analysis.evaluate_simple.ipynb: Offers a simplified evaluation approach focused on JSON parsing.Amostra100AvalCEP.txt: Contains the original prompts and baselines.augmented_prompt.csv: Stores the generated augmented prompts.final_evaluation.csv: Stores the final evaluation metrics.requirements.txt: Lists the Python package dependencies.
Contributions are welcome! Please submit pull requests with detailed descriptions of your changes. Consider adding unit tests for new functionality.
This project is licensed under the MIT License.