This repository contains a Python script and package for automating testing of LLM performance on various ML tasks using different prompt designs and approaches. The package goal is to speed up LLM experimentation by automating the following steps: prompt submission to the LLM API through litellm, task scoring, and experiment results logging. Currently the test suite supports image classification tasks for three LLM prompting approaches, basic and basic_with_context, and i3, which are described in more detail below. The test suite has been tested for functionality with the gpt-4o OpenAI model. Extensions to handle additional LLM prompting approaches, tasks, and LLM models are planned for the near future. Currently the default prompts and images stored in the data folder and the test_script.py and log files are configured for image classification of oranges. This can be used as a model example for familiarization with the test suite.
core.py- The primary class isI3TestSuitewhich takes several user-defined prompts and training / test data and converts it into a LLM prompt in the format specified by the OpenAI ChatCompletions API. The class then makes an API call to the specified LLM via thelitellmpackage and logs the experiment results to a log file.TaskStrategy.pyandPromptDesignStrategy.py- These modules contain classes that support theI3TestSuiteclass by varying the behavior of certain functions depending on the current task or prompting approach.utils.py- This module contains general utility functions.
| Folder/File | Description |
|---|---|
data/ |
Contains task data, prompts, and logs |
data/images/ |
Stores image files grouped by label |
data/prompts/ |
Contains user-defined prompts for the LLM |
data/logs/ |
Directory where logs are saved |
i3-testsuite/ |
Core Python package code |
test_script.py |
Example usage script |
- Python 3.8+
- OpenAI API key with credits
- Git
git clone https://github.com/jonkaonor/i3-testsuite.git
cd i3-testsuite
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
python test_script.pyNote: You must replace sk-... with your own API key and export will only set the environment variable for the current terminal session.
- Navigate to your project directory.
- Clone the repository with
git clone https://github.com/jonkaonor/i3-testsuite.gitand navigate into the main directory withcd i3-testsuite. - Activate a virtual environment if desired. Use
pip install -r requirements.txtto install necessary Python packages. - Set the environment variable
OPENAI_API_KEYto the value of the API key for whichever LLM API you are using. This is howlitellmaccesses the API key. Note: You must purchase OpenAI API credits separately, as access is not included with a ChatGPT Plus subscription. The cost of each API call togpt-4odepends on the number and resolution of images and the length of the prompt. With downscaled images, an 8-image call typically costs around half a cent. You can further reduce costs by lowering image resolution.
base_data_path: This is the path to the directory containing theimages,promptsandlogsfolders.model_name: This is the name of the LLM model as specified by litellm.select_train_examples: This parameter sets the method for selecting training examples and can berandomormanual. When using themanualselection mode, relative file paths to the training images should be written to thetraining_images_list.txtfile in thedata/imagesdirectory. You must specify at least one training example for every possible image class.num_train_examples: This is the number of training examples to be used per class. This parameter is only used forrandomselection of training examples and is ignored formanualselection.max_output_tokens: This parameter specifies the maximum number of output tokens used in the LLM response. If set too low, LLM behavior becomes unpredictable and generally poor.num_test_examples: This is the number of test examples for the LLM to classify in total.task_strategy: This parameter specifies the task. Currentlyimage_classificationis supported andarcagisupport is pending.prompt_design_strategy: This parameter specifies the prompt design / prompting approach, which controls how the LLM is prompted. Currently there are three supported approaches:basicwhich adds only basic task instructions to the LLM prompt fromdata/prompts/image_classification.txtandbasic_with_contextwhich combines the final prompt generated by thebasicapproach and additional user-defined knowledge fromdata/prompts/context_prompt.txtto the LLM prompt. Thei3approach splits training and testing into separate phases. In the training phase,image_classification_i3_train_prompt.txtis used to give task instructions and the user-definedi3_context_prompt.txtadds task-specific context to the prompt the LLM to generate a classifiation prompt that is stored inimage_classification_prompt.txt. In the test phase, the classification prompt is utilized along with thei3_context_prompt.txtto perform the specified task on a test set. In all approaches a general system prompt for orienting the LLM to its purpose is added fromimage_classification_system_prompt.txt
- Set the configuration settings in
test_script.pyfor the specific task. - Place the images you wish to classify in the
data/imagesdirectory. Each image should be placed in a directory with the class label as the directory name. For example, in the oranges dataset images are either labeled A or B so there are two directories inimages. Set the method for selecting training images via theselect_train_examplesinput parameter. Refer to theConfiguration Settingssection if usingmanualselection of training images. - Modify the relevant prompts in the
data/promptsdirectory depending on your chosenprompt_design_strategy. Refer to theConfiguration Settingspart of the README for specifics of the functionality of each prompt. - Run the test script by executing the command
python test_script.pyorpython3 test_script.py. An simple interactive loop will prompt you for input and allow you to run experiments. - Experiment results will be logged to
data/logs/experiment_log. - Repeat for further experiments.
✅ API test succeeded. Response: Answers: A, B, A, B, A, B, B, A Reasoning: - **First Test Image**: The stem appears green with a noticeable protrusion, matching the characteristics of orange A. - **Second Test Image**: The stem appears flat with more black, resembling orange B. - **Third Test Image**: The protruding green stem suggests it is orange A. - **Fourth Test Image**: The stem looks flat and darker, matching orange B. - **Fifth Test Image**: The green protruding stem is indicative of orange A. - **Sixth Test Image**: The stem appears flat with more black, similar to orange B. - **Seventh Test Image**: Again, a flat, less prominent stem points to orange B. - **Eighth Test Image**: The green stem is protruding, indicating it is orange A. Overall score is 6 out of 8 test examples
MIT License. See LICENSE file for details.
Pull requests and suggestions are welcome! Feel free to open an issue or PR.