This project systematically evaluates the performance of various Large Language Models (LLMs) on a benchmark dataset of multiple-choice questions related to clinical travel health.
The primary objective is to assess and compare the accuracy and reliability of different LLMs when faced with specialized questions from the clinical travel medicine domain. This involves testing models under different prompting strategies (modalities) to understand how instructions influence their performance.
The analytical pipeline is orchestrated by the _targets.R script that includes the following key steps:
- Questions: Loads the benchmark questions from
inst/benchmark_travel_medicine.csv. This dataset contains multiple-choice questions, each with an identifier (item), source information, question text (item_text), four options (option_Atooption_D), and the correct option (option_correct). - Models: Loads LLM configurations from
inst/models.csv. This file defines the models to be tested, including theirmodel_id,provider(e.g., "openai", "anthropic", "ollama"),model_name, specific parameters likebase_url(for local models),temperature,max_tokens, and amodel_type("reasoning" or "classic") indicating their general capabilities. Anactiveflag allows selectively including models in the run.
Modalities: Defines three distinct prompting strategies:
cold: The LLM is instructed to output only the letter (A, B, C, or D) corresponding to the correct option.free: The LLM is allowed to explain its reasoning but must conclude its response with the chosen option letter.reasoning: The LLM is prompted to engage in step-by-step thinking, question its assumptions, and then provide the final answer letter. (Note: This modality is automatically skipped for models designated asreasoningtype inmodels.csvto avoid redundancy).
Combinations: Generates a comprehensive list of all experimental units by creating combinations of:
- Each question item.
- Each active model ID.
- Each modality.
- A specified number of replications (
n_replicationsdefined in_targets.R) for statistical robustness.
Filtering: Checks the results/processed/ directory and filters out combinations that have already been successfully run, allowing the pipeline to resume or incrementally add results. It can also be configured to reprocess combinations that previously resulted in errors (reprocess_status in _targets.R).
For each unique combination, in parallel, the pipeline:
- Formats the question text and options.
- Constructs the appropriate system prompt based on the selected modality.
- Sends the query to the specified LLM via the
ellmerpackage, configured with parameters frommodels.csvand global settings (e.g.,temperature,max_tokens). - Captures the LLM's raw response and metadata (like token usage and generation time).
- Handles potential errors during the API call gracefully.
- Answer Extraction: Parses the raw LLM response to extract the final answer choice (A, B, C, or D) either via regex or via LLM if not trivially identifiable.
- Status Determination: Compares the extracted answer to the ground truth (
option_correct) and assigns a status:C(Correct),F(False),N(Not Found - if no valid option was extracted), orE(Error - if an API or processing error occurred). - Incremental Saving: Saves the results for each individual combination as a separate CSV file in the
results/processed/directory. The filename encodes the status, question item, model ID, modality, and replication number (e.g.,C.1_gpt-4_cold_1.csv). This ensures results are saved immediately and prevents data loss if the pipeline is interrupted. Files for combinations currently being processed are temporarily stored inresults/processing/.
- After all combinations are processed, the pipeline gathers all individual result files from
results/processed/. - It compiles them into a single comprehensive dataset.
- This final dataset is saved as
results/all_results.csv. - The final results dataset can be loaded into R using
targets::tar_read(results).
After compiling the raw results, the pipeline advances to a Bayesian multilevel modelling stage implemented with the brms package.
- Accuracy model: A binomial logistic model that estimates the probability of answering correctly as a function of
model_id,modality, and their interaction, with varying (random-effect) intercepts for each question item. - Parsing-quality model: A multinomial ordinal model that captures the probabilities of responses being clean, rescued, or failed.
- Consistency model: A multinomial model that measures how consistent a model's answers are across replications using categorical probabilities for options B, C, and D.
All models use mildly regularising Student-t priors and an LKJ prior for correlation structures. Sampling is delegated to CmdStanR (configured via mcmc_config), and the number of cores/threads is automatically chosen based on the host machine.
Posterior draws from each model are extracted with extract_posterior_draws() and transformed into human-readable summaries with tidybayes. The pipeline computes:
- Marginalised summaries by model, modality, and their interaction.
- Three complementary consistency scores—scaled KL divergence, Simpson index, and modal probability—via
compute_consistency_kl(),compute_consistency_simpson(), andcompute_consistency_modal().
Several helper targets turn the posterior summaries into publication-ready artefacts:
plot_summaries()- faceted forest plots for each metric (accuracy, parsing, and the three consistency variants).plot_pareto_frontier()- a two-objective plot that highlights trade-offs between accuracy and consistency at the model level.- Flextable helpers such as
create_summary_table(),create_model_performance_table(), andcreate_interaction_performance_table()for polished supplementary tables used directly in the manuscript.
Finally, compute_model_correlation() explores relationships between metrics, for example the link between accuracy and parsing quality, or accuracy and KL-based consistency.
All of the above steps are fully reproducible and automatically re-executed by targets whenever the underlying inputs change.
-
Clone the Repository:
git clone https://github.com/bakaburg1/cthllm.git cd cthllm -
Install Dependencies with
renv:This project uses
renvto manage R package dependencies, ensuring reproducibility.- Open R within the cloned project directory.
renvshould activate automatically, installing itself if necessary. - Run the following command to install all required packages:
renv::restore()
- Open R within the cloned project directory.
-
Set Up API Keys:
- Create a
.envfile in the project's root directory. You can copy the template frominst/templates/.envif it exists, or create a new file. - Add your API keys for the LLM providers you intend to use (e.g., OpenAI, Anthropic) to this
.envfile. - In this file you can also set the
PARSER_MODEL_IDandTEST_MODEL_IDto the IDs of the models you want to use to parse the LLM response and test the pipeline, respectively.
- Create a
-
Configure Local Models (Optional):
- If you plan to test local models (e.g., using Ollama or LMStudio), ensure the corresponding service is installed, running, and accessible. Update the
models.csvfile with the correctbase_urlif necessary. - For larger local models, activate and test them one at a time by toggling the
activeflag ininst/models.csv, since loading multiple large models simultaneously may exceed available system memory.
- If you plan to test local models (e.g., using Ollama or LMStudio), ensure the corresponding service is installed, running, and accessible. Update the
-
Execute the
targetsPipeline:-
Once the setup is complete, run the entire analysis pipeline from the R console:
targets::tar_make()
-
This command will execute all the steps defined in the
_targets.Rfile, including data loading, querying LLMs, processing responses, and compiling results.
-
-
Debugging:
-
If you encounter errors, especially during the LLM querying phase which runs in parallel by default, it can be helpful to run the pipeline sequentially in the main R process for easier debugging. Use:
targets::tar_make(callr_function = NULL, use_crew = FALSE)
-
This allows you to check the status of each LLM query and use standard R debugging tools (like
browser()) directly within the target steps.
-
- The
results/all_results.csvfile contains the raw data (questions, models, modalities, responses, status, metadata) for further analysis, such as calculating accuracy, confusion matrices, or performing statistical comparisons between models and modalities.