Note: This is the code for the updated MedScore paper, 2025 October version. Check out the arxiv-reproducibility branch for the exact code used in the paper of 2025 May arxiv version.
Supporting code and data for MedScore, a medical chatbot factuality evaluation system that can adapt to other domains easily. See the MedScore paper for details. Following the structure of the paper (update MedScore taxonomy based on domain-specific requirements for valid claim definition, then change the MedScore Instructions and domain-specific In Context Learning examples), researchers can adapt this tool to their text domain optimally with minimal effort.
If you use this tool, please cite
@misc{huang2025medscoregeneralizablefactualityevaluation,
title={MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification},
author={Heyuan Huang and Alexandra DeLucia and Vijay Murari Tiyyala and Mark Dredze},
year={2025},
eprint={2505.18452},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18452},
}
There are two options for installation. For editing the code, we recommend the development install option.
For running the code as-is, we recommend the standard install option.
Pip install from the repository.:
pip install git+https://github.com/Heyuan9/MedScore.gitThis option allows you to edit the code and have changes reflected without re-installing.
-
Clone the repository
git clone [email protected]:Heyuan9/MedScore.git
-
Create a new environment
conda env create --file=environment.yml conda activate medscore
-
Install the MedScore package for easy command-line usage
cd /path/to/MedScore pip install .
-
Add any API keys to your
~/.bashrcor to a.envfile in the root directory.export OPENAI_API_KEY="" export TOGETHER_API_KEY=""
-
[Optional] Set
MEDRAG_CORPUSenvironment variable or add it to a.envfile in the root directory.export MEDRAG_CORPUS=""
Setting this variable makes sure that the MedRAG corpus will only be downloaded once.
MedScore v0.1.1 is now run from the command line using a single configuration file, which makes managing experiments much easier.
python -m medscore.medscore --config /path/to/your/config.yamlAll options:
--config: Path to the YAML configuration file. The config file is explained below.--input_file: JSONLines-formatted input file. Override the input data file specified in the config.--output_dir: Path to save the intermediate and result files. Override the output directory specified in the config.--decompose_only: Only run the decomposition step. Saves tooutput_dir/decompositions.jsonl.--verify_only: Only run the verification step (requires an existing decomposition file in theoutput_dir) Saves tooutput_dir/verifications.jsonl.
The final output is saved to output_dir/output.jsonl.
All settings are defined within the YAML configuration file. You can create different config files for different experiments.
Below are explanations for all the options in a MedScore config file. There are examples in demo/ and a few are below.
There are three main sections of a MedScore config file.
1. Main input/output
input_file: Path to the input data file. It should be injsonlformat.id: A unique identifier for the instance.repsonse: The text response from the medical chatbot. This key can be changed with--response_key.- Any other metadata.
output_dir: Path to the output directory. The output files aredecompositions.jsonl,verifications.jsonl, andmedscore_output.jsonl.- Default: current directory
response_key: JSON key corresponding to the medical chatbot response. The default isresponse.
2. Decomposition-related arguments
type: Method for decomposing the sentences into claims.- Options:
factscore: FActScore prompt from FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (Min et al., EMNLP 2023)medscore: Our work.dndscore: Prompt from DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation (Wanner et al., arXiv 2024)custom: A custom user-written prompt with instructions and in-domain examples best for your dataset. Thedecomp_prompt_pathmust also be provided. We recommend following the format of MedScore_prompt.txt to make the first customization try easier.
- Default:
MedScore
- Options:
prompt_path: Path to atxtfile containing a system prompt for decomposition. See the prompts inmedscore/prompts.pyfor examples. This should only be set if you are using a custom decomposer.model_name: The name of the model for decomposing the response into claims. It should the model identifier for a hosted HuggingFace model, OpenAI model, TogetherAI model, or locally-hosted vLLM model.- Default:
gpt-4o-mini
- Default:
server_path: The server path for the decomposition model.- Default:
https://api.openai.com/v1
- Default:
api_key: API key for the specifiedserver_path. You can use environment variables by prefacing them with!env. Example:!env TOGETHER_API_KEY
3. Verification-related arguments
type: The method for verification.- Options:
medrag: Verify theresponseagainst MedCorp from Benchmarking Retrieval-Augmented Generation for Medicine (Xiong et al., Findings 2024). The default settings retrieve the top 5 passages from PubMed, StatPearls, and academic textbooks with theMedCPTencoder.internal: Verify against the internal knowledge of an LLM.provided: Verify against pre-collected user-provided evidence. Requiresprovided_evidence_pathto be set.
- Default:
internal
- Options:
response_key: JSON key corresponding to the medical chatbot response. The default isresponse.prompt_path: Path to atxtfile containing a system prompt for decomposition. See the prompts inmedscore/prompts.pyfor examples. This should only be set if you are using a custom decomposer.model_name: The name of the model for verifying the response. It should be a model identifier for a hosted HuggingFace model, OpenAI model, TogetherAI model, or a locally-hosted vLLM model.- Default:
gpt-4oThe paper usemistralai/Mistral-Small-24B-Instruct-2501
- Default:
server_path: The server path for the verification model. Refer to the vLLM Hugging Face tutorial for open-sourced LLM server path:http://<your-server>:8000/v1- Default:
https://api.openai.com/v1
- Default:
api_key: API key for the specifiedserver_path. You can use environment variables by prefacing them with!env. Example:!env TOGETHER_API_KEYprovided_evidence_path: Path tojsonfile in{"{id}": "{evidence}"}format, where theidis the same as the entry id ininput_file.
All of the decomposition and verification arguments are built from the classes in medscore.decomposer and medscore.verifier, respectively.
1. MedScore Decomposer with Internal Verification
#################
# MedScore Configuration File
#################
# --- Main Input/Output Files ---
# These paths are relative to where you run the script.
input_file: "data/AskDocs.demo.jsonl"
output_dir: "results"
response_key: "response" # The 'response' is used as the answer_context for decomposition. However, if 'response' doesn't contain enough information and your dataset has 'question', you can input 'question' and change format_input("question_context","answer_context",) function of the MedScore class to incorporate two keys in the prompt as "Question Context:{}\nAnswer Context:{}\n".
# --- Decomposition Configuration ---
decomposer:
type: "medscore"
model_name: "gpt-4o-mini"
server_path: "https://api.openai.com/v1"
# api_key: "YOUR_API_KEY" # Optional: can be set here or via environment variable.
# --- Verification Configuration ---
verifier:
type: "internal"
model_name: "gpt-4o"
server_path: "https://api.openai.com/v1"2. MedScore Decomposer with MedRAG Verification and locally-hosted model
#################
# MedScore Configuration File
#################
# --- Main Input/Output Files ---
# These paths are relative to where you run the script.
input_file: "data/AskDocs.demo.jsonl"
output_dir: "results"
response_key: "response"
# --- Decomposition Configuration ---
decomposer:
type: "medscore"
model_name: "gpt-4o-mini"
server_path: "https://api.openai.com/v1"
# api_key: "YOUR_API_KEY" # Optional: can be set here or via environment variable.
# --- Verification Configuration ---
verifier:
type: "medrag"
model_name: "mistralai/Mistral-Small-24B-Instruct-2501"
server_path: "http://localhost:8000/v1"
corpus_name: "Textbooks" # options: "PubMed", "Textbooks", "StatPearls", "Wikipedia", "MedCorp", "MEDIC". Our paper uses MEDIC.
n_returned_docs: 5
cache: false # Set to true for large datasets to improve performance
db_dir: "."For flexibility, the medscore.py script saves intermediate output of the decompositions, verification, and the final combined file.
Decompositions
The output from the decomposition step is decompositions.jsonl and has the following format:
{
"id": {},
"sentence": {},
"sentence_id": {},
"claim": {},
"claim_id": {}
}There is one entry for every claim for every sentence. The claim key can be None if a sentence has no claims.
Verifications
The output from the verification step is verificcations.jsonl and has the following format:
{
"id": {},
"sentence": {},
"sentence_id": {},
"claim": {},
"claim_id": {},
"evidence": [{
"id": {},
"title": {},
"text": {},
"score": {}
}],
"raw": {},
"score": {}
}There is one entry for every claim. Claims that were None from the decomposition step are filtered before being passed to the verifier.
The evidence key can change based on the verification setting.
medrag"evidence": [{ "id": {}, "title": {}, "text": {}, "score": {} }]
where the id, title, and text correspond to the retrieved entries in MedRAG. score is the similarity score based on the retriever.
internal"evidence": None
In the internal setting, the model is not prompted with evidence.
-
provided"evidence": "{evidence from provided_evidence_path}"
Combined output
The final output file combines the decompositions and verifications by id.
{
"id": {},
"score": {},
"claims": [{
"claim": {},
"sentence": {},
"evidence": {},
"raw": {},
"score": {}
}]
}where score is the average claim score for the id.
The MedRAG verifier is memory-intensive due to the large size of the dataset. The data subset can be customized by overriding or editing
the verifier.MedRAGVerifier class. The dataset downloads to MEDRAG_CORPUS (if set) or ./corpus.
Dataset size estimates:
PubMed: 238GBStatPearls: 6.2GBtextbooks: 1.2GBWikipedia: 310GB
Running verification with the full MedCorp dataset (with MedRAGVerifier.cache=True) requires roughly 300GB of RAM. The entire dataset takes up 646GB of disk.
For speed, we highly recommend setting MedRAGVerifier.cache=True for input files with a large number of claims (5K+).
The AskDocs dataset is in the ./data folder. It has 300 samples and 4 keys:
id: question idquestion: user questiondoctor_response: a doctor response from a verified doctor for this questionresponse: the Llama3.1 chatbot response for this question, augmented from the doctor_response by explaining medical terminologies in detail and adding more empathetic sentences, without adding other diagnosis/treatment information.
The AskDocs.demo dataset has 20 random samples from the AskDocs dataset. It is more cost-efficient to experiment on this small-scale dataset.
If your input data already contains sentence-level annotations (for example, produced by an external senticizer), MedScore can use those directly instead of running its internal sentence splitter. To enable this, add the top-level flag presenticized: true to your YAML config.
Behavior when presenticized is true:
- MedScore will look for a
sentenceslist in each input record and use those entries as the source of sentences. - Each item in
sentencesis a dict with atextfield and an optionalsentence_idfield. Ifsentence_idis not provided, it will be auto-generated based on the list index. - If the original
response(or configuredresponse_key) is present, it will be used as thecontextpassed to the decomposer; otherwise the context will be reconstructed by joining the provided sentence texts. - Records missing a valid
sentenceslist will be skipped with a warning.
Example YAML config enabling presenticized inputs:
input_file: "data/AskDocs.demo.jsonl"
output_dir: "results"
response_key: "response"
presenticized: true
decomposer:
type: "medscore"
model_name: "gpt-4o-mini"
verifier:
type: "internal"Example input JSONL record (per-line):
{"id":"123","response":"optional original text","sentences":[{"text":"Sentence 1.","sentence_id": 0, "span_start":0,"span_end":10},{"text":"Sentence 2.", "sentence_id": 1, "span_start":11,"span_end":25}]}MedScore will then skip internal sentence splitting and use the provided sentences for decomposition and verification.