The Compressed Medical LLM Benchmark (CMedBench) is a comprehensive benchmark designed to evaluate the performance of compressed large language models (LLMs) in medical applications. It provides an in-depth analysis across multiple tracks to assess model efficiency, accuracy, and trustworthiness in medical contexts.
Clone the repository and set up the environment:
git clone https://github.com/Tabrisrei/CMedBench.git
cd CMedBench
conda create -n cmedbench python=3.10
conda activate cmedbench
cd TrustLLM/trustllm_pkg
pip install -e .
cd ../../opencompass
pip install -e .
pip install vllm pynvml-
Add your API token to
PycrawlersDownload.py. -
Run the download script:
python PycrawlersDownload.py
Tips: None of the mmlu dataset in huggingface is correctly parsed, so we use opencompass dataset reader. Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
- Alternatively, download the dataset zip file from our GitHub repository and unzip it in the project folder to access Track 1, 2, and 4 datasets.
Unzip the trustworthy dataset:
unzip TrustLLM/dataset/dataset.zipThis repository includes scripts to evaluate LLMs across five tracks. Ensure the LLM to be tested is prepared before running evaluations.
-
Update the dataset and model paths in the configuration file:
opencompass/configs/xperiments -
Modify the log and result paths in:
opencompass/scripts/launcher.sh -
Run the evaluation:
cd opencompass bash scripts/launcher.sh
-
Update the paths in the generation and evaluation scripts:
TrustLLM/run_generation.py TrustLLM/run_evaluation.py -
Generate LLM results:
cd TrustLLM python run_generation.py -
After generation completes, calculate metrics:
python run_evaluation.py
Note: The generation process may take significant time. Consider using
nohuportmuxto run it in the background.
-
Update the path in the efficiency evaluation script:
track5_efficiency.py -
Run the script to evaluate model efficiency.