Xiang Feng1 *, Wentao Jiang1 *, Zengmao Wang1, Yong Luo1 †, Pingbo Xu2,3, Baosheng Yu4,
Hua Jin5,6, Bo Du1 †, Jing Zhang1 †
1 School of Computer Science, Wuhan University, China,
2 Department of Anesthesiology, Zhejiang Cancer Hospital, China,
3 Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang, China
4 Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
5 Department of Anesthesiology, First People’s Hospital of Yunnan Province, China
6 Kunming University of Science and Technology, China
- 🔥 Update
- 🌞 Intro
- 🔍 Overview
- 📖 Datasets
- 🐎 Leaderboard
- 🔨 Evaluation
- 🛠️ Training with LLaMA-Factory
- ⭐ Citation
2025.09.26
- We updated the latest progress.
2025.05.14
- We released the evaluation code along with usage instructions.
2025.04.04
- We uploaded our work on arXiv.
2025.03.31
- We released the AnesSuite project page.
AnesSuite is a benchmark and dataset suite for advancing LLM reasoning in anesthesiology. It provides bilingual benchmark and curated training resources (AnesCorpus, AnesQA, AnesR1) to support CPT, SFT, and RLVR.
Built on this foundation, Morpheus is first baseline model collection (7B & 14B) for anesthesiology reasnoning. Together, AnesSuite and Morpheus offer a practical infrastructure for research and development of advanced anesthesiology LLMs.
AnesBench is designed to assess anesthesiology-related reasoning capabilities of Large Language Models (LLMs). It contains 7,972 anesthesiology MCQs (≈4.4k English / 3.5k Chinese). Each question is labeled with a three-level categorization of cognitive demands, enabling evaluation of LLMs’ knowledge, application, and clinical reasoning abilities across diverse linguistic contexts.
AnesCorpus is a large-scale, domain-specific corpus constructed for CPT in the field of anesthesiology.
| Language | Rows |
|---|---|
| English | ~1.8M |
| Chinese | ~0.6M |
This curated dataset provides a rich foundation for pretraining language models to understand anesthesiology-related concepts, terminology, and clinical context.
AnesQA is a QA dataset designed for SFT. The QA pairs are generated and filtered using advanced large language models.
| Language | QA Pairs |
|---|---|
| English | ~20K |
AnesQA enables the development of instruction-tuned models with robust reasoning and answering capabilities in the anesthesiology domain.
AnesR1 contains over 10k instances, each featuring a verifiable MCQ and a detailed reasoning chain, making it well-suited for both SFT and RLVR.
| Language | QA Pairs |
|---|---|
| English | ~3.2K |
| Chinese | ~7K |
-
AnesBench: Use as the primary evaluation benchmark to measure LLM performance across factual recall, hybrid reasoning, and complex decision-making in anaesthesiology.
-
AnesCorpus: Apply for CPT to enhance domain knowledge before fine-tuning.
-
AnesQA: Use for SFT.
-
AnesR1: Use for SFT or RLVR to strengthen reasoning capability.
Clone Repository:
git clone https://github.com/MiliLab/AnesBench
cd AnesBenchDownload Benchmark:
cd benchmark
huggingface-cli download --repo-type dataset MiliLab/AnesBench --local-dir ./Before starting, ensure that CUDA and its compiler nvcc are properly installed and accessible.
nvcc --versionWe recommend separating the SGLang service environment from the inference environment.
conda create -n sglang_server python==3.10
conda activate sglang_serverThen, install the required sglang and flashinfer packages.
pip install "sglang[all]"
pip install sglang-router Download the wheel file for your environment from https://github.com/flashinfer-ai/flashinfer/releases.
pip install /path/to/flashinfer-wheelCreate a new environment and install the packages based on the requirements file.
conda create -n inference python==3.10
conda activate inference
cd eval
pip install -r requirements.txtPrepare environment variables in the .env file.
export RESULT_SAVE_PATH=/path/to/result_save_dir
export MODEL_PATH=/path/to/model
export BENCHMARK_PATH=/path/to/benchmarkand run:
source .envbash sglang_server.sh python ./evaluate.py --config ./config.yaml To train with AnesCorpus (for CPT) and AnesQA (for SFT) using LLaMA-Factory, follow the steps below:
Follow the LLaMA-Factory official installation guide, or use the following scripts:
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"We provide scripts to convert the raw Parquet files into the required JSON format.
📌 The
--splitargument can be set to:
en: English data onlycn: Chinese data onlyall: merge both English and Chinese
python tools/anescorpus2json.py \
--local-dir /path/to/anescorpus/parquet_files \
--save-dir ./data \
--split enThis will generate:
./data/AnesCorpus_en.json
python tools/anescorpus2json.py \
--local-dir /path/to/anesqa/parquet_files \
--save-dir ./data \
--split en \
--instruction "Please answer the following question based on the anesthesiology context."This will generate:
./data/AnesCorpus_en.json
Move your dataset in LLaMA-Factory/data, and register your dataset entries in LLaMA-Factory/data/dataset_info.json/.
{
"anescorpus_en": {
"file_name": "AnesCorpus_en.json",
"columns": {
"prompt": "text"
}
},
"anesqa_en": {
"file_name": "AnesQA_en.json",
}
}For more details on dataset registration and formatting, refer to the official data preparation guide in manual and github.
You can use or modify the example config files we provide in configs/.
Edit them to set paths like:
// Example snippet
dataset_dir: LLaMA-Factory/data // Directory contains "dataset_info.json"
dataset: anesqa_en
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
output_dir: ./output/llama3.1-anesqa-sft
...More details can be found in official guide.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
llamafactory-cli train configs/qwen2.5-7b-pt-anesthesia.yamlCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
llamafactory-cli train configs/qwen2.5-7b-sft-anesthesia.yamlIf you find AnesBench helpful, please consider giving this repo a ⭐ and citing:
@article{AnesBench,
title={AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology},
author={Xiang Feng and Wentao Jiang and Zengmao Wang and Yong Luo and Pingbo Xu and Baosheng Yu and Hua Jin and Bo Du and Jing Zhang},
journal={arXiv preprint arXiv:2504.02404},
year={2025}
}