🐨 Contents

AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs

Xiang Feng^{1 }, Wentao Jiang^{1 }, Zengmao Wang¹, Yong Luo^{1 †}, Pingbo Xu^2,3, Baosheng Yu⁴,
Hua Jin^5,6, Bo Du^{1 †}, Jing Zhang^{1 †}

¹ School of Computer Science, Wuhan University, China,
² Department of Anesthesiology, Zhejiang Cancer Hospital, China,
³ Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang, China
⁴ Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
⁵ Department of Anesthesiology, First People’s Hospital of Yunnan Province, China
⁶ Kunming University of Science and Technology, China

^∗ Equal contribution, ^† Corresponding author

🐨 Contents

🔥 Update

2025.09.26

We updated the latest progress.

2025.05.14

We released the evaluation code along with usage instructions.

2025.04.04

We uploaded our work on arXiv.

2025.03.31

We released the AnesSuite project page.

🌞 Intro

AnesSuite is a benchmark and dataset suite for advancing LLM reasoning in anesthesiology. It provides bilingual benchmark and curated training resources (AnesCorpus, AnesQA, AnesR1) to support CPT, SFT, and RLVR.

Built on this foundation, Morpheus is first baseline model collection (7B & 14B) for anesthesiology reasnoning. Together, AnesSuite and Morpheus offer a practical infrastructure for research and development of advanced anesthesiology LLMs.

🔍 Overview

Figure 1: Overview of the AnesSuite.

📖 Datasets

AnesBench

AnesBench is designed to assess anesthesiology-related reasoning capabilities of Large Language Models (LLMs). It contains 7,972 anesthesiology MCQs (≈4.4k English / 3.5k Chinese). Each question is labeled with a three-level categorization of cognitive demands, enabling evaluation of LLMs’ knowledge, application, and clinical reasoning abilities across diverse linguistic contexts.

AnesCorpus

AnesCorpus is a large-scale, domain-specific corpus constructed for CPT in the field of anesthesiology.

Language	Rows
English	~1.8M
Chinese	~0.6M

This curated dataset provides a rich foundation for pretraining language models to understand anesthesiology-related concepts, terminology, and clinical context.

AnesQA

AnesQA is a QA dataset designed for SFT. The QA pairs are generated and filtered using advanced large language models.

Language	QA Pairs
English	~20K

AnesQA enables the development of instruction-tuned models with robust reasoning and answering capabilities in the anesthesiology domain.

AnesR1

AnesR1 contains over 10k instances, each featuring a verifiable MCQ and a detailed reasoning chain, making it well-suited for both SFT and RLVR.

Language	QA Pairs
English	~3.2K
Chinese	~7K

Recommended Usage

AnesBench: Use as the primary evaluation benchmark to measure LLM performance across factual recall, hybrid reasoning, and complex decision-making in anaesthesiology.
AnesCorpus: Apply for CPT to enhance domain knowledge before fine-tuning.
AnesQA: Use for SFT.
AnesR1: Use for SFT or RLVR to strengthen reasoning capability.

🔨 Evaluation

📁 0. Clone the Repository & Download Benchmark

Clone Repository:

git clone https://github.com/MiliLab/AnesBench
cd AnesBench

Download Benchmark:

cd benchmark
huggingface-cli download --repo-type dataset  MiliLab/AnesBench --local-dir ./

🧱 1. Prepare the Runtime Environment

Before starting, ensure that CUDA and its compiler nvcc are properly installed and accessible.

Check:

nvcc --version

We recommend separating the SGLang service environment from the inference environment.

SGLang service environment

conda create -n sglang_server python==3.10
conda activate sglang_server

Then, install the required sglang and flashinfer packages.

pip install "sglang[all]"
pip install sglang-router

Download the wheel file for your environment from https://github.com/flashinfer-ai/flashinfer/releases.

pip install /path/to/flashinfer-wheel

Inference environment

Create a new environment and install the packages based on the requirements file.

conda create -n inference python==3.10
conda activate inference
cd eval
pip install -r requirements.txt

Environment Variables

Prepare environment variables in the .env file.

export RESULT_SAVE_PATH=/path/to/result_save_dir
export MODEL_PATH=/path/to/model
export BENCHMARK_PATH=/path/to/benchmark

and run:

source .env

▶️ 2. Run Evaluation

For SGLang service

bash sglang_server.sh

For Inference

python ./evaluate.py --config ./config.yaml

🛠️ Training with LLaMA-Factory

To train with AnesCorpus (for CPT) and AnesQA (for SFT) using LLaMA-Factory, follow the steps below:

1️. Install LLaMA-Factory

Follow the LLaMA-Factory official installation guide, or use the following scripts:

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

2. Convert Data to LLaMA-Factory Format

We provide scripts to convert the raw Parquet files into the required JSON format.

📌 The --split argument can be set to:

en: English data only

cn: Chinese data only

all: merge both English and Chinese

For AnesCorpus (CPT):

python tools/anescorpus2json.py \
    --local-dir /path/to/anescorpus/parquet_files \
    --save-dir ./data \
    --split en

This will generate:
./data/AnesCorpus_en.json

For AnesQA (SFT):

python tools/anescorpus2json.py \
    --local-dir /path/to/anesqa/parquet_files \
    --save-dir ./data \
    --split en \
    --instruction "Please answer the following question based on the anesthesiology context."

This will generate:
./data/AnesCorpus_en.json

3. Register the Dataset

Move your dataset in LLaMA-Factory/data, and register your dataset entries in LLaMA-Factory/data/dataset_info.json/.

{
  "anescorpus_en": {
    "file_name": "AnesCorpus_en.json",
    "columns": {
      "prompt": "text"
    }
  },
  "anesqa_en": {
    "file_name": "AnesQA_en.json",
  }
}

For more details on dataset registration and formatting, refer to the official data preparation guide in manual and github.

4. Set Config File

You can use or modify the example config files we provide in configs/.

Edit them to set paths like:

// Example snippet
dataset_dir: LLaMA-Factory/data    // Directory contains "dataset_info.json"
dataset: anesqa_en
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
output_dir: ./output/llama3.1-anesqa-sft
...

More details can be found in official guide.

5. Launch Training from CLI

Continuous Pre-training (CPT)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
llamafactory-cli train configs/qwen2.5-7b-pt-anesthesia.yaml

Supervised Fine-Tuning (SFT)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
llamafactory-cli train configs/qwen2.5-7b-sft-anesthesia.yaml

⭐ Citation

If you find AnesBench helpful, please consider giving this repo a ⭐ and citing:

@article{AnesBench,
  title={AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology},
  author={Xiang Feng and Wentao Jiang and Zengmao Wang and Yong Luo and Pingbo Xu and Baosheng Yu and Hua Jin and Bo Du and Jing Zhang},
  journal={arXiv preprint arXiv:2504.02404},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
benchmark		benchmark
configs		configs
eval		eval
figs		figs
tools		tools
README.md		README.md

MiliLab/AnesSuite

Folders and files

Latest commit

History

Repository files navigation

AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs

∗ Equal contribution, † Corresponding author

🐨 Contents

🔥 Update

🌞 Intro

🔍 Overview

📖 Datasets

AnesBench

AnesCorpus

AnesQA

AnesR1

Recommended Usage

🔨 Evaluation

📁 0. Clone the Repository & Download Benchmark

🧱 1. Prepare the Runtime Environment

Check:

SGLang service environment

Inference environment

Environment Variables

▶️ 2. Run Evaluation

For SGLang service

For Inference

🛠️ Training with LLaMA-Factory

1️. Install LLaMA-Factory

2. Convert Data to LLaMA-Factory Format

For AnesCorpus (CPT):

For AnesQA (SFT):

3. Register the Dataset

4. Set Config File

5. Launch Training from CLI

Continuous Pre-training (CPT)

Supervised Fine-Tuning (SFT)

⭐ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

^∗ Equal contribution, ^† Corresponding author

Packages