Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Official repo for "AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs"

Notifications You must be signed in to change notification settings

MiliLab/AnesSuite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs

Xiang Feng1 *, Wentao Jiang1 *, Zengmao Wang1, Yong Luo1 †, Pingbo Xu2,3, Baosheng Yu4,
Hua Jin5,6, Bo Du1 †, Jing Zhang1 †


1 School of Computer Science, Wuhan University, China,
2 Department of Anesthesiology, Zhejiang Cancer Hospital, China,
3 Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang, China
4 Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
5 Department of Anesthesiology, First People’s Hospital of Yunnan Province, China
6 Kunming University of Science and Technology, China
Equal contribution, Corresponding author

🐨 Contents

🔥 Update

2025.09.26

  • We updated the latest progress.

2025.05.14

  • We released the evaluation code along with usage instructions.

2025.04.04

  • We uploaded our work on arXiv.

2025.03.31

🌞 Intro

AnesSuite is a benchmark and dataset suite for advancing LLM reasoning in anesthesiology. It provides bilingual benchmark and curated training resources (AnesCorpus, AnesQA, AnesR1) to support CPT, SFT, and RLVR.

Built on this foundation, Morpheus is first baseline model collection (7B & 14B) for anesthesiology reasnoning. Together, AnesSuite and Morpheus offer a practical infrastructure for research and development of advanced anesthesiology LLMs.

🔍 Overview

Figure 1: Overview of the AnesSuite.

📖 Datasets

AnesBench

AnesBench is designed to assess anesthesiology-related reasoning capabilities of Large Language Models (LLMs). It contains 7,972 anesthesiology MCQs (≈4.4k English / 3.5k Chinese). Each question is labeled with a three-level categorization of cognitive demands, enabling evaluation of LLMs’ knowledge, application, and clinical reasoning abilities across diverse linguistic contexts.

AnesCorpus

AnesCorpus is a large-scale, domain-specific corpus constructed for CPT in the field of anesthesiology.

Language Rows
English ~1.8M
Chinese ~0.6M

This curated dataset provides a rich foundation for pretraining language models to understand anesthesiology-related concepts, terminology, and clinical context.

AnesQA

AnesQA is a QA dataset designed for SFT. The QA pairs are generated and filtered using advanced large language models.

Language QA Pairs
English ~20K

AnesQA enables the development of instruction-tuned models with robust reasoning and answering capabilities in the anesthesiology domain.

AnesR1

AnesR1 contains over 10k instances, each featuring a verifiable MCQ and a detailed reasoning chain, making it well-suited for both SFT and RLVR.

Language QA Pairs
English ~3.2K
Chinese ~7K

Recommended Usage

  • AnesBench: Use as the primary evaluation benchmark to measure LLM performance across factual recall, hybrid reasoning, and complex decision-making in anaesthesiology.

  • AnesCorpus: Apply for CPT to enhance domain knowledge before fine-tuning.

  • AnesQA: Use for SFT.

  • AnesR1: Use for SFT or RLVR to strengthen reasoning capability.

🔨 Evaluation


📁 0. Clone the Repository & Download Benchmark

Clone Repository:

git clone https://github.com/MiliLab/AnesBench
cd AnesBench

Download Benchmark:

cd benchmark
huggingface-cli download --repo-type dataset  MiliLab/AnesBench --local-dir ./

🧱 1. Prepare the Runtime Environment

Before starting, ensure that CUDA and its compiler nvcc are properly installed and accessible.

Check:

nvcc --version

We recommend separating the SGLang service environment from the inference environment.

SGLang service environment

conda create -n sglang_server python==3.10
conda activate sglang_server

Then, install the required sglang and flashinfer packages.

pip install "sglang[all]"
pip install sglang-router 

Download the wheel file for your environment from https://github.com/flashinfer-ai/flashinfer/releases.

pip install /path/to/flashinfer-wheel

Inference environment

Create a new environment and install the packages based on the requirements file.

conda create -n inference python==3.10
conda activate inference
cd eval
pip install -r requirements.txt

Environment Variables

Prepare environment variables in the .env file.

export RESULT_SAVE_PATH=/path/to/result_save_dir
export MODEL_PATH=/path/to/model
export BENCHMARK_PATH=/path/to/benchmark

and run:

source .env

▶️ 2. Run Evaluation

For SGLang service

bash sglang_server.sh 

For Inference

python ./evaluate.py --config ./config.yaml 

🛠️ Training with LLaMA-Factory

To train with AnesCorpus (for CPT) and AnesQA (for SFT) using LLaMA-Factory, follow the steps below:

1️. Install LLaMA-Factory

Follow the LLaMA-Factory official installation guide, or use the following scripts:

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

2. Convert Data to LLaMA-Factory Format

We provide scripts to convert the raw Parquet files into the required JSON format.

📌 The --split argument can be set to:

  • en: English data only
  • cn: Chinese data only
  • all: merge both English and Chinese

For AnesCorpus (CPT):

python tools/anescorpus2json.py \
    --local-dir /path/to/anescorpus/parquet_files \
    --save-dir ./data \
    --split en

This will generate:
./data/AnesCorpus_en.json

For AnesQA (SFT):

python tools/anescorpus2json.py \
    --local-dir /path/to/anesqa/parquet_files \
    --save-dir ./data \
    --split en \
    --instruction "Please answer the following question based on the anesthesiology context."

This will generate:
./data/AnesCorpus_en.json

3. Register the Dataset

Move your dataset in LLaMA-Factory/data, and register your dataset entries in LLaMA-Factory/data/dataset_info.json/.

{
  "anescorpus_en": {
    "file_name": "AnesCorpus_en.json",
    "columns": {
      "prompt": "text"
    }
  },
  "anesqa_en": {
    "file_name": "AnesQA_en.json",
  }
}

For more details on dataset registration and formatting, refer to the official data preparation guide in manual and github.

4. Set Config File

You can use or modify the example config files we provide in configs/.

Edit them to set paths like:

// Example snippet
dataset_dir: LLaMA-Factory/data    // Directory contains "dataset_info.json"
dataset: anesqa_en
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
output_dir: ./output/llama3.1-anesqa-sft
...

More details can be found in official guide.

5. Launch Training from CLI

Continuous Pre-training (CPT)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
llamafactory-cli train configs/qwen2.5-7b-pt-anesthesia.yaml

Supervised Fine-Tuning (SFT)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
llamafactory-cli train configs/qwen2.5-7b-sft-anesthesia.yaml

⭐ Citation

If you find AnesBench helpful, please consider giving this repo a ⭐ and citing:

@article{AnesBench,
  title={AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology},
  author={Xiang Feng and Wentao Jiang and Zengmao Wang and Yong Luo and Pingbo Xu and Baosheng Yu and Hua Jin and Bo Du and Jing Zhang},
  journal={arXiv preprint arXiv:2504.02404},
  year={2025}
}

About

Official repo for "AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •