LLM Evaluation

[toc]

Overview

We have successfully reproduced various open-source model results on the AIME 2024 & AIME 2025 benchmarks.

For benchmarks like AIME24, which contains only 30 problems, it is crucial to sample multiple responses as this can introduce high variance across repeated runs. The number of responses sampled per prompt likely accounts for the slight differences between our evaluation results and those reported by DeepSeek.

DeepSeek-R1-Distill-Qwen-32B

Datasets	(🤗 LLMEval)	DeepSeek-R1-Distill-Qwen-32B（Reported）
AIME24	70.625	72.6
AIME25	55.052	59.0
MATH-500	93.2	94.3

QwQ-32B

Datasets	(🤗 LLMEval)	QwQ-32B（Reported）
AIME24	78.65	79.5
AIME25	67.22	69.5

Skywork-OR1-32B

Datasets	(🤗 LLMEval)	Skywork-OR1-32B（Reported）
AIME24	81.25	82.2
AIME25	72.66	73.3

OpenThinker3-7B

Datasets	(🤗 LLMEval)	OpenThinker3-7B（Reported）
AIME24	70.41	69.0
AIME25	59.16	53.3

Installation

Basic Environment Setup

software	version
Python	== 3.10
CANN	== 8.1.RC1
torch	== 2.5.1
torch_npu	== 2.5.1

For basic environment setup, please refer to this documentation.

vllm & vllm-ascend

To properly use vllm to accelerate inference, you need to compile and install vllm and vllm-ascend using the following commands. Please note the installation method varies depending on your machine type.

# vllm
git clone -b v0.7.3 --depth 1 https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt

# for Atlas 200T A2 Box16
VLLM_TARGET_DEVICE=empty pip install -e . --extra-index https://download.pytorch.org/whl/cpu/

# for Atlas 900 A2 PODc
VLLM_TARGET_DEVICE=empty pip install -e .
# vllm-ascend
git clone -b v0.7.3.post1 --depth 1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
export COMPILE_CUSTOM_KERNELS=1
python setup.py install

llmeval

Install the llmeval package by cloning the repository and then using pip to install it in editable mode. This will also install all the necessary dependencies.

# For github source
git clone https://github.com/jianzhnie/LLMEval.git
# For gitee source
# git clone https://gitee.com/jianzhnie/LLMEval.git
cd LLMEval
pip install -e .

Evaluation

The VLLM library provides two modes for inference: online server mode and offline mode. Below are the instructions for both methods.

Online Server Mode with vLLM

This method involves starting a vLLM server and then sending requests to it for inference. This approach is more flexible and can handle multiple requests concurrently.

Step 1: Start vLLM Server

First, start the vLLM server with the following command:

model_path="Qwen/QwQ-32B"  # or model to the path where the model is located
model_name="Qwen/QwQ-32B"

num_gpus=8
max_model_len=32768  # ✅ 支持 32k 上下文
gpu_memory_utilization=0.9  # ✅ 提高内存利用率

python -m vllm.entrypoints.openai.api_server \
    --model $model_path \
    --trust-remote-code \
    --served-model-name $model_name \
    --tensor-parallel-size $num_gpus \
    --gpu-memory-utilization $gpu_memory_utilization \
    --max-model-len $max_model_len  \
    --enforce-eager \
    --port 8090

Adjust the tensor_parallel_size parameter based on your available devices.

Please refer to the script for more details.

Optional : Start SGLang server/router

Since the evaluation could takes days, we also suggest using SGLang with data parallelism to accelerate the evaluation. Refer to SGLang documentation for more details.

# Use router to support better data parallelism
python -m sglang_router.launch_server --model-path Qwen/QwQ-32B --dp-size 4 --host=0.0.0.0 --port=30000

Adjust the dp_size parameter based on your available devices. Also adjust the port in following commands.

Step 2: Run Inference

After starting the vLLM service, run the inference script to generate responses.

output_dir="./output/Qwen/QwQ-32B"
model_name="Qwen/QwQ-32B"

base_url="http://127.0.0.1:8090/v1"
n_samples=64  # Default sample size for aime24 and aime25

# Create output directory if it doesn't exist
mkdir -p "${output_dir}"

# --- Run Inference Tasks ---
# aime24 (repeated sample 64 times)
python ./llmeval/vllm/online_server.py \
    --input_file "./data/aime24.jsonl" \
    --output_file "${output_dir}/aime24_bz${n_samples}.jsonl" \
    --base_url "${base_url}" \
    --model_name "${model_name}" \
    --n_samples "${n_samples}" \
    --system_prompt_type empty \
    --max_workers 8

# aime25 (repeated sample 64 times)
python ./llmeval/vllm/online_server.py \
    --input_file "./data/aime25.jsonl" \
    --output_file "${output_dir}/aime25_bz${n_samples}.jsonl" \
    --base_url "${base_url}" \
    --model_name "${model_name}" \
    --n_samples "${n_samples}" \
    --system_prompt_type empty \
    --max_workers 8

Please refer to the script for more details.

Note: We apply repeated sampling to reduce evaluation variance, but it may take a long time to complete (more than 8 hours depending on your device).

Parameter Description

--base_url: Base URL of the vLLM service
--model_name: Must match the model name used in Step 1
--n_samples: Number of samples per prompt
- AIME24 / AIME 25: Recommended 64 samples
--input_file: Input data file path
--output_file: Output result file path, model responses will be stored in the gen field
--max_workers: Maximum number of concurrent threads to control inference speed and resource usage

Sampling Parameters

We use top_p=0.95, temperature=0.6, top_k=40, max_tokens=32768 for sampling.

Resuming Interrupted Inference

If the inference process is interrupted, simply rerun the same command to resume. The script will automatically read the previous output file and process any prompts that haven't completed the required number of samples.

Offline Inference with vLLM

This method involves loading the model into memory and then running inference locally. This approach is faster and more efficient, but it requires more memory and may not be suitable for large models.

# --- Configuration ---
output_dir="./output/Qwen/QwQ-32B"
model_name_or_path="Qwen/QwQ-32B"
n_samples=64  # Default sample size for aime24 and aime25

# Create output directory if it doesn't exist
mkdir -p "${output_dir}"

# --- Run Inference Tasks ---
# aime24 (repeated sample 64 times)
python llmeval/vllm/offline_infer.py \
    --input_file "./data/aime24.jsonl" \
    --output_file "${output_dir}/aime24_bz${n_samples}.jsonl" \
    --batch_size 32 \
    --model_name_or_path "${model_name_or_path}" \
    --trust_remote_code \
    --max_model_len 32768 \
    --gpu_memory_utilization 0.9 \
    --tensor_parallel_size 8 \
    --enforce_eager \
    --n_samples "${n_samples}"

# aime25 (repeated sample 64 times)
python llmeval/vllm/offline_infer.py \
    --input_file "./data/aime25.jsonl" \
    --output_file "${output_dir}/aime25_bz${n_samples}.jsonl" \
    --batch_size 32 \
    --model_name_or_path "${model_name_or_path}" \
    --trust_remote_code \
    --max_model_len 32768 \
    --gpu_memory_utilization 0.9 \
    --tensor_parallel_size 8 \
    --enforce_eager \
    --n_samples "${n_samples}"

Please refer to the script for more details.

The result format is consistent with the online server mode, and the model responses will be stored in the gen field.

Step 3: Scoring

After completing the inference, use the following commands for scoring:

output_dir="./output/Qwen/QwQ-32B"
n_samples=64  # Default sample size for aime24 and aime25

# Evaluation output directory
reval_dir="${output_dir}/eval_score"
# Create evaluation directory if it doesn't exist
mkdir -p "${reval_dir}"

# --- Evaluate Each Task ---
# Evaluate aime24
python ./llmeval/tasks/math_eval/eval.py \
    --input_path "${output_dir}/aime24_bz${n_samples}.jsonl" \
    --cache_path "${reval_dir}/aime24_bz${n_samples}.jsonl" \
    --task_name "math_opensource/aime24" \
    --max_workers 16 \
    > "${reval_dir}/aime24_bz${n_samples}_res_result.txt"

# Evaluate aime25
python ./llmeval/tasks/math_eval/eval.py \
    --input_path "${output_dir}/aime25_bz${n_samples}.jsonl" \
    --cache_path "${reval_dir}/aime25_bz${n_samples}.jsonl" \
    --task_name "math_opensource/aime25" \
    --max_workers 16 \
    > "${reval_dir}/aime25_bz${n_samples}_res_result.txt"

Please refer to the script for more details.

Parameter Description

--input_path: Input file path, can directly use the output file from multi-threaded inference or other files with consistent format. Requirements:
- JSONL format
- Contains prompt and corresponding fields
- Model responses stored in the gen field
--cache_path: Cache directory for storing temporary files during evaluation
--task_name: Evaluation task name, must be one of the following options:
- math_opensource/aime24
- math_opensource/aime25
max_workers: Maximum number of concurrent threads to control evaluation speed and resource usage.

Name		Name	Last commit message	Last commit date
Latest commit History 301 Commits
chat_template		chat_template
data		data
docs		docs
examples		examples
llmeval		llmeval
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
config.sh		config.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
set_env.sh		set_env.sh
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Evaluation

Overview

DeepSeek-R1-Distill-Qwen-32B

QwQ-32B

Skywork-OR1-32B

OpenThinker3-7B

Installation

Basic Environment Setup

vllm & vllm-ascend

llmeval

Evaluation

Online Server Mode with vLLM

Step 1: Start vLLM Server

Optional : Start SGLang server/router

Step 2: Run Inference

Parameter Description

Sampling Parameters

Resuming Interrupted Inference

Offline Inference with vLLM

Step 3: Scoring

Parameter Description

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jianzhnie/LLMEval

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation

Overview

DeepSeek-R1-Distill-Qwen-32B

QwQ-32B

Skywork-OR1-32B

OpenThinker3-7B

Installation

Basic Environment Setup

vllm & vllm-ascend

llmeval

Evaluation

Online Server Mode with vLLM

Step 1: Start vLLM Server

Optional : Start SGLang server/router

Step 2: Run Inference

Parameter Description

Sampling Parameters

Resuming Interrupted Inference

Offline Inference with vLLM

Step 3: Scoring

Parameter Description

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages