Before running evaluations, please install the required packages following the instructions below.
First, create a virtual_envs directory under the root directory for installing virtual environments:
mkdir virtual_envsWe use different virtual environments for different test sets. This is because the package versions are critical when evaluating math and code benchmarks. We build virtual environments from the official repositories of Eurus, Qwen2.5-Math, LiveCodeBench, UGPhysics, and SciCode respectively.
For AIME 2024, AMC, MATH-500, and LeetCode, use the following virtual environment:
python -m venv ./virtual_envs/prime
source ./virtual_envs/prime/bin/activate
pip install -r ./eval/requirements_prime.txtFor Minerva Math and OlympiadBench, use the following virtual environment:
python -m venv ./virtual_envs/qwen_math
source ./virtual_envs/qwen_math/bin/activate
pip install -r ./eval/requirements_qwen_math.txtFor LiveCodeBench, use the following virtual environment:
python -m venv ./virtual_envs/lcb
source ./virtual_envs/lcb/bin/activate
pip install -r ./eval/requirements_lcb.txtFor UGPhysics and SciCode, use the following virtual environment:
python -m venv ./virtual_envs/em_inf
source ./virtual_envs/em_inf/bin/activate
pip install -r ./eval/requirements_em_inf.txtTo run SciCode evaluation, please download the numeric test results and save them as eval/SciCode/eval/data/test_data.h5.
The EM-INF directory contains shell scripts to run evaluations with different inference modes. Different evaluation modes require different arguments. Please follow the guide below to set up the evaluation properly.
EM-INF/normal_eval.sh contains the settings and commands for running evaluation with normal inference mode. Please set the following arguments in EM-INF/normal_eval.sh accordingly:
MODEL_NAME_OR_PATH: HuggingFace model name (e.g., Qwen/Qwen2.5-7B-Instruct) or the path to your local model checkpoint.TEMP: Set the temperature.TASK: Set the tasks you want to evaluate. Available tasks for normal inference aremath500,amc,aime,qwen,leetcode,livecodebench,ugphysics, andscicode.all: Run all the available tasks for normal mode.math: Run all the math benchmarks, including MATH-500, AMC, AIME, and Qwen Math (Minerva and OlympiadBench).math,leetcode,ugphysics: Run selected tasks, separated by commas.
After the arguments are set, run the following bash command. The results will be saved in results/sanitized_model_name/normal:
bash EM-INF/normal_eval.shEM-INF/em_inf_eval.sh contains the settings and commands for running evaluation with EM-INF inference mode. Please set the following arguments in EM-INF/em_inf_eval.sh accordingly:
MODEL_NAME_OR_PATH: HuggingFace model name (e.g., Qwen/Qwen2.5-7B-Instruct) or the path to your local model checkpoint.TEMP: Set the temperature.TASK: Set the tasks you want to evaluate. Available tasks for EM-INF aremath500,amc,aime,qwen,leetcode,ugphysics, andscicode.all: Run all the available tasks for EM-INF mode.math: Run all the math benchmarks, including MATH-500, AMC, AIME, and Qwen Math (Minerva and OlympiadBench).math,leetcode,ugphysics: Run selected tasks, separated by commas.
HYPERPARAMETERS: There are 4 hyperparameters required for EM-INF:threshold: Float ranging from [0, 1]. If the entropy of the logits is belowthreshold, it will stop minimizing entropy and start token sampling.kl_weight: Float ranging from [0, 1]. Controls the importance of the KL constraint when minimizing entropy of the logits.learning_rate: Float ranging from [0, 1].n_grad_steps: Integer ranging from [0, ∞). Controls the number of optimization steps when minimizing entropy of logits.
NUM_PROCESSES: The number of model copies to run inference in parallel. Please set this number based on your model size and available GPU memory, otherwise it will raise a CUDA Out-of-Memory error.
After the arguments are set, run the following bash command. The results will be saved in results/sanitized_model_name/em_inf:
bash EM-INF/em_inf_eval.shEM-INF/adaptive_temp_eval.sh contains the settings and commands for running evaluation with Adaptive Temperature inference mode. Please set the following arguments in EM-INF/adaptive_temp_eval.sh accordingly:
MODEL_NAME_OR_PATH: HuggingFace model name (e.g., Qwen/Qwen2.5-7B-Instruct) or the path to your local model checkpoint.TASK: Set the tasks you want to evaluate. Available tasks for Adaptive Temperature inference aremath500,amc,aime,qwen,leetcode,ugphysics, andscicode.all: Run all the available tasks for Adaptive Temperature mode.math: Run all the math benchmarks, including MATH-500, AMC, AIME, and Qwen Math (Minerva and OlympiadBench).math,leetcode,ugphysics: Run selected tasks, separated by commas.
HYPERPARAMETERS: There are 6 hyperparameters required for Adaptive Temperature:tmax: Float ranging from [0, 1]. The temperature upper bound for adaptive temperature optimization, which is also the starting temperature.tmin: Float ranging from [0, 1]. The temperature lower bound for adaptive temperature optimization.max_iter: Integer ranging from [0, ∞). The maximum iteration for optimization.tol: Float ranging from [0, ∞). Controls the tolerance of temperature for deciding when to stop adaptive temperature optimization.target_ratio: Float ranging from [0, 1]. Controls the percentage of initial logit entropy you want to reduce to (e.g.,target_ratio=0.85means you want to minimize logit entropy to 0.85 of the initial entropy through adaptive temperature).threshold: Float ranging from [0, 1]. If the entropy of the logits is belowthreshold, it will stop minimizing entropy and start token sampling.
After the arguments are set, run the following bash command. The results will be saved in results/sanitized_model_name/adaptive_temp:
bash EM-INF/adaptive_temp_eval.shThe folder EM-RL contains the training code for our entropy minimization training method.
Our training code is based on veRL. We use vLLM for inference and deverlop evaluation scripts based on PRIME, Qwen-2.5-Math, LiveCode, UGPhysics, and SciCode. Our data is sourced from RLFlow and PRIME.
You can cite our work using the following bib:
@article{agarwal2025unreasonable,
title={The unreasonable effectiveness of entropy minimization in llm reasoning},
author={Agarwal, Shivam and Zhang, Zimin and Yuan, Lifan and Han, Jiawei and Peng, Hao},
journal={arXiv preprint arXiv:2505.15134},
year={2025}
}