• 📖 Introduction • ⚙ Installation • 🤗 Models • 🚀 A Quick Start • 🔗 Citation
Recent breakthroughs in large language models (LLMs) on complex reasoning tasks have been largely driven by Test-Time Scaling (TTS) — a paradigm that enhances reasoning by intensifying inference-time computation. TTS methods can be classified into:
- Training-based TTS, such as reinforcement learning and supervised fine-tuning, which train models to produce longer reasoning chains but often incur high training and inference costs.
- Training-free TTS, which explores the solution space during inference via strategies like parallel and sequential scaling. These approaches are cost-efficient and require no additional training, but generally underperform on complex tasks.
✨ We propose Step-level Verifier-guided Hybrid Test-Time Scaling (Hybrid TTS), a novel training-free test-time scaling paradigm to further enhance the reasoning capabilities of Large Language Models (LLMs).
We first propose Conditional Step-level Self-refinement to validate the effectiveness of fine-grained sequential scaling. On this basis, we further introduce Step-level Verifier-guided Hybrid Test-Time Scaling, which combines parallel (Best-of-N) and sequential (self-refinement) scaling within a step-level tree search.
Our experiments across five instruction-tuned LLMs (3B–14B) on MATH500, AIME24, and GPQA Diamond datasets show consistent and significant improvements, achieving up to 28.6% performance gain. Notably, our lightweight 3B model outperforms an RL-enhanced 7B baseline on GPQA Diamond by 2.4%.
For a detailed explanation of the method and full experimental results, please refer to our paper.
conda create -n hybrid_tts python=3.10
conda activate hybrid_tts
pip install -r requirements.txt
cd envs/MATH/latex2sympy
pip install -e .
cd -
Our main experiments employ the following models, which are publicly available on Huggingface:
- LLMs
Qwen/Qwen2.5-3B-Instruct,Qwen/Qwen2.5-7B-Instruct,Qwen/Qwen2.5-14B-Instructmeta-llama/Llama-3.1-8B-Instructgoogle/gemma-3-4b-it
- PRMs
Qwen/Qwen2.5-Math-PRM-7Bpeiyi9979/math-shepherd-mistral-7b-prm
Launch the PRM service:
CUDA_VISIBLE_DEVICES=0 vllm serve /path/.../Qwen2.5-Math-PRM-7B --task reward --max-model-len 32768 --host 127.0.0.1 --port 8011Launch the LLM service:
CUDA_VISIBLE_DEVICES=1 vllm serve /path/.../Qwen2.5-3B-Instruct --host 127.0.0.1 --port 8012To run inference on the GPQA dataset, please modify the COT_TASK_DESC and REWRITE_TASK_DESC from "Common Prompts" to the "GPQA prompts" as needed in the Hybrid_TTS/envs/critic_MATH/*_prompt.py files.
The scale of the Hybrid TTS process can be controlled by adjusting the following parameters in the Hybrid_TTS/reason/evaluation/eval.sh files:
$num_sample_values: Sets the number of samples for Best-of-N.$num_sequence_values: Sets the number of search paths for MCTS.$num_refine_values: Sets the number of iteration rounds for self-refinement.- Stopping conditions for Self-refinement:
- The process stops if the PRM score exceeds a threshold:
$prm_threshold_values. - The process stops if the PRM score improvement over
$refine_cut_num_valuesconsecutive rounds is less than$prm_gap_values.
- The process stops if the PRM score exceeds a threshold:
The experimental setup in our paper is as follows:
| Experimental Setup | num_sample_values |
num_sequence_values |
num_refine_values |
prm_threshold_values |
refine_cut_num_values |
prm_gap_values |
|---|---|---|---|---|---|---|
| Hybrid Test-Time Scaling | 4/8/16 | 4/8/16 | 5 | 0.9 | 2 | 0.2 |
| BoN+Self-Refinement | 4/8/16 | 1 | 5 | 0.9 | 2 | 0.2 |
| MCTS+BoN | 4/8/16 | 4/8/16 | 0 | 0.0 | 0 | 0.0 |
For detailed implementation of OpenR, please refer to the solution branch.
The RM@k metric reported in the paper corresponds to the prm_min_max in the experiment logs.
⚠ Note: As discussed in the Limitations section and Appendix E of our paper, fluctuations in experimental results are expected. We recommend multiple inference runs to obtain a more stable evaluation.
export PYTHONPATH=$(pwd)
bash reason/evaluation/eval.shIf you find our paper useful for your research, please kindly cite our paper:
@article{chang2025step,
title={Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models},
author={Chang, Kaiyan and Shi, Yonghao and Wang, Chenglong and Zhou, Hang and Hu, Chi and Liu, Xiaoqian and Luo, Yingfeng and Ge, Yuan and Xiao, Tong and Zhu, Jingbo},
journal={arXiv preprint arXiv:2507.15512},
year={2025}
}
For questions or suggestions, please contact: [email protected]