Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

• 📖 Introduction • ⚙ Installation • 🤗 Models • 🚀 A Quick Start • 🔗 Citation

📖 Introduction

Recent breakthroughs in large language models (LLMs) on complex reasoning tasks have been largely driven by Test-Time Scaling (TTS) — a paradigm that enhances reasoning by intensifying inference-time computation. TTS methods can be classified into:

Training-based TTS, such as reinforcement learning and supervised fine-tuning, which train models to produce longer reasoning chains but often incur high training and inference costs.
Training-free TTS, which explores the solution space during inference via strategies like parallel and sequential scaling. These approaches are cost-efficient and require no additional training, but generally underperform on complex tasks.

✨ We propose Step-level Verifier-guided Hybrid Test-Time Scaling (Hybrid TTS), a novel training-free test-time scaling paradigm to further enhance the reasoning capabilities of Large Language Models (LLMs).

We first propose Conditional Step-level Self-refinement to validate the effectiveness of fine-grained sequential scaling. On this basis, we further introduce Step-level Verifier-guided Hybrid Test-Time Scaling, which combines parallel (Best-of-N) and sequential (self-refinement) scaling within a step-level tree search.

Our experiments across five instruction-tuned LLMs (3B–14B) on MATH500, AIME24, and GPQA Diamond datasets show consistent and significant improvements, achieving up to 28.6% performance gain. Notably, our lightweight 3B model outperforms an RL-enhanced 7B baseline on GPQA Diamond by 2.4%.

For a detailed explanation of the method and full experimental results, please refer to our paper.

⚙ Installation

conda create -n hybrid_tts python=3.10
conda activate hybrid_tts
pip install -r requirements.txt

cd envs/MATH/latex2sympy
pip install -e .
cd -

🤗 Models

Our main experiments employ the following models, which are publicly available on Huggingface:

LLMs
- Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-14B-Instruct
- meta-llama/Llama-3.1-8B-Instruct
- google/gemma-3-4b-it
PRMs
- Qwen/Qwen2.5-Math-PRM-7B
- peiyi9979/math-shepherd-mistral-7b-prm

🚀 A Quick Start

Start vLLM Services

Launch the PRM service:

CUDA_VISIBLE_DEVICES=0 vllm serve /path/.../Qwen2.5-Math-PRM-7B --task reward --max-model-len 32768 --host 127.0.0.1 --port 8011

Launch the LLM service:

CUDA_VISIBLE_DEVICES=1 vllm serve /path/.../Qwen2.5-3B-Instruct --host 127.0.0.1 --port 8012

📌 Switching Prompts

To run inference on the GPQA dataset, please modify the COT_TASK_DESC and REWRITE_TASK_DESC from "Common Prompts" to the "GPQA prompts" as needed in the Hybrid_TTS/envs/critic_MATH/*_prompt.py files.

Scripts

The scale of the Hybrid TTS process can be controlled by adjusting the following parameters in the Hybrid_TTS/reason/evaluation/eval.sh files:

$num_sample_values: Sets the number of samples for Best-of-N.
$num_sequence_values: Sets the number of search paths for MCTS.
$num_refine_values: Sets the number of iteration rounds for self-refinement.
Stopping conditions for Self-refinement:
- The process stops if the PRM score exceeds a threshold: $prm_threshold_values.
- The process stops if the PRM score improvement over $refine_cut_num_values consecutive rounds is less than $prm_gap_values.

The experimental setup in our paper is as follows:

Experimental Setup	`num_sample_values`	`num_sequence_values`	`num_refine_values`	`prm_threshold_values`	`refine_cut_num_values`	`prm_gap_values`
Hybrid Test-Time Scaling	4/8/16	4/8/16	5	0.9	2	0.2
BoN+Self-Refinement	4/8/16	1	5	0.9	2	0.2
MCTS+BoN	4/8/16	4/8/16	0	0.0	0	0.0

For detailed implementation of OpenR, please refer to the solution branch. The RM@k metric reported in the paper corresponds to the prm_min_max in the experiment logs.

⚠ Note: As discussed in the Limitations section and Appendix E of our paper, fluctuations in experimental results are expected. We recommend multiple inference runs to obtain a more stable evaluation.

Inference

export PYTHONPATH=$(pwd)
bash reason/evaluation/eval.sh

🔗 Citation

If you find our paper useful for your research, please kindly cite our paper:

@article{chang2025step,
  title={Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models},
  author={Chang, Kaiyan and Shi, Yonghao and Wang, Chenglong and Zhou, Hang and Hu, Chi and Liu, Xiaoqian and Luo, Yingfeng and Ge, Yuan and Xiao, Tong and Zhu, Jingbo},
  journal={arXiv preprint arXiv:2507.15512},
  year={2025}
}

For questions or suggestions, please contact: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
distributed		distributed
envs		envs
reason		reason
Hybrid_TTS.png		Hybrid_TTS.png
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

📖 Introduction

⚙ Installation

🤗 Models

🚀 A Quick Start

Start vLLM Services

📌 Switching Prompts

Scripts

Inference

🔗 Citation

About

Uh oh!

Releases

Packages

Languages

Lucky-259/Hybrid_TTS

Folders and files

Latest commit

History

Repository files navigation

Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

📖 Introduction

⚙ Installation

🤗 Models

🚀 A Quick Start

Start vLLM Services

📌 Switching Prompts

Scripts

Inference

🔗 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages