Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Lucky-259/Hybrid_TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

• 📖 Introduction • ⚙ Installation • 🤗 Models • 🚀 A Quick Start • 🔗 Citation

📖 Introduction

Recent breakthroughs in large language models (LLMs) on complex reasoning tasks have been largely driven by Test-Time Scaling (TTS) — a paradigm that enhances reasoning by intensifying inference-time computation. TTS methods can be classified into:

  • Training-based TTS, such as reinforcement learning and supervised fine-tuning, which train models to produce longer reasoning chains but often incur high training and inference costs.
  • Training-free TTS, which explores the solution space during inference via strategies like parallel and sequential scaling. These approaches are cost-efficient and require no additional training, but generally underperform on complex tasks.

✨ We propose Step-level Verifier-guided Hybrid Test-Time Scaling (Hybrid TTS), a novel training-free test-time scaling paradigm to further enhance the reasoning capabilities of Large Language Models (LLMs).

We first propose Conditional Step-level Self-refinement to validate the effectiveness of fine-grained sequential scaling. On this basis, we further introduce Step-level Verifier-guided Hybrid Test-Time Scaling, which combines parallel (Best-of-N) and sequential (self-refinement) scaling within a step-level tree search.

Hybrid_TTS

Our experiments across five instruction-tuned LLMs (3B–14B) on MATH500, AIME24, and GPQA Diamond datasets show consistent and significant improvements, achieving up to 28.6% performance gain. Notably, our lightweight 3B model outperforms an RL-enhanced 7B baseline on GPQA Diamond by 2.4%.

For a detailed explanation of the method and full experimental results, please refer to our paper.

⚙ Installation

conda create -n hybrid_tts python=3.10
conda activate hybrid_tts
pip install -r requirements.txt

cd envs/MATH/latex2sympy
pip install -e .
cd -

🤗 Models

Our main experiments employ the following models, which are publicly available on Huggingface:

  • LLMs
    • Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-14B-Instruct
    • meta-llama/Llama-3.1-8B-Instruct
    • google/gemma-3-4b-it
  • PRMs
    • Qwen/Qwen2.5-Math-PRM-7B
    • peiyi9979/math-shepherd-mistral-7b-prm

🚀 A Quick Start

Start vLLM Services

Launch the PRM service:

CUDA_VISIBLE_DEVICES=0 vllm serve /path/.../Qwen2.5-Math-PRM-7B --task reward --max-model-len 32768 --host 127.0.0.1 --port 8011

Launch the LLM service:

CUDA_VISIBLE_DEVICES=1 vllm serve /path/.../Qwen2.5-3B-Instruct --host 127.0.0.1 --port 8012

📌 Switching Prompts

To run inference on the GPQA dataset, please modify the COT_TASK_DESC and REWRITE_TASK_DESC from "Common Prompts" to the "GPQA prompts" as needed in the Hybrid_TTS/envs/critic_MATH/*_prompt.py files.

Scripts

The scale of the Hybrid TTS process can be controlled by adjusting the following parameters in the Hybrid_TTS/reason/evaluation/eval.sh files:

  • $num_sample_values: Sets the number of samples for Best-of-N.
  • $num_sequence_values: Sets the number of search paths for MCTS.
  • $num_refine_values: Sets the number of iteration rounds for self-refinement.
  • Stopping conditions for Self-refinement:
    • The process stops if the PRM score exceeds a threshold: $prm_threshold_values.
    • The process stops if the PRM score improvement over $refine_cut_num_values consecutive rounds is less than $prm_gap_values.

The experimental setup in our paper is as follows:

Experimental Setup num_sample_values num_sequence_values num_refine_values prm_threshold_values refine_cut_num_values prm_gap_values
Hybrid Test-Time Scaling 4/8/16 4/8/16 5 0.9 2 0.2
BoN+Self-Refinement 4/8/16 1 5 0.9 2 0.2
MCTS+BoN 4/8/16 4/8/16 0 0.0 0 0.0

For detailed implementation of OpenR, please refer to the solution branch. The RM@k metric reported in the paper corresponds to the prm_min_max in the experiment logs.

Note: As discussed in the Limitations section and Appendix E of our paper, fluctuations in experimental results are expected. We recommend multiple inference runs to obtain a more stable evaluation.

Inference

export PYTHONPATH=$(pwd)
bash reason/evaluation/eval.sh

🔗 Citation

If you find our paper useful for your research, please kindly cite our paper:

@article{chang2025step,
  title={Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models},
  author={Chang, Kaiyan and Shi, Yonghao and Wang, Chenglong and Zhou, Hang and Hu, Chi and Liu, Xiaoqian and Luo, Yingfeng and Ge, Yuan and Xiao, Tong and Zhu, Jingbo},
  journal={arXiv preprint arXiv:2507.15512},
  year={2025}
}

For questions or suggestions, please contact: [email protected]

About

Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published