NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

NOVER is accepted by EMNLP 2025! 🎉 Check out more about NOVER on the official website.
This is the official implementation of the paper "NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning". This code repo is built based on HuggingFace trl.

Overview

NOVER (NO-VERifier) is a novel reinforcement learning approach for training language models without requiring explicit verifiers.
The method can perform DeepSeek R1-Zero-like training on ANY SFT DATA, extending reasoning abilities beyond math and coding.
We released the NOVEReason dataset collections, including NOVEReason_2k, NOVEReason_5k and NOVEReason_full.

Updates

QuickStart

Clone this repository, install dependencies, set your Weights & Biases credentials.

git clone https://github.com/thinkwee/NOVER.git
cd NOVER
pip install -r requirements.txt
export WANDB_API_KEY=your_api_key
export WANDB_ENTITY=your_entity

NOVER uses Hydra for configuration management. To run an experiment, simply create a YAML file specifying only the parameters you want to override. For example, see config/my_exp.yaml. For the full list of configurable options, refer to config/config.yaml.

Some key parameters to customize include:
- project.suffix: A unique identifier for your training run
- project.wandb_project: Your Weights & Biases project name
- project.save_base_path: Directory to save model checkpoints
- dataset.hf_home: Your huggingface root path for loading models and datasets
- dataset.name: Your dataset name under HF_HOME
- model.name_vllm: Model to use for VLLM server (e.g., "Qwen/Qwen2.5-7B")
- model.name: Model to use for training (usually same as VLLM model)

Prepare your data, see the data section below for more details.
Start the training! This will begin the training process using the configuration parameters defined in config/my_exp.yaml.

# first, start the vllm server
# This will launch a VLLM server for model rollouts in Reinforcement Learning
./scripts/run_vllm_server.sh my_exp

# then, start the training
./scripts/run_training.sh my_exp

Data

Format your data as a standard Hugging Face Arrow dataset with at least two columns: prompt and reference, representing input and output in any standard SFT dataset. No conversational format or system prompts are required.
Structure your input data in the prompt column as follows:

Question: {input}

Answer the question and return in the following format:

<think>
...
</think>

<answer>
...
</answer>

For convenience, you can use the included dataset formatter to automatically format your dataset.

# Format Hugging Face dataset
./scripts/format_dataset.sh squad --prompt-column question --reference-column answers.text

# Format a custom CSV file
./scripts/format_dataset.sh data.csv --prompt-column question --reference-column answer

# Format a custom JSONL file
./scripts/format_dataset.sh data.jsonl --prompt-column input --reference-column output

Generation

The model is trained with Hugging Face PEFT and saved in the standard LoRA adapter format. To use the trained model, you can use the following code:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained(base_model_path)
model = PeftModel.from_pretrained(base_model, adapter_path)
merged_model = model.merge_and_unload()

model.save_pretrained(
    merge_output_dir,
    safe_serialization=True)

tokenizer = AutoTokenizer.from_pretrained(base_model_path)
tokenizer.save_pretrained(merge_output_dir)

Use the vllm to serve the merged model in merge_output_dir.

Citation

If you find this work useful, please cite our paper:

@article{liu2025noverincentivetraininglanguage,
      title={NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning}, 
      author={Wei Liu and Siya Qi and Xinyu Wang and Chen Qian and Yali Du and Yulan He},
      year={2025},
      eprint={2505.16022},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.16022}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
config		config
dataset		dataset
src		src
.gitignore		.gitignore
README.md		README.md
format_dataset.sh		format_dataset.sh
logo.png		logo.png
requirements.txt		requirements.txt
run_training.sh		run_training.sh
run_vllm_server.sh		run_vllm_server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Overview

Updates

QuickStart

Data

Generation

Citation

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

thinkwee/NOVER

Folders and files

Latest commit

History

Repository files navigation

NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Overview

Updates

QuickStart

Data

Generation

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages