【📖 arXiv | 🤗 HF Papers | 🎉 Website | 🧠 alphaArxiv | 📦 NOVEReason Datasets | 🚀 NOVER1 Models】
- NOVER is accepted by EMNLP 2025! 🎉 Check out more about NOVER on the official website.
- This is the official implementation of the paper "NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning". This code repo is built based on HuggingFace trl.
- NOVER (NO-VERifier) is a novel reinforcement learning approach for training language models without requiring explicit verifiers.
- The method can perform DeepSeek R1-Zero-like training on ANY SFT DATA, extending reasoning abilities beyond math and coding.
- We released the NOVEReason dataset collections, including NOVEReason_2k, NOVEReason_5k and NOVEReason_full.
- Initialize training code
- Simplify SFT data import
- Add custom tag support
- Upgrade to
trl==0.20.0 - Cleaned NOVER sampled data
- Integrate Hydra config management
- Simplify logging in
CustomGRPOTrainer - Streamline NCCL logging
- Simplify start script
- Clean full NOVEReason data (data has been cleaned, but not all was used in the paper)
- Release NOVER1-Qwen2.5-7B and NOVER1-Qwen3-4B and config files
- Add inverse incentive training support
- Add incentive steering support
- Clone this repository, install dependencies, set your Weights & Biases credentials.
git clone https://github.com/thinkwee/NOVER.git
cd NOVER
pip install -r requirements.txt
export WANDB_API_KEY=your_api_key
export WANDB_ENTITY=your_entity- NOVER uses Hydra for configuration management. To run an experiment, simply create a YAML file specifying only the parameters you want to override. For example, see
config/my_exp.yaml. For the full list of configurable options, refer toconfig/config.yaml.
- Some key parameters to customize include:
project.suffix: A unique identifier for your training runproject.wandb_project: Your Weights & Biases project nameproject.save_base_path: Directory to save model checkpointsdataset.hf_home: Your huggingface root path for loading models and datasetsdataset.name: Your dataset name underHF_HOMEmodel.name_vllm: Model to use for VLLM server (e.g., "Qwen/Qwen2.5-7B")model.name: Model to use for training (usually same as VLLM model)
-
Prepare your data, see the data section below for more details.
-
Start the training! This will begin the training process using the configuration parameters defined in
config/my_exp.yaml.
# first, start the vllm server
# This will launch a VLLM server for model rollouts in Reinforcement Learning
./scripts/run_vllm_server.sh my_exp
# then, start the training
./scripts/run_training.sh my_exp- Format your data as a standard Hugging Face Arrow dataset with at least two columns:
promptandreference, representing input and output in any standard SFT dataset. No conversational format or system prompts are required. - Structure your input data in the
promptcolumn as follows:
Question: {input}
Answer the question and return in the following format:
<think>
...
</think>
<answer>
...
</answer>- For convenience, you can use the included dataset formatter to automatically format your dataset.
# Format Hugging Face dataset
./scripts/format_dataset.sh squad --prompt-column question --reference-column answers.text
# Format a custom CSV file
./scripts/format_dataset.sh data.csv --prompt-column question --reference-column answer
# Format a custom JSONL file
./scripts/format_dataset.sh data.jsonl --prompt-column input --reference-column outputThe model is trained with Hugging Face PEFT and saved in the standard LoRA adapter format. To use the trained model, you can use the following code:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained(base_model_path)
model = PeftModel.from_pretrained(base_model, adapter_path)
merged_model = model.merge_and_unload()
model.save_pretrained(
merge_output_dir,
safe_serialization=True)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
tokenizer.save_pretrained(merge_output_dir)- Use the vllm to serve the merged model in
merge_output_dir.
If you find this work useful, please cite our paper:
@article{liu2025noverincentivetraininglanguage,
title={NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning},
author={Wei Liu and Siya Qi and Xinyu Wang and Chen Qian and Yali Du and Yulan He},
year={2025},
eprint={2505.16022},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.16022},
}