We propose SATURN, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLM reasoning. SATURN enables scalable task construction, rule-based verification, and precise difficulty control. SATURN designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions.
We introduce SATURN-2.6K, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply SATURN to DeepSeek-R1-Distill-Qwen and obtain SATURN-1.5B and SATURN-7B.
Building upon the SAT_Construction tool and our difficulty estimation, we release SATURN-2.6K, a curated benchmark designed to evaluate LLMs' reasoning capability across varying complexity levels.
SATURN-2.6K consists of:
- 1,500 training instances and 160 test instances sharing the same estimated difficulty level.
- 1,000 additional test instances from 10 unseen harder difficulty levels, with 100 instances per level.
These difficulty levels are selected based on our estimation function D(n, k, l), enabling a systematic analysis of how LLM performance changes as problem difficulty increases.
The datasets path is:
./dataAdditionally, custom datasets with target difficulty levels can be generated using our open-sourced SAT_Construction tool (See Step 1 below).
- Download: https://zenodo.org/records/15487151
- Hugging Face: Saturn-7B, Saturn-1.5B
To install the required dependencies, run:
conda create -n saturn python=3.10.12
conda activate saturn
pip install -r requirements.txt
cd src/OpenRLHF
pip install -e .Run the following script:
sh ./src/Build_SAT_Datasets/build_sat_dataset.shEdit the following variables in the script to configure difficulty and number of samples:
PARAMETERS=(
"3 5 20"
)
N_SAMPLE=520This controls the SAT problem's (n, k, l) parameters and sample count.
Training scripts are located in:
scripts/train
We provide separate scripts for both the 1.5B and 7B models. Each stage of training is isolated for better observability and debugging. For example:
sh ./scripts/train/grpo_1.5B_355.sh
Before running the script, please modify the following parameters:
--pretrain /xxx/Qwen \
--save_path xxx \
--use_wandb xxx \
--wandb_run_name xxx \
--ckpt_path xxx/checkpoints \For more detailed argument configurations, please refer to the OpenRLHF documentation.
Run:
sh ./scripts/test/test_SAT.shEdit the first two lines in the script before running:
model_path= # TODO: your local model path
model_name= # TODO: name you want to assignWe use Docker + vLLM to deploy models. You should modify Docker parameters like -v based on your server setup. You may also modify vLLM-related arguments in the script. See vLLM for reference.
Run:
sh ./scripts/test/test_model_math_programming.shModify the third line:
MODEL= # TODO: model pathOther arguments follow lighteval conventions.
To generate a word cloud, uncomment line 39 in:
sh ./scripts/test/test_model.sh#python ./scripts/test/frequency_cloud.py \
# --work_dir $OUTPUT_DIR \
# --model $MODELTo reproduction Figure 3, run:
python ./experiments/draw_pic/draw_difficulty.pyFigures will be saved in ./experiments/draw_pic/.
This project reuses code from the following repositories:
@article{saturn2025,
author = {Huanyu Liu and Jia Li and Hao Zhu and Kechi Zhang and Yihong Dong and Ge Li},
title = {SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning},
journal = {CoRR},
volume = {abs/2505.16368},
year = {2025},
}
This repository includes components licensed under the Apache License 2.0.