Investigating length-hacking in Chain-of-Thought post-training.
Does DPO/PPO training on reasoning tasks cause models to "filibuster"—generating verbose, circular reasoning because longer responses correlate with higher rewards?
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv syncTest the entire pipeline locally without a GPU:
# Prepare data (50 samples)
uv run python scripts/data_prep.py --debug
# Create DPO pairs
uv run python scripts/create_pairs.py --debug
# Train (10 steps, tiny model)
uv run python scripts/train_dpo.py --debug
# Evaluate
uv run python scripts/evaluate.py models/debug --debugCreates three variants of GSM8K: clean, mild padding, heavy padding.
uv run python scripts/data_prep.py
uv run python scripts/create_pairs.pyuv run python scripts/train_sft.py --config-path configs/sft_baseline.yamluv run python scripts/train_dpo.py --config-path configs/dpo_clean.yaml --pairs-path data/pairs/dpo_pairs_clean.json
uv run python scripts/train_dpo.py --config-path configs/dpo_padded_mild.yaml --pairs-path data/pairs/dpo_pairs_mild.json
uv run python scripts/train_dpo.py --config-path configs/dpo_padded_heavy.yaml --pairs-path data/pairs/dpo_pairs_heavy.json
uv run python scripts/train_dpo.py --config-path configs/dpo_mixed.yaml --pairs-path data/pairs/dpo_pairs_mixed.jsonuv run python scripts/evaluate.py models/sft-baseline
uv run python scripts/evaluate.py models/dpo-clean
uv run python scripts/evaluate.py models/dpo-padded-mild
uv run python scripts/evaluate.py models/dpo-padded-heavy
uv run python scripts/evaluate.py models/dpo-mixeduv run python scripts/analyze.pyGenerates plots in results/figures/.
No local GPU? Use these platforms:
| Platform | Setup |
|---|---|
| RunPod | Upload code, install deps, run scripts |
| Modal | modal run scripts/train_dpo.py (needs Modal wrapper) |
| Lambda Labs | SSH, clone repo, run |
| Vast.ai | Cheapest option, variable quality |
# On RunPod instance
git clone <your-repo>
cd reasoning-mirage
pip install uv
uv sync
uv run python scripts/train_dpo.py --config-path configs/dpo_clean.yaml --pairs-path data/pairs/dpo_pairs_clean.json.
├── configs/ # Experiment configs (YAML)
├── data/
│ ├── processed/ # Padded datasets
│ └── pairs/ # DPO preference pairs
├── models/ # Trained checkpoints
├── results/ # Evaluation outputs + figures
├── scripts/ # Runnable scripts
├── src/ # Core library code
├── plan.md # Detailed research plan
└── pyproject.toml # Dependencies
- Accuracy: Final answer correctness on GSM8K test
- Mean Length: Average response token count
- Filler Ratio: Proportion of filler phrases detected
- N-gram Uniqueness: Measures repetitiveness
- Length-Accuracy Correlation: Does longer = better?
The injector adds these types of filler:
- Restatement: "Let me reconsider this approach."
- Hedging: "I want to be absolutely certain here."
- Verification: "Double-checking my work here."
- Affirmation: "Yes, this approach makes sense."
- Transition: "Moving on to the next step,"
If the hypothesis holds:
dpo-padded-heavyproduces longer responses thandpo-cleanon clean test data- Length increase is NOT accompanied by accuracy increase
- Filler ratio is higher in padded-trained models
- The model "learns" to filibuster even when not trained on filler
If you use this code:
@misc{reasoning-mirage,
title={The Reasoning Mirage: Length-Hacking in Chain-of-Thought Post-Training},
year={2026}
}