Joykirat Singh | Justin Chih-Yao Chen | Archiki Prasad | Elias Stengel-Eskin | Akshay Nambi | Mohit Bansal
- Overview
- Installation
- Download Models
- Training Data & Evaluation
- Run Evaluations
- Train Models
- Citation
This repository contains the implementation of TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model’s self-attention over a long reasoning trajectory to identify important steps and prune redundant ones.
Given a problem, the model first generates
</think>. During this attention-based compression step, we remove steps with lower scores. The degree of removal is determined by the estimated difficulty: easier problems undergo more aggressive compression. Finally, we compute the correctness and length rewards using the compressed reasoning trajectory, and these rewards are used to update the policy.
Please make sure that you have torch compiled with CUDA enabled. Repository developed with python 3.12.3, please ensure python envokes python 3.12.3. The codebase has been build on top of verl.
Create virtual environment and Install verl and other packages from requirements.txt:
python -m venv traac_venv
source traac_venv/bin/activate
pip install -e .[vllm]
pip install -r requirements.txtDownload our trained Adaptive reasoning models directly from huggingface:
| Model | Download Link |
|---|---|
| (TRAAC)DeepSeek-R1-Distill-Qwen-7B | |
| (TRAAC) Qwen3-4B |
All training and evaluation scripts are located in the scripts/data folder.
Make sure you cd into this folder before running the commands:
cd scripts/data| Purpose | Script |
|---|---|
| Training Data Generation | dapo-17k.py |
| Evaluation – AIME, AMC, GPQA | full_test_dataset.py |
| Evaluation – Overthinking Benchmark | overthink_test_dataset.py |
| Evaluation – Underthinking Benchmark | underthink_test_dataset.py |
| Evaluation – BBEH Benchmark | bbeh_test_dataset.py |
These will create appropriate parquet files in the scripts/data folder.
To run your custom evaluation on any of the above datasets use the bas script provided in scripts/eval folder.
Change the data.val_files field inside verl configuraiton to the required dataset:
./eval_deepseek-qwen-with-summary-linear-reward-attention.sh
./eval_qwen3-4b-with-summary-linear-reward-attention.shTraining was conducted on 3 GPUs:
- 1 GPU was dedicated to hosting the policy model for calculating attention scores (attention-based compression).
- 2 GPUs were used to train the main model.
The host model script is located at scripts/train:
CUDA_VISIBLE_DEVICES=0 python model_host_qwen.pyThis will host the Qwen3-4B model at http://localhost:8008.
source run_qwen3-4b-with-summary-linear-reward-attention.shThe file vllm_rollout_spmd.py contains the implementation for adaptive, attentive summarization, which is used during training.
If you find this work useful, please consider citing us:
@misc{singh2025thinkrightlearningmitigate,
title={Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression},
author={Joykirat Singh and Justin Chih-Yao Chen and Archiki Prasad and Elias Stengel-Eskin and Akshay Nambi and Mohit Bansal},
year={2025},
eprint={2510.01581},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.01581},
}