ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

📖 Introduction

This paper proposes ARES, a novel open-source framework for adaptive multimodal reasoning, aiming to dynamically allocate the model’s reasoning effort based on the difficulty of the input problem. The authors observe a key imbalance in existing multimodal reasoning models: on easy tasks they tend to overthink (producing redundantly long inference traces), whereas on hard tasks they under-explore (missing solutions due to insufficient search). To correct this, ARES introduces a mechanism based on High Window-Entropy (HWE) tokens (i.e. token-level entropies averaged over a sliding window) to detect moments of reasoning uncertainty, and flexibly adapt the exploration intensity.

ARES is trained with a two-stage pipeline:

Adaptive Cold-Start Stage: construct multimodal and textual reasoning examples with trace lengths scaled to task difficulty, so the model learns a notion of difficulty awareness.
Adaptive Entropy Policy Optimization (AEPO): use HWE tokens as triggers to decide when to explore further, combined with a hierarchical entropy reward and dynamic KL control to decide how much to explore.

Empirical results show that ARES achieves better tradeoffs between reasoning efficiency and accuracy, outperforming baselines across multimodal, mathematical, and logical benchmarks — while incurring lower inference costs, and narrowing the gap to commercial systems.ARES

This work highlights that adaptively modulating the exploration behavior at token-level (rather than a fixed strategy) is essential for balancing reasoning depth and computational cost under varying task difficulties.

🧮 Results

Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical,and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.

🍭 Pipeline

Overall training pipeline of our method ARES. Stage 1 (Adaptive Coldstart Fine-Tuning): difficulty-aware selective data curation and adaptive KL-guided fine-tuning establish a strong initialization across text and multimodal inputs. Stage 2 (Adaptive Entropy Policy Optimization, AEPO): online difficulty bucketing and entropy-aware rollout allocate reasoning depth dynamically, with high-entropy windows serving as branching points for exploration. Together, the two stages enable uncertainty-aware, difficulty-adaptive reasoning for large language models.

✨ Models

Model	Huggingface	Base Model
ARES-Coldstart	https://huggingface.co/datasets/ares0728/ARES-Adaptive-Coldstart	Qwen2.5-VL-7B-Instruct
ARES-RL	https://huggingface.co/ares0728/ARES-RL-7B	Qwen2.5-VL-7B-Instruct

🔮 Datasets

The dataset construction of ARES revolves around a core concept:

"The reasoning length of the model should match the difficulty of the task."

To this end, ARES does not directly use the common hybrid multimodal corpus. Instead, it constructs a difficult-aware reasoning corpus, which is specifically used to teach the model to distinguish between "easy questions" and "difficult questions" and to use different reasoning lengths and exploration intensities during the cold start stage.

Datasets	Huggingface	Size of the data volume
ARES-hard-validation	https://huggingface.co/datasets/ares0728/ARES-hard-validation	2.46K
ARES-Adaptive-SFT	https://huggingface.co/datasets/ares0728/ARES-Adaptive-Coldstart	223k

The training corpus of ARES-Adaptive-223K comprises two components:

Textual reasoning data — drawn from high-quality, reasoning-intensive datasets used to develop symbolic reasoning and reflection capabilities.
Multimodal reasoning data — collected from visual mathematics, logical reasoning, and chart-understanding datasets to enhance cross-modal reasoning consistency.

To ensure coherence across sources, all reasoning traces undergo chain-of-thought (CoT) normalization, standardizing them into a unified “think → derive → conclude” format.

We further use Gemini 2.5-Pro with a pass@3 evaluation to filter out samples that the model fails on in all three attempts across various visual benchmarks, resulting in a curated hard-validation set containing 2.46 k challenging examples.

🛠️ Installation

conda create -n aepo python=3.11 -y
conda activate aepo
pip install -r requirements.txt

🚀 Training

Staged RL: AEPO

# example script to prepare rewards / launch AEPO
bash ./experiments/AEPO/train.sh

Key ideas.

HWE trigger: branch only in sustained-uncertainty regions.
Difficulty-aware shaping: suppress over-exploration on easy, encourage deeper exploration on hard, stabilize around a batch target on medium.
Dynamic KL: token-wise KL budget that relaxes inside validated HWE windows.

🧩 Checkpoint Merge (HF format)

python scripts/model_merger.py \
  --local_dir ./checkpoints/${ProjectName}/exp_name/global_step_1/actor

🖥️ Inference

Run the command below.

MODEL_PATH="ARES"
MAX_TOKENS=16384
DO_SAMPLE=True
TEMPERATURE=1.0
TOP_P=0.95
TOP_K=50
NUM_RETURN_SEQUENCES=1


prompt = "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."
question="xxx"


python infer.py \
 --model_path ${MODEL_PATH} \
 --image_path ${IMAGE_PATH} \
 --question ${question} \
 --prompt ${prompt} \
 --max_tokens ${MAX_TOKENS} \
 --do_sample ${DO_SAMPLE} \
 --temperature ${TEMPERATURE} \
 --top_p ${TOP_P} \
 --top_k ${TOP_K} \
 --num_return_sequences ${NUM_RETURN_SEQUENCES}

You can also modify the arguments in inference/inference.sh

bash inference/inference.sh

📊 Highlights

ARES-3B: +8.4 average over prior open 3B models across core multimodal benchmarks.
ARES-7B: +9.7 average over strong 7B open baselines; large gains on MathVision and DynaMath-W.
Efficiency: Shorter responses on easy/medium tasks; deeper but targeted exploration on hard tasks.

🙌 Acknowledgements

We thank the open-source community for tools, datasets, and prior work on reasoning-oriented pretraining and RL that inspired this project.

🚧 TODO

We are preparing to complete these tasks over the next few weeks, please stay tuned!

🚧 We are in the process of training for 3B ARES (Coldstart&RL) and will release them in a few days.
🚧 We are also in the process of developing and open-sourcing a multimodal model with performance comparable to leading commercial systems. Stay tuned!

📮 Contact

For questions, feedback, or collaboration opportunities, feel free to reach out: [email protected]

📄Citation

If you find our works useful for your research, please consider citing:

@article{chen2025ares,
  title={ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping},
  author={Chen, Shuang and Guo, Yue and Ye, Yimeng and Huang, Shijue and Hu, Wenbo and Li, Haoxi and Zhang, Manyuan and Chen, Jiayu and Guo, Song and Peng, Nanyun},
  journal={arXiv preprint arXiv:2510.08457},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
assets		assets
data		data
eval		eval
examples		examples
experiments/AEPO		experiments/AEPO
scripts		scripts
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.legacy		Dockerfile.legacy
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

📖 Introduction

🧮 Results

🍭 Pipeline

✨ Models

🔮 Datasets

🛠️ Installation

🚀 Training

Staged RL: AEPO

🧩 Checkpoint Merge (HF format)

🖥️ Inference

📊 Highlights

🙌 Acknowledgements

🚧 TODO

📮 Contact

📄Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

shawn0728/ARES

Folders and files

Latest commit

History

Repository files navigation

ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

📖 Introduction

🧮 Results

🍭 Pipeline

✨ Models

🔮 Datasets

🛠️ Installation

🚀 Training

Staged RL: AEPO

🧩 Checkpoint Merge (HF format)

🖥️ Inference

📊 Highlights

🙌 Acknowledgements

🚧 TODO

📮 Contact

📄Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages