Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
conda create -n vif python=3.10 -y
conda activate vif
pip install -r requirements.txt- Place the base VLM (e.g., LLaVA-NeXT) under
./examples/base_model/or pass a HuggingFace model id. - Prepare multimodal data:
- Stage 1: general pretraining/tuning data (image, instruction, short answer) – paths in
configs/stage1.yaml - Stage 2: instruction-tuning datasets – paths in
configs/stage2.yaml
- Stage 1: general pretraining/tuning data (image, instruction, short answer) – paths in
- Multi-Agent Construction
- Files under
vif/multiagent/implement the worker agents with dynamic allocation.
- Files under
# Stage 1 Pre-Training
python scripts/train_stage1.py --config configs/stage1.yaml
# Stage 2 Instruction Tuning
python scripts/train_stage2.py --config configs/stage2.yamlpython scripts/eval_demo.py --config configs/eval.yaml --images_dir examples/data/images --questions_file examples/data/train_stage2.jsonl- The visual evidence relayed in MAS, which is typically via textual flow, potentially results in multi-agent hallucination snowballing.
- When the agent turns increase, the average attention allocated to vision tokens reduces, and the attention peak in middle layers diminishes, while attention to instruction tokens increases accordingly; system and output tokens receive relatively stable attention.
- In middle layers, vision tokens with unimodal attention allocation relay visual information; all vision tokens are significant in shallow layers and less significant in deep layers.
@article{yu2025visual,
title={Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow},
author={Yu, Xinlei and Xu, Chengming and Zhang, Guibin and He, Yongbo and Chen, Zhangquan and Xue, Zhucun and Zhang, Jiangning and Liao, Yue and Hu, Xiaobin and Jiang, Yu-Gang and others},
journal={arXiv preprint arXiv:2509.21789},
year={2025}
}