Thanks to visit codestin.com
Credit goes to github.com

Skip to content

YU-deep/ViF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViF

Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

📑Quick Start

1) Install

conda create -n vif python=3.10 -y
conda activate vif
pip install -r requirements.txt

2) Prepare

  • Place the base VLM (e.g., LLaVA-NeXT) under ./examples/base_model/ or pass a HuggingFace model id.
  • Prepare multimodal data:
    • Stage 1: general pretraining/tuning data (image, instruction, short answer) – paths in configs/stage1.yaml
    • Stage 2: instruction-tuning datasets – paths in configs/stage2.yaml
  • Multi-Agent Construction
    • Files under vif/multiagent/ implement the worker agents with dynamic allocation.

3) Train

# Stage 1 Pre-Training
python scripts/train_stage1.py --config configs/stage1.yaml

# Stage 2 Instruction Tuning
python scripts/train_stage2.py --config configs/stage2.yaml

4) Evaluate

python scripts/eval_demo.py --config configs/eval.yaml --images_dir examples/data/images --questions_file examples/data/train_stage2.jsonl

📜Requisite Analyses

Layer-Wise Attention Allocation in Different Agent Turns

fig2

Dropping Subsets of Vision Tokens in Different Layers

tab1
fig3

Investigation of Unimodal Tokens

fig4
fig5

Insights

  • The visual evidence relayed in MAS, which is typically via textual flow, potentially results in multi-agent hallucination snowballing.
  • When the agent turns increase, the average attention allocated to vision tokens reduces, and the attention peak in middle layers diminishes, while attention to instruction tokens increases accordingly; system and output tokens receive relatively stable attention.
  • In middle layers, vision tokens with unimodal attention allocation relay visual information; all vision tokens are significant in shallow layers and less significant in deep layers.

🌟🌟🌟ViF

fig6

🔥🔥🔥Results

Results on Six Base Models and Four MAS Structures

table2

Results on Larger Base Models

table3

Results on Multi-Agent Hallucination Snowballing Mitigation

table4

Comparison Results

table5

🔗 Citation

@article{yu2025visual,
  title={Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow},
  author={Yu, Xinlei and Xu, Chengming and Zhang, Guibin and He, Yongbo and Chen, Zhangquan and Xue, Zhucun and Zhang, Jiangning and Liao, Yue and Hu, Xiaobin and Jiang, Yu-Gang and others},
  journal={arXiv preprint arXiv:2509.21789},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages