Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[ACL 2025 Main] MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Notifications You must be signed in to change notification settings

Zhitao-He/MMBoundary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration


News

  • 2025/05/15: 🔥 MMBoundary is accepted to ACL 2025 Main Conference!

Introduction

In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level ( e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.


MMBoundary Overview

MMBoundary

The overview of MMBoundary, which consists of two stages. The initial stage trains MLLMs via supervised learning to generate natural language confidence statement for each sentence, similar to human expression. The second stage employs reinforcement learning with three intuitively designed reward functions to further calibrate the expressed confidence estimates and enhance knowledge alignment. 🟣 represents the internal states (i.e., the log probability of tokens) of model and the estimated internal confidence.

Experimental Results

Experimental Results

The evaluation results of models and various ablations of our framework. CulturalVQA is the out-of-distribution dataset. w/o ULNLP, w/o UMTE, w/o UTSAR, and w/o UCLIPS represent MMBoundary without the three text-based uncertainty estimation methods and visual information uncertainty estimation, respectively; w UMax indicates the confidence determined using the max pooling method from the four uncertainty estimation scores; w/o S-SMapping denotes MMBoundary without confidence score-statement mapping; w/o RKA, w/o REC, and w/o RCS represent MMBoundary without knowledge accuracy reward, expected calibration reward, and confidence self-calibration reward, respectively; w/o RL denotes MMBoundary without reinforcement learning.


Environment setup

  • Git clone:
git clone https://github.com/Zhitao-He/MMBoundary.git
cd MMBoundary
  • Create a conda environment:
conda create -n MMBoundary python=3.10
conda activate MMBoundary
pip install -r requirements.txt

Downloading the dataset

  • A-OKVQA:
export AOKVQA_DIR=./datasets/aokvqa/
mkdir -p ${AOKVQA_DIR}

curl -fsSL https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0.tar.gz | tar xvz -C ${AOKVQA_DIR}
export COCO_DIR=./datasets/coco/
mkdir -p ${COCO_DIR}

for split in train val test; do
    wget "http://images.cocodataset.org/zips/${split}2017.zip"
    unzip "${split}2017.zip" -d ${COCO_DIR}; rm "${split}2017.zip"
done

wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip annotations_trainval2017.zip -d ${COCO_DIR}; rm annotations_trainval2017.zip
  • ScienceVQA:
git clone https://github.com/lupantech/ScienceQA.git

. tools/download.sh
  • CulturalVQA:

Download the CulturalVQA dataset from huggingface repository link. To load and use the CulturalVQA benchmark, use the following commands:

from datasets import load_dataset

culturalvqa_dataset = load_dataset(‘./CulturalVQA')

Our annotated data

We first prompt GPT-4o to generate an analysis (reasoning chain) structured at the perception and reasoning levels. Then, we have GPT-4o filter and correct the initially annotated chains. Finally, manual data quality control is conducted to ensure accuracy and reliability.

The Annotation Pipeline.

Experimental Results

We provide the annotated data and the script. In the file of annotated data, you will see the following structure:

annotated_data/
├── aokvqa
│   ├── train_mmb
│   ├── val_mmb
├── sciencevqa
│   ├── train_mmb
│   ├── val_mmb
├── culturalvqa
│   ├── cvqa_mmb

your can also download from our huggingface repository link.

Annotate data script

  • AOKVQA
python code/annotate_data/construct_dataset_aokvqa.py
  • ScicenceVQA
python code/annotate_data/construct_dataset_sciencevqa.py
  • CulturalVQA
python code/annotate_data/construct_dataset_culturalvqa.py

Uncertainty estimation

In the file of uncertainty, you will see the following structure:

uncertainty/
├── normalized_logprob.py
├── token_SAR.py
├── mean_token_entropy.py
├── clip_score.py

run following script to get the normalized confidence score of model:

python code/uncertainty/vlm_uncertainty.py
python code/uncertainty/uncertainty_normalize.py

Construct data and training

run following script to construct fine-tuning data:

python code/training/training_data_construct.py

The preset statement is in code/training/confidence_statement.json

We use the LLaMA-Factory framework for Supervised Fine-Tuning (SFT). Run the following script:

python code/training/training_sft_model.yaml

Run the following script for Reinforcement Learning (RL):

python code/training/training_rl_model.py

Evaluation

Download Cross-Encoder model. Pre-trained models can be used like this:

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([('Sentence 1', 'Sentence 2'), ('Sentence 3', 'Sentence 4')])

run following script to evaluate:

python code/evaluation/eval.py

Citation

@misc{he2025mmboundaryadvancingmllmknowledge,
      title={MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration}, 
      author={Zhitao He and Sandeep Polisetty and Zhiyuan Fan and Yuchen Huang and Shujin Wu and Yi R. Fung},
      year={2025},
      eprint={2505.23224},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.23224}, 
}

License

Code: Licensed under the Apache 2.0 License. Dataset: Licensed under the CC BY-NC 4.0 License.

Acknowledgments

We would like to thank the following open-source projects for their contributions: LLaMA-Factory, lm-polygraph.

About

[ACL 2025 Main] MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published