MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Zhitao He¹ Sandeep Polisetty^1,2 Zhiyuan Fan¹ Yuchen Huang¹ Shujin Wu^1,3 Yi R. (May) Fung¹

¹HKUST ² UMass Amherst ³USC

News

2025/05/15: 🔥 MMBoundary is accepted to ACL 2025 Main Conference!

Introduction

In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level ( e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.

MMBoundary Overview

The overview of MMBoundary, which consists of two stages. The initial stage trains MLLMs via supervised learning to generate natural language confidence statement for each sentence, similar to human expression. The second stage employs reinforcement learning with three intuitively designed reward functions to further calibrate the expressed confidence estimates and enhance knowledge alignment. 🟣 represents the internal states (i.e., the log probability of tokens) of model and the estimated internal confidence.

Experimental Results

The evaluation results of models and various ablations of our framework. CulturalVQA is the out-of-distribution dataset. w/o U_LNLP, w/o U_MTE, w/o U_TSAR, and w/o U_CLIPS represent MMBoundary without the three text-based uncertainty estimation methods and visual information uncertainty estimation, respectively; w U_Max indicates the confidence determined using the max pooling method from the four uncertainty estimation scores; w/o S-S_Mapping denotes MMBoundary without confidence score-statement mapping; w/o R_KA, w/o R_EC, and w/o R_CS represent MMBoundary without knowledge accuracy reward, expected calibration reward, and confidence self-calibration reward, respectively; w/o RL denotes MMBoundary without reinforcement learning.

Environment setup

Git clone:

git clone https://github.com/Zhitao-He/MMBoundary.git
cd MMBoundary

Create a conda environment:

conda create -n MMBoundary python=3.10
conda activate MMBoundary
pip install -r requirements.txt

Downloading the dataset

A-OKVQA:

export AOKVQA_DIR=./datasets/aokvqa/
mkdir -p ${AOKVQA_DIR}

curl -fsSL https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0.tar.gz | tar xvz -C ${AOKVQA_DIR}

export COCO_DIR=./datasets/coco/
mkdir -p ${COCO_DIR}

for split in train val test; do
    wget "http://images.cocodataset.org/zips/${split}2017.zip"
    unzip "${split}2017.zip" -d ${COCO_DIR}; rm "${split}2017.zip"
done

wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip annotations_trainval2017.zip -d ${COCO_DIR}; rm annotations_trainval2017.zip

ScienceVQA:

git clone https://github.com/lupantech/ScienceQA.git

. tools/download.sh

CulturalVQA:

Download the CulturalVQA dataset from huggingface repository link. To load and use the CulturalVQA benchmark, use the following commands:

from datasets import load_dataset

culturalvqa_dataset = load_dataset(‘./CulturalVQA')

Our annotated data

We first prompt GPT-4o to generate an analysis (reasoning chain) structured at the perception and reasoning levels. Then, we have GPT-4o filter and correct the initially annotated chains. Finally, manual data quality control is conducted to ensure accuracy and reliability.

The Annotation Pipeline.

We provide the annotated data and the script. In the file of annotated data, you will see the following structure:

annotated_data/
├── aokvqa
│   ├── train_mmb
│   ├── val_mmb
├── sciencevqa
│   ├── train_mmb
│   ├── val_mmb
├── culturalvqa
│   ├── cvqa_mmb

your can also download from our huggingface repository link.

Annotate data script

AOKVQA

python code/annotate_data/construct_dataset_aokvqa.py

ScicenceVQA

python code/annotate_data/construct_dataset_sciencevqa.py

CulturalVQA

python code/annotate_data/construct_dataset_culturalvqa.py

Uncertainty estimation

In the file of uncertainty, you will see the following structure:

uncertainty/
├── normalized_logprob.py
├── token_SAR.py
├── mean_token_entropy.py
├── clip_score.py

run following script to get the normalized confidence score of model:

python code/uncertainty/vlm_uncertainty.py

python code/uncertainty/uncertainty_normalize.py

Construct data and training

run following script to construct fine-tuning data:

python code/training/training_data_construct.py

The preset statement is in code/training/confidence_statement.json

We use the LLaMA-Factory framework for Supervised Fine-Tuning (SFT). Run the following script:

python code/training/training_sft_model.yaml

Run the following script for Reinforcement Learning (RL):

python code/training/training_rl_model.py

Evaluation

Download Cross-Encoder model. Pre-trained models can be used like this:

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([('Sentence 1', 'Sentence 2'), ('Sentence 3', 'Sentence 4')])

run following script to evaluate:

python code/evaluation/eval.py

Citation

@misc{he2025mmboundaryadvancingmllmknowledge,
      title={MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration}, 
      author={Zhitao He and Sandeep Polisetty and Zhiyuan Fan and Yuchen Huang and Shujin Wu and Yi R. Fung},
      year={2025},
      eprint={2505.23224},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.23224}, 
}

License

Code: Licensed under the Apache 2.0 License. Dataset: Licensed under the CC BY-NC 4.0 License.

Acknowledgments

We would like to thank the following open-source projects for their contributions: LLaMA-Factory, lm-polygraph.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
annotated_data		annotated_data
asset		asset
code		code
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

News

Introduction

MMBoundary Overview

Experimental Results

Environment setup

Downloading the dataset

Our annotated data

The Annotation Pipeline.

Annotate data script

Uncertainty estimation

Construct data and training

Evaluation

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Zhitao-He/MMBoundary

Folders and files

Latest commit

History

Repository files navigation

MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

News

Introduction

MMBoundary Overview

Experimental Results

Environment setup

Downloading the dataset

Our annotated data

The Annotation Pipeline.

Annotate data script

Uncertainty estimation

Construct data and training

Evaluation

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages