MHALO: Evaluating MLLMs as Fine-grained Hallucination Detectors

This repo consists of core scripts for reproducing the main results of the paper "MHALO: Evaluating MLLMs as Fine-grained Hallucination Detectors".

Contributors

Yishuo Cai, Renjie Gu

If you have any questions or issues with the code, please send us an issue directly.

Introduction

Quick Start

Quick install the environment:

git clone https://github.com/walkeralan123/MHALO.git
cd MHALO 
conda create -n mhalo python=3.10
conda activate mhalo
pip install -r requirements.txt

File Structure

The project is mainly divided into two parts:

build Folder - Corresponding to the process of generating hallucinated data

evaluate Folder - Corresponding to the evaluation of hallucination detection of different models

Evaluating Hallucination Detection Performance

cd evaluate

Download the images for evaluation

Because the images are too large, we store them in google drive. You can download them from here. Then put the directory in the evaluate directory.

Configure Environment Variables:

Create a .env file in the project root directory and add the following content:

OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=your_api_base_url

Evaluating the Performance of Hallucination Detection

The main evaluation script is located in evaluate/src/evaluate.py. You can run the evaluation with the following command:

cd mhalo/evaluate
python src/evaluate.py

Explanation of the evaluation results

The evaluation results are stored in the evaluate/results directory with naming format: evaluate/results/YYYY_MM_DD_HH_MM_SS_model-name_prompt-method_sample-limit.

Each result directory contains:

Five dataset-specific folders with detailed evaluation results
An evaluation_summary.csv file with overall metrics

Example of evaluation summary:

Metric	RLHF-V	M-HalDetect	Geo170K	MathV360K	MC	Average
Total Samples	10	10	10	10	10	50
Successful Samples	10	10	9	9	9	47
IF	1.0	1.0	0.9	0.9	0.9	0.94
F1M	0.250	0.0	0.301	0.287	0.549	0.278
F1IOU	0.156	0.0	0.091	0.203	0.483	0.187

Available Parameters

--dataset: The dataset to evaluate
- Available options: ['all'] + all dataset names in DATASET_CONFIGS
- Default: 'all'
--sample_limit: The number of samples to evaluate
- For example: --sample_limit 50
- Default: 500
--model: Select the model to use
- Available options: ['gpt-4o-2024-11-20', 'claude-3-5-sonnet-20241022', 'gemini-1.5-pro-002', 'qwen-vl-max-0809', 'abab7-chat-preview', 'meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo', 'glm-4v', 'local/MiniCPM-V-2_6', 'local/InternVL2-Llama3-76B']
- Default: 'gpt-4o-2024-11-20'
--prompt_method: Select the prompt method to use
- Available options: ['vanilla', 'Criteria', '2-shot', 'Analyze-then-judge']
- Default: 'vanilla'
--max_workers: Maximum number of worker threads
- Default: 20
--max_annotation_retries: Maximum number of annotation retries
- Default: 3
--api_retry_limit: Maximum number of API call retries
- Default: 3

Example Commands

Evaluate all datasets:

python src/evaluate.py

Evaluate a specific dataset (e.g., mathv_360k) and limit the number of samples:

python src/evaluate.py --dataset mathv_360k --sample_limit 100
python src/evaluate.py --dataset rlhfv --sample_limit 10

Use different models and prompt methods:

python src/evaluate.py --model qwen-vl-max-0809 --dataset MathV360K --prompt_method Analyze-then-judge

Building Hallucinated Data

1.Download the original dataset and put it in the build/data directory, download the corresponding images and put them in the build/images directory

1.1 For geo170k, you need to download the qa_tuning.json file from the Geo170K dataset and put it in the build/data/Geo170K directory, download the image folder and unzip it to the build/images directory and rename it to 170k.

1.2 For mathv360k, you need to download the train_samples_all_tuning.json file from the MathV360K dataset and put it in the build/data/mathv directory, download the image folder and unzip it to the build/images directory and rename it to mathv

1.3 For mhal, you need to download the train_raw.json file from the mhal-detect dataset and put it in the build/data/mhal-detect directory, download the image folder and unzip it to the build/images directory and rename it to mhal.

1.4 For RLHF-V-Dataset: Download the image, the dataset is loaded directly from the code

2.You can run the following scripts to generate single dataset:

python build/src/process_mathv_360k.py

python build/src/process_geo_170k.py

python build/src/process_mhal_dataset.py

python build/src/process_rlhfv_dataset.py

You can add the parameter to generate the number of samples, just add --num_samples 1000 after the code, for example:

python build/src/process_geo_170k.py --num_samples 1000

Fine-tuning Data Set Description

target_data.jsonl

This dataset is merged from the following three source datasets:

RLHF-Vision dataset (4,733 samples)
- Source file: ft_rlhfv_dataset_20250111_201136_new.json
MHAL dataset (7,387 samples)
- Source file: ft_mhal_dataset_20250124_145127.json
Math-Vision dataset (5,000 samples)
- Source file: ft_mathv_dataset_20250115_063847_n5000_p127_new.json

To finetune your model,you should download the images from the google drive and put them in the ft_data/images directory using the link

Data processing instructions:

Each data contains image path (image_path), prompt (prompt), dialog history (history) and reference answer (reference)
The prompt contains a system message specially designed for hallucination detection and a user prompt template
Images are stored in the images/rlhfv/、images/mhal/ and images/mathv/ directories according to the dataset type

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MHALO: Evaluating MLLMs as Fine-grained Hallucination Detectors

Contributors

Introduction

Quick Start

File Structure

Evaluating Hallucination Detection Performance

Download the images for evaluation

Configure Environment Variables:

Evaluating the Performance of Hallucination Detection

Explanation of the evaluation results

Available Parameters

Example Commands

Building Hallucinated Data

Fine-tuning Data Set Description

target_data.jsonl

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
build/src		build/src
evaluate		evaluate
finetune		finetune
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MHALO: Evaluating MLLMs as Fine-grained Hallucination Detectors

Contributors

Introduction

Quick Start

File Structure

Evaluating Hallucination Detection Performance

Download the images for evaluation

Configure Environment Variables:

Evaluating the Performance of Hallucination Detection

Explanation of the evaluation results

Available Parameters

Example Commands

Building Hallucinated Data

Fine-tuning Data Set Description

target_data.jsonl

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages