This repo consists of core scripts for reproducing the main results of the paper "MHALO: Evaluating MLLMs as Fine-grained Hallucination Detectors".
Yishuo Cai, Renjie Gu
If you have any questions or issues with the code, please send us an issue directly.
Quick install the environment:
git clone https://github.com/walkeralan123/MHALO.git
cd MHALO
conda create -n mhalo python=3.10
conda activate mhalo
pip install -r requirements.txt
The project is mainly divided into two parts:
buildFolder - Corresponding to the process of generating hallucinated data
evaluateFolder - Corresponding to the evaluation of hallucination detection of different models
cd evaluate
Because the images are too large, we store them in google drive. You can download them from here.
Then put the directory in the evaluate directory.
Create a .env file in the project root directory and add the following content:
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=your_api_base_url
The main evaluation script is located in evaluate/src/evaluate.py. You can run the evaluation with the following command:
cd mhalo/evaluate
python src/evaluate.py The evaluation results are stored in the evaluate/results directory with naming format: evaluate/results/YYYY_MM_DD_HH_MM_SS_model-name_prompt-method_sample-limit.
Each result directory contains:
- Five dataset-specific folders with detailed evaluation results
- An
evaluation_summary.csvfile with overall metrics
Example of evaluation summary:
| Metric | RLHF-V | M-HalDetect | Geo170K | MathV360K | MC | Average |
|---|---|---|---|---|---|---|
| Total Samples | 10 | 10 | 10 | 10 | 10 | 50 |
| Successful Samples | 10 | 10 | 9 | 9 | 9 | 47 |
| IF | 1.0 | 1.0 | 0.9 | 0.9 | 0.9 | 0.94 |
| F1M | 0.250 | 0.0 | 0.301 | 0.287 | 0.549 | 0.278 |
| F1IOU | 0.156 | 0.0 | 0.091 | 0.203 | 0.483 | 0.187 |
-
--dataset: The dataset to evaluate- Available options: ['all'] + all dataset names in DATASET_CONFIGS
- Default: 'all'
-
--sample_limit: The number of samples to evaluate- For example:
--sample_limit 50 - Default: 500
- For example:
-
--model: Select the model to use- Available options: ['gpt-4o-2024-11-20', 'claude-3-5-sonnet-20241022', 'gemini-1.5-pro-002', 'qwen-vl-max-0809', 'abab7-chat-preview', 'meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo', 'glm-4v', 'local/MiniCPM-V-2_6', 'local/InternVL2-Llama3-76B']
- Default: 'gpt-4o-2024-11-20'
-
--prompt_method: Select the prompt method to use- Available options: ['vanilla', 'Criteria', '2-shot', 'Analyze-then-judge']
- Default: 'vanilla'
-
--max_workers: Maximum number of worker threads- Default: 20
-
--max_annotation_retries: Maximum number of annotation retries- Default: 3
-
--api_retry_limit: Maximum number of API call retries- Default: 3
- Evaluate all datasets:
python src/evaluate.py- Evaluate a specific dataset (e.g., mathv_360k) and limit the number of samples:
python src/evaluate.py --dataset mathv_360k --sample_limit 100
python src/evaluate.py --dataset rlhfv --sample_limit 10 - Use different models and prompt methods:
python src/evaluate.py --model qwen-vl-max-0809 --dataset MathV360K --prompt_method Analyze-then-judge1.Download the original dataset and put it in the build/data directory, download the corresponding images and put them in the build/images directory
1.1 For geo170k, you need to download the qa_tuning.json file from the Geo170K dataset and put it in the build/data/Geo170K directory, download the image folder and unzip it to the build/images directory and rename it to 170k.
1.2 For mathv360k, you need to download the train_samples_all_tuning.json file from the MathV360K dataset and put it in the build/data/mathv directory, download the image folder and unzip it to the build/images directory and rename it to mathv
1.3 For mhal, you need to download the train_raw.json file from the mhal-detect dataset and put it in the build/data/mhal-detect directory, download the image folder and unzip it to the build/images directory and rename it to mhal.
1.4 For RLHF-V-Dataset: Download the image, the dataset is loaded directly from the code
2.You can run the following scripts to generate single dataset:
python build/src/process_mathv_360k.py
python build/src/process_geo_170k.py
python build/src/process_mhal_dataset.py
python build/src/process_rlhfv_dataset.py
You can add the parameter to generate the number of samples, just add --num_samples 1000 after the code, for example:
python build/src/process_geo_170k.py --num_samples 1000
This dataset is merged from the following three source datasets:
-
RLHF-Vision dataset (4,733 samples)
- Source file:
ft_rlhfv_dataset_20250111_201136_new.json
- Source file:
-
MHAL dataset (7,387 samples)
- Source file:
ft_mhal_dataset_20250124_145127.json
- Source file:
-
Math-Vision dataset (5,000 samples)
- Source file:
ft_mathv_dataset_20250115_063847_n5000_p127_new.json
- Source file:
To finetune your model,you should download the images from the google drive and put them in the ft_data/images directory using the link
Data processing instructions:
- Each data contains image path (image_path), prompt (prompt), dialog history (history) and reference answer (reference)
- The prompt contains a system message specially designed for hallucination detection and a user prompt template
- Images are stored in the images/rlhfv/、images/mhal/ and images/mathv/ directories according to the dataset type