Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Evaluate LVLMs’ robustness to context-incongruent object recognition and hallucination errors.

License

Notifications You must be signed in to change notification settings

ZhaoyangLi-1/ORIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

This repository provides the official implementation and data for
ORIC: Benchmarking Object Recognition in Incongruous Context for Large Vision-Language Models

Zhaoyang Li* (UC San Diego), Zhang Ling* (UC San Diego), Yuchen Zhou (UC San Diego), Litian Gong (UC Riverside), Erdem Bıyık (University of Southern California), Hao Su (UC San Diego, Hillbot)

* Equal contribution.


ORIC Overview

We introduce ORIC, a framework designed to systematically evaluate object recognition under incongruous contexts, where objects appear in unexpected settings or expected objects fail to appear. ORIC constructs challenging object–context pairs using two complementary strategies:

(1) LLM-guided sampling to identify hard-to-recognize existing objects, and
(2) CLIP-guided sampling to mine plausible but absent ones.

Applied to MSCOCO, ORIC generates both the ORIC-Bench (evaluation) and ORIC-style training data. We also release code for evaluating ORIC-Bench and for fine-tuning Qwen3-VL-8B-Instruct via Visual Reinforcement Fine-Tuning (Visual-RFT). Overall, ORIC underscores contextual incongruity as a critical source of uncertainty for LVLMs and provides practical tools for building more reliable vision–language systems.


🛠️ 1. Setup:

git clone https://github.com/ZhaoyangLi-1/ORIC.git
cd ORIC
conda create -n oric python=3.10
conda activate oric
bash setup.sh

2. Set your OpenAI API Key:

export OPENAI_API_KEY="your_openai_api_key"

3. Generate ORIC QA Samples:

python main.py \
  --data_folder /path/to/coco \
  --output_folder /path/to/output \
  --num_objects 2 \
  --num_images 1000 \
  --seed 42 \
  --llm_model gpt-5-2025-08-07 \
  --reject_prompt ./prompts/reject_sample.txt \
  --split val

Arguments:

--data_folder: Path to your COCO dataset folder.

data_folder/
├── train2014/                 # Training images
│     ├── COCO_train2014_000000000009.jpg
│     └── ...
├── val2014/                   # Validation images
│     ├── COCO_val2014_000000000042.jpg
│     └── ...
└── annotations/               # COCO annotation JSON files
      ├── instances_train2014.json
      └── instances_val2014.json

--output_folder: Directory to save generated Q&A samples.

--num_objects: Number of objects to sample per image.

--num_images: Number of images per label (yes and no). For example, if num_images = 500, the sampler uses:

  • 500 images with label yes

  • 500 images with label no

  • Given num_objects, the total number of Q&A pairs is:

    Total Q&A = 2 * num_images * num_objects

    For instance, if num_images = 500 and num_objects = 2, then:

    Total Q&A = 2 * 500 * 2 = 2000 Q&A pairs.

--llm_model: OpenAI model name (e.g., gpt-5-2025-08-07).

--reject_prompt: Prompt template used to formulate questions.

--split: Dataset split to use: train or val.

  • train: Produces ORIC-style training data.
  • val: Produces ORIC-Bench evaluation data.

This step produces ORIC-style Q&A pairs ready for inference. We already provide generated questions in the outputs folder for dirrectly using.

4. Run Inference with Your VLM:

Run your Vision-Language Model (VLM) on the generated ORIC Q&A pairs. The output should be saved in a JSON file with the following structure:

[
  {
    "question_id": "1",
    "predicted_answer": "yes",
    "solution": "yes"
  },
  {
    "question_id": "2",
    "predicted_answer": "no",
    "solution": "no"
  }
]

5. Evaluate Model Performance:

python evaluate.py \
  --result_path /path/to/predictions.json \
  --output_folder /path/to/eval_results

6. Visual-RFT Finetuning

Visual-RFT is our reinforcement-learning finetuning pipeline built upon Group Relative Policy Optimization (GRPO), designed to reduce uncertainty-driven hallucination and improve robustness under contextual incongruity.

Requirements

  • 4 × NVIDIA H100 / A100 GPUs
  • PyTorch ≥ 2.1
  • Flash-Attention v2
  • DeepSpeed ZeRO-3 (config included in repo)

6.1 Training Data Preprocessing

Before running Visual-RFT finetuning, ORIC-style training data must be converted into a HuggingFace Dataset format. Use the following preprocessing script to convert an ORIC JSON file into a HF DatasetDict:

python virft/dataset/build_dataset.py \
  --json_path /path/to/oric_train.json \
  --image_dir /path/to/coco/images \
  --save_path /path/to/hf_datase

6.2 Training Command

Run the following command to launch GRPO fine-tuning on 4 GPUs:

export DEBUG_MODE="true"
export LOG_PATH="./debug_log_8b_GRPO_oric.txt"

export DATA_PATH=./dataset  ### ORIC-style training data path which was saved in Section 6.1
export CKPT_PATH=./share_models/Qwen3-VL-8B-Instruct ### Qwen3-VL-8B-Instruct checkpoint path
export SAVE_PATH=./share_models/Qwen3-VL-8B-Instruct_GRPO_oric ### save path

torchrun --nproc_per_node="4" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12345" \
    virft/src/open_r1/grpo_classification.py \
    --output_dir ${SAVE_PATH} \
    --model_name_or_path ${CKPT_PATH} \
    --dataset_name ${DATA_PATH} \
    --deepspeed virft/zero3.json \
    --max_prompt_length 1024 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --logging_steps 1 \
    --bf16 true \
    --report_to wandb \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --max_pixels 401408 \
    --num_train_epochs 15 \
    --run_name Qwen3-VL-8B_GRPO_oric \
    --save_steps 100 \
    --save_only_model true \
    --num_generations 8 \
    --learning_rate 2e-6 \
    --lr_scheduler_type cosine

✅ The trainers automatically detect whether the checkpoint corresponds to Qwen2, Qwen2.5, or Qwen3-VL (including MoE variants) and select the correct model class and image processor settings.

About

Evaluate LVLMs’ robustness to context-incongruent object recognition and hallucination errors.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published