This is the official implementation of the paper RCA: Region Conditioned Adaptation for Visual Abductive Reasoning. We achieved the top rank on the official Sherlock Abductive Reasoning Leaderboard and the DHPR retrieval performance.
- Release RCA-V1 version (the version used in paper) to public.
| Model | Backbone | Tuned (M↓) | im→txt (↓) | txt→im (↓) | P@1→I (↑) | GT/Auto-Box (↑) | Human Acc (↑) | Model Link |
|---|---|---|---|---|---|---|---|---|
| LXMERT [1] from [4] | F-RCNN | NA | 51.10 | 48.80 | 14.90 | 69.50 / 30.30 | 21.10 | NA |
| UNITER [2] from [4] | F-RCNN | NA | 40.40 | 40.00 | 19.80 | 73.00 / 33.30 | 22.90 | NA |
| CPT [3] from [4] | RN50×64 | NA | 16.35 | 17.72 | 33.44 | 87.22 / 40.60 | 27.12 | NA |
| CPT [3] from [4] | ViT-B-16 | 149.62 | 19.85 | 21.64 | 30.56 | 85.33 / 36.60 | 21.31 | pth |
| RCA + Dual-Contrast Loss | ViT-B-16 | 42.26 | 13.92 | 16.58 | 35.42 | 88.08 / 42.32 | 27.51 | pth |
| CPT [3] (our impl) | ViT-L-14 | 428.53 | 13.08 | 14.91 | 37.21 | 87.85 / 41.99 | 29.58 | pth |
| RCA + Dual-Contrast Loss | ViT-L-14 | 89.63 | 10.14 | 12.65 | 40.36 | 89.72 / 44.73 | 31.74 | pth |
1. LXMERT 2. UNITER 3. CPT 4. SHERLOCK
cd train_code_v2.20.0_RCA_CLIP
pip install -r requirements.txt
Create a folder named Sherlock and put the following files in it (annotations can be directly download from annotations.zip, images please download from Sherlock):
Sherlock
|_sherlock_val_with_split_idxs_v1_1.json
|_sherlock_train_v1_1.json
|
|_test_localization_public
|_test_retrieval_public
|_test_comparison_public
|_val_localization
|_val_retrieval
|_val_comparison
|
|_images
|_vcr1images
| |_vcr1images_0.jpg
| |_...
|
|_VG_100K
| |_vcr1images_1.jpg
| |_...
|
|_VG_100K_2
|_vcr1images_2.jpg
|_...
RCA is coded and maintained by Dr. Hao Zhang.
If you find the paper helpful for your work, please consider citing the following:
@inproceedings{hesselhwang2022abduction,
title={{RCA: Region Conditioned Adaptation for Visual Abductive Reasoning}},
author={Hao Zhang, Yeo Keat Ee, Basura Fernando},
booktitle={ACM Multimedia},
year={2024}
}
@inproceedings{hesselhwang2022abduction,
title={{The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning}},
author={*Hessel, Jack and *Hwang, Jena D and Park, Jae Sung and Zellers, Rowan and Bhagavatula, Chandra and Rohrbach, Anna and Saenko, Kate and Choi, Yejin},
booktitle={ECCV},
year={2022}
}
@article{10568360,
author={Charoenpitaks, Korawat and Nguyen, Van-Quang and Suganuma, Masanori and Takahashi, Masahiro and Niihara, Ryoma and Okatani, Takayuki},
journal={IEEE Transactions on Intelligent Vehicles},
title={Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction},
year={2024},
volume={},
number={},
pages={1-11},
keywords={Hazards;Cognition;Videos;Automobiles;Accidents;Task analysis;Natural languages;Vision;Language;Reasoning;Traffic Accident Anticipation},
doi={10.1109/TIV.2024.3417353}
}
Thanks for the following Github repositories:
