MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Authors

Shuo Yang¹, Soyeon Caren Han^1,*, Siwen Luo², Eduard Hovy¹

¹ The University of Melbourne ² The University of Western Australia

[email protected] [email protected],

Accepted by the 2025 Association for Computational Linguistics: ACL 2025
(ACL 2025)

Implementation of MAGIC-VQA.

Updates

[08/24/2025]:🎉 We have released the code for MAGIC-VQA.

Abstract

Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.

How to Use MAGIC-VQA

Following the original paper, please first use the CLIP_Retrieval to retreive and proceed with the filtered triplets. Then use the GCN training for train the graph network and do the inference.

The result for ScienceQA, TextVQA, MMMU and their responding knowledge triplets can be found in the Atomic folder.

If you find our method useful, please kindly cite our paper.

@inproceedings{yang-etal-2025-magic,
    title = "{MAGIC}-{VQA}: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering",
    author = "Yang, Shuo  and
      Han, Caren  and
      Luo, Siwen  and
      Hovy, Eduard",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.872/",
    doi = "10.18653/v1/2025.findings-acl.872",
    pages = "16967--16986",
    ISBN = "979-8-89176-256-5"
}

4. Contributing

We welcome contributions from the research community to improve the efficiency of MAGIC-VQA. If you have any idea or would like to report a bug, please open an issue or submit a pull request.

5. License

The code is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
GCN_Training		GCN_Training
MMMU		MMMU
Retrieval_and_Processing		Retrieval_and_Processing
ScienceQA		ScienceQA
TextVQA		TextVQA
.DS_Store		.DS_Store
Inference_Demo.ipynb		Inference_Demo.ipynb
MAGIC_architecture.jpg		MAGIC_architecture.jpg
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Authors

Updates

Abstract

How to Use MAGIC-VQA

4. Contributing

5. License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

adlnlp/magic_vqa

Folders and files

Latest commit

History

Repository files navigation

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Authors

Updates

Abstract

How to Use MAGIC-VQA

4. Contributing

5. License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages