Official repository for "Question-Aware Gaussian Experts for Audio-Visual Question Answering" in CVPR 2025.
Authors: Hongyeob Kim1*, Inyoung Jung1*, Dayoon Suh2, Youjija Zhang1, Sangmin Lee1, Sungeun Hong1†
1Sungkyunkwan University, 2Purdue University
Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance.
python3.10 +
pytorch2.4.0-
Clone this repo
git clone https://github.com/AIM-SKKU/QA-TIGER.git
-
Setting the environment
- with conda
conda create -n qa-tiger python=3.10 conda activate qa-tiger pip install -e .- with pip
pip install -e .- with uv
uv sync source .venv/bin/activate -
Prepare data
- you can find annotations in
./data/annots/. notice: for MUSIC-AVQA-v2.0, we asked the authors about the original split and pre-divided the dataset accordingly.- Additionally, the following links provide access to download the original annotation and data
- MUSIC-AVQA: https://gewu-lab.github.io/MUSIC-AVQA/
- MUSIC-AVQA-R: https://github.com/reml-group/MUSIC-AVQA-R
- MUSIV-AVQA-v2.0: https://github.com/DragonLiu1995/MUSIC-AVQA-v2.0
- you can find annotations in
-
Feature extraction
-
we follow the same protocol as TSPM for feature extraction. Please refer to TSPM
-
put the extracted features under
./data/feats/data ┣ annots ┃ ┣ music_avqa ┃ ┣ music_avqa_r ┃ ┣ music_avqa_v2 ┣ feats ┃ ┣ frame_ViT-L14@336px ┃ ┣ visual_tome14 ┃ ┣ ... ┗ ┗ vggish
-
-
Training
bash scripts/train.sh <CONFIG> <GPU_IDX>
-
Testing
bash scripts/test.sh <CONFIG> <GPU_IDX> <WEIGHT> <OUTPUT_LOG_PATH>
If you find this work useful, please consider citing it.
@inproceedings{kim2025qatiger,
title={Question-Aware Gaussian Experts for Audio-Visual Question Answering},
author={Hongyeob Kim and Inyoung Jung and Dayoon Suh and Youjia Zhang and Sangmin Lee and Sungeun Hong},
booktitle={CVPR},
year={2025}
}
We acknowledge the following code, which served as a reference for our implementation.