BEVBlip is an efficient and lightweight Vision-Language Model (VLM) based on BLIP architecture, trained for comprehensive Visual Question Answering (VQA) task introduced by DriveLM on nuScenes dataset. As the core idea, BEVBlip employs spatio-temporal Bird’s Eye View (BEV) maps acquired via BEVFormer as visual features and integrates visual and language features for enhanced traffic environment understanding. In order to align BEV features with language, a pre-training stage utilizing GPT generated data is executed.
For an in depth explanation, please see: Thesis - Knowledge Distilled Traffic Environment Understanding
High-level outline of the proposed approach:
The architecture of the pre-training model:
The architecture of the VQA model used for the fine-tuning on DriveLM task:
Sources and references: