Visual Question Answering for Traffic Environment Understanding

BEVBlip is an efficient and lightweight Vision-Language Model (VLM) based on BLIP architecture, trained for comprehensive Visual Question Answering (VQA) task introduced by DriveLM on nuScenes dataset. As the core idea, BEVBlip employs spatio-temporal Bird’s Eye View (BEV) maps acquired via BEVFormer as visual features and integrates visual and language features for enhanced traffic environment understanding. In order to align BEV features with language, a pre-training stage utilizing GPT generated data is executed.

For an in depth explanation, please see: Thesis - Knowledge Distilled Traffic Environment Understanding

Example Results

Implementation

High-level outline of the proposed approach:

Pre-training

The architecture of the pre-training model:


The bottom section illustrates offline data generation steps using BEVFormer and GPT-3.5. The upper right section shows the unified multimodal encoder-decoder with pretrained weights from BLIP. The upper left section depicts the compact vision transformer architecture, trained from scratch with BEV feature maps.