Approximately 200 million individuals worldwide suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to provide walking assistance.
With the recent progress of vision-language models (VLMs), applying them for walking guidance has become increasingly popular. However:
- Existing methods are mainly based on self-curated QA datasets that are not publicly accessible, lacking a standardized benchmark.
- Walking assistance often requires real-time video analysis and concise, informative reminders, but current VLMs struggle due to long responses and low inference efficiency.
✨ Our contributions:
- We introduce the first large-scale walking assistance dataset, comprising 12,000 video–annotation pairs, serving as a unified benchmark for training and evaluation.
- We propose WalkVLM, which:
- Employs chain-of-thought hierarchical planning to generate concise but informative reminders.
- Utilizes temporal-aware adaptive prediction to reduce redundancy in reminders.
- We establish a solid benchmark for the blind walking task and verify the advantages of WalkVLM in streaming video processing compared to other VLMs.
Fig. 1 — Visualization results of the WAD dataset sorted by region. The WAD dataset has a wide range of sources, and the categories shown are randomly sampled from the dataset. The pie chart in the lower-left corner shows the proportion of video length from different regions.
Fig. 2 — An overview of the proposed WalkVLM framework. WalkVLM employs CoT-based hierarchical planning to summarize the static attributes and understanding of scenes, thereby facilitating the subsequent reminder and QA tasks. Furthermore, temporal-aware adaptive prediction has been proposed to calculate the trigger state of VLM, thereby reducing the temporal redundancy of outputs.
.
├── wad_dataset # Dataset used in WalkVLM
└── WalkVLM-LR # WalkVLM reasoning code
├── checkpoint # Pretrained weights
├── vlm_grpo_template # GRPO training template
├── EAD.py # EAD model code
├── GPTScore.py # GPTScore calculation code
├── inference.py # Inference script
├── test.py # Testing script
└── train_EAD.py # EAD model training script
If you feel this code helpful or use this code or dataset, please cite it as
Z. Yuan et al., "WalkVLM: Aid Visually Impaired People Walking by Vision Language Model," arXiv preprint arXiv:2412.20903, Dec. 30, 2024.