WalkVLM: Aid Visually Impaired People Walking by Vision-Language Model

🚀 Introduction

Approximately 200 million individuals worldwide suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to provide walking assistance.

With the recent progress of vision-language models (VLMs), applying them for walking guidance has become increasingly popular. However:

Existing methods are mainly based on self-curated QA datasets that are not publicly accessible, lacking a standardized benchmark.
Walking assistance often requires real-time video analysis and concise, informative reminders, but current VLMs struggle due to long responses and low inference efficiency.

✨ Our contributions:

We introduce the first large-scale walking assistance dataset, comprising 12,000 video–annotation pairs, serving as a unified benchmark for training and evaluation.
We propose WalkVLM, which:
- Employs chain-of-thought hierarchical planning to generate concise but informative reminders.
- Utilizes temporal-aware adaptive prediction to reduce redundancy in reminders.
We establish a solid benchmark for the blind walking task and verify the advantages of WalkVLM in streaming video processing compared to other VLMs.

🛰️ Method & Dataset

Visualization results of the WAD dataset by region

Fig. 1 — Visualization results of the WAD dataset sorted by region. The WAD dataset has a wide range of sources, and the categories shown are randomly sampled from the dataset. The pie chart in the lower-left corner shows the proportion of video length from different regions.

Fig. 2 — An overview of the proposed WalkVLM framework. WalkVLM employs CoT-based hierarchical planning to summarize the static attributes and understanding of scenes, thereby facilitating the subsequent reminder and QA tasks. Furthermore, temporal-aware adaptive prediction has been proposed to calculate the trigger state of VLM, thereby reducing the temporal redundancy of outputs.

📂 Code structure

.
├── wad_dataset               # Dataset used in WalkVLM
└── WalkVLM-LR                # WalkVLM reasoning code
    ├── checkpoint            # Pretrained weights
    ├── vlm_grpo_template     # GRPO training template
    ├── EAD.py                # EAD model code
    ├── GPTScore.py           # GPTScore calculation code
    ├── inference.py          # Inference script
    ├── test.py               # Testing script
    └── train_EAD.py          # EAD model training script

📖 Citation

If you feel this code helpful or use this code or dataset, please cite it as

Z. Yuan et al., "WalkVLM: Aid Visually Impaired People Walking by Vision Language Model," arXiv preprint arXiv:2412.20903, Dec. 30, 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
WalkVLM-LR		WalkVLM-LR
figures		figures
wad_dataset		wad_dataset
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WalkVLM: Aid Visually Impaired People Walking by Vision-Language Model

🚀 Introduction

🛰️ Method & Dataset

📂 Code structure

📖 Citation

About

Uh oh!

Releases

Packages

Languages

xiaoyuan1996/walkvlm

Folders and files

Latest commit

History

Repository files navigation

WalkVLM: Aid Visually Impaired People Walking by Vision-Language Model

🚀 Introduction

🛰️ Method & Dataset

📂 Code structure

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages