Thanks to visit codestin.com
Credit goes to github.com

Skip to content

xiaoyuan1996/walkvlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WalkVLM: Aid Visually Impaired People Walking by Vision-Language Model

📄 View the Paper

🚀 Introduction

Approximately 200 million individuals worldwide suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to provide walking assistance.

With the recent progress of vision-language models (VLMs), applying them for walking guidance has become increasingly popular. However:

  • Existing methods are mainly based on self-curated QA datasets that are not publicly accessible, lacking a standardized benchmark.
  • Walking assistance often requires real-time video analysis and concise, informative reminders, but current VLMs struggle due to long responses and low inference efficiency.

Our contributions:

  1. We introduce the first large-scale walking assistance dataset, comprising 12,000 video–annotation pairs, serving as a unified benchmark for training and evaluation.
  2. We propose WalkVLM, which:
    • Employs chain-of-thought hierarchical planning to generate concise but informative reminders.
    • Utilizes temporal-aware adaptive prediction to reduce redundancy in reminders.
  3. We establish a solid benchmark for the blind walking task and verify the advantages of WalkVLM in streaming video processing compared to other VLMs.

🛰️ Method & Dataset

Visualization results of the WAD dataset by region
Fig. 1 — Visualization results of the WAD dataset sorted by region. The WAD dataset has a wide range of sources, and the categories shown are randomly sampled from the dataset. The pie chart in the lower-left corner shows the proportion of video length from different regions.
Overview of the proposed WalkVLM framework
Fig. 2 — An overview of the proposed WalkVLM framework. WalkVLM employs CoT-based hierarchical planning to summarize the static attributes and understanding of scenes, thereby facilitating the subsequent reminder and QA tasks. Furthermore, temporal-aware adaptive prediction has been proposed to calculate the trigger state of VLM, thereby reducing the temporal redundancy of outputs.

📂 Code structure

.
├── wad_dataset               # Dataset used in WalkVLM
└── WalkVLM-LR                # WalkVLM reasoning code
    ├── checkpoint            # Pretrained weights
    ├── vlm_grpo_template     # GRPO training template
    ├── EAD.py                # EAD model code
    ├── GPTScore.py           # GPTScore calculation code
    ├── inference.py          # Inference script
    ├── test.py               # Testing script
    └── train_EAD.py          # EAD model training script

📖 Citation

If you feel this code helpful or use this code or dataset, please cite it as

Z. Yuan et al., "WalkVLM: Aid Visually Impaired People Walking by Vision Language Model," arXiv preprint arXiv:2412.20903, Dec. 30, 2024.

About

walkvlm for blind walking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published