GSA-VLN

This repository is the official implementation of the ICLR 2025 paper "General Scene Adaptation for Vision-and-Language Navigation".

Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in persistent environments with relatively consistent physical layouts, visual observations, and language styles from instructors. Such a gap in the task setting presents an opportunity to improve VLN agents by incorporating continuous adaptation to specific environments. To better reflect these real-world conditions, we introduce GSA-VLN (General Scene Adaptation for VLN), a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time. To evaluate the proposed task, one has to address two challenges in existing VLN datasets: the lack of out-of-distribution (OOD) data, and the limited number and style diversity of instructions for each scene. Therefore, we propose a new dataset, GSA-R2R, which significantly expands the diversity and quantity of environments and instructions for the Room-to-Room (R2R) dataset to evaluate agent adaptability in both ID and OOD contexts. Furthermore, we design a three-stage instruction orchestration pipeline that leverages large language models (LLMs) to refine speaker-generated instructions and apply role-playing techniques to rephrase instructions into different speaking styles. This is motivated by the observation that each individual user often has consistent signatures or preferences in their instructions, taking the use case of home robotic assistants as an example. We conducted extensive experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various methods, revealing key factors enabling agents to adapt to specific environments. Based on our findings, we propose a novel method, Graph-Retained DUET (GR-DUET), which incorporates memory-based navigation graphs with an environment-specific training strategy, achieving state-of-the-art results on all GSA-R2R splits.

Implementation Steps

Step 1

Follow DUET to install the Matterport3D Simulator and download the required data

Step 2

Download the preprocessed features and connectivity graphs from ScaleVLN

Step 3

Download the instructions of GSA-R2R dataset from here

[optional] Download the pretrained and fine-tuned checkpoints for GSA-R2R.

Step 4

Run the pretraining code with the ground-truth navigation graph

cd GR-DUET
cd pretrain_src
bash run_r2r.sh

Note: The pretraining requires a pano_inputs.h5 file which we save all the inputs to speed up the training. You can either run the get_pano_inputs.py to generate it or download it from here.

Step 5

Finetune and evaluate the model

cd map_nav_src
bash scripts/run_gsa_r2r.sh

Citation

If you find this work helpful, please consider citing:

@inproceedings{hong2025general,
	title={General Scene Adaptation for Vision-and-Language Navigation},
	author={Haodong Hong and Yanyuan Qiao and Sen Wang and Jiajun Liu and Qi Wu},
	booktitle={The Thirteenth International Conference on Learning Representations},
	year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
GR-DUET		GR-DUET
figures		figures
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GSA-VLN

Implementation Steps

Step 1

Step 2

Step 3

Step 4

Step 5

Citation

About

Uh oh!

Releases

Packages

Languages

License

honghd16/GSA-VLN

Folders and files

Latest commit

History

Repository files navigation

GSA-VLN

Implementation Steps

Step 1

Step 2

Step 3

Step 4

Step 5

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages