Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Official implementation of the paper: "ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation"

License

Notifications You must be signed in to change notification settings

arvillion/ActiveVLN

Repository files navigation

ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation


Paper checkpoint

ActiveVLN

ActiveVLN is a Vision-and-Language Navigation (VLN) framework designed to enable active exploration through multi-turn reinforcement learning. Unlike traditional VLN methods, which rely on imitation learning and expert trajectories, ActiveVLN empowers the navigation agent to engage in open-ended exploration, improving its ability to discover diverse and plausible navigation paths.

ActiveVLN follows a two-stage process:

  1. Initial Stage: The agent is bootstrapped using a small set of expert trajectories via imitation learning.
  2. Active Exploration: In the second stage, the agent iteratively predicts and executes actions while collecting diverse navigation trajectories. These trajectories are then optimized using the GRPO objective, with a dynamic early-stopping strategy to prune unpromising paths.

Overview

πŸ“ TODO List

  • βœ… Submit the arXiv paper (Sep. 12, 2025)
  • βœ… Publish model checkpoints
  • βœ… Release evaluation scripts
  • βœ… Release training code and data
  • βœ… Release environment server code

Quick Start

ActiveVLN separates the training code from the VLN-CE environment code to improve efficiency and resource usage. The training machine and the VLN-CE environment server communicate via HTTP. Therefore, you'll need to set up two separate environments: one for running the training code and another for the environment server. If resources (memory, GPU memory) allow, you can run both parts on the same machine.

Setup the VLN-CE Environemnt

# Create an environment for VLN-CE server
conda create -n server python=3.10 -y
conda activate server

mkdir nav_ws
cd nav_ws

# Install habitat-sim v0.1.7 (building from source)
git clone --branch v0.1.7 https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
pip install -r requirements.txt
python setup.py install --headless

# The following versions are specified to prevent a known segmentation fault
pip install numpy==1.23.0 numba==0.60.0

# Test habitat-sim installation by running an example
python examples/examples.py

# Install habitat-lab v0.1.7 (building from source)
cd ..
git clone --branch v0.1.7 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -r requirements.txt
pip install -r habitat_baselines/rl/requirements.txt
pip install -r habitat_baselines/rl/ddppo/requirements.txt
pip install dtw fastdtw gym==0.17.3
python setup.py develop --all

# The following versions are specified to prevent a known segmentation fault
pip install numpy==1.23.0 numba==0.60.0

# Install vlnce_server package
cd ..
git clone https://github.com/arvillion/ActiveVLN
cd ActiveVLN
pip install -e ./vlnce_server

Setup the Training Environment

# Create an environment for RL training
conda conda create -n train python=3.10 -y
conda activate train

git clone https://github.com/arvillion/ActiveVLN
cd ActiveVLN

# Follow the VeRL official installation procedure
pip install --no-deps -e .

# Install additional dependencies
bash scripts/install_activevln.sh

# Install the vlnce_server package for the client interface
pip install -e ./vlnce_server

Data Preparation

On the environment server side, you must prepare two types of data to use the simulator:

  1. Scene Dataset: Download the Matterport3D dataset and extract it to data/scene_datasets/mp3d/.

  2. VLN-CE Episodes: These contain instructions and trajectory annotations. Download and extract them into the data/datasets/ directory.

Your final folder structure should look like this:

data/
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ R2R_VLNCE_v1-3_preprocessed/
β”‚   β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ val_seen/
β”‚   β”‚   β”‚   └── val_seen.json.gz
β”‚   β”‚   └── val_unseen/
β”‚   β”‚       └── val_unseen.json.gz
β”‚   β”œβ”€β”€ RxR_VLNCE_v0/
β”‚   β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ val_seen/
β”‚   β”‚   β”‚   β”œβ”€β”€ val_seen_guide.json.gz
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   └── val_unseen/
β”‚   β”‚       β”œβ”€β”€ val_unseen_guide.json.gz
β”‚   β”‚       └── ...
└── scene_datasets/
    └── mp3d/
        β”œβ”€β”€ 17DRP5sb8fy/
        β”œβ”€β”€ 1LXtFkjw3qL/
        └── ...

On the training side, the required data (data/r2r_4000_train.parquet and data/rxr_4000_train.parquet) is already included in this repository, containing 4000 trajectories each from R2R and RxR datasets for RL training.

Launch the Environment Server

conda activate server

python3 -m vlnce_server.server \
    server.host=0.0.0.0 \
    server.port=5001 \
    vlnce.gpus=[0,1] \
    vlnce.r2r_gpu_plan=[16,16] \
    vlnce.rxr_gpu_plan=[0,0]
  • vlnce.gpus specifies the GPUs on which to launch the simulators.
  • vlnce.r2r_gpu_plan and vlnce.rxr_gpu_plan define the simulator allocation on each GPU.

The example configuration above launches 32 R2R simulators in parallel across GPU 0 and GPU 1, with 16 simulators allocated to each.

Start RL Training

This repository provides the code for RL training, and we offer a fine-tuned SFT checkpoint to serve as a starting point for RL. If you wish to perform SFT yourself, please refer to the For SFT Reference section.

The training code and configuration are designed to run on 4x NVIDIA L40s GPUs (48GB) with 512GB of CPU RAM. The environment server runs 32 VLN-CE simulators in parallel.

Step 1: Prepare the Post-SFT Model: RL training builds upon an SFT checkpoint. You can either download our pre-trained checkpoints or fine-tune one yourself. We provide two checkpoints for RL training on R2R and RxR, respectively.

Step 2: Start Training: Run one of the following scripts to begin training on your chosen dataset:

# Training on the R2R dataset
bash examples/vlnce/train_r2r.sh

# Training on the RxR dataset
bash example/vlnce/train_rxr.sh

Checkpoints

We provide several checkpoints:

For SFT Reference

We do not provide the codebase for SFT. However, if you wish to create your own base model for RL, you can refer to the following details:

Ensure you use the appropriate system prompt. You can use frameworks like transformers, LLaMA-Factory or other training frameworks to do SFT.

Evaluation

We provide an evaluation script based on vllm for fast inference. To begin, first start the model serving endpoint as follows:

vllm serve $MODEL_PATH --task generate \
  --trust-remote-code \
  --limit-mm-per-prompt image=200,video=0 \
  --mm_processor_kwargs '{"max_pixels": 80000}' \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --disable-log-requests \
  --port 8003

Replace $MODEL_PATH with the path to the model you wish to evaluate.

Once the inference service is running, you can initiate the evaluation using examples/vlnce/eval_r2r.sh or examples/vlnce/eval_rxr.sh. Before running, you must modify the essential variables in the script. For example, examples/vlnce/eval_r2r.sh looks like this:

SAVE_PATH=eval_results/r2r/example
CHUNKS=32
CONFIG_PATH="vlnce_server/VLN_CE/vlnce_baselines/config/r2r_baselines/activevln_r2r_test.yaml"

export OPENAI_API_KEY="EMPTY"
export OPENAI_API_BASE="http://127.0.0.1:8003/v1"
export PYTHONPATH=$PYTHONPATH:"$(pwd)/vlnce_server"

python3 eval/vlnce/eval_vlnce.py \
    --exp-config $CONFIG_PATH \
    --split-num $CHUNKS \
    --result-path $SAVE_PATH

# Score calculation
python3 eval/vlnce/analyze_results.py \
    --path $SAVE_PATH

Make sure to edit the following:

  • OPENAI_API_BASE and OPENAI_API_KEY should match your model serving configuration.
  • Set SAVE_PATH to the desired location where you want to save the evaluation results.
  • (Optional) CHUNKS defines the number of simulators that will run in parallel.

Acknowledgements

We would like to thank the authors of verl and DeepEyes, upon whose work our code builds. Special thanks to VAGEN, which inspired the logic for our environment server.

Citation

If you find our work useful, please consider starring this repo 🌟 and citing our paper:

@article{zhang2025activevln,
  title={ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation},
  author={Zhang, Zekai and Zhu, Weiye and Pan, Hewei and Wang, Xiangchen and Xu, Rongtao and Sun, Xing and Zheng, Feng},
  journal={arXiv preprint arXiv:2509.12618},
  year={2025}
}

About

Official implementation of the paper: "ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation"

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published