Codestin Search App

We introduce InternVLA-N1, the first open dual-system vision-language navigation foundation model. Unlike previous navigation foundation models that can only take short-term actions from a limited discrete space, InternVLA-N1 decouples the task as pixel-goal planning with System 2 and agile execution with System 1. A curriculum two-stage training paradigm is devised for this framework: First, two systems are pretrained with explicit pixel goals as supervision or condition. Subsequently, we freeze System 2 and finetune the newly added latent plans with System 1 in an asynchronous end-to-end manner. Such a paradigm relying on latent plans as the intermediate representation removes the ambiguity of pixel goal planning and provides new potentials for pretraining extensions with video prediction. To enable scalable training, we develop an efficient navigation data generation pipeline and introduce InternData-N1, the largest navigation dataset to date. InternData-N1 comprises over 50 million egocentric images collected from more than 3,000 scenes, amounting to 4,839 kilometers of robot navigation experience. We evaluate InternVLA-N1 across 6 challenging navigation benchmarks, where it consistently achieves state-of-the-art performance, with improvements ranging from 3% to 28%. In particular, it demonstrates synergistic integration of long-horizon planning (>150m) and real-time decision-making (>30Hz) capabilities and can be zero-shot generalized across diverse embodiments (wheeled, quadruped, humanoid) and in-the-wild environments. All code, models, and datasets are publicly available.

InternVLA-N1 adopts a compositional architecture featuring a dual-system design that synergistically combines high-level instruction interpretation with low-level action execution. Specifically, our system integrates:

System 2: A vision-language model (VLM)-based planning module that interprets navigation instructions to predict mid-term waypoint goals through image-grounded reasoning. By predicting pixel coordinates in the image space, it effectively connects instruction understanding with spatial reasoning, enabling long-horizon navigation instruction following.
System 1: A multi-modal goal-conditioned diffusion policy guided by latent plan or supported explicit goals, which generates executable short-horizon trajectories conditioned on current observations and the asynchronous latent features from System 2. It enables robust, real-time control and local decision-making in complex environments.

To fully unlock open-world generalization and asynchronous inference capabilities in the dual-system architecture, we design a curriculum training scheme. Initially, each system is trained separately to acquire basic navigation skills using explicit goals in a synchronized setting. Then, a joint fine-tuning phase is introduced. In this phase, we incorporate learnable tokens into System 2 as implicit midterm goals to reduce the ambiguity of pixel-based targets.

For the navigation task, most real-world datasets are constrained by scene diversity and scale. Meanwhile, Internet video datasets suffer from imprecise localization and mapping information, which limits their feasibility as reliable navigation datasets for trajectory prediction. In contrast, we propose efficient pipelines for generating navigation datasets in simulation, aiming to facilitate scalable training.

Our RGB-only variant outperforms all previous RGB-based methods on the R2R Val-Unseen benchmark. It achieves a Success Rate (SR) of 55.4% and an SPL of 52.1%, surpassing previous state-of-the-art models like NAVILA and MapNav.

Our System 1 model showcases several distinctive capabilities, outperforming baseline methods significantly. It achieves 2.7x better performance in no-goal exploration tasks, a 10.9% higher success rate in complex indoor path-planning, and 27.1% better performance in image-driven exploration by adaptively balancing exploration and exploitation.

PointGoal

NoGoal

ImageGoal

BibTeX

@article{wang2025internvla,
  title = {InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans},
  author = {Intern Robotics},
  booktitle = {Arxiv},
  year = {2025},
}

InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans

Abstract

Approach

InternData-N1 Dataset

Realworld evaluation

Zero-shot transfer to long-horizon navigation

Clutter Environment Collision Avoidance

Semantics Understanding

Baseline comparison

Evaluation in Simulation

Quantitative results

Dual-System VLN-CE evaluation

Dual-System VLN-PE evaluation

System 1 evaluation

PointGoal

PointGoal

NoGoal

NoGoal

ImageGoal

ImageGoal

BibTeX