We introduce InternVLA-N1, the first open dual-system vision-language navigation foundation model. Unlike previous navigation foundation models that can only take short-term actions from a limited discrete space, InternVLA-N1 decouples the task as pixel-goal planning with System 2 and agile execution with System 1. A curriculum two-stage training paradigm is devised for this framework: First, two systems are pretrained with explicit pixel goals as supervision or condition. Subsequently, we freeze System 2 and finetune the newly added latent plans with System 1 in an asynchronous end-to-end manner. Such a paradigm relying on latent plans as the intermediate representation removes the ambiguity of pixel goal planning and provides new potentials for pretraining extensions with video prediction. To enable scalable training, we develop an efficient navigation data generation pipeline and introduce InternData-N1, the largest navigation dataset to date. InternData-N1 comprises over 50 million egocentric images collected from more than 3,000 scenes, amounting to 4,839 kilometers of robot navigation experience. We evaluate InternVLA-N1 across 6 challenging navigation benchmarks, where it consistently achieves state-of-the-art performance, with improvements ranging from 3% to 28%. In particular, it demonstrates synergistic integration of long-horizon planning (>150m) and real-time decision-making (>30Hz) capabilities and can be zero-shot generalized across diverse embodiments (wheeled, quadruped, humanoid) and in-the-wild environments. All code, models, and datasets are publicly available.
InternVLA-N1 adopts a compositional architecture featuring a dual-system design that synergistically combines high-level instruction interpretation with low-level action execution. Specifically, our system integrates:
To fully unlock open-world generalization and asynchronous inference capabilities in the dual-system architecture, we design a curriculum training scheme. Initially, each system is trained separately to acquire basic navigation skills using explicit goals in a synchronized setting. Then, a joint fine-tuning phase is introduced. In this phase, we incorporate learnable tokens into System 2 as implicit midterm goals to reduce the ambiguity of pixel-based targets.
For the navigation task, most real-world datasets are constrained by scene diversity and scale. Meanwhile, Internet video datasets suffer from imprecise localization and mapping information, which limits their feasibility as reliable navigation datasets for trajectory prediction. In contrast, we propose efficient pipelines for generating navigation datasets in simulation, aiming to facilitate scalable training.
Instruction: Walk through the tables and chairs, turn left into the café, and stop at the coffee counter.
Instruction: Walk through the bamboo forest and turn left. Reach a man sitting on the sofa.
Instruction: Walk straight ahead, turn right after seeing the billiard table, and enter the room with shelves.
Instruction: Enter the office area. Immediately turn left and continue straight. Then make a right turn. Stop at the workstation near the whiteboard.
Instruction: Walk out of this house and find a trash bin.
Instruction: Walk towards the orange coffee sculpture and go upstairs. Then keep walking straight and turn right at the end. After that turn left and walking to the man with a black umbrella. Stop in front of the doors with red handles.
NAVILA
InternVLA-N1
StreamVLN
Our RGB-only variant outperforms all previous RGB-based methods on the R2R Val-Unseen benchmark. It achieves a Success Rate (SR) of 55.4% and an SPL of 52.1%, surpassing previous state-of-the-art models like NAVILA and MapNav.
Our System 1 model showcases several distinctive capabilities, outperforming baseline methods significantly. It achieves 2.7x better performance in no-goal exploration tasks, a 10.9% higher success rate in complex indoor path-planning, and 27.1% better performance in image-driven exploration by adaptively balancing exploration and exploitation.
Some visualization results of our dual-system model on VLN-CE benchmark (habitat simulator) are as follows. Green curves are GT.
To narrow the sim2real gap, we further evaluate our dual-system on VLN-PE with physics simulation of InternUTopia on a Unitree H1 robot.
Evaluation of our System 1 model in more diverse and clutter environments are shown as follows. Our System 1 supports different input patterns.
@article{wang2025internvla,
title = {InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans},
author = {Intern Robotics},
booktitle = {Arxiv},
year = {2025},
}