X-VLA (Cross-Embodiment Vision-Language-Action) model introduces a unified soft-prompted Transformer architecture that achieves scalable and generalizable control across heterogeneous robotic embodiments.
By decoupling the core policy model from embodiment-specific details, X-VLA enables robust, high-performance deployment in both simulation and real-world robotic systems.
| 📄 Paper | 🌐 Project Page | 🤗 Hugging Face |
|---|---|---|
| Read the Full Research | Explore the Demos | Access Models & Datasets |
Successful generalist Vision–Language–Action (VLA) models depend on scalable, cross-platform training across diverse robotic embodiments.
To leverage the heterogeneity of large-scale robot datasets, X-VLA introduces a soft prompt mechanism — embodiment-specific learnable embeddings that guide a unified Transformer backbone toward effective multi-domain policy learning.
The resulting architecture — X-VLA-0.9B — achieves state-of-the-art generalization across six simulation platforms and three real-world robots, surpassing prior VLA approaches in dexterity, adaptability, and efficiency.
video.mp4
# Clone the repository
git clone https://github.com/2toinf/X-VLA.git
cd X-VLA
# Create and activate Conda environment
conda create -n XVLA python=3.10 -y
conda activate XVLA
# Install dependencies
pip install -r requirements.txtX-VLA adopts a Server–Client architecture to separate the model environment from simulation or robot-specific dependencies. This design avoids package conflicts and supports distributed inference across GPUs, SLURM clusters, or edge devices.
| Model ID | Embodiment | Description | Performance | Evaluation Guidance |
|---|---|---|---|---|
2toINF/X-VLA-Pt |
Foundation | Pretrained on large-scale heterogeneous robot–vision–language datasets for general transfer. | — | — |
2toINF/X-VLA-AgiWorld-Challenge |
Agibot-G1 | Fine-tuned for AgiWorld Challenge. | Champion🥇 | - |
2toINF/X-VLA-Calvin-ABC_D |
Franka | Fine-tuned on CALVIN benchmark (ABC_D subset) | 4.41 | Calvin Eval |
2toINF/X-VLA-Google-Robot |
Google Robot | Fine-tuned on large-scale Google Robot dataset | 80.4%(VM) 75.7%(VA) | Simpler Eval |
2toINF/X-VLA-Libero |
Franka | Fine-tuned on LIBERO benchmark | 98.1% | to be update |
2toINF/X-VLA-RoboTwin2 |
Agilex | Trained on RoboTwin2 dataset for dual-arm coordinated manipulation(50 demos for each task). | 70% | RoboTwin2.0 Eval |
2toINF/X-VLA-Simpler-WidowX |
WidowX | Fine-tuned on BridgeDataV2 (Simpler benchmark). | 95.8% | Simpler Eval |
2toINF/X-VLA-SoftFold |
Agilex | Fine-tuned on Soft-Fold Dataset. Specialized in deformable object manipulation (e.g., folding and cloth control). | cloth folding with a 100% success rate in 2 hours. | SoftFold-Agilex |
- All models share a consistent architecture:
configuration_xvla.py,modeling_xvla.py, and unified tokenizer (tokenizer.json). - The X-VLA-Pt model is the foundation checkpoint, trained across multiple robot domains.
- Each embodiment is fine-tuned for its respective environment while retaining cross-embodiment alignment.
- Evaluation scripts (in
evaluation/) follow a standardized format for reproducible benchmarking.
📊 Performance metrics follow standard evaluation protocols detailed in the paper.
from transformers import AutoModel, AutoProcessor
import json_numpy
# Load model and processor
model = AutoModel.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
# Start the inference server
print("🚀 Starting X-VLA inference server...")
model.run(processor, host="0.0.0.0", port=8000)Once launched, the API endpoint is available at:
POST http://<server_ip>:8000/infer
The client communicates via HTTP POST, sending multimodal data (vision + language + proprioception) as a JSON payload.
| Key | Type | Description |
|---|---|---|
proprio |
json_numpy.dumps(array) |
Current proprioceptive state (e.g., joint positions). |
language_instruction |
str |
Task instruction (e.g., "Pick up the red block"). |
image0 |
json_numpy.dumps(array) |
Primary camera image (RGB). |
image1, image2 |
optional | Additional camera views if applicable. |
domain_id |
int |
Identifier for the current robotic embodiment/domain. |
steps |
int |
Number of action steps to predict (e.g., 10). |
import requests
import numpy as np
import json_numpy
server_url = "http://localhost:8000/infer"
timeout = 5
# Prepare inputs
proprio = np.zeros(7, dtype=np.float32)
image = np.zeros((256, 256, 3), dtype=np.uint8)
instruction = "Move the gripper to the target position"
payload = {
"proprio": json_numpy.dumps(proprio),
"language_instruction": instruction,
"image0": json_numpy.dumps(image),
"domain_id": 0,
"steps": 10
}
try:
response = requests.post(server_url, json=payload, timeout=timeout)
response.raise_for_status()
result = response.json()
actions = np.array(result["action"], dtype=np.float32)
print(f"✅ Received {actions.shape[0]} predicted actions.")
except Exception as e:
print(f"⚠️ Request failed: {e}")
actions = np.zeros((10, 20), dtype=np.float32)[Server] Model loaded successfully on cuda:0
[Server] Listening on 0.0.0.0:8000
[Client] Sending observation to server...
✅ Received 10 predicted actions.
To ensure consistency across embodiments, X-VLA adopts a unified EE6D (End-Effector 6D) control space.
| Component | Specification | Notes |
|---|---|---|
| Proprio Input | Current EE6D pose (position + orientation) | Must align with training-space normalization. |
| Action Output | Predicted target delta/absolute pose (EE6D) | Executed by downstream controller. |
| Dimensionality | 20-D vector = 3 (EE Pos) + 6 (Rotation in 6D) + 1 (Gripper) + 10 (Padding) | |
| Single-arm Case | If only one arm exists, pad with zeros to maintain 20D vector. |
⚙️ Reference Post-processing:
from datasets.utils import rotate6d_to_xyz action_final = np.concatenate([ action_pred[:3], rotate6d_to_xyz(action_pred[3:9]), np.array([1.0 if action_pred[9] > 0.5 else 0]) ])When feeding proprioception to the model, apply the inverse transformation accordingly.
Each released model includes a corresponding reference client under
evaluation/<domain>/<robot>/client.py for reproducing exact deployment behaviors.
We strongly recommend adapting from these clients when connecting to physical or simulated robots.
For large-scale or distributed training/deployment (e.g., HPC clusters, AgiBot nodes):
python -m deploy --model_path /path/to/your/modelThis script automatically detects SLURM environment variables, launches distributed servers, and writes connection metadata to info.json.
X-VLA supports fine-tuning on new demonstrations via a modular and extensible dataset interface.
-
Prepare Meta JSONs — each domain has a
meta.jsonlisting trajectory file paths. -
Implement Custom Handler — write a domain loader class with
iter_episode(traj_idx)generator. -
Register Domain — update:
datasets/domain_handler/registry.pydatasets/domain_config.py
| Handler | Dataset | Description |
|---|---|---|
"lerobot" |
Agibot-Beta | Optimized for LEROBOT format |
"h5py" |
RoboMind / Simulation | Efficient loading from .h5 trajectories |
"scattered" |
AGIWorld | Handles scattered trajectory storage |
accelerate launch \
--mixed_precision bf16 \
train.py \
--models '2toINF/X-VLA-Pt' \
--train_metas_path /path/to/meta_files.json \
--learning_rate 1e-4 \
--learning_coef 0.1 \
--iters 50000 \
--freeze_steps 1000 \
--warmup_steps 2000| Argument | Description |
|---|---|
--models |
Base model (e.g., '2toINF/X-VLA-Pt') |
--train_metas_path |
Path to meta JSON file(s) |
--batch_size |
Batch size |
--learning_rate |
Base LR |
--learning_coef |
LR multiplier for soft prompts |
--iters |
Total training iterations |
--freeze_steps |
Steps to freeze backbone |
--warmup_steps |
Warmup iterations |
If you use X-VLA in your research, please cite:
@article{zheng2025x,
title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
journal = {arXiv preprint arXiv:2510.10274},
year = {2025}
}This repository is licensed under the Apache License 2.0. You may freely use, modify, and distribute the code under the terms of the license.
Copyright 2025 2toINF (https://github.com/2toinf)
Licensed under the Apache License, Version 2.0.
Maintained by 2toINF 💬 Feedback, issues, and contributions are welcome via GitHub Discussions or Pull Requests.