🤖 X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model

🏅Champion @ AgiBot World Challenge @ IROS 2025

X-VLA (Cross-Embodiment Vision-Language-Action) model introduces a unified soft-prompted Transformer architecture that achieves scalable and generalizable control across heterogeneous robotic embodiments.
By decoupling the core policy model from embodiment-specific details, X-VLA enables robust, high-performance deployment in both simulation and real-world robotic systems.

📄 Paper	🌐 Project Page	🤗 Hugging Face
Read the Full Research	Explore the Demos	Access Models & Datasets

🧩 Overview

Successful generalist Vision–Language–Action (VLA) models depend on scalable, cross-platform training across diverse robotic embodiments.
To leverage the heterogeneity of large-scale robot datasets, X-VLA introduces a soft prompt mechanism — embodiment-specific learnable embeddings that guide a unified Transformer backbone toward effective multi-domain policy learning.

The resulting architecture — X-VLA-0.9B — achieves state-of-the-art generalization across six simulation platforms and three real-world robots, surpassing prior VLA approaches in dexterity, adaptability, and efficiency.

video.mp4

🚀 Quick Start: Installation & Deployment

1️⃣ Installation

# Clone the repository
git clone https://github.com/2toinf/X-VLA.git
cd X-VLA

# Create and activate Conda environment
conda create -n XVLA python=3.10 -y
conda activate XVLA

# Install dependencies
pip install -r requirements.txt

2️⃣ Deploying X-VLA for Inference

X-VLA adopts a Server–Client architecture to separate the model environment from simulation or robot-specific dependencies. This design avoids package conflicts and supports distributed inference across GPUs, SLURM clusters, or edge devices.

🧠 Available Pre-trained Models

Model ID	Embodiment	Description	Performance	Evaluation Guidance
`2toINF/X-VLA-Pt`	Foundation	Pretrained on large-scale heterogeneous robot–vision–language datasets for general transfer.	—	—
`2toINF/X-VLA-AgiWorld-Challenge`	Agibot-G1	Fine-tuned for AgiWorld Challenge.	Champion🥇	-
`2toINF/X-VLA-Calvin-ABC_D`	Franka	Fine-tuned on CALVIN benchmark (ABC_D subset)	4.41	Calvin Eval
`2toINF/X-VLA-Google-Robot`	Google Robot	Fine-tuned on large-scale Google Robot dataset	80.4%(VM) 75.7%(VA)	Simpler Eval
`2toINF/X-VLA-Libero`	Franka	Fine-tuned on LIBERO benchmark	98.1%	to be update
`2toINF/X-VLA-RoboTwin2`	Agilex	Trained on RoboTwin2 dataset for dual-arm coordinated manipulation(50 demos for each task).	70%	RoboTwin2.0 Eval
`2toINF/X-VLA-Simpler-WidowX`	WidowX	Fine-tuned on BridgeDataV2 (Simpler benchmark).	95.8%	Simpler Eval
`2toINF/X-VLA-SoftFold`	Agilex	Fine-tuned on Soft-Fold Dataset. Specialized in deformable object manipulation (e.g., folding and cloth control).	cloth folding with a 100% success rate in 2 hours.	SoftFold-Agilex

🧩 Notes

All models share a consistent architecture: configuration_xvla.py, modeling_xvla.py, and unified tokenizer (tokenizer.json).
The X-VLA-Pt model is the foundation checkpoint, trained across multiple robot domains.
Each embodiment is fine-tuned for its respective environment while retaining cross-embodiment alignment.
Evaluation scripts (in evaluation/) follow a standardized format for reproducible benchmarking.

📊 Performance metrics follow standard evaluation protocols detailed in the paper.

3️⃣ Launching the Inference Server

from transformers import AutoModel, AutoProcessor
import json_numpy

# Load model and processor
model = AutoModel.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)

# Start the inference server
print("🚀 Starting X-VLA inference server...")
model.run(processor, host="0.0.0.0", port=8000)

Once launched, the API endpoint is available at:

POST http://<server_ip>:8000/infer

4️⃣ Client Interaction & Action Prediction

The client communicates via HTTP POST, sending multimodal data (vision + language + proprioception) as a JSON payload.

Payload Structure

Key	Type	Description
`proprio`	`json_numpy.dumps(array)`	Current proprioceptive state (e.g., joint positions).
`language_instruction`	`str`	Task instruction (e.g., "Pick up the red block").
`image0`	`json_numpy.dumps(array)`	Primary camera image (RGB).
`image1`, `image2`	optional	Additional camera views if applicable.
`domain_id`	`int`	Identifier for the current robotic embodiment/domain.
`steps`	`int`	Number of action steps to predict (e.g., 10).

Example Client Code

import requests
import numpy as np
import json_numpy

server_url = "http://localhost:8000/infer"
timeout = 5

# Prepare inputs
proprio = np.zeros(7, dtype=np.float32)
image = np.zeros((256, 256, 3), dtype=np.uint8)
instruction = "Move the gripper to the target position"

payload = {
    "proprio": json_numpy.dumps(proprio),
    "language_instruction": instruction,
    "image0": json_numpy.dumps(image),
    "domain_id": 0,
    "steps": 10
}

try:
    response = requests.post(server_url, json=payload, timeout=timeout)
    response.raise_for_status()
    result = response.json()
    actions = np.array(result["action"], dtype=np.float32)
    print(f"✅ Received {actions.shape[0]} predicted actions.")
except Exception as e:
    print(f"⚠️ Request failed: {e}")
    actions = np.zeros((10, 20), dtype=np.float32)

Expected Output

[Server] Model loaded successfully on cuda:0
[Server] Listening on 0.0.0.0:8000
[Client] Sending observation to server...
✅ Received 10 predicted actions.

5️⃣ Standardized Control Interface: EE6D

To ensure consistency across embodiments, X-VLA adopts a unified EE6D (End-Effector 6D) control space.

Component	Specification	Notes
Proprio Input	Current EE6D pose (position + orientation)	Must align with training-space normalization.
Action Output	Predicted target delta/absolute pose (EE6D)	Executed by downstream controller.
Dimensionality	20-D vector = 3 (EE Pos) + 6 (Rotation in 6D) + 1 (Gripper) + 10 (Padding)
Single-arm Case	If only one arm exists, pad with zeros to maintain 20D vector.

⚙️ Reference Post-processing:
from datasets.utils import rotate6d_to_xyz
action_final = np.concatenate([
    action_pred[:3],
    rotate6d_to_xyz(action_pred[3:9]),
    np.array([1.0 if action_pred[9] > 0.5 else 0])
])
When feeding proprioception to the model, apply the inverse transformation accordingly.

6️⃣ Reference Client Implementations

Each released model includes a corresponding reference client under evaluation/<domain>/<robot>/client.py for reproducing exact deployment behaviors. We strongly recommend adapting from these clients when connecting to physical or simulated robots.

7️⃣ SLURM & Cluster Deployment

For large-scale or distributed training/deployment (e.g., HPC clusters, AgiBot nodes):

python -m deploy --model_path /path/to/your/model

This script automatically detects SLURM environment variables, launches distributed servers, and writes connection metadata to info.json.

⚙️ Training / Fine-tuning on Custom Data

X-VLA supports fine-tuning on new demonstrations via a modular and extensible dataset interface.

Data Preparation Workflow

Prepare Meta JSONs — each domain has a meta.json listing trajectory file paths.
Implement Custom Handler — write a domain loader class with iter_episode(traj_idx) generator.
Register Domain — update:
- datasets/domain_handler/registry.py
- datasets/domain_config.py

Example Handlers

Handler	Dataset	Description
`"lerobot"`	Agibot-Beta	Optimized for LEROBOT format
`"h5py"`	RoboMind / Simulation	Efficient loading from `.h5` trajectories
`"scattered"`	AGIWorld	Handles scattered trajectory storage

Launch Training with Accelerate

accelerate launch \
    --mixed_precision bf16 \
    train.py \
    --models '2toINF/X-VLA-Pt' \
    --train_metas_path /path/to/meta_files.json \
    --learning_rate 1e-4 \
    --learning_coef 0.1 \
    --iters 50000 \
    --freeze_steps 1000 \
    --warmup_steps 2000

Argument	Description
`--models`	Base model (e.g., `'2toINF/X-VLA-Pt'`)
`--train_metas_path`	Path to meta JSON file(s)
`--batch_size`	Batch size
`--learning_rate`	Base LR
`--learning_coef`	LR multiplier for soft prompts
`--iters`	Total training iterations
`--freeze_steps`	Steps to freeze backbone
`--warmup_steps`	Warmup iterations

📚 Citation

If you use X-VLA in your research, please cite:

@article{zheng2025x,
  title   = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
  author  = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
             and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
  journal = {arXiv preprint arXiv:2510.10274},
  year    = {2025}
}

🪪 License

This repository is licensed under the Apache License 2.0. You may freely use, modify, and distribute the code under the terms of the license.

Copyright 2025 2toINF (https://github.com/2toinf)
Licensed under the Apache License, Version 2.0.

Maintained by 2toINF 💬 Feedback, issues, and contributions are welcome via GitHub Discussions or Pull Requests.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
datasets		datasets
evaluation		evaluation
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deploy.py		deploy.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model

🏅Champion @ AgiBot World Challenge @ IROS 2025

🧩 Overview

🚀 Quick Start: Installation & Deployment

1️⃣ Installation

2️⃣ Deploying X-VLA for Inference

🧠 Available Pre-trained Models

🧩 Notes

3️⃣ Launching the Inference Server

4️⃣ Client Interaction & Action Prediction

Payload Structure

Example Client Code

Expected Output

5️⃣ Standardized Control Interface: EE6D

6️⃣ Reference Client Implementations

7️⃣ SLURM & Cluster Deployment

⚙️ Training / Fine-tuning on Custom Data

Data Preparation Workflow

Example Handlers

Launch Training with Accelerate

📚 Citation

🪪 License

About

Uh oh!

Releases

Packages

Languages

License

Yioutpi/X-VLA

Folders and files

Latest commit

History

Repository files navigation

🤖 X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model

🏅Champion @ AgiBot World Challenge @ IROS 2025

🧩 Overview

🚀 Quick Start: Installation & Deployment

1️⃣ Installation

2️⃣ Deploying X-VLA for Inference

🧠 Available Pre-trained Models

🧩 Notes

3️⃣ Launching the Inference Server

4️⃣ Client Interaction & Action Prediction

Payload Structure

Example Client Code

Expected Output

5️⃣ Standardized Control Interface: EE6D

6️⃣ Reference Client Implementations

7️⃣ SLURM & Cluster Deployment

⚙️ Training / Fine-tuning on Custom Data

Data Preparation Workflow

Example Handlers

Launch Training with Accelerate

📚 Citation

🪪 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages