Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ X-VLA Public
forked from 2toinf/X-VLA

The offical Implementation of "Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model"

License

Notifications You must be signed in to change notification settings

Yioutpi/X-VLA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model

🏅Champion @ AgiBot World Challenge @ IROS 2025

X-VLA (Cross-Embodiment Vision-Language-Action) model introduces a unified soft-prompted Transformer architecture that achieves scalable and generalizable control across heterogeneous robotic embodiments.
By decoupling the core policy model from embodiment-specific details, X-VLA enables robust, high-performance deployment in both simulation and real-world robotic systems.

📄 Paper 🌐 Project Page 🤗 Hugging Face
Read the Full Research Explore the Demos Access Models & Datasets

🧩 Overview

Successful generalist Vision–Language–Action (VLA) models depend on scalable, cross-platform training across diverse robotic embodiments.
To leverage the heterogeneity of large-scale robot datasets, X-VLA introduces a soft prompt mechanism — embodiment-specific learnable embeddings that guide a unified Transformer backbone toward effective multi-domain policy learning.

The resulting architecture — X-VLA-0.9B — achieves state-of-the-art generalization across six simulation platforms and three real-world robots, surpassing prior VLA approaches in dexterity, adaptability, and efficiency.

video.mp4

🚀 Quick Start: Installation & Deployment

1️⃣ Installation

# Clone the repository
git clone https://github.com/2toinf/X-VLA.git
cd X-VLA

# Create and activate Conda environment
conda create -n XVLA python=3.10 -y
conda activate XVLA

# Install dependencies
pip install -r requirements.txt

2️⃣ Deploying X-VLA for Inference

X-VLA adopts a Server–Client architecture to separate the model environment from simulation or robot-specific dependencies. This design avoids package conflicts and supports distributed inference across GPUs, SLURM clusters, or edge devices.

🧠 Available Pre-trained Models

Model ID Embodiment Description Performance Evaluation Guidance
2toINF/X-VLA-Pt Foundation Pretrained on large-scale heterogeneous robot–vision–language datasets for general transfer.
2toINF/X-VLA-AgiWorld-Challenge Agibot-G1 Fine-tuned for AgiWorld Challenge. Champion🥇 -
2toINF/X-VLA-Calvin-ABC_D Franka Fine-tuned on CALVIN benchmark (ABC_D subset) 4.41 Calvin Eval
2toINF/X-VLA-Google-Robot Google Robot Fine-tuned on large-scale Google Robot dataset 80.4%(VM) 75.7%(VA) Simpler Eval
2toINF/X-VLA-Libero Franka Fine-tuned on LIBERO benchmark 98.1% to be update
2toINF/X-VLA-RoboTwin2 Agilex Trained on RoboTwin2 dataset for dual-arm coordinated manipulation(50 demos for each task). 70% RoboTwin2.0 Eval
2toINF/X-VLA-Simpler-WidowX WidowX Fine-tuned on BridgeDataV2 (Simpler benchmark). 95.8% Simpler Eval
2toINF/X-VLA-SoftFold Agilex Fine-tuned on Soft-Fold Dataset. Specialized in deformable object manipulation (e.g., folding and cloth control). cloth folding with a 100% success rate in 2 hours. SoftFold-Agilex

🧩 Notes

  • All models share a consistent architecture: configuration_xvla.py, modeling_xvla.py, and unified tokenizer (tokenizer.json).
  • The X-VLA-Pt model is the foundation checkpoint, trained across multiple robot domains.
  • Each embodiment is fine-tuned for its respective environment while retaining cross-embodiment alignment.
  • Evaluation scripts (in evaluation/) follow a standardized format for reproducible benchmarking.

📊 Performance metrics follow standard evaluation protocols detailed in the paper.


3️⃣ Launching the Inference Server

from transformers import AutoModel, AutoProcessor
import json_numpy

# Load model and processor
model = AutoModel.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)

# Start the inference server
print("🚀 Starting X-VLA inference server...")
model.run(processor, host="0.0.0.0", port=8000)

Once launched, the API endpoint is available at:

POST http://<server_ip>:8000/infer

4️⃣ Client Interaction & Action Prediction

The client communicates via HTTP POST, sending multimodal data (vision + language + proprioception) as a JSON payload.

Payload Structure

Key Type Description
proprio json_numpy.dumps(array) Current proprioceptive state (e.g., joint positions).
language_instruction str Task instruction (e.g., "Pick up the red block").
image0 json_numpy.dumps(array) Primary camera image (RGB).
image1, image2 optional Additional camera views if applicable.
domain_id int Identifier for the current robotic embodiment/domain.
steps int Number of action steps to predict (e.g., 10).

Example Client Code

import requests
import numpy as np
import json_numpy

server_url = "http://localhost:8000/infer"
timeout = 5

# Prepare inputs
proprio = np.zeros(7, dtype=np.float32)
image = np.zeros((256, 256, 3), dtype=np.uint8)
instruction = "Move the gripper to the target position"

payload = {
    "proprio": json_numpy.dumps(proprio),
    "language_instruction": instruction,
    "image0": json_numpy.dumps(image),
    "domain_id": 0,
    "steps": 10
}

try:
    response = requests.post(server_url, json=payload, timeout=timeout)
    response.raise_for_status()
    result = response.json()
    actions = np.array(result["action"], dtype=np.float32)
    print(f"✅ Received {actions.shape[0]} predicted actions.")
except Exception as e:
    print(f"⚠️ Request failed: {e}")
    actions = np.zeros((10, 20), dtype=np.float32)

Expected Output

[Server] Model loaded successfully on cuda:0
[Server] Listening on 0.0.0.0:8000
[Client] Sending observation to server...
✅ Received 10 predicted actions.

5️⃣ Standardized Control Interface: EE6D

To ensure consistency across embodiments, X-VLA adopts a unified EE6D (End-Effector 6D) control space.

Component Specification Notes
Proprio Input Current EE6D pose (position + orientation) Must align with training-space normalization.
Action Output Predicted target delta/absolute pose (EE6D) Executed by downstream controller.
Dimensionality 20-D vector = 3 (EE Pos) + 6 (Rotation in 6D) + 1 (Gripper) + 10 (Padding)
Single-arm Case If only one arm exists, pad with zeros to maintain 20D vector.

⚙️ Reference Post-processing:

from datasets.utils import rotate6d_to_xyz
action_final = np.concatenate([
    action_pred[:3],
    rotate6d_to_xyz(action_pred[3:9]),
    np.array([1.0 if action_pred[9] > 0.5 else 0])
])

When feeding proprioception to the model, apply the inverse transformation accordingly.


6️⃣ Reference Client Implementations

Each released model includes a corresponding reference client under evaluation/<domain>/<robot>/client.py for reproducing exact deployment behaviors. We strongly recommend adapting from these clients when connecting to physical or simulated robots.


7️⃣ SLURM & Cluster Deployment

For large-scale or distributed training/deployment (e.g., HPC clusters, AgiBot nodes):

python -m deploy --model_path /path/to/your/model

This script automatically detects SLURM environment variables, launches distributed servers, and writes connection metadata to info.json.


⚙️ Training / Fine-tuning on Custom Data

X-VLA supports fine-tuning on new demonstrations via a modular and extensible dataset interface.

Data Preparation Workflow

  1. Prepare Meta JSONs — each domain has a meta.json listing trajectory file paths.

  2. Implement Custom Handler — write a domain loader class with iter_episode(traj_idx) generator.

  3. Register Domain — update:

    • datasets/domain_handler/registry.py
    • datasets/domain_config.py

Example Handlers

Handler Dataset Description
"lerobot" Agibot-Beta Optimized for LEROBOT format
"h5py" RoboMind / Simulation Efficient loading from .h5 trajectories
"scattered" AGIWorld Handles scattered trajectory storage

Launch Training with Accelerate

accelerate launch \
    --mixed_precision bf16 \
    train.py \
    --models '2toINF/X-VLA-Pt' \
    --train_metas_path /path/to/meta_files.json \
    --learning_rate 1e-4 \
    --learning_coef 0.1 \
    --iters 50000 \
    --freeze_steps 1000 \
    --warmup_steps 2000
Argument Description
--models Base model (e.g., '2toINF/X-VLA-Pt')
--train_metas_path Path to meta JSON file(s)
--batch_size Batch size
--learning_rate Base LR
--learning_coef LR multiplier for soft prompts
--iters Total training iterations
--freeze_steps Steps to freeze backbone
--warmup_steps Warmup iterations

📚 Citation

If you use X-VLA in your research, please cite:

@article{zheng2025x,
  title   = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
  author  = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
             and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
  journal = {arXiv preprint arXiv:2510.10274},
  year    = {2025}
}

🪪 License

This repository is licensed under the Apache License 2.0. You may freely use, modify, and distribute the code under the terms of the license.

Copyright 2025 2toINF (https://github.com/2toinf)
Licensed under the Apache License, Version 2.0.

Maintained by 2toINF 💬 Feedback, issues, and contributions are welcome via GitHub Discussions or Pull Requests.

About

The offical Implementation of "Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 56.2%
  • Python 26.1%
  • Common Lisp 8.5%
  • C 4.6%
  • JavaScript 3.1%
  • CMake 1.1%
  • Other 0.4%