We fixed several script links and so everything is smoother now. Thanks to the community for the feedback.
StarVLA is a modular and flexible codebase for developing Vision-Language Model (VLM) to Vision-Language-Action (VLA) models. In StarVLA (also a pun on “start VLA” ), each functional component (model, data, trainer, config, evaluation, etc.) follows a top-down, intuitive separation and high cohesion and low coupling principle, which enabling plug-and-play design, rapid prototyping, and independent debugging.
*Modules with solid borders are supported; borderless ones are coming soon.
Various VLA Frameworks
- Qwen-FAST: Utilizes Qwen2.5-VL-3B with a fast tokenizer to autoregressively generate discrete action tokens conditioned on visual and linguistic inputs (in line with π₀-fast).
- Qwen-OFT: Combines Qwen2.5-VL-3B with an MLP action head to perform parallel decoding of continuous actions, regressed from the hidden states of predefined special action tokens (in line with OpenVLA-OFT/EO).
- Qwen-PI: Integrates the Flow-Matching (FM) action expert with Qwen2.5-VL-3B, adopting a diffusion-based approach for continuous action prediction (in line with π₀).
- Qwen-GR00T: Implements a dual-system VLA architecture, where Qwen2.5-VL-3B serves as System2 for high-level vision-language reasoning, while the Flow-Matching module acts as System1 for rapid action prediction (in line with GR00T).
For dynamic updates, see our 🍀 Overleaf, which continuously presents our real-time experimental results.
We release a series of pretrained models and checkpoints to facilitate reproduction and downstream use.
| Model | Description | WindowX | Link |
|---|---|---|---|
| Qwen2.5-VL-3B-Action | Add action tokens to Qwen2.5-VL | - | 🤗 Hugging Face |
| Qwen3-VL-4B-Action | Add action tokens to Qwen3-VL | - | 🤗 Hugging Face |
| QWen2.5-FAST-Bridge-RT-1 | QwenVL + fast-tokenizer | 58.6 | 🤗 Hugging Face |
| QWen2.5-OFT-Bridge-RT-1 | QwenVL + OFT action regression | 41.8 | 🤗 Hugging Face |
| QWen2.5-PI-Bridge-RT-1 | QwenVL + flow-matching expert | 62.5 | 🤗 Hugging Face |
| QWen2.5-GR00T-Bridge-RT-1 | QwenVL + GR00T N1.5 action header | 63.6 | 🤗 Hugging Face |
| QWen-GR00T-Bridge | QwenVL + GR00T N1.5 action header | 71.4 | 🤗 Hugging Face |
| QWen3VL-OFT-Bridge-RT-1 | Qwen3VL + OFT action regression | 42.7 | 🤗 Hugging Face |
| QWen3VL-GR00T-Bridge-RT-1 | Qwen3VL + GR00T N1.5 action header | 65.3 | 🤗 Hugging Face |
Various Simulation Benchmarks
- SimplerEnV
- LIBERO
- Robocasa
- RLBench
- RoboTwin
- BEHAVIOR
Various Training Strategies
- Single Imitation Learning
- Multimodal Multitasks Co-training
- Reinforcement Learning Adaption
👇 StarVLA achieves “Lego-like” development via the following designs:
1. Smoke test any submodule
StarVLA emphasizes a modular model design. Each major framework file can be run standalone for rapid debugging and smoke test your code. For example:
# model
python starVLA/model/framework/QwenOFT.py --config_yaml starvla_cotrain_oxe.yaml
# dataloader
python starVLA/dataloader/lerobot_datasets.py --config_yaml starvla_cotrain_oxe.yaml
Note: starVLA/model/framework/yourframework.py is the single external API surface of the model; it should mirror (be structurally isomorphic to) the framework diagram in your paper.
2. Explicit model boundaries
StarVLA follows top‑down decomposition and the principle of high cohesion & low coupling.
For example:
- Dataloader
- Returns a raw, model‑agnostic dict only; no model‑specific preprocessing (e.g., tokenizer, image encoding).
- A single sample should include (add/remove as needed):
- image: list[PIL.Image] | np.ndarray
- lang: str
- action: np.ndarray[T, action_dim]
- state: Optional[np.ndarray[..., state_dim]]
Both framework.forward() and framework.predict_action() operate directly on raw inputs, keeping train/test boundaries explicit and easy to hack.
3. Flexible configuration system
StarVLA uses a single global configuration object Parameters are passed primarily via extensible dicts, allowing overrides and controlled redundancy.
🧪 To self‑test and iterate on StarVLA’s usability, we re‑implemented several representative VLA frameworks. Our have done a beta test: an internal developer can stand up a new VLA framework in under half a day (leat then 3 hours), and an new user can build their first custom VLA framework within a single day. More design insights for each item can be found in assets/intro_v1.md.
🛠 Environment Setup
# Clone the repo
git clone https://github.com/starVLA/starVLA
# Create conda environment
conda create -n starVLA python=3.10 -y
conda activate starVLA
# Install requirements
pip install -r requirements.txt
# Install FlashAttention2
pip install flash-attn --no-build-isolation
# Install starVLA
pip install -e .--no-build-isolation flag resolves most issues, but on newer systems you may need to manually choose a compatible flash-attn version. Ensure your CUDA driver/toolkit and torch versions are aligned. Check your environment:
nvcc -V
pip list | grep -E 'torch|transformers|flash-attn'If issues persist, pick a flash-attn release that matches your versions (CUDA and torch) or ask chatGPT with searching function for help with the outputs above.
👀 Quick Check StarVLA
# check framework with fake examples
python starVLA/model/framework/QwenGR00T.pyYou should download ./playground/Pretrained_models/Qwen3-VL-4B-Instruct. It should build successfully and print(model). You can also call model.forward(fake_data) and obtain unnormalized actions via model.predict_action(fake_data).
🧪 Eval Existing Model
We also provide a parallel evaluation script:
check_pt=StarVLA/Qwen3VL-GR00T-Bridge-RT-1/checkpoints/steps_20000_pytorch_model.pt
bash examples/SimplerEnv/star_bridge_parall_eval.sh ${check_pt}Before running, down Qwen3VL-GR00T-Bridge-RT-1 and follow SimpplerEnv prapare a python. Edit these variables directly at the top of star_bridge_parall_eval.sh.
If you don't want parallel testing, please run:
# Terminal 1
bash ./examples/SimplerEnv/start_server.sh
# Terminal 2
bash ./examples/SimplerEnv/start_simpler_env.shlibvulkan.so.1: cannot open shared object file: No such file or directory
You can refer to this link to fix: Installation Guide – Vulkan Section
When run policy server but NotImplementedError:Framework QwenGR00T is not implemented, you may need to python QwenGR00T.py to check your env.
🚀 Train Your Own Model
Our training pipeline follows InternVLA-M1.
Steps:
- Prepare a LeRobot-format OXE dataset, including
modality.json. Refer to GR00T N1.5. - Add your dataset path to
config.yaml:datasets: vla_data: dataset_py: lerobot_datasets data_root_dir: playground/Datasets/OXE_LEROBOT_DATASET # path to your dataset data_mix: bridge_rt_1
- Run with Accelerate:
base_vlm=Qwen/Qwen2.5-VL-3B-Instruct Framework_name=QwenGR00T run_root_dir=./results run_id=${Framework_name} accelerate launch \ --config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \ --num_processes 8 \ starVLA/training/train_starvla.py \ --config_yaml ./starVLA/config/training/starvla_contrain_oxe.yaml \ --framework.framework_py ${Framework_name} \ --framework.qwenvl.base_vlm ${base_vlm} \ --run_root_dir ${run_root_dir} \ --run_id ${run_id} \ --wandb_project your_project \ --wandb_entity your_name
Note: run_root_dir stores the unified config snapshot and data‑processing metadata for reproducibility and quick restarts.
Q: Why not put preprocessing in the dataloader?
A: We profiled it: data preprocessing takes <1% time. Keeping it inside the Framework is acceptable and allows model‑specific flexible handling.
Q: Can I use a backbone other than Qwen2.5-VL?
A: Yes. Implement new vision + language modules and compose them inside a Framework; any other existing models can be swapped in. Yet, due to the framework processing raw action data, it is very easy to swap in.
Q: Why isn't there an abstract interface for the vision tower?
A: We believe that VLM will become the base model and will inherently possess its own native vision tower.
Q: Can I override or add parameters via the terminal?
A: Yes. We use OmegaConf.load(args.config_yaml) as the single configuration entry; standalone debugging also uses args.config_yaml. Parameters may be intentionally redundant; you can freely add or override them via the CLI.
Examples:
accelerate launch \
--config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \
--num_processes 8 \
starVLA/training/train_internvla.py \
--config_yaml ./starVLA/config/training/starvla_cotrain_oxe.yaml \
--framework.qwenvl.base_vlm Qwen/Qwen2.5-VL-7B-Instruct \ # override framework choice
--framework.qwenvl.base_vlm Qwen/Qwen2.5-VL-7B-Instruct \ # override framework choice
--framework.action_model.new_module ${module_name} \ # plug-in a new module to action modelframework.action_model.new_module only adds to the global config; its behavior is on your framework.
Q: Can I freeze the VLM via parameters?
A: Yes. StarVLA uses a regex / name list to control freezing. Example:
--trainer.freeze_modules "qwen_vl_interface.model.model.visual,dino_encoder" \
Tips: You can print(your_model) first to check the relative paths of your modules and list them as comma-separated values.
(implementation in TrainerUtils.freeze_backbones.)
Q: Can I set different learning rates for different modules?
A: Yes, starVLA also uses name: value dict to control learning group. Config example:
trainer:
learning_rate:
base: 1e-05 # other modules
qwen_vl_interface: 1.0e-05
action_model: 1.0e-04(Also referenced in trainer_tools.build_param_lr_groups.)
Q: Can I resume training from a checkpoint?
A: Yes, somehow can. Specify the latest checkpoint path in config.yaml, e.g.:
trainer:
pretrained_checkpoint: path_to_steps_10000.pt
reload_modules: "action_model"Empty reload_modules means full load all model. However, starVLA does not save optimizer state. It requires a lot of memory/disk and bring limited benefit.
StarVLA is released under the MIT License, which permits commercial use, modification, distribution, and private use. Rebases are allowed for forks and feature branches; when rebasing from upstream StarVLA, use descriptive commit messages (e.g., "chore: rebase from StarVLA") and keep at least the two latest upstream commits as separate. See License for details.
@misc{starvla2025,
title = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing},
author = {starVLA Community},
url = {https://github.com/starVLA/starVLA}
year = {2025}
}
-
If you find an issue, please open an Issue first. If it persists or needs clarification, start a Discussion and we’ll follow up.
-
If you have ideas to improve StarVLA, feel free to open a PR. To make sure we will accept your effect, please align scope and design first via an Issue or by booking a short sync with this Cooperation Form.
-
If you’re blocked or want to brainstorm, please fill out the Cooperation Form. We host office hours every Friday afternoon for live discussion.
Tip: Before submitting a PR, run make check locally to pass formatting and lint.
This project draws inspiration and references from several notable open-source initiatives, including:
The codebase was originally forked from InternVLA-M1.