Roboverse
Roboverse
1 2
UC Berkeley PKU 3 USC 4 UMich 5 UIUC 6 Stanford 7 CMU 8 UCLA 9 BIGAI
* equal contribution † equal advising Correspondence to: Haoran Geng <[email protected]>
Multi-Simulator Cross
Support Embodiment
Fig. 1: ROBOV ERSE comprises a scalable simulation platform, a large-scale synthetic dataset, and unified benchmarks. The
simulation platform supports seamless integration of new tasks and demonstrations through unified protocols, ensuring flexibility
and extensibility. The dataset includes over 1,000 diverse tasks and more than 10 million transitions, constructed through
large-scale data migration, cross-embodiment transfer, and robust augmentation and randomization.
Abstract—Data scaling and standardized evaluation bench- constructed through multiple approaches including migration
marks have driven significant advances in natural language from public datasets, policy rollout, and motion planning,
processing and computer vision. However, robotics faces unique etc. enhanced by data augmentation. Additionally, we propose
challenges in scaling data and establishing reliable evaluation unified benchmarks for imitation learning and reinforcement
protocols. Collecting real-world robotic data is resource-intensive learning, enabling consistent evaluation across different levels of
and inefficient, while benchmarking in real-world scenarios generalization. At the core of the simulation platform is M ETA S IM,
remains highly complex. Synthetic data and simulation offer an infrastructure that abstracts diverse simulation environments
promising alternatives, yet existing efforts often fall short in data into a universal interface. It restructures existing simulation
quality, diversity, and benchmark standardization. To address environments into a simulator-agnostic configuration system, as
these challenges, we introduce ROBOV ERSE, a comprehensive well as an API aligning different simulator functionalities, such
framework comprising a simulation platform, a synthetic dataset, as launching simulation environments, loading assets with initial
and unified benchmarks. Our simulation platform supports states, stepping the physics engine, etc. This abstraction ensures
multiple simulators and robotic embodiments, enabling seamless interoperability and extensibility. Comprehensive experiments
transitions between different environments. The synthetic dataset, demonstrate that ROBOV ERSE enhances the performance of
featuring high-fidelity physics and photorealistic rendering, is imitation learning, reinforcement learning, and world model
learning, improving sim-to-real transfer. These results validate thecore infrastructure of the ROBOV ERSE. Through careful design,
reliability of our dataset and benchmarks, establishing RoboVerse M ETA S IM establishes a universal configuration system for
as a robust solution for advancing simulation-assisted robot agents, objects, sensors, tasks, and physics parameters while
learning.
exposing a simulator-agnostic interface for simulation setup
I. I NTRODUCTION and control. This architecture enables seamless integration of
tasks, assets and robot trajectories from diverse simulation en-
Large-scale datasets, combined with well-established bench- vironments with minimal adaptation effort. M ETA S IM provides
marks, have fueled rapid advancements in natural language three key capabilities: (1) Cross-Simulator Integration: Enables
processing (NLP) [93, 5] and computer vision (CV) [23, seamless switching between different simulators, fostering uni-
59, 57, 95, 67, 43] . Specifically, large-scale data provides fied benchmarking and facilitating the transfer of environments
ample training examples that bolster learning, while uniform and demonstrations across platforms. (2) Hybrid Simulation:
benchmarks enable standardized evaluation and fair comparison Combines the strengths of multiple simulators—such as pairing
across different methods. However, replicating these successes advanced physics engines with superior renderers—to generate
in robotics remains challenging due to the difficulty of scalable and high-quality synthetic data. (3) Cross-Embodiment
collecting high-quality, diverse data and the lack of widely Transfer: Allows the retargeting of trajectories across various
recognized evaluation protocols. robot arms with parallel grippers, maximizing dataset reuse
Real-world approaches [15, 54] to constructing datasets and from heterogeneous sources.
benchmarks, though authentically reflecting the complexities of M ETA S IM enables ROBOV ERSE to systematically enhance
operational environments, face significant practical constraints. the workflow for building and scaling simulation environments
First, collecting demonstrations is time-consuming and resource- and datasets. Our method features:
intensive, and the resulting data is often hardware-dependent
• Scalable and Diverse Data Generation: By aligning
or modality-specific, limiting its adaptability to new scenarios.
Additionally, establishing standardized and widely applicable multiple benchmarks and task trajectories and leveraging a
benchmarks is inherently challenging since reproducing identi- robust multi-source integration and data filtering pipeline,
cal conditions for fair comparisons is nearly impossible. For we generate large-scale, high-quality datasets. Addition-
instance, object placements can vary across rollouts, ambient ally, our data randomization and augmentation pipeline
lighting fluctuates under natural sunlight, and background enhances data diversity and volume, further enriching the
environments may change. Consequently, scaling real-world dataset for comprehensive model training;
• Realistic Simulation and Rendering: With M ETA S IM ’s
datasets, evaluating policies, and iterating development in
real-world scenarios remain cost-prohibitive and difficult to hybrid simulation capability, we enable the fusion of ad-
standardize. vanced physics engines and rendering systems across mul-
Simulators, on the other hand, present a promising alternative tiple simulators and renderers. Combined with carefully
for large-scale dataset and benchmark construction. By pro- curated scenes, materials, and lighting assets, ROBOV ERSE
viding efficient computation, synthetic assets, and omniscient enhances realism in physical interactions and sensory
information in reproducible settings, they enable cost-effective observations;
• Unified Benchmarking and Evaluation: We unify widely
dataset construction and consistent performance evaluation.
Recent works, exemplified by [135, 50, 10, 33, 98, 124, 70], used benchmarks into a cohesive system, streamlining
have demonstrated the potential of simulation-based methods algorithm development and performance comparison
in various robotic tasks. Despite these advantages, several within a structured evaluation framework. Additionally,
challenges impede the broader adoption of synthetic datasets we introduce a standardized benchmarking protocol to
and benchmarks. First, utilizing simulators often demands assess varying levels of generalization and sim-to-real
considerable expertise due to both the complexity of simulator transferability.
• Highly Extensibility and Scalability: The aligned APIs
design and the relative immaturity of many platforms, which
complicates the data construction process. Second, simula- and infrastructure streamline development and enable
tors vary widely in their internal architectures and external efficient algorithm integration, testing, and deployment
interfaces, making it laborious to transfer data and models or across diverse simulation environments. Additionally, we
adapt workflows from one to another. Consequently, reusing develop real-to-sim frameworks, multiple teleoperation
existing synthetic datasets and benchmarks is difficult, resulting methods, and AI-generative systems for scalable task and
in a fragmented ecosystem that further hinders convenient data creation.
construction and effective use of large-scale data in simulation Leveraging these workflows in ROBOV ERSE, we construct
environments. the largest and most diverse high-quality synthetic dataset
To fully harness the potential of simulation in robotics, we and benchmark to date, all in a unified format. This dataset
introduce ROBOV ERSE, a scalable simulation platform that includes ∼500k unique, high-fidelity trajectories covering 276
unifies existing simulators under a standardized format and a task categories and ∼5.5k assets. Additionally, we generate
single infrastructure, a large-scale synthetic dataset, and unified over 50 million high-quality state transitions to support policy
benchmarks. To achieve this, we first propose M ETA S IM, the learning.
Beyond dataset and benchmark construction, we explore ROBOVERSE
the potential of ROBOV ERSE through extensive experiments
on imitation learning (Sec. VI-B), reinforcement learning High-Quality Unified
Dataset Benchmarks
(Sec. VI-C), and world model learning (Sec. VI-E). Our results
demonstrate that ROBOV ERSE enables reliable policy learning
and evaluation, supports strong sim-to-sim and (Sec. VI-G)
sim-to-real transfer (Sec. VI-F) via high-fidelity physics and Simulation Platform
rendering, and facilitates efficient data expansion through
teleoperation (Sec. ??), trajectory augmentation (Sec. IV-D1), METASIM
domain randomization (Sec. IV-D2) and generative models
(Sec. IV-C). These findings highlight the framework’s robust-
ness, scalability, and real-world applicability. Fig. 2: ROBOV ERSE consists of a simulation platform, a large-
scale, high-quality dataset, and unified benchmarks. At the core
II. R ELATED W ORK
of the simulation platform is M ETA S IM, the infrastructure of
A. Robotics Simulators ROBOV ERSE. Powered by M ETA S IM, the simulation platform
Advancements in computer graphics have contributed to facilitates dataset creation and benchmark construction.
the development of high-fidelity simulators, which are widely
used in robotics research and development. CoppeliaSim [97],
Bullet [16], and MuJoCo [111] provide accurate physics Hussing et al. [46] proposed a dataset containing 256M
simulations and are extensively utilized in applications such transitions on 256 tasks for offline compositional reinforcement
as reinforcement learning and robotic benchmarking [3, 126, learning. RoboCasa [82] introduced a dataset of 100 tasks
87, 14]. More simulators have been developed to fully exploit and over 100k trajectories for generalist robots. DexGraspNet-
parallelism for better efficiency. IsaacGym [72], IsaacSim [85], 2.0 [134] has collected over 400M demonstrations for dexterous
SAPIEN [37, 109], MuJoCo MJX [111, 132], and Genesis [2] grasping. Despite these efforts, synthetic datasets often exist
utilize GPU power for enhanced performance, enabling large- in disparate simulators, leading to a fragmented ecosystem
scale reinforcement learning and efficient data collection, with limited diversity and quality. Moreover, simulation-based
significantly improving training speed and scalability. Some data often fails to capture complex physics and diverse task
simulators focus on bridging the simulation-reality gap (Sim-to- variations found in the real world [63, 26], potentially causing
Real Gap), incorporating technologies including ray-tracing and overfitting to specific simulators and hampering generalization
customized renderers for photo-realistic rendering [85, 109]. to real-world scenarios.
Furthermore, IsaacSim [85] and Genesis [2] offer high-fidelity
ROBOVERSE provides a unified solution for large-scale, high-
soft-body and liquid simulation, expanding the scope of
quality, and diverse synthetic data. It enables agents to train on
realistic robotic interactions. ROBOV ERSE proposes a unified
a large set of environments and simulators to reduce overfitting,
platform that supports multiple simulators, facilitating seamless
thereby improving the robustness of the learned policies.
transitions between them and enabling hybrid integration to
utilize the strengths of each simulator.
B. Large-Scale Robotics Dataset C. Benchmarking in Robotics
The scarcity of large-scale, high-quality, and diverse datasets
in the robotics community has long been recognized. Several Benchmarking remains a critical yet highly challenging
works have shown the possibility of collecting demonstration problem in the robotics community. Compared to supervised
data directly on real robots. RoboNet [20] is a large-scale learning tasks, it is relatively difficult to evaluate the per-
manipulation dataset containing roughly 162k trajectories from formance of a robotics model. MetaWorld [131] is an early
multiple robot platforms. DROID [54] has collected over attempt in multi-task benchmarking. This is followed by
76k contact-rich robotic manipulation demonstrations across RLBench [48], Behavior-1k [62], Habitat [108], and Man-
86 tasks. RH20T [28] proposed a dataset with over 100k iSkill [81, 37, 109, 103], covering a large variety of robotic
demonstrations and 147 tasks. At the same time, RT-1 [4] tasks. Grutopia [116] and InfiniteWorld [96] make a leap toward
set the record further to 130k demonstrations on over 700 general-purpose robot benchmarking.
tasks. Recently, Open X-embodiment [15] has demonstrated a Despite significant efforts dedicated to these benchmarks,
promising approach to unite the community’s efforts, collecting it is not guaranteed that the results are reproducible across
over 1M trajectories on 160,266 tasks with 22 different different benchmarks. The uncertainty comes from multiple
embodiments. At this stage, real-world datasets became difficult aspects including simulation accuracy, rendering style and asset
to scale up due to the proportional effort and cost required to properties [63, 26]. To address these challenges, ROBOV ERSE
collect more demonstrative trajectories. enables researchers to evaluate their policies across multiple
Simulation-based data collection provides a promising solu- benchmarks and simulators seamlessly, without familiarizing
tion to the high cost and inefficiencies of real-world datasets. themselves with each one individually.
METASIM
Three-Layer Architecture
Migration
Simulator Backends
Task Self-Designed
Capabilities
Isaac Lab
Generative AI Unified Benchmarks
Isaac Gym Environment Cross-Simulator
Agents Integration
MuJoCo step
Migration Objects
Asset SAPIEN reset Hybrid
Real-to-Sim Tasks Simulation
Genesis render
Sensors
Bullet close Cross-Embodiment
Physics Transfer
Migration High-Quality Dataset
CoppeliaSim gym.Env
MetaConfig
Teleoperation
…
Trajectory
RL Rollout
Handler
Motion Planning
Data Augmentation
Fig. 3: M ETA S IM provides a universal configuration system, aligned simulator backends, and a Gym [112] environment wrapper.
This three-layer architecture abstracts simulation environments into simulator-agnostic specifications and aligns simulator
backends, enabling three key capabilities: cross-simulator integration, hybrid simulation and cross-embodiment transfer. Based
on M ETA S IM, we build a pipeline to collect tasks, assets and trajectories from diverse public sources in a unified format,
employ data augmentation methods, and ultimately generate a large-scale high-quality dataset along with unified benchmarks.
This data pipeline forms the foundation of ROBOV ERSE, facilitating the generation of large-scale datasets and construction of
unified benchmarks.
MetaConfig
III. I NFRASTRUCTURE : M ETA S IM
agents
A. M ETA S IM Overview PhysicsConfig objects TaskConfig
We present M ETA S IM, a high-level interface above specific gravity task instructions
simulation environment implementations. It is also the core in- collision sensors success_metrics
CALVIN
RLBench
ManiSkill
Fig. 8: Dataset Comparison and Gallery. Left: other representative synthetic robotics datasets. Right: the ROBOV ERSE dataset.
Source # Task
Source Benchmark
Simulator Categories
# Trajectories # Assets including the Unitree Dog (a legged robot) and the JetBot (a
ManiSkill [81, 37, 109] SAPIEN 6 19k 1.7k wheeled robot), which support different control policies. A
RLBench [48] CoppeliaSim 80 150k 100 detailed elaboration on the navigation dataset is provided in
CALVIN [79] Pybullet 7 20k 7
MetaWorld [131] MuJoCo 5 5k 6
the supplementary materials.
RoboSuite [142]&MimicGen [76] MuJoCo 6 6k 12 c) Humanoid Dataset: We migrate HumanoidBench [102]
GAPartNet [34] IsaacGym 4 4k 151
Open6DOR [24] IsaacGym 69 10k 207
tasks for reinforcement learning benchmarks and integrate
ARNOLD [36] IsaacSim 6 3k 30 tasks, policies, and data samples from Humanoid-X [77] and
LIBERO [65] MuJoCo 10 15k 15 SkillBlender [61]. Additionally, we re-implement the UH-1
Simpler [63] SAPIEN 6 30k 52
RLAfford [35] IsaacGym 4 40k 40 inference pipeline within our framework. The pretrained policy
GraspNet [27] - 58 200k 42 successfully enables humanoid robots to follow demonstrated
GarmentLab [69] IsaacSim 6 6k 3k
UniDoorManip [64] IsaacGym 7 1k 140
poses while maintaining stable locomotion across multiple
GAPartManip [18] IsaacSim 2 1.5k 42 simulators based on ROBOV ERSE.
Total - 276 510.5k 5.5k
V. ROBOV ERSE B ENCHMARK
TABLE I: Migration progress statistics for manipulation tasks
in ROBOV ERSE A. Benchmark Overview
With the collected tasks, assets, and trajectories, RoboVerse
establishes standardized benchmarks for robot learning, includ-
focus on VLN in continuous environments (VLN-CE) [58], as ing both imitation learning and reinforcement learning. We
it more closely resembles real-world scenarios [11, 136, 137]. define a unified training and evaluation protocol within the
Specifically, we construct our dataset based on ROBOV ERSE by RoboVerse platform and implement standardized baselines and
integrating MatterPort 3D scenes [9] (90 scenes) and off-the- learning frameworks for benchmarking. Specifically, for imita-
shelf instructions from R2R [58] (10k episodes) and RxR [60] tion learning, we introduce different levels of generalization
(20k episodes). We provide two types of mobile embodiments, benchmarks to assess the generalization capability of models.
(a) Level 0 (b) Level 1 (c) Level 2 (d) Level 3
Fig. 9: Benchmark Protocol: We define a four-level generalization benchmarking protocol, allocating 90% of the data for
training and 10% for generalization evaluation. From left to right, Levels 0 to 3 corresponds to task space generalization,
environment radomization, camera randomization, lighting and reflection randomization, respectively.
B. Imitation Learning Benchmark illumination setups [19]. This enhances robustness testing under
varying conditions, as shown in Fig. 9 (d).
For each imitation learning benchmark, we establish a
standardized evaluation framework with a fixed set of demon- C. Reinforcement Learning Benchmark
strations and a controlled evaluation environment. Policies In addition to imitation learning, RoboVerse offers a com-
must be trained exclusively on the provided training data and prehensive reinforcement learning (RL) benchmark designed
assessed within this environment to ensure fair comparison. to accommodate a diverse range of tasks, robot embodi-
To rigorously test generalization capability, we curate training ments, and simulation backends. Specifically, we integrate the
data from specific domains and evaluate policies on unseen PPO [101] algorithm from both S TABLE -BASELINES 3 [94]
samples, challenging their adaptability to novel scenarios. We and RSL _ RL [98] into our M ETA S IM interface, enabling
systematically categorize visual generalization factors into mul- straightforward task definition, seamless environment switching,
tiple levels, including task space generalization, environment and standardized performance logging.
setup generalization, camera setting generalization, and lighting Building upon this infrastructure, we have successfully
and reflection generalization. Each level introduces controlled ported multiple humanoid control tasks from the Humanoid-
variations to assess a policy’s adaptability and robustness in Bench [102] benchmark into RoboVerse. Through our adapted
increasingly diverse and challenging conditions. interface for RSL _ RL, we have efficiently extended framework
a) Level 0: Task Space Generalization: We establish a compatibility to support the TD-MPC2 [41, 42] algorithm
controlled evaluation by standardizing the environment with from the original benchmark while preserving implementation
consistent camera, materials, lighting, and other parameters. fidelity.
The task space, including object initialization and instructions,
is split into 90% training and 10% validation to assess VI. E XPERIMENTAL R ESULTS
generalization within a fixed setting, as shown in Fig. 9 (a). A. Overview
b) Level 1: Environment Randomization: Building on the We conduct extensive experiments to validate the effec-
standardized setup, we introduce scene randomization while tiveness and practicality of ROBOV ERSE. First, we evaluate
keeping the camera, materials, and lighting fixed [78]. By baselines on representative tasks from various benchmark
varying house, table, and ground configurations, we create sources to ensure the reliability of the collected datasets and
diverse visual inputs to test robustness against environmental established benchmarks. This includes assessments of both
changes [51]. A fixed set of predefined randomized scenes imitation learning baselines VI-B and reinforcement learning
ensures structured evaluation, as shown in Fig. 9 (b). baselines VI-C.
c) Level 2: Camera Randomization: To assess generaliza- Then we further demonstrate the strength of the high-quality
tion across camera variations, we introduce different viewing synthetic dataset. We find that synthetic data could significantly
heights and angles using carefully annotated, realistic camera boost world model learning.
poses. Following the 90/10 training/testing split, we ensure
B. Results on the Imitation Learning Benchmark
consistent and rigorous evaluation, as illustrated in Fig. 9 (c).
1) Baseline and Task Selection: To genuinely reflect the
d) Level 3: Lighting and Reflection Randomization:
data quality of the RoboVerse dataset and provide a standard
Real-world environments involve diverse materials and lighting
conditions [113]. To simulate these challenges, we randomize 1 Due to resource and time constraints, we uniformly sample 20 testing
lighting and reflections, curating realistic object materials and scenarios for the OpenVLA baseline.
Representative Task PickCube StackCube CloseBox MoveSliderLeft PickChocolatePudding NutAssembly Average
Benchmark Source ManiSkill ManiSkill RLBench CALVIN LIBERO RoboSuite -
Diffusion Policy[13] 78M 52.7 53.8 51.5 76.5 50.0 7.1 48.6
ACT[138] 84M 31.7 36.7 68.3 85.0 78.3 0.0 50.0
TABLE II: Baseline Results on ROBOV ERSE Imitation Learning Benchmark. We report baseline results on representative
tasks from various benchmark sources to validate the effectiveness and reliability of the ROBOV ERSE benchmark.
TABLE III: Generalization Performance on Imitation Learning Benchmark. This table presents the experimental results for
each generalization level in our benchmark across different tasks and methodologies. The tasks are divided into distinct levels
(Level 0, Level 1, Level 2, and Level 3) to evaluate performance under progressively challenging scenarios.
TABLE VI: Comparison of Physics Simulators [104]. The column GPU denotes whether the simulator can use GPU-accelerated
computation. The column Open denotes whether the simulator is open-source.
simulation states to the ones included in the dict. The step() randomization of physics parameters, Jacobian and inverse
method will prompt the simulation to proceed one timestep. kinematics support, and customizable friction settings.
3) Mujoco: MuJoCo [111] is a physics engine and simu-
D. Gym API Wrappers lation framework designed to accurately model the dynamics
To support building learning environments, we define an Env and control of complex robotic systems in real-time. Its
class built on top of Handler. It offers Gymnasium-like APIs name, MuJoCo, stands for Multi-Joint dynamics with Contact,
(step, reset, render, and close), implementing these highlighting its primary emphasis on efficient computation of
methods by leveraging the underlying Handler methods. contact forces and multi-joint dynamics. The engine supports
It is worth noting that most simulation environments provide advanced features such as frictional contact models, user-
the underlying APIs (corresponding to our Handler) and defined actuators, and customizable sensor modalities, allowing
upper-level environments (corresponding to our Env) seper- researchers and developers to prototype, test, and refine control
ately, such as SAPIEN [122] with ManiSkill [109], Isaac- algorithms across a wide range of robot morphologies and
Sim [85] with IsaacLab [80], CoppeliaSim [97]/PyRep [47] tasks.
with RLBench [48], and Mujoco [111] with Mujoco Play- A key strength of MuJoCo is its computational precision,
ground [132]. This fact proves our Handler and Env two- which enables high simulation throughput and real-time in-
level abstraction reasonable. teractive control. It supports rigid-body dynamics, articulated
mechanisms, and a variety of constraints, making it suitable for
E. Backend Support tasks involving locomotion, manipulation, and reinforcement
1) Isaac Lab: Isaac Lab [85] is an advanced robotics learning. Furthermore, MuJoCo’s flexible XML-based model
simulation platform developed by NVIDIA. By leveraging high- description streamlines creating and modifying simulated
fidelity physics, GPU acceleration, and photorealistic rendering, environments, providing a straightforward way to experiment
it enables rapid prototyping, testing, and deployment of AI- with novel designs. The compatibility between MuJoCo and
driven robotics solutions in virtual environments. Through Brax offers a high-speed, differentiable pipeline crucial for
seamless integration with NVIDIA’s Omniverse framework, reinforcement learning. This powerful blend of accuracy, speed,
Isaac Lab offers robust features such as domain randomization, and flexibility has solidified MuJoCo’s status as a leading choice
sensor simulation, and support for large-scale reinforcement in robotics research and machine learning, particularly for
learning, making it a powerful tool for both research and advanced control, motion planning, and reinforcement learning
industrial applications. applications [29].
A key advantage of Isaac Lab is its compatibility with the 4) Genesis: Genesis[2] is a comprehensive physics platform
Isaac ROS infrastructure, which includes valuable models such developed for robotics and physics simulation research, unify-
as foundationpose [121, 120] and curobo [107], among others. ing multiple core capabilities in a single environment. At its
2) Isaac Gym: Isaac Gym [72] is a physics simulation foundation is a universal physics engine, rebuilt from the ground
environment designed for reinforcement learning research. up to simulate diverse materials and physical phenomena while
Although it remains available for download, official support seamlessly integrating various solvers. Alongside this engine,
has ended. Nevertheless, multiple works published before Genesis provides a swift, Python-friendly robotics simulation
2024—such as hora [91], humanoid-gym [38], and IPC- toolkit, an efficient photo-realistic rendering system, and a
graspsim [55]—were developed using Isaac Gym. data-generation module that converts natural language prompts
Key features of Isaac Gym include support for importing into multi-modal datasets. We leverage the Genesis backend
URDF and MJCF files with automatic convex decomposition, to support loading, simulation, and rendering in ROBOV ERSE
a GPU-accelerated tensor API for managing environment states workflow.
and actions, and a range of sensors (e.g., position, velocity, 5) Sapien: SAPIEN [122] is a robot simulation framework
force, torque). Additional capabilities include runtime domain that allows highly efficient simulation and rendering of robotic
tasks. It uses PhysX [83] as the underlying physics engine. We formats includes, but is not limited to, MuJoCo XML control
supported the released version Sapien 2.2 for the M ETA S IM files [111], URDF files [8], and USD files [85].
framework. The three predominant file formats in robotics simulation:
We use the multipocessing library to support parallel environ- MJCF, URDF, and USD. Each of them serves distinct purposes
ments in the Handler class for Sapien. When instantiating the and offers unique capabilities. MJCF (MuJoCo Configuration
environment from configurations, a desired number of processes Format) stands out for its exceptional expressiveness in
are forked to run the simulation of different environments. To physics simulation, featuring sophisticated capabilities to model
support the get_states and set_states API, data for complex dynamical systems including tendons, actuators, and
different environments are distributed to different processes, advanced joint configurations, along with an integrated compiler
and the return values are then gathered. for handling complex compile-time computations [111]. URDF
6) Pybullet: PyBullet [17] is a fast and easy-to-use robotics (Unified Robot Description Format), while more constrained in
simulator. It uses its own physics solvers for accurate and effi- its feature set, has emerged as the de facto standard in robotics
cient simulations. We supported the released version PyBullet due to its remarkable cross-platform compatibility and universal
3.2 for the M ETA S IM framework. adaptability across various simulation environments including
We use the same techniques as for Sapien to achieve parallel- Isaac Sim [85], Isaac Gym [72], MuJoCo [111], Gazebo, and
environment simulation. PyBullet [16], making it ideal for robot model exchange
despite its limitations in representing parallel mechanisms
F. Hybrid Simulation Implementation
or complex sensor configurations [8]. USD (Universal Scene
M ETA S IM allows launching two simulators in one single Description), originally developed by Pixar Animation Studios,
process with one command. Taking our demo collection excels in high-fidelity rendering and scene composition through
command as example: python collect_demo.py --sim=mujoco its sophisticated layering system and variant sets [22], making
--renderer=isaaclab --task=$task. The implementation is illus- it particularly valuable for applications requiring advanced
trated in Code. 2. visual properties and collaborative workflows [84], although
its physics simulation capabilities are more limited compared
class HybridEnv: to dedicated robotics formats like MJCF [26].
def __init__(self, env_physic: Env,
env_render: Env): Features MJCF URDF USD
...
def step(self, action): Basic Geometries ✓ ✓ ✓
env_physic.handler.set_states(action= Mesh Support ✓ ✓ ✓
action) Texture Support ✓ Limited ✓
phys_states = env_physic.handler.
get_states() Material ✓ Basic ✓
env_render.handler.set_states(states= Properties
phys_states)
env_render.handler.refresh_render()
Physics Properties ✓ ✓ Limited
states = env_render.handler.get_states() Joint Types Many Basic Basic
return ... Collision Proper- Advanced Basic Advanced
ties
Code 2: Pseudocode for implementing hybrid simulation using Deformable ✓ ✗ ✓
two different simulator environments simultaneously. The Objects
core of this implementation is using states as a unified Animation Limited ✗ ✓
representation across both simulation environments. Support
Scene Basic ✗ Advanced
X. A SSET C ONVERSION Composition
File Format XML XML ASCII/Binary
A. Asset types
The diverse landscape of robotic assets, stemming from TABLE VII: Comparison of Robot Description Formats
prior research initiatives [142, 48, 81] and a multitude of
software platforms [111, 72, 122], necessitates a robust strategy
for managing a wide array of file formats. To facilitate B. Conversion Pipeline
dependable cross-simulator training and uphold data integrity Given that our simulation pipeline primarily utilizes Isaac
throughout the development lifecycle, the establishment of Sim for rendering while many of our assets are originally stored
an efficient and reliable asset conversion pipeline is of in MJCF format, a two-stage conversion pipeline (MJCF →
paramount importance [26]. Such a pipeline is crucial for URDF → USD) becomes necessary and advantageous. This
ensuring seamless interoperability, minimizing potential data approach leverages URDF as an intermediate format for several
loss or inaccuracies, and promoting the uniform application reasons. First, while direct conversion from MJCF to USD is
of metadata and configurations across disparate simulation theoretically possible, such conversion would be complex and
environments. A selection of frequently encountered asset error-prone due to MJCF’s rich feature set for physics properties
(like tendons and actuators) that lack direct equivalents in frequently exhibit texture alignment discrepancies following
USD [115]. Instead, converting to URDF first allows us to the .msh to .obj conversion process.
standardize the robot’s basic kinematic and dynamic properties Fortunately, we discovered that many asset collections
in a format that has well-established conversion tools and maintain redundant mesh representations, often including a
widespread support. The subsequent URDF to USD conversion properly UV-mapped .obj file alongside the .msh file, typically
benefits from Isaac Sim’s robust URDF importing capabilities, sharing the same filename or designated as "textured.obj".
which have been extensively tested and optimized for robotics Leveraging this observation, we implemented a robust mesh
applications. This two-stage pipeline thus ensures more reliable alignment pipeline that follows a hierarchical decision process:
asset conversion while maintaining essential physical properties • First, the system searches for an existing .obj file within
and compatibility across different simulation environments. the same directory as the .msh file
1) MJCF to URDF conversion: We implemented our own • If found, this pre-existing .obj file is utilized, ensuring
MJCF to URDF converter by first parsing everything with proper texture alignment
MuJoCo’s MJCF importer, then exporting all texture, collision • In the absence of a pre-existing .obj file, the system
mesh and joint information to the correct URDF format. The proceeds with the .msh to .obj conversion
inspiration is taken from Genesis [2], which they built their • In the latter case, users receive a warning notification
own class for each asset object that encode all joint, texture regarding potential texture misalignment issues
and mesh information. We then recursively generate the body
information to URDF and align everything with texture. Following the mesh format resolution, the pipeline sys-
a) Parsing Link, Joint, and Body Information from the tematically maps these processed mesh files back to their
MJCF file: To parse link, joint, and body information from corresponding links within the URDF structure, maintaining
the MJCF file, we leverage MuJoCo’s parsing capabilities the integrity of the robot’s geometric representation while
to load the MJCF XML into a MuJoCo model structure. preserving texture information where possible.
From this parsed model, we employ a recursive approach, c) Building URDF: The assembling procedure after all the
starting from the root body and descending into each child conversions become very aparent: we first processes robot links
body to systematically process the hierarchical structure. For and joints, incorporating their properties and relationships into
each body, we extract detailed link properties such as name, the URDF format. This automated approach ensures a robust
position, orientation, inertial characteristics, and associated and flexible method for generating URDF files, accommodating
geometry. Simultaneously, we parse joint information connected a wide range of robot configurations and properties derived
to each body, including joint type, limits, and axis of motion. from the preceding conversion steps.
All of this extracted link and joint data is systematically Even though this pipeline roughly works for most of the
organized and stored in dictionary structures. These dictionaries MJCF, for some specific MJCF files in some specific folder,
serve as intermediate representations, holding all the necessary we have to modify our conversion approach on a case by case
information from the MJCF model in a structured format basis. Below is a table for some special treament we employed
that is readily accessible for subsequent stages of the URDF to specific packages, and its conversion success rate:
conversion process. Despite the general efficacy of the described pipeline across a
b) Aligning Meshes and textures: The management of broad spectrum of MJCF assets, it is important to acknowledge
collision meshes across existing asset libraries presents a that certain MJCF files, particularly those within specific pack-
notable challenge, as these assets are typically stored in various ages or directories, necessitate bespoke conversion strategies.
formats including .msh, .obj, and .stl files. While URDF natively These exceptions arise due to the inherent complexity and
supports .obj and .stl formats, the conversion of .msh files variability in MJCF file structures across different projects and
into URDF-compatible formats requires careful consideration. asset libraries. To address these unique cases, we have adopted
Although MuJoCo’s repository provides a conversion utility a tailored approach, implementing case-specific modifications
for transforming .msh files to .obj format—accomplished by to our conversion pipeline as required. The subsequent table
parsing the .msh files through the MuJoCo interface and details instances where such specialized treatment has been
subsequently exporting vertex and face information—this ap- applied, along with the corresponding conversion success rates
proach introduces potential complications with texture mapping achieved for each package.
alignment. 2) URDF to USD conversion: IsaacSim has implemented
The complexity arises from the specific requirements of a robust solution for converting URDF files to USD for-
texture files, which are predominantly stored as albedo PNG mat. The conversion process comprehensively preserves the
files. These textures depend on precise UV mapping coordinates robot’s structural and kinematic information, including joint
within the .obj file to ensure proper alignment. The current .msh hierarchies, geometric properties, and physical attributes. The
to .obj conversion utility provided in the MuJoCo repository implementation demonstrates exceptional fidelity in translating
does not adequately address texture support, leading to potential complex robotic descriptions, ensuring that all essential compo-
misalignment issues in the converted models. This limitation nents—such as joint configurations, collision geometries, and
is particularly evident in comprehensive robotics frameworks visual representations—are accurately encoded in the resulting
such as Libero [65] , where both static and articulated objects USD files.
Given the proprietary nature of IsaacSim’s conversion D. MetaWorld
implementation, we utilize their framework as an external MetaWorld [131] is a widely used benchmark for multi-
tool in our pipeline. This approach leverages the proven task and meta-reinforcement learning, comprising 50 distinct
reliability and performance of IsaacSim’s converter while tabletop robotic manipulation tasks involving a Sawyer robot.
maintaining compatibility with our broader system architecture. Tasks and Assets: We integrate five representative tasks
The conversion process serves as a critical bridge between into RoboVerse: drawer open, drawer close, door close, window
standard robotics formats and the high-performance USD open, and window close. The corresponding assets are manually
representation required for our simulation environment. converted from MJCF to USD files with appropriate physics
XI. TASK AND DATA M IGRATION APIs.
A. ManiSkill Demonstrations: As the benchmark does not provide demon-
strations, we generate trajectories for each task by rolling out
ManiSkill [81, 37, 109] provides a series of robotic manip- reinforcement learning policies from [123].
ulation tasks under single-arm or dual-arm settings.
Tasks and assets: We migrate basic single-arm tasks and E. Open6DOR
demonstrations to RoboVerse, including the pick-and-place Open6DOR is a benchmark for open-instruction 6-DoF
tasks like PickCube and PickSingleYCB, as well as the in- object rearrangement tasks, which requires embodied agents
sertion tasks like PegInsertionSide and PlugCharger. to move the target objects according to open instructions that
The corresponding assets are manually crafted with primitives specify its 6-DoF pose.
or process from the mesh files, with proper physics API set Tasks and Assets: The synthetic object dataset comprises
up. 200+ items spanning 70+ distinct categories. Originally derived
Demonstrations: For each task, a great number of from YCB[7] and Objaverse-XL[22], the objects are carefully
demonstration trajectories are available in the released data. filtered and scaled using a standardized format of mesh
Noteworthy, the data does not come with the initial scene states, representation. Overall, the Open6DOR Benchmark consists of
which are obtained by replaying the demonstrations within the 5k+ tasks, divided into the position-track, rotation-track, and
SAPIEN simulator. With the specified seed set, the states are 6-DoF-track, each providing manually configured tasks along
recovered by the random samplers.The success checkers are with comprehensive and quantitative 3D annotations.
implemented according to the task designs. Success checkers: We determine success by comparing
B. RLBench the target object’s final pose with the annotated ground-truth
RLBench [48] is a large-scale benchmark and learning pose range.
environment for robotic manipulation, featuring 100 diverse, F. ARNOLD
hand-designed tasks ranging in complexity, from simple actions
like reaching to multi-stage tasks like opening an oven and Arnold [36] is a benchmark for language-conditioned manip-
placing a tray inside. Each task includes an infinite supply of ulation. The benchmark uses motion planning and keypoints
demonstrations generated via waypoint-based motion planning. for robot manipulation tasks, focusing on fine-grained language
Tasks and assets: We roll out ∼2K trajectories in understanding.
RLBench [48] for each task, and migrate them to ROBOV ERSE. Tasks and Assets: : We integrate six out of eight tasks
from Arnold into RoboVerse: picking up objects, reorienting
C. CALVIN objects, opening/closing drawers, and opening/closing cabinets.
CALVIN [79] provides 6-hour teleopreation trajectories on Demonstrations: As the benchmark does not use trajectory-
4 environments, each involve an articulated table with three level demonstrations, we use motion planning for trajectory
blocks in blue, pink, or red. generation to interpolate between keypoints
Tasks and assets: We migrate the demonstrations in all 4
environments and transform the original assets (URDF for the G. RoboSuite & MimicGen
table, and primitives for the cubes) into USD files with proper RoboSuite [142] provides a suite of task environments for
physics APIs. robotic manipulation, built on the MuJoCo physics engine.
Demonstrations: We segment the trajectories according Each task is implemented as a separate class, with most
to the text annotations, which specified the task category (e.g., configuration details embedded in the source code. Based
PlaceInSlider), the text annotation (e.g., place the red on these environments, MimicGen [76] offers thousands of
block in the slider), and the timestamps of the demonstration demonstrations, serving as a widely used benchmark for
segment. The states of the first frame is adopted as the scene imitation learning.
initial states. Tasks and Assets: For tasks with separate object descrip-
Success checkers: We carefully implement the success tion files (MJCF), we directly migrate the corresponding assets
checkers according to the original implementation to make through our Asset Conversion pipeline. However, some tasks
sure the failed executions can be filtered out. This is because contain hard-coded assets within the source code, such as
the coarsely annotated timestamps in the dataset, which may a hammer composed of multiple cubes, cylinders and other
cause the failed execution in part of the demonstrations. primitives with carefully designed relative poses. To integrate
these tasks, we will manually reconstruct the assets within Tasks and Assets: We currently implement two tasks:
our framework. We also argue that hard-coded asset and task OpenBox and OpenToilet. For the OpenBox task, we
definitions, as opposed to modular task descriptions, are not collect 12 object assets from the Box category in the original
scalable for future robotic task benchmarking. dataset. For the OpenToilet task, we gather 30 objects from
Demonstrations: We convert MimicGen demonstrations the Toilet category. We convert these assets into USD files
into our format. Specifically, we transform the robot actions with appropriate physics APIs to ensure compatibility with our
from 6-DoF Cartesian space representations to joint space. simulation environment.
Additionally, the state of the first frame is adopted as the initial Demonstrations: We generate demonstrations for our tasks
scene state. in simulation using motion planning with CuRobo [106]. First,
Success Checkers: We meticulously implement success we filter potential grasping poses for the target object link by
checkers based on the original definitions to ensure failed assessing their feasibility through motion planning. Specifically,
executions are effectively filtered out. we discard poses that the end-effector cannot reach or that
H. SimplerEnv would cause a collision between the robot and the object. Next,
we generate an end-effector pose trajectory to complete the
SimplerEnv is a set of tasks and methods designed to task using heuristics. Based on the object’s kinematic tree, we
do trustworthy benchmarking in simulation for manipulation could define an ideal trajectory. We then apply motion planning
policies that can reflect the real-world success rate. to perform inverse kinematics, computing the corresponding
There are in total 25 different tasks in SimplerEnv. We ignore joint poses of the robot along this trajectory. Finally, we
all tasks that are just a subset of another task and migrated in execute the planned trajectory in simulation to verify task
total 6 tasks and 52 object assets to ROBOV ERSE. The tasks completion, saving successful trajectories as demonstrations.
all use Google Robot. The entire demonstration generation process is conducted in
SimplerEnv provided some controller models trained with IsaacSim [85].
RT-1 [4] and RT-X [15] dataset. We did not use the trajectories
Success Checkers: To determine task success, we require
from the dataset directly because some environmental settings
the manipulated object to be opened by at least 60 degrees for
are different from the environments from SimplerEnv. We used
all tasks.
the trained model to collect trajectories. Hooks are inserted
into the original SimplerEnv codebase to extract and maintain K. GraspNet-1B
the recordings at different stages of simulation. We then rollout GraspNet-1B [27] is a general object grasping dataset for
the model trained with RT-1 dataset on each task to collect the predicting 6 DoF grasping pose given partial pointcloud input.
trajectories. It contains 256 realworld tabletop scenes consists of total 88
I. GAPartNet different objects. We carefully filter out 58 objects as our
For tasks in GAPartNet [34], we generate both motion target grasping objects based on the availability of purchasing
planning [34] and reinforcement learning [32] trajectories. real items because we need to evaluate our policies to grasp
GAPartNet is implemented in IsaacGym [72] with various them in the real world experiments. To generate grasping
articulated objects. To integrate it into RoboVerse, we first demonstrations, we use CuRobo [107] as motion planner to
align all articulated object initial states to the MetaSim format generate robot end effector trajectories starting from a fixed
and convert the asset format to USD for compatibility across initial pose and ending to an target object grasping pose. The
different simulators. grasping pose is obtained from the grasping annotations used to
For trajectory generation: train GraspNet [27]. We also randomized the object positions
(1) Motion Planning: GAPartNet [34] introduces a part- to generate more diverse layouts. Finally, we validate the
centric manipulation approach. We roll out heuristics to trajectories in our framework and filter out invalid ones by
generate manipulation trajectories, providing three demon- controlling robots to follow the generated grasping trajectories.
strations per part with different object and part initial states. In the end, we successfully generated about 100k valid grasping
(2) Reinforcement Learning Rollout: The follow-up work, trajectories.
PartManip [32], proposes several reinforcement learning meth- L. GarmentLab
ods. We re-train all policies based on our robot setup and
roll out trajectories for dataset collection. With aligned task GarmentLab [69] is the first robotic manipulation benchmark
configurations, trajectories, and assets, we successfully adapt for deformable object and garment manipulation. It integrates
GAPartNet into RoboVerse. 10 categories of versatile garment assets and the total number
of USD assets reaches 6k. To generate manipulation demon-
J. GAPartManip strations, we directly roll out the trajectories provided by the
Instead of providing direct demonstrations, GAPartMa- official codebase in IsaacSim and collect the corresponding
nip [18] offers a large-scale, part-oriented, scene-level dataset state information in a parallel process. Although the trajectory
with annotations for actionable interaction poses. We utilize the provided by the official codebase is limited and hard-coded,
mesh-level grasping pose annotations in this dataset to generate we further extend the number of demonstrations by applying
diverse demonstrations for articulated object manipulation. different garments and textures, and all the demonstrations are
validated by the original success checker. Finally, we have also recording the participating objects in our MetaCfg. This
successfully collected 6k trajectories. process ensures that all necessary components of a LIBERO
task, initial states, and action data, are fully translated and
M. UniDoorManip ready for execution in MetaSim.
UniDoorManip [64] provides an articulated manipulation We further augment the data by randomly sampling initial
environment reflecting different realistic door manipulation positions around each LIBERO demonstration, thus increasing
mechanisms, and a large-scale door dataset containing 6 door the effective number of demos well beyond the original 50 per
categories with hundreds of door bodies and handles stored in task. The spatial sampling range is dynamically chosen based
URDF format. We convert those door assets into USD format on the task context and object dimensions, ensuring that the
with physics APIs from IsaacSim and manually further verify augmented configurations remain physically plausible.
the correctness of the joint-link relationship. Demonstrations
are collected by directly rolling out the hard-coded trajectories XII. TASK G ENERATION
in IsaacGym. We eventually collect about 1k successful legal A. Robot & Object Generation Protocol
demonstrations. Our task generation pipeline (Fig. 16) begins with a user
prompt describing the desired theme or constraints of a robotic
N. RLAfford
task (e.g., "place the butter in the drawer and close it"). From
RLAfford [35] investigates the generalization ability of here, the system proceeds in two main phases, mediated by
Deep Reinforcement Learning models on articulated object large generative model calls:
manipulation tasks with the presence of a computer vision 1) call_gpt_to_generate_task(): Conceptual
model that is co-trained with it in an end-to-end manner. This Task Generation. This initial function queries the model
work provided a dataset of articulated objects and 8 tasks for for a high-level task overview. It requests:
benchmarking.
• A unique task name (e.g., “ButterDrawerTask”).
In Roboverse, we have adapted 4 tasks (open cabinet, open
• A short, human-readable instruction (e.g., “Place the
drawer, close cabinet, close drawer) and in total 40k trajectories
butter in the drawer, then close the drawer.”).
from RLAfford.
• A candidate list of robots and objects to appear in
In the task adaptation, we included 40 articulated objects
the scenario, referencing an internal asset library (see
from the RLAfford dataset, and uses the same robot description
below).
file from RLAfford. Then we record 1000 trajectories for each
object in its corresponding task. The large generative model draws on its generative abilities
The trajectory recording is achieved with several hooks we to propose creative or contextually relevant tasks, while
inserted into the original RLAfford codebase. The hooks are remaining loosely guided by the user prompt.[119, 118,
used to extract and maintain the recordings at different stages 39, 140] As shown in Fig. 16, the model might retrieve a
of simulation. We evaluated the released RLAfford model with “drawer” asset from a different benchmark and a “butter”
hook-inserted scripts. In the initialization stage, objects and asset from a separate dataset, combining them into a single
robots are initialized with randomization, their pose, and DoF scene idea.
information are recorded. For each simulation step, the DoF 2) call_gpt_to_get_init_state(): Physical Lay-
position information of objects and robots is recorded in the out Refinement. After receiving the conceptual descrip-
trajectories. In the end, for each object, a separate trajectory tion, we call the model again to specify x,y coordinates
file of 1000 different trajectories is saved in the RoboVerse for each listed item. During this second phase, user
supported format. can provide the prompts that include minimal bounding
constraints (e.g., permissible table edges, object height)
O. LIBERO to help modelgenerate various initial states by few-shot
LIBERO [65] manages data loading and task execution learning.
through a combination of INIT(initialization files), BDDL (Be- Asset Library. To ground the large generative model’s outputs
havior Description Definition Language), and HDF5 datasets. in realistic data, we maintain an asset library (via JSON files)
Specifically, the initialization files define scene layouts, object that describes each robot or object’s core attributes (e.g., assets
properties, and basic task goals; the BDDL format captures filepath, default rotation, size). The two core functions above
semantic details and object affordances; and the HDF5 files selectively pull from this library.
store structured data such as object positions and robot actions Input and Output Format.
for dynamic retrieval at runtime. • Input: A user prompt (e.g., “create a tabletop scene with a
To migrate a LIBERO task into MetaSim, we parse the random container and a snack food”). The pipeline loads
relevant BDDL file to identify which objects are involved and relevant asset definitions and passes them to the large
what type of manipulation context is required. Then we get generative model calls.
the robot and object initial states from the INIT files, followed • Output: A merged init_state or “initial state” dic-
by the corresponding robot actions from the HDF5 dataset. tionary capturing the initial state config needed for
These elements are combined into our PKL file format while simulation: the chosen robot/object list, each item’s final
Simple Language-conditioned Grasping
x,y,z coordinate, and the textual instructions, as shown Method
PickCube MoveSliderLeft Object Set 1 Object Set 2 Object Set 3
in the right half of Fig. 16. OpenVLA 40.0 45.0 46.0 33.3 14.4
Octo 50.0 30.0 42.0 14.4 2.2
XIII. T ELEOPERATION
TABLE VIII: Vision-Language-Action (VLA) Model Re-
Ensuring flexible and intuitive remote operation is critical in
sults on ROBOV ERSE Imitation Learning Benchmark.
robotic teleopration system, particularly when collecting large
Constrained with time and resources, we report VLA models’
volumes of high quality data. In this work, we designed a suite
results on two simple tasks from ROBOV ERSE and grasping
of input methods to facilitate robot teleopration within the
tasks with diverse and challenging language instructions. We
M ETA S IM infrastructure. By supporting keyboard, DualSense
split 58 objects in GraspNet into three sets, each containing
Joystick, smartphone, and VR-based controls, our system
progressively more challenging objects based on their geometry.
accommodates varying user preferences and experimental needs.
This section details our design rationale, implementation steps,
and practical considerations for each control interface.
updates and gripper control. Multi-touch input is supported
A. Keyboard to enable users to send combined control signals, such as
simultaneous movement along multiple axes, improving control
Keyboard input is an accessible method for controlling robots
flexibility and efficiency. As shown in the Fig. 19 and Fig. 17,
in simulation. Our implementation supports multi-key combi-
tilting the smartphone controls the gripper’s orientation, while
nations for diagonal movement and enables full six-degree-of-
combining multi-touch signals from on-screen buttons enables
freedom manipulation of the end effector. Translational move-
precise and complex manipulation in 3D space. However,
ment follows the world coordinate frame (UP: +X, DOWN:
to mitigate magnetic interference, users should maintain a
-X, LEFT: +Y, RIGHT: -Y, ‘e’: +Z, ‘d’: -Z), while rotations
minimum distance of 10 cm from strong magnetic sources such
in the local EE frame are controlled via ‘q’/‘w’ (roll), ‘a’/‘s’
as laptops and other electronic devices. This design optimizes
(pitch), and ‘z’/‘x’ (yaw). The spacebar toggles the gripper.
resource utilization, providing a high-precision 6-DoF remote
To assist users and avoid hotkey conflicts with the simulation
operation experience at minimal cost, rivaling professional-
viewer, we provide an operation window displaying instructions
grade teleoperation systems.
using pygame. While efficient and hardware-independent, this
method lacks 3D spatial representation, reducing user intuition. C. Others
Additionally, Euler angle-based rotation control risks gimbal
Beyond keyboard and smartphone controls, our system
lock, potentially leading to loss of rotational degrees of freedom
incorporates support for DualSense Joysticks and VR con-
and failure in certain configurations.
trollers. The DualSense joystick provides ergonomic advantages
B. Smartphone and high-fidelity analog inputs for nuanced velocity control,
Modern smartphones, equipped with advanced sensors and mapping triggers and joysticks seamlessly to robot motion.
wireless communication, offer an ideal low-cost solution for The VR interface enhances spatial awareness and precision by
intuitive teleoperation from any location. However, existing enabling natural gestures and directional cues for control.
smartphone-based 6-DoF methods, such as those relying on Future work could extend VR capabilities by integrating
accelerometers or vision-based Visual Inertial Odometry (VIO) haptic feedback to improve user immersion and task accuracy.
systems (e.g., ARKit), suffer from instability due to sensor Additionally, the modular design of our system facilitates the in-
noise, low update rates, or weak visual features [40, 73, 74, 75]. tegration of emerging input devices with minimal development
Additionally, no open-source Android app exists for such effort.
implementations. To overcome these limitations, we adopt XIV. R EAL 2S IM T OOLSET FOR A SSET AND TASK
a hybrid approach: using smartphone orientation for motion G ENERATION
control and on-screen buttons for precise translation. Unlike the
keyboard interface, where roll, pitch, and yaw are controlled A. Overview
incrementally via discrete keypresses (i.e., delta orientation The Real2Sim toolset, specifically Video2URDF, provides
adjustments), the smartphone directly provides absolute orien- a systematic pipeline to reconstruct environment geometry
tation data in the form of quaternions. Quaternions, due to their and robotic assets from monocular video input. By leveraging
compactness and immunity to gimbal lock, allow for a more advanced reconstruction techniques, this pipeline produces
stable and accurate representation of the smartphone’s orien- meshes and unified robot descriptions that can be used in
tation in the world frame. As illustrated in Fig. 18, real-time simulation-based experiments. In doing so, it helps bridge
data from the smartphone’s inclination, rotation, and magnetic the gap between real-world data and simulated environments,
field sensors is fused to compute spatial orientation with ±5° enabling more accurate and comprehensive benchmarking [68]
accuracy at a frequency of 50 Hz. This data is transmitted
via WebSocket, ensuring low-latency communication. The app B. Components
interface features six buttons for translation control in the local 1) Gaussian Splatting Reconstruction: The first step in the
coordinate system and two switches for toggling orientation pipeline involves Gaussian splatting [53], which converts
Place butter in the drawer,
then close the drawer
Generated Task Instruction
Table in CALVIN
Butter in LIBERO
Assets in
ROBOVERSE
Scene in ROBOVERSE
...
Fig. 16: Illustration of the two-phase generation protocol. A user prompt guides the LLM to propose an overall task and item
list. The system then refines object positions and merges them into a final initial state.
monocular video frames into a set of Gaussian kernels for is to import them into a simulator, such as Mujoco [111] in
rendering [130]. This representation captures key scene features Roboverse. This allows researchers to configure tasks that
such as depth, color, and collision boundaries in a compact accurately reflect real-world scenarios, forming a benchmark
and efficient way. As a result, it provides a visually faithful for training and evaluating robotic manipulation algorithms.
preview of the scene and serves as an intermediate step before The resulting simulated environment benefits from high-fidelity
detailed mesh reconstruction. geometry and a consistent representation of the physical
2) Mesh Reconstruction: Once the high-level scene struc- workspace.
ture is represented by Gaussian splatting, we perform mesh 4) Real-to-Sim boost Sim-to-Real Performance: We
reconstruction to obtain a more precise geometric model utilize train model on our real2sim module compared with
tsdf extraction [133, 128, 129, 45]. This step recovers the DexGraspNet[134], demonstrating 80% success rate compared
meshes of: to the 50% baseline from DexGraspNet. We use our real2sim
• The environment, including rigid, immovable structures assets in physics-based simulations that closely replicate real-
(e.g., a table). world grasping conditions, enabling robust grasp execution.
• The manipulatable object, which is central to the task at See Fig. 20 for visualization.
hand.
• The robotic arm and end effector, assumed to have a
deterministic configuration during real-to-sim and sim-to- C. Limitations and Challenges.
real transitions.
We use a visual-language model (VLM) and available While the Real2Sim pipeline effectively reconstructs most
CAD design information to generate a unified URDF (or of the relevant geometry, it struggles with completely unseen
MJCF) description for these components. This division of meshes and complex material properties [139]. Furthermore,
the workspace follows the notion of worldconfig in parameters such as friction and mass are inherently difficult
curobo [107], ensuring that each element of the scene to estimate purely from visual data, introducing uncertainties
(robot, object, environment) is cleanly separated and can be that may affect simulation fidelity. Despite these challenges,
easily adapted or replaced as needed. Real2Sim offers a powerful approach to rapidly generating
3) Loading the URDF into the Simulation Environment: simulation-ready assets for benchmarking in robotic manipula-
After the URDF (or MJCF) files are generated, the final step tion tasks.
Fig. 17: Sequential demonstration of smartphone-based control for stack cube and close box tasks.
smaller subset positioned at side-facing angles. stabilize in a standing position. The primary robot model used is
the Unitree H1, augmented with two dexterous Shadow Hands,
XVI. NAVIGATION AND L OCOMOTOIN TASKS
though the environment supports other humanoid models such
A. Navigation Tasks as Unitree G1 and Agility Robotics Digit.
To integrate vision-and-language navigation into IsaacSim, Demonstrations: While HumanoidBench does not provide
we first correct the error-containing instructions by refining pre-collected demonstrations, it supports the use of reinforce-
incorrect punctuation and grammar using ChatGPT. Next, we ment learning algorithms to generate task-specific policies.
validate the ground truth trajectory by sweeping the robot’s The benchmark is designed to facilitate learning from scratch,
3D model (based on the ground truth trajectory) through the with dense and sparse reward structures to guide the learning
scene. The trajectory is deemed invalid if collisions occur process.
between the robot and the scene. Additionally, we adopt the Success Checkers: Each task in HumanoidBench is equipped
same evaluation metrics as VLN-CE [58]. For controlling the with a success checker that evaluates task completion based on
robot, we provide two different types of mobile embodiments, predefined criteria. For example, in the walk task, success is
including a Unitree Go2 robot dog and a JetBot wheeled robot, determined by the robot’s ability to maintain a forward velocity
making our task suitable for a variety of policies (with different of 1 m/s without falling, while in the stand task, success is
navigation capabilities). measured by the robot’s ability to maintain a stable upright
posture for a specified duration.
B. Humanoid Tasks
Experiment and Results: We trained the walk, stand, and run
We migrated the data samples from the Humanoid-X tasks in both the RoboVerse MuJoCo and IsaacLab simulators
dataset [77], and re-implemented the inference pipeline of UH- using the PPO and TD-MPC2[41, 42] algorithms, and compared
1 [77] in our framework. We use the Unitree-H1-2 humanoid the results with the HumanoidBench baseline based on the
robot as the simulated embodiment and set up the locomotion original MuJoCo environment. As shown in Figure22 and
and humanoid pose control task in our framework. The Figure23, the training curves from the RoboVerse MuJoCo sim-
humanoid pose control task is to control the humanoid robot ulator eventually converged and approached the performance
to follow some human poses while maintaining its stability on of HumanoidBench, validating the feasibility of the RoboVerse
the ground. The demonstrated poses in our framework include reinforcement learning infrastructure. Additionally, we trained
arms crossing, boxing, dancing, left and right punch, playing the same tasks in the RoboVerse IsaacLab simulator with
violin, playing guitar, praying, waving to a friend, etc. Our identical configurations. While training efficiency in IsaacLab
pretrained policy can successfully follow the demonstrated was comparatively lower under non-parallelized settings (to
pose to control a humanoid robot while maintaining stable maintain configuration consistency), it still demonstrated a clear
locomotion in IssacGym, and also obtain a decent performance upward trend in reward accumulation. This confirms the rapid
in IssacLab. The humanoid environment and task configurations migration capability of the MetaSim framework and highlights
are highly flexible and scalable, and we are able to support its potential to enable sim-to-sim learning while leveraging the
more humanoid pose control tasks from Humanoid-X without strengths of different simulators, such as IsaacLab’s support
modifying the infrastructure. for GPU-accelerated large-scale parallel training.
C. HumanoidBench
XVII. ROBOV ERSE B ENCHMARK S ET UP D ETAILS
HumanoidBench[102] is a high-dimensional simulated
benchmark designed to accelerate research in humanoid robot A. Generalization Levels
learning, focusing on whole-body locomotion and manipulation To systematically evaluate the generalization capability
tasks. The benchmark features a humanoid robot equipped with of a robot policy, we establish a benchmark based on a
dexterous hands, enabling a wide range of complex interactions carefully curated asset set designed for domain randomization.
in human-like environments. This asset set encompasses a diverse range of environmental
Tasks and Assets: We migrate three fundamental locomotion factors, including materials, textures, lighting conditions, scene
tasks: run, walk, and stand. These tasks are designed to test the configurations, and camera perspectives. By leveraging this
robot’s ability to maintain balance, achieve forward motion, and set, we assess how well different policies generalize to unseen
Fig. 21: Navigation gallery. We deploy the Unitree Go2 robot within Matterport 3D environments, primarily integrated with
ROBOV ERSE Isaac Lab branch. The robot is tasked with navigating the environment based on provided instructions.
MuJoCo PPO MuJoCo TD-MPC2 IsaacLab TD-MPC2 MuJoCo PPO (Baseline) MuJoCo TD-MPC2 (Baseline)
800 800
800
600 600
600
Return
Return
Return
400 400
400
200 200 200
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Environment Steps 1e6 Environment Steps 1e6 Environment Steps 1e6
conditions. Specifically, we split the available assets into a different illumination conditions.
9:1 ratio for training and testing, ensuring that the testing – Distant Light: The polar angle of the light source is
environment contains novel variations not encountered during randomized within a predefined range, influencing the
training. Below, we detail the key components of this domain way shadows and reflections appear in the scene.
randomization setup: – Cylinder Light Arrays: A randomized n×m matrix of
cylinder lights, varying in size and intensity, is placed
• Table, Ground, and Wall. In tasks where a predefined at a fixed height above the agent.
scene is absent, we incorporate walls (and ceilings) to
In both configurations, light intensity and color tempera-
introduce structural complexity. Additionally, customiz-
ture are randomly varied within reasonable limits to ensure
able tables are included for tasks requiring tabletop
that the model encounters a broad range of lighting effects.
interactions. The visual materials applied to these elements
• Camera Poses. To further evaluate the robustness of visual
are randomly sampled from a carefully curated subset of
perception, we carefully select 59 candidate camera poses,
ARNOLD [36] and vMaterials [84], ensuring a diverse
strategically positioned to provide diverse viewpoints. The
range of appearances. The table features approximately
majority of these cameras are oriented directly towards
300 distinct material options, while both the wall and
the robot, ensuring consistent frontal perspectives, while
ground have around 150 material choices each. This
a subset is placed at side-facing angles to introduce
variation enhances the robustness of the learned policy
additional viewpoint variability.
by exposing the model to a wide spectrum of surface
• Reflection Properties. To simulate the wide range of
appearances and textures.
reflective surfaces encountered in real-world environments,
• Lighting Conditions. We introduce two distinct lighting
we randomize key material reflection properties, including
scenarios: distant lighting and cylinder light arrays, each
roughness, specular intensity, and metallic characteristics.
designed to test the adaptability of the learned policy to
Walk
Stand
Timestep
Fig. 23: Demonstration of TD-MPC2 policys trained in the RoboVerse MuJoCo simulator on the Walk and Stand tasks migrated
from the HumanoidBench benchmark
These properties are adjusted within reasonable physical For generalist models, the action is pre-processed into delta
ranges to ensure that the robot policy learns to handle end-effector position space from absolute end-effector position
various levels of surface reflectivity. space, and the gripper action is binarized to {0, +1}. Owing
By integrating these domain randomization techniques into to the lack of time and resources, we are only able to fine-tune
our benchmark, we create a controlled yet diverse testing the generalist models in the single-task setting. For each task,
environment that challenges the generalization ability of OpenVLA [56] is LoRA [44] fine-tuned (rank= 32) with 8
different robot policies. This setup ensures that trained policies A100 GPU under official settings to convergence and reaches
are not merely overfitting to a limited set of conditions but are over 95% action token accuracy as proposed by Kim et al.
instead capable of adapting to a broader range of real-world [56] during the training stage. During evaluations, we employ
variations. Curobo [106] as the inverse-kinematics solver to transform the
action to robot joint state space.
B. RoboVerse Benchmark Protocol B. Diffusion Policy
We rigorously design a training and evaluation protocol to We implemented the training and validation code for Diffu-
ensure a structured and reliable assessment of the policy’s sion Policy based on the requirements of our tasks and relevant
performance. Given the training data, the policy learns to research papers.
imitate the demonstrated behavior. For evaluation, we provide Modeling Diffusion Policy as Denoising Diffusion Proba-
a standardized API that enables systematic assessment. As bilistic Models (DDPMs), we train a noise predictor network:
mentioned earlier, the training and evaluation follow a 9:1
ϵbk = ϵθ ak , s, k
ratio, ensuring that the policy is tested on novel scenarios not (1)
encountered during training. that takes in noisy actions ak , current observations s, and
denoising iterations k and predicts the noise ϵbk .
XVIII. P OLICY T RAINING D ETAILS As for observation s, We use ResNet18 to extract the features
A. Implementation Details of scene images fimg and use 3-layer MLP to extract the
For specialist models, we train from scratch with action features of robot joint states frobot . fimg concatenating with
in 9-dim robot joint state space. Diffusion Policy [13] is frobot is just the conditioning input for Diffusion Policy.
During training, we randomly choose a denoising step k
implemented based on its original framework. We search several
and sample noise ϵk added to the unmodified sample a0 . Our
key hyperparameters, including observation and prediction
training loss is the difference between ϵk and predicted noise:
length, to optimize performance for our tasks. ACT [138]
is implemented with the original architecture and hyper- LDP = M SELoss(ϵk , ϵbk ) (2)
parameters, except that the batch size has been increased to
During inference time, our policy starts from random actions
512, with learning rate correspondingly enlarged to 1e − 4 to
aK and denoises for K steps to obtain the final action
accelerate convergence. We train ACT on one A100 GPU for
predictions. At each step, the action is updated following:
2000 epochs and evaluate with the best checkpoints on the
ak−1 = α ak − γϵθ ak , s, k + N 0, σ 2 I
validation set. (3)
, where α, β and γ are hyperparameters. is due to joint positions being less ambiguous than Cartesian
position plus orientation as the robot states representation.
XIX. W ORLD M ODEL D ETAILS However, generation quality remains suboptimal when train-
A. Methodology ing on the DROID-50K or DROID-RoboVerse-100K datasets
and validating on DROID samples due to the complexity of
We adopt a video generation framework based on
DROID scenes. Scaling the model to 500M parameters and
Latte[71]—a transformer-driven latent diffusion model
reducing the batch size to 8 leads to better preservation of
equipped with an efficient spatial-temporal attention mech-
object geometry, as does the prediction of robot arm movement.
anism. For action conditioning, we use frame-level Adaptive
As discussed in the main paper, although the larger model
Layer Normalization[89] (AdaLN), following insights from
trained on DROID-RoboVerse-100K shows an improved under-
IRASim[141] that show more precise control of the gripper with
standing of object shapes in DROID samples compared to the
frame-level conditioning compared to video-level conditioning.
model trained on DROID-50K, it still struggles with intricate
In the forward pass, raw video frames are encoded using a real-world physics. In contrast, training with RoboVerse-50K
frozen autoencoder from Stable Diffusion[90]. The first frame or DROID-RoboVerse-100K and validating on RoboVerse
serves as the initial condition, while noise is introduced into scenes produces more physically and geometrically consistent
the latent representation of subsequent frames during training. predictions.
Both the noise schedule and action conditions (gripper states We believe it is because RoboVerse offers cleaner back-
with either Cartesian position plus orientation or joint position) grounds, more comprehensive views of the robotic arm, and
are encoded by separate MLPs into latent space and then added the implementation of domain randomization and augmenta-
together. tion. By comparison, many DROID frames contain cluttered
These noisy latent frames are then fed into a transformer backgrounds or incomplete arm visibility, creating challenges
composed of alternating spatial and temporal attention blocks, for learning robust temporal dynamics from raw pixels.
where action conditions are applied at each frame via AdaLN.
For inference, we employ DDIM[105] as a denoising scheduler,
using 200 sampling steps.
B. Data Preparation
The DROID[54] dataset’s episodes typically last from 120
to 360 frames. To amplify motion, we skip every 6 frames,
effectively reducing the frame rate to 4 fps with sequence
lengths from 20 to 60. In the RoboVerse simulation, we adjust
the control frequency so that most episodes span 20 to 60
frames, mirroring the number of frames of DROID in one
episode. We filter out any sequence shorter than 20 or longer
than 60 frames, resulting in about 50,000 unique episodes from
DROID.
We only generate 50,000 unique RoboVerse episodes due
to time and resource constraints. The full-scale RoboVerse is
planned to train more capable world models in future works.
We exclude the gripper camera view because the model
struggles with drastic camera pose changes, which leads to
poor frame generation quality. Since we consider left and right
camera views as separate samples, each dataset effectively
doubles to 100,000 samples.
C. Experiments
Our experiments involve training three datasets, DROID-50K,
RoboVerse-50K, and DROID-RoboVerse-100K, on 8 NVIDIA
H100 GPUs. We use a spatial resolution of 240×320 and
sequences of 16 frames per episode. Starting with a model of
100M parameters and a batch size of 16, training converges at
around 100K steps on RoboVerse and 200K steps on DROID.
We first compare Cartesian position plus orientation to joint
positions as action conditions and find that using joint positions
as action conditions yields more precise gripper movement
control in frame generation, as shown in Fig.25. We believe it
Fig. 24: Visualization of Sim-to-Sim-to-Real Experiments.