Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
155 views36 pages

Roboverse

Uploaded by

ucaslml587
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views36 pages

Roboverse

Uploaded by

ucaslml587
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

ROBOV ERSE: Towards a Unified Platform, Dataset and

Benchmark for Scalable and Generalizable Robot Learning


Haoran Geng1* , Feishi Wang1,2,3* , Songlin Wei2* , Yuyang Li2* , Bangjun Wang3* , Boshi An2* ,
Charlie Tianyue Cheng1* , Haozhe Lou3 , Peihao Li1,4 , Yen-Jen Wang1 , Yutong Liang2 , Dylan Goetting1 ,
Chaoyi Xu2 , Haozhe Chen5 , Yuxi Qian6 , Yiran Geng2 , Jiageng Mao3 , Weikang Wan2 , Mingtong Zhang3 ,
Jiangran Lyu2 , Siheng Zhao3 , Jiazhao Zhang2 , Jialiang Zhang1,2 , Chengyang Zhao7 , Haoran Lu2 ,
Yufei Ding1,2 , Ran Gong8 , Yuran Wang2 , Yuxuan Kuang2,3 , Ruihai Wu2 , Baoxiong Jia9 , Carlo Sferrazza1 ,
Hao Dong2 , Siyuan Huang9 , Koushil Sreenath1 , Yue Wang3† , Jitendra Malik1† , Pieter Abbeel1†

1 2
UC Berkeley PKU 3 USC 4 UMich 5 UIUC 6 Stanford 7 CMU 8 UCLA 9 BIGAI
* equal contribution † equal advising Correspondence to: Haoran Geng <[email protected]>

Diverse Tasks and


Demonstrations

Multi-Simulator Cross
Support Embodiment

Fig. 1: ROBOV ERSE comprises a scalable simulation platform, a large-scale synthetic dataset, and unified benchmarks. The
simulation platform supports seamless integration of new tasks and demonstrations through unified protocols, ensuring flexibility
and extensibility. The dataset includes over 1,000 diverse tasks and more than 10 million transitions, constructed through
large-scale data migration, cross-embodiment transfer, and robust augmentation and randomization.

Abstract—Data scaling and standardized evaluation bench- constructed through multiple approaches including migration
marks have driven significant advances in natural language from public datasets, policy rollout, and motion planning,
processing and computer vision. However, robotics faces unique etc. enhanced by data augmentation. Additionally, we propose
challenges in scaling data and establishing reliable evaluation unified benchmarks for imitation learning and reinforcement
protocols. Collecting real-world robotic data is resource-intensive learning, enabling consistent evaluation across different levels of
and inefficient, while benchmarking in real-world scenarios generalization. At the core of the simulation platform is M ETA S IM,
remains highly complex. Synthetic data and simulation offer an infrastructure that abstracts diverse simulation environments
promising alternatives, yet existing efforts often fall short in data into a universal interface. It restructures existing simulation
quality, diversity, and benchmark standardization. To address environments into a simulator-agnostic configuration system, as
these challenges, we introduce ROBOV ERSE, a comprehensive well as an API aligning different simulator functionalities, such
framework comprising a simulation platform, a synthetic dataset, as launching simulation environments, loading assets with initial
and unified benchmarks. Our simulation platform supports states, stepping the physics engine, etc. This abstraction ensures
multiple simulators and robotic embodiments, enabling seamless interoperability and extensibility. Comprehensive experiments
transitions between different environments. The synthetic dataset, demonstrate that ROBOV ERSE enhances the performance of
featuring high-fidelity physics and photorealistic rendering, is imitation learning, reinforcement learning, and world model
learning, improving sim-to-real transfer. These results validate thecore infrastructure of the ROBOV ERSE. Through careful design,
reliability of our dataset and benchmarks, establishing RoboVerse M ETA S IM establishes a universal configuration system for
as a robust solution for advancing simulation-assisted robot agents, objects, sensors, tasks, and physics parameters while
learning.
exposing a simulator-agnostic interface for simulation setup
I. I NTRODUCTION and control. This architecture enables seamless integration of
tasks, assets and robot trajectories from diverse simulation en-
Large-scale datasets, combined with well-established bench- vironments with minimal adaptation effort. M ETA S IM provides
marks, have fueled rapid advancements in natural language three key capabilities: (1) Cross-Simulator Integration: Enables
processing (NLP) [93, 5] and computer vision (CV) [23, seamless switching between different simulators, fostering uni-
59, 57, 95, 67, 43] . Specifically, large-scale data provides fied benchmarking and facilitating the transfer of environments
ample training examples that bolster learning, while uniform and demonstrations across platforms. (2) Hybrid Simulation:
benchmarks enable standardized evaluation and fair comparison Combines the strengths of multiple simulators—such as pairing
across different methods. However, replicating these successes advanced physics engines with superior renderers—to generate
in robotics remains challenging due to the difficulty of scalable and high-quality synthetic data. (3) Cross-Embodiment
collecting high-quality, diverse data and the lack of widely Transfer: Allows the retargeting of trajectories across various
recognized evaluation protocols. robot arms with parallel grippers, maximizing dataset reuse
Real-world approaches [15, 54] to constructing datasets and from heterogeneous sources.
benchmarks, though authentically reflecting the complexities of M ETA S IM enables ROBOV ERSE to systematically enhance
operational environments, face significant practical constraints. the workflow for building and scaling simulation environments
First, collecting demonstrations is time-consuming and resource- and datasets. Our method features:
intensive, and the resulting data is often hardware-dependent
• Scalable and Diverse Data Generation: By aligning
or modality-specific, limiting its adaptability to new scenarios.
Additionally, establishing standardized and widely applicable multiple benchmarks and task trajectories and leveraging a
benchmarks is inherently challenging since reproducing identi- robust multi-source integration and data filtering pipeline,
cal conditions for fair comparisons is nearly impossible. For we generate large-scale, high-quality datasets. Addition-
instance, object placements can vary across rollouts, ambient ally, our data randomization and augmentation pipeline
lighting fluctuates under natural sunlight, and background enhances data diversity and volume, further enriching the
environments may change. Consequently, scaling real-world dataset for comprehensive model training;
• Realistic Simulation and Rendering: With M ETA S IM ’s
datasets, evaluating policies, and iterating development in
real-world scenarios remain cost-prohibitive and difficult to hybrid simulation capability, we enable the fusion of ad-
standardize. vanced physics engines and rendering systems across mul-
Simulators, on the other hand, present a promising alternative tiple simulators and renderers. Combined with carefully
for large-scale dataset and benchmark construction. By pro- curated scenes, materials, and lighting assets, ROBOV ERSE
viding efficient computation, synthetic assets, and omniscient enhances realism in physical interactions and sensory
information in reproducible settings, they enable cost-effective observations;
• Unified Benchmarking and Evaluation: We unify widely
dataset construction and consistent performance evaluation.
Recent works, exemplified by [135, 50, 10, 33, 98, 124, 70], used benchmarks into a cohesive system, streamlining
have demonstrated the potential of simulation-based methods algorithm development and performance comparison
in various robotic tasks. Despite these advantages, several within a structured evaluation framework. Additionally,
challenges impede the broader adoption of synthetic datasets we introduce a standardized benchmarking protocol to
and benchmarks. First, utilizing simulators often demands assess varying levels of generalization and sim-to-real
considerable expertise due to both the complexity of simulator transferability.
• Highly Extensibility and Scalability: The aligned APIs
design and the relative immaturity of many platforms, which
complicates the data construction process. Second, simula- and infrastructure streamline development and enable
tors vary widely in their internal architectures and external efficient algorithm integration, testing, and deployment
interfaces, making it laborious to transfer data and models or across diverse simulation environments. Additionally, we
adapt workflows from one to another. Consequently, reusing develop real-to-sim frameworks, multiple teleoperation
existing synthetic datasets and benchmarks is difficult, resulting methods, and AI-generative systems for scalable task and
in a fragmented ecosystem that further hinders convenient data creation.
construction and effective use of large-scale data in simulation Leveraging these workflows in ROBOV ERSE, we construct
environments. the largest and most diverse high-quality synthetic dataset
To fully harness the potential of simulation in robotics, we and benchmark to date, all in a unified format. This dataset
introduce ROBOV ERSE, a scalable simulation platform that includes ∼500k unique, high-fidelity trajectories covering 276
unifies existing simulators under a standardized format and a task categories and ∼5.5k assets. Additionally, we generate
single infrastructure, a large-scale synthetic dataset, and unified over 50 million high-quality state transitions to support policy
benchmarks. To achieve this, we first propose M ETA S IM, the learning.
Beyond dataset and benchmark construction, we explore ROBOVERSE
the potential of ROBOV ERSE through extensive experiments
on imitation learning (Sec. VI-B), reinforcement learning High-Quality Unified
Dataset Benchmarks
(Sec. VI-C), and world model learning (Sec. VI-E). Our results
demonstrate that ROBOV ERSE enables reliable policy learning
and evaluation, supports strong sim-to-sim and (Sec. VI-G)
sim-to-real transfer (Sec. VI-F) via high-fidelity physics and Simulation Platform
rendering, and facilitates efficient data expansion through
teleoperation (Sec. ??), trajectory augmentation (Sec. IV-D1), METASIM
domain randomization (Sec. IV-D2) and generative models
(Sec. IV-C). These findings highlight the framework’s robust-
ness, scalability, and real-world applicability. Fig. 2: ROBOV ERSE consists of a simulation platform, a large-
scale, high-quality dataset, and unified benchmarks. At the core
II. R ELATED W ORK
of the simulation platform is M ETA S IM, the infrastructure of
A. Robotics Simulators ROBOV ERSE. Powered by M ETA S IM, the simulation platform
Advancements in computer graphics have contributed to facilitates dataset creation and benchmark construction.
the development of high-fidelity simulators, which are widely
used in robotics research and development. CoppeliaSim [97],
Bullet [16], and MuJoCo [111] provide accurate physics Hussing et al. [46] proposed a dataset containing 256M
simulations and are extensively utilized in applications such transitions on 256 tasks for offline compositional reinforcement
as reinforcement learning and robotic benchmarking [3, 126, learning. RoboCasa [82] introduced a dataset of 100 tasks
87, 14]. More simulators have been developed to fully exploit and over 100k trajectories for generalist robots. DexGraspNet-
parallelism for better efficiency. IsaacGym [72], IsaacSim [85], 2.0 [134] has collected over 400M demonstrations for dexterous
SAPIEN [37, 109], MuJoCo MJX [111, 132], and Genesis [2] grasping. Despite these efforts, synthetic datasets often exist
utilize GPU power for enhanced performance, enabling large- in disparate simulators, leading to a fragmented ecosystem
scale reinforcement learning and efficient data collection, with limited diversity and quality. Moreover, simulation-based
significantly improving training speed and scalability. Some data often fails to capture complex physics and diverse task
simulators focus on bridging the simulation-reality gap (Sim-to- variations found in the real world [63, 26], potentially causing
Real Gap), incorporating technologies including ray-tracing and overfitting to specific simulators and hampering generalization
customized renderers for photo-realistic rendering [85, 109]. to real-world scenarios.
Furthermore, IsaacSim [85] and Genesis [2] offer high-fidelity
ROBOVERSE provides a unified solution for large-scale, high-
soft-body and liquid simulation, expanding the scope of
quality, and diverse synthetic data. It enables agents to train on
realistic robotic interactions. ROBOV ERSE proposes a unified
a large set of environments and simulators to reduce overfitting,
platform that supports multiple simulators, facilitating seamless
thereby improving the robustness of the learned policies.
transitions between them and enabling hybrid integration to
utilize the strengths of each simulator.
B. Large-Scale Robotics Dataset C. Benchmarking in Robotics
The scarcity of large-scale, high-quality, and diverse datasets
in the robotics community has long been recognized. Several Benchmarking remains a critical yet highly challenging
works have shown the possibility of collecting demonstration problem in the robotics community. Compared to supervised
data directly on real robots. RoboNet [20] is a large-scale learning tasks, it is relatively difficult to evaluate the per-
manipulation dataset containing roughly 162k trajectories from formance of a robotics model. MetaWorld [131] is an early
multiple robot platforms. DROID [54] has collected over attempt in multi-task benchmarking. This is followed by
76k contact-rich robotic manipulation demonstrations across RLBench [48], Behavior-1k [62], Habitat [108], and Man-
86 tasks. RH20T [28] proposed a dataset with over 100k iSkill [81, 37, 109, 103], covering a large variety of robotic
demonstrations and 147 tasks. At the same time, RT-1 [4] tasks. Grutopia [116] and InfiniteWorld [96] make a leap toward
set the record further to 130k demonstrations on over 700 general-purpose robot benchmarking.
tasks. Recently, Open X-embodiment [15] has demonstrated a Despite significant efforts dedicated to these benchmarks,
promising approach to unite the community’s efforts, collecting it is not guaranteed that the results are reproducible across
over 1M trajectories on 160,266 tasks with 22 different different benchmarks. The uncertainty comes from multiple
embodiments. At this stage, real-world datasets became difficult aspects including simulation accuracy, rendering style and asset
to scale up due to the proportional effort and cost required to properties [63, 26]. To address these challenges, ROBOV ERSE
collect more demonstrative trajectories. enables researchers to evaluate their policies across multiple
Simulation-based data collection provides a promising solu- benchmarks and simulators seamlessly, without familiarizing
tion to the high cost and inefficiencies of real-world datasets. themselves with each one individually.
METASIM
Three-Layer Architecture
Migration
Simulator Backends
Task Self-Designed
Capabilities
Isaac Lab
Generative AI Unified Benchmarks
Isaac Gym Environment Cross-Simulator
Agents Integration
MuJoCo step
Migration Objects
Asset SAPIEN reset Hybrid
Real-to-Sim Tasks Simulation
Genesis render
Sensors
Bullet close Cross-Embodiment
Physics Transfer
Migration High-Quality Dataset
CoppeliaSim gym.Env
MetaConfig
Teleoperation

Trajectory
RL Rollout
Handler
Motion Planning

Trajectory Augmentation Domain Randomization

Data Augmentation

Fig. 3: M ETA S IM provides a universal configuration system, aligned simulator backends, and a Gym [112] environment wrapper.
This three-layer architecture abstracts simulation environments into simulator-agnostic specifications and aligns simulator
backends, enabling three key capabilities: cross-simulator integration, hybrid simulation and cross-embodiment transfer. Based
on M ETA S IM, we build a pipeline to collect tasks, assets and trajectories from diverse public sources in a unified format,
employ data augmentation methods, and ultimately generate a large-scale high-quality dataset along with unified benchmarks.
This data pipeline forms the foundation of ROBOV ERSE, facilitating the generation of large-scale datasets and construction of
unified benchmarks.

MetaConfig
III. I NFRASTRUCTURE : M ETA S IM
agents
A. M ETA S IM Overview PhysicsConfig objects TaskConfig
We present M ETA S IM, a high-level interface above specific gravity task instructions

simulation environment implementations. It is also the core in- collision sensors success_metrics

frastructure of ROBOV ERSE. As illustrated in Fig. 2, M ETA S IM friction physics reward_funcs

empowers the ROBOV ERSE simulation platform, allowing for


the generation of a large-scale high-quality dataset, as well as Fig. 4: The MetaConfig is a nested dataclass that abstracts
the construction of a unified benchmark. the core components in any simulation environment in a
simulator-agnostic way.
B. M ETA S IM Implementation
As illustrated in Fig. 3, M ETA S IM employs a three-layer
architecture including a universal configuration system, a metrics, and rewards), how the environment is perceived and
simulator-agnostic interface, and a user-friendly environment measured (sensors), and the governing physical laws (physics
wrapper. The universal configuration system unifies specifica- parameters). Ideally, these components should be simulator-
tions for a simulation scenario and ensures consistent format agnostic, requiring a unified standard of simulation scenarios.
across simulators. The simulator-agnostic interface interprets Such a standard would enable researchers to work across
these specifications, translates them into simulator-specific different simulators seamlessly and integrate existing efforts
commands, and therefore aligns different simulator backends. In from the community through cross-simulation.
addition, the environment wrappers encapsulate the simulator- Based on such a principle, we design a configuration system,
agnostic interface into a standarized learning environment, such MetaConfig, to abstract simulation scenarios in a simulator-
as a Gym [112] environment. We describe each layer with more agnostic way. As illustrated in Fig. 4, MetaConfig is a nested
details in the following sections. class that contains the above-mentioned core components. It
1) Universal Configuration System: A typical simulation can be interpreted by different simulator backends to build
environment comprises agents, objects, tasks, sensors, and the corresponding simulation. Additionally, MetaConfig
physics parameters. They collectively define who performs supports optional simulator-specific hyperparameters (e.g.,
the actions (agents), what the environment looks like (objects), solver type), allowing fully leveraging the unique features
what the agents should do (tasks, including instructions, success of different simulators through customization.
unified benchmark construction, and sim-to-sim transfer for
class Env:
def __init__(self, handler): reinforcement learning training. For example, tasks from
self.handler = handler MetaWorld [131] can be used by Isaacgym [72] for fast parallel
handler.launch() training, after which the generated trajectories can be deployed
def reset(self): in IsaacSim [85] for rendering.
handler.set_states() 2) Hybrid Simulation: M ETA S IM supports combining the
states = handler.get_states() physics engine of one simulator and the renderer of another
return get_observation(states), \
handler.get_extra() simulator at the same time, allowing users to benefit from
advantages owned by different simulators. Specifically, using a
def step(self, action): single command, one could launch a simulator with a powerful
handler.set_states(action=action)
handler.step() renderer (e.g., IsaacSim [85]) with a simulator that has an
states = handler.get_states() accurate physics engine (e.g., MuJoCo [111]) to form an
return get_observation(states), \ even more powerful simulation, enabling high-quality data
get_reward(states), \
get_success(states) \ generation.
get_termination(states), \ 3) Cross-Embodiment Transfer: Reusing the trajectories
get_time_out(states), \ across different gripper-based robot morphologies by retarget-
handler.get_extra()
ing the end-effector pose, which allows the integration of data
def render(self): collected from diverse robots into a unified format.
return handler.render()
IV. ROBOV ERSE DATASET
def close(self):
handler.close() A. Dataset Overview
On top of M ETA S IM, we generate large-scale high quality
dataset by incorporating multiple data collection methods.
Code 1: Pseudocode for gym.Env implementation. Each Overall, there are three key data types to collect: tasks,
method of gym.Env is implemented by calling the corre- assets, and robot trajectories. The main source of these data
sponding methods of the Handler class. is migration from existing simulation environments. Beyond
migration, we explore various methods to collect these data,
such as using large language models to generate new tasks,
2) Aligned Simulator Backends: Different simulators have leveraging the RealSsim toolset [68] to reconstruct assets from
their own implementations and specializations. However, rou- the real world, using teleoperation to collect new trajectories,
tine operations – such as initializing a scene, loading objects, etc. Additionally, we leverage data augmentation methods for
stepping the physics engine, retrieving observations, time man- both trajectories and visual observations. Finally, we report the
agement, and determining success states – tend to follow similar statistics for current progress of data migration in ROBOV ERSE.
patterns. To standardize these shared operations, we create a
B. Tasks, Assets and Trajectories Collection: Migration
unified interface through a Handler class. Each simulator
has its own handler instance implementing this interface. The Leveraging the RoboVerse format and infrastructure, we
handler class implements the common methods including seamlessly integrate a wide range of benchmarks and datasets
launch(), get_states(), and set_states(), etc., into our system with a unified format and clean codebase.
spanning the whole lifecycle of simulating a task. The usage of We apply the following approaches to collect tasks and
the APIs is illustrated in Code 1. More information is provided demonstrations.
in the supplementary materials. • Direct Migration from Other Simulation Environments
3) User-Friendly Environment Wrapper: The Gym Some benchmarks provide essential components integra-
API [112] is a widely adopted paradigm in reinforcement tion into ROBOV ERSE. We define environment configura-
learning and robotics, in which the gym.Env class is tions for task initialization and evaluation, then convert
fundamental to building learning environments. We define a trajectory data and asset formats for seamless compatibility.
wrapper to easily transform a Handler into an environment Notably, ROBOV ERSE streamlines this migration process
equipped with Gym APIs (step(), reset(), render(), by first aligning formats in the original simulator and
and close()). As shown in Code 1, these methods are automatically ensuring compatibility across all simulators.
implemented by leveraging the underlying Handler methods. • Motion Planning and RL Rollout When benchmarks
provide only partial manipulation data, such as keypoint
C. M ETA S IM Capabilities trajectories or grasping poses, we use motion planning to
M ETA S IM offers the following three key capabilities. generate complete trajectories. If no explicit manipulation
1) Cross-Simulator Integration: Seamlessly switching be- data is available but pre-existing policies or reinforcement
tween different simulators, allowing tasks and trajectories learning frameworks exist, we either utilize these policies
from one simulator to be utilized in other simulators. This or train new ones to collect demonstration data through
capability enables efficient task and trajectory integration, rollouts. To ensure high data quality and consistency with
our system standards, we carefully adapt the success
checker and rigorously filter both planned and collected
trajectories. Keyboard Joystick Android iPhone Mocap VR

With the techniques mentioned above, we migrated multiple


existing manipulation datasets into ROBOV ERSE. Currently, we
support ManiSkill [81, 37, 109], RLBench [48], CALVIN [79],
MetaWorld [131], RoboSuite [142], MimicGen [76], GAPart-
Net [34], Open6DOR [24], ARNOLD [36], LIBERO [65],
Simpler [63], GraspNet [27], GarmentLab [69], and UniDoor-
Manip [64].
We also integrated datasets from a wider range of embodi-
ments, including dexterous hands, quadrupeds, and humanoids,
covering tasks such as dexterous manipulation, locomotion, nav-
igation, and whole-body control. Currently, we have migrated
VLN-CE R2R [58] and RxR [60] for navigation, as well as Fig. 5: Teleoperation System. ROBOV ERSE supports various
HumanoidBench [102] and Humanoid-X [77] for locomotion user-friendly teleoperation approaches. Currently, it enables
and whole-body control. teleoperation via a phone app (second row), motion capture
RoboVerse simplifies and standardizes the migration process, (middle), VR devices (bottom left), as well as keyboard and
and we will continue to maintain and expand it. joystick (bottom right). These methods allow control of robotic
arms, dexterous hands, and bimanual systems across different
C. Tasks, Assets and Trajectories Collection: Teleoperation simulators.
and Generation
• Teleoperation System for Trajectory Collection . As shown
in Fig. 5, ROBOV ERSE integrates teleoperation systems
within the M ETA S IM infrastructure, offering a flexible and format standards are discarded. (2) Feasibility Check: Since
efficient solution for high-quality data collection. It supports trajectory data is collected via human teleoperation, tasks
various robotic systems, including arms, dexterous hands [88], deemed unreasonable by the teleoperator are removed. By
and bimanual setups, enabling seamless teleoperation across unleashing the extrapolative and few-shot learning abilities of
different simulators. To mitigate the high cost and complexity large generative models, we integrate assets under a uniform
of professional equipment, we introduce an interactive motion schema automatically, driving task generation that spans
control system utilizing accessible devices such as keyboards, multiple simulators and benchmarks.
joysticks, mobile apps (we developed a new app for Android • Real-to-Sim for Asset Construction. Video-based recon-
and iOS to control robotic arms; see supplementary materials struction proves to be a valuable source for data and asset
for more details.), motion capture (Mocap) [114], and VR creation by leveraging Real-to-Sim techniques. Our approach
systems [12, 92]. These devices’ integrated sensors capture integrates multiple reconstruction pipelines to extract high-
motion data, allowing natural, gesture-based control along fidelity assets from video data. First, we initialize the
with real-time, high-frequency communication for precise, structure using Colmap [99, 100] and employ Gaussian
low-cost remote operation. Further details are provided in Splatting [53] for high-quality rendering. Next, we infer
the supplementary materials. physical properties by feeding both semantic and original
• AI-Assisted Task Generation. Leveraging the generalization images into a Vision-Language Model (VLM) [140]. For
capability of large generative models, AI-assisted task gen- geometry reconstruction, we estimate surface normals from
eration provides a mechanism to diversify task varieties and video [129], apply surfel splatting [45], and utilize TSDF-
scenario distribution. By learning from example placements, based methods with dynamic filtering to reconstruct detailed
it acquires a sense of spatial and semantic constraints [1] meshes [128]. By leveraging semantic masks [95], we
(e.g. by demonstrating specific constraints, it can learn selectively extract components from both Gaussian and mesh
to spread out objects to avoid potential overlap etc.). It representations. To further enhance realism, we infer and
can arrange objects originally from different benchmarks learn object kinematics directly from video [66], ensuring
into a physically plausible scenes based on M ETA S IM, as accurate motion representations. Finally, we formulate URDF
shown in Fig. 6. Incorporating randomization in robot models by refining key attributes such as coordinate frames,
and object selection [52] with their initial poses, large orientation, axis alignment, scale, relative 6-DoF poses,
generative models can generate various initial states. The and PD control parameters [68]. This pipeline effectively
system can automatically output all the required configuration bridges the gap between real-world video data and simulation-
files in unified format for instant visualization and user- ready assets, enhancing robotic learning and simulation
friendly editing. After task generation, we will process a fidelity. We also present comparative experiments in the
two-step filtering to avoid errors and hallucinations: (1) supplementary materials, demonstrating that our methods
Format Validation: Tasks that fail to meet ROBOV ERSE significantly enhance real-world policy performance.
a single object’s coordinate frame (oSi ∈ O, O is the set
of objects in the task M). Additionally, we assume that the
sequence of subtasks in each task is predefined. By leveraging
Place butter in the drawer, then close the drawer
this minimal human annotation regarding the order of subtasks,
we can efficiently divide each source demo into contiguous
object-centric manipulation segments {τi }M i=1 (each of which
corresponds to a subtask Si (oi )) using a simulator, and then
Put basket into the box, then put milk into the basket
generate extensive trajectory datasets for various task variants
(in our case: variations in the initial and goal state distributions
of objects (D) and robots (R)) using MimicGen [76]. This
Stack tomato sauce on top of cup, then stack chocolate pudding on top of the sauce
approach has been shown to significantly benefit generalization
in imitation learning [76, 50, 117, 31, 82], particularly in
scenarios where the number of source demonstrations is limited.
For further details, please refer to the supplementary materials.
Place butter, cream cheese, and chocolate pudding in a line, then knock them over like dominoes
2) Domain Randomization: We implement domain random-
ization in the IsaacLab [85] handler of MetaSim. This involves
Fig. 6: AI-Assisted Task Generation. RoboVerse supports an four types of randomization:
AI-assisted task generation framework that leverages large gen- • Table, Ground, and Wall. Walls (and ceilings) can be added
erative models’ extrapolation capabilities to generate non-trivial for tasks that lack a predefined scene. Customizable tables
and semantically rich tasks. Combined with our teleoperation can also be included for tasks that are performed on tabletops.
system, it enables the generation of diverse and high-quality The visual materials for these elements are randomly selected
data. from a curated subset of ARNOLD [36] and vMaterials [84].
The table has ∼300 material options, while the wall and
Real Sim ground each have around ∼150 material options.
• Lighting Condition. Two types of lighting scenarios can be
3D GS
TSDF specified: distant light and cylinder light arrays. For distant
light, the light’s polar angles are randomized. For cylinder
light, a random n × m matrix of cylinder lights with random
size is added at a fixed height above the agents. In both
Multi-View Image Fully Reconstructed Mesh and URDF
scenarios, the intensity and color temperature of the lights
are randomized within a reasonable range.
• Camera Poses. We carefully select 59 candidate camera
poses, with the majority positioned to face the robot directly
and a smaller subset placed at side-facing angles.
• Reflection Properties. The roughness, specular, and metallic
Execution in the real world Control in MetaSim (Genesis) properties of each surface are randomized within reasonable
ranges.
Fig. 7: Real-to-Sim Tools. We use a mobile device to capture These randomization options can be freely combined. For
multi-view images, reconstruct a high-quality mesh, build a example, a scene can include a customized table, walls with
URDF using VLM, and then perform actions in both RoboVerse a ceiling, and a set of cylinder lights to simulate an indoor
and the real world. environment. For details, please refer to the supplementary
materials.
E. ROBOV ERSE Dataset
D. Data Augmentation
1) Dataset Statistics:
1) Trajectory Augmentation: With the unified simulation a) Manipulation Dataset: We migrate diverse manipula-
interface and data format, ROBOV ERSE enables significantly tion datasets from existing source benchmarks [81, 37, 109, 48,
more efficient data augmentation and supports advanced aug- 79, 131, 142, 76, 34, 24, 36, 65, 63, 35, 27, 69, 64, 18] into
mentation techniques. Beyond the visual randomization detailed ROBOV ERSE. The number of task categories, trajectories and
in Benchmark Protocol [8], we also provide robust trajectory assets contributed by each source benchmarks is summarized
space augmentation. We offer an API to generate large-scale in Tab. I. In total, this migration results in 276 task categories,
robot trajectory datasets from a limited number of source 510.5k trajectories, and 5.5k assets. Representitive tasks with
demonstrations. Following the MimicGen [76] framework, for rich domain randomization are shown in Fig. 8.
most tasks, we can decompose them into a sequence of object- b) Navigation Dataset: We migrate vision-and-language
centric subtasks (S1 (oS1 ), S2 (oS2 ), . . . , SM (oSM )), where the navigation (VLN) tasks into ROBOV ERSE. Note that there exists
robot’s trajectory within each subtask Si (oSi ) is relative to various VLN tasks with different settings; here, we particularly
RoboSuite

CALVIN

RLBench

ManiSkill

Fig. 8: Dataset Comparison and Gallery. Left: other representative synthetic robotics datasets. Right: the ROBOV ERSE dataset.

Source # Task
Source Benchmark
Simulator Categories
# Trajectories # Assets including the Unitree Dog (a legged robot) and the JetBot (a
ManiSkill [81, 37, 109] SAPIEN 6 19k 1.7k wheeled robot), which support different control policies. A
RLBench [48] CoppeliaSim 80 150k 100 detailed elaboration on the navigation dataset is provided in
CALVIN [79] Pybullet 7 20k 7
MetaWorld [131] MuJoCo 5 5k 6
the supplementary materials.
RoboSuite [142]&MimicGen [76] MuJoCo 6 6k 12 c) Humanoid Dataset: We migrate HumanoidBench [102]
GAPartNet [34] IsaacGym 4 4k 151
Open6DOR [24] IsaacGym 69 10k 207
tasks for reinforcement learning benchmarks and integrate
ARNOLD [36] IsaacSim 6 3k 30 tasks, policies, and data samples from Humanoid-X [77] and
LIBERO [65] MuJoCo 10 15k 15 SkillBlender [61]. Additionally, we re-implement the UH-1
Simpler [63] SAPIEN 6 30k 52
RLAfford [35] IsaacGym 4 40k 40 inference pipeline within our framework. The pretrained policy
GraspNet [27] - 58 200k 42 successfully enables humanoid robots to follow demonstrated
GarmentLab [69] IsaacSim 6 6k 3k
UniDoorManip [64] IsaacGym 7 1k 140
poses while maintaining stable locomotion across multiple
GAPartManip [18] IsaacSim 2 1.5k 42 simulators based on ROBOV ERSE.
Total - 276 510.5k 5.5k
V. ROBOV ERSE B ENCHMARK
TABLE I: Migration progress statistics for manipulation tasks
in ROBOV ERSE A. Benchmark Overview
With the collected tasks, assets, and trajectories, RoboVerse
establishes standardized benchmarks for robot learning, includ-
focus on VLN in continuous environments (VLN-CE) [58], as ing both imitation learning and reinforcement learning. We
it more closely resembles real-world scenarios [11, 136, 137]. define a unified training and evaluation protocol within the
Specifically, we construct our dataset based on ROBOV ERSE by RoboVerse platform and implement standardized baselines and
integrating MatterPort 3D scenes [9] (90 scenes) and off-the- learning frameworks for benchmarking. Specifically, for imita-
shelf instructions from R2R [58] (10k episodes) and RxR [60] tion learning, we introduce different levels of generalization
(20k episodes). We provide two types of mobile embodiments, benchmarks to assess the generalization capability of models.
(a) Level 0 (b) Level 1 (c) Level 2 (d) Level 3

Fig. 9: Benchmark Protocol: We define a four-level generalization benchmarking protocol, allocating 90% of the data for
training and 10% for generalization evaluation. From left to right, Levels 0 to 3 corresponds to task space generalization,
environment radomization, camera randomization, lighting and reflection randomization, respectively.

B. Imitation Learning Benchmark illumination setups [19]. This enhances robustness testing under
varying conditions, as shown in Fig. 9 (d).
For each imitation learning benchmark, we establish a
standardized evaluation framework with a fixed set of demon- C. Reinforcement Learning Benchmark
strations and a controlled evaluation environment. Policies In addition to imitation learning, RoboVerse offers a com-
must be trained exclusively on the provided training data and prehensive reinforcement learning (RL) benchmark designed
assessed within this environment to ensure fair comparison. to accommodate a diverse range of tasks, robot embodi-
To rigorously test generalization capability, we curate training ments, and simulation backends. Specifically, we integrate the
data from specific domains and evaluate policies on unseen PPO [101] algorithm from both S TABLE -BASELINES 3 [94]
samples, challenging their adaptability to novel scenarios. We and RSL _ RL [98] into our M ETA S IM interface, enabling
systematically categorize visual generalization factors into mul- straightforward task definition, seamless environment switching,
tiple levels, including task space generalization, environment and standardized performance logging.
setup generalization, camera setting generalization, and lighting Building upon this infrastructure, we have successfully
and reflection generalization. Each level introduces controlled ported multiple humanoid control tasks from the Humanoid-
variations to assess a policy’s adaptability and robustness in Bench [102] benchmark into RoboVerse. Through our adapted
increasingly diverse and challenging conditions. interface for RSL _ RL, we have efficiently extended framework
a) Level 0: Task Space Generalization: We establish a compatibility to support the TD-MPC2 [41, 42] algorithm
controlled evaluation by standardizing the environment with from the original benchmark while preserving implementation
consistent camera, materials, lighting, and other parameters. fidelity.
The task space, including object initialization and instructions,
is split into 90% training and 10% validation to assess VI. E XPERIMENTAL R ESULTS
generalization within a fixed setting, as shown in Fig. 9 (a). A. Overview
b) Level 1: Environment Randomization: Building on the We conduct extensive experiments to validate the effec-
standardized setup, we introduce scene randomization while tiveness and practicality of ROBOV ERSE. First, we evaluate
keeping the camera, materials, and lighting fixed [78]. By baselines on representative tasks from various benchmark
varying house, table, and ground configurations, we create sources to ensure the reliability of the collected datasets and
diverse visual inputs to test robustness against environmental established benchmarks. This includes assessments of both
changes [51]. A fixed set of predefined randomized scenes imitation learning baselines VI-B and reinforcement learning
ensures structured evaluation, as shown in Fig. 9 (b). baselines VI-C.
c) Level 2: Camera Randomization: To assess generaliza- Then we further demonstrate the strength of the high-quality
tion across camera variations, we introduce different viewing synthetic dataset. We find that synthetic data could significantly
heights and angles using carefully annotated, realistic camera boost world model learning.
poses. Following the 90/10 training/testing split, we ensure
B. Results on the Imitation Learning Benchmark
consistent and rigorous evaluation, as illustrated in Fig. 9 (c).
1) Baseline and Task Selection: To genuinely reflect the
d) Level 3: Lighting and Reflection Randomization:
data quality of the RoboVerse dataset and provide a standard
Real-world environments involve diverse materials and lighting
conditions [113]. To simulate these challenges, we randomize 1 Due to resource and time constraints, we uniformly sample 20 testing
lighting and reflections, curating realistic object materials and scenarios for the OpenVLA baseline.
Representative Task PickCube StackCube CloseBox MoveSliderLeft PickChocolatePudding NutAssembly Average
Benchmark Source ManiSkill ManiSkill RLBench CALVIN LIBERO RoboSuite -
Diffusion Policy[13] 78M 52.7 53.8 51.5 76.5 50.0 7.1 48.6
ACT[138] 84M 31.7 36.7 68.3 85.0 78.3 0.0 50.0

TABLE II: Baseline Results on ROBOV ERSE Imitation Learning Benchmark. We report baseline results on representative
tasks from various benchmark sources to validate the effectiveness and reliability of the ROBOV ERSE benchmark.

MoveSliderLeft CloseBox PickCube


Task and Generalization Level
Level 0 Level 1 Level 2 Level 3 Level 0 Level 1 Level 2 Level 3 Level 0 Level 1 Level 2 Level 3
Diffusion Policy [13] 76.5 81.3 72.0 60.0 51.5 42.8 20.0 10.4 52.7 11.1 0.0 0.0
ACT [138] 85.0 83.3 43.3 16.6 68.3 73.3 0.0 20.0 31.7 30.0 6.7 3.3
OpenVLA1 [56] 45.0 40.0 35.0 30.0 0.0 0.0 0.0 0.0 40.0 15.0 0.0 0.0

TABLE III: Generalization Performance on Imitation Learning Benchmark. This table presents the experimental results for
each generalization level in our benchmark across different tasks and methodologies. The tasks are divided into distinct levels
(Level 0, Level 1, Level 2, and Level 3) to evaluate performance under progressively challenging scenarios.

Simple Language-conditioned Grasping


Method
PickCube MoveSliderLeft different scenarios.
Object Set 1 Object Set 2 Object Set 3
OpenVLA 40.0 45.0 46.0 33.3 14.4 2) Implementation Details: Due to time and resource
Octo 50.0 30.0 42.0 14.4 2.2 constraints, we implement specialist and generalist models
using different strategies, and all the results are obtained under
TABLE IV: Vision-Language-Action (VLA) Model Results the single-task setting. The training and evaluation settings
on ROBOV ERSE Imitation Learning Benchmark. Con- follow the 90/10 ROBOV ERSE benchmark protocol as specified
strained with time and resources, we report VLA models’ in V-B. During evaluations, we randomly select ten task settings
results on two simple tasks from ROBOV ERSE and grasping from training sets and another ten from the validation sets. The
tasks with diverse and challenging language instructions. We reported success rates are computed as the averages over three
split 58 objects in GraspNet into three sets, each containing random seeds.
progressively more challenging objects based on their geometry. For each step, the inputs are 256 × 256 × 3 RGB images and
a short language description depending on the task settings.
For specialist models, we train from scratch with action in
benchmark for all kinds of imitation learning policy models, 9-dim robot joint state space. For generalist models, the action
we select both prevailing specialist and generalist models is pre-processed into delta end-effector position space from
as baselines of our RoboVerse benchmark. Specifically, for absolute end-effector position space, and The gripper action
specialist models, we integrate ACT [138] and Diffusion is discretized into binary values {0, +1}. Owing to the lack
Policy [13]. For generalist models, We benchmark our approach of time and resources, we are only able to fine-tune the
on OpenVLA [56] and Octo [86], both of which we fine-tuned generalist models in the single-task setting. During evaluations,
using our synthetic dataset. ACT is one of the most widely we employ Curobo [106] as the inverse-kinematics solver to
used methods in bi-manual manipulation. Diffusion Policy [13] transform the action to robot joint state space. Specific model
is the first work that applies the conditional denoising diffusion implementation details and hyperparameters are provided in
process as a robot visuomotor policy and achieves great supplementary materials.
generalization capabilities. OpenVLA is the largest open-source 3) Experiment Results: We present the imitation learning
vision-language-action model with 7B parameters. benchmark results in Tab. II and the generalization evaluation
Leveraging the RoboVerse format and infrastructure design, in Tab. III. We further fine-tune large vision-language-action
we are able to evaluate models on different tasks within a models on both simple and complex language-conditioned
unified platform. To fully test policy models’ performance under tasks, as shown in Tab. VIII.
versatile settings, we select one representative task from each of
the source benchmarks integrated by the RoboVerse dataset as C. Results on the Reinforcement Learning Benchmark
shown in Tab. II. The experiment subset includes PickCube and Using S TABLE -BASELINES 3 and RSL _ RL implementations
StackCube from ManiSkill [81], CloseBox from RLBench [48], of PPO, we train policies on tasks from IsaacLab [80] under
MoveSliderLeft from CALVIN [79], PickChocolatePudding consistent hyperparameters.
from LIBERO [65], and NutAssembly on RoboSuite [142]. For additional tasks (humanoid, dexterous hand), the same
These tasks not only demand precise Pick-and-Place skills but PPO-based workflow applies. We successfully migrate the
also require contact-rich physical interactions with articulated HumanoidBench [102] from MuJoCo to RoboVerse, enabling
objects. Through these tasks, the benchmark results can provide training across multiple simulators (IsaacLab and MuJoCo) with
a comprehensive reflection of each model’s performance under consistent interfaces. Experiment results demonstrate stable
Ground Truth Train on DROID Trained onDROID- Roboverse
policy convergence across simulators, achieving comparable
performance to native MuJoCo baselines. Leveraging the
generalizability of RSL _ RL, we further extend the benchmark to
support TD-MPC2 [41, 42] algorithm , which exhibits robust
training dynamics in all environments. For implementation
details, reward curve, and extended experimental results, please
refer to the supplementary materials.
D. Augmentation Experiments
To verify the effectiveness of our trajectory augmentation
API, on four representative tasks, we compare the success rates
of trained Diffusion Policy on 50 source demonstrations and
200, 1000, and 3000 generated augmentation demonstrations
under the imitation learning setting. The results presented
in Fig. 10 demonstrate a consistent improvement in model
performance as the number of generated data increases, high-
lighting both the effectiveness and scalability of the trajectory Fig. 11: Ablation Study of Action-conditioned World Model
augmentation API. Learning. We compare the qualitative results of an action-
conditioned world model trained on pure DROID and DROID-
RoboVerse datasets, with evaluations sampled from the DROID
dataset.

RoboVerse samples, we observe that the generated frames are


physically more realistic in most scenes, with details in the
supplementary materials. This improvement can be attributed
to the extensive randomization and augmentation available in
RoboVerse. Conversely, a model trained solely on DROID
data fails to transfer effectively to the RoboVerse scene. We
hypothesize that this shortcoming stems from limited samples
Fig. 10: Effectiveness of Trajectory Augmentation. Success per scene coverage in DROID and incomplete gripper visibility
rates of policy trained with augmented dataset and source in the camera view.
dataset.
F. Imitating the RoboVerse Dataset Enables Direct Sim-to-Real
Transfer
E. World Model Learning
The RoboVerse system seamlessly integrates a powerful
Recent advances in general-purpose video generation and physics engine with a high-quality renderer, ensuring the
interactive world models [110, 6] have shown promising generation of realistic, high-fidelity data. To demonstrate its
progress. Yet, the scarcity of gigantic-scale robotic datasets still potential, we conduct experiments validating its effectiveness
impedes the development of robust world models for a wide in direct sim-to-real transfer. As shown in Fig. 13, we fine-
range of robotic applications. In this session, we demonstrate tune OpenVLA[56] on the RoboVerse dataset and transfer the
how synthetic data from the RoboVerse simulation can augment learned policy to real-world scenarios without additional fine-
real-world datasets to train more capable robotics world models. tuning. The model successfully manipulates unseen objects in
When a model is trained exclusively on 50,000 episodes from previously unseen real-world environments, showcasing the
the DROID dataset [54], it generally respects action conditions robustness and generalization capabilities of our system. The
but struggles to accurately capture physical interactions between quantitative results on more challenging language-guided tasks,
the gripper and target objects. Notably, the objects appear as shown in Tab. V, further demonstrate the high success rate
“warped” during contact with the gripper, as shown in Fig. 11. of models trained on the RoboVerse dataset. Additional details
By incorporating an additional 50,000 synthetic episodes from are provided in the supplementary materials.
RoboVerse to create a combined dataset of 100,000 episodes,
the model predictions improve with regard to preserving G. Reinforcement Learning in RoboVerse Enables Sim-to-Sim-
object geometry. However, merely “watching videos” remains to-Real Transfer
insufficient for learning the intricate physical interactions in Large-scale parallel environments offer significant potential
DROID. for large-scale exploration and are highly effective for rein-
In contrast, training solely on the RoboVerse-50K or on forcement learning (RL) tasks. However, while they provide
the DROID-RoboVerse-100K dataset and then validating on excellent efficiency, their accuracy may be limited in certain
Fig. 12: Sim-to-Real and Sim-to-Sim-to-Real Experiment Results. We demonstrate that learning within the RoboVerse
framework enables seamless direct Sim-to-Real transfer for manipulating unseen objects in new environments (imitation learning)
and Sim-to-Sim-to-Real transfer for whole-body humanoid control (reinforcement learning).

Pick up Lift Grasp


GraspNet Objects
Wash Soap Mouth Rinse Green Dish
Octo[86] 5.0/10.0 3.0/10.0 6.0/10.0
OpenVLA[56] 7.0/10.0 8.0/10.0 5.0/10.0

TABLE V: Direct Sim-to-Real. We fine-tune two baseline


models using demonstrations adapted from GraspNet [27]
to validate the effectiveness of the RoboVerse dataset. The
final performance score for each task is reported, where
a baseline receives 1 point for successfully grasping the
target. Additionally, we adopt the partial reward scheme from
OpenVLA, awarding 0.5 points when the gripper makes contact
with the target.

Fig. 13: Generalization of Sim-to-Sim-to-Real. This figure


shows the in-the-wild generalization ability of our lower-body VII. L IMITATIONS
RL policy with upper-body PD control by the sim-to-sim-to-
real approach. While ROBOV ERSE provides a comprehensive and scalable
platform, several limitations remain. First, the integration of a
unified format for non-rigid objects is not yet fully supported,
scenarios [25]. To address this problem, Sim-to-sim evalu- which we leave for future work to develop. Additionally,
ation and fine-tuning present promising solutions [63]. The while our large-scale dataset presents significant potential for
RoboVerse platform seamlessly supports such functionalities, pretraining a foundation model, this exploration falls beyond
enabling robust sim-to-sim and sim-to-real transitions. We the scope of this paper due to resource constraints. Furthermore,
further demonstrate the effectiveness of sim-to-sim-to-real despite our extensive efforts to fully reimplement and optimize
generalization through comprehensive experiments, highlight- all baseline methods within the ROBOV ERSE baselines, some
ing the platform’s ability to bridge simulation and real-world implementations may still be suboptimal. Our primary goal is
performance. not to directly compare policy performance but to demonstrate
that the system is comprehensive, supports diverse policies, R EFERENCES
and ensures strong alignment between simulation and real-
world performance. While we have made every effort to build [1] Unai Antero, Francisco Blanco, Jon Oñativia, Damien
a robust platform, it is inevitable that some oversights or errors Sallé, and Basilio Sierra. Harnessing the power of large
may remain. We encourage the broader research community to language models for automated code generation and
contribute to maintaining and refining the baselines, fostering verification. Robotics, 13(9):137, 2024.
collaboration to further enhance the platform’s capabilities. [2] Genesis Authors. Genesis: A universal and generative
physics engine for robotics and beyond, December 2024.
ACKNOWLEDGEMENT
URL https://github.com/Genesis-Embodied-AI/Genesis.
We thank Hanyang Zhou and Sicheng He for providing [3] Tamir Blum, Gabin Paillet, Mickael Laine, and Kazuya
valuable suggestions for setting up robotics hardware. We thank Yoshida. Rl star platform: Reinforcement learning for
Yufeng Chi and Sophia Shao for providing humanoid robots simulation based training of robots. arXiv preprint
for testing. We thank Jie Yang and Muzhi Han for valuable arXiv:2009.09595, 2020.
discussion. [4] Anthony Brohan, Noah Brown, Justice Carbajal, Yev-
Pieter Abbeel holds concurrent appointments as a professor gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana
at UC Berkeley and as an Amazon Scholar. This paper describes Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine
work performed at UC Berkeley and is not associated with Hsu, et al. Rt-1: Robotics transformer for real-world
Amazon. control at scale. arXiv preprint arXiv:2212.06817, 2022.
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. Language models are few-shot learners.
Advances in neural information processing systems, 33:
1877–1901, 2020.
[6] Jake Bruce, Michael Dennis, Ashley Edwards, Jack
Parker-Holder, Yuge Shi, Edward Hughes, Matthew
Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps,
Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie
Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero,
Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna,
Jeff Clune, Nando de Freitas, Satinder Singh, and Tim
Rocktäschel. Genie: Generative interactive environments,
2024. URL https://arxiv.org/abs/2402.15391.
[7] Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha
Srinivasa, Pieter Abbeel, and Aaron M. Dollar. Bench-
marking in manipulation research: Using the yale-
cmu-berkeley object and model set. IEEE Robotics
& Automation Magazine, 22(3):36–52, 2015. doi:
10.1109/MRA.2015.2448951.
[8] Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha
Srinivasa, Pieter Abbeel, and Aaron M Dollar. Bench-
marking in manipulation research: The ycb object and
model set and benchmarking protocols. arXiv preprint
arXiv:1502.03143, 2015.
[9] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej
Halber, Matthias Niebner, Manolis Savva, Shuran Song,
Andy Zeng, and Yinda Zhang. Matterport3d: Learning
from rgb-d data in indoor environments. In 2017
International Conference on 3D Vision (3DV), pages
667–676. IEEE, 2017.
[10] Yuanpei Chen, Chen Wang, Yaodong Yang, and Karen
Liu. Object-centric dexterous manipulation from human
motion data. In 8th Annual Conference on Robot
Learning, 2024.
[11] An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan
Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu,
and Xiaolong Wang. Navila: Legged robot vision- and Li Fei-Fei. Imagenet: A large-scale hierarchical
language-action model for navigation. arXiv preprint image database. In 2009 IEEE conference on computer
arXiv:2412.04453, 2024. vision and pattern recognition, pages 248–255. Ieee,
[12] Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and 2009.
Xiaolong Wang. Open-television: Teleoperation with [24] Yufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng
immersive active visual feedback. arXiv preprint Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng
arXiv:2407.01512, 2024. Zhang, and He Wang. Open6dor: Benchmarking open-
[13] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric instruction 6-dof object rearrangement and a vlm-based
Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- approach. In 2024 IEEE/RSJ International Conference
fusion policy: Visuomotor policy learning via action on Intelligent Robots and Systems (IROS), pages 7359–
diffusion. In Proceedings of Robotics: Science and 7366. IEEE, 2024.
Systems (RSS), 2023. [25] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd
[14] Alberto Silvio Chiappa, Alessandro Marin Vargas, Ann Hester. Challenges of real-world reinforcement learning,
Huang, and Alexander Mathis. Latent exploration for 2019. URL https://arxiv.org/abs/1904.12901.
reinforcement learning. Advances in Neural Information [26] Tom Erez, Yuval Tassa, and Emanuel Todorov. Simu-
Processing Systems, 36, 2024. lation tools for model-based robotics: Comparison of
[15] Open X-Embodiment Collaboration. Open X- bullet, havok, mujoco, ode and physx. In 2015 IEEE
Embodiment: Robotic learning datasets and RT-X mod- international conference on robotics and automation
els. https://arxiv.org/abs/2310.08864, 2023. (ICRA), pages 4397–4404. IEEE, 2015.
[16] Erwin Coumans and Yunfei Bai. Pybullet, a python [27] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu
module for physics simulation for games, robotics and Lu. Graspnet-1billion: A large-scale benchmark for
machine learning. http://pybullet.org, 2016–2021. general object grasping. In Proceedings of the IEEE/CVF
[17] Erwin Coumans and Yunfei Bai. Pybullet, a python Conference on Computer Vision and Pattern Recogni-
module for physics simulation for games, robotics and tion(CVPR), pages 11444–11453, 2020.
machine learning. http://pybullet.org, 2016–2023. [28] Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu,
[18] Wenbo Cui, Chengyang Zhao, Songlin Wei, Jiazhao Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu.
Zhang, Haoran Geng, Yaran Chen, and He Wang. Gapart- Rh20t: A comprehensive robotic dataset for learning
manip: A large-scale part-centric dataset for material- diverse skills in one-shot. In 2024 IEEE International
agnostic articulated object manipulation. arXiv preprint Conference on Robotics and Automation (ICRA), pages
arXiv:2411.18276, 2024. 653–660. IEEE, 2024.
[19] Qiyu Dai, Jiyao Zhang, Qiwei Li, Tianhao Wu, Hao [29] C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan
Dong, Ziyuan Liu, Ping Tan, and He Wang. Domain Girgin, Igor Mordatch, and Olivier Bachem. Brax - a
randomization-enhanced depth simulation and restoration differentiable physics engine for large scale rigid body
for perceiving and grasping specular and transparent simulation, 2021. URL http://github.com/google/brax.
objects. In European Conference on Computer Vision, [30] Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong,
pages 374–391. Springer, 2022. Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-
[20] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, future: 3d furniture shape with texture. International
Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Journal of Computer Vision (IJCV), 2021.
Sergey Levine, and Chelsea Finn. Robonet: Large-scale [31] Caelan Garrett, Ajay Mandlekar, Bowen Wen, and
multi-robot learning. arXiv preprint arXiv:1910.11215, Dieter Fox. Skillmimicgen: Automated demonstration
2019. generation for efficient skill learning and deployment.
[21] Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca arXiv preprint arXiv:2410.18907, 2024.
Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric [32] Haoran Geng, Ziming Li, Yiran Geng, Jiayi Chen, Hao
Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Dong, and He Wang. Partmanip: Learning cross-category
Procthor: Large-scale embodied ai using procedural generalizable part manipulation policy from point cloud
generation. Advances in Neural Information Processing observations. In CVPR, pages 2978–2988, 2023.
Systems (NeurIPS), 2022. [33] Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen,
[22] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong He Wang, and Leonidas Guibas. Sage: Bridging semantic
Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- and actionable parts for generalizable articulated-object
tian Laforte, Vikram S. Voleti, Samir Yitzhak Gadre, manipulation under language instructions, 2023.
Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, [34] Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu,
Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-
Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d category domain-generalizable object perception and
objects. ArXiv, abs/2307.05663, 2023. URL https: manipulation via generalizable and actionable parts. In
//api.semanticscholar.org/CorpusID:259836993. Proceedings of the IEEE/CVF Conference on Computer
[23] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Vision and Pattern Recognition, pages 7081–7091, 2023.
[35] Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen, Cassandra Kent, and Eric Eaton. Robotic manipulation
Yaodong Yang, and Hao Dong. Rlafford: End-to- datasets for offline compositional reinforcement learning.
end affordance learning for robotic manipulation. In arXiv preprint arXiv:2307.07091, 2023.
2023 IEEE International Conference on Robotics and [47] Stephen James, Marc Freese, and Andrew J. Davison.
Automation (ICRA), pages 5880–5886. IEEE, 2023. Pyrep: Bringing v-rep to deep robot learning. arXiv
[36] Ran Gong, Jiangyong Huang, Yizhou Zhao, Haoran preprint arXiv:1906.11176, 2019.
Geng, Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng [48] Stephen James, Zicong Ma, David Rovick Arrojo, and
Zhou, Demetri Terzopoulos, Song-Chun Zhu, et al. Andrew J. Davison. Rlbench: The robot learning
Arnold: A benchmark for language-grounded task learn- benchmark & learning environment. IEEE Robotics
ing with continuous states in realistic 3d scenes. In and Automation Letters, 2020.
Proceedings of the IEEE/CVF International Conference [49] Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang,
on Computer Vision (ICCV), 2023. Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang.
[37] Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Sceneverse: Scaling 3d vision-language learning for
Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, grounded scene understanding. In European Conference
Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified on Computer Vision (ECCV), 2024.
benchmark for generalizable manipulation skills. arXiv [50] Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang
preprint arXiv:2302.04659, 2023. Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu.
[38] Xinyang Gu, Yen-Jen Wang, and Jianyu Chen. Dexmimicgen: Automated data generation for bimanual
Humanoid-gym: Reinforcement learning for humanoid dexterous manipulation via imitation learning. arXiv
robot with zero-shot sim2real transfer, 2024. URL preprint arXiv:2410.24185, 2024.
https://arxiv.org/abs/2404.05695. [51] Abhishek Kadian, Joanne Truong, Aaron Gokaslan,
[39] Huy Ha, Pete Florence, and Shuran Song. Scaling up and Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis
distilling down: Language-guided robot skill acquisition. Savva, Sonia Chernova, and Dhruv Batra. Sim2real
In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors, predictivity: Does evaluation in simulation predict real-
Proceedings of The 7th Conference on Robot Learning, world performance? IEEE Robotics and Automation
volume 229 of Proceedings of Machine Learning Re- Letters, 5(4):6670–6677, 2020.
search, pages 3766–3777. PMLR, 06–09 Nov 2023. URL [52] Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki.
https://proceedings.mlr.press/v229/ha23a.html. Gen2sim: Scaling up robot learning in simulation with
[40] Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran generative models. In 2024 IEEE International Confer-
Song. Umi on legs: Making manipulation policies mobile ence on Robotics and Automation (ICRA), pages 6672–
with manipulation-centric whole-body controllers. arXiv 6679. IEEE, 2024.
preprint arXiv:2407.10353, 2024. [53] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler,
[41] Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal and George Drettakis. 3d gaussian splatting for real-
difference learning for model predictive control. In time radiance field rendering. ACM Transactions on
International Conference on Machine Learning (ICML), Graphics, 42(4), July 2023. URL https://repo-sam.inria.
2022. fr/fungraph/3d-gaussian-splatting/.
[42] Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: [54] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash-
Scalable, robust world models for continuous control. In win Balakrishna, Sudeep Dasari, Siddharth Karam-
International Conference on Learning Representations cheti, Soroush Nasiriany, Mohan Kumar Srirama,
(ICLR), 2024. Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid:
[43] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr A large-scale in-the-wild robot manipulation dataset.
Dollár, and Ross Girshick. Masked autoencoders are CoRR, 2024.
scalable vision learners. In Proceedings of the IEEE/CVF [55] Chung Min Kim, Michael Danielczuk, Isabella Huang,
conference on computer vision and pattern recognition, and Ken Goldberg. Ipc-graspsim: Reducing the sim2real
pages 16000–16009, 2022. gap for parallel-jaw grasping with the incremental
[44] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan potential contact model, 2022. URL https://arxiv.org/
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and abs/2111.01391.
Weizhu Chen. Lora: Low-rank adaptation of large [56] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted
language models, 2021. URL https://arxiv.org/abs/2106. Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov,
09685. Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong,
[45] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa
and Shenghua Gao. 2d gaussian splatting for geo- Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn.
metrically accurate radiance fields. In SIGGRAPH Openvla: An open-source vision-language-action model.
2024 Conference Papers. Association for Computing arXiv preprint arXiv:2406.09246, 2024.
Machinery, 2024. doi: 10.1145/3641519.3657428. [57] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi
[46] Marcel Hussing, Jorge A Mendez, Anisha Singrodia, Mao, Chloe Rolland, Laura Gustafson, Tete Xiao,
Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, [68] Haozhe Lou, Yurong Liu, Yike Pan, Yiran Geng, Jianteng
et al. Segment anything. In Proceedings of the IEEE/CVF Chen, Wenlong Ma, Chenglong Li, Lin Wang, Hengzhen
International Conference on Computer Vision, pages Feng, Lu Shi, Liyi Luo, and Yongliang Shi. Robo-gs:
4015–4026, 2023. A physics consistent spatial-temporal model for robotic
[58] Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv arm with hybrid representation, 2024. URL https://arxiv.
Batra, and Stefan Lee. Beyond the nav-graph: Vision- org/abs/2408.14873.
and-language navigation in continuous environments. In [69] Haoran Lu, Ruihai Wu, Yitong Li, Sijie Li, Ziyu Zhu,
European Conference on Computer Vision, 2020. URL Chuanruo Ning, Yan Shen, Longzan Luo, Yuanpei Chen,
https://api.semanticscholar.org/CorpusID:214802389. and Hao Dong. Garmentlab: A unified simulation and
[59] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. benchmark for garment manipulation. In Advances in
Imagenet classification with deep convolutional neural Neural Information Processing Systems, 2024.
networks. Advances in neural information processing [70] Jiangran Lyu, Yuxing Chen, Tao Du, Feng Zhu, Hui-
systems, 25, 2012. quan Liu, Yizhou Wang, and He Wang. Scissorbot:
[60] Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, Learning generalizable scissor skill for paper cutting
and Jason Baldridge. Room-across-room: Multilingual via simulation, imitation, and sim2real. arXiv preprint
vision-and-language navigation with dense spatiotempo- arXiv:2409.13966, 2024.
ral grounding. In Proceedings of the 2020 Conference [71] Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen,
on Empirical Methods in Natural Language Processing Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao.
(EMNLP), pages 4392–4412, 2020. Latte: Latent diffusion transformer for video generation,
[61] Yuxuan Kuang, Amine Elhafsi, Haoran Geng, Marco 2024. URL https://arxiv.org/abs/2401.03048.
Pavone, and Yue Wang. Skillblender: Towards versatile [72] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo,
humanoid whole-body control via skill blending. In Michelle Lu, Kier Storey, Miles Macklin, David Hoeller,
CoRL 2024 Workshop on Whole-body Control and Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel
Bimanual Manipulation: Applications in Humanoids and State. Isaac gym: High performance gpu-based physics
Beyond. simulation for robot learning, 2021.
[62] Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gok- [73] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan
men, Sanjana Srivastava, Roberto Martín-Martín, Chen Booher, Max Spero, Albert Tung, Julian Gao, John
Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A
et al. Behavior-1k: A benchmark for embodied ai with crowdsourcing platform for robotic skill learning through
1,000 everyday activities and realistic simulation. In imitation. In Conference on Robot Learning, pages 879–
Conference on Robot Learning, pages 80–93. PMLR, 893. PMLR, 2018.
2023. [74] Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín,
[63] Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in-
Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, the-loop imitation learning using remote teleoperation.
Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, arXiv preprint arXiv:2012.06733, 2020.
Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. [75] Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush
Evaluating real-world robot manipulation policies in Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei,
simulation. arXiv preprint arXiv:2405.05941, 2024. Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín.
[64] Yu Li, Xiaojie Zhang, Ruihai Wu, Zilong Zhang, Yiran What matters in learning from offline human demon-
Geng, Hao Dong, and Zhaofeng He. Unidoormanip: strations for robot manipulation. arXiv preprint
Learning universal door manipulation policy over large- arXiv:2108.03298, 2021.
scale and diverse door manipulation environments. arXiv [76] Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ireti-
preprint arXiv:2403.02604, 2024. ayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and
[65] Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Dieter Fox. Mimicgen: A data generation system for
Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking scalable robot learning using human demonstrations. In
knowledge transfer for lifelong robot learning. arXiv 7th Annual Conference on Robot Learning, 2023.
preprint arXiv:2306.03310, 2023. [77] Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi,
[66] Ruoshi Liu, Alper Canberk, Shuran Song, and Carl Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra
Vondrick. Differentiable robot rendering, 2024. URL Malik, Vitor Guizilini, and Yue Wang. Learning from
https://arxiv.org/abs/2410.13851. massive human videos for universal humanoid pose
[67] Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang control. arXiv preprint arXiv:2412.14172, 2024.
Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, [78] Pablo Martinez-Gonzalez, Sergiu Oprea, Alberto Garcia-
and Li Yi. Hoi4d: A 4d egocentric dataset for category- Garcia, Alvaro Jover-Alvarez, Sergio Orts-Escolano,
level human-object interaction. In Proceedings of the and Jose Garcia-Rodriguez. Unrealrox: an extremely
IEEE/CVF Conference on Computer Vision and Pattern photorealistic virtual reality environment for robotics
Recognition, pages 21013–21022, 2022. simulations and synthetic data generation. Virtual Reality,
24:271–288, 2020. models for high-resolution image synthesis, 2023. URL
[79] Oier Mees, Lukas Hermann, Erick Rosete-Beas, and https://arxiv.org/abs/2307.01952.
Wolfram Burgard. Calvin: A benchmark for language- [91] Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma,
conditioned policy learning for long-horizon robot ma- and Jitendra Malik. In-hand object rotation via rapid
nipulation tasks. IEEE Robotics and Automation Letters motor adaptation. In Conference on Robot Learning,
(RA-L), 7(3):7327–7334, 2022. pages 1722–1732. PMLR, 2023.
[80] Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, [92] Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk,
Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dietor
Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Fox. Anyteleop: A general vision-based dexterous
Babich, Gavriel State, Marco Hutter, and Animesh robot arm-hand teleoperation system. arXiv preprint
Garg. Orbit: A unified simulation framework for arXiv:2307.04577, 2023.
interactive robot learning environments. IEEE Robotics [93] Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
and Automation Letters, 8(6):3740–3747, 2023. doi: Dario Amodei, Ilya Sutskever, et al. Language models
10.1109/LRA.2023.3270034. are unsupervised multitask learners. OpenAI blog, 1(8):
[81] Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, 9, 2019.
Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and [94] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi
Hao Su. Maniskill: Generalizable manipulation skill Kanervisto, Maximilian Ernestus, and Noah Dormann.
benchmark with large-scale demonstrations. arXiv Stable-baselines3: Reliable reinforcement learning im-
preprint arXiv:2107.14483, 2021. plementations. Journal of Machine Learning Research,
[82] Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, 22(268):1–8, 2021.
Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- [95] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang
dlekar, and Yuke Zhu. Robocasa: Large-scale simulation Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr,
of everyday tasks for generalist robots. In Robotics: Roman Rädle, Chloe Rolland, Laura Gustafson, Eric
Science and Systems, 2024. Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas
[83] NVidia. Physx, 2024. URL https://nvidia-omniverse. Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and
github.io/PhysX/physx/5.5.0/. Christoph Feichtenhofer. Sam 2: Segment anything in
[84] NVidia. vmaterials, 2024. URL https://developer.nvidia. images and videos, 2024. URL https://arxiv.org/abs/
com/vmaterials. 2408.00714.
[85] NVIDIA. Isaacsim simulator, 2025. URL https: [96] Pengzhen Ren, Min Li, Zhen Luo, Xinshuai Song, Ziwei
//developer.nvidia.com/isaac/sim. Chen, Weijia Liufu, Yixuan Yang, Hao Zheng, Rongtao
[86] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Xu, Zitong Huang, et al. Infiniteworld: A unified scalable
Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey simulation framework for general visual-language robot
Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You interaction. arXiv preprint arXiv:2412.05789, 2024.
Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, [97] E. Rohmer, S. P. N. Singh, and M. Freese. Cop-
Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: peliasim (formerly v-rep): a versatile and scalable robot
An open-source generalist robot policy. In Proceedings simulation framework. In Proc. of The International
of Robotics: Science and Systems, Delft, Netherlands, Conference on Intelligent Robots and Systems (IROS),
2024. 2013. www.coppeliarobotics.com.
[87] Jacopo Panerati, Hehui Zheng, SiQi Zhou, James Xu, [98] Nikita Rudin, David Hoeller, Philipp Reist, and Marco
Amanda Prorok, and Angela P Schoellig. Learning to Hutter. Learning to walk in minutes using massively
fly—a gym environment with pybullet physics for rein- parallel deep reinforcement learning, 2022. URL https:
forcement learning of multi-agent quadcopter control. In //arxiv.org/abs/2109.11978.
2021 IEEE/RSJ International Conference on Intelligent [99] Johannes Lutz Schönberger and Jan-Michael Frahm.
Robots and Systems (IROS), pages 7512–7519. IEEE, Structure-from-motion revisited. In Conference on
2021. Computer Vision and Pattern Recognition (CVPR), 2016.
[88] Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, [100] Johannes Lutz Schönberger, Enliang Zheng, Marc Polle-
Angjoo Kanazawa, David Fouhey, and Jitendra Malik. feys, and Jan-Michael Frahm. Pixelwise view selection
Reconstructing hands in 3D with transformers. In CVPR, for unstructured multi-view stereo. In European Confer-
2024. ence on Computer Vision (ECCV), 2016.
[89] Ethan Perez, Florian Strub, Harm de Vries, Vincent [101] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec
Dumoulin, and Aaron Courville. Film: Visual reasoning Radford, and Oleg Klimov. Proximal policy optimization
with a general conditioning layer, 2017. URL https: algorithms. arXiv preprint arXiv:1707.06347, 2017.
//arxiv.org/abs/1709.07871. [102] Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin,
[90] Dustin Podell, Zion English, Kyle Lacey, Andreas Youngwoon Lee, and Pieter Abbeel. Humanoidbench:
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Simulated humanoid benchmark for whole-body loco-
and Robin Rombach. Sdxl: Improving latent diffusion motion and manipulation, 2024.
[103] Arth Shukla, Stone Tao, and Hao Su. Maniskill-hab: A surroundings. ACM transactions on graphics (TOG), 26
benchmark for low-level manipulation in home rearrange- (3):35–es, 2007.
ment tasks, 2024. URL https://arxiv.org/abs/2412.13211. [115] Fang Wan, Haokun Wang, Xiaobo Liu, Linhan Yang,
[104] Simulately Wiki. Simulately wiki. https://simulately.wiki, and Chaoyang Song. Deepclaw: A robotic hardware
2025. Accessed: 31 Jan 2025. benchmarking platform for learning object manipula-
[105] Jiaming Song, Chenlin Meng, and Stefano Ermon. tion. In 2020 IEEE/ASME International Conference on
Denoising diffusion implicit models, 2022. URL https: Advanced Intelligent Mechatronics (AIM), pages 2011–
//arxiv.org/abs/2010.02502. 2018. IEEE, 2020.
[106] Balakumar Sundaralingam, Siva Kumar Sastry Hari, [116] Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben,
Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun
Blukis, Alexander Millane, Helen Oleynikova, Ankur Chen, Sizhe Yang, et al. Grutopia: Dream general robots
Handa, Fabio Ramos, et al. Curobo: Parallelized in a city at scale. arXiv preprint arXiv:2407.10943, 2024.
collision-free robot motion generation. In ICRA, pages [117] Jun Wang, Yuzhe Qin, Kaiming Kuang, Yigit Korkmaz,
8112–8119. IEEE, 2023. Akhilan Gurumoorthy, Hao Su, and Xiaolong Wang. Cy-
[107] Balakumar Sundaralingam, Siva Kumar Sastry Hari, berdemo: Augmenting simulated human demonstration
Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts for real-world dexterous manipulation. In Proceedings
Blukis, Alexander Millane, Helen Oleynikova, Ankur of the IEEE/CVF Conference on Computer Vision and
Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. Pattern Recognition, pages 17952–17963, 2024.
curobo: Parallelized collision-free minimum-jerk robot [118] Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit
motion generation, 2023. Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe
[108] Andrew Szot, Alexander Clegg, Eric Undersander, Xu, and Xiaolong Wang. Gensim: Generating robotic
Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, simulation tasks via large language models. arXiv
Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr preprint arXiv:2310.01361, 2023.
Maksymets, et al. Habitat 2.0: Training home assistants [119] Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang,
to rearrange their habitat. Advances in neural information Yian Wang, Katerina Fragkiadaki, Zackory Erickson,
processing systems, 34:251–266, 2021. David Held, and Chuang Gan. Robogen: Towards
[109] Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, unleashing infinite data for automated robot learning via
Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong generative simulation. arXiv preprint arXiv:2311.01455,
Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, 2023.
Tongzhou Mu, Nan Xiao, Arnav Gurha, Zhiao Huang, [120] Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen
Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan
Maniskill3: Gpu parallelized robotics simulation and Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF
rendering for generalizable embodied ai. arXiv preprint tracking and 3D reconstruction of unknown objects. In
arXiv:2410.00425, 2024. CVPR, 2023.
[110] Movie Gen team. Movie gen: A cast of media foundation [121] Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield.
models, 2024. URL https://arxiv.org/abs/2410.13720. FoundationPose: Unified 6d pose estimation and tracking
[111] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: of novel objects. In CVPR, 2024.
A physics engine for model-based control. In 2012 [122] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao
IEEE/RSJ International Conference on Intelligent Robots Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu
and Systems, pages 5026–5033. IEEE, 2012. doi: 10. Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J.
1109/IROS.2012.6386109. Guibas, and Hao Su. SAPIEN: A simulated part-based
[112] Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U interactive environment. In The IEEE Conference on
Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Computer Vision and Pattern Recognition (CVPR), June
Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. 2020.
Gymnasium: A standard interface for reinforcement [123] Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu,
learning environments. arXiv preprint arXiv:2407.17032, Qian Luo, Victor Zhong, Yanchao Yang, and Tao Yu.
2024. Text2reward: Reward shaping with language models for
[113] Christine M Vaccaro, Catrina C Crisp, Angela N Fellner, reinforcement learning. In The Twelfth International
Christopher Jackson, Steven D Kleeman, and James Conference on Learning Representations, 2024. URL
Pavelka. Robotic virtual reality simulation plus standard https://openreview.net/forum?id=tUM39YTRxH.
robotic orientation versus standard robotic orientation [124] Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu,
alone: a randomized controlled trial. Urogynecology, 19 Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng,
(5):266–270, 2013. Yijia Weng, Jiayi Chen, et al. Unidexgrasp: Universal
[114] Daniel Vlasic, Rolf Adelsberger, Giovanni Vannucci, robotic dexterous grasping via learning diverse proposal
John Barnwell, Markus Gross, Wojciech Matusik, and generation and goal-conditioned policy. In Proceedings
Jovan Popović. Practical motion capture in everyday of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 4737–4746, 2023. [136] Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li,
[125] Omry Yadan. Hydra - a framework for elegantly Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng
configuring complex applications. Github, 2019. URL Zhang, and He Wang. Uni-navid: A video-based vision-
https://github.com/facebookresearch/hydra. language-action model for unifying embodied navigation
[126] Xintong Yang, Ze Ji, Jing Wu, and Yu-Kun Lai. An open- tasks. arXiv preprint arXiv:2412.06224, 2024.
source multi-goal reinforcement learning environment [137] Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze
for robotic manipulation with pybullet. In Annual Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng
Conference Towards Autonomous Robotic Systems, pages Zhang, and He Wang. Navid: Video-based vlm plans the
14–24. Springer, 2021. next step for vision-and-language navigation. Robotics:
[127] Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan Science and Systems, 2024.
Huang. Physcene: Physically interactable 3d scene [138] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea
synthesis for embodied ai. In Proceedings of the Finn. Learning fine-grained bimanual manipulation with
IEEE/CVF Conference on Computer Vision and Pattern low-cost hardware. arXiv preprint arXiv:2304.13705,
Recognition (CVPR), 2024. 2023.
[128] Chongjie Ye, Yinyu Nie, Jiahao Chang, Yuantao Chen, [139] Chengwei Zheng, Lixin Xue, Juan Zarate, and Jie Song.
Yihao Zhi, and Xiaoguang Han. Gaustudio: A modular Gstar: Gaussian surface tracking and reconstruction,
framework for 3d gaussian splatting and beyond. arXiv 2025. URL https://arxiv.org/abs/2501.10283.
preprint arXiv:2403.19632, 2024. [140] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and
[129] Chongjie Ye, Lingteng Qiu, Xiaodong Gu, Qi Zuo, Ziwei Liu. Learning to prompt for vision-language
Yushuang Wu, Zilong Dong, Liefeng Bo, Yuliang Xiu, models. International Journal of Computer Vision, 130
and Xiaoguang Han. Stablenormal: Reducing diffusion (9):2337–2348, 2022.
variance for stable and sharp normal. ACM Transactions [141] Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu,
on Graphics (TOG), 2024. Chilam Cheang, and Tao Kong. Irasim: Learning
[130] Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, interactive real-robot action simulators, 2024. URL
Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey https://arxiv.org/abs/2406.14540.
Hu, Matthew Tancik, and Angjoo Kanazawa. gsplat: [142] Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto
An open-source library for Gaussian splatting. arXiv Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and
preprint arXiv:2409.06765, 2024. URL https://arxiv.org/ Yifeng Zhu. robosuite: A modular simulation framework
abs/2409.06765. and benchmark for robot learning. In arXiv preprint
[131] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, arXiv:2009.12293, 2020.
Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-
world: A benchmark and evaluation for multi-task and
meta reinforcement learning. In Conference on Robot
Learning (CoRL), 2019. URL https://arxiv.org/abs/1910.
10897.
[132] Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa
Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire,
Erik Frey, Koushil Sreenath, Lueder A. Kahrs, Carlo
Sferrazza, Yuval Tassa, and Pieter Abbeel. Mujoco play-
ground: An open-source framework for gpu-accelerated
robot learning and sim-to-real transfer., 2025. URL https:
//github.com/google-deepmind/mujoco_playground.
[133] Andy Zeng, Shuran Song, Matthias Nießner, Matthew
Fisher, Jianxiong Xiao, and Thomas Funkhouser.
3dmatch: Learning local geometric descriptors from rgb-
d reconstructions. In CVPR, 2017.
[134] Jialiang Zhang, Haoran Liu, Danshi Li, XinQiang Yu,
Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang.
Dexgraspnet 2.0: Learning generative dexterous grasping
in large-scale synthetic cluttered scenes. In 8th Annual
Conference on Robot Learning, 2024.
[135] Jialiang Zhang, Haoran Liu, Danshi Li, Xinqiang Yu,
Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang.
Dexgraspnet 2.0: Learning generative dexterous grasping
in large-scale synthetic cluttered scenes, 2024. URL
https://arxiv.org/abs/2410.23004.
C ONTENTS IX The M ETA S IM Framework 21
IX-A Architecture Overview . . . . . . . . . . 21
I Introduction 2 IX-B MetaConfig Configuration System . . 22
IX-C Aligned Simulation APIs . . . . . . . . 22
II Related Work 3 IX-D Gym API Wrappers . . . . . . . . . . . 23
II-A Robotics Simulators . . . . . . . . . . . 3 IX-E Backend Support . . . . . . . . . . . . 23
II-B Large-Scale Robotics Dataset . . . . . . 3 IX-E1 Isaac Lab . . . . . . . . . . . 23
II-C Benchmarking in Robotics . . . . . . . 3 IX-E2 Isaac Gym . . . . . . . . . . 23
IX-E3 Mujoco . . . . . . . . . . . . 23
III Infrastructure: M ETA S IM 4
IX-E4 Genesis . . . . . . . . . . . . 23
III-A M ETA S IM Overview . . . . . . . . . . 4
III-B M ETA S IM Implementation . . . . . . . 4 IX-E5 Sapien . . . . . . . . . . . . 23
III-B1 Universal Configuration System 4 IX-E6 Pybullet . . . . . . . . . . . 24
III-B2 Aligned Simulator Backends 5 IX-F Hybrid Simulation Implementation . . . 24
III-B3 User-Friendly Environment
X Asset Conversion 24
Wrapper . . . . . . . . . . . 5
X-A Asset types . . . . . . . . . . . . . . . . 24
III-C M ETA S IM Capabilities . . . . . . . . . 5
X-B Conversion Pipeline . . . . . . . . . . . 24
III-C1 Cross-Simulator Integration . 5
X-B1 MJCF to URDF conversion . 25
III-C2 Hybrid Simulation . . . . . . 5
X-B2 URDF to USD conversion . 25
III-C3 Cross-Embodiment Transfer . 5
XI Task and Data Migration 26
IV ROBOV ERSE Dataset 5
XI-A ManiSkill . . . . . . . . . . . . . . . . 26
IV-A Dataset Overview . . . . . . . . . . . . 5
XI-B RLBench . . . . . . . . . . . . . . . . . 26
IV-B Tasks, Assets and Trajectories Collec-
XI-C CALVIN . . . . . . . . . . . . . . . . . 26
tion: Migration . . . . . . . . . . . . . . 5
XI-D MetaWorld . . . . . . . . . . . . . . . . 26
IV-C Tasks, Assets and Trajectories Collec-
XI-E Open6DOR . . . . . . . . . . . . . . . 26
tion: Teleoperation and Generation . . . 6
IV-D Data Augmentation . . . . . . . . . . . 7 XI-F ARNOLD . . . . . . . . . . . . . . . . 26
IV-D1 Trajectory Augmentation . . 7 XI-G RoboSuite & MimicGen . . . . . . . . 26
IV-D2 Domain Randomization . . . 7 XI-H SimplerEnv . . . . . . . . . . . . . . . . 27
IV-E ROBOV ERSE Dataset . . . . . . . . . . 7 XI-I GAPartNet . . . . . . . . . . . . . . . . 27
IV-E1 Dataset Statistics . . . . . . . 7 XI-J GAPartManip . . . . . . . . . . . . . . 27
XI-K GraspNet-1B . . . . . . . . . . . . . . . 27
V ROBOV ERSE Benchmark 8 XI-L GarmentLab . . . . . . . . . . . . . . . 27
V-A Benchmark Overview . . . . . . . . . . 8 XI-M UniDoorManip . . . . . . . . . . . . . . 28
V-B Imitation Learning Benchmark . . . . . 9 XI-N RLAfford . . . . . . . . . . . . . . . . . 28
V-C Reinforcement Learning Benchmark . . 9 XI-O LIBERO . . . . . . . . . . . . . . . . . 28

VI Experimental Results 9 XII Task Generation 28


VI-A Overview . . . . . . . . . . . . . . . . . 9 XII-A Robot & Object Generation Protocol . . 28
VI-B Results on the Imitation Learning Bench-
mark . . . . . . . . . . . . . . . . . . . 9 XIII Teleoperation 29
VI-B1 Baseline and Task Selection 9 XIII-A Keyboard . . . . . . . . . . . . . . . . . 29
VI-B2 Implementation Details . . . 10 XIII-B Smartphone . . . . . . . . . . . . . . . 29
VI-B3 Experiment Results . . . . . 10 XIII-C Others . . . . . . . . . . . . . . . . . . 29
VI-C Results on the Reinforcement Learning
Benchmark . . . . . . . . . . . . . . . . 10 XIV Real2Sim Toolset for Asset and Task Generation 29
VI-D Augmentation Experiments . . . . . . . 11 XIV-A Overview . . . . . . . . . . . . . . . . . 29
VI-E World Model Learning . . . . . . . . . 11 XIV-B Components . . . . . . . . . . . . . . . 29
VI-F Imitating the RoboVerse Dataset Enables XIV-B1 Gaussian Splatting Recon-
Direct Sim-to-Real Transfer . . . . . . . 11 struction . . . . . . . . . . . 29
VI-G Reinforcement Learning in RoboVerse XIV-B2 Mesh Reconstruction . . . . 30
Enables Sim-to-Sim-to-Real Transfer . . 11 XIV-B3 Loading the URDF into the
Simulation Environment . . . 30
VII Limitations 12 XIV-B4 Real-to-Sim boost Sim-to-
Real Performance . . . . . . 30
VIII Simulators Overview 21 XIV-C Limitations and Challenges. . . . . . . . 30
XV Domain Randomization 31 VIII. S IMULATORS OVERVIEW
XV-A Scene Randomization . . . . . . . . . . 31 In the field of robotics, simulators play an important role.
XV-B Visual Material Randomization . . . . . 31 It is the womb of a robot, taking responsibility for training
XV-C Light Randomization . . . . . . . . . . 31 and testing a robot’s behaviors before it was "born" into the
XV-D Camera Randomization . . . . . . . . . 31 real world. Therefore, the functionalities are crucial for a
successful robotic application. Users require different functions
XVI Navigation and Locomotoin Tasks 32
of simulators according to their specific scenarios: whether
XVI-A Navigation Tasks . . . . . . . . . . . . 32
it is a photorealistic task which requires accurate rendering
XVI-B Humanoid Tasks . . . . . . . . . . . . . 32
of a close-to-real virtual world, or a massive parallel scene
XVI-C HumanoidBench . . . . . . . . . . . . . 32
that is designed for efficient reinforcement learning. All the
XVII RoboVerse Benchmark Set up Details 32 requirements may influence the choice of the simulator. In
XVII-A Generalization Levels . . . . . . . . . . 32 order to reduce the pain users need to endure in getting them
XVII-B RoboVerse Benchmark Protocol . . . . 34 familiarized with each new simulator, we incorporated multiple
simulators into the RoboVerse platform and listed specifica-
XVIII Policy Training Details 34 tions of the simulators currently supported by RoboVerse in
XVIII-A Implementation Details . . . . . . . . . 34 Fig. VI.
XVIII-B Diffusion Policy . . . . . . . . . . . . . 34 Due to the complexity of physics simulation and rendering,
current simulators cannot depict the real world well enough.
XIX World Model Details 35 Our experiments revealed some common issues of nowadays
XIX-A Methodology . . . . . . . . . . . . . . . 35 simulators in the basic physics laws. The experimental results
XIX-B Data Preparation . . . . . . . . . . . . . 35 on fundamental conservation laws may be a pessimistic sign
XIX-C Experiments . . . . . . . . . . . . . . . 35 on our hope of direct sim-to-real transfer of more complicated
robotic behaviors.
We conducted experiments on three basic conservation laws
of physics in three simulators.
In the experiments for Conservation of Momentum, two rigid
bodies are placed in a gravity-free environment, their initial
states are set to have an elastic collision.
In the experiments for Conservation of Angular Momentum,
one or two rigid bodies are placed in the gravity-free environ-
ment, and their initial states are set to rotate. We calculate and
record the overall angular momentum as the system evolves.
In the experiments for Conservation of Kinetic Energy, two
rigid bodies are placed in the gravity-free environment, and
their initial states are set to have a rotation-free elastic collision.
This setup allows us to directly observe the conservation of
kinetic energy regardless of the results of experiments on
angular momentum.
From the results listed in Fig. 14, we can easily notice that
basic conservation laws are not kept in the three simulators.
However, different simulators behave differently in different
experimental setups, which indicates that depending on the
needs of different tasks, we may need to choose different
simulators for more accurate results. This highlights the
necessity of a tool that helps users to easily transfer tasks
among simulators.

IX. T HE M ETA S IM F RAMEWORK


A. Architecture Overview
The M ETA S IM framework is a unified simulation framework
as shown in Fig. 15. On the front-end side, it provides user-
friendly Gym APIs as well as easy-to-use parallel environment
support. On the back-end side, it supports multiple simulators
to allow seamless transfer of tasks across simulators. Users only
need to master simple skills on writing a simulator-agnostic
Simulator Physics Engine Rendering Sensor Support Dynamics GPU Open
Rasterization
SAPIEN [122] PhysX-5, Warp RGBD; Force; Contact Rigid; Soft; Fluid ✓ ✓
RayTracing
RGBD; Force
Pybullet [16] Bullet Rasterization Rigid; Soft; Cloth ✓
IMU; Tactile
RGBD; Force
MuJoCo [111] MuJoCo Rasterization Rigid;Soft;Cloth ✓ ✓
IMU; Tactile
MuJoCo; Bullet
CoppeliaSim [97] Rasterization RGBD; Force; Contact Rigid;Soft;Cloth ✓
ODE; Newton; Vortex
RGBD; Lidar; Force
Rigid; Soft
IsaacLab [80] PhysX-5 RayTracing Effort; IMU; Contact ✓
Cloth; Fluid
Proximity
IsaacGym [72] PhysX-5, Flex Rasterization RGBD; Force; Contact Rigid; Soft; Cloth ✓
Rasterization
Genesis [2] Genesis RGBD; Force; Tactile Rigid; Soft ✓ ✓
RayTracing

TABLE VI: Comparison of Physics Simulators [104]. The column GPU denotes whether the simulator can use GPU-accelerated
computation. The column Open denotes whether the simulator is open-source.

mode, physics engine solver type, etc.), there is a seperate


simulator-specific part which defines those things.
To make changing the settings and debug more easily, we
design the configuration system in a Hydra[125]-like way,
making each item in the configuration system can be modified
from commandline just like Hydra [125]. The configuration
system is implemented based on Python dataclass, and could
therefore use Python type annotation to help user use them.
In order to run the tasks seamlessly across all simulators,
it is necessary to define them in a simulator-agnostic way.
We configure the task and define its objects list, robot in use,
success checker and the reward. The success checker is used
to determine when the task is successfully execucated, and is
the most difficult part in task definition. To standardize, we
(a) Momentum (b) Angular (c) Kinetic Energy offer some structured success checker templates which cover
the most cases, and leave option for users to define a callback
Fig. 14: Three series of experiments on conservation laws in
function for flexibility to implement those stuctured success
simulators. Blue, orange and green lines are data collected
checker could not cover.
from Sapien, Isaacgym and Pybullet respectively.
C. Aligned Simulation APIs
MetaConfig configuration class, the environment will then M ETA S IM support different simulator backends, including
be automatically instantiated with the designated back-end IsaacSim [85], IsaacGym [72], MuJoCo [111], PyBullet [16],
simulator. SAPIEN [122], CoppeliaSim [97, 47]. The framework is
implemented in Python, as these simulators either natively
B. MetaConfig Configuration System support Python or provide Python APIs.
Common simulator operations are unified in a Handler
The M ETA S IM framework uses MetaConfig, a unified class. Each handler supports only tree basic APIs:
configuration class to describe a scenario in simulation envi- get_state(), set_state() and step(). The
ronments. get_state() method takes a descriptive Python dict
We designed a configuration system that set up the simulator, (e.g., {object_name: {’pos’: ..., ’rot’: ...,
define the tasks, set up the domain randomization. In order to ’...’: ...}}) as input, and returns current simulation
run the same setting of environments across different simulators, states according to the dict in another Python dict structured
the configuration system is defined to be simulator-agnostic as in the same manner. The set_state() method also takes
much as possible. For simulator-specific settings (e.g. rendering a descriptive Python dict as input, and modifies current
Fig. 15: Comparison between the M ETA S IM and the other simulation environments. Left: Other simulator and benchmark, using
self-defined data format, simulator-associated assets, simulator-dependent task definition, and scripts. Right: The M ETA S IM,
decoupling all components to be agnostic to specific simulators or benchmark environments.

simulation states to the ones included in the dict. The step() randomization of physics parameters, Jacobian and inverse
method will prompt the simulation to proceed one timestep. kinematics support, and customizable friction settings.
3) Mujoco: MuJoCo [111] is a physics engine and simu-
D. Gym API Wrappers lation framework designed to accurately model the dynamics
To support building learning environments, we define an Env and control of complex robotic systems in real-time. Its
class built on top of Handler. It offers Gymnasium-like APIs name, MuJoCo, stands for Multi-Joint dynamics with Contact,
(step, reset, render, and close), implementing these highlighting its primary emphasis on efficient computation of
methods by leveraging the underlying Handler methods. contact forces and multi-joint dynamics. The engine supports
It is worth noting that most simulation environments provide advanced features such as frictional contact models, user-
the underlying APIs (corresponding to our Handler) and defined actuators, and customizable sensor modalities, allowing
upper-level environments (corresponding to our Env) seper- researchers and developers to prototype, test, and refine control
ately, such as SAPIEN [122] with ManiSkill [109], Isaac- algorithms across a wide range of robot morphologies and
Sim [85] with IsaacLab [80], CoppeliaSim [97]/PyRep [47] tasks.
with RLBench [48], and Mujoco [111] with Mujoco Play- A key strength of MuJoCo is its computational precision,
ground [132]. This fact proves our Handler and Env two- which enables high simulation throughput and real-time in-
level abstraction reasonable. teractive control. It supports rigid-body dynamics, articulated
mechanisms, and a variety of constraints, making it suitable for
E. Backend Support tasks involving locomotion, manipulation, and reinforcement
1) Isaac Lab: Isaac Lab [85] is an advanced robotics learning. Furthermore, MuJoCo’s flexible XML-based model
simulation platform developed by NVIDIA. By leveraging high- description streamlines creating and modifying simulated
fidelity physics, GPU acceleration, and photorealistic rendering, environments, providing a straightforward way to experiment
it enables rapid prototyping, testing, and deployment of AI- with novel designs. The compatibility between MuJoCo and
driven robotics solutions in virtual environments. Through Brax offers a high-speed, differentiable pipeline crucial for
seamless integration with NVIDIA’s Omniverse framework, reinforcement learning. This powerful blend of accuracy, speed,
Isaac Lab offers robust features such as domain randomization, and flexibility has solidified MuJoCo’s status as a leading choice
sensor simulation, and support for large-scale reinforcement in robotics research and machine learning, particularly for
learning, making it a powerful tool for both research and advanced control, motion planning, and reinforcement learning
industrial applications. applications [29].
A key advantage of Isaac Lab is its compatibility with the 4) Genesis: Genesis[2] is a comprehensive physics platform
Isaac ROS infrastructure, which includes valuable models such developed for robotics and physics simulation research, unify-
as foundationpose [121, 120] and curobo [107], among others. ing multiple core capabilities in a single environment. At its
2) Isaac Gym: Isaac Gym [72] is a physics simulation foundation is a universal physics engine, rebuilt from the ground
environment designed for reinforcement learning research. up to simulate diverse materials and physical phenomena while
Although it remains available for download, official support seamlessly integrating various solvers. Alongside this engine,
has ended. Nevertheless, multiple works published before Genesis provides a swift, Python-friendly robotics simulation
2024—such as hora [91], humanoid-gym [38], and IPC- toolkit, an efficient photo-realistic rendering system, and a
graspsim [55]—were developed using Isaac Gym. data-generation module that converts natural language prompts
Key features of Isaac Gym include support for importing into multi-modal datasets. We leverage the Genesis backend
URDF and MJCF files with automatic convex decomposition, to support loading, simulation, and rendering in ROBOV ERSE
a GPU-accelerated tensor API for managing environment states workflow.
and actions, and a range of sensors (e.g., position, velocity, 5) Sapien: SAPIEN [122] is a robot simulation framework
force, torque). Additional capabilities include runtime domain that allows highly efficient simulation and rendering of robotic
tasks. It uses PhysX [83] as the underlying physics engine. We formats includes, but is not limited to, MuJoCo XML control
supported the released version Sapien 2.2 for the M ETA S IM files [111], URDF files [8], and USD files [85].
framework. The three predominant file formats in robotics simulation:
We use the multipocessing library to support parallel environ- MJCF, URDF, and USD. Each of them serves distinct purposes
ments in the Handler class for Sapien. When instantiating the and offers unique capabilities. MJCF (MuJoCo Configuration
environment from configurations, a desired number of processes Format) stands out for its exceptional expressiveness in
are forked to run the simulation of different environments. To physics simulation, featuring sophisticated capabilities to model
support the get_states and set_states API, data for complex dynamical systems including tendons, actuators, and
different environments are distributed to different processes, advanced joint configurations, along with an integrated compiler
and the return values are then gathered. for handling complex compile-time computations [111]. URDF
6) Pybullet: PyBullet [17] is a fast and easy-to-use robotics (Unified Robot Description Format), while more constrained in
simulator. It uses its own physics solvers for accurate and effi- its feature set, has emerged as the de facto standard in robotics
cient simulations. We supported the released version PyBullet due to its remarkable cross-platform compatibility and universal
3.2 for the M ETA S IM framework. adaptability across various simulation environments including
We use the same techniques as for Sapien to achieve parallel- Isaac Sim [85], Isaac Gym [72], MuJoCo [111], Gazebo, and
environment simulation. PyBullet [16], making it ideal for robot model exchange
despite its limitations in representing parallel mechanisms
F. Hybrid Simulation Implementation
or complex sensor configurations [8]. USD (Universal Scene
M ETA S IM allows launching two simulators in one single Description), originally developed by Pixar Animation Studios,
process with one command. Taking our demo collection excels in high-fidelity rendering and scene composition through
command as example: python collect_demo.py --sim=mujoco its sophisticated layering system and variant sets [22], making
--renderer=isaaclab --task=$task. The implementation is illus- it particularly valuable for applications requiring advanced
trated in Code. 2. visual properties and collaborative workflows [84], although
its physics simulation capabilities are more limited compared
class HybridEnv: to dedicated robotics formats like MJCF [26].
def __init__(self, env_physic: Env,
env_render: Env): Features MJCF URDF USD
...
def step(self, action): Basic Geometries ✓ ✓ ✓
env_physic.handler.set_states(action= Mesh Support ✓ ✓ ✓
action) Texture Support ✓ Limited ✓
phys_states = env_physic.handler.
get_states() Material ✓ Basic ✓
env_render.handler.set_states(states= Properties
phys_states)
env_render.handler.refresh_render()
Physics Properties ✓ ✓ Limited
states = env_render.handler.get_states() Joint Types Many Basic Basic
return ... Collision Proper- Advanced Basic Advanced
ties
Code 2: Pseudocode for implementing hybrid simulation using Deformable ✓ ✗ ✓
two different simulator environments simultaneously. The Objects
core of this implementation is using states as a unified Animation Limited ✗ ✓
representation across both simulation environments. Support
Scene Basic ✗ Advanced
X. A SSET C ONVERSION Composition
File Format XML XML ASCII/Binary
A. Asset types
The diverse landscape of robotic assets, stemming from TABLE VII: Comparison of Robot Description Formats
prior research initiatives [142, 48, 81] and a multitude of
software platforms [111, 72, 122], necessitates a robust strategy
for managing a wide array of file formats. To facilitate B. Conversion Pipeline
dependable cross-simulator training and uphold data integrity Given that our simulation pipeline primarily utilizes Isaac
throughout the development lifecycle, the establishment of Sim for rendering while many of our assets are originally stored
an efficient and reliable asset conversion pipeline is of in MJCF format, a two-stage conversion pipeline (MJCF →
paramount importance [26]. Such a pipeline is crucial for URDF → USD) becomes necessary and advantageous. This
ensuring seamless interoperability, minimizing potential data approach leverages URDF as an intermediate format for several
loss or inaccuracies, and promoting the uniform application reasons. First, while direct conversion from MJCF to USD is
of metadata and configurations across disparate simulation theoretically possible, such conversion would be complex and
environments. A selection of frequently encountered asset error-prone due to MJCF’s rich feature set for physics properties
(like tendons and actuators) that lack direct equivalents in frequently exhibit texture alignment discrepancies following
USD [115]. Instead, converting to URDF first allows us to the .msh to .obj conversion process.
standardize the robot’s basic kinematic and dynamic properties Fortunately, we discovered that many asset collections
in a format that has well-established conversion tools and maintain redundant mesh representations, often including a
widespread support. The subsequent URDF to USD conversion properly UV-mapped .obj file alongside the .msh file, typically
benefits from Isaac Sim’s robust URDF importing capabilities, sharing the same filename or designated as "textured.obj".
which have been extensively tested and optimized for robotics Leveraging this observation, we implemented a robust mesh
applications. This two-stage pipeline thus ensures more reliable alignment pipeline that follows a hierarchical decision process:
asset conversion while maintaining essential physical properties • First, the system searches for an existing .obj file within
and compatibility across different simulation environments. the same directory as the .msh file
1) MJCF to URDF conversion: We implemented our own • If found, this pre-existing .obj file is utilized, ensuring
MJCF to URDF converter by first parsing everything with proper texture alignment
MuJoCo’s MJCF importer, then exporting all texture, collision • In the absence of a pre-existing .obj file, the system
mesh and joint information to the correct URDF format. The proceeds with the .msh to .obj conversion
inspiration is taken from Genesis [2], which they built their • In the latter case, users receive a warning notification
own class for each asset object that encode all joint, texture regarding potential texture misalignment issues
and mesh information. We then recursively generate the body
information to URDF and align everything with texture. Following the mesh format resolution, the pipeline sys-
a) Parsing Link, Joint, and Body Information from the tematically maps these processed mesh files back to their
MJCF file: To parse link, joint, and body information from corresponding links within the URDF structure, maintaining
the MJCF file, we leverage MuJoCo’s parsing capabilities the integrity of the robot’s geometric representation while
to load the MJCF XML into a MuJoCo model structure. preserving texture information where possible.
From this parsed model, we employ a recursive approach, c) Building URDF: The assembling procedure after all the
starting from the root body and descending into each child conversions become very aparent: we first processes robot links
body to systematically process the hierarchical structure. For and joints, incorporating their properties and relationships into
each body, we extract detailed link properties such as name, the URDF format. This automated approach ensures a robust
position, orientation, inertial characteristics, and associated and flexible method for generating URDF files, accommodating
geometry. Simultaneously, we parse joint information connected a wide range of robot configurations and properties derived
to each body, including joint type, limits, and axis of motion. from the preceding conversion steps.
All of this extracted link and joint data is systematically Even though this pipeline roughly works for most of the
organized and stored in dictionary structures. These dictionaries MJCF, for some specific MJCF files in some specific folder,
serve as intermediate representations, holding all the necessary we have to modify our conversion approach on a case by case
information from the MJCF model in a structured format basis. Below is a table for some special treament we employed
that is readily accessible for subsequent stages of the URDF to specific packages, and its conversion success rate:
conversion process. Despite the general efficacy of the described pipeline across a
b) Aligning Meshes and textures: The management of broad spectrum of MJCF assets, it is important to acknowledge
collision meshes across existing asset libraries presents a that certain MJCF files, particularly those within specific pack-
notable challenge, as these assets are typically stored in various ages or directories, necessitate bespoke conversion strategies.
formats including .msh, .obj, and .stl files. While URDF natively These exceptions arise due to the inherent complexity and
supports .obj and .stl formats, the conversion of .msh files variability in MJCF file structures across different projects and
into URDF-compatible formats requires careful consideration. asset libraries. To address these unique cases, we have adopted
Although MuJoCo’s repository provides a conversion utility a tailored approach, implementing case-specific modifications
for transforming .msh files to .obj format—accomplished by to our conversion pipeline as required. The subsequent table
parsing the .msh files through the MuJoCo interface and details instances where such specialized treatment has been
subsequently exporting vertex and face information—this ap- applied, along with the corresponding conversion success rates
proach introduces potential complications with texture mapping achieved for each package.
alignment. 2) URDF to USD conversion: IsaacSim has implemented
The complexity arises from the specific requirements of a robust solution for converting URDF files to USD for-
texture files, which are predominantly stored as albedo PNG mat. The conversion process comprehensively preserves the
files. These textures depend on precise UV mapping coordinates robot’s structural and kinematic information, including joint
within the .obj file to ensure proper alignment. The current .msh hierarchies, geometric properties, and physical attributes. The
to .obj conversion utility provided in the MuJoCo repository implementation demonstrates exceptional fidelity in translating
does not adequately address texture support, leading to potential complex robotic descriptions, ensuring that all essential compo-
misalignment issues in the converted models. This limitation nents—such as joint configurations, collision geometries, and
is particularly evident in comprehensive robotics frameworks visual representations—are accurately encoded in the resulting
such as Libero [65] , where both static and articulated objects USD files.
Given the proprietary nature of IsaacSim’s conversion D. MetaWorld
implementation, we utilize their framework as an external MetaWorld [131] is a widely used benchmark for multi-
tool in our pipeline. This approach leverages the proven task and meta-reinforcement learning, comprising 50 distinct
reliability and performance of IsaacSim’s converter while tabletop robotic manipulation tasks involving a Sawyer robot.
maintaining compatibility with our broader system architecture. Tasks and Assets: We integrate five representative tasks
The conversion process serves as a critical bridge between into RoboVerse: drawer open, drawer close, door close, window
standard robotics formats and the high-performance USD open, and window close. The corresponding assets are manually
representation required for our simulation environment. converted from MJCF to USD files with appropriate physics
XI. TASK AND DATA M IGRATION APIs.
A. ManiSkill Demonstrations: As the benchmark does not provide demon-
strations, we generate trajectories for each task by rolling out
ManiSkill [81, 37, 109] provides a series of robotic manip- reinforcement learning policies from [123].
ulation tasks under single-arm or dual-arm settings.
Tasks and assets: We migrate basic single-arm tasks and E. Open6DOR
demonstrations to RoboVerse, including the pick-and-place Open6DOR is a benchmark for open-instruction 6-DoF
tasks like PickCube and PickSingleYCB, as well as the in- object rearrangement tasks, which requires embodied agents
sertion tasks like PegInsertionSide and PlugCharger. to move the target objects according to open instructions that
The corresponding assets are manually crafted with primitives specify its 6-DoF pose.
or process from the mesh files, with proper physics API set Tasks and Assets: The synthetic object dataset comprises
up. 200+ items spanning 70+ distinct categories. Originally derived
Demonstrations: For each task, a great number of from YCB[7] and Objaverse-XL[22], the objects are carefully
demonstration trajectories are available in the released data. filtered and scaled using a standardized format of mesh
Noteworthy, the data does not come with the initial scene states, representation. Overall, the Open6DOR Benchmark consists of
which are obtained by replaying the demonstrations within the 5k+ tasks, divided into the position-track, rotation-track, and
SAPIEN simulator. With the specified seed set, the states are 6-DoF-track, each providing manually configured tasks along
recovered by the random samplers.The success checkers are with comprehensive and quantitative 3D annotations.
implemented according to the task designs. Success checkers: We determine success by comparing
B. RLBench the target object’s final pose with the annotated ground-truth
RLBench [48] is a large-scale benchmark and learning pose range.
environment for robotic manipulation, featuring 100 diverse, F. ARNOLD
hand-designed tasks ranging in complexity, from simple actions
like reaching to multi-stage tasks like opening an oven and Arnold [36] is a benchmark for language-conditioned manip-
placing a tray inside. Each task includes an infinite supply of ulation. The benchmark uses motion planning and keypoints
demonstrations generated via waypoint-based motion planning. for robot manipulation tasks, focusing on fine-grained language
Tasks and assets: We roll out ∼2K trajectories in understanding.
RLBench [48] for each task, and migrate them to ROBOV ERSE. Tasks and Assets: : We integrate six out of eight tasks
from Arnold into RoboVerse: picking up objects, reorienting
C. CALVIN objects, opening/closing drawers, and opening/closing cabinets.
CALVIN [79] provides 6-hour teleopreation trajectories on Demonstrations: As the benchmark does not use trajectory-
4 environments, each involve an articulated table with three level demonstrations, we use motion planning for trajectory
blocks in blue, pink, or red. generation to interpolate between keypoints
Tasks and assets: We migrate the demonstrations in all 4
environments and transform the original assets (URDF for the G. RoboSuite & MimicGen
table, and primitives for the cubes) into USD files with proper RoboSuite [142] provides a suite of task environments for
physics APIs. robotic manipulation, built on the MuJoCo physics engine.
Demonstrations: We segment the trajectories according Each task is implemented as a separate class, with most
to the text annotations, which specified the task category (e.g., configuration details embedded in the source code. Based
PlaceInSlider), the text annotation (e.g., place the red on these environments, MimicGen [76] offers thousands of
block in the slider), and the timestamps of the demonstration demonstrations, serving as a widely used benchmark for
segment. The states of the first frame is adopted as the scene imitation learning.
initial states. Tasks and Assets: For tasks with separate object descrip-
Success checkers: We carefully implement the success tion files (MJCF), we directly migrate the corresponding assets
checkers according to the original implementation to make through our Asset Conversion pipeline. However, some tasks
sure the failed executions can be filtered out. This is because contain hard-coded assets within the source code, such as
the coarsely annotated timestamps in the dataset, which may a hammer composed of multiple cubes, cylinders and other
cause the failed execution in part of the demonstrations. primitives with carefully designed relative poses. To integrate
these tasks, we will manually reconstruct the assets within Tasks and Assets: We currently implement two tasks:
our framework. We also argue that hard-coded asset and task OpenBox and OpenToilet. For the OpenBox task, we
definitions, as opposed to modular task descriptions, are not collect 12 object assets from the Box category in the original
scalable for future robotic task benchmarking. dataset. For the OpenToilet task, we gather 30 objects from
Demonstrations: We convert MimicGen demonstrations the Toilet category. We convert these assets into USD files
into our format. Specifically, we transform the robot actions with appropriate physics APIs to ensure compatibility with our
from 6-DoF Cartesian space representations to joint space. simulation environment.
Additionally, the state of the first frame is adopted as the initial Demonstrations: We generate demonstrations for our tasks
scene state. in simulation using motion planning with CuRobo [106]. First,
Success Checkers: We meticulously implement success we filter potential grasping poses for the target object link by
checkers based on the original definitions to ensure failed assessing their feasibility through motion planning. Specifically,
executions are effectively filtered out. we discard poses that the end-effector cannot reach or that
H. SimplerEnv would cause a collision between the robot and the object. Next,
we generate an end-effector pose trajectory to complete the
SimplerEnv is a set of tasks and methods designed to task using heuristics. Based on the object’s kinematic tree, we
do trustworthy benchmarking in simulation for manipulation could define an ideal trajectory. We then apply motion planning
policies that can reflect the real-world success rate. to perform inverse kinematics, computing the corresponding
There are in total 25 different tasks in SimplerEnv. We ignore joint poses of the robot along this trajectory. Finally, we
all tasks that are just a subset of another task and migrated in execute the planned trajectory in simulation to verify task
total 6 tasks and 52 object assets to ROBOV ERSE. The tasks completion, saving successful trajectories as demonstrations.
all use Google Robot. The entire demonstration generation process is conducted in
SimplerEnv provided some controller models trained with IsaacSim [85].
RT-1 [4] and RT-X [15] dataset. We did not use the trajectories
Success Checkers: To determine task success, we require
from the dataset directly because some environmental settings
the manipulated object to be opened by at least 60 degrees for
are different from the environments from SimplerEnv. We used
all tasks.
the trained model to collect trajectories. Hooks are inserted
into the original SimplerEnv codebase to extract and maintain K. GraspNet-1B
the recordings at different stages of simulation. We then rollout GraspNet-1B [27] is a general object grasping dataset for
the model trained with RT-1 dataset on each task to collect the predicting 6 DoF grasping pose given partial pointcloud input.
trajectories. It contains 256 realworld tabletop scenes consists of total 88
I. GAPartNet different objects. We carefully filter out 58 objects as our
For tasks in GAPartNet [34], we generate both motion target grasping objects based on the availability of purchasing
planning [34] and reinforcement learning [32] trajectories. real items because we need to evaluate our policies to grasp
GAPartNet is implemented in IsaacGym [72] with various them in the real world experiments. To generate grasping
articulated objects. To integrate it into RoboVerse, we first demonstrations, we use CuRobo [107] as motion planner to
align all articulated object initial states to the MetaSim format generate robot end effector trajectories starting from a fixed
and convert the asset format to USD for compatibility across initial pose and ending to an target object grasping pose. The
different simulators. grasping pose is obtained from the grasping annotations used to
For trajectory generation: train GraspNet [27]. We also randomized the object positions
(1) Motion Planning: GAPartNet [34] introduces a part- to generate more diverse layouts. Finally, we validate the
centric manipulation approach. We roll out heuristics to trajectories in our framework and filter out invalid ones by
generate manipulation trajectories, providing three demon- controlling robots to follow the generated grasping trajectories.
strations per part with different object and part initial states. In the end, we successfully generated about 100k valid grasping
(2) Reinforcement Learning Rollout: The follow-up work, trajectories.
PartManip [32], proposes several reinforcement learning meth- L. GarmentLab
ods. We re-train all policies based on our robot setup and
roll out trajectories for dataset collection. With aligned task GarmentLab [69] is the first robotic manipulation benchmark
configurations, trajectories, and assets, we successfully adapt for deformable object and garment manipulation. It integrates
GAPartNet into RoboVerse. 10 categories of versatile garment assets and the total number
of USD assets reaches 6k. To generate manipulation demon-
J. GAPartManip strations, we directly roll out the trajectories provided by the
Instead of providing direct demonstrations, GAPartMa- official codebase in IsaacSim and collect the corresponding
nip [18] offers a large-scale, part-oriented, scene-level dataset state information in a parallel process. Although the trajectory
with annotations for actionable interaction poses. We utilize the provided by the official codebase is limited and hard-coded,
mesh-level grasping pose annotations in this dataset to generate we further extend the number of demonstrations by applying
diverse demonstrations for articulated object manipulation. different garments and textures, and all the demonstrations are
validated by the original success checker. Finally, we have also recording the participating objects in our MetaCfg. This
successfully collected 6k trajectories. process ensures that all necessary components of a LIBERO
task, initial states, and action data, are fully translated and
M. UniDoorManip ready for execution in MetaSim.
UniDoorManip [64] provides an articulated manipulation We further augment the data by randomly sampling initial
environment reflecting different realistic door manipulation positions around each LIBERO demonstration, thus increasing
mechanisms, and a large-scale door dataset containing 6 door the effective number of demos well beyond the original 50 per
categories with hundreds of door bodies and handles stored in task. The spatial sampling range is dynamically chosen based
URDF format. We convert those door assets into USD format on the task context and object dimensions, ensuring that the
with physics APIs from IsaacSim and manually further verify augmented configurations remain physically plausible.
the correctness of the joint-link relationship. Demonstrations
are collected by directly rolling out the hard-coded trajectories XII. TASK G ENERATION
in IsaacGym. We eventually collect about 1k successful legal A. Robot & Object Generation Protocol
demonstrations. Our task generation pipeline (Fig. 16) begins with a user
prompt describing the desired theme or constraints of a robotic
N. RLAfford
task (e.g., "place the butter in the drawer and close it"). From
RLAfford [35] investigates the generalization ability of here, the system proceeds in two main phases, mediated by
Deep Reinforcement Learning models on articulated object large generative model calls:
manipulation tasks with the presence of a computer vision 1) call_gpt_to_generate_task(): Conceptual
model that is co-trained with it in an end-to-end manner. This Task Generation. This initial function queries the model
work provided a dataset of articulated objects and 8 tasks for for a high-level task overview. It requests:
benchmarking.
• A unique task name (e.g., “ButterDrawerTask”).
In Roboverse, we have adapted 4 tasks (open cabinet, open
• A short, human-readable instruction (e.g., “Place the
drawer, close cabinet, close drawer) and in total 40k trajectories
butter in the drawer, then close the drawer.”).
from RLAfford.
• A candidate list of robots and objects to appear in
In the task adaptation, we included 40 articulated objects
the scenario, referencing an internal asset library (see
from the RLAfford dataset, and uses the same robot description
below).
file from RLAfford. Then we record 1000 trajectories for each
object in its corresponding task. The large generative model draws on its generative abilities
The trajectory recording is achieved with several hooks we to propose creative or contextually relevant tasks, while
inserted into the original RLAfford codebase. The hooks are remaining loosely guided by the user prompt.[119, 118,
used to extract and maintain the recordings at different stages 39, 140] As shown in Fig. 16, the model might retrieve a
of simulation. We evaluated the released RLAfford model with “drawer” asset from a different benchmark and a “butter”
hook-inserted scripts. In the initialization stage, objects and asset from a separate dataset, combining them into a single
robots are initialized with randomization, their pose, and DoF scene idea.
information are recorded. For each simulation step, the DoF 2) call_gpt_to_get_init_state(): Physical Lay-
position information of objects and robots is recorded in the out Refinement. After receiving the conceptual descrip-
trajectories. In the end, for each object, a separate trajectory tion, we call the model again to specify x,y coordinates
file of 1000 different trajectories is saved in the RoboVerse for each listed item. During this second phase, user
supported format. can provide the prompts that include minimal bounding
constraints (e.g., permissible table edges, object height)
O. LIBERO to help modelgenerate various initial states by few-shot
LIBERO [65] manages data loading and task execution learning.
through a combination of INIT(initialization files), BDDL (Be- Asset Library. To ground the large generative model’s outputs
havior Description Definition Language), and HDF5 datasets. in realistic data, we maintain an asset library (via JSON files)
Specifically, the initialization files define scene layouts, object that describes each robot or object’s core attributes (e.g., assets
properties, and basic task goals; the BDDL format captures filepath, default rotation, size). The two core functions above
semantic details and object affordances; and the HDF5 files selectively pull from this library.
store structured data such as object positions and robot actions Input and Output Format.
for dynamic retrieval at runtime. • Input: A user prompt (e.g., “create a tabletop scene with a
To migrate a LIBERO task into MetaSim, we parse the random container and a snack food”). The pipeline loads
relevant BDDL file to identify which objects are involved and relevant asset definitions and passes them to the large
what type of manipulation context is required. Then we get generative model calls.
the robot and object initial states from the INIT files, followed • Output: A merged init_state or “initial state” dic-
by the corresponding robot actions from the HDF5 dataset. tionary capturing the initial state config needed for
These elements are combined into our PKL file format while simulation: the chosen robot/object list, each item’s final
Simple Language-conditioned Grasping
x,y,z coordinate, and the textual instructions, as shown Method
PickCube MoveSliderLeft Object Set 1 Object Set 2 Object Set 3
in the right half of Fig. 16. OpenVLA 40.0 45.0 46.0 33.3 14.4
Octo 50.0 30.0 42.0 14.4 2.2
XIII. T ELEOPERATION
TABLE VIII: Vision-Language-Action (VLA) Model Re-
Ensuring flexible and intuitive remote operation is critical in
sults on ROBOV ERSE Imitation Learning Benchmark.
robotic teleopration system, particularly when collecting large
Constrained with time and resources, we report VLA models’
volumes of high quality data. In this work, we designed a suite
results on two simple tasks from ROBOV ERSE and grasping
of input methods to facilitate robot teleopration within the
tasks with diverse and challenging language instructions. We
M ETA S IM infrastructure. By supporting keyboard, DualSense
split 58 objects in GraspNet into three sets, each containing
Joystick, smartphone, and VR-based controls, our system
progressively more challenging objects based on their geometry.
accommodates varying user preferences and experimental needs.
This section details our design rationale, implementation steps,
and practical considerations for each control interface.
updates and gripper control. Multi-touch input is supported
A. Keyboard to enable users to send combined control signals, such as
simultaneous movement along multiple axes, improving control
Keyboard input is an accessible method for controlling robots
flexibility and efficiency. As shown in the Fig. 19 and Fig. 17,
in simulation. Our implementation supports multi-key combi-
tilting the smartphone controls the gripper’s orientation, while
nations for diagonal movement and enables full six-degree-of-
combining multi-touch signals from on-screen buttons enables
freedom manipulation of the end effector. Translational move-
precise and complex manipulation in 3D space. However,
ment follows the world coordinate frame (UP: +X, DOWN:
to mitigate magnetic interference, users should maintain a
-X, LEFT: +Y, RIGHT: -Y, ‘e’: +Z, ‘d’: -Z), while rotations
minimum distance of 10 cm from strong magnetic sources such
in the local EE frame are controlled via ‘q’/‘w’ (roll), ‘a’/‘s’
as laptops and other electronic devices. This design optimizes
(pitch), and ‘z’/‘x’ (yaw). The spacebar toggles the gripper.
resource utilization, providing a high-precision 6-DoF remote
To assist users and avoid hotkey conflicts with the simulation
operation experience at minimal cost, rivaling professional-
viewer, we provide an operation window displaying instructions
grade teleoperation systems.
using pygame. While efficient and hardware-independent, this
method lacks 3D spatial representation, reducing user intuition. C. Others
Additionally, Euler angle-based rotation control risks gimbal
Beyond keyboard and smartphone controls, our system
lock, potentially leading to loss of rotational degrees of freedom
incorporates support for DualSense Joysticks and VR con-
and failure in certain configurations.
trollers. The DualSense joystick provides ergonomic advantages
B. Smartphone and high-fidelity analog inputs for nuanced velocity control,
Modern smartphones, equipped with advanced sensors and mapping triggers and joysticks seamlessly to robot motion.
wireless communication, offer an ideal low-cost solution for The VR interface enhances spatial awareness and precision by
intuitive teleoperation from any location. However, existing enabling natural gestures and directional cues for control.
smartphone-based 6-DoF methods, such as those relying on Future work could extend VR capabilities by integrating
accelerometers or vision-based Visual Inertial Odometry (VIO) haptic feedback to improve user immersion and task accuracy.
systems (e.g., ARKit), suffer from instability due to sensor Additionally, the modular design of our system facilitates the in-
noise, low update rates, or weak visual features [40, 73, 74, 75]. tegration of emerging input devices with minimal development
Additionally, no open-source Android app exists for such effort.
implementations. To overcome these limitations, we adopt XIV. R EAL 2S IM T OOLSET FOR A SSET AND TASK
a hybrid approach: using smartphone orientation for motion G ENERATION
control and on-screen buttons for precise translation. Unlike the
keyboard interface, where roll, pitch, and yaw are controlled A. Overview
incrementally via discrete keypresses (i.e., delta orientation The Real2Sim toolset, specifically Video2URDF, provides
adjustments), the smartphone directly provides absolute orien- a systematic pipeline to reconstruct environment geometry
tation data in the form of quaternions. Quaternions, due to their and robotic assets from monocular video input. By leveraging
compactness and immunity to gimbal lock, allow for a more advanced reconstruction techniques, this pipeline produces
stable and accurate representation of the smartphone’s orien- meshes and unified robot descriptions that can be used in
tation in the world frame. As illustrated in Fig. 18, real-time simulation-based experiments. In doing so, it helps bridge
data from the smartphone’s inclination, rotation, and magnetic the gap between real-world data and simulated environments,
field sensors is fused to compute spatial orientation with ±5° enabling more accurate and comprehensive benchmarking [68]
accuracy at a frequency of 50 Hz. This data is transmitted
via WebSocket, ensuring low-latency communication. The app B. Components
interface features six buttons for translation control in the local 1) Gaussian Splatting Reconstruction: The first step in the
coordinate system and two switches for toggling orientation pipeline involves Gaussian splatting [53], which converts
Place butter in the drawer,
then close the drawer
Generated Task Instruction

Table in CALVIN

Asset Retrieval Task Set up Teleoperation


User Prompt

Butter in LIBERO
Assets in
ROBOVERSE

meta config Data Task config

Scene in ROBOVERSE
...

Fig. 16: Illustration of the two-phase generation protocol. A user prompt guides the LLM to propose an overall task and item
list. The system then refines object positions and merges them into a final initial state.

monocular video frames into a set of Gaussian kernels for is to import them into a simulator, such as Mujoco [111] in
rendering [130]. This representation captures key scene features Roboverse. This allows researchers to configure tasks that
such as depth, color, and collision boundaries in a compact accurately reflect real-world scenarios, forming a benchmark
and efficient way. As a result, it provides a visually faithful for training and evaluating robotic manipulation algorithms.
preview of the scene and serves as an intermediate step before The resulting simulated environment benefits from high-fidelity
detailed mesh reconstruction. geometry and a consistent representation of the physical
2) Mesh Reconstruction: Once the high-level scene struc- workspace.
ture is represented by Gaussian splatting, we perform mesh 4) Real-to-Sim boost Sim-to-Real Performance: We
reconstruction to obtain a more precise geometric model utilize train model on our real2sim module compared with
tsdf extraction [133, 128, 129, 45]. This step recovers the DexGraspNet[134], demonstrating 80% success rate compared
meshes of: to the 50% baseline from DexGraspNet. We use our real2sim
• The environment, including rigid, immovable structures assets in physics-based simulations that closely replicate real-
(e.g., a table). world grasping conditions, enabling robust grasp execution.
• The manipulatable object, which is central to the task at See Fig. 20 for visualization.
hand.
• The robotic arm and end effector, assumed to have a
deterministic configuration during real-to-sim and sim-to- C. Limitations and Challenges.
real transitions.
We use a visual-language model (VLM) and available While the Real2Sim pipeline effectively reconstructs most
CAD design information to generate a unified URDF (or of the relevant geometry, it struggles with completely unseen
MJCF) description for these components. This division of meshes and complex material properties [139]. Furthermore,
the workspace follows the notion of worldconfig in parameters such as friction and mass are inherently difficult
curobo [107], ensuring that each element of the scene to estimate purely from visual data, introducing uncertainties
(robot, object, environment) is cleanly separated and can be that may affect simulation fidelity. Despite these challenges,
easily adapted or replaced as needed. Real2Sim offers a powerful approach to rapidly generating
3) Loading the URDF into the Simulation Environment: simulation-ready assets for benchmarking in robotic manipula-
After the URDF (or MJCF) files are generated, the final step tion tasks.
Fig. 17: Sequential demonstration of smartphone-based control for stack cube and close box tasks.

Fig. 19: The smartphone app enables 6-DoF control using


Fig. 18: Visualization of the smartphone’s local coordinate
orientation sensing and multi-touch buttons for translation com-
system, world-frame orientation, and app functionality: six
mands, while the simulated robot’s movements are visualized
buttons control translation, and two switches toggle orientation
in real-time on the workstation.
control and gripper state.

of ARNOLD [36] and vMaterials [84], providing more the


XV. D OMAIN R ANDOMIZATION 300 high-quality visual material candidates. Additionally, user
A. Scene Randomization can also randomize the reflection properties of a given visual
For scene randomization, we curate 3D simulatable scene material, by setting roughness, specular, and metallic to random
assets from existing 3D scene datasets [30, 36, 21, 49]. number between 0 and 1.
Specifically, we convert all assets to the USD format for C. Light Randomization
integration. Additionally, we employ the articulated scene Two lighting configurations are supported: distant light and
generation method PhyScene [127] to create realistic scenes cylinder light arrays. For distant lighting, the polar angle
with articulated objects and mix the generated room-level of the light source is randomized. For cylinder lighting, a
scenes with house-level 3D scenes like ProcTHOR for greater randomly generated n × m matrix of cylinder lights, each with
diversity. We replay demonstrations in these scenes by selecting a randomized size, is added at a fixed height above the agents.
surfaces (e.g., floors, tables) that provide sufficient workspace, In both configurations, the intensity and color temperature of
guided by heuristic-based spatial constraints, following [36]. the lights are randomized within physically plausible ranges.
B. Visual Material Randomization D. Camera Randomization
It’s optinal to attach random visual material to object surfaces. A total of 59 candidate camera poses are carefully selected,
Visual materials are randomly selected from a curated subset with the majority oriented to face the robot directly and a
Fig. 20: Visualization of our real2sim pipeline for robotic grasping.

smaller subset positioned at side-facing angles. stabilize in a standing position. The primary robot model used is
the Unitree H1, augmented with two dexterous Shadow Hands,
XVI. NAVIGATION AND L OCOMOTOIN TASKS
though the environment supports other humanoid models such
A. Navigation Tasks as Unitree G1 and Agility Robotics Digit.
To integrate vision-and-language navigation into IsaacSim, Demonstrations: While HumanoidBench does not provide
we first correct the error-containing instructions by refining pre-collected demonstrations, it supports the use of reinforce-
incorrect punctuation and grammar using ChatGPT. Next, we ment learning algorithms to generate task-specific policies.
validate the ground truth trajectory by sweeping the robot’s The benchmark is designed to facilitate learning from scratch,
3D model (based on the ground truth trajectory) through the with dense and sparse reward structures to guide the learning
scene. The trajectory is deemed invalid if collisions occur process.
between the robot and the scene. Additionally, we adopt the Success Checkers: Each task in HumanoidBench is equipped
same evaluation metrics as VLN-CE [58]. For controlling the with a success checker that evaluates task completion based on
robot, we provide two different types of mobile embodiments, predefined criteria. For example, in the walk task, success is
including a Unitree Go2 robot dog and a JetBot wheeled robot, determined by the robot’s ability to maintain a forward velocity
making our task suitable for a variety of policies (with different of 1 m/s without falling, while in the stand task, success is
navigation capabilities). measured by the robot’s ability to maintain a stable upright
posture for a specified duration.
B. Humanoid Tasks
Experiment and Results: We trained the walk, stand, and run
We migrated the data samples from the Humanoid-X tasks in both the RoboVerse MuJoCo and IsaacLab simulators
dataset [77], and re-implemented the inference pipeline of UH- using the PPO and TD-MPC2[41, 42] algorithms, and compared
1 [77] in our framework. We use the Unitree-H1-2 humanoid the results with the HumanoidBench baseline based on the
robot as the simulated embodiment and set up the locomotion original MuJoCo environment. As shown in Figure22 and
and humanoid pose control task in our framework. The Figure23, the training curves from the RoboVerse MuJoCo sim-
humanoid pose control task is to control the humanoid robot ulator eventually converged and approached the performance
to follow some human poses while maintaining its stability on of HumanoidBench, validating the feasibility of the RoboVerse
the ground. The demonstrated poses in our framework include reinforcement learning infrastructure. Additionally, we trained
arms crossing, boxing, dancing, left and right punch, playing the same tasks in the RoboVerse IsaacLab simulator with
violin, playing guitar, praying, waving to a friend, etc. Our identical configurations. While training efficiency in IsaacLab
pretrained policy can successfully follow the demonstrated was comparatively lower under non-parallelized settings (to
pose to control a humanoid robot while maintaining stable maintain configuration consistency), it still demonstrated a clear
locomotion in IssacGym, and also obtain a decent performance upward trend in reward accumulation. This confirms the rapid
in IssacLab. The humanoid environment and task configurations migration capability of the MetaSim framework and highlights
are highly flexible and scalable, and we are able to support its potential to enable sim-to-sim learning while leveraging the
more humanoid pose control tasks from Humanoid-X without strengths of different simulators, such as IsaacLab’s support
modifying the infrastructure. for GPU-accelerated large-scale parallel training.
C. HumanoidBench
XVII. ROBOV ERSE B ENCHMARK S ET UP D ETAILS
HumanoidBench[102] is a high-dimensional simulated
benchmark designed to accelerate research in humanoid robot A. Generalization Levels
learning, focusing on whole-body locomotion and manipulation To systematically evaluate the generalization capability
tasks. The benchmark features a humanoid robot equipped with of a robot policy, we establish a benchmark based on a
dexterous hands, enabling a wide range of complex interactions carefully curated asset set designed for domain randomization.
in human-like environments. This asset set encompasses a diverse range of environmental
Tasks and Assets: We migrate three fundamental locomotion factors, including materials, textures, lighting conditions, scene
tasks: run, walk, and stand. These tasks are designed to test the configurations, and camera perspectives. By leveraging this
robot’s ability to maintain balance, achieve forward motion, and set, we assess how well different policies generalize to unseen
Fig. 21: Navigation gallery. We deploy the Unitree Go2 robot within Matterport 3D environments, primarily integrated with
ROBOV ERSE Isaac Lab branch. The robot is tasked with navigating the environment based on provided instructions.

MuJoCo PPO MuJoCo TD-MPC2 IsaacLab TD-MPC2 MuJoCo PPO (Baseline) MuJoCo TD-MPC2 (Baseline)
800 800
800
600 600
600
Return

Return

Return
400 400
400
200 200 200

0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Environment Steps 1e6 Environment Steps 1e6 Environment Steps 1e6

(a) Stand (b) Walk (c) Run


Fig. 22: Learning curves of RL algorithms on HumanoidBench task migratation: We also run PPO in the IsaacLab
simulator in RoboVerse, but it is not visible in the plot since it only achieves very low returns.

conditions. Specifically, we split the available assets into a different illumination conditions.
9:1 ratio for training and testing, ensuring that the testing – Distant Light: The polar angle of the light source is
environment contains novel variations not encountered during randomized within a predefined range, influencing the
training. Below, we detail the key components of this domain way shadows and reflections appear in the scene.
randomization setup: – Cylinder Light Arrays: A randomized n×m matrix of
cylinder lights, varying in size and intensity, is placed
• Table, Ground, and Wall. In tasks where a predefined at a fixed height above the agent.
scene is absent, we incorporate walls (and ceilings) to
In both configurations, light intensity and color tempera-
introduce structural complexity. Additionally, customiz-
ture are randomly varied within reasonable limits to ensure
able tables are included for tasks requiring tabletop
that the model encounters a broad range of lighting effects.
interactions. The visual materials applied to these elements
• Camera Poses. To further evaluate the robustness of visual
are randomly sampled from a carefully curated subset of
perception, we carefully select 59 candidate camera poses,
ARNOLD [36] and vMaterials [84], ensuring a diverse
strategically positioned to provide diverse viewpoints. The
range of appearances. The table features approximately
majority of these cameras are oriented directly towards
300 distinct material options, while both the wall and
the robot, ensuring consistent frontal perspectives, while
ground have around 150 material choices each. This
a subset is placed at side-facing angles to introduce
variation enhances the robustness of the learned policy
additional viewpoint variability.
by exposing the model to a wide spectrum of surface
• Reflection Properties. To simulate the wide range of
appearances and textures.
reflective surfaces encountered in real-world environments,
• Lighting Conditions. We introduce two distinct lighting
we randomize key material reflection properties, including
scenarios: distant lighting and cylinder light arrays, each
roughness, specular intensity, and metallic characteristics.
designed to test the adaptability of the learned policy to
Walk
Stand

Timestep
Fig. 23: Demonstration of TD-MPC2 policys trained in the RoboVerse MuJoCo simulator on the Walk and Stand tasks migrated
from the HumanoidBench benchmark

These properties are adjusted within reasonable physical For generalist models, the action is pre-processed into delta
ranges to ensure that the robot policy learns to handle end-effector position space from absolute end-effector position
various levels of surface reflectivity. space, and the gripper action is binarized to {0, +1}. Owing
By integrating these domain randomization techniques into to the lack of time and resources, we are only able to fine-tune
our benchmark, we create a controlled yet diverse testing the generalist models in the single-task setting. For each task,
environment that challenges the generalization ability of OpenVLA [56] is LoRA [44] fine-tuned (rank= 32) with 8
different robot policies. This setup ensures that trained policies A100 GPU under official settings to convergence and reaches
are not merely overfitting to a limited set of conditions but are over 95% action token accuracy as proposed by Kim et al.
instead capable of adapting to a broader range of real-world [56] during the training stage. During evaluations, we employ
variations. Curobo [106] as the inverse-kinematics solver to transform the
action to robot joint state space.
B. RoboVerse Benchmark Protocol B. Diffusion Policy
We rigorously design a training and evaluation protocol to We implemented the training and validation code for Diffu-
ensure a structured and reliable assessment of the policy’s sion Policy based on the requirements of our tasks and relevant
performance. Given the training data, the policy learns to research papers.
imitate the demonstrated behavior. For evaluation, we provide Modeling Diffusion Policy as Denoising Diffusion Proba-
a standardized API that enables systematic assessment. As bilistic Models (DDPMs), we train a noise predictor network:
mentioned earlier, the training and evaluation follow a 9:1
ϵbk = ϵθ ak , s, k

ratio, ensuring that the policy is tested on novel scenarios not (1)
encountered during training. that takes in noisy actions ak , current observations s, and
denoising iterations k and predicts the noise ϵbk .
XVIII. P OLICY T RAINING D ETAILS As for observation s, We use ResNet18 to extract the features
A. Implementation Details of scene images fimg and use 3-layer MLP to extract the
For specialist models, we train from scratch with action features of robot joint states frobot . fimg concatenating with
in 9-dim robot joint state space. Diffusion Policy [13] is frobot is just the conditioning input for Diffusion Policy.
During training, we randomly choose a denoising step k
implemented based on its original framework. We search several
and sample noise ϵk added to the unmodified sample a0 . Our
key hyperparameters, including observation and prediction
training loss is the difference between ϵk and predicted noise:
length, to optimize performance for our tasks. ACT [138]
is implemented with the original architecture and hyper- LDP = M SELoss(ϵk , ϵbk ) (2)
parameters, except that the batch size has been increased to
During inference time, our policy starts from random actions
512, with learning rate correspondingly enlarged to 1e − 4 to
aK and denoises for K steps to obtain the final action
accelerate convergence. We train ACT on one A100 GPU for
predictions. At each step, the action is updated following:
2000 epochs and evaluate with the best checkpoints on the
ak−1 = α ak − γϵθ ak , s, k + N 0, σ 2 I
 
validation set. (3)
, where α, β and γ are hyperparameters. is due to joint positions being less ambiguous than Cartesian
position plus orientation as the robot states representation.
XIX. W ORLD M ODEL D ETAILS However, generation quality remains suboptimal when train-
A. Methodology ing on the DROID-50K or DROID-RoboVerse-100K datasets
and validating on DROID samples due to the complexity of
We adopt a video generation framework based on
DROID scenes. Scaling the model to 500M parameters and
Latte[71]—a transformer-driven latent diffusion model
reducing the batch size to 8 leads to better preservation of
equipped with an efficient spatial-temporal attention mech-
object geometry, as does the prediction of robot arm movement.
anism. For action conditioning, we use frame-level Adaptive
As discussed in the main paper, although the larger model
Layer Normalization[89] (AdaLN), following insights from
trained on DROID-RoboVerse-100K shows an improved under-
IRASim[141] that show more precise control of the gripper with
standing of object shapes in DROID samples compared to the
frame-level conditioning compared to video-level conditioning.
model trained on DROID-50K, it still struggles with intricate
In the forward pass, raw video frames are encoded using a real-world physics. In contrast, training with RoboVerse-50K
frozen autoencoder from Stable Diffusion[90]. The first frame or DROID-RoboVerse-100K and validating on RoboVerse
serves as the initial condition, while noise is introduced into scenes produces more physically and geometrically consistent
the latent representation of subsequent frames during training. predictions.
Both the noise schedule and action conditions (gripper states We believe it is because RoboVerse offers cleaner back-
with either Cartesian position plus orientation or joint position) grounds, more comprehensive views of the robotic arm, and
are encoded by separate MLPs into latent space and then added the implementation of domain randomization and augmenta-
together. tion. By comparison, many DROID frames contain cluttered
These noisy latent frames are then fed into a transformer backgrounds or incomplete arm visibility, creating challenges
composed of alternating spatial and temporal attention blocks, for learning robust temporal dynamics from raw pixels.
where action conditions are applied at each frame via AdaLN.
For inference, we employ DDIM[105] as a denoising scheduler,
using 200 sampling steps.

B. Data Preparation
The DROID[54] dataset’s episodes typically last from 120
to 360 frames. To amplify motion, we skip every 6 frames,
effectively reducing the frame rate to 4 fps with sequence
lengths from 20 to 60. In the RoboVerse simulation, we adjust
the control frequency so that most episodes span 20 to 60
frames, mirroring the number of frames of DROID in one
episode. We filter out any sequence shorter than 20 or longer
than 60 frames, resulting in about 50,000 unique episodes from
DROID.
We only generate 50,000 unique RoboVerse episodes due
to time and resource constraints. The full-scale RoboVerse is
planned to train more capable world models in future works.
We exclude the gripper camera view because the model
struggles with drastic camera pose changes, which leads to
poor frame generation quality. Since we consider left and right
camera views as separate samples, each dataset effectively
doubles to 100,000 samples.

C. Experiments
Our experiments involve training three datasets, DROID-50K,
RoboVerse-50K, and DROID-RoboVerse-100K, on 8 NVIDIA
H100 GPUs. We use a spatial resolution of 240×320 and
sequences of 16 frames per episode. Starting with a model of
100M parameters and a batch size of 16, training converges at
around 100K steps on RoboVerse and 200K steps on DROID.
We first compare Cartesian position plus orientation to joint
positions as action conditions and find that using joint positions
as action conditions yields more precise gripper movement
control in frame generation, as shown in Fig.25. We believe it
Fig. 24: Visualization of Sim-to-Sim-to-Real Experiments.

Fig. 25: Visualization of ground truth and predicted frames


by models conditioned on cartesian position (plus orientation)
and joint position.
Ground
Truth
Conditioned
Cartesian
Positiion
nditioned
osiiton
Joint

You might also like