Code for my paper "You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations" [arXiv] / [Project] / [Dataset]
- ✅ [2025-07-05] The extended journal version YOTO++: Learning Long-Horizon Closed-Loop Bimanual Manipulation from One-Shot Human Video Demonstrations with more new content has been released.
- ✅ [2025-06-12] I have built a summary page of latest related works on Robot Manipulation - Vision Language Action. Welcome to read and communicate!
- ✅ [2025-04-11] Our paper has been accepted by conference RSS 2025 (Robotics: Science and Systems).
- ✅ [2025-03-10] We have uploaded our preprocessed datasets and pretrained models in huggingface/YOTO. Please refer AugDemos and BiDP under this repo for the usage of them.
Bimanual robotic manipulation is a long-standing challenge of embodied intelligence due to its characteristics of dual-arm spatial-temporal coordination and high-dimensional action spaces. Previous studies rely on pre-defined action taxonomies or direct teleoperation to alleviate or circumvent these issues, often making them lack simplicity, versatility and scalability. Differently, we believe that the most effective and efficient way for teaching bimanual manipulation is learning from human demonstrated videos, where rich features such as spatial-temporal positions, dynamic postures, interaction states and dexterous transitions are available almost for free. In this work, we propose the YOTO (You Only Teach Once), which can extract and then inject patterns of bimanual actions from as few as a single binocular observation of hand movements, and teach dual robot arms various complex tasks. Furthermore, based on keyframes-based motion trajectories, we devise a subtle solution for rapidly generating training demonstrations with diverse variations of manipulated objects and their locations. These data can then be used to learn a customized bimanual diffusion policy (BiDP) across diverse scenes. In experiments, YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks, possesses strong generalization under different visual and spatial conditions, and outperforms existing visuomotor imitation learning methods in accuracy and efficiency.
- Below, we present some prediction results of trained models with / without augmentation for comparison.
| Task | BiDP trained without augmentation | BiDP trained with augmentation |
|---|---|---|
| Drawer | ||
| Pouring | ||
| Unscrew | ||
| Uncover | ||
| Openbox |
- It can be clearly seen that the augmented data can make the model prediction error significantly smaller. More videos and illustrations can be found in our homepage.
- Specifically, we use
$$\color{green}green$$ point clouds to represent the platform (which does not appear in the observation input) for easy visualization, and$$\color{magenta}magenta$$ to represent the point clouds of the manipulated objects. The$$\color{blue}blue$$ and$$\color{red}red$$ colored 6-DoF keyposes represent the end-effector actions of the left and right arms, respectively. The far left and far right keyposes are initial robot states. The ground-truth 6-DoF keyposes have larger size than those predicted actions.
Our hand motion extraction and injection process relies on a variety of vison algorithms, including Hand Detection and 3D Mesh Reconstruction WiLoR, Large Vision-Language Model Florence2, Segment Anything Model 2 SAM2 and Binocular Stereo Matching IGEV. While, the codebase of our imitation learning algorithm BiDP is partly based on ACT, Diffusion Policy, 3D Diffusion Policy and EquiBot. We thank them for their open source efforts and contributions.
If you use our code or models in your research, please cite with:
@article{zhou2025you,
title={You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations},
author={Zhou, Huayi and Wang, Ruixiang and Tai, Yunxin and Deng, Yueci and Liu, Guiliang and Jia, Kui},
journal={arXiv preprint arXiv:2501.14208},
year={2025}
}