YOTO

Code for my paper "You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations" [arXiv] / [Project] / [Dataset]

✅ [2025-07-05] The extended journal version YOTO++: Learning Long-Horizon Closed-Loop Bimanual Manipulation from One-Shot Human Video Demonstrations with more new content has been released.
✅ [2025-06-12] I have built a summary page of latest related works on Robot Manipulation - Vision Language Action. Welcome to read and communicate!
✅ [2025-04-11] Our paper has been accepted by conference RSS 2025 (Robotics: Science and Systems).
✅ [2025-03-10] We have uploaded our preprocessed datasets and pretrained models in huggingface/YOTO. Please refer AugDemos and BiDP under this repo for the usage of them.

● Abstract

Bimanual robotic manipulation is a long-standing challenge of embodied intelligence due to its characteristics of dual-arm spatial-temporal coordination and high-dimensional action spaces. Previous studies rely on pre-defined action taxonomies or direct teleoperation to alleviate or circumvent these issues, often making them lack simplicity, versatility and scalability. Differently, we believe that the most effective and efficient way for teaching bimanual manipulation is learning from human demonstrated videos, where rich features such as spatial-temporal positions, dynamic postures, interaction states and dexterous transitions are available almost for free. In this work, we propose the YOTO (You Only Teach Once), which can extract and then inject patterns of bimanual actions from as few as a single binocular observation of hand movements, and teach dual robot arms various complex tasks. Furthermore, based on keyframes-based motion trajectories, we devise a subtle solution for rapidly generating training demonstrations with diverse variations of manipulated objects and their locations. These data can then be used to learn a customized bimanual diffusion policy (BiDP) across diverse scenes. In experiments, YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks, possesses strong generalization under different visual and spatial conditions, and outperforms existing visuomotor imitation learning methods in accuracy and efficiency.

● Inference Visualization

Below, we present some prediction results of trained models with / without augmentation for comparison.

Task	BiDP trained without augmentation	BiDP trained with augmentation
Drawer
Pouring
Unscrew
Uncover
Openbox

It can be clearly seen that the augmented data can make the model prediction error significantly smaller. More videos and illustrations can be found in our homepage.
Specifically, we use $$\color{green}green$$ point clouds to represent the platform (which does not appear in the observation input) for easy visualization, and $$\color{magenta}magenta$$ to represent the point clouds of the manipulated objects. The $$\color{blue}blue$$ and $$\color{red}red$$ colored 6-DoF keyposes represent the end-effector actions of the left and right arms, respectively. The far left and far right keyposes are initial robot states. The ground-truth 6-DoF keyposes have larger size than those predicted actions.

● Acknowledgement

Our hand motion extraction and injection process relies on a variety of vison algorithms, including Hand Detection and 3D Mesh Reconstruction WiLoR, Large Vision-Language Model Florence2, Segment Anything Model 2 SAM2 and Binocular Stereo Matching IGEV. While, the codebase of our imitation learning algorithm BiDP is partly based on ACT, Diffusion Policy, 3D Diffusion Policy and EquiBot. We thank them for their open source efforts and contributions.

● Citation

If you use our code or models in your research, please cite with:

@article{zhou2025you,
  title={You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations},
  author={Zhou, Huayi and Wang, Ruixiang and Tai, Yunxin and Deng, Yueci and Liu, Guiliang and Jia, Kui},
  journal={arXiv preprint arXiv:2501.14208},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
AugDemos		AugDemos
BiDP		BiDP
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

YOTO

● Abstract

● Inference Visualization

● Acknowledgement

● Citation

About

Uh oh!

Releases

Packages

Languages

hnuzhy/YOTO

Folders and files

Latest commit

History

Repository files navigation

YOTO

● Abstract

● Inference Visualization

● Acknowledgement

● Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages