👋 Hi! I am Siteng Huang (黄思腾 in Chinese). I work at DAMO Academy, Alibaba Group, as an Algorithm Expert through the AliStar program. I received my Ph.D. degree from
Zhejiang University in June 2024, affiliated with a joint program with
Westlake University at Machine Intelligence Laboratory (MiLAB) and advised by Prof. Donglin Wang. In my Ph.D. study, I also spent wonderful internship time at
TongYi Lab, Alibaba Group. Before that, I received my B.Eng. Degree from School of Computer Science,
Wuhan University in June 2019.
🔬 My research has centered on the perception, understanding, reasoning, and generation of multimodal (including images, videos, language, dynamics, etc.) data from both the internet and the physical world. I also focus on efficientAI (in terms of data, time, parameters, memory, etc.) for multimodal applications. I have published 30+ papers on the above topics at the top-tier international AI conferences and journals. Recently, I devote myself to the development of multi-modal generative, embodied, and unified foundation models.
Welcome to refer to my full publication list at my personal homepage.
- 2026/02/18 [RA-L] RoboSimGS, a novel Real2Sim2Real framework that converts multi-view real-world images into scalable, high-fidelity, and physically interactive simulation environments for robotic manipulation, got accepted for RA-L! See Project page for the overview video!
- 2026/02/10 [RynnBrain] We presented RynnBrain, an embodied foundation model grounded in physical reality, including dense (2B, 8B) and MoE (30B) variants, alongside three specialized models: RynnBrain‑Plan (manipulation planning), RynnBrain‑Nav (navigation), and RynnBrain‑CoP (spatial reasoning). See Github and Chinese report from 机器之心.
- 2026/01/31 [ICRA'26] RynnVLA-001, the VLA foundation model, got accepted for ICRA 2026!
- 2026/01/22 [Talk] I gave a talk titled Physical AI Ecosystem: Tackling the Key Barriers to Embodied Intelligence in AAAI-26 Interactive Industry Sessions.
- 2025/12/11 [Preprint] We released HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable a "think-while-acting" paradigm for long-horizon manipulation! Project page and Code are available!
- 2025/11/24 [Preprint] We released RynnVLA-002, an upgraded version of WorldVLA, a more powerful VLA and world model unified model! Get videos and code at Github!
- 2025/11/08 [AAAI'26] 4 papers got accepted for AAAI 2026! They included training-free MLLM inference acceleration methods FiCoCo and GlobalCom2, dexterous grasping policy AffordDex, and tiny-scale VLA VLA-Adapter.
- 2025/09/19 [NeurIPS'25] SSR got accepted for NeurIPS 2025! The work transforms raw depth data into structured, interpretable textual CoT, enhancing spatial reasoning capabilities of MLLMs. See Project page and Github!
- 2025/09/12 [Preprint] We released VLA-Adapter, which reduces reliance on large-scale VLMs and extensive pre-training by using a lightweight Policy module with Bridge Attention, achieving SOTA performance and fast inference speed with minimal computational resources! Checkpoint has been available! See Project page for more details. Got #1 Paper of the day on huggingface papers! 2025/11/08 VLA-Adapter got accepted for AAAI 2026 Oral!
- 2025/08/13 [Preprint] We released AffordDex, a universal grasping policy for dexterous hands with an inherent understanding of both motion priors and object affordances! Grasping videos can be found in Project page! 2025/11/08 AffordDex got accepted for AAAI 2026!
- 2025/08/08 [DAMO RynnBot] We open-sourced RynnEC: a video MLLM for embodied cognition tasks, RynnVLA-001: a VLA model based on pretrained video generation model, RynnRCP: a complete set of robot service agreements and frameworks! 2025/08/11 We released the technical blog for RynnVLA-001! 2025/09/19 We released the technical report for RynnVLA-001!
- 2025/08/02 [CoRL'25] Long-VLA, a novel framework designed to enhance VLA models for challenging long-horizon robotic manipulation tasks, got accepted for CoRL 2025!
- 2025/07/24 [DAMO RynnBot] We released RynnBot PlayGround Beta, a platform that provides data management, SOTA VLA models, model training and validation, cloud-edge collaborative deployment, and so on! Welcome to follow our continuous progress!
- 2025/06/27 [Preprint] We released WorldVLA, an autoregressive action world model that unifies action and image understanding and generation! Code has been available!
- 2025/06/26 [ICCV'25] CARP, Coarse-to-fine AutoRegressive Prediction for visuomotor policy learning, got accepted for ICCV 2025! The approach produces highly accurate and smooth robot actions, achieving up to a 10% improvement of success rates, and delivers 10x faster inference compared to state-of-the-art policies. Paper, code and cool videos can be found in Project page!
- 2025/05/22 [Preprint] We released VARD, a novel RL fine-tuning method on diffusion-based generative models for both protein structure and text-to-image synthesis, enhancing sample quality with improved efficiency, effective mitigation of reward hacking, and broad applicability.
- 2025/05/07 [Preprint] We released OpenHelix, a low-cost open-source dual-system VLA with systematic empirical evaluations on the core design elements. Code and List of papers have been available!
- 2025/03/31 [Preprint] We released Unicorn to explore the question: can high-quality multimodal training data be synthesized purely from text?
- 2025/03/28 [Survey Preprint] We released Exploring the Evolution of Physics Cognition in Video Generation: A Survey, which dives deep into the development of physics cognition in video generation, from basic perception to active cognition! List of papers has been available!
- 2025/03/11 [TCSVT'25] M2IST, a novel Multi-Modal Interactive Side-Tuning method that effectively addresses the challenges of insufficient multi-modal interaction and high GPU memory consumption, got accepted for IEEE Transactions on Circuits and Systems for Video Technology! Code has been available!
- 2025/02/24 [Preprint] We released Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control!
- 2025/01/28 [ICRA'25] QUART-Online, a novel latency-free quadruped MLLM model that achieves real-time inference while boosting the success rate across various tasks by 65%, got accepted for ICRA 2025! See Project page.
- 2025/01/23 [ICLR'25] ToCa, a token-wise feature caching method that achieves a 2x acceleration for PixArt-α, OpenSora, and DiT while maintaining nearly lossless generation quality, got accepted for ICLR 2025! Code has been available!
- 2025/01/10 [Preprint] We released GlobalCom2, a "global-to-local" approach for training-free acceleration of high-resolution MLLMs with AnyRes strategy. Code has been available! 2025/11/08 GlobalCom2 got accepted for AAAI 2026!