The Molmo 2 model family
Pick a variant to explore weights, code and reports. Every card includes instant links to artifacts.
Read the technical report
4B
A compact workhorse for multimodal research. Molmo 2 (4B) delivers strong performance on image, captioning, pointing, video understanding, and tracking tasks while staying light enough for workstations and rapid iteration.
8B
Our strongest overall performer for video understanding. Molmo 2 (8B) achieves state-of-the-art results among open-weight models on key short- and long-video, captioning, pointing, counting, and tracking benchmarks—ideal for demanding applications that still require efficient inference.
Explore what Molmo can do
Open data
Molmo 2 is trained on an extensive mix of video-centric multimodal datasets from publicly available sources and nine new open collections by Ai2 for dense captioning, long-form QA, pointing, and tracking across images, multi-image sets, and video—transparent data and recipes the research community can build on.
Dense, long-form video descriptions averaging hundreds of words per clip, capturing actions, relationships, rare events, and fine-grained visual details—designed to give Molmo 2 rich, temporally aware video understanding.
Human-crafted and synthetic question–answer pairs spanning short and long videos, including free-form 'ask the model anything' queries and subtitle-aware QA that combines what's seen with what's said.
Open-vocabulary spatio-temporal pointing data where annotators mark precise pixels and timestamps for objects, actions, and visual artifacts. This supervision underpins Molmo 2's ability to answer by showing you exactly where and when something happens.
Point-based tracking data that follows multiple objects across frames, through occlusions and re-entries. From this, Molmo 2 learns to assign persistent IDs and trace objects over time—enabling grounded tracking and counting by pointing.
Curated sets of semantically related images with QA and pointing supervision. These collections help Molmo 2 resolve references and ground answers when reasoning across multi-image sets.
What people are saying
Resources
Go deeper with Molmo 2 through models, datasets, code, blogs, and technical reports.
- MolmoActAn Action Reasoning Model that reasons in 3D space, built on Molmo.
- DocumentationGuides for training, fine-tuning, and deploying Molmo 2 in your own environment.
- BlogLearn how Molmo 2 was built and how it performs against other open and proprietary models.
- ResearchDive into the technical report.





