Thanks to visit codestin.com
Credit goes to allenai.org

Skip to main content ->
Ai2

Molmo

A family of open state-of-the-art multimodal AI models

The Molmo 2 model family

Pick a variant to explore weights, code and reports. Every card includes instant links to artifacts.
Read the technical report

4B

A compact workhorse for multimodal research. Molmo 2 (4B) delivers strong performance on image, captioning, pointing, video understanding, and tracking tasks while staying light enough for workstations and rapid iteration.

8B

Our strongest overall performer for video understanding. Molmo 2 (8B) achieves state-of-the-art results among open-weight models on key short- and long-video, captioning, pointing, counting, and tracking benchmarks—ideal for demanding applications that still require efficient inference.

O-7B

A fully open, end-to-end stack for research. Molmo 2-O (7B) pairs Molmo 2's vision and video grounding with Olmo, our fully open LLM, so every component – from language backbone to vision encoder to training checkpoints – can be inspected, modified, and adapted.

Explore what Molmo can do

Video Tracking
Complex Video QA
Counting
Dense Captioning
Reasoning across documents and images

Open data

Molmo 2 is trained on an extensive mix of video-centric multimodal datasets from publicly available sources and nine new open collections by Ai2 for dense captioning, long-form QA, pointing, and tracking across images, multi-image sets, and video—transparent data and recipes the research community can build on.

Molmo 2 video captions

Dense, long-form video descriptions averaging hundreds of words per clip, capturing actions, relationships, rare events, and fine-grained visual details—designed to give Molmo 2 rich, temporally aware video understanding.

Video QA

Human-crafted and synthetic question–answer pairs spanning short and long videos, including free-form 'ask the model anything' queries and subtitle-aware QA that combines what's seen with what's said.

Video point

Open-vocabulary spatio-temporal pointing data where annotators mark precise pixels and timestamps for objects, actions, and visual artifacts. This supervision underpins Molmo 2's ability to answer by showing you exactly where and when something happens.

Video track

Point-based tracking data that follows multiple objects across frames, through occlusions and re-entries. From this, Molmo 2 learns to assign persistent IDs and trace objects over time—enabling grounded tracking and counting by pointing.

Multi-image data

Curated sets of semantically related images with QA and pointing supervision. These collections help Molmo 2 resolve references and ground answers when reasoning across multi-image sets.

What people are saying

Molmo reshaped expectations for open-source vision-language models, influencing both research and real-world applications. Its continued development builds on that impact and pushes forward the field of multimodal AI!

Roger Wang
Multimodal Lead, vLLM

Molmo is fully open source, with the model, weights, training procedure, and dataset completely accessible, enabling researchers to study how each step influences model behavior and performance transparently.

Giuseppe Riccardi
Professor of Computer Science, University of Trento

[T]he most capable open source AI model with visual abilities yet.

Will Knight
Wired

Molmo shows the power of an open approach to AI, proves the value of high-quality training data, and unlocks new capabilities.

Todd Bishop
GeekWire

The OSS ecosystem is privileged to have the Molmo team leading the development of open multimodal AI, putting frontier VLM performance in the hands of everyone on Earth.

Anastasios Angelopoulos
CEO, LMArena

Molmo has transformed the multimodal model landscape by introducing not only innovative capabilities like pointing, but by providing fully open weights and training data so it can be easily inspected, reproduced, and customized.

Clem Delangue
Co-Founder & CEO, Hugging Face

Subscribe to receive monthly updates about the latest Ai2 news.