-

Ai2 Releases Molmo 2: State-of-the-Art Open Multimodal Family for Video and Multi-Image Understanding

New open models unlock deep video comprehension with novel features like video tracking and multi-image reasoning, accelerating the science of AI into a new generation of multimodal intelligence.

SEATTLE--(BUSINESS WIRE)--Ai2 (The Allen Institute for AI) today announced Molmo 2, a state-of-the-art open multimodal model suite capable of precise spatial and temporal understanding of video, image, and multi-image sets. Building on the global impact of Molmo, which pioneered image pointing for multimodal AI systems, Molmo 2 introduces breakthrough capabilities in video pointing, multi-frame reasoning, and object tracking.

"With a fraction of the data, Molmo 2 surpasses many frontier models on key video understanding tasks. We are excited to see the immense impact this model will have on the AI landscape, adding another piece to our fully open model ecosystem.” -Ali Farhadi

Share

Molmo 2 improves on earlier versions, with the 8B-parameter model surpassing last year's 72B-parameter Molmo in accuracy, temporal understanding and pixel-level grounding, and besting proprietary models like Gemini 3 on key emerging skills like video tracking.

On image and multi-image reasoning, Molmo 2 excels despite its compact size. The 4B variant outperforms open models like Qwen 3-VL-8B — while using fewer parameters. Skills like these are essential for real-world video AI applications and allow the model, and any application or system built on top of it, to understand what is happening, where it is happening, and what it means. Molmo 2 is also trained on far less data than similar models, 9.19M videos compared to 72.5M for Meta’s PerceptionLM.

Performant, efficient, and open — we release many of the high-quality datasets that Molmo was trained on as well as all weights, evaluation tools, and data recipes for complete transparency.

Video has become the dominant form of information on the internet and the primary sensor stream for robotics, vehicles, industrial systems, scientific research, public infrastructure, and many other real-world applications. Yet, deep video understanding has remained elusive. The models either lack video understanding capabilities or are locked behind proprietary systems without transparency into the data. Molmo 2 changes this by giving researchers access to advanced video grounding, tracking, and multi-frame reasoning, all with open weights and data.

“With Olmo, we set the standard for truly open AI, then last year Molmo ushered the industry toward pointing; Molmo 2 pushes it even further by bringing these capabilities to videos and temporal domains,” said Ali Farhadi, CEO of Ai2. “With a fraction of the data, Molmo 2 surpasses many frontier models on key video understanding tasks—tracking, grounding, and multi-frame reasoning. We are excited to see the immense impact this model will have on the AI landscape, adding another piece to our fully open model ecosystem.”

A New Class of Video Intelligence, with open data and recipes

Molmo 2 introduces capabilities that no prior open model has delivered. It can identify exactly where and when events occur, track multiple objects through complex scenes, and connect actions to frame-level timelines. These capabilities support safer automation, more accurate real-world systems, and open research the global community can inspect, reproduce, and build upon.

Key capabilities include:

  • Frame-level spatial and temporal grounding: Molmo 2 goes beyond description. It returns precise pixel coordinates, object positions, and timestamps for events across a video.
  • Robust multi-object tracking and counting: The model maintains consistent object identities across occlusions, scene changes, and long clips, enabling applications in robotics, inspection, transportation, and industry.
  • Dense long-form video captioning and anomaly detection: Molmo 2 produces highly detailed, searchable descriptions and flags unusual events in long sequences.

These capabilities create a foundation for applications such as assistive technology, industrial automation, scientific research, and next-generation robotics, where precision and interpretability are essential.

Breakthrough Performance Across Open and Proprietary Benchmarks

Molmo 2 establishes a new standard for open multimodal models. It delivers state-of-the-art results on major open-weight benchmarks and is on par with leading proprietary systems on real-world video tasks.

Highlights include:

  • Leading open-weight performance on short-video understanding benchmarks such as MVBench, MotionQA, and NextQA.
  • Major improvements in video grounding accuracy, often doubling or tripling the scores of previous open models and surpassing proprietary APIs on several pointing and counting tasks.
  • Best-in-class tracking results across multi-domain benchmarks, outperforming strong open baselines and several commercial closed models.
  • Strong image and multi-image reasoning that rivals or exceeds larger open-weight systems despite using fewer parameters.
  • Human preference evaluations showing Molmo 2 is on par with or better than multiple proprietary systems on real-world video QA and captioning tasks.

Built on One of the Most Comprehensive Video Datasets Ever Released

For transparency and reproducibility, all the training sources for Molmo 2 are provided in the technical report. Additionally, Ai2 is releasing a collection of nine new open datasets used to train Molmo 2, totaling more than nine million multimodal examples across dense video captions, long-form QA, grounding, tracking, and multi-image reasoning.

The captioning corpus alone spans more than one hundred thousand videos with detailed descriptions that average more than nine hundred words each. The data mix covers video pointing, multi-object tracking, synthetic grounding, and long-video reasoning. Together, they form one of the most complete open video data collections available today.

Molmo 2 comes in three main variants: Molmo 2 (4B), Molmo2 (8B), and Molmo 2-O (7B), which uses Ai2’s fully open Olmo backbone for the complete end-to-end model flow. Versions tuned specifically for pointing and tracking are also available.

Availability

All models, datasets, and evaluation tools are now publicly available on GitHub, Hugging Face, and the Ai2 Playground for interactive testing. Training code will be released soon.

About Ai2

Ai2 is a Seattle-based non-profit AI research institute with the mission of building breakthrough AI to solve the world’s biggest problems. Founded by the late Paul G. Allen, Ai2 accelerates scientific discovery and deploys open, transparent AI systems through initiatives such as Olmo, Molmo, Dolma, Tulu, Asta, and more. For additional information, visit allenai.org.

Contacts

Media contact:
ai2pr@archetype.co

Ai2


Release Versions

Contacts

Media contact:
ai2pr@archetype.co

Social Media Profiles
More News From Ai2

Olmo 3: Charting a Path Through the Model Flow to Lead Open Source AI

SEATTLE--(BUSINESS WIRE)--The Allen Institute for AI (Ai2) today announced Olmo 3, a family of frontier fully open language models that delivers the entire model flow – the complete, transparent pipeline from data to deployment. The release includes the first-ever fully open 32B thinking model that generates explicit reasoning-chain-style content. Olmo 3 gives AI builders full visibility and control over every training stage, checkpoint, and dataset, enabling limitless customization and reprodu...

Ai2 Launches OlmoEarth: State of the Art Foundation Models and Open Infrastructure to Tackle the Planet's Biggest Problems

SEATTLE--(BUSINESS WIRE)--Ai2 (The Allen Institute for AI) today announced the OlmoEarth Platform, the first open, end-to-end solution that transforms global satellite and sensor data into real-time earth insight. At its core is OlmoEarth, Ai2’s state-of-the-art multimodal foundation model family, trained on millions of Earth observations spanning roughly ten terabytes of data. Built on these advanced models, OlmoEarth Platform brings cutting-edge AI to governments, NGOs, and local communities—...

Ai2 Launches Asta DataVoyager: Data-driven Discovery and Analysis for Science

SEATTLE--(BUSINESS WIRE)--Ai2 (The Allen Institute for AI) today launched Asta DataVoyager, a powerful data analysis AI agent within the Asta ecosystem. Scientists across disciplines face an abundance of structured data but lack intuitive tools to efficiently transform it into meaningful answers. With Asta DataVoyager, researchers can upload datasets and ask questions in plain language, receiving trustworthy, reproducible outputs that combine statistical rigor with usability. At a time when mos...
Back to Newsroom