Ai2 Releases Molmo 2: State-of-the-Art Open Multimodal Family for Video and Multi-Image Understanding
Ai2 Releases Molmo 2: State-of-the-Art Open Multimodal Family for Video and Multi-Image Understanding
New open models unlock deep video comprehension with novel features like video tracking and multi-image reasoning, accelerating the science of AI into a new generation of multimodal intelligence.
SEATTLE--(BUSINESS WIRE)--Ai2 (The Allen Institute for AI) today announced Molmo 2, a state-of-the-art open multimodal model suite capable of precise spatial and temporal understanding of video, image, and multi-image sets. Building on the global impact of Molmo, which pioneered image pointing for multimodal AI systems, Molmo 2 introduces breakthrough capabilities in video pointing, multi-frame reasoning, and object tracking.
"With a fraction of the data, Molmo 2 surpasses many frontier models on key video understanding tasks. We are excited to see the immense impact this model will have on the AI landscape, adding another piece to our fully open model ecosystem.” -Ali Farhadi
Share
Molmo 2 improves on earlier versions, with the 8B-parameter model surpassing last year's 72B-parameter Molmo in accuracy, temporal understanding and pixel-level grounding, and besting proprietary models like Gemini 3 on key emerging skills like video tracking.
On image and multi-image reasoning, Molmo 2 excels despite its compact size. The 4B variant outperforms open models like Qwen 3-VL-8B — while using fewer parameters. Skills like these are essential for real-world video AI applications and allow the model, and any application or system built on top of it, to understand what is happening, where it is happening, and what it means. Molmo 2 is also trained on far less data than similar models, 9.19M videos compared to 72.5M for Meta’s PerceptionLM.
Performant, efficient, and open — we release many of the high-quality datasets that Molmo was trained on as well as all weights, evaluation tools, and data recipes for complete transparency.
Video has become the dominant form of information on the internet and the primary sensor stream for robotics, vehicles, industrial systems, scientific research, public infrastructure, and many other real-world applications. Yet, deep video understanding has remained elusive. The models either lack video understanding capabilities or are locked behind proprietary systems without transparency into the data. Molmo 2 changes this by giving researchers access to advanced video grounding, tracking, and multi-frame reasoning, all with open weights and data.
“With Olmo, we set the standard for truly open AI, then last year Molmo ushered the industry toward pointing; Molmo 2 pushes it even further by bringing these capabilities to videos and temporal domains,” said Ali Farhadi, CEO of Ai2. “With a fraction of the data, Molmo 2 surpasses many frontier models on key video understanding tasks—tracking, grounding, and multi-frame reasoning. We are excited to see the immense impact this model will have on the AI landscape, adding another piece to our fully open model ecosystem.”
A New Class of Video Intelligence, with open data and recipes
Molmo 2 introduces capabilities that no prior open model has delivered. It can identify exactly where and when events occur, track multiple objects through complex scenes, and connect actions to frame-level timelines. These capabilities support safer automation, more accurate real-world systems, and open research the global community can inspect, reproduce, and build upon.
Key capabilities include:
- Frame-level spatial and temporal grounding: Molmo 2 goes beyond description. It returns precise pixel coordinates, object positions, and timestamps for events across a video.
- Robust multi-object tracking and counting: The model maintains consistent object identities across occlusions, scene changes, and long clips, enabling applications in robotics, inspection, transportation, and industry.
- Dense long-form video captioning and anomaly detection: Molmo 2 produces highly detailed, searchable descriptions and flags unusual events in long sequences.
These capabilities create a foundation for applications such as assistive technology, industrial automation, scientific research, and next-generation robotics, where precision and interpretability are essential.
Breakthrough Performance Across Open and Proprietary Benchmarks
Molmo 2 establishes a new standard for open multimodal models. It delivers state-of-the-art results on major open-weight benchmarks and is on par with leading proprietary systems on real-world video tasks.
Highlights include:
- Leading open-weight performance on short-video understanding benchmarks such as MVBench, MotionQA, and NextQA.
- Major improvements in video grounding accuracy, often doubling or tripling the scores of previous open models and surpassing proprietary APIs on several pointing and counting tasks.
- Best-in-class tracking results across multi-domain benchmarks, outperforming strong open baselines and several commercial closed models.
- Strong image and multi-image reasoning that rivals or exceeds larger open-weight systems despite using fewer parameters.
- Human preference evaluations showing Molmo 2 is on par with or better than multiple proprietary systems on real-world video QA and captioning tasks.
Built on One of the Most Comprehensive Video Datasets Ever Released
For transparency and reproducibility, all the training sources for Molmo 2 are provided in the technical report. Additionally, Ai2 is releasing a collection of nine new open datasets used to train Molmo 2, totaling more than nine million multimodal examples across dense video captions, long-form QA, grounding, tracking, and multi-image reasoning.
The captioning corpus alone spans more than one hundred thousand videos with detailed descriptions that average more than nine hundred words each. The data mix covers video pointing, multi-object tracking, synthetic grounding, and long-video reasoning. Together, they form one of the most complete open video data collections available today.
Molmo 2 comes in three main variants: Molmo 2 (4B), Molmo2 (8B), and Molmo 2-O (7B), which uses Ai2’s fully open Olmo backbone for the complete end-to-end model flow. Versions tuned specifically for pointing and tracking are also available.
Availability
All models, datasets, and evaluation tools are now publicly available on GitHub, Hugging Face, and the Ai2 Playground for interactive testing. Training code will be released soon.
About Ai2
Ai2 is a Seattle-based non-profit AI research institute with the mission of building breakthrough AI to solve the world’s biggest problems. Founded by the late Paul G. Allen, Ai2 accelerates scientific discovery and deploys open, transparent AI systems through initiatives such as Olmo, Molmo, Dolma, Tulu, Asta, and more. For additional information, visit allenai.org.
Contacts
Media contact:
ai2pr@archetype.co
