
Share
Molmo challenges the dominance of proprietary AI systems by offering an open-source alternative that excels in multimodal interaction, bridging the gap between digital and physical worlds.
Molmo, a new family of open state-of-the-art multimodal AI models from the Allen Institute for AI (AI2), is making waves in the research community. These models not only match but often exceed the performance of proprietary systems across various benchmarks and human evaluations. What sets Molmo apart is its ability to go beyond traditional language-based interactions, enabling rich and actionable interactions with both physical and virtual environments.
Open-Source Innovation: Unlike many state-of-the-art multimodal models that remain proprietary, Molmo is fully open-sourced. This includes the model weights, code, data, and evaluations. The transparency allows researchers to build upon and improve these models without the black-box limitations of closed systems.
Novel Datasets (PixMo): Molmo leverages a new dataset collection called PixMo, which includes:
Architectural Details: The models are built by combining pre-trained vision encoders and language-only LLMs. This hybrid approach ensures robust performance across multiple modalities without relying on synthetic data from proprietary VLMs, a common practice in other open-source models.
Molmo excels in perception tasks, demonstrating advanced capabilities:

Molmo's action capabilities are equally impressive:
The development of Molmo addresses a significant gap in the AI research community. While recent open-source models have relied heavily on synthetic data from proprietary VLMs to achieve good performance, this approach has limited foundational knowledge about building performant VLMs from scratch. Molmo's success lies in its innovative use of human-annotated datasets and careful architectural choices.
The introduction of 2D pointing data is a game-changer for multimodal models. This capability opens up new avenues for applications where agents need to interact with both virtual and physical environments, such as augmented reality (AR), robotics, and interactive AI systems.
Tags
Original Sources
↗ https://allenai.org/blog/molmo
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
26 September 2024
88 articles
Related Articles
Related Articles
More Stories