Molmo: A New Family of Open State-of-the-Art Multimodal AI Models

Models & Research

The Engineer

26 Sept 2024 · 3 min read

Molmo challenges the dominance of proprietary AI systems by offering an open-source alternative that excels in multimodal interaction, bridging the gap between digital and physical worlds.

Molmo, a new family of open state-of-the-art multimodal AI models from the Allen Institute for AI (AI2), is making waves in the research community. These models not only match but often exceed the performance of proprietary systems across various benchmarks and human evaluations. What sets Molmo apart is its ability to go beyond traditional language-based interactions, enabling rich and actionable interactions with both physical and virtual environments.

Key Technical Changes and Why They Matter

Open-Source Innovation: Unlike many state-of-the-art multimodal models that remain proprietary, Molmo is fully open-sourced. This includes the model weights, code, data, and evaluations. The transparency allows researchers to build upon and improve these models without the black-box limitations of closed systems.
Novel Datasets (PixMo): Molmo leverages a new dataset collection called PixMo, which includes:
- A highly-detailed image caption dataset collected from human annotators using speech-based descriptions.
- A diverse mixture of fine-tuning datasets that introduce capabilities like 2D pointing.
Architectural Details: The models are built by combining pre-trained vision encoders and language-only LLMs. This hybrid approach ensures robust performance across multiple modalities without relying on synthetic data from proprietary VLMs, a common practice in other open-source models.

Perception Capabilities

Molmo excels in perception tasks, demonstrating advanced capabilities:

Open-Ended Question Answering: Molmo can answer complex questions about images and videos with high accuracy.
Pointing with Olmo: A unique feature where the model can point to specific elements in an image, enhancing its ability to interact with visual data.
Counting with Pointing: The model can count objects in an image and use pointing to indicate each counted item.

Action Capabilities

Molmo's action capabilities are equally impressive:

Robotics Images: Molmo can interpret images from robotic systems and provide actionable insights, making it a valuable tool for robotics research.
Augmenting How We See with AI: The model can enhance visual data by providing additional context or highlighting important elements.
Molmo Robotics Demo: Demonstrations show how Molmo can be integrated into robotic systems to perform tasks more effectively.

Performance and Benchmarks

Benchmarking: Molmo's most powerful models close the gap between open and proprietary systems across a wide range of academic benchmarks. Smaller models outperform those 10 times their size, making them highly efficient.
Human Evaluations: The model performs well in human evaluations, indicating its practical usability beyond just benchmark scores.

Open, Cutting-Edge, and Actionable

The development of Molmo addresses a significant gap in the AI research community. While recent open-source models have relied heavily on synthetic data from proprietary VLMs to achieve good performance, this approach has limited foundational knowledge about building performant VLMs from scratch. Molmo's success lies in its innovative use of human-annotated datasets and careful architectural choices.

Future Directions

The introduction of 2D pointing data is a game-changer for multimodal models. This capability opens up new avenues for applications where agents need to interact with both virtual and physical environments, such as augmented reality (AR), robotics, and interactive AI systems.