
Share
Mantis tackles the challenge of training large multimodal models to excel in both single-image and multi-image tasks by introducing a unique dataset that enhances their reasoning and context comprehension skills.
Recent advancements in large multimodal models (LMMs) have significantly improved their performance on single-image vision-language tasks. However, these models often struggle with multi-image visual language tasks, which require more complex reasoning and context understanding. Mantis, a new LLM developed by researchers from the University of Waterloo, Tsinghua University, and Sea AI Lab, aims to bridge this gap.
The Mantis-Instruct dataset is the first fully text-image interleaved multimodal instruction tuning dataset. It contains 721K examples from 14 subsets, covering a range of multi-image skills:
10 Existing Subsets:
4 New Datasets:

Mantis is built on the LLaMA-3 architecture and is designed to handle interleaved text and image inputs. The key aspects of its development include:
Mantis achieves state-of-the-art performance on five multi-image benchmarks:
Additionally, Mantis maintains strong single-image performance, comparable to models like CogVLM and Emu2. This balance is crucial for real-world applications where both types of tasks are common.
The researchers have made the Mantis-Instruct dataset, training/evaluation codes, and model checkpoints publicly available:
Mantis represents a significant step forward in the development of multimodal models capable of handling both single-image and multi-image tasks. By leveraging a carefully curated dataset and efficient training methods, Mantis achieves state-of-the-art performance while remaining accessible to researchers with limited resources. This work opens up new possibilities for applications requiring advanced visual language understanding.
Tags
Original Sources
↗ https://tiger-ai-lab.github.io/Mantis/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
6 May 2024
88 articles
Related Articles
Related Articles
More Stories