Mantis: Enhancing Multimodal Models with Interleaved Multi-Image Instruction Tuning

Models & Research

The Engineer

6 May 2024 · 3 min read

Mantis tackles the challenge of training large multimodal models to excel in both single-image and multi-image tasks by introducing a unique dataset that enhances their reasoning and context comprehension skills.

Balancing Multi-Image and Single-Image Abilities in Large Multimodal Models

Recent advancements in large multimodal models (LMMs) have significantly improved their performance on single-image vision-language tasks. However, these models often struggle with multi-image visual language tasks, which require more complex reasoning and context understanding. Mantis, a new LLM developed by researchers from the University of Waterloo, Tsinghua University, and Sea AI Lab, aims to bridge this gap.

Key Contributions

Mantis-Instruct Data: A novel dataset for multimodal instruction tuning with 721K examples.
Mantis Model: An LLaMA-3-based model trained on Mantis-Instruct using academic-level resources.
State-of-the-Art Performance: Superior performance on five multi-image benchmarks while maintaining strong single-image capabilities.

Mantis-Instruct Data

The Mantis-Instruct dataset is the first fully text-image interleaved multimodal instruction tuning dataset. It contains 721K examples from 14 subsets, covering a range of multi-image skills:

Co-reference: Understanding and linking entities across multiple images.
Reasoning: Making logical inferences based on visual information.
Comparing: Identifying similarities and differences between images.
Temporal Understanding: Recognizing sequences and changes over time.

Dataset Composition

10 Existing Subsets:
- NLVR2, IconQA for reasoning
- DreamSim, Birds-to-Words for comparison
- NExT-QA, STAR for temporal understanding
4 New Datasets:
- LLaVA-665k-multi, LRV-multi for co-reference
- Contrast-Caption, Multi-VQA for broader reasoning (Multi-VQA generated by prompting GPT-4)

Mantis Model Architecture and Training

Mantis is built on the LLaMA-3 architecture and is designed to handle interleaved text and image inputs. The key aspects of its development include:

Interleaved Input: Mantis processes both text and images in a single input sequence, allowing it to better understand the context across multiple modalities.
Training Resources: Trained on Mantis-Instruct using 16 A100-40G GPUs for 36 hours, making it feasible with academic-level resources.

Performance Highlights

Mantis achieves state-of-the-art performance on five multi-image benchmarks:

NLVR2
Q-Bench
BLINK
MVBench
Mantis-Eval

Additionally, Mantis maintains strong single-image performance, comparable to models like CogVLM and Emu2. This balance is crucial for real-world applications where both types of tasks are common.

Open-Source Contributions

The researchers have made the Mantis-Instruct dataset, training/evaluation codes, and model checkpoints publicly available:

Dataset: Mantis-Instruct
Code: GitHub Repository
Models: Hugging Face Collections
Evaluation: Mantis-Eval

Conclusion

Mantis represents a significant step forward in the development of multimodal models capable of handling both single-image and multi-image tasks. By leveraging a carefully curated dataset and efficient training methods, Mantis achieves state-of-the-art performance while remaining accessible to researchers with limited resources. This work opens up new possibilities for applications requiring advanced visual language understanding.