Salesforce Unveils XGen-MM: The Next Evolution of Multimodal Models

Models & Research

The Engineer

13 May 2024 · 2 min read

Salesforce's XGen-MM marks a pivotal shift in multimodal AI, building on its BLIP technology to deliver unprecedented performance and versatility, setting new benchmarks in the field.

Salesforce AI Research has announced the continuation and rebranding of their BLIP series into the XGen-MM (X-Generation Multimodal) series, aligning with the company's broader initiative for large foundation models. This new series represents a significant leap in multimodal technology, offering state-of-the-art performance across various benchmarks.

What Changed Technically

The XGen-MM series builds on the successful designs of the BLIP series, incorporating several key enhancements:

Pretrained Foundation Model: The xgen-mm-phi3-mini-base-r-v1 model achieves state-of-the-art performance with fewer than 5 billion parameters. It demonstrates strong in-context learning capabilities, making it highly versatile for a wide range of tasks.
Instruct Fine-Tuned Models: The xgen-mm-phi3-mini-instruct-r-v1 variant further improves performance through instruction tuning, achieving top scores among both open-source and closed-source Vision-Language Models (VLMs) under 5 billion parameters.
Flexible High-Resolution Image Encoding: This model supports efficient visual token sampling, enabling it to handle high-resolution images with ease.

Model Variants

The latest release, XGen-MM-v1.5, includes several variants:

xgen-mm-phi3-mini-instruct-interleave-r-v1.5
xgen-mm-phi3-mini-base-r-v1.5
xgen-mm-phi3-mini-instruct-singleimg-r-v1.5
xgen-mm-phi3-mini-instruct-dpo-r-v1.5

Key Results

Pretrain (Base Model without Instruction Tuning)

The base model, xgen-mm-phi3-mini-base-r-v1, shows impressive performance across multiple benchmarks:

| Model | Shot | COCO (val) | NoCaps (val) | TextCaps (val) | OKVQA (val) | TextVQA (val) | VizWiz (testdev) | VQAv2 (testdev) | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Flamingo-3B | 4 | 85.0 | - | - | 43.3 | 32.7 | 34 | 53.2 | | | 8 | 90.6 | - | - | 44.6 | 32.4 | 38.4 | 55.4 | | MM1-3B | 0 | 73.5 | 55.6 | 63.3 | 26.1 | 29.4 | 15.6 | 46.2 | | | 4 | 112.3 | 99.7 | 84.1 | 48.6 | 45.3 | 38.0 | 57.9 | | | 8 | 114.6 | 104.7 | 88.8 | 48.4 | 44.6 | 46.4 | 63.6 | | xgen-mm-phi3-mini-base-r-v1 (Ours) | 0 | 81.7 | 80.2 | 60.7 | 26.5 | 36.0 | 21.2 | 48.1 | | | 4 | 110.5 | 101.7 | 84.6 | 49.2 | 46.1 | 38.4 | 63.9 | | | 8 | 112.1 | 104.4 | 87.7 | 49.1 | 46.4 | 44.3 | 63.8 |

Instruct (After Instruction Tuning)

The instruct-tuned model, `xgen-mm-phi3-mini-instruct-r