Imp-v1-3B: A Compact Multimodal Language Model That Punches Above Its Weight

Models & Research

The Engineer

29 Jan 2024 · 3 min read

Despite its modest size, Imp-v1-3B outperforms larger models in multimodal tasks, thanks to innovative design elements like Phi-2 and SigLIP, making it a standout in efficiency and performance.

In the rapidly evolving landscape of multimodal language models (MLMs), size often correlates with performance. However, the recent release of Imp-v1-3B by MILVLG challenges this notion. This model, packing a mere 3 billion parameters, demonstrates competitive and even superior performance compared to much larger counterparts on various benchmarks. Let's dive into what makes Imp-v1-3B stand out.

Technical Overview

Imp-v1-3B is built upon the following components:

Phi-2 (2.7B): A small yet powerful language model that forms the backbone of Imp.
SigLIP (0.4B): A robust visual encoder that enhances the model's multimodal capabilities.

These components are combined and fine-tuned on the LLaVA-v1.5 training set, which is known for its high-quality multimodal data. The result is a compact model that delivers impressive results.

Key Features

Compact Size: At 3 billion parameters, Imp-v1-3B is significantly smaller than many state-of-the-art models, making it more accessible and efficient to deploy.
Strong Performance: Despite its size, Imp-v1-3B outperforms similar-sized models and even surpasses the larger LLaVA-7B on several benchmarks.
Ease of Use: The model is compatible with popular libraries like [transformers](/companies/hugging-face), making it easy for developers to integrate into their workflows.

Evaluation

Imp-v1-3B was evaluated on nine commonly used benchmarks, including five academic VQA (Visual Question Answering) benchmarks and four other popular datasets. The results are striking:

VQA Benchmarks: Imp-v1-3B consistently outperforms models of similar size and even matches or exceeds the performance of larger models like LLaVA-7B.
Other Datasets: The model shows strong performance across a variety of tasks, demonstrating its versatility.

How to Use

To get started with Imp-v1-3B, you'll need to install the necessary dependencies. Here’s a step-by-step guide:

Install Dependencies

pip install transformers # Latest version is fine, but v4.39.2 is recommended
pip install -q pillow accelerate einops

Run Model Inference

Here's an example of how to use Imp-v1-3B for inference. This code snippet demonstrates how to generate a response to a visual question:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

# Set default device to GPU
torch.set_default_device("[cuda](/companies/nvidia)")

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "MILVLG/imp-v1-3b",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("MILVLG/imp-v1-3b", trust_remote_code=True)

# Set up the input
text = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat are the colors of the bus in the image? ASSISTANT:"
image = Image.open("images/bus.jpg")

# Tokenize the input text
input_ids = tokenizer(text, return_tensors='pt').input_ids

# Preprocess the image
image_tensor = model.image_preprocess(image)

# Generate the answer
output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    images=image_tensor,
    use_cache=True)[0]

# Decode and print the output
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

Future Developments

The team at MILVLG is committed to continuously improving Imp-v1-3B. They plan to release more versions with enhanced performance and additional features. The detailed technical report and training/evaluation code will be available soon on their GitHub repository.

Conclusion

Imp-v1-3B is a compelling addition to the