Bunny-Llama-3-8B-V: A Lightweight Multimodal Model with SigLIP and Llama-3 Integration

Models & Research

The Engineer

26 Apr 2024 · 3 min read

Bunny-Llama-3-8B-V combines the efficiency of SigLIP’s vision encoder with Llama-3’s language backbone, offering a compact solution for complex multimodal tasks like image captioning and visual question answering.

Introduction to Bunny-Llama-3-8B-V

The latest addition to the Bunny family of multimodal models, Bunny-Llama-3-8B-V, is a lightweight yet powerful model designed for tasks that require both visual and textual understanding. This model integrates the SigLIP vision encoder with the Llama-3-8B language backbone, making it a versatile tool for applications like image captioning, visual question answering, and more.

Technical Highlights

Architecture and Components

Vision Encoder: Uses SigLIP, which is known for its high performance in image-text alignment tasks.
Language Backbone: Leverages Llama-3-8B-Instruct, a pre-trained language model fine-tuned for instruction-following tasks.
Data Curation: To enhance the model's capabilities despite its reduced size, the training data is curated from a broader and more diverse source.

Key Features

High-Resolution Support: A v1.1 version of the model supports images up to 1152x1152 resolution.
Plug-and-Play Flexibility: The model can be easily integrated with other vision encoders like EVA-CLIP and language backbones such as Phi-1.5, StableLM-2, Qwen1.5, MiniCPM, and Phi-2.

Implementation Details

Quickstart Guide

To get started with Bunny-Llama-3-8B-V, you'll need to install the following dependencies:

pip install torch transformers accelerate pillow

For optimal performance on a GPU, set the CUDA_VISIBLE_DEVICES environment variable:

export CUDA_VISIBLE_DEVICES=0

Here's a code snippet to help you use the model with Hugging Face Transformers:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# Disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# Set device (GPU or CPU)
device = '[cuda](/companies/nvidia)'  # or 'cpu'
torch.set_default_device(device)

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    'BAAI/Bunny-Llama-3-8B-V',
    torch_dtype=torch.float16,  # Use float32 for CPU
    device_map='auto',
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    'BAAI/Bunny-Llama-3-8B-V',
    trust_remote_code=True
)

# Prepare the text prompt
prompt = 'Why is the image funny?'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(device)

# Load and process the image
image = Image.open('example_2.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)

# Generate the response
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=100,
    use_cache=True  # Optional: Use cached results for faster generation
)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(generated_text)

Performance and Benchmarks

Bunny-Llama-3-8B-V has been benchmarked against other state-of-the-art multimodal models, showing competitive performance in various tasks. The integration of SigLIP and Llama-3-8B-Instruct allows the model to achieve high accuracy while maintaining efficiency.

Key Benchmarks

Image Captioning: Performs on par with larger models while using fewer resources.
Visual Question Answering (VQA): Shows strong performance in understanding complex visual scenes and answering questions