xGen-MM-Vid (BLIP-3-Video): Efficient Video-Language Model with 32 Tokens

Models & Research

The Engineer

24 Oct 2024 · 3 min read

This innovative video-language model from Salesforce AI Research uses just 32 tokens to efficiently represent and understand complex video content, breaking new ground in visual language processing.

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
Salesforce AI Research

[arXiv] [🤗 Model] [Release Notes]

Introduction

xGen-MM-Vid (BLIP-3-Video) is a compact and efficient vision-language model (VLM) designed to understand videos. The key innovation lies in the integration of a temporal encoder within the original (image-based) BLIP-3 architecture. This temporal encoder maps sequences of tokens over multiple frames into a compact set of visual tokens, significantly reducing the computational load while maintaining high accuracy.

Technical Overview

Key Components

Temporal Encoder: The heart of xGen-MM-Vid is its temporal encoder, which processes video frames in a sequence and reduces them to a smaller set of tokens. This is crucial for efficient video representation.
Visual Tokenizer: Converts raw video frames into token sequences that the model can process.
Transformer Architecture: Uses a Transformer-based architecture to handle both visual and language data.

Temporal Encoder Variants

Conventional Pooling: Simple averaging or max-pooling over frames.
Transformer-Based Encoders: More sophisticated models that capture temporal dependencies using self-attention mechanisms.
Learnable Spatio-Temporal Pooling: Advanced pooling techniques that learn to aggregate spatial and temporal information effectively.
Token Turing Machines (TTM): A novel approach that processes tokens in a sequential manner, allowing for more dynamic and context-aware representations.

Performance Highlights

xGen-MM-Vid achieves impressive results on complex video tasks with significantly fewer resources compared to larger models:

Video QA: Comparable accuracy to 7B and 34B models using only 4B parameters.
Video Captioning: High-quality captions generated with a fraction of the tokens (e.g., 32 vs. 4608 tokens).

Architecture Details

Input Processing

Frame Sampling: The model samples a fixed number of frames from the video to create a sequence of visual inputs.
Tokenization: Each frame is tokenized into a set of visual tokens using a pre-trained tokenizer.

Temporal Encoding

Pooling Layer: Aggregates information across frames to reduce the number of tokens.
Transformer Layer: Processes the pooled tokens to capture temporal dependencies and context.

Output Generation

Language Decoder: Generates text based on the encoded video representation, using a Transformer-based language model.

Benchmarks

xGen-MM-Vid's 4B model achieves:

Video QA Accuracy: On par with 7B and 34B models.
Video Captioning Quality: High-quality captions with fewer tokens, reducing computational overhead.

Implementation Notes

Efficiency: The use of a compact token set (e.g., 32 tokens) makes the model highly efficient in terms of both memory and computation.
Scalability: The modular design allows for easy integration into existing VLM frameworks.
Flexibility: Supports various temporal encoder variants, enabling researchers to experiment with different approaches.

Releases

10/22/2024

Arxiv Paper Release
- 📄 arXiv

01/16/2025

Code/Model Releases
- 🤗 BLIP-3-Video 128 token model
- 🤗 BLIP-3-Video 32 token model

Examples

The xGen-MM