
Share
This innovative video-language model from Salesforce AI Research uses just 32 tokens to efficiently represent and understand complex video content, breaking new ground in visual language processing.
Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
Salesforce AI Research
[arXiv] [🤗 Model] [Release Notes]
xGen-MM-Vid (BLIP-3-Video) is a compact and efficient vision-language model (VLM) designed to understand videos. The key innovation lies in the integration of a temporal encoder within the original (image-based) BLIP-3 architecture. This temporal encoder maps sequences of tokens over multiple frames into a compact set of visual tokens, significantly reducing the computational load while maintaining high accuracy.
xGen-MM-Vid achieves impressive results on complex video tasks with significantly fewer resources compared to larger models:

xGen-MM-Vid's 4B model achieves:
The xGen-MM
Tags
Original Sources
↗ https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
24 October 2024
88 articles
Related Articles
Related Articles
More Stories