
Share
Researchers unveil Compound Text-Guided Prompt Tuning, a method that slashes GPU memory usage by 93% while enhancing performance in Vision-Language Models, overcoming challenges posed by large datasets.
Vision-Language Models (VLMs) like CLIP have shown impressive generalization capabilities for various downstream tasks. However, existing prompt tuning frameworks often face significant challenges, especially when dealing with a large number of categories in target datasets. These frameworks typically parallelize learnable textual inputs for all categories, leading to massive GPU memory consumption and subpar performance with ambiguous category names.
To address these issues, researchers Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, and Xiangyu Zhang have introduced Compound Text-Guided Prompt Tuning (TGP-T). This novel approach not only reduces resource demand but also achieves superior performance in few-shot recognition and domain generalization tasks.
TGP-T leverages compound text supervision, which includes:
This dual supervision approach is highly effective and contributes to the overall performance gains of TGP-T.

To further enhance the alignment between prompts and visual features, TGP-T introduces a module called Bonder. Bonder conditions the prompt generation on visual features extracted from images, ensuring that the generated prompts are more relevant and contextually accurate.
The researchers conducted extensive experiments to evaluate the performance of TGP-T:
Compound Text-Guided Prompt Tuning (TGP-T) represents a significant advancement in the field of Vision-Language Models. By reducing GPU memory usage and improving performance, TGP-T addresses key challenges faced by existing prompt tuning frameworks. This approach not only makes VLMs more efficient but also enhances their practical applicability in real-world scenarios.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
13 December 2023
88 articles
Related Articles
Related Articles
More Stories