Compound Text-Guided Prompt Tuning Reduces GPU Memory Usage by 93% While Boosting Performance

Models & Research

The Engineer

13 Dec 2023 · 3 min read

Researchers unveil Compound Text-Guided Prompt Tuning, a method that slashes GPU memory usage by 93% while enhancing performance in Vision-Language Models, overcoming challenges posed by large datasets.

Vision-Language Models (VLMs) like CLIP have shown impressive generalization capabilities for various downstream tasks. However, existing prompt tuning frameworks often face significant challenges, especially when dealing with a large number of categories in target datasets. These frameworks typically parallelize learnable textual inputs for all categories, leading to massive GPU memory consumption and subpar performance with ambiguous category names.

To address these issues, researchers Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, and Xiangyu Zhang have introduced Compound Text-Guided Prompt Tuning (TGP-T). This novel approach not only reduces resource demand but also achieves superior performance in few-shot recognition and domain generalization tasks.

Key Technical Changes and Benefits

1. Reduced GPU Memory Consumption

Problem: Existing methods parallelize learnable textual inputs for all categories, leading to high GPU memory usage.
Solution: TGP-T introduces text supervision to the optimization of prompts, reducing the number of inputs to the text encoder. This significantly decreases GPU memory consumption by 93% compared to traditional methods.

2. Flexible Prompt Generation

Problem: Traditional frameworks rely on pre-defined category names during inference, which can be limiting and lead to performance issues with ambiguous categories.
Solution: TGP-T releases the model from this reliance, enabling more flexible prompt generation. This flexibility is crucial for handling a diverse range of categories and improving performance in real-world scenarios.

Compound Text Supervision

TGP-T leverages compound text supervision, which includes:

Category-wise supervision: Helps in achieving inter-class separability by clearly distinguishing between different categories.
Content-wise supervision: Captures intra-class variations, ensuring that the model can recognize subtle differences within the same category.

This dual supervision approach is highly effective and contributes to the overall performance gains of TGP-T.

Visual Feature Conditioning

To further enhance the alignment between prompts and visual features, TGP-T introduces a module called Bonder. Bonder conditions the prompt generation on visual features extracted from images, ensuring that the generated prompts are more relevant and contextually accurate.

Experimental Results

The researchers conducted extensive experiments to evaluate the performance of TGP-T:

Few-shot Recognition: On 16-shot ImageNet, TGP-T achieved a 2.5% performance gain compared to baseline methods.
Domain Generalization: TGP-T demonstrated consistent improvements across various domain generalization tasks, showcasing its robustness and adaptability.

Implementation Details

Dataset: Experiments were conducted on popular datasets like ImageNet for few-shot recognition and domain generalization tasks.
Hardware: The model was trained using standard GPU setups, with significant memory savings due to the reduced number of inputs to the text encoder.
Code Availability: The code for TGP-T is available at this GitHub repository.

Conclusion

Compound Text-Guided Prompt Tuning (TGP-T) represents a significant advancement in the field of Vision-Language Models. By reducing GPU memory usage and improving performance, TGP-T addresses key challenges faced by existing prompt tuning frameworks. This approach not only makes VLMs more efficient but also enhances their practical applicability in real-world scenarios.