Visual-RFT: Extending Reinforcement Fine-Tuning to Visual Tasks with LVLMs

Models & Research

The Engineer

11 Mar 2025 · 3 min read

Visual-RFT harnesses the power of LVLMs to tackle complex visual tasks through reinforcement fine-tuning, offering a data-efficient way to improve model performance in specific domains without extensive training datasets.

In a significant advancement for multi-modal reinforcement learning, researchers from various institutions have introduced Visual-RFT (Visual Reinforcement Fine-Tuning), a novel approach that extends the capabilities of Large Vision-Language Models (LVLMs) in visual tasks. The paper, titled "Visual-RFT: Visual Reinforcement Fine-Tuning," is available on arXiv and introduces a data-efficient, reward-driven method for enhancing reasoning and adaptability in domain-specific visual tasks.

What Changed Technically

Reinforcement Fine-Tuning (RFT) has been a game-changer for large language models like OpenAI's o1, where the model learns from feedback on its answers. This is particularly useful when fine-tuning data is limited. However, the application of RFT in multi-modal domains, such as vision and language, has been under-explored. Visual-RFT addresses this gap by leveraging LVLMs to generate multiple responses for each input, which are then refined using visual perception verifiable reward functions.

Key Components

LVLMs: The foundation of Visual-RFT is the use of pre-trained LVLMs, which can generate both reasoning tokens and final answers. These models are powerful because they can understand and reason about complex visual and textual inputs.
Policy Optimization: The model uses Group Relative Policy Optimization (GRPO) to update its parameters based on the rewards it receives. GRPO is a robust policy optimization algorithm that helps in stabilizing training and improving performance.
Verifiable Reward Functions: Different tasks require different reward functions. For example:
- Intersection over Union (IoU) for object detection: Measures how well the model's predicted bounding boxes align with ground truth.
- Accuracy for image classification: Directly measures the correctness of the model's predictions.

Implementation Details

Data Efficiency: Visual-RFT is designed to work with limited data, making it particularly useful in scenarios where labeled data is scarce. This is achieved by leveraging the rich pre-training of LVLMs and fine-tuning them using reinforcement learning.
Generalization: The model demonstrates strong generalization ability across various benchmarks, including fine-grained image classification, few-shot object detection, reasoning grounding, and open-vocabulary object detection.

Experimental Results

The researchers evaluated Visual-RFT on several challenging tasks:

Fine-Grained Image Classification: In one-shot settings with around 100 samples, Visual-RFT improved accuracy by 24.3% over the baseline.
Few-Shot Object Detection: On COCO's two-shot setting, Visual-RFT exceeded the baseline by 21.9%, and on LVIS, it showed a 15.4% improvement.

These results highlight the model's ability to adapt quickly and effectively to new tasks with minimal data.

Why It Matters

Visual-RFT represents a paradigm shift in fine-tuning LVLMs for visual tasks. By combining the strengths of reinforcement learning and multi-modal models, it offers a more flexible and efficient approach compared to traditional supervised fine-tuning (SFT). This is particularly valuable in domains where labeled data is expensive or difficult to obtain.

Conclusion

The introduction of Visual-RFT opens up new possibilities for leveraging LVLMs in visual tasks. Its data-efficient, reward-driven approach not only enhances reasoning and adaptability but also sets a new standard for multi-modal reinforcement learning. For practitioners working with limited data or complex visual tasks, Visual-RFT is a promising tool to explore.