PaliGemma 2: A Versatile Family of Vision-Language Models for Enhanced Transfer Learning

Models & Research

The Engineer

6 Dec 2024 · 3 min read

PaliGemma 2 expands on its predecessor with a versatile family of vision-language models, integrating advanced encoders and spanning a wide parameter range to enhance transfer learning across various resolutions.

PaliGemma 2 is the latest iteration of the open-source Vision-Language Model (VLM) family, building on the success of its predecessor. This new version integrates the SigLIP-So400m vision encoder with the Gemma 2 language models, spanning a range from 2 billion parameters to 27 billion parameters. The team behind PaliGemma 2, led by Andreas Steiner and including notable researchers like André Susano Pinto and Michael Tschannen, has trained these models at three different resolutions (224px, 448px, and 896px) to ensure broad applicability for transfer learning via fine-tuning.

Technical Updates and Why They Matter

The key technical changes in PaliGemma 2 are:

Combination of SigLIP-So400m and Gemma 2: The SigLIP-So400m vision encoder, which was previously used in PaliGemma, is now paired with the entire range of Gemma 2 language models. This combination allows for more robust multimodal understanding and better transfer learning capabilities.
Multi-Resolution Training: Models are trained at three resolutions: 224px, 448px, and 896px. This multi-resolution approach helps in handling a variety of input sizes and ensures that the models can perform well across different tasks.
Staged Training: The training process is divided into multiple stages to progressively equip the models with broader knowledge. This staged training helps in achieving better performance on diverse transfer tasks.

Architecture Details

Vision Encoder: The SigLIP-So400m vision encoder is a key component of PaliGemma 2. It processes visual data and generates embeddings that are then combined with text embeddings from the language models.
Language Models: The Gemma 2 family includes models ranging from 2B to 27B parameters. These models are fine-tuned for various tasks, ensuring they can handle a wide range of linguistic complexities.
Resolution Handling: The multi-resolution training approach involves:
- 224px: Suitable for smaller images and faster inference.
- 448px: Balances detail and computational efficiency.
- 896px: Captures fine details but requires more computational resources.

Transfer Learning Performance

PaliGemma 2 is designed to excel in transfer learning, which means it can be fine-tuned for a variety of tasks with minimal additional training. The team has expanded the number and breadth of transfer tasks beyond those covered by the original PaliGemma. Notable improvements include:

OCR-Related Tasks: Enhanced performance on tasks like table structure recognition, molecular structure recognition, and music score recognition.
Long Fine-Grained Captioning: Improved accuracy in generating detailed captions for complex images.
Radiography Report Generation: Better results in generating medical reports from radiographic images.

Benchmarks and Results

PaliGemma 2 has achieved state-of-the-art results on several benchmarks, particularly in the areas of OCR-related tasks and fine-grained captioning. The team's experiments have shown that:

Model Size and Resolution Interplay: Larger models generally perform better but require more computational resources. Higher resolutions provide more detail but also increase the training time.
Task-Specific Optimizations: Fine-tuning the learning rate and other hyperparameters can significantly impact performance on specific tasks.

Conclusion

PaliGemma 2 represents a significant advancement in the field of vision-language models, offering a versatile family of models that can be fine-tuned for a wide range of applications. The combination of the SigLIP-So400m vision encoder with the Gemma 2 language models, along with multi-resolution training and staged learning, ensures that these models are well-equipped to handle diverse transfer tasks. Whether you're working on OCR-related tasks, generating detailed captions, or creating medical reports, PaliGemma 2 is a powerful tool to have in your arsenal.