
Share
PaliGemma 2 expands on its predecessor with a versatile family of vision-language models, integrating advanced encoders and spanning a wide parameter range to enhance transfer learning across various resolutions.
PaliGemma 2 is the latest iteration of the open-source Vision-Language Model (VLM) family, building on the success of its predecessor. This new version integrates the SigLIP-So400m vision encoder with the Gemma 2 language models, spanning a range from 2 billion parameters to 27 billion parameters. The team behind PaliGemma 2, led by Andreas Steiner and including notable researchers like André Susano Pinto and Michael Tschannen, has trained these models at three different resolutions (224px, 448px, and 896px) to ensure broad applicability for transfer learning via fine-tuning.
The key technical changes in PaliGemma 2 are:

PaliGemma 2 is designed to excel in transfer learning, which means it can be fine-tuned for a variety of tasks with minimal additional training. The team has expanded the number and breadth of transfer tasks beyond those covered by the original PaliGemma. Notable improvements include:
PaliGemma 2 has achieved state-of-the-art results on several benchmarks, particularly in the areas of OCR-related tasks and fine-grained captioning. The team's experiments have shown that:
PaliGemma 2 represents a significant advancement in the field of vision-language models, offering a versatile family of models that can be fine-tuned for a wide range of applications. The combination of the SigLIP-So400m vision encoder with the Gemma 2 language models, along with multi-resolution training and staged learning, ensures that these models are well-equipped to handle diverse transfer tasks. Whether you're working on OCR-related tasks, generating detailed captions, or creating medical reports, PaliGemma 2 is a powerful tool to have in your arsenal.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
6 December 2024
88 articles
Related Articles
Related Articles
More Stories