
Share
Gemma 3 revolutionizes lightweight AI models with enhanced multimodal and multilingual features, designed to operate seamlessly on everyday devices, pushing the boundaries of what's possible in consumer-grade technology.
Gemma 3, the latest addition to Google DeepMind's family of lightweight open models, brings significant enhancements to multimodality, context length, and multilingual support. Ranging from 1 billion to 27 billion parameters, these models are designed to run efficiently on consumer-grade hardware, including phones, laptops, and high-end GPUs. This article delves into the technical changes that make Gemma 3 a standout in the AI landscape.
Gemma 3 introduces vision understanding capabilities through integration with a tailored version of the SigLIP vision encoder (Zhai et al., 2023). The language models treat images as sequences of soft tokens, encoded by SigLIP. To reduce inference costs, the vision embeddings are condensed into a fixed size of 256 vectors. Inspired by LLaVA (Liu et al., 2024), Gemma 3 uses a Pan and Scan (P&S) method to handle images at various resolutions effectively.
One of the most notable improvements in Gemma 3 is its ability to handle longer contexts, up to 128K tokens. This extension is crucial for tasks requiring deep context understanding, such as summarization and long-form content generation. However, increasing context length often leads to a significant memory explosion due to the KV cache during inference.
By increasing the ratio of local to global attention layers and keeping the span on local attention short, Gemma 3 efficiently manages memory usage without sacrificing performance.

The pre-training optimization recipe for Gemma 3 is largely similar to that of Gemma 2, with some key architectural modifications. The models use the same tokenizer as Gemini 2.0 and revisit the data mixture to enhance multilingual capabilities and introduce image understanding. Knowledge distillation (Hinton et al., 2015) is employed during training to improve efficiency and performance.
Post-training efforts focus on improving mathematics, chat capabilities, instruction-following, and multilingual abilities. A novel post-training recipe significantly boosts these areas, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across various benchmarks.
Gemma 3 models come in sizes ranging from 1 billion to 27 billion parameters, including a new 1B model. All these models are open-source and available to the community, promoting transparency and collaboration in AI research.
Gemma 3 represents a significant step forward in lightweight, multimodal AI models. By addressing key challenges like memory management and extending context length, it offers improved performance and broader capabilities. The combination of vision understanding, multilingual support, and efficient architecture makes Gemma 3 a valuable addition to the open-source AI ecosystem.
Tags
Original Sources
↗ https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
13 March 2025
88 articles
Related Articles
Related Articles
More Stories