
Share
This article demystifies Google's Gemma 2B transformer LLM with PyTorch, offering clear code and explanations for both programmers and beginners interested in how these models make single-step predictions.
Transformer-based large language models (LLMs) can seem daunting, but they don't have to be. In this article, we'll break down the inner workings of Google's Gemma 2B, a modern transformer LLM, using bare-bones PyTorch code and some intuitive explanations. This guide is tailored for programmers and casual ML enthusiasts who want to understand how these models operate under the hood.
At its core, our task is single-step prediction: given an input string like "I want to move," we use a pre-trained language model (LM) to predict what could come next. This capability is fundamental for applications like chatbots and coding assistants, as it allows us to chain these predictions together to generate longer sequences of text.
We'll explore this process using Gemma 2B and provide an accompanying notebook that you can follow along with:
The first step in using Gemma 2B is to tokenize the input string. This involves splitting the string into subword tokens and mapping these tokens to numeric IDs.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
input_ids = tokenizer("I want to move").input_ids
# input_ids = [2, 235285, 1938, 577, 3124]
Tokenization is defined by a vocabulary, a large set of subword tokens. Each token is represented as an integer in the range [0, 256000). For example, "▁want" is mapped to 1938. The tokenizer first splits the input string into these subwords and then maps them to their corresponding numeric IDs.
Once we have the token IDs, the next step is to convert them into embeddings. Embeddings are dense vector representations of tokens that capture semantic meaning.
import torch

model = transformers.AutoModelForCausalLM.from_pretrained("google/gemma-2b") embeddings = model.transformer.wte(input_ids)
#### Understanding Embeddings
Embeddings are crucial because they transform discrete token IDs into continuous vector spaces where similar tokens have similar representations. This allows the model to capture nuanced relationships between words.
### Attention Mechanism
The heart of a transformer is its attention mechanism, which allows the model to focus on different parts of the input sequence when making predictions.
```python
# Simplified example of self-attention
def self_attention(Q, K, V):
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(K.size(-1))
attention_weights = torch.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output
# Example with embeddings
Q = K = V = embeddings
output = self_attention(Q, K, V)
The dot product between the query (Q) and key (K) vectors measures the similarity between tokens. The attention weights are then normalized using a softmax function to ensure they sum to 1. Finally, these weights are used to compute a weighted sum of the value (V) vectors, producing the output.
After the attention mechanism, the model passes the output through a feed-forward network (FFN), which consists of linear layers and activation functions.
# Simplified example of FFN
def feed_forward(x):
hidden = torch.nn.Linear(2048, 2048)(x)
activated = torch.nn.ReLU()(hidden)
output = torch.nn.Linear(2048, 2048)(activated)
return output
# Example
Tags
Original Sources
↗ https://graphcore-research.github.io/posts/gemma/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
2 May 2024
88 articles
Related Articles
Related Articles
More Stories