A Deep Dive into Gemma 2B: Simplifying Transformer LLMs with PyTorch

Models & Research

The Engineer

2 May 2024 · 3 min read

This article demystifies Google's Gemma 2B transformer LLM with PyTorch, offering clear code and explanations for both programmers and beginners interested in how these models make single-step predictions.

Transformer-based large language models (LLMs) can seem daunting, but they don't have to be. In this article, we'll break down the inner workings of Google's Gemma 2B, a modern transformer LLM, using bare-bones PyTorch code and some intuitive explanations. This guide is tailored for programmers and casual ML enthusiasts who want to understand how these models operate under the hood.

The Problem: Single-Step Prediction

At its core, our task is single-step prediction: given an input string like "I want to move," we use a pre-trained language model (LM) to predict what could come next. This capability is fundamental for applications like chatbots and coding assistants, as it allows us to chain these predictions together to generate longer sequences of text.

We'll explore this process using Gemma 2B and provide an accompanying notebook that you can follow along with:

GitHub
Colab

Tokenization

The first step in using Gemma 2B is to tokenize the input string. This involves splitting the string into subword tokens and mapping these tokens to numeric IDs.

from [transformers](/companies/hugging-face) import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
input_ids = tokenizer("I want to move").input_ids
# input_ids = [2, 235285, 1938, 577, 3124]

Understanding Tokenization

Tokenization is defined by a vocabulary, a large set of subword tokens. Each token is represented as an integer in the range [0, 256000). For example, "▁want" is mapped to 1938. The tokenizer first splits the input string into these subwords and then maps them to their corresponding numeric IDs.

Embedding Layer

Once we have the token IDs, the next step is to convert them into embeddings. Embeddings are dense vector representations of tokens that capture semantic meaning.

import torch

model = transformers.AutoModelForCausalLM.from_pretrained("google/gemma-2b") embeddings = model.transformer.wte(input_ids)


#### Understanding Embeddings

Embeddings are crucial because they transform discrete token IDs into continuous vector spaces where similar tokens have similar representations. This allows the model to capture nuanced relationships between words.

### Attention Mechanism

The heart of a transformer is its attention mechanism, which allows the model to focus on different parts of the input sequence when making predictions.

```python
# Simplified example of self-attention
def self_attention(Q, K, V):
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(K.size(-1))
    attention_weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output

# Example with embeddings
Q = K = V = embeddings
output = self_attention(Q, K, V)

Understanding Attention

The dot product between the query (Q) and key (K) vectors measures the similarity between tokens. The attention weights are then normalized using a softmax function to ensure they sum to 1. Finally, these weights are used to compute a weighted sum of the value (V) vectors, producing the output.

Feed-Forward Network

After the attention mechanism, the model passes the output through a feed-forward network (FFN), which consists of linear layers and activation functions.

# Simplified example of FFN
def feed_forward(x):
    hidden = torch.nn.Linear(2048, 2048)(x)
    activated = torch.nn.ReLU()(hidden)
    output = torch.nn.Linear(2048, 2048)(activated)
    return output

# Example