Logit Prisms: A New Tool for Understanding Transformer Decision-Making

Models & Research

The Engineer

20 Jun 2024 · 4 min read

Logit prisms offer a novel approach to dissecting transformer models' decision-making processes, revealing new insights into how each component influences outcomes without altering network activations.

In a recent breakthrough, researchers have extended the logit lens approach to provide deeper insights into how transformer models make decisions. This new method, called "logit prisms," offers a mathematically rigorous and effective way to decompose the final logit output into individual component contributions. By treating certain parts of the network activations as constants, we can leverage the linear properties within the network to break down the logit output and understand how different parts of the model influence the final decision.

Key Components of Logit Prisms

Residual Stream: The core data flow in transformer models that is iteratively refined by a sequence of attention and MLP layers.
Attention Layers: Responsible for capturing dependencies between input tokens.
MLP Layers: Perform nonlinear transformations on the residual stream.
Input Embedding (Embed): Converts input tokens into dense vector representations.
Unembedding Layer (Unembed): Maps the final hidden state back to the output logits.

How Logit Prisms Work

The logit prism approach can be thought of as applying a series of prisms to the transformer network. Each prism splits the logits from the previous layer into separate components, allowing us to see how different parts of the model contribute to the final output. Here’s a breakdown:

Residual Stream Prism: Breaks down the contributions from the residual stream.
Attention Layer Prism: Decomposes the influence of attention heads.
MLP Layer Prism: Analyzes the impact of MLP neurons.

By treating nonlinear activations as constants, we can isolate and calculate the contribution of each component. This method provides a clear view of how information flows through the network and how different parts interact to produce the final output.

Illustrative Examples

Example 1: Factual Retrieval Task

In one example, researchers used the gemma-2b model to perform a simple factual retrieval task-retrieving a capital city from a country name. The findings suggest that the model learns to encode information about country names and their capital cities in a way that allows for easy conversion of country embeddings into capital city unembeddings through a linear projection.

Example 2: Number Addition

The second example explores how the gemma-2b model adds two small numbers (ranging from 1 to 9). This study uncovers interesting insights into the workings of MLP layers. The network predicts output numbers using interpretable templates learned by MLP neurons. When multiple neurons are activated simultaneously, their predictions can interfere with each other, ultimately producing a final prediction that peaks at the correct number.

Methodology

A typical decoder-only transformer network consists of several key components:

Input Embedding (Embed): Converts input tokens into dense vector representations.
Normalization Layers (Norm): Ensure stable and efficient training by normalizing the activations.
Attention Layers: Capture dependencies between input tokens.
MLP Layers: Perform nonlinear transformations on the residual stream.
Unembedding Layer (Unembed): Maps the final hidden state back to the output logits.

The logit prism approach involves:

Treating Nonlinear Activations as Constants: This allows us to leverage linear properties within the network.
Decomposing Logits: Using prisms to break down the final logit output into individual component contributions.
Analyzing Contributions: Calculating how much each part of the model (e.g., attention heads, MLP neurons) contributes to the final decision.

Conclusion

Logit prisms provide a powerful tool for understanding the inner workings of transformer models. By decomposing the final logit output into individual component contributions, researchers can gain deeper insights into how different parts of the model influence the final decision. This approach not only enhances interpretability but also opens new avenues for optimizing and debugging complex neural networks.