Extracting Monosemantic Features from Transformers Using Sparse Autoencoders

Models & Research

The Engineer

30 Oct 2023 · 4 min read

Researchers at Anthropic have cracked the code on making complex LLMs more transparent by isolating singularly meaningful features with sparse autoencoders, shedding light on how transformers truly function.

In a significant step towards making large language models (LLMs) more interpretable, researchers at Anthropic have developed a method to extract monosemantic features using a sparse autoencoder. This approach aims to decompose the complex functions of transformers into simpler, more understandable components.

What Changed and Why It Matters

Traditionally, neurons in neural networks are polysemantic-they respond to multiple, seemingly unrelated inputs. For example, a single neuron might activate for both cat faces and car fronts in vision models or for academic citations, English dialogue, HTTP requests, and Korean text in language models. This polysemanticity makes it challenging to understand the network's behavior by analyzing individual neurons.

By using a sparse autoencoder, the researchers managed to extract features that are monosemantic-each feature responds to a single, specific input type. This is crucial for mechanistic interpretability, which seeks to break down neural networks into understandable components. Monosemantic features allow us to better reason about how the network processes information and make it easier to identify and analyze individual functionalities.

Key Technical Details

Sparse Autoencoder:
- A sparse autoencoder is a type of neural network that learns an efficient, compressed representation of input data by minimizing reconstruction error while keeping most activations close to zero.
- This sparsity constraint helps the model focus on the most salient features, leading to monosemantic representations.
Feature Extraction Process:
- The researchers applied the sparse autoencoder to a one-layer transformer model.
- They trained the autoencoder to reconstruct the input data while ensuring that the hidden layer activations were sparse.
- This resulted in a large number of interpretable features, each capturing a specific aspect of the input.
Interpreting Features:
- The monosemantic features can be visualized and analyzed to understand their specific functions.
- For example, one feature might respond only to academic citations, while another might respond to HTTP requests.

Implementation and Results

Architecture Details:
- The sparse autoencoder consists of an encoder and a decoder.
- The encoder maps the input data into a lower-dimensional latent space with sparsity constraints.
- The decoder reconstructs the input from the latent representation.
- The model is trained to minimize reconstruction loss while keeping the activations in the latent layer sparse.

Benchmarks:
- The researchers demonstrated that the extracted features are indeed monosemantic by visualizing their responses to various inputs.
- They found that each feature responds consistently to a specific type of input, making them easier to interpret and analyze.

Visualizations

To explore the extracted features, you can browse through the following links:

A/1 Features: Browse A/1 Features →
All Features: Browse All Features →

Implications for Practitioners

This work has significant implications for the field of mechanistic interpretability. By providing a method to extract monosemantic features, it opens up new avenues for understanding and debugging large language models. Practitioners can use these techniques to gain deeper insights into how their models process information and identify potential issues or biases.

Authors and Affiliations

The research was conducted by Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. They are affiliated with Anthropic.

Conclusion

The ability to extract monosemantic features from transformers using sparse autoencoders is a promising step towards making large language models more interpretable. This work not only enhances our understanding of these complex systems but also provides practical tools for practitioners to analyze and improve their models.