Open-Sourcing Sparse Autoencoders for Llama 3.1 8B and Llama 3.3 70B

Models & Research

The Engineer

13 Jan 2025 · 3 min read

The release of open-source sparse autoencoders for Llama 3.1 8B and Llama 3.3 70B enhances model interpretability, offering developers tools to explore and control language models' internal workings through Ember's new API/SDK.

Following the recent announcement of Goodfire Ember, we’re excited to release state-of-the-art, open-source sparse autoencoders (SAEs) for Llama 3.1 8B and Llama 3.3 70B. SAEs are interpreter models that help us understand how language models process and represent information internally. These models power Ember’s interpretability API/SDK and have been crucial in enabling feature discovery and programmatic control over LLM internals.

What’s Being Released

We’re releasing SAEs for:

Llama 3.1 8B: Available on Hugging Face
Llama 3.3 70B: Available on Hugging Face

These models build on our earlier work with Llama-3-8B, where we demonstrated the effectiveness of training an SAE on the LMSYS-Chat-1M dataset [2]. Our SAEs are designed to decompose complex neural activations into interpretable features, making it possible to understand and steer model behavior at a granular level.

Key Features and Implementation Details

Interpretable Features: The SAEs break down the internal representations of LLMs into meaningful components. This allows researchers and developers to identify specific patterns and behaviors the model has learned.
Programmatic Control: By steering these features, you can control the model's output in a more precise manner. For example, you can instruct the model to "talk like a pirate" or exhibit "melancholy" across various prompts.
Evaluation Metrics:
- Sparsity: Measures how many neurons are active for a given input.
- Fidelity: Ensures that the SAE's interpretation aligns with the original LLM's behavior.
- Feature Quality: Assessed through an LLM-as-a-judge scoring system, testing the model’s ability to exhibit specific "steered" behaviors across diverse prompts.

Model Implementation

Parameterization Strategy: Our starting point was the Anthropic April update [3], which provided insights into effective parameter settings for transformer circuits.
Training Data: We used the LMSYS-Chat-1M dataset, a large-scale real-world conversation dataset that captures a wide range of interactions and contexts.

Why It Matters

For practitioners, these SAEs offer several benefits:

Enhanced Interpretability: Gain deeper insights into how your models make decisions.
Fine-grained Control: Implement specific behaviors or styles in model outputs.
Research Opportunities: Explore new avenues for understanding and improving LLMs.

How to Use

To get started with these SAEs, you can:

Download the Models from Hugging Face.
Integrate with Ember’s API/SDK for seamless interpretability and steering capabilities.
Explore Documentation for detailed implementation notes and examples.

Conclusion

The release of these open-source SAEs marks a significant step forward in the field of model interpretability and control. By providing researchers and developers with powerful tools to understand and steer LLMs, we aim to foster innovation and responsible AI development.