Enhancing Model Interpretability with Sparse-Autoencoder Latent Attribution

Models & Research

The Engineer

9 Dec 2025 · 3 min read

Researchers are refining techniques to pinpoint and address unexpected behaviors in AI models by employing sparse-autoencoders to dissect complex latent spaces, offering new ways to ensure model alignment.

Dec 1, 2025 · Tom Dupre la Tour and Dan Mossing, in collaboration with the Interpretability team

In the realm of language models, misalignment-where a model behaves unexpectedly or undesirably-is a significant concern. Recent work from our team has focused on using interpretability tools to understand and debug these issues, particularly by leveraging sparse-autoencoders (SAEs) for latent attribution.

Background: Model Diffing with SAEs

Previously, we explored the mechanism of emergent misalignment using a model diffing approach (Wang et al., 2025; Betley et al., 2025). This method involves comparing two models: one before and one after problematic fine-tuning. The process is divided into two steps:

Step 1: Select Latents with Significant Differences
- We identify a subset of SAE latents that show the largest activation differences between the two models.
Step 2: Activation Steering and Grading
- We sample many completions from the model while steering specific activations (Panickssery et al., 2023).
- An LLM judge then grades these completions to measure the causal link between each latent and the unexpected behavior.

However, this approach has limitations. The latents with the largest activation differences may not necessarily be causally relevant to the behavior of interest. Additionally, the model diffing method is limited to comparing closely related models, one exhibiting the undesired behavior and the other not.

Introducing Latent Attribution

To overcome these limitations, we introduce a new technique: latent attribution. This method helps us identify SAE latents that are likely causally linked to a given behavior by approximating the causal relationship between activations and outputs using a first-order Taylor expansion. Attribution is widely used in studying language models, particularly for circuit discovery (Nanda, 2024; Marks et al., 2024; Syed et al., 2024; Jafari et al., 2025; Arora et al., 2024).

Implementation of Latent Attribution

Here’s how we implement latent attribution:

Single Model Analysis
- We use a single model in isolation, rather than comparing two models.
Data Collection
- We collect multiple completions for the same prefix, categorizing them as positive (showing the behavior of interest) and negative (not showing the behavior).
Attribution Calculation
- For each completion, we compute the attribution scores for the SAE latents.
- We then calculate the difference in attribution between positive and negative completions.

This approach allows us to focus on latents that are more likely to be causally relevant to the behavior of interest, without the need for a closely related model for comparison.

Case Study: Debugging Misaligned Completions

To illustrate the effectiveness of latent attribution, consider a scenario where a language model exhibits biased completions. By applying our method:

Step 1: Collect Data
- We gather positive and negative completions for a given prefix.
Step 2: Compute Attribution
- We compute the attribution scores for each SAE latent across these completions.
Step 3: Identify Causal Latents
- We identify latents with significant differences in attribution between positive and negative completions.

These identified latents can then be further analyzed to understand and potentially correct the misaligned behavior. For example, if a specific latent is found to strongly influence biased completions, we can explore techniques to modify or steer that latent during inference to mitigate the bias.

Conclusion

Latent attribution using sparse-autoencoders provides a powerful tool for understanding and debugging misaligned behaviors in language models. By focusing on causally relevant latents, this method offers a more targeted and effective approach compared to traditional model diffing techniques. As we continue to refine these tools, we move closer to building more reliable and interpretable AI systems.