
Share
Researchers are refining techniques to pinpoint and address unexpected behaviors in AI models by employing sparse-autoencoders to dissect complex latent spaces, offering new ways to ensure model alignment.
Dec 1, 2025 · Tom Dupre la Tour and Dan Mossing, in collaboration with the Interpretability team
In the realm of language models, misalignment-where a model behaves unexpectedly or undesirably-is a significant concern. Recent work from our team has focused on using interpretability tools to understand and debug these issues, particularly by leveraging sparse-autoencoders (SAEs) for latent attribution.
Previously, we explored the mechanism of emergent misalignment using a model diffing approach (Wang et al., 2025; Betley et al., 2025). This method involves comparing two models: one before and one after problematic fine-tuning. The process is divided into two steps:
Step 1: Select Latents with Significant Differences
Step 2: Activation Steering and Grading
However, this approach has limitations. The latents with the largest activation differences may not necessarily be causally relevant to the behavior of interest. Additionally, the model diffing method is limited to comparing closely related models, one exhibiting the undesired behavior and the other not.
To overcome these limitations, we introduce a new technique: latent attribution. This method helps us identify SAE latents that are likely causally linked to a given behavior by approximating the causal relationship between activations and outputs using a first-order Taylor expansion. Attribution is widely used in studying language models, particularly for circuit discovery (Nanda, 2024; Marks et al., 2024; Syed et al., 2024; Jafari et al., 2025; Arora et al., 2024).

Here’s how we implement latent attribution:
Single Model Analysis
Data Collection
Attribution Calculation
This approach allows us to focus on latents that are more likely to be causally relevant to the behavior of interest, without the need for a closely related model for comparison.
To illustrate the effectiveness of latent attribution, consider a scenario where a language model exhibits biased completions. By applying our method:
Step 1: Collect Data
Step 2: Compute Attribution
Step 3: Identify Causal Latents
These identified latents can then be further analyzed to understand and potentially correct the misaligned behavior. For example, if a specific latent is found to strongly influence biased completions, we can explore techniques to modify or steer that latent during inference to mitigate the bias.
Latent attribution using sparse-autoencoders provides a powerful tool for understanding and debugging misaligned behaviors in language models. By focusing on causally relevant latents, this method offers a more targeted and effective approach compared to traditional model diffing techniques. As we continue to refine these tools, we move closer to building more reliable and interpretable AI systems.
Tags
Original Sources
↗ https://alignment.openai.com/sae-latent-attribution/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
9 December 2025
88 articles
Related Articles
Related Articles
More Stories