
Share
While theoretically enabling deep models to access distant context through multiple layers, sliding window attention实践中遇到瓶颈,限制了模型捕捉长文本中远距离依赖的能力。
Modern language models like GPT-OSS, Mistral, and Gemma 3 have made significant strides in handling long texts efficiently by leveraging a technique called sliding window attention (SWA). Instead of allowing each token to attend to all previous tokens, SWA restricts the attention to only the last WWW words, effectively creating a sliding window over the input sequence. This approach dramatically reduces computational complexity while maintaining local context.
On paper, stacking LLL layers of SWA should allow the model to see ( L \times W ) words back. Each layer can theoretically hop backward by ( W ) positions, creating a receptive field that grows linearly with depth. For instance, a 100-layer model with a window size of 1,000 words should be able to access up to 100,000 words of context.
However, in practice, SWA models struggle to use information from more than about 1,500 words ago-far less than the theoretical 100,000. This significant gap between theory and practice is a critical issue for understanding the limitations of these models.
The StreamingLLM paper provides empirical evidence of this discrepancy. The task formulation involves answering questions about specific numbers that appear several lines before in the context. Performance results show that the accuracy of StreamingLLM (a pure SWA model) drops drastically once the queried information is no longer in the sliding window cache, despite the theoretical capability to propagate this information through the network layers.
To explain this gap, we need to consider two key effects that limit how far back these models can effectively access information:

Let's dive deeper into these effects with a mathematical model to understand why the effective memory of SWA models has fundamental limitations.
In each layer, the attention weights are computed as: [ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ]
As information propagates through multiple layers, the attention weights for distant tokens become increasingly diluted. This can be modeled using a decay factor ( \alpha ), where the signal strength of information from ( t ) steps back is reduced by ( \alpha^t ). For small values of ( \alpha ), the signal strength diminishes rapidly, making it difficult to capture long-range dependencies.
Residual connections add the input of a layer to its output: [ \text{Output} = \text{Layer}(x) + x ]
This helps in training deep networks by allowing gradients to flow more easily. However, it also means that recent information is given more weight, as each layer's output is a combination of the current and previous layers' inputs. This can be seen as an exponential barrier where the influence of distant information decreases exponentially with depth.
While sliding window attention (SWA) theoretically allows deep models to access vast amounts of context, practical limitations such as information dilution and residual connections create significant barriers. These effects explain why SWA models struggle to use information from more than about 1,500 words ago, despite the potential for much deeper memory.
Understanding these limitations is crucial for developing more effective long-context models in the future. By addressing these issues, researchers can push the boundaries of what language models are capable of handling.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
26 August 2025
133 articles
Related Articles
Related Articles
More Stories