
Share
As BERT once redefined language understanding, explore how its encoder-only structure evolved and why newer models like T5 favor PrefixLMs and denoising objectives to tackle today’s complex text challenges.
In the rapidly evolving landscape of large language models (LLMs), it’s easy to get lost in the myriad of architectures and paradigms. If you were deep into NLP a few years ago, you might be wondering where all the encoder models went. After all, BERT worked so well-why not just scale it up? Let's dive into what happened to encoder-only models like BERT and how they relate to the current state of LLMs.
To set the stage, let’s quickly review the three main types of model architectures that have dominated the NLP landscape over the past few years:
One common misconception is that encoder-decoder models are not autoregressive. In reality, the decoder in these models is fundamentally a causal decoder, just like in decoder-only models. The key difference is that some of the input context can be offloaded to the encoder, which then sends its representations to the decoder via cross-attention. This allows the model to leverage more contextual information during generation.
A variant of the encoder-decoder architecture is the Prefix Language Model (PrefixLM). These models operate similarly to encoder-decoders but without the cross-attention mechanism. Instead, they use a shared weight scheme between the encoder and decoder, and there's no encoder bottleneck. PrefixLMs are sometimes referred to as non-casual decoders because they can generate text based on a prefix of the input sequence.

BERT and other encoder-only models use a denoising objective, where the model is trained to predict masked tokens within an input sequence. This approach helps the model learn rich contextual representations. However, these models typically require additional "task heads" for specific downstream tasks, which can be limiting.
T5, on the other hand, adopted a denoising objective in a sequence-to-sequence format. In T5, the denoising process is adapted to fit the encoder-decoder architecture. Instead of predicting masked tokens in place, T5 uses the encoder to generate a representation of the input and the decoder to predict the missing parts. This approach allows T5 to leverage the strengths of both encoder-only and decoder-only models.
Despite BERT's success, there has been a significant shift towards decoder-only and encoder-decoder models in recent years. Here are a few reasons why:
The evolution from encoder-only models like BERT to more complex architectures like T5 and PrefixLMs reflects the ongoing advancements in NLP. While BERT laid a strong foundation, newer models have built upon this by addressing scalability, generative capabilities, and task flexibility. As we continue to push the boundaries of what LLMs can do, it's essential to understand these architectural shifts and their implications for practitioners.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
17 July 2024
133 articles
Related Articles
Related Articles
More Stories