Understanding the Evolution of BERT and T5: Encoders, PrefixLMs, and Denoising Objectives

Models & Research

The Engineer

17 Jul 2024 · 4 min read

As BERT once redefined language understanding, explore how its encoder-only structure evolved and why newer models like T5 favor PrefixLMs and denoising objectives to tackle today’s complex text challenges.

In the rapidly evolving landscape of large language models (LLMs), it’s easy to get lost in the myriad of architectures and paradigms. If you were deep into NLP a few years ago, you might be wondering where all the encoder models went. After all, BERT worked so well-why not just scale it up? Let's dive into what happened to encoder-only models like BERT and how they relate to the current state of LLMs.

A Quick Primer on Model Architectures

To set the stage, let’s quickly review the three main types of model architectures that have dominated the NLP landscape over the past few years:

Encoder-Only Models (e.g., BERT): These models focus on understanding context within a given input sequence. They are typically used for tasks like text classification, named entity recognition, and question answering.
Encoder-Decoder Models (e.g., T5): These models use an encoder to process the input and a decoder to generate output. They are commonly used for translation, summarization, and other generative tasks.
Decoder-Only Models (e.g., GPT series): These models generate text directly from a given prompt without an explicit encoder. They excel in tasks like text completion and generation.

The Role of Encoder-Decoder Models

One common misconception is that encoder-decoder models are not autoregressive. In reality, the decoder in these models is fundamentally a causal decoder, just like in decoder-only models. The key difference is that some of the input context can be offloaded to the encoder, which then sends its representations to the decoder via cross-attention. This allows the model to leverage more contextual information during generation.

Introducing Prefix Language Models (PrefixLMs)

A variant of the encoder-decoder architecture is the Prefix Language Model (PrefixLM). These models operate similarly to encoder-decoders but without the cross-attention mechanism. Instead, they use a shared weight scheme between the encoder and decoder, and there's no encoder bottleneck. PrefixLMs are sometimes referred to as non-casual decoders because they can generate text based on a prefix of the input sequence.

Denoising Objectives in BERT and T5

BERT and other encoder-only models use a denoising objective, where the model is trained to predict masked tokens within an input sequence. This approach helps the model learn rich contextual representations. However, these models typically require additional "task heads" for specific downstream tasks, which can be limiting.

T5, on the other hand, adopted a denoising objective in a sequence-to-sequence format. In T5, the denoising process is adapted to fit the encoder-decoder architecture. Instead of predicting masked tokens in place, T5 uses the encoder to generate a representation of the input and the decoder to predict the missing parts. This approach allows T5 to leverage the strengths of both encoder-only and decoder-only models.

Why the Shift Away from Encoder-Only Models?

Despite BERT's success, there has been a significant shift towards decoder-only and encoder-decoder models in recent years. Here are a few reasons why:

Scalability: Decoder-only models like GPT can be scaled more easily to handle larger datasets and longer sequences.
Generative Capabilities: Encoder-decoder models and PrefixLMs offer better generative capabilities, making them suitable for a wider range of tasks.
Flexibility: The sequence-to-sequence format used in T5 and similar models provides greater flexibility in handling various NLP tasks.

Conclusion

The evolution from encoder-only models like BERT to more complex architectures like T5 and PrefixLMs reflects the ongoing advancements in NLP. While BERT laid a strong foundation, newer models have built upon this by addressing scalability, generative capabilities, and task flexibility. As we continue to push the boundaries of what LLMs can do, it's essential to understand these architectural shifts and their implications for practitioners.