
Share
MosaicML's MosaicBERT slashes pretraining time for BERT models by integrating FlashAttention and ALiBi, offering a blueprint for accelerating other large-scale transformer architectures like MPT-7B and MPT-30B.
MosaicBERT, a custom BERT architecture developed by MosaicML in collaboration with Databricks, is designed to significantly speed up the pretraining process. This model introduces several key architectural modifications that can be applied to other transformer models, including MosaicML's own MPT-7B and MPT-30B. Here’s a deep dive into what makes MosaicBERT tick and why it matters for practitioners.
FlashAttention:
ALiBi (Attention with Linear Biases):
Gated Linear Units (GLUs):
Unpadding Inputs:
Low Precision LayerNorm:

Increase Masked Language Modeling (MLM) Ratio:
Remove Dropout in Attention Module:
Use bfloat16 for Training:
Optimize Vocabulary Size:
Pretraining transformers from scratch is often prohibitively expensive and time-consuming. MosaicBERT's optimizations address these challenges by reducing computational requirements without compromising model performance. These modifications are not limited to BERT-style encoder models; many can be applied to decoder architectures like GPT and MPT as well.
For ML practitioners, adopting these techniques can lead to more efficient pretraining workflows, enabling faster experimentation and deployment of transformer models. Whether you're working on a small-scale project or scaling up to large datasets, the optimizations introduced by MosaicBERT are worth considering.
Tags
Original Sources
↗ https://mosaicbert.github.io/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
3 January 2024
133 articles
Related Articles
Related Articles
More Stories