Monarch Mixer Powers Long-Context Retrieval Models with M2-BERT Up to 32K Tokens

Models & Research

The Engineer

12 Jan 2024 · 3 min read

Researchers at Stanford’s Hazy Research group unveil Monarch Mixer, a groundbreaking model that extends BERT’s context length up to 32K tokens, revolutionizing long-document analysis and retrieval.

Text embeddings are a cornerstone of many modern applications, from search engines and RAG (Retrieval-Augmented Generation) systems to vector databases. However, most embedding models, which are typically BERT/Transformer-based, have short context lengths-around 512 tokens. This is roughly equivalent to two pages of text, but real-world documents can be much longer, often spanning tens of thousands of tokens. To address this, researchers at Stanford's Hazy Research group are taking a significant step forward with long-context retrieval models.

Introducing Monarch Mixer (M2)

The foundation for these long-context models is the Monarch Mixer (M2) family. M2 is an innovative model that eschews traditional attention and MLP layers, making it possible to handle much longer contexts while maintaining efficiency. Today, the team is releasing a preview of several M2-BERT models with context lengths up to 32K tokens, fine-tuned for long-context retrieval.

Key Changes: Data Mixtures and Loss Functions

To enable these new models, the researchers had to make significant adjustments to both the data mixtures and loss functions. Here’s a breakdown:

Data Mixture Adjustments:
- Diverse Document Sources: The team curated a diverse set of long documents from various domains, including books, legal cases, TV screenplays, and code repositories.
- Balanced Training Data: Ensuring that the training data is balanced across different lengths and types of content to avoid bias towards shorter or specific document types.
Loss Function Innovations:
- Contrastive Learning: The models use a contrastive loss function to improve the quality of embeddings for long-context retrieval. This helps in distinguishing relevant from irrelevant documents more effectively.
- Hierarchical Losses: For very long contexts, hierarchical losses are employed to ensure that the model can capture both local and global context.

Model Releases

The team has released the following models on HuggingFace:

M2-BERT-80M-2k-retrieval (up to 2K tokens)
M2-BERT-80M-8k-retrieval (up to 8K tokens)
M2-BERT-80M-32k-retrieval (up to 32K tokens)

These models are also available via Together AI’s new embedding service, which you can explore here. The models have already been beta-tested at a MongoDB hackathon and integrated into RAG systems like LangChain and LlamaIndex.

Long-Context Retrieval Benchmark: LoCo

To evaluate the performance of these long-context retrieval models, the team has introduced LoCo (Long-Context), a new benchmark. LoCo includes a variety of retrieval tasks with long documents, though it is still in its early stages. The researchers are actively seeking feedback and contributions to expand the benchmark.

Community Engagement

The team is eager for community feedback on these models and the LoCo benchmark:

Real-World Performance: If you have long-context retrieval tasks, share how the M2-BERT retrieval models perform in your applications.
Dataset Contributions: If you have public long-context retrieval datasets or tasks that could enhance LoCo, please let the team know. They aim to make the benchmark more comprehensive and representative.

Looking Forward

A full paper detailing these developments will be released next month. For now, the team is excited to share this preview and gather valuable insights from the community.