Anthropic Introduces Contextual Retrieval for Enhanced RAG Performance

Models & Research

The Engineer

23 Sept 2024 · 3 min read

Anthropic's new Contextual Retrieval technique boosts RAG systems, ensuring AI models can access precise, relevant data for better performance in specialized fields like customer support and law.

For an AI model to excel in specific contexts, it often needs access to relevant background knowledge. This is particularly true for applications like customer support chatbots and legal analysts, where the model must understand the nuances of a particular business or legal landscape. Traditionally, developers have used Retrieval-Augmented Generation (RAG) to enhance an AI model's knowledge by retrieving relevant information from a knowledge base and appending it to the user's prompt. However, traditional RAG methods often struggle with context, leading to failed retrievals.

In this post, we introduce Contextual Retrieval, a method that significantly improves the retrieval step in RAG. This technique leverages two sub-techniques: Contextual Embeddings and Contextual BM25. According to Anthropic, Contextual Retrieval can reduce the number of failed retrievals by 49%, and when combined with reranking, this improvement jumps to 67%. These enhancements directly translate to better performance in downstream tasks.

How Contextual Retrieval Works

Contextual Embeddings

Definition: Unlike traditional embeddings that encode text independently of context, Contextual Embeddings generate vector representations of text chunks based on the context provided by the user's query.
Implementation: The model first processes the user's query to understand its context. Then, it generates embeddings for the knowledge base chunks, taking this context into account. This ensures that the retrieved information is more relevant to the specific query.

Contextual BM25

Definition: BM25 (Best Matching 25) is a ranking function used in information retrieval systems. Contextual BM25 modifies this function to consider the user's query context.
Implementation: The system ranks the retrieved chunks based on their relevance to both the query and the context provided by the user. This improves the accuracy of the retrieval process.

A Note on Using a Longer Prompt

For smaller knowledge bases (less than 200,000 tokens or about 500 pages), you can bypass RAG entirely by including the entire knowledge base in the model's prompt. Anthropic recently introduced prompt caching for Claude, which makes this approach faster and more cost-effective:

Latency Reduction: Caching frequently used prompts between API calls reduces latency by more than 2x.
Cost Savings: Prompt caching can reduce costs by up to 90%.

You can learn more about prompt caching in Anthropic's prompt caching cookbook.

A Primer on RAG: Scaling to Larger Knowledge Bases

For larger knowledge bases that exceed the context window, RAG is the go-to solution. The process involves:

Chunking: Breaking down the knowledge base into smaller text chunks (usually a few hundred tokens each).
Embedding: Using an embedding model to convert these chunks into vector embeddings that capture their semantic meaning.
Storing: Storing these embeddings in a vector database for efficient retrieval based on semantic similarity.

At runtime, when a user submits a query:

The system generates a contextual embedding for the query.
It searches the vector database for the most relevant chunks.
These chunks are appended to the user's prompt and fed into the AI model.

Deploying Contextual Retrieval with Claude

You can easily deploy your own Contextual Retrieval solution using Claude with Anthropic's cookbook. This guide provides step-by-step instructions and best practices for integrating Contextual Retrieval into your applications.

Conclusion

Contextual Retrieval represents a significant advancement in RAG, making it easier to build AI models that can effectively leverage large knowledge bases. By reducing failed retrievals and improving accuracy, this method enhances the performance of downstream tasks, ultimately leading to more useful and reliable AI applications.