Claude 2.1 Enhances Long Context Retrieval with Simple Prompting Tweaks

Models & Research

The Engineer

8 Dec 2023 · 3 min read

A minor tweak in prompt design has transformed Claude 2.1's ability to extract precise information from vast data sets, catapulting its accuracy in long-context tasks from barely passable to nearly perfect.

Claude 2.1, the latest iteration of Anthropic's AI model, has made significant strides in handling long-context tasks. With a massive 200K token context window-equivalent to around 500 pages of text-Claude 2.1 excels at real-world retrieval tasks that involve extensive documents. However, it faced some challenges when asked to retrieve specific sentences within these large contexts. A simple prompting adjustment has dramatically improved its performance, boosting accuracy from 27% to an impressive 98%.

Key Improvements in Claude 2.1

Enhanced Retrieval Accuracy: Claude 2.1 now retrieves information with high precision across its full 200K token context window.
Reduced Incorrect Answers: Compared to Claude 2.0, the new model shows a 30% reduction in incorrect answers and a 3-4x lower rate of mistakenly supporting claims not present in the document.

Training and Data

Claude 2.1 was trained using large amounts of user feedback on long-document tasks, such as summarizing S-1 length documents. This training data included real-world tasks performed on actual documents, helping Claude 2.1 make fewer mistakes and avoid unsupported claims. The model's improved memory over very long contexts is a direct result of this focused training.

Debugging Long Context Recall

Despite its capabilities, Claude 2.1 initially struggled with retrieving specific sentences within long documents. A recent evaluation highlighted this issue using Paul Graham’s essays about startups as the test material. An embedded sentence in one of these essays was: "The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day." When asked, "What is the most fun thing to do in San Francisco?" Claude 2.1 often responded with variations of, "Unfortunately, the essay does not provide a definitive answer about the most fun thing to do in San Francisco."

This reluctance to retrieve specific sentences was due to the model's cautious approach, which can be beneficial in avoiding unsupported claims but problematic when the information is indeed present. To address this, Anthropic introduced a minor prompting edit:

Prompting Adjustment: By tweaking the prompt to more explicitly guide Claude 2.1 to look for and extract specific information, the model’s performance improved significantly.

Implementation Details

The prompting adjustment involved changing how the question was framed to make it clearer that the answer should be based on the content of the document. For example, instead of asking a general question like "What is the most fun thing to do in San Francisco?" the prompt was modified to something more specific, such as "Based on the text provided, what does the author suggest is the best thing to do in San Francisco?"

This simple change had a profound impact:

Accuracy Improvement: From 27% to 98% accuracy in retrieving the correct sentence.
Consistency: The model consistently identified and recalled the relevant information across various test cases.

Practical Implications

For practitioners, this improvement means that Claude 2.1 can now be more reliably used for tasks that require precise retrieval of specific information from large documents. Whether you're summarizing legal contracts, extracting key points from scientific papers, or analyzing financial reports, the enhanced performance in long-context recall makes Claude 2.1 a powerful tool.

Conclusion

Claude 2.1's advancements in handling long contexts and the effectiveness of simple prompting adjustments highlight the ongoing progress in AI research. By focusing on real-world tasks and user feedback, Anthropic has created a model that not only performs well but also provides more accurate and reliable results.