
Share
A minor tweak in prompt design has transformed Claude 2.1's ability to extract precise information from vast data sets, catapulting its accuracy in long-context tasks from barely passable to nearly perfect.
Claude 2.1, the latest iteration of Anthropic's AI model, has made significant strides in handling long-context tasks. With a massive 200K token context window-equivalent to around 500 pages of text-Claude 2.1 excels at real-world retrieval tasks that involve extensive documents. However, it faced some challenges when asked to retrieve specific sentences within these large contexts. A simple prompting adjustment has dramatically improved its performance, boosting accuracy from 27% to an impressive 98%.
Claude 2.1 was trained using large amounts of user feedback on long-document tasks, such as summarizing S-1 length documents. This training data included real-world tasks performed on actual documents, helping Claude 2.1 make fewer mistakes and avoid unsupported claims. The model's improved memory over very long contexts is a direct result of this focused training.
Despite its capabilities, Claude 2.1 initially struggled with retrieving specific sentences within long documents. A recent evaluation highlighted this issue using Paul Graham’s essays about startups as the test material. An embedded sentence in one of these essays was: "The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day." When asked, "What is the most fun thing to do in San Francisco?" Claude 2.1 often responded with variations of, "Unfortunately, the essay does not provide a definitive answer about the most fun thing to do in San Francisco."
This reluctance to retrieve specific sentences was due to the model's cautious approach, which can be beneficial in avoiding unsupported claims but problematic when the information is indeed present. To address this, Anthropic introduced a minor prompting edit:

The prompting adjustment involved changing how the question was framed to make it clearer that the answer should be based on the content of the document. For example, instead of asking a general question like "What is the most fun thing to do in San Francisco?" the prompt was modified to something more specific, such as "Based on the text provided, what does the author suggest is the best thing to do in San Francisco?"
This simple change had a profound impact:
For practitioners, this improvement means that Claude 2.1 can now be more reliably used for tasks that require precise retrieval of specific information from large documents. Whether you're summarizing legal contracts, extracting key points from scientific papers, or analyzing financial reports, the enhanced performance in long-context recall makes Claude 2.1 a powerful tool.
Claude 2.1's advancements in handling long contexts and the effectiveness of simple prompting adjustments highlight the ongoing progress in AI research. By focusing on real-world tasks and user feedback, Anthropic has created a model that not only performs well but also provides more accurate and reliable results.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
8 December 2023
133 articles
Related Articles
Related Articles
More Stories