
Share
Researchers at Google have developed Michelangelo, a new framework using Latent Structure Queries to evaluate large language models' long-context reasoning skills, moving beyond basic information retrieval.
In a recent paper titled "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries," a team of researchers from Google introduces a novel framework to evaluate long-context reasoning capabilities in large language models (LLMs). This framework, called Latent Structure Queries (LSQ), aims to move beyond simple information retrieval and assess the model's ability to reason over extended contexts.
The key innovation here is the LSQ framework. Unlike traditional evaluations that focus on finding a single piece of information within a large context (think "needle in a haystack"), LSQ tasks require models to perform more complex reasoning. This includes identifying and manipulating latent structures within the data, which can be crucial for tasks like summarization, question-answering, and logical inference over long texts.
For practitioners, this new framework offers several advantages:
The core of the LSQ framework involves constructing tasks where models must "chisel away" irrelevant information to find and manipulate latent structures. Here are the key components:
Task Construction: Each task is designed to require multiple steps of reasoning, such as:
Scoring Mechanism: The tasks are designed to be easily scoreable. For example:

The researchers provide several examples to illustrate how LSQ tasks can be constructed:
Example Task 1: Summarization with Constraints
Example Task 2: Logical Inference Over Time
The paper includes preliminary results using LSQ to evaluate several state-of-the-art LLMs. Key findings include:
The researchers outline several future directions for improving and expanding LSQ:
The introduction of the Latent Structure Queries framework marks a significant step forward in evaluating long-context reasoning capabilities in LLMs. By moving beyond simple information retrieval, LSQ provides a robust and flexible tool for researchers and practitioners to better understand and improve these models.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
23 September 2024
133 articles
Related Articles
Related Articles
More Stories