Michelangelo: A New Framework for Long-Context Reasoning in Large Language Models

Models & Research

The Engineer

23 Sept 2024 · 4 min read

Researchers at Google have developed Michelangelo, a new framework using Latent Structure Queries to evaluate large language models' long-context reasoning skills, moving beyond basic information retrieval.

In a recent paper titled "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries," a team of researchers from Google introduces a novel framework to evaluate long-context reasoning capabilities in large language models (LLMs). This framework, called Latent Structure Queries (LSQ), aims to move beyond simple information retrieval and assess the model's ability to reason over extended contexts.

What Changed Technically?

The key innovation here is the LSQ framework. Unlike traditional evaluations that focus on finding a single piece of information within a large context (think "needle in a haystack"), LSQ tasks require models to perform more complex reasoning. This includes identifying and manipulating latent structures within the data, which can be crucial for tasks like summarization, question-answering, and logical inference over long texts.

Why It Matters

For practitioners, this new framework offers several advantages:

Comprehensive Evaluation: LSQ provides a more holistic view of an LLM's reasoning capabilities.
Minimal and Synthetic: The tasks are designed to be minimal and synthetic, making them easy to generate and score automatically.
No Leaks: Unlike some benchmarks that can be gamed or memorized, LSQ is constructed to avoid leakage, ensuring fair evaluation.

How It Works

The core of the LSQ framework involves constructing tasks where models must "chisel away" irrelevant information to find and manipulate latent structures. Here are the key components:

Task Construction: Each task is designed to require multiple steps of reasoning, such as:
- Identifying Relevant Information: The model must distinguish between relevant and irrelevant parts of the context.
- Manipulating Structures: Once the relevant information is identified, the model must perform operations like aggregation, filtering, or transformation.
- Generating Output: The final output should demonstrate that the model has correctly reasoned over the entire context.
Scoring Mechanism: The tasks are designed to be easily scoreable. For example:
- Automated Scoring: A scoring function can automatically evaluate the correctness of the model's output based on predefined criteria.
- Flexibility: The framework allows for both binary and graded scoring, depending on the task requirements.

Implementation Details

The researchers provide several examples to illustrate how LSQ tasks can be constructed:

Example Task 1: Summarization with Constraints
- Context: A long document containing multiple sections.
- Task: Generate a summary that includes specific key points while excluding irrelevant details.
- Evaluation: The model's output is scored based on the presence of required key points and the absence of irrelevant information.
Example Task 2: Logical Inference Over Time
- Context: A series of events described in chronological order.
- Task: Determine the outcome of a specific event based on the sequence of prior events.
- Evaluation: The model's output is scored based on the logical consistency with the given context.

Benchmarks and Results

The paper includes preliminary results using LSQ to evaluate several state-of-the-art LLMs. Key findings include:

Performance Variability: Different models exhibit varying levels of performance, highlighting the need for more comprehensive evaluation.
Scalability: The framework scales well with context length, making it suitable for evaluating long-context reasoning.

Future Directions

The researchers outline several future directions for improving and expanding LSQ:

Task Diversity: Introducing a wider variety of tasks to cover different types of reasoning.
Model Adaptation: Exploring how models can be fine-tuned or adapted specifically for LSQ tasks.
Cross-Domain Evaluation: Applying LSQ to evaluate models across different domains, such as legal documents, scientific papers, and more.

Conclusion

The introduction of the Latent Structure Queries framework marks a significant step forward in evaluating long-context reasoning capabilities in LLMs. By moving beyond simple information retrieval, LSQ provides a robust and flexible tool for researchers and practitioners to better understand and improve these models.