PaperBench: A New Benchmark for Evaluating AI's Replication of Research Papers

Models & Research

The Engineer

3 Apr 2025 · 3 min read

PaperBench challenges AI systems to replicate cutting-edge research from scratch, breaking down complex tasks into manageable parts to assess their true understanding and execution capabilities in academia.

April 2, 2025

OpenAI has introduced PaperBench, a novel benchmark designed to evaluate the ability of AI agents to replicate state-of-the-art AI research. This benchmark focuses on replicating 20 ICML 2024 Spotlight and Oral papers from scratch, encompassing tasks such as understanding paper contributions, developing a codebase, and successfully executing experiments.

Key Technical Changes

Comprehensive Task Breakdown: PaperBench decomposes each replication task into smaller, individually gradable sub-tasks. This hierarchical approach ensures that every aspect of the research can be objectively evaluated.
LLM-Based Judge: An LLM-based judge is developed to automatically grade replication attempts against predefined rubrics. This facilitates scalable and consistent evaluation.

Why It Matters

For AI practitioners, PaperBench represents a significant step forward in evaluating the capabilities of AI agents in real-world, complex tasks. Here’s why:

Realistic Evaluation: By co-developing rubrics with the original authors of each ICML paper, PaperBench ensures that the evaluation criteria are both accurate and realistic.
Scalability: The LLM-based judge allows for large-scale, automated evaluation, making it feasible to test multiple models and agents efficiently.
Benchmarking Models: Initial evaluations show that even the best-performing model, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of only 21.0%. This highlights the current limitations and areas for improvement in AI research replication.

Technical Details

Task Decomposition:
- Each ICML paper is broken down into smaller tasks.
- These tasks are further divided into sub-tasks with clear grading criteria.
- The benchmark contains a total of 8,316 individually gradable tasks.
Rubric Development:
- Rubrics are co-developed with the original authors to ensure accuracy and realism.
- This collaboration ensures that the evaluation criteria align with the intended contributions of each paper.
LLM-Based Judge:
- An LLM is trained to automatically grade replication attempts against the rubrics.
- A separate benchmark for judges is created to assess the performance of the LLM-based judge.

Initial Results

Model Performance: Claude 3.5 Sonnet (New) with open-source scaffolding, one of the best-performing models tested, achieved an average replication score of 21.0%.
Human Baseline: Top ML PhDs were recruited to attempt a subset of PaperBench tasks. The results indicate that current AI models do not yet outperform human researchers in this domain.

Open-Sourcing

To facilitate future research and development, OpenAI has open-sourced the code for PaperBench. This includes:

Code Repository: Available on GitHub at this link.
Paper: The full paper is available on arXiv at this link.

Conclusion

PaperBench sets a new standard for evaluating AI's ability to replicate complex research tasks. By providing a comprehensive, realistic, and scalable benchmark, it offers valuable insights into the current capabilities and limitations of AI agents in this domain. For researchers and practitioners, PaperBench is a crucial tool for advancing the field of AI research replication.