Real-World Accounting Tasks Reveal LLM Limitations Despite Promising Benchmarks

Job Market & Society

The Steward

21 Jul 2025 · 3 min read

While AI shows promise in standardized tests, a new study reveals LLMs falter on intricate real-world accounting tasks, underscoring the technology's limitations in handling complex financial duties with precision.

In an era where artificial intelligence (AI) is increasingly integrated into various industries, the promise of AI-driven solutions in accounting and financial analysis has captured significant attention. However, a recent study using real-world data from a SaaS business highlights that even the most advanced language models (LLMs) struggle when it comes to complex, long-horizon tasks like closing the books.

The Human Stakes

For accountants and financial analysts, accuracy is paramount. Errors in financial records can lead to significant legal and financial repercussions, affecting not only the company but also its stakeholders. The ability of AI to handle these tasks reliably could potentially revolutionize the industry by freeing up human professionals to focus on higher-value strategic work. However, as this study shows, we are not quite there yet.

Understanding the Study

The research team at Penrose.com introduced AccountingBench, a new evaluation framework designed to test LLMs on real-world financial tasks. Unlike synthetic benchmarks that simulate controlled environments, AccountingBench uses one year of actual financial data from a SaaS business generating millions in revenue. This dataset includes a wide range of transactions and activities, providing a comprehensive challenge for AI models.

The Challenge: Closing the Books

Closing the books is a critical process in accounting that involves systematically handling all source transactions, reconciliations, and documentation to ensure the financial records are accurate and up-to-date. It's a task that requires precision and attention to detail over an extended period.

In the study, the LLM was tasked with closing the books for March 2021. The process began by processing source transactions and account reconciliations. However, the model encountered significant discrepancies:

Ledger Balance: $2,787,366.07
Statement Balance (Actual): $2,494,508.43
Difference: $292,857.64

This substantial discrepancy highlights the model's limitations in accurately handling financial data over a longer period.

What Went Wrong?

The study revealed that while LLMs excel at tasks that don't alter the underlying environment-such as answering questions, writing code, and researching sources-they struggle with "butterfly" tasks. These are tasks where each action has lasting consequences, and errors can compound over time. In accounting, a small mistake in one transaction can lead to significant discrepancies down the line.

The Human Expert Baseline

To provide a point of comparison, the study also included a human expert baseline performed by a Certified Public Accountant (CPA). The CPA was able to accurately close the books and identify and correct the discrepancies that the LLM missed. This highlights the current gap between AI capabilities and human expertise in complex financial tasks.

Implications for the Workforce

The findings from AccountingBench have important implications for the accounting and finance workforce. While AI can automate routine tasks and assist with data entry, it is not yet a reliable replacement for human professionals when it comes to high-stakes, long-horizon tasks. This means that accountants and financial analysts will continue to play a crucial role in ensuring the accuracy and integrity of financial records.

Moving Forward

The study underscores the need for continued research and development in AI to bridge this gap. While LLMs have shown promise in many areas, their limitations in handling complex, real-world tasks like closing the books highlight the importance of human oversight and intervention. As AI technology evolves, it is crucial to balance automation with the expertise and judgment of human professionals.