
Share
While AI shows promise in standardized tests, a new study reveals LLMs falter on intricate real-world accounting tasks, underscoring the technology's limitations in handling complex financial duties with precision.
In an era where artificial intelligence (AI) is increasingly integrated into various industries, the promise of AI-driven solutions in accounting and financial analysis has captured significant attention. However, a recent study using real-world data from a SaaS business highlights that even the most advanced language models (LLMs) struggle when it comes to complex, long-horizon tasks like closing the books.
For accountants and financial analysts, accuracy is paramount. Errors in financial records can lead to significant legal and financial repercussions, affecting not only the company but also its stakeholders. The ability of AI to handle these tasks reliably could potentially revolutionize the industry by freeing up human professionals to focus on higher-value strategic work. However, as this study shows, we are not quite there yet.
The research team at Penrose.com introduced AccountingBench, a new evaluation framework designed to test LLMs on real-world financial tasks. Unlike synthetic benchmarks that simulate controlled environments, AccountingBench uses one year of actual financial data from a SaaS business generating millions in revenue. This dataset includes a wide range of transactions and activities, providing a comprehensive challenge for AI models.
Closing the books is a critical process in accounting that involves systematically handling all source transactions, reconciliations, and documentation to ensure the financial records are accurate and up-to-date. It's a task that requires precision and attention to detail over an extended period.
In the study, the LLM was tasked with closing the books for March 2021. The process began by processing source transactions and account reconciliations. However, the model encountered significant discrepancies:
This substantial discrepancy highlights the model's limitations in accurately handling financial data over a longer period.

The study revealed that while LLMs excel at tasks that don't alter the underlying environment-such as answering questions, writing code, and researching sources-they struggle with "butterfly" tasks. These are tasks where each action has lasting consequences, and errors can compound over time. In accounting, a small mistake in one transaction can lead to significant discrepancies down the line.
To provide a point of comparison, the study also included a human expert baseline performed by a Certified Public Accountant (CPA). The CPA was able to accurately close the books and identify and correct the discrepancies that the LLM missed. This highlights the current gap between AI capabilities and human expertise in complex financial tasks.
The findings from AccountingBench have important implications for the accounting and finance workforce. While AI can automate routine tasks and assist with data entry, it is not yet a reliable replacement for human professionals when it comes to high-stakes, long-horizon tasks. This means that accountants and financial analysts will continue to play a crucial role in ensuring the accuracy and integrity of financial records.
The study underscores the need for continued research and development in AI to bridge this gap. While LLMs have shown promise in many areas, their limitations in handling complex, real-world tasks like closing the books highlight the importance of human oversight and intervention. As AI technology evolves, it is crucial to balance automation with the expertise and judgment of human professionals.
Tags
Original Sources
↗ https://accounting.penrose.com/?utm_source=tldrai
About the author
Amara's entry point into AI was an epidemiology role at a London research hospital, where she spent five years studying how digital health tools reached — or conspicuously failed to reach — underserved communities. Watching early algorithmic systems in healthcare quietly entrench existing inequalities, she redirected her career toward the systemic consequences of AI at scale. She covers AI through an unflinching lens: who benefits, who bears the cost, and what evidence actually says versus what the press release claims. Her writing is calm and precise, but she doesn't mistake balance for neutrality.
More from The Steward →This Week's Edition
21 July 2025
133 articles
Related Articles
Related Articles
More Stories