
Share
The rapid growth in language model inference revenue reveals a surge in demand for speed and efficiency, pushing companies like OpenAI and Anthropic to innovate beyond traditional limits.
As AI models continue to evolve, the economics of language model inference have become a critical area of focus for major players like OpenAI and Anthropic. Despite the trend towards smaller and more cost-effective models, inference revenue at these companies has been growing at an impressive rate of 3x per year or more. This growth underscores the increasing demand for faster and more efficient AI inference, particularly as models are tasked with handling complex problems and operating within sophisticated agentic loops.
The shift from "human reading speed" to more demanding performance benchmarks highlights a significant change in how language models are used. Previously, generating 10 tokens per second was considered sufficient for user interactions. However, modern applications require models to process information at much higher speeds. This trend has important implications for both the technical and economic aspects of AI inference.
Despite the clear benefits of faster inference, there are several risks associated with this shift:
To address these challenges, researchers at Epoch AI have developed a model for understanding the economics of language model inference. This model decomposes the time taken during the forward pass of a Transformer into four key components:

By breaking down these components, the model provides a detailed view of how different factors contribute to inference speed. For example, NVIDIA's Collective Communication Library (NCCL) has a base latency of 30 microseconds on a DGX H100 machine, even for small tensors.
The model allows researchers and engineers to compute the forward pass time for specific inputs, given the past context length and batch size. This computation is not straightforward, as some components can overlap with others. For instance, memory read-write time can be overlapped with arithmetic time. By making reasonable assumptions about these overlaps, the model provides a final estimate of inference speed.
This approach helps in understanding how to optimize models for faster performance while considering the trade-offs involved. It also highlights areas where further research and innovation can lead to significant improvements in efficiency and cost-effectiveness.
The economics of language model inference is a rapidly evolving field with significant implications for both technical and business strategies. As demand for high-speed inference continues to grow, companies must navigate the complex landscape of performance optimization and cost management. The model developed by Epoch AI provides valuable insights into these challenges, offering a framework for making informed decisions in this critical area.
Tags
Original Sources
About the author
Marcus began tracking AI's market implications in 2016, noticing AI-related patent filings accelerating ahead of earnings upgrades before most of the sell-side had caught on. A former fixed-income quantitative analyst, he spent two decades building models that priced risk across emerging markets before pivoting to cover the economic impact of AI full-time. His writing translates opaque technical developments into clear risk/reward terms — and he's rarely diplomatic about the gap between AI valuations and underlying fundamentals. He believes most market participants still underestimate AI's long-run deflationary effect on knowledge work.
More from The Analyst →This Week's Edition
20 June 2025
133 articles
Related Articles
Related Articles
More Stories