
Share
Researchers at TensorZero unveil a groundbreaking method that slashes costs by up to 30x and speeds up inference times fourfold for large language models, making them more accessible without compromising performance.
July 29, 2025 · Andrew Jesson, Gabriel Bianconi, Aaron Hill, Viraj Mehta
In the world of large language models (LLMs), there's a constant tug-of-war between performance and cost. Large models like GPT-4.1 or Claude Sonnet 4 offer top-notch results but come with hefty price tags, making them impractical for many production workloads. On the other hand, smaller models are budget-friendly but often fall short in terms of accuracy and response quality.
However, recent research from TensorZero shows a promising solution: fine-tuning small models on programmatically curated high-quality outputs from large models can bridge this gap. This approach not only matches or exceeds the performance of large models but also reduces inference costs by up to 30x and speeds up response times by up to 4x.
For LLM application builders, the choice is clear: do you prioritize top-tier performance at a high cost, or opt for affordability with potential trade-offs in quality? Consider a customer service agent handling thousands of conversations daily. Using GPT-4.1 or Claude Sonnet 4 might provide excellent responses, but at $2-$15 per million tokens, the costs can add up quickly. Switch to a smaller model, and while your budget stays intact, so might your customer satisfaction.
Our research demonstrates that fine-tuning small models on programmatically curated data from large models can break this tradeoff. Here's how it works:
This method leverages the strengths of both worlds: the high-quality outputs from large models and the cost-efficiency of small models. Let's dive into the details:

We benchmarked this approach using a mix of closed-source (OpenAI, Google) and open-source (Qwen) models on several tasks:
To reproduce this workflow, you can use open-source tools like 11.2KTensorZero and other LLMOps tools. Here’s a step-by-step guide:
By leveraging distillation with programmatic data curation, you can achieve significant cost savings without compromising on performance. This approach not only makes LLMs more accessible for a broader range of applications
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
5 August 2025
88 articles
Related Articles
Related Articles
More Stories