Distillation and Programmatic Data Curation: Achieving 30x Cost Reduction and 4x Faster Inference in LLMs

Models & Research

The Engineer

5 Aug 2025 · 3 min read

Researchers at TensorZero unveil a groundbreaking method that slashes costs by up to 30x and speeds up inference times fourfold for large language models, making them more accessible without compromising performance.

Distillation with Programmatic Data Curation: Smarter LLMs, 5-30x Cheaper Inference

July 29, 2025 · Andrew Jesson, Gabriel Bianconi, Aaron Hill, Viraj Mehta

In the world of large language models (LLMs), there's a constant tug-of-war between performance and cost. Large models like GPT-4.1 or Claude Sonnet 4 offer top-notch results but come with hefty price tags, making them impractical for many production workloads. On the other hand, smaller models are budget-friendly but often fall short in terms of accuracy and response quality.

However, recent research from TensorZero shows a promising solution: fine-tuning small models on programmatically curated high-quality outputs from large models can bridge this gap. This approach not only matches or exceeds the performance of large models but also reduces inference costs by up to 30x and speeds up response times by up to 4x.

The Performance-Cost Dilemma

For LLM application builders, the choice is clear: do you prioritize top-tier performance at a high cost, or opt for affordability with potential trade-offs in quality? Consider a customer service agent handling thousands of conversations daily. Using GPT-4.1 or Claude Sonnet 4 might provide excellent responses, but at $2-$15 per million tokens, the costs can add up quickly. Switch to a smaller model, and while your budget stays intact, so might your customer satisfaction.

A New Approach: Distillation with Programmatic Data Curation

Our research demonstrates that fine-tuning small models on programmatically curated data from large models can break this tradeoff. Here's how it works:

Programmatic Curation: We use a large model to generate high-quality outputs for specific tasks.
Fine-Tuning: These outputs are then used to fine-tune smaller models, which learn to replicate the performance of the larger model.

This method leverages the strengths of both worlds: the high-quality outputs from large models and the cost-efficiency of small models. Let's dive into the details:

Benchmarks and Results

We benchmarked this approach using a mix of closed-source (OpenAI, Google) and open-source (Qwen) models on several tasks:

Data Extraction (CoNLL++ NER): Named Entity Recognition
Multi-turn Maze Navigation (BabyAI): Complex navigation tasks
Agentic RAG (Multi-Hop): Retrieval-Augmented Generation for multi-hop reasoning
Agentic Tool Use (τ-bench): Using tools in a simulated environment

Key Findings:

Cost Reduction: Fine-tuned small models achieved up to 30x cost reduction compared to their large counterparts.
Response Time: These models also delivered up to 4x faster response times, enhancing user experience.
Performance Parity: In many cases, fine-tuned small models matched or exceeded the performance of large models on specific tasks.

Implementation Details

To reproduce this workflow, you can use open-source tools like 11.2KTensorZero and other LLMOps tools. Here’s a step-by-step guide:

Data Collection: Use a large model to generate high-quality outputs for your specific tasks.
Dataset Preparation: Curate these outputs into a training dataset suitable for fine-tuning.
Model Fine-Tuning: Train smaller models on this curated dataset using frameworks like PyTorch or TensorFlow.
Evaluation and Deployment: Evaluate the performance of the fine-tuned model and deploy it in your production environment.

Best Practices for Production Deployments

Regular Updates: Continuously update your training data with new, high-quality outputs to maintain performance.
Monitoring: Implement monitoring tools to track the performance and cost efficiency of your models in real-time.
Scalability: Design your infrastructure to handle increasing workloads efficiently.

Conclusion

By leveraging distillation with programmatic data curation, you can achieve significant cost savings without compromising on performance. This approach not only makes LLMs more accessible for a broader range of applications