
Share
A non-technical founder slashed his LLM API costs by 80% after discovering that default models aren't always the most economical choice, highlighting the need for custom benchmarking to match tasks with the right model.
Last month, I helped a non-technical founder reduce his Large Language Model (LLM) API bill by 80%. He was using GPT-5 because it’s the default choice for many-easy to set up, well-benchmarked, and widely adopted. However, as usage grew, so did his costs, reaching $1,500/month just for API calls.
By benchmarking his actual prompts against 100+ models, we discovered that while GPT-5 is a solid choice, it was rarely the most cost-effective option with comparable quality. This process saved him thousands of dollars. Here’s how we did it.
When selecting an LLM, most people choose a model from their favorite provider, like Anthropic’s Opus, Sonnet, or Haiku. If you’re more sophisticated, you might check benchmarks such as GPQA Diamond, AIME, SWE Bench, MATH 500, Humanity’s Last Exam, ARC-AGI, and MMLU.
However, these benchmarks are at best a rough indicator of performance and do not account for costs. A model that excels in reasoning tasks might be mediocre at damage cost estimation or customer support in specific languages. The only way to know which model works best for your task is to test it on your actual prompts and consider quality, cost, and latency.
To optimize the LLM selection, we built our own benchmarks. Let’s walk through a use case: customer support.
We extracted actual support chats using WHAPI. Each chat provided the conversation history, the customer's latest message, and the response my friend actually sent. My friend also shared the prompts he used to generate responses manually and within his chat tool. We selected around 50 chats, including frequently asked questions and edge cases where we wanted the LLM to behave in a specific way.
For each example, we used my friend’s actual response as the expected output. We also defined ranking criteria, such as:

We created a simple dataset with the prompt (conversation + instructions) and the expected response. This format is generic and can be applied to various use cases. For every prompt:
We tested this benchmark dataset on over 100 models, including GPT-5, Anthropic’s models, and others. We evaluated the responses based on quality, cost, and latency:
The results were eye-opening. While GPT-5 performed well in many benchmarks, it was not always the most cost-effective option. We found several models that offered comparable quality at significantly lower costs. For example:
By switching to one of these models, my friend saved thousands of dollars in API costs while maintaining high-quality customer support.
Benchmarking LLMs on your specific task is crucial for optimizing both performance and cost. Default choices like GPT-5 are often overkill and can lead to unnecessary expenses. By creating custom benchmarks and testing multiple models, you can find the best balance of quality, cost, and latency for your application.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
21 January 2026
133 articles
Related Articles
Related Articles
More Stories