HEADLINE: Cutting LLM API Costs by 80% Through Custom Benchmarking

Tools & Engineering

The Engineer

21 Jan 2026 · 4 min read

A non-technical founder slashed his LLM API costs by 80% after discovering that default models aren't always the most economical choice, highlighting the need for custom benchmarking to match tasks with the right model.

Last month, I helped a non-technical founder reduce his Large Language Model (LLM) API bill by 80%. He was using GPT-5 because it’s the default choice for many-easy to set up, well-benchmarked, and widely adopted. However, as usage grew, so did his costs, reaching $1,500/month just for API calls.

By benchmarking his actual prompts against 100+ models, we discovered that while GPT-5 is a solid choice, it was rarely the most cost-effective option with comparable quality. This process saved him thousands of dollars. Here’s how we did it.

The Problem: Benchmarks Don’t Predict Performance on Your Task

When selecting an LLM, most people choose a model from their favorite provider, like Anthropic’s Opus, Sonnet, or Haiku. If you’re more sophisticated, you might check benchmarks such as GPQA Diamond, AIME, SWE Bench, MATH 500, Humanity’s Last Exam, ARC-AGI, and MMLU.

However, these benchmarks are at best a rough indicator of performance and do not account for costs. A model that excels in reasoning tasks might be mediocre at damage cost estimation or customer support in specific languages. The only way to know which model works best for your task is to test it on your actual prompts and consider quality, cost, and latency.

Building Custom Benchmarks

To optimize the LLM selection, we built our own benchmarks. Let’s walk through a use case: customer support.

Step 1: Collect Real Examples

We extracted actual support chats using WHAPI. Each chat provided the conversation history, the customer's latest message, and the response my friend actually sent. My friend also shared the prompts he used to generate responses manually and within his chat tool. We selected around 50 chats, including frequently asked questions and edge cases where we wanted the LLM to behave in a specific way.

Step 2: Define the Expected Output

For each example, we used my friend’s actual response as the expected output. We also defined ranking criteria, such as:

A good answer tells the customer that this product costs $5.99 and offers to take an order right now.
A good answer explains the return policy, which gives customers 30 days to send back the order, but notes that they sent their return over two months after receiving it.

Step 3: Create the Benchmark Dataset

We created a simple dataset with the prompt (conversation + instructions) and the expected response. This format is generic and can be applied to various use cases. For every prompt:

Prompt: The conversation history and any additional instructions.
Expected Response: The actual response my friend sent.

Testing and Evaluation

We tested this benchmark dataset on over 100 models, including GPT-5, Anthropic’s models, and others. We evaluated the responses based on quality, cost, and latency:

Quality: How well the model’s response matched the expected output.
Cost: The API call costs for each model.
Latency: The time it took for the model to generate a response.

Results

The results were eye-opening. While GPT-5 performed well in many benchmarks, it was not always the most cost-effective option. We found several models that offered comparable quality at significantly lower costs. For example:

Model A: 70% of the quality of GPT-5 but at 30% of the cost.
Model B: 90% of the quality of GPT-5 but at 50% of the cost.

By switching to one of these models, my friend saved thousands of dollars in API costs while maintaining high-quality customer support.

Conclusion

Benchmarking LLMs on your specific task is crucial for optimizing both performance and cost. Default choices like GPT-5 are often overkill and can lead to unnecessary expenses. By creating custom benchmarks and testing multiple models, you can find the best balance of quality, cost, and latency for your application.