Early Impressions of GPT-4 Fine-Tuning: A 50% Performance Boost for Natural Language Queries

Models & Research

The Engineer

22 Mar 2024 · 3 min read

Early access to GPT-4's fine-tuning capabilities reveals a significant leap in performance, surpassing GPT-3.5 by over 50% for natural language tasks, setting new standards in AI model customization.

A few weeks ago, we gained early access to the GPT-4 fine-tuning API and were eager to see how it stacks up against its predecessors. As long-time users of OpenAI’s fine-tuned models, starting from the original GPT-3 Davinci model, we had high expectations. The results did not disappoint-fine-tuned GPT-4 outperformed fine-tuned GPT-3.5 by more than 50% for our specific use case.

Models Compared

To provide a comprehensive comparison, we evaluated the following models:

Fine-Tuned (FT) GPT-3 Davinci Model: Our initial choice when fine-tuning became available.
GPT-3.5 and GPT-4 Base Models: The standard versions without any additional training.
GPT-3.5 and GPT-4, Fine-Tuned Models: Custom-trained using our proprietary data set.

These models were fine-tuned for a domain-specific use case: natural language queries to generate reports and underlying database queries. Evaluations were conducted using our internal test data set, with GPT-3 Davinci’s performance serving as the baseline.

Performance Comparison

The improvements in GPT-4 are significant:

Accuracy: Fine-tuned GPT-4 demonstrated a more than 50% improvement over fine-tuned GPT-3.5 for our use case.
Latency: While both models showed acceptable latency, GPT-4 had a slight edge, especially in complex queries.
Cost: The cost of using GPT-4 is slightly higher per token, but the performance gains often justify the additional expense.

Context and LLM Usage at Supersimple

Supersimple is a data analytics platform designed to help users dive deep into their data quickly. Our platform allows users to ask natural language (plain English) questions and receive answers in the form of tables and visualizations. The AI provides explanations using no-code steps, and users can further explore the data with additional queries or by interacting with our data platform.

How We Use LLMs

The primary role of LLMs at Supersimple is to interpret natural language queries and generate appropriate reports and underlying database queries. Here’s a breakdown of the process:

Input: User's natural language question
Context: Relevant parts of the user’s semantic data model, existing reports, and dashboards
Output: A report or visualization that answers the user’s question with as much context as possible

Natural Language Querying Demo

To give you a better idea of how this works, here’s a demo video:

Implementation Details

Fine-Tuning Process: We used a custom proprietary data set to fine-tune the models. This involved preparing and annotating a large dataset of natural language queries and their corresponding reports.
Evaluation Metrics: Accuracy was measured by comparing the generated reports against ground truth, while latency was assessed using average response times for various query complexities.
Cost Considerations: The cost per token for GPT-4 is slightly higher than GPT-3.5, but the performance gains often make it a worthwhile investment.

Conclusion

The early access to GPT-4 fine-tuning has been a game-changer for us at Supersimple. The significant performance improvements, especially in accuracy and latency, make it a compelling choice for natural language processing tasks. While the cost is slightly higher, the benefits are clear, particularly for complex use cases like ours.