
Share
Researchers explore how fine-tuning and optimized prompts can elevate the chess-solving abilities of chatbots to match those of specialized completion models, bridging the gap in performance.
Published on August 24, 2024
By Robert Washbourne
@rawsh0
It turns out that GPT completion models are pretty good at chess, with gpt-3.5-turbo-instruct playing around 1800 Elo. However, chat models typically struggle with the same tasks. To see if we could bring the performance of chat models up to a level competitive with completion models, I ran several experiments combining fine-tuning and prompt optimization.
The dataset used for these experiments consists of chess puzzles from various sources, ensuring a diverse range of difficulty levels and positions. This diversity is crucial for training models that can handle different types of problems effectively.
Completion models like gpt-3.5-turbo-instruct have shown promising results in solving chess puzzles, achieving an Elo rating around 1800. This baseline performance sets a high bar for chat models to match or exceed.
To improve the performance of chat models, I used DSPy (a framework for algorithmically optimizing LM prompts and weights) to automatically optimize the LLM chess puzzle-solving prompt.
The DSPy program was designed to:
The optimized prompt was compiled using a series of steps:
The final compiled prompt included:

Using the optimized prompt, chat models showed significant improvement in solving chess puzzles.
To further enhance performance, I fine-tuned the models on the optimized LLM chain of thought outputs from gpt-4o. This involved:
The key to effective fine-tuning is creating high-quality training data. I focused on:
The combination of prompt optimization and fine-tuning significantly improved the performance of both chat and completion models.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
26 August 2024
88 articles
Related Articles
Related Articles
More Stories