Fine-Tuning and Prompt Optimization Boost Chat Models for Chess Puzzles

Models & Research

The Engineer

26 Aug 2024 · 3 min read

Researchers explore how fine-tuning and optimized prompts can elevate the chess-solving abilities of chatbots to match those of specialized completion models, bridging the gap in performance.

Teaching Chat Models to Solve Chess Puzzles

Published on August 24, 2024

By Robert Washbourne

@rawsh0

It turns out that GPT completion models are pretty good at chess, with gpt-3.5-turbo-instruct playing around 1800 Elo. However, chat models typically struggle with the same tasks. To see if we could bring the performance of chat models up to a level competitive with completion models, I ran several experiments combining fine-tuning and prompt optimization.

Dataset

The dataset used for these experiments consists of chess puzzles from various sources, ensuring a diverse range of difficulty levels and positions. This diversity is crucial for training models that can handle different types of problems effectively.

Can Completion Models Solve Chess Puzzles?

Completion models like gpt-3.5-turbo-instruct have shown promising results in solving chess puzzles, achieving an Elo rating around 1800. This baseline performance sets a high bar for chat models to match or exceed.

Completion Models Baseline

Model: gpt-3.5-turbo-instruct
Elo Rating: ~1800

Prompt Optimization with DSPy

To improve the performance of chat models, I used DSPy (a framework for algorithmically optimizing LM prompts and weights) to automatically optimize the LLM chess puzzle-solving prompt.

DSPy Program

The DSPy program was designed to:

Generate few-shot examples that guide the model through the reasoning process.
Refine the prompt structure to enhance clarity and effectiveness.

Compiling

The optimized prompt was compiled using a series of steps:

Data Collection: Gathered a set of chess puzzles with solutions.
Example Selection: Chose high-quality few-shot examples that effectively demonstrate the problem-solving process.
Prompt Refinement: Iteratively refined the prompt to improve model performance.

Compiled Prompt

The final compiled prompt included:

A clear problem statement
Few-shot examples that break down the reasoning steps
Instructions for generating step-by-step solutions

DSPy Compiled Results

Using the optimized prompt, chat models showed significant improvement in solving chess puzzles.

Chat Models + DSPy

Model: gpt-4o-mini (chat model)
Elo Rating: ~1600 (after optimization)

Fine-Tuning

To further enhance performance, I fine-tuned the models on the optimized LLM chain of thought outputs from gpt-4o. This involved:

Constructing high-quality examples that capture the step-by-step reasoning process.
Using OpenAI's 2 million tokens of free fine-tuning per day to train the models.

Constructing Good Examples

The key to effective fine-tuning is creating high-quality training data. I focused on:

Step-by-Step Solutions: Ensuring each example includes a clear chain of thought.
Diverse Puzzles: Using puzzles from different difficulty levels and positions.

gpt-4o-mini

Model: gpt-4o-mini (chat model)
Elo Rating: ~1700 (after fine-tuning)

gpt-4o

Model: gpt-4o (completion model)
Elo Rating: ~1850 (after fine-tuning)

davinci

Model: davinci (completion model)
Elo Rating: ~2000 (after fine-tuning)

Finetuning Results

The combination of prompt optimization and fine-tuning significantly improved the performance of both chat and completion models.

Completion Models

gpt-4o-mini: ~1700 Elo
gpt-4o: ~1850 Elo
davinci: ~2000 Elo

Chat Models + DSPy

gpt-4o-mini: ~1600 Elo (after prompt optimization)
gpt-4o-mini: ~1700 Elo (after fine-tuning)

Notes

The results demonstrate that combining fine-tuning and prompt optimization can bring chat models closer to the performance of completion models in solving chess puzzles.