Teaching Language Models to Solve Sudoku with Reinforcement Learning

Models & Research

The Engineer

11 Mar 2025 · 4 min read

Sharma's innovative approach uses reinforcement learning to teach language models the intricate skills needed for structured problem-solving, challenging AI’s traditional strengths in unstructured tasks like essay writing and coding.

In the world of AI and language models, we often hear about systems that can write essays, generate code, or answer complex questions. But what about teaching them to solve puzzles that require structured thinking, spatial reasoning, and logical deduction? This is where a recent experiment by Hrishabh Sharma comes in-teaching language models to solve Sudoku puzzles through reinforcement learning.

The Challenge of Structured Problem-Solving

Sudoku presents a fascinating challenge for language models. Unlike open-ended text generation, solving a Sudoku puzzle requires:

Following strict rules: Each row, column, and box must contain numbers 1-9 without repetition.
Maintaining a consistent grid format: The model needs to keep track of the entire grid's state.
Applying step-by-step logical reasoning: The solution involves a series of deductions based on the current state of the grid.
Understanding spatial relationships: The model must grasp how numbers in one part of the grid affect others.
Arriving at a single correct solution: There is only one valid configuration for each puzzle.

What makes this particularly interesting is that language models aren’t designed for structured problem-solving. They are trained to predict text, not to follow logical rules or maintain grid structures. Yet with the right approach, they can learn these skills.

The Experiment

Model and Algorithm

The experiment used the Qwen model, a large language model developed by Alibaba Cloud. To teach it how to solve Sudoku puzzles, Sharma employed the Generalized Reinforcement Policy Optimization (GRPO) algorithm. GRPO is an advanced reinforcement learning technique that combines elements of policy optimization and value function estimation to improve the model's ability to make sequential decisions.

Training Process

Environment Setup: The environment was designed to simulate a 9x9 Sudoku grid. Each cell in the grid could be one of the numbers 1-9 or empty (represented by 0).
Action Space: The model could choose to place any number from 1-9 in any empty cell.
Reward Function: Rewards were given for placing valid numbers and penalties for violating Sudoku rules. A large positive reward was given when a puzzle was solved correctly.
Training Phases:
- Initial Exploration: The model explored the action space randomly to gather initial data.
- Policy Optimization: Using GRPO, the model learned to maximize rewards by refining its policy.
- Fine-Tuning: The model was fine-tuned on a dataset of solved Sudoku puzzles to improve its performance.

Results

After training, the Qwen model showed significant improvement in solving Sudoku puzzles. Key findings include:

Accuracy: The model achieved an accuracy rate of over 95% on a test set of 10,000 Sudoku puzzles.
Speed: On average, the model solved puzzles in under 1 second, demonstrating efficient reasoning and decision-making.
Generalization: The model was able to generalize well to new, unseen puzzles, indicating robust learning.

Architecture Details

Model Architecture: Qwen is a transformer-based model with multiple layers of self-attention. It leverages the attention mechanism to focus on relevant parts of the grid during each step.
State Representation: The state of the Sudoku grid was represented as a 9x9 matrix, where each cell contained either a number or a zero (for empty cells).
Action Selection: The model used a policy network to predict the next action (number and position) based on the current state of the grid.
Value Function: A separate value function was trained to estimate the expected reward for each state, helping the model make more informed decisions.

Implementation Notes

Data Efficiency: The training process was data-efficient, requiring a relatively small dataset of solved puzzles to achieve high performance.
Scalability: The approach can be scaled to larger Sudoku grids (e.g., 16x16) with minor adjustments to the action space and reward function.

Conclusion

This experiment demonstrates that language models, when combined with reinforcement learning techniques like GRPO, can tackle structured problems like Sudoku. It opens up new possibilities for applying AI to tasks that require logical reasoning and step-by-step problem-solving, potentially extending beyond puzzles to more complex real-world scenarios.