HEADLINE: Chain of Preference Optimization Boosts Chain-of-Thought Reasoning in LLMs Without Inference Overhead

Models & Research

The Engineer

18 Jun 2024 · 3 min read

Researchers introduce Chain of Preference Optimization, a technique that enhances the logical reasoning abilities of large language models without adding computational burden, surpassing traditional CoT and ToT methods.

Recent advancements in large language models (LLMs) have introduced chain-of-thought (CoT) decoding, enabling these models to generate explicit logical reasoning paths for complex problem-solving. However, research has shown that these CoT paths are not always optimal and can sometimes be subpar. The tree-of-thought (ToT) method addresses this by using tree-search algorithms to explore a broader reasoning space, but it comes with a significant increase in inference complexity.

In their latest paper, "Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs," Xuan Zhang and colleagues from the Singapore University of Technology and Design (SUTD) present a novel approach that leverages ToT to fine-tune LLMs, thereby improving CoT performance without the computational overhead. This method, called Chain of Preference Optimization (CPO), aligns each step of the CoT reasoning paths with those generated by ToT, using the inherent preference information from the tree-search process.

Key Technical Changes and Their Impact

Tree-of-Thought (ToT) Method:
- What Changed: ToT uses a tree-search algorithm to explore multiple reasoning paths, identifying the most optimal ones.
- Why It Matters: While CoT can generate logical steps, it often misses out on better solutions. ToT helps in finding these more effective paths but at a higher computational cost.
Chain of Preference Optimization (CPO):
- What Changed: CPO fine-tunes LLMs by aligning their CoT reasoning with the optimal paths identified by ToT.
- Why It Matters: By using the preference information from ToT, CPO enables LLMs to generate more accurate and efficient CoT reasoning without the need for extensive tree-search during inference.

Implementation Details

Fine-Tuning Process:
- The researchers first use ToT to construct a search tree of potential reasoning paths.
- Each node in the tree represents a step in the reasoning process, with edges indicating transitions between steps.
- The tree is annotated with preference scores based on how well each path leads to a correct solution.

Training LLMs:
- LLMs are fine-tuned using the CoT paths generated by ToT as reference.
- During training, the model learns to align its reasoning steps with those of the optimal paths identified in the tree.
- This alignment is achieved through a loss function that penalizes deviations from the preferred reasoning paths.

Experimental Results

Benchmarks:
- The researchers evaluated CPO on a variety of tasks, including question answering, fact verification, and arithmetic reasoning.
- They compared the performance of LLMs fine-tuned with CPO against those using traditional CoT and ToT methods.
Performance Improvements:
- CPO significantly improved the accuracy of LLMs in all tested tasks.
- For example, on a complex question answering dataset, models fine-tuned with CPO achieved a 10% higher accuracy compared to those using standard CoT.
- The inference time for CPO-fine-tuned models was comparable to that of CoT, making it a practical solution for real-world applications.

Conclusion

Chain of Preference Optimization (CPO) offers a promising approach to enhance the reasoning capabilities of LLMs without the computational overhead associated with tree-search methods. By fine-tuning models to align their reasoning paths with those generated by ToT, CPO ensures that LLMs can generate more accurate and efficient solutions to complex problems.

If you're working on improving the logical reasoning of your LLMs, CPO is definitely worth exploring. The researchers have made their code available for further experimentation and development.