Contrastive Preference Optimization Boosts LLM Performance in Machine Translation

Models & Research

The Engineer

24 Jan 2024 · 3 min read

Researchers have developed Contrastive Preference Optimization, a technique that enhances moderate-sized language models for machine translation, bridging the performance gap with larger models without requiring extensive computational resources.

In a recent study, researchers from Johns Hopkins University and the University of Maryland have introduced Contrastive Preference Optimization (CPO), a novel approach that significantly improves the performance of moderate-sized large language models (LLMs) in machine translation (MT). The paper, titled "Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation," was accepted at ICML 2024 and addresses a critical gap between the performance of smaller LLMs and state-of-the-art conventional encoder-decoder models or larger-scale LLMs like GPT-4.

What Changed Technically?

The key innovation lies in shifting from supervised fine-tuning (SFT) to CPO. SFT, which is the standard approach for training LLMs on specific tasks, involves using a dataset of human-generated reference translations. However, this method has limitations:

Quality Issues: Even though the reference data is human-generated, it often contains errors or is not always the best possible translation.
Overfitting to Adequate Translations: SFT tends to make models mimic these references, which can lead to generating adequate but not optimal translations.

CPO addresses these issues by training models to avoid generating suboptimal translations. This is achieved through a contrastive learning framework where the model learns from pairs of translations, one better than the other. The goal is to push the model towards generating higher-quality translations by penalizing it for producing lower-quality ones.

Key Findings and Implementation Details

Model Architecture: The researchers applied CPO to ALMA (Adaptive Learning with Memory Augmentation), a 13B parameter LLM.
Data Efficiency: They used only 22K parallel sentences and 12M parameters, which is significantly less data than typically required for SFT.
Performance Gains:
- The resulting model, ALMA-R, achieved performance on par with or better than the WMT competition winners and GPT-4 on the WMT'21, WMT'22, and WMT'23 test datasets.
- Specific metrics include BLEU scores (a common metric for MT quality) that were consistently higher across multiple language pairs.

How It Works

CPO operates by:

Contrastive Pairs: Generating pairs of translations where one is better than the other, based on human or automated evaluation.
Preference Learning: Training a reward model to score these pairs, guiding the LLM to prefer better translations.
Optimization Loop: Iteratively refining the LLM's parameters to maximize the reward from the preference model.

Why It Matters

For practitioners in the field of machine translation and natural language processing (NLP), CPO offers several advantages:

Data Efficiency: Reduces the need for large, high-quality reference datasets.
Quality Improvements: Enhances the quality of translations by focusing on generating better outputs rather than just mimicking references.
Scalability: Can be applied to a wide range of LLMs and MT tasks, potentially leading to broader adoption.

Conclusion

The introduction of Contrastive Preference Optimization marks a significant step forward in improving the performance of moderate-sized LLMs in machine translation. By addressing the limitations of SFT and leveraging contrastive learning, CPO demonstrates that with the right approach, smaller models can achieve state-of-the-art results, making it a valuable addition to the NLP toolkit.