New Techniques for Better LLM Alignment: CLAIR and APO

Models & Research

The Engineer

16 Aug 2024 · 3 min read

Researchers introduce CLAIR and APO to enhance Large Language Model alignment, addressing limitations in current preference-based training methods and improving AI adherence to human preferences.

Large Language Models (LLMs) have made significant strides in natural language processing, but aligning these models with human preferences remains a challenging task. The paper "Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment" by D'Oosterlinck et al. introduces two novel techniques-Contrastive Learning from AI Revisions (CLAIR) and Anchored Preference Optimization (APO)-to improve the alignment process.

What Changed Technically?

The authors identified that traditional methods of aligning LLMs using preference pair datasets often produce subpar results due to underspecification in the training data. To address this, they introduced:

Contrastive Learning from AI Revisions (CLAIR):
- Data Creation Method: CLAIR generates more contrastive preference pairs by revising model outputs.
- Why It Matters: Contrastive pairs provide a clearer learning signal for the model, leading to better alignment.
Anchored Preference Optimization (APO):
- Controllable and Stable Objective: APO is an alignment objective that provides more control over the model during training.
- Why It Matters: More control leads to more stable and consistent performance, reducing the variability in alignment results.

Key Findings

Contrastive Data Improves Learning Signals:
- Preference data is more effective when the underlying responses are contrastive.
- This means that the model can better distinguish between good and bad outputs, leading to improved learning.
Controllable Objectives Enhance Performance:
- Alignment objectives that specify more control over the model during training lead to better performance.
- APO, in particular, outperforms less controllable objectives consistently.

Experimental Setup

The authors aligned Llama-3-8B-Instruct using various datasets and alignment objectives. They measured performance using MixEval-Hard scores, which correlate highly with human judgments.

Datasets:
- CLAIR preferences
- Standard preference pairs
- Randomly generated pairs
Alignment Objectives:
- APO
- Traditional contrastive alignment
- Baseline objectives

Results

CLAIR Preferences Lead to Strongest Performance:
- The model trained on CLAIR preferences outperformed all other datasets.
APO Outperforms Less Controllable Objectives:
- APO consistently delivered better results compared to less controllable alignment methods.
Best Model Performance:
- The best model, trained on 32K CLAIR preferences with APO, improved Llama-3-8B-Instruct by 7.65%.
- This improvement closed the gap with GPT4-turbo by 45%.

Implementation Details

CLAIR:
- Involves revising model outputs to create more contrastive pairs.
- The revision process ensures that the generated pairs are meaningful and useful for training.
APO:
- Provides a stable and controllable objective during training.
- Reduces variability in alignment results, leading to more consistent performance.

Conclusion

The introduction of CLAIR and APO represents a significant step forward in aligning LLMs with human preferences. By generating more contrastive data and using a more controllable alignment objective, these techniques can help improve the performance and reliability of large language models.