Starling-7B: Advancing LLM Helpfulness and Harmlessness with RLAIF

Models & Research

The Engineer

28 Nov 2023 · 3 min read

Starling-7B harnesses Reinforcement Learning from AI Feedback to enhance helpfulness and safety, outperforming peers in GPT-4 evaluations while using a unique dataset that pushes the boundaries of LLM capabilities.

Starling-LM-7B, a new open-source large language model (LLM), is making waves by leveraging Reinforcement Learning from AI Feedback (RLAIF). Developed by researchers at UC Berkeley, this 7-billion parameter model uses a novel GPT-4 labeled ranking dataset called Nectar and an advanced reward training pipeline. Starling-7B-alpha scores 8.09 on the MT Bench evaluation with GPT-4 as the judge, outperforming all models to date except for OpenAI’s GPT-4 and GPT-4 Turbo.

Key Technical Changes and Why They Matter

Reinforcement Learning from AI Feedback (RLAIF)
- RLAIF is a variant of Reinforcement Learning from Human Feedback (RLHF), but it uses AI-generated feedback instead of human labels.
- This approach allows for scaling up the training process by leveraging the vast amount of data that can be generated by powerful AI models like GPT-4.
Nectar Dataset
- Nectar is a high-quality ranking dataset specifically designed for chat applications.
- It consists of 183K chat prompts, each with 7 responses from various models (GPT-4, GPT-3.5-instruct, GPT-3.5-turbo, Mistral-7B-Instruct, Llama2-7B), resulting in a total of 3.8M pairwise comparisons.
- The dataset was curated to minimize positional bias by carefully designing the prompts and response order.
Reward Model and Policy Tuning Pipeline
- A reward model (Starling-RM-7B-alpha) was trained using Nectar to guide the policy tuning process.
- This pipeline ensures that the model is optimized for helpfulness, harmlessness, and other desired qualities as defined by the feedback from GPT-4.

Implementation Details

Model Architecture:
- Starling-7B is a transformer-based model with 7 billion parameters.
- The architecture is similar to other large language models but has been fine-tuned using RLAIF.
Training Process:
- The initial model was pre-trained on a large corpus of text data.
- It was then fine-tuned using the Nectar dataset and the reward model.
- The training process involved multiple iterations of policy optimization to ensure the model aligns with the desired qualities.

Benchmarks:
- Starling-7B-alpha achieved a score of 8.09 on MT Bench, which is a significant improvement over previous models.
- It outperforms other leading SFT models like OpenHermes 2.5 and Openchat 3.5 in the same evaluation.

Why This Matters to Practitioners

Enhanced Model Performance:
- By using RLAIF, Starling-7B demonstrates that AI-generated feedback can significantly enhance model performance.
- This opens up new possibilities for scaling up training processes and improving model quality without the need for extensive human labeling.
High-Quality Dataset:
- The Nectar dataset provides a valuable resource for researchers and practitioners working on chat applications.
- It can be used to train and evaluate other models, contributing to the overall advancement of the field.
Open Source Contributions:
- The release of Starling-7B, Starling-RM-7B-alpha, and the Nectar dataset on HuggingFace makes these resources accessible to the wider AI community.
- This fosters collaboration and further innovation in the development of language models.

Conclusion

Starling-7B represents a significant step forward in the development of large language models. By leveraging RLAIF and a high-quality ranking dataset, it achieves state-of-the-art performance on MT Bench while being open-source and accessible to the community. This work not only showcases the potential of AI-generated feedback but also provides valuable resources for future research and development.