
Share
Researchers explore how contrastive activation additions can guide Llama-2 to align with human values more effectively, surpassing traditional fine-tuning and few-shot prompting techniques in achieving better generalization and reduced harmful outputs.
Steering large language models (LLMs) like Llama-2 to align better with human values and reduce harmful outputs is a critical challenge in AI research. A recent paper from the Alignment Forum explores how contrastive activation additions can be used to steer Llama-2 effectively, offering improvements over traditional methods like few-shot prompting and fine-tuning.
Contrastive activation addition (CAA) involves modifying the hidden activations of a pre-trained model during inference. The process works as follows:
One of the key findings is that these steering vectors generalize well beyond the training data used to create them. For example, a vector trained to reduce hallucinations can effectively prevent the model from generating false information in various contexts. This generalization is particularly valuable for open-ended tasks where the input and output can be highly varied.
The study compares CAA with two common methods for steering LLMs: few-shot prompting and fine-tuning.

CAA outperforms both methods in terms of generalization. It can steer the model more effectively across a broader range of tasks without requiring extensive retraining or fine-tuning.
It's important to note that activation addition is fundamentally different from fine-tuning:
One concern with steering techniques is that they might degrade the model's overall performance. However, the study shows that CAA has minimal impact on the model's capabilities. The vectors effectively steer the model towards desired behaviors without significantly compromising its general language skills.
The research highlights ongoing progress in activation engineering, a field focused on developing techniques to modify and control the internal activations of neural networks. This includes advancements in creating more effective steering vectors and better understanding their impact on model behavior.
Contrastive activation additions offer a promising approach to aligning LLMs like Llama-2 with human values while maintaining or even enhancing their capabilities. By leveraging these techniques, researchers can develop more reliable and trustworthy AI systems that are less prone to harmful outputs.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
8 January 2024
88 articles
Related Articles
Related Articles
More Stories