Steering Llama-2 with Contrastive Activation Additions for Better Alignment and Generalization

Models & Research

The Engineer

8 Jan 2024 · 4 min read

Researchers explore how contrastive activation additions can guide Llama-2 to align with human values more effectively, surpassing traditional fine-tuning and few-shot prompting techniques in achieving better generalization and reduced harmful outputs.

Steering large language models (LLMs) like Llama-2 to align better with human values and reduce harmful outputs is a critical challenge in AI research. A recent paper from the Alignment Forum explores how contrastive activation additions can be used to steer Llama-2 effectively, offering improvements over traditional methods like few-shot prompting and fine-tuning.

How Contrastive Activation Addition Works

Contrastive activation addition (CAA) involves modifying the hidden activations of a pre-trained model during inference. The process works as follows:

Positive and Negative Vectors: You create two vectors: one that represents desirable behavior (positive vector) and another that represents undesirable behavior (negative vector).
Addition During Inference: These vectors are added to the hidden states of the model at specific layers. This influences the model's output, steering it towards more aligned responses.
Layer Selection: The choice of which layers to modify is crucial. Experimentation shows that adding vectors to earlier layers (e.g., layer 10 in Llama-2) often yields better results.

Vectors Generalize to Open-Ended Generation

One of the key findings is that these steering vectors generalize well beyond the training data used to create them. For example, a vector trained to reduce hallucinations can effectively prevent the model from generating false information in various contexts. This generalization is particularly valuable for open-ended tasks where the input and output can be highly varied.

Specific Vectors: Hallucination and Anti-Discrimination

Hallucination Vector: This vector aims to reduce the generation of false or contradictory information. It was trained using pairs of inputs where one version contained a known hallucination, and the other did not.
Anti-Discrimination Vector: This vector is designed to prevent biased or discriminatory outputs. It was trained on examples that highlight different forms of discrimination, such as gender or racial bias.

Better Generalization Than Few-Shot Prompting and Fine-Tuning

The study compares CAA with two common methods for steering LLMs: few-shot prompting and fine-tuning.

Few-Shot Prompting: This method involves providing the model with a few examples of desired behavior. While effective in some cases, it often requires carefully crafted prompts and can be brittle.
Supervised Fine-Tuning: This involves training the model on a labeled dataset to adjust its weights. However, this can be computationally expensive and may overfit to the specific data used.

CAA outperforms both methods in terms of generalization. It can steer the model more effectively across a broader range of tasks without requiring extensive retraining or fine-tuning.

Activation Addition vs. Fine-Tuning

It's important to note that activation addition is fundamentally different from fine-tuning:

Non-Destructive: Unlike fine-tuning, which modifies the model's weights, CAA only adds vectors during inference. This means the original model remains unchanged, and you can easily experiment with different steering vectors without retraining.
Flexibility: You can apply multiple vectors simultaneously or switch between them on-the-fly, making it a more flexible approach for real-world applications.

Steering Vectors Don't Hurt Capabilities Much

One concern with steering techniques is that they might degrade the model's overall performance. However, the study shows that CAA has minimal impact on the model's capabilities. The vectors effectively steer the model towards desired behaviors without significantly compromising its general language skills.

Progress in Activation Additions

The research highlights ongoing progress in activation engineering, a field focused on developing techniques to modify and control the internal activations of neural networks. This includes advancements in creating more effective steering vectors and better understanding their impact on model behavior.

Conclusion

Contrastive activation additions offer a promising approach to aligning LLMs like Llama-2 with human values while maintaining or even enhancing their capabilities. By leveraging these techniques, researchers can develop more reliable and trustworthy AI systems that are less prone to harmful outputs.