LoRA vs. Full Finetuning: A Deep Dive into Performance and Catastrophic Forgetting

Models & Research

The Engineer

20 May 2024 · 3 min read

Researchers delve into how Low-Rank Adaptation (LoRA) stacks up against full finetuning, revealing its memory efficiency but also its limitations in preserving performance across various tasks.

In a recent study titled "LoRA Learns Less and Forgets Less," researchers from various institutions explored the performance of Low-Rank Adaptation (LoRA) compared to full finetuning in large language models. LoRA, known for its parameter efficiency, trains only low-rank perturbations to selected weight matrices, significantly reducing memory usage. However, this efficiency comes with trade-offs, especially when it comes to maintaining the base model's performance on tasks outside the target domain.

Key Findings

Performance Comparison: In standard low-rank settings, LoRA underperforms full finetuning in both programming and mathematics domains.
- Instruction Finetuning: Approximately 100K prompt-response pairs
- Continued Pretraining: 20B unstructured tokens
Catastrophic Forgetting: LoRA better maintains the base model's performance on tasks outside the target domain compared to full finetuning.
- Regularization Techniques: LoRA outperforms common techniques like weight decay and dropout in mitigating forgetting.
Diverse Generations: LoRA helps maintain more diverse generations of text, which is crucial for creative and varied outputs.

Technical Details

Performance Metrics

Programming Domain:
- LoRA: Lower performance metrics (e.g., accuracy, F1 score) compared to full finetuning.
- Full Finetuning: Higher performance but with a risk of catastrophic forgetting.
Mathematics Domain:
- LoRA: Similar trends observed as in the programming domain, with lower performance but better retention of base model capabilities.

Catastrophic Forgetting

Base Model Retention: LoRA maintains the base model's performance on out-of-domain tasks more effectively.
- Regularization Techniques: Weight decay and dropout are less effective in mitigating forgetting compared to LoRA.

Diverse Generations

Text Diversity: LoRA generates more diverse text, which is beneficial for applications requiring varied outputs (e.g., creative writing, dialogue systems).

Implementation Insights

Rank of Perturbations:
- LoRA Configurations: Typically uses low-rank perturbations (e.g., rank 4 or 8).
- Full Finetuning: Learns perturbations with a rank that is 10-100 times greater, which might explain the performance gap.

Best Practices for LoRA

Choose Appropriate Rank:
- Start with standard low-rank settings (e.g., rank 4 or 8) and adjust based on specific task requirements.
Monitor Base Model Performance:
- Regularly evaluate the model's performance on out-of-domain tasks to ensure minimal forgetting.
Combine with Regularization Techniques:
- Use weight decay and dropout in conjunction with LoRA to further mitigate forgetting while maintaining efficiency.
Experiment with Data Regimes:
- Test both instruction finetuning and continued pretraining data regimes to find the best balance between performance and memory usage.

Conclusion

While LoRA is a powerful tool for parameter-efficient finetuning, it comes with trade-offs in terms of performance on target tasks. However, its ability to maintain base model performance and generate diverse outputs makes it a valuable technique, especially when combined with other regularization methods. Researchers and practitioners should carefully consider these factors when deciding between LoRA and full finetuning for their specific use cases.