Accelerate LLM Fine-Tuning with Unsloth and Hugging Face TRL: 2x Speed, -40% Memory Usage

Models & Research

The Engineer

11 Jan 2024 · 3 min read

Unsloth slashes fine-tuning time for large language models by half and cuts memory usage by 40%, making it an indispensable tool for developers working within the Hugging Face ecosystem.

If you've been frustrated by the painfully slow process of fine-tuning large language models (LLMs), you're not alone. Enter Unsloth, a lightweight library developed by the community to significantly speed up LLM fine-tuning while reducing memory usage and maintaining accuracy. This tool is fully compatible with the Hugging Face ecosystem, including the Hub, transformers, PEFT, and TRL libraries.

What is Unsloth?

Unsloth is a library designed to optimize the fine-tuning process of LLMs by overwriting certain parts of the modeling code with highly optimized operations. The key improvements are:

2x Faster Fine-Tuning: By manually deriving backpropagation steps and rewriting PyTorch modules into Triton kernels, Unsloth can significantly reduce training time.
-40% Memory Usage: Optimized operations help in reducing VRAM usage, making it feasible to fine-tune larger models on smaller GPUs.
0% Accuracy Degradation: Despite the optimizations, there is no loss in accuracy compared to standard QLoRA (Quantized Low-Rank Adaptation) methods.

Compatibility and Supported Architectures

Unsloth supports a wide range of NVIDIA GPUs, from GTX 1070 to H100, making it accessible for both hobbyists and professionals. It also integrates seamlessly with the entire trainer suite from the TRL library, including:

SFTTrainer (Supervised Fine-Tuning)
DPOTrainer (Direct Preference Optimization)
PPOTrainer (Proximal Policy Optimization)

At the time of writing, Unsloth supports the following model architectures:

Llama (CodeLlama, Yi, etc.)
Mistral

How It Works

Unsloth achieves its performance gains through a combination of manual backpropagation and Triton kernel optimization. Here’s a breakdown:

Manual Backpropagation: By manually deriving the backpropagation steps, Unsloth can avoid the overhead associated with automatic differentiation.
Triton Kernels: Rewriting PyTorch modules into Triton kernels allows for more efficient execution on GPUs, reducing both computation time and memory usage.

Benchmarking

Let’s look at some benchmarks to see how Unsloth performs compared to standard Hugging Face methods and other optimizations like Flash Attention 2.

A100 40GB GPU Benchmarks

| Model | Dataset | 🤗 Hugging Face | 🤗 + Flash Attention 2 | 🦥 Unsloth | 🦥 VRAM Reduction | | --- | --- | --- | --- | --- | --- | | Code Llama 34b | Slim Orca | 1x | 1.01x | 1.94x | -22.7% | | Llama-2 7b | Slim Orca | 1x | 0.96x | 1.87x | -39.3% | | Mistral 7b | Slim Orca | 1x | 1.17x | 1.88x | -65.9% | | Tiny Llama 1.1b | Alpaca | 1x | 1.55x | 2.74x | -57.8% | | DPO with Zephyr | Ultra Chat | 1x | 1.24x | 1.88x | -11.6% |

Free Colab T4 GPU Benchmarks

| Model | Dataset | 🤗 Hugging Face | 🤗 + Pytorch 2.1.1 | 🦥 Unsloth | 🦥 VRAM Reduction | | --- | --- | --- | --- | --- | --- | | Llama-2 7b | OASST | 1x | 1.19x | 1.95x | -43.3% | | Mistral 7b | Alpaca | 1x | 1.07x | 1.56x | -13.7