HEADLINE: Llama 3 Finetuning with Unsloth: 2x Faster, 68% Less VRAM, and 6x Longer Context

Tools & Engineering

The Engineer

26 Apr 2024 · 3 min read

Unsloth's update for Meta’s Llama 3 models slashes finetuning time and VRAM usage, making large-scale language model training more accessible on limited hardware.

Unsloth has just released a significant update for finetuning Meta’s latest Llama 3 models, bringing substantial improvements in speed, memory efficiency, and context length. These enhancements are particularly noteworthy for practitioners working with large-scale language models (LLMs) on limited hardware resources.

Key Improvements:

2x Faster Finetuning: For the 8B model, Unsloth achieves a 205% faster finetuning speed compared to using Flash Attention 2 (FA2) + Hugging Face (HF). The 70B model sees an 183% boost in speed.
68% Less VRAM: The 70B model requires 68% less video RAM, while the 8B model uses 63% less VRAM. This reduction allows for more efficient use of GPU resources.
6x Longer Context Lengths: On an A100 80GB GPU, Llama-3 70B can now handle up to 48K total tokens (8192 * batch size of 5), compared to just 7K tokens without Unsloth.

Performance Benchmarks:

Here’s a detailed comparison between Unsloth and the Hugging Face + FA2 setup:

| Model | VRAM | Unsloth Speed | VRAM Reduction | Longer Context | Hugging Face + FA2 | |-------------|--------|-------------------|--------------------|--------------------|------------------------| | Llama-3 8B | 24GB | 2x | 63% | 3x longer | 1x | | Llama-3 70B | 80GB | 1.8x | 68% | 6x longer | 1x |

Implementation Details:

To achieve these improvements, Unsloth leverages several optimizations:

Memory-Efficient Attention: By using a more memory-efficient attention mechanism, Unsloth reduces the VRAM footprint without sacrificing performance.
Quantization and Low-Rank Adaptation (LoRA): Quantizing models to 4-bit precision and applying LoRA with a rank of 32 helps in reducing both computational and memory requirements. This is particularly useful for fine-tuning large models on consumer-grade GPUs.
Long Context Support: Unsloth’s latest update includes support for longer context lengths, allowing the model to process more tokens at once. This is crucial for tasks requiring extensive context, such as summarization or document understanding.

Practical Examples:

8B Model on Tesla T4:
- Colab Notebook: A Colab notebook is available for finetuning Llama-3 8B on a free Tesla T4 GPU: Llama-3 8b Notebook.
- LoRA Fine-Tuning: A community member tested LoRA fine-tuning of the 8B model in bf16 precision, achieving a VRAM usage of only 16GB.
70B Model on A100 80GB:
- Max Context Lengths: On an A100 80GB SXM machine, Unsloth allows for a maximum context length of 48K tokens, compared to 7.5K tokens with Hugging Face + FA2.
- VRAM vs Context Length Data: | GPU VRAM | Unsloth (New) | Unsloth (Old) | Hugging Face+FA2 | |----------|---------------|---------------|------------------| | 48 GB | 7,698 | 2,875 | OOM | | 80 GB | 48,053 | 18,332 | 7,433 |

HEADLINE: Llama 3 Finetuning with Unsloth: 2x Faster, 68% Less VRAM, and 6x Longer Context

Key Improvements:

Performance Benchmarks:

Implementation Details:

Practical Examples:

Community and