Unsloth's Gemma 7B: 2.43x Faster and 58% Less VRAM on A100 GPUs

Tools & Engineering

The Engineer

4 Mar 2024 · 3 min read

Unsloth's Gemma 7B outperforms Hugging Face models by 2.43 times in training speed while using 58% less VRAM, making it a more efficient choice for AI workloads on A100 GPUs.

Unsloth has announced significant performance improvements for their Gemma 7B and 2B models, making them faster and more efficient in both training and inference. These enhancements are particularly notable given the growing demand for powerful yet resource-efficient language models.

Key Performance Gains

Gemma 7B on 1x A100 80GB GPU:
- Training Speed: 243% faster than Hugging Face (HF) + Flash Attention 2 (FA2)
- VRAM Usage: 58% less VRAM
Gemma 2B on 1x A100 80GB GPU:
- Training Speed: 200% faster than HF + FA2
- VRAM Usage: 55% less VRAM

Benchmarks and Practical Implications

When compared to vanilla Hugging Face, Unsloth's Gemma models show even more impressive gains:

Gemma 7B:
- Training Speed: 2.53x faster
- VRAM Usage: 70% less VRAM
Token Capacity on A100 80GB GPU:
- Unsloth: Fits 40K total tokens (8192 * batch size of 5)
- FA2: Fits ~15K tokens
- Vanilla HF: Fits 9K tokens

These improvements are crucial for practitioners who need to train large models on limited hardware. The reduced VRAM usage allows for larger batch sizes, which can lead to better model convergence and faster training times.

Technical Breakdown

To achieve these performance gains, Unsloth had to tackle several technical challenges:

GeGLU vs Swiglu Activation Function:
- Gemma uses the GeGLU activation function, which is different from the more common Swiglu used in models like Llama. This required rewriting the manual autograd engine to support other activation functions. The team even had to derive the derivative for GeGLU using Wolfram Alpha.

Tied Embeddings:
- Unlike Mistral and Llama, Gemma's embedding layer and language model head (lm_head) share the same weights. This optimization reduces memory usage but required adjustments to the training pipeline to ensure correct weight updates.
256K Vocab Size:
- Gemma has a significantly larger vocabulary size compared to models like Llama and Mistral, which have vocab sizes of 32K. The increased vocab size necessitated rewriting the Cross Entropy Kernel to handle all vocab sizes, as the original kernel was limited by CUDA's max blocksize of 65536.
MLP Size:
- Gemma's MLP (Multi-Layer Perceptron) is much larger at 24576 units compared to Llama’s 11008 and Mistral’s 14336. This larger MLP contributes to the model's increased VRAM usage but also enhances its expressive power.

Additional Features

Unsloth has also made several other improvements:

Faster Inference:
- Unsloth now supports 2x faster inference, including for Gemma models.
Chat Templates:
- New chat templates have been added to support finetuning on conversational datasets. This includes formats like ChatML, Vicuna, and Zephyr.
Colab Notebooks:
- Unsloth has provided Colab notebooks for finetuning Gemma 7B and 2B on free Tesla T4 GPUs. These notebooks include pre-quantized 4-bit models for faster downloading.

Unsloth's Gemma 7B: 2.43x Faster and 58% Less VRAM on A100 GPUs

Key Performance Gains

Benchmarks and Practical Implications

Technical Breakdown

Additional Features

Source