FSDP and QLoRA Enable 70B LLM Training on Desktop GPUs

Tools & Engineering

The Engineer

11 Mar 2024 · 3 min read

Using FSDP and QLoRA, Answer.AI has developed a system that trains massive 70B-parameter LLMs on desktop GPUs, making advanced AI capabilities accessible to researchers and developers worldwide.

Tools & Engineering: Making Large Language Model Training Accessible with FSDP-QLoRA

Today, Answer.AI is proud to announce the release of a groundbreaking open-source system that can efficiently train a 70-billion-parameter large language model (LLM) on a regular desktop computer equipped with two or more standard gaming GPUs (RTX 3090 or 4090). This achievement, made possible through a collaboration with Tim Dettmers from the University of Washington and Hugging Face’s Titus von Koeller and Sourab Mangrulkar, marks a significant step towards democratizing AI.

What Changed Technically?

The key innovation lies in the combination of Fully Sharded Data Parallel (FSDP) and Quantized LoRA (QLoRA). Here's a breakdown:

Fully Sharded Data Parallel (FSDP):
- FSDP is an advanced technique for distributed training that splits model parameters across multiple GPUs, reducing memory usage.
- It ensures that each GPU only holds a fraction of the model, allowing for efficient use of limited GPU memory.
Quantized LoRA (QLoRA):
- QLoRA extends the Low-Rank Adaptation (LoRA) method by quantizing weights to reduce precision without significant loss in performance.
- This further reduces memory requirements, making it feasible to train large models on consumer-grade hardware.

Why It Matters

This system is a game-changer for the open-source community and small labs. Traditionally, training LLMs required expensive data center GPUs like the NVIDIA A100 or H100, which can cost hundreds of thousands of dollars. In contrast, a desktop setup with dual RTX 4090 GPUs costs under $10,000 (or even less if using second-hand parts).

Despite the significant price difference, gaming GPUs like the RTX 4090 offer performance comparable to their data center counterparts. The primary limitation has been memory capacity: while data center GPUs can have up to 80GB of RAM, consumer GPUs are capped at 24GB. FSDP and QLoRA address this by optimizing memory usage and enabling efficient training on these lower-memory devices.

Technical Details

Architecture:
- The system leverages PyTorch’s distributed data parallel (DDP) module with FSDP to distribute the model across multiple GPUs.
- QLoRA is integrated to quantize weights, reducing the memory footprint of the model during training.
- The combination ensures that each GPU only handles a portion of the model and its gradients, making it feasible to train large models.
Benchmarks:
- Initial tests show that the system can train a 70-billion-parameter LLM on a dual RTX 4090 setup with comparable performance to data center GPUs.
- Training times are competitive, and the memory efficiency allows for scaling up to larger models without requiring expensive hardware upgrades.

Community Impact

Teknium, known for creating popular OpenHermes models and datasets (with over half a million downloads), has already embraced this new capability:

“With this capability, we can take huge models to new heights locally, and gigantic, hundreds of billions of parameter models are now accessible by small labs.”

At Answer.AI, our mission is to make useful AI available to everyone. While using pre-trained models from others is valuable, the ability to create personalized models empowers users to control their own AI systems.

Conclusion

The release of this open-source system combining FSDP and QLoRA represents a significant leap forward in making large language model training accessible. It opens up new possibilities for researchers, developers, and small labs, ensuring that cutting-edge AI technology is no longer limited to those with deep pockets.