Intel Gaudi 2 Delivers Strong LLM Training and Inference Performance on Databricks

Tools & Engineering

The Engineer

5 Jan 2024 · 3 min read

Databricks reveals Intel's Gaudi 2 excels in large language model training and inference, offering a robust alternative to traditional ML hardware with superior performance and efficiency.

At Databricks, we're always looking for ways to help our customers build and deploy generative AI applications efficiently while maintaining data privacy and control. One key area of focus is optimizing machine learning (ML) hardware, and today we’re excited to share our findings with Intel's Gaudi 2 AI accelerators.

Overview of Intel Gaudi 2

Intel’s Gaudi 2 family of AI accelerators offers a compelling alternative for training and inference workloads. These accelerators are available via AWS (first-generation Gaudi), the Intel Developer Cloud (Gaudi 2), and on-premises through Supermicro and WiWynn. Our tests with Gaudi 2 have shown impressive performance, making it a strong contender in the AI hardware market.

LLM Training Performance

We evaluated the Intel Gaudi 2 for large language model (LLM) training using our open-source LLM Foundry. Here’s what we found:

Single-Chip Performance: The Gaudi 2 achieved over 260 TFLOP/s/device when training MPT-7B on an 8 x Gaudi 2 setup. This places it as the second-best performing chip we've tested, just behind NVIDIA's H100.
Multi-Node Scaling: For larger-scale training, we had access to a cluster of 160x Intel Gaudi 2 accelerators and observed near-linear scaling across the cluster. This is crucial for maintaining efficiency as you scale up your training resources.

LLM Inference Performance

For inference, we used the open-source Optimum Habana library to profile the performance of the LLaMa2-70B model on an 8 x Gaudi 2 system. The results were impressive:

Decoding Latency: The 8 x Gaudi 2 system matched the decoding latency of an 8 x NVIDIA H100 system, which is particularly significant given that decoding is the most computationally expensive phase of LLM inference.

Performance-per-Dollar

Since the Intel Gaudi 2 is available via the Intel Developer Cloud (IDC), we could also estimate performance per dollar. Based on public, on-demand pricing from Lambda and Intel, the Gaudi 2 stands out as a cost-effective option for both training and inference workloads.

Future Enhancements with SynapseAI 1.13

All our results were measured using SynapseAI 1.12 and BF16 mixed precision training. However, we're looking forward to SynapseAI 1.13, which will introduce support for FP8 training. This is a significant improvement:

FP8 Training: In their MLPerf Training 3.1 GPT3 submission, Intel demonstrated that FP8 training on a cluster of 256x Gaudi 2 and 384x Gaudi 2 achieved 379 TFLOP/s/device and 368 TFLOP/s/device, respectively. This is nearly 1.5x faster than our results with BF16.
Inference Performance: SynapseAI 1.13 is also expected to bring a performance boost for LLM inference.

Detailed Results

Training MPT-7B on 8 x Gaudi 2: Achieved over 260 TFLOP/s/device.
Multi-Node Training (160x Gaudi 2): Near-linear scaling across the cluster.
LLaMa2-70B Inference on 8 x Gaudi 2: Matched decoding latency of 8 x NVIDIA H100.

Conclusion

The Intel Gaudi 2 AI accelerators offer robust performance for both LLM training and inference, making them a viable option for organizations looking to optimize their AI workloads. With the upcoming enhancements in SynapseAI 1.13, we expect even better results, particularly with FP8 support.

Stay tuned for future updates as we continue to explore optimizations on various hardware platforms, including NVIDIA H100 with FP8 support.