Groq Joins Hugging Face Inference Providers, Boosting LLM Performance

Tools & Engineering

The Engineer

17 Jun 2025 · 3 min read

Groq's integration with Hugging Face brings cutting-edge LPU technology to the platform, offering users faster and more efficient large language model inferences.

We're excited to announce that Groq is now a supported Inference Provider on the Hugging Face Hub! This integration enhances our serverless inference capabilities directly on the Hub’s model pages and seamlessly integrates into our client SDKs (for both JavaScript and Python). This means you can easily leverage a wide variety of models with your preferred providers, including Groq's powerful Language Processing Units (LPUs).

What Changed Technically

New Inference Provider: Groq is now available as an inference provider on the Hugging Face Hub.
- LPU™ Technology: At the core of Groq’s offering is the LPU, a new type of end-to-end processing unit designed for computationally intensive applications like Large Language Models (LLMs).
- Performance Gains:
  - Lower Latency: LPUs offer significantly lower latency compared to GPUs.
  - Higher Throughput: They provide higher throughput, making them ideal for real-time AI applications.

Why It Matters

Model Support: Groq supports a wide range of text and conversational models, including the latest open-source models such as:
- Meta's Llama 4 (e.g., Llama-4-Maverick-17B-128E-Instruct)
- Qwen's QWQ-32B (e.g., QwQ-32B)
Ease of Use: Groq’s Inference API is designed to be developer-friendly, allowing easy integration into applications.
- API Access: On-demand and pay-as-you-go model for accessing a wide range of openly-available LLMs.

How It Works

Integration with Hugging Face Hub:
- Model Pages: You can now select Groq as an inference provider directly on the model pages.
- Client SDKs: Both JavaScript and Python SDKs support Groq, making it straightforward to use in your applications.

LPU™ Architecture:
- Sequential Processing: LPUs are optimized for sequential processing tasks, which is crucial for LLMs where each token depends on the previous ones.
- Efficient Memory Access: They feature efficient memory access patterns, reducing the latency often associated with GPU-based inference.

Implementation Notes

Benchmarks:
- Groq’s LPUs have been shown to outperform GPUs in terms of both latency and throughput for LLMs. For example, they can achieve lower latency on models like Llama 4 and Qwen's QWQ-32B.
Use Cases:
- Real-Time Applications: Ideal for applications requiring real-time responses, such as chatbots, virtual assistants, and interactive AI systems.

Getting Started

To start using Groq as an inference provider on Hugging Face:

Select a Model: Visit the model page of your choice on the Hugging Face Hub.
Choose Inference Provider: Select "Groq" from the list of available inference providers.
Integrate into Your Application: Use the provided API documentation to integrate Groq’s inference capabilities into your application.

We're excited to see what you'll build with this new provider and look forward to hearing about your projects!