XAI Aims for 50 Million H100-Equivalent GPUs by 2028, Already Boasts 230k GPUs Operational

Tools & Engineering

The Engineer

24 Jul 2025 · 3 min read

XAI's plan to surge from 230,000 to 50 million H100-equivalent GPUs by 2028 signals a quantum leap in AI computing power, dwarfing current industry standards and reshaping the landscape for next-gen AI models.

XAI, the AI-focused subsidiary of Twitter (now part of Elon Musk’s broader tech empire), has set its sights on an ambitious goal: deploying 50 million H100-equivalent GPUs over the next five years. This massive scale-up is designed to power their advanced language models and other AI applications. As of now, XAI already has a formidable 230,000 GPUs operational, including 30,000 GB200s specifically for training Grok, their in-house large language model.

Why This Matters

For practitioners, the scale at which XAI is operating and planning to expand is unprecedented. Here’s a breakdown of what this means:

H100 GPUs: These are among the most powerful GPUs available today, offering significant improvements in both training speed and inference efficiency. The H100 uses NVIDIA’s Hopper architecture, which supports advanced features like Transformer Engine for optimized AI workloads.
Energy Consumption: Deploying 50 million GPUs will have a substantial impact on energy consumption. Data centers already consume a significant amount of power, and this scale-up will likely necessitate advancements in energy efficiency and cooling technologies.
Data Center Infrastructure: Managing such a vast number of GPUs requires robust data center infrastructure. This includes not only the hardware but also the software stack for orchestration, monitoring, and scaling.

Current State

XAI’s current setup is already impressive:

230,000 GPUs Operational: This includes 30,000 GB200s, which are specifically designed for high-performance training tasks. The GB200 is a variant of the H100 with even more cores and higher memory bandwidth.
Training Grok: Grok, XAI’s large language model, is being trained on this infrastructure. The ability to train such models in-house gives XAI significant control over the development process and reduces dependency on third-party services.

Technical Details

H100 GPU Specifications:
- Architecture: Hopper
- Memory: Up to 80 GB of HBM2e memory
- Bandwidth: 3 TB/s
- FP64 Performance: 97 TFLOPs
- FP32 Performance: 194 TFLOPs
- Tensor Core Performance: 3,950 TFLOPs (FP8)

GB200 Specifications:
- Architecture: Hopper (optimized for training)
- Memory: Up to 160 GB of HBM2e memory
- Bandwidth: 4 TB/s
- FP64 Performance: 194 TFLOPs
- FP32 Performance: 388 TFLOPs
- Tensor Core Performance: 7,900 TFLOPs (FP8)

Challenges and Considerations

Energy Efficiency: With such a large number of GPUs, energy consumption becomes a critical issue. XAI will need to invest in green data center technologies, such as liquid cooling and renewable energy sources.
Scalability: Managing 50 million GPUs requires sophisticated orchestration tools and robust monitoring systems. Kubernetes and other containerization technologies will likely play a crucial role.
Cost: The financial investment required for this scale-up is enormous. XAI will need to balance the costs of hardware, data center infrastructure, and ongoing operational expenses.

Future Implications

If XAI successfully achieves its goal, it could have far-reaching implications:

Advancements in AI Research: With such powerful resources at their disposal, XAI can push the boundaries of what’s possible in AI research and development.
Competition in the Market: Other tech giants will need to respond with similar investments to stay competitive.
Impact on Smaller Players: The high barrier to entry created by this scale-up could make it difficult for smaller companies to compete, potentially leading to a more consolidated market.

Conclusion

XAI’s ambitious plan to deploy 50 million H100-equivalent GPUs over the next five years