INT4 QAT Pipeline Enables 1TB Model Rollout on a Single H200 GPU

Tools & Engineering

The Engineer

28 Jan 2026 · 3 min read

This groundbreaking QAT pipeline slashes the hardware requirements for deploying massive 1TB RL models, boosting efficiency and stability on a single H200 GPU while eradicating cross-node communication hurdles.

The SGLang RL team, in collaboration with the InfiXAI Team, Ant Group Asystem & AQ Infra Team, slime Team, and RadixArk Team, has made significant strides in optimizing large-scale reinforcement learning (RL) models. Their latest achievement is an INT4 Quantization-Aware Training (QAT) pipeline that allows the deployment of 1TB-scale models on a single NVIDIA H200 GPU. This breakthrough not only eliminates cross-node communication bottlenecks but also significantly improves rollout efficiency and stability.

Technical Overview

The key innovation lies in combining fake quantization during training with real quantization at inference (W4A16), achieving performance comparable to BF16 full-precision training. Here are the technical details:

INT4 QAT End-to-End Training: The team implemented a complete QAT INT4 closed-loop solution from training to inference. This includes:
- Fake Quantization During Training: Simulating the effects of quantization during the training phase to ensure that the model remains stable and consistent.
- Real Quantization at Inference (W4A16): Applying actual quantization to weights (W4) and activations (A16) during inference, which significantly reduces memory usage and computational requirements.
Unified Multi-Turn VLM/LLM Training: They provided an implementation for the VLM multi-turn sampling paradigm. This allows developers to easily start multi-turn RL for vision-language models (VLMs) by writing a customized rollout function, similar to training large language models (LLMs).
Rollout Router Replay: The team implemented the Rollout Router Replay mechanism, which significantly improves the stability of Mixture of Experts (MoE) models during RL training.
FP8 End-to-End Training: They successfully implemented end-to-end FP8 training and sampling in RL scenarios, further enhancing hardware performance.
Speculative Decoding in RL: The team also practiced speculative sampling in RL scenarios, achieving lossless acceleration for large-scale training.

Implementation Details

The INT4 QAT pipeline was built on the slime framework, inspired by the Kimi K2 team’s technical report. Here are some implementation specifics:

Training Phase:
- Fake Quantization: During training, the model is trained with fake quantization to simulate the effects of quantization without actually reducing precision. This helps in maintaining stability and consistency.
- Loss Function: A custom loss function was used to ensure that the model remains robust during the quantization process.
Inference Phase:
- Real Quantization (W4A16): During inference, weights are quantized to 4 bits (W4) and activations to 16 bits (A16). This reduces memory usage and computational requirements while maintaining performance.
- Optimized Inference: The team optimized the inference pipeline to ensure that it runs efficiently on a single H200 GPU.

Benchmarks and Performance

The INT4 QAT pipeline has shown impressive results:

Model Size: 1TB-scale models can now be deployed on a single H200 GPU.
Efficiency: The rollout efficiency is significantly improved, eliminating cross-node communication bottlenecks.
Stability: The model's stability and train-infer consistency are comparable to BF16 full-precision training.

Open Source Contribution

To pay tribute to the pioneers and give back to the community, the SGLang RL team has open-sourced their implementation. Developers can find the detailed technical recipe and scripts on GitHub:

INT4 QAT Technical Recipe: GitHub Link
End-to-End INT4 QAT Script: GitHub Link

Conclusion

The INT4 Q