
Share
This groundbreaking QAT pipeline slashes the hardware requirements for deploying massive 1TB RL models, boosting efficiency and stability on a single H200 GPU while eradicating cross-node communication hurdles.
The SGLang RL team, in collaboration with the InfiXAI Team, Ant Group Asystem & AQ Infra Team, slime Team, and RadixArk Team, has made significant strides in optimizing large-scale reinforcement learning (RL) models. Their latest achievement is an INT4 Quantization-Aware Training (QAT) pipeline that allows the deployment of 1TB-scale models on a single NVIDIA H200 GPU. This breakthrough not only eliminates cross-node communication bottlenecks but also significantly improves rollout efficiency and stability.
The key innovation lies in combining fake quantization during training with real quantization at inference (W4A16), achieving performance comparable to BF16 full-precision training. Here are the technical details:
INT4 QAT End-to-End Training: The team implemented a complete QAT INT4 closed-loop solution from training to inference. This includes:
Unified Multi-Turn VLM/LLM Training: They provided an implementation for the VLM multi-turn sampling paradigm. This allows developers to easily start multi-turn RL for vision-language models (VLMs) by writing a customized rollout function, similar to training large language models (LLMs).
Rollout Router Replay: The team implemented the Rollout Router Replay mechanism, which significantly improves the stability of Mixture of Experts (MoE) models during RL training.
FP8 End-to-End Training: They successfully implemented end-to-end FP8 training and sampling in RL scenarios, further enhancing hardware performance.
Speculative Decoding in RL: The team also practiced speculative sampling in RL scenarios, achieving lossless acceleration for large-scale training.
The INT4 QAT pipeline was built on the slime framework, inspired by the Kimi K2 team’s technical report. Here are some implementation specifics:

Training Phase:
Inference Phase:
The INT4 QAT pipeline has shown impressive results:
To pay tribute to the pioneers and give back to the community, the SGLang RL team has open-sourced their implementation. Developers can find the detailed technical recipe and scripts on GitHub:
The INT4 Q
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
28 January 2026
88 articles
Related Articles
Related Articles
More Stories