
Share
TRL's integration with vLLM optimizes GPU usage for GRPO, streamlining the online learning process and significantly reducing the computational overhead typically associated with LLM generation.
TRL, a library by Hugging Face, supports training large language models (LLMs) using GRPO (Guided Reinforcement Policy Optimization), an online learning algorithm introduced in the DeepSeekMath paper. In GRPO, the model learns from its own outputs: it generates responses during training, receives feedback, and uses that feedback to improve itself over time. This makes generation a critical step in the training loop-and also a major bottleneck.
To address this bottleneck, TRL integrates with vLLM, a high-performance inference engine for LLMs. However, before TRL v0.18.0, vLLM was only supported in server mode, which introduced inefficiencies. Let’s dive into the problem and the solution.
Before TRL v0.18.0, vLLM ran as a separate process on different GPUs from the training job, communicating with the training script over HTTP. This setup had several advantages:
However, it also introduced significant inefficiencies:
In TRL v0.18.0, Hugging Face introduced co-located vLLM, which runs on the same GPUs as the training job. This change significantly improves GPU utilization and reduces latency, making the training process more efficient.

To quantify the improvements, Hugging Face conducted benchmarks comparing the performance of TRL with and without co-located vLLM. The results were impressive:
Co-located vLLM is particularly beneficial for:
The introduction of co-located vLLM in TRL v0.18.0 marks a significant step forward in optimizing the training of large language models. By improving GPU utilization and reducing latency, this update makes it easier and more efficient to train powerful models using online learning algorithms like GRPO.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
4 June 2025
88 articles
Related Articles
Related Articles
More Stories