Co-located vLLM in TRL: Boosting GPU Efficiency for Online Learning

Tools & Engineering

The Engineer

4 Jun 2025 · 3 min read

TRL's integration with vLLM optimizes GPU usage for GRPO, streamlining the online learning process and significantly reducing the computational overhead typically associated with LLM generation.

Introduction

TRL, a library by Hugging Face, supports training large language models (LLMs) using GRPO (Guided Reinforcement Policy Optimization), an online learning algorithm introduced in the DeepSeekMath paper. In GRPO, the model learns from its own outputs: it generates responses during training, receives feedback, and uses that feedback to improve itself over time. This makes generation a critical step in the training loop-and also a major bottleneck.

To address this bottleneck, TRL integrates with vLLM, a high-performance inference engine for LLMs. However, before TRL v0.18.0, vLLM was only supported in server mode, which introduced inefficiencies. Let’s dive into the problem and the solution.

The Problem

Before TRL v0.18.0, vLLM ran as a separate process on different GPUs from the training job, communicating with the training script over HTTP. This setup had several advantages:

Modularity: Easy to set up and use.
Scalability: Independent scaling of generation and training resources.

However, it also introduced significant inefficiencies:

GPU Utilization: During training, the model frequently needs to generate completions.
- The trainer sends a request to the vLLM server running on its own GPUs.
- While vLLM generates responses, the training GPUs sit idle and wait for the results.
Latency: HTTP communication adds latency, further slowing down the training process.

The Solution: Co-located vLLM

In TRL v0.18.0, Hugging Face introduced co-located vLLM, which runs on the same GPUs as the training job. This change significantly improves GPU utilization and reduces latency, making the training process more efficient.

Key Benefits

GPU Utilization: By running vLLM on the same GPUs as the training job, both processes can share resources efficiently.
- The training loop can continue while vLLM generates responses in parallel.
Reduced Latency: Eliminating HTTP communication reduces latency, speeding up the overall training process.

Implementation Details

Shared Memory: Co-located vLLM uses shared memory to exchange data with the training script, minimizing overhead.
- This approach ensures that both processes can access and modify the same data without the need for network communication.
Synchronization: The system uses synchronization mechanisms to ensure that the training loop waits only when necessary.
- For example, if vLLM is still generating a response, the training loop will pause but not block other operations.

Benchmarks

To quantify the improvements, Hugging Face conducted benchmarks comparing the performance of TRL with and without co-located vLLM. The results were impressive:

Training Time: Co-located vLLM reduced the total training time by up to 30% in some cases.
GPU Utilization: GPU utilization increased from an average of 50% to over 85%, making better use of available resources.

Use Cases

Co-located vLLM is particularly beneficial for:

Online Learning: GRPO and other online learning algorithms that require frequent generation during training.
Resource-Constrained Environments: Scenarios where maximizing GPU utilization is crucial due to limited hardware resources.

Conclusion

The introduction of co-located vLLM in TRL v0.18.0 marks a significant step forward in optimizing the training of large language models. By improving GPU utilization and reducing latency, this update makes it easier and more efficient to train powerful models using online learning algorithms like GRPO.