Ray Data LLM Doubles Throughput Over vLLM for Production-Scale Batch Inference

Tools & Engineering

The Engineer

25 Mar 2026 · 3 min read

Ray Data LLM outperforms vLLM by doubling throughput for large-scale batch inference, crucial for tasks like synthetic data generation and model evaluation, offering superior efficiency in production environments.

When it comes to large-scale batch inference with Large Language Models (LLMs), throughput is often more critical than per-request latency. This is especially true in workflows like synthetic data generation, data curation, and model evaluation, where processing massive datasets efficiently is paramount. Enter Ray Data LLM, a library designed for high-throughput, scalable, and fault-tolerant batch inference. In this article, we'll dive into how Ray Data LLM achieves up to 2x throughput over vLLM's synchronous LLM engine, making it a robust choice for production-scale workloads.

Why Ray Data LLM?

Naive Approach: vLLM’s Offline Inference API

To understand the advantages of Ray Data LLM, let's first look at a naive approach using vLLM’s Offline Inference API. This method involves loading the entire dataset into CPU memory and running forward passes directly:

from vllm import LLM

# Initialize the model
llm = LLM(model="facebook/opt-125m")

# Define prompts
prompts = [
    "What is machine learning?",
    "Explain neural networks.",
    "How does backpropagation work?"
]

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=100
)

# Generate outputs
outputs = llm.generate(prompts, sampling_params)

While this approach works for small datasets, it quickly hits limitations in production environments:

Memory Constraints: Large datasets can easily exceed available CPU memory, making it impossible to process them in a single batch.
Lack of Streaming Execution: The naive method doesn't support streaming execution, which is crucial for handling large datasets efficiently.
No Fault Tolerance: If an error occurs during processing, the entire job may fail without any recovery mechanism.

Ray Data LLM: Optimized for Production-Scale

Ray Data LLM addresses these limitations with a highly optimized architecture designed for batch inference at scale. Here’s how it achieves 2x throughput over vLLM:

Distributed Execution: Ray Data LLM leverages distributed computing to process large datasets efficiently. It can split the dataset into smaller chunks and distribute them across multiple nodes, ensuring that no single node becomes a bottleneck.
Streaming Execution: Instead of loading the entire dataset into memory, Ray Data LLM processes data in streams. This allows it to handle datasets much larger than available CPU RAM.
Fault Tolerance: Ray Data LLM includes built-in fault tolerance mechanisms. If a node fails during processing, it can automatically recover and resume from the last successful checkpoint.

Performance Benchmarks

To illustrate the performance gains, consider a scenario where you need to process a large dataset of prompts. Using vLLM’s synchronous engine, the throughput might be limited by the single-node architecture. In contrast, Ray Data LLM can distribute the workload across multiple nodes, significantly increasing throughput:

vLLM: 1000 prompts/minute
Ray Data LLM: 2000 prompts/minute

This 2x improvement in throughput is crucial for production-scale workloads where time and resource efficiency are paramount.

Implementation Details

Here’s a simplified example of how you can use Ray Data LLM to process a large dataset:

import ray
from ray.data import read_parquet
from ray.data.extensions import TensorDtype
from [transformers](/companies/hugging-face) import AutoModelForCausalLM, AutoTokenizer

# Initialize Ray
ray.init()

# Load the dataset
dataset = read_parquet("s3://path/to/your/dataset.parquet")

# Define the model and tokenizer
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Function to generate outputs
def generate_outputs(batch):
    inputs = tokenizer(batch["prompts"], return_tensors="pt", padding=True, truncation=True).input_ids
    outputs = model.generate(inputs, max_length=100, temperature=0.7)
    return {"outputs": [tokenizer.decode(output) for output in outputs]}

# Apply the function to the dataset
results = dataset.map_batches(g