
Share
Ray Data LLM outperforms vLLM by doubling throughput for large-scale batch inference, crucial for tasks like synthetic data generation and model evaluation, offering superior efficiency in production environments.
When it comes to large-scale batch inference with Large Language Models (LLMs), throughput is often more critical than per-request latency. This is especially true in workflows like synthetic data generation, data curation, and model evaluation, where processing massive datasets efficiently is paramount. Enter Ray Data LLM, a library designed for high-throughput, scalable, and fault-tolerant batch inference. In this article, we'll dive into how Ray Data LLM achieves up to 2x throughput over vLLM's synchronous LLM engine, making it a robust choice for production-scale workloads.
To understand the advantages of Ray Data LLM, let's first look at a naive approach using vLLM’s Offline Inference API. This method involves loading the entire dataset into CPU memory and running forward passes directly:
from vllm import LLM
# Initialize the model
llm = LLM(model="facebook/opt-125m")
# Define prompts
prompts = [
"What is machine learning?",
"Explain neural networks.",
"How does backpropagation work?"
]
# Set sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=100
)
# Generate outputs
outputs = llm.generate(prompts, sampling_params)
While this approach works for small datasets, it quickly hits limitations in production environments:
Ray Data LLM addresses these limitations with a highly optimized architecture designed for batch inference at scale. Here’s how it achieves 2x throughput over vLLM:

To illustrate the performance gains, consider a scenario where you need to process a large dataset of prompts. Using vLLM’s synchronous engine, the throughput might be limited by the single-node architecture. In contrast, Ray Data LLM can distribute the workload across multiple nodes, significantly increasing throughput:
This 2x improvement in throughput is crucial for production-scale workloads where time and resource efficiency are paramount.
Here’s a simplified example of how you can use Ray Data LLM to process a large dataset:
import ray
from ray.data import read_parquet
from ray.data.extensions import TensorDtype
from transformers import AutoModelForCausalLM, AutoTokenizer
# Initialize Ray
ray.init()
# Load the dataset
dataset = read_parquet("s3://path/to/your/dataset.parquet")
# Define the model and tokenizer
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Function to generate outputs
def generate_outputs(batch):
inputs = tokenizer(batch["prompts"], return_tensors="pt", padding=True, truncation=True).input_ids
outputs = model.generate(inputs, max_length=100, temperature=0.7)
return {"outputs": [tokenizer.decode(output) for output in outputs]}
# Apply the function to the dataset
results = dataset.map_batches(g
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
25 March 2026
133 articles
Related Articles
Related Articles
More Stories