Optimizing Low-Latency Generative AI Model Serving with Ray, NVIDIA Triton, and TensorRT-LLM

Tools & Engineering

The Engineer

26 Mar 2024 · 3 min read

Anyscale teams up with NVIDIA to enhance AI model serving, combining Ray Serve’s ease of use with NVIDIA’s advanced hardware optimizations for更低延迟和更高效的大型语言模型推理。

In a previous article, we discussed how Ray Serve can significantly improve hardware utilization and streamline the deployment of AI applications to production. With the growing complexity and size of AI models, optimizing model inference to reduce GPU costs has become essential. Now, Anyscale is collaborating with NVIDIA to integrate the developer-friendly features of Ray Serve and RayLLM with the advanced optimizations provided by NVIDIA Triton Inference Server and NVIDIA TensorRT-LLM.

Model Serving with Ray Serve and RayLLM

Ray Serve is a powerful, scalable model-serving library built on top of Ray. It offers a simple Python API for deploying everything from deep learning models (e.g., PyTorch) to custom business logic. Key features include:

Model Composition: Ray Serve excels at composing multiple ML models into a single service, allowing each model to auto-scale independently for optimal hardware utilization.
Multi-Model Serving: It supports serving numerous models simultaneously, making it ideal for complex inference services.

In 2023, Anyscale made significant investments in enhancing user experience and price-performance. This led to notable adoption by companies like LinkedIn, Samsara, and DoorDash, resulting in a 10x growth in Ray Serve usage as both startups and enterprises sought faster, more cost-effective ways to serve AI models.

RayLLM is an LLM-serving solution built on Ray Serve, designed to simplify the deployment and management of various open-source LLMs. Key features include:

Pre-configured Models: A wide range of pre-configured open-source LLMs with optimized settings.
Bring Your Own Models (BYOM): Support for deploying custom models.
OpenAI-Compatible API: An OpenAI-compatible API for seamless integration with existing LLM tooling like LangChain and LlamaIndex.

Since RayLLM is built on Ray Serve, it inherits features such as auto-scaling, multi-GPU support, and multi-node inference, making it a robust solution for large-scale deployments.

AI Deployment with NVIDIA Triton Inference Server and TensorRT-LLM

NVIDIA Triton Inference Server and TensorRT-LLM are key components in optimizing AI model serving. Here’s how they integrate with Ray Serve:

NVIDIA Triton Inference Server:
- Model Optimization: Provides advanced optimizations to reduce inference latency and improve throughput.
- Multi-GPU Support: Efficiently handles multi-GPU setups, ensuring optimal performance across multiple GPUs.
- Versatile Model Formats: Supports a wide range of model formats, including TensorFlow, PyTorch, and ONNX.
NVIDIA TensorRT-LLM:
- LLM Optimization: Specifically designed to optimize large language models (LLMs), enhancing inference speed and efficiency.
- Customizable Pipelines: Allows for the creation of custom inference pipelines tailored to specific use cases.
- Integration with Triton: Seamlessly integrates with NVIDIA Triton Inference Server, providing a unified solution for model serving.

Combining Ray Serve, RayLLM, and NVIDIA Technologies

The integration of Ray Serve, RayLLM, NVIDIA Triton Inference Server, and TensorRT-LLM creates a powerful ecosystem for low-latency AI model serving. Here’s how it works:

Development Efficiency: Ray Serve and RayLLM provide an intuitive development experience, allowing developers to focus on building and deploying models rather than infrastructure.
Optimized Performance: NVIDIA Triton Inference Server and TensorRT-LLM ensure that models run efficiently, reducing latency and GPU costs.
Scalability and Flexibility: The combination supports auto-scaling, multi-GPU setups, and complex model compositions, making it suitable for a wide range of applications.

By leveraging these technologies, organizations can achieve faster, more cost-effective AI deployments, enabling them to stay competitive in the rapidly evolving landscape of generative AI.