Character.ai Achieves 2x Inference Performance with DigitalOcean and AMD GPUs

Tools & Engineering

The Engineer

15 Jan 2026 · 3 min read

By partnering with DigitalOcean and leveraging AMD GPUs, Character.ai doubled its inference performance, crucial for serving 20 million users who demand seamless, low-latency interactions.

In a significant technical collaboration, Character.ai, DigitalOcean, and AMD have optimized GPU workloads to achieve a 2x production inference throughput for the AI entertainment platform. This optimization is crucial for Character.ai, which serves around 20 million users worldwide and requires low-latency performance at scale.

The Challenge

Character.ai's application demands high-performance GPUs to handle large-scale, low-latency inference tasks. To meet these requirements, they partnered with DigitalOcean and AMD to optimize the Qwen3-235B Instruct FP8 model on a cluster of AMD Instinct™ MI325X GPUs.

The Solution

The teams focused on several key areas to achieve this performance boost:

Platform-Level Optimizations: This included optimizing large Mixture-of-Experts (MoE) models, efficient FP8 execution paths, and topology-aware GPU allocation.
Parallelization Strategies: Clever parallelization techniques were employed to handle the complexity of MoE models.
Optimized Kernels with AITER: Custom kernels were developed to enhance performance.
Kubernetes Orchestration: DigitalOcean Kubernetes (DOKS) was used for production-ready orchestration.

Technical Deep Dive

Model and Workload

Character.ai leverages multiple models, including Qwen, Mistral, and others. This deep dive focuses on the optimization of the Qwen3-235B Instruct FP8 model on a cluster of DigitalOcean droplets featuring AMD GPUs.

The primary objective was to run the Qwen3-235B model with a workload of 5600 / 140 (ISL / OSL) on AMD Instinct™ MI325X GPUs. The goal was to maximize request throughput (QPS) per MI325X 8x GPU server while maintaining strict latency and concurrency constraints.

Key Optimizations

Mixture-of-Experts (MoE) Models: MoE models are known for their ability to scale efficiently by distributing the computation across multiple experts. The team optimized these models to ensure that each expert could handle its share of the workload effectively.
- Parallelization: By parallelizing the execution of experts, they reduced the overall inference time and increased throughput.
FP8 Execution Paths: FP8 (Float8) is a lower precision format that can significantly speed up computations while maintaining acceptable accuracy. The team implemented efficient FP8 execution paths to leverage this advantage.
- Custom Kernels with AITER: AITER (AMD Iterative Execution Runtime) was used to develop optimized kernels that could take full advantage of the FP8 format.
Topology-Aware GPU Allocation: Ensuring that GPUs were allocated in a topology-aware manner helped minimize data transfer latency and maximize performance.
- Kubernetes Orchestration: DOKS was used to manage the deployment and scaling of GPU resources, ensuring that the system could handle high request density while maintaining low latency.

Performance Gains

The optimizations resulted in:

2x Improvement in Request Throughput (QPS): Under strict latency and concurrency constraints, the optimized setup achieved up to a 2x improvement in QPS.
High Request Density: DigitalOcean delivered high request density per node while maintaining exceptional p90 responsiveness for initial token and sustained token generation throughput.
Predictable Scaling: The optimizations allowed Character.ai to scale inference predictably without increasing operational burden.

Impact

These performance gains have not only improved the user experience on Character.ai but also resulted in significant cost savings. The collaboration between Character.ai, DigitalOcean, and AMD has led to a multi-year, eight-figure annual agreement for GPU infrastructure, reflecting the success of this technical partnership.