Cerebras CS-3 Powers AWS for Ultra-Fast AI Inference and Disaggregated Architecture

Products & Applications

The Engineer

16 Mar 2026 · 3 min read

Cerebras Systems and AWS are revolutionizing cloud AI with lightning-fast Cerebras CS-3 systems, offering unparalleled speed for large language models and paving the way for a new era of disaggregated computing architectures.

Cerebras Systems, the leader in high-speed AI inference, is teaming up with Amazon Web Services (AWS) to bring unprecedented speed and performance to cloud-based AI models. Starting today, AWS customers will have access to Cerebras CS-3 systems via AWS Bedrock, enabling them to run leading open-source large language models (LLMs) and Amazon’s Nova models at the industry's highest inference speeds.

The Technical Shift: Why It Matters

AI is rapidly transforming software development, with AI agents increasingly taking over tasks that were traditionally done by human developers. This shift has a significant impact on the computational requirements for AI inference. Unlike conversational chat, agentic coding generates approximately 15 times more tokens per query and demands high-speed token output to keep developers productive. As a result, there is an urgent need for faster inference capabilities across the industry.

Cerebras has been at the forefront of this movement, powering models from OpenAI, Cognition, and Meta with speeds of up to 3,000 tokens per second. By bringing this technology to AWS, one of the world’s leading cloud providers, the collaboration aims to meet the growing demand for fast inference on a global scale.

Disaggregated Inference: A New Approach

To achieve even higher performance, AWS and Cerebras are collaborating on a novel disaggregated architecture that pairs AWS Trainium with Cerebras WSE (Wafer-Scale Engine). This approach leverages the strengths of both systems to deliver 5 times more high-speed token capacity in the same hardware footprint.

How Disaggregated Inference Works

Prefill and Decode: Every AI query involves two distinct phases:
- Prefill: Processes the input query.
- Decode: Generates the output tokens.

Trainium for Prefill:
- AWS Trainium is a purpose-built AI chip designed for scalable performance and cost efficiency.
- It excels in compute-bound tasks like prefill, where it computes the key-value (KV) cache.
Cerebras WSE for Decode:
- The Cerebras CS-3 system stores all model weights on-chip in SRAM, providing thousands of times greater memory bandwidth compared to the fastest GPUs.
- This makes it ideal for the decode phase, which is highly bandwidth-intensive as it requires fetching the entire model from memory for each token generated.

Implementation Details

Disaggregated Configuration:
- In this setup, Trainium handles the prefill work exclusively, computing the KV cache and sending it to the WSE via AWS's high-speed Elastic Fabric Adapter (EFA) interconnect.
- The Cerebras WSE then focuses on the decode phase, generating tokens at an unprecedented speed.
Performance Benchmarks:
- This disaggregated architecture delivers a 5x increase in high-speed token capacity compared to traditional monolithic systems.
- It ensures that both prefill and decode operations are optimized for their specific requirements, leading to significant performance gains.

Impact on the Industry

The collaboration between AWS and Cerebras represents a major step forward in AI inference technology. By combining the strengths of Trainium and WSE, this disaggregated architecture not only meets but exceeds the growing demand for fast and efficient AI processing. For developers and businesses relying on AI-driven applications, this means more productive workflows, faster development cycles, and ultimately, better end-user experiences.