Understanding Prompt Caching: Paged Attention and Prefix Caching for Efficient LLM Inference

Tools & Engineering

The Engineer

1 Dec 2025 · 3 min read

This article explores the nuances of prompt caching in LLMs, focusing on paged attention and prefix caching techniques to boost efficiency and cut costs in real-world applications.

Introduction

Recently, I had to build a chat feature with tool integration under a tight deadline. Initially, I didn't pay much attention to prompt caching, but as I optimized the system, I realized some critical mistakes. This article delves into how prompt caching works, particularly focusing on paged attention and prefix caching, and provides practical tips for improving cache hits.

Why Prompt Caching Matters

Prompt caching is crucial for optimizing large language model (LLM) inference. It reduces latency and computational costs by reusing previously computed results. However, achieving consistent cache hits can be tricky. Here are some tips to improve your cache hit rate:

Keep the System Prompt Stable: Ensure that the system prompt remains unchanged across conversations to maximize reuse.
Minimize User-Specific Data: Avoid adding user-specific data at the end of the system prompt. Instead, place it in the conversation history where it can be managed more flexibly.
Leverage Shared Prompts: Recognize that your system prompt can be shared across all users within the same API key organization.

LLM Inference Basics

To understand prompt caching, let's review the basics of LLM inference:

Prefill Stage: The model processes the input prompt and generates the initial hidden states and key-value (KV) pairs.
Decode Stage: The model uses the KV cache to generate subsequent tokens efficiently.
KV Caching: Storing the KV pairs allows the model to avoid redundant computations during decoding.

The Memory Problem

Traditional KV caching faces significant memory challenges:

Fixed Allocation: Most implementations allocate a fixed amount of memory for the KV cache, which can be inefficient and limit scalability.
Memory Fragmentation: As different prompts vary in length, managing memory efficiently becomes difficult, leading to fragmentation and wasted resources.

Paged Attention

To address these issues, vLLM (a library for efficient LLM inference) introduces paged attention, inspired by operating system principles:

Blocks and Block Tables: The KV cache is divided into blocks, each of which can be allocated or deallocated independently. A block table keeps track of the allocation status.
Dynamic Memory Management: This approach allows for dynamic memory allocation, optimizing resource usage and reducing fragmentation.

Prefix Caching

Prefix caching further enhances efficiency by leveraging shared prefixes:

Block Hashing: Each block in the KV cache is hashed to create a unique identifier.
Longest Cache Hit: The system identifies the longest prefix that matches previously cached blocks, allowing for efficient reuse.
Full Picture: By combining paged attention and prefix caching, vLLM can handle diverse prompts while maintaining high cache hit rates.

Practical Example

Consider a typical conversation array:

0. [system prompt + tool definitions]

1. user:        what's up. please build this feature for me
2. assistant:   can you tell me where to look, it's a big codebase
3. user:        look into kv_caching folder
4. assistant:   you're absolutely right! i will look there
5. tool output: *greps* *reads*
6. assistant:   llm gets output for observation
7. user:        ...
8. assistant:   ...

Initially, I expected to hit the cache at point 4 for this session because points 0-3 repeat. However, I overlooked that cache hits can start at point 0 across different users. The system prompt is shared across all conversations from the same API key organization.

Conclusion

Understanding and optimizing prompt caching is essential for efficient LLM inference. By keeping your system prompts stable, minimizing user-specific data, and leveraging shared prompts, you can improve cache hit rates. Techniques like paged attention and prefix caching further enhance performance by dynamically managing memory and reusing shared prefixes.