
Share
This article explores the nuances of prompt caching in LLMs, focusing on paged attention and prefix caching techniques to boost efficiency and cut costs in real-world applications.
Recently, I had to build a chat feature with tool integration under a tight deadline. Initially, I didn't pay much attention to prompt caching, but as I optimized the system, I realized some critical mistakes. This article delves into how prompt caching works, particularly focusing on paged attention and prefix caching, and provides practical tips for improving cache hits.
Prompt caching is crucial for optimizing large language model (LLM) inference. It reduces latency and computational costs by reusing previously computed results. However, achieving consistent cache hits can be tricky. Here are some tips to improve your cache hit rate:
To understand prompt caching, let's review the basics of LLM inference:
Traditional KV caching faces significant memory challenges:

To address these issues, vLLM (a library for efficient LLM inference) introduces paged attention, inspired by operating system principles:
Prefix caching further enhances efficiency by leveraging shared prefixes:
Consider a typical conversation array:
0. [system prompt + tool definitions]
1. user: what's up. please build this feature for me
2. assistant: can you tell me where to look, it's a big codebase
3. user: look into kv_caching folder
4. assistant: you're absolutely right! i will look there
5. tool output: *greps* *reads*
6. assistant: llm gets output for observation
7. user: ...
8. assistant: ...
Initially, I expected to hit the cache at point 4 for this session because points 0-3 repeat. However, I overlooked that cache hits can start at point 0 across different users. The system prompt is shared across all conversations from the same API key organization.
Understanding and optimizing prompt caching is essential for efficient LLM inference. By keeping your system prompts stable, minimizing user-specific data, and leveraging shared prompts, you can improve cache hit rates. Techniques like paged attention and prefix caching further enhance performance by dynamically managing memory and reusing shared prefixes.
Tags
Original Sources
↗ https://sankalp.bearblog.dev/how-prompt-caching-works/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
1 December 2025
88 articles
Related Articles
Related Articles
More Stories