
Share
ParaLLM harnesses batched KV caching to revolutionize parallel inference on MacBooks, offering a 1600+ tokens/sec boost that makes generating multiple outputs simultaneously smoother and faster than ever.
June 23, 2024
If you've been working with Large Language Models (LLMs) on your MacBook and found that parallel inference is lacking, there's good news. I recently developed ParaLLM, a solution for fast parallel LLM inference using MLX. This approach leverages batched key-value (KV) caching to significantly boost throughput, especially useful for generating multiple outputs at once.
For single-stream applications like chat interfaces, tools like llama.cpp and MLXServer perform well on Apple devices. However, when you need to sample a large number of outputs simultaneously-such as for evaluating training runs or developing "agent-flavored" applications-these tools fall short in terms of total throughput.
On CUDA machines, solutions like vLLM offer high tok/s throughput with parallel requests, but they don't work on Macs. This is where ParaLLM comes in.
The main feature enabling this speedup is batched key-value caching. By extending the generate method from the existing mlx_lm library, I introduced a BatchedKVCache object and a batch_generate method to handle multiple decoding channels.
load Function: Loads the model and tokenizer.batch_generate Method: Generates multiple outputs in parallel using batched KV caching.Here's a sample code snippet to illustrate how it works:
from mlx_parallm.utils import load, generate, batch_generate

import string capital_letters = string.ascii_uppercase distinct_pairs = [(a, b) for i, a in enumerate(capital_letters) for b in capital_letters[i + 1:]] prompt_template = "Think of a real word containing both the letters {l1} and {l2}. Then, say 3 sentences which use the word." prompts_raw = [prompt_template.format(l1=p[0], l2=p[1]) for p in random.sample(distinct_pairs, 325)]
model, tokenizer = load("google/gemma-1.1-2b-it") responses = batch_generate(model, tokenizer, prompts=prompts_raw, max_tokens=100, verbose=True, temp=0.0)
### Performance Benchmarks
For "small" models like Gemma-2B, ParaLLM achieves **1600+ tokens/sec** in total throughput on a 128GB M3 Max MacBook. This is a significant improvement over single-stream generation.
### Supported Models and Future Work
I've tested ParaLLM with the following models:
- **Gemma-2B**
- **Phi-3-mini**
- **Llama3-8B**
All of these models show substantial throughput gains, especially as you increase the number of parallel requests.
### Additional Features
While features like repetition penalties and streaming outputs are not yet supported, I plan to contribute a `batch_generate` PR for mlx_lm once it reaches a stable, non-breaking state. In the meantime, adding other models is straightforward:
- **Copy Architecture Files**: Copy the necessary architecture files from `mlx_lm/models` into `mlx_parallm/models`.
- **Replace KVCache References**: Replace any `KVCache` references with `BatchedKVCache`.
### Conclusion
ParaLLM brings high-throughput parallel LLM inference to Mac users, making it a valuable tool for evaluating training runs and developing [complex](/articles/sglang-and-radixattention-accelerating-complex-llm-workloads-with-efficient-kv-cache-reuse) applications. The code is available on GitHub at [mlx_parallm](https://github.com/willccbb/mlx_parallm/tree/main), and I encourage you to try it out and provide feedback.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
25 June 2024
88 articles
Related Articles
Related Articles
More Stories