ParaLLM: Achieving 1600+ Tokens/Sec on a MacBook with Batched KV Caching

Tools & Engineering

The Engineer

25 Jun 2024 · 3 min read

ParaLLM harnesses batched KV caching to revolutionize parallel inference on MacBooks, offering a 1600+ tokens/sec boost that makes generating multiple outputs simultaneously smoother and faster than ever.

June 23, 2024

If you've been working with Large Language Models (LLMs) on your MacBook and found that parallel inference is lacking, there's good news. I recently developed ParaLLM, a solution for fast parallel LLM inference using MLX. This approach leverages batched key-value (KV) caching to significantly boost throughput, especially useful for generating multiple outputs at once.

The Problem with Single-Stream Inference

For single-stream applications like chat interfaces, tools like llama.cpp and MLXServer perform well on Apple devices. However, when you need to sample a large number of outputs simultaneously-such as for evaluating training runs or developing "agent-flavored" applications-these tools fall short in terms of total throughput.

On CUDA machines, solutions like vLLM offer high tok/s throughput with parallel requests, but they don't work on Macs. This is where ParaLLM comes in.

Batched KV Caching: The Key to Parallel Inference

The main feature enabling this speedup is batched key-value caching. By extending the generate method from the existing mlx_lm library, I introduced a BatchedKVCache object and a batch_generate method to handle multiple decoding channels.

Implementation Details

Model Loading: ParaLLM uses the same model loading mechanism as MLX.
Batch Generation:
- load Function: Loads the model and tokenizer.
- batch_generate Method: Generates multiple outputs in parallel using batched KV caching.

Here's a sample code snippet to illustrate how it works:

from mlx_parallm.utils import load, generate, batch_generate

Fun trick for generating workloads

import string capital_letters = string.ascii_uppercase distinct_pairs = [(a, b) for i, a in enumerate(capital_letters) for b in capital_letters[i + 1:]] prompt_template = "Think of a real word containing both the letters {l1} and {l2}. Then, say 3 sentences which use the word." prompts_raw = [prompt_template.format(l1=p[0], l2=p[1]) for p in random.sample(distinct_pairs, 325)]

model, tokenizer = load("google/gemma-1.1-2b-it") responses = batch_generate(model, tokenizer, prompts=prompts_raw, max_tokens=100, verbose=True, temp=0.0)


### Performance Benchmarks

For "small" models like Gemma-2B, ParaLLM achieves **1600+ tokens/sec** in total throughput on a 128GB M3 Max MacBook. This is a significant improvement over single-stream generation.

### Supported Models and Future Work

I've tested ParaLLM with the following models:
- **Gemma-2B**
- **Phi-3-mini**
- **Llama3-8B**

All of these models show substantial throughput gains, especially as you increase the number of parallel requests.

### Additional Features

While features like repetition penalties and streaming outputs are not yet supported, I plan to contribute a `batch_generate` PR for mlx_lm once it reaches a stable, non-breaking state. In the meantime, adding other models is straightforward:
- **Copy Architecture Files**: Copy the necessary architecture files from `mlx_lm/models` into `mlx_parallm/models`.
- **Replace KVCache References**: Replace any `KVCache` references with `BatchedKVCache`.

### Conclusion

ParaLLM brings high-throughput parallel LLM inference to Mac users, making it a valuable tool for evaluating training runs and developing complex applications. The code is available on GitHub at [mlx_parallm](https://github.com/willccbb/mlx_parallm/tree/main), and I encourage you to try it out and provide feedback.