CLLMs: Efficient Parallel Decoding for Faster LLM Inference

Models & Research

The Engineer

10 May 2024 · 3 min read

Researchers at Hao AI Lab have developed CLLMs, a breakthrough technique that accelerates large language model inference by parallelizing token decoding, slashing latency and enhancing efficiency without extra hardware.

In a significant advancement for large language models (LLMs), researchers from Hao AI Lab have introduced Consistency Large Language Models (CLLMs), a novel approach to parallel decoding. Traditionally, LLMs decode tokens sequentially, one at a time, which can lead to high latency for longer responses. CLLMs, however, are designed to decode multiple tokens in parallel, significantly reducing inference times without the need for additional memory or architectural modifications.

What Changed Technically?

Parallel Decoding: Unlike conventional autoregressive (AR) decoding, where each token is generated one by one, CLLMs can generate an n-token sequence per inference step. This approach mimics how humans form complete sentences in their minds before speaking.
Finetuning for Consistency: Pretrained LLMs are finetuned to map any randomly initialized n-token sequence to the same result as AR decoding, but in fewer steps. The training objective is to minimize a global consistency loss, ensuring that the parallel-generated sequence aligns with the sequential one.

Why It Matters

Inference Speed: CLLMs achieve 2.4× to 3.4× improvements in generation speed compared to traditional methods. This performance boost is on par with or even better than other fast inference techniques like Medusa2 and Eagle.
No Additional Memory Cost: Unlike some acceleration techniques that require auxiliary models or components, CLLMs maintain the same memory footprint as their pretrained counterparts during inference.

Technical Details

Jacobi Decoding:
- Concept: Jacobi decoding is inspired by the Jacobi fixed-point iteration method used in solving nonlinear equations. It reformulates the sequential generation process into a system of n non-linear equations, which can be solved in parallel.
- Process:
  - Initialization: Start with a random guess for the next n tokens.
  - Iteration: Feed this sequence into the LLM and iterate until convergence. Each iteration might predict more than one correct token (correct here means alignment with the AR decoding result under a greedy sampling strategy).
Training:
- Objective: Minimize the global consistency loss, which measures the difference between the parallel-generated sequence and the AR-generated sequence.
- Steps:
  - Randomly initialize an n-token sequence.
  - Use Jacobi iteration to refine this sequence until it aligns with the AR result.
  - Fine-tune the model to perform this mapping efficiently.

Experiment Results

Speedup: CLLM-ABEL-7B-001, a specific implementation of CLLMs, demonstrated a ∼3× speedup compared to its baseline ABEL-7B-001 when tested on GSM8K using Jacobi decoding.
Memory Efficiency: The model achieved these performance gains without increasing memory usage, maintaining the same resource requirements as the pretrained model.

Implications for Practitioners

For developers and researchers working with LLMs, CLLMs offer a straightforward way to enhance inference speed without compromising on accuracy or memory efficiency. This approach can be particularly beneficial in applications requiring real-time responses, such as chatbots, content generation, and interactive AI systems.

By leveraging the principles of Jacobi decoding and finetuning for consistency, CLLMs provide a practical solution to one of the most pressing challenges in LLM deployment: reducing latency while maintaining high-quality output.