Transformer-Lite: High-Efficiency Deployment of Large Language Models on Mobile GPUs

Tools & Engineering

The Engineer

2 Apr 2024 · 3 min read

Transformer-Lite optimizes large language models for mobile GPUs with faster inference speeds, enhancing user experience in applications like text summarization and translation without sacrificing functionality.

Large Language Models (LLMs) have become a cornerstone in applications like intelligent assistants, text summarization, translation, and multi-modality. However, deploying these models on mobile devices has been challenging due to slow inference speeds, which can lead to poor user experiences. A recent paper by Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, and Qin Xie introduces Transformer-Lite, a mobile inference engine that significantly improves the efficiency of LLMs on device GPUs.

Key Technical Changes

The authors propose four key optimization techniques to enhance the performance of LLMs on mobile GPUs:

Dynamic Shape Inference:
- Symbolic Expression-Based Approach: This technique supports dynamic shape model inference, allowing the engine to handle varying input sizes efficiently without pre-compilation for fixed shapes.
Operator Optimizations and Execution Priority Setting:
- Enhanced Inference Speed: By optimizing operators and setting execution priorities, Transformer-Lite reduces phone lagging and accelerates inference times.
- Specific Optimizations: This includes custom kernels for common operations like matrix multiplications and attention mechanisms.
FP4 Quantization (M0E4):
- Reduced Dequantization Overhead: The M0E4 method minimizes the overhead associated with dequantizing data, which is crucial for maintaining performance while reducing memory usage.
Sub-Tensor-Based Technique:
- Elimination of KV Cache Copying: This technique eliminates the need to copy key-value (KV) cache after each inference step, further optimizing memory and computation.

Implementation Details

Transformer-Lite is implemented as a mobile inference engine that is compatible with both Qualcomm and MTK processors. The authors evaluated its performance using LLMs with architectures and parameter sizes ranging from 2B to 14B parameters.

Prefill and Decoding Speeds:
- For the ChatGLM2 6B model, Transformer-Lite achieved prefill speeds of 121 tokens/second and decoding speeds of 14 tokens/second.
- For the smaller Gemma 2B model, it achieved prefill speeds of 330 tokens/second and decoding speeds of 30 tokens/second.

Performance Benchmarks

Compared to other inference engines:

CPU-Based FastLLM: Transformer-Lite attains over a 10x speedup in prefill speed.
GPU-Based MLC-LLM: It achieves a 2-3x speedup in both prefill and decoding speeds.

Why This Matters

For mobile developers and practitioners, these optimizations mean:

Improved User Experience: Faster inference times lead to more responsive applications.
Resource Efficiency: Reduced memory usage and optimized computation make it feasible to run larger models on resource-constrained devices.
Versatility: Compatibility with different processors (Qualcomm and MTK) ensures broader adoption.

Conclusion

Transformer-Lite represents a significant step forward in the deployment of LLMs on mobile devices. By addressing key bottlenecks such as dynamic shape inference, operator optimization, quantization, and memory management, it offers a practical solution for developers looking to enhance the performance of their mobile applications.