
Share
Transformer-Lite optimizes large language models for mobile GPUs with faster inference speeds, enhancing user experience in applications like text summarization and translation without sacrificing functionality.
Large Language Models (LLMs) have become a cornerstone in applications like intelligent assistants, text summarization, translation, and multi-modality. However, deploying these models on mobile devices has been challenging due to slow inference speeds, which can lead to poor user experiences. A recent paper by Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, and Qin Xie introduces Transformer-Lite, a mobile inference engine that significantly improves the efficiency of LLMs on device GPUs.
The authors propose four key optimization techniques to enhance the performance of LLMs on mobile GPUs:
Dynamic Shape Inference:
Operator Optimizations and Execution Priority Setting:
FP4 Quantization (M0E4):
Sub-Tensor-Based Technique:
Transformer-Lite is implemented as a mobile inference engine that is compatible with both Qualcomm and MTK processors. The authors evaluated its performance using LLMs with architectures and parameter sizes ranging from 2B to 14B parameters.

Compared to other inference engines:
For mobile developers and practitioners, these optimizations mean:
Transformer-Lite represents a significant step forward in the deployment of LLMs on mobile devices. By addressing key bottlenecks such as dynamic shape inference, operator optimization, quantization, and memory management, it offers a practical solution for developers looking to enhance the performance of their mobile applications.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
2 April 2024
88 articles
Related Articles
Related Articles
More Stories