ExecuTorch Alpha: Bringing LLMs and AI to Edge Devices with Advanced Quantization and Broad Support

Tools & Engineering

The Engineer

1 May 2024 · 3 min read

ExecuTorch alpha simplifies deploying large language models to edge devices with advanced quantization techniques and broad support for Meta’s Llama 2 and early access to Llama 3, revolutionizing mobile AI capabilities.

We're excited to announce the release of ExecuTorch alpha, a significant step forward in deploying large language models (LLMs) and other machine learning (ML) models to edge devices. This release focuses on stabilizing the API surface, improving installation processes, and providing robust support for Meta’s Llama 2 and early access to Llama 3. Let's dive into the technical details and what this means for practitioners.

Large Language Models on Mobile

Deploying LLMs on mobile devices is a challenging task due to constraints in compute, memory, and power. ExecuTorch alpha addresses these challenges with advanced quantization techniques and optimizations:

4-bit Post-Training Quantization: Using GPTQ (Generalized Pruning via Adversarial Training), we've achieved 4-bit post-training quantization. This significantly reduces the model size without a significant loss in accuracy.
Dynamic Shape Support: We’ve added dynamic shape support to XNNPack, enabling efficient execution on CPU across a variety of devices.
New Dtypes: New data types have been introduced in XNNPack to enhance performance and reduce memory overhead.
Export and Lowering Improvements: Significant improvements in the export and lowering processes ensure smoother transitions from training to deployment.

These optimizations allow Llama 2 7B to run efficiently on devices like the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22, S23, and S24. Early support for Llama 3 8B is also available. For the latest performance numbers, check out the GitHub repository.

Supported Models

ExecuTorch alpha supports a wide range of models across natural language processing (NLP), vision, and speech. Since the preview release, we've expanded our tested models significantly:

NLP: Llama 2, Llama 3, and other popular NLP models.
Vision: Models like ResNet, MobileNet, and EfficientNet.
Speech: Models for speech recognition and synthesis.

We are committed to continuously expanding this list. If you encounter any issues, please open a GitHub issue.

Performance Enhancements

To maximize performance on edge devices, ExecuTorch alpha leverages hardware acceleration through various backends:

Apple Devices: Core ML and Metal Performance Shaders (MPS) for GPU and NPU delegation.
Arm Devices: TOSA (TensorFlow Operations for Scalable Acceleration) for optimized execution.
Qualcomm Devices: Qualcomm AI Stack for leveraging the Hexagon Tensor Processor (HTP).

These backends ensure that models run efficiently, even on resource-constrained devices.

Productivity Tools

Deploying performant models to specific platforms often requires deep visualization and debugging tools. ExecuTorch alpha provides:

Model Visualization: Tools to visualize the model architecture and identify bottlenecks.
Profiling: Detailed profiling capabilities to optimize performance.
Debugging: Enhanced debugging tools to help resolve issues during deployment.

These tools are designed to make the development process smoother and more efficient.

Community and Partnerships

We've been working closely with partners like Arm, Apple, and Qualcomm Technologies to ensure that ExecuTorch alpha is robust and performant. Their contributions have been crucial in enabling GPU and NPU delegation, which significantly boosts performance on edge devices.

Conclusion

ExecuTorch alpha represents a significant milestone in bringing LLMs and other ML models to edge devices. With advanced quantization techniques, broad model support, and hardware acceleration, it opens up new possibilities for deploying AI in resource-constrained environments. We look forward to the community's feedback and contributions as we continue to improve ExecuTorch.