MobileNet-V4: Next-Gen Efficiency for Edge Devices

Models & Research

The Engineer

27 May 2024 · 3 min read

MobileNet-V4 optimizes computer vision tasks for edge devices, pushing the boundaries of efficiency with runtime optimization tailored for modern hardware, from tiny CPUs to advanced accelerators.

MobileNet-V4 has landed in timm, the PyTorch Image Models library, and it's a significant step forward for efficient computer vision on edge devices. This new model is designed to be runtime optimal on today’s mobile and edge hardware, from small DSP/CPU devices to modest accelerators like Google’s EdgeTPU found in modern smartphones.

Background

Five years ago, MobileNet-V3 and EfficientNet were introduced by Google researchers. These models leveraged the Inverted Residual Block (IR), a key innovation that placed the wide part of the block at the depthwise convolution rather than at the start or end. The IR consists of:

A 1x1 pointwise expansion convolution
A depthwise convolution (3x3 or 5x5)
A 1x1 pointwise linear (PWL) convolution in the residual path with no activation

Since then, timm has become the go-to repository for these architectures. It includes all officially released Tensorflow weights and numerous related models like MNasNet, FBNet v1/v2/v3, LCNet, TinyNet, and MixNet. Many of these weights are trained purely in PyTorch with PyTorch-friendly convolution padding.

Introducing MobileNet-V4

MobileNet-V4 aims to push the boundaries further by optimizing for today's hardware. The key innovations include two new block types:

Universal Inverted Bottleneck (UIB)
Multi Query Attention (MQA)

Universal Inverted Bottleneck (UIB)

The UIB is a superset of the original Inverted Residual Block, designed to be more flexible and efficient across different hardware configurations. It allows for:

Dynamic expansion ratios
Adaptive kernel sizes
Efficient skip connections

These features enable the model to better adapt to the computational constraints of edge devices while maintaining high accuracy.

Multi Query Attention (MQA)

The MQA block introduces a novel attention mechanism that is more efficient than traditional multi-head attention. It reduces the computational overhead by:

Using fewer query heads
Sharing key and value projections across multiple queries

This makes MQA particularly suitable for resource-constrained environments where every operation counts.

Implementation Details

MobileNet-V4 has been integrated into timm, making it accessible to a wide range of practitioners. The implementation includes:

Model Variants: Multiple variants are available, each optimized for different hardware and performance requirements.
Training Techniques: Advanced training techniques such as RandAug/AutoAug, AdvProp, and Noisy Student are supported to improve robustness and accuracy.
PyTorch Compatibility: All models are trained in PyTorch with native padding, ensuring compatibility with existing PyTorch workflows.

Performance Benchmarks

Initial benchmarks show that MobileNet-V4 outperforms its predecessors on both accuracy and inference speed. Key highlights include:

Accuracy: Competitive or better than state-of-the-art models on standard datasets like ImageNet.
Inference Speed: Up to 2x faster than MobileNet-V3 on edge devices, thanks to the optimized UIB and MQA blocks.

Conclusion

MobileNet-V4 represents a significant advancement in efficient computer vision for edge devices. By introducing the Universal Inverted Bottleneck and Multi Query Attention, it addresses the unique challenges of modern hardware while maintaining high performance. For practitioners working with resource-constrained environments, this model is a must-try.