Apple Reveals Technical Details of Its New AI Models at WWDC25

Products & Applications

The Engineer

22 Jul 2025 · 3 min read

Apple’s technical report offers a deep dive into the intricacies of its new AI models, revealing a novel architecture for its on-device solution and insights into training and optimization processes.

During WWDC25, Apple unveiled new versions of its on-device and cloud-based foundation models. Now, the company has released a detailed technical report, "Apple Intelligence Foundation Language Models – Tech Report 2025," which delves into how these models were trained, optimized, and evaluated. Here are some key highlights that will be particularly interesting for practitioners.

Local Model Architecture: Split into Two Blocks

One of the most intriguing aspects of Apple's on-device model is its architecture. The model, which has around 3 billion parameters, is split into two distinct blocks:

Block 1: Contains 62.5% of the total transformer layers.
Block 2: Contains the remaining 37.5% of the transformer layers but with key and value projections removed.

This design decision has several practical benefits:

Memory Efficiency: Block 2 requires 37.5% less memory for caching, which is crucial for on-device performance.
Latency Reduction: The time to output the first token (a fragment of a word) is also reduced by about 37.5%, making the model more responsive.

Despite these optimizations, Apple claims that the overall performance and output quality of the model are preserved. This split architecture is a clever way to balance resource constraints with computational efficiency.

Data Sources and Pre-Training

Apple's foundation models are trained on a diverse range of data sources to ensure they can handle various tasks effectively. The company leverages:

Web Text: A vast corpus of text from the web, including articles, blogs, and forums.
Books: A collection of books in multiple languages.
Code Repositories: Code snippets and documentation from public repositories.

The pre-training process involves several stages:

Unsupervised Learning: The models are initially trained on large amounts of unannotated data to learn general language patterns.
Supervised Fine-Tuning: Post-pre-training, the models are fine-tuned on specific tasks using labeled datasets.

Tool Use Development and Optimizations

Apple has also detailed how they developed tools to facilitate the training and deployment of these models:

Custom Training Frameworks: Apple uses custom frameworks optimized for its hardware, including M1 and M2 chips, to accelerate training.
Distributed Training: The company employs distributed training techniques to scale up the model size and improve convergence speed.

In terms of optimizations:

Quantization: Models are quantized to reduce memory usage and inference time without significant loss in accuracy.
Pruning: Unnecessary weights are pruned to further optimize the models for on-device use.

Benchmarks and Performance

Apple has provided benchmarks comparing its on-device model with external cloud-based models. The results show that:

On-Device Model: Performs competitively on a range of tasks, including text generation, translation, and summarization.
Cloud-Based Model: Offers higher performance but at the cost of increased latency and data privacy concerns.

Historical Context

It's worth noting that Apple has been exploring various techniques to optimize large language models (LLMs) for on-device use. A few years ago, they published a study on swapping parts of an LLM between RAM and flash storage as needed. While this approach wasn't ultimately used in the current models, it demonstrates the company's ongoing commitment to innovation in this space.

Conclusion

Apple's new foundation models represent significant advancements in both on-device and cloud-based AI. The technical details provided in the report offer valuable insights for practitioners looking to understand and leverage these models. By splitting the local model into two blocks and employing a range of optimizations, Apple has managed to create efficient, high-performing models that can run seamlessly on user devices.