Superior Intent Extraction with Small Models Through Decomposition

Models & Research

The Engineer

23 Jan 2026 · 4 min read

Cohen and Halpern's technique breaks down complex interactions into simpler tasks, enabling small models to accurately predict user intent, potentially revolutionizing on-device AI assistance.

January 22, 2026

Danielle Cohen and Yoni Halpern, Software Engineers at Google, have introduced a novel approach to understanding user intents from UI interaction trajectories using small models. This method outperforms significantly larger models, making it an exciting development for on-device applications.

The Challenge of User Intent Understanding

As AI technologies advance, the goal is to create agents that can better anticipate and assist with user needs. For mobile devices, this means understanding what users are doing or trying to do when they interact with apps. This context helps predict potential next actions, enhancing user experience. For instance, if a user has been searching for music festivals in Europe and then looks for flights to London, an intelligent agent could suggest festivals in London on the specific dates of interest.

Large multimodal language models (LLMs) are already adept at understanding user intent from UI trajectories. However, using LLMs for this task often involves sending data to a server, which can be slow, costly, and may expose sensitive information. This is where small models come in-lightweight, efficient, and capable of running on-device.

The Decomposition Approach

In their paper "Small Models, Big Results: Achieving Superior Intent Extraction Through Decomposition," presented at EMNLP 2025, Cohen and Halpern propose a two-stage approach to make user intent understanding more tractable for small models:

Stage 1: Summarize Each Screen Separately
- The first stage involves generating a summary of each screen in the UI trajectory. This is done using a smaller multimodal language model (MLLM) that can efficiently process and summarize individual screens.
Stage 2: Extract Intent from Summaries
- In the second stage, another small MLLM takes the sequence of generated summaries as input and extracts the overall user intent. By breaking down the task into manageable parts, the approach reduces the complexity for each model.

Key Benefits

Efficiency: Small models are faster and more resource-efficient, making them ideal for on-device processing.
Privacy: Since data is processed locally, there's no need to send sensitive information to a server.
Performance: Despite their size, these small models achieve results comparable to much larger models.

Implementation Details

The researchers formalized metrics to evaluate model performance, ensuring that the approach could be rigorously tested. Here are some key implementation details:

Model Architecture:
- Summarization Model: A lightweight MLLM trained on a diverse dataset of UI screens and user interactions.
- Intent Extraction Model: Another small MLLM fine-tuned to extract intents from sequences of summaries.
Training Data:
- The models were trained on a large corpus of anonymized user interaction data, ensuring they generalize well to real-world scenarios.
Benchmarks:
- The decomposed approach was tested against several state-of-the-art LLMs and small models. It outperformed larger models in terms of accuracy while maintaining low latency and resource usage.

Practical Applications

This approach has significant implications for on-device applications, particularly in mobile devices and wearables where resource constraints are common. By enabling more efficient and private user intent understanding, it can enhance the user experience across a range of applications, from personal assistants to e-commerce platforms.

Future Directions

The researchers suggest several avenues for future work, including:

Model Optimization: Further optimizing the models to reduce latency and improve accuracy.
Multimodal Integration: Exploring how other modalities (e.g., voice, images) can be integrated into the decomposed approach.
User Feedback: Incorporating user feedback mechanisms to continuously improve model performance.

Conclusion

The decomposition approach presented by Cohen and Halpern offers a promising solution for on-device user intent understanding. By breaking down the task into manageable stages, small models can achieve results that rival those of much larger models, making them an attractive option for resource-constrained environments.