
Share
While multimodal models integrate diverse data types to mimic human cognition, they overlook essential aspects like physical presence and environmental engagement, making true AGI unattainable for now.
In recent years, generative AI models have achieved impressive feats, leading some to speculate that Artificial General Intelligence (AGI) is just around the corner. These models, particularly large language models (LLMs), seem to capture aspects of human intelligence, but they fall short in crucial areas. The multimodal approach, which combines various sensory inputs into a single framework, has gained traction as a potential path to AGI. However, I argue that this strategy is unlikely to succeed because it fails to address the fundamental role of embodiment and interaction with the environment.
Multimodal models are designed to handle multiple types of data-text, images, audio, etc.-by using massive modular networks. While these models appear general, they are actually a patchwork of specialized components that do not truly integrate into a cohesive understanding of the world. Here’s why:
True AGI must be capable of solving problems across all domains, including those that originate in physical reality. This requires a form of intelligence that is fundamentally situated in a physical world model. Here’s what this means:

LLMs have been lauded for their apparent understanding of language and the world. However, this understanding is often superficial:
To achieve AGI, we need to shift our focus from multimodal models to approaches that prioritize embodiment and interaction with the environment:
While multimodal models have made significant strides, they fall short of achieving true AGI. To build systems that can solve problems across all domains, we need to focus on embodiment and interaction with the physical world. This approach will lead to more robust and capable AI systems that can reason about and act in complex environments.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
5 June 2025
88 articles
Related Articles
Related Articles
More Stories