
Share
Researchers at AAAI 2025 present GoHD, a groundbreaking system that generates lifelike talking head animations by disentangling gaze direction and facial expressions, offering unprecedented control over audio-visual integration.
In the realm of audio-driven talking head generation, achieving a seamless integration of audio and visual data is no small feat. The challenges are manifold, especially when dealing with diverse input portraits and the intricate correlations between audio and facial motions. A recent paper from researchers at AAAI 2025 introduces GoHD (Gaze-oriented and Highly Disentangled Portrait Animation), a robust framework designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion.
GoHD stands out with three key modules that address the core challenges in this domain:
Animation Module with Latent Navigation:
Conformer-Structured Conditional Diffusion Model:
Two-Stage Training Strategy:

Extensive experiments have been conducted to validate GoHD's advanced generalization capabilities. The results demonstrate that the framework can generate highly realistic talking face videos on arbitrary subjects, maintaining high fidelity and naturalness across different input styles. Key benchmarks include:
For practitioners in computer vision and pattern recognition, GoHD represents a significant step forward in audio-driven talking head generation. Its ability to handle diverse input styles, maintain high disentanglement, and incorporate gaze orientation addresses many of the limitations of previous models. This makes it a valuable tool for applications ranging from virtual assistants and video conferencing to entertainment and gaming.
GoHD's innovative approach to portrait animation sets a new standard in the field. By addressing key challenges with well-designed modules and a practical training strategy, it paves the way for more realistic and controllable talking head generation. As this research continues to evolve, we can expect even more sophisticated and versatile models in the future.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
17 December 2024
88 articles
Related Articles
Related Articles
More Stories