GoHD: Gaze-Oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression

Models & Research

The Engineer

17 Dec 2024 · 3 min read

Researchers at AAAI 2025 present GoHD, a groundbreaking system that generates lifelike talking head animations by disentangling gaze direction and facial expressions, offering unprecedented control over audio-visual integration.

In the realm of audio-driven talking head generation, achieving a seamless integration of audio and visual data is no small feat. The challenges are manifold, especially when dealing with diverse input portraits and the intricate correlations between audio and facial motions. A recent paper from researchers at AAAI 2025 introduces GoHD (Gaze-oriented and Highly Disentangled Portrait Animation), a robust framework designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion.

Key Technical Innovations

GoHD stands out with three key modules that address the core challenges in this domain:

Animation Module with Latent Navigation:
- Generalization Across Styles: This module leverages latent navigation to improve generalization across unseen input styles. By navigating through a high-dimensional latent space, it can adapt to various portrait characteristics and maintain consistency.
- High Disentanglement of Motion and Identity: The animation module ensures that the motion and identity are highly disentangled, meaning changes in one do not affect the other. This is crucial for generating natural-looking animations where the facial expressions and movements remain consistent with the reference identity.
- Gaze Orientation Rectification: A novel feature of this module is its ability to rectify unnatural eye movements, a common issue in previous models. By incorporating gaze orientation, GoHD ensures that the eyes move naturally, enhancing the realism of the generated videos.
Conformer-Structured Conditional Diffusion Model:
- Prosody-Aware Head Poses: This module uses a conformer (a type of neural network architecture) to generate head poses that are aware of prosody (the rhythm and intonation of speech). By aligning head movements with the natural flow of speech, it adds an extra layer of realism to the animations.
- Conditional Diffusion for Fine Control: The conditional diffusion model allows for fine-grained control over the generated poses, ensuring they are both realistic and contextually appropriate.
Two-Stage Training Strategy:
- Decoupling Lip Motion from Other Movements: To estimate lip-synchronized and realistic expressions from input audio within limited training data, GoHD employs a two-stage training strategy. The first stage focuses on distilling frequent and frame-wise lip motion, while the second stage handles more temporally dependent but less audio-related motions like blinks and frowns.
- Efficient Training with Limited Data: This approach allows for efficient training even when data is scarce, making it a practical solution for real-world applications.

Experimental Validation

Extensive experiments have been conducted to validate GoHD's advanced generalization capabilities. The results demonstrate that the framework can generate highly realistic talking face videos on arbitrary subjects, maintaining high fidelity and naturalness across different input styles. Key benchmarks include:

Realism: The generated videos are indistinguishable from real human footage in terms of facial expressions and movements.
Generalization: GoHD performs well even with unseen input styles, showcasing its robustness and adaptability.
Control: Users have fine-grained control over the generated animations, allowing for customization and personalization.

Why It Matters

For practitioners in computer vision and pattern recognition, GoHD represents a significant step forward in audio-driven talking head generation. Its ability to handle diverse input styles, maintain high disentanglement, and incorporate gaze orientation addresses many of the limitations of previous models. This makes it a valuable tool for applications ranging from virtual assistants and video conferencing to entertainment and gaming.

Conclusion

GoHD's innovative approach to portrait animation sets a new standard in the field. By addressing key challenges with well-designed modules and a practical training strategy, it paves the way for more realistic and controllable talking head generation. As this research continues to evolve, we can expect even more sophisticated and versatile models in the future.