
Share
NVIDIA researchers unveil EquiVDM, a diffusion model that generates coherent video frames by ensuring noise is temporally consistent, thereby promoting equivariance to spatial transformations and improving motion accuracy.
In a recent paper, NVIDIA researchers Chao Liu and Arash Vahdat introduce EquiVDM (Equivariant Video Diffusion Model), a novel approach to generating temporally consistent video frames using diffusion models. The key innovation lies in leveraging temporally consistent noise, which inherently encourages the model to be equivariant to spatial transformations. This method not only simplifies the generation process but also enhances motion alignment and 3D consistency, making it particularly useful for applications like sim-to-real, style transfer, and video upsampling.
Traditionally, achieving temporal consistency in video diffusion models (VDMs) has required specialized modules or additional constraints. These methods often rely on 3D convolution layers or attention mechanisms to capture spatiotemporal information effectively. While these techniques can improve temporal coherence, they typically need extensive training on large-scale datasets and introduce complexity into the model architecture.
EquiVDM takes a different approach by focusing on the noise input itself. By using temporally consistent noise, the model is naturally encouraged to be equivariant to spatial transformations in both the input video and the noise. This means that the motion patterns from the input video are better preserved, leading to more aligned and high-fidelity frames without the need for extra modules or constraints.
Equivariance as an Inherent Property: The standard training objective of diffusion models, when applied with temporally consistent noise, inherently promotes equivariance. This means that the model can generate coherent video frames without additional guidance or regularization strategies.
Temporal Consistency Without Extra Cost: Unlike other methods that require specialized modules or constraints to achieve temporal consistency, EquiVDM accomplishes this as a natural byproduct of its training process. This simplifies the architecture and reduces the computational overhead.
3D Consistency for Sim-to-Real Applications: The researchers extend their approach to 3D-consistent video generation by attaching noise as textures on 3D meshes. This ensures that the generated frames maintain consistency in 3D space, which is crucial for sim-to-real applications where realistic motion and alignment are essential.

Standard VDMs generate high-quality video frames by iteratively denoising a sequence of noisy inputs. However, achieving temporal consistency-where the generated frames align well with each other and follow coherent motion patterns-remains challenging. Most existing methods introduce 3D convolution layers or attention mechanisms to capture spatiotemporal information, which can be computationally expensive and require large datasets for training.
EquiVDM addresses this challenge by using temporally consistent noise. During the denoising process, the same noise pattern is applied across consecutive frames, ensuring that the model learns to generate frames that are coherent in both space and time. This approach leverages the standard training objective of diffusion models but with a twist: the noise is designed to be temporally consistent.
The key insight is that by using temporally consistent noise, the model is naturally encouraged to be equivariant to spatial transformations. In other words, if the input video undergoes a transformation (e.g., rotation or translation), the generated frames will follow the same transformation. This property is crucial for generating videos with aligned motion and high fidelity.
For applications requiring 3D consistency, such as sim-to-real scenarios, EquiVDM attaches noise as textures on 3D meshes. This ensures that the generated video frames maintain consistency in 3D space, which is essential for realistic simulations and real-world applications.
The researchers conducted extensive experiments to evaluate the performance of EquiVDM. They compared their method against state-of-the-art baselines in terms of motion alignment, 3D consistency, and video quality. The results demonstrate that EquiVDM surpasses these baselines in all metrics while requiring only a few sampling steps in practice.
Tags
Original Sources
↗ https://research.nvidia.com/labs/genair/equivdm/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
17 April 2025
133 articles
Related Articles
Related Articles
More Stories