EquiVDM: Achieving Temporal Consistency in Video Diffusion with Inherent Equivariance

Models & Research

The Engineer

17 Apr 2025 · 4 min read

NVIDIA researchers unveil EquiVDM, a diffusion model that generates coherent video frames by ensuring noise is temporally consistent, thereby promoting equivariance to spatial transformations and improving motion accuracy.

EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

In a recent paper, NVIDIA researchers Chao Liu and Arash Vahdat introduce EquiVDM (Equivariant Video Diffusion Model), a novel approach to generating temporally consistent video frames using diffusion models. The key innovation lies in leveraging temporally consistent noise, which inherently encourages the model to be equivariant to spatial transformations. This method not only simplifies the generation process but also enhances motion alignment and 3D consistency, making it particularly useful for applications like sim-to-real, style transfer, and video upsampling.

What Changed Technically

Traditionally, achieving temporal consistency in video diffusion models (VDMs) has required specialized modules or additional constraints. These methods often rely on 3D convolution layers or attention mechanisms to capture spatiotemporal information effectively. While these techniques can improve temporal coherence, they typically need extensive training on large-scale datasets and introduce complexity into the model architecture.

EquiVDM takes a different approach by focusing on the noise input itself. By using temporally consistent noise, the model is naturally encouraged to be equivariant to spatial transformations in both the input video and the noise. This means that the motion patterns from the input video are better preserved, leading to more aligned and high-fidelity frames without the need for extra modules or constraints.

Key Features of EquiVDM

Equivariance as an Inherent Property: The standard training objective of diffusion models, when applied with temporally consistent noise, inherently promotes equivariance. This means that the model can generate coherent video frames without additional guidance or regularization strategies.
Temporal Consistency Without Extra Cost: Unlike other methods that require specialized modules or constraints to achieve temporal consistency, EquiVDM accomplishes this as a natural byproduct of its training process. This simplifies the architecture and reduces the computational overhead.
3D Consistency for Sim-to-Real Applications: The researchers extend their approach to 3D-consistent video generation by attaching noise as textures on 3D meshes. This ensures that the generated frames maintain consistency in 3D space, which is crucial for sim-to-real applications where realistic motion and alignment are essential.

Method Overview

Standard VDMs and Temporal Consistency

Standard VDMs generate high-quality video frames by iteratively denoising a sequence of noisy inputs. However, achieving temporal consistency-where the generated frames align well with each other and follow coherent motion patterns-remains challenging. Most existing methods introduce 3D convolution layers or attention mechanisms to capture spatiotemporal information, which can be computationally expensive and require large datasets for training.

Temporally Consistent Noise

EquiVDM addresses this challenge by using temporally consistent noise. During the denoising process, the same noise pattern is applied across consecutive frames, ensuring that the model learns to generate frames that are coherent in both space and time. This approach leverages the standard training objective of diffusion models but with a twist: the noise is designed to be temporally consistent.

Equivariance to Spatial Transformations

The key insight is that by using temporally consistent noise, the model is naturally encouraged to be equivariant to spatial transformations. In other words, if the input video undergoes a transformation (e.g., rotation or translation), the generated frames will follow the same transformation. This property is crucial for generating videos with aligned motion and high fidelity.

3D Consistency

For applications requiring 3D consistency, such as sim-to-real scenarios, EquiVDM attaches noise as textures on 3D meshes. This ensures that the generated video frames maintain consistency in 3D space, which is essential for realistic simulations and real-world applications.

Experimental Results

The researchers conducted extensive experiments to evaluate the performance of EquiVDM. They compared their method against state-of-the-art baselines in terms of motion alignment, 3D consistency, and video quality. The results demonstrate that EquiVDM surpasses these baselines in all metrics while requiring only a few sampling steps in practice.

Motion Alignment: EquiVDM generates frames with better-aligned motion patterns compared to other methods.
**3D Cons