Diffusion Models Without Attention: A Scalable State Space Approach for High-Resolution Image Generation

Models & Research

The Engineer

4 Dec 2023 · 3 min read

This paper presents DiffuSSM, a novel approach that sidesteps attention mechanisms to generate high-resolution images efficiently, offering a promising solution to the scalability issues faced by current DDPMs.

In the rapidly evolving field of high-fidelity image generation, Denoising Diffusion Probabilistic Models (DDPMs) have become a cornerstone. However, their application at high resolutions has been hindered by significant computational challenges. Traditional methods like patchifying in UNet and Transformer architectures can expedite processes but often at the cost of representational capacity. A new paper from Jing Nathan Yan, Jiatao Gu, and Alexander M. Rush introduces the Diffusion State Space Model (DiffuSSM), which replaces attention mechanisms with a more scalable state space model backbone. This approach not only handles higher resolutions without global compression but also preserves detailed image representation throughout the diffusion process.

Key Technical Changes

State Space Model Backbone: The core innovation in DiffuSSM is the use of a state space model (SSM) instead of attention mechanisms. SSMs are known for their efficiency in handling sequential data and can be extended to 2D images by treating them as sequences of patches.
Scalability and Efficiency: By leveraging the structure of SSMs, DiffuSSM reduces the computational burden typically associated with high-resolution image generation. This is particularly important for large-scale datasets like ImageNet and LSUN.

Implementation Details

Model Architecture:
- Input Representation: The input image is divided into patches, similar to how transformers handle images.
- State Space Model Layer: Each patch is processed through a series of SSM layers. These layers maintain a hidden state that evolves over the diffusion process, allowing for efficient computation and memory usage.
- Denoising Process: The denoising steps are guided by the evolving hidden states, ensuring that the model can effectively remove noise while preserving image details.
Training and Inference:
- FLOP Efficiency: The authors focus on creating FLOP-efficient architectures. This is crucial for practical deployment, especially in resource-constrained environments.
- Loss Function: The model uses a combination of mean squared error (MSE) loss and perceptual loss to ensure both pixel-level accuracy and high-quality visual output.

Performance Benchmarks

Datasets: The model was evaluated on the ImageNet and LSUN datasets at two different resolutions: 128x128 and 256x256.
Metrics:
- FID Score: DiffuSSM achieves FID scores that are on par or even better than existing models with attention mechanisms.
- Inception Score: The Inception Score, which measures both the quality and diversity of generated images, also shows competitive results.
- FLOP Usage: Notably, DiffuSSM significantly reduces the total FLOP usage compared to attention-based models, making it a more efficient choice for high-resolution image generation.

Practical Implications

Resource Efficiency: The reduced computational requirements make DiffuSSM a viable option for real-world applications where resources are limited.
Image Quality: Despite the efficiency gains, the model maintains or improves upon the quality of generated images, which is crucial for tasks like photo editing and content creation.
Research Directions: This work opens up new avenues for research in scalable diffusion models, particularly in exploring other efficient architectures that can handle high-dimensional data.

Conclusion

The introduction of DiffuSSM marks a significant step forward in the field of image generation. By replacing attention mechanisms with state space models, the authors have created a model that is both computationally efficient and capable of generating high-quality images at high resolutions. This approach not only addresses the limitations of current methods but also paves the way for future advancements in scalable diffusion models.