Stability AI Launches Stable Cascade: A Three-Stage Text-to-Image Model for Efficient Training and Inference

Products & Applications

The Engineer

14 Feb 2024 · 3 min read

Stability AI unveils Stable Cascade, a groundbreaking text-to-image model with a unique three-stage architecture that simplifies training on consumer hardware while maintaining top-tier quality and flexibility.

Stability AI has announced the research preview release of Stable Cascade, a new text-to-image model that builds on the Würstchen architecture. This model is particularly noteworthy for its ease of training and fine-tuning on consumer hardware, thanks to its innovative three-stage approach. Released under a non-commercial license, Stable Cascade aims to make advanced text-to-image generation more accessible while maintaining high quality and flexibility.

Key Features

Three-Stage Architecture: Stable Cascade introduces a novel pipeline comprising three distinct models (Stages A, B, and C) for hierarchical image compression.
Efficient Training: The model is designed to be highly efficient, with Stage C allowing for significant cost reductions compared to training similar-sized models like Stable Diffusion.
Non-Commercial License: The model is available under a non-commercial license, permitting use only in non-commercial projects.
Comprehensive Code Release: Stability AI has released training and inference code on their GitHub page, along with scripts for fine-tuning, ControlNet, and LoRA training.

Technical Details

Stable Cascade's architecture stands out from the Stable Diffusion lineup due to its three-stage pipeline:

Stage C (Latent Generator): This stage transforms user inputs into compact 24x24 latents. These latents are then passed to Stages A and B for further processing.
Stage B (Latent Decoder): This stage decodes the latents from Stage C, expanding them into a higher-resolution latent space.
Stage A (Pixel Decoder): The final stage converts the high-resolution latents into the final pixel space image.

Hierarchical Compression

The hierarchical compression approach allows for efficient use of a highly compressed latent space. This is particularly beneficial for reducing computational requirements and improving training efficiency. By decoupling the text-conditional generation (Stage C) from the decoding to the high-resolution pixel space (Stages A and B), Stable Cascade achieves significant cost savings.

Cost Reduction: Training Stage C alone results in a 16x cost reduction compared to training a similar-sized Stable Diffusion model, as demonstrated in the original Würstchen paper.
Fine-Tuning Flexibility: Users can fine-tune Stage C independently for tasks like ControlNet and LoRA training. Stages A and B can also be fine-tuned, but this is generally less necessary and more resource-intensive.

Model Variants

Stable Cascade will be released with two different models:

1B Parameters: Suitable for users with limited computational resources.
3B Parameters: Offers higher quality outputs at the cost of increased computational requirements.

Getting Started

To get started with Stable Cascade, you can access the model and associated scripts on the Stability AI GitHub page. The repository includes:

Checkpoints and Inference Scripts: For running inference and generating images.
Fine-Tuning Scripts: To customize the model for specific tasks or datasets.
ControlNet and LoRA Training Scripts: For advanced users looking to explore additional capabilities.

Conclusion

Stable Cascade represents a significant step forward in text-to-image generation, offering a balance of quality, flexibility, and efficiency. By leveraging its three-stage architecture and non-commercial license, researchers and enthusiasts can experiment with advanced models on consumer hardware, further democratizing access to cutting-edge AI technologies.