Style Aligned Image Generation: Consistent Styles Without Fine-Tuning

Models & Research

The Engineer

6 Dec 2023 · 3 min read

Researchers present a novel method for generating stylistically consistent images without the need for fine-tuning, streamlining the process for artists and designers.

CVPR 2024, Oral Presentation

By Amir Hertz*, Andrey Voynov*, Shlomi Fruchter†, and Daniel Cohen-Or†
1 Google Research, 2 Tel Aviv University
*Indicates Equal Contribution, †Indicates Equal Advising
[Paper] [Code]

Overview

Large-scale Text-to-Image (T2I) models have become a cornerstone in creative fields, generating visually compelling images from textual prompts. However, ensuring consistent style across multiple images remains a significant challenge. Traditional methods often require fine-tuning and manual intervention to disentangle content and style. In this paper, "Style Aligned Image Generation via Shared Attention," researchers from Google Research introduce StyleAligned, a novel technique that achieves consistent style generation using a pretrained diffusion model without the need for fine-tuning.

Key Technical Changes

Minimal Attention Sharing: The core innovation is the use of minimal attention sharing during the diffusion process. This approach ensures that generated images maintain a consistent style aligned with a reference image.
Reference Image Inversion: By performing an inversion operation on a reference image, StyleAligned can generate new images that adhere to the same stylistic characteristics.

Problem Statement

State-of-the-art T2I models often produce images that diverge significantly in their interpretations of the same stylistic descriptor. For example, given the style description "minimal origami," standard T2I generation might output images with vastly different styles (left). StyleAligned addresses this by making the model's generation style persistent (right).

How It Works

Reference Image Selection: Choose a reference image that embodies the desired style.
Inversion Operation: Perform an inversion operation to align the diffusion process with the reference image.
Attention Sharing:
- During each diffusion denoising step, all target images (except the reference) perform shared self-attention with the reference image.
- The target images attend to the reference by applying Adaptive Instance Normalization (AdaIN) over their queries and keys using the reference queries and keys.
- Shared attention updates the target features using both the target values ( V_t ) and the reference values ( V_r ).

Results

StyleAligned enables style-consistent content generation across different prompts without fine-tuning. Here are some key findings:

High-Quality Synthesis: The method demonstrates high-quality image synthesis and fidelity.
Diverse Styles and Prompts: It works well with a variety of styles and textual prompts, maintaining consistency in the generated images.
Real Reference Images: StyleAligned can transfer style from real reference images without requiring additional training or model personalization.

Integration with Other Methods

StyleAligned is versatile and can be easily combined with other methods to enhance its capabilities:

ControlNet + StyleAligned: Combining ControlNet, which guides the generation process using additional inputs like segmentation maps, with StyleAligned ensures both style consistency and controlled content.
DreamBooth + StyleAligned: Integrating DreamBooth, a method for personalizing T2I models to specific subjects, with StyleAligned allows for personalized style-consistent generation.

Why It Matters

For practitioners in the field of generative AI, StyleAligned offers several advantages:

Efficiency: No fine-tuning required, making it faster and more resource-efficient.
Flexibility: Works with a wide range of styles and prompts, expanding its applicability.
Quality: High-quality synthesis ensures that generated images are visually compelling and consistent.

Conclusion

Style Aligned Image Generation via Shared Attention is a significant step forward in the field of text-to-image generation. By leveraging minimal attention sharing and reference image inversion, it achieves style consistency without the need for fine-tuning. This method has the potential to revolutionize creative workflows by providing reliable and high-quality style-consistent image generation.