GLM-Image: A Hybrid Auto-regressive and Diffusion Model for Dense-Knowledge and High-Fidelity Image Generation

Models & Research

The Engineer

14 Jan 2026 · 3 min read

GLM-Image merges auto-regressive and diffusion models to generate images with both high precision in dense-knowledge tasks and unmatched fidelity, pushing the boundaries of industrial-grade image generation technology.

Today, we're excited to introduce GLM-Image, a groundbreaking open-source model that combines the strengths of auto-regressive and diffusion architectures. This industrial-grade discrete auto-regressive image generation model is designed to excel in tasks requiring precise semantic understanding and complex information expression, while maintaining high-fidelity and fine-grained detail generation.

What Changed Technically

GLM-Image introduces a hybrid architecture that leverages an auto-reggressive module for low-frequency semantic signals and a diffusion decoder for high-frequency detail refinement. Here’s a breakdown of the key components:

Auto-regressive Module:
- Initialized from GLM-4-9B-0414, a 9 billion parameter model.
- Generates tokens that capture low-frequency semantic information, essential for understanding complex instructions and knowledge-intensive scenarios.
Diffusion Decoder:
- Follows the CogView4 architecture with a single-stream DiT (Discrete Image Transformer) structure containing 7 billion parameters.
- Refines high-frequency details to produce high-fidelity images, ensuring that the final output is visually rich and detailed.

Why It Matters

General Image Generation

Alignment with Mainstream Approaches: GLM-Image performs on par with leading latent diffusion models in general image generation tasks.
Advantages in Specific Scenarios: It excels in text-rendering and knowledge-intensive generation, making it particularly useful for tasks that require precise semantic understanding and complex information expression.

Text-to-Image Generation

Robust Semantic Understanding: The auto-regressive generator ensures that the model can handle intricate instructions and detailed descriptions.
High-Fidelity Details: The diffusion decoder refines the output to maintain high visual quality, making it suitable for creative work that demands both artistic aesthetics and information precision.

Image-to-Image Tasks

Versatile Capabilities: GLM-Image supports a wide range of image-to-image tasks, including:
- Image Editing: Modify specific aspects of an image while preserving overall coherence.
- Style Transfer: Apply different styles to images while maintaining the original content.
- Identity-Preserving Generation: Generate new images that maintain the identity of the subjects in the input image.
- Multi-Subject Consistency: Ensure consistency across multiple subjects in a single image.

Background

Diffusion models have become the go-to choice for image generation due to their training stability and strong generalization capabilities. However, they often fall short in complex instruction following and knowledge-intensive scenarios, lacking both information expression and semantic alignment. On the other hand, some high-quality auto-regressive models have shown outstanding performance in these areas, producing visually rich details while maintaining robust semantic understanding.

Techniques

Visual Token Selection

In previous visual auto-regressive generation models, token types typically fell into three categories:

Visual Codes: Obtained via discrete reconstruction techniques.
Latent Variables: Derived from continuous latent spaces.
Hybrid Tokens: Combining both discrete and continuous representations to capture a broader range of information.

GLM-Image's hybrid architecture leverages the strengths of these token types by using an auto-regressive generator to produce tokens with low-frequency semantic signals, which are then refined by the diffusion decoder to add high-frequency details. This approach ensures that the model can handle both complex instructions and high-fidelity image generation effectively.

Conclusion

GLM-Image represents a significant step forward in image generation, combining the robust semantic understanding of auto-regressive models with the high-fidelity detail refinement capabilities of diffusion decoders. Whether you're working on creative projects that demand intricate knowledge representation or general image generation tasks, GLM-Image is a powerful tool to have in your arsenal.