
Share
From surreal 80s TV commercials to today’s hyper-realistic visuals, AI image generators like DALL-E and ChatGPT have rapidly evolved, showcasing the breathtaking pace of technological advancement in visual creativity.
When I first got access to DALL-E's beta version in the summer of 2022, it felt like stepping into a new world. For months, I had been on the waitlist, hearing whispers about this revolutionary tool that could transform any text description into a matching image. One of my early prompts was "80s TV commercial showing a hippo fighting a pegasus." The result was both surreal and fascinating.
Fast forward to today, less than two years later, and the same prompt in ChatGPT 4 yields this:
Despite these persistent flaws and occasional hallucinations, the progress is astounding. We can now dream up anything with a text description, and a machine will generate a matching image in seconds. But how does this technology work, and what has driven its rapid evolution?
At their core, AI image generators are deep learning models trained on vast datasets of images and corresponding text descriptions. They use various architectures to map text inputs to visual outputs. Here’s a breakdown of the key components:
Transformer Models: Many modern image generators, like DALL-E and Midjourney, use transformer architectures. Transformers excel at handling sequential data (like text) and have been adapted for image generation through techniques like vision transformers (ViTs).
Latent Space Manipulation: These models map input descriptions to points in a high-dimensional latent space. The decoder then translates these points into images.
Training Data and Augmentation: High-quality training data is crucial. Models are trained on large datasets of images paired with descriptive text.

DALL-E (2021): OpenAI’s first major image generation model, which introduced the concept of text-to-image synthesis using transformers.
Midjourney (2022): A competitor to DALL-E, known for its more artistic and stylized outputs.
Stable Diffusion (2022): An open-source model that democratized access to high-quality image generation.
Despite the impressive advancements, there are still significant challenges:
Hallucinations: Models sometimes generate images that do not accurately reflect the input description (e.g., the three-legged hippo).
Bias and Ethics: AI models can perpetuate biases present in their training data, leading to problematic or offensive content.
The future of AI image generation is promising. We can expect:
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
11 June 2024
88 articles
Related Articles
Related Articles
More Stories