What Is Latent Diffusion and How Does It Work?

Latent diffusion is a technique for generating images by running the core creation process in a compressed version of the image rather than on the full-size image itself. This compression makes the whole system dramatically faster and less memory-hungry, which is why latent diffusion powers the most widely used AI image generators today, including Stable Diffusion. The key insight is simple: you don’t need every pixel to capture the essential structure of an image, so why waste computing power working at full resolution?

How It Differs From Standard Diffusion

Standard diffusion models work directly on images at their full resolution. They learn to generate pictures by starting with pure static (random noise) and gradually cleaning it up, step by step, until a coherent image emerges. This works remarkably well, but it’s painfully slow. Every denoising step has to process every single pixel, and for a high-resolution image, that means millions of calculations repeated dozens or hundreds of times.

Latent diffusion solves this by splitting the job into two stages. First, a separate network compresses images into a much smaller representation called a “latent.” Then the diffusion process (the noising and denoising) happens entirely in that compressed space. Once the model finishes generating a clean latent, a decoder expands it back into a full-resolution image. The original 2022 paper from Robin Rombach and colleagues showed a significant quality gap of 38 FID points between pixel-based diffusion and their latent approach after the same amount of training, meaning the latent version produced noticeably better images for the same computational budget.

The Compression Step

The compression relies on a type of neural network called a variational autoencoder, or VAE. A VAE has two halves: an encoder that squishes a high-resolution image down into a compact code, and a decoder that reconstructs the image from that code. The encoder maps the image from a high-dimensional space (all those pixels) into a lower-dimensional “latent space.” Think of it like saving a photo as a very efficient thumbnail that still captures the important structure, colors, and layout.

The VAE is trained separately, before the diffusion model ever touches it. Its goal is to minimize the difference between the original image and the reconstructed version. Once trained, the encoder and decoder act as translators: the encoder converts real images into latents for training, and the decoder converts generated latents back into viewable images at the end. The diffusion model never sees a real pixel. It only ever works with these compressed codes.

The compression factor matters. The original paper tested several ratios and found that compressing by a factor of 4 to 16 struck the best balance between speed and image quality. Compress too little and you don’t save much compute. Compress too much and the decoder can’t reconstruct fine details.

The Denoising Process in Latent Space

Once images are represented as latents, the diffusion model learns through a two-phase process: a forward phase that adds noise, and a reverse phase that removes it.

In the forward phase, the model takes a clean latent and progressively adds random noise to it across many steps, following a fixed schedule. By the final step, the original latent is completely buried in static, indistinguishable from pure randomness. This phase isn’t learned. It’s just math applied mechanically.

The reverse phase is where the learning happens. A neural network (typically a U-Net architecture) is trained to predict the noise that was added at each step, then subtract it. During generation, the model starts with a latent-sized block of pure random noise and runs the reverse process step by step. At each step, it estimates the noise present, removes a portion of it, and passes the slightly cleaner result to the next step. After all the steps are complete, you’re left with a clean latent that the VAE decoder translates into a full image.

Because these operations happen on the small latent rather than the full image, each step is far cheaper. This is what makes it practical to run on consumer hardware rather than requiring massive data center GPUs.

How Text Prompts Guide Generation

Latent diffusion becomes a text-to-image system through a mechanism called cross-attention. The original paper introduced cross-attention layers into the denoising network, turning it into what the authors described as “a powerful and flexible generator for general conditioning inputs such as text or bounding boxes.”

Here’s how it works in practice. Your text prompt is first processed by a language model that converts the words into a numerical representation capturing their meaning. During each denoising step, the U-Net doesn’t just look at the noisy latent. It also “attends to” the text representation through these cross-attention layers, which let the network focus on different parts of the prompt as it refines different parts of the image. If your prompt says “a red barn in a snowy field,” the cross-attention mechanism helps the model associate “red” with the barn region and “snowy” with the surrounding area.

This conditioning approach is flexible. The same architecture can accept other types of input beyond text: depth maps, edge outlines, segmentation masks, or even other images. That flexibility is why latent diffusion models have been adapted to so many different creative tasks.

What It’s Used For

Text-to-image generation is the most visible application, but latent diffusion powers a wider range of tasks. Image super-resolution uses diffusion priors to add realistic detail when upscaling low-resolution photos, going well beyond simple sharpening. Inpainting lets you select a region of an existing image and have the model fill it in with new content that matches the surrounding context. Image editing allows targeted changes to specific parts of an image while leaving the rest untouched.

The technique has also expanded beyond still images into text-to-video generation, 3D object creation, and even structural design in engineering. Because the core architecture separates compression from generation, researchers can swap in different VAEs optimized for different data types while keeping the diffusion process largely the same.

Stable Diffusion and Real-World Scale

The most prominent implementation of latent diffusion is the Stable Diffusion family of models released by Stability AI. These models took the architecture from the original 2022 paper and trained it on billions of image-text pairs, making high-quality image generation available to anyone with a decent graphics card.

The series has grown substantially. Stable Diffusion 3.5, released in October 2024, comes in multiple sizes: a Large version with 8.1 billion parameters, a Large Turbo variant optimized for speed, and a Medium version with 2.5 billion parameters. The range reflects different tradeoffs between image quality and the hardware needed to run the model. The Medium version can run on laptops, while the Large version produces higher-fidelity output but needs more powerful GPUs.

Other companies and research groups have built their own latent diffusion systems. The architecture has become the default approach for image generation because it hits a practical sweet spot: image quality competitive with the best alternatives, but achievable with realistic computing resources and training times.