Foundation of Diffusion Models

(Work in progress I will gradually add more content when having more time:D Please stay tuned :D)

What are Diffusion Models?

Diffusion models are a class of generative models that generate data by progressively denoising a sample from pure noise. They are inspired by non-equilibrium thermodynamics and are based on a forward and reverse diffusion process:

Forward Process (Diffusion Process): A data sample (e.g., an image) is gradually corrupted by adding Gaussian noise over multiple timesteps until it becomes nearly pure noise.
Reverse Process (Denoising Process): A neural network learns to reverse this corruption by gradually removing noise step by step, reconstructing the original data distribution.

Diffusion - How molecules actually move

Analogy: Ink Dissolving in Water Imagine dropping a blob of ink into a glass of water:

Forward process (Diffusion Process): Initially, the ink is concentrated in one place (structured data). Over time, it spreads out randomly, blending with the water (adding noise). Eventually, the entire glass becomes a uniformly colored mixture, losing its original structure (complete noise).
Reverse process (Denoising Process): If we had a way to perfectly reverse time, we could watch the ink particles retrace their paths, reassembling into the original drop (generating the original data from noise). Diffusion models learn to perform this “reverse process” step by step using machine learning.

Non-Equilibrium Thermodynamics

Thermodynamics studies how energy moves and changes in a system. In equilibrium thermodynamics, systems are in balance—nothing is changing. Non-equilibrium thermodynamics, on the other hand, deals with systems that are constantly evolving, moving between states of disorder and order.

In diffusion models, the forward process (adding noise to data) and the reverse process (removing noise) resemble a non-equilibrium thermodynamic system because they describe an evolving state that moves from order (structured data) to disorder (pure noise) and back to order (reconstructed data).

Brownian Motion

Brownian motion describes the random movement of tiny particles (like pollen grains in water) due to collisions with molecules. This randomness is similar to how noise is added in diffusion models.

Advantages of Diffusion Models

Diffusion models offer several key advantages over traditional generative models like GANs and VAEs:

High-Fidelity Samples: Unlike VAEs and GANs which generate samples in one step, diffusion models create samples gradually by denoising. This step-by-step process allows the model to first establish coarse image structure before refining fine details, resulting in higher quality outputs.
Training Stability: Diffusion models are easier to train compared to GANs as they use a single tractable likelihood loss. They don’t suffer from training instabilities like mode collapse that often plague GANs.
Sample Diversity: Similar to VAEs, diffusion models maximize likelihood which ensures coverage of all modes in the training dataset. This leads to more diverse outputs compared to GANs which can suffer from mode collapse.
Flexible Architecture: The multi-step denoising process enables additional functionalities like inpainting or image-to-image generation by manipulating the input noise, without requiring architectural changes.
Consistent Quality: The gradual denoising process is more robust and consistent compared to GANs where quality can vary significantly between samples.

The main trade-off is generation speed - diffusion models require multiple neural network passes to generate samples, making them slower than single-pass models like GANs and VAEs. However, various sampling optimization techniques have been developed to significantly reduce this computational overhead.

Disadvantages of Diffusion Models

While diffusion models have significant advantages, they also come with some trade-offs:

Slow Sampling: The reverse process requires multiple denoising steps, making inference slower compared to GANs.
Compute Intensive: Training requires large amounts of data and computational power.
Memory Usage: They require storing multiple intermediate noise distributions, making them more memory-intensive.
Complex Implementation: The multi-step nature of diffusion models makes them more complex to implement compared to single-step models.

Mathematical Foundation

ODE

An Ordinary Differential Equation (ODE) is a mathematical equation that describes how a function changes over time. In simple terms, it tells us how a quantity evolves continuously based on its current state.

\[\frac{dx}{dt} = f(x, t)\]

where $x(t)$ is the state of the system - the function we want to solve -and $t$ is time. $f(x, t)$ defines how $x$ changes over time.

SDE

A general form of an SDE is:

\[dx = f(x, t) dt + g(x, t) dW_t\]

where $f(x, t)dt$ is the drift term (deterministic change), and $g(x, t) dW_t$ is the diffusion term (stochastic change). $dW_t$ is the increment of a Wiener process (Brownian motion).

This equation describes how a system evolves over time with both deterministic trends and random fluctuations. In the case of Diffusion Models, the drift term is the shift of the mean of the distribution, and the diffusion term is the spread of the distribution.

Forward Process (Adding Noise)

The forward diffusion process transform a data sample $x_0$ into pure noise $x_T$ over time:

\[dx = f(x, t)dt + g(t) dW_t\]

where $f(x,t)$ represents a deterministic shift of the mean of the distribution, and $g(t)$ represents a stochastic spread of the distribution - injecting Gaussian noise.

Reverse Process (Removing Noise)

The Reverse-Time SDE (by Anderson 1982) is:

\[dx = \left[ f(x,t) - g^2(t) \nabla_x \log p_t(x) \right] dt + g(t) d\tilde{W}_t\]

where $\nabla_x \log p_t(x)$ is the score function, which estimates the structure of data at time $t$ - how likely different data points are at each step. $d\tilde{W}_t$ is another Wiener process but in the reverse direction.

In the reverse process of diffusion models, we train a neural network to approximate the score function $\nabla_x \log p_t(x)$.

ELBO

Evidence lower bound (ELBO) is a key concept in variational inference, which is used in VAEs to approximate the log-likelihood of the data.

Let $X$ and $Z$ be random variables, jointly distributed with distribution $p_\theta$. For example, $p_\theta(X)$ is the marginal distribution of $X$, and $p_\theta(Z \mid X)$ is the conditional distribution of $Z$ given $X$. Then, for a sample $x \sim p_{\text{data}}$, and any distribution $q_\phi$, the ELBO is defined as

\[L(\phi, \theta; x) := \mathbb{E}_{z\sim q_\phi(\cdot|x)} \left[\ln \frac{p_\theta(x,z)}{q_\phi(z|x)}\right].\]

The ELBO can equivalently be written as

\[\begin{aligned} L(\phi, \theta; x) &= \mathbb{E}_{z\sim q_\phi(\cdot|x)}[\ln p_\theta(x,z)] + H[q_\phi(z \mid x)] \\ &= \ln p_\theta(x) - D_{KL}(q_\phi(z \mid x) || p_\theta(z \mid x)). \end{aligned}\]

In the first line, $H[q_\phi(z \mid x)]$ is the entropy of $q_\phi$, which relates the ELBO to the Helmholtz free energy. In the second line, $\ln p_\theta(x)$ is called the evidence for $x$, and $D_{KL}(q_\phi(z \mid x) \mid\mid p_\theta(z \mid x))$ is the Kullback-Leibler divergence between $q_\phi$ and $p_\theta$. Since the Kullback-Leibler divergence is non-negative, $L(\phi, \theta; x)$ forms a lower bound on the evidence (ELBO inequality)

\[\ln p_\theta(x) \geq \mathbb{E}_{z\sim q_\phi(\cdot|x)}\left[\ln \frac{p_\theta(x,z)}{q_\phi(z|x)}\right].\]

Deep-dive topics about VAE might including:

Reparameterization Trick: How to sample from a distribution in a differentiable way - Wiki
The problem of KL divergence: mode seeking vs mode covering by Andy Jones
A nice property of VAEs: Disentanglement Representation Learning

Tweedie’s formula

Variants

DDPM

Read more about DDPM in another blog post here

DDIM

Read more about DDIM in another blog post here

Key concepts of DDIM:

DDIM utilizes a non-Markovian transition during inference, enables faster sampling.
Can use the same training process as DDPM, e.g., we can use pretrained DDPM models to generate data.

The sampling process of DDIM is as follows:

\[x_{t-1} = \sqrt{\alpha_{t-1}} \left(\frac{x_t - \sqrt{1-\alpha_t}\epsilon_\theta^{(t)}(x_t)}{\sqrt{\alpha_t}}\right) + \sqrt{1-\alpha_{t-1}-\sigma_t^2} \cdot \epsilon_\theta^{(t)}(x_t) + \sigma_t\epsilon_t\]

where the first term represents the “predicted $x_0$”, the second term is the “direction pointing to $x_t$”, and the last term is random noise.

By setting $\sigma_t = 0$ for all $t$, DDIM becomes a deterministic process given $x_{t-1}$ and $x_0$, except for $t=1$. In other words, the intermediate steps $x_{T-1}, x_{T-2}, \ldots, x_1$ are deterministic given starting noise $x_T$.

Noise scheduling

Noise scheduling in diffusion models refers to how noise is gradually added to data in the forward process and how it is removed in the reverse process. The choice of noise schedule significantly impacts the model’s performance, sample quality, and training efficiency.

We follow the DDIM convention, where $0 < \bar{\alpha}_t < 1, \beta_t = 1 - \bar{\alpha}_t$ and $\alpha_t = \prod_{i=1}^{t} \bar{\alpha}_i$ is the cumulative noise level at time $t$, and $\beta_t$ is the noise level at time $t$. With this convention, $x_t = \sqrt(\alpha_t) x_0 + \sqrt(1-\alpha_t) \epsilon$, and $\alpha_T \approx 0$ when $t \rightarrow T$ while $\alpha_0 \approx 1$ when $t \rightarrow 0$.

Common principles of noise scheduling:

Add large amount of noise at $t$ large while small amount of noise at $t$ small. $t=0$ means clean data, $t=T$ means pure noise.
The speed of change (acceleration, or $\frac{d\beta_t}{dt}$) should also has some proper speed (but I am not sure :D)

Common noise schedules:

Linear: $\alpha_t = \frac{t}{T}$ or $\beta_t = \beta_{\min} + (\beta_{\max} - \beta_{\min})\frac{t}{T}$. Issue: early timesteps do not add enough noise, and late timesteps can add too much noise.
Cosine: $\beta_t = \beta_{\min} + 0.5 (\beta_{\max} - \beta_{\min}) ( 1 + \cos(\frac{t}{T} \pi))$. Intuition is that adding more gradually at the start and faster at the end.
Exponential: $\beta_t = \beta_{\max} (\beta_{\min} / \beta_{\max})^{\frac{t}{T}}$

Guidanced Diffusion

Resources:

A great blog from Sander Dieleman: Guidance: a cheat code for diffusion models and the geometry of diffusion guidance.

Why Guidance?

Guidance is a method to control the generation process so that the ouput is sample from a conditional distribution $p(x \mid y)$, where $y$ is a condition - such as a text prompt - rather than a generic $p(x)$.

Classifier Guidance

In order to get the conditional score function $\nabla_x \ln p(x \mid y)$, we can use Bayes rule to decompose the score function into an unconditional component and a conditional one:

\[p(x \mid y) = \frac{p(y \mid x) p(x)}{p(y)}\] \[\log p(x \mid y) = \log p(y \mid x) + \log p(x) - \log p(y)\] \[\nabla_x \log p(x \mid y) = \nabla_x \log p(y \mid x) + \nabla_x \log p(x) - \nabla_x \log p(y)\]

where $\nabla_x \log p(x)$ is the score function of the unconditional model. $\nabla_x \log p(y) = 0$ since $p(y)$ is independent of $x$.

The term $\nabla_x \log p(y \mid x)$ means the direction pointing to $y$ given $x$.

In the begining of the inference process, i.e., large $t$, when $x_t$ still has a lot of noise, $\nabla_x \log p(y \mid x)$ is close to $0$, means that there is no clear information of $y$.
In the later stages, i.e., small $t$, when $x_t$ is less noisy and closer to $x_0$, $\nabla_x \log p(y \mid x)$ is larger, means that $x_t$ has more information of $y$, i.e., larger $p(y \mid x)$.

How to obtain $\nabla_x \log p(y \mid x)$?

$p(y \mid x)$ means the probability of a condition $y$ given $x$. In a simple case, where $y$ is just a image class, like a cat, the probability $p(y=\text{cat} \mid x)$ can be simply obtained from a pre-trained classifier.

However, in a more complex case, where $y$ is a text prompt like a black cat with red eyes and blue fur, a pre-trained classifier is not expressive enough, i.e., it cannot distinguish between $y_1$ a black cat with red eyes and blue fur vs $y_2$ a white cat with blue eyes and red fur or mathematically $p(y_1 \mid x) \neq p(y_2 \mid x)$.

In other words, the quality - diversity of the generated image $x$ strongly depends on the capability of the conditional model $p(y \mid x)$. For example:

If $p_\phi(y \mid x)$ is a binary classifier hot dog or not hot dog, then output image $x \sim p_\theta(x \mid y)$ can be either hot dog or not hot dog only, even $p_\theta(x)$ was trained from a massive dataset with many more classes rather than just two classes.
If you want to generate an image $x$ from a complex prompt $y$, you need a powerful model like CLIP as the conditional model $p_\phi(y \mid x)$.

To balance between the specificity (i.e., high $$p(y \mid x$)) and diversity/quality (i.e., $p(x \mid y) \approx p(x)$), we use a guidance scale $\gamma$ to control the trade-off between the two.

\[\nabla_x \log p(x \mid y) = \nabla_x \log p(x) + \gamma \nabla_x \log p(y \mid x)\]

where $\gamma$ is the guidance scale. A big $\gamma$ means the model is less creative but more following the condition $y$.

Classifier-free Guidance

The main limitation of the above approach is that the quality of the generated image $x$ strongly depends on the capability of the conditional model $p(y \mid x)$.

If your model $p(x)$ was trained on Image-Net dataset, but you want to generate an CT-scan medical image, then even with a powerful conditional model $p(y \mid x)$, you will not get that.

The idea of classifier-free guidance cames from the Bayes Classifier - if you have trained a powerful unconditional generative model $p(x)$ then you can use it as a classifier $p(y \mid x)$ as follows:

\[p(y \mid x) = \frac{p(x \mid y) p(y)}{p(x)}\]

Latent Diffusion

Conditional Diffusion

Control-Net

Image Prompt

Beyond controlling the generation process using text prompt, there is a hot topic in the community to control using image information/layout/prompt - which has a huge potential in applications, e.g., image inpainting, image-to-image generation, etc. In the standard Stable Diffusion, the condition embedding $c_t$ is just a text embedding $c_t = E_t(y)$ where $y$ is the text prompt and $E_t$ is a pre-trained text encoder such as CLIP. IP-Adapter [1] proposes to use an additional image encoder to extract the image embedding from a reference image $c_i = E_i(x)$ and then project it into the original condition space. The objective function for IP-Adapter is:

\[\mathcal{L}_{IP} = \mathbb{E}_{z, c, \epsilon, t} \left[ \mid \mid \epsilon - \epsilon_\theta(z_t \mid c_i, c_t, t) \mid \mid_2^2 \right]\]

The cross-attention layer is also modified from the one in Stable Diffusion to include the image embedding $c_i$ as a condition.

\[\text{Attention}(Q, K_i, K_t, V_i, V_t) = \lambda \text{softmax}\left(\frac{QK_i^T}{\sqrt{d}} + c_i\right)V_i + \text{softmax}\left(\frac{QK_t^T}{\sqrt{d}}\right)V_t\]

where $Q=z W_Q$, $K_i = c_i W_K^i$, $K_t = c_t W_K^t$, $V_i = c_i W_V^i$, $V_t = c_t W_V^t$, and $W_Q$, $W_K^i$, $W_K^t$, $W_V^i$, $W_V^t$ are the weights of the linear layers. The model becomes the original Stable Diffusion when $\lambda = 0$.

References:

Diffusion Transformers

Image Inpainting with Diffusion Models

Accelerating Diffusion Models

Diffusion Distillation

Rectified Diffusion