ML papers One a week

Diffusion Models

Latent diffusion, stable diffusion

A diffusion model is a parametrized Markov chain trained using variational inference to generate sample images from noise.
What is variational inference? Approximate guessing. Variational inference is the process of using an approximate posterior distribution instead of a complex true data distribution.
To ensure the approximations KL, divergence between the 2 distributions are minimized. Minimizing KL divergence is equal to maximizing ELBO and vice versa.

Denoising diffusion models have 2 processes:

  1. Forward diffusion process
  2. Reverse denoising process

Denoising diffusion

Each of the intermediate step is a latent variable.

Denoising diffusion

References:

  • Thermodynamics inspired original paper

GPT

Generative models

GANs

Generative Adversarial Networks, Generative models, Latent variable models

What if I want to create new images from a very complex latent space \(z\)? Since it is very difficult to model it, I use the encoder part of a VAE to generate this \(z\).

GAN

Now the Generator (Decoder) tries to synthesize new images and the Discriminator (Encoder) tries to identify if that image came from the real world (\(x\)) or is fake (\(x'\)). The generator gets better at creating fake images which are very similar to real images and the discriminator gets better at identifying the important minute details which can help identify a fake. So we have competing objective functions for the generator and discriminator.

The discriminator’s loss function is
\begin{equation} \text{arg max}_D \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}] \label{eq:loss_generator} \end{equation} It tries to maximize the probability that the discriminator identifies a fake input as fake and a real input as real.

The generator’s loss function is
\begin{equation} \text{arg min}_G \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}] \label{eq:loss_discriminator} \end{equation} It tries to minimize the probability that the discriminator identifies a fake input as fake and a real input as real.

Putting this together, the objective function of a GAN is
\begin{equation} \text{arg min}_G \text{max}_D \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}] \label{eq:loss_gan} \end{equation} Whereby we get a Generator that creates fake instances that fools the best Discriminator.

This finally created Generator is what we use to sample from noise to create new images.

Style GANs
Transfer style from one distribution to another. Example: ageing/ de-aging by transferring from a source of old/ young data respectively to my face.

Conditional GANs
To control the type of output which can be synthesized. Example: train the GAN on pairs of (street scene, semantic map of the scene) to then create synthesized street scenes given a semantic map. Or train the GAN on pairs of (hand drawn image, photo) to then create realistic photos from newly drawn scribbles.

Cycle GANs
To transfer the style between domains without paired data. Example: transform summer photos to winter photos or transform a photo to a specific painting style.

Quality vs Coverage in generative models
GANs make high quality images but can suffer from mode collapse. This is the phenomenon where the generator gets stuck at creating only specific images (of some modes/classes) which fool the discriminator and thus fails to create diverse samples across the entire data distribution. Compared to this, VAEs have a broader coverage but the samples may not be of high quality.
Autoregressive models like Transformers are a good compromise but they have no latent variables. In comparison, diffusion models also offer good performance with a stable training and have latent variables to perform efficient sample editing.

References:

VAE

Variational Auto Encoders, Discriminative modeling, Generative modeling

Generative modeling is a type of unsupervised learning to create a model that learns the probability distribution of the training data given to it. It can be used for :

  1. Density estimation - Learning the distribution allows us to identify outliers, and hence handle unpredictable behaviour.
  2. Sample generation - Learning the underlying distribution of the training data allows us to uncover biases in the data and hence create better datasets.

Generative modeling is finding p(x). Conditional generative modeling is p(x|y). Latent variable models are a type of generative models. It includes AEs, VAEs and GANs.

Auto Encoder (AE) is used for compressing information in a latent layer or recreating an original image from corrupted image.

Auto encoder application

An encoder maps the input \(x\) to a low dimensional latent space \(z\). This is an unsupervised problem => we do not have labels associated with the training images. The decoder tries to reconstruct the original image \(x'\). The objective is to minimize the distance between \(x\) and \(x'\).
\begin{equation} L(x,x’) = ||x-x’||^2 \label{eq:loss_ae} \end{equation}
The latent layer introduces a probability distribution on \(z\).

Auto encoder

This equation along with the low dimensional nature of \(z\) introduces an information bottleneck, which tries to compress as much information as possible about \(x\) into \(z\).
This is deterministic. As long as the green, yellow and pink boxes remain same (no change to the NN weights), we will get same \(x'\) for the same input \(x\). Meaning we can reproduce the changes every time.

Variational Auto Encoder (VAE)
VAEs make this yellow box (\(z\)) stochastic. For each variable in \(z\) you learn an associated μ and σ. This is what allows us to get new images, by sampling from the distribution of a \(z(\mu, \sigma, \epsilon)\).

Variational auto encoder

Now the encoder tries to learn the probability distribution of \(z\) given \(x\), and the decoder tries to learn a new probability distribution of \(x'\) given \(z\) i.e.
\begin{equation} \label{eq:vae_encoder} \text{Encoder computes } q_\Phi(z|x) \end{equation}
\begin{equation} \label{eq:vae_decoder} \text{Decoder computes } p_\theta(x|z) \end{equation}
\(\Phi\) and \(\theta\) are the weights of the NNs. So the loss function for VAEs is
\begin{equation} L(\Phi, \theta, x) = (\text{reconstruction loss}) + (\text{regularization term}) \label{eq:loss_vae} \end{equation} This reconstruction loss is same as AE’s \eqref{eq:loss_ae}, while the regularization loss places a prior on \(z\) so as to try to enforce all the z’s we learn to follow this prior \(p(z)\). This way we learn z \(q_\Phi(z|x)\) (\eqref{eq:vae_encoder} Encoder) as close to the prior \(p(z)\) as possible i.e. we minimize
\begin{equation} D(\underbrace{q_\phi(z|x)}_{\text{learned z}} || \underbrace{p(z)}_{\text{fixed prior on z}}) \label{eq:loss_vae_regularzn} \end{equation}
This prevents the NN from overfitting. Equation \eqref{eq:loss_vae_regularzn} is actually the KL divergence between the two distributions.

Why regularize?
We want the following properties in the latent space:

  1. Continuity - closer points in \(z\) should create similar images in \(x'\)
  2. Completeness - sampling from \(z\) must be a meaningful \(x'\)
Non regularized latent space
Regularized latent space

How to select a prior? Gaussian FTW!

Regularization with gaussian

Regularization with a Normal Gaussian prior helps enforce information gradient in the latent space. Points and distances in z have some meaning.

Proof for ELBO :
We know that the expectation of a random variable is
\begin{equation} \label{eq:elbo_one} \mathbb{E} [f(x)]=\int xf(x)dx \end{equation}
By chain rule, we have
\begin{equation} \label{eq:elbo_two} P(x,y)=P(x|y)P(y) \end{equation}
From Bayes theorem, we have
\begin{equation} \label{eq:elbo_three} P(x|y)=\frac{P(y|x)P(x)}{P(y)} \end{equation}
KL divergence is given by
\begin{equation} \label{eq:elbo_four} D_{KL} (P||Q)=\int p(x) log(\frac{p(x)}{q(x)})dx \end{equation}
The likelihood of our data is
\(p(x)=\int p(x,z)dz\) which is a joint probability. But this is intractable since we need to calculate this over all z’s.
\(p(x)=\frac{p(x,z)}{p(z|x)}\) using \eqref{eq:elbo_two} but we don’t have the denominator. So, using an approximation for it,
\(\underbrace{p_{\theta}(z|x)}_{\text{TRUE posterior}} \approx \underbrace{q_{\phi}(z|x)}_{\text{APPROX posterior}}\)
This is the encoder (see \eqref{eq:vae_encoder}). Take the log, so log-likelihood of our data is
\(\text{log }p_\theta(x)=\text{log }p_\theta(x)=\text{log }p_\theta(x) \cdot 1\)
\(\text{log }p_\theta(x)=\text{log }p_\theta(x) \cdot \int q_{\phi}(z|x)dx\)
since integral over a domain is always 1. Bring x inside the integral since integral is over \(dz\)
\(\text{log }p_\theta(x)= \int\text{log }p_\theta(x) \cdot q_{\phi}(z|x)dx\)
is of the form \eqref{eq:elbo_one}
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }p_{\theta}(x)]\)
Using \eqref{eq:elbo_two} inside the expectation,
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{p_\theta(z|x)}]\)
Multiplying by the same quantity,
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{p_\theta(z|x)} \cdot \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)}]\)
Reordering
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{q_{\phi}(z|x)} \cdot \frac{q_{\phi}(z|x)}{p_\theta(z|x)}]\)
Since \(log(ab)=log(a)+log(b)\)
\(\text{log }p_\theta(x)= \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{q_{\phi}(z|x)}]}_{\text{ELBO}} + \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{q_{\phi}(z|x)}{p_\theta(z|x)}]}_{\text{KL Divergence}}\)

Reparametrization trick

Why can’t we backpropagate with a stochastic variable?
Technically we can, it’s just that it is very computationally intensive and what we compute may have a very high variance, so we need to do it multiple times, without any guarantee that we will converge to a correct answer at the end of it.

Sampling introduces randomness.
=> it is no longer a function of parameters \(f(\mu, \sigma)\) Reparametrization trick Reparametrization converts \(z \in \mathcal{N}(\mu, \sigma^2)\) into \(z_i=\mu_i + \sigma_i \epsilon_i\), which is still an unbiased estimator, but with a smaller variance (see end of SGD post for intuition). If \(z\) is assumed to have a Gaussian prior, then \(z_i=\mu_i + \sigma_i \epsilon_i\) where \(\epsilon_i \sim \mathcal{N}(0,1)\). Then, \(\frac{ \partial z_i}{\partial \mu_i} = 1\) and \(\frac{ \partial z_i}{\partial \sigma_i} = \epsilon_i\). \(\mu\) and \(\sigma\) are learned so they can be backpropagated and \(\epsilon\) is sampled from a distribution. Because this distribution is Gaussian, we know an estimator and can calculate it relatively easier.

β - VAEs
Allows control over the latent space \(z\) by controlling the strength of the regularization term in the loss function \eqref{eq:loss_vae}. \begin{equation} L(\Phi, \theta, x, z, \beta) = (\text{reconstruction loss}) - \beta(\text{regularization term}) \label{eq:loss_betavae} \end{equation} This manifests in disentanglement or creating latent space variables which are not correlated with each other. The image on the left shows head rotation and smile are entangled.

Disentanglement

This is done mathematically by assuming a Diagonal prior on \(z\).
What is the use of this? By looking at the latent variables in a dataset, we can find out if it is fair and representative.

References:

Autograd

Momentum, RMSProp

TBD

References: