02 Apr 2024
Latent diffusion, stable diffusion
A diffusion model is a parametrized Markov chain trained using variational inference to
generate sample images from noise.
What is variational inference? Approximate guessing. Variational inference is the process of
using an approximate posterior distribution instead of a complex true data distribution.
To ensure the approximations KL, divergence between the 2 distributions are minimized.
Minimizing KL divergence is equal to maximizing ELBO and vice versa.
Denoising diffusion models have 2 processes:
- Forward diffusion process
- Reverse denoising process
Each of the intermediate step is a latent variable.
References:
- Thermodynamics inspired original paper
27 Feb 2024
Generative Adversarial Networks, Generative models, Latent variable models
What if I want to create new images from a very complex latent space \(z\)?
Since it is very difficult to model it, I use the encoder part of a VAE to generate this \(z\).
Now the Generator (Decoder) tries to synthesize new images and the Discriminator (Encoder)
tries to identify if that image came from the real world (\(x\)) or is fake (\(x'\)).
The generator gets better at creating fake images which are very similar to real images
and the discriminator gets better at identifying the important minute details which can help
identify a fake. So we have competing objective functions for the generator and discriminator.
The discriminator’s loss function is
\begin{equation}
\text{arg max}_D \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}]
\label{eq:loss_generator}
\end{equation}
It tries to maximize the probability that the discriminator identifies a fake input as fake
and a real input as real.
The generator’s loss function is
\begin{equation}
\text{arg min}_G \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}]
\label{eq:loss_discriminator}
\end{equation}
It tries to minimize the probability that the discriminator identifies a fake input as fake
and a real input as real.
Putting this together, the objective function of a GAN is
\begin{equation}
\text{arg min}_G \text{max}_D \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}]
\label{eq:loss_gan}
\end{equation}
Whereby we get a Generator that creates fake instances that fools the best Discriminator.
This finally created Generator is what we use to sample from noise to create new images.
Style GANs
Transfer style from one distribution to another. Example: ageing/ de-aging by transferring
from a source of old/ young data respectively to my face.
Conditional GANs
To control the type of output which can be synthesized. Example: train the GAN on pairs of
(street scene, semantic map of the scene) to then create synthesized street scenes given a
semantic map. Or train the GAN on pairs of (hand drawn image, photo) to then create realistic
photos from newly drawn scribbles.
Cycle GANs
To transfer the style between domains without paired data. Example: transform summer photos
to winter photos or transform a photo to a specific painting style.
Quality vs Coverage in generative models
GANs make high quality images but can suffer from mode collapse. This is the phenomenon where
the generator gets stuck at creating only specific images (of some modes/classes) which fool the
discriminator and thus fails to create diverse samples across the entire data distribution.
Compared to this, VAEs have a broader coverage but the samples may not be of high quality.
Autoregressive models like Transformers are a good compromise but they have no latent variables.
In comparison, diffusion models also offer good performance with a stable training and have
latent variables to perform efficient sample editing.
References:
13 Feb 2024
Variational Auto Encoders, Discriminative modeling, Generative modeling
Generative modeling is a type of unsupervised learning to create a model that learns the probability distribution of the training data given to it.
It can be used for :
- Density estimation - Learning the distribution allows us to identify outliers, and hence handle unpredictable behaviour.
- Sample generation - Learning the underlying distribution of the training data allows us to uncover biases in the data and hence create better datasets.
Generative modeling is finding p(x). Conditional generative modeling is p(x|y).
Latent variable models are a type of generative models. It includes AEs, VAEs and GANs.
Auto Encoder (AE) is used for compressing information in a latent layer or recreating an
original image from corrupted image.
An encoder maps the input \(x\) to a low dimensional latent space \(z\). This is an unsupervised problem =>
we do not have labels associated with the training images. The decoder tries to reconstruct the original image \(x'\).
The objective is to minimize the distance between \(x\) and \(x'\).
\begin{equation}
L(x,x’) = ||x-x’||^2
\label{eq:loss_ae}
\end{equation}
The latent layer introduces a probability distribution on \(z\).
This equation along with the low dimensional nature of \(z\) introduces an information bottleneck,
which tries to compress as much information as possible about \(x\) into \(z\).
This is deterministic. As long as the green, yellow and pink boxes remain same (no
change to the NN weights), we will get same \(x'\) for the same input \(x\). Meaning we can
reproduce the changes every time.
Variational Auto Encoder (VAE)
VAEs make this yellow box (\(z\)) stochastic. For each variable in \(z\) you learn an associated
μ and σ. This is what allows us to get new images, by sampling from the distribution of
a \(z(\mu, \sigma, \epsilon)\).
Now the encoder tries to learn the probability distribution of \(z\) given \(x\),
and the decoder tries to learn a new probability distribution of \(x'\) given \(z\) i.e.
\begin{equation} \label{eq:vae_encoder} \text{Encoder computes } q_\Phi(z|x) \end{equation}
\begin{equation} \label{eq:vae_decoder} \text{Decoder computes } p_\theta(x|z) \end{equation}
\(\Phi\) and \(\theta\) are the weights of the NNs. So the loss function for VAEs is
\begin{equation}
L(\Phi, \theta, x) = (\text{reconstruction loss}) + (\text{regularization term})
\label{eq:loss_vae}
\end{equation}
This reconstruction loss is same as AE’s \eqref{eq:loss_ae}, while the regularization loss
places a prior on \(z\) so as to try to enforce all the z’s we learn to follow this prior \(p(z)\).
This way we learn z \(q_\Phi(z|x)\) (\eqref{eq:vae_encoder} Encoder) as close to the prior \(p(z)\)
as possible i.e. we minimize
\begin{equation}
D(\underbrace{q_\phi(z|x)}_{\text{learned z}} || \underbrace{p(z)}_{\text{fixed prior on z}})
\label{eq:loss_vae_regularzn}
\end{equation}
This prevents the NN from overfitting. Equation \eqref{eq:loss_vae_regularzn} is actually the KL divergence between the
two distributions.
Why regularize?
We want the following properties in the latent space:
- Continuity - closer points in \(z\) should create similar images in \(x'\)
- Completeness - sampling from \(z\) must be a meaningful \(x'\)
How to select a prior? Gaussian FTW!
Regularization with a Normal Gaussian prior helps enforce information gradient in the latent space.
Points and distances in z have some meaning.
Proof for ELBO :
We know that the expectation of a random variable is
\begin{equation} \label{eq:elbo_one} \mathbb{E} [f(x)]=\int xf(x)dx \end{equation}
By chain rule, we have
\begin{equation} \label{eq:elbo_two} P(x,y)=P(x|y)P(y) \end{equation}
From Bayes theorem, we have
\begin{equation} \label{eq:elbo_three} P(x|y)=\frac{P(y|x)P(x)}{P(y)} \end{equation}
KL divergence is given by
\begin{equation} \label{eq:elbo_four} D_{KL} (P||Q)=\int p(x) log(\frac{p(x)}{q(x)})dx \end{equation}
The likelihood of our data is
\(p(x)=\int p(x,z)dz\) which is a joint probability. But this is intractable since we need to
calculate this over all z’s.
\(p(x)=\frac{p(x,z)}{p(z|x)}\) using \eqref{eq:elbo_two} but we don’t have the denominator.
So, using an approximation for it,
\(\underbrace{p_{\theta}(z|x)}_{\text{TRUE posterior}} \approx \underbrace{q_{\phi}(z|x)}_{\text{APPROX posterior}}\)
This is the encoder (see \eqref{eq:vae_encoder}). Take the log, so log-likelihood of our data is
\(\text{log }p_\theta(x)=\text{log }p_\theta(x)=\text{log }p_\theta(x) \cdot 1\)
\(\text{log }p_\theta(x)=\text{log }p_\theta(x) \cdot \int q_{\phi}(z|x)dx\)
since integral over a domain is always 1. Bring x inside the integral since integral is over \(dz\)
\(\text{log }p_\theta(x)= \int\text{log }p_\theta(x) \cdot q_{\phi}(z|x)dx\)
is of the form \eqref{eq:elbo_one}
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }p_{\theta}(x)]\)
Using \eqref{eq:elbo_two} inside the expectation,
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{p_\theta(z|x)}]\)
Multiplying by the same quantity,
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{p_\theta(z|x)} \cdot \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)}]\)
Reordering
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{q_{\phi}(z|x)} \cdot \frac{q_{\phi}(z|x)}{p_\theta(z|x)}]\)
Since \(log(ab)=log(a)+log(b)\)
\(\text{log }p_\theta(x)= \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{q_{\phi}(z|x)}]}_{\text{ELBO}} + \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{q_{\phi}(z|x)}{p_\theta(z|x)}]}_{\text{KL Divergence}}\)
Reparametrization trick
Why can’t we backpropagate with a stochastic variable?
Technically we can, it’s just that it is very computationally intensive and what we compute may
have a very high variance, so we need to do it multiple times, without any guarantee that we will
converge to a correct answer at the end of it.
Sampling introduces randomness.
=> it is no longer a function of parameters \(f(\mu, \sigma)\)
Reparametrization converts \(z \in \mathcal{N}(\mu, \sigma^2)\) into
\(z_i=\mu_i + \sigma_i \epsilon_i\), which is still an unbiased estimator, but with a
smaller variance (see end of SGD post for intuition).
If \(z\) is assumed to have a Gaussian prior, then \(z_i=\mu_i + \sigma_i \epsilon_i\) where
\(\epsilon_i \sim \mathcal{N}(0,1)\).
Then, \(\frac{ \partial z_i}{\partial \mu_i} = 1\) and
\(\frac{ \partial z_i}{\partial \sigma_i} = \epsilon_i\). \(\mu\) and \(\sigma\) are learned
so they can be backpropagated and \(\epsilon\) is sampled from a distribution. Because this
distribution is Gaussian, we know an estimator and can calculate it relatively easier.
β - VAEs
Allows control over the latent space \(z\) by controlling the strength of the regularization
term in the loss function \eqref{eq:loss_vae}.
\begin{equation}
L(\Phi, \theta, x, z, \beta) = (\text{reconstruction loss}) - \beta(\text{regularization term})
\label{eq:loss_betavae}
\end{equation}
This manifests in disentanglement or creating latent space variables which are not
correlated with each other. The image on the left shows head rotation and smile are entangled.
This is done mathematically by assuming a Diagonal prior on \(z\).
What is the use of this? By looking at the latent variables in a dataset, we can find out if
it is fair and representative.
References: