VAE
13 Feb 2024Generative modeling is a type of unsupervised learning to create a model that learns the probability distribution of the training data given to it. It can be used for :
- Density estimation - Learning the distribution allows us to identify outliers, and hence handle unpredictable behaviour.
- Sample generation - Learning the underlying distribution of the training data allows us to uncover biases in the data and hence create better datasets.
Generative modeling is finding p(x). Conditional generative modeling is p(x|y). Latent variable models are a type of generative models. It includes AEs, VAEs and GANs.
Auto Encoder (AE) is used for compressing information in a latent layer or recreating an original image from corrupted image.
An encoder maps the input \(x\) to a low dimensional latent space \(z\). This is an unsupervised problem =>
we do not have labels associated with the training images. The decoder tries to reconstruct the original image \(x'\).
The objective is to minimize the distance between \(x\) and \(x'\).
\begin{equation}
L(x,x’) = ||x-x’||^2
\label{eq:loss_ae}
\end{equation}
The latent layer introduces a probability distribution on \(z\).
This equation along with the low dimensional nature of \(z\) introduces an information bottleneck,
which tries to compress as much information as possible about \(x\) into \(z\).
This is deterministic. As long as the green, yellow and pink boxes remain same (no
change to the NN weights), we will get same \(x'\) for the same input \(x\). Meaning we can
reproduce the changes every time.
Variational Auto Encoder (VAE)
VAEs make this yellow box (\(z\)) stochastic. For each variable in \(z\) you learn an associated
μ and σ. This is what allows us to get new images, by sampling from the distribution of
a \(z(\mu, \sigma, \epsilon)\).
Now the encoder tries to learn the probability distribution of \(z\) given \(x\),
and the decoder tries to learn a new probability distribution of \(x'\) given \(z\) i.e.
\begin{equation} \label{eq:vae_encoder} \text{Encoder computes } q_\Phi(z|x) \end{equation}
\begin{equation} \label{eq:vae_decoder} \text{Decoder computes } p_\theta(x|z) \end{equation}
\(\Phi\) and \(\theta\) are the weights of the NNs. So the loss function for VAEs is
\begin{equation}
L(\Phi, \theta, x) = (\text{reconstruction loss}) + (\text{regularization term})
\label{eq:loss_vae}
\end{equation}
This reconstruction loss is same as AE’s \eqref{eq:loss_ae}, while the regularization loss
places a prior on \(z\) so as to try to enforce all the z’s we learn to follow this prior \(p(z)\).
This way we learn z \(q_\Phi(z|x)\) (\eqref{eq:vae_encoder} Encoder) as close to the prior \(p(z)\)
as possible i.e. we minimize
\begin{equation}
D(\underbrace{q_\phi(z|x)}_{\text{learned z}} || \underbrace{p(z)}_{\text{fixed prior on z}})
\label{eq:loss_vae_regularzn}
\end{equation}
This prevents the NN from overfitting. Equation \eqref{eq:loss_vae_regularzn} is actually the KL divergence between the
two distributions.
Why regularize?
We want the following properties in the latent space:
- Continuity - closer points in \(z\) should create similar images in \(x'\)
- Completeness - sampling from \(z\) must be a meaningful \(x'\)
How to select a prior? Gaussian FTW!
Regularization with a Normal Gaussian prior helps enforce information gradient in the latent space. Points and distances in z have some meaning.
Proof for ELBO :
We know that the expectation of a random variable is
\begin{equation} \label{eq:elbo_one} \mathbb{E} [f(x)]=\int xf(x)dx \end{equation}
By chain rule, we have
\begin{equation} \label{eq:elbo_two} P(x,y)=P(x|y)P(y) \end{equation}
From Bayes theorem, we have
\begin{equation} \label{eq:elbo_three} P(x|y)=\frac{P(y|x)P(x)}{P(y)} \end{equation}
KL divergence is given by
\begin{equation} \label{eq:elbo_four} D_{KL} (P||Q)=\int p(x) log(\frac{p(x)}{q(x)})dx \end{equation}
The likelihood of our data is
\(p(x)=\int p(x,z)dz\) which is a joint probability. But this is intractable since we need to
calculate this over all z’s.
\(p(x)=\frac{p(x,z)}{p(z|x)}\) using \eqref{eq:elbo_two} but we don’t have the denominator.
So, using an approximation for it,
\(\underbrace{p_{\theta}(z|x)}_{\text{TRUE posterior}} \approx \underbrace{q_{\phi}(z|x)}_{\text{APPROX posterior}}\)
This is the encoder (see \eqref{eq:vae_encoder}). Take the log, so log-likelihood of our data is
\(\text{log }p_\theta(x)=\text{log }p_\theta(x)=\text{log }p_\theta(x) \cdot 1\)
\(\text{log }p_\theta(x)=\text{log }p_\theta(x) \cdot \int q_{\phi}(z|x)dx\)
since integral over a domain is always 1. Bring x inside the integral since integral is over \(dz\)
\(\text{log }p_\theta(x)= \int\text{log }p_\theta(x) \cdot q_{\phi}(z|x)dx\)
is of the form \eqref{eq:elbo_one}
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }p_{\theta}(x)]\)
Using \eqref{eq:elbo_two} inside the expectation,
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{p_\theta(z|x)}]\)
Multiplying by the same quantity,
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{p_\theta(z|x)} \cdot \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)}]\)
Reordering
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{q_{\phi}(z|x)} \cdot \frac{q_{\phi}(z|x)}{p_\theta(z|x)}]\)
Since \(log(ab)=log(a)+log(b)\)
\(\text{log }p_\theta(x)= \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{q_{\phi}(z|x)}]}_{\text{ELBO}} + \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{q_{\phi}(z|x)}{p_\theta(z|x)}]}_{\text{KL Divergence}}\)
Reparametrization trick
Why can’t we backpropagate with a stochastic variable?
Technically we can, it’s just that it is very computationally intensive and what we compute may have a very high variance, so we need to do it multiple times, without any guarantee that we will converge to a correct answer at the end of it.
Sampling introduces randomness.
=> it is no longer a function of parameters \(f(\mu, \sigma)\)
Reparametrization converts \(z \in \mathcal{N}(\mu, \sigma^2)\) into
\(z_i=\mu_i + \sigma_i \epsilon_i\), which is still an unbiased estimator, but with a
smaller variance (see end of SGD post for intuition).
If \(z\) is assumed to have a Gaussian prior, then \(z_i=\mu_i + \sigma_i \epsilon_i\) where
\(\epsilon_i \sim \mathcal{N}(0,1)\).
Then, \(\frac{ \partial z_i}{\partial \mu_i} = 1\) and
\(\frac{ \partial z_i}{\partial \sigma_i} = \epsilon_i\). \(\mu\) and \(\sigma\) are learned
so they can be backpropagated and \(\epsilon\) is sampled from a distribution. Because this
distribution is Gaussian, we know an estimator and can calculate it relatively easier.
β - VAEs
Allows control over the latent space \(z\) by controlling the strength of the regularization
term in the loss function \eqref{eq:loss_vae}.
\begin{equation}
L(\Phi, \theta, x, z, \beta) = (\text{reconstruction loss}) - \beta(\text{regularization term})
\label{eq:loss_betavae}
\end{equation}
This manifests in disentanglement or creating latent space variables which are not
correlated with each other. The image on the left shows head rotation and smile are entangled.
This is done mathematically by assuming a Diagonal prior on \(z\).
What is the use of this? By looking at the latent variables in a dataset, we can find out if
it is fair and representative.
References:
- Bayesian reasoning vs Frequentist understanding
- MIT lecture on Deep Generative Modeling 6.S191
- Stanford lecture on VAE CS 229
- VAE math video
- Reparametrization explanation on StackExchange
- Reparameterizations are known since 1996