ML papers One a week

VAE

Variational Auto Encoders, Discriminative modeling, Generative modeling

Generative modeling is a type of unsupervised learning to create a model that learns the probability distribution of the training data given to it. It can be used for :

  1. Density estimation - Learning the distribution allows us to identify outliers, and hence handle unpredictable behaviour.
  2. Sample generation - Learning the underlying distribution of the training data allows us to uncover biases in the data and hence create better datasets.

Generative modeling is finding p(x). Conditional generative modeling is p(x|y). Latent variable models are a type of generative models. It includes AEs, VAEs and GANs.

Auto Encoder (AE) is used for compressing information in a latent layer or recreating an original image from corrupted image.

Auto encoder application

An encoder maps the input \(x\) to a low dimensional latent space \(z\). This is an unsupervised problem => we do not have labels associated with the training images. The decoder tries to reconstruct the original image \(x'\). The objective is to minimize the distance between \(x\) and \(x'\).
\begin{equation} L(x,x’) = ||x-x’||^2 \label{eq:loss_ae} \end{equation}
The latent layer introduces a probability distribution on \(z\).

Auto encoder

This equation along with the low dimensional nature of \(z\) introduces an information bottleneck, which tries to compress as much information as possible about \(x\) into \(z\).
This is deterministic. As long as the green, yellow and pink boxes remain same (no change to the NN weights), we will get same \(x'\) for the same input \(x\). Meaning we can reproduce the changes every time.

Variational Auto Encoder (VAE)
VAEs make this yellow box (\(z\)) stochastic. For each variable in \(z\) you learn an associated μ and σ. This is what allows us to get new images, by sampling from the distribution of a \(z(\mu, \sigma, \epsilon)\).

Variational auto encoder

Now the encoder tries to learn the probability distribution of \(z\) given \(x\), and the decoder tries to learn a new probability distribution of \(x'\) given \(z\) i.e.
\begin{equation} \label{eq:vae_encoder} \text{Encoder computes } q_\Phi(z|x) \end{equation}
\begin{equation} \label{eq:vae_decoder} \text{Decoder computes } p_\theta(x|z) \end{equation}
\(\Phi\) and \(\theta\) are the weights of the NNs. So the loss function for VAEs is
\begin{equation} L(\Phi, \theta, x) = (\text{reconstruction loss}) + (\text{regularization term}) \label{eq:loss_vae} \end{equation} This reconstruction loss is same as AE’s \eqref{eq:loss_ae}, while the regularization loss places a prior on \(z\) so as to try to enforce all the z’s we learn to follow this prior \(p(z)\). This way we learn z \(q_\Phi(z|x)\) (\eqref{eq:vae_encoder} Encoder) as close to the prior \(p(z)\) as possible i.e. we minimize
\begin{equation} D(\underbrace{q_\phi(z|x)}_{\text{learned z}} || \underbrace{p(z)}_{\text{fixed prior on z}}) \label{eq:loss_vae_regularzn} \end{equation}
This prevents the NN from overfitting. Equation \eqref{eq:loss_vae_regularzn} is actually the KL divergence between the two distributions.

Why regularize?
We want the following properties in the latent space:

  1. Continuity - closer points in \(z\) should create similar images in \(x'\)
  2. Completeness - sampling from \(z\) must be a meaningful \(x'\)
Non regularized latent space
Regularized latent space

How to select a prior? Gaussian FTW!

Regularization with gaussian

Regularization with a Normal Gaussian prior helps enforce information gradient in the latent space. Points and distances in z have some meaning.

Proof for ELBO :
We know that the expectation of a random variable is
\begin{equation} \label{eq:elbo_one} \mathbb{E} [f(x)]=\int xf(x)dx \end{equation}
By chain rule, we have
\begin{equation} \label{eq:elbo_two} P(x,y)=P(x|y)P(y) \end{equation}
From Bayes theorem, we have
\begin{equation} \label{eq:elbo_three} P(x|y)=\frac{P(y|x)P(x)}{P(y)} \end{equation}
KL divergence is given by
\begin{equation} \label{eq:elbo_four} D_{KL} (P||Q)=\int p(x) log(\frac{p(x)}{q(x)})dx \end{equation}
The likelihood of our data is
\(p(x)=\int p(x,z)dz\) which is a joint probability. But this is intractable since we need to calculate this over all z’s.
\(p(x)=\frac{p(x,z)}{p(z|x)}\) using \eqref{eq:elbo_two} but we don’t have the denominator. So, using an approximation for it,
\(\underbrace{p_{\theta}(z|x)}_{\text{TRUE posterior}} \approx \underbrace{q_{\phi}(z|x)}_{\text{APPROX posterior}}\)
This is the encoder (see \eqref{eq:vae_encoder}). Take the log, so log-likelihood of our data is
\(\text{log }p_\theta(x)=\text{log }p_\theta(x)=\text{log }p_\theta(x) \cdot 1\)
\(\text{log }p_\theta(x)=\text{log }p_\theta(x) \cdot \int q_{\phi}(z|x)dx\)
since integral over a domain is always 1. Bring x inside the integral since integral is over \(dz\)
\(\text{log }p_\theta(x)= \int\text{log }p_\theta(x) \cdot q_{\phi}(z|x)dx\)
is of the form \eqref{eq:elbo_one}
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }p_{\theta}(x)]\)
Using \eqref{eq:elbo_two} inside the expectation,
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{p_\theta(z|x)}]\)
Multiplying by the same quantity,
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{p_\theta(z|x)} \cdot \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)}]\)
Reordering
\(\text{log }p_\theta(x)= \mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{q_{\phi}(z|x)} \cdot \frac{q_{\phi}(z|x)}{p_\theta(z|x)}]\)
Since \(log(ab)=log(a)+log(b)\)
\(\text{log }p_\theta(x)= \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{p_\theta(x,z)}{q_{\phi}(z|x)}]}_{\text{ELBO}} + \underbrace{\mathbb{E}_{q_{\phi}(z|x)} [\text{log }\frac{q_{\phi}(z|x)}{p_\theta(z|x)}]}_{\text{KL Divergence}}\)

Reparametrization trick

Why can’t we backpropagate with a stochastic variable?
Technically we can, it’s just that it is very computationally intensive and what we compute may have a very high variance, so we need to do it multiple times, without any guarantee that we will converge to a correct answer at the end of it.

Sampling introduces randomness.
=> it is no longer a function of parameters \(f(\mu, \sigma)\) Reparametrization trick Reparametrization converts \(z \in \mathcal{N}(\mu, \sigma^2)\) into \(z_i=\mu_i + \sigma_i \epsilon_i\), which is still an unbiased estimator, but with a smaller variance (see end of SGD post for intuition). If \(z\) is assumed to have a Gaussian prior, then \(z_i=\mu_i + \sigma_i \epsilon_i\) where \(\epsilon_i \sim \mathcal{N}(0,1)\). Then, \(\frac{ \partial z_i}{\partial \mu_i} = 1\) and \(\frac{ \partial z_i}{\partial \sigma_i} = \epsilon_i\). \(\mu\) and \(\sigma\) are learned so they can be backpropagated and \(\epsilon\) is sampled from a distribution. Because this distribution is Gaussian, we know an estimator and can calculate it relatively easier.

β - VAEs
Allows control over the latent space \(z\) by controlling the strength of the regularization term in the loss function \eqref{eq:loss_vae}. \begin{equation} L(\Phi, \theta, x, z, \beta) = (\text{reconstruction loss}) - \beta(\text{regularization term}) \label{eq:loss_betavae} \end{equation} This manifests in disentanglement or creating latent space variables which are not correlated with each other. The image on the left shows head rotation and smile are entangled.

Disentanglement

This is done mathematically by assuming a Diagonal prior on \(z\).
What is the use of this? By looking at the latent variables in a dataset, we can find out if it is fair and representative.

References: