GANs
27 Feb 2024What if I want to create new images from a very complex latent space \(z\)? Since it is very difficult to model it, I use the encoder part of a VAE to generate this \(z\).
Now the Generator (Decoder) tries to synthesize new images and the Discriminator (Encoder) tries to identify if that image came from the real world (\(x\)) or is fake (\(x'\)). The generator gets better at creating fake images which are very similar to real images and the discriminator gets better at identifying the important minute details which can help identify a fake. So we have competing objective functions for the generator and discriminator.
The discriminator’s loss function is
\begin{equation}
\text{arg max}_D \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}]
\label{eq:loss_generator}
\end{equation}
It tries to maximize the probability that the discriminator identifies a fake input as fake
and a real input as real.
The generator’s loss function is
\begin{equation}
\text{arg min}_G \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}]
\label{eq:loss_discriminator}
\end{equation}
It tries to minimize the probability that the discriminator identifies a fake input as fake
and a real input as real.
Putting this together, the objective function of a GAN is
\begin{equation}
\text{arg min}_G \text{max}_D \mathbb{E}_{z,x}[\underbrace{log D(\underbrace{G(z)}_{\text{generator’s output}})}_{\text{Discriminator’s estimate that this is fake}} + \underbrace{log(1-\underbrace{D(\underbrace{x}_{\text{real instance}})}_{\text{Discriminator’s estimate that this is fake}})}_{\text{Discriminator’s estimate that this is real}}]
\label{eq:loss_gan}
\end{equation}
Whereby we get a Generator that creates fake instances that fools the best Discriminator.
This finally created Generator is what we use to sample from noise to create new images.
Style GANs
Transfer style from one distribution to another. Example: ageing/ de-aging by transferring
from a source of old/ young data respectively to my face.
Conditional GANs
To control the type of output which can be synthesized. Example: train the GAN on pairs of
(street scene, semantic map of the scene) to then create synthesized street scenes given a
semantic map. Or train the GAN on pairs of (hand drawn image, photo) to then create realistic
photos from newly drawn scribbles.
Cycle GANs
To transfer the style between domains without paired data. Example: transform summer photos
to winter photos or transform a photo to a specific painting style.
Quality vs Coverage in generative models
GANs make high quality images but can suffer from mode collapse. This is the phenomenon where
the generator gets stuck at creating only specific images (of some modes/classes) which fool the
discriminator and thus fails to create diverse samples across the entire data distribution.
Compared to this, VAEs have a broader coverage but the samples may not be of high quality.
Autoregressive models like Transformers are a good compromise but they have no latent variables.
In comparison, diffusion models also offer good performance with a stable training and have
latent variables to perform efficient sample editing.