VAEs-GANs: What happens when we combine them?

Abhishek Kalra
5 min readOct 27, 2021

--

The term VAE-GAN was introduced by A. Larsen et. al (2016) in their paper “Autoencoding beyond pixels using a learned similarity metric”. The authors noting the criticality of choice of similarity metric in generative models like Variational Auto Encoder (VAE) explored usage of alternative architectures & similarity metrics. Through their experiments authors noted a combination of VAE and generative adversarial networks (GAN) to outperform traditional VAEs. The following sections discuss the architectures of generative models VAEs, GANs and the improvements made by assimilation of their architectures

Variational Auto Encoders(VAEs):

Variational Autoencoders (VAEs) are powerful generative models with applications as diverse as from generating fake human faces, to producing purely synthetic music. The textbook defines VAEs as “providing probabilistic descriptions of observations in latent spaces.” Intuitively, this means VAEs store latent attributes as probability distributions. This ability to encode the latent attributes of the input in a probabilistic manner (distribution) instead of a deterministic manner (single value) like standard autoencoder provides them a superior ability for recreation.

A typical variational autoencoder is nothing other than a cleverly designed deep neural network, which consists of a pair of networks: the encoder and the decoder. The encoder can be described as a variational inference network responsible for mapping of input x​​​ to posteriors distributions q​θ​​(z∣x). Likelihood p(x∣z) is then parametrized by the decoder, a generative network which takes latent variables z and parameters as inputs and projects them to data distributions p​ϕ​​(x∣z).

Figure 1: VAE Architecture

The VAE regularizes the encoder by imposing a prior over the latent distribution p(z). Typically z N(0, I) is chosen. The VAE loss is minus the sum of the expected log likelihood (the reconstruction error) and a prior regularization term:

Equation-1 VAE loss Function

Generative Adversarial Networks(GANs)

GANs just like VAEs belong to a class of generative algorithms that are used in unsupervised machine learning. A GAN architecture (Figure 2) consist of two neural networks, a generative neural network and a discriminative neural network engaged in a ‘zero-sum’ game. A generative neural network is responsible for taking noise as input and generating samples. The discriminative neural network is then asked to evaluate and distinguish the generated samples from training data. Much like VAEs, generative networks map latent variables and parameters to data distributions.

Figure 2: GAN Architecture

The GAN objective is to find the binary classifier that gives the best possible discrimination between true and generated data and simultaneously encouraging Generator to fit the true data distribution. We thus aim to maximize/ minimize the binary cross entropy:

Equation-2: GAN Loss Function

Combining VAE-GANs Architectures:

A major drawback of VAEs is the unrealistic and blurry outputs that it generates (Dosovitskiy & Brox). This has to do with the way data distributions are recovered and loss functions are calculated in VAEs. VAEs use element-wise measures like the squared error (equation 1) which although simple but are not suitable for training requiring usage of complex datasets like images. A. Larsen et. al therefore, recommend using a higher-level and sufficiently invariant representation to measure image similarity.

The authors accomplish the same by jointly training a VAE and a generative adversarial network (GAN) (Goodfellow et al., 2014) and use GAN’s discriminant networks property to learn a similarity metric.

Figure 3: VAE-GAN Combined Architecture

The assimilated VAE-GAN architecture aggregates the VAE decoder and the GAN generator by letting them share parameters and training them jointly. Simultaneously VAE decoder is replaced with GAN discriminator allowing a more abstract reconstruction error for the VAE by transfer of the properties of images learned by the GAN’s discriminator. VAE’s loss term (L Pixel-llike) in equation 1 is therefore, updated using the following:

The combined model is therefore, trained using the following loss function.

Equation 3: VAE-GAN Loss
Figure 4: VAE-GAN Information Flow

Figure 4 shows the information flow in VAE-GAN architecture.

The end result is a model that combines the advantage of GAN as a high quality generative model and VAE as a method that produces an encoder of data into the latent space z. The architecture provides the following benefits:

Limiting error signals to relevant part of networks: As both VAE and GAN train simultaneously not all network parameters as per the combined loss are updated. The parameters are updated as shown below:

Discriminator minimizes GAN’s loss function and does not minimize LDisl-llike (Decoder’s loss function) as this would collapse the discriminator to 0. Further by not backpropagating the error signal from LGAN to Encoder better results are achieved.

References:

Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016, June). Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning (pp. 1558–1566). PMLR.

Dosovitskiy, A., & Brox, T. (2016). Generating images with perceptual similarity metrics based on deep networks. arXiv preprint arXiv:1602.02644.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). Generative adversarial networks. arXiv preprint arXiv:1406.2661.

--

--

Abhishek Kalra
Abhishek Kalra

Written by Abhishek Kalra

Data Scientist working in Financial Crime Compliance. Here to share some nerdy anecdotes.

No responses yet