Generative Models |

Tech Talk: Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Sat, 15 Jun 2019 13:00:00 +0000

Density Ratio Estimation for KL Divergence Minimization between Implicit Distributions

Mon, 27 Aug 2018 00:00:00 +0000

The Kullback-Leibler (KL) divergence between distributions $p$ and $q$ is defined as

$$ \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] := \mathbb{E}_{p(x)} \left [ \log \left ( \frac{p(x)}{q(x)} \right ) \right ]. $$

It can be expressed more succinctly as

$$ \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] = \mathbb{E}_{p(x)} [ \log r^{*}(x) ], $$

where $r^{*}(x)$ is defined to be the ratio of between the densities $p(x)$ and $q(x)$,

$$ r^{*}(x) := \frac{p(x)}{q(x)}. $$

This density ratio is crucial for computing not only the KL divergence but for all $f$-divergences, defined as¹

$$ \mathcal{D}_f[p(x) || q(x)] := \mathbb{E}_{q(x)} \left [ f \left ( \frac{p(x)}{q(x)} \right ) \right ]. $$

Rarely can this expectation (i.e. integral) can be calculated analytically—in most cases, we must resort to Monte Carlo approximation methods, which explicitly requires the density ratio. In the more severe case where this density ratio is unavailable, because either or both $p(x)$ and $q(x)$ are not calculable, we must resort to methods for density ratio estimation. In this post, we illustrate how to perform density ratio estimation by exploiting its tight correspondence to probabilistic classification.

Example: Univariate Gaussians

Let us consider the following univariate Gaussian distributions as the running example for this post,

$$ p(x) = \mathcal{N}(x \mid 1, 1^2), \qquad \text{and} \qquad q(x) = \mathcal{N}(x \mid 0, 2^2). $$

We will be using TensorFlow, TensorFlow Probability, and Keras in the code snippets throughout this post.

import tensorflow as tf
import tensorflow_probability as tfp

We first instantiate the distributions:

p = tfp.distributions.Normal(loc=1., scale=1.)
q = tfp.distributions.Normal(loc=0., scale=2.)

Their densities are shown below:

For any pair of distributions, we can implement their density ratio function $r$ as follows:

def log_density_ratio(p, q):

 def log_ratio(x):

 return p.log_prob(x) - q.log_prob(x)

 return log_ratio

def density_ratio(p, q):

 log_ratio = log_density_ratio(p, q)

 def ratio(x):

 return tf.exp(log_ratio(x))

 return ratio

Let’s create the density ratio function for the Gaussian distributions we just instantiated:

>>> r = density_ratio(p, q)

This density ratio function is plotted as the orange dotted line below, alongside the individual densities shown in the previous plot:

Analytical Form

For our running example, we picked $p(x)$ and $q(x)$ to be Gaussians so that it is possible to integrate out $x$ and compute the KL divergence analytically. When we introduce the approximate methods later, this will provide us a “gold standard” to benchmark against.

In general, for Gaussian distributions

$$ p(x) = \mathcal{N}(x \mid \mu_p, \sigma_p^2), \qquad \text{and} \qquad q(x) = \mathcal{N}(x \mid \mu_q, \sigma_q^2), $$

it is easy to verify that

$$ \mathrm{KL}[ p(x) || q(x) ] = \log \sigma_q - \log \sigma_p - \frac{1}{2} \left [ 1 - \left ( \frac{\sigma_p^2 + (\mu_p - \mu_q)^2}{\sigma_q^2} \right ) \right ]. $$

This is implemented below:

def _kl_divergence_gaussians(p, q):

 r = p.loc - q.loc

 return (tf.log(q.scale) - tf.log(p.scale) -
 .5 * (1. - (p.scale**2 + r**2) / q.scale**2))

We can use this to compute the KL divergence between $p(x)$ and $q(x)$ exactly:

>>> _kl_divergence_gaussians(p, q).eval()
0.44314718

Equivalently, we could also use kl_divergence from TensorFlow Probability–Distributions (tfp.distributions), which implements the analytical closed-form expression of the KL divergence between distributions when such exists.

>>> tfp.distributions.kl_divergence(p, q).eval()
0.44314718

Monte Carlo Estimation — prescribed distributions

For distributions where their KL divergence is not analytically tractable, we may appeal to Monte Carlo (MC) estimation:

Clearly, this requires the density ratio $r^{*}(x)$ and, in turn, the densities $p(x)$ and $q(x)$ to be analytically tractable. Distributions for which the density function can be readily evaluated are sometimes referred to as prescribed distributions. As before, we prescribed Gaussians distributions in our running example so the Monte Carlo estimate can be later compared against. We approximate their KL divergence using $M = 5000$ Monte Carlo samples as follows:

>>> p_samples = p.sample(5000)
>>> true_log_ratio = log_density_ratio(p, q)
>>> tf.reduce_mean(true_log_ratio(p_samples)).eval()
0.44670376

Or equivalently, using the expectation function from TensorFlow Probability–Monte Carlo (tfp.monte_carlo):

>>> tfp.monte_carlo.expectation(f=true_log_ratio, samples=p_samples).eval()
0.4581419

More generally, we can approximate any $f$-divergence with MC estimation:

$$ \begin{align*} \mathcal{D}_f[p(x) || q(x)] & = \mathbb{E}_{q(x)} [ f(r^{*}(x)) ] \newline & \approx \frac{1}{M} \sum_{i=1}^{M} f(r^{*}(x_q^{(i)})), \quad x_q^{(i)} \sim q(x). \end{align*} $$

This can be done using the monte_carlo_csiszar_f_divergence function from TensorFlow Probability–Variational Inference (tfp.vi). One simply needs to specify the appropriate convex function $f$. The convex function that instantiates the (forward) KL divergence is provided in tfp.vi as kl_forward, alongside many other common $f$-divergences.

>>> tfp.vi.monte_carlo_csiszar_f_divergence(f=tfp.vi.kl_forward,
... p_log_prob=p.log_prob, q=q,
... num_draws=5000).eval()
0.4430853

Density Ratio Estimation — implicit distributions

When either density $p(x)$ or $q(x)$ is unavailable, things become more tricky. Which brings us to the topic of this post. Suppose we only have samples from $p(x)$ and $q(x)$—these could be natural images, outputs from a neural network with stochastic inputs, or in the case of our running example, i.i.d. samples drawn from Gaussians, etc. Distributions for which we are only able to observe their samples are known as implicit distributions, since their samples imply some underlying true density which we may not have direct access to.

Density ratio estimation is concerned with estimating the ratio of densities $r^{*}(x) = p(x) / q(x)$ given access only to samples from $p(x)$ and $q(x)$. Moreover, density ratio estimation usually encompass methods that achieve this without resorting to direct density estimation of the individual densities $p(x)$ or $q(x)$, since any error in the estimation of the denominator $q(x)$ is magnified exponentially.

Of the many density ratio estimation methods that now flourish², the classical approach of probabilistic classification remains dominant, due in no small part to its simplicity.

Reducing Density Ratio Estimation to Probabilistic Classification

We now demonstrate that density ratio estimation can be reduced to probabilistic classification. We shall do this by highlighting the one-to-one correspondence between the density ratio of $p(x)$ and $q(x)$ and the optimal probabilistic classifier that discriminates between their samples. Specifically, suppose we have a collection of samples from both $p(x)$ and $q(x)$, where each sample is assigned a class label indicating which distribution it was drawn from. Then, from an estimator of the class-membership probabilities, it is straightforward to recover an estimator of the density ratio.

Suppose we have $N_p$ and $N_q$ samples drawn from $p(x)$ and $q(x)$, respectively,

$$ x_p^{(1)}, \dotsc, x_p^{(N_p)} \sim p(x), \qquad \text{and} \qquad x_q^{(1)}, \dotsc, x_q^{(N_q)} \sim q(x). $$

Then, we form the dataset $\{ (x_n, y_n) \}_{n=1}^N$, where $N = N_p + N_q$ and

$$ \begin{align*} (x_1, \dotsc, x_N) & = (x_p^{(1)}, \dotsc, x_p^{(N_p)}, x_q^{(1)}, \dotsc, x_q^{(N_q)}), \newline (y_1, \dotsc, y_N) & = (\underbrace{1, \dotsc, 1}_{N_p}, \underbrace{0, \dotsc, 0}_{N_q}). \end{align*} $$

In other words, we label samples drawn from $p(x)$ as 1 and those drawn from $q(x)$ as 0. In code, this looks like:

>>> p_samples = p.sample(sample_shape=(n_p, 1))
>>> q_samples = q.sample(sample_shape=(n_q, 1))
>>> X = tf.concat([p_samples, q_samples], axis=0)
>>> y = tf.concat([tf.ones_like(p_samples), tf.zeros_like(q_samples)], axis=0)

This dataset is visualized below. The blue squares in the top row are samples $x_p^{(i)} \sim p(x)$ with label 1; red squares in the bottom row are samples $x_q^{(j)} \sim q(x)$ with label 0.

Now, by construction, we have

$$ p(x) = \mathcal{P}(x \mid y = 1), \qquad \text{and} \qquad q(x) = \mathcal{P}(x \mid y = 0). $$

Using Bayes’ rule, we can write

$$ \mathcal{P}(x \mid y) = \frac{\mathcal{P}(y \mid x) \mathcal{P}(x)} {\mathcal{P}(y)}. $$

Hence, we can express the density ratio $r^{*}(x)$ as

$$ \begin{align*} r^{*}(x) & = \frac{p(x)}{q(x)} = \frac{\mathcal{P}(x \mid y = 1)} {\mathcal{P}(x \mid y = 0)} \newline & = \left ( \frac{\mathcal{P}(y = 1 \mid x) \mathcal{P}(x)} {\mathcal{P}(y = 1)} \right ) \left ( \frac{\mathcal{P}(y = 0 \mid x) \mathcal{P}(x)} {\mathcal{P}(y = 0)} \right ) ^ {-1} \newline & = \frac{\mathcal{P}(y = 0)}{\mathcal{P}(y = 1)} \frac{\mathcal{P}(y = 1 \mid x)} {\mathcal{P}(y = 0 \mid x)}. \end{align*} $$

Let us approximate the ratio of marginal densities by the ratio of sample sizes,

$$ \frac{\mathcal{P}(y = 0)} {\mathcal{P}(y = 1)} \approx \frac{N_q}{N_p + N_q} \left ( \frac{N_p}{N_p + N_q} \right )^{-1} = \frac{N_q}{N_p}. $$

To avoid notational clutter, let us assume from now on that $N_q = N_p$. We can then write $r^{*}(x)$ in terms of class-posterior probabilities,

$$ \begin{align*} r^{*}(x) = \frac{\mathcal{P}(y = 1 \mid x)} {\mathcal{P}(y = 0 \mid x)}. \end{align*} $$

Recovering the Density Ratio from the Class Probability

This yields a one-to-one correspondence between the density ratio $r^{*}(x)$ and the class-posterior probability $\mathcal{P}(y = 1 \mid x)$. Namely,

$$ \begin{align*} r^{*}(x) = \frac{\mathcal{P}(y = 1 \mid x)} {\mathcal{P}(y = 0 \mid x)} & = \frac{\mathcal{P}(y = 1 \mid x)} {1 - \mathcal{P}(y = 1 \mid x)} \newline & = \exp \left [ \log \frac{\mathcal{P}(y = 1 \mid x)} {1 - \mathcal{P}(y = 1 \mid x)} \right ] \newline & = \exp[ \sigma^{-1}(\mathcal{P}(y = 1 \mid x)) ], \end{align*} $$

where $\sigma^{-1}$ is the logit function, or inverse sigmoid function, given by $\sigma^{-1}(\rho) = \log \left ( \frac{\rho}{1-\rho} \right )$

Recovering the Class Probability from the Density Ratio

By simultaneously manipulating both sides of this equation, we can also recover the exact class-posterior probability as a function of the density ratio,

$$ \mathcal{P}(y=1 \mid x) = \sigma(\log r^{*}(x)) = \frac{p(x)}{p(x) + q(x)}. $$

This is implemented below:

def optimal_classifier(p, q):

 def classifier(x):

 return tf.truediv(p.prob(x), p.prob(x) + q.prob(x))

 return classifier

In the figure below, The class-posterior probability $\mathcal{P}(y=1 \mid x)$ is plotted against the dataset visualized earlier.

Probabilistic Classification with Logistic Regression

The class-posterior probability $\mathcal{P}(y = 1 \mid x)$ can be approximated using a parameterized function $D_{\theta}(x)$ with parameters $\theta$. This functions takes as input samples from $p(x)$ and $q(x)$ and outputs a score, or probability, in the range $[0, 1]$ that it was drawn from $p(x)$. Hence, we refer to $D_{\theta}(x)$ as the probabilistic classifier.

From before, it is clear to see how an estimator of the density ratio $r_{\theta}(x)$ might be constructed as a function of probabilistic classifier $D_{\theta}(x)$. Namely,

$$ \begin{align*} r_{\theta}(x) & = \exp[ \sigma^{-1}(D_{\theta}(x)) ] \newline & \approx \exp[ \sigma^{-1}(\mathcal{P}(y = 1 \mid x)) ] = r^{*}(x), \end{align*} $$

and vice versa,

$$ \begin{align*} D_{\theta}(x) & = \sigma(\log r_{\theta}(x)) \newline & \approx \sigma(\log r^{*}(x)) = \mathcal{P}(y = 1 \mid x). \end{align*} $$

Instead of $D_{\theta}(x)$, we usually specify the parameterized function $\log r_{\theta}(x)$. This is also referred to as the log-odds, or logits, since it is equivalent to the unnormalized output of the classifier before being fed through the logistic sigmoid function.

We define a small fully-connected neural network with two hidden layers and ReLU activations:

log_ratio = Sequential([
 Dense(16, input_dim=1, activation='relu'),
 Dense(32, activation='relu'),
 Dense(1),
])

This simple architecture is visualized in the diagram below:

We learn the optimal class probability estimator by optimizing it with respect to a proper scoring rule³ that yields well-calibrated probabilistic predictions, such as the binary cross-entropy loss,

$$ \begin{align*} \mathcal{L}(\theta) & := -\mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] -\mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ] \newline & = -\mathbb{E}_{p(x)} [ \log \sigma ( \log r_{\theta} (x) ) ] -\mathbb{E}_{q(x)} [ \log(1 - \sigma ( \log r_{\theta} (x) )) ]. \end{align*} $$

An implementation optimized for numerical stability is given below:

def _binary_crossentropy(log_ratio_p, log_ratio_q):

 loss_p = tf.nn.sigmoid_cross_entropy_with_logits(
 logits=log_ratio_p,
 labels=tf.ones_like(log_ratio_p)
 )

 loss_q = tf.nn.sigmoid_cross_entropy_with_logits(
 logits=log_ratio_q,
 labels=tf.zeros_like(log_ratio_q)
 )

 return tf.reduce_mean(loss_p + loss_q)

Now we can build a , where the —samples from $p(x)$ and $q(x)$, respectively.

>>> x_p = Input(tensor=p_samples)
>>> x_q = Input(tensor=q_samples)
>>> log_ratio_p = log_ratio(x_p)
>>> log_ratio_q = log_ratio(x_q)

The model can now be compiled and finalized. Since we’re using a custom loss that take the two sets of log-ratios as input, we specify loss=None and define it instead through the add_loss method.

>>> m = Model(inputs=[x_p, x_q], outputs=[log_ratio_p, log_ratio_q])
>>> m.add_loss(_binary_crossentropy(log_ratio_p, log_ratio_q))
>>> m.compile(optimizer='rmsprop', loss=None)

As a sanity-check, the loss evaluated on a random batch can be obtained like so:

>>> m.evaluate(x=None, steps=1)
1.3765026330947876

We can now fit our estimator, recording the loss at the end of each epoch:

>>> hist = m.fit(x=None, y=None, steps_per_epoch=1, epochs=500)

The following animation shows how the predictions for the probabilistic classifier, density ratio, log density ratio, evolve after every epoch:

It is overlaid on top of their exact, analytical counterparts, which are only available since we prescribed them to be Gaussian distribution. For implicit distributions, these won’t be accessible at all.

Below is the final plot of how the binary cross-entropy loss converges:

Below is a plot of the probabilistic classifier $D_{\theta}(x)$ (dotted green), plotted against the optimal classifier, which is the class-posterior probability $\mathcal{P}(y=1 \mid x) = \frac{p(x)}{p(x) + q(x)}$ (solid blue):

Below is a plot of the density ratio estimator $r_{\theta}(x)$ (dotted green), plotted against the exact density ratio function $r^{*}(x) = \frac{p(x)}{q(x)}$ (solid blue):

And finally, the previous plot in logarithmic scale:

While it may appear that we are simply performing regression on the latent function $r^{*}(x)$ (which is not wrong—we are), it is important to emphasize that we do this without ever having observed values of $r^{*}(x)$. Instead, we only ever observed samples from $p(x)$ and $q(x)$ This has profound implications and potential for a great number of applications that we shall explore later on.

Back to Monte Carlo estimation

Having an obtained an estimate of the log density ratio, it is now feasible to perform Monte Carlo estimation:

$$ \begin{align*} \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] & = \mathbb{E}_{p(x)} [ \log r^{*}(x) ] \newline & \approx \frac{1}{M} \sum_{i=1}^{M} \log r^{*}(x_p^{(i)}), \quad x_p^{(i)} \sim p(x) \newline & \approx \frac{1}{M} \sum_{i=1}^{M} \log r_{\theta}(x_p^{(i)}), \quad x_p^{(i)} \sim p(x). \end{align*} $$

>>> tf.squeeze(tfp.monte_carlo.expectation(f=log_ratio, samples=p_samples)).eval()
0.4570999

In other words, we draw MC samples from $p(x)$ as before. But instead of taking the mean of the function $\log r^{*}(x)$ evaluated on these samples (which is unavailable for implicit distributions), we do so on a proxy function $\log r_{\theta}(x)$ that is estimated through probabilistic classification as described above.

Learning in Implicit Generative Models

Now let’s take a look at where these ideas are being used in practice. Consider a collection of natural images, such as the MNIST handwritten digits shown below, which are assumed to be samples drawn from some implicit distribution $q(\mathbf{x})$:

MNIST hand-written digits

Directly estimating the density of $q(\mathbf{x})$ may not always be feasible—in some cases, it may not even exist. Instead, consider defining a parametric function $G_{\phi}: \mathbf{z} \mapsto \mathbf{x}$ with parameters $\phi$, that takes as input $\mathbf{z}$ drawn from some fixed distribution $p(\mathbf{z})$. The outputs $\mathbf{x}$ of this generative process are assumed to be samples following some implicit distribution $p_{\phi}(\mathbf{x})$. In other words, we can write

$$ \mathbf{x} \sim p_{\phi}(\mathbf{x}) \quad \Leftrightarrow \quad \mathbf{x} = G_{\phi}(\mathbf{z}), \quad \mathbf{z} \sim p(\mathbf{z}). $$

By optimizing parameters $\phi$, we can make $p_{\phi}(\mathbf{x})$ close to the real data distribution $q(\mathbf{x})$. This is a compelling alternative to density estimation since there are many situations where being able to generate samples is more important than being able to calculate the numerical value of the density. Some examples of these include image super-resolution and semantic segmentation.

One approach might be to introduce a classifier $D_{\theta}$ that discriminates between real and synthetic samples. Then we optimize $G_{\phi}$ to synthesize samples that are indistinguishable, to classifier $D_{\theta}$, from the real samples. This can be achieved by simultaneously optimizing the binary cross-entropy loss, resulting in the saddle-point objective,

$$ \begin{align*} & \min_{\phi} \max_{\theta} \mathbb{E}_{q(\mathbf{x})} [ \log D_{\theta} (\mathbf{x}) ] + \mathbb{E}_{p_{\phi}(\mathbf{x})} [ \log(1-D_{\theta} (\mathbf{x})) ] \newline = & \min_{\phi} \max_{\theta} \mathbb{E}_{q(\mathbf{x})} [ \log D_{\theta} (\mathbf{x}) ] + \mathbb{E}_{p(\mathbf{z})} [ \log(1-D_{\theta} (G_{\phi}(\mathbf{z}))) ]. \end{align*} $$

This is, of course, none other than the groundbreaking generative adversarial network (GAN)⁴. You can read more about the density ratio estimation perspective of GANs in the paper by Uehara et al. 2016⁵. For an even more general and complete treatment of learning in implicit models, I recommend the paper from Mohamed and Lakshminarayanan, 2016⁶, which partially inspired this post.

For the remainder of this section, I want to highlight a variant of this approach that specifically aims to minimize the KL divergence w.r.t. parameters $\phi$,

$$ \min_{\phi} \mathcal{D}_{\mathrm{KL}}[p_{\phi}(\mathbf{x}) || q(\mathbf{x})]. $$

To overcome the fact that the densities of both $p_{\phi}(\mathbf{x})$ and $q(\mathbf{x})$ are unknown, we can readily adopt the density ratio estimation approach outlined in this post. Namely, by maximizing the following objective,

$$ \begin{align*} & \max_{\theta} \mathbb{E}_{q(\mathbf{x})} [ \log D_{\theta} (\mathbf{x}) ] + \mathbb{E}_{p(\mathbf{z})} [ \log(1-D_{\theta} (G_{\phi}(\mathbf{z}))) ] \newline = & \max_{\theta} \mathbb{E}_{q(\mathbf{x})} [ \log \sigma ( \log r_{\theta} (\mathbf{x}) ) ] + \mathbb{E}_{p(\mathbf{z})} [ \log(1 - \sigma ( \log r_{\theta} (G_{\phi}(\mathbf{z})) )) ], \end{align*} $$

which attains its maximum at

$$ r_{\theta}(\mathbf{x}) = \frac{q(\mathbf{x})}{p_{\phi}(\mathbf{x})}. $$

Concurrently, we also minimize the current best estimate of the KL divergence,

$$ \begin{align*} \min_{\phi} \mathcal{D}_{\mathrm{KL}}[p_{\phi}(\mathbf{x}) || q(\mathbf{x})] & = \min_{\phi} \mathbb{E}_{p_{\phi}(\mathbf{x})} \left [ \log \frac{p_{\phi}(\mathbf{x})}{q(\mathbf{x})} \right ] \newline & \approx \min_{\phi} \mathbb{E}_{p_{\phi}(\mathbf{x})} [ - \log r_{\theta}(\mathbf{x}) ] \newline & = \min_{\phi} \mathbb{E}_{p(\mathbf{z})} [ - \log r_{\theta}(G_{\phi}(\mathbf{z})) ]. \end{align*} $$

In addition to being more stable than the vanilla GAN approach (alleviates saturating gradients), this is especially important in contexts where there is a specific need to minimize the KL divergence, such as in variational inference (VI).

This was first used in AffGAN by Sønderby et al. 2016⁷, and has since been incorporated in many papers that deal with implicit distributions in variational inference, such as (Mescheder et al. 2017⁸, Huszar 2017⁹, Tran et al. 2017¹⁰, Pu et al. 2017¹¹, Chen et al. 2018¹², Tiao et al. 2018¹³), and many others.

Bound on the Jensen-Shannon Divergence

Before we wrap things up, let us take another look at the plot of the binary-cross entropy loss recorded at the end of each epoch. We see that it converges quickly to some value. It is natural to wonder: what is the significance, if any, of this value?

It is in fact the (negative) Jensen-Shannon (JS) divergence, up to constants,

$$ -2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] + \log 4. $$

Recall the Jensen-Shannon divergence is defined as

$$ \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] = \frac{1}{2} \mathcal{D}_{\mathrm{KL}}[p(x) || m(x)] + \frac{1}{2} \mathcal{D}_{\mathrm{KL}}[q(x) || m(x)], $$

where $m$ is the mixture density

$$ m(x) = \frac{p(x) + q(x)}{2}. $$

With our running example, this cannot be evaluated exactly since the KL divergence between a Gaussian and a mixture of Gaussians is analytically intractable. However, like the KL, we can still estimate their JS divergence with Monte Carlo estimation¹⁴:

>>> js = - tfp.vi.monte_carlo_csiszar_f_divergence(f=tfp.vi.jensen_shannon,
... p_log_prob=p.log_prob,
... q=q, num_draws=5000)

This value is shown in the horizontal black line in the plot above. Along the right margin, we also plot the a histogram of the binary cross-entropy loss values over epochs. We can see that this value indeed coincides with the mode of this histogram.

It is straightforward to show that we have the upper bound

$$ \inf_{\theta} \mathcal{L}(\theta) \geq - 2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] + \log 4. $$

Firstly, we have

$$ \begin{align*} \sup_{\theta} & \mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] + \mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ] \newline & = \mathbb{E}_{p(x)} [ \log \mathcal{P}(y=1 \mid x) ] + \mathbb{E}_{q(x)} [ \log \mathcal{P}(y=0 \mid x) ] \newline & = \mathbb{E}_{p(x)} \left [ \log \frac{p(x)}{p(x) + q(x)} \right ] + \mathbb{E}_{q(x)} \left [ \log \frac{q(x)}{p(x) + q(x)} \right ] \newline & = \mathbb{E}_{p(x)} \left [ \log \frac{1}{2} \frac{p(x)}{m(x)} \right ] + \mathbb{E}_{q(x)} \left [ \log \frac{1}{2} \frac{q(x)}{m(x)} \right ] \newline & = \mathbb{E}_{p(x)} \left [ \log \frac{p(x)}{m(x)} \right ] + \mathbb{E}_{q(x)} \left [ \log \frac{q(x)}{m(x)} \right ] - 2 \log 2 \newline & = 2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] - \log 4. \end{align*} $$

Therefore,

$$ 2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] - \log 4 \geq \sup_{\theta} \mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] + \mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ]. $$

Negating both sides, we get

$$ \begin{align*} -2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] + \log 4 \leq & -\sup_{\theta} \mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] + \mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ] \newline = & \inf_{\theta} -\mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] -\mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ] \newline = & \inf_{\theta} \mathcal{L}(\theta), \end{align*} $$

as required.

In short, this tells us that the binary cross-entropy loss is itself an approximation (up to constants) to the Jensen-Shannon divergence. This begs the question: is it possible to construct a more general loss that bounds any given $f$-divergence?

Teaser: Lower Bound on any $f$-divergence

Using convex analysis, one can actually show that for any $f$-divergence, we have the lower bound¹⁵

$$ \mathcal{D}_f[p(x) || q(x)] \geq \sup_{\theta} \mathbb{E}_{p(x)} [ f'(r_{\theta}(x)) ] - \mathbb{E}_{q(x)} [ f^{\star}(f'(r_{\theta}(x))) ], $$

with equality exactly when $r_{\theta}(x) = r^{*}(x)$. Importantly, this lower bound can be computed without requiring the densities of $p(x)$ or $q(x)$—only their samples are needed.

In the special case of $f(u) = u \log u - (u + 1) \log (u + 1)$, we recover the binary cross-entropy loss and the previous result, as expected,

$$ \begin{align*} \mathcal{D}_f[p(x) || q(x)] & = 2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] - \log 4 \newline & \geq \sup_{\theta} \mathbb{E}_{p(x)} [ \log \sigma ( \log r_{\theta} (x) ) ] + \mathbb{E}_{q(x)} [ \log(1 - \sigma ( \log r_{\theta} (x) )) ] \newline & = \sup_{\theta} \mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] + \mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ]. \end{align*} $$

Alternately, in the special case of $f(u) = u \log u$, we get

$$ \begin{align*} \mathcal{D}_f[p(x) || q(x)] & = \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] \newline & \geq \sup_{\theta} \mathbb{E}_{p(x)} [ \log r_{\theta} (x) ] - \mathbb{E}_{q(x)} [ r_{\theta} (x) - 1 ]. \end{align*} $$

This gives us yet another way to estimate the KL divergence between implicit distributions, in the form of a direct lower bound on the KL divergence itself. As it turns out, this lower bound is closely-related to the objective of the KL Importance Estimation Procedure (KLIEP)¹⁶, and will be the topic of our next post in this series.

Summary

This post covered how to evaluate the KL divergence, or any $f$-divergence, between implicit distributions—distributions which we can only sample from. First, we underscored the crucial role of the density ratio in the estimation of $f$-divergences. Next, we showed the correspondence between the density ratio and the optimal classifier. By exploiting this link, we demonstrated how one can use a trained probabilistic classifier to construct a proxy for the exact density ratio, and use this to enable estimation of any $f$-divergence. Finally, we provided some context on where this method is used, touching upon some recent advances in implicit generative models and variational inference.

Cite as:

@article{tiao2018dre,
 title = "{D}ensity {R}atio {E}stimation for {KL} {D}ivergence {M}inimization between {I}mplicit {D}istributions",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2018",
 url = "https://tiao.io/post/density-ratio-estimation-for-kl-divergence-minimization-between-implicit-distributions/"
}

To receive updates on more posts like this, follow me on and !

Acknowledgements

I am grateful to for providing extensive feedback and insightful discussions. I would also like to thank Alistair Reid and for their comments and suggestions.

Links and Resources

The used to generate the figures in this post, which you can .
The very readable textbook on ², which I highly recommend. (Note: the Gaussian distributions example was borrowed from this book.)
Shakir Mohamed’s blog post .
The paper by Menon and Ong, 2016¹⁷, which gives a generalized treatment of the theoretical link between density ratio estimation and probabilistic classification.

The (forward) KL divergence can be recovered with
$$ f_{\mathrm{KL}}(u) := u \log u. $$
This is easy to verify,
$$ \begin{align*} \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] & := \mathbb{E}_{p(x)} \left [ \log \left ( \frac{p(x)}{q(x)} \right ) \right ] \newline & = \mathbb{E}_{q(x)} \left [ \frac{p(x)}{q(x)} \log \left ( \frac{p(x)}{q(x)} \right ) \right ] \newline & = \mathbb{E}_{q(x)} \left [ f_{\mathrm{KL}} \left ( \frac{p(x)}{q(x)} \right ) \right ]. \end{align*} $$ ↩︎
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge University Press. ↩︎ ↩︎
Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), (pp. 359-378). ↩︎
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). Generative Adversarial Nets. In Advances in Neural Information Processing Systems (pp. 2672-2680). ↩︎
Uehara, M., Sato, I., Suzuki, M., Nakayama, K., & Matsuo, Y. (2016). Generative Adversarial Nets from a Density Ratio Estimation Perspective. arXiv preprint arXiv:1610.02920. ↩︎
Mohamed, S., & Lakshminarayanan, B. (2016). Learning in Implicit Generative Models. arXiv preprint arXiv:1610.03483. ↩︎
Sønderby, C. K., Caballero, J., Theis, L., Shi, W., & Huszár, F. (2016). Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490. ↩︎
Mescheder, L., Nowozin, S., & Geiger, A. (2017). Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks. In International Conference on Machine learning (ICML). ↩︎
Huszár, F. (2017). Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235. ↩︎
Tran, D., Ranganath, R., & Blei, D. (2017). Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems (pp. 5523-5533). ↩︎
Pu, Y., Wang, W., Henao, R., Chen, L., Gan, Z., Li, C., & Carin, L. (2017). Adversarial symmetric variational autoencoder. In Advances in Neural Information Processing Systems (pp. 4330-4339). ↩︎
Chen, L., Dai, S., Pu, Y., Zhou, E., Li, C., Su, Q., … & Carin, L. (2018, March). Symmetric variational autoencoder and connections to adversarial learning. In International Conference on Artificial Intelligence and Statistics (pp. 661-669). ↩︎
Tiao, L. C., Bonilla, E. V., & Ramos, F. (2018). Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference. arXiv preprint arXiv:1806.01771. ↩︎
Note that jensen_shannon with self_normalized=False (default), corresponds to $2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] - \log 4$, while self_normalized=True corresponds to $\mathcal{D}_{\mathrm{JS}}[p(x) || q(x)]$. ↩︎
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), 5847-5861. ↩︎
Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P. V., & Kawanabe, M. (2008). Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems (pp. 1433-1440). ↩︎
Menon, A., & Ong, C. S. (2016, June). Linking Losses for Density Ratio and Class-Probability Estimation. In International Conference on Machine Learning (pp. 304-313). ↩︎

Contributed Talk: Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Sat, 14 Jul 2018 15:20:00 +0000

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Sun, 01 Jul 2018 00:00:00 +0000

A Tutorial on Variational Autoencoders with a Concise Keras Implementation

Wed, 20 Apr 2016 00:00:00 +0000

is awesome. It is a very well-designed library that clearly abides by its of modularity and extensibility, enabling us to easily assemble powerful, complex models from primitive building blocks. This has been demonstrated in numerous blog posts and tutorials, in particular, the excellent tutorial on . As the name suggests, that tutorial provides examples of how to implement various kinds of autoencoders in Keras, including the variational autoencoder (VAE)¹.

Like all autoencoders, the variational autoencoder is primarily used for unsupervised learning of hidden representations. However, they are fundamentally different to your usual neural network-based autoencoder in that they approach the problem from a probabilistic perspective. They specify a joint distribution over the observed and latent variables, and approximate the intractable posterior conditional density over latent variables with variational inference, using an inference network ² ³ (or more classically, a recognition model ⁴) to amortize the cost of inference.

While the examples in the aforementioned tutorial do well to showcase the versatility of Keras on a wide range of autoencoder model architectures, doesn’t properly take advantage of Keras’ modular design, making it difficult to generalize and extend in important ways. As we will see, it relies on implementing custom layers and constructs that are restricted to a specific instance of variational autoencoders. This is a shame because when combined, Keras’ building blocks are powerful enough to encapsulate most variants of the variational autoencoder and more generally, recognition-generative model combinations for which the generative model belongs to a large family of deep latent Gaussian models (DLGMs)⁵.

The goal of this post is to propose a clean and elegant alternative implementation that takes better advantage of Keras’ modular design. It is not intended as tutorial on variational autoencoders ⁶. Rather, we study variational autoencoders as a special case of variational inference in deep latent Gaussian models using inference networks, and demonstrate how we can use Keras to implement them in a modular fashion such that they can be easily adapted to approximate inference in tasks beyond unsupervised learning, and with complicated (non-Gaussian) likelihoods.

This first post will lay the groundwork for a series of future posts that explore ways to extend this basic modular framework to implement the cutting-edge methods proposed in the latest research, such as the normalizing flows for building richer posterior approximations ⁷, importance weighted autoencoders ⁸, the Gumbel-softmax trick for inference in discrete latent variables ⁹, and even the most recent GAN-based density-ratio estimation techniques for likelihood-free inference ¹⁰ ¹¹.

Model specification

First, it is important to understand that the variational autoencoder . Rather, the generative model is a component of the variational autoencoder and is, in general, a deep latent Gaussian model. In particular, let $\mathbf{x}$ be a local observed variable and $\mathbf{z}$ its corresponding local latent variable, with joint distribution

$$ p_{\theta}(\mathbf{x}, \mathbf{z}) = p_{\theta}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}). $$

In Bayesian modelling, we assume the distribution of observed variables to be governed by the latent variables. Latent variables are drawn from a prior density $p(\mathbf{z})$ and related to the observations through the likelihood $p_{\theta}(\mathbf{x} | \mathbf{z})$. Deep latent Gaussian models (DLGMs) are a general class of models where the observed variable is governed by a hierarchy of latent variables, and the latent variables at each level of the hierarchy are Gaussian a priori ⁵.

In a typical instance of the variational autoencoder, we have only a single layer of latent variables with a Normal prior distribution,

$$ p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}). $$

Now, each local latent variable is related to its corresponding observation through the likelihood $p_{\theta}(\mathbf{x} | \mathbf{z})$, which can be viewed as a probabilistic decoder. Given a hidden lower-dimensional representation (or “code”) $\mathbf{z}$, it “decodes” it into a distribution over the observation $\mathbf{x}$.

Decoder

In this example, we define $p_{\theta}(\mathbf{x} | \mathbf{z})$ to be a multivariate Bernoulli whose probabilities are computed from $\mathbf{z}$ using a fully-connected neural network with a single hidden layer,

$$ \begin{align*} p_{\theta}(\mathbf{x} | \mathbf{z}) & = \mathrm{Bern}( \sigma( \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2 ) ), \newline \mathbf{h} & = h(\mathbf{W}_1 \mathbf{z} + \mathbf{b}_1), \end{align*} $$

where $\sigma$ is the logistic sigmoid function, $h$ is some non-linearity, and the model parameters $\theta = \{ \mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2 \}$ consist of the weights and biases of this neural network.

It is straightforward to implement this in Keras with the :

decoder = Sequential([
 Dense(intermediate_dim, input_dim=latent_dim, activation='relu'),
 Dense(original_dim, activation='sigmoid')
])

You can view a summary of the model parameters $\theta$ by calling decoder.summary(). Additionally, you can produce a high-level diagram of the network architecture, and optionally the input and output shapes of each layer using from the keras.utils.vis_utils module. Although our architecture is about as simple as it gets, it is included in the figure below as an example of what the diagrams look like.

Note that by fixing $\mathbf{W}_1$, $\mathbf{b}_1$ and $h$ to be the identity matrix, the zero vector, and the identity function, respectively (or equivalently dropping the first Dense layer in the snippet above altogether), we recover logistic factor analysis. With similarly minor modifications, we can recover other members from the family of DLGMs, which include non-linear factor analysis, non-linear Gaussian belief networks, sigmoid belief networks, and many others ⁵.

Having specified how the probabilities are computed, we can now define the negative log likelihood of a Bernoulli $- \log p_{\theta}(\mathbf{x}|\mathbf{z})$, which is in fact equivalent to the :

def nll(y_true, y_pred):
 """ Negative log likelihood (Bernoulli). """

 # keras.losses.binary_crossentropy gives the mean
 # over the last axis. we require the sum
 return K.sum(K.binary_crossentropy(y_true, y_pred), axis=-1)

As we discuss later, this will not be the loss we ultimately minimize, but will constitute the data-fitting term of our final loss.

Note this is a valid definition of a , which is required to compile and optimize a model. It is a symbolic function that returns a scalar for each data-point in y_true and y_pred. In our example, y_pred will be the output of our decoder network, which are the predicted probabilities, and y_true will be the true probabilities.

Side note: Using TensorFlow Distributions in loss

If you are using the TensorFlow backend, you can directly use the (negative) log probability of Bernoulli from TensorFlow Distributions as a Keras loss, as I demonstrate in my post on .

Specifically we can define the loss as,

def nll(y_true, y_pred):
 """ Negative log likelihood (Bernoulli). """

 lh = K.tf.distributions.Bernoulli(probs=y_pred)

 return - K.sum(lh.log_prob(y_true), axis=-1)

This is exactly equivalent to the previous definition, but does not call K.binary_crossentropy directly.

Inference

Having specified the generative process, we would now like to perform inference on the latent variables and model parameters $\mathbf{z}$ and $\theta$, respectively. In particular, our goal is to compute the posterior $p_{\theta}(\mathbf{z} | \mathbf{x})$, the conditional density of the latent variable $\mathbf{z}$ given observed variable $\mathbf{x}$. Additionally, we wish to optimize the model parameters $\theta$ with respect to the marginal likelihood $p_{\theta}(\mathbf{x})$. Both depend on the marginal likelihood, whose calculation requires marginalizing out the latent variables $\mathbf{z}$. In general, this is computational intractable, requiring exponential time to compute, or it is analytically intractable and cannot be evaluated in closed-form. In our case, we suffer from the latter intractability, since our prior is Gaussian non-conjugate to the Bernoulli likelihood.

To circumvent this intractability we turn to variational inference, which formulates inference as an optimization problem. It seeks an approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$ closest in Kullback-Leibler (KL) divergence to the true posterior. More precisely, the approximate posterior is parameterized by variational parameters $\phi$, and we seek a setting of these parameters that minimizes the aforementioned KL divergence,

$$ \phi^* = \mathrm{argmin}_{\phi} \mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p_{\theta}(\mathbf{z} | \mathbf{x}) ] $$

With the luck we’ve had so far, it shouldn’t come as a surprise anymore that this too is intractable. It also depends on the log marginal likelihood, whose intractability is the reason we appealed to approximate inference in the first place. Instead, we maximize an alternative objective function, the evidence lower bound (ELBO), which is expressed as

$$ \begin{align*} \mathrm{ELBO}(q) & = \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ \log p_{\theta}(\mathbf{x} | \mathbf{z}) + \log p(\mathbf{z}) - \log q_{\phi}(\mathbf{z} | \mathbf{x}) ] \newline & = \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ \log p_{\theta}(\mathbf{x} | \mathbf{z}) ] -\mathrm{KL} [ q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ]. \end{align*} $$

Importantly, the ELBO is a lower bound to the log marginal likelihood. Therefore, maximizing it with respect to the model parameters $\theta$ approximately maximizes the log marginal likelihood. Additionally, maximizing it with respect to variational parameters $\phi$ can be shown to minimize $\mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p_{\theta}(\mathbf{z} | \mathbf{x}) ]$. Also, it turns out that the KL divergence determines the tightness of the lower bound, where we have equality iff the KL divergence is zero, which happens iff $q_{\phi}(\mathbf{z} | \mathbf{x}) = p_{\theta}(\mathbf{z} | \mathbf{x})$. Hence, simultaneously maximizing it with respect to $\theta$ and $\phi$ gets us two birds with one stone.

Next we discuss the form of the approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$, which can be viewed as a probabilistic encoder. Its role is opposite to that of the decoder. Given an observation $\mathbf{x}$, it “encodes” it into a distribution over its hidden lower-dimensional representations.

Encoder

For each local observed variable $\mathbf{x}_n$, we wish to approximate the true posterior distribution $p(\mathbf{z}_n|\mathbf{x}_n)$ over its corresponding local latent variables $\mathbf{z}_n$. A common approach is to approximate it using a variational distribution $q_{\lambda_n}(\mathbf{z}_n)$, specified as a diagonal Gaussian, where the local variational parameters $\lambda_n = \{ \boldsymbol{\mu}_n, \boldsymbol{\sigma}_n \}$ are the mean and standard deviation of this approximating distribution,

$$ q_{\lambda_n}(\mathbf{z}_n) = \mathcal{N}( \mathbf{z}_n | \boldsymbol{\mu}_n, \mathrm{diag}(\boldsymbol{\sigma}_n^2) ). $$

This approach has a number of shortcomings. First, the number of local variational parameters we need to optimize grows with the size of the dataset. Second, a new set of local variational parameters need to be optimized for new unseen test points. This is not to mention the strong factorization assumption we make by specifying diagonal Gaussian distributions as the family of approximations. The last is still an active area of research, and the first two can be addressed by introducing a further approximation using an inference network.

Inference network

We amortize the cost of inference by introducing an inference network which approximates the local variational parameters $\lambda_n$ for a given local observed variable $\textbf{x}_n$. For our approximating distribution in particular, given $\textbf{x}_n$ the inference network yields two vector-valued outputs $\boldsymbol{\mu}_{\phi}(\textbf{x}_n)$ and $\boldsymbol{\sigma}_{\phi}(\textbf{x}_n)$, which we use to approximate its local variational parameters $\boldsymbol{\mu}_n$ and $\boldsymbol{\sigma}_n$, respectively. Our approximate posterior distribution now becomes $$ q_{\phi}(\mathbf{z}_n | \mathbf{x}_n)

\mathcal{N}(\mathbf{z}n | \boldsymbol{\mu}{\phi}(\mathbf{x}n), \mathrm{diag}(\boldsymbol{\sigma}{\phi}^2(\mathbf{x}_n)) ). $$ Instead of learning local variational parameters $\lambda_n$ for each data-point, we now learn a fixed number of global variational parameters $\phi$ which constitute the parameters (i.e. weights) of the inference network. Moreover, this approximation allows statistical strength to be shared across observed data-points and also generalize to unseen test points.

We specify the mean $\boldsymbol{\mu}_{\phi}(\mathbf{x})$ and log variance $\log \boldsymbol{\sigma}_{\phi}^2(\mathbf{x})$ of this distribution as the output of an inference network. For this post, we keep the architecture of the network simple, with only a single hidden layer and two fully-connected output layers. Again, this is simple to define in Keras:

# input layer
x = Input(shape=(original_dim,))

# hidden layer
h = Dense(intermediate_dim, activation='relu')(x)

# output layer for mean and log variance
z_mu = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)

Since this network has multiple outputs, we couldn’t use the Sequential model API as we did for the decoder. Instead, we will resort to the more powerful , which allows us to implement complex models with shared layers, multiple inputs, multiple outputs, and so on.

Note that we output the log variance instead of the standard deviation because this is not only more convenient to work with, but also helps with numerical stability. However, we still require the standard deviation later. To recover it, we simply implement the appropriate transformation and encapsulate it in a .

# normalize log variance to std dev
z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)

Before moving on, we give a few words on nomenclature and context. In the prelude and title of this section, we characterized the approximate posterior distribution with an inference network as a probabilistic encoder (analogously to its counterpart, the probabilistic decoder). Although this is an accurate interpretation, it is a limited one. Classically, inference networks are known as recognition models, and have now been used for decades in a wide variety of probabilistic methods. When composed end-to-end, the recognition-generative model combination can be seen as having an autoencoder structure. Indeed, this structure contains the variational autoencoder as a special case, and also the now less fashionable Helmholtz machine ⁴. Even more generally, this recognition-generative model combination constitutes a widely-applicable approach currently known as amortized variational inference, which can be used to perform approximate inference in models that lie beyond even the large class of deep latent Gaussian models.

Having specified all the ingredients necessary to carry out variational inference (namely, the prior, likelihood and approximate posterior), we next focus on finalizing the definition of the (negative) ELBO as our loss function in Keras. As written earlier, the ELBO can be decomposed into two terms, $\mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ \log p_{\theta}(\mathbf{x} | \mathbf{z}) ]$ the expected log likelihood (ELL) over $q_{\phi}(\mathbf{z} | \mathbf{x})$, and $- \mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ]$ the negative KL divergence between prior $p(\mathbf{z})$ and approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$. We first turn our attention to the KL divergence term.

KL Divergence

Intuitively, maximizing the negative KL divergence term encourages approximate posterior densities that place its mass on configurations of the latent variables which are closest to the prior. Effectively, this regularizes the complexity of latent space. Now, since both the prior $p(\mathbf{z})$ and approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$ are Gaussian, the KL divergence can actually be calculated with the closed-form expression,

$$ \mathrm{KL} [ q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ] = - \frac{1}{2} \sum_{k=1}^K \{ 1 + \log \sigma_k^2 - \mu_k^2 - \sigma_k^2 \} $$

where $\mu_k$ and $\sigma_k$ are the $k$-th components of output vectors $\mu_{\phi}(\mathbf{x})$ and $\sigma_{\phi}(\mathbf{x})$, respectively. This is not too difficult to derive, and I would recommend verifying this as an exercise. You can also find a derivation in the appendix of Kingma and Welling’s (2014) paper ¹.

Recall that earlier, we defined the expected log likelihood term of the ELBO as a Keras loss. We were able to do this since the log likelihood is a function of the network’s final output (the predicted probabilities), so it maps nicely to a Keras loss. Unfortunately, the same does not apply for the KL divergence term, which is a function of the network’s intermediate layer outputs, the mean mu and log variance log_var.

We define an auxiliary which takes mu and log_var as input and simply returns them as output without modification. We do however explicitly introduce the of calculating the KL divergence and adding it to a collection of losses, by calling the method add_loss ¹².

class KLDivergenceLayer(Layer):

 """ Identity transform layer that adds KL divergence
 to the final model loss.
 """

 def __init__(self, *args, **kwargs):
 self.is_placeholder = True
 super(KLDivergenceLayer, self).__init__(*args, **kwargs)

 def call(self, inputs):

 mu, log_var = inputs

 kl_batch = - .5 * K.sum(1 + log_var -
 K.square(mu) -
 K.exp(log_var), axis=-1)

 self.add_loss(K.mean(kl_batch), inputs=inputs)

 return inputs

Next we feed z_mu and z_log_var through this layer (this needs to take place before feeding z_log_var through the Lambda layer to recover z_sigma).

z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])

Now when the Keras model is finally compiled, the collection of losses will be aggregated and added to the specified Keras loss function to form the loss we ultimately minimize. If we specify the loss as the negative log-likelihood we defined earlier (nll), we recover the negative ELBO as the final loss we minimize, as intended.

Side note: Alternative divergences

A key benefit of encapsulating the divergence in an auxiliary layer is that we can easily implement and swap in other divergences, such as the $\chi$-divergence or the $\alpha$-divergence. Using alternative divergences for variational inference is an active research topic ¹³ ¹⁴.

Side note: Implicit models and adversarial learning

Additionally, we could also extend the divergence layer to use an auxiliary density ratio estimator function, instead of evaluating the KL divergence in the analytical form above. This relaxes the requirement on approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ (and incidentally also prior $p(\mathbf{z})$) to yield tractable densities, at the cost of maximizing a cruder estimate of the ELBO. This is known as Adversarial Variational Bayes¹⁰, and is an important line of recent research that, when taken to its logcal conclusion, can extend the applicability of variational inference to arbitrarily expressive implicit probabilistic models with intractable likelihoods¹¹.

Reparameterization using Merge Layers

To perform gradient-based optimization of ELBO with respect to model parameters $\theta$ and variational parameters $\phi$, we require its gradients with respect to these parameters, which is generally intractable. Currently, the dominant approach for circumventing this is by Monte Carlo (MC) estimation of the gradients. The basic idea is to write the gradient of the ELBO as an expectation of the gradient, approximate it with MC estimates, then perform stochastic gradient descent with the repeated MC gradient estimates.

There exist a number of estimators based on different variance reduction techniques. However, MC gradient estimates based on the reparameterization trick, known as the reparameterization gradients, have be shown to have the lowest variance among competing estimators for continuous latent variables⁵. The reparameterization trick is a straightforward change of variables that expresses the random variable $\mathbf{z} \sim q_{\phi}(\mathbf{z} | \mathbf{x})$ as a deterministic transformation $g_{\phi}$ of another random variable $\boldsymbol{\epsilon}$ and input $\mathbf{x}$, with parameters $\phi$,

$$ z = g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon}), \quad \boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon}). $$

Note that $p(\boldsymbol{\epsilon})$ is simpler base distribution which is parameter-free and independent of $\mathbf{x}$ or $\phi$. To prevent clutter, we write the ELBO as an expectation of the function $f(\mathbf{x}, \mathbf{z}) = \log p_{\theta}(\mathbf{x} , \mathbf{z}) - \log q_{\phi}(\mathbf{z} | \mathbf{x})$ over distribution $q_{\phi}(\mathbf{z} | \mathbf{x})$. Now, for any function $f(\mathbf{x}, \mathbf{z})$, taking the gradient of the expectation with respect to $\phi$, and substituting all occurrences of $\mathbf{z}$ with $g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})$, we have

$$ \begin{align*} \nabla_{\phi} \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ f(\mathbf{x}, \mathbf{z}) ] & = \nabla_{\phi} \mathbb{E}_{p(\boldsymbol{\epsilon})} [ f(\mathbf{x}, g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})) ] \newline & = \mathbb{E}_{p(\mathbf{\epsilon})} [ \nabla_{\phi} f(\mathbf{x}, g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})) ]. \end{align*} $$

In other words, this simple reparameterization allows the gradient and the expectation to commute, thereby allowing us to compute unbiased stochastic estimates of the ELBO gradients by drawing noise samples $\boldsymbol{\epsilon}$ from $p(\boldsymbol{\epsilon})$.

To recover the diagonal Gaussian approximation we specified earlier $q_{\phi}(\mathbf{z}_n | \mathbf{x}_n) = \mathcal{N}(\mathbf{z}_n | \boldsymbol{\mu}_{\phi}(\mathbf{x}_n), \mathrm{diag}(\boldsymbol{\sigma}_{\phi}^2(\mathbf{x}_n)))$, we draw noise from the Normal base distribution, and specify a simple location-scale transformation

$$ \mathbf{z} = g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon}) = \mu_{\phi}(\mathbf{x}) + \sigma_{\phi}(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), $$

where $\mu_{\phi}(\mathbf{x})$ and $\sigma_{\phi}(\mathbf{x})$ are the outputs of the inference network defined earlier with parameters $\phi$, and $\odot$ denotes the elementwise product. In Keras, we explicitly make the noise vector an input to the model by defining an Input layer for it. We then implement the above location-scale transformation using , namely Add and Multiply.

eps = Input(shape=(latent_dim,))

z_eps = Multiply()([z_sigma, eps])
z = Add()([z_mu, z_eps])

Side note: Monte Carlo sample size

Note both the inputs for observed variables and noise (x and eps) need to be specified explicitly as inputs to our final model. Furthermore, the size of their first dimension (i.e. batch size) are required to be the same. This corresponds to using a exactly one Monte Carlo sample to approximate the expected log likelihood, drawing a single sample $\mathbf{z}_n$ from $q_{\phi}(\mathbf{z}_n | \mathbf{x}_n)$ for each data-point $\mathbf{x}_n$ in the batch. Although you might find an MC sample size of 1 surprisingly small, it is actually adequate for a sufficiently large batch size (~100) ¹. In a , I demonstrate how to extend our approach to support larger MC sample sizes using just a few minor tweaks. This extension is crucial for implementing the importance weighted autoencoder ⁸.

Now, since the noise input is drawn from the Normal distribution, we can save from having to feed in values for this input from outside the computation graph by binding a tensor to this Input layer. Specifically, we bind a tensor created using K.random_normal with the required shape,

eps = Input(tensor=K.random_normal(shape=(K.shape(x)[0], latent_dim)))

While eps still needs to be explicitly specified as an input to compile the model, values for this input will no longer be expected by methods such as fit, predict. Instead, samples from this distribution will be lazily generated inside the computation graph when required. See my notes on for more details.

In the , all of this logic is encapsulated in a single Lambda layer, which simultaneously draws samples from a hard-coded base distribution and also performs the location-scale transformation. In contrast, this approach achieves a good level of and . By decoupling the random noise vector from the layer’s internal logic and explicitly making it a model input, we emphasize the fact that all sources of stochasticity emanate from this input. It thereby becomes clear that a random sample drawn from a particular approximating distribution is obtained by feeding this source of stochasticity through a number of successive deterministic transformations.

Side notes: Gumbel-softmax trick for discrete latent variables

As an example, we could provide samples drawn from the Uniform distribution as noise input. By applying a number of deterministic transformations that constitute the Gumbel-softmax reparameterization trick ⁹, we are able to obtain samples from the Categorical distribution. This allows us to perform approximate inference on discrete latent variables, and can be implemented in this framework by adding a dozen or so lines of code!

Putting it all together

So far, we’ve dissected the variational autoencoder into modular components and discussed the role and implementation of each one at some length. Now let’s compose these components together end-to-end to form the final autoencoder architecture.

x = Input(shape=(original_dim,))
h = Dense(intermediate_dim, activation='relu')(x)

z_mu = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)

z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])
z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)

eps = Input(tensor=K.random_normal(shape=(K.shape(x)[0], latent_dim)))
z_eps = Multiply()([z_sigma, eps])
z = Add()([z_mu, z_eps])

decoder = Sequential([
 Dense(intermediate_dim, input_dim=latent_dim, activation='relu'),
 Dense(original_dim, activation='sigmoid')
])

x_pred = decoder(z)

It’s surprisingly concise, taking up around 20 lines of code. The diagram of the full model architecture is visualized below.

Finally, we specify and compile the model, using the negative log likelihood nll defined earlier as the loss.

vae = Model(inputs=[x, eps], outputs=x_pred)
vae.compile(optimizer='rmsprop', loss=nll)

Model fitting

Dataset: MNIST digits

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, original_dim) / 255.
x_test = x_test.reshape(-1, original_dim) / 255.

vae.fit(x_train,
 x_train,
 shuffle=True,
 epochs=epochs,
 batch_size=batch_size,
 validation_data=(x_test, x_test))

Loss (NELBO) convergence

pd.DataFrame(hist.history).plot(ax=ax)

Model evaluation

encoder = Model(x, z_mu)

# display a 2D plot of the digit classes in the latent space
z_test = encoder.predict(x_test, batch_size=batch_size)
plt.figure(figsize=(6, 6))
plt.scatter(z_test[:, 0], z_test[:, 1], c=y_test,
 alpha=.4, s=3**2, cmap='viridis')
plt.colorbar()
plt.show()

# display a 2D manifold of the digits
n = 15 # figure with 15x15 digits
digit_size = 28

# linearly spaced coordinates on the unit square were transformed
# through the inverse CDF (ppf) of the Gaussian to produce values
# of the latent variables z, since the prior of the latent space
# is Gaussian

z1 = norm.ppf(np.linspace(0.01, 0.99, n))
z2 = norm.ppf(np.linspace(0.01, 0.99, n))
z_grid = np.dstack(np.meshgrid(z1, z2))

x_pred_grid = decoder.predict(z_grid.reshape(n*n, latent_dim)) \
 .reshape(n, n, digit_size, digit_size)

plt.figure(figsize=(10, 10))
plt.imshow(np.block(list(map(list, x_pred_grid))), cmap='gray')
plt.show()

Recap

In this post, we covered the basics of amortized variational inference, looking at variational autoencoders as a specific example. In particular, we

Implemented the decoder and encoder using the and respectively.
Augmented the final loss with the KL divergence term by writing an auxiliary .
Worked with the log variance for numerical stability, and used a to transform it to the standard deviation when necessary.
Explicitly made the noise an Input layer, and implemented the reparameterization trick using .
, so random samples are generated within the computation graph.

What’s next

Next, we will extend the divergence layer to use an auxiliary density ratio estimator function, instead of evaluating the KL divergence in the analytical form above. This relaxes the requirement on approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ (and incidentally also prior $p(\mathbf{z})$) to yield tractable densities, at the cost of maximizing a cruder estimate of the ELBO. This is known as Adversarial Variational Bayes¹⁰, and is an important line of recent research that, when taken to its logcal conclusion, can extend the applicability of variational inference to arbitrarily expressive implicit probabilistic models with intractable likelihoods¹¹.

Cite as:

@article{tiao2017vae,
 title = "{A} {T}utorial on {V}ariational {A}utoencoders with a {C}oncise {K}eras {I}mplementation",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2017",
 url = "https://tiao.io/post/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/"
}

To receive updates on more posts like this, follow me on and !

Links & Resources

Below, you can find:

The used to generate the diagrams and plots in this post.
The above snippets combined in a single executable Python file:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

from keras import backend as K

from keras.layers import Input, Dense, Lambda, Layer, Add, Multiply
from keras.models import Model, Sequential
from keras.datasets import mnist


original_dim = 784
intermediate_dim = 256
latent_dim = 2
batch_size = 100
epochs = 50
epsilon_std = 1.0


def nll(y_true, y_pred):
 """ Negative log likelihood (Bernoulli). """

 # keras.losses.binary_crossentropy gives the mean
 # over the last axis. we require the sum
 return K.sum(K.binary_crossentropy(y_true, y_pred), axis=-1)


class KLDivergenceLayer(Layer):

 """ Identity transform layer that adds KL divergence
 to the final model loss.
 """

 def __init__(self, *args, **kwargs):
 self.is_placeholder = True
 super(KLDivergenceLayer, self).__init__(*args, **kwargs)

 def call(self, inputs):

 mu, log_var = inputs

 kl_batch = - .5 * K.sum(1 + log_var -
 K.square(mu) -
 K.exp(log_var), axis=-1)

 self.add_loss(K.mean(kl_batch), inputs=inputs)

 return inputs


decoder = Sequential([
 Dense(intermediate_dim, input_dim=latent_dim, activation='relu'),
 Dense(original_dim, activation='sigmoid')
])

x = Input(shape=(original_dim,))
h = Dense(intermediate_dim, activation='relu')(x)

z_mu = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)

z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])
z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)

eps = Input(tensor=K.random_normal(stddev=epsilon_std,
 shape=(K.shape(x)[0], latent_dim)))
z_eps = Multiply()([z_sigma, eps])
z = Add()([z_mu, z_eps])

x_pred = decoder(z)

vae = Model(inputs=[x, eps], outputs=x_pred)
vae.compile(optimizer='rmsprop', loss=nll)

# train the VAE on MNIST digits
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, original_dim) / 255.
x_test = x_test.reshape(-1, original_dim) / 255.

vae.fit(x_train,
 x_train,
 shuffle=True,
 epochs=epochs,
 batch_size=batch_size,
 validation_data=(x_test, x_test))

encoder = Model(x, z_mu)

# display a 2D plot of the digit classes in the latent space
z_test = encoder.predict(x_test, batch_size=batch_size)
plt.figure(figsize=(6, 6))
plt.scatter(z_test[:, 0], z_test[:, 1], c=y_test,
 alpha=.4, s=3**2, cmap='viridis')
plt.colorbar()
plt.show()

# display a 2D manifold of the digits
n = 15 # figure with 15x15 digits
digit_size = 28

# linearly spaced coordinates on the unit square were transformed
# through the inverse CDF (ppf) of the Gaussian to produce values
# of the latent variables z, since the prior of the latent space
# is Gaussian
u_grid = np.dstack(np.meshgrid(np.linspace(0.05, 0.95, n),
 np.linspace(0.05, 0.95, n)))
z_grid = norm.ppf(u_grid)
x_decoded = decoder.predict(z_grid.reshape(n*n, 2))
x_decoded = x_decoded.reshape(n, n, digit_size, digit_size)

plt.figure(figsize=(10, 10))
plt.imshow(np.block(list(map(list, x_decoded))), cmap='gray')
plt.show()

D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014. ↩︎ ↩︎ ↩︎
↩︎
Section “Recognition models and amortised inference” in ↩︎
Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7(5), 889–904. ↩︎ ↩︎
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). “Stochastic backpropagation and approximate inference in deep generative models,” in Proceedings of The 31st International Conference on Machine Learning, 2014, (Vol. 32, pp. 1278–1286). Bejing, China: PMLR. ↩︎ ↩︎ ↩︎ ↩︎
For a complete treatment of variational autoencoders, and variational inference in general, I highly recommend:
- Jaan Altosaar’s blog post,
- Diederik P. Kingma’s PhD Thesis, .
↩︎
D. Rezende and S. Mohamed, “Variational Inference with Normalizing Flows,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, vol. 37, pp. 1530–1538. ↩︎
Y. Burda, R. Grosse, and R. Salakhutdinov, “Importance Weighted Autoencoders,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015. ↩︎ ↩︎
E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax,” Nov. 2016. in Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017. ↩︎ ↩︎
L. Mescheder, S. Nowozin, and A. Geiger, “Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks,” in Proceedings of the 34th International Conference on Machine Learning, 2017, vol. 70, pp. 2391–2400. ↩︎ ↩︎ ↩︎
D. Tran, R. Ranganath, and D. Blei, “Hierarchical Implicit Models and Likelihood-Free Variational Inference,” in Advances in Neural Information Processing Systems 30, 2017. ↩︎ ↩︎ ↩︎
To support sample weighting (fined-tuning how much each data-point contributes to the loss), Keras losses are expected returns a scalar for each data-point in the batch. In contrast, losses appended with the add_loss method don’t support this, and are expected to be a single scalar. Hence, we calculate the KL divergence for all data-points in the batch and take the mean before passing it to add_loss. ↩︎
Y. Li and R. E. Turner, “Rényi Divergence Variational Inference,” in Advances in Neural Information Processing Systems 29, 2016. ↩︎
A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei, “Variational Inference via chi Upper Bound Minimization,” in Advances in Neural Information Processing Systems 30, 2017. ↩︎