TensorFlow Probability |

Efficient Cholesky decomposition of low-rank updates

Sun, 16 Apr 2023 11:16:03 +0000

Suppose we’re given a positive semidefinite (PSD) matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$ to which we wish to update by some low-rank matrix $\mathbf{U} \mathbf{U}^\top \in \mathbb{R}^{N \times N}$ , $$\mathbf{B} \triangleq \mathbf{A} + \mathbf{U} \mathbf{U}^\top,$$ where the update factor matrix $\mathbf{U} \in \mathbb{R}^{N \times M}$ . To be more precise, the low-rank update is rank-$M$ for some $M \ll N$.

What is the best way to calculate the Cholesky decomposition of $\mathbf{B}$ ?

Given no additional information the obvious way is to calculate it directly, which incurs a cost of $\mathcal{O}(N^3)$ . But suppose we’ve already calculated the lower-triangular Cholesky factor $\mathbf{L} \in \mathbb{R}^{N \times N}$ of $\mathbf{A}$ (i.e., $\mathbf{LL}^\top = \mathbf{A}$ ). Then, we can use it to calculate the Cholesky decomposition of $\mathbf{B}$ at a reduced cost of $\mathcal{O}(N^2M)$ . Here’s how.

Rank-1 Updates

First, let’s consider the simpler case involving just rank-1 updates $$\mathbf{B} \triangleq \mathbf{A} + \mathbf{u} \mathbf{u}^\top,$$ where update factor vector $\mathbf{u} \in \mathbb{R}^{N}$ . With some clever manipulations¹, the details of which we won’t get into in this post, we can leverage $\mathbf{L}$ to calculate the Cholesky decomposition of $\mathbf{B}$ at a reduced cost of $\mathcal{O}(N^2)$ . Such a procedure for rank-1 updates is implemented in the old-school Fortran linear algebra software library (but unfortunately not in its successor ), and also in modern libraries like (TFP).

In TFP, this is implemented in the function named . For example,

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp

update_factor_vector # Tensor; shape [..., N]
a # Tensor; shape [..., N, N]

update = tf.linalg.matmul(
 update_factor_vector[..., tf.newaxis],
 update_factor_vector[..., tf.newaxis],
 transpose_b=True
)

b = a + update # Tensor; shape [..., N, N]
a_factor = tf.linalg.cholesky(a) # O(N^3); suppose this is pre-computed and stored

b_factor = tf.linalg.cholesky(b) # O(N^3), ignores `a_factor`
b_factor_1 = tfp.math.cholesky_update(a_factor, update_factor_vector) # O(N^2), uses `a_factor`

np.testing.assert_array_almost_equal(b_factor, b_factor_1)

Here cholesky_update takes as arguments chol with shape [B1, ..., Bn, N, N] and u with shape [B1, ..., Bn, N], and returns a lower triangular Cholesky factor of the rank-1 updated matrix chol @ chol.T + u @ u.T in $\mathcal{O}(N^2)$ time.

Low-Rank Updates

Now let’s return to rank-$M$ updates. First let’s write the update factor matrix $\mathbf{U}$ in terms of column vectors $\mathbf{u}_m \in \mathbb{R}^{N}$, $$ \mathbf{U} \triangleq \begin{bmatrix} \mathbf{u}_1 & \cdots & \mathbf{u}_M \end{bmatrix}. $$

Now we can write the rank-$M$ update matrix as a sum of $M$ rank-1 matrices, $$ \mathbf{U} \mathbf{U}^\top = \begin{bmatrix} \mathbf{u}_1 & \cdots & \mathbf{u}_M \end{bmatrix} \begin{bmatrix} \mathbf{u}_1^\top \\ \vdots \\ \mathbf{u}_M^\top \end{bmatrix} = \sum_{m=1}^{M} \mathbf{u}_m \mathbf{u}_m^\top. $$

update_factor_matrix # Tensor; shape [..., N, M]

# [..., N, 1, M] [..., 1, N, M] -> [..., N, N, M] -> [..., N, N]
update1 = tf.reduce_sum(update_factor_matrix[..., tf.newaxis, :] *
 update_factor_matrix[..., tf.newaxis, :, :], axis=-1)
# [..., N, M] [..., M, N] -> [..., N, N]
update2 = tf.linalg.matmul(update_factor_matrix,
 update_factor_matrix, transpose_b=True)

# not exactly equal due to finite precision, but still equal up to high precision
np.testing.assert_array_almost_equal(update1, update2, decimal=14)

Thus seen, a low-rank update is nothing more than a repeated application of rank-1 updates, $$ \begin{align} \mathbf{B} & = \mathbf{A} + \mathbf{U} \mathbf{U}^\top \\ & = \mathbf{A} + \sum_{m=1}^{M} \mathbf{u}_m \mathbf{u}_m^\top \\ & = ((\mathbf{A} + \mathbf{u}_1 \mathbf{u}_1^\top) + \cdots ) + \mathbf{u}_M \mathbf{u}_M^{\top}. \end{align} $$

Therefore, we can simply leverage the $O(N^2)$ procedure for Cholesky decompositions of rank-1 updates and apply it recursively $M$ times to obtain a $O(N^2M)$ procedure for rank-$M$ updates.

Hence, we have:

# [..., N, M] [..., M, N] -> [..., N, N]
update = tf.linalg.matmul(update_factor_matrix,
 update_factor_matrix, transpose_b=True)
b = a + update # Tensor; shape [..., N, N]

b_factor = tf.linalg.cholesky(b) # O(N^3), ignores `a_factor`
b_factor_1 = cholesky_update_iterated(a_factor, update_factor_matrix) # O(N^2M), uses `a_factor`

np.testing.assert_array_almost_equal(b_factor_1, b_factor)

where function cholesky_update_iterated is implemented as follows:

def cholesky_update_iterated(chol, update_factor_matrix):

 # base case
 if update_factor_matrix.shape[-1] == 0:
 return chol

 prev = cholesky_update_iterated(chol, update_factor_matrix[..., :-1])
 return tfp.math.cholesky_update(prev, update_factor_matrix[..., -1])

We can also implement this iteratively. First we’d use tf.unstack to turn the update factor matrix $\mathbf{U}$ into a list of update factor vectors $\mathbf{u}_m$:

>>> update_factor_vectors = tf.unstack(update_factor_matrix, axis=-1)
>>> assert isinstance(update_factor_vectors, list) # `update_factor_vectors` is a list
>>> assert len(update_factor_vectors) == M # ... the list contains M vectors
>>> assert update_factor_vectors[0].shape == (*Bs, N) # ... and each vector has shape [B1, ..., Bn, N]

Then, we have:

def cholesky_update_iterated(chol, update_factor_matrix):
 new_chol = chol
 for update_factor_vector in tf.unstack(update_factor_matrix, axis=-1):
 new_chol = tfp.math.cholesky_update(new_chol, update_factor_vector)
 return new_chol

The astute reader will recognize that this is simply an special case of the or patterns, where the binary operator is tfp.math.cholesky_update, the iterable is tf.unstack(update_factor, axis=-1) and the initial value is chol.

Therefore, we can also implement it neatly using the one-liner:

from functools import reduce


def cholesky_update_iterated(chol, update_factor_matrix):
 return reduce(tfp.math.cholesky_update, tf.unstack(update_factor_matrix, axis=-1), chol)

Summary

In summary, we showed that to efficiently calculate the Cholesky decomposition of a matrix perturbed by a low-rank update, one just needs to iteratively calculate that of the same matrix perturbed by a series of rank-1 updates. Better yet, all of this can be done with a simple one-liner!

To receive updates on more posts like this, follow me on and !

Seeger, M. (2004). Low rank updates for the Cholesky decomposition. ↩︎

An Illustrated Guide to the Knowledge Gradient Acquisition Function

Thu, 18 Feb 2021 19:13:23 +0100

Note

Draft – work in progress.

We provide a short guide to the knowledge-gradient (KG) acquisition function (Frazier et al., 2009)¹ for Bayesian optimization (BO). Rather than being a self-contained tutorial, this posts is intended to serve as an illustrated compendium to the paper of Frazier et al., 2009¹ and the subsequent tutorial by Frazier, 2018², authored nearly a decade later.

This post assumes a basic level of familiarity with BO and Gaussian processes (GPs), to the extent provided by the literature survey of Shahriari et al., 2015³, and the acclaimed textbook of Rasmussen and Williams, 2006, respectively.

Knowledge-gradient

First, we set-up the notation and terminology. Let $f: \mathcal{X} \to \mathbb{R}$ be the blackbox function we wish to minimize. We denote the GP posterior predictive distribution, or predictive for short, by $p(y | \mathbf{x}, \mathcal{D})$. The mean of the predictive, or the predictive mean for short, is denoted by

$$ \mu(\mathbf{x}; \mathcal{D}) = \mathbb{E}[y | \mathbf{x}, \mathcal{D}] $$

Let $\mathcal{D}_n$ be the set of $n$ input-output observations $\mathcal{D}_n = \{ (\mathbf{x}_i, y_i) \}_{i=1}^n$, where output $y_i = f(\mathbf{x}_i) + \epsilon$ is assumed to be observed with noise $\epsilon \sim \mathcal{N}(0, \sigma^2)$. We make the following abbreviation

$$ \mu_n(\mathbf{x}) = \mu(\mathbf{x}; \mathcal{D}_n) $$

Next, we define the minimum of the predictive mean, or predictive minimum for short, as

$$ \tau(\mathcal{D}) = \min_{\mathbf{x}' \in \mathcal{X}} \mu(\mathbf{x}'; \mathcal{D}) $$

If we view $\mu(\mathbf{x}; \mathcal{D})$ as our fit to the underlying function $f(\mathbf{x})$ from which the observations $\mathcal{D}$ were generated, then $\tau(\mathcal{D})$ is our estimate of the minimum of $f(\mathbf{x})$, given observations $\mathcal{D}$.

Further, we make the following abbreviations

$$ \tau_n = \tau(\mathcal{D}_n), \qquad \text{and} \qquad \tau_{n+1} = \tau(\mathcal{D}_{n+1}), $$

where $\mathcal{D}_{n+1} = \mathcal{D}_n \cup \{ (\mathbf{x}, y) \}$ is the set of existing observations, augmented by some input-output pair $(\mathbf{x}, y)$. Then, the knowledge-gradient is defined as

$$ \alpha(\mathbf{x}; \mathcal{D}_n) = \mathbb{E}_{p(y | \mathbf{x}, \mathcal{D}_n)} [ \tau_n - \tau_{n+1} ] $$

Crucially, note that $\tau_{n+1}$ is implicitly a function of $(\mathbf{x}, y)$, and that this expression integrates over all possible input-output observation pairs $(\mathbf{x}, y)$ for the given $\mathbf{x}$ under the predictive $p(y | \mathbf{x}, \mathcal{D}_n)$.

Monte Carlo estimation

Not surprisingly, the knowledge-gradient function is analytically intractable. Therefore, in practice, we compute it using Monte Carlo estimation,

$$ \alpha(\mathbf{x}; \mathcal{D}_n) \approx \frac{1}{M} \left ( \sum_{m=1}^M \tau_n - \tau_{n+1}^{(m)} \right ), \qquad y^{(m)} \sim p(y | \mathbf{x}, \mathcal{D}_n), $$

where $\tau_{n+1}^{(m)} = \tau(\mathcal{D}_{n+1}^{(m)})$ and $\mathcal{D}_{n+1}^{(m)} = \mathcal{D}_n \cup \{ (\mathbf{x}, y^{(m)}) \}$.

We refer to $y^{(m)}$ as the $m$th simulated outcome, or the $m$th simulation for short. Then, $\mathcal{D}_{n+1}^{(m)}$ is the $m$th simulation-augmented dataset and, accordingly, $\tau_{n+1}^{(m)}$ is the $m$th simulation-augmented predictive minimum.

We see that this approximation to the knowledge-gradient is simply the average difference between the predictive minimum values based on simulation-augmented data $\tau_{n+1}^{(m)}$, and that based on observed data $\tau_n$, across $M$ simulations.

This might take a moment to digest, as there are quite a number of moving parts to keep track of. To help visualize these parts, we provide an illustration of each of the steps required to compute KG on a simple one-dimensional synthetic problem.

One-dimensional example

As the running example throughout this post, we use a synthetic function defined as

$$ f(x) = \sin(3x) + x^2 - 0.7 x. $$

We generate $n=10$ observations at locations sampled uniformly at random. The true function, and the set of noisy observations $\mathcal{D}_n$ are visualized in the figure below:

Latent blackbox function and $n=10$ observations.

Using the observations $\mathcal{D}_n$ we have collected so far, we wish to use KG to score a candidate location $x_c$ at which to evaluate next.

Posterior predictive distribution

The posterior predictive $p(y | \mathbf{x}, \mathcal{D}_n)$ is visualized in the figure below. In particular, the predictive mean $\mu_n(\mathbf{x})$ is represented by the solid orange curve.

Posterior predictive distribution (before hyperparameter estimation).

Clearly, this is a poor fit to the data and a uncalibrated estimation of the predictive uncertainly.

Step 1: Hyperparameter estimation

Therefore, first step is to optimize the hyperparameters of the GP regression model, i.e. the kernel lengthscale, amplitude, and the observation noise variance. We do this using type-II maximum likelihood estimation (MLE), or empirical Bayes.

Posterior predictive distribution (after hyperparameter estimation).

Step 2: Determine the predictive minimum

Next, we compute the predictive minimum $\tau_n = \min_{\mathbf{x}' \in \mathcal{X}} \mu_n(\mathbf{x}')$. Since $\mu_n$ is end-to-end differentiable wrt to input $\mathbf{x}$, we can simply use a multi-started quasi-Newton hill-climber such as L-BFGS. We visualize this in the figure below, where the value of the predictive minimum is represented by the orange horizontal dashed line, and its location is denoted by the orange star and triangle.

Predictive minimum $\tau_n$.

Step 3: Compute simulation-augmented predictive means

Suppose we are scoring the candidate location $x_c = 0.1$. For illustrative purposes, let us draw just $M=1$ sample $y_c^{(1)} \sim p(y | x_c, \mathcal{D}_n)$. In the figure below, the candidate location $x_c$ is represented by the vertical solid gray line, and the single simulated outcome $y_c^{(1)}$ is represented by the filled blue dot.

In general, we denote the simulation-augmented predictive mean as

$$ \mu_{n+1}^{(m)}(\mathbf{x}) = \mu(\mathbf{x}; \mathcal{D}_{n+1}^{(m)}), $$

where $\mathcal{D}_{n+1}^{(m)} = \mathcal{D}_n \cup \{ (\mathbf{x}, y^{(m)}) \}$ as defined earlier.

Here, the simulation-augmented dataset $\mathcal{D}_{n+1}^{(1)}$ is the set of existing observations $\mathcal{D}_n$, augmented by the simulated input-output pair $(x_c, y_c^{(1)})$,

$$ \mathcal{D}_{n+1}^{(1)} = \mathcal{D}_n \cup \{ (x_c, y_c^{(1)}) \}, $$

and the corresponding simulation-augmented predictive mean $\mu_{n+1}^{(1)}(x)$ is represented in the figure below by the solid blue curve.

Simulation-augmented predictive mean $\mu_{n+1}^{(1)}(x)$ at location $x_c = 0.1$

Step 4: Compute simulation-augmented predictive minimums

Next, we compute the simulation-augmented predictive minimum

$$ \tau_{n+1}^{(1)} = \min_{\mathbf{x}' \in \mathcal{X}} \mu_{n+1}^{(1)}(\mathbf{x}') $$

It may not be immediately obvious, but $\mu_{n+1}^{(1)}$ is in fact also end-to-end differentiable wrt to input $\mathbf{x}$. Therefore, we can again appeal to an method such as L-BFGS. We visualize this in the figure below, where the value of the simulation-augmented predictive minimum is represented by the blue horizontal dashed line, and its location is denoted by the blue star and triangle.

Simulation-augmented predictive minimum $\tau_{n+1}^{(1)}$ at location $x_c = 0.1$

Taking the difference between the orange and blue horizontal dashed line will give us an unbiased estimate of the knowledge-gradient. However, this is likely to be a crude one, since it is based on just a single MC sample. To obtain a more accurate estimate, one needs to increase $M$, the number of MC samples.

Samples $M > 1$

Let us now consider $M=5$ samples. We draw $y_c^{(m)} \sim p(y | x_c, \mathcal{D}_n)$, for $m = 1, \dotsc, 5$. As before, the input location $x_c$ is represented by the vertical solid gray line, and the corresponding simulated outcomes are represented by the filled dots below, with varying hues from a perceptually uniform color palette to distinguish between samples.

Accordingly, the simulation-augmented predictive means $\mu_{n+1}^{(m)}(x)$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$ are represented by the colored curves, with hues set to that of the simulated outcome on which the predictive distribution is based.

Simulation-augmented predictive mean $\mu_{n+1}^{(m)}(x)$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$

Next we compute the simulation-augmented predictive minimum $\tau_{n+1}^{(m)}$, which requires minimizing $\mu_{n+1}^{(m)}(x)$ for $m = 1, \dotsc, 5$. These values are represented below by the horizontal dashed lines, and their location is denoted by the stars and triangles.

Simulation-augmented predictive minimum $\tau_{n+1}^{(1)}$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$

Finally, taking the average difference between the orange dashed line and every other dashed line gives us the estimate of the knowledge gradient at input $x_c$.

Links and Further Readings

In this post, we only showed a (naïve) approach to calculating the KG at a given location. Suffice it to say, there is still quite a gap between this and being able to efficiently minimize KG within a sequential decision-making algorithm. For a guide on incorporating KG in a modular and fully-fledged framework for BO (namely ) see
Another introduction to KG:

Cite as:

@article{tiao2021knowledge,
 title = "{A}n {I}llustrated {G}uide to the {K}nowledge {G}radient {A}cquisition {F}unction",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2021",
 url = "https://tiao.io/post/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/"
}

To receive updates on more posts like this, follow me on and !

Frazier, P., Powell, W., & Dayanik, S. (2009). . INFORMS Journal on Computing, 21(4), 599-613. ↩︎ ↩︎
Frazier, P. I. (2018). . arXiv preprint arXiv:1807.02811. ↩︎
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2015). . Proceedings of the IEEE, 104(1), 148-175. ↩︎

Building Probability Distributions with the TensorFlow Probability Bijector API

Mon, 30 Jul 2018 00:00:00 +0000

TensorFlow Distributions, now under the broader umbrella of , is a fantastic TensorFlow library for efficient and composable manipulation of probability distributions¹.

Among the many features it has to offer, one of the most powerful in my opinion is the Bijector API, which provide the modular building blocks necessary to construct a broad class of probability distributions. Instead of describing it any further in the abstract, let’s dive right in with a simple example.

Example: Banana-shaped distribution

Consider the banana-shaped distribution, a commonly-used testbed for adaptive MCMC methods². Denote the density of this distribution as $p_{Y}(\mathbf{y})$. To illustrate, 1k samples randomly drawn from this distribution are shown below:

The underlying process that generates samples $\tilde{\mathbf{y}} \sim p_{Y}(\mathbf{y})$ is simple to describe, and is of the general form,

$$ \tilde{\mathbf{y}} \sim p_{Y}(\mathbf{y}) \quad \Leftrightarrow \quad \tilde{\mathbf{y}} = G(\tilde{\mathbf{x}}), \quad \tilde{\mathbf{x}} \sim p_{X}(\mathbf{x}). $$

In other words, a sample $\tilde{\mathbf{y}}$ is the output of a transformation $G$, given a sample $\tilde{\mathbf{x}}$ drawn from some underlying base distribution $p_{X}(\mathbf{x})$.

However, it is not as straightforward to compute an analytical expression for density $p_{Y}(\mathbf{y})$. In fact, this is only possible if $G$ is a differentiable and invertible transformation (a diffeomorphism³), and if there is an analytical expression for $p_{X}(\mathbf{x})$.

Transformations that fail to satisfy these conditions (which includes something as simple as a multi-layer perceptron with non-linear activations) give rise to implicit distributions, and will be the subject of many posts to come. But for now, we will restrict our attention to diffeomorphisms.

Base distribution

Following on with our example, the base distribution $p_{X}(\mathbf{x})$ is given by a two-dimensional Gaussian with unit variances and covariance $\rho = 0.95$:

$$ p_{X}(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \mathbf{0}, \mathbf{\Sigma}), \qquad \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.95 \newline 0.95 & 1 \end{bmatrix} $$

This can be encapsulated by an instance of , which is parameterized by a lower-triangular matrix. First let’s import TensorFlow Distributions:

import tensorflow.contrib.distributions as tfd

Then we create the lower-triangular matrix and the instantiate the distribution:

>>> rho = 0.95
>>> Sigma = np.float32(np.eye(N=2) + rho * np.eye(N=2)[::-1])
>>> Sigma
array([[1. , 0.95],
 [0.95, 1. ]], dtype=float32)
>>> p_x = tfd.MultivariateNormalTriL(scale_tril=tf.cholesky(Sigma))

As with all subclasses of tfd.Distribution, we can evaluated the probability density function of this distribution by calling the p_x.prob method. Evaluating this on an uniformly-spaced grid yields the equiprobability contour plot below:

Forward Transformation

The required transformation $G$ is defined as:

$$ G(\mathbf{x}) = \begin{bmatrix} x_1 \newline x_2 - x_1^2 - 1 \newline \end{bmatrix} $$

We implement this in the _forward function below⁴:

def _forward(x):

 y_0 = x[..., 0:1]
 y_1 = x[..., 1:2] - y_0**2 - 1
 y_tail = x[..., 2:-1]

 return tf.concat([y_0, y_1, y_tail], axis=-1)

We can now use this to generate samples from $p_{Y}(\mathbf{y})$. To do this we first sample from the base distribution $p_{X}(\mathbf{x})$ by calling p_x.sample. For this illustration, we generate 1k samples, which is specified through the sample_shape argument. We then transform these samples through $G$ by calling _forward on them.

>>> x_samples = p_x.sample(1000)
>>> y_samples = _forward(x_samples)

The figure below contains scatterplots of the 1k samples x_samples (left) and the transformed y_samples (right):

Instantiating a `TransformedDistribution` with a `Bijector`

Having specified the forward transformation and the underlying distribution, we have now fully described the sample generation process, which is the bare minimum necessary to define a probability distribution.

The forward transformation is also the first of three operations needed to fully specify a Bijector, which can be used to instantiate a TransformedDistribution that encapsulates the banana-shaped distribution.

Creating a `Bijector`

First, let’s subclass Bijector to define the Banana bijector and implement the forward transformation as an instance method:

class Banana(tfd.bijectors.Bijector):

 def __init__(self, name="banana"):
 super(Banana, self).__init__(inverse_min_event_ndims=1,
 name=name)

 def _forward(self, x):

 y_0 = x[..., 0:1]
 y_1 = x[..., 1:2] - y_0**2 - 1
 y_tail = x[..., 2:-1]

 return tf.concat([y_0, y_1, y_tail], axis=-1)

Note that we need to specify either forward_min_event_ndims or inverse_min_event_ndims, the number of dimensions the forward or inverse transformation operate on (which can sometimes differ). In our example, both the inverse and forward transformation operate on vectors (rank 1 tensors), so we set inverse_min_event_ndims=1.

With an instance of the Banana bijector, we can call the forward method on x_samples to produce y_samples as before:

>>> y_samples = Banana().forward(x_samples)

Instantiating a `TransformedDistribution`

More importantly, we can now create a TransformedDistribution with the base distribution p_x and an instance of the Banana bijector:

>>> p_y = tfd.TransformedDistribution(distribution=p_x, bijector=Banana())

This now allows us to directly sample from p_y just as we could with p_x, and any other TensorFlow Probability Distribution:

>>> y_samples = p_y.sample(1000)

Neat!

Probability Density Function

Although we can now sample from this distribution, we have yet to define the operations necessary to evaluate its probability density function—the remaining two of three operations needed to fully specify a Bijector

Indeed, calling p_y.prob at this stage would simply raise a NotImplementedError exception. So what else do we need to define?

Recall the probability density of $p_{Y}(\mathbf{y})$ is given by:

$$ p_{Y}(\mathbf{y}) = p_{X}(G^{-1}(\mathbf{y})) \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) $$

Hence we need to specify the inverse transformation $G^{-1}(\mathbf{y})$ and its Jacobian determinant $\mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right )$.

For numerical stability, the Bijector API requires that this be defined in log-space. Hence, it is useful to recall that the forward and inverse log determinant Jacobians differ only in their signs⁵,

$$ \begin{align} \log \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = - \log \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right ), \end{align} $$

which gives us the option of implementing either (or both). However, do note the following from the official API docs:

Generally its preferable to directly implement the inverse Jacobian determinant. This should have superior numerical stability and will often share subgraphs with the _inverse implementation.

Inverse Transformation

So let’s implement the inverse transform $G^{-1}$, which is given by:

$$ G^{-1}(\mathbf{y}) = \begin{bmatrix} y_1 \newline y_2 + y_1^2 + 1 \newline \end{bmatrix} $$

We define this in the _inverse function below:

def _inverse(y):

 x_0 = y[..., 0:1]
 x_1 = y[..., 1:2] + x_0**2 + 1
 x_tail = y[..., 2:-1]

 return tf.concat([x_0, x_1, x_tail], axis=-1)

Jacobian determinant

Now we compute the log determinant of the Jacobian of the inverse transformation. In this simple example, the transformation is volume-preserving, meaning its Jacobian determinant is equal to 1.

This is easy to verify:

$$ \begin{align} \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = \mathrm{det} \begin{pmatrix} \frac{\partial}{\partial y_1} y_1 & \frac{\partial}{\partial y_2} y_1 \newline \frac{\partial}{\partial y_1} y_2 + y_1^2 + 1 & \frac{\partial}{\partial y_2} y_2 + y_1^2 + 1 \newline \end{pmatrix} \newline & = \mathrm{det} \begin{pmatrix} 1 & 0 \newline 2 y_1 & 1 \newline \end{pmatrix} = 1 \end{align} $$

Hence, the log determinant Jacobian is given by zeros shaped like input y, up to the last inverse_min_event_ndims=1 dimensions:

def _inverse_log_det_jacobian(y):

 return tf.zeros(shape=y.shape[:-1])

Since the log determinant Jacobian is constant, i.e. independent of the input, we can just specify it for one input by setting the flag is_constant_jacobian=True⁶, and the Bijector class will handle the necessary shape inference for us.

Putting it all together in the Banana bijector subclass, we have:

class Banana(tfd.bijectors.Bijector):

 def __init__(self, name="banana"):
 super(Banana, self).__init__(inverse_min_event_ndims=1,
 is_constant_jacobian=True,
 name=name)

 def _forward(self, x):

 y_0 = x[..., 0:1]
 y_1 = x[..., 1:2] - y_0**2 - 1
 y_tail = x[..., 2:-1]

 return tf.concat([y_0, y_1, y_tail], axis=-1)

 def _inverse(self, y):

 x_0 = y[..., 0:1]
 x_1 = y[..., 1:2] + x_0**2 + 1
 x_tail = y[..., 2:-1]

 return tf.concat([x_0, x_1, x_tail], axis=-1)

 def _inverse_log_det_jacobian(self, y):

 return tf.zeros(shape=())

Finally, we can instantiate distribution p_y by calling tfd.TransformedDistribution as we did before et voilà, we can now simply call p_y.prob to evaluate the probability density function.

Evaluating this on the same uniformly-spaced grid as before yields the following equiprobability contour plot:

Inline Bijector

Before we conclude, we note that instead of creating a subclass, one can also opt for a more lightweight and functional approach by creating an bijector:

banana = tfd.bijectors.Inline(
 forward_fn=_forward,
 inverse_fn=_inverse,
 inverse_log_det_jacobian_fn=_inverse_log_det_jacobian,
 inverse_min_event_ndims=1,
 is_constant_jacobian=True,
)
p_y = tfd.TransformedDistribution(distribution=p_x, bijector=banana)

Summary

In this post, we showed that using diffeomorphisms—mappings that are differentiable and invertible, it is possible transform standard distributions into interesting and complicated distributions, while still being able to compute their densities analytically.

The Bijector API provides an interface that encapsulates the basic properties of a diffeomorphism needed to transform a distribution. These are: the forward transform itself, its inverse and the determinant of their Jacobians.

Using this, TransformedDistribution automatically implements perhaps the two most important methods of a probability distribution: sampling (sample), and density evaluation (prob).

Needless to say, this is a very powerful combination. Through the Bijector API, the number of possible distributions that can be implemented and used directly with other functionalities in the TensorFlow Probability ecosystem effectively becomes endless.

Cite as:

@article{tiao2018bijector,
 title = "{B}uilding {P}robability {D}istributions with the {T}ensor{F}low {P}robability {B}ijector {API}",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2018",
 url = "https://tiao.io/post/building-probability-distributions-with-tensorflow-probability-bijector-api/"
}

To receive updates on more posts like this, follow me on and !

Links & Resources

Try this out yourself in a .
Paper: see footnote¹
Blog Post:
API Documentation:

Dillon, J.V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M. and Saurous, R.A., 2017. TensorFlow Distributions. . ↩︎ ↩︎
Haario, H., Saksman, E., & Tamminen, J. (1999). . Computational Statistics, 14(3), 375-396. ↩︎
for the transformation to be a diffeomorphism, it also needs to be smooth. ↩︎
we implement this for the general case of $K \geq 2$ dimensional inputs since this actually turns out to be easier and cleaner (a phenomenon known as ). ↩︎
this is a straightforward consequence of the which says the matrix inverse of the Jacobian of $G$ is the Jacobian of its inverse $G^{-1}$,
$$ \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) = \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} $$
Taking the determinant of both sides, we get:
$$ \begin{align} \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = \mathrm{det} \left ( \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} \right ) \newline & = \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} \end{align} $$
as required. ↩︎
See description of argument for further details. ↩︎

TensorFlow Probability |

Efficient Cholesky decomposition of low-rank updates

Rank-1 Updates

Low-Rank Updates

Summary

An Illustrated Guide to the Knowledge Gradient Acquisition Function

Knowledge-gradient

Monte Carlo estimation

One-dimensional example

Latent blackbox function and $n=10$ observations.

Posterior predictive distribution

Posterior predictive distribution (*before* hyperparameter estimation).

Step 1: Hyperparameter estimation

Posterior predictive distribution (*after* hyperparameter estimation).

Step 2: Determine the predictive minimum

Predictive minimum $\tau_n$.

Step 3: Compute simulation-augmented predictive means

Simulation-augmented predictive mean $\mu_{n+1}^{(1)}(x)$ at location $x_c = 0.1$

Step 4: Compute simulation-augmented predictive minimums

Simulation-augmented predictive minimum $\tau_{n+1}^{(1)}$ at location $x_c = 0.1$

Samples $M > 1$

Simulation-augmented predictive mean $\mu_{n+1}^{(m)}(x)$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$

Simulation-augmented predictive minimum $\tau_{n+1}^{(1)}$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$

Links and Further Readings

Building Probability Distributions with the TensorFlow Probability Bijector API

Example: Banana-shaped distribution

Base distribution

Forward Transformation

Instantiating a TransformedDistribution with a Bijector

Creating a Bijector

Instantiating a TransformedDistribution

Probability Density Function

Inverse Transformation

Jacobian determinant

Inline Bijector

Summary

Links & Resources

Posterior predictive distribution (before hyperparameter estimation).

Posterior predictive distribution (after hyperparameter estimation).

Instantiating a `TransformedDistribution` with a `Bijector`

Creating a `Bijector`

Instantiating a `TransformedDistribution`