TensorFlow Distributions, now under the broader umbrella of
TensorFlow Probability, is a fantastic TensorFlow library for efficient and
composable manipulation of probability distributions^{1}.

Among the many features it has to offer, one of the most powerful in my opinion
is the `Bijector`

API, which provide the modular building blocks necessary to
construct a broad class of probability distributions.
Instead of describing it any further in the abstract, let’s dive right in with
a simple example.

## Example: Banana-shaped distribution

Consider the *banana-shaped distribution*, a commonly-used testbed for adaptive
MCMC methods^{2}.
Denote the density of this distribution as $p_{Y}(\mathbf{y})$.
To illustrate, 1k samples randomly drawn from this distribution are shown below:

The underlying process that generates samples $\tilde{\mathbf{y}} \sim p_{Y}(\mathbf{y})$ is simple to describe, and is of the general form,

$$ \tilde{\mathbf{y}} \sim p_{Y}(\mathbf{y}) \quad \Leftrightarrow \quad \tilde{\mathbf{y}} = G(\tilde{\mathbf{x}}), \quad \tilde{\mathbf{x}} \sim p_{X}(\mathbf{x}). $$

In other words, a sample $\tilde{\mathbf{y}}$ is the output of a transformation $G$, given a sample $\tilde{\mathbf{x}}$ drawn from some underlying base distribution $p_{X}(\mathbf{x})$.

However, it is not as straightforward to compute an analytical expression for
density $p_{Y}(\mathbf{y})$.
In fact, this is only possible if $G$ is a *differentiable* and *invertible*
transformation (a *diffeomorphism*^{3}), and if there is an analytical
expression for $p_{X}(\mathbf{x})$.

Transformations that fail to satisfy these conditions (which includes something
as simple as a multi-layer perceptron with non-linear activations) give rise to
*implicit distributions*, and will be the subject of many posts to come.
But for now, we will restrict our attention to diffeomorphisms.

### Base distribution

Following on with our example, the base distribution $p_{X}(\mathbf{x})$ is given by a two-dimensional Gaussian with unit variances and covariance $\rho = 0.95$:

$$ p_{X}(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \mathbf{0}, \mathbf{\Sigma}), \qquad \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.95 \newline 0.95 & 1 \end{bmatrix} $$

This can be encapsulated by an instance of MultivariateNormalTriL, which is parameterized by a lower-triangular matrix. First let’s import TensorFlow Distributions:

```
import tensorflow.contrib.distributions as tfd
```

Then we create the lower-triangular matrix and the instantiate the distribution:

```
>>> rho = 0.95
>>> Sigma = np.float32(np.eye(N=2) + rho * np.eye(N=2)[::-1])
>>> Sigma
array([[1. , 0.95],
[0.95, 1. ]], dtype=float32)
>>> p_x = tfd.MultivariateNormalTriL(scale_tril=tf.cholesky(Sigma))
```

As with all subclasses of `tfd.Distribution`

, we can evaluated the probability
density function of this distribution by calling the `p_x.prob`

method.
Evaluating this on an uniformly-spaced grid yields the equiprobability contour
plot below:

### Forward Transformation

The required transformation $G$ is defined as:

$$ G(\mathbf{x}) = \begin{bmatrix} x_1 \newline x_2 - x_1^2 - 1 \newline \end{bmatrix} $$

We implement this in the `_forward`

function below^{4}:

```
def _forward(x):
y_0 = x[..., 0:1]
y_1 = x[..., 1:2] - y_0**2 - 1
y_tail = x[..., 2:-1]
return tf.concat([y_0, y_1, y_tail], axis=-1)
```

We can now use this to generate samples from $p_{Y}(\mathbf{y})$.
To do this we first sample from the base distribution $p_{X}(\mathbf{x})$ by
calling `p_x.sample`

. For this illustration, we generate 1k samples, which is
specified through the `sample_shape`

argument. We then transform these samples
through $G$ by calling `_forward`

on them.

```
>>> x_samples = p_x.sample(1000)
>>> y_samples = _forward(x_samples)
```

The figure below contains scatterplots of the 1k samples `x_samples`

(left)
and the transformed `y_samples`

(right):

### Instantiating a `TransformedDistribution`

with a `Bijector`

Having specified the forward transformation and the underlying distribution, we have now fully described the sample generation process, which is the bare minimum necessary to define a probability distribution.

The forward transformation is also the *first* of **three** operations needed to
fully specify a `Bijector`

, which can be used to instantiate a
`TransformedDistribution`

that encapsulates the banana-shaped distribution.

#### Creating a `Bijector`

First, let’s subclass `Bijector`

to define the `Banana`

bijector and implement
the forward transformation as an instance method:

```
class Banana(tfd.bijectors.Bijector):
def __init__(self, name="banana"):
super(Banana, self).__init__(inverse_min_event_ndims=1,
name=name)
def _forward(self, x):
y_0 = x[..., 0:1]
y_1 = x[..., 1:2] - y_0**2 - 1
y_tail = x[..., 2:-1]
return tf.concat([y_0, y_1, y_tail], axis=-1)
```

Note that we need to specify either `forward_min_event_ndims`

or
`inverse_min_event_ndims`

, the number of dimensions the forward or inverse
transformation operate on (which can sometimes differ).
In our example, both the inverse and forward transformation operate on vectors
(rank 1 tensors), so we set `inverse_min_event_ndims=1`

.

With an instance of the `Banana`

bijector, we can call the `forward`

method on
`x_samples`

to produce `y_samples`

as before:

```
>>> y_samples = Banana().forward(x_samples)
```

#### Instantiating a `TransformedDistribution`

More importantly, we can now create a `TransformedDistribution`

with the base
distribution `p_x`

and an instance of the `Banana`

bijector:

```
>>> p_y = tfd.TransformedDistribution(distribution=p_x, bijector=Banana())
```

This now allows us to directly sample from `p_y`

just as we could with `p_x`

,
and any other TensorFlow Probability `Distribution`

:

```
>>> y_samples = p_y.sample(1000)
```

Neat!

### Probability Density Function

Although we can now sample from this distribution, we have yet to define the
operations necessary to evaluate its probability density function—the
remaining *two* of **three** operations needed to fully specify a `Bijector`

Indeed, calling `p_y.prob`

at this stage would simply raise a
`NotImplementedError`

exception. So what else do we need to define?

Recall the probability density of $p_{Y}(\mathbf{y})$ is given by:

$$ p_{Y}(\mathbf{y}) = p_{X}(G^{-1}(\mathbf{y})) \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) $$

Hence we need to specify the inverse transformation $G^{-1}(\mathbf{y})$ and its Jacobian determinant $\mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right )$.

For numerical stability, the `Bijector`

API requires that this be defined in
log-space. Hence, it is useful to recall that the forward and inverse log
determinant Jacobians differ only in their signs^{5},

$$ \begin{align} \log \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = - \log \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right ), \end{align} $$

which gives us the option of implementing either (or both). However, do note the following from the official tf.contrib.distributions.bijectors.Bijector API docs:

Generally its preferable to directly implement the inverse Jacobian determinant. This should have superior numerical stability and will often share subgraphs with the

`_inverse`

implementation.

### Inverse Transformation

So let’s implement the inverse transform $G^{-1}$, which is given by:

$$ G^{-1}(\mathbf{y}) = \begin{bmatrix} y_1 \newline y_2 + y_1^2 + 1 \newline \end{bmatrix} $$

We define this in the `_inverse`

function below:

```
def _inverse(y):
x_0 = y[..., 0:1]
x_1 = y[..., 1:2] + x_0**2 + 1
x_tail = y[..., 2:-1]
return tf.concat([x_0, x_1, x_tail], axis=-1)
```

### Jacobian determinant

Now we compute the log determinant of the Jacobian of the *inverse*
transformation.
In this simple example, the transformation is *volume-preserving*, meaning its
Jacobian determinant is equal to 1.

This is easy to verify:

$$ \begin{align} \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = \mathrm{det} \begin{pmatrix} \frac{\partial}{\partial y_1} y_1 & \frac{\partial}{\partial y_2} y_1 \newline \frac{\partial}{\partial y_1} y_2 + y_1^2 + 1 & \frac{\partial}{\partial y_2} y_2 + y_1^2 + 1 \newline \end{pmatrix} \newline & = \mathrm{det} \begin{pmatrix} 1 & 0 \newline 2 y_1 & 1 \newline \end{pmatrix} = 1 \end{align} $$

Hence, the log determinant Jacobian is given by zeros shaped like input `y`

, up
to the last `inverse_min_event_ndims=1`

dimensions:

```
def _inverse_log_det_jacobian(y):
return tf.zeros(shape=y.shape[:-1])
```

Since the log determinant Jacobian is constant, i.e. independent of the input,
we can just specify it for one input by setting the flag `is_constant_jacobian=True`

^{6},
and the `Bijector`

class will handle the necessary shape inference for us.

Putting it all together in the `Banana`

bijector subclass, we have:

```
class Banana(tfd.bijectors.Bijector):
def __init__(self, name="banana"):
super(Banana, self).__init__(inverse_min_event_ndims=1,
is_constant_jacobian=True,
name=name)
def _forward(self, x):
y_0 = x[..., 0:1]
y_1 = x[..., 1:2] - y_0**2 - 1
y_tail = x[..., 2:-1]
return tf.concat([y_0, y_1, y_tail], axis=-1)
def _inverse(self, y):
x_0 = y[..., 0:1]
x_1 = y[..., 1:2] + x_0**2 + 1
x_tail = y[..., 2:-1]
return tf.concat([x_0, x_1, x_tail], axis=-1)
def _inverse_log_det_jacobian(self, y):
return tf.zeros(shape=())
```

Finally, we can instantiate distribution `p_y`

by calling
`tfd.TransformedDistribution`

as we did before *et voilĂ *,
we can now simply call `p_y.prob`

to evaluate the probability density function.

Evaluating this on the same uniformly-spaced grid as before yields the following equiprobability contour plot:

#### Inline Bijector

Before we conclude, we note that instead of creating a subclass, one can also opt for a more lightweight and functional approach by creating an Inline bijector:

```
banana = tfd.bijectors.Inline(
forward_fn=_forward,
inverse_fn=_inverse,
inverse_log_det_jacobian_fn=_inverse_log_det_jacobian,
inverse_min_event_ndims=1,
is_constant_jacobian=True,
)
p_y = tfd.TransformedDistribution(distribution=p_x, bijector=banana)
```

# Summary

In this post, we showed that using diffeomorphisms—mappings that are differentiable and invertible, it is possible transform standard distributions into interesting and complicated distributions, while still being able to compute their densities analytically.

The `Bijector`

API provides an interface that encapsulates the basic properties
of a diffeomorphism needed to transform a distribution. These are: the
forward transform itself, its inverse and the determinant of their Jacobians.

Using this, `TransformedDistribution`

*automatically* implements perhaps the two
most important methods of a probability distribution: sampling (`sample`

), and
density evaluation (`prob`

).

Needless to say, this is a very powerful combination.
Through the `Bijector`

API, the number of possible distributions that can be
implemented and used directly with other functionalities in the TensorFlow
Probability ecosystem effectively becomes *endless*.

## Links & Resources

- Try this out yourself in a Colaboratory Notebook.
- Paper: see footnote
^{1} - Blog Post: Introducing TensorFlow Probability
- API Documentation: tf.contrib.distributions.bijectors.Bijector

- Dillon, J.V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M. and Saurous, R.A., 2017.
*TensorFlow Distributions.*arXiv preprint arXiv:1711.10604.^{^} - Haario, H., Saksman, E., & Tamminen, J. (1999). Adaptive proposal distribution for random walk Metropolis algorithm.
*Computational Statistics*, 14(3), 375-396.^{^} - for the transformation to be a diffeomorphism, it also needs to be
*smooth*.^{^} - we implement this for the general case of $K \geq 2$ dimensional inputs since this actually turns out to be easier and cleaner (a phenomenon known as Inventor’s paradox).
^{^} this is a straightforward consequence of the inverse function theorem which says the matrix inverse of the Jacobian of $G$ is the Jacobian of its inverse $G^{-1}$, $$ \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) = \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} $$ Taking the determinant of both sides, we get: $$ \begin{align} \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = \mathrm{det} \left ( \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} \right ) \newline & = \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} \end{align} $$ as required.

^{^}- See description of is_constant_jacobian argument for further details.
^{^}