# A Handbook for Sparse Variational Gaussian Processes

A summary of notation, identities and derivations for the sparse variational Gaussian process (SVGP) framework

In the sparse variational Gaussian process (SVGP) framework1, one augments the joint distribution $p(\mathbf{y}, \mathbf{f})$ with auxiliary variables $\mathbf{u}$ so that the joint becomes $$p(\mathbf{y}, \mathbf{f}, \mathbf{u}) = p(\mathbf{y} | \mathbf{f}) p(\mathbf{f}, \mathbf{u}).$$ The vector $\mathbf{u} = \begin{bmatrix} u(\mathbf{z}_1) \cdots u(\mathbf{z}_M)\end{bmatrix}^{\top} \in \mathbb{R}^M$ consists of inducing variables, the latent function values corresponding to the inducing input locations contained in the matrix $\mathbf{Z} = \begin{bmatrix} \mathbf{z}_1 \cdots \mathbf{z}_M \end{bmatrix}^{\top} \in \mathbb{R}^{M \times D}$.

## Prior

The joint distribution of the latent function values $\mathbf{f}$, and the inducing variables $\mathbf{u}$ according to the prior is $$p(\mathbf{f}, \mathbf{u}) = \mathcal{N} \left ( \mathbf{0}, \begin{bmatrix} \mathbf{K}_{XX} & \mathbf{K}_{XZ} \newline \mathbf{K}_{XZ}^{\top} & \mathbf{K}_{ZZ} \end{bmatrix} \right ).$$ To reduce clutter, we shall write $\mathbf{K}_{X} = \mathbf{K}_{XX}$ and $\mathbf{K}_{Z} = \mathbf{K}_{ZZ}$ from now on. If we let the joint prior factorize as $$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) p(\mathbf{u}),$$ we can apply the rules of Gaussian conditioning to derive the marginal prior $p(\mathbf{u})$ and conditional prior $p(\mathbf{f} | \mathbf{u})$.

### Marginal prior over inducing variables

The marginal prior over inducing variables is simply given by $$p(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{0}, \mathbf{K}_Z).$$

#### Gaussian process notation

We can express the prior over the inducing variable $u(\mathbf{z})$ at inducing input $\mathbf{z}$ as $$p(u(\mathbf{z})) = \mathcal{GP}(0, \kappa_{\theta}(\mathbf{z}, \mathbf{z}')).$$

### Conditional prior

First, let us define the vector-valued function $\lambda: \mathbb{R}^{D} \to \mathbb{R}^{M}$ as $$\lambda(\mathbf{x}) = \mathbf{K}_{Z}^{-1} k_{Z}(\mathbf{x}),$$ where $k_{Z}(\mathbf{x}) = \kappa_{\theta}(\mathbf{Z}, \mathbf{x})$ denotes the vector of covariances between $\mathbf{x}$ and the inducing inputs $\mathbf{Z}$. Further, let $\mathbf{\Lambda} \in \mathbb{R}^{N \times M}$ be the matrix containing values of function $\lambda$ applied row-wise to the matrix of inputs $\mathbf{X} = \begin{bmatrix} \mathbf{x}_1 \cdots \mathbf{x}_N \end{bmatrix}^{\top} \in \mathbb{R}^{N \times D}$, $$\mathbf{\Lambda} = \begin{bmatrix} \lambda(\mathbf{x}_1) \cdots \lambda(\mathbf{x}_N) \end{bmatrix}^{\top} = \mathbf{K}_{XZ} \mathbf{K}_{Z}^{-1}.$$ Then, we can conditioning the joint prior distribution on the inducing variables to give $$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}),$$ where the mean vector and covariance matrix are $$\mathbf{m} = \mathbf{\Lambda} \mathbf{u}, \quad \text{and} \quad \mathbf{S} = \mathbf{K}_X - \mathbf{\Lambda} \mathbf{K}_Z \mathbf{\Lambda}^{\top},$$ respectively.

#### Gaussian process notation

We can express the distribution over the function value $f(\mathbf{x})$ at input $\mathbf{x}$, given $\mathbf{u}$, that is, the conditional $p(f(\mathbf{x}) | \mathbf{u})$, as a Gaussian process: $$p(f(\mathbf{x}) | \mathbf{u}) = \mathcal{GP}(m(\mathbf{x}), s(\mathbf{x}, \mathbf{x}')),$$ with mean and covariance functions, $$m(\mathbf{x}) = \lambda(\mathbf{x})^{\top} \mathbf{u}, \quad \text{and} \quad s(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \lambda(\mathbf{x})^{\top} \mathbf{K}_Z \lambda(\mathbf{x}').$$

## Variational Distribution

We specify a joint variational distribution $q_{\mathbf{\phi}}(\mathbf{f},\mathbf{u})$ which factorizes as $$q_{\mathbf{\phi}}(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) q_{\mathbf{\phi}}(\mathbf{u}).$$ For convenience, let us specify a variational distribution that is also Gaussian, $$q_{\mathbf{\phi}}(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{b}, \mathbf{W}\mathbf{W}^{\top}),$$ with variational parameters $\mathbf{\phi} = \{ \mathbf{W}, \mathbf{b} \}$. To obtain the corresponding marginal variational distribution over $\mathbf{f}$, we marginalize out the inducing variables $\mathbf{u}$, leading to $$q_{\mathbf{\phi}}(\mathbf{f}) = \int q_{\mathbf{\phi}}(\mathbf{f}, \mathbf{u}) d\mathbf{u} = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \mathbf{\Sigma}),$$ where $$\boldsymbol{\mu} = \mathbf{\Lambda} \mathbf{b}, \quad \text{and} \quad \mathbf{\Sigma} = \mathbf{K}_X - \mathbf{\Lambda} (\mathbf{K}_Z - \mathbf{W}\mathbf{W}^{\top}) \mathbf{\Lambda}^{\top}.$$

#### Gaussian process notation

We can express the variational distribution over the function value $f(\mathbf{x})$ at input $\mathbf{x}$, that is, the marginal $q_{\mathbf{\phi}}(f(\mathbf{x}))$, as a Gaussian process: $$q_{\mathbf{\phi}}(f(\mathbf{x})) = \mathcal{GP}(\mu(\mathbf{x}), \sigma(\mathbf{x}, \mathbf{x}')),$$ with mean and covariance functions, $$\mu(\mathbf{x}) = \lambda(\mathbf{x})^{\top} \mathbf{b}, \quad \text{and} \quad \sigma(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \lambda(\mathbf{x})^{\top} (\mathbf{K}_Z - \mathbf{W}\mathbf{W}^{\top}) \lambda(\mathbf{x}').$$

### Whitened parameterization

Whitening is a powerful trick for stabilizing the learning of variational parameters that works by reducing correlations in the variational distribution2 3. Let $\mathbf{L}_Z$ be the Cholesky factor of $\mathbf{K}_Z$, i.e. the lower triangular matrix such that $\mathbf{L}_Z \mathbf{L}_Z^{\top} = \mathbf{K}_Z$. Then, the whitened variational parameters are given by $$\tilde{\mathbf{W}} = \mathbf{L}_Z \mathbf{W}, \quad \text{and} \quad \tilde{\mathbf{b}} = \mathbf{L}_Z \mathbf{b},$$ for free parameters $\{ \mathbf{W}, \mathbf{b} \}$. This leads to mean and covariance $$\boldsymbol{\mu} = \tilde{\mathbf{\Lambda}} \mathbf{b}, \quad \text{and} \quad \mathbf{\Sigma} = \mathbf{K}_X - \tilde{\mathbf{\Lambda}} (\mathbf{I}_M - \mathbf{W} \mathbf{W}^{\top}) \tilde{\mathbf{\Lambda}}^{\top},$$ respectively, where $\tilde{\mathbf{\Lambda}} = \mathbf{\Lambda} \mathbf{L}_Z = \mathbf{K}_{XZ} \mathbf{L}_{Z}^{-\top}$.

#### Gaussian process notation

The mean and covariance functions are now $$\mu(\mathbf{x}) = \tilde{\lambda}(\mathbf{x})^{\top} \mathbf{b}, \quad \text{and} \quad \sigma(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \tilde{\lambda}(\mathbf{x})^{\top} (\mathbf{I}_M - \mathbf{W}\mathbf{W}^{\top}) \tilde{\lambda}(\mathbf{x}'),$$ respectively, where $$\tilde{\lambda}(\mathbf{x}) = \mathbf{L}_{Z}^{\top} \lambda(\mathbf{x}) = \mathbf{L}_{Z}^{-1} k_{Z}(\mathbf{x}).$$

## Implementation Details

Here is an efficient and numerically stable way to compute $q_{\mathbf{\phi}}(f(\mathbf{x}))$ for an input $\mathbf{x}$. To reduce notational clutter, let us make the following abbreviations $$\boldsymbol{\lambda} = \tilde{\lambda}(\mathbf{x}), \quad \tilde{\boldsymbol{\lambda}} = \tilde{\lambda}(\mathbf{x}), \quad \text{and} \quad \mathbf{k}_{Z} = k_{Z}(\mathbf{x}).$$ Then, we take the following steps

1. $\mathbf{L}_{Z} = \mathrm{cholesky}(\mathbf{K}_{Z})$ – Cholesky decomposition
2. $\tilde{\boldsymbol{\lambda}} = \mathbf{L}_{Z} \backslash \mathbf{k}_{Z}$ – solve system of linear equations
3. $u = \kappa_{\theta}(\mathbf{x}, \mathbf{x}) - \tilde{\boldsymbol{\lambda}}^{\top} \tilde{\boldsymbol{\lambda}}$
4. If whitened,
• $\mu = \tilde{\boldsymbol{\lambda}}^{\top} \mathbf{b}$
• $\mathbf{v} = \tilde{\boldsymbol{\lambda}}^{\top} \mathbf{W}$
5. otherwise,
• $\boldsymbol{\lambda} = \mathbf{L}_{Z}^{\top} \backslash \tilde{\boldsymbol{\lambda}}$
• $\mu = \boldsymbol{\lambda}^{\top} \mathbf{b}$
• $\mathbf{v} = \boldsymbol{\lambda}^{\top} \mathbf{W}$
6. $\sigma^2 = u + \mathbf{v}^{\top} \mathbf{v}$
7. Return $\mathcal{N}(f(\mathbf{x}) ; \mu, \sigma^2)$

It is simple to extend this to compute $q_{\mathbf{\phi}}(\mathbf{f})$ for an arbitary number of index points $\mathbf{X}$,

1. $\mathbf{L}_{Z} = \mathrm{cholesky}(\mathbf{K}_{Z})$
2. $\tilde{\mathbf{\Lambda}}^{\top} = \mathbf{L}_{Z} \backslash \mathbf{K}_{XZ}^{\top}$
3. $\mathbf{U} = \mathbf{K}_{X} - \tilde{\mathbf{\Lambda}} \tilde{\mathbf{\Lambda}}^{\top}$
4. If whitened,
• $\boldsymbol{\mu} = \tilde{\mathbf{\Lambda}} \mathbf{b}$
• $\mathbf{V} = \tilde{\mathbf{\Lambda}} \mathbf{W}$
5. otherwise,
• $\mathbf{\Lambda}^{\top} = \mathbf{L}_{Z}^{\top} \backslash \tilde{\mathbf{\Lambda}}^{\top}$
• $\boldsymbol{\mu} = \mathbf{\Lambda} \mathbf{b}$
• $\mathbf{V} = \mathbf{\Lambda} \mathbf{W}$
6. $\mathbf{\Sigma} = \mathbf{U} + \mathbf{V} \mathbf{V}^{\top}$
7. Return $\mathcal{N}(\mathbf{f} ; \boldsymbol{\mu}, \mathbf{\Sigma})$

In TensorFlow, this looks something like:

import tensorflow as tf

def variational(Kxx, Kzz, Kzx, W, b, whiten=True, jitter=1e-8):

Lz = tf.linalg.cholesky(Kzz + jitter * eye_like(Kzz))
M = tf.linalg.triangular_solve(Lz, Kzx, lower=True)  # Lambda^T
Sigma = Kxx - tf.linalg.matmul(M, M, adjoint_a=True)  # Kxx - Lambda Lambda^T

if not whiten:
M = tf.linalg.triangular_solve(Lz, M, lower=True, adjoint=True)

mu = tf.linalg.matmul(M, b, adjoint_a=True)  # Lambda b
V = tf.linalg.matmul(M, W, adjoint_a=True)  # Lambda W

Sigma += tf.linalg.matmul(V, V, adjoint_b=True)  # V V^T

return mu, Sigma


Cite as:

@article{tiao2020svgp,
title   = "{A} {C}heatsheet for {S}parse {V}ariational {G}aussian {P}rocesses",
author  = "Tiao, Louis C",
journal = "tiao.io",
year    = "2020",
url     = "https://tiao.io/post/sparse-variational-gaussian-processes/"
}


## References

1. Titsias, M. (2009, April). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Artificial Intelligence and Statistics (pp. 567-574). ↩︎

2. Murray, I., & Adams, R. P. (2010). Slice Sampling Covariance Hyperparameters of Latent Gaussian Models. In Advances in Neural Information Processing Systems (pp. 1732-1740). ↩︎

3. Hensman, J., Matthews, A. G., Filippone, M., & Ghahramani, Z. (2015). MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 1648-1656). ↩︎

4. Hensman, J., Fusi, N., & Lawrence, N. D. (2013, August). Gaussian Processes for Big Data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (pp. 282-290). ↩︎

5. Dezfouli, A., & Bonilla, E. V. (2015). Scalable Inference for Gaussian Process Models with Black-box Likelihoods. In Advances in Neural Information Processing Systems (pp. 1414-1422). ↩︎

6. Salimbeni, H., Cheng, C. A., Boots, B., & Deisenroth, M. (2018). Orthogonally Decoupled Variational Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 8711-8720). ↩︎

7. Shi, J., Titsias, M., & Mnih, A. (2020, June). Sparse Orthogonal Variational Inference for Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (pp. 1932-1942). PMLR. ↩︎

##### Louis Tiao
###### PhD Candidate

Thanks for stopping by! Let’s connect – drop me a message or follow me: