A Cheatsheet for Sparse Variational Gaussian Processes

A summary of notation, identities and derivations for the sparse variational Gaussian process (SVGP) framework

In the sparse variational Gaussian process (SVGP) framework1, one augments the joint distribution $p(\mathbf{y}, \mathbf{f})$ with auxiliary variables $\mathbf{u}$ so that the joint becomes $$ p(\mathbf{y}, \mathbf{f}, \mathbf{u}) = p(\mathbf{y} | \mathbf{f}) p(\mathbf{f}, \mathbf{u}). $$ The vector $\mathbf{u} = \begin{bmatrix} u(\mathbf{z}_1) \cdots u(\mathbf{z}_M)\end{bmatrix}^{\top} \in \mathbb{R}^M$ consists of inducing variables, the latent function values corresponding to the inducing input locations contained in the matrix $\mathbf{Z} = \begin{bmatrix} \mathbf{z}_1 \cdots \mathbf{z}_M \end{bmatrix}^{\top} \in \mathbb{R}^{M \times D}$.


The joint distribution of the latent function values $\mathbf{f}$, and the inducing variables $\mathbf{u}$ according to the prior is $$ p(\mathbf{f}, \mathbf{u}) = \mathcal{N} \left ( \mathbf{0}, \begin{bmatrix} \mathbf{K}_{XX} & \mathbf{K}_{XZ} \newline \mathbf{K}_{XZ}^{\top} & \mathbf{K}_{ZZ} \end{bmatrix} \right ). $$ To reduce clutter, we shall write $\mathbf{K}_{X} = \mathbf{K}_{XX}$ and $\mathbf{K}_{Z} = \mathbf{K}_{ZZ}$ from now on. If we let the joint prior factorize as $$ p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) p(\mathbf{u}), $$ we can apply the rules of Gaussian conditioning to derive the marginal prior $p(\mathbf{u})$ and conditional prior $p(\mathbf{f} | \mathbf{u})$.

Marginal prior over inducing variables

The marginal prior over inducing variables is simply given by $$ p(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{0}, \mathbf{K}_Z). $$

Gaussian process notation

We can express the prior over the inducing variable $u(\mathbf{z})$ at inducing input $\mathbf{z}$ as $$ p(u(\mathbf{z})) = \mathcal{GP}(0, \kappa_{\theta}(\mathbf{z}, \mathbf{z}')). $$

Conditional prior

First, let us define the vector-valued function $\lambda: \mathbb{R}^{D} \to \mathbb{R}^{M}$ as $$ \lambda(\mathbf{x}) = \mathbf{K}_{Z}^{-1} k_{Z}(\mathbf{x}), $$ where $k_{Z}(\mathbf{x}) = \kappa_{\theta}(\mathbf{Z}, \mathbf{x})$ denotes the vector of covariances between $\mathbf{x}$ and the inducing inputs $\mathbf{Z}$. Further, let $\mathbf{\Lambda} \in \mathbb{R}^{N \times M}$ be the matrix containing values of function $\lambda$ applied row-wise to the matrix of inputs $\mathbf{X} = \begin{bmatrix} \mathbf{x}_1 \cdots \mathbf{x}_N \end{bmatrix}^{\top} \in \mathbb{R}^{N \times D}$, $$ \mathbf{\Lambda} = \begin{bmatrix} \lambda(\mathbf{x}_1) \cdots \lambda(\mathbf{x}_N) \end{bmatrix}^{\top} = \mathbf{K}_{XZ} \mathbf{K}_{Z}^{-1}. $$ Then, we can conditioning the joint prior distribution on the inducing variables to give $$ p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}), $$ where the mean vector and covariance matrix are $$ \mathbf{m} = \mathbf{\Lambda} \mathbf{u}, \quad \text{and} \quad \mathbf{S} = \mathbf{K}_X - \mathbf{\Lambda} \mathbf{K}_Z \mathbf{\Lambda}^{\top}, $$ respectively.

Gaussian process notation

We can express the distribution over the function value $f(\mathbf{x})$ at input $\mathbf{x}$, given $\mathbf{u}$, that is, the conditional $p(f(\mathbf{x}) | \mathbf{u})$, as a Gaussian process: $$ p(f(\mathbf{x}) | \mathbf{u}) = \mathcal{GP}(m(\mathbf{x}), s(\mathbf{x}, \mathbf{x}')), $$ with mean and covariance functions, $$ m(\mathbf{x}) = \lambda(\mathbf{x})^{\top} \mathbf{u}, \quad \text{and} \quad s(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \lambda(\mathbf{x})^{\top} \mathbf{K}_Z \lambda(\mathbf{x}'). $$

Variational Distribution

We specify a joint variational distribution $q_{\mathbf{\phi}}(\mathbf{f},\mathbf{u})$ which factorizes as $$ q_{\mathbf{\phi}}(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) q_{\mathbf{\phi}}(\mathbf{u}). $$ For convenience, let us specify a variational distribution that is also Gaussian, $$ q_{\mathbf{\phi}}(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{b}, \mathbf{W}\mathbf{W}^{\top}), $$ with variational parameters $\mathbf{\phi} = \{ \mathbf{W}, \mathbf{b} \}$. To obtain the corresponding marginal variational distribution over $\mathbf{f}$, we marginalize out the inducing variables $\mathbf{u}$, leading to $$ q_{\mathbf{\phi}}(\mathbf{f}) = \int q_{\mathbf{\phi}}(\mathbf{f}, \mathbf{u}) d\mathbf{u} = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \mathbf{\Sigma}), $$ where $$ \boldsymbol{\mu} = \mathbf{\Lambda} \mathbf{b}, \quad \text{and} \quad \mathbf{\Sigma} = \mathbf{K}_X - \mathbf{\Lambda} (\mathbf{K}_Z - \mathbf{W}\mathbf{W}^{\top}) \mathbf{\Lambda}^{\top}. $$

Gaussian process notation

We can express the variational distribution over the function value $f(\mathbf{x})$ at input $\mathbf{x}$, that is, the marginal $q_{\mathbf{\phi}}(f(\mathbf{x}))$, as a Gaussian process: $$ q_{\mathbf{\phi}}(f(\mathbf{x})) = \mathcal{GP}(\mu(\mathbf{x}), \sigma(\mathbf{x}, \mathbf{x}')), $$ with mean and covariance functions, $$ \mu(\mathbf{x}) = \lambda(\mathbf{x})^{\top} \mathbf{b}, \quad \text{and} \quad \sigma(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \lambda(\mathbf{x})^{\top} (\mathbf{K}_Z - \mathbf{W}\mathbf{W}^{\top}) \lambda(\mathbf{x}'). $$

Whitened parameterization

Whitening is a powerful trick for stabilizing the learning of variational parameters that works by reducing correlations in the variational distribution2 3. Let $\mathbf{L}_Z$ be the Cholesky factor of $\mathbf{K}_Z$, i.e. the lower triangular matrix such that $\mathbf{L}_Z \mathbf{L}_Z^{\top} = \mathbf{K}_Z$. Then, the whitened variational parameters are given by $$ \tilde{\mathbf{W}} = \mathbf{L}_Z \mathbf{W}, \quad \text{and} \quad \tilde{\mathbf{b}} = \mathbf{L}_Z \mathbf{b}, $$ for free parameters $\{ \mathbf{W}, \mathbf{b} \}$. This leads to mean and covariance $$ \boldsymbol{\mu} = \tilde{\mathbf{\Lambda}} \mathbf{b}, \quad \text{and} \quad \mathbf{\Sigma} = \mathbf{K}_X - \tilde{\mathbf{\Lambda}} (\mathbf{I}_M - \mathbf{W} \mathbf{W}^{\top}) \tilde{\mathbf{\Lambda}}^{\top}, $$ respectively, where $\tilde{\mathbf{\Lambda}} = \mathbf{\Lambda} \mathbf{L}_Z = \mathbf{K}_{XZ} \mathbf{L}_{Z}^{-\top}$.

Gaussian process notation

The mean and covariance functions are now $$ \mu(\mathbf{x}) = \tilde{\lambda}(\mathbf{x})^{\top} \mathbf{b}, \quad \text{and} \quad \sigma(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \tilde{\lambda}(\mathbf{x})^{\top} (\mathbf{I}_M - \mathbf{W}\mathbf{W}^{\top}) \tilde{\lambda}(\mathbf{x}'), $$ respectively, where $$ \tilde{\lambda}(\mathbf{x}) = \mathbf{L}_{Z}^{\top} \lambda(\mathbf{x}) = \mathbf{L}_{Z}^{-1} k_{Z}(\mathbf{x}). $$

Implementation Details

Here is an efficient and numerically stable way to compute $q_{\mathbf{\phi}}(f(\mathbf{x}))$ for an input $\mathbf{x}$. To reduce notational clutter, let us make the following abbreviations $$ \boldsymbol{\lambda} = \tilde{\lambda}(\mathbf{x}), \quad \tilde{\boldsymbol{\lambda}} = \tilde{\lambda}(\mathbf{x}), \quad \text{and} \quad \mathbf{k}_{Z} = k_{Z}(\mathbf{x}). $$ Then, we take the following steps

  1. $\mathbf{L}_{Z} = \mathrm{cholesky}(\mathbf{K}_{Z})$ – Cholesky decomposition
  2. $\tilde{\boldsymbol{\lambda}} = \mathbf{L}_{Z} \backslash \mathbf{k}_{Z}$ – solve system of linear equations
  3. $u = \kappa_{\theta}(\mathbf{x}, \mathbf{x}) - \tilde{\boldsymbol{\lambda}}^{\top} \tilde{\boldsymbol{\lambda}}$
  4. If whitened,
    • $\mu = \tilde{\boldsymbol{\lambda}}^{\top} \mathbf{b}$
    • $\mathbf{v} = \tilde{\boldsymbol{\lambda}}^{\top} \mathbf{W}$
  5. otherwise,
    • $\boldsymbol{\lambda} = \mathbf{L}_{Z}^{\top} \backslash \tilde{\boldsymbol{\lambda}}$
    • $\mu = \boldsymbol{\lambda}^{\top} \mathbf{b}$
    • $\mathbf{v} = \boldsymbol{\lambda}^{\top} \mathbf{W}$
  6. $\sigma^2 = u + \mathbf{v}^{\top} \mathbf{v}$
  7. Return $\mathcal{N}(f(\mathbf{x}) ; \mu, \sigma^2)$

It is simple to extend this to compute $q_{\mathbf{\phi}}(\mathbf{f})$ for an arbitary number of index points $\mathbf{X}$,

  1. $\mathbf{L}_{Z} = \mathrm{cholesky}(\mathbf{K}_{Z})$
  2. $\tilde{\mathbf{\Lambda}}^{\top} = \mathbf{L}_{Z} \backslash \mathbf{K}_{XZ}^{\top}$
  3. $\mathbf{U} = \mathbf{K}_{X} - \tilde{\mathbf{\Lambda}} \tilde{\mathbf{\Lambda}}^{\top}$
  4. If whitened,
    • $\boldsymbol{\mu} = \tilde{\mathbf{\Lambda}} \mathbf{b}$
    • $\mathbf{V} = \tilde{\mathbf{\Lambda}} \mathbf{W}$
  5. otherwise,
    • $\mathbf{\Lambda}^{\top} = \mathbf{L}_{Z}^{\top} \backslash \tilde{\mathbf{\Lambda}}^{\top}$
    • $\boldsymbol{\mu} = \mathbf{\Lambda} \mathbf{b}$
    • $\mathbf{V} = \mathbf{\Lambda} \mathbf{W}$
  6. $\mathbf{\Sigma} = \mathbf{U} + \mathbf{V} \mathbf{V}^{\top}$
  7. Return $\mathcal{N}(\mathbf{f} ; \boldsymbol{\mu}, \mathbf{\Sigma})$

In TensorFlow, this looks something like:

import tensorflow as tf

def variational(Kxx, Kzz, Kzx, W, b, whiten=True, jitter=1e-8):

    Lz = tf.linalg.cholesky(Kzz + jitter * eye_like(Kzz))
    M = tf.linalg.triangular_solve(Lz, Kzx, lower=True)  # Lambda^T
    Sigma = Kxx - tf.linalg.matmul(M, M, adjoint_a=True)  # Kxx - Lambda Lambda^T

    if not whiten:
        M = tf.linalg.triangular_solve(Lz, M, lower=True, adjoint=True)

    mu = tf.linalg.matmul(M, b, adjoint_a=True)  # Lambda b
    V = tf.linalg.matmul(M, W, adjoint_a=True)  # Lambda W

    Sigma += tf.linalg.matmul(V, V, adjoint_b=True)  # V V^T

    return mu, Sigma

Cite as:

  title   = "{A} {C}heatsheet for {S}parse {V}ariational {G}aussian {P}rocesses",
  author  = "Tiao, Louis C",
  journal = "tiao.io",
  year    = "2020",
  url     = "https://tiao.io/post/sparse-variational-gaussian-processes/"

To receive updates on more posts like this, follow me on Twitter and GitHub!


  1. Titsias, M. (2009, April). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Artificial Intelligence and Statistics (pp. 567-574). ↩︎

  2. Murray, I., & Adams, R. P. (2010). Slice Sampling Covariance Hyperparameters of Latent Gaussian Models. In Advances in Neural Information Processing Systems (pp. 1732-1740). ↩︎

  3. Hensman, J., Matthews, A. G., Filippone, M., & Ghahramani, Z. (2015). MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 1648-1656). ↩︎

  4. Hensman, J., Fusi, N., & Lawrence, N. D. (2013, August). Gaussian Processes for Big Data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (pp. 282-290). ↩︎

  5. Dezfouli, A., & Bonilla, E. V. (2015). Scalable Inference for Gaussian Process Models with Black-box Likelihoods. In Advances in Neural Information Processing Systems (pp. 1414-1422). ↩︎

  6. Salimbeni, H., Cheng, C. A., Boots, B., & Deisenroth, M. (2018). Orthogonally Decoupled Variational Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 8711-8720). ↩︎

  7. Shi, J., Titsias, M., & Mnih, A. (2020, June). Sparse Orthogonal Variational Inference for Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (pp. 1932-1942). PMLR. ↩︎

Louis Tiao
Louis Tiao
PhD Candidate

Probabilistic machine learning and artificial intelligence.

comments powered by Disqus