# A Handbook for Sparse Variational Gaussian Processes

A summary of notation, identities and derivations for the sparse variational Gaussian process (SVGP) framework

In the sparse variational Gaussian process (SVGP) framework (Titsias, 2009)1, one augments the joint distribution $p(\mathbf{y}, \mathbf{f})$ with auxiliary variables $\mathbf{u}$ so that the joint becomes $$p(\mathbf{y}, \mathbf{f}, \mathbf{u}) = p(\mathbf{y} | \mathbf{f}) p(\mathbf{f}, \mathbf{u}).$$ The vector $\mathbf{u} = \begin{bmatrix} u(\mathbf{z}_1) \cdots u(\mathbf{z}_M)\end{bmatrix}^{\top} \in \mathbb{R}^M$ consists of inducing variables, the latent function values corresponding to the inducing input locations contained in the matrix $\mathbf{Z} = \begin{bmatrix} \mathbf{z}_1 \cdots \mathbf{z}_M \end{bmatrix}^{\top} \in \mathbb{R}^{M \times D}$.

## Prior

The joint distribution of the latent function values $\mathbf{f}$, and the inducing variables $\mathbf{u}$ according to the prior is $$p(\mathbf{f}, \mathbf{u}) = \mathcal{N} \left ( \mathbf{0}, \begin{bmatrix} \mathbf{K}_{XX} & \mathbf{K}_{XZ} \newline \mathbf{K}_{XZ}^{\top} & \mathbf{K}_{ZZ} \end{bmatrix} \right ).$$ To reduce clutter, we shall write $\mathbf{K}_{X} = \mathbf{K}_{XX}$ and $\mathbf{K}_{Z} = \mathbf{K}_{ZZ}$ from now on. If we let the joint prior factorize as $$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) p(\mathbf{u}),$$ we can apply the rules of Gaussian conditioning to derive the marginal prior $p(\mathbf{u})$ and conditional prior $p(\mathbf{f} | \mathbf{u})$.

### Marginal prior over inducing variables

The marginal prior over inducing variables is simply given by $$p(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{0}, \mathbf{K}_Z).$$

#### Gaussian process notation

We can express the prior over the inducing variable $u(\mathbf{z})$ at inducing input $\mathbf{z}$ as $$p(u(\mathbf{z})) = \mathcal{GP}(0, \kappa_{\theta}(\mathbf{z}, \mathbf{z}')).$$

### Conditional prior

First, let us define the vector-valued function $\psi: \mathbb{R}^{D} \to \mathbb{R}^{M}$ as $$\psi(\mathbf{x}) \doteq \mathbf{K}_{Z}^{-1} k_{Z}(\mathbf{x}),$$ where $k_{Z}(\mathbf{x}) = \kappa_{\theta}(\mathbf{Z}, \mathbf{x})$ denotes the vector of covariances between $\mathbf{x}$ and the inducing inputs $\mathbf{Z}$. Further, let $\boldsymbol{\Psi} \in \mathbb{R}^{N \times M}$ be the matrix containing values of function $\psi$ applied row-wise to the matrix of inputs $\mathbf{X} = \begin{bmatrix} \mathbf{x}_1 \cdots \mathbf{x}_N \end{bmatrix}^{\top} \in \mathbb{R}^{N \times D}$, $$\boldsymbol{\Psi} \doteq \begin{bmatrix} \psi(\mathbf{x}_1) \cdots \psi(\mathbf{x}_N) \end{bmatrix}^{\top} = \mathbf{K}_{XZ} \mathbf{K}_{Z}^{-1}.$$ Then, we can condition the joint prior distribution on the inducing variables to give $$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}),$$ where the mean vector and covariance matrix are $$\mathbf{m} = \boldsymbol{\Psi} \mathbf{u}, \quad \text{and} \quad \mathbf{S} = \mathbf{K}_X - \boldsymbol{\Psi} \mathbf{K}_Z \boldsymbol{\Psi}^{\top}.$$

#### Gaussian process notation

We can express the distribution over the function value $f(\mathbf{x})$ at input $\mathbf{x}$, given $\mathbf{u}$, that is, the conditional $p(f(\mathbf{x}) | \mathbf{u})$, as a Gaussian process: $$p(f(\mathbf{x}) | \mathbf{u}) = \mathcal{GP}(m(\mathbf{x}), s(\mathbf{x}, \mathbf{x}')),$$ with mean and covariance functions, $$m(\mathbf{x}) = \psi(\mathbf{x})^{\top} \mathbf{u}, \quad \text{and} \quad s(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \psi(\mathbf{x})^{\top} \mathbf{K}_Z \psi(\mathbf{x}').$$

## Variational Distribution

We specify a joint variational distribution $q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})$ which factorizes as $$q_{\boldsymbol{\phi}}(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) q_{\boldsymbol{\phi}}(\mathbf{u}).$$ For convenience, let us specify a variational distribution that is also Gaussian, $$q_{\boldsymbol{\phi}}(\mathbf{u}) \doteq \mathcal{N}(\mathbf{u} | \mathbf{b}, \mathbf{W}\mathbf{W}^{\top}),$$ with variational parameters $\boldsymbol{\phi} = \{ \mathbf{W}, \mathbf{b} \}$. To obtain the corresponding marginal variational distribution over $\mathbf{f}$, we marginalize out the inducing variables $\mathbf{u}$, leading to $$q_{\boldsymbol{\phi}}(\mathbf{f}) = \int q_{\boldsymbol{\phi}}(\mathbf{f}, \mathbf{u}) \, \mathrm{d}\mathbf{u} = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \mathbf{\Sigma}),$$ where $$\boldsymbol{\mu} = \boldsymbol{\Psi} \mathbf{b}, \quad \text{and} \quad \mathbf{\Sigma} = \mathbf{K}_X - \boldsymbol{\Psi} (\mathbf{K}_Z - \mathbf{W}\mathbf{W}^{\top}) \boldsymbol{\Psi}^{\top}.$$

#### Gaussian process notation

We can express the variational distribution over the function value $f(\mathbf{x})$ at input $\mathbf{x}$, that is, the marginal $q_{\boldsymbol{\phi}}(f(\mathbf{x}))$, as a Gaussian process: $$q_{\boldsymbol{\phi}}(f(\mathbf{x})) = \mathcal{GP}(\mu(\mathbf{x}), \sigma(\mathbf{x}, \mathbf{x}')),$$ with mean and covariance functions, $$\mu(\mathbf{x}) = \psi(\mathbf{x})^{\top} \mathbf{b}, \quad \text{and} \quad \sigma(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \psi(\mathbf{x})^{\top} (\mathbf{K}_Z - \mathbf{W}\mathbf{W}^{\top}) \psi(\mathbf{x}').$$

### Whitened parameterization

Whitening is a powerful trick for stabilizing the learning of variational parameters that works by reducing correlations in the variational distribution (Murray & Adams, 2010; Hensman et al, 2015)2 3. Let $\mathbf{L}_Z$ be the Cholesky factor of $\mathbf{K}_Z$, i.e. the lower triangular matrix such that $\mathbf{L}_Z \mathbf{L}_Z^{\top} = \mathbf{K}_Z$. Then, the whitened variational parameters are given by $$\mathbf{W}’ \doteq \mathbf{L}_Z \mathbf{W}, \quad \text{and} \quad \mathbf{b}’ \doteq \mathbf{L}_Z \mathbf{b},$$ with free parameters $\{ \mathbf{W}, \mathbf{b} \}$. This leads to mean and covariance $$\boldsymbol{\mu} = \boldsymbol{\Lambda} \mathbf{b}, \quad \text{and} \quad \mathbf{\Sigma} = \mathbf{K}_X - \boldsymbol{\Lambda} (\mathbf{I}_M - \mathbf{W} \mathbf{W}^{\top}) \boldsymbol{\Lambda}^{\top},$$ where $\boldsymbol{\Lambda} \doteq \boldsymbol{\Psi} \mathbf{L}_Z = \mathbf{K}_{XZ} (\mathbf{L}_{Z}^{-1})^{\top}$. Refer to Appendix I for derivations.

#### Gaussian process notation

The mean and covariance functions are now $$\mu(\mathbf{x}) = \lambda(\mathbf{x})^{\top} \mathbf{b}, \quad \text{and} \quad \sigma(\mathbf{x}, \mathbf{x}') = \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \lambda(\mathbf{x})^{\top} (\mathbf{I}_M - \mathbf{W}\mathbf{W}^{\top}) \lambda(\mathbf{x}'),$$ where $$\lambda(\mathbf{x}) = \mathbf{L}_{Z}^{\top} \psi(\mathbf{x}) = \mathbf{L}_{Z}^{-1} k_{Z}(\mathbf{x}).$$

For an efficient and numerically stable way to compute and evaluate the variational distribution $q_{\boldsymbol{\phi}}(\mathbf{f})$ at arbitrary input index points $\mathbf{X}$, see Appendix II.

## Inference

Work in progress.

### Preliminaries

We minimize the Kullback-Leibler (KL) divergence between $q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})$ and $p(\mathbf{f},\mathbf{u} \mid \mathbf{y})$, which is given by \begin{align} \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})] & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}{p(\mathbf{f},\mathbf{u} \mid \mathbf{y})}}\right] \newline & = \log{p(\mathbf{y})} + \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}{p(\mathbf{f},\mathbf{u}, \mathbf{y})}}\right] \newline & = \log{p(\mathbf{y})} - \mathrm{ELBO}(\boldsymbol{\phi}), \end{align} where the evidence lower bound (ELBO), $\mathrm{ELBO}(\boldsymbol{\phi})$, is defined as $$\mathrm{ELBO}(\boldsymbol{\phi}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{p(\mathbf{f},\mathbf{u}, \mathbf{y})}{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}}\right].$$ Hence, minimizing the KL divergence above is equivalent to maximizing the ELBO. Importantly, the ELBO is a lower bound on the log marginal likelihood, since $$\log{p(\mathbf{y})} = \mathrm{ELBO}(\boldsymbol{\phi}) + \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})]$$ and the KL divergence is nonnegative. Therefore, we have $\log{p(\mathbf{y})} \geq \mathrm{ELBO}(\boldsymbol{\phi})$ with equality at $\mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})] = 0 \Leftrightarrow q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) = p(\mathbf{f},\mathbf{u} \mid \mathbf{y})$.

We can write the ELBO as \require{cancel} \begin{align} \mathrm{ELBO}(\boldsymbol{\phi}) & = \iint \log{\frac{p(\mathbf{f},\mathbf{u}, \mathbf{y})}{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \,\mathrm{d}\mathbf{f} \mathrm{d}\mathbf{u} \newline & = \iint \log{\frac{p(\mathbf{y} | \mathbf{f}) \bcancel{p(\mathbf{f} | \mathbf{u})} p(\mathbf{u})}{\bcancel{p(\mathbf{f} | \mathbf{u})} q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \,\mathrm{d}\mathbf{f} \mathrm{d}\mathbf{u} \newline & = \int \left ( \int \log{p(\mathbf{y} | \mathbf{f})} p(\mathbf{f} | \mathbf{u}) \,\mathrm{d}\mathbf{f} + \log{\frac{p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \right ) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \newline & = \int \left ( \log{G(\mathbf{y}, \mathbf{u})} + \log{\frac{p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \right ) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \newline & = \int \log{\frac{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u}, \end{align} where we have defined $$G(\mathbf{y}, \mathbf{u}) = \exp{ \left ( \int \log{p(\mathbf{y} | \mathbf{f})} p(\mathbf{f} | \mathbf{u}) \,\mathrm{d}\mathbf{f} \right ) }.$$

Taking the functional derivative of the ELBO wrt to $q_{\boldsymbol{\phi}}(\mathbf{u})$, we get \begin{align} \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \mathrm{ELBO}(\boldsymbol{\phi}) & = \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \left ( \int \log{\frac{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \right ) \newline & = \int \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \left ( \log{\frac{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \right ) \,\mathrm{d}\mathbf{u} \newline & = \begin{split} & \int \log{\frac{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \left ( \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} q_{\boldsymbol{\phi}}(\mathbf{u}) \right ) + \newline & \qquad q_{\boldsymbol{\phi}}(\mathbf{u}) \left ( \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \log{\frac{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \right ) \,\mathrm{d}\mathbf{u} \end{split} \newline & = \int \log{\frac{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} + q_{\boldsymbol{\phi}}(\mathbf{u}) \left ( -\frac{1}{q_{\boldsymbol{\phi}}(\mathbf{u})} \right ) \,\mathrm{d}\mathbf{u} \newline & = \int \log{G(\mathbf{y}, \mathbf{u})} + \log{p(\mathbf{u})} - \log{q_{\boldsymbol{\phi}}(\mathbf{u})} - 1 \,\mathrm{d}\mathbf{u} \end{align} Setting this expression to zero, we have \begin{align} \log{q_{\boldsymbol{\phi}^{\star}}(\mathbf{u})} & = \log{G(\mathbf{y}, \mathbf{u})} + \log{p(\mathbf{u})} - 1 \newline q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) & \propto G(\mathbf{y}, \mathbf{u}) p(\mathbf{u}) \end{align}

$$q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) = \frac{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{Z}$$ where $$Z = \int G(\mathbf{y}, \mathbf{u}) p(\mathbf{u}) \,\mathrm{d}\mathbf{u}$$

Plug this back into the ELBO we get $$\require{cancel} \mathrm{ELBO}(\boldsymbol{\phi}^{\star}) = \int \log{\left (\bcancel{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})} \frac{Z}{\bcancel{G(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}} \right )} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} = \log{Z}$$

### Inference for Gaussian Likelihoods

$$p(\mathbf{y} | \mathbf{f}) = \mathcal{N}(\mathbf{y} | \mathbf{f}, \sigma^{2} \mathbf{I})$$

$$\log{G(\mathbf{y}, \mathbf{u})} = \log{\mathcal{N}(\mathbf{y} | \mathbf{m}, \sigma^{2} \mathbf{I} )} - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S})$$ where $\mathbf{m}$ and $\mathbf{S}$ are defined as before, i.e. $\mathbf{m} = \boldsymbol{\Psi} \mathbf{u}$ and $\mathbf{S} = \mathbf{K}_X - \boldsymbol{\Psi} \mathbf{K}_Z \boldsymbol{\Psi}^{\top}$. Refer to Appendix III for derivations.

\begin{align} \mathrm{ELBO}(\boldsymbol{\phi}) & = \int \left ( \log{\mathcal{N}(\mathbf{y} | \mathbf{m}, \sigma^{2} \mathbf{I} )} - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S}) + \log{\frac{p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \right ) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \newline & = \int \log{\frac{\mathcal{N}(\mathbf{y} | \mathbf{m}, \sigma^{2} \mathbf{I} ) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S}) \end{align}

## Appendix

### I

#### Whitened parameterization

Recall the definition $\boldsymbol{\Lambda} \doteq \boldsymbol{\Psi} \mathbf{L}_Z$. Then, the mean simplifies to $$\boldsymbol{\mu} = \boldsymbol{\Psi} \mathbf{b}’ = (\boldsymbol{\Psi} \mathbf{L}_Z) \mathbf{b} = \boldsymbol{\Lambda} \mathbf{b}.$$ Similarly, the covariance simplifies to \begin{align} \mathbf{\Sigma} & = \mathbf{K}_X - \boldsymbol{\Psi} (\mathbf{K}_Z - \mathbf{W}'\mathbf{W}‘^{\top}) \boldsymbol{\Psi}^{\top} \newline & = \mathbf{K}_X - \boldsymbol{\Psi} (\mathbf{L}_Z \mathbf{L}_Z^{\top} - \mathbf{L}_Z (\mathbf{W} \mathbf{W}^{\top}) \mathbf{L}_Z^{\top}) \boldsymbol{\Psi}^{\top} \newline & = \mathbf{K}_X - (\boldsymbol{\Psi} \mathbf{L}_Z) ( \mathbf{I}_M - \mathbf{W} \mathbf{W}^{\top}) (\boldsymbol{\Psi} \mathbf{L}_Z)^{\top} \newline & = \mathbf{K}_X - \boldsymbol{\Lambda} ( \mathbf{I}_M - \mathbf{W} \mathbf{W}^{\top}) \boldsymbol{\Lambda}^{\top}. \end{align}

### II

#### Implementation Details

Single input index point

Here is an efficient and numerically stable way to compute $q_{\boldsymbol{\phi}}(f(\mathbf{x}))$ for an input $\mathbf{x}$. To reduce notational clutter, let us make the following abbreviations $$\boldsymbol{\lambda} = \lambda(\mathbf{x}) \in \mathbb{R}^{M}, \quad \text{and} \quad \mathbf{k}_{Z} = k_{Z}(\mathbf{x}) \in \mathbb{R}^{M}.$$ Then, we take the following steps:

1. Cholesky decomposition: $\mathbf{L}_{Z} = \mathrm{cholesky}(\mathbf{K}_{Z})$

Note: $\mathcal{O}(M^3)$ complexity.

2. Solve system of linear equations: $\boldsymbol{\lambda} = \mathbf{L}_{Z} \backslash \mathbf{k}_{Z}$

Note: $\mathcal{O}(M^2)$ complexity since $\mathbf{L}_{Z}$ is lower triangular; $\boldsymbol{\beta} = \mathbf{A} \backslash \mathbf{x}$ denotes the vector $\boldsymbol{\beta}$ such that $\mathbf{A} \boldsymbol{\beta} = \mathbf{x} \Leftrightarrow \boldsymbol{\beta} = \mathbf{A}^{-1} \mathbf{x}$. Hence, $\boldsymbol{\lambda} = \mathbf{L}_{Z}^{-1} \mathbf{k}_{Z}$.

3. $u = \kappa_{\theta}(\mathbf{x}, \mathbf{x}) - \boldsymbol{\lambda}^{\top} \boldsymbol{\lambda}$

Note: $$\boldsymbol{\lambda}^{\top} \boldsymbol{\lambda} = \mathbf{k}_{Z}^{\top} (\mathbf{L}_{Z}^{-1})^{\top} \mathbf{L}_{Z}^{-1} \mathbf{k}_{Z} = \mathbf{k}_{Z}^{\top} \mathbf{K}_Z^{-1} \mathbf{k}_{Z} = \mathbf{k}_{Z}^{\top} \mathbf{K}_Z^{-1} (\mathbf{K}_Z) \mathbf{K}_Z^{-1} \mathbf{k}_{Z}.$$

4. For whitened parameterization:

1. $\mu = \boldsymbol{\lambda}^{\top} \mathbf{b}$

2. $\mathbf{v}^{\top} = \boldsymbol{\lambda}^{\top} \mathbf{W}$

Note: $\mathbf{v}^{\top} \mathbf{v} = \mathbf{k}_{Z}^{\top} (\mathbf{L}_{Z}^{-1})^{\top} (\mathbf{W} \mathbf{W}^{\top}) \mathbf{L}_{Z}^{-1} \mathbf{k}_{Z}$

otherwise:

1. Solve system of linear equations: $\boldsymbol{\psi} = \mathbf{L}_{Z}^{\top} \backslash \boldsymbol{\lambda}$

Note: $\mathcal{O}(M^2)$ complexity since $\mathbf{L}_{Z}^{\top}$ is upper triangular. Further, $$\boldsymbol{\psi} = (\mathbf{L}_{Z}^{\top})^{-1} \boldsymbol{\lambda} = (\mathbf{L}_{Z}^{\top})^{-1} \mathbf{L}_{Z}^{-1} \mathbf{k}_{Z} = \mathbf{K}_Z^{-1} \mathbf{k}_{Z}$$ and $$\boldsymbol{\psi}^{\top} = \mathbf{k}_{Z}^{\top} (\mathbf{K}_Z^{-1})^{\top} = \mathbf{k}_{Z}^{\top} \mathbf{K}_Z^{-1}$$ since $\mathbf{K}_Z$ is symmetric and nonsingular.

2. $\mu = \boldsymbol{\psi}^{\top} \mathbf{b}$

3. $\mathbf{v}^{\top} = \boldsymbol{\psi}^{\top} \mathbf{W}$

Note: $\mathbf{v}^{\top} \mathbf{v} = \mathbf{k}_{Z}^{\top} \mathbf{K}_Z^{-1} (\mathbf{W} \mathbf{W}^{\top}) \mathbf{K}_Z^{-1} \mathbf{k}_{Z}$

5. $\sigma^2 = u + \mathbf{v}^{\top} \mathbf{v}$

6. Return $\mathcal{N}(f(\mathbf{x}) ; \mu, \sigma^2)$

Multiple input index points

It is simple to extend this to compute $q_{\boldsymbol{\phi}}(\mathbf{f})$ for an arbitary number of index points $\mathbf{X}$:

1. Cholesky decomposition: $\mathbf{L}_{Z} = \mathrm{cholesky}(\mathbf{K}_{Z})$

Note: $\mathcal{O}(M^3)$ complexity.

2. Solve system of linear equations: $\boldsymbol{\Lambda}^{\top} = \mathbf{L}_{Z} \backslash \mathbf{K}_{XZ}^{\top}$

Note: $\mathcal{O}(M^2)$ complexity since $\mathbf{L}_{Z}$ is lower triangular; $\mathbf{B} = \mathbf{A} \backslash \mathbf{X}$ denotes the matrix $\mathbf{B}$ such that $\mathbf{A} \mathbf{B} = \mathbf{X} \Leftrightarrow \mathbf{B} = \mathbf{A}^{-1} \mathbf{X}$. Hence, $\boldsymbol{\Lambda}^{\top} = \mathbf{L}_{Z}^{-1} \mathbf{K}_{XZ}^{\top}$.

3. $\mathbf{U} = \mathbf{K}_{X} - \boldsymbol{\Lambda} \boldsymbol{\Lambda}^{\top}$

Note: $$\boldsymbol{\Lambda} \boldsymbol{\Lambda}^{\top} = \mathbf{K}_{XZ} (\mathbf{L}_{Z}^{-1})^{\top} \mathbf{L}_{Z}^{-1} \mathbf{K}_{XZ}^{\top} = \mathbf{K}_{XZ} \mathbf{K}_{Z}^{-1} \mathbf{K}_{XZ}^{\top} = \mathbf{K}_{XZ} \mathbf{K}_{Z}^{-1} (\mathbf{K}_{Z}) \mathbf{K}_{Z}^{-1} \mathbf{K}_{XZ}^{\top}.$$

4. For whitened parameterization:

1. $\boldsymbol{\mu} = \boldsymbol{\Lambda} \mathbf{b}$

2. $\mathbf{V} = \boldsymbol{\Lambda} \mathbf{W}$

Note: $\mathbf{V} \mathbf{V}^{\top} = \mathbf{K}_{XZ} (\mathbf{L}_{Z}^{-1})^{\top} (\mathbf{W} \mathbf{W}^{\top}) \mathbf{L}_{Z}^{-1} \mathbf{K}_{XZ}^{\top}.$

otherwise:

1. Solve system of linear equations: $\boldsymbol{\Psi}^{\top} = \mathbf{L}_{Z}^{\top} \backslash \boldsymbol{\Lambda}^{\top}$

Note: $\mathcal{O}(M^2)$ complexity since $\mathbf{L}_{Z}^{\top}$ is upper triangular. Further,

$$\boldsymbol{\Psi}^{\top} = (\mathbf{L}_{Z}^{\top})^{-1} \boldsymbol{\Lambda}^{\top} = (\mathbf{L}_{Z}^{\top})^{-1} \mathbf{L}_{Z}^{-1} \mathbf{K}_{XZ}^{\top} = \mathbf{K}_Z^{-1} \mathbf{K}_{XZ}^{\top}$$ and $$\boldsymbol{\Psi} = \mathbf{K}_{XZ} (\mathbf{K}_Z^{-1})^{\top} = \mathbf{K}_{XZ} \mathbf{K}_Z^{-1}$$ since $\mathbf{K}_Z$ is symmetric and nonsingular.

2. $\boldsymbol{\mu} = \boldsymbol{\Psi} \mathbf{b}$

3. $\mathbf{V} = \boldsymbol{\Psi} \mathbf{W}$

Note: $\mathbf{V} \mathbf{V}^{\top} = \mathbf{K}_{XZ} \mathbf{K}_Z^{-1} (\mathbf{W} \mathbf{W}^{\top}) \mathbf{K}_Z^{-1} \mathbf{K}_{XZ}^{\top}$.

5. $\mathbf{\Sigma} = \mathbf{U} + \mathbf{V} \mathbf{V}^{\top}$

6. Return $\mathcal{N}(\mathbf{f} ; \boldsymbol{\mu}, \mathbf{\Sigma})$

In TensorFlow, this looks something like:

import tensorflow as tf

def variational(Kxx, Kzz, Kzx, W, b, whiten=True, jitter=1e-8):

Lz = tf.linalg.cholesky(Kzz + jitter * eye_like(Kzz))
M = tf.linalg.triangular_solve(Lz, Kzx, lower=True)  # Lambda^T
Sigma = Kxx - tf.linalg.matmul(M, M, adjoint_a=True)  # Kxx - Lambda Lambda^T

if not whiten:
M = tf.linalg.triangular_solve(Lz, M, lower=False, adjoint=True)

mu = tf.linalg.matmul(M, b, adjoint_a=True)  # Lambda b
V = tf.linalg.matmul(M, W, adjoint_a=True)  # Lambda W

Sigma += tf.linalg.matmul(V, V, adjoint_b=True)  # V V^T

return mu, Sigma


### III

#### Variational lower bound for Gaussian likelihoods

To carry out this derivation, we will need to recall the following two simple identities. First, we can write the inner product between two vectors as the trace of their outer product, $$\mathbf{a}^\top \mathbf{b} = \mathrm{tr}(\mathbf{a} \mathbf{b}^\top).$$ Second, the relationship between the auto-correlation matrix $\mathbb{E}[\mathbf{a}\mathbf{a}^{\top}]$ and the covariance matrix, $$\mathrm{cov}[\mathbf{a}] = \mathbb{E}[\mathbf{a}\mathbf{a}^{\top}] - \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}]^\top \quad\Leftrightarrow\quad \mathbb{E}[\mathbf{a}\mathbf{a}^{\top}] = \mathrm{cov}[\mathbf{a}] + \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}]^\top$$ These allow us to write \begin{align} \log{G(\mathbf{y}, \mathbf{u})} & = \int \log{\mathcal{N}(\mathbf{y} | \mathbf{f}, \sigma^{2} \mathbf{I})} \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f} \newline & = - \frac{1}{2\sigma^2} \int (\mathbf{y} - \mathbf{f})^{\top} (\mathbf{y} - \mathbf{f}) \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f} - \frac{N}{2}\log{(2\pi\sigma^2)} \newline & = - \frac{1}{2\sigma^2} \int \mathrm{tr} \left (\mathbf{y}\mathbf{y}^{\top} - 2 \mathbf{y}\mathbf{f}^{\top} + \mathbf{f}\mathbf{f}^{\top} \right) \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f} - \frac{N}{2}\log{(2\pi\sigma^2)} \newline & = - \frac{1}{2\sigma^2} \mathrm{tr} \left (\mathbf{y}\mathbf{y}^{\top} - 2 \mathbf{y}\mathbf{m}^{\top} + \mathbf{S} + \mathbf{m} \mathbf{m}^{\top} \right) - \frac{N}{2}\log{(2\pi\sigma^2)} \newline & = - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{m})^{\top} (\mathbf{y} - \mathbf{m}) - \frac{N}{2}\log{(2\pi\sigma^2)} - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S}) \newline & = \log{\mathcal{N}(\mathbf{y} | \mathbf{m}, \sigma^{2} \mathbf{I} )} - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S}). \end{align}

Cite as:

@article{tiao2020svgp,
title   = "{A} {H}andbook for {S}parse {V}ariational {G}aussian {P}rocesses",
author  = "Tiao, Louis C",
journal = "tiao.io",
year    = "2020",
url     = "https://tiao.io/post/sparse-variational-gaussian-processes/"
}


1. Titsias, M. (2009, April). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Artificial Intelligence and Statistics (pp. 567-574). ↩︎

2. Murray, I., & Adams, R. P. (2010). Slice Sampling Covariance Hyperparameters of Latent Gaussian Models. In Advances in Neural Information Processing Systems (pp. 1732-1740). ↩︎

3. Hensman, J., Matthews, A. G., Filippone, M., & Ghahramani, Z. (2015). MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 1648-1656). ↩︎

4. Csató, L., & Opper, M. (2002). Sparse On-line Gaussian Processes. Neural Computation, 14(3), 641-668. ↩︎

5. Seeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations (PhD Thesis). University of Edinburgh. ↩︎

6. Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian Processes using Pseudo-inputs. Advances in Neural Information Processing Systems, 18, 1257-1264. ↩︎

7. Quinonero-Candela, J., & Rasmussen, C. E. (2005). A Unifying View of Sparse Approximate Gaussian Process Regression. The Journal of Machine Learning Research, 6, 1939-1959. ↩︎

8. Lázaro-Gredilla, M., & Figueiras-Vidal, A. R. (2009, December). Inter-domain Gaussian Processes for Sparse Inference using Inducing Features. In Advances in Neural Information Processing Systems. ↩︎

9. Damianou, A., & Lawrence, N. D. (2013, April). Deep Gaussian Processes. In Artificial Intelligence and Statistics (pp. 207-215). PMLR. ↩︎

10. Hensman, J., Fusi, N., & Lawrence, N. D. (2013, August). Gaussian Processes for Big Data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (pp. 282-290). ↩︎

11. Dezfouli, A., & Bonilla, E. V. (2015). Scalable Inference for Gaussian Process Models with Black-box Likelihoods. In Advances in Neural Information Processing Systems (pp. 1414-1422). ↩︎

12. Bui, T. D., Yan, J., & Turner, R. E. (2017). A Unifying Framework for Gaussian Process Pseudo-point Approximations using Power Expectation Propagation. The Journal of Machine Learning Research, 18(1), 3649-3720. ↩︎

13. Salimbeni, H., Cheng, C. A., Boots, B., & Deisenroth, M. (2018). Orthogonally Decoupled Variational Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 8711-8720). ↩︎

14. Shi, J., Titsias, M., & Mnih, A. (2020, June). Sparse Orthogonal Variational Inference for Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (pp. 1932-1942). PMLR. ↩︎

15. Burt, D., Rasmussen, C. E., & Van Der Wilk, M. (2019, May). Rates of Convergence for Sparse Variational Gaussian Process Regression. In International Conference on Machine Learning (pp. 862-871). PMLR. ↩︎

##### Louis Tiao
###### PhD Candidate

Thanks for stopping by! Let’s connect – drop me a message or follow me: