A Handbook for Sparse Variational Gaussian Processes
A summary of notation, identities and derivations for the sparse variational Gaussian process (SVGP) framework

Table of Contents
In the sparse variational Gaussian process (SVGP) framework (Titsias, 2009)1,
one augments the joint distribution
Prior
The joint distribution of the latent function values
Marginal prior over inducing variables
The marginal prior over inducing variables is simply given by
Gaussian process notation
We can express the prior over the inducing variable
Conditional prior
First, let us define the vector-valued function
Gaussian process notation
We can express the distribution over the function value
Before moving on, we briefly highlight the important
quantity,
Variational Distribution
We specify a joint variational distribution
Gaussian process notation
We can express the variational distribution over the function value
Whitened parameterization
Whitening is a powerful trick for stabilizing the learning of variational
parameters that works by reducing correlations in the variational distribution (Murray & Adams, 2010; Hensman et al, 2015)2 3.
Let
Gaussian process notation
The mean and covariance functions are now
For an efficient and numerically stable way to compute and evaluate the
variational distribution
Inference
Preliminaries
We seek to approximate the exact posterior
Let us now focus our attention on the ELBO, which can be written as
Gaussian Likelihoods – Sparse Gaussian Process Regression (SGPR)
Let us assume we have a Gaussian likelihood of the form
Now, there are a few key objects of interest.
First, the
optimal variational distribution
The optimal variational distribution is given by
This leads to the predictive distribution
The ELBO is given by
Non-Gaussian Likelihoods
Recall from earlier that the ELBO is written as
Now, the second term in the ELBO is the KL divergence between
Large-Scale Data with Stochastic Optimization
Links and Further Readings
- Papers:
- Forerunners: Deterministic Training Conditional (DTC; Csató & Opper, 20024; Seeger, 20035); Fully Independent Training Conditional (FITC; Snelson & Ghahramani, 20056; Quinonero-Candela & Rasmussen, 20057)
- Inter-domain Gaussian processes: Lázaro-Gredilla & Figueiras-Vidal, 20098
- Deep Gaussian processes: Damianou & Lawrence, 20139, Salimbeni et al, 201710
- Non-Gaussian likelihoods: Hensman et al, 201311; Dezfouli & Bonilla, 201512
- Unifying inducing-/pseudo-point approximations: Bui et al, 201713
- Orthogonal decompositions: Salimbeni et al, 201814; Shi et al, 202015
- Convergence analysis: Burt et al, 201916
- Efficient sampling: Wilson et al, 202017
- Technical Reports:
- Notes:
- Blog posts:
- Sparse GPs: approximate the posterior, not the model by J. Hensman
Cite as:
@article{tiao2020svgp,
title = "{A} {H}andbook for {S}parse {V}ariational {G}aussian {P}rocesses",
author = "Tiao, Louis C",
journal = "tiao.io",
year = "2020",
url = "https://tiao.io/post/sparse-variational-gaussian-processes/"
}
To receive updates on more posts like this, follow me on Twitter and GitHub!
Appendix
I
Whitened parameterization
Recall the definition
II
SVGP Implementation Details
Single input index point
Here is an efficient and numerically stable way to compute
Cholesky decomposition:
Note:
complexity.Solve system of linear equations:
Note:
complexity since is lower triangular; denotes the vector such that . Hence, .Note:
For whitened parameterization:
Note:
otherwise:
Solve system of linear equations:
Note:
complexity since is upper triangular. Further, and since is symmetric and nonsingular.Note:
Return
Multiple input index points
It is simple to extend this to compute
Cholesky decomposition:
Note:
complexity.Solve system of linear equations:
Note:
complexity since is lower triangular; denotes the matrix such that . Hence, .Note:
For whitened parameterization:
Note:
otherwise:
Solve system of linear equations:
Note:
complexity since is upper triangular. Further, and since is symmetric and nonsingular.Note:
.
Return
In TensorFlow, this looks something like:
import tensorflow as tf
def variational_predictive(Knn, Kmm, Kmn, W, b, whiten=True, jitter=1e-6):
L = tf.linalg.cholesky(Kmm + jitter * tf.eye(m)) # L L^T = Kmm + jitter I_m
Lambda = tf.linalg.triangular_solve(L, Kmn, lower=True) # Lambda = L^{-1} Kmn
S = Knn - tf.linalg.matmul(Lambda, Lambda, adjoint_a=True) # Knn - Lambda^T Lambda
# Phi = L^{-T} L^{-1} Kmn = Kmm^{-1} Kmn
Phi = Lambda if whiten else tf.linalg.triangular_solve(L, Lambda, adjoint=True, lower=True)
U = tf.linalg.matmul(Phi, W, adjoint_a=True) # U = V^T = Phi^T W
mu = tf.linalg.matmul(Phi, b, adjoint_a=True) # Phi^T b
Sigma = S + tf.linalg.matmul(U, U, adjoint_b=True) # S + UU^T = S + V^T V
return mu, Sigma
III
Optimal variational distribution (in general)
Taking the functional derivative of the ELBO wrt to
IV
Variational lower bound (partial) for Gaussian likelihoods
To carry out this derivation, we will need to recall the following two simple
identities. First, we can write the inner product between two vectors as the
trace of their outer product,
V
Optimal variational distribution for Gaussian likelihoods
Firstly, the optimal variational distribution can be found in closed-form as
VI
Variational lower bound (complete) for Gaussian likelihoods
We have
VII
SGPR Implementation Details
Here we provide implementation details that simultaneously minimizes the computational demands while avoiding numerically unstable calculations.
The difficulty in calculating the ELBO stem from terms involving
the inverse and the determinant of
First, let’s tackle the inverse term.
Using the Woodbury identity, we can write it as
Recall that
Next, let’s address the determinant term.
To this end, first note that the determinant of
The last non-trivial component of the ELBO is the trace term, which can be
calculated as
Finally, let us address the posterior predictive.
Recall that
Titsias, M. (2009, April). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Artificial Intelligence and Statistics (pp. 567-574). ↩︎
Murray, I., & Adams, R. P. (2010). Slice Sampling Covariance Hyperparameters of Latent Gaussian Models. In Advances in Neural Information Processing Systems (pp. 1732-1740). ↩︎
Hensman, J., Matthews, A. G., Filippone, M., & Ghahramani, Z. (2015). MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 1648-1656). ↩︎
Csató, L., & Opper, M. (2002). Sparse On-line Gaussian Processes. Neural Computation, 14(3), 641-668. ↩︎
Seeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations (PhD Thesis). University of Edinburgh. ↩︎
Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian Processes using Pseudo-inputs. Advances in Neural Information Processing Systems, 18, 1257-1264. ↩︎
Quinonero-Candela, J., & Rasmussen, C. E. (2005). A Unifying View of Sparse Approximate Gaussian Process Regression. The Journal of Machine Learning Research, 6, 1939-1959. ↩︎
Lázaro-Gredilla, M., & Figueiras-Vidal, A. R. (2009, December). Inter-domain Gaussian Processes for Sparse Inference using Inducing Features. In Advances in Neural Information Processing Systems. ↩︎
Damianou, A., & Lawrence, N. D. (2013, April). Deep Gaussian Processes. In Artificial Intelligence and Statistics (pp. 207-215). PMLR. ↩︎
Salimbeni, H., & Deisenroth, M. (2017). Doubly Stochastic Variational Inference for Deep Gaussian Processes. Advances in Neural Information Processing Systems, 30. ↩︎
Hensman, J., Fusi, N., & Lawrence, N. D. (2013, August). Gaussian Processes for Big Data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (pp. 282-290). ↩︎
Dezfouli, A., & Bonilla, E. V. (2015). Scalable Inference for Gaussian Process Models with Black-box Likelihoods. In Advances in Neural Information Processing Systems (pp. 1414-1422). ↩︎
Bui, T. D., Yan, J., & Turner, R. E. (2017). A Unifying Framework for Gaussian Process Pseudo-point Approximations using Power Expectation Propagation. The Journal of Machine Learning Research, 18(1), 3649-3720. ↩︎
Salimbeni, H., Cheng, C. A., Boots, B., & Deisenroth, M. (2018). Orthogonally Decoupled Variational Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 8711-8720). ↩︎
Shi, J., Titsias, M., & Mnih, A. (2020, June). Sparse Orthogonal Variational Inference for Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (pp. 1932-1942). PMLR. ↩︎
Burt, D., Rasmussen, C. E., & Van Der Wilk, M. (2019, May). Rates of Convergence for Sparse Variational Gaussian Process Regression. In International Conference on Machine Learning (pp. 862-871). PMLR. ↩︎
Wilson, J., Borovitskiy, V., Terenin, A., Mostowsky, P., & Deisenroth, M. (2020, November). Efficiently Sampling Functions from Gaussian Process Posteriors. In International Conference on Machine Learning (pp. 10292-10302). PMLR. ↩︎