A Handbook for Sparse Variational Gaussian Processes

A summary of notation, identities and derivations for the sparse variational Gaussian process (SVGP) framework

Table of Contents

In the sparse variational Gaussian process (SVGP) framework (Titsias, 2009)1, one augments the joint distribution p(y,f) with auxiliary variables u so that the joint becomes p(y,f,u)=p(y|f)p(f,u). The vector u=[u(z1)u(zM)]RM consists of inducing variables, the latent function values corresponding to the inducing input locations contained in the matrix Z=[z1zM]RM×D.

Prior

The joint distribution of the latent function values f, and the inducing variables u according to the prior is p(f,u)=N([fu];[00],[KffKufKufKuu]). If we let the joint prior factorize as p(f,u)=p(f|u)p(u), we can apply the rules of Gaussian conditioning to derive the marginal prior p(u) and conditional prior p(f|u).

Marginal prior over inducing variables

The marginal prior over inducing variables is simply given by p(u)=N(u|0,Kuu).

Gaussian process notation

We can express the prior over the inducing variable u(z) at inducing input z as p(u(z))=GP(0,kθ(z,z)).

Conditional prior

First, let us define the vector-valued function ψu:RDRM as ψu(x)Kuu1ku(x), where ku(x)=kθ(Z,x) denotes the vector of covariances between x and the inducing inputs Z. Further, let ΨRM×N be the matrix containing values of function ψ applied row-wise to the matrix of inputs X=[x1xN]RN×D, Ψ[ψ(x1)ψ(xN)]=Kuu1Kuf. Then, we can condition the joint prior distribution on the inducing variables to give p(f|u)=N(f|m,S), where the mean vector and covariance matrix are m=Ψu,andS=KffΨKuuΨ.

Gaussian process notation

We can express the distribution over the function value f(x) at input x, given u, that is, the conditional p(f(x)|u), as a Gaussian process: p(f(x)|u)=GP(m(x),s(x,x)), with mean and covariance functions, m(x)=ψu(x)u,ands(x,x)=kθ(x,x)ψu(x)Kuuψu(x).

Before moving on, we briefly highlight the important quantity, QffΨKuuΨ, which is sometimes referred to as the Nyström approximation of Kff. It can be written as Qff=KfuKuu1Kuf.

Variational Distribution

We specify a joint variational distribution qϕ(f,u) which factorizes as qϕ(f,u)p(f|u)qϕ(u). For convenience, let us specify a variational distribution that is also Gaussian, qϕ(u)N(u|b,WW), with variational parameters ϕ={W,b}. To obtain the corresponding marginal variational distribution over f, we marginalize out the inducing variables u, leading to qϕ(f)=qϕ(f,u)du=N(f|μ,Σ), where μ=Ψb,andΣ=KffΨ(KuuWW)Ψ.

Gaussian process notation

We can express the variational distribution over the function value f(x) at input x, that is, the marginal qϕ(f(x)), as a Gaussian process: qϕ(f(x))=GP(μ(x),σ(x,x)), with mean and covariance functions, μ(x)=ψu(x)b,andσ(x,x)=κθ(x,x)ψu(x)(KuuWW)ψu(x).

Whitened parameterization

Whitening is a powerful trick for stabilizing the learning of variational parameters that works by reducing correlations in the variational distribution (Murray & Adams, 2010; Hensman et al, 2015)2 3. Let L be the Cholesky factor of Kuu, i.e. the lower triangular matrix such that LL=Kuu. Then, the whitened variational parameters are given by WLW,andbLb, with free parameters {W,b}. This leads to mean and covariance μ=Λb,andΣ=KffΛ(IMWW)Λ, where ΛLΨ=L1Kuf. Refer to Appendix I for derivations.

Gaussian process notation

The mean and covariance functions are now μ(x)=λ(x)b,andσ(x,x)=kθ(x,x)λ(x)(IMWW)λ(x), where λ(x)Lψu(x)=L1ku(x).

For an efficient and numerically stable way to compute and evaluate the variational distribution qϕ(f) at an arbitrary set of inputs, see Appendix II.

Inference

Preliminaries

We seek to approximate the exact posterior p(f,uy) by an variational distribution qϕ(f,u). To this end, we minimize the Kullback-Leibler (KL) divergence between qϕ(f,u) and p(f,uy), which is given by KL[qϕ(f,u)∣∣p(f,uy)]=Eqϕ(f,u)[logqϕ(f,u)p(f,uy)]=logp(y)+Eqϕ(f,u)[logqϕ(f,u)p(f,u,y)]=logp(y)ELBO(ϕ,Z), where we’ve defined the evidence lower bound (ELBO) as ELBO(ϕ,Z)Eqϕ(f,u)[logp(f,u,y)qϕ(f,u)]. Notice that minimizing the KL divergence above is equivalent to maximizing the ELBO. Furthermore, the ELBO is a lower bound on the log marginal likelihood, since logp(y)=ELBO(ϕ,Z)+KL[qϕ(f,u)∣∣p(f,uy)], and the KL divergence is nonnegative. Therefore, we have logp(y)ELBO(ϕ,Z) with equality at KL[qϕ(f,u)∣∣p(f,uy)]=0qϕ(f,u)=p(f,uy).

Let us now focus our attention on the ELBO, which can be written as ELBO(ϕ,Z)=logp(f,u,y)qϕ(f,u)qϕ(f,u)dfdu=logp(y|f)p(f|u)p(u)p(f|u)qϕ(u)qϕ(f,u)dfdu=logΦ(y,u)p(u)qϕ(u)qϕ(u)du, where we have made use of the previous definition qϕ(f,u)=p(f|u)qϕ(u) and also introduced the definition Φ(y,u)exp(logp(y|f)p(f|u)df). It is straightforward to verify that the optimal variational distribution, that is, the distribution qϕ(u) at which the ELBO is maximized, satisfies qϕ(u)Φ(y,u)p(u). Refer to Appendix III for details. Specifically, after normalization, we have qϕ(u)=Φ(y,u)p(u)Z, where ZΦ(y,u)p(u)du. Plugging this back into the ELBO, we get ELBO(ϕ,Z)=log(Φ(y,u)p(u)ZΦ(y,u)p(u))qϕ(u)du=logZ.

Gaussian Likelihoods – Sparse Gaussian Process Regression (SGPR)

Let us assume we have a Gaussian likelihood of the form p(y|f)=N(y|f,β1I). Then it is straightforward to show that logΦ(y,u)=logN(y|m,β1I)β2tr(S), where m and S are defined as before, i.e. m=Ψu and S=KffΨKuuΨ. Refer to Appendix IV for derivations.

Now, there are a few key objects of interest. First, the optimal variational distribution qϕ(u), which is required to compute the predictive distribution qϕ(f)=p(f|u)qϕ(u)du, but which may also be of independent interest. Second, the ELBO, the objective with respect to which the inducing input locations Z are optimized.

The optimal variational distribution is given by qϕ(u)=N(uβKuuM1Kufy,KuuM1Kuu), where MKuu+βKufKfu. This can be verified by reducing the product of two exponential-quadratic functions in Φ(y,u) and p(u) into a single exponential-quadratic function up to a constant factor, an operation also known as “completing the square”. Refer to Appendix V for complete derivations.

This leads to the predictive distribution qϕ(f)=N(βΨKuuM1KuuΨy,KffΨ(KuuKuuM1Kuu)Ψ)=N(βKfuM1Kufy,KffKfu(Kuu1M1)Kuf).

The ELBO is given by ELBO(ϕ,Z)=logZ=logN(0,Qff+β1I)β2tr(S). This can be verified by applying simple rules for marginalizing Gaussians. Again, refer to Appendix VI for complete derivations. Refer to Appendix VII for a numerically efficient and robust method for computing these quantities.

Non-Gaussian Likelihoods

Recall from earlier that the ELBO is written as ELBO(ϕ,Z)=log(Φ(y,u)p(u)qϕ(u))qϕ(u)du=(logΦ(y,u)+logp(u)qϕ(u) )qϕ(u)du=ELL(ϕ,Z)KL[qϕ(u)p(u)], where we define ELL(ϕ,Z), the expected log-likelihood (ELL), as ELL(ϕ,Z)Eqϕ(u)[logΦ(y,u)]. This constitutes the first term in the ELBO, and can be written as ELL(ϕ,Z)=logΦ(y,u)qϕ(u)du=(logp(y|f)p(f|u)df)qϕ(u)du=logp(y|f)(p(f|u)qϕ(u)du)df=logp(y|f)q(f)df=Eq(f)[logp(y|f)]. While this integral is analytically intractable in general, we can nonetheless approximate it efficiently using numerical integration techniques such as Monte Carlo (MC) estimation or quadrature rules. In particular, because q(f) is Gaussian, we can utilize simple yet effective rules such as Gauss-Hermite quadrature.

Now, the second term in the ELBO is the KL divergence between qϕ(u) and p(u), which are both multivariate Gaussians, KL[qϕ(u)p(u)]=KL[N(b,WW)||N(0,Kuu)], and has a closed-form expression. In the case of the whitened parameterization, it can be simplified as KL[qϕ(u)p(u)]=KL[N(b,WW)||N(0,Kuu)]=KL[N(b,WW)||N(0,I)]. This comes from the fact that KL[N(Aμ0,AΣ0A)||N(Aμ1,AΣ1A)]=KL[N(μ0,Σ0)||N(μ1,Σ1)] where we set μ0=b,Σ0=WW,μ1=0,Σ1=I and A=L where L is the Cholesky factor of Kuu, i.e. the lower triangular matrix such that LL=Kuu.

Large-Scale Data with Stochastic Optimization

Coming soon.

Cite as:

@article{tiao2020svgp,
  title   = "{A} {H}andbook for {S}parse {V}ariational {G}aussian {P}rocesses",
  author  = "Tiao, Louis C",
  journal = "tiao.io",
  year    = "2020",
  url     = "https://tiao.io/post/sparse-variational-gaussian-processes/"
}

To receive updates on more posts like this, follow me on Twitter and GitHub!

Appendix

I

Whitened parameterization

Recall the definition ΛLΨ. Then, the mean simplifies to μ=Ψb=Ψ(Lb)=(LΨ)b=Λb. Similarly, the covariance simplifies to Σ=KffΨ(KuuWW)Ψ=KffΨ(LLL(WW)L)Ψ=Kff(LΨ)(IMWW)(LΨ)=KffΛ(IMWW)Λ.

II

SVGP Implementation Details

Single input index point

Here is an efficient and numerically stable way to compute qϕ(f(x)) for an input x. We take the following steps:

  1. Cholesky decomposition: Lcholesky(Kuu)

    Note: O(M3) complexity.

  2. Solve system of linear equations: λ(x)Lku(x)

    Note: O(M2) complexity since L is lower triangular; β=Ax denotes the vector β such that Aβ=xβ=A1x. Hence, λ(x)=L1ku(x).

  3. s(x,x)kθ(x,x)λ(x)λ(x)

    Note: λ(x)λ(x)=ku(x)LL1ku(x)=ku(x)Kuu1ku(x)=ψu(x)Kuuψu(x).

  4. For whitened parameterization:

    1. μλ(x)b

    2. v(x)λ(x)W

      Note: v(x)v(x)=ku(x)L(WW)L1ku(x)

    otherwise:

    1. Solve system of linear equations: ψu(x)Lλ(x)

      Note: O(M2) complexity since L is upper triangular. Further, ψu(x)=Lλ(x)=LL1ku(x)=Kuu1ku(x) and ψu(x)=ku(x)Kuu=ku(x)Kuu1 since Kuu is symmetric and nonsingular.

    2. μ(x)ψu(x)b

    3. v(x)ψu(x)W

      Note: v(x)v(x)=ku(x)Kuu1(WW)Kuu1ku(x)

  5. σ2(x)s(x,x)+v(x)v(x)

  6. Return N(f(x);μ(x),σ2(x))

Multiple input index points

It is simple to extend this to compute qϕ(f) for an arbitary number of index points X:

  1. Cholesky decomposition: L=cholesky(Kuu)

    Note: O(M3) complexity.

  2. Solve system of linear equations: Λ=LKuf

    Note: O(M2) complexity since L is lower triangular; B=AX denotes the matrix B such that AB=XB=A1X. Hence, Λ=L1Kuf.

  3. SKffΛΛ

    Note: ΛΛ=KfuLL1Kuf=KfuKuu1Kuf=KfuKuu1(Kuu)Kuu1Kuf=ΨKuuΨ.

  4. For whitened parameterization:

    1. μΛb

    2. VΛW

      Note: VV=KfuL(WW)L1Kuf.

    otherwise:

    1. Solve system of linear equations: Ψ=LΛ

      Note: O(M2) complexity since L is upper triangular. Further,

      Ψ=LΛ=LL1Kuf=(LL)1Kuf=Kuu1Kuf, and Ψ=KfuKuu=KfuKuu1, since Kuu is symmetric and nonsingular.

    2. μΨb

    3. VΨW

      Note: VV=KfuKuu1(WW)Kuu1Kuf.

  5. ΣS+VV

  6. Return N(f;μ,Σ)

In TensorFlow, this looks something like:

import tensorflow as tf


def variational_predictive(Knn, Kmm, Kmn, W, b, whiten=True, jitter=1e-6):

    L = tf.linalg.cholesky(Kmm + jitter * tf.eye(m))  # L L^T = Kmm + jitter I_m
    Lambda = tf.linalg.triangular_solve(L, Kmn, lower=True)  # Lambda = L^{-1} Kmn
    S = Knn - tf.linalg.matmul(Lambda, Lambda, adjoint_a=True)  # Knn - Lambda^T Lambda
    # Phi = L^{-T} L^{-1} Kmn = Kmm^{-1} Kmn
    Phi = Lambda if whiten else tf.linalg.triangular_solve(L, Lambda, adjoint=True, lower=True)
 
    U = tf.linalg.matmul(Phi, W, adjoint_a=True)  # U = V^T = Phi^T W
 
    mu = tf.linalg.matmul(Phi, b, adjoint_a=True)  # Phi^T b
    Sigma = S + tf.linalg.matmul(U, U, adjoint_b=True)  # S + UU^T = S + V^T V
 
    return mu, Sigma

III

Optimal variational distribution (in general)

Taking the functional derivative of the ELBO wrt to qϕ(u), we get qϕ(u)ELBO(ϕ,Z)=qϕ(u)(logΦ(y,u)p(u)qϕ(u)qϕ(u)du)=qϕ(u)(logΦ(y,u)p(u)qϕ(u)qϕ(u))du=logΦ(y,u)p(u)qϕ(u)(qϕ(u)qϕ(u))+qϕ(u)(qϕ(u)logΦ(y,u)p(u)qϕ(u))du=logΦ(y,u)p(u)qϕ(u)+qϕ(u)(1qϕ(u))du=logΦ(y,u)+logp(u)logqϕ(u)1du. Setting this expression to zero, we have logqϕ(u)=logΦ(y,u)+logp(u)1qϕ(u)Φ(y,u)p(u).

IV

Variational lower bound (partial) for Gaussian likelihoods

To carry out this derivation, we will need to recall the following two simple identities. First, we can write the inner product between two vectors as the trace of their outer product, ab=tr(ab). Second, the relationship between the auto-correlation matrix E[aa] and the covariance matrix, Cov[a]=E[aa]E[a]E[a]E[aa]=Cov[a]+E[a]E[a] These allow us to write logΦ(y,u)=logN(y|f,β1I)N(f|m,S)df=12σ2(yf)(yf)N(f|m,S)dfN2log(2πσ2)=12σ2tr(yy2yf+ff)N(f|m,S)dfN2log(2πσ2)=12σ2tr(yy2ym+S+mm)N2log(2πσ2)=12σ2(ym)(ym)N2log(2πσ2)12σ2tr(S)=logN(y|m,β1I)12σ2tr(S).

V

Optimal variational distribution for Gaussian likelihoods

Firstly, the optimal variational distribution can be found in closed-form as qϕ(u)Φ(y,u)p(u)N(yΨu,β1I)N(u0,Kuu)exp(β2(yΨu)(yΨu)12uKuu1u)exp(12(uCu2β(Ψy)u)), where CKuu1+βΨΨ=Kuu1(Kuu+βKufKfu)Kuu1. By completing the square, we get qϕ(u)exp(12(uβC1Ψy)C(uβC1Ψy))N(uβC1Ψy,C1). We define MKuu+βKufKfu so that C=Kuu1MKuu1, which allows us to write qϕ(u)=N(uβKuuM1Kufy,KuuM1Kuu).

VI

Variational lower bound (complete) for Gaussian likelihoods

We have ELBO(ϕ,Z)=logZ=logΦ(y,u)p(u)du=log[exp(β2tr(S))N(y|Ψu,β1I)p(u)du]=logN(yΨu,β1I)N(u0,Kuu)duβ2tr(S)=logN(y0,β1I+ΨKuuΨ)β2tr(S)=logN(y0,Qff+β1I)β2tr(S).

VII

SGPR Implementation Details

Here we provide implementation details that simultaneously minimizes the computational demands while avoiding numerically unstable calculations.

The difficulty in calculating the ELBO stem from terms involving the inverse and the determinant of Qff+β1I. More specifically, we have ELBO(ϕ,Z)=12(logdet(Qff+β1I)+y(Qff+β1I)1y+Nlog2π)β2tr(S). It turns out that many of the required terms can be expressed in terms of the symmetric positive definite matrix BUU+I, where Uβ12Λ.

First, let’s tackle the inverse term. Using the Woodbury identity, we can write it as (Qff+β1I)1=(β1I+ΨKuuΨ)1=βIβ2Ψ(Kuu1+βΨΨ)1Ψ=β(IβΨC1Ψ).

Recall that C1=KuuM1Kuu. We can expand M as MKuu+βKufKfu=LL+βLL1KufKfuLL=L(I+βΛΛ)L=LBL, so its inverse is simply M1=LB1L1. Therefore, we have C1=KuuLB1L1Kuu=LB1L=WW where WLLB and LB is the Cholesky factor of B, i.e. the lower triangular matrix such that LBLB=B. All in all, we now have (Qff+β1I)1=β(IβΨWWΨ), so we can compute the quadratic term in y as y(Qff+β1I)1y=β(yyβyΨWWΨy)=βyycc, where cβWΨy=βLB1Λy=β12LB1Uy.

Next, let’s address the determinant term. To this end, first note that the determinant of M is det(M)=det(LBL)=det(L)det(B)det(L)=det(Kuu)det(B). Hence, the determinant of C is det(C)=det(Kuu1MKuu1)=det(M)det(Kuu)2=det(B)det(Kuu). Therefore, by the matrix determinant lemma, we have det(Qff+β1I)=det(β1I+ΨKuuΨ)=det(Kuu1+βΨΨ)det(Kuu)det(β1I)=det(C)det(Kuu)det(β1I)=det(B)det(β1I). We can re-use LB to calculate det(B) in linear time.

The last non-trivial component of the ELBO is the trace term, which can be calculated as β2tr(S)=β2tr(Kff)12tr(UU), since tr(UU)=tr(UU)=βtr(ΛΛ)=βtr(ΨKuuΨ). Again, we can re-use UU computed earlier.

Finally, let us address the posterior predictive. Recall that qϕ(u)=N(uβC1Ψy,C1). Re-writing this in terms of W, we get qϕ(u)=N(uβWWΨy,WW)=N(uβLLBWΨy,LLBLB1L)=N(uL(LBc),LB1L). Hence, we see that the optimal variational distribution is itself a whitened parameterization with b=LBc and W=LB (such that WW=B1). Combined with results from a previous section, we can directly write the predictive qϕ(f)=p(f|u)qϕ(u)du as qϕ(f)=N(ΛLBc,KffΛ(IB1)Λ). Alternatively, we can derive this by noting the following simple identity, ΨC1Ψ=ΨLB1LΨ=ΛB1Λ, and applying the rules for marginalizing Gaussians to obtain qϕ(f)=N(βΨC1Ψy,KffΨKuuΨ+ΨC1Ψ)=N(βΛB1Λy,KffΛΛ+ΛB1Λ)=N(ΛLBc,KffΛ(IB1)Λ).


  1. Titsias, M. (2009, April). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Artificial Intelligence and Statistics (pp. 567-574). ↩︎

  2. Murray, I., & Adams, R. P. (2010). Slice Sampling Covariance Hyperparameters of Latent Gaussian Models. In Advances in Neural Information Processing Systems (pp. 1732-1740). ↩︎

  3. Hensman, J., Matthews, A. G., Filippone, M., & Ghahramani, Z. (2015). MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 1648-1656). ↩︎

  4. Csató, L., & Opper, M. (2002). Sparse On-line Gaussian Processes. Neural Computation, 14(3), 641-668. ↩︎

  5. Seeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations (PhD Thesis). University of Edinburgh. ↩︎

  6. Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian Processes using Pseudo-inputs. Advances in Neural Information Processing Systems, 18, 1257-1264. ↩︎

  7. Quinonero-Candela, J., & Rasmussen, C. E. (2005). A Unifying View of Sparse Approximate Gaussian Process Regression. The Journal of Machine Learning Research, 6, 1939-1959. ↩︎

  8. Lázaro-Gredilla, M., & Figueiras-Vidal, A. R. (2009, December). Inter-domain Gaussian Processes for Sparse Inference using Inducing Features. In Advances in Neural Information Processing Systems. ↩︎

  9. Damianou, A., & Lawrence, N. D. (2013, April). Deep Gaussian Processes. In Artificial Intelligence and Statistics (pp. 207-215). PMLR. ↩︎

  10. Salimbeni, H., & Deisenroth, M. (2017). Doubly Stochastic Variational Inference for Deep Gaussian Processes. Advances in Neural Information Processing Systems, 30. ↩︎

  11. Hensman, J., Fusi, N., & Lawrence, N. D. (2013, August). Gaussian Processes for Big Data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (pp. 282-290). ↩︎

  12. Dezfouli, A., & Bonilla, E. V. (2015). Scalable Inference for Gaussian Process Models with Black-box Likelihoods. In Advances in Neural Information Processing Systems (pp. 1414-1422). ↩︎

  13. Bui, T. D., Yan, J., & Turner, R. E. (2017). A Unifying Framework for Gaussian Process Pseudo-point Approximations using Power Expectation Propagation. The Journal of Machine Learning Research, 18(1), 3649-3720. ↩︎

  14. Salimbeni, H., Cheng, C. A., Boots, B., & Deisenroth, M. (2018). Orthogonally Decoupled Variational Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 8711-8720). ↩︎

  15. Shi, J., Titsias, M., & Mnih, A. (2020, June). Sparse Orthogonal Variational Inference for Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (pp. 1932-1942). PMLR. ↩︎

  16. Burt, D., Rasmussen, C. E., & Van Der Wilk, M. (2019, May). Rates of Convergence for Sparse Variational Gaussian Process Regression. In International Conference on Machine Learning (pp. 862-871). PMLR. ↩︎

  17. Wilson, J., Borovitskiy, V., Terenin, A., Mostowsky, P., & Deisenroth, M. (2020, November). Efficiently Sampling Functions from Gaussian Process Posteriors. In International Conference on Machine Learning (pp. 10292-10302). PMLR. ↩︎

Louis Tiao
Louis Tiao
Research Scientist

Thanks for stopping by! Let’s connect – drop me a message or follow me