<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Gaussian Processes |</title><link>https://tiao.io/tags/gaussian-processes/</link><atom:link href="https://tiao.io/tags/gaussian-processes/index.xml" rel="self" type="application/rss+xml"/><description>Gaussian Processes</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 01 May 2026 00:00:00 +0000</lastBuildDate><image><url>https://tiao.io/media/icon_hu_9c2a75fde2335590.png</url><title>Gaussian Processes</title><link>https://tiao.io/tags/gaussian-processes/</link></image><item><title>📄 One paper accepted to ICML 2026</title><link>https://tiao.io/posts/one-paper-accepted-to-icml2026/</link><pubDate>Fri, 01 May 2026 00:00:00 +0000</pubDate><guid>https://tiao.io/posts/one-paper-accepted-to-icml2026/</guid><description>&lt;p&gt;Our paper
was accepted to ICML 2026. This is
joint work with Jihao Andreas Lin and Sebastian Ament (co-first authors), and
David Eriksson, Maximilian Balandat, and Eytan Bakshy.&lt;/p&gt;</description></item><item><title>Empirical Gaussian Processes</title><link>https://tiao.io/publications/empirical-gaussian-processes/</link><pubDate>Sun, 01 Feb 2026 00:00:00 +0000</pubDate><guid>https://tiao.io/publications/empirical-gaussian-processes/</guid><description/></item><item><title>🎓 PhD thesis completed</title><link>https://tiao.io/posts/phd-thesis-completed/</link><pubDate>Fri, 15 Dec 2023 00:00:00 +0000</pubDate><guid>https://tiao.io/posts/phd-thesis-completed/</guid><description>&lt;p&gt;Submitted my PhD thesis, &lt;em&gt;Probabilistic Machine Learning in the Age of Deep
Learning: New Perspectives for Gaussian Processes, Bayesian Optimization and
Beyond&lt;/em&gt;, at the University of Sydney. Supervised by Fabio Ramos and Edwin
Bonilla. The full text and chapter PDFs are available
.&lt;/p&gt;</description></item><item><title>Probabilistic Machine Learning in the Age of Deep Learning: New Perspectives for Gaussian Processes, Bayesian Optimization and Beyond (PhD Thesis)</title><link>https://tiao.io/publications/phd-thesis/</link><pubDate>Fri, 01 Sep 2023 00:00:00 +0000</pubDate><guid>https://tiao.io/publications/phd-thesis/</guid><description>&lt;p&gt;The full text is available as a single PDF file &lt;a href="phd-thesis-louis-tiao.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can also find a list of contents and PDFs corresponding to each individual chapter below:&lt;/p&gt;
&lt;h3 id="table-of-contents"&gt;Table of Contents&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Chapter 1: Introduction &lt;a href="contents/1 Introduction.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 2: Background &lt;a href="contents/2 Background.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 3: Orthogonally-Decoupled Sparse Gaussian Processes with Spherical Neural Network Activation Features &lt;a href="contents/3 Orthogonally-Decoupled Sparse Gaussian Processes with Spherical Neural Network Activation Features.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 4: Cycle-Consistent Generative Adversarial Networks as a Bayesian Approximation &lt;a href="contents/4 Cycle-Consistent Generative Adversarial Networks as a Bayesian Approximation.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 5: Bayesian Optimisation by Classification with Deep Learning and Beyond &lt;a href="contents/5 Bayesian Optimisation by Classification with Deep Learning and Beyond.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 6: Conclusion &lt;a href="contents/6 Conclusion.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Appendix A: Numerical Methods for Improved Decoupled Sampling of Gaussian Processes &lt;a href="contents/A Numerical Methods for Improved Decoupled Sampling of Gaussian Processes.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Bibliography &lt;a href="contents/Bibliography.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Please find &lt;em&gt;Chapter 1: Introduction&lt;/em&gt; reproduced in full below:&lt;/p&gt;
&lt;h3 id="introduction"&gt;Introduction&lt;/h3&gt;
&lt;p&gt;Artificial intelligence (AI) stands poised to be among the most disruptive technologies of our era. The breakneck pace of recent AI advancements has been spearheaded by machine learning (ML), particularly the resurgence of &lt;em&gt;deep learning&lt;/em&gt;. Deep learning is as old as the first general-purpose electronic computer; with roots tracing back to the 1940s and ’50s (
;
), the revival of deep learning, beginning in the early 2010s, was catalysed by a series of breakthroughs that shattered previously perceived limitations and captivated the collective imagination. These breakthroughs span various domains, including computer vision (
;
;
;
), speech recognition (
;
), natural language processing (
;
), protein folding (
), generative art and artificial creativity  (
;
;
;
), as well as reinforcement learning for robotics control (
;
) and achieving superhuman-level gameplay (
;
).&lt;/p&gt;
&lt;p&gt;Nevertheless, it is crucial to view these developments as means to an ultimate end rather than an end in themselves. Arguably, the true pinnacle of AI’s capabilities lies in optimal &lt;em&gt;decision-making&lt;/em&gt;, whether that entails offering analyses and insights to aid humans in making better decisions or completely automating the decision-making process altogether. Practically any task directed towards a well-defined objective can be boiled down to a cascade of decisions. At a fundamental level, operating a vehicle involves a continuous stream of decisions involving accelerating, braking, and turning. Financial trading revolves around decisions to buy, sell, or hold various assets. Even complex engineering tasks, such as designing an aerofoil, involve a sequence of decisions about adjusting design variables to achieve desirable aerodynamic characteristics.&lt;/p&gt;
&lt;p&gt;Yet, the intricacies of decision-making surpass what any single advancement in deep learning can address. While convolutional neural networks (CNNs) can facilitate object detection tasks in autonomous vehicles, recurrent neural networks (RNNs) can aid in forecasting market dynamics for systematic trading, and physics-informed NNs can assist in predicting aerodynamic effects, it remains the case that no target or quantity of interest can be entirely known or predictable (indeed, if they were, the pursuit of predictive modelling and ML would be superfluous). Instead, predictions often prove unreliable, or at best, &lt;em&gt;uncertain&lt;/em&gt;, due to the limitations of our knowledge and the complexity and variability inherent in the underlying real-world processes. The impressive power of deep learning models often overshadows their ignorance of the limits of their own knowledge and the extent of uncertainty in their predictions. When these predictions are integrated into a sequential decision-making framework, such uncertainty can amplify, compound, and lead to catastrophic consequences. In the context of aeronautical engineering, this could result in inefficient designs; in quantitative finance, it can lead to devastating capital losses; and in autonomous driving, it can even cost lives.&lt;/p&gt;
&lt;h4 id="probabilistic-machine-learning"&gt;Probabilistic Machine Learning&lt;/h4&gt;
&lt;p&gt;Grounded in the laws of probability and Bayesian statistics (
;
), &lt;em&gt;probabilistic&lt;/em&gt; ML provides a consistent framework for systematically reasoning about the unknown. The probabilistic approach to ML acknowledges that the real world is fraught with uncertainty and embraces this uncertainty as an inherent part of decision-making. Unlike traditional methods, including those of deep learning, it recognises model predictions not as absolute truths that can be represented as single &lt;em&gt;point estimates&lt;/em&gt; produced from a deterministic mapping, but as full &lt;em&gt;probability distributions&lt;/em&gt; that capture the potential outcomes of a random variable as it propagates through some underlying data-generating process. In a &lt;em&gt;probabilistic model&lt;/em&gt;, all quantities are treated as random variables governed by probability distributions – the data are treated as observed variables, which are influenced by some underlying hidden variables, e.g., the model parameters. A prior distribution is used to express reasonable values for these hidden variables and to eliminate implausible ones. The relationship between observed and hidden variables is described using the likelihood, and the process of Bayesian inference amounts to calculating, using basic laws of probability, a posterior distribution over the hidden factors conditioned on the observed data, which can be seen as a refinement of the prior beliefs in light of new evidence. While the posterior distribution can be useful in and of itself, its primary role lies in facilitating subsequent prediction and decision-making by providing full probability distributions over predicted outcomes. This capability allows the decision-maker to assess the range of possible scenarios and their associated probabilities, enabling a more nuanced understanding of uncertainty and risk, which is indispensable in complex, dynamic environments where the repercussions of incorrect decisions can be severe. In essence, probabilistic ML equips autonomous decision-making systems with a probabilistic worldview, enabling them to navigate ambiguity and make sound decisions in the face of imperfect information.&lt;/p&gt;
&lt;h4 id="probabilistic-ml-vs-deep-learning"&gt;Probabilistic ML vs. Deep Learning&lt;/h4&gt;
&lt;p&gt;While deep learning has dominated recent AI advances, probabilistic ML remains as important as ever and continues to offer valuable tools for addressing AI challenges that can not be fully resolved by deep learning alone. Although both approaches can be combined to create hybrid methods that leverage their respective strengths, some defining characteristics have traditionally set deep learning apart from probabilistic ML. Perhaps most notably, probabilistic ML approaches can achieve remarkable predictive performance even when data is scarce. In contrast, deep learning models tend to be data-intensive by nature, often demanding datasets of a scale proportional to their size (i.e., their parameter count) (
), which has seen explosive growth in recent years (
;
;
;
;
). With that being said, inference in many probabilistic models poses computational problems that are difficult to scale. On the other hand, deep learning approaches have excelled in scalability, a key factor contributing to their widespread success. This scalability is bolstered by their compatibility with various speed-enhancing mechanisms such as stochastic optimisation, specialised hardware accelerators (GPUs and TPUs), as well as distributed and/or cloud-based computing infrastructure. To bridge this gap, substantial research effort has been devoted to enabling probabilistic ML to benefit from these advantages through optimisation-based approximations to Bayesian inference (
).&lt;/p&gt;
&lt;p&gt;Moreover, as mentioned earlier, these paradigms are by no means mutually exclusive. Indeed, it is often possible to directly extend existing models with a Bayesian treatment of their parameters, adding a layer of probabilistic reasoning to the model, and allowing it to not only make predictions but also estimate the uncertainty associated with those predictions. An excellent example is the BNN, which treats the weights as hidden variables and leverages posterior inference to provide predictions while estimating associated uncertainties, delivering a more robust and principled approach to deep learning (
;
;
).&lt;/p&gt;
&lt;p&gt;The Bayesian formalism naturally gives rise to many popular methods and paradigms, often in the form of point estimates or other kinds of approximations. The quintessential example of this is found in linear regression, in particular, in ridge and lasso regression (
), which correspond variously to maximum &lt;em&gt;a posteriori&lt;/em&gt; (MAP) estimates in Bayesian linear regression (BLR) models with prior distributions possessing different sparsity-inducing characteristics (
) – more broadly, mitigations against over-fitting tend to arise organically in Bayesian methods, which is why they are frequently characterised as being fundamentally more robust against over-fitting (
). Likewise, the once &lt;em&gt;à la mode&lt;/em&gt; support vector machines (SVMs) can be seen as MAP estimates for a class of nonparametric Bayesian models (
), dropout (
) in NNs can be seen as a variational approximation to exact inference in BNNs (
), and unsupervised learning methods such as factor analysis (FA) (
) and principal component analysis (PCA) (
) are instances of a class of LVMs (
;
) known as linear-Gaussian factor models (
), to name just a few examples. Time and again, classical approaches have not only benefitted from being viewed through the Bayesian perspective but have also been enriched and redefined by the depth of insights this framework provides.&lt;/p&gt;
&lt;h3 id="thesis-goals"&gt;Thesis Goals&lt;/h3&gt;
&lt;p&gt;The over-arching goal of this thesis is to continue advancing the integration and cross-pollination between deep learning and probabilistic ML. We aim to further the interplay between these two fields, both by incorporating probabilistic interpretations and uncertainty quantification into popular deep learning frameworks, and by leveraging the representational power of deep NNs to improve established Bayesian methods. This dual-pronged approach provides fresh perspectives and taps the complementary strengths of both paradigms, advancing the foundations of AI and facilitating the development of more capable and dependable decision support frameworks. Ultimately, we strive to unlock the potential of deep learning within high-impact probabilistic ML methodologies, and to lend useful Bayesian perspectives on current deep learning techniques.&lt;/p&gt;
&lt;h4 id="gaussian-process-models"&gt;Gaussian Process Models&lt;/h4&gt;
&lt;p&gt;Arguably, no family of probabilistic models embodies the ethos of probabilistic ML and illustrates its nuances and parallels with deep learning quite like the GP. Accordingly, they shall occupy a prominent place in our thesis. In particular, GPs stand out as the ideal choice when dealing with limited data, offer the flexibility to encode prior beliefs through the covariance function, and provide predictive uncertainty estimates with a fine calibration that is second to none. Conversely, they are challenging to scale to large datasets, a limitation that has spurred extensive research and development efforts. Furthermore, in contrast to deep learning models, which are often lauded for their ability to automatically uncover valuable patterns and features in data, GPs have at times been dismissed as unsophisticated smoothing mechanisms (
). Despite these apparent disparities, GPs are intricately connected to NNs in numerous ways. Among these, one of the most classical and well-known relationships is the convergence of single-layer NNs with randomly initialised weights toward GPs in the infinite-width limit (
). Similar links have also been identified between GPs and infinitely wide &lt;em&gt;deep&lt;/em&gt; NNs (
;
).&lt;/p&gt;
&lt;p&gt;In an effort to elevate the representational capabilities of GPs to a level comparable with deep NNs, DGPs (
) stack together multiple layers of GPs. Additional efforts to construct efficient sparse GP approximations have leveraged the advantageous properties of computations on the hypersphere (
), which has led to deep GP (DGP) models in which the propagation of posterior predictive means is equivalent to a forward pass through a deep neural network (NN) (
;
). Notably, as a side effect, this model effectively provides uncertainty estimates for deep NN through its predictive variance. Among the contributions of our thesis is the further development of this framework, integrating cutting-edge techniques (
;
) to address some of its practical limitations, thereby narrowing the performance gap between GPs and deep NNs.&lt;/p&gt;
&lt;p&gt;Probabilistic models, serving a crucial role as decision support tools, routinely aid scientific discovery in fields such as physics and astronomy, guiding advancements in areas of medicine and healthcare encompassing bioinformatics, epidemiology, and medical diagnosis. Beyond that, these models have wide-ranging applications in economics, econometrics, and the social sciences. Moreover, they are indispensable in various engineering disciplines, such as robotics and environmental engineering. Among the many probabilistic models, GPs stand out as a powerful driving force behind a number of important sequential decision-making frameworks, including active learning (
) and reinforcement learning (
), and the broader area of probabilistic numerics at large (
). Notably, Bayesian optimisation (BO) (
;
;
) is one major area that relies heavily on GPs and will feature extensively in our thesis.&lt;/p&gt;
&lt;h4 id="bayesian-optimisation"&gt;Bayesian optimisation&lt;/h4&gt;
&lt;p&gt;BO is a powerful methodology dedicated to the global optimisation of complex and resource-intensive objective functions. In contrast to classical optimisation methods, BO excels even when dealing with functions that lack strong assumptions or guarantees. These functions may not be convex, possess no gradients, lack a well-defined mathematical form, and observable only indirectly through noisy measurements.&lt;/p&gt;
&lt;p&gt;At its core, BO is a sequential decision-making algorithm.&lt;/p&gt;
&lt;p&gt;It relies on observations from past function evaluations to determine the next candidate location for evaluation in pursuit of optimal solutions. BO leverages a probabilistic model, often a GP, to represent its knowledge and beliefs about the unknown function. This model is continuously updated with the acquisition of each new observation, enabling the algorithm to adapt its behaviour and make sound decisions based on the evolving information.&lt;/p&gt;
&lt;p&gt;BO effectively manages uncertainty inherent in such sequential decision-making processes by making use of the probabilistic model to the fullest, harnessing the entire predictive distribution, particularly, the predictive uncertainty, to select promising candidate solutions that bring the most value to the optimisation process. This generally consists not merely of those most likely to optimise the objective function (i.e., &lt;em&gt;exploiting&lt;/em&gt; that which is known), but also those likely to reveal the most knowledge and information about the function itself (i.e., &lt;em&gt;exploring&lt;/em&gt; that which remains unknown).&lt;/p&gt;
&lt;p&gt;This pronounced emphasis on well-calibrated uncertainty distinguishes BO as one of the standout “killer apps” for GPs and a jewel in the crown of probabilistic ML applications. In practice, BO has proven instrumental across science, engineering, and industry, where efficiency and cost-effectiveness are paramount. Its applications include protein engineering (
;
), material discovery (
), experimental physics (e.g., experiments involving ultra-cold atoms (
) and free-electron lasers (
)), environmental monitoring (sensor placement) (
;
), and the design of aerodynamic aerofoils (
;
), integrated circuits (
;
), broadband high-efficiency power amplifiers (
), and fast-charging protocols for lithium-ion batteries (
). Notably, it has played a crucial role in automating the hyperparameter tuning of various ML models (
;
), especially deep learning models, thus representing yet another way in which probabilistic ML has contributed to the advancement of deep learning.&lt;/p&gt;
&lt;p&gt;However, GPs are not universally suitable for all BO problem scenarios. They are most effective when dealing with smooth, stationary functions with homoscedastic noise and a relatively modest input dimensionality. Additionally, GPs are easiest to work with for functions with a single output and purely continuous inputs. While a surprisingly wide array of real-world challenges satisfy these conditions, many high-impact problems, such as gene and protein design, which involves sequential inputs (
;
;
;
;
); NAS, which involves structured inputs with intricate conditional dependencies; and automotive safety engineering, which involve numerous constraints and multiple objectives, clearly fall outside of this scope. This is not to say that GPs cannot be extended to such challenging scenarios. However, such extensions almost always come at a cost. Consequently, it makes sense to appeal to alternative modelling paradigms more naturally suited to specific tasks, e.g., employing random forests (RFs) to handle discrete and structured inputs, or deep NNs for capturing nonstationary behaviour and dealing with multiple objectives. A major contribution of this thesis is the introduction of a new formulation of BO that seamlessly accommodates virtually any modelling paradigm, including deep learning, without any compromise.&lt;/p&gt;
&lt;h3 id="thesis-overview"&gt;Thesis Overview&lt;/h3&gt;
&lt;p&gt;The core contributions of our thesis are summarised as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;span id="item:contrib-orthogonal-sparse-spherical-gp" label="item:contrib-orthogonal-sparse-spherical-gp"&gt;&lt;/span&gt; We improve upon the framework for sparse hyperspherical GP approximations that employ nonlinear activations as inter-domain inducing features. This framework serves as a bridge between GPs and NNs, with posterior predictive mean taking the form of single-layer feedforward NNs. Our thesis examines some practical issues associated with this approach and proposes an extension that takes advantage of the orthogonal decoupling of GPs to mitigate these limitations. In particular, we introduce spherical inter-domain features to construct more flexible data-dependent basis functions for both the principal and orthogonal components of the GP approximation. We demonstrate that incorporating orthogonal inducing variables under this framework not only alleviates these shortcomings but also offers superior scalability compared to alternative strategies.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;span id="item:contrib-cycle-bayes" label="item:contrib-cycle-bayes"&gt;&lt;/span&gt; We provide a probabilistic perspective on cycle-consistent adversarial networks (CYCLEGANs), a cutting-edge deep generative model for style transfer and image-to-image translation. Specifically, we frame the problem of learning cross-domain correspondences without paired data as Bayesian inference in a latent variable model (LVM), in which the goal is to uncover the hidden representations of entities from one domain as entities in another. First, we introduce implicit LVMs, which allow flexible prior specification over latent representations as implicit distributions. Next, we develop a new variational inference (VI) framework that minimises a symmetrised statistical divergence between the variational and true joint distributions. Finally, we show that CYCLEGANs emerge as a closely-related variant of our framework, providing a useful interpretation as a Bayesian approximation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;span id="item:contrib-bore" label="item:contrib-bore"&gt;&lt;/span&gt; We introduce a model-agnostic formulation of BO based on classification. Building on the established links between class-probability estimation (CPE), density-ratio estimation (DRE), and the improvement-based acquisition functions, we reformulate the acquisition function as a binary classifier over candidate solutions. This approach eliminates the need for an explicit probabilistic model of the objective function and casts aside the limitations of tractability constraints. As a result, our model-agnostic BO approach substantially broadens its applicability across diverse problem scenarios, accommodating flexible and scalable modelling paradigms such as deep learning without necessitating approximations or sacrificing expressive and representational capacity.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Accordingly, our thesis is organised as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Chapter 2 (Background) lays the necessary groundwork for our thesis. We begin by outlining the fundamental principles of probability and Bayesian statistics, which form the basis of probabilistic ML. Additionally, we introduce the widely-adopted method of approximate Bayesian inference known as VI. Our discussion underscores the central role played by statistical divergences, prompting us to delve into a larger family of divergences and motivating our discussion of DRE. With a solid foundation in place, we shift our focus to GPs, providing an introductory overview and highlighting the most commonly-used sparse approximations. Finally, we conclude this background chapter by introducing the basic concepts behind BO.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chapter 3 (Orthogonally-Decoupled Sparse GPs with Spherical Inducing Features) examines orthogonally-decoupled sparse GPs with spherical NN activation features, as summarised in the corresponding item above.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chapter 4 (Cycle-Consistent Adversarial Learning as Bayesian Inference) examines from the perspective of approximate Bayesian inference, as summarised in the corresponding item above.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chapter 5 (Bayesian Optimization by Density-Ratio Estimation) examines our model-agnostic approach to BO based on binary classification and DRE, as summarised in the corresponding item above.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chapter 6 (Conclusion) brings this thesis to a close by reflecting on our main contributions and situating them in the broader landscape of probabilistic methods in ML. Finally, we conclude by presenting our outlook on the avenues for future research and development in this rapidly evolving field.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="references"&gt;References&lt;/h3&gt;
&lt;div id="refs" class="references csl-bib-body hanging-indent" entry-spacing="0" line-spacing="2"&gt;
&lt;div id="ref-anil2023palm" class="csl-entry"&gt;
&lt;p&gt;Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. (2023). Palm 2 technical report. &lt;em&gt;arXiv Preprint arXiv:2305.10403&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-attia2020closed" class="csl-entry"&gt;
&lt;p&gt;Attia, P. M., Grover, A., Jin, N., Severson, K. A., Markov, T. M., Liao, Y.-H., Chen, M. H., Cheong, B., Perkins, N., Yang, Z., et al. (2020). Closed-loop optimization of fast-charging protocols for batteries with machine learning. &lt;em&gt;Nature&lt;/em&gt;, &lt;em&gt;578&lt;/em&gt;(7795), 397–402.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-bartholomew2011latent" class="csl-entry"&gt;
&lt;p&gt;Bartholomew, D. J., Knott, M., &amp;amp; Moustaki, I. (2011). &lt;em&gt;Latent variable models and factor analysis: A unified approach&lt;/em&gt;. John Wiley &amp;amp; Sons.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-bayes1763lii" class="csl-entry"&gt;
&lt;p&gt;Bayes, T. (1763). LII. An essay towards solving a problem in the doctrine of chances. By the late rev. Mr. Bayes, FRS communicated by mr. Price, in a letter to john canton, AMFR s. &lt;em&gt;Philosophical Transactions of the Royal Society of London&lt;/em&gt;, &lt;em&gt;53&lt;/em&gt;, 370–418.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-blundell2015weight" class="csl-entry"&gt;
&lt;p&gt;Blundell, C., Cornebise, J., Kavukcuoglu, K., &amp;amp; Wierstra, D. (2015). Weight uncertainty in neural network. &lt;em&gt;International Conference on Machine Learning&lt;/em&gt;, 1613–1622.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-brochu2010tutorial" class="csl-entry"&gt;
&lt;p&gt;Brochu, E., Cora, V. M., &amp;amp; De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. &lt;em&gt;arXiv Preprint arXiv:1012.2599&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-brown2020language" class="csl-entry"&gt;
&lt;p&gt;Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;33&lt;/em&gt;, 1877–1901.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-chen2015bayesian" class="csl-entry"&gt;
&lt;p&gt;Chen, P., Merrick, B. M., &amp;amp; Brazil, T. J. (2015). Bayesian optimization for broadband high-efficiency power amplifier designs. &lt;em&gt;IEEE Transactions on Microwave Theory and Techniques&lt;/em&gt;, &lt;em&gt;63&lt;/em&gt;(12), 4263–4272.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-damianou2013deep" class="csl-entry"&gt;
&lt;p&gt;Damianou, A., &amp;amp; Lawrence, N. D. (2013). Deep gaussian processes. &lt;em&gt;Artificial Intelligence and Statistics&lt;/em&gt;, 207–215.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-deisenroth2011pilco" class="csl-entry"&gt;
&lt;p&gt;Deisenroth, M., &amp;amp; Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. &lt;em&gt;Proceedings of the 28th International Conference on Machine Learning (ICML-11)&lt;/em&gt;, 465–472.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-duris2020bayesian" class="csl-entry"&gt;
&lt;p&gt;Duris, J., Kennedy, D., Hanuka, A., Shtalenkova, J., Edelen, A., Baxevanis, P., Egger, A., Cope, T., McIntire, M., Ermon, S., et al. (2020). Bayesian optimization of a free-electron laser. &lt;em&gt;Physical Review Letters&lt;/em&gt;, &lt;em&gt;124&lt;/em&gt;(12), 124801.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-dutordoir2020sparse" class="csl-entry"&gt;
&lt;p&gt;Dutordoir, V., Durrande, N., &amp;amp; Hensman, J. (2020). Sparse Gaussian processes with spherical harmonic features. &lt;em&gt;International Conference on Machine Learning&lt;/em&gt;, 2793–2802.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-dutordoir2021deep" class="csl-entry"&gt;
&lt;p&gt;Dutordoir, V., Hensman, J., Wilk, M. van der, Ek, C. H., Ghahramani, Z., &amp;amp; Durrande, N. (2021). Deep neural networks as point estimates for deep Gaussian processes. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;34&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-forrester2009recent" class="csl-entry"&gt;
&lt;p&gt;Forrester, A. I., &amp;amp; Keane, A. J. (2009). Recent advances in surrogate-based optimization. &lt;em&gt;Progress in Aerospace Sciences&lt;/em&gt;, &lt;em&gt;45&lt;/em&gt;(1-3), 50–79.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-gal2016dropout" class="csl-entry"&gt;
&lt;p&gt;Gal, Y., &amp;amp; Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. &lt;em&gt;International Conference on Machine Learning&lt;/em&gt;, 1050–1059.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-garnett_bayesoptbook_2023" class="csl-entry"&gt;
&lt;p&gt;Garnett, R. (2023). &lt;em&gt;&lt;span class="nocase"&gt;Bayesian Optimization&lt;/span&gt;&lt;/em&gt;. Cambridge University Press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-garnett2010bayesian" class="csl-entry"&gt;
&lt;p&gt;Garnett, R., Osborne, M. A., &amp;amp; Roberts, S. J. (2010). Bayesian optimization for sensor set selection. &lt;em&gt;Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks&lt;/em&gt;, 209–219.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-gelman2013bayesian" class="csl-entry"&gt;
&lt;p&gt;Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., &amp;amp; Rubin, D. B. (2013). &lt;em&gt;Bayesian data analysis&lt;/em&gt;. CRC press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-girshick2014rich" class="csl-entry"&gt;
&lt;p&gt;Girshick, R., Donahue, J., Darrell, T., &amp;amp; Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. &lt;em&gt;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition&lt;/em&gt;, 580–587.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-gonzalez2015bayesian" class="csl-entry"&gt;
&lt;p&gt;Gonzalez, J., Longworth, J., James, D. C., &amp;amp; Lawrence, N. D. (2015). Bayesian optimization for synthetic gene design. &lt;em&gt;arXiv Preprint arXiv:1505.01627&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-goodfellow2014generative" class="csl-entry"&gt;
&lt;p&gt;Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., &amp;amp; Bengio, Y. (2014). Generative adversarial networks. &lt;em&gt;arXiv Preprint arXiv:1406.2661&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-graves2013speech" class="csl-entry"&gt;
&lt;p&gt;Graves, A., Mohamed, A., &amp;amp; Hinton, G. (2013). Speech recognition with deep recurrent neural networks. &lt;em&gt;2013 IEEE International Conference on Acoustics, Speech and Signal Processing&lt;/em&gt;, 6645–6649.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-hennig2022probabilistic" class="csl-entry"&gt;
&lt;p&gt;Hennig, P., Osborne, M. A., &amp;amp; Kersting, H. P. (2022). &lt;em&gt;Probabilistic numerics&lt;/em&gt;. Cambridge University Press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-hie2022adaptive" class="csl-entry"&gt;
&lt;p&gt;Hie, B. L., &amp;amp; Yang, K. K. (2022). Adaptive machine learning for protein engineering. &lt;em&gt;Current Opinion in Structural Biology&lt;/em&gt;, &lt;em&gt;72&lt;/em&gt;, 145–152.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-hinton2012deep" class="csl-entry"&gt;
&lt;p&gt;Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. &lt;em&gt;IEEE Signal Processing Magazine&lt;/em&gt;, &lt;em&gt;29&lt;/em&gt;(6), 82–97.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-ho2020denoising" class="csl-entry"&gt;
&lt;p&gt;Ho, J., Jain, A., &amp;amp; Abbeel, P. (2020). Denoising diffusion probabilistic models. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;33&lt;/em&gt;, 6840–6851.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-hoffmann2022training" class="csl-entry"&gt;
&lt;p&gt;Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). Training compute-optimal large language models. &lt;em&gt;arXiv Preprint arXiv:2203.15556&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-houlsby2011bayesian" class="csl-entry"&gt;
&lt;p&gt;Houlsby, N., Huszár, F., Ghahramani, Z., &amp;amp; Lengyel, M. (2011). Bayesian active learning for classification and preference learning. &lt;em&gt;arXiv Preprint arXiv:1112.5745&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-jordan1998introduction" class="csl-entry"&gt;
&lt;p&gt;Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., &amp;amp; Saul, L. K. (1998). An introduction to variational methods for graphical models. &lt;em&gt;Learning in Graphical Models&lt;/em&gt;, 105–161.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-jumper2021highly" class="csl-entry"&gt;
&lt;p&gt;Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. &lt;em&gt;Nature&lt;/em&gt;, &lt;em&gt;596&lt;/em&gt;(7873), 583–589.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-krizhevsky2012imagenet" class="csl-entry"&gt;
&lt;p&gt;Krizhevsky, A., Sutskever, I., &amp;amp; Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;25&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-lam2018advances" class="csl-entry"&gt;
&lt;p&gt;Lam, R., Poloczek, M., Frazier, P., &amp;amp; Willcox, K. E. (2018). Advances in bayesian optimization with applications in aerospace engineering. &lt;em&gt;2018 AIAA Non-Deterministic Approaches Conference&lt;/em&gt;, 1656.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-laplace1814theorie" class="csl-entry"&gt;
&lt;p&gt;Laplace, P. S. (1814). &lt;em&gt;Théorie analytique des probabilités&lt;/em&gt;. Courcier.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-lee2017deep" class="csl-entry"&gt;
&lt;p&gt;Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., &amp;amp; Sohl-Dickstein, J. (2017). Deep neural networks as gaussian processes. &lt;em&gt;arXiv Preprint arXiv:1711.00165&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-lillicrap2015continuous" class="csl-entry"&gt;
&lt;p&gt;Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., &amp;amp; Wierstra, D. (2015). Continuous control with deep reinforcement learning. &lt;em&gt;arXiv Preprint arXiv:1509.02971&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-lyu2017efficient" class="csl-entry"&gt;
&lt;p&gt;Lyu, W., Xue, P., Yang, F., Yan, C., Hong, Z., Zeng, X., &amp;amp; Zhou, D. (2017). An efficient bayesian optimization approach for automated optimization of analog circuits. &lt;em&gt;IEEE Transactions on Circuits and Systems I: Regular Papers&lt;/em&gt;, &lt;em&gt;65&lt;/em&gt;(6), 1954–1967.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mackay1992practical" class="csl-entry"&gt;
&lt;p&gt;MacKay, D. J. (1992). A practical bayesian framework for backpropagation networks. &lt;em&gt;Neural Computation&lt;/em&gt;, &lt;em&gt;4&lt;/em&gt;(3), 448–472.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mackay2003information" class="csl-entry"&gt;
&lt;p&gt;MacKay, D. J. (2003). &lt;em&gt;Information theory, inference and learning algorithms&lt;/em&gt;. Cambridge university press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-marchant2012bayesian" class="csl-entry"&gt;
&lt;p&gt;Marchant, R., &amp;amp; Ramos, F. (2012). Bayesian optimisation for intelligent environmental monitoring. &lt;em&gt;2012 IEEE/RSJ International Conference on Intelligent Robots and Systems&lt;/em&gt;, 2242–2249.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-matthews2018gaussian" class="csl-entry"&gt;
&lt;p&gt;Matthews, A. G. de G., Rowland, M., Hron, J., Turner, R. E., &amp;amp; Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. &lt;em&gt;arXiv Preprint arXiv:1804.11271&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mcculloch1943logical" class="csl-entry"&gt;
&lt;p&gt;McCulloch, W. S., &amp;amp; Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. &lt;em&gt;The Bulletin of Mathematical Biophysics&lt;/em&gt;, &lt;em&gt;5&lt;/em&gt;, 115–133.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mnih2013playing" class="csl-entry"&gt;
&lt;p&gt;Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &amp;amp; Riedmiller, M. (2013). Playing atari with deep reinforcement learning. &lt;em&gt;arXiv Preprint arXiv:1312.5602&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mnih2015human" class="csl-entry"&gt;
&lt;p&gt;Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. &lt;em&gt;Nature&lt;/em&gt;, &lt;em&gt;518&lt;/em&gt;(7540), 529–533.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-moss2020boss" class="csl-entry"&gt;
&lt;p&gt;Moss, H. B., Beck, D., González, J., Leslie, D. S., &amp;amp; Rayson, P. (2020). BOSS: Bayesian optimization over string spaces. &lt;em&gt;arXiv Preprint arXiv:2010.00979&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-neal1995bayesian" class="csl-entry"&gt;
&lt;p&gt;Neal, R. M. (1995). &lt;em&gt;BAYESIAN LEARNING FOR NEURAL NETWORKS&lt;/em&gt; &lt;/p&gt;
\[PhD thesis\]&lt;p&gt;. University of Toronto.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-openai2023gpt" class="csl-entry"&gt;
&lt;p&gt;OpenAI, R. (2023). GPT-4 technical report. &lt;em&gt;arXiv&lt;/em&gt;, 2303–08774.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-opper2000gaussian" class="csl-entry"&gt;
&lt;p&gt;Opper, M., &amp;amp; Winther, O. (2000). &lt;em&gt;Gaussian processes and SVM: Mean field results and leave-one-out&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-pearson1901liii" class="csl-entry"&gt;
&lt;p&gt;Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. &lt;em&gt;The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science&lt;/em&gt;, &lt;em&gt;2&lt;/em&gt;(11), 559–572.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-rae2021scaling" class="csl-entry"&gt;
&lt;p&gt;Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis &amp;amp; insights from training gopher. &lt;em&gt;arXiv Preprint arXiv:2112.11446&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-ramesh2022hierarchical" class="csl-entry"&gt;
&lt;p&gt;Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., &amp;amp; Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. &lt;em&gt;arXiv Preprint arXiv:2204.06125&lt;/em&gt;, &lt;em&gt;1&lt;/em&gt;(2), 3.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-10.7551/mitpress/3206.001.0001" class="csl-entry"&gt;
&lt;p&gt;Rasmussen, C. E., &amp;amp; Williams, C. K. I. (2005). &lt;em&gt;&lt;span class="nocase"&gt;Gaussian Processes for Machine Learning&lt;/span&gt;&lt;/em&gt;. The MIT Press.
&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-redmon2016you" class="csl-entry"&gt;
&lt;p&gt;Redmon, J., Divvala, S., Girshick, R., &amp;amp; Farhadi, A. (2016). You only look once: Unified, real-time object detection. &lt;em&gt;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition&lt;/em&gt;, 779–788.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-rombach2022high" class="csl-entry"&gt;
&lt;p&gt;Rombach, R., Blattmann, A., Lorenz, D., Esser, P., &amp;amp; Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. &lt;em&gt;Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition&lt;/em&gt;, 10684–10695.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-romero2013navigating" class="csl-entry"&gt;
&lt;p&gt;Romero, P. A., Krause, A., &amp;amp; Arnold, F. H. (2013). Navigating the protein fitness landscape with gaussian processes. &lt;em&gt;Proceedings of the National Academy of Sciences&lt;/em&gt;, &lt;em&gt;110&lt;/em&gt;(3), E193–E201.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-ronneberger2015u" class="csl-entry"&gt;
&lt;p&gt;Ronneberger, O., Fischer, P., &amp;amp; Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. &lt;em&gt;Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18&lt;/em&gt;, 234–241.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-rosenblatt1958perceptron" class="csl-entry"&gt;
&lt;p&gt;Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. &lt;em&gt;Psychological Review&lt;/em&gt;, &lt;em&gt;65&lt;/em&gt;(6), 386.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-roweis1999unifying" class="csl-entry"&gt;
&lt;p&gt;Roweis, S., &amp;amp; Ghahramani, Z. (1999). A unifying review of linear gaussian models. &lt;em&gt;Neural Computation&lt;/em&gt;, &lt;em&gt;11&lt;/em&gt;(2), 305–345.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-salimbeni2018orthogonally" class="csl-entry"&gt;
&lt;p&gt;Salimbeni, H., Cheng, C.-A., Boots, B., &amp;amp; Deisenroth, M. (2018). Orthogonally decoupled variational Gaussian processes. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;31&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-seko2015prediction" class="csl-entry"&gt;
&lt;p&gt;Seko, A., Togo, A., Hayashi, H., Tsuda, K., Chaput, L., &amp;amp; Tanaka, I. (2015). Prediction of low-thermal-conductivity compounds with first-principles anharmonic lattice-dynamics calculations and bayesian optimization. &lt;em&gt;Physical Review Letters&lt;/em&gt;, &lt;em&gt;115&lt;/em&gt;(20), 205901.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-shahriari2015taking" class="csl-entry"&gt;
&lt;p&gt;Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., &amp;amp; De Freitas, N. (2015). Taking the human out of the loop: A review of bayesian optimization. &lt;em&gt;Proceedings of the IEEE&lt;/em&gt;, &lt;em&gt;104&lt;/em&gt;(1), 148–175.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-shi2020sparse" class="csl-entry"&gt;
&lt;p&gt;Shi, J., Titsias, M., &amp;amp; Mnih, A. (2020). Sparse orthogonal variational inference for Gaussian processes. &lt;em&gt;International Conference on Artificial Intelligence and Statistics&lt;/em&gt;, 1932–1942.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-shoeybi2019megatron" class="csl-entry"&gt;
&lt;p&gt;Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., &amp;amp; Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism. &lt;em&gt;arXiv Preprint arXiv:1909.08053&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-silver2016mastering" class="csl-entry"&gt;
&lt;p&gt;Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. &lt;em&gt;Nature&lt;/em&gt;, &lt;em&gt;529&lt;/em&gt;(7587), 484–489.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-snoek2012practical" class="csl-entry"&gt;
&lt;p&gt;Snoek, J., Larochelle, H., &amp;amp; Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;25&lt;/em&gt;, 2951–2959.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-spearman1904general" class="csl-entry"&gt;
&lt;p&gt;Spearman, C. (1904). &amp;quot; general intelligence,&amp;quot; objectively determined and measured. &lt;em&gt;The American Journal of Psychology&lt;/em&gt;, &lt;em&gt;15&lt;/em&gt;(2), 201–292.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-srivastava2014dropout" class="csl-entry"&gt;
&lt;p&gt;Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &amp;amp; Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. &lt;em&gt;The Journal of Machine Learning Research&lt;/em&gt;, &lt;em&gt;15&lt;/em&gt;(1), 1929–1958.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-sun2020neural" class="csl-entry"&gt;
&lt;p&gt;Sun, S., Shi, J., &amp;amp; Grosse, R. B. (2020). Neural networks as inter-domain inducing points. &lt;em&gt;Third Symposium on Advances in Approximate Bayesian Inference&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-tibshirani1996regression" class="csl-entry"&gt;
&lt;p&gt;Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. &lt;em&gt;Journal of the Royal Statistical Society Series B: Statistical Methodology&lt;/em&gt;, &lt;em&gt;58&lt;/em&gt;(1), 267–288.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-tipping1999probabilistic" class="csl-entry"&gt;
&lt;p&gt;Tipping, M. E., &amp;amp; Bishop, C. M. (1999). Probabilistic principal component analysis. &lt;em&gt;Journal of the Royal Statistical Society: Series B (Statistical Methodology)&lt;/em&gt;, &lt;em&gt;61&lt;/em&gt;(3), 611–622.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-torun2018global" class="csl-entry"&gt;
&lt;p&gt;Torun, H. M., Swaminathan, M., Davis, A. K., &amp;amp; Bellaredj, M. L. F. (2018). A global bayesian optimization algorithm and its application to integrated system design. &lt;em&gt;IEEE Transactions on Very Large Scale Integration (VLSI) Systems&lt;/em&gt;, &lt;em&gt;26&lt;/em&gt;(4), 792–802.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-touvron2023llama" class="csl-entry"&gt;
&lt;p&gt;Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. &lt;em&gt;arXiv Preprint arXiv:2307.09288&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-turner2021bayesian" class="csl-entry"&gt;
&lt;p&gt;Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z., &amp;amp; Guyon, I. (2021). Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. &lt;em&gt;NeurIPS 2020 Competition and Demonstration Track&lt;/em&gt;, 3–26.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-vaswani2017attention" class="csl-entry"&gt;
&lt;p&gt;Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &amp;amp; Polosukhin, I. (2017). Attention is all you need. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;30&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-wigley2016fast" class="csl-entry"&gt;
&lt;p&gt;Wigley, P. B., Everitt, P. J., Hengel, A. van den, Bastian, J. W., Sooriyabandara, M. A., McDonald, G. D., Hardman, K. S., Quinlivan, C. D., Manju, P., Kuhn, C. C., et al. (2016). Fast machine-learning online optimization of ultra-cold-atom experiments. &lt;em&gt;Scientific Reports&lt;/em&gt;, &lt;em&gt;6&lt;/em&gt;(1), 25890.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-yang2019machine" class="csl-entry"&gt;
&lt;p&gt;Yang, K. K., Wu, Z., &amp;amp; Arnold, F. H. (2019). Machine-learning-guided directed evolution for protein engineering. &lt;em&gt;Nature Methods&lt;/em&gt;, &lt;em&gt;16&lt;/em&gt;(8), 687–694.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;</description></item><item><title>📄 One paper accepted to ICML 2023</title><link>https://tiao.io/posts/one-paper-accepted-to-icml2023/</link><pubDate>Tue, 25 Apr 2023 20:37:43 +0000</pubDate><guid>https://tiao.io/posts/one-paper-accepted-to-icml2023/</guid><description>&lt;p&gt;Our paper
was accepted to ICML2023 as an Oral Presentation!
This work was largely done during my time at Secondmind Labs as a
Student Researcher, in collaboration with Vincent Dutordoir and Victor Picheny.&lt;/p&gt;</description></item><item><title>Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes</title><link>https://tiao.io/publications/spherical-features-gaussian-process/</link><pubDate>Tue, 25 Apr 2023 00:00:00 +0000</pubDate><guid>https://tiao.io/publications/spherical-features-gaussian-process/</guid><description/></item><item><title>Efficient Cholesky decomposition of low-rank updates</title><link>https://tiao.io/posts/efficient-cholesky-decomposition-of-low-rank-updates/</link><pubDate>Sun, 16 Apr 2023 11:16:03 +0000</pubDate><guid>https://tiao.io/posts/efficient-cholesky-decomposition-of-low-rank-updates/</guid><description>&lt;p&gt;Suppose we&amp;rsquo;re given a positive semidefinite (PSD)
matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$
to
which we wish to update by some low-rank
matrix $\mathbf{U} \mathbf{U}^\top \in \mathbb{R}^{N \times N}$
,
$$\mathbf{B} \triangleq \mathbf{A} + \mathbf{U} \mathbf{U}^\top,$$
where the update factor matrix $\mathbf{U} \in \mathbb{R}^{N \times M}$
.
To be more precise, the low-rank update is rank-$M$ for some $M \ll N$.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;What is the best way to calculate the Cholesky decomposition of $\mathbf{B}$
?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Given no additional information the obvious way is to calculate it directly,
which incurs a cost of $\mathcal{O}(N^3)$
.
But suppose we&amp;rsquo;ve already calculated the lower-triangular Cholesky factor
$\mathbf{L} \in \mathbb{R}^{N \times N}$
of $\mathbf{A}$
(i.e., $\mathbf{LL}^\top = \mathbf{A}$
).
Then, we can use it to calculate the Cholesky decomposition
of $\mathbf{B}$
at a reduced cost
of $\mathcal{O}(N^2M)$
.
Here&amp;rsquo;s how.&lt;/p&gt;
&lt;h2 id="rank-1-updates"&gt;Rank-1 Updates&lt;/h2&gt;
&lt;p&gt;First, let&amp;rsquo;s consider the simpler case involving just &lt;em&gt;rank-1 updates&lt;/em&gt;
$$\mathbf{B} \triangleq \mathbf{A} + \mathbf{u} \mathbf{u}^\top,$$
where update factor vector $\mathbf{u} \in \mathbb{R}^{N}$
.
With some clever manipulations&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;, the details of which we won&amp;rsquo;t
get into in this post, we can leverage $\mathbf{L}$
to
calculate the Cholesky decomposition of $\mathbf{B}$
at a reduced cost of $\mathcal{O}(N^2)$
.
Such a procedure for rank-1 updates is implemented in the old-school Fortran
linear algebra software library
(but unfortunately not in its successor
),
and also in modern libraries like
(TFP).&lt;/p&gt;
&lt;p&gt;In TFP, this is implemented in the function named
.
For example,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow_probability&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tfp&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;update_factor_vector&lt;/span&gt; &lt;span class="c1"&gt;# Tensor; shape [..., N]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="c1"&gt;# Tensor; shape [..., N, N]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;update&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;update_factor_vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;newaxis&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;update_factor_vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;newaxis&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;transpose_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt; &lt;span class="c1"&gt;# Tensor; shape [..., N, N]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;a_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cholesky&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# O(N^3); suppose this is pre-computed and stored&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;b_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cholesky&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# O(N^3), ignores `a_factor`&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;b_factor_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tfp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cholesky_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_factor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update_factor_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# O(N^2), uses `a_factor`&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assert_array_almost_equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b_factor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_factor_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here &lt;code&gt;cholesky_update&lt;/code&gt; takes as arguments &lt;code&gt;chol&lt;/code&gt; with shape &lt;code&gt;[B1, ..., Bn, N, N]&lt;/code&gt;
and &lt;code&gt;u&lt;/code&gt; with shape &lt;code&gt;[B1, ..., Bn, N]&lt;/code&gt;, and returns a lower triangular Cholesky
factor of the rank-1 updated matrix &lt;code&gt;chol @ chol.T + u @ u.T&lt;/code&gt; in $\mathcal{O}(N^2)$
time.&lt;/p&gt;
&lt;h2 id="low-rank-updates"&gt;Low-Rank Updates&lt;/h2&gt;
&lt;p&gt;Now let&amp;rsquo;s return to rank-$M$ updates.
First let&amp;rsquo;s write the update factor matrix $\mathbf{U}$ in terms of column
vectors $\mathbf{u}_m \in \mathbb{R}^{N}$,
$$
\mathbf{U} \triangleq
\begin{bmatrix}
\mathbf{u}_1 &amp; \cdots &amp; \mathbf{u}_M
\end{bmatrix}.
$$
&lt;/p&gt;
&lt;p&gt;Now we can write the rank-$M$ update matrix as a sum of $M$ rank-1 matrices,
$$
\mathbf{U} \mathbf{U}^\top =
\begin{bmatrix} \mathbf{u}_1 &amp; \cdots &amp; \mathbf{u}_M \end{bmatrix}
\begin{bmatrix} \mathbf{u}_1^\top \\ \vdots \\ \mathbf{u}_M^\top \end{bmatrix} =
\sum_{m=1}^{M} \mathbf{u}_m \mathbf{u}_m^\top.
$$
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;update_factor_matrix&lt;/span&gt; &lt;span class="c1"&gt;# Tensor; shape [..., N, M]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# [..., N, 1, M] [..., 1, N, M] -&amp;gt; [..., N, N, M] -&amp;gt; [..., N, N]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;update1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce_sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;newaxis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;newaxis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# [..., N, M] [..., M, N] -&amp;gt; [..., N, N]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;update2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transpose_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# not exactly equal due to finite precision, but still equal up to high precision&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assert_array_almost_equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Thus seen, a low-rank update is nothing more than a repeated application of
rank-1 updates,
$$
\begin{align}
\mathbf{B} &amp; = \mathbf{A} + \mathbf{U} \mathbf{U}^\top \\ &amp; =
\mathbf{A} + \sum_{m=1}^{M} \mathbf{u}_m \mathbf{u}_m^\top \\ &amp; =
((\mathbf{A} + \mathbf{u}_1 \mathbf{u}_1^\top) + \cdots ) + \mathbf{u}_M \mathbf{u}_M^{\top}.
\end{align}
$$
&lt;/p&gt;
&lt;p&gt;Therefore, we can simply leverage the $O(N^2)$ procedure for Cholesky
decompositions of rank-1 updates and apply it recursively $M$ times to obtain
a $O(N^2M)$ procedure for rank-$M$ updates.&lt;/p&gt;
&lt;p&gt;Hence, we have:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# [..., N, M] [..., M, N] -&amp;gt; [..., N, N]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;update&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transpose_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt; &lt;span class="c1"&gt;# Tensor; shape [..., N, N]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;b_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cholesky&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# O(N^3), ignores `a_factor`&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;b_factor_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cholesky_update_iterated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_factor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# O(N^2M), uses `a_factor`&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assert_array_almost_equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b_factor_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_factor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;where function &lt;code&gt;cholesky_update_iterated&lt;/code&gt; is implemented as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cholesky_update_iterated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# base case&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chol&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cholesky_update_iterated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tfp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cholesky_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can also implement this iteratively.
First we&amp;rsquo;d use &lt;code&gt;tf.unstack&lt;/code&gt; to turn the update factor matrix $\mathbf{U}$
into a list of update factor vectors $\mathbf{u}_m$:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;update_factor_vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unstack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nb"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update_factor_vectors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# `update_factor_vectors` is a list&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update_factor_vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt; &lt;span class="c1"&gt;# ... the list contains M vectors&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;update_factor_vectors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Bs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# ... and each vector has shape [B1, ..., Bn, N]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then, we have:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cholesky_update_iterated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;new_chol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chol&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;update_factor_vector&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unstack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;new_chol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tfp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cholesky_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_chol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update_factor_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;new_chol&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The astute reader will recognize that this is simply an special case of
the
or
patterns, where
the &lt;em&gt;binary operator&lt;/em&gt; is &lt;code&gt;tfp.math.cholesky_update&lt;/code&gt;,
the &lt;em&gt;iterable&lt;/em&gt; is &lt;code&gt;tf.unstack(update_factor, axis=-1)&lt;/code&gt; and
the &lt;em&gt;initial value&lt;/em&gt; is &lt;code&gt;chol&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Therefore, we can also implement it neatly using the one-liner:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cholesky_update_iterated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tfp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cholesky_update&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unstack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update_factor_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;p&gt;In summary, we showed that to efficiently calculate the Cholesky decomposition
of a matrix perturbed by a low-rank update, one just needs to iteratively
calculate that of the same matrix perturbed by a series of rank-1 updates.
Better yet, all of this can be done with a simple one-liner!&lt;/p&gt;
&lt;p&gt;To receive updates on more posts like this, follow me on
and
!&lt;/p&gt;
&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Seeger, M. (2004). Low rank updates for the Cholesky decomposition.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item><item><title>GPflux</title><link>https://tiao.io/projects/gpflux/</link><pubDate>Wed, 01 Sep 2021 00:00:00 +0000</pubDate><guid>https://tiao.io/projects/gpflux/</guid><description>&lt;p&gt;
is a TensorFlow/Keras
framework for Deep
, developed
at Secondmind Labs. It builds on
and exposes
Deep GP layers as familiar Keras building blocks, making it easier to compose
deep
.&lt;/p&gt;
&lt;p&gt;Contributed during my doctoral student researcher appointment at Secondmind
Labs, alongside Vincent Dutordoir, ST John, and other members of the lab.&lt;/p&gt;</description></item><item><title>An Illustrated Guide to the Knowledge Gradient Acquisition Function</title><link>https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/</link><pubDate>Thu, 18 Feb 2021 19:13:23 +0100</pubDate><guid>https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/</guid><description>
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;Draft &amp;ndash; work in progress.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We provide a short guide to the knowledge-gradient (KG) acquisition
function (Frazier et al., 2009)&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt; for Bayesian
optimization (BO).
Rather than being a self-contained tutorial, this posts is intended to serve as
an illustrated compendium to the paper of Frazier et al., 2009&lt;sup id="fnref1:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;
and the subsequent tutorial by Frazier, 2018&lt;sup id="fnref:2"&gt;&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref"&gt;2&lt;/a&gt;&lt;/sup&gt;, authored
nearly a decade later.&lt;/p&gt;
&lt;p&gt;This post assumes a basic level of familiarity with BO and Gaussian processes (GPs),
to the extent provided by the literature survey of Shahriari et al.,
2015&lt;sup id="fnref:3"&gt;&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref"&gt;3&lt;/a&gt;&lt;/sup&gt;, and the acclaimed textbook of Rasmussen and Williams, 2006,
respectively.&lt;/p&gt;
&lt;h2 id="knowledge-gradient"&gt;Knowledge-gradient&lt;/h2&gt;
&lt;p&gt;First, we set-up the notation and terminology.
Let $f: \mathcal{X} \to \mathbb{R}$ be the blackbox function we wish to
minimize.
We denote the GP posterior predictive distribution, or &lt;em&gt;predictive&lt;/em&gt; for short,
by $p(y | \mathbf{x}, \mathcal{D})$.
The mean of the predictive, or the &lt;em&gt;predictive mean&lt;/em&gt; for short, is denoted by
&lt;/p&gt;
$$
\mu(\mathbf{x}; \mathcal{D}) = \mathbb{E}[y | \mathbf{x}, \mathcal{D}]
$$&lt;p&gt;
Let $\mathcal{D}_n$ be the set of $n$ input-output
observations $\mathcal{D}_n = \{ (\mathbf{x}_i, y_i) \}_{i=1}^n$, where
output $y_i = f(\mathbf{x}_i) + \epsilon$ is assumed to be observed with noise
$\epsilon \sim \mathcal{N}(0, \sigma^2)$.
We make the following abbreviation
&lt;/p&gt;
$$
\mu_n(\mathbf{x}) = \mu(\mathbf{x}; \mathcal{D}_n)
$$&lt;p&gt;
Next, we define the minimum of the predictive mean, or &lt;em&gt;predictive minimum&lt;/em&gt; for short,
as
&lt;/p&gt;
$$
\tau(\mathcal{D}) = \min_{\mathbf{x}' \in \mathcal{X}} \mu(\mathbf{x}'; \mathcal{D})
$$&lt;p&gt;
If we view $\mu(\mathbf{x}; \mathcal{D})$ as our fit to the underlying
function $f(\mathbf{x})$ from which the observations $\mathcal{D}$ were
generated, then $\tau(\mathcal{D})$ is our estimate of the minimum of $f(\mathbf{x})$,
given observations $\mathcal{D}$.&lt;/p&gt;
&lt;p&gt;Further, we make the following abbreviations
&lt;/p&gt;
$$
\tau_n = \tau(\mathcal{D}_n),
\qquad
\text{and}
\qquad
\tau_{n+1} = \tau(\mathcal{D}_{n+1}),
$$&lt;p&gt;
where $\mathcal{D}_{n+1} = \mathcal{D}_n \cup \{ (\mathbf{x}, y) \}$ is the
set of existing observations, augmented by some input-output pair $(\mathbf{x}, y)$.
Then, the knowledge-gradient is defined as
&lt;/p&gt;
$$
\alpha(\mathbf{x}; \mathcal{D}_n) =
\mathbb{E}_{p(y | \mathbf{x}, \mathcal{D}_n)} [ \tau_n - \tau_{n+1} ]
$$&lt;p&gt;
Crucially, note that $\tau_{n+1}$ is implicitly a function of $(\mathbf{x}, y)$,
and that this expression integrates over all possible input-output observation
pairs $(\mathbf{x}, y)$ for the given $\mathbf{x}$ under the
predictive $p(y | \mathbf{x}, \mathcal{D}_n)$.&lt;/p&gt;
&lt;h3 id="monte-carlo-estimation"&gt;Monte Carlo estimation&lt;/h3&gt;
&lt;p&gt;Not surprisingly, the knowledge-gradient function is analytically intractable.
Therefore, in practice, we compute it using Monte Carlo estimation,
&lt;/p&gt;
$$
\alpha(\mathbf{x}; \mathcal{D}_n) \approx
\frac{1}{M} \left ( \sum_{m=1}^M \tau_n - \tau_{n+1}^{(m)} \right ),
\qquad
y^{(m)} \sim p(y | \mathbf{x}, \mathcal{D}_n),
$$&lt;p&gt;
where $\tau_{n+1}^{(m)} = \tau(\mathcal{D}_{n+1}^{(m)})$
and $\mathcal{D}_{n+1}^{(m)} = \mathcal{D}_n \cup \{ (\mathbf{x}, y^{(m)}) \}$.&lt;/p&gt;
&lt;p&gt;We refer to $y^{(m)}$ as the $m$th simulated outcome, or the $m$th &lt;em&gt;simulation&lt;/em&gt;
for short.
Then, $\mathcal{D}_{n+1}^{(m)}$ is the $m$th simulation-augmented dataset and,
accordingly, $\tau_{n+1}^{(m)}$ is the $m$th simulation-augmented predictive minimum.&lt;/p&gt;
&lt;p&gt;We see that this approximation to the knowledge-gradient is simply the average
difference between the predictive minimum values &lt;em&gt;based on simulation-augmented
data&lt;/em&gt; $\tau_{n+1}^{(m)}$, and that &lt;em&gt;based on observed data&lt;/em&gt; $\tau_n$,
across $M$ simulations.&lt;/p&gt;
&lt;p&gt;This might take a moment to digest, as there are quite a number of moving parts
to keep track of. To help visualize these parts, we provide an illustration of
each of the steps required to compute KG on a simple one-dimensional synthetic
problem.&lt;/p&gt;
&lt;h2 id="one-dimensional-example"&gt;One-dimensional example&lt;/h2&gt;
&lt;p&gt;As the running example throughout this post, we use a synthetic function
defined as
&lt;/p&gt;
$$
f(x) = \sin(3x) + x^2 - 0.7 x.
$$&lt;p&gt;
We generate $n=10$ observations at locations sampled uniformly at random.
The true function, and the set of noisy observations $\mathcal{D}_n$ are
visualized in the figure below:&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/figures/observations_paper_1800x1112.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Latent blackbox function and $n=10$ observations.&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Using the observations $\mathcal{D}_n$ we have collected so far, we wish to
use KG to score a candidate location $x_c$ at which to evaluate next.&lt;/p&gt;
&lt;h2 id="posterior-predictive-distribution"&gt;Posterior predictive distribution&lt;/h2&gt;
&lt;p&gt;The posterior predictive $p(y | \mathbf{x}, \mathcal{D}_n)$ is visualized in
the figure below. In particular, the predictive mean $\mu_n(\mathbf{x})$ is
represented by the solid orange curve.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/figures/predictive_mean_before_paper_1800x1112.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Posterior predictive distribution (*before* hyperparameter estimation).&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Clearly, this is a poor fit to the data and a uncalibrated estimation of the
predictive uncertainly.&lt;/p&gt;
&lt;h3 id="step-1-hyperparameter-estimation"&gt;Step 1: Hyperparameter estimation&lt;/h3&gt;
&lt;p&gt;Therefore, first step is to optimize the hyperparameters of the GP regression
model, i.e. the kernel lengthscale, amplitude, and the observation noise variance.
We do this using type-II maximum likelihood estimation (MLE), or &lt;em&gt;empirical Bayes&lt;/em&gt;.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/figures/predictive_mean_after_paper_1800x1112.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Posterior predictive distribution (*after* hyperparameter estimation).&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 id="step-2-determine-the-predictive-minimum"&gt;Step 2: Determine the predictive minimum&lt;/h3&gt;
&lt;p&gt;Next, we compute the predictive minimum $\tau_n = \min_{\mathbf{x}' \in \mathcal{X}} \mu_n(\mathbf{x}')$.
Since $\mu_n$ is end-to-end differentiable wrt to input $\mathbf{x}$, we can
simply use a multi-started quasi-Newton hill-climber such as L-BFGS.
We visualize this in the figure below, where the value of the predictive
minimum is represented by the orange horizontal dashed line, and its location is
denoted by the orange star and triangle.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/figures/predictive_minimum_paper_1800x1112.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Predictive minimum $\tau_n$.&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 id="step-3-compute-simulation-augmented-predictive-means"&gt;Step 3: Compute simulation-augmented predictive means&lt;/h3&gt;
&lt;p&gt;Suppose we are scoring the candidate location $x_c = 0.1$.
For illustrative purposes, let us draw just $M=1$ sample $y_c^{(1)} \sim p(y | x_c, \mathcal{D}_n)$.
In the figure below, the candidate location $x_c$ is represented by the
vertical solid gray line, and the single simulated outcome $y_c^{(1)}$ is
represented by the filled blue dot.&lt;/p&gt;
&lt;p&gt;In general, we denote the simulation-augmented predictive mean as
&lt;/p&gt;
$$
\mu_{n+1}^{(m)}(\mathbf{x}) = \mu(\mathbf{x}; \mathcal{D}_{n+1}^{(m)}),
$$&lt;p&gt;
where
$\mathcal{D}_{n+1}^{(m)} = \mathcal{D}_n \cup \{ (\mathbf{x}, y^{(m)}) \}$
as defined earlier.&lt;/p&gt;
&lt;p&gt;Here, the simulation-augmented dataset $\mathcal{D}_{n+1}^{(1)}$ is the set
of existing observations $\mathcal{D}_n$, augmented by the simulated
input-output pair $(x_c, y_c^{(1)})$,
&lt;/p&gt;
$$
\mathcal{D}_{n+1}^{(1)} = \mathcal{D}_n \cup \{ (x_c, y_c^{(1)}) \},
$$&lt;p&gt;
and the corresponding simulation-augmented predictive mean $\mu_{n+1}^{(1)}(x)$
is represented in the figure below by the solid blue curve.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/figures/simulated_predictive_mean_paper_1800x1112.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Simulation-augmented predictive mean $\mu_{n&amp;#43;1}^{(1)}(x)$ at location $x_c = 0.1$&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 id="step-4-compute-simulation-augmented-predictive-minimums"&gt;Step 4: Compute simulation-augmented predictive minimums&lt;/h3&gt;
&lt;p&gt;Next, we compute the simulation-augmented predictive minimum
&lt;/p&gt;
$$
\tau_{n+1}^{(1)} = \min_{\mathbf{x}' \in \mathcal{X}} \mu_{n+1}^{(1)}(\mathbf{x}')
$$&lt;p&gt;
It may not be immediately obvious, but $\mu_{n+1}^{(1)}$ is in fact also
end-to-end differentiable wrt to input $\mathbf{x}$. Therefore, we can again
appeal to an method such as L-BFGS.
We visualize this in the figure below, where the value of the simulation-augmented
predictive minimum is represented by the blue horizontal dashed line, and its
location is denoted by the blue star and triangle.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/figures/simulated_predictive_minimum_paper_1800x1112.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Simulation-augmented predictive minimum $\tau_{n&amp;#43;1}^{(1)}$ at location $x_c = 0.1$&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Taking the difference between the orange and blue horizontal dashed line will
give us an unbiased estimate of the knowledge-gradient.
However, this is likely to be a crude one, since it is based on just a single
MC sample.
To obtain a more accurate estimate, one needs to increase $M$, the number of
MC samples.&lt;/p&gt;
&lt;h4 id="samples"&gt;Samples $M &gt; 1$&lt;/h4&gt;
&lt;p&gt;Let us now consider $M=5$ samples. We draw $y_c^{(m)} \sim p(y | x_c, \mathcal{D}_n)$,
for $m = 1, \dotsc, 5$.
As before, the input location $x_c$ is represented by the vertical solid
gray line, and the corresponding simulated outcomes are represented by the
filled dots below, with varying hues from a perceptually uniform color palette
to distinguish between samples.&lt;/p&gt;
&lt;p&gt;Accordingly, the simulation-augmented predictive means
$\mu_{n+1}^{(m)}(x)$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$ are
represented by the colored curves, with hues set to that of the simulated
outcome on which the predictive distribution is based.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/figures/bar_paper_1800x1112.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Simulation-augmented predictive mean $\mu_{n&amp;#43;1}^{(m)}(x)$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Next we compute the simulation-augmented predictive
minimum $\tau_{n+1}^{(m)}$, which requires minimizing
$\mu_{n+1}^{(m)}(x)$ for $m = 1, \dotsc, 5$.
These values are represented below by the horizontal dashed lines, and their
location is denoted by the stars and triangles.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://tiao.io/posts/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/figures/baz_paper_1800x1112.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Simulation-augmented predictive minimum $\tau_{n&amp;#43;1}^{(1)}$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Finally, taking the average difference between the orange dashed line and every
other dashed line gives us the estimate of the knowledge gradient at
input $x_c$.&lt;/p&gt;
&lt;h2 id="links-and-further-readings"&gt;Links and Further Readings&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;In this post, we only showed a (naïve) approach to calculating the KG at a
given location.
Suffice it to say, there is still quite a gap between this and being able to
efficiently minimize KG within a sequential decision-making algorithm.
For a guide on incorporating KG in a modular and fully-fledged framework for
BO (namely
) see
&lt;/li&gt;
&lt;li&gt;Another introduction to KG:
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Cite as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;@article{tiao2021knowledge,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; title = &amp;#34;{A}n {I}llustrated {G}uide to the {K}nowledge {G}radient {A}cquisition {F}unction&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; author = &amp;#34;Tiao, Louis C&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; journal = &amp;#34;tiao.io&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; year = &amp;#34;2021&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; url = &amp;#34;https://tiao.io/post/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;To receive updates on more posts like this, follow me on
and
!&lt;/p&gt;
&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Frazier, P., Powell, W., &amp;amp; Dayanik, S. (2009).
. INFORMS Journal on Computing, 21(4), 599-613.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref1:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;Frazier, P. I. (2018).
. arXiv preprint arXiv:1807.02811.&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., &amp;amp; De Freitas, N. (2015).
. Proceedings of the IEEE, 104(1), 148-175.&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item><item><title>Model-based Asynchronous Hyperparameter and Neural Architecture Search</title><link>https://tiao.io/publications/async-multi-fidelity-hpo/</link><pubDate>Sun, 01 Mar 2020 00:00:00 +0000</pubDate><guid>https://tiao.io/publications/async-multi-fidelity-hpo/</guid><description/></item><item><title>A Handbook for Sparse Variational Gaussian Processes</title><link>https://tiao.io/posts/sparse-variational-gaussian-processes/</link><pubDate>Fri, 13 Sep 2019 00:00:00 +0000</pubDate><guid>https://tiao.io/posts/sparse-variational-gaussian-processes/</guid><description>
&lt;details class="print:hidden xl:hidden" &gt;
&lt;summary&gt;Table of Contents&lt;/summary&gt;
&lt;div class="text-sm"&gt;
&lt;nav id="TableOfContents"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#prior"&gt;Prior&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#marginal-prior-over-inducing-variables"&gt;Marginal prior over inducing variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conditional-prior"&gt;Conditional prior&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#variational-distribution"&gt;Variational Distribution&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#whitened-parameterization"&gt;Whitened parameterization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#inference"&gt;Inference&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#preliminaries"&gt;Preliminaries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#gaussian-likelihoods--sparse-gaussian-process-regression-sgpr"&gt;Gaussian Likelihoods &amp;ndash; Sparse Gaussian Process Regression (SGPR)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#non-gaussian-likelihoods"&gt;Non-Gaussian Likelihoods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#large-scale-data-with-stochastic-optimization"&gt;Large-Scale Data with Stochastic Optimization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#links-and-further-readings"&gt;Links and Further Readings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#appendix"&gt;Appendix&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#i"&gt;I&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ii"&gt;II&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#iii"&gt;III&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#iv"&gt;IV&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v"&gt;V&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#vi"&gt;VI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#vii"&gt;VII&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/nav&gt;
&lt;/div&gt;
&lt;/details&gt;
&lt;p&gt;In the sparse variational Gaussian process (SVGP) framework (Titsias, 2009)&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;,
one augments the joint distribution $p(\mathbf{y}, \mathbf{f})$ with auxiliary
variables $\mathbf{u}$ so that the joint becomes
&lt;/p&gt;
$$
p(\mathbf{y}, \mathbf{f}, \mathbf{u}) = p(\mathbf{y} | \mathbf{f}) p(\mathbf{f}, \mathbf{u}).
$$&lt;p&gt;
The vector $\mathbf{u} = \begin{bmatrix} u(\mathbf{z}_1) \cdots u(\mathbf{z}_M)\end{bmatrix}^{\top} \in \mathbb{R}^M$
consists of &lt;em&gt;inducing variables&lt;/em&gt;, the latent function values corresponding
to the &lt;em&gt;inducing input&lt;/em&gt; locations contained in the matrix
$\mathbf{Z} = \begin{bmatrix} \mathbf{z}_1 \cdots \mathbf{z}_M \end{bmatrix}^{\top} \in \mathbb{R}^{M \times D}$.&lt;/p&gt;
&lt;h2 id="prior"&gt;Prior&lt;/h2&gt;
&lt;p&gt;The joint distribution of the latent function values $\mathbf{f}$, and the
inducing variables $\mathbf{u}$ according to the prior is
&lt;/p&gt;
$$
p(\mathbf{f}, \mathbf{u}) =
\mathcal{N} \left (
\begin{bmatrix}
\mathbf{f} \newline
\mathbf{u}
\end{bmatrix}
;
\begin{bmatrix}
\mathbf{0} \newline
\mathbf{0}
\end{bmatrix},
\begin{bmatrix}
\mathbf{K}_\mathbf{ff} &amp; \mathbf{K}_\mathbf{uf}^\top \newline
\mathbf{K}_\mathbf{uf} &amp; \mathbf{K}_\mathbf{uu}
\end{bmatrix}
\right ).
$$&lt;p&gt;
If we let the joint prior factorize as
&lt;/p&gt;
$$
p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) p(\mathbf{u}),
$$&lt;p&gt;
we can apply the rules of Gaussian conditioning to derive the marginal prior
$p(\mathbf{u})$ and conditional prior $p(\mathbf{f} | \mathbf{u})$.&lt;/p&gt;
&lt;h3 id="marginal-prior-over-inducing-variables"&gt;Marginal prior over inducing variables&lt;/h3&gt;
&lt;p&gt;The marginal prior over inducing variables is simply given by
&lt;/p&gt;
$$
p(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{0}, \mathbf{K}_\mathbf{uu}).
$$
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;h4 id="gaussian-process-notation"&gt;Gaussian process notation&lt;/h4&gt;
&lt;p&gt;We can express the prior over the inducing variable $u(\mathbf{z})$ at
inducing input $\mathbf{z}$ as
&lt;/p&gt;
$$
p(u(\mathbf{z})) = \mathcal{GP}(0, k_{\theta}(\mathbf{z}, \mathbf{z}')).
$$&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3 id="conditional-prior"&gt;Conditional prior&lt;/h3&gt;
&lt;p&gt;First, let us define the vector-valued function $\boldsymbol{\psi}_\mathbf{u}: \mathbb{R}^{D} \to \mathbb{R}^{M}$ as
&lt;/p&gt;
$$
\boldsymbol{\psi}_\mathbf{u}(\mathbf{x}) \triangleq \mathbf{K}_\mathbf{uu}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}),
$$&lt;p&gt;
where $\mathbf{k}_\mathbf{u}(\mathbf{x}) = k_{\theta}(\mathbf{Z}, \mathbf{x})$ denotes the
vector of covariances between $\mathbf{x}$ and the inducing inputs $\mathbf{Z}$.
Further, let $\boldsymbol{\Psi} \in \mathbb{R}^{M \times N}$ be the matrix
containing values of function $\psi$ applied row-wise to the matrix of inputs
$\mathbf{X} = \begin{bmatrix} \mathbf{x}_1 \cdots \mathbf{x}_N \end{bmatrix}^{\top} \in \mathbb{R}^{N \times D}$,
&lt;/p&gt;
$$
\boldsymbol{\Psi} \triangleq
\begin{bmatrix}
\psi(\mathbf{x}_1)
\cdots
\psi(\mathbf{x}_N)
\end{bmatrix} = \mathbf{K}_\mathbf{uu}^{-1} \mathbf{K}_\mathbf{uf}.
$$&lt;p&gt;
Then, we can condition the joint prior distribution on the inducing
variables to give
&lt;/p&gt;
$$
p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}),
$$&lt;p&gt;
where the mean vector and covariance matrix are
&lt;/p&gt;
$$
\mathbf{m} = \boldsymbol{\Psi}^{\top} \mathbf{u},
\quad
\text{and}
\quad
\mathbf{S} = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^{\top} \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi}.
$$
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;h4 id="gaussian-process-notation"&gt;Gaussian process notation&lt;/h4&gt;
&lt;p&gt;We can express the distribution over the function value $f(\mathbf{x})$ at
input $\mathbf{x}$, given $\mathbf{u}$, that is, the conditional
$p(f(\mathbf{x}) | \mathbf{u})$, as a Gaussian process:
&lt;/p&gt;
$$
p(f(\mathbf{x}) | \mathbf{u}) = \mathcal{GP}(m(\mathbf{x}), s(\mathbf{x}, \mathbf{x}')),
$$&lt;p&gt;
with mean and covariance functions,
&lt;/p&gt;
$$
m(\mathbf{x}) = \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{u},
\quad
\text{and}
\quad
s(\mathbf{x}, \mathbf{x}') = k_{\theta}(\mathbf{x}, \mathbf{x}') - \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu} \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}').
$$&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Before moving on, we briefly highlight the important
quantity,
&lt;/p&gt;
$$
\mathbf{Q}_\mathbf{ff} \triangleq \boldsymbol{\Psi}^{\top} \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi},
$$&lt;p&gt;
which is sometimes referred to as the &lt;em&gt;Nyström approximation&lt;/em&gt; of $\mathbf{K}_\mathbf{ff}$.
It can be written as
&lt;/p&gt;
$$
\mathbf{Q}_\mathbf{ff} = \mathbf{K}_\mathbf{fu} \mathbf{K}_\textbf{uu}^{-1} \mathbf{K}_\mathbf{uf}.
$$&lt;h2 id="variational-distribution"&gt;Variational Distribution&lt;/h2&gt;
&lt;p&gt;We specify a joint variational distribution $q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})$
which factorizes as
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}}(\mathbf{f}, \mathbf{u}) \triangleq p(\mathbf{f} | \mathbf{u}) q_{\boldsymbol{\phi}}(\mathbf{u}).
$$&lt;p&gt;
For convenience, let us specify a variational distribution that is also Gaussian,
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}}(\mathbf{u}) \triangleq \mathcal{N}(\mathbf{u} | \mathbf{b}, \mathbf{W}\mathbf{W}^{\top}),
$$&lt;p&gt;
with variational parameters $\boldsymbol{\phi} = \{ \mathbf{W}, \mathbf{b} \}$.
To obtain the corresponding marginal variational distribution over $\mathbf{f}$,
we marginalize out the inducing variables $\mathbf{u}$, leading to
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}}(\mathbf{f}) =
\int q_{\boldsymbol{\phi}}(\mathbf{f}, \mathbf{u}) \, \mathrm{d}\mathbf{u} =
\mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \mathbf{\Sigma}),
$$&lt;p&gt;
where
&lt;/p&gt;
$$
\boldsymbol{\mu} = \boldsymbol{\Psi}^\top \mathbf{b},
\quad
\text{and}
\quad
\mathbf{\Sigma} = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^\top (\mathbf{K}_\mathbf{uu} - \mathbf{W}\mathbf{W}^{\top}) \boldsymbol{\Psi}.
$$
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;h4 id="gaussian-process-notation"&gt;Gaussian process notation&lt;/h4&gt;
&lt;p&gt;We can express the variational distribution over the function value $f(\mathbf{x})$ at
input $\mathbf{x}$, that is, the marginal $q_{\boldsymbol{\phi}}(f(\mathbf{x}))$,
as a Gaussian process:
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}}(f(\mathbf{x})) = \mathcal{GP}(\mu(\mathbf{x}), \sigma(\mathbf{x}, \mathbf{x}')),
$$&lt;p&gt;
with mean and covariance functions,
&lt;/p&gt;
$$
\begin{aligned}
\mu(\mathbf{x}) &amp;= \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{b}, \\
\sigma(\mathbf{x}, \mathbf{x}') &amp;= \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) (\mathbf{K}_\mathbf{uu} - \mathbf{W}\mathbf{W}^{\top}) \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}').
\end{aligned}
$$&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3 id="whitened-parameterization"&gt;Whitened parameterization&lt;/h3&gt;
&lt;p&gt;Whitening is a powerful trick for stabilizing the learning of variational
parameters that works by reducing correlations in the variational distribution (Murray &amp;amp; Adams, 2010; Hensman et al, 2015)&lt;sup id="fnref:2"&gt;&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref"&gt;2&lt;/a&gt;&lt;/sup&gt; &lt;sup id="fnref:3"&gt;&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref"&gt;3&lt;/a&gt;&lt;/sup&gt;.
Let $\mathbf{L}$ be the Cholesky factor of $\mathbf{K}_\mathbf{uu}$, i.e. the
lower triangular matrix such that $\mathbf{L} \mathbf{L}^{\top} = \mathbf{K}_\mathbf{uu}$.
Then, the whitened variational parameters are given by
&lt;/p&gt;
$$
\mathbf{W} \triangleq \mathbf{L} \mathbf{W}',
\quad
\text{and}
\quad
\mathbf{b} \triangleq \mathbf{L} \mathbf{b}',
$$&lt;p&gt;
with free parameters $\{ \mathbf{W}', \mathbf{b}' \}$.
This leads to mean and covariance
&lt;/p&gt;
$$
\boldsymbol{\mu} = \boldsymbol{\Lambda}^\top \mathbf{b}',
\quad
\text{and}
\quad
\mathbf{\Sigma} = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^\top (\mathbf{I}_M - {\mathbf{W}'} {\mathbf{W}'}^{\top}) \boldsymbol{\Lambda},
$$&lt;p&gt;
where
&lt;/p&gt;
$$
\boldsymbol{\Lambda} \triangleq \mathbf{L}^\top \boldsymbol{\Psi} = \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf}.
$$&lt;p&gt;
Refer to
for derivations.&lt;/p&gt;
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;h4 id="gaussian-process-notation"&gt;Gaussian process notation&lt;/h4&gt;
&lt;p&gt;The mean and covariance functions are now
&lt;/p&gt;
$$
\begin{aligned}
\mu(\mathbf{x}) &amp;= \boldsymbol{\lambda}^\top(\mathbf{x}) \mathbf{b}', \\
\sigma(\mathbf{x}, \mathbf{x}') &amp;= k_{\theta}(\mathbf{x}, \mathbf{x}') - \boldsymbol{\lambda}^\top(\mathbf{x}) (\mathbf{I}_M - \mathbf{W}' {\mathbf{W}'}^{\top}) \boldsymbol{\lambda}(\mathbf{x}'),
\end{aligned}
$$&lt;p&gt;
where
&lt;/p&gt;
$$
\begin{aligned}
\boldsymbol{\lambda}(\mathbf{x}) &amp;\triangleq \mathbf{L}^{\top} \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}) \\
&amp;= \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}).
\end{aligned}
$$&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;For an efficient and numerically stable way to compute and evaluate the
variational distribution $q_{\boldsymbol{\phi}}(\mathbf{f})$ at an arbitrary
set of inputs, see
.&lt;/p&gt;
&lt;h2 id="inference"&gt;Inference&lt;/h2&gt;
&lt;h3 id="preliminaries"&gt;Preliminaries&lt;/h3&gt;
&lt;p&gt;We seek to approximate the exact posterior $p(\mathbf{f},\mathbf{u} \mid \mathbf{y})$
by an variational distribution $q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})$.
To this end, we minimize the Kullback-Leibler (KL) divergence
between $q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})$
and $p(\mathbf{f},\mathbf{u} \mid \mathbf{y})$, which is given by
&lt;/p&gt;
$$
\begin{align*}
\mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})] &amp; =
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}{p(\mathbf{f},\mathbf{u} \mid \mathbf{y})}}\right] \newline &amp; =
\log{p(\mathbf{y})} + \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}{p(\mathbf{f},\mathbf{u}, \mathbf{y})}}\right] \newline &amp; =
\log{p(\mathbf{y})} - \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}),
\end{align*}
$$&lt;p&gt;
where we&amp;rsquo;ve defined the &lt;em&gt;evidence lower bound (ELBO)&lt;/em&gt; as
&lt;/p&gt;
$$
\mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) \triangleq \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{p(\mathbf{f},\mathbf{u}, \mathbf{y})}{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}}\right].
$$&lt;p&gt;
Notice that minimizing the KL divergence above is equivalent to maximizing the ELBO.
Furthermore, the ELBO is a lower bound on the log marginal likelihood, since
&lt;/p&gt;
$$
\log{p(\mathbf{y})} = \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) + \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})],
$$&lt;p&gt;
and the KL divergence is nonnegative.
Therefore, we have $\log{p(\mathbf{y})} \geq \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z})$
with equality at $\mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})] = 0 \Leftrightarrow q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) = p(\mathbf{f},\mathbf{u} \mid \mathbf{y})$.&lt;/p&gt;
&lt;p&gt;Let us now focus our attention on the ELBO, which can be written as
&lt;/p&gt;
$$
\begin{align*}
\mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) &amp; = \iint \log{\frac{p(\mathbf{f},\mathbf{u}, \mathbf{y})}{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \,\mathrm{d}\mathbf{f} \mathrm{d}\mathbf{u} \newline &amp; =
\iint \log{\frac{p(\mathbf{y} | \mathbf{f}) \bcancel{p(\mathbf{f} | \mathbf{u})} p(\mathbf{u})}{\bcancel{p(\mathbf{f} | \mathbf{u})} q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \,\mathrm{d}\mathbf{f} \mathrm{d}\mathbf{u} \newline &amp; =
\int \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u},
\end{align*}
$$&lt;p&gt;
where we have made use of the previous
definition $q_{\boldsymbol{\phi}}(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) q_{\boldsymbol{\phi}}(\mathbf{u})$
and also introduced the definition
&lt;/p&gt;
$$
\Phi(\mathbf{y}, \mathbf{u}) \triangleq \exp{ \left ( \int \log{p(\mathbf{y} | \mathbf{f})} p(\mathbf{f} | \mathbf{u}) \,\mathrm{d}\mathbf{f} \right ) }.
$$&lt;p&gt;
It is straightforward to verify that the optimal variational distribution, that
is, the distribution $q_{\boldsymbol{\phi}^{\star}}(\mathbf{u})$ at which the
ELBO is maximized, satisfies
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) \propto \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}).
$$&lt;p&gt;
Refer to
for details.
Specifically, after normalization, we have
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) = \frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{\mathcal{Z}},
$$&lt;p&gt;
where $\mathcal{Z} \triangleq \int \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}) \,\mathrm{d}\mathbf{u}$.
Plugging this back into the ELBO, we get
&lt;/p&gt;
$$
\begin{aligned}
\mathrm{ELBO}(\boldsymbol{\phi}^{\star}, \mathbf{Z})
&amp;= \int \log{\left (\bcancel{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})} \frac{\mathcal{Z}}{\bcancel{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}} \right )} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\
&amp;= \log{\mathcal{Z}}.
\end{aligned}
$$&lt;h3 id="gaussian-likelihoods--sparse-gaussian-process-regression-sgpr"&gt;Gaussian Likelihoods &amp;ndash; Sparse Gaussian Process Regression (SGPR)&lt;/h3&gt;
&lt;p&gt;Let us assume we have a Gaussian likelihood of the form
&lt;/p&gt;
$$
p(\mathbf{y} | \mathbf{f}) = \mathcal{N}(\mathbf{y} | \mathbf{f}, \beta^{-1} \mathbf{I}).
$$&lt;p&gt;
Then it is straightforward to show that
&lt;/p&gt;
$$
\log{\Phi(\mathbf{y}, \mathbf{u})} =
\log{\mathcal{N}(\mathbf{y} | \mathbf{m}, \beta^{-1} \mathbf{I} )} - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}),
$$&lt;p&gt;
where $\mathbf{m}$ and $\mathbf{S}$ are defined as before, i.e. $\mathbf{m} = \boldsymbol{\Psi}^{\top} \mathbf{u}$ and
$\mathbf{S} = \mathbf{K}_\textbf{ff} - \boldsymbol{\Psi}^{\top} \mathbf{K}_\textbf{uu} \boldsymbol{\Psi}$.
Refer to
for derivations.&lt;/p&gt;
&lt;p&gt;Now, there are a few key objects of interest.
First, the
optimal variational distribution $q_{\boldsymbol{\phi}^{\star}}(\mathbf{u})$,
which is required to compute the predictive distribution $q_{\boldsymbol{\phi}^{\star}}(\mathbf{f}) = \int p(\mathbf{f}|\mathbf{u}) q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) \, \mathrm{d}\mathbf{u}$,
but which may also be of independent interest.
Second, the ELBO, the objective with respect to which the inducing input
locations $\mathbf{Z}$ are optimized.&lt;/p&gt;
&lt;p&gt;The optimal variational distribution is given by
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) =
\mathcal{N}(\mathbf{u} \mid \beta \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uf} \mathbf{y}, \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu}),
$$&lt;p&gt;
where
&lt;/p&gt;
$$
\mathbf{M} \triangleq \mathbf{K}_\mathbf{uu} + \beta \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu}.
$$&lt;p&gt;
This can be verified by reducing the product of two exponential-quadratic
functions in $\Phi(\mathbf{y}, \mathbf{u})$ and $p(\mathbf{u})$ into a single
exponential-quadratic function up to a constant factor,
an operation also known as &amp;ldquo;completing the square&amp;rdquo;.
Refer to
for complete derivations.&lt;/p&gt;
&lt;p&gt;This leads to the predictive distribution
&lt;/p&gt;
$$
\begin{aligned}
q_{\boldsymbol{\phi}^{\star}}(\mathbf{f})
&amp;= \mathcal{N}\bigl(\beta \boldsymbol{\Psi}^\top \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} \mathbf{y}, \\
&amp;\qquad\qquad \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^\top (\mathbf{K}_\mathbf{uu} - \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu} ) \boldsymbol{\Psi} \bigr) \\
&amp;= \mathcal{N}\bigl(\beta \mathbf{K}_\mathbf{fu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uf} \mathbf{y}, \\
&amp;\qquad\qquad \mathbf{K}_\mathbf{ff} - \mathbf{K}_\mathbf{fu} (\mathbf{K}_\mathbf{uu}^{-1} - \mathbf{M}^{-1}) \mathbf{K}_\mathbf{uf} \bigr).
\end{aligned}
$$&lt;p&gt;The ELBO is given by
&lt;/p&gt;
$$
\mathrm{ELBO}(\boldsymbol{\phi}^{\star}, \mathbf{Z}) =
\log \mathcal{Z} =
\log \mathcal{N}(\mathbf{0}, \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}) - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}).
$$&lt;p&gt;
This can be verified by applying simple rules for marginalizing Gaussians.
Again, refer to
for complete derivations.
Refer to
for a numerically efficient and
robust method for computing these quantities.&lt;/p&gt;
&lt;h3 id="non-gaussian-likelihoods"&gt;Non-Gaussian Likelihoods&lt;/h3&gt;
&lt;p&gt;Recall from earlier that the ELBO is written as
&lt;/p&gt;
$$
\begin{align*}
\mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) &amp; =
\int \log{\left(\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}\right)} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ &amp; =
\int \left(\log{\Phi(\mathbf{y}, \mathbf{u})} + \log{\frac{p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}}\ \right) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ &amp; =
\mathrm{ELL}(\boldsymbol{\phi}, \mathbf{Z}) - \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{u})|p(\mathbf{u})],
\end{align*}
$$&lt;p&gt;
where we define $\mathrm{ELL}(\boldsymbol{\phi}, \mathbf{Z})$, the &lt;em&gt;expected log-likelihood (ELL)&lt;/em&gt;, as
&lt;/p&gt;
$$
\mathrm{ELL}(\boldsymbol{\phi}, \mathbf{Z}) \triangleq \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{u})}\left[\log{\Phi(\mathbf{y}, \mathbf{u})}\right].
$$&lt;p&gt;
This constitutes the first term in the ELBO, and can be written as
&lt;/p&gt;
$$
\begin{align*}
\mathrm{ELL}(\boldsymbol{\phi}, \mathbf{Z}) &amp; =
\int \log{\Phi(\mathbf{y}, \mathbf{u})} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ &amp; =
\int \left(\int \log{p(\mathbf{y} | \mathbf{f})} p(\mathbf{f} | \mathbf{u}) \,\mathrm{d}\mathbf{f}\right) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ &amp; =
\int \log{p(\mathbf{y} | \mathbf{f})} \left(\int p(\mathbf{f} | \mathbf{u}) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \right) \,\mathrm{d}\mathbf{f} \\\\ &amp; =
\int \log{p(\mathbf{y} | \mathbf{f})} q(\mathbf{f}) \,\mathrm{d}\mathbf{f} \\\\ &amp; =
\mathbb{E}_{q(\mathbf{f})}[\log{p(\mathbf{y} | \mathbf{f})}].
\end{align*}
$$&lt;p&gt;
While this integral is analytically intractable in general, we can nonetheless
approximate it efficiently using numerical integration techniques such as
Monte Carlo (MC) estimation or quadrature rules.
In particular, because $q(\mathbf{f})$ is Gaussian, we can utilize simple yet
effective rules such as
.&lt;/p&gt;
&lt;p&gt;Now, the second term in the ELBO is the KL divergence between $q_{\boldsymbol{\phi}}(\mathbf{u})$ and $p(\mathbf{u})$, which are both multivariate Gaussians,
&lt;/p&gt;
$$
\mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{u})|p(\mathbf{u})] =
\mathrm{KL}[\mathcal{N}(\mathbf{b}, \mathbf{W} {\mathbf{W}}^\top) || \mathcal{N}(\mathbf{0}, \mathbf{K}_\mathbf{uu})],
$$&lt;p&gt;
and has a
.
In the case of the whitened parameterization, it can be simplified as
&lt;/p&gt;
$$
\begin{align*}
\mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{u})|p(\mathbf{u})] &amp; =
\mathrm{KL}[\mathcal{N}(\mathbf{b}', \mathbf{W}' {\mathbf{W}'}^\top) || \mathcal{N}(\mathbf{0}, \mathbf{K}_\mathbf{uu})] \\\\ &amp; =
\mathrm{KL}[\mathcal{N}(\mathbf{b}, \mathbf{W} {\mathbf{W}}^\top) || \mathcal{N}(\mathbf{0}, \mathbf{I})].
\end{align*}
$$&lt;p&gt;
This comes from the fact that
&lt;/p&gt;
$$
\begin{aligned}
&amp;\mathrm{KL}\left[\mathcal{N}(\mathbf{A} \boldsymbol{\mu}_0, \mathbf{A} \boldsymbol{\Sigma}_0 \mathbf{A}^\top) \,\|\, \mathcal{N}(\mathbf{A} \boldsymbol{\mu}_1, \mathbf{A} \boldsymbol{\Sigma}_1 \mathbf{A}^\top) \right] \\
&amp;\qquad = \mathrm{KL}\left[\mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0) \,\|\, \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \right]
\end{aligned}
$$&lt;p&gt;
where we set $\boldsymbol{\mu}_0 = \mathbf{b}, \boldsymbol{\Sigma}_0 = \mathbf{W} \mathbf{W}^\top, \boldsymbol{\mu}_1 = \mathbf{0}, \boldsymbol{\Sigma}_1 = \mathbf{I}$ and $\mathbf{A} = \mathbf{L}$ where $\mathbf{L}$ is the Cholesky factor of $\mathbf{K}_\mathbf{uu}$, i.e. the lower triangular matrix such that $\mathbf{L}\mathbf{L}^\top = \mathbf{K}_\mathbf{uu}$.&lt;/p&gt;
&lt;h3 id="large-scale-data-with-stochastic-optimization"&gt;Large-Scale Data with Stochastic Optimization&lt;/h3&gt;
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-orange-100 dark:bg-orange-900 border-orange-500"
data-callout="warning"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-orange-600 dark:text-orange-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M12 9v3.75m-9.303 3.376c-.866 1.5.217 3.374 1.948 3.374h14.71c1.73 0 2.813-1.874 1.948-3.374L13.949 3.378c-.866-1.5-3.032-1.5-3.898 0zM12 15.75h.007v.008H12z"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Warning&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;Coming soon.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;!-- \int \left ( \int \log{p(\mathbf{y} \| \mathbf{f})} p(\mathbf{f} \| \mathbf{u}) \\,\mathrm{d}\mathbf{f} + \log{\frac{p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \right ) q_{\boldsymbol{\phi}}(\mathbf{u}) \\,\mathrm{d}\mathbf{u} \newline &amp; = --&gt;
&lt;!-- \int \left ( \log{\Phi(\mathbf{y}, \mathbf{u})} + \log{\frac{p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \right ) q_{\boldsymbol{\phi}}(\mathbf{u}) \\,\mathrm{d}\mathbf{u} \newline &amp; = --&gt;
&lt;!-- Therefore,
$$
q(\mathbf{u}) = \mathcal{N}(\mathbf{u} \mid \beta \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uf} \mathbf{y}, \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu})
$$
since $\mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} = \mathbf{K}_\mathbf{uf}$.
$$
\exp \left ( - \frac{1}{2} \left ( \mathbf{u}^\top (\beta \boldsymbol{\Psi}\boldsymbol{\Psi}^\top) \mathbf{u} - 2 \beta \mathbf{y}^\top \boldsymbol{\Psi}^\top \mathbf{u} + \mathbf{u}^\top \mathbf{K}_\mathbf{uu}^{-1} \mathbf{u} \right ) \right )
$$
$$
\exp \left ( - \frac{1}{2} \left ( \mathbf{u}^\top ( \mathbf{K}_\mathbf{uu}^{-1} + \beta \boldsymbol{\Psi}\boldsymbol{\Psi}^\top) \mathbf{u} - 2 \beta (\boldsymbol{\Psi} \mathbf{y})^\top \mathbf{u} \right ) \right )
$$ --&gt;
&lt;h2 id="links-and-further-readings"&gt;Links and Further Readings&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Papers:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Forerunners:&lt;/strong&gt; Deterministic Training Conditional (DTC; Csató &amp;amp; Opper, 2002&lt;sup id="fnref:4"&gt;&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref"&gt;4&lt;/a&gt;&lt;/sup&gt;; Seeger, 2003&lt;sup id="fnref:5"&gt;&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref"&gt;5&lt;/a&gt;&lt;/sup&gt;); Fully Independent Training Conditional (FITC; Snelson &amp;amp; Ghahramani, 2005&lt;sup id="fnref:6"&gt;&lt;a href="#fn:6" class="footnote-ref" role="doc-noteref"&gt;6&lt;/a&gt;&lt;/sup&gt;; Quinonero-Candela &amp;amp; Rasmussen, 2005&lt;sup id="fnref:7"&gt;&lt;a href="#fn:7" class="footnote-ref" role="doc-noteref"&gt;7&lt;/a&gt;&lt;/sup&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inter-domain Gaussian processes:&lt;/strong&gt; Lázaro-Gredilla &amp;amp; Figueiras-Vidal, 2009&lt;sup id="fnref:8"&gt;&lt;a href="#fn:8" class="footnote-ref" role="doc-noteref"&gt;8&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deep Gaussian processes:&lt;/strong&gt; Damianou &amp;amp; Lawrence, 2013&lt;sup id="fnref:9"&gt;&lt;a href="#fn:9" class="footnote-ref" role="doc-noteref"&gt;9&lt;/a&gt;&lt;/sup&gt;, Salimbeni et al, 2017&lt;sup id="fnref:10"&gt;&lt;a href="#fn:10" class="footnote-ref" role="doc-noteref"&gt;10&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Non-Gaussian likelihoods:&lt;/strong&gt; Hensman et al, 2013&lt;sup id="fnref:11"&gt;&lt;a href="#fn:11" class="footnote-ref" role="doc-noteref"&gt;11&lt;/a&gt;&lt;/sup&gt;; Dezfouli &amp;amp; Bonilla, 2015&lt;sup id="fnref:12"&gt;&lt;a href="#fn:12" class="footnote-ref" role="doc-noteref"&gt;12&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unifying inducing-/pseudo-point approximations:&lt;/strong&gt; Bui et al, 2017&lt;sup id="fnref:13"&gt;&lt;a href="#fn:13" class="footnote-ref" role="doc-noteref"&gt;13&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Orthogonal decompositions:&lt;/strong&gt; Salimbeni et al, 2018&lt;sup id="fnref:14"&gt;&lt;a href="#fn:14" class="footnote-ref" role="doc-noteref"&gt;14&lt;/a&gt;&lt;/sup&gt;; Shi et al, 2020&lt;sup id="fnref:15"&gt;&lt;a href="#fn:15" class="footnote-ref" role="doc-noteref"&gt;15&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Convergence analysis:&lt;/strong&gt; Burt et al, 2019&lt;sup id="fnref:16"&gt;&lt;a href="#fn:16" class="footnote-ref" role="doc-noteref"&gt;16&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient sampling:&lt;/strong&gt; Wilson et al, 2020&lt;sup id="fnref:17"&gt;&lt;a href="#fn:17" class="footnote-ref" role="doc-noteref"&gt;17&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Technical Reports:
&lt;ul&gt;
&lt;li&gt;
by M. Titsias&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Notes:
&lt;ul&gt;
&lt;li&gt;
by T. Bui and R. Turner&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Blog posts:
&lt;ul&gt;
&lt;li&gt;
by J. Hensman&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Cite as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tiao2020svgp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;{A} {H}andbook for {S}parse {V}ariational {G}aussian {P}rocesses&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;author&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;Tiao, Louis C&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;journal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tiao.io&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;2020&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;https://tiao.io/post/sparse-variational-gaussian-processes/&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;To receive updates on more posts like this, follow me on
and
!&lt;/p&gt;
&lt;h2 id="appendix"&gt;Appendix&lt;/h2&gt;
&lt;h3 id="i"&gt;I&lt;/h3&gt;
&lt;h4 id="whitened-parameterization-1"&gt;Whitened parameterization&lt;/h4&gt;
&lt;p&gt;Recall the definition $\boldsymbol{\Lambda} \triangleq \mathbf{L}^\top \boldsymbol{\Psi}$.
Then, the mean simplifies to
&lt;/p&gt;
$$
\boldsymbol{\mu} = \boldsymbol{\Psi}^\top \mathbf{b} = \boldsymbol{\Psi}^\top (\mathbf{L} \mathbf{b}') = (\mathbf{L}^\top \boldsymbol{\Psi})^\top \mathbf{b}' = \boldsymbol{\Lambda}^\top \mathbf{b}'.
$$&lt;p&gt;
Similarly, the covariance simplifies to
&lt;/p&gt;
$$
\begin{align*}
\mathbf{\Sigma} &amp; = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^{\top} (\mathbf{K}_\mathbf{uu} - \mathbf{W} \mathbf{W}^{\top}) \boldsymbol{\Psi} \newline &amp; =
\mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^{\top} (\mathbf{L} \mathbf{L}^{\top} - \mathbf{L} ({\mathbf{W}'}{\mathbf{W}'}^{\top}) \mathbf{L}^{\top}) \boldsymbol{\Psi} \newline &amp; =
\mathbf{K}_\mathbf{ff} - (\mathbf{L}^{\top} \boldsymbol{\Psi})^{\top} ( \mathbf{I}_M - {\mathbf{W}'}{\mathbf{W}'}^{\top}) (\mathbf{L}^{\top} \boldsymbol{\Psi}) \newline &amp; =
\mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^{\top} ( \mathbf{I}_M - {\mathbf{W}'}{\mathbf{W}'}^{\top}) \boldsymbol{\Lambda}.
\end{align*}
$$&lt;h3 id="ii"&gt;II&lt;/h3&gt;
&lt;h4 id="svgp-implementation-details"&gt;SVGP Implementation Details&lt;/h4&gt;
&lt;p&gt;&lt;em&gt;Single input index point&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Here is an efficient and numerically stable way to compute $q_{\boldsymbol{\phi}}(f(\mathbf{x}))$
for an input $\mathbf{x}$.
We take the following steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Cholesky decomposition: $\mathbf{L} \triangleq \mathrm{cholesky}(\mathbf{K}_\textbf{uu})$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathcal{O}(M^3)$ complexity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Solve system of linear equations: $\boldsymbol{\lambda}(\mathbf{x}) \triangleq \mathbf{L} \backslash \mathbf{k}_\mathbf{u}(\mathbf{x})$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathcal{O}(M^2)$ complexity since $\mathbf{L}$ is lower triangular; $\boldsymbol{\beta} = \mathbf{A} \backslash \mathbf{x}$ denotes the vector $\boldsymbol{\beta}$ such that $\mathbf{A} \boldsymbol{\beta} = \mathbf{x} \Leftrightarrow \boldsymbol{\beta} = \mathbf{A}^{-1} \mathbf{x}$.
Hence, $\boldsymbol{\lambda}(\mathbf{x}) = \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x})$.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$s(\mathbf{x}, \mathbf{x}) \triangleq k_{\theta}(\mathbf{x}, \mathbf{x}) - \boldsymbol{\lambda}^\top(\mathbf{x}) \boldsymbol{\lambda}(\mathbf{x})$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt;
&lt;/p&gt;
$$
\begin{aligned}
\boldsymbol{\lambda}^\top(\mathbf{x}) \boldsymbol{\lambda}(\mathbf{x})
&amp;= \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}) \\
&amp;= \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}) \\
&amp;= \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu} \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}).
\end{aligned}
$$&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For &lt;strong&gt;whitened parameterization&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;$\mu \triangleq \boldsymbol{\lambda}^\top(\mathbf{x}) \mathbf{b}'$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\mathbf{v}^\top(\mathbf{x}) \triangleq \boldsymbol{\lambda}^\top(\mathbf{x}) {\mathbf{W}'}$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathbf{v}^\top(\mathbf{x}) \mathbf{v}(\mathbf{x}) = \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{L}^{-\top} ({\mathbf{W}'} {\mathbf{W}'}^{\top}) \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x})$&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;otherwise:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Solve system of linear equations: $\boldsymbol{\psi}_\mathbf{u}(\mathbf{x}) \triangleq \mathbf{L}^\top \backslash \boldsymbol{\lambda}(\mathbf{x})$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathcal{O}(M^2)$ complexity since $\mathbf{L}^{\top}$ is upper triangular. Further,
&lt;/p&gt;
$$
\boldsymbol{\psi}_\mathbf{u}(\mathbf{x}) = \mathbf{L}^{-\top} \boldsymbol{\lambda}(\mathbf{x}) = \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}) = \mathbf{K}_\mathbf{uu}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x})
$$&lt;p&gt;
and
&lt;/p&gt;
$$
\boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) = \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu}^{-\top} = \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu}^{-1}
$$&lt;p&gt;
since $\mathbf{K}_\mathbf{uu}$ is symmetric and nonsingular.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\mu(\mathbf{x}) \triangleq \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{b}$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\mathbf{v}^\top(\mathbf{x}) \triangleq \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{W}$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathbf{v}^\top(\mathbf{x}) \mathbf{v}(\mathbf{x}) = \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu}^{-1} (\mathbf{W} \mathbf{W}^{\top}) \mathbf{K}_\mathbf{uu}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x})$&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\sigma^2(\mathbf{x}) \triangleq s(\mathbf{x}, \mathbf{x}) + \mathbf{v}^\top(\mathbf{x}) \mathbf{v}(\mathbf{x})$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Return $\mathcal{N}(f(\mathbf{x}) ; \mu(\mathbf{x}), \sigma^2(\mathbf{x}))$&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;em&gt;Multiple input index points&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;It is simple to extend this to compute $q_{\boldsymbol{\phi}}(\mathbf{f})$ for an
arbitary number of index points $\mathbf{X}$:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Cholesky decomposition: $\mathbf{L} = \mathrm{cholesky}(\mathbf{K}_\textbf{uu})$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathcal{O}(M^3)$ complexity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Solve system of linear equations: $\boldsymbol{\Lambda} = \mathbf{L} \backslash \mathbf{K}_\mathbf{uf}$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathcal{O}(M^2)$ complexity since $\mathbf{L}$ is lower triangular; $\mathbf{B} = \mathbf{A} \backslash \mathbf{X}$ denotes the matrix $\mathbf{B}$ such that $\mathbf{A} \mathbf{B} = \mathbf{X} \Leftrightarrow \mathbf{B} = \mathbf{A}^{-1} \mathbf{X}$.
Hence, $\boldsymbol{\Lambda} = \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf}$.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\mathbf{S} \triangleq \mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^{\top} \boldsymbol{\Lambda}$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt;
&lt;/p&gt;
$$
\begin{aligned}
\boldsymbol{\Lambda}^{\top} \boldsymbol{\Lambda}
&amp;= \mathbf{K}_\mathbf{fu} \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf} \\
&amp;= \mathbf{K}_\mathbf{fu} \mathbf{K}_\textbf{uu}^{-1} \mathbf{K}_\mathbf{uf} \\
&amp;= \mathbf{K}_\mathbf{fu} \mathbf{K}_\textbf{uu}^{-1} (\mathbf{K}_\textbf{uu}) \mathbf{K}_\textbf{uu}^{-1} \mathbf{K}_\mathbf{uf} \\
&amp;= \boldsymbol{\Psi}^\top \mathbf{K}_\textbf{uu} \boldsymbol{\Psi}.
\end{aligned}
$$&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For &lt;strong&gt;whitened parameterization&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;$\boldsymbol{\mu} \triangleq \boldsymbol{\Lambda}^\top \mathbf{b}'$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\mathbf{V}^\top \triangleq \boldsymbol{\Lambda}^\top {\mathbf{W}'}$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathbf{V}^\top \mathbf{V} = \mathbf{K}_\mathbf{fu} \mathbf{L}^{-\top} ({\mathbf{W}'} {\mathbf{W}'}^{\top}) \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf}.$&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;otherwise:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Solve system of linear equations: $\boldsymbol{\Psi} = \mathbf{L}^{\top} \backslash \boldsymbol{\Lambda}$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathcal{O}(M^2)$ complexity since $\mathbf{L}^{\top}$ is upper triangular. Further,&lt;/p&gt;
$$
\boldsymbol{\Psi} = \mathbf{L}^{-\top} \boldsymbol{\Lambda} = \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf} = (\mathbf{L}\mathbf{L}^\top)^{-1} \mathbf{K}_\mathbf{uf} = \mathbf{K}_\mathbf{uu}^{-1} \mathbf{K}_\mathbf{uf},
$$&lt;p&gt;
and
&lt;/p&gt;
$$
\boldsymbol{\Psi}^\top = \mathbf{K}_\mathbf{fu} \mathbf{K}_\mathbf{uu}^{-\top} = \mathbf{K}_\mathbf{fu} \mathbf{K}_\mathbf{uu}^{-1},
$$&lt;p&gt;
since $\mathbf{K}_\mathbf{uu}$ is symmetric and nonsingular.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\boldsymbol{\mu} \triangleq \boldsymbol{\Psi}^\top \mathbf{b}$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\mathbf{V}^\top \triangleq \boldsymbol{\Psi}^\top \mathbf{W}$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; $\mathbf{V}^\top \mathbf{V} = \mathbf{K}_\mathbf{fu} \mathbf{K}_\mathbf{uu}^{-1} (\mathbf{W} \mathbf{W}^{\top}) \mathbf{K}_\mathbf{uu}^{-1} \mathbf{K}_\mathbf{uf}$.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;$\mathbf{\Sigma} \triangleq \mathbf{S} + \mathbf{V}^\top \mathbf{V}$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Return $\mathcal{N}(\mathbf{f} ; \boldsymbol{\mu}, \mathbf{\Sigma})$&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In TensorFlow, this looks something like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;variational_predictive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Knn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Kmm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Kmn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;whiten&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jitter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-6&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cholesky&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Kmm&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;jitter&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eye&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# L L^T = Kmm + jitter I_m&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Lambda&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;triangular_solve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Kmn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Lambda = L^{-1} Kmn&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;S&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Knn&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lambda&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Lambda&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adjoint_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Knn - Lambda^T Lambda&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Phi = L^{-T} L^{-1} Kmn = Kmm^{-1} Kmn&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Phi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Lambda&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;whiten&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;triangular_solve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Lambda&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adjoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;U&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Phi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adjoint_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# U = V^T = Phi^T W&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Phi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adjoint_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Phi^T b&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Sigma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adjoint_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# S + UU^T = S + V^T V&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Sigma&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="iii"&gt;III&lt;/h3&gt;
&lt;h4 id="optimal-variational-distribution-in-general"&gt;Optimal variational distribution (in general)&lt;/h4&gt;
&lt;p&gt;Taking the functional derivative of the ELBO wrt to $q_{\boldsymbol{\phi}}(\mathbf{u})$, we get
&lt;/p&gt;
$$
\begin{align*}
\frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) &amp; =
\frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \left ( \int \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \right ) \newline &amp; =
\int \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \left ( \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \right ) \,\mathrm{d}\mathbf{u} \newline &amp; =
\begin{split}
&amp; \int \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \left ( \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} q_{\boldsymbol{\phi}}(\mathbf{u}) \right ) + \newline
&amp; \qquad q_{\boldsymbol{\phi}}(\mathbf{u}) \left ( \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \right ) \,\mathrm{d}\mathbf{u}
\end{split}
\newline &amp; =
\int \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} +
q_{\boldsymbol{\phi}}(\mathbf{u}) \left ( -\frac{1}{q_{\boldsymbol{\phi}}(\mathbf{u})} \right ) \,\mathrm{d}\mathbf{u}
\newline &amp; =
\int \log{\Phi(\mathbf{y}, \mathbf{u})} + \log{p(\mathbf{u})} - \log{q_{\boldsymbol{\phi}}(\mathbf{u})} - 1 \,\mathrm{d}\mathbf{u}.
\end{align*}
$$&lt;p&gt;
Setting this expression to zero, we have
&lt;/p&gt;
$$
\begin{align*}
\log{q_{\boldsymbol{\phi}^{\star}}(\mathbf{u})} &amp; = \log{\Phi(\mathbf{y}, \mathbf{u})} + \log{p(\mathbf{u})} - 1 \\\\
\Rightarrow \qquad
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) &amp; \propto \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}).
\end{align*}
$$&lt;h3 id="iv"&gt;IV&lt;/h3&gt;
&lt;h4 id="variational-lower-bound-partial-for-gaussian-likelihoods"&gt;Variational lower bound (partial) for Gaussian likelihoods&lt;/h4&gt;
&lt;p&gt;To carry out this derivation, we will need to recall the following two simple
identities. First, we can write the inner product between two vectors as the
trace of their outer product,
&lt;/p&gt;
$$
\mathbf{a}^\top \mathbf{b} = \mathrm{tr}(\mathbf{a} \mathbf{b}^\top).
$$&lt;p&gt;
Second, the relationship between the auto-correlation matrix $\mathbb{E}[\mathbf{a}\mathbf{a}^{\top}]$
and the covariance matrix,
&lt;/p&gt;
$$
\begin{align*}
\mathrm{Cov}[\mathbf{a}] &amp; = \mathbb{E}[\mathbf{a}\mathbf{a}^{\top}] - \mathbb{E}[\mathbf{a}] \, \mathbb{E}[\mathbf{a}]^\top \\\\
\Leftrightarrow \quad
\mathbb{E}[\mathbf{a}\mathbf{a}^{\top}] &amp; = \mathrm{Cov}[\mathbf{a}] + \mathbb{E}[\mathbf{a}] \, \mathbb{E}[\mathbf{a}]^\top
\end{align*}
$$&lt;p&gt;
These allow us to write
&lt;/p&gt;
$$
\begin{align*}
\log{\Phi(\mathbf{y}, \mathbf{u})} &amp; =
\int \log{\mathcal{N}(\mathbf{y} | \mathbf{f}, \beta^{-1} \mathbf{I})} \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f}
\newline &amp; = - \frac{1}{2\sigma^2} \int (\mathbf{y} - \mathbf{f})^{\top} (\mathbf{y} - \mathbf{f}) \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f}
\newline &amp; \quad - \frac{N}{2}\log{(2\pi\sigma^2)}
\newline &amp; = - \frac{1}{2\sigma^2} \int \mathrm{tr} \left (\mathbf{y}\mathbf{y}^{\top} - 2 \mathbf{y}\mathbf{f}^{\top} + \mathbf{f}\mathbf{f}^{\top} \right) \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f}
\newline &amp; \quad - \frac{N}{2}\log{(2\pi\sigma^2)}
\newline &amp; = - \frac{1}{2\sigma^2} \mathrm{tr} \left (\mathbf{y}\mathbf{y}^{\top} - 2 \mathbf{y}\mathbf{m}^{\top} + \mathbf{S} + \mathbf{m} \mathbf{m}^{\top} \right)
\newline &amp; \quad - \frac{N}{2}\log{(2\pi\sigma^2)}
\newline &amp; = - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{m})^{\top} (\mathbf{y} - \mathbf{m}) - \frac{N}{2}\log{(2\pi\sigma^2)}
\newline &amp; \quad - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S})
\newline &amp; = \log{\mathcal{N}(\mathbf{y} | \mathbf{m}, \beta^{-1} \mathbf{I} )} - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S}).
\end{align*}
$$&lt;h3 id="v"&gt;V&lt;/h3&gt;
&lt;h4 id="optimal-variational-distribution-for-gaussian-likelihoods"&gt;Optimal variational distribution for Gaussian likelihoods&lt;/h4&gt;
&lt;p&gt;Firstly, the optimal variational distribution can be found in closed-form as
&lt;/p&gt;
$$
\begin{align*}
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) &amp; \propto \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}) \\\\
&amp; \propto \mathcal{N}(\mathbf{y} \mid \boldsymbol{\Psi}^\top \mathbf{u}, \beta^{-1} \mathbf{I}) \mathcal{N}(\mathbf{u} \mid \mathbf{0}, \mathbf{K}_\mathbf{uu}) \\\\ &amp; \propto
\exp \left ( - \frac{\beta}{2} (\mathbf{y} - \boldsymbol{\Psi}^\top \mathbf{u})^\top
(\mathbf{y} - \boldsymbol{\Psi}^\top \mathbf{u}) - \frac{1}{2} \mathbf{u}^\top \mathbf{K}_\mathbf{uu}^{-1} \mathbf{u} \right ) \\\\ &amp; \propto
\exp \left ( - \frac{1}{2} \left ( \mathbf{u}^\top \mathbf{C} \mathbf{u} - 2 \beta (\boldsymbol{\Psi} \mathbf{y})^\top \mathbf{u} \right ) \right ),
\end{align*}
$$&lt;p&gt;
where
&lt;/p&gt;
$$
\mathbf{C} \triangleq \mathbf{K}_\mathbf{uu}^{-1} + \beta \boldsymbol{\Psi} \boldsymbol{\Psi}^\top =
\mathbf{K}_\mathbf{uu}^{-1} (\mathbf{K}_\mathbf{uu} + \beta \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu} ) \mathbf{K}_\mathbf{uu}^{-1}.
$$&lt;p&gt;
By
, we get
&lt;/p&gt;
$$
\begin{align*}
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) &amp; \propto
\exp \left ( - \frac{1}{2} (\mathbf{u} - \beta \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y})^\top \mathbf{C} (\mathbf{u} - \beta \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y}) \right ) \\\\ &amp; \propto
\mathcal{N}(\mathbf{u} \mid \beta \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y}, \mathbf{C}^{-1}).
\end{align*}
$$&lt;p&gt;
We define
&lt;/p&gt;
$$
\mathbf{M} \triangleq \mathbf{K}_\mathbf{uu} + \beta \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu}
$$&lt;p&gt;
so that
&lt;/p&gt;
$$
\mathbf{C} = \mathbf{K}_\mathbf{uu}^{-1} \mathbf{M} \mathbf{K}_\mathbf{uu}^{-1},
$$&lt;p&gt;
which allows us to write
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) =
\mathcal{N}(\mathbf{u} \mid \beta \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uf} \mathbf{y}, \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu}).
$$&lt;h3 id="vi"&gt;VI&lt;/h3&gt;
&lt;h4 id="variational-lower-bound-complete-for-gaussian-likelihoods"&gt;Variational lower bound (complete) for Gaussian likelihoods&lt;/h4&gt;
&lt;p&gt;We have
&lt;/p&gt;
$$
\begin{align*}
\mathrm{ELBO}(\boldsymbol{\phi}^{\star}, \mathbf{Z}) &amp; =
\log \mathcal{Z} \\\\ &amp; =
\log \int \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ &amp; =
\log \biggl[ \exp{\left(-\frac{\beta}{2} \mathrm{tr}(\mathbf{S})\right)}
\newline &amp; \qquad \cdot \int \mathcal{N}(\mathbf{y} | \boldsymbol{\Psi}^{\top} \mathbf{u}, \beta^{-1} \mathbf{I}) p(\mathbf{u}) \,\mathrm{d}\mathbf{u} \biggr] \\\\ &amp; =
\log \int \mathcal{N}(\mathbf{y} \mid \boldsymbol{\Psi}^{\top} \mathbf{u}, \beta^{-1} \mathbf{I}) \mathcal{N}(\mathbf{u} \mid \mathbf{0}, \mathbf{K}_\mathbf{uu}) \,\mathrm{d}\mathbf{u} - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}) \\\\ &amp; =
\log \mathcal{N}(\mathbf{y} \mid \mathbf{0}, \beta^{-1} \mathbf{I} + \boldsymbol{\Psi}^{\top} \mathbf{K}_\textbf{uu} \boldsymbol{\Psi}) - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}) \\\\ &amp; =
\log \mathcal{N}(\mathbf{y} \mid \mathbf{0}, \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}) - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}).
\end{align*}
$$&lt;h3 id="vii"&gt;VII&lt;/h3&gt;
&lt;h4 id="sgpr-implementation-details"&gt;SGPR Implementation Details&lt;/h4&gt;
&lt;p&gt;Here we provide implementation details that simultaneously minimizes the
computational demands while avoiding numerically unstable calculations.&lt;/p&gt;
&lt;p&gt;The difficulty in calculating the ELBO stem from terms involving
the &lt;em&gt;inverse&lt;/em&gt; and the &lt;em&gt;determinant&lt;/em&gt; of $\mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}$.
More specifically, we have
&lt;/p&gt;
$$
\begin{split}
\mathrm{ELBO}(\boldsymbol{\phi}^{\star}, \mathbf{Z}) &amp; = - \frac{1}{2} \Bigl( \log \det \left ( \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I} \right ) \\\\
&amp; \qquad + \mathbf{y}^\top \left ( \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I} \right )^{-1} \mathbf{y} + N \log {2\pi} \Bigr) \\\\
&amp; \qquad - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}).
\end{split}
$$&lt;p&gt;
It turns out that many of the required terms can be expressed in terms of the
symmetric positive definite matrix
&lt;/p&gt;
$$
\mathbf{B} \triangleq \mathbf{U} \mathbf{U}^\top + \mathbf{I},
$$&lt;p&gt;
where $\mathbf{U} \triangleq \beta^{\frac{1}{2}} \boldsymbol{\Lambda}$.&lt;/p&gt;
&lt;p&gt;First, let&amp;rsquo;s tackle the inverse term.
Using the Woodbury identity, we can write it as
&lt;/p&gt;
$$
\begin{align*}
\left(\mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}\right)^{-1}
&amp; = \left(\beta^{-1} \mathbf{I} + \boldsymbol{\Psi}^\top \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi}\right)^{-1} \\\\
&amp; = \beta \mathbf{I} - \beta^2 \boldsymbol{\Psi}^\top \left(\mathbf{K}_\mathbf{uu}^{-1} + \beta \boldsymbol{\Psi} \boldsymbol{\Psi}^\top \right)^{-1} \boldsymbol{\Psi} \\\\
&amp; = \beta \left(\mathbf{I} - \beta \boldsymbol{\Psi}^\top \mathbf{C}^{-1} \boldsymbol{\Psi}\right).
\end{align*}
$$&lt;p&gt;Recall that $\mathbf{C}^{-1} = \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu}$.
We can expand $\mathbf{M}$ as
&lt;/p&gt;
$$
\begin{align*}
\mathbf{M} &amp; \triangleq \mathbf{K}_\mathbf{uu} + \beta \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu} \\\\
&amp; = \mathbf{L} \mathbf{L}^\top + \beta \mathbf{L} \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu} \mathbf{L}^{-\top} \mathbf{L}^\top \\\\
&amp; = \mathbf{L} \left( \mathbf{I} + \beta \boldsymbol{\Lambda} \boldsymbol{\Lambda}^\top \right) \mathbf{L}^\top \\\\
&amp; = \mathbf{L} \mathbf{B} \mathbf{L}^{\top},
\end{align*}
$$&lt;p&gt;
so its inverse is simply
&lt;/p&gt;
$$
\mathbf{M}^{-1} = \mathbf{L}^{-\top} \mathbf{B}^{-1} \mathbf{L}^{-1}.
$$&lt;p&gt;
Therefore, we have
&lt;/p&gt;
$$
\begin{align*}
\mathbf{C}^{-1}
&amp; = \mathbf{K}_\mathbf{uu} \mathbf{L}^{-\top} \mathbf{B}^{-1} \mathbf{L}^{-1} \mathbf{K}_\mathbf{uu} \\\\
&amp; = \mathbf{L} \mathbf{B}^{-1} \mathbf{L}^\top \\\\
&amp; = \mathbf{W} \mathbf{W}^\top
\end{align*}
$$&lt;p&gt;
where
&lt;/p&gt;
$$
\mathbf{W} \triangleq \mathbf{L} \mathbf{L}_\mathbf{B}^{-\top}
$$&lt;p&gt;
and $\mathbf{L}_\mathbf{B}$ is the Cholesky factor of $\mathbf{B}$,
i.e. the lower triangular matrix such
that $\mathbf{L}_\mathbf{B}\mathbf{L}_\mathbf{B}^\top = \mathbf{B}$.
All in all, we now have
&lt;/p&gt;
$$
\begin{align*}
\left(\mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}\right)^{-1}
&amp; = \beta \left(\mathbf{I} - \beta \boldsymbol{\Psi}^\top \mathbf{W} \mathbf{W}^\top \boldsymbol{\Psi}\right),
\end{align*}
$$&lt;p&gt;
so we can compute the quadratic term in $\mathbf{y}$ as
&lt;/p&gt;
$$
\begin{align*}
\mathbf{y}^\top \left ( \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I} \right )^{-1} \mathbf{y}
&amp; = \beta \left( \mathbf{y}^\top \mathbf{y} - \beta \mathbf{y}^\top \boldsymbol{\Psi}^\top \mathbf{W} \mathbf{W}^\top \boldsymbol{\Psi} \mathbf{y} \right) \\\\
&amp; = \beta \mathbf{y}^\top \mathbf{y} - \mathbf{c}^\top \mathbf{c},
\end{align*}
$$&lt;p&gt;
where
&lt;/p&gt;
$$
\mathbf{c} \triangleq \beta \mathbf{W}^\top \boldsymbol{\Psi} \mathbf{y} = \beta \mathbf{L}_\mathbf{B}^{-1} \boldsymbol{\Lambda} \mathbf{y} = \beta^{\frac{1}{2}} \mathbf{L}_\mathbf{B}^{-1} \mathbf{U} \mathbf{y}.
$$&lt;p&gt;Next, let&amp;rsquo;s address the determinant term.
To this end, first note that the determinant of $\mathbf{M}$ is
&lt;/p&gt;
$$
\begin{align*}
\det \left( \mathbf{M} \right) &amp; = \det \left( \mathbf{L} \mathbf{B} \mathbf{L}^{\top} \right) \\\\ &amp; =
\det \left( \mathbf{L} \right) \det \left( \mathbf{B} \right) \det \left( \mathbf{L}^{\top} \right) \\\\ &amp; =
\det \left( \mathbf{K}_\mathbf{uu} \right) \det \left( \mathbf{B} \right).
\end{align*}
$$&lt;p&gt;
Hence, the determinant of $\mathbf{C}$ is
&lt;/p&gt;
$$
\begin{align*}
\det \left( \mathbf{C} \right) &amp; =
\det \left( \mathbf{K}_\mathbf{uu}^{-1} \mathbf{M} \mathbf{K}_\mathbf{uu}^{-1} \right) \\\\ &amp; =
\frac{\det \left( \mathbf{M} \right)}{\det \left( \mathbf{K}_\mathbf{uu} \right )^2} \\\\ &amp; =
\frac{\det \left( \mathbf{B} \right)}{\det \left( \mathbf{K}_\mathbf{uu} \right )}.
\end{align*}
$$&lt;p&gt;
Therefore, by the
, we have
&lt;/p&gt;
$$
\begin{align*}
\det \left( \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I} \right) &amp; =
\det \left( \beta^{-1} \mathbf{I} + \boldsymbol{\Psi}^\top \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} \right) \\\\ &amp; =
\det \left( \mathbf{K}_\mathbf{uu}^{-1} + \beta \boldsymbol{\Psi} \boldsymbol{\Psi}^\top \right)
\det \left( \mathbf{K}_\mathbf{uu} \right)
\det \left( \beta^{-1} \mathbf{I} \right) \\\\ &amp; =
\det \left( \mathbf{C} \right)
\det \left( \mathbf{K}_\mathbf{uu} \right)
\det \left( \beta^{-1} \mathbf{I} \right) \\\\ &amp; =
\det \left( \mathbf{B} \right) \det \left( \beta^{-1} \mathbf{I} \right).
\end{align*}
$$&lt;p&gt;
We can re-use $\mathbf{L}_\mathbf{B}$ to calculate $\det \left( \mathbf{B} \right)$
in linear time.&lt;/p&gt;
&lt;p&gt;The last non-trivial component of the ELBO is the trace term, which can be
calculated as
&lt;/p&gt;
$$
\frac{\beta}{2} \mathrm{tr}(\mathbf{S}) = \frac{\beta}{2} \mathrm{tr}\left(\mathbf{K}_\mathbf{ff}\right) - \frac{1}{2} \mathrm{tr}\left(\mathbf{U} \mathbf{U}^\top \right),
$$&lt;p&gt;
since
&lt;/p&gt;
$$
\begin{align*}
\mathrm{tr}\left(\mathbf{U} \mathbf{U}^\top\right) &amp; =
\mathrm{tr}\left(\mathbf{U}^\top \mathbf{U}\right) \\\\ &amp; =
\beta \cdot \mathrm{tr}\left(\boldsymbol{\Lambda} \boldsymbol{\Lambda}^\top\right) \\\\ &amp; =
\beta \cdot \mathrm{tr}\left( \boldsymbol{\Psi}^{\top} \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} \right).
\end{align*}
$$&lt;p&gt;
Again, we can re-use $\mathbf{U} \mathbf{U}^\top$ computed earlier.&lt;/p&gt;
&lt;p&gt;Finally, let us address the posterior predictive.
Recall that
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) = \mathcal{N}(\mathbf{u} \mid \beta \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y}, \mathbf{C}^{-1}).
$$&lt;p&gt;
Re-writing this in terms of $\mathbf{W}$, we get
&lt;/p&gt;
$$
\begin{align*}
q_{\boldsymbol{\phi}^{\star}}(\mathbf{u})
&amp; = \mathcal{N}\left(\mathbf{u} \mid \beta \mathbf{W} \mathbf{W}^\top \boldsymbol{\Psi} \mathbf{y}, \mathbf{W} \mathbf{W}^\top \right) \\\\
&amp; = \mathcal{N}\left(\mathbf{u} \mid \beta \mathbf{L} \mathbf{L}_\mathbf{B}^{-\top} \mathbf{W}^\top \boldsymbol{\Psi} \mathbf{y}, \mathbf{L} \mathbf{L}_\mathbf{B}^{-\top} \mathbf{L}_\mathbf{B}^{-1} \mathbf{L}^\top\right) \\\\
&amp; = \mathcal{N}\left(\mathbf{u} \mid \mathbf{L} \left(\mathbf{L}_\mathbf{B}^{-\top} \mathbf{c}\right), \mathbf{L} \mathbf{B}^{-1} \mathbf{L}^\top\right).
\end{align*}
$$&lt;p&gt;
Hence, we see that the optimal variational distribution is itself a
whitened parameterization with $\mathbf{b}' = \mathbf{L}_\mathbf{B}^{-\top} \mathbf{c}$
and $\mathbf{W}' = \mathbf{L}_\mathbf{B}^{-\top}$ (such that ${\mathbf{W}'} {\mathbf{W}'}^\top = \mathbf{B}^{-1}$).
Combined with results from a
,
we can directly write the predictive $q_{\boldsymbol{\phi}^{\star}}(\mathbf{f}) = \int p(\mathbf{f}|\mathbf{u}) q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) \, \mathrm{d}\mathbf{u}$ as
&lt;/p&gt;
$$
q_{\boldsymbol{\phi}^{\star}}(\mathbf{f}) =
\mathcal{N}\left(\boldsymbol{\Lambda}^\top \mathbf{L}_\mathbf{B}^{-\top} \mathbf{c},
\mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^\top \left( \mathbf{I} - \mathbf{B}^{-1} \right) \boldsymbol{\Lambda} \right).
$$&lt;p&gt;
Alternatively, we can derive this by noting the following simple identity,
&lt;/p&gt;
$$
\boldsymbol{\Psi}^\top \mathbf{C}^{-1} \boldsymbol{\Psi} = \boldsymbol{\Psi}^\top \mathbf{L} \mathbf{B}^{-1} \mathbf{L}^\top \boldsymbol{\Psi} = \boldsymbol{\Lambda}^\top \mathbf{B}^{-1} \boldsymbol{\Lambda},
$$&lt;p&gt;
and applying the rules for marginalizing Gaussians to obtain
&lt;/p&gt;
$$
\begin{align*}
q_{\boldsymbol{\phi}^{\star}}(\mathbf{f})
&amp; = \mathcal{N}\left(\beta \boldsymbol{\Psi}^\top \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y},
\mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^\top \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} + \boldsymbol{\Psi}^\top \mathbf{C}^{-1} \boldsymbol{\Psi} \right) \\\\
&amp; = \mathcal{N}\left(\beta \boldsymbol{\Lambda}^\top \mathbf{B}^{-1} \boldsymbol{\Lambda} \mathbf{y},
\mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^\top \boldsymbol{\Lambda} + \boldsymbol{\Lambda}^\top \mathbf{B}^{-1} \boldsymbol{\Lambda} \right) \\\\
&amp; = \mathcal{N}\left(\boldsymbol{\Lambda}^\top \mathbf{L}_\mathbf{B}^{-\top} \mathbf{c},
\mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^\top \left( \mathbf{I} - \mathbf{B}^{-1} \right) \boldsymbol{\Lambda} \right).
\end{align*}
$$&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Titsias, M. (2009, April). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Artificial Intelligence and Statistics (pp. 567-574).&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;Murray, I., &amp;amp; Adams, R. P. (2010). Slice Sampling Covariance Hyperparameters of Latent Gaussian Models. In Advances in Neural Information Processing Systems (pp. 1732-1740).&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Hensman, J., Matthews, A. G., Filippone, M., &amp;amp; Ghahramani, Z. (2015). MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 1648-1656).&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;Csató, L., &amp;amp; Opper, M. (2002). Sparse On-line Gaussian Processes. Neural Computation, 14(3), 641-668.&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;Seeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations (PhD Thesis). University of Edinburgh.&amp;#160;&lt;a href="#fnref:5" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:6"&gt;
&lt;p&gt;Snelson, E., &amp;amp; Ghahramani, Z. (2005). Sparse Gaussian Processes using Pseudo-inputs. Advances in Neural Information Processing Systems, 18, 1257-1264.&amp;#160;&lt;a href="#fnref:6" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:7"&gt;
&lt;p&gt;Quinonero-Candela, J., &amp;amp; Rasmussen, C. E. (2005). A Unifying View of Sparse Approximate Gaussian Process Regression. The Journal of Machine Learning Research, 6, 1939-1959.&amp;#160;&lt;a href="#fnref:7" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:8"&gt;
&lt;p&gt;Lázaro-Gredilla, M., &amp;amp; Figueiras-Vidal, A. R. (2009, December). Inter-domain Gaussian Processes for Sparse Inference using Inducing Features. In Advances in Neural Information Processing Systems.&amp;#160;&lt;a href="#fnref:8" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:9"&gt;
&lt;p&gt;Damianou, A., &amp;amp; Lawrence, N. D. (2013, April). Deep Gaussian Processes. In Artificial Intelligence and Statistics (pp. 207-215). PMLR.&amp;#160;&lt;a href="#fnref:9" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:10"&gt;
&lt;p&gt;Salimbeni, H., &amp;amp; Deisenroth, M. (2017). Doubly Stochastic Variational Inference for Deep Gaussian Processes. Advances in Neural Information Processing Systems, 30.&amp;#160;&lt;a href="#fnref:10" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:11"&gt;
&lt;p&gt;Hensman, J., Fusi, N., &amp;amp; Lawrence, N. D. (2013, August). Gaussian Processes for Big Data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (pp. 282-290).&amp;#160;&lt;a href="#fnref:11" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:12"&gt;
&lt;p&gt;Dezfouli, A., &amp;amp; Bonilla, E. V. (2015). Scalable Inference for Gaussian Process Models with Black-box Likelihoods. In Advances in Neural Information Processing Systems (pp. 1414-1422).&amp;#160;&lt;a href="#fnref:12" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:13"&gt;
&lt;p&gt;Bui, T. D., Yan, J., &amp;amp; Turner, R. E. (2017). A Unifying Framework for Gaussian Process Pseudo-point Approximations using Power Expectation Propagation. The Journal of Machine Learning Research, 18(1), 3649-3720.&amp;#160;&lt;a href="#fnref:13" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:14"&gt;
&lt;p&gt;Salimbeni, H., Cheng, C. A., Boots, B., &amp;amp; Deisenroth, M. (2018). Orthogonally Decoupled Variational Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 8711-8720).&amp;#160;&lt;a href="#fnref:14" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:15"&gt;
&lt;p&gt;Shi, J., Titsias, M., &amp;amp; Mnih, A. (2020, June). Sparse Orthogonal Variational Inference for Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (pp. 1932-1942). PMLR.&amp;#160;&lt;a href="#fnref:15" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:16"&gt;
&lt;p&gt;Burt, D., Rasmussen, C. E., &amp;amp; Van Der Wilk, M. (2019, May). Rates of Convergence for Sparse Variational Gaussian Process Regression. In International Conference on Machine Learning (pp. 862-871). PMLR.&amp;#160;&lt;a href="#fnref:16" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:17"&gt;
&lt;p&gt;Wilson, J., Borovitskiy, V., Terenin, A., Mostowsky, P., &amp;amp; Deisenroth, M. (2020, November). Efficiently Sampling Functions from Gaussian Process Posteriors. In International Conference on Machine Learning (pp. 10292-10302). PMLR.&amp;#160;&lt;a href="#fnref:17" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item></channel></rss>