Machine Learning |

Empirical Gaussian Processes

Sun, 01 Feb 2026 00:00:00 +0000

Ax: A Platform for Adaptive Experimentation

Mon, 01 Sep 2025 00:00:00 +0000

Probabilistic Machine Learning in the Age of Deep Learning: New Perspectives for Gaussian Processes, Bayesian Optimization and Beyond (PhD Thesis)

Fri, 01 Sep 2023 00:00:00 +0000

The full text is available as a single PDF file

You can also find a list of contents and PDFs corresponding to each individual chapter below:

Chapter 1: Introduction
Chapter 2: Background
Chapter 3: Orthogonally-Decoupled Sparse Gaussian Processes with Spherical Neural Network Activation Features
Chapter 4: Cycle-Consistent Generative Adversarial Networks as a Bayesian Approximation
Chapter 5: Bayesian Optimisation by Classification with Deep Learning and Beyond
Chapter 6: Conclusion
Appendix A: Numerical Methods for Improved Decoupled Sampling of Gaussian Processes
Bibliography

Please find Chapter 1: Introduction reproduced in full below:

Introduction

Artificial intelligence (AI) stands poised to be among the most disruptive technologies of our era. The breakneck pace of recent AI advancements has been spearheaded by machine learning (ML), particularly the resurgence of deep learning. Deep learning is as old as the first general-purpose electronic computer; with roots tracing back to the 1940s and ’50s ( ; ), the revival of deep learning, beginning in the early 2010s, was catalysed by a series of breakthroughs that shattered previously perceived limitations and captivated the collective imagination. These breakthroughs span various domains, including computer vision ( ; ; ; ), speech recognition ( ; ), natural language processing ( ; ), protein folding ( ), generative art and artificial creativity ( ; ; ; ), as well as reinforcement learning for robotics control ( ; ) and achieving superhuman-level gameplay ( ; ).

Nevertheless, it is crucial to view these developments as means to an ultimate end rather than an end in themselves. Arguably, the true pinnacle of AI’s capabilities lies in optimal decision-making, whether that entails offering analyses and insights to aid humans in making better decisions or completely automating the decision-making process altogether. Practically any task directed towards a well-defined objective can be boiled down to a cascade of decisions. At a fundamental level, operating a vehicle involves a continuous stream of decisions involving accelerating, braking, and turning. Financial trading revolves around decisions to buy, sell, or hold various assets. Even complex engineering tasks, such as designing an aerofoil, involve a sequence of decisions about adjusting design variables to achieve desirable aerodynamic characteristics.

Yet, the intricacies of decision-making surpass what any single advancement in deep learning can address. While convolutional neural networks (CNNs) can facilitate object detection tasks in autonomous vehicles, recurrent neural networks (RNNs) can aid in forecasting market dynamics for systematic trading, and physics-informed NNs can assist in predicting aerodynamic effects, it remains the case that no target or quantity of interest can be entirely known or predictable (indeed, if they were, the pursuit of predictive modelling and ML would be superfluous). Instead, predictions often prove unreliable, or at best, uncertain, due to the limitations of our knowledge and the complexity and variability inherent in the underlying real-world processes. The impressive power of deep learning models often overshadows their ignorance of the limits of their own knowledge and the extent of uncertainty in their predictions. When these predictions are integrated into a sequential decision-making framework, such uncertainty can amplify, compound, and lead to catastrophic consequences. In the context of aeronautical engineering, this could result in inefficient designs; in quantitative finance, it can lead to devastating capital losses; and in autonomous driving, it can even cost lives.

Probabilistic Machine Learning

Grounded in the laws of probability and Bayesian statistics ( ; ), probabilistic ML provides a consistent framework for systematically reasoning about the unknown. The probabilistic approach to ML acknowledges that the real world is fraught with uncertainty and embraces this uncertainty as an inherent part of decision-making. Unlike traditional methods, including those of deep learning, it recognises model predictions not as absolute truths that can be represented as single point estimates produced from a deterministic mapping, but as full probability distributions that capture the potential outcomes of a random variable as it propagates through some underlying data-generating process. In a probabilistic model, all quantities are treated as random variables governed by probability distributions – the data are treated as observed variables, which are influenced by some underlying hidden variables, e.g., the model parameters. A prior distribution is used to express reasonable values for these hidden variables and to eliminate implausible ones. The relationship between observed and hidden variables is described using the likelihood, and the process of Bayesian inference amounts to calculating, using basic laws of probability, a posterior distribution over the hidden factors conditioned on the observed data, which can be seen as a refinement of the prior beliefs in light of new evidence. While the posterior distribution can be useful in and of itself, its primary role lies in facilitating subsequent prediction and decision-making by providing full probability distributions over predicted outcomes. This capability allows the decision-maker to assess the range of possible scenarios and their associated probabilities, enabling a more nuanced understanding of uncertainty and risk, which is indispensable in complex, dynamic environments where the repercussions of incorrect decisions can be severe. In essence, probabilistic ML equips autonomous decision-making systems with a probabilistic worldview, enabling them to navigate ambiguity and make sound decisions in the face of imperfect information.

Probabilistic ML vs. Deep Learning

While deep learning has dominated recent AI advances, probabilistic ML remains as important as ever and continues to offer valuable tools for addressing AI challenges that can not be fully resolved by deep learning alone. Although both approaches can be combined to create hybrid methods that leverage their respective strengths, some defining characteristics have traditionally set deep learning apart from probabilistic ML. Perhaps most notably, probabilistic ML approaches can achieve remarkable predictive performance even when data is scarce. In contrast, deep learning models tend to be data-intensive by nature, often demanding datasets of a scale proportional to their size (i.e., their parameter count) ( ), which has seen explosive growth in recent years ( ; ; ; ; ). With that being said, inference in many probabilistic models poses computational problems that are difficult to scale. On the other hand, deep learning approaches have excelled in scalability, a key factor contributing to their widespread success. This scalability is bolstered by their compatibility with various speed-enhancing mechanisms such as stochastic optimisation, specialised hardware accelerators (GPUs and TPUs), as well as distributed and/or cloud-based computing infrastructure. To bridge this gap, substantial research effort has been devoted to enabling probabilistic ML to benefit from these advantages through optimisation-based approximations to Bayesian inference ( ).

Moreover, as mentioned earlier, these paradigms are by no means mutually exclusive. Indeed, it is often possible to directly extend existing models with a Bayesian treatment of their parameters, adding a layer of probabilistic reasoning to the model, and allowing it to not only make predictions but also estimate the uncertainty associated with those predictions. An excellent example is the BNN, which treats the weights as hidden variables and leverages posterior inference to provide predictions while estimating associated uncertainties, delivering a more robust and principled approach to deep learning ( ; ; ).

The Bayesian formalism naturally gives rise to many popular methods and paradigms, often in the form of point estimates or other kinds of approximations. The quintessential example of this is found in linear regression, in particular, in ridge and lasso regression ( ), which correspond variously to maximum a posteriori (MAP) estimates in Bayesian linear regression (BLR) models with prior distributions possessing different sparsity-inducing characteristics ( ) – more broadly, mitigations against over-fitting tend to arise organically in Bayesian methods, which is why they are frequently characterised as being fundamentally more robust against over-fitting ( ). Likewise, the once à la mode support vector machines (SVMs) can be seen as MAP estimates for a class of nonparametric Bayesian models ( ), dropout ( ) in NNs can be seen as a variational approximation to exact inference in BNNs ( ), and unsupervised learning methods such as factor analysis (FA) ( ) and principal component analysis (PCA) ( ) are instances of a class of LVMs ( ; ) known as linear-Gaussian factor models ( ), to name just a few examples. Time and again, classical approaches have not only benefitted from being viewed through the Bayesian perspective but have also been enriched and redefined by the depth of insights this framework provides.

Thesis Goals

The over-arching goal of this thesis is to continue advancing the integration and cross-pollination between deep learning and probabilistic ML. We aim to further the interplay between these two fields, both by incorporating probabilistic interpretations and uncertainty quantification into popular deep learning frameworks, and by leveraging the representational power of deep NNs to improve established Bayesian methods. This dual-pronged approach provides fresh perspectives and taps the complementary strengths of both paradigms, advancing the foundations of AI and facilitating the development of more capable and dependable decision support frameworks. Ultimately, we strive to unlock the potential of deep learning within high-impact probabilistic ML methodologies, and to lend useful Bayesian perspectives on current deep learning techniques.

Gaussian Process Models

Arguably, no family of probabilistic models embodies the ethos of probabilistic ML and illustrates its nuances and parallels with deep learning quite like the GP. Accordingly, they shall occupy a prominent place in our thesis. In particular, GPs stand out as the ideal choice when dealing with limited data, offer the flexibility to encode prior beliefs through the covariance function, and provide predictive uncertainty estimates with a fine calibration that is second to none. Conversely, they are challenging to scale to large datasets, a limitation that has spurred extensive research and development efforts. Furthermore, in contrast to deep learning models, which are often lauded for their ability to automatically uncover valuable patterns and features in data, GPs have at times been dismissed as unsophisticated smoothing mechanisms ( ). Despite these apparent disparities, GPs are intricately connected to NNs in numerous ways. Among these, one of the most classical and well-known relationships is the convergence of single-layer NNs with randomly initialised weights toward GPs in the infinite-width limit ( ). Similar links have also been identified between GPs and infinitely wide deep NNs ( ; ).

In an effort to elevate the representational capabilities of GPs to a level comparable with deep NNs, DGPs ( ) stack together multiple layers of GPs. Additional efforts to construct efficient sparse GP approximations have leveraged the advantageous properties of computations on the hypersphere ( ), which has led to deep GP (DGP) models in which the propagation of posterior predictive means is equivalent to a forward pass through a deep neural network (NN) ( ; ). Notably, as a side effect, this model effectively provides uncertainty estimates for deep NN through its predictive variance. Among the contributions of our thesis is the further development of this framework, integrating cutting-edge techniques ( ; ) to address some of its practical limitations, thereby narrowing the performance gap between GPs and deep NNs.

Probabilistic models, serving a crucial role as decision support tools, routinely aid scientific discovery in fields such as physics and astronomy, guiding advancements in areas of medicine and healthcare encompassing bioinformatics, epidemiology, and medical diagnosis. Beyond that, these models have wide-ranging applications in economics, econometrics, and the social sciences. Moreover, they are indispensable in various engineering disciplines, such as robotics and environmental engineering. Among the many probabilistic models, GPs stand out as a powerful driving force behind a number of important sequential decision-making frameworks, including active learning ( ) and reinforcement learning ( ), and the broader area of probabilistic numerics at large ( ). Notably, Bayesian optimisation (BO) ( ; ; ) is one major area that relies heavily on GPs and will feature extensively in our thesis.

Bayesian optimisation

BO is a powerful methodology dedicated to the global optimisation of complex and resource-intensive objective functions. In contrast to classical optimisation methods, BO excels even when dealing with functions that lack strong assumptions or guarantees. These functions may not be convex, possess no gradients, lack a well-defined mathematical form, and observable only indirectly through noisy measurements.

At its core, BO is a sequential decision-making algorithm.

It relies on observations from past function evaluations to determine the next candidate location for evaluation in pursuit of optimal solutions. BO leverages a probabilistic model, often a GP, to represent its knowledge and beliefs about the unknown function. This model is continuously updated with the acquisition of each new observation, enabling the algorithm to adapt its behaviour and make sound decisions based on the evolving information.

BO effectively manages uncertainty inherent in such sequential decision-making processes by making use of the probabilistic model to the fullest, harnessing the entire predictive distribution, particularly, the predictive uncertainty, to select promising candidate solutions that bring the most value to the optimisation process. This generally consists not merely of those most likely to optimise the objective function (i.e., exploiting that which is known), but also those likely to reveal the most knowledge and information about the function itself (i.e., exploring that which remains unknown).

This pronounced emphasis on well-calibrated uncertainty distinguishes BO as one of the standout “killer apps” for GPs and a jewel in the crown of probabilistic ML applications. In practice, BO has proven instrumental across science, engineering, and industry, where efficiency and cost-effectiveness are paramount. Its applications include protein engineering ( ; ), material discovery ( ), experimental physics (e.g., experiments involving ultra-cold atoms ( ) and free-electron lasers ( )), environmental monitoring (sensor placement) ( ; ), and the design of aerodynamic aerofoils ( ; ), integrated circuits ( ; ), broadband high-efficiency power amplifiers ( ), and fast-charging protocols for lithium-ion batteries ( ). Notably, it has played a crucial role in automating the hyperparameter tuning of various ML models ( ; ), especially deep learning models, thus representing yet another way in which probabilistic ML has contributed to the advancement of deep learning.

However, GPs are not universally suitable for all BO problem scenarios. They are most effective when dealing with smooth, stationary functions with homoscedastic noise and a relatively modest input dimensionality. Additionally, GPs are easiest to work with for functions with a single output and purely continuous inputs. While a surprisingly wide array of real-world challenges satisfy these conditions, many high-impact problems, such as gene and protein design, which involves sequential inputs ( ; ; ; ; ); NAS, which involves structured inputs with intricate conditional dependencies; and automotive safety engineering, which involve numerous constraints and multiple objectives, clearly fall outside of this scope. This is not to say that GPs cannot be extended to such challenging scenarios. However, such extensions almost always come at a cost. Consequently, it makes sense to appeal to alternative modelling paradigms more naturally suited to specific tasks, e.g., employing random forests (RFs) to handle discrete and structured inputs, or deep NNs for capturing nonstationary behaviour and dealing with multiple objectives. A major contribution of this thesis is the introduction of a new formulation of BO that seamlessly accommodates virtually any modelling paradigm, including deep learning, without any compromise.

Thesis Overview

The core contributions of our thesis are summarised as follows:

We improve upon the framework for sparse hyperspherical GP approximations that employ nonlinear activations as inter-domain inducing features. This framework serves as a bridge between GPs and NNs, with posterior predictive mean taking the form of single-layer feedforward NNs. Our thesis examines some practical issues associated with this approach and proposes an extension that takes advantage of the orthogonal decoupling of GPs to mitigate these limitations. In particular, we introduce spherical inter-domain features to construct more flexible data-dependent basis functions for both the principal and orthogonal components of the GP approximation. We demonstrate that incorporating orthogonal inducing variables under this framework not only alleviates these shortcomings but also offers superior scalability compared to alternative strategies.
We provide a probabilistic perspective on cycle-consistent adversarial networks (CYCLEGANs), a cutting-edge deep generative model for style transfer and image-to-image translation. Specifically, we frame the problem of learning cross-domain correspondences without paired data as Bayesian inference in a latent variable model (LVM), in which the goal is to uncover the hidden representations of entities from one domain as entities in another. First, we introduce implicit LVMs, which allow flexible prior specification over latent representations as implicit distributions. Next, we develop a new variational inference (VI) framework that minimises a symmetrised statistical divergence between the variational and true joint distributions. Finally, we show that CYCLEGANs emerge as a closely-related variant of our framework, providing a useful interpretation as a Bayesian approximation.
We introduce a model-agnostic formulation of BO based on classification. Building on the established links between class-probability estimation (CPE), density-ratio estimation (DRE), and the improvement-based acquisition functions, we reformulate the acquisition function as a binary classifier over candidate solutions. This approach eliminates the need for an explicit probabilistic model of the objective function and casts aside the limitations of tractability constraints. As a result, our model-agnostic BO approach substantially broadens its applicability across diverse problem scenarios, accommodating flexible and scalable modelling paradigms such as deep learning without necessitating approximations or sacrificing expressive and representational capacity.

Accordingly, our thesis is organised as follows:

Chapter 2 (Background) lays the necessary groundwork for our thesis. We begin by outlining the fundamental principles of probability and Bayesian statistics, which form the basis of probabilistic ML. Additionally, we introduce the widely-adopted method of approximate Bayesian inference known as VI. Our discussion underscores the central role played by statistical divergences, prompting us to delve into a larger family of divergences and motivating our discussion of DRE. With a solid foundation in place, we shift our focus to GPs, providing an introductory overview and highlighting the most commonly-used sparse approximations. Finally, we conclude this background chapter by introducing the basic concepts behind BO.
Chapter 3 (Orthogonally-Decoupled Sparse GPs with Spherical Inducing Features) examines orthogonally-decoupled sparse GPs with spherical NN activation features, as summarised in the corresponding item above.
Chapter 4 (Cycle-Consistent Adversarial Learning as Bayesian Inference) examines from the perspective of approximate Bayesian inference, as summarised in the corresponding item above.
Chapter 5 (Bayesian Optimization by Density-Ratio Estimation) examines our model-agnostic approach to BO based on binary classification and DRE, as summarised in the corresponding item above.
Chapter 6 (Conclusion) brings this thesis to a close by reflecting on our main contributions and situating them in the broader landscape of probabilistic methods in ML. Finally, we conclude by presenting our outlook on the avenues for future research and development in this rapidly evolving field.

References

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. (2023). Palm 2 technical report. arXiv Preprint arXiv:2305.10403.

Attia, P. M., Grover, A., Jin, N., Severson, K. A., Markov, T. M., Liao, Y.-H., Chen, M. H., Cheong, B., Perkins, N., Yang, Z., et al. (2020). Closed-loop optimization of fast-charging protocols for batteries with machine learning. Nature, 578(7795), 397–402.

Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis: A unified approach. John Wiley & Sons.

Bayes, T. (1763). LII. An essay towards solving a problem in the doctrine of chances. By the late rev. Mr. Bayes, FRS communicated by mr. Price, in a letter to john canton, AMFR s. Philosophical Transactions of the Royal Society of London, 53, 370–418.

Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural network. International Conference on Machine Learning, 1613–1622.

Brochu, E., Cora, V. M., & De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv Preprint arXiv:1012.2599.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Chen, P., Merrick, B. M., & Brazil, T. J. (2015). Bayesian optimization for broadband high-efficiency power amplifier designs. IEEE Transactions on Microwave Theory and Techniques, 63(12), 4263–4272.

Damianou, A., & Lawrence, N. D. (2013). Deep gaussian processes. Artificial Intelligence and Statistics, 207–215.

Deisenroth, M., & Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. Proceedings of the 28th International Conference on Machine Learning (ICML-11), 465–472.

Duris, J., Kennedy, D., Hanuka, A., Shtalenkova, J., Edelen, A., Baxevanis, P., Egger, A., Cope, T., McIntire, M., Ermon, S., et al. (2020). Bayesian optimization of a free-electron laser. Physical Review Letters, 124(12), 124801.

Dutordoir, V., Durrande, N., & Hensman, J. (2020). Sparse Gaussian processes with spherical harmonic features. International Conference on Machine Learning, 2793–2802.

Dutordoir, V., Hensman, J., Wilk, M. van der, Ek, C. H., Ghahramani, Z., & Durrande, N. (2021). Deep neural networks as point estimates for deep Gaussian processes. Advances in Neural Information Processing Systems, 34.

Forrester, A. I., & Keane, A. J. (2009). Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1-3), 50–79.

Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 1050–1059.

Garnett, R. (2023). Bayesian Optimization. Cambridge University Press.

Garnett, R., Osborne, M. A., & Roberts, S. J. (2010). Bayesian optimization for sensor set selection. Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks, 209–219.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587.

Gonzalez, J., Longworth, J., James, D. C., & Lawrence, N. D. (2015). Bayesian optimization for synthetic gene design. arXiv Preprint arXiv:1505.01627.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial networks. arXiv Preprint arXiv:1406.2661.

Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 6645–6649.

Hennig, P., Osborne, M. A., & Kersting, H. P. (2022). Probabilistic numerics. Cambridge University Press.

Hie, B. L., & Yang, K. K. (2022). Adaptive machine learning for protein engineering. Current Opinion in Structural Biology, 72, 145–152.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). Training compute-optimal large language models. arXiv Preprint arXiv:2203.15556.

Houlsby, N., Huszár, F., Ghahramani, Z., & Lengyel, M. (2011). Bayesian active learning for classification and preference learning. arXiv Preprint arXiv:1112.5745.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). An introduction to variational methods for graphical models. Learning in Graphical Models, 105–161.

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25.

Lam, R., Poloczek, M., Frazier, P., & Willcox, K. E. (2018). Advances in bayesian optimization with applications in aerospace engineering. 2018 AIAA Non-Deterministic Approaches Conference, 1656.

Laplace, P. S. (1814). Théorie analytique des probabilités. Courcier.

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., & Sohl-Dickstein, J. (2017). Deep neural networks as gaussian processes. arXiv Preprint arXiv:1711.00165.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv Preprint arXiv:1509.02971.

Lyu, W., Xue, P., Yang, F., Yan, C., Hong, Z., Zeng, X., & Zhou, D. (2017). An efficient bayesian optimization approach for automated optimization of analog circuits. IEEE Transactions on Circuits and Systems I: Regular Papers, 65(6), 1954–1967.

MacKay, D. J. (1992). A practical bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472.

MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge university press.

Marchant, R., & Ramos, F. (2012). Bayesian optimisation for intelligent environmental monitoring. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2242–2249.

Matthews, A. G. de G., Rowland, M., Hron, J., Turner, R. E., & Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. arXiv Preprint arXiv:1804.11271.

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5, 115–133.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv Preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

Moss, H. B., Beck, D., González, J., Leslie, D. S., & Rayson, P. (2020). BOSS: Bayesian optimization over string spaces. arXiv Preprint arXiv:2010.00979.

Neal, R. M. (1995). BAYESIAN LEARNING FOR NEURAL NETWORKS

\[PhD thesis\]

. University of Toronto.

OpenAI, R. (2023). GPT-4 technical report. arXiv, 2303–08774.

Opper, M., & Winther, O. (2000). Gaussian processes and SVM: Mean field results and leave-one-out.

Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572.

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv Preprint arXiv:2112.11446.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv Preprint arXiv:2204.06125, 1(2), 3.

Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian Processes for Machine Learning. The MIT Press.

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.

Romero, P. A., Krause, A., & Arnold, F. H. (2013). Navigating the protein fitness landscape with gaussian processes. Proceedings of the National Academy of Sciences, 110(3), E193–E201.

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241.

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386.

Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345.

Salimbeni, H., Cheng, C.-A., Boots, B., & Deisenroth, M. (2018). Orthogonally decoupled variational Gaussian processes. Advances in Neural Information Processing Systems, 31.

Seko, A., Togo, A., Hayashi, H., Tsuda, K., Chaput, L., & Tanaka, I. (2015). Prediction of low-thermal-conductivity compounds with first-principles anharmonic lattice-dynamics calculations and bayesian optimization. Physical Review Letters, 115(20), 205901.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2015). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1), 148–175.

Shi, J., Titsias, M., & Mnih, A. (2020). Sparse orthogonal variational inference for Gaussian processes. International Conference on Artificial Intelligence and Statistics, 1932–1942.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv Preprint arXiv:1909.08053.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.

Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 25, 2951–2959.

Spearman, C. (1904). " general intelligence," objectively determined and measured. The American Journal of Psychology, 15(2), 201–292.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.

Sun, S., Shi, J., & Grosse, R. B. (2020). Neural networks as inter-domain inducing points. Third Symposium on Advances in Approximate Bayesian Inference.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267–288.

Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.

Torun, H. M., Swaminathan, M., Davis, A. K., & Bellaredj, M. L. F. (2018). A global bayesian optimization algorithm and its application to integrated system design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(4), 792–802.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv Preprint arXiv:2307.09288.

Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z., & Guyon, I. (2021). Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. NeurIPS 2020 Competition and Demonstration Track, 3–26.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Wigley, P. B., Everitt, P. J., Hengel, A. van den, Bastian, J. W., Sooriyabandara, M. A., McDonald, G. D., Hardman, K. S., Quinlivan, C. D., Manju, P., Kuhn, C. C., et al. (2016). Fast machine-learning online optimization of ultra-cold-atom experiments. Scientific Reports, 6(1), 25890.

Yang, K. K., Wu, Z., & Arnold, F. H. (2019). Machine-learning-guided directed evolution for protein engineering. Nature Methods, 16(8), 687–694.

📄 One paper accepted to ICML 2023

Tue, 25 Apr 2023 20:37:43 +0000

Our paper was accepted to ICML2023 as an Oral Presentation! This work was largely done during my time at Secondmind Labs as a Student Researcher, in collaboration with Vincent Dutordoir and Victor Picheny.

Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes

Tue, 25 Apr 2023 00:00:00 +0000

Efficient Cholesky decomposition of low-rank updates

Sun, 16 Apr 2023 11:16:03 +0000

Suppose we’re given a positive semidefinite (PSD) matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$ to which we wish to update by some low-rank matrix $\mathbf{U} \mathbf{U}^\top \in \mathbb{R}^{N \times N}$ , $$\mathbf{B} \triangleq \mathbf{A} + \mathbf{U} \mathbf{U}^\top,$$ where the update factor matrix $\mathbf{U} \in \mathbb{R}^{N \times M}$ . To be more precise, the low-rank update is rank-$M$ for some $M \ll N$.

What is the best way to calculate the Cholesky decomposition of $\mathbf{B}$ ?

Given no additional information the obvious way is to calculate it directly, which incurs a cost of $\mathcal{O}(N^3)$ . But suppose we’ve already calculated the lower-triangular Cholesky factor $\mathbf{L} \in \mathbb{R}^{N \times N}$ of $\mathbf{A}$ (i.e., $\mathbf{LL}^\top = \mathbf{A}$ ). Then, we can use it to calculate the Cholesky decomposition of $\mathbf{B}$ at a reduced cost of $\mathcal{O}(N^2M)$ . Here’s how.

Rank-1 Updates

First, let’s consider the simpler case involving just rank-1 updates $$\mathbf{B} \triangleq \mathbf{A} + \mathbf{u} \mathbf{u}^\top,$$ where update factor vector $\mathbf{u} \in \mathbb{R}^{N}$ . With some clever manipulations¹, the details of which we won’t get into in this post, we can leverage $\mathbf{L}$ to calculate the Cholesky decomposition of $\mathbf{B}$ at a reduced cost of $\mathcal{O}(N^2)$ . Such a procedure for rank-1 updates is implemented in the old-school Fortran linear algebra software library (but unfortunately not in its successor ), and also in modern libraries like (TFP).

In TFP, this is implemented in the function named . For example,

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp

update_factor_vector # Tensor; shape [..., N]
a # Tensor; shape [..., N, N]

update = tf.linalg.matmul(
 update_factor_vector[..., tf.newaxis],
 update_factor_vector[..., tf.newaxis],
 transpose_b=True
)

b = a + update # Tensor; shape [..., N, N]
a_factor = tf.linalg.cholesky(a) # O(N^3); suppose this is pre-computed and stored

b_factor = tf.linalg.cholesky(b) # O(N^3), ignores `a_factor`
b_factor_1 = tfp.math.cholesky_update(a_factor, update_factor_vector) # O(N^2), uses `a_factor`

np.testing.assert_array_almost_equal(b_factor, b_factor_1)

Here cholesky_update takes as arguments chol with shape [B1, ..., Bn, N, N] and u with shape [B1, ..., Bn, N], and returns a lower triangular Cholesky factor of the rank-1 updated matrix chol @ chol.T + u @ u.T in $\mathcal{O}(N^2)$ time.

Low-Rank Updates

Now let’s return to rank-$M$ updates. First let’s write the update factor matrix $\mathbf{U}$ in terms of column vectors $\mathbf{u}_m \in \mathbb{R}^{N}$, $$ \mathbf{U} \triangleq \begin{bmatrix} \mathbf{u}_1 & \cdots & \mathbf{u}_M \end{bmatrix}. $$

Now we can write the rank-$M$ update matrix as a sum of $M$ rank-1 matrices, $$ \mathbf{U} \mathbf{U}^\top = \begin{bmatrix} \mathbf{u}_1 & \cdots & \mathbf{u}_M \end{bmatrix} \begin{bmatrix} \mathbf{u}_1^\top \\ \vdots \\ \mathbf{u}_M^\top \end{bmatrix} = \sum_{m=1}^{M} \mathbf{u}_m \mathbf{u}_m^\top. $$

update_factor_matrix # Tensor; shape [..., N, M]

# [..., N, 1, M] [..., 1, N, M] -> [..., N, N, M] -> [..., N, N]
update1 = tf.reduce_sum(update_factor_matrix[..., tf.newaxis, :] *
 update_factor_matrix[..., tf.newaxis, :, :], axis=-1)
# [..., N, M] [..., M, N] -> [..., N, N]
update2 = tf.linalg.matmul(update_factor_matrix,
 update_factor_matrix, transpose_b=True)

# not exactly equal due to finite precision, but still equal up to high precision
np.testing.assert_array_almost_equal(update1, update2, decimal=14)

Thus seen, a low-rank update is nothing more than a repeated application of rank-1 updates, $$ \begin{align} \mathbf{B} & = \mathbf{A} + \mathbf{U} \mathbf{U}^\top \\ & = \mathbf{A} + \sum_{m=1}^{M} \mathbf{u}_m \mathbf{u}_m^\top \\ & = ((\mathbf{A} + \mathbf{u}_1 \mathbf{u}_1^\top) + \cdots ) + \mathbf{u}_M \mathbf{u}_M^{\top}. \end{align} $$

Therefore, we can simply leverage the $O(N^2)$ procedure for Cholesky decompositions of rank-1 updates and apply it recursively $M$ times to obtain a $O(N^2M)$ procedure for rank-$M$ updates.

Hence, we have:

# [..., N, M] [..., M, N] -> [..., N, N]
update = tf.linalg.matmul(update_factor_matrix,
 update_factor_matrix, transpose_b=True)
b = a + update # Tensor; shape [..., N, N]

b_factor = tf.linalg.cholesky(b) # O(N^3), ignores `a_factor`
b_factor_1 = cholesky_update_iterated(a_factor, update_factor_matrix) # O(N^2M), uses `a_factor`

np.testing.assert_array_almost_equal(b_factor_1, b_factor)

where function cholesky_update_iterated is implemented as follows:

def cholesky_update_iterated(chol, update_factor_matrix):

 # base case
 if update_factor_matrix.shape[-1] == 0:
 return chol

 prev = cholesky_update_iterated(chol, update_factor_matrix[..., :-1])
 return tfp.math.cholesky_update(prev, update_factor_matrix[..., -1])

We can also implement this iteratively. First we’d use tf.unstack to turn the update factor matrix $\mathbf{U}$ into a list of update factor vectors $\mathbf{u}_m$:

>>> update_factor_vectors = tf.unstack(update_factor_matrix, axis=-1)
>>> assert isinstance(update_factor_vectors, list) # `update_factor_vectors` is a list
>>> assert len(update_factor_vectors) == M # ... the list contains M vectors
>>> assert update_factor_vectors[0].shape == (*Bs, N) # ... and each vector has shape [B1, ..., Bn, N]

Then, we have:

def cholesky_update_iterated(chol, update_factor_matrix):
 new_chol = chol
 for update_factor_vector in tf.unstack(update_factor_matrix, axis=-1):
 new_chol = tfp.math.cholesky_update(new_chol, update_factor_vector)
 return new_chol

The astute reader will recognize that this is simply an special case of the or patterns, where the binary operator is tfp.math.cholesky_update, the iterable is tf.unstack(update_factor, axis=-1) and the initial value is chol.

Therefore, we can also implement it neatly using the one-liner:

from functools import reduce


def cholesky_update_iterated(chol, update_factor_matrix):
 return reduce(tfp.math.cholesky_update, tf.unstack(update_factor_matrix, axis=-1), chol)

Summary

In summary, we showed that to efficiently calculate the Cholesky decomposition of a matrix perturbed by a low-rank update, one just needs to iteratively calculate that of the same matrix perturbed by a series of rank-1 updates. Better yet, all of this can be done with a simple one-liner!

To receive updates on more posts like this, follow me on and !

Seeger, M. (2004). Low rank updates for the Cholesky decomposition. ↩︎

Batch Bayesian Optimisation via Density-ratio Estimation with Guarantees

Thu, 01 Dec 2022 00:00:00 +0000

Long Talk: BORE — Bayesian Optimization by Density-Ratio Estimation

Wed, 21 Jul 2021 14:00:00 +0000

Invited Talk: BORE — Bayesian Optimization by Density-Ratio Estimation

Wed, 12 May 2021 16:00:00 +0000

BORE: Bayesian Optimization by Density-Ratio Estimation

Sat, 08 May 2021 00:00:00 +0000

Bayesian Optimization (BO) by Density-Ratio Estimation (DRE), or BORE, is a simple, yet effective framework for the optimization of blackbox functions. BORE is built upon the correspondence between expected improvement (EI)—arguably the predominant acquisition functions used in BO—and the density-ratio between two unknown distributions.

One of the far-reaching consequences of this correspondence is that we can reduce the computation of EI to a probabilistic classification problem—a problem we are well-equipped to tackle, as evidenced by the broad range of streamlined, easy-to-use and, perhaps most importantly, battle-tested tools and frameworks available at our disposal for applying a variety of approaches. Notable among these are / and / for Deep Learning, for Gradient Tree Boosting, not to mention for just about everything else. The BORE framework lets us take direct advantage of these tools.

Code Example

We provide an simple example with Keras to give you a taste of how BORE can be implemented using a feed-forward neural network (NN) classifier. A useful class that the package provides is , a subclass of from Keras that inherits all of its existing functionalities, and provides just one additional method. We can build and compile a feed-forward NN classifier as usual:

from bore.models import MaximizableSequential
from tensorflow.keras.layers import Dense

# build model
classifier = MaximizableSequential()
classifier.add(Dense(16, activation="relu"))
classifier.add(Dense(16, activation="relu"))
classifier.add(Dense(1, activation="sigmoid"))

# compile model
classifier.compile(optimizer="adam", loss="binary_crossentropy")

See from the if this seems unfamiliar to you.

The additional method provided is argmax, which returns the maximizer of the network, i.e. the input $\mathbf{x}$ that maximizes the final output of the network:

x_argmax = classifier.argmax(bounds=bounds, method="L-BFGS-B", num_start_points=3)

Since the network is differentiable end-to-end wrt to input $\mathbf{x}$, this method can be implemented efficiently using a multi-started quasi-Newton hill-climber such as . We will see the pivotal role this method plays in the next section.

Using this classifier, the BO loop in BORE looks as follows:

import numpy as np

features = []
targets = []

# initialize design
features.extend(features_initial_design)
targets.extend(targets_initial_design)

for i in range(num_iterations):

 # construct classification problem
 X = np.vstack(features)
 y = np.hstack(targets)

 tau = np.quantile(y, q=0.25)
 z = np.less(y, tau)

 # update classifier
 classifier.fit(X, z, epochs=200, batch_size=64)

 # suggest new candidate
 x_next = classifier.argmax(bounds=bounds, method="L-BFGS-B", num_start_points=3)

 # evaluate blackbox
 y_next = blackbox.evaluate(x_next)

 # update dataset
 features.append(x_next)
 targets.append(y_next)

Let’s break this down a bit:

At the start of the loop, we construct the classification problem—by labeling instances $\mathbf{x}$ whose corresponding target value $y$ is in the top q=0.25 quantile of all target values as positive, and the rest as negative.
Next, we train the classifier to discriminate between these instances. This classifier should converge towards
$$ \pi^{*}(\mathbf{x}) = \frac{\gamma \ell(\mathbf{x})}{\gamma \ell(\mathbf{x}) + (1-\gamma) g(\mathbf{x})}, $$
where $\ell(\mathbf{x})$ and $g(\mathbf{x})$ are the unknown distributions of instances belonging to the positive and negative classes, respectively, and $\gamma$ is the class balance-rate and, by construction, simply the quantile we specified (i.e. $\gamma=0.25$).
Once the classifier is a decent approximation to $\pi^{*}(\mathbf{x})$, we propose the maximizer of this classifier as the next input to evaluate. In other words, we are now using the classifier itself as the acquisition function.

How is it justifiable to use this in lieu of EI, or some other acquisition function we’re used to? And what is so special about $\pi^{*}(\mathbf{x})$?

Well, as it turns out, $\pi^{*}(\mathbf{x})$ is equivalent to EI, up to some constant factors.

The remainder of the loop should now be self-explanatory. Namely, we
evaluate the blackbox function at the suggested point, and
update the dataset.

Step-by-step Illustration

Here is a step-by-step animation of six iterations of this loop in action, using the Forrester synthetic function as an example. The noise-free function is shown as the solid gray curve in the main pane. This procedure is warm-started with four random initial designs.

The right pane shows the empirical CDF (ECDF) of the observed $y$ values. The vertical dashed black line in this pane is located at $\Phi(y) = \gamma$, where $\gamma = 0.25$. The horizontal dashed black line is located at $\tau$, the value of $y$ such that $\Phi(y) = 0.25$, i.e. $\tau = \Phi^{-1}(0.25)$.

The instances below this horizontal line are assigned binary label $z=1$, while those above are assigned $z=0$. This is visualized in the bottom pane, alongside the probabilistic classifier $\pi_{\boldsymbol{\theta}}(\mathbf{x})$ represented by the solid gray curve, which is trained to discriminate between these instances.

Finally, the maximizer of the classifier is represented by the vertical solid green line. This is the location at which the BO procedure suggests be evaluated next.

We see that the procedure converges toward to global minimum of the blackbox function after half a dozen iterations.

To understand how and why this works in more detail, please read our paper! If you only have 15 minutes to spare, please watch the video recording of our talk!

Video

Simulation-based Scoring for Model-based Asynchronous Hyperparameter and Neural Architecture Search

Sat, 01 May 2021 00:00:00 +0000

A Primer on Pólya-gamma Random Variables - Part II: Bayesian Logistic Regression

Tue, 20 Apr 2021 17:20:53 +0100

Note

This is Part II of a three-part series on Pólya-Gamma random variables. Part I (Basic Relationships) and Part III (Local Variational Methods) are in preparation.

Table of Contents

Binary Classification

Consider the usual set-up for a binary classification problem: for some input $\mathbf{x} \in \mathbb{R}^{D}$, predict its binary label $y \in \{ 0, 1 \}$ given observations consisting of a feature matrix $\mathbf{X} = [ \mathbf{x}_1 \cdots \mathbf{x}_N ]^{\top} \in \mathbb{R}^{N \times D}$ and a target vector $\mathbf{y} = [ y_1 \cdots y_N ]^{\top} \in \{ 0, 1 \}^N$.

Model – Bayesian Logistic Regression

Recall the standard Bayesian logistic regression model:

Likelihood

Let $f: \mathbb{R}^{D} \to \mathbb{R}$ denote the real-valued latent function, sometimes referred to as the nuisance function, and let $f_n = f(\mathbf{x}_n)$ be the function value corresponding to observed input $\mathbf{x}_n$. The distribution over the observed variable $y_n$ is assumed to be governed by the latent variable $f_n$. In particular, the observed target vectors $\mathbf{y}$ are related to $\mathbf{f}$, the column vector of latent variables $\mathbf{f} = [f_1, \dotsc, f_N]^{\top}$, through the likelihood, or observation model, defined as

$$ p(\mathbf{y} | \mathbf{f}) \doteq \prod_{n=1}^N p(y_n | f_n), $$

where

$$ p(y_n | f_n) = \mathrm{Bern}(y_n | \sigma(f_n)) = \sigma(f_n)^{y_n} \left (1 - \sigma(f_n) \right )^{1 - y_n}, $$

and $\sigma(u) = \left ( 1 + \exp(-u) \right )^{-1}$ is the logistic sigmoid function.

Prior

For the sake of generality we discuss both the weight-space and function-space views of Bayesian logistic regression. In both cases, we consider a prior distribution in the form of a multivariate Gaussian $\mathcal{N}(\mathbf{m}, \mathbf{S}^{-1})$, whether it be over the weights or the function values themselves.

Weight-space

In the weight-space view, sometimes referred to as linear logistic regression, we assume a priori that the latent function takes the form

$$ f(\mathbf{x}) = \boldsymbol{\beta}^{\top} \mathbf{x}, \qquad \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{m}, \mathbf{S}^{-1}). $$

In this case, we express vector of latent function values as $\mathbf{f} = \mathbf{X} \boldsymbol{\beta}$ and the prior over the weights as $p(\boldsymbol{\beta}) = \mathcal{N}(\mathbf{m}, \mathbf{S}^{-1})$.

Function-space

In the function-space view, we assume the function is distributed according to a Gaussian process (GP) with mean function $m(\mathbf{x})$ and covariance function $k(\mathbf{x}, \mathbf{x}')$

$$ f(\mathbf{x}) \sim \mathcal{GP}\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')\right) $$

In this case, we express the prior over latent function values as $p(\mathbf{f} | \mathbf{X}) = \mathcal{N}(\mathbf{m}, \mathbf{K}_X)$, where $\mathbf{m} = m(\mathbf{X})$ and $\mathbf{K}_X = k(\mathbf{X}, \mathbf{X})$.

Inference and Prediction

Given some test input $\mathbf{x}_*$, we are interested in producing a probability distribution over predictions $p(y_* | \mathbf{X}, \mathbf{y}, \mathbf{x}_*)$. As we shall see, the procedure for computing this distribition is rife with intractabilities.

Specifically, we first marginalize out the uncertainty about the associated latent function value $f_*$,

$$ p(y_* | \mathbf{X}, \mathbf{y}, \mathbf{x}_*) = \int \sigma(f_*) p(f_* | \mathbf{X}, \mathbf{y}, \mathbf{x}_*) \mathrm{d}f_* $$

where $p(f_* | \mathbf{X}, \mathbf{y}, \mathbf{x}_*)$ is the posterior predictive distribution. Solving this integral is intractable, but since it is one-dimensional, it can be approximated efficiently using assuming $p(f_* | \mathbf{X}, \mathbf{y}, \mathbf{x}_*)$ is Gaussian.

But herein lies the real difficulty: the predictive $p(f_* | \mathbf{X}, \mathbf{y}, \mathbf{x}_*)$ is computed as

$$ p(f_* | \mathbf{X}, \mathbf{y}, \mathbf{x}_*) = \int p(f_* | \mathbf{X}, \mathbf{x}_*, \mathbf{f}) p(\mathbf{f} | \mathbf{X}, \mathbf{y}) \mathrm{d}\mathbf{f}, $$

where $p(\mathbf{f} | \mathbf{X}, \mathbf{y}) = \frac{p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | \mathbf{X})}{p(\mathbf{y} | \mathbf{X})} \propto p(\mathbf{y} | \mathbf{f}) p(\mathbf{f} | \mathbf{X})$ is the posterior over latent function values at the observed points, which is analytically intractable because a Gaussian prior is not to the Bernoulli likelihood.

To overcome this intractability, one must typically resort to approximate inference methods such as the Laplace approximation¹, variational inference (VI)², expectation propagation (EP)³ and sampling-based approximations such as Markov Chain Monte Carlo (MCMC).

Augmented Model

Instead of appealing to approximate inference methods, let us consider an augmentation strategy that works by introducing auxiliary variables to the model⁴.

In particular, we introduce auxiliary variables $\boldsymbol{\omega}$ and define the augmented or joint likelihood that factorizes as

$$ p(\mathbf{y}, \boldsymbol{\omega} | \mathbf{f}) = p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega}) p(\boldsymbol{\omega}), $$

where $p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega})$ is the conditional likelihood, a likelihood that is conditioned on the auxiliary variables $\boldsymbol{\omega}$, and $p(\boldsymbol{\omega})$ is the prior. Specifically, we wish to define $p(\boldsymbol{\omega})$ and $p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega})$ for which the following two properties hold:

Marginalizing out $\boldsymbol{\omega}$ recovers the original observation model $$ \int \underbrace{p(\mathbf{y}, \boldsymbol{\omega} | \mathbf{f})}_\text{joint likelihood} d\boldsymbol{\omega} = \int \underbrace{p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega})}_\text{conditional likelihood} p(\boldsymbol{\omega}) d\boldsymbol{\omega} = \underbrace{p(\mathbf{y} | \mathbf{f})}_\text{original likelihood} $$
A Gaussian prior $p(\mathbf{f})$ is conjugate to the conditional likelihood $p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega})$.

Likelihood conditioned on auxiliary variables

First, let us define a conditional likelihood that factorize as

$$ p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega}) = \prod_{n=1}^n p(y_n | f_n, \omega_n), $$

where each factor is defined as

$$ p(y_n | f_n, \omega_n) \doteq \frac{1}{2} \exp{\left \{ - \frac{\omega_n}{2} \left ( f_n^2 - 2 f_n \frac{\kappa_n}{\omega_n} \right ) \right \}} $$

for $\kappa_n = y_n - \frac{1}{2}$.

Prior over auxiliary variables

Second, let us define a prior over auxiliary variables $\boldsymbol{\omega}$ that factorize as

$$ p(\boldsymbol{\omega}) = \prod_{n=1}^N p(\omega_n) $$

where each factor $p(\omega_n)$ is a Pólya-gamma density

$$ p(\omega_n) = \mathrm{PG}(\omega_n | 1, 0), $$

defined as an infinite of gamma distributions :

Note

Pólya-gamma density (Polson et al. 2013)

A random variable $\omega$ has a Pólya-gamma distribution with parameters $b > 0$ and $c \in \mathbb{R}$, denoted $\omega \sim \mathrm{PG}(b, c)$, if

$$ \omega \overset{D}{=} \frac{1}{2 \pi^2} \sum_{k=1}^{\infty} \frac{g_k}{\left (k - \frac{1}{2} \right )^2 + \left ( \frac{c}{2\pi} \right )^2} $$

where the $g_k \sim \mathrm{Ga}(b, 1)$ are independent gamma random variables (and where $\overset{D}{=}$ denotes equality in distribution).

Property I: Recovering the original model

First we show that we can recover the original likelihood $p(y_n | f_n)$ by integrating out $\boldsymbol{\omega}$. Before we proceed, note that the $p(y_n | f_n)$ can be expressed more succinctly as

$$ p(y_n | f_n) = \frac{e^{y_n f_n}}{1 + e^{f_n}}. $$

Refer to for derivations. Next, note the following property of Pólya-gamma variables:

Note

Laplace transform of the Pólya-gamma density (Polson et al. 2013)

Based on the of the Pólya-gamma density function, we can derive the following relationship:

$$ \frac{\left (e^{u} \right )^a}{\left (1 + e^{u} \right )^b} = \frac{1}{2^b} \exp{(\kappa u)} \ \int_0^\infty \exp{\left ( - \frac{u^2}{2} \omega \right )} p(\omega) d\omega, $$

where $\kappa = a - \frac{b}{2}$ and $p(\omega) = \mathrm{PG}(\omega | b, 0)$.

Therefore, by substituting $\kappa = \kappa_n, a = y_n, b = 1$ and $u = f_n$ we get

$$ \begin{align*} \int p(y_n, \omega_n | f_n) d\omega_n &= \int p(y_n | f_n, \omega_n) p(\omega_n) d\omega_n \newline &= \frac{1}{2} \int \exp{\left \{ - \frac{\omega_n}{2} \left (f_n^2 - 2 f_n \frac{\kappa_n}{\omega_n} \right ) \right \}} p(\omega_n) d\omega_n \newline &= \frac{1}{2} \exp{(\kappa_n f_n)} \int \exp{\left ( - \frac{f_n^2}{2} \omega_n \right )} p(\omega_n) d\omega_n \newline &= \frac{\left (e^{f_n} \right )^{y_n}}{1 + e^{f_n}} = p(y_n | f_n) \end{align*} $$

as required.

Property II: Gaussian-Gaussian conjugacy

Let us define the diagonal matrix $\boldsymbol{\Omega} = \mathrm{diag}(\omega_1 \cdots \omega_n)$ and vector $\mathbf{z} = \boldsymbol{\Omega}^{-1} \boldsymbol{\kappa}$. More simply, $\mathbf{z}$ is the vector with $n$th element $z_n = {\kappa_n} / {\omega_n}$. Hence, by , the per-datapoint conditional likelihood $p(y_n | f_n, \omega_n)$ above can be written as

$$ \begin{align*} p(y_n | f_n, \omega_n) & \propto \exp{\left \{ - \frac{\omega_n}{2} \left (f_n - \frac{\kappa_n}{\omega_n} \right )^2 \right \}} \newline & = \exp{\left \{ - \frac{\omega_n}{2} \left (f_n - z_n \right )^2 \right \}} \end{align*} $$

Importantly, this implies that the conditional likelihood over all variables $p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega})$ is simply a multivariate Gaussian distribution up to a constant factor

$$ p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega}) \propto \mathcal{N}\left (\boldsymbol{\Omega}^{-1} \boldsymbol{\kappa} | \mathbf{f}, \boldsymbol{\Omega}^{-1} \right ). $$

Refer to for derivations. Therefore, a Gaussian prior $p(\mathbf{f})$ is conjugate to the conditional likelihood $p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega})$, which leads to $p(\mathbf{f} | \mathbf{y}, \boldsymbol{\omega})$, the posterior over $\mathbf{f}$ conditioned on the auxiliary latent variables $\boldsymbol{\omega}$, also being a Gaussian—a property that will prove crucial to us in the next section.

Inference (Gibbs sampling)

We wish to compute the posterior distribution $p(\mathbf{f}, \boldsymbol{\omega} | \mathbf{y})$, the distribution over the hidden variables $(\mathbf{f}, \boldsymbol{\omega})$ conditioned on the observed variables $\mathbf{y}$. To produce samples from this distribution

$$ (\mathbf{f}^{(t)}, \boldsymbol{\omega}^{(t)}) \sim p(\mathbf{f}, \boldsymbol{\omega} | \mathbf{y}), $$

we can readily apply Gibbs sampling⁵, an MCMC algorithm that can be seen as a special case of the Metropolis-Hastings algorithm.

Each step of the Gibbs sampling procedure involves replacing the value of one of the variables by a value drawn from the distribution of that variable conditioned on the values of the remaining variables. Specifically, we proceed as follows. At step $t$, we have values $\mathbf{f}^{(t-1)}, \boldsymbol{\omega}^{(t-1)}$ sampled from the previous step.

We first replace $\mathbf{f}^{(t-1)}$ by a new value $\mathbf{f}^{(t)}$ by sampling from the conditional distribution $p(\mathbf{f} | \mathbf{y}, \boldsymbol{\omega}^{(t-1)})$, $$ \mathbf{f}^{(t)} \sim p(\mathbf{f} | \mathbf{y}, \boldsymbol{\omega}^{(t-1)}). $$
Then we replace $\boldsymbol{\omega}^{(t-1)}$ by $\boldsymbol{\omega}^{(t)}$ by sampling from the conditional distribution $p(\boldsymbol{\omega}| \mathbf{f}^{(t)})$, $$ \boldsymbol{\omega}^{(t)} \sim p(\boldsymbol{\omega}| \mathbf{f}^{(t)}), $$ where we’ve used $\mathbf{f}^{(t)}$, the new value for $\mathbf{f}$ from step 1, straight away in the current step. Note that we’ve dropped the conditioning on $\mathbf{y}$, since $\boldsymbol{\omega}$ does not a posteriori depend on this variable.

We then proceed in like manner, cycling between the two variables in turn until some convergence criterion is met.

Suffice it to say, this requires us to first compute the conditional posteriors $p(\mathbf{f} | \mathbf{y}, \boldsymbol{\omega})$ and $p(\boldsymbol{\omega}| \mathbf{f})$, the calculation of which will be the subject of the next two subsections.

Posterior over latent function values

The posterior over the latent function values $\mathbf{f}$ conditioned on the auxiliary latent variables $\boldsymbol{\omega}$ is

$$ p(\mathbf{f} | \mathbf{y}, \boldsymbol{\omega}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \boldsymbol{\Sigma}), $$

where

$$ \boldsymbol{\mu} = \boldsymbol{\Sigma} \left ( \mathbf{S} \mathbf{m} + \boldsymbol{\kappa} \right ) \quad \text{and} \quad \boldsymbol{\Sigma} = \left (\mathbf{S} + \boldsymbol{\Omega} \right )^{-1}. $$

We readily obtain $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ by noting, as alluded to earlier, that

$$ p(\mathbf{f}) = \mathcal{N}(\mathbf{m}, \mathbf{S}^{-1}), \qquad \text{and} \qquad p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega}) \propto \mathcal{N}\left (\boldsymbol{\Omega}^{-1} \boldsymbol{\kappa} | \mathbf{f}, \boldsymbol{\Omega}^{-1} \right ). $$

Thereafter, we can appeal to the following elementary properties of Gaussian conditioning and perform some pattern-matching substitutions:

Note

Marginal and Conditional Gaussians (Bishop, Section 2.3.3, pg. 93)

Given a marginal Gaussian distribution for $\mathbf{b}$ and a conditional Gaussian distribution for $\mathbf{a}$ given $\mathbf{b}$ in the form

$$ \begin{align*} p(\mathbf{b}) & = \mathcal{N}(\mathbf{b} | \mathbf{m}, \mathbf{S}^{-1}) \newline p(\mathbf{a} | \mathbf{b}) & = \mathcal{N}(\mathbf{a} | \mathbf{W} \mathbf{b}, \boldsymbol{\Psi}^{-1}) \end{align*} $$

the marginal distribution of $\mathbf{a}$ and the conditional distribution of $\mathbf{b}$ given $\mathbf{a}$ are given by \begin{align*} p(\mathbf{a}) & = \mathcal{N}(\mathbf{a} | \mathbf{W} \mathbf{m}, \boldsymbol{\Psi}^{{-1} + \mathbf{W} \mathbf{S}}{-1} \mathbf{W}^{\top}) \newline p(\mathbf{b} | \mathbf{a}) & = \mathcal{N}(\mathbf{b} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) \end{align*} where

$$ \boldsymbol{\mu} = \boldsymbol{\Sigma} \left ( \mathbf{W}^{\top} \boldsymbol{\Psi} \mathbf{a} + \mathbf{S} \mathbf{m} \right ), \quad \text{and} \quad \boldsymbol{\Sigma} = \left (\mathbf{S} + \mathbf{W}^{\top} \boldsymbol{\Psi} \mathbf{W}\right )^{-1}. $$

Note that we also could have derived this directly without resorting to the formulae above by reducing the product of two exponential-quadratic functions in $p(\mathbf{f} | \mathbf{y}, \boldsymbol{\omega}) \propto p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega}) p(\mathbf{f})$ into a single exponential-quadratic function up to a constant factor. It would, however, have been rather tedious and mundane.

Note

Example: Gaussian process prior

To make this more concrete, let us revisit the Gaussian process prior we discussed earlier, namely,

$$ p(\mathbf{f} | \mathbf{X}) = \mathcal{N}(\mathbf{m}, \mathbf{K}_X). $$

By substituting $\mathbf{S}^{-1} = \mathbf{K}_X$ from before, we obtain

$$ p(\mathbf{f} | \mathbf{y}, \boldsymbol{\omega}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\Sigma} \left ( \mathbf{K}_X^{-1} \mathbf{m} + \boldsymbol{\kappa} \right ), \boldsymbol{\Sigma}), $$

where $\boldsymbol{\Sigma} = \left (\mathbf{K}_X^{-1} + \boldsymbol{\Omega} \right )^{-1}.$

Posterior over auxiliary variables

The posterior over the auxiliary latent variables $\boldsymbol{\omega}$ conditioned on the latent function values $\mathbf{f}$ factorizes as

$$ p(\boldsymbol{\omega}| \mathbf{f}) = \prod_{n=1}^{N} p(\omega_n | f_n), $$

where each factor

$$ p(\omega_n | f_n) = \frac{p(f_n, \omega_n)}{\int p(f_n, \omega_n) d\omega_n}. $$

Now, the joint factorizes as $p(f_n, \omega_n) = p(f_n | \omega_n) p(\omega_n)$ where

$$ p(f_n | \omega_n) = \exp{\left (-\frac{f_n^2}{2}\omega_n \right )}, \quad \text{and} \quad p(\omega_n) = \mathrm{PG}(\omega_n | 1, 0). $$

Hence, by the of the Pólya-gamma distribution, we have

$$ p(\omega_n | f_n) = \mathrm{PG}(\omega_n | 1, f_n) \propto \mathrm{PG}(\omega_n | 1, 0) \times \exp{\left (-\frac{f_n^2}{2}\omega_n \right )} = p(f_n, \omega_n). $$

We have omitted the normalizing constant $\int p(f_n, \omega_n) d\omega_n$ from our discussion for the sake of brevity. If you’re interested in calculating it, refer to .

Implementation (Weight-space view)

Having presented the general form of an augmented model for Bayesian logistic regression, we now derive a simple instance of this model to tackle a synthetic one-dimensional classification problem. In this particular implementation, we make the following choices: (a) we incorporate a basis function to project inputs into a higher-dimensional feature space, and (b) we consider an isotropic Gaussian prior on the weights.

Synthetic one-dimensional classification problem

First we synthesize a one-dimensional classification problem for which the true class-membership probability $p(y = 1 | x)$ is both known and easy to compute. To this end, let us introduce the following one-dimensional Gaussians,

$$ p(x) = \mathcal{N}(1, 1^2), \qquad \text{and} \qquad q(x) = \mathcal{N}(0, 2^2). $$

In code we can specify these as:

from scipy.stats import norm

p = norm(loc=1.0, scale=1.0)
q = norm(loc=0.0, scale=2.0)

We evenly draw a total of $N$ samples from both distributions:

>>> X_p, X_q = draw_samples(num_train, p, q, rate=0.5, random_state=random_state)

where the function draw_samples is defined as:

def draw_samples(num_samples, p, q, rate=0.5, random_state=None):

 num_top = int(num_samples * rate)
 num_bot = num_samples - num_top

 X_top = p.rvs(size=num_top, random_state=random_state)
 X_bot = q.rvs(size=num_bot, random_state=random_state)

 return X_top, X_bot

The densities of both distributions and their and samples are shown in the figure below.

Densities of two Gaussians and samples drawn from each.

From these samples, let us now construct a classification dataset by assigning label $y = 1$ to inputs $x \sim p(x)$, and $y = 0$ to inputs $x \sim q(x)$.

>>> X_train, y_train = make_dataset(X_p, X_q)

where the function make_dataset is defined as:

def make_dataset(X_pos, X_neg):

 X = np.expand_dim(np.hstack([X_pos, X_neg]), axis=-1)
 y = np.hstack([np.ones_like(X_pos), np.zeros_like(X_neg)])

 return X, y

Crucially, the true class-membership probability is given exactly by

$$ p(y = 1 | x) = \frac{p(x)}{p(x) + q(x)}, $$

thus providing a ground-truth yardstick by which to measure the quality of our resulting predictions.

The class-membership probability $p(y = 1 | x)$ is shown in the figure below as the black curve, along with the dataset $\mathcal{D}_N = \{(\mathbf{x}_n, y_n)\}_{n=1}^N$ where positive instances are colored red and negative instances are colored blue.

Classification dataset $\mathcal{D}_N = \{(\mathbf{x}_n, y_n)\}_{n=1}^N$ and the true class-posterior probability.

Prior

To increase the flexibility of our model, we introduce a basis function $\phi: \mathbb{R}^{D} \to \mathbb{R}^{K}$ that projects $D$-dimensional input vectors into a $K$-dimensional vector space. Accordingly, we introduce matrix $\boldsymbol{\Phi} \in \mathbb{R}^{N \times K}$ such that the $n$th column of $\boldsymbol{\Phi}^{\top}$ consists of the vector $\phi(\mathbf{x}_n)$. Hence, we assume a priori that the latent function is of the form

$$ f(\mathbf{x}) = \boldsymbol{\beta}^{\top} \phi(\mathbf{x}), $$

and express vector of latent function values as $\mathbf{f} = \boldsymbol{\Phi} \boldsymbol{\beta}$. In this example, we consider a simply polynomial basis function,

$$ \phi(x) = \begin{bmatrix} 1 & x & x^2 & \cdots & x^{K-1} \end{bmatrix}^{\top}. $$

Therefore, we call:

>>> Phi = basis_function(X_train, degree=degree)

where the function basis_function is defined as:

def basis_function(x, degree=3):
 return np.power(x, np.arange(degree))

Let us define the prior over weights as a simple isotropic Gaussian with precision $\alpha > 0$,

$$ p(\boldsymbol{\beta}) = \mathcal{N}(\mathbf{0}, \alpha^{-1} \mathbf{I}), $$

and the prior over each local auxiliary latent variable as before,

$$ p(\omega_n) = \mathrm{PG}(\omega_n | 1, 0). $$

Since we have analytic forms for the conditional posteriors, we don’t need to implement the priors explicitly. However, in order to initialize the Gibbs sampler, we may want to be able to sample from the prior. Let us do this using the prior over weights:

m = np.zeros(latent_dim)

alpha = 2.0 # prior precision
S_inv = np.eye(latent_dim) / alpha

# initialize `beta`
beta = random_state.multivariate_normal(mean=m, cov=S_inv)

or more simply:

alpha = 2.0 # prior precision

# initialize `beta`
beta = random_state.normal(size=latent_dim, scale=1/np.sqrt(alpha))

Conditional likelihood

The conditional likelihood is defined like before, except we instead condition on weights $\boldsymbol{\beta}$ and substitute occurrences of $\mathbf{f}$ with $\boldsymbol{\Phi} \boldsymbol{\beta}$,

$$ p(\mathbf{y} | \boldsymbol{\beta}, \boldsymbol{\omega}) \propto \mathcal{N}\left (\boldsymbol{\Omega}^{-1} \boldsymbol{\kappa} | \boldsymbol{\Phi} \boldsymbol{\beta}, \boldsymbol{\Omega}^{-1} \right ). $$

Inference and Prediction

Posterior over latent function values

The posterior over the latent weights $\boldsymbol{\beta}$ conditioned on the auxiliary latent variables $\boldsymbol{\omega}$ is

$$ p(\boldsymbol{\beta} | \mathbf{y}, \boldsymbol{\omega}) = \mathcal{N}(\boldsymbol{\beta} | \boldsymbol{\Sigma} \boldsymbol{\Phi}^{\top} \boldsymbol{\kappa}, \boldsymbol{\Sigma}), $$

where

$$ \boldsymbol{\Sigma} = \left (\boldsymbol{\Phi}^{\top} \boldsymbol{\Omega} \boldsymbol{\Phi} + \alpha \mathbf{I} \right )^{-1}. $$

Let us implement the function that computes the mean and covariance of $p(\boldsymbol{\beta} | \mathbf{y}, \boldsymbol{\omega})$:

def conditional_posterior_weights(Phi, kappa, alpha, omega):

 latent_dim = Phi.shape[-1]

 Sigma_inv = (omega * Phi.T) @ Phi + alpha * np.eye(latent_dim)

 mu = np.linalg.solve(Sigma_inv, Phi.T @ kappa)
 Sigma = np.linalg.solve(Sigma_inv, np.eye(latent_dim))

 return mu, Sigma

and a function to return samples from the multivariate Gaussian parameterized by this mean and covariance:

def gassian_sample(mean, cov, random_state=None):
 random_state = check_random_state(random_state)
 return random_state.multivariate_normal(mean=mean, cov=cov)

Posterior over auxiliary variables

The conditional posterior over the local auxiliary variable $\omega_n$ is defined as before, except we instead condition on weights $\boldsymbol{\beta}$ and substitute occurrences of $f_n$ with $\boldsymbol{\beta}^{\top} \phi(\mathbf{x}_n)$,

$$ p(\omega_n | \boldsymbol{\beta}) \propto \mathrm{PG}(\omega_n | 1, \boldsymbol{\beta}^{\top} \phi(\mathbf{x}_n)). $$

Let us implement a function to compute the parameters of the posterior Polya-gamma distribution:

def conditional_posterior_auxiliary(Phi, beta):
 c = Phi @ beta
 b = np.ones_like(c)
 return b, c

and accordingly a function to return samples from this distribution:

def polya_gamma_sample(b, c, pg=PyPolyaGamma()):
 assert b.shape == c.shape, "shape mismatch"
 omega = np.empty_like(b)
 pg.pgdrawv(b, c, omega)
 return omega

where we have imported the PyPolyaGamma object from the package:

from pypolyagamma import PyPolyaGamma

The pypolyagamma package can be installed via pip as usual:

$ pip install pypolyagamma

To provide some context, this package is a port, created by S. Linderman, of the original R package authored by J. Windle that implements the method described in their paper on the efficient sampling of Pólya-gamma variables⁶.

Gibbs sampling

With these functions defined, we can define the Gibbs sampling procedure by the simple for-loop below:

# preprocessing
kappa = y_train - 0.5
Phi = basis_function(X_train, degree=degree)

# initialize `beta`
latent_dim = Phi.shape[-1]
beta = random_state.normal(size=latent_dim, scale=1/np.sqrt(alpha))

for i in range(num_iterations):

 b, c = conditional_posterior_auxiliary(Phi, beta)
 omega = polya_gamma_sample(b, c, pg=pg)

 mu, Sigma = conditional_posterior_weights(Phi, kappa, alpha, omega)
 beta = gassian_sample(mu, Sigma, random_state=random_state)

We now visualize the samples $(\boldsymbol{\beta}^{(t)}, \boldsymbol{\omega}^{(t)})$ produced by this procedure. In the figures that follow, we set the hues to be proportional to the step counter $t$ along a perceptually uniform colormap.

First, we show the sampled weight vector $\boldsymbol{\beta}^{(t)} \in \mathbb{R}^K$ where we have set $K = 3$. We plot the $i$th entry $\beta_i^{(t)}$ against the $j$th entry $\beta_j^{(t)}$ for all $i < j$ and $0 < j < K$.

Parameter $\boldsymbol{\beta}^{(t)}$ samples as Gibbs sampling iteration $t$ increases.

We find a strong correlation between $\beta_1$ and $\beta_2$, the coefficients associated with the linear and quadratic terms of our augmented feature representation, respectively. Furthermore, we find $\beta_1$ to consistently have a relatively large magnitude.

Second, we show the sampled auxiliary latent variables $\boldsymbol{\omega}^{(t)}$ by plotting the pairs $(x_n, \omega_n^{(t)})$.

Auxiliary variable $\omega_n^{(t)}$ samples as Gibbs sampling iteration $t$ increases. For visualization purposes, each $\omega_n^{(t)}$ is placed at its corresponding input location $x_n$ along the horizontal axis.

As expected, we find longer-tailed distributions in the variables $\omega_n$ that are associated with negative examples.

Finally, we plot the sampled class-membership probability predictions

$$ \pi^{(t)}(\mathbf{x}) = \sigma(f^{(t)}(\mathbf{x})), \quad \text{where} \quad f^{(t)}(\mathbf{x}) = {\boldsymbol{\beta}^{(t)}}^{\top} \phi(\mathbf{x}), $$

in the figure below:

Predicted class-membership probability $\pi^{(t)}(\mathbf{x})$ as Gibbs sampling iteration $t$ increases.

At least qualitatively, we find that the sampling procedure produces predictions that fit the true class-membership probability reasonably well.

Code

The full code is reproduced below:

import numpy as np

from scipy.stats import norm
from pypolyagamma import PyPolyaGamma

from .utils import (draw_samples, make_dataset, basis_function,
 conditional_posterior_auxiliary, polya_gamma_sample,
 conditional_posterior_weights, gassian_sample)

# constants
num_train = 128
num_iterations = 1000
degree = 3
alpha = 2.0 # prior precision

seed = 8888
random_state = np.random.RandomState(seed)
pg = PyPolyaGamma(seed=seed)

# generate dataset
p = norm(loc=1.0, scale=1.0)
q = norm(loc=0.0, scale=2.0)

X_p, X_q = draw_samples(num_train, p, q, rate=0.5, random_state=random_state)
X_train, y_train = make_dataset(X_p, X_q)

# preprocessing
kappa = y_train - 0.5
Phi = basis_function(X_train, degree=degree)

# initialize `beta`
latent_dim = Phi.shape[-1]
beta = random_state.normal(size=latent_dim, scale=1/np.sqrt(alpha))

for i in range(num_iterations):

 b, c = conditional_posterior_auxiliary(Phi, beta)
 omega = polya_gamma_sample(b, c, pg=pg)

 mu, Sigma = conditional_posterior_weights(Phi, kappa, alpha, omega)
 beta = gassian_sample(mu, Sigma, random_state=random_state)

where the module utils.py contains:

import numpy as np
from sklearn.utils import check_random_state
from pypolyagamma import PyPolyaGamma


def draw_samples(num_samples, p, q, rate=0.5, random_state=None):
 num_top = int(num_samples * rate)
 num_bot = num_samples - num_top

 X_top = p.rvs(size=num_top, random_state=random_state)
 X_bot = q.rvs(size=num_bot, random_state=random_state)
 return X_top, X_bot


def make_dataset(X_pos, X_neg):
 X = np.expand_dims(np.hstack([X_pos, X_neg]), axis=-1)
 y = np.hstack([np.ones_like(X_pos), np.zeros_like(X_neg)])
 return X, y


def basis_function(x, degree=3):
 return np.power(x, np.arange(degree))


def polya_gamma_sample(b, c, pg=PyPolyaGamma()):
 assert b.shape == c.shape, "shape mismatch"
 omega = np.empty_like(b)
 pg.pgdrawv(b, c, omega)
 return omega


def gassian_sample(mean, cov, random_state=None):
 random_state = check_random_state(random_state)
 return random_state.multivariate_normal(mean=mean, cov=cov)


def conditional_posterior_weights(Phi, kappa, alpha, omega):
 latent_dim = Phi.shape[-1]
 eye = np.eye(latent_dim)

 Sigma_inv = (omega * Phi.T) @ Phi + alpha * eye

 mu = np.linalg.solve(Sigma_inv, Phi.T @ kappa)
 Sigma = np.linalg.solve(Sigma_inv, eye)
 return mu, Sigma


def conditional_posterior_auxiliary(Phi, beta):
 c = Phi @ beta
 b = np.ones_like(c)
 return b, c

Bonus: Gibbs sampling with mutual recursion and generator delegation

The Gibbs sampling procedure naturally lends itself to implementations based on . Combining this with the yield from expression for , we can succinctly replace the for-loop with the following mutually recursive functions:

def gibbs_sampler(beta, Phi, kappa, alpha, pg, random_state):
 b, c = conditional_posterior_auxiliary(Phi, beta)
 omega = polya_gamma_sample(b, c, pg=pg)
 yield from gibbs_sampler_helper(omega, Phi, kappa, alpha, pg, random_state)


def gibbs_sampler_helper(omega, Phi, kappa, alpha, pg, random_state):
 mu, Sigma = conditional_posterior_weights(Phi, kappa, alpha, omega)
 beta = gassian_sample(mu, Sigma, random_state=random_state)
 yield beta, omega
 yield from gibbs_sampler(beta, Phi, kappa, alpha, pg, random_state)

Now you can use gibbs_sampler as a , for example, to explicitly iterate over it in a for-loop:

for beta, omega in gibbs_sampler(beta, Phi, kappa, alpha, pg, random_state):

 if stop_predicate:
 break

 # do something
 pass

or by making use of and other primitives:

from itertools import islice

# example: collect beta and omega samples into respective lists
betas, omegas = zip(*islice(gibbs_sampler(beta, Phi, kappa, alpha, pg, random_state), num_iterations))

There are a few obvious drawbacks to this implementation. First, while it may be a lot fun to write, it will probably not be as fun to read when you revisit it later on down the line. Second, you may occasionally find yourself hitting the maximum recursion depth before you have reached a sufficient number of iterations for the warm-up or “burn-in” phase to have been completed. It goes without saying, the latter can make this implementation a non-starter.

Links and Further Readings

Papers:
- Original paper (Polson et al., 2013)⁴
- Extended to GP classification (Wenzel et al., 2019)⁷
- Few-shot classification with GPs and the one-vs-each likelihood (Snell et al., 2020)⁸
Blog posts:
- by G. Gundersen
Code:
- : A Python package by S. Linderman
- : An R package by J. Windle

Cite as:

@article{tiao2021polyagamma,
 title = "{A} {P}rimer on {P}ólya-gamma {R}andom {V}ariables - {P}art II: {B}ayesian {L}ogistic {R}egression",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2021",
 url = "https://tiao.io/post/polya-gamma-bayesian-logistic-regression/"
}

To receive updates on more posts like this, follow me on and !

Appendix

I

First, note that the logistic function can be written as

$$ \sigma(u) = \frac{1}{1+e^{-u}} = \frac{e^u}{1+e^u} $$

Therefore, we have

$$ \begin{align*} p(y_n | f_n) &= \left ( \frac{e^{f_n}}{1+e^{f_n}} \right )^{y_n} \left ( \frac{\left (1+e^{f_n} \right ) - e^{f_n}}{1+e^{f_n}} \right )^{1-y_n} \newline &= \left ( \frac{e^{f_n}}{1+e^{f_n}} \right )^{y_n} \left ( \frac{1}{1+e^{f_n}} \right )^{1-y_n} \newline &= \left (e^{f_n} \right )^{y_n} \left ( \frac{1}{1+e^{f_n}} \right )^{y_n} \left ( \frac{1}{1+e^{f_n}} \right )^{1-y_n} \newline &= \frac{e^{y_n f_n}}{1 + e^{f_n}} \end{align*} $$

II

The conditional likelihood factorizes as

$$ \begin{align*} p(\mathbf{y} | \mathbf{f}, \boldsymbol{\omega}) &= \prod_{i=1}^n p(y_n | f_n, \omega_n) \newline &\propto \prod_{i=1}^n \exp{\left ( - \frac{\omega_n}{2} \left (f_n - z_n \right )^2 \right )} \newline &= \exp{\left ( - \frac{1}{2} \sum_{i=1}^n \omega_n \left (f_n - z_n \right )^2 \right )} \newline &= \exp{\left \{ - \frac{1}{2} (\mathbf{f} - \mathbf{z})^{\top} \boldsymbol{\Omega} (\mathbf{f} - \mathbf{z}) \right \}} \newline &\propto \mathcal{N}\left (\boldsymbol{\Omega}^{-1} \boldsymbol{\kappa} | \mathbf{f}, \boldsymbol{\Omega}^{-1} \right ) \end{align*} $$

III

We have omitted the normalizing constant $\int p(f_n, \omega_n) d\omega_n$ from our discussion for the sake of brevity since it is not required to carry out inference using Gibbs sampling. However, this is easy to compute, simply by referring to the Laplace transform of the $\mathrm{PG}(1, 0)$ distribution:

Note

Laplace transform of the $\mathrm{PG}(1, 0)$ distribution (Polson et al. 2013)

The of the $\mathrm{PG}(1, 0)$ distribution is

$$ \mathbb{E}_{\omega \sim \mathrm{PG}(1, 0)}[\exp(-\omega t)] = \frac{1}{\cosh{\left(\sqrt{\frac{t}{2}}\right)}}. $$

Hence, by making the substitution $t = \frac{f_n^2}{2}$, we obtain

$$ \int p(f_n, \omega_n) d\omega_n = \int \exp{\left (-\frac{f_n^2}{2}\omega_n \right )} \mathrm{PG}(\omega_n | 1, 0) d\omega_n = \frac{1}{\cosh{\left(\frac{f_n}{2}\right)}}. $$

Therefore, we have

$$ \begin{align*} p(\omega_n | f_n) & = \frac{p(f_n, \omega_n)}{\int p(f_n, \omega_n) d\omega_n} \newline & = \cosh{\left(\frac{f_n}{2}\right)} \exp{\left (-\frac{f_n^2}{2}\omega_n \right )} \mathrm{PG}(\omega_n | 1, 0) \newline & = \mathrm{PG}(\omega_n | 1, f_n). \end{align*} $$

MacKay, D. J. (1992). . Neural Computation, 4(5), 720-736. ↩︎
Jaakkola, T. S., & Jordan, M. I. (2000). . Statistics and Computing, 10(1), 25-37. ↩︎
Minka, T. P. (2001, August). . In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (pp. 362-369). ↩︎
Polson, N. G., Scott, J. G., & Windle, J. (2013). . Journal of the American Statistical Association, 108(504), 1339-1349. ↩︎ ↩︎
Geman, S., & Geman, D. (1984). . IEEE Transactions on Pattern Analysis and Machine Intelligence, (6), 721-741. ↩︎
Windle, J., Polson, N. G., & Scott, J. G. (2014). . arXiv preprint arXiv:1405.0506. ↩︎
Wenzel, F., Galy-Fajou, T., Donner, C., Kloft, M., & Opper, M. (2019, July). . In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 5417-5424). ↩︎
Snell, J., & Zemel, R. (2020). . arXiv preprint arXiv:2007.10417. ↩︎

An Illustrated Guide to the Knowledge Gradient Acquisition Function

Thu, 18 Feb 2021 19:13:23 +0100

Note

Draft – work in progress.

We provide a short guide to the knowledge-gradient (KG) acquisition function (Frazier et al., 2009)¹ for Bayesian optimization (BO). Rather than being a self-contained tutorial, this posts is intended to serve as an illustrated compendium to the paper of Frazier et al., 2009¹ and the subsequent tutorial by Frazier, 2018², authored nearly a decade later.

This post assumes a basic level of familiarity with BO and Gaussian processes (GPs), to the extent provided by the literature survey of Shahriari et al., 2015³, and the acclaimed textbook of Rasmussen and Williams, 2006, respectively.

Knowledge-gradient

First, we set-up the notation and terminology. Let $f: \mathcal{X} \to \mathbb{R}$ be the blackbox function we wish to minimize. We denote the GP posterior predictive distribution, or predictive for short, by $p(y | \mathbf{x}, \mathcal{D})$. The mean of the predictive, or the predictive mean for short, is denoted by

$$ \mu(\mathbf{x}; \mathcal{D}) = \mathbb{E}[y | \mathbf{x}, \mathcal{D}] $$

Let $\mathcal{D}_n$ be the set of $n$ input-output observations $\mathcal{D}_n = \{ (\mathbf{x}_i, y_i) \}_{i=1}^n$, where output $y_i = f(\mathbf{x}_i) + \epsilon$ is assumed to be observed with noise $\epsilon \sim \mathcal{N}(0, \sigma^2)$. We make the following abbreviation

$$ \mu_n(\mathbf{x}) = \mu(\mathbf{x}; \mathcal{D}_n) $$

Next, we define the minimum of the predictive mean, or predictive minimum for short, as

$$ \tau(\mathcal{D}) = \min_{\mathbf{x}' \in \mathcal{X}} \mu(\mathbf{x}'; \mathcal{D}) $$

If we view $\mu(\mathbf{x}; \mathcal{D})$ as our fit to the underlying function $f(\mathbf{x})$ from which the observations $\mathcal{D}$ were generated, then $\tau(\mathcal{D})$ is our estimate of the minimum of $f(\mathbf{x})$, given observations $\mathcal{D}$.

Further, we make the following abbreviations

$$ \tau_n = \tau(\mathcal{D}_n), \qquad \text{and} \qquad \tau_{n+1} = \tau(\mathcal{D}_{n+1}), $$

where $\mathcal{D}_{n+1} = \mathcal{D}_n \cup \{ (\mathbf{x}, y) \}$ is the set of existing observations, augmented by some input-output pair $(\mathbf{x}, y)$. Then, the knowledge-gradient is defined as

$$ \alpha(\mathbf{x}; \mathcal{D}_n) = \mathbb{E}_{p(y | \mathbf{x}, \mathcal{D}_n)} [ \tau_n - \tau_{n+1} ] $$

Crucially, note that $\tau_{n+1}$ is implicitly a function of $(\mathbf{x}, y)$, and that this expression integrates over all possible input-output observation pairs $(\mathbf{x}, y)$ for the given $\mathbf{x}$ under the predictive $p(y | \mathbf{x}, \mathcal{D}_n)$.

Monte Carlo estimation

Not surprisingly, the knowledge-gradient function is analytically intractable. Therefore, in practice, we compute it using Monte Carlo estimation,

$$ \alpha(\mathbf{x}; \mathcal{D}_n) \approx \frac{1}{M} \left ( \sum_{m=1}^M \tau_n - \tau_{n+1}^{(m)} \right ), \qquad y^{(m)} \sim p(y | \mathbf{x}, \mathcal{D}_n), $$

where $\tau_{n+1}^{(m)} = \tau(\mathcal{D}_{n+1}^{(m)})$ and $\mathcal{D}_{n+1}^{(m)} = \mathcal{D}_n \cup \{ (\mathbf{x}, y^{(m)}) \}$.

We refer to $y^{(m)}$ as the $m$th simulated outcome, or the $m$th simulation for short. Then, $\mathcal{D}_{n+1}^{(m)}$ is the $m$th simulation-augmented dataset and, accordingly, $\tau_{n+1}^{(m)}$ is the $m$th simulation-augmented predictive minimum.

We see that this approximation to the knowledge-gradient is simply the average difference between the predictive minimum values based on simulation-augmented data $\tau_{n+1}^{(m)}$, and that based on observed data $\tau_n$, across $M$ simulations.

This might take a moment to digest, as there are quite a number of moving parts to keep track of. To help visualize these parts, we provide an illustration of each of the steps required to compute KG on a simple one-dimensional synthetic problem.

One-dimensional example

As the running example throughout this post, we use a synthetic function defined as

$$ f(x) = \sin(3x) + x^2 - 0.7 x. $$

We generate $n=10$ observations at locations sampled uniformly at random. The true function, and the set of noisy observations $\mathcal{D}_n$ are visualized in the figure below:

Latent blackbox function and $n=10$ observations.

Using the observations $\mathcal{D}_n$ we have collected so far, we wish to use KG to score a candidate location $x_c$ at which to evaluate next.

Posterior predictive distribution

The posterior predictive $p(y | \mathbf{x}, \mathcal{D}_n)$ is visualized in the figure below. In particular, the predictive mean $\mu_n(\mathbf{x})$ is represented by the solid orange curve.

Posterior predictive distribution (before hyperparameter estimation).

Clearly, this is a poor fit to the data and a uncalibrated estimation of the predictive uncertainly.

Step 1: Hyperparameter estimation

Therefore, first step is to optimize the hyperparameters of the GP regression model, i.e. the kernel lengthscale, amplitude, and the observation noise variance. We do this using type-II maximum likelihood estimation (MLE), or empirical Bayes.

Posterior predictive distribution (after hyperparameter estimation).

Step 2: Determine the predictive minimum

Next, we compute the predictive minimum $\tau_n = \min_{\mathbf{x}' \in \mathcal{X}} \mu_n(\mathbf{x}')$. Since $\mu_n$ is end-to-end differentiable wrt to input $\mathbf{x}$, we can simply use a multi-started quasi-Newton hill-climber such as L-BFGS. We visualize this in the figure below, where the value of the predictive minimum is represented by the orange horizontal dashed line, and its location is denoted by the orange star and triangle.

Predictive minimum $\tau_n$.

Step 3: Compute simulation-augmented predictive means

Suppose we are scoring the candidate location $x_c = 0.1$. For illustrative purposes, let us draw just $M=1$ sample $y_c^{(1)} \sim p(y | x_c, \mathcal{D}_n)$. In the figure below, the candidate location $x_c$ is represented by the vertical solid gray line, and the single simulated outcome $y_c^{(1)}$ is represented by the filled blue dot.

In general, we denote the simulation-augmented predictive mean as

$$ \mu_{n+1}^{(m)}(\mathbf{x}) = \mu(\mathbf{x}; \mathcal{D}_{n+1}^{(m)}), $$

where $\mathcal{D}_{n+1}^{(m)} = \mathcal{D}_n \cup \{ (\mathbf{x}, y^{(m)}) \}$ as defined earlier.

Here, the simulation-augmented dataset $\mathcal{D}_{n+1}^{(1)}$ is the set of existing observations $\mathcal{D}_n$, augmented by the simulated input-output pair $(x_c, y_c^{(1)})$,

$$ \mathcal{D}_{n+1}^{(1)} = \mathcal{D}_n \cup \{ (x_c, y_c^{(1)}) \}, $$

and the corresponding simulation-augmented predictive mean $\mu_{n+1}^{(1)}(x)$ is represented in the figure below by the solid blue curve.

Simulation-augmented predictive mean $\mu_{n+1}^{(1)}(x)$ at location $x_c = 0.1$

Step 4: Compute simulation-augmented predictive minimums

Next, we compute the simulation-augmented predictive minimum

$$ \tau_{n+1}^{(1)} = \min_{\mathbf{x}' \in \mathcal{X}} \mu_{n+1}^{(1)}(\mathbf{x}') $$

It may not be immediately obvious, but $\mu_{n+1}^{(1)}$ is in fact also end-to-end differentiable wrt to input $\mathbf{x}$. Therefore, we can again appeal to an method such as L-BFGS. We visualize this in the figure below, where the value of the simulation-augmented predictive minimum is represented by the blue horizontal dashed line, and its location is denoted by the blue star and triangle.

Simulation-augmented predictive minimum $\tau_{n+1}^{(1)}$ at location $x_c = 0.1$

Taking the difference between the orange and blue horizontal dashed line will give us an unbiased estimate of the knowledge-gradient. However, this is likely to be a crude one, since it is based on just a single MC sample. To obtain a more accurate estimate, one needs to increase $M$, the number of MC samples.

Samples $M > 1$

Let us now consider $M=5$ samples. We draw $y_c^{(m)} \sim p(y | x_c, \mathcal{D}_n)$, for $m = 1, \dotsc, 5$. As before, the input location $x_c$ is represented by the vertical solid gray line, and the corresponding simulated outcomes are represented by the filled dots below, with varying hues from a perceptually uniform color palette to distinguish between samples.

Accordingly, the simulation-augmented predictive means $\mu_{n+1}^{(m)}(x)$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$ are represented by the colored curves, with hues set to that of the simulated outcome on which the predictive distribution is based.

Simulation-augmented predictive mean $\mu_{n+1}^{(m)}(x)$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$

Next we compute the simulation-augmented predictive minimum $\tau_{n+1}^{(m)}$, which requires minimizing $\mu_{n+1}^{(m)}(x)$ for $m = 1, \dotsc, 5$. These values are represented below by the horizontal dashed lines, and their location is denoted by the stars and triangles.

Simulation-augmented predictive minimum $\tau_{n+1}^{(1)}$ at location $x_c = 0.1$, for $m = 1, \dotsc, 5$

Finally, taking the average difference between the orange dashed line and every other dashed line gives us the estimate of the knowledge gradient at input $x_c$.

Links and Further Readings

In this post, we only showed a (naïve) approach to calculating the KG at a given location. Suffice it to say, there is still quite a gap between this and being able to efficiently minimize KG within a sequential decision-making algorithm. For a guide on incorporating KG in a modular and fully-fledged framework for BO (namely ) see
Another introduction to KG:

Cite as:

@article{tiao2021knowledge,
 title = "{A}n {I}llustrated {G}uide to the {K}nowledge {G}radient {A}cquisition {F}unction",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2021",
 url = "https://tiao.io/post/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/"
}

To receive updates on more posts like this, follow me on and !

Frazier, P., Powell, W., & Dayanik, S. (2009). . INFORMS Journal on Computing, 21(4), 599-613. ↩︎ ↩︎
Frazier, P. I. (2018). . arXiv preprint arXiv:1807.02811. ↩︎
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2015). . Proceedings of the IEEE, 104(1), 148-175. ↩︎

Contributed Talk: BORE — Bayesian Optimization by Density-Ratio Estimation

Fri, 11 Dec 2020 15:00:00 +0000

Bayesian Optimization by Density Ratio Estimation

Tue, 01 Dec 2020 00:00:00 +0000

📄 One paper accepted to NeurIPS 2020

Fri, 25 Sep 2020 00:00:00 +0000

Our paper was accepted to NeurIPS 2020 as a Spotlight Presentation (awarded to the top 3% of submissions). This is joint work with Pantelis Elinas and Edwin Bonilla.

Variational Inference for Graph Convolutional Networks in the Absence of Graph Data and Adversarial Settings

Mon, 01 Jun 2020 00:00:00 +0000

This paper is a follow-up to our , previously presented at the NeurIPS2019 Graph Representation Learning Workshop, now with significantly expanded experimental analyses.

Model-based Asynchronous Hyperparameter and Neural Architecture Search

Sun, 01 Mar 2020 00:00:00 +0000

Variational Graph Convolutional Networks

Sun, 01 Dec 2019 00:00:00 +0000

A Handbook for Sparse Variational Gaussian Processes

Fri, 13 Sep 2019 00:00:00 +0000

Table of Contents

In the sparse variational Gaussian process (SVGP) framework (Titsias, 2009)¹, one augments the joint distribution $p(\mathbf{y}, \mathbf{f})$ with auxiliary variables $\mathbf{u}$ so that the joint becomes

$$ p(\mathbf{y}, \mathbf{f}, \mathbf{u}) = p(\mathbf{y} | \mathbf{f}) p(\mathbf{f}, \mathbf{u}). $$

The vector $\mathbf{u} = \begin{bmatrix} u(\mathbf{z}_1) \cdots u(\mathbf{z}_M)\end{bmatrix}^{\top} \in \mathbb{R}^M$ consists of inducing variables, the latent function values corresponding to the inducing input locations contained in the matrix $\mathbf{Z} = \begin{bmatrix} \mathbf{z}_1 \cdots \mathbf{z}_M \end{bmatrix}^{\top} \in \mathbb{R}^{M \times D}$.

Prior

The joint distribution of the latent function values $\mathbf{f}$, and the inducing variables $\mathbf{u}$ according to the prior is

$$ p(\mathbf{f}, \mathbf{u}) = \mathcal{N} \left ( \begin{bmatrix} \mathbf{f} \newline \mathbf{u} \end{bmatrix} ; \begin{bmatrix} \mathbf{0} \newline \mathbf{0} \end{bmatrix}, \begin{bmatrix} \mathbf{K}_\mathbf{ff} & \mathbf{K}_\mathbf{uf}^\top \newline \mathbf{K}_\mathbf{uf} & \mathbf{K}_\mathbf{uu} \end{bmatrix} \right ). $$

If we let the joint prior factorize as

$$ p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) p(\mathbf{u}), $$

we can apply the rules of Gaussian conditioning to derive the marginal prior $p(\mathbf{u})$ and conditional prior $p(\mathbf{f} | \mathbf{u})$.

Marginal prior over inducing variables

The marginal prior over inducing variables is simply given by

$$ p(\mathbf{u}) = \mathcal{N}(\mathbf{u} | \mathbf{0}, \mathbf{K}_\mathbf{uu}). $$

Note

Gaussian process notation

We can express the prior over the inducing variable $u(\mathbf{z})$ at inducing input $\mathbf{z}$ as

$$ p(u(\mathbf{z})) = \mathcal{GP}(0, k_{\theta}(\mathbf{z}, \mathbf{z}')). $$

Conditional prior

First, let us define the vector-valued function $\boldsymbol{\psi}_\mathbf{u}: \mathbb{R}^{D} \to \mathbb{R}^{M}$ as

$$ \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}) \triangleq \mathbf{K}_\mathbf{uu}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}), $$

where $\mathbf{k}_\mathbf{u}(\mathbf{x}) = k_{\theta}(\mathbf{Z}, \mathbf{x})$ denotes the vector of covariances between $\mathbf{x}$ and the inducing inputs $\mathbf{Z}$. Further, let $\boldsymbol{\Psi} \in \mathbb{R}^{M \times N}$ be the matrix containing values of function $\psi$ applied row-wise to the matrix of inputs $\mathbf{X} = \begin{bmatrix} \mathbf{x}_1 \cdots \mathbf{x}_N \end{bmatrix}^{\top} \in \mathbb{R}^{N \times D}$,

$$ \boldsymbol{\Psi} \triangleq \begin{bmatrix} \psi(\mathbf{x}_1) \cdots \psi(\mathbf{x}_N) \end{bmatrix} = \mathbf{K}_\mathbf{uu}^{-1} \mathbf{K}_\mathbf{uf}. $$

Then, we can condition the joint prior distribution on the inducing variables to give

$$ p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}), $$

where the mean vector and covariance matrix are

$$ \mathbf{m} = \boldsymbol{\Psi}^{\top} \mathbf{u}, \quad \text{and} \quad \mathbf{S} = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^{\top} \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi}. $$

Note

Gaussian process notation

We can express the distribution over the function value $f(\mathbf{x})$ at input $\mathbf{x}$, given $\mathbf{u}$, that is, the conditional $p(f(\mathbf{x}) | \mathbf{u})$, as a Gaussian process:

$$ p(f(\mathbf{x}) | \mathbf{u}) = \mathcal{GP}(m(\mathbf{x}), s(\mathbf{x}, \mathbf{x}')), $$

with mean and covariance functions,

$$ m(\mathbf{x}) = \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{u}, \quad \text{and} \quad s(\mathbf{x}, \mathbf{x}') = k_{\theta}(\mathbf{x}, \mathbf{x}') - \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu} \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}'). $$

Before moving on, we briefly highlight the important quantity,

$$ \mathbf{Q}_\mathbf{ff} \triangleq \boldsymbol{\Psi}^{\top} \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi}, $$

which is sometimes referred to as the Nyström approximation of $\mathbf{K}_\mathbf{ff}$. It can be written as

$$ \mathbf{Q}_\mathbf{ff} = \mathbf{K}_\mathbf{fu} \mathbf{K}_\textbf{uu}^{-1} \mathbf{K}_\mathbf{uf}. $$

Variational Distribution

We specify a joint variational distribution $q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})$ which factorizes as

$$ q_{\boldsymbol{\phi}}(\mathbf{f}, \mathbf{u}) \triangleq p(\mathbf{f} | \mathbf{u}) q_{\boldsymbol{\phi}}(\mathbf{u}). $$

For convenience, let us specify a variational distribution that is also Gaussian,

$$ q_{\boldsymbol{\phi}}(\mathbf{u}) \triangleq \mathcal{N}(\mathbf{u} | \mathbf{b}, \mathbf{W}\mathbf{W}^{\top}), $$

with variational parameters $\boldsymbol{\phi} = \{ \mathbf{W}, \mathbf{b} \}$. To obtain the corresponding marginal variational distribution over $\mathbf{f}$, we marginalize out the inducing variables $\mathbf{u}$, leading to

$$ q_{\boldsymbol{\phi}}(\mathbf{f}) = \int q_{\boldsymbol{\phi}}(\mathbf{f}, \mathbf{u}) \, \mathrm{d}\mathbf{u} = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \mathbf{\Sigma}), $$

where

$$ \boldsymbol{\mu} = \boldsymbol{\Psi}^\top \mathbf{b}, \quad \text{and} \quad \mathbf{\Sigma} = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^\top (\mathbf{K}_\mathbf{uu} - \mathbf{W}\mathbf{W}^{\top}) \boldsymbol{\Psi}. $$

Note

Gaussian process notation

We can express the variational distribution over the function value $f(\mathbf{x})$ at input $\mathbf{x}$, that is, the marginal $q_{\boldsymbol{\phi}}(f(\mathbf{x}))$, as a Gaussian process:

$$ q_{\boldsymbol{\phi}}(f(\mathbf{x})) = \mathcal{GP}(\mu(\mathbf{x}), \sigma(\mathbf{x}, \mathbf{x}')), $$

with mean and covariance functions,

$$ \begin{aligned} \mu(\mathbf{x}) &= \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{b}, \\ \sigma(\mathbf{x}, \mathbf{x}') &= \kappa_{\theta}(\mathbf{x}, \mathbf{x}') - \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) (\mathbf{K}_\mathbf{uu} - \mathbf{W}\mathbf{W}^{\top}) \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}'). \end{aligned} $$

Whitened parameterization

Whitening is a powerful trick for stabilizing the learning of variational parameters that works by reducing correlations in the variational distribution (Murray & Adams, 2010; Hensman et al, 2015)² ³. Let $\mathbf{L}$ be the Cholesky factor of $\mathbf{K}_\mathbf{uu}$, i.e. the lower triangular matrix such that $\mathbf{L} \mathbf{L}^{\top} = \mathbf{K}_\mathbf{uu}$. Then, the whitened variational parameters are given by

$$ \mathbf{W} \triangleq \mathbf{L} \mathbf{W}', \quad \text{and} \quad \mathbf{b} \triangleq \mathbf{L} \mathbf{b}', $$

with free parameters $\{ \mathbf{W}', \mathbf{b}' \}$. This leads to mean and covariance

$$ \boldsymbol{\mu} = \boldsymbol{\Lambda}^\top \mathbf{b}', \quad \text{and} \quad \mathbf{\Sigma} = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^\top (\mathbf{I}_M - {\mathbf{W}'} {\mathbf{W}'}^{\top}) \boldsymbol{\Lambda}, $$

where

$$ \boldsymbol{\Lambda} \triangleq \mathbf{L}^\top \boldsymbol{\Psi} = \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf}. $$

Refer to for derivations.

Note

Gaussian process notation

The mean and covariance functions are now

$$ \begin{aligned} \mu(\mathbf{x}) &= \boldsymbol{\lambda}^\top(\mathbf{x}) \mathbf{b}', \\ \sigma(\mathbf{x}, \mathbf{x}') &= k_{\theta}(\mathbf{x}, \mathbf{x}') - \boldsymbol{\lambda}^\top(\mathbf{x}) (\mathbf{I}_M - \mathbf{W}' {\mathbf{W}'}^{\top}) \boldsymbol{\lambda}(\mathbf{x}'), \end{aligned} $$

where

$$ \begin{aligned} \boldsymbol{\lambda}(\mathbf{x}) &\triangleq \mathbf{L}^{\top} \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}) \\ &= \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}). \end{aligned} $$

For an efficient and numerically stable way to compute and evaluate the variational distribution $q_{\boldsymbol{\phi}}(\mathbf{f})$ at an arbitrary set of inputs, see .

Inference

Preliminaries

We seek to approximate the exact posterior $p(\mathbf{f},\mathbf{u} \mid \mathbf{y})$ by an variational distribution $q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})$. To this end, we minimize the Kullback-Leibler (KL) divergence between $q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})$ and $p(\mathbf{f},\mathbf{u} \mid \mathbf{y})$, which is given by

$$ \begin{align*} \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})] & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}{p(\mathbf{f},\mathbf{u} \mid \mathbf{y})}}\right] \newline & = \log{p(\mathbf{y})} + \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}{p(\mathbf{f},\mathbf{u}, \mathbf{y})}}\right] \newline & = \log{p(\mathbf{y})} - \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}), \end{align*} $$

where we’ve defined the evidence lower bound (ELBO) as

$$ \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) \triangleq \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}\left[\log{\frac{p(\mathbf{f},\mathbf{u}, \mathbf{y})}{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}}\right]. $$

Notice that minimizing the KL divergence above is equivalent to maximizing the ELBO. Furthermore, the ELBO is a lower bound on the log marginal likelihood, since

$$ \log{p(\mathbf{y})} = \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) + \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})], $$

and the KL divergence is nonnegative. Therefore, we have $\log{p(\mathbf{y})} \geq \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z})$ with equality at $\mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \mid\mid p(\mathbf{f},\mathbf{u} \mid \mathbf{y})] = 0 \Leftrightarrow q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) = p(\mathbf{f},\mathbf{u} \mid \mathbf{y})$.

Let us now focus our attention on the ELBO, which can be written as

$$ \begin{align*} \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) & = \iint \log{\frac{p(\mathbf{f},\mathbf{u}, \mathbf{y})}{q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \,\mathrm{d}\mathbf{f} \mathrm{d}\mathbf{u} \newline & = \iint \log{\frac{p(\mathbf{y} | \mathbf{f}) \bcancel{p(\mathbf{f} | \mathbf{u})} p(\mathbf{u})}{\bcancel{p(\mathbf{f} | \mathbf{u})} q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{f},\mathbf{u}) \,\mathrm{d}\mathbf{f} \mathrm{d}\mathbf{u} \newline & = \int \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u}, \end{align*} $$

where we have made use of the previous definition $q_{\boldsymbol{\phi}}(\mathbf{f}, \mathbf{u}) = p(\mathbf{f} | \mathbf{u}) q_{\boldsymbol{\phi}}(\mathbf{u})$ and also introduced the definition

$$ \Phi(\mathbf{y}, \mathbf{u}) \triangleq \exp{ \left ( \int \log{p(\mathbf{y} | \mathbf{f})} p(\mathbf{f} | \mathbf{u}) \,\mathrm{d}\mathbf{f} \right ) }. $$

It is straightforward to verify that the optimal variational distribution, that is, the distribution $q_{\boldsymbol{\phi}^{\star}}(\mathbf{u})$ at which the ELBO is maximized, satisfies

$$ q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) \propto \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}). $$

Refer to for details. Specifically, after normalization, we have

$$ q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) = \frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{\mathcal{Z}}, $$

where $\mathcal{Z} \triangleq \int \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}) \,\mathrm{d}\mathbf{u}$. Plugging this back into the ELBO, we get

$$ \begin{aligned} \mathrm{ELBO}(\boldsymbol{\phi}^{\star}, \mathbf{Z}) &= \int \log{\left (\bcancel{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})} \frac{\mathcal{Z}}{\bcancel{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}} \right )} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\ &= \log{\mathcal{Z}}. \end{aligned} $$

Gaussian Likelihoods – Sparse Gaussian Process Regression (SGPR)

Let us assume we have a Gaussian likelihood of the form

$$ p(\mathbf{y} | \mathbf{f}) = \mathcal{N}(\mathbf{y} | \mathbf{f}, \beta^{-1} \mathbf{I}). $$

Then it is straightforward to show that

$$ \log{\Phi(\mathbf{y}, \mathbf{u})} = \log{\mathcal{N}(\mathbf{y} | \mathbf{m}, \beta^{-1} \mathbf{I} )} - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}), $$

where $\mathbf{m}$ and $\mathbf{S}$ are defined as before, i.e. $\mathbf{m} = \boldsymbol{\Psi}^{\top} \mathbf{u}$ and $\mathbf{S} = \mathbf{K}_\textbf{ff} - \boldsymbol{\Psi}^{\top} \mathbf{K}_\textbf{uu} \boldsymbol{\Psi}$. Refer to for derivations.

Now, there are a few key objects of interest. First, the optimal variational distribution $q_{\boldsymbol{\phi}^{\star}}(\mathbf{u})$, which is required to compute the predictive distribution $q_{\boldsymbol{\phi}^{\star}}(\mathbf{f}) = \int p(\mathbf{f}|\mathbf{u}) q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) \, \mathrm{d}\mathbf{u}$, but which may also be of independent interest. Second, the ELBO, the objective with respect to which the inducing input locations $\mathbf{Z}$ are optimized.

The optimal variational distribution is given by

$$ q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) = \mathcal{N}(\mathbf{u} \mid \beta \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uf} \mathbf{y}, \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu}), $$

where

$$ \mathbf{M} \triangleq \mathbf{K}_\mathbf{uu} + \beta \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu}. $$

This can be verified by reducing the product of two exponential-quadratic functions in $\Phi(\mathbf{y}, \mathbf{u})$ and $p(\mathbf{u})$ into a single exponential-quadratic function up to a constant factor, an operation also known as “completing the square”. Refer to for complete derivations.

This leads to the predictive distribution

$$ \begin{aligned} q_{\boldsymbol{\phi}^{\star}}(\mathbf{f}) &= \mathcal{N}\bigl(\beta \boldsymbol{\Psi}^\top \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} \mathbf{y}, \\ &\qquad\qquad \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^\top (\mathbf{K}_\mathbf{uu} - \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu} ) \boldsymbol{\Psi} \bigr) \\ &= \mathcal{N}\bigl(\beta \mathbf{K}_\mathbf{fu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uf} \mathbf{y}, \\ &\qquad\qquad \mathbf{K}_\mathbf{ff} - \mathbf{K}_\mathbf{fu} (\mathbf{K}_\mathbf{uu}^{-1} - \mathbf{M}^{-1}) \mathbf{K}_\mathbf{uf} \bigr). \end{aligned} $$

The ELBO is given by

$$ \mathrm{ELBO}(\boldsymbol{\phi}^{\star}, \mathbf{Z}) = \log \mathcal{Z} = \log \mathcal{N}(\mathbf{0}, \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}) - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}). $$

This can be verified by applying simple rules for marginalizing Gaussians. Again, refer to for complete derivations. Refer to for a numerically efficient and robust method for computing these quantities.

Non-Gaussian Likelihoods

Recall from earlier that the ELBO is written as

$$ \begin{align*} \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) & = \int \log{\left(\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}\right)} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ & = \int \left(\log{\Phi(\mathbf{y}, \mathbf{u})} + \log{\frac{p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}}\ \right) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ & = \mathrm{ELL}(\boldsymbol{\phi}, \mathbf{Z}) - \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{u})|p(\mathbf{u})], \end{align*} $$

where we define $\mathrm{ELL}(\boldsymbol{\phi}, \mathbf{Z})$, the expected log-likelihood (ELL), as

$$ \mathrm{ELL}(\boldsymbol{\phi}, \mathbf{Z}) \triangleq \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{u})}\left[\log{\Phi(\mathbf{y}, \mathbf{u})}\right]. $$

This constitutes the first term in the ELBO, and can be written as

$$ \begin{align*} \mathrm{ELL}(\boldsymbol{\phi}, \mathbf{Z}) & = \int \log{\Phi(\mathbf{y}, \mathbf{u})} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ & = \int \left(\int \log{p(\mathbf{y} | \mathbf{f})} p(\mathbf{f} | \mathbf{u}) \,\mathrm{d}\mathbf{f}\right) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ & = \int \log{p(\mathbf{y} | \mathbf{f})} \left(\int p(\mathbf{f} | \mathbf{u}) q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \right) \,\mathrm{d}\mathbf{f} \\\\ & = \int \log{p(\mathbf{y} | \mathbf{f})} q(\mathbf{f}) \,\mathrm{d}\mathbf{f} \\\\ & = \mathbb{E}_{q(\mathbf{f})}[\log{p(\mathbf{y} | \mathbf{f})}]. \end{align*} $$

While this integral is analytically intractable in general, we can nonetheless approximate it efficiently using numerical integration techniques such as Monte Carlo (MC) estimation or quadrature rules. In particular, because $q(\mathbf{f})$ is Gaussian, we can utilize simple yet effective rules such as .

Now, the second term in the ELBO is the KL divergence between $q_{\boldsymbol{\phi}}(\mathbf{u})$ and $p(\mathbf{u})$, which are both multivariate Gaussians,

$$ \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{u})|p(\mathbf{u})] = \mathrm{KL}[\mathcal{N}(\mathbf{b}, \mathbf{W} {\mathbf{W}}^\top) || \mathcal{N}(\mathbf{0}, \mathbf{K}_\mathbf{uu})], $$

and has a . In the case of the whitened parameterization, it can be simplified as

$$ \begin{align*} \mathrm{KL}[q_{\boldsymbol{\phi}}(\mathbf{u})|p(\mathbf{u})] & = \mathrm{KL}[\mathcal{N}(\mathbf{b}', \mathbf{W}' {\mathbf{W}'}^\top) || \mathcal{N}(\mathbf{0}, \mathbf{K}_\mathbf{uu})] \\\\ & = \mathrm{KL}[\mathcal{N}(\mathbf{b}, \mathbf{W} {\mathbf{W}}^\top) || \mathcal{N}(\mathbf{0}, \mathbf{I})]. \end{align*} $$

This comes from the fact that

$$ \begin{aligned} &\mathrm{KL}\left[\mathcal{N}(\mathbf{A} \boldsymbol{\mu}_0, \mathbf{A} \boldsymbol{\Sigma}_0 \mathbf{A}^\top) \,\|\, \mathcal{N}(\mathbf{A} \boldsymbol{\mu}_1, \mathbf{A} \boldsymbol{\Sigma}_1 \mathbf{A}^\top) \right] \\ &\qquad = \mathrm{KL}\left[\mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0) \,\|\, \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \right] \end{aligned} $$

where we set $\boldsymbol{\mu}_0 = \mathbf{b}, \boldsymbol{\Sigma}_0 = \mathbf{W} \mathbf{W}^\top, \boldsymbol{\mu}_1 = \mathbf{0}, \boldsymbol{\Sigma}_1 = \mathbf{I}$ and $\mathbf{A} = \mathbf{L}$ where $\mathbf{L}$ is the Cholesky factor of $\mathbf{K}_\mathbf{uu}$, i.e. the lower triangular matrix such that $\mathbf{L}\mathbf{L}^\top = \mathbf{K}_\mathbf{uu}$.

Large-Scale Data with Stochastic Optimization

Warning

Coming soon.

Links and Further Readings

Papers:
- Forerunners: Deterministic Training Conditional (DTC; Csató & Opper, 2002⁴; Seeger, 2003⁵); Fully Independent Training Conditional (FITC; Snelson & Ghahramani, 2005⁶; Quinonero-Candela & Rasmussen, 2005⁷)
- Inter-domain Gaussian processes: Lázaro-Gredilla & Figueiras-Vidal, 2009⁸
- Deep Gaussian processes: Damianou & Lawrence, 2013⁹, Salimbeni et al, 2017¹⁰
- Non-Gaussian likelihoods: Hensman et al, 2013¹¹; Dezfouli & Bonilla, 2015¹²
- Unifying inducing-/pseudo-point approximations: Bui et al, 2017¹³
- Orthogonal decompositions: Salimbeni et al, 2018¹⁴; Shi et al, 2020¹⁵
- Convergence analysis: Burt et al, 2019¹⁶
- Efficient sampling: Wilson et al, 2020¹⁷
Technical Reports:
- by M. Titsias
Notes:
- by T. Bui and R. Turner
Blog posts:
- by J. Hensman

Cite as:

@article{tiao2020svgp,
 title = "{A} {H}andbook for {S}parse {V}ariational {G}aussian {P}rocesses",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2020",
 url = "https://tiao.io/post/sparse-variational-gaussian-processes/"
}

To receive updates on more posts like this, follow me on and !

Appendix

I

Whitened parameterization

Recall the definition $\boldsymbol{\Lambda} \triangleq \mathbf{L}^\top \boldsymbol{\Psi}$. Then, the mean simplifies to

$$ \boldsymbol{\mu} = \boldsymbol{\Psi}^\top \mathbf{b} = \boldsymbol{\Psi}^\top (\mathbf{L} \mathbf{b}') = (\mathbf{L}^\top \boldsymbol{\Psi})^\top \mathbf{b}' = \boldsymbol{\Lambda}^\top \mathbf{b}'. $$

Similarly, the covariance simplifies to

$$ \begin{align*} \mathbf{\Sigma} & = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^{\top} (\mathbf{K}_\mathbf{uu} - \mathbf{W} \mathbf{W}^{\top}) \boldsymbol{\Psi} \newline & = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^{\top} (\mathbf{L} \mathbf{L}^{\top} - \mathbf{L} ({\mathbf{W}'}{\mathbf{W}'}^{\top}) \mathbf{L}^{\top}) \boldsymbol{\Psi} \newline & = \mathbf{K}_\mathbf{ff} - (\mathbf{L}^{\top} \boldsymbol{\Psi})^{\top} ( \mathbf{I}_M - {\mathbf{W}'}{\mathbf{W}'}^{\top}) (\mathbf{L}^{\top} \boldsymbol{\Psi}) \newline & = \mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^{\top} ( \mathbf{I}_M - {\mathbf{W}'}{\mathbf{W}'}^{\top}) \boldsymbol{\Lambda}. \end{align*} $$

II

SVGP Implementation Details

Single input index point

Here is an efficient and numerically stable way to compute $q_{\boldsymbol{\phi}}(f(\mathbf{x}))$ for an input $\mathbf{x}$. We take the following steps:

Cholesky decomposition: $\mathbf{L} \triangleq \mathrm{cholesky}(\mathbf{K}_\textbf{uu})$

Note: $\mathcal{O}(M^3)$ complexity.
Solve system of linear equations: $\boldsymbol{\lambda}(\mathbf{x}) \triangleq \mathbf{L} \backslash \mathbf{k}_\mathbf{u}(\mathbf{x})$

Note: $\mathcal{O}(M^2)$ complexity since $\mathbf{L}$ is lower triangular; $\boldsymbol{\beta} = \mathbf{A} \backslash \mathbf{x}$ denotes the vector $\boldsymbol{\beta}$ such that $\mathbf{A} \boldsymbol{\beta} = \mathbf{x} \Leftrightarrow \boldsymbol{\beta} = \mathbf{A}^{-1} \mathbf{x}$. Hence, $\boldsymbol{\lambda}(\mathbf{x}) = \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x})$.
$s(\mathbf{x}, \mathbf{x}) \triangleq k_{\theta}(\mathbf{x}, \mathbf{x}) - \boldsymbol{\lambda}^\top(\mathbf{x}) \boldsymbol{\lambda}(\mathbf{x})$

Note:
$$ \begin{aligned} \boldsymbol{\lambda}^\top(\mathbf{x}) \boldsymbol{\lambda}(\mathbf{x}) &= \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}) \\ &= \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}) \\ &= \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu} \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}). \end{aligned} $$
For whitened parameterization:
1. $\mu \triangleq \boldsymbol{\lambda}^\top(\mathbf{x}) \mathbf{b}'$
2. $\mathbf{v}^\top(\mathbf{x}) \triangleq \boldsymbol{\lambda}^\top(\mathbf{x}) {\mathbf{W}'}$
  
  Note: $\mathbf{v}^\top(\mathbf{x}) \mathbf{v}(\mathbf{x}) = \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{L}^{-\top} ({\mathbf{W}'} {\mathbf{W}'}^{\top}) \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x})$
otherwise:
1. Solve system of linear equations: $\boldsymbol{\psi}_\mathbf{u}(\mathbf{x}) \triangleq \mathbf{L}^\top \backslash \boldsymbol{\lambda}(\mathbf{x})$
  
  Note: $\mathcal{O}(M^2)$ complexity since $\mathbf{L}^{\top}$ is upper triangular. Further,
  $$ \boldsymbol{\psi}_\mathbf{u}(\mathbf{x}) = \mathbf{L}^{-\top} \boldsymbol{\lambda}(\mathbf{x}) = \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}) = \mathbf{K}_\mathbf{uu}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x}) $$
  and
  $$ \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) = \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu}^{-\top} = \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu}^{-1} $$
  since $\mathbf{K}_\mathbf{uu}$ is symmetric and nonsingular.
2. $\mu(\mathbf{x}) \triangleq \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{b}$
3. $\mathbf{v}^\top(\mathbf{x}) \triangleq \boldsymbol{\psi}_\mathbf{u}^\top(\mathbf{x}) \mathbf{W}$
  
  Note: $\mathbf{v}^\top(\mathbf{x}) \mathbf{v}(\mathbf{x}) = \mathbf{k}_\mathbf{u}^\top(\mathbf{x}) \mathbf{K}_\mathbf{uu}^{-1} (\mathbf{W} \mathbf{W}^{\top}) \mathbf{K}_\mathbf{uu}^{-1} \mathbf{k}_\mathbf{u}(\mathbf{x})$
$\sigma^2(\mathbf{x}) \triangleq s(\mathbf{x}, \mathbf{x}) + \mathbf{v}^\top(\mathbf{x}) \mathbf{v}(\mathbf{x})$
Return $\mathcal{N}(f(\mathbf{x}) ; \mu(\mathbf{x}), \sigma^2(\mathbf{x}))$

Multiple input index points

It is simple to extend this to compute $q_{\boldsymbol{\phi}}(\mathbf{f})$ for an arbitary number of index points $\mathbf{X}$:

Cholesky decomposition: $\mathbf{L} = \mathrm{cholesky}(\mathbf{K}_\textbf{uu})$

Note: $\mathcal{O}(M^3)$ complexity.
Solve system of linear equations: $\boldsymbol{\Lambda} = \mathbf{L} \backslash \mathbf{K}_\mathbf{uf}$

Note: $\mathcal{O}(M^2)$ complexity since $\mathbf{L}$ is lower triangular; $\mathbf{B} = \mathbf{A} \backslash \mathbf{X}$ denotes the matrix $\mathbf{B}$ such that $\mathbf{A} \mathbf{B} = \mathbf{X} \Leftrightarrow \mathbf{B} = \mathbf{A}^{-1} \mathbf{X}$. Hence, $\boldsymbol{\Lambda} = \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf}$.
$\mathbf{S} \triangleq \mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^{\top} \boldsymbol{\Lambda}$

Note:
$$ \begin{aligned} \boldsymbol{\Lambda}^{\top} \boldsymbol{\Lambda} &= \mathbf{K}_\mathbf{fu} \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf} \\ &= \mathbf{K}_\mathbf{fu} \mathbf{K}_\textbf{uu}^{-1} \mathbf{K}_\mathbf{uf} \\ &= \mathbf{K}_\mathbf{fu} \mathbf{K}_\textbf{uu}^{-1} (\mathbf{K}_\textbf{uu}) \mathbf{K}_\textbf{uu}^{-1} \mathbf{K}_\mathbf{uf} \\ &= \boldsymbol{\Psi}^\top \mathbf{K}_\textbf{uu} \boldsymbol{\Psi}. \end{aligned} $$
For whitened parameterization:
1. $\boldsymbol{\mu} \triangleq \boldsymbol{\Lambda}^\top \mathbf{b}'$
2. $\mathbf{V}^\top \triangleq \boldsymbol{\Lambda}^\top {\mathbf{W}'}$
  
  Note: $\mathbf{V}^\top \mathbf{V} = \mathbf{K}_\mathbf{fu} \mathbf{L}^{-\top} ({\mathbf{W}'} {\mathbf{W}'}^{\top}) \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf}.$
otherwise:
1. Solve system of linear equations: $\boldsymbol{\Psi} = \mathbf{L}^{\top} \backslash \boldsymbol{\Lambda}$
  
  Note: $\mathcal{O}(M^2)$ complexity since $\mathbf{L}^{\top}$ is upper triangular. Further,
  $$ \boldsymbol{\Psi} = \mathbf{L}^{-\top} \boldsymbol{\Lambda} = \mathbf{L}^{-\top} \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf} = (\mathbf{L}\mathbf{L}^\top)^{-1} \mathbf{K}_\mathbf{uf} = \mathbf{K}_\mathbf{uu}^{-1} \mathbf{K}_\mathbf{uf}, $$
  and
  $$ \boldsymbol{\Psi}^\top = \mathbf{K}_\mathbf{fu} \mathbf{K}_\mathbf{uu}^{-\top} = \mathbf{K}_\mathbf{fu} \mathbf{K}_\mathbf{uu}^{-1}, $$
  since $\mathbf{K}_\mathbf{uu}$ is symmetric and nonsingular.
2. $\boldsymbol{\mu} \triangleq \boldsymbol{\Psi}^\top \mathbf{b}$
3. $\mathbf{V}^\top \triangleq \boldsymbol{\Psi}^\top \mathbf{W}$
  
  Note: $\mathbf{V}^\top \mathbf{V} = \mathbf{K}_\mathbf{fu} \mathbf{K}_\mathbf{uu}^{-1} (\mathbf{W} \mathbf{W}^{\top}) \mathbf{K}_\mathbf{uu}^{-1} \mathbf{K}_\mathbf{uf}$.
$\mathbf{\Sigma} \triangleq \mathbf{S} + \mathbf{V}^\top \mathbf{V}$
Return $\mathcal{N}(\mathbf{f} ; \boldsymbol{\mu}, \mathbf{\Sigma})$

In TensorFlow, this looks something like:

import tensorflow as tf


def variational_predictive(Knn, Kmm, Kmn, W, b, whiten=True, jitter=1e-6):

 L = tf.linalg.cholesky(Kmm + jitter * tf.eye(m)) # L L^T = Kmm + jitter I_m
 Lambda = tf.linalg.triangular_solve(L, Kmn, lower=True) # Lambda = L^{-1} Kmn
 S = Knn - tf.linalg.matmul(Lambda, Lambda, adjoint_a=True) # Knn - Lambda^T Lambda
 # Phi = L^{-T} L^{-1} Kmn = Kmm^{-1} Kmn
 Phi = Lambda if whiten else tf.linalg.triangular_solve(L, Lambda, adjoint=True, lower=True)

 U = tf.linalg.matmul(Phi, W, adjoint_a=True) # U = V^T = Phi^T W

 mu = tf.linalg.matmul(Phi, b, adjoint_a=True) # Phi^T b
 Sigma = S + tf.linalg.matmul(U, U, adjoint_b=True) # S + UU^T = S + V^T V

 return mu, Sigma

III

Optimal variational distribution (in general)

Taking the functional derivative of the ELBO wrt to $q_{\boldsymbol{\phi}}(\mathbf{u})$, we get

$$ \begin{align*} \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \mathrm{ELBO}(\boldsymbol{\phi}, \mathbf{Z}) & = \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \left ( \int \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \,\mathrm{d}\mathbf{u} \right ) \newline & = \int \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \left ( \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} q_{\boldsymbol{\phi}}(\mathbf{u}) \right ) \,\mathrm{d}\mathbf{u} \newline & = \begin{split} & \int \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \left ( \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} q_{\boldsymbol{\phi}}(\mathbf{u}) \right ) + \newline & \qquad q_{\boldsymbol{\phi}}(\mathbf{u}) \left ( \frac{\partial}{\partial q_{\boldsymbol{\phi}}(\mathbf{u})} \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} \right ) \,\mathrm{d}\mathbf{u} \end{split} \newline & = \int \log{\frac{\Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u})}{q_{\boldsymbol{\phi}}(\mathbf{u})}} + q_{\boldsymbol{\phi}}(\mathbf{u}) \left ( -\frac{1}{q_{\boldsymbol{\phi}}(\mathbf{u})} \right ) \,\mathrm{d}\mathbf{u} \newline & = \int \log{\Phi(\mathbf{y}, \mathbf{u})} + \log{p(\mathbf{u})} - \log{q_{\boldsymbol{\phi}}(\mathbf{u})} - 1 \,\mathrm{d}\mathbf{u}. \end{align*} $$

Setting this expression to zero, we have

$$ \begin{align*} \log{q_{\boldsymbol{\phi}^{\star}}(\mathbf{u})} & = \log{\Phi(\mathbf{y}, \mathbf{u})} + \log{p(\mathbf{u})} - 1 \\\\ \Rightarrow \qquad q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) & \propto \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}). \end{align*} $$

IV

Variational lower bound (partial) for Gaussian likelihoods

To carry out this derivation, we will need to recall the following two simple identities. First, we can write the inner product between two vectors as the trace of their outer product,

$$ \mathbf{a}^\top \mathbf{b} = \mathrm{tr}(\mathbf{a} \mathbf{b}^\top). $$

Second, the relationship between the auto-correlation matrix $\mathbb{E}[\mathbf{a}\mathbf{a}^{\top}]$ and the covariance matrix,

$$ \begin{align*} \mathrm{Cov}[\mathbf{a}] & = \mathbb{E}[\mathbf{a}\mathbf{a}^{\top}] - \mathbb{E}[\mathbf{a}] \, \mathbb{E}[\mathbf{a}]^\top \\\\ \Leftrightarrow \quad \mathbb{E}[\mathbf{a}\mathbf{a}^{\top}] & = \mathrm{Cov}[\mathbf{a}] + \mathbb{E}[\mathbf{a}] \, \mathbb{E}[\mathbf{a}]^\top \end{align*} $$

These allow us to write

$$ \begin{align*} \log{\Phi(\mathbf{y}, \mathbf{u})} & = \int \log{\mathcal{N}(\mathbf{y} | \mathbf{f}, \beta^{-1} \mathbf{I})} \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f} \newline & = - \frac{1}{2\sigma^2} \int (\mathbf{y} - \mathbf{f})^{\top} (\mathbf{y} - \mathbf{f}) \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f} \newline & \quad - \frac{N}{2}\log{(2\pi\sigma^2)} \newline & = - \frac{1}{2\sigma^2} \int \mathrm{tr} \left (\mathbf{y}\mathbf{y}^{\top} - 2 \mathbf{y}\mathbf{f}^{\top} + \mathbf{f}\mathbf{f}^{\top} \right) \mathcal{N}(\mathbf{f} | \mathbf{m}, \mathbf{S}) \,\mathrm{d}\mathbf{f} \newline & \quad - \frac{N}{2}\log{(2\pi\sigma^2)} \newline & = - \frac{1}{2\sigma^2} \mathrm{tr} \left (\mathbf{y}\mathbf{y}^{\top} - 2 \mathbf{y}\mathbf{m}^{\top} + \mathbf{S} + \mathbf{m} \mathbf{m}^{\top} \right) \newline & \quad - \frac{N}{2}\log{(2\pi\sigma^2)} \newline & = - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{m})^{\top} (\mathbf{y} - \mathbf{m}) - \frac{N}{2}\log{(2\pi\sigma^2)} \newline & \quad - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S}) \newline & = \log{\mathcal{N}(\mathbf{y} | \mathbf{m}, \beta^{-1} \mathbf{I} )} - \frac{1}{2\sigma^2} \mathrm{tr}(\mathbf{S}). \end{align*} $$

V

Optimal variational distribution for Gaussian likelihoods

Firstly, the optimal variational distribution can be found in closed-form as

$$ \begin{align*} q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) & \propto \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}) \\\\ & \propto \mathcal{N}(\mathbf{y} \mid \boldsymbol{\Psi}^\top \mathbf{u}, \beta^{-1} \mathbf{I}) \mathcal{N}(\mathbf{u} \mid \mathbf{0}, \mathbf{K}_\mathbf{uu}) \\\\ & \propto \exp \left ( - \frac{\beta}{2} (\mathbf{y} - \boldsymbol{\Psi}^\top \mathbf{u})^\top (\mathbf{y} - \boldsymbol{\Psi}^\top \mathbf{u}) - \frac{1}{2} \mathbf{u}^\top \mathbf{K}_\mathbf{uu}^{-1} \mathbf{u} \right ) \\\\ & \propto \exp \left ( - \frac{1}{2} \left ( \mathbf{u}^\top \mathbf{C} \mathbf{u} - 2 \beta (\boldsymbol{\Psi} \mathbf{y})^\top \mathbf{u} \right ) \right ), \end{align*} $$

where

$$ \mathbf{C} \triangleq \mathbf{K}_\mathbf{uu}^{-1} + \beta \boldsymbol{\Psi} \boldsymbol{\Psi}^\top = \mathbf{K}_\mathbf{uu}^{-1} (\mathbf{K}_\mathbf{uu} + \beta \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu} ) \mathbf{K}_\mathbf{uu}^{-1}. $$

By , we get

$$ \begin{align*} q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) & \propto \exp \left ( - \frac{1}{2} (\mathbf{u} - \beta \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y})^\top \mathbf{C} (\mathbf{u} - \beta \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y}) \right ) \\\\ & \propto \mathcal{N}(\mathbf{u} \mid \beta \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y}, \mathbf{C}^{-1}). \end{align*} $$

We define

$$ \mathbf{M} \triangleq \mathbf{K}_\mathbf{uu} + \beta \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu} $$

so that

$$ \mathbf{C} = \mathbf{K}_\mathbf{uu}^{-1} \mathbf{M} \mathbf{K}_\mathbf{uu}^{-1}, $$

which allows us to write

VI

Variational lower bound (complete) for Gaussian likelihoods

We have

$$ \begin{align*} \mathrm{ELBO}(\boldsymbol{\phi}^{\star}, \mathbf{Z}) & = \log \mathcal{Z} \\\\ & = \log \int \Phi(\mathbf{y}, \mathbf{u}) p(\mathbf{u}) \,\mathrm{d}\mathbf{u} \\\\ & = \log \biggl[ \exp{\left(-\frac{\beta}{2} \mathrm{tr}(\mathbf{S})\right)} \newline & \qquad \cdot \int \mathcal{N}(\mathbf{y} | \boldsymbol{\Psi}^{\top} \mathbf{u}, \beta^{-1} \mathbf{I}) p(\mathbf{u}) \,\mathrm{d}\mathbf{u} \biggr] \\\\ & = \log \int \mathcal{N}(\mathbf{y} \mid \boldsymbol{\Psi}^{\top} \mathbf{u}, \beta^{-1} \mathbf{I}) \mathcal{N}(\mathbf{u} \mid \mathbf{0}, \mathbf{K}_\mathbf{uu}) \,\mathrm{d}\mathbf{u} - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}) \\\\ & = \log \mathcal{N}(\mathbf{y} \mid \mathbf{0}, \beta^{-1} \mathbf{I} + \boldsymbol{\Psi}^{\top} \mathbf{K}_\textbf{uu} \boldsymbol{\Psi}) - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}) \\\\ & = \log \mathcal{N}(\mathbf{y} \mid \mathbf{0}, \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}) - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}). \end{align*} $$

VII

SGPR Implementation Details

Here we provide implementation details that simultaneously minimizes the computational demands while avoiding numerically unstable calculations.

The difficulty in calculating the ELBO stem from terms involving the inverse and the determinant of $\mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}$. More specifically, we have

$$ \begin{split} \mathrm{ELBO}(\boldsymbol{\phi}^{\star}, \mathbf{Z}) & = - \frac{1}{2} \Bigl( \log \det \left ( \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I} \right ) \\\\ & \qquad + \mathbf{y}^\top \left ( \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I} \right )^{-1} \mathbf{y} + N \log {2\pi} \Bigr) \\\\ & \qquad - \frac{\beta}{2} \mathrm{tr}(\mathbf{S}). \end{split} $$

It turns out that many of the required terms can be expressed in terms of the symmetric positive definite matrix

$$ \mathbf{B} \triangleq \mathbf{U} \mathbf{U}^\top + \mathbf{I}, $$

where $\mathbf{U} \triangleq \beta^{\frac{1}{2}} \boldsymbol{\Lambda}$.

First, let’s tackle the inverse term. Using the Woodbury identity, we can write it as

$$ \begin{align*} \left(\mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}\right)^{-1} & = \left(\beta^{-1} \mathbf{I} + \boldsymbol{\Psi}^\top \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi}\right)^{-1} \\\\ & = \beta \mathbf{I} - \beta^2 \boldsymbol{\Psi}^\top \left(\mathbf{K}_\mathbf{uu}^{-1} + \beta \boldsymbol{\Psi} \boldsymbol{\Psi}^\top \right)^{-1} \boldsymbol{\Psi} \\\\ & = \beta \left(\mathbf{I} - \beta \boldsymbol{\Psi}^\top \mathbf{C}^{-1} \boldsymbol{\Psi}\right). \end{align*} $$

Recall that $\mathbf{C}^{-1} = \mathbf{K}_\mathbf{uu} \mathbf{M}^{-1} \mathbf{K}_\mathbf{uu}$. We can expand $\mathbf{M}$ as

$$ \begin{align*} \mathbf{M} & \triangleq \mathbf{K}_\mathbf{uu} + \beta \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu} \\\\ & = \mathbf{L} \mathbf{L}^\top + \beta \mathbf{L} \mathbf{L}^{-1} \mathbf{K}_\mathbf{uf} \mathbf{K}_\mathbf{fu} \mathbf{L}^{-\top} \mathbf{L}^\top \\\\ & = \mathbf{L} \left( \mathbf{I} + \beta \boldsymbol{\Lambda} \boldsymbol{\Lambda}^\top \right) \mathbf{L}^\top \\\\ & = \mathbf{L} \mathbf{B} \mathbf{L}^{\top}, \end{align*} $$

so its inverse is simply

$$ \mathbf{M}^{-1} = \mathbf{L}^{-\top} \mathbf{B}^{-1} \mathbf{L}^{-1}. $$

Therefore, we have

$$ \begin{align*} \mathbf{C}^{-1} & = \mathbf{K}_\mathbf{uu} \mathbf{L}^{-\top} \mathbf{B}^{-1} \mathbf{L}^{-1} \mathbf{K}_\mathbf{uu} \\\\ & = \mathbf{L} \mathbf{B}^{-1} \mathbf{L}^\top \\\\ & = \mathbf{W} \mathbf{W}^\top \end{align*} $$

where

$$ \mathbf{W} \triangleq \mathbf{L} \mathbf{L}_\mathbf{B}^{-\top} $$

and $\mathbf{L}_\mathbf{B}$ is the Cholesky factor of $\mathbf{B}$, i.e. the lower triangular matrix such that $\mathbf{L}_\mathbf{B}\mathbf{L}_\mathbf{B}^\top = \mathbf{B}$. All in all, we now have

$$ \begin{align*} \left(\mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I}\right)^{-1} & = \beta \left(\mathbf{I} - \beta \boldsymbol{\Psi}^\top \mathbf{W} \mathbf{W}^\top \boldsymbol{\Psi}\right), \end{align*} $$

so we can compute the quadratic term in $\mathbf{y}$ as

$$ \begin{align*} \mathbf{y}^\top \left ( \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I} \right )^{-1} \mathbf{y} & = \beta \left( \mathbf{y}^\top \mathbf{y} - \beta \mathbf{y}^\top \boldsymbol{\Psi}^\top \mathbf{W} \mathbf{W}^\top \boldsymbol{\Psi} \mathbf{y} \right) \\\\ & = \beta \mathbf{y}^\top \mathbf{y} - \mathbf{c}^\top \mathbf{c}, \end{align*} $$

where

$$ \mathbf{c} \triangleq \beta \mathbf{W}^\top \boldsymbol{\Psi} \mathbf{y} = \beta \mathbf{L}_\mathbf{B}^{-1} \boldsymbol{\Lambda} \mathbf{y} = \beta^{\frac{1}{2}} \mathbf{L}_\mathbf{B}^{-1} \mathbf{U} \mathbf{y}. $$

Next, let’s address the determinant term. To this end, first note that the determinant of $\mathbf{M}$ is

$$ \begin{align*} \det \left( \mathbf{M} \right) & = \det \left( \mathbf{L} \mathbf{B} \mathbf{L}^{\top} \right) \\\\ & = \det \left( \mathbf{L} \right) \det \left( \mathbf{B} \right) \det \left( \mathbf{L}^{\top} \right) \\\\ & = \det \left( \mathbf{K}_\mathbf{uu} \right) \det \left( \mathbf{B} \right). \end{align*} $$

Hence, the determinant of $\mathbf{C}$ is

$$ \begin{align*} \det \left( \mathbf{C} \right) & = \det \left( \mathbf{K}_\mathbf{uu}^{-1} \mathbf{M} \mathbf{K}_\mathbf{uu}^{-1} \right) \\\\ & = \frac{\det \left( \mathbf{M} \right)}{\det \left( \mathbf{K}_\mathbf{uu} \right )^2} \\\\ & = \frac{\det \left( \mathbf{B} \right)}{\det \left( \mathbf{K}_\mathbf{uu} \right )}. \end{align*} $$

Therefore, by the , we have

$$ \begin{align*} \det \left( \mathbf{Q}_\mathbf{ff} + \beta^{-1} \mathbf{I} \right) & = \det \left( \beta^{-1} \mathbf{I} + \boldsymbol{\Psi}^\top \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} \right) \\\\ & = \det \left( \mathbf{K}_\mathbf{uu}^{-1} + \beta \boldsymbol{\Psi} \boldsymbol{\Psi}^\top \right) \det \left( \mathbf{K}_\mathbf{uu} \right) \det \left( \beta^{-1} \mathbf{I} \right) \\\\ & = \det \left( \mathbf{C} \right) \det \left( \mathbf{K}_\mathbf{uu} \right) \det \left( \beta^{-1} \mathbf{I} \right) \\\\ & = \det \left( \mathbf{B} \right) \det \left( \beta^{-1} \mathbf{I} \right). \end{align*} $$

We can re-use $\mathbf{L}_\mathbf{B}$ to calculate $\det \left( \mathbf{B} \right)$ in linear time.

The last non-trivial component of the ELBO is the trace term, which can be calculated as

$$ \frac{\beta}{2} \mathrm{tr}(\mathbf{S}) = \frac{\beta}{2} \mathrm{tr}\left(\mathbf{K}_\mathbf{ff}\right) - \frac{1}{2} \mathrm{tr}\left(\mathbf{U} \mathbf{U}^\top \right), $$

since

$$ \begin{align*} \mathrm{tr}\left(\mathbf{U} \mathbf{U}^\top\right) & = \mathrm{tr}\left(\mathbf{U}^\top \mathbf{U}\right) \\\\ & = \beta \cdot \mathrm{tr}\left(\boldsymbol{\Lambda} \boldsymbol{\Lambda}^\top\right) \\\\ & = \beta \cdot \mathrm{tr}\left( \boldsymbol{\Psi}^{\top} \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} \right). \end{align*} $$

Again, we can re-use $\mathbf{U} \mathbf{U}^\top$ computed earlier.

Finally, let us address the posterior predictive. Recall that

$$ q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) = \mathcal{N}(\mathbf{u} \mid \beta \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y}, \mathbf{C}^{-1}). $$

Re-writing this in terms of $\mathbf{W}$, we get

$$ \begin{align*} q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) & = \mathcal{N}\left(\mathbf{u} \mid \beta \mathbf{W} \mathbf{W}^\top \boldsymbol{\Psi} \mathbf{y}, \mathbf{W} \mathbf{W}^\top \right) \\\\ & = \mathcal{N}\left(\mathbf{u} \mid \beta \mathbf{L} \mathbf{L}_\mathbf{B}^{-\top} \mathbf{W}^\top \boldsymbol{\Psi} \mathbf{y}, \mathbf{L} \mathbf{L}_\mathbf{B}^{-\top} \mathbf{L}_\mathbf{B}^{-1} \mathbf{L}^\top\right) \\\\ & = \mathcal{N}\left(\mathbf{u} \mid \mathbf{L} \left(\mathbf{L}_\mathbf{B}^{-\top} \mathbf{c}\right), \mathbf{L} \mathbf{B}^{-1} \mathbf{L}^\top\right). \end{align*} $$

Hence, we see that the optimal variational distribution is itself a whitened parameterization with $\mathbf{b}' = \mathbf{L}_\mathbf{B}^{-\top} \mathbf{c}$ and $\mathbf{W}' = \mathbf{L}_\mathbf{B}^{-\top}$ (such that ${\mathbf{W}'} {\mathbf{W}'}^\top = \mathbf{B}^{-1}$). Combined with results from a , we can directly write the predictive $q_{\boldsymbol{\phi}^{\star}}(\mathbf{f}) = \int p(\mathbf{f}|\mathbf{u}) q_{\boldsymbol{\phi}^{\star}}(\mathbf{u}) \, \mathrm{d}\mathbf{u}$ as

$$ q_{\boldsymbol{\phi}^{\star}}(\mathbf{f}) = \mathcal{N}\left(\boldsymbol{\Lambda}^\top \mathbf{L}_\mathbf{B}^{-\top} \mathbf{c}, \mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^\top \left( \mathbf{I} - \mathbf{B}^{-1} \right) \boldsymbol{\Lambda} \right). $$

Alternatively, we can derive this by noting the following simple identity,

$$ \boldsymbol{\Psi}^\top \mathbf{C}^{-1} \boldsymbol{\Psi} = \boldsymbol{\Psi}^\top \mathbf{L} \mathbf{B}^{-1} \mathbf{L}^\top \boldsymbol{\Psi} = \boldsymbol{\Lambda}^\top \mathbf{B}^{-1} \boldsymbol{\Lambda}, $$

and applying the rules for marginalizing Gaussians to obtain

$$ \begin{align*} q_{\boldsymbol{\phi}^{\star}}(\mathbf{f}) & = \mathcal{N}\left(\beta \boldsymbol{\Psi}^\top \mathbf{C}^{-1} \boldsymbol{\Psi} \mathbf{y}, \mathbf{K}_\mathbf{ff} - \boldsymbol{\Psi}^\top \mathbf{K}_\mathbf{uu} \boldsymbol{\Psi} + \boldsymbol{\Psi}^\top \mathbf{C}^{-1} \boldsymbol{\Psi} \right) \\\\ & = \mathcal{N}\left(\beta \boldsymbol{\Lambda}^\top \mathbf{B}^{-1} \boldsymbol{\Lambda} \mathbf{y}, \mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^\top \boldsymbol{\Lambda} + \boldsymbol{\Lambda}^\top \mathbf{B}^{-1} \boldsymbol{\Lambda} \right) \\\\ & = \mathcal{N}\left(\boldsymbol{\Lambda}^\top \mathbf{L}_\mathbf{B}^{-\top} \mathbf{c}, \mathbf{K}_\mathbf{ff} - \boldsymbol{\Lambda}^\top \left( \mathbf{I} - \mathbf{B}^{-1} \right) \boldsymbol{\Lambda} \right). \end{align*} $$

Titsias, M. (2009, April). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Artificial Intelligence and Statistics (pp. 567-574). ↩︎
Murray, I., & Adams, R. P. (2010). Slice Sampling Covariance Hyperparameters of Latent Gaussian Models. In Advances in Neural Information Processing Systems (pp. 1732-1740). ↩︎
Hensman, J., Matthews, A. G., Filippone, M., & Ghahramani, Z. (2015). MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 1648-1656). ↩︎
Csató, L., & Opper, M. (2002). Sparse On-line Gaussian Processes. Neural Computation, 14(3), 641-668. ↩︎
Seeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations (PhD Thesis). University of Edinburgh. ↩︎
Snelson, E., & Ghahramani, Z. (2005). Sparse Gaussian Processes using Pseudo-inputs. Advances in Neural Information Processing Systems, 18, 1257-1264. ↩︎
Quinonero-Candela, J., & Rasmussen, C. E. (2005). A Unifying View of Sparse Approximate Gaussian Process Regression. The Journal of Machine Learning Research, 6, 1939-1959. ↩︎
Lázaro-Gredilla, M., & Figueiras-Vidal, A. R. (2009, December). Inter-domain Gaussian Processes for Sparse Inference using Inducing Features. In Advances in Neural Information Processing Systems. ↩︎
Damianou, A., & Lawrence, N. D. (2013, April). Deep Gaussian Processes. In Artificial Intelligence and Statistics (pp. 207-215). PMLR. ↩︎
Salimbeni, H., & Deisenroth, M. (2017). Doubly Stochastic Variational Inference for Deep Gaussian Processes. Advances in Neural Information Processing Systems, 30. ↩︎
Hensman, J., Fusi, N., & Lawrence, N. D. (2013, August). Gaussian Processes for Big Data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (pp. 282-290). ↩︎
Dezfouli, A., & Bonilla, E. V. (2015). Scalable Inference for Gaussian Process Models with Black-box Likelihoods. In Advances in Neural Information Processing Systems (pp. 1414-1422). ↩︎
Bui, T. D., Yan, J., & Turner, R. E. (2017). A Unifying Framework for Gaussian Process Pseudo-point Approximations using Power Expectation Propagation. The Journal of Machine Learning Research, 18(1), 3649-3720. ↩︎
Salimbeni, H., Cheng, C. A., Boots, B., & Deisenroth, M. (2018). Orthogonally Decoupled Variational Gaussian Processes. In Advances in Neural Information Processing Systems (pp. 8711-8720). ↩︎
Shi, J., Titsias, M., & Mnih, A. (2020, June). Sparse Orthogonal Variational Inference for Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (pp. 1932-1942). PMLR. ↩︎
Burt, D., Rasmussen, C. E., & Van Der Wilk, M. (2019, May). Rates of Convergence for Sparse Variational Gaussian Process Regression. In International Conference on Machine Learning (pp. 862-871). PMLR. ↩︎
Wilson, J., Borovitskiy, V., Terenin, A., Mostowsky, P., & Deisenroth, M. (2020, November). Efficiently Sampling Functions from Gaussian Process Posteriors. In International Conference on Machine Learning (pp. 10292-10302). PMLR. ↩︎

Tech Talk: Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Sat, 15 Jun 2019 13:00:00 +0000

Density Ratio Estimation for KL Divergence Minimization between Implicit Distributions

Mon, 27 Aug 2018 00:00:00 +0000

The Kullback-Leibler (KL) divergence between distributions $p$ and $q$ is defined as

$$ \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] := \mathbb{E}_{p(x)} \left [ \log \left ( \frac{p(x)}{q(x)} \right ) \right ]. $$

It can be expressed more succinctly as

$$ \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] = \mathbb{E}_{p(x)} [ \log r^{*}(x) ], $$

where $r^{*}(x)$ is defined to be the ratio of between the densities $p(x)$ and $q(x)$,

$$ r^{*}(x) := \frac{p(x)}{q(x)}. $$

This density ratio is crucial for computing not only the KL divergence but for all $f$-divergences, defined as¹

$$ \mathcal{D}_f[p(x) || q(x)] := \mathbb{E}_{q(x)} \left [ f \left ( \frac{p(x)}{q(x)} \right ) \right ]. $$

Rarely can this expectation (i.e. integral) can be calculated analytically—in most cases, we must resort to Monte Carlo approximation methods, which explicitly requires the density ratio. In the more severe case where this density ratio is unavailable, because either or both $p(x)$ and $q(x)$ are not calculable, we must resort to methods for density ratio estimation. In this post, we illustrate how to perform density ratio estimation by exploiting its tight correspondence to probabilistic classification.

Example: Univariate Gaussians

Let us consider the following univariate Gaussian distributions as the running example for this post,

$$ p(x) = \mathcal{N}(x \mid 1, 1^2), \qquad \text{and} \qquad q(x) = \mathcal{N}(x \mid 0, 2^2). $$

We will be using TensorFlow, TensorFlow Probability, and Keras in the code snippets throughout this post.

import tensorflow as tf
import tensorflow_probability as tfp

We first instantiate the distributions:

p = tfp.distributions.Normal(loc=1., scale=1.)
q = tfp.distributions.Normal(loc=0., scale=2.)

Their densities are shown below:

For any pair of distributions, we can implement their density ratio function $r$ as follows:

def log_density_ratio(p, q):

 def log_ratio(x):

 return p.log_prob(x) - q.log_prob(x)

 return log_ratio

def density_ratio(p, q):

 log_ratio = log_density_ratio(p, q)

 def ratio(x):

 return tf.exp(log_ratio(x))

 return ratio

Let’s create the density ratio function for the Gaussian distributions we just instantiated:

>>> r = density_ratio(p, q)

This density ratio function is plotted as the orange dotted line below, alongside the individual densities shown in the previous plot:

Analytical Form

For our running example, we picked $p(x)$ and $q(x)$ to be Gaussians so that it is possible to integrate out $x$ and compute the KL divergence analytically. When we introduce the approximate methods later, this will provide us a “gold standard” to benchmark against.

In general, for Gaussian distributions

$$ p(x) = \mathcal{N}(x \mid \mu_p, \sigma_p^2), \qquad \text{and} \qquad q(x) = \mathcal{N}(x \mid \mu_q, \sigma_q^2), $$

it is easy to verify that

$$ \mathrm{KL}[ p(x) || q(x) ] = \log \sigma_q - \log \sigma_p - \frac{1}{2} \left [ 1 - \left ( \frac{\sigma_p^2 + (\mu_p - \mu_q)^2}{\sigma_q^2} \right ) \right ]. $$

This is implemented below:

def _kl_divergence_gaussians(p, q):

 r = p.loc - q.loc

 return (tf.log(q.scale) - tf.log(p.scale) -
 .5 * (1. - (p.scale**2 + r**2) / q.scale**2))

We can use this to compute the KL divergence between $p(x)$ and $q(x)$ exactly:

>>> _kl_divergence_gaussians(p, q).eval()
0.44314718

Equivalently, we could also use kl_divergence from TensorFlow Probability–Distributions (tfp.distributions), which implements the analytical closed-form expression of the KL divergence between distributions when such exists.

>>> tfp.distributions.kl_divergence(p, q).eval()
0.44314718

Monte Carlo Estimation — prescribed distributions

For distributions where their KL divergence is not analytically tractable, we may appeal to Monte Carlo (MC) estimation:

Clearly, this requires the density ratio $r^{*}(x)$ and, in turn, the densities $p(x)$ and $q(x)$ to be analytically tractable. Distributions for which the density function can be readily evaluated are sometimes referred to as prescribed distributions. As before, we prescribed Gaussians distributions in our running example so the Monte Carlo estimate can be later compared against. We approximate their KL divergence using $M = 5000$ Monte Carlo samples as follows:

>>> p_samples = p.sample(5000)
>>> true_log_ratio = log_density_ratio(p, q)
>>> tf.reduce_mean(true_log_ratio(p_samples)).eval()
0.44670376

Or equivalently, using the expectation function from TensorFlow Probability–Monte Carlo (tfp.monte_carlo):

>>> tfp.monte_carlo.expectation(f=true_log_ratio, samples=p_samples).eval()
0.4581419

More generally, we can approximate any $f$-divergence with MC estimation:

$$ \begin{align*} \mathcal{D}_f[p(x) || q(x)] & = \mathbb{E}_{q(x)} [ f(r^{*}(x)) ] \newline & \approx \frac{1}{M} \sum_{i=1}^{M} f(r^{*}(x_q^{(i)})), \quad x_q^{(i)} \sim q(x). \end{align*} $$

This can be done using the monte_carlo_csiszar_f_divergence function from TensorFlow Probability–Variational Inference (tfp.vi). One simply needs to specify the appropriate convex function $f$. The convex function that instantiates the (forward) KL divergence is provided in tfp.vi as kl_forward, alongside many other common $f$-divergences.

>>> tfp.vi.monte_carlo_csiszar_f_divergence(f=tfp.vi.kl_forward,
... p_log_prob=p.log_prob, q=q,
... num_draws=5000).eval()
0.4430853

Density Ratio Estimation — implicit distributions

When either density $p(x)$ or $q(x)$ is unavailable, things become more tricky. Which brings us to the topic of this post. Suppose we only have samples from $p(x)$ and $q(x)$—these could be natural images, outputs from a neural network with stochastic inputs, or in the case of our running example, i.i.d. samples drawn from Gaussians, etc. Distributions for which we are only able to observe their samples are known as implicit distributions, since their samples imply some underlying true density which we may not have direct access to.

Density ratio estimation is concerned with estimating the ratio of densities $r^{*}(x) = p(x) / q(x)$ given access only to samples from $p(x)$ and $q(x)$. Moreover, density ratio estimation usually encompass methods that achieve this without resorting to direct density estimation of the individual densities $p(x)$ or $q(x)$, since any error in the estimation of the denominator $q(x)$ is magnified exponentially.

Of the many density ratio estimation methods that now flourish², the classical approach of probabilistic classification remains dominant, due in no small part to its simplicity.

Reducing Density Ratio Estimation to Probabilistic Classification

We now demonstrate that density ratio estimation can be reduced to probabilistic classification. We shall do this by highlighting the one-to-one correspondence between the density ratio of $p(x)$ and $q(x)$ and the optimal probabilistic classifier that discriminates between their samples. Specifically, suppose we have a collection of samples from both $p(x)$ and $q(x)$, where each sample is assigned a class label indicating which distribution it was drawn from. Then, from an estimator of the class-membership probabilities, it is straightforward to recover an estimator of the density ratio.

Suppose we have $N_p$ and $N_q$ samples drawn from $p(x)$ and $q(x)$, respectively,

$$ x_p^{(1)}, \dotsc, x_p^{(N_p)} \sim p(x), \qquad \text{and} \qquad x_q^{(1)}, \dotsc, x_q^{(N_q)} \sim q(x). $$

Then, we form the dataset $\{ (x_n, y_n) \}_{n=1}^N$, where $N = N_p + N_q$ and

$$ \begin{align*} (x_1, \dotsc, x_N) & = (x_p^{(1)}, \dotsc, x_p^{(N_p)}, x_q^{(1)}, \dotsc, x_q^{(N_q)}), \newline (y_1, \dotsc, y_N) & = (\underbrace{1, \dotsc, 1}_{N_p}, \underbrace{0, \dotsc, 0}_{N_q}). \end{align*} $$

In other words, we label samples drawn from $p(x)$ as 1 and those drawn from $q(x)$ as 0. In code, this looks like:

>>> p_samples = p.sample(sample_shape=(n_p, 1))
>>> q_samples = q.sample(sample_shape=(n_q, 1))
>>> X = tf.concat([p_samples, q_samples], axis=0)
>>> y = tf.concat([tf.ones_like(p_samples), tf.zeros_like(q_samples)], axis=0)

This dataset is visualized below. The blue squares in the top row are samples $x_p^{(i)} \sim p(x)$ with label 1; red squares in the bottom row are samples $x_q^{(j)} \sim q(x)$ with label 0.

Now, by construction, we have

$$ p(x) = \mathcal{P}(x \mid y = 1), \qquad \text{and} \qquad q(x) = \mathcal{P}(x \mid y = 0). $$

Using Bayes’ rule, we can write

$$ \mathcal{P}(x \mid y) = \frac{\mathcal{P}(y \mid x) \mathcal{P}(x)} {\mathcal{P}(y)}. $$

Hence, we can express the density ratio $r^{*}(x)$ as

$$ \begin{align*} r^{*}(x) & = \frac{p(x)}{q(x)} = \frac{\mathcal{P}(x \mid y = 1)} {\mathcal{P}(x \mid y = 0)} \newline & = \left ( \frac{\mathcal{P}(y = 1 \mid x) \mathcal{P}(x)} {\mathcal{P}(y = 1)} \right ) \left ( \frac{\mathcal{P}(y = 0 \mid x) \mathcal{P}(x)} {\mathcal{P}(y = 0)} \right ) ^ {-1} \newline & = \frac{\mathcal{P}(y = 0)}{\mathcal{P}(y = 1)} \frac{\mathcal{P}(y = 1 \mid x)} {\mathcal{P}(y = 0 \mid x)}. \end{align*} $$

Let us approximate the ratio of marginal densities by the ratio of sample sizes,

$$ \frac{\mathcal{P}(y = 0)} {\mathcal{P}(y = 1)} \approx \frac{N_q}{N_p + N_q} \left ( \frac{N_p}{N_p + N_q} \right )^{-1} = \frac{N_q}{N_p}. $$

To avoid notational clutter, let us assume from now on that $N_q = N_p$. We can then write $r^{*}(x)$ in terms of class-posterior probabilities,

$$ \begin{align*} r^{*}(x) = \frac{\mathcal{P}(y = 1 \mid x)} {\mathcal{P}(y = 0 \mid x)}. \end{align*} $$

Recovering the Density Ratio from the Class Probability

This yields a one-to-one correspondence between the density ratio $r^{*}(x)$ and the class-posterior probability $\mathcal{P}(y = 1 \mid x)$. Namely,

$$ \begin{align*} r^{*}(x) = \frac{\mathcal{P}(y = 1 \mid x)} {\mathcal{P}(y = 0 \mid x)} & = \frac{\mathcal{P}(y = 1 \mid x)} {1 - \mathcal{P}(y = 1 \mid x)} \newline & = \exp \left [ \log \frac{\mathcal{P}(y = 1 \mid x)} {1 - \mathcal{P}(y = 1 \mid x)} \right ] \newline & = \exp[ \sigma^{-1}(\mathcal{P}(y = 1 \mid x)) ], \end{align*} $$

where $\sigma^{-1}$ is the logit function, or inverse sigmoid function, given by $\sigma^{-1}(\rho) = \log \left ( \frac{\rho}{1-\rho} \right )$

Recovering the Class Probability from the Density Ratio

By simultaneously manipulating both sides of this equation, we can also recover the exact class-posterior probability as a function of the density ratio,

$$ \mathcal{P}(y=1 \mid x) = \sigma(\log r^{*}(x)) = \frac{p(x)}{p(x) + q(x)}. $$

This is implemented below:

def optimal_classifier(p, q):

 def classifier(x):

 return tf.truediv(p.prob(x), p.prob(x) + q.prob(x))

 return classifier

In the figure below, The class-posterior probability $\mathcal{P}(y=1 \mid x)$ is plotted against the dataset visualized earlier.

Probabilistic Classification with Logistic Regression

The class-posterior probability $\mathcal{P}(y = 1 \mid x)$ can be approximated using a parameterized function $D_{\theta}(x)$ with parameters $\theta$. This functions takes as input samples from $p(x)$ and $q(x)$ and outputs a score, or probability, in the range $[0, 1]$ that it was drawn from $p(x)$. Hence, we refer to $D_{\theta}(x)$ as the probabilistic classifier.

From before, it is clear to see how an estimator of the density ratio $r_{\theta}(x)$ might be constructed as a function of probabilistic classifier $D_{\theta}(x)$. Namely,

$$ \begin{align*} r_{\theta}(x) & = \exp[ \sigma^{-1}(D_{\theta}(x)) ] \newline & \approx \exp[ \sigma^{-1}(\mathcal{P}(y = 1 \mid x)) ] = r^{*}(x), \end{align*} $$

and vice versa,

$$ \begin{align*} D_{\theta}(x) & = \sigma(\log r_{\theta}(x)) \newline & \approx \sigma(\log r^{*}(x)) = \mathcal{P}(y = 1 \mid x). \end{align*} $$

Instead of $D_{\theta}(x)$, we usually specify the parameterized function $\log r_{\theta}(x)$. This is also referred to as the log-odds, or logits, since it is equivalent to the unnormalized output of the classifier before being fed through the logistic sigmoid function.

We define a small fully-connected neural network with two hidden layers and ReLU activations:

log_ratio = Sequential([
 Dense(16, input_dim=1, activation='relu'),
 Dense(32, activation='relu'),
 Dense(1),
])

This simple architecture is visualized in the diagram below:

We learn the optimal class probability estimator by optimizing it with respect to a proper scoring rule³ that yields well-calibrated probabilistic predictions, such as the binary cross-entropy loss,

$$ \begin{align*} \mathcal{L}(\theta) & := -\mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] -\mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ] \newline & = -\mathbb{E}_{p(x)} [ \log \sigma ( \log r_{\theta} (x) ) ] -\mathbb{E}_{q(x)} [ \log(1 - \sigma ( \log r_{\theta} (x) )) ]. \end{align*} $$

An implementation optimized for numerical stability is given below:

def _binary_crossentropy(log_ratio_p, log_ratio_q):

 loss_p = tf.nn.sigmoid_cross_entropy_with_logits(
 logits=log_ratio_p,
 labels=tf.ones_like(log_ratio_p)
 )

 loss_q = tf.nn.sigmoid_cross_entropy_with_logits(
 logits=log_ratio_q,
 labels=tf.zeros_like(log_ratio_q)
 )

 return tf.reduce_mean(loss_p + loss_q)

Now we can build a , where the —samples from $p(x)$ and $q(x)$, respectively.

>>> x_p = Input(tensor=p_samples)
>>> x_q = Input(tensor=q_samples)
>>> log_ratio_p = log_ratio(x_p)
>>> log_ratio_q = log_ratio(x_q)

The model can now be compiled and finalized. Since we’re using a custom loss that take the two sets of log-ratios as input, we specify loss=None and define it instead through the add_loss method.

>>> m = Model(inputs=[x_p, x_q], outputs=[log_ratio_p, log_ratio_q])
>>> m.add_loss(_binary_crossentropy(log_ratio_p, log_ratio_q))
>>> m.compile(optimizer='rmsprop', loss=None)

As a sanity-check, the loss evaluated on a random batch can be obtained like so:

>>> m.evaluate(x=None, steps=1)
1.3765026330947876

We can now fit our estimator, recording the loss at the end of each epoch:

>>> hist = m.fit(x=None, y=None, steps_per_epoch=1, epochs=500)

The following animation shows how the predictions for the probabilistic classifier, density ratio, log density ratio, evolve after every epoch:

It is overlaid on top of their exact, analytical counterparts, which are only available since we prescribed them to be Gaussian distribution. For implicit distributions, these won’t be accessible at all.

Below is the final plot of how the binary cross-entropy loss converges:

Below is a plot of the probabilistic classifier $D_{\theta}(x)$ (dotted green), plotted against the optimal classifier, which is the class-posterior probability $\mathcal{P}(y=1 \mid x) = \frac{p(x)}{p(x) + q(x)}$ (solid blue):

Below is a plot of the density ratio estimator $r_{\theta}(x)$ (dotted green), plotted against the exact density ratio function $r^{*}(x) = \frac{p(x)}{q(x)}$ (solid blue):

And finally, the previous plot in logarithmic scale:

While it may appear that we are simply performing regression on the latent function $r^{*}(x)$ (which is not wrong—we are), it is important to emphasize that we do this without ever having observed values of $r^{*}(x)$. Instead, we only ever observed samples from $p(x)$ and $q(x)$ This has profound implications and potential for a great number of applications that we shall explore later on.

Back to Monte Carlo estimation

Having an obtained an estimate of the log density ratio, it is now feasible to perform Monte Carlo estimation:

$$ \begin{align*} \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] & = \mathbb{E}_{p(x)} [ \log r^{*}(x) ] \newline & \approx \frac{1}{M} \sum_{i=1}^{M} \log r^{*}(x_p^{(i)}), \quad x_p^{(i)} \sim p(x) \newline & \approx \frac{1}{M} \sum_{i=1}^{M} \log r_{\theta}(x_p^{(i)}), \quad x_p^{(i)} \sim p(x). \end{align*} $$

>>> tf.squeeze(tfp.monte_carlo.expectation(f=log_ratio, samples=p_samples)).eval()
0.4570999

In other words, we draw MC samples from $p(x)$ as before. But instead of taking the mean of the function $\log r^{*}(x)$ evaluated on these samples (which is unavailable for implicit distributions), we do so on a proxy function $\log r_{\theta}(x)$ that is estimated through probabilistic classification as described above.

Learning in Implicit Generative Models

Now let’s take a look at where these ideas are being used in practice. Consider a collection of natural images, such as the MNIST handwritten digits shown below, which are assumed to be samples drawn from some implicit distribution $q(\mathbf{x})$:

MNIST hand-written digits

Directly estimating the density of $q(\mathbf{x})$ may not always be feasible—in some cases, it may not even exist. Instead, consider defining a parametric function $G_{\phi}: \mathbf{z} \mapsto \mathbf{x}$ with parameters $\phi$, that takes as input $\mathbf{z}$ drawn from some fixed distribution $p(\mathbf{z})$. The outputs $\mathbf{x}$ of this generative process are assumed to be samples following some implicit distribution $p_{\phi}(\mathbf{x})$. In other words, we can write

$$ \mathbf{x} \sim p_{\phi}(\mathbf{x}) \quad \Leftrightarrow \quad \mathbf{x} = G_{\phi}(\mathbf{z}), \quad \mathbf{z} \sim p(\mathbf{z}). $$

By optimizing parameters $\phi$, we can make $p_{\phi}(\mathbf{x})$ close to the real data distribution $q(\mathbf{x})$. This is a compelling alternative to density estimation since there are many situations where being able to generate samples is more important than being able to calculate the numerical value of the density. Some examples of these include image super-resolution and semantic segmentation.

One approach might be to introduce a classifier $D_{\theta}$ that discriminates between real and synthetic samples. Then we optimize $G_{\phi}$ to synthesize samples that are indistinguishable, to classifier $D_{\theta}$, from the real samples. This can be achieved by simultaneously optimizing the binary cross-entropy loss, resulting in the saddle-point objective,

$$ \begin{align*} & \min_{\phi} \max_{\theta} \mathbb{E}_{q(\mathbf{x})} [ \log D_{\theta} (\mathbf{x}) ] + \mathbb{E}_{p_{\phi}(\mathbf{x})} [ \log(1-D_{\theta} (\mathbf{x})) ] \newline = & \min_{\phi} \max_{\theta} \mathbb{E}_{q(\mathbf{x})} [ \log D_{\theta} (\mathbf{x}) ] + \mathbb{E}_{p(\mathbf{z})} [ \log(1-D_{\theta} (G_{\phi}(\mathbf{z}))) ]. \end{align*} $$

This is, of course, none other than the groundbreaking generative adversarial network (GAN)⁴. You can read more about the density ratio estimation perspective of GANs in the paper by Uehara et al. 2016⁵. For an even more general and complete treatment of learning in implicit models, I recommend the paper from Mohamed and Lakshminarayanan, 2016⁶, which partially inspired this post.

For the remainder of this section, I want to highlight a variant of this approach that specifically aims to minimize the KL divergence w.r.t. parameters $\phi$,

$$ \min_{\phi} \mathcal{D}_{\mathrm{KL}}[p_{\phi}(\mathbf{x}) || q(\mathbf{x})]. $$

To overcome the fact that the densities of both $p_{\phi}(\mathbf{x})$ and $q(\mathbf{x})$ are unknown, we can readily adopt the density ratio estimation approach outlined in this post. Namely, by maximizing the following objective,

$$ \begin{align*} & \max_{\theta} \mathbb{E}_{q(\mathbf{x})} [ \log D_{\theta} (\mathbf{x}) ] + \mathbb{E}_{p(\mathbf{z})} [ \log(1-D_{\theta} (G_{\phi}(\mathbf{z}))) ] \newline = & \max_{\theta} \mathbb{E}_{q(\mathbf{x})} [ \log \sigma ( \log r_{\theta} (\mathbf{x}) ) ] + \mathbb{E}_{p(\mathbf{z})} [ \log(1 - \sigma ( \log r_{\theta} (G_{\phi}(\mathbf{z})) )) ], \end{align*} $$

which attains its maximum at

$$ r_{\theta}(\mathbf{x}) = \frac{q(\mathbf{x})}{p_{\phi}(\mathbf{x})}. $$

Concurrently, we also minimize the current best estimate of the KL divergence,

$$ \begin{align*} \min_{\phi} \mathcal{D}_{\mathrm{KL}}[p_{\phi}(\mathbf{x}) || q(\mathbf{x})] & = \min_{\phi} \mathbb{E}_{p_{\phi}(\mathbf{x})} \left [ \log \frac{p_{\phi}(\mathbf{x})}{q(\mathbf{x})} \right ] \newline & \approx \min_{\phi} \mathbb{E}_{p_{\phi}(\mathbf{x})} [ - \log r_{\theta}(\mathbf{x}) ] \newline & = \min_{\phi} \mathbb{E}_{p(\mathbf{z})} [ - \log r_{\theta}(G_{\phi}(\mathbf{z})) ]. \end{align*} $$

In addition to being more stable than the vanilla GAN approach (alleviates saturating gradients), this is especially important in contexts where there is a specific need to minimize the KL divergence, such as in variational inference (VI).

This was first used in AffGAN by Sønderby et al. 2016⁷, and has since been incorporated in many papers that deal with implicit distributions in variational inference, such as (Mescheder et al. 2017⁸, Huszar 2017⁹, Tran et al. 2017¹⁰, Pu et al. 2017¹¹, Chen et al. 2018¹², Tiao et al. 2018¹³), and many others.

Bound on the Jensen-Shannon Divergence

Before we wrap things up, let us take another look at the plot of the binary-cross entropy loss recorded at the end of each epoch. We see that it converges quickly to some value. It is natural to wonder: what is the significance, if any, of this value?

It is in fact the (negative) Jensen-Shannon (JS) divergence, up to constants,

$$ -2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] + \log 4. $$

Recall the Jensen-Shannon divergence is defined as

$$ \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] = \frac{1}{2} \mathcal{D}_{\mathrm{KL}}[p(x) || m(x)] + \frac{1}{2} \mathcal{D}_{\mathrm{KL}}[q(x) || m(x)], $$

where $m$ is the mixture density

$$ m(x) = \frac{p(x) + q(x)}{2}. $$

With our running example, this cannot be evaluated exactly since the KL divergence between a Gaussian and a mixture of Gaussians is analytically intractable. However, like the KL, we can still estimate their JS divergence with Monte Carlo estimation¹⁴:

>>> js = - tfp.vi.monte_carlo_csiszar_f_divergence(f=tfp.vi.jensen_shannon,
... p_log_prob=p.log_prob,
... q=q, num_draws=5000)

This value is shown in the horizontal black line in the plot above. Along the right margin, we also plot the a histogram of the binary cross-entropy loss values over epochs. We can see that this value indeed coincides with the mode of this histogram.

It is straightforward to show that we have the upper bound

$$ \inf_{\theta} \mathcal{L}(\theta) \geq - 2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] + \log 4. $$

Firstly, we have

$$ \begin{align*} \sup_{\theta} & \mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] + \mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ] \newline & = \mathbb{E}_{p(x)} [ \log \mathcal{P}(y=1 \mid x) ] + \mathbb{E}_{q(x)} [ \log \mathcal{P}(y=0 \mid x) ] \newline & = \mathbb{E}_{p(x)} \left [ \log \frac{p(x)}{p(x) + q(x)} \right ] + \mathbb{E}_{q(x)} \left [ \log \frac{q(x)}{p(x) + q(x)} \right ] \newline & = \mathbb{E}_{p(x)} \left [ \log \frac{1}{2} \frac{p(x)}{m(x)} \right ] + \mathbb{E}_{q(x)} \left [ \log \frac{1}{2} \frac{q(x)}{m(x)} \right ] \newline & = \mathbb{E}_{p(x)} \left [ \log \frac{p(x)}{m(x)} \right ] + \mathbb{E}_{q(x)} \left [ \log \frac{q(x)}{m(x)} \right ] - 2 \log 2 \newline & = 2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] - \log 4. \end{align*} $$

Therefore,

$$ 2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] - \log 4 \geq \sup_{\theta} \mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] + \mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ]. $$

Negating both sides, we get

$$ \begin{align*} -2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] + \log 4 \leq & -\sup_{\theta} \mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] + \mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ] \newline = & \inf_{\theta} -\mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] -\mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ] \newline = & \inf_{\theta} \mathcal{L}(\theta), \end{align*} $$

as required.

In short, this tells us that the binary cross-entropy loss is itself an approximation (up to constants) to the Jensen-Shannon divergence. This begs the question: is it possible to construct a more general loss that bounds any given $f$-divergence?

Teaser: Lower Bound on any $f$-divergence

Using convex analysis, one can actually show that for any $f$-divergence, we have the lower bound¹⁵

$$ \mathcal{D}_f[p(x) || q(x)] \geq \sup_{\theta} \mathbb{E}_{p(x)} [ f'(r_{\theta}(x)) ] - \mathbb{E}_{q(x)} [ f^{\star}(f'(r_{\theta}(x))) ], $$

with equality exactly when $r_{\theta}(x) = r^{*}(x)$. Importantly, this lower bound can be computed without requiring the densities of $p(x)$ or $q(x)$—only their samples are needed.

In the special case of $f(u) = u \log u - (u + 1) \log (u + 1)$, we recover the binary cross-entropy loss and the previous result, as expected,

$$ \begin{align*} \mathcal{D}_f[p(x) || q(x)] & = 2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] - \log 4 \newline & \geq \sup_{\theta} \mathbb{E}_{p(x)} [ \log \sigma ( \log r_{\theta} (x) ) ] + \mathbb{E}_{q(x)} [ \log(1 - \sigma ( \log r_{\theta} (x) )) ] \newline & = \sup_{\theta} \mathbb{E}_{p(x)} [ \log D_{\theta} (x) ] + \mathbb{E}_{q(x)} [ \log(1-D_{\theta} (x)) ]. \end{align*} $$

Alternately, in the special case of $f(u) = u \log u$, we get

$$ \begin{align*} \mathcal{D}_f[p(x) || q(x)] & = \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] \newline & \geq \sup_{\theta} \mathbb{E}_{p(x)} [ \log r_{\theta} (x) ] - \mathbb{E}_{q(x)} [ r_{\theta} (x) - 1 ]. \end{align*} $$

This gives us yet another way to estimate the KL divergence between implicit distributions, in the form of a direct lower bound on the KL divergence itself. As it turns out, this lower bound is closely-related to the objective of the KL Importance Estimation Procedure (KLIEP)¹⁶, and will be the topic of our next post in this series.

Summary

This post covered how to evaluate the KL divergence, or any $f$-divergence, between implicit distributions—distributions which we can only sample from. First, we underscored the crucial role of the density ratio in the estimation of $f$-divergences. Next, we showed the correspondence between the density ratio and the optimal classifier. By exploiting this link, we demonstrated how one can use a trained probabilistic classifier to construct a proxy for the exact density ratio, and use this to enable estimation of any $f$-divergence. Finally, we provided some context on where this method is used, touching upon some recent advances in implicit generative models and variational inference.

Cite as:

@article{tiao2018dre,
 title = "{D}ensity {R}atio {E}stimation for {KL} {D}ivergence {M}inimization between {I}mplicit {D}istributions",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2018",
 url = "https://tiao.io/post/density-ratio-estimation-for-kl-divergence-minimization-between-implicit-distributions/"
}

To receive updates on more posts like this, follow me on and !

Acknowledgements

I am grateful to for providing extensive feedback and insightful discussions. I would also like to thank Alistair Reid and for their comments and suggestions.

Links and Resources

The used to generate the figures in this post, which you can .
The very readable textbook on ², which I highly recommend. (Note: the Gaussian distributions example was borrowed from this book.)
Shakir Mohamed’s blog post .
The paper by Menon and Ong, 2016¹⁷, which gives a generalized treatment of the theoretical link between density ratio estimation and probabilistic classification.

The (forward) KL divergence can be recovered with
$$ f_{\mathrm{KL}}(u) := u \log u. $$
This is easy to verify,
$$ \begin{align*} \mathcal{D}_{\mathrm{KL}}[p(x) || q(x)] & := \mathbb{E}_{p(x)} \left [ \log \left ( \frac{p(x)}{q(x)} \right ) \right ] \newline & = \mathbb{E}_{q(x)} \left [ \frac{p(x)}{q(x)} \log \left ( \frac{p(x)}{q(x)} \right ) \right ] \newline & = \mathbb{E}_{q(x)} \left [ f_{\mathrm{KL}} \left ( \frac{p(x)}{q(x)} \right ) \right ]. \end{align*} $$ ↩︎
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge University Press. ↩︎ ↩︎
Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), (pp. 359-378). ↩︎
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). Generative Adversarial Nets. In Advances in Neural Information Processing Systems (pp. 2672-2680). ↩︎
Uehara, M., Sato, I., Suzuki, M., Nakayama, K., & Matsuo, Y. (2016). Generative Adversarial Nets from a Density Ratio Estimation Perspective. arXiv preprint arXiv:1610.02920. ↩︎
Mohamed, S., & Lakshminarayanan, B. (2016). Learning in Implicit Generative Models. arXiv preprint arXiv:1610.03483. ↩︎
Sønderby, C. K., Caballero, J., Theis, L., Shi, W., & Huszár, F. (2016). Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490. ↩︎
Mescheder, L., Nowozin, S., & Geiger, A. (2017). Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks. In International Conference on Machine learning (ICML). ↩︎
Huszár, F. (2017). Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235. ↩︎
Tran, D., Ranganath, R., & Blei, D. (2017). Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems (pp. 5523-5533). ↩︎
Pu, Y., Wang, W., Henao, R., Chen, L., Gan, Z., Li, C., & Carin, L. (2017). Adversarial symmetric variational autoencoder. In Advances in Neural Information Processing Systems (pp. 4330-4339). ↩︎
Chen, L., Dai, S., Pu, Y., Zhou, E., Li, C., Su, Q., … & Carin, L. (2018, March). Symmetric variational autoencoder and connections to adversarial learning. In International Conference on Artificial Intelligence and Statistics (pp. 661-669). ↩︎
Tiao, L. C., Bonilla, E. V., & Ramos, F. (2018). Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference. arXiv preprint arXiv:1806.01771. ↩︎
Note that jensen_shannon with self_normalized=False (default), corresponds to $2 \cdot \mathcal{D}_{\mathrm{JS}}[p(x) || q(x)] - \log 4$, while self_normalized=True corresponds to $\mathcal{D}_{\mathrm{JS}}[p(x) || q(x)]$. ↩︎
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), 5847-5861. ↩︎
Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P. V., & Kawanabe, M. (2008). Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems (pp. 1433-1440). ↩︎
Menon, A., & Ong, C. S. (2016, June). Linking Losses for Density Ratio and Class-Probability Estimation. In International Conference on Machine Learning (pp. 304-313). ↩︎

Building Probability Distributions with the TensorFlow Probability Bijector API

Mon, 30 Jul 2018 00:00:00 +0000

TensorFlow Distributions, now under the broader umbrella of , is a fantastic TensorFlow library for efficient and composable manipulation of probability distributions¹.

Among the many features it has to offer, one of the most powerful in my opinion is the Bijector API, which provide the modular building blocks necessary to construct a broad class of probability distributions. Instead of describing it any further in the abstract, let’s dive right in with a simple example.

Example: Banana-shaped distribution

Consider the banana-shaped distribution, a commonly-used testbed for adaptive MCMC methods². Denote the density of this distribution as $p_{Y}(\mathbf{y})$. To illustrate, 1k samples randomly drawn from this distribution are shown below:

The underlying process that generates samples $\tilde{\mathbf{y}} \sim p_{Y}(\mathbf{y})$ is simple to describe, and is of the general form,

$$ \tilde{\mathbf{y}} \sim p_{Y}(\mathbf{y}) \quad \Leftrightarrow \quad \tilde{\mathbf{y}} = G(\tilde{\mathbf{x}}), \quad \tilde{\mathbf{x}} \sim p_{X}(\mathbf{x}). $$

In other words, a sample $\tilde{\mathbf{y}}$ is the output of a transformation $G$, given a sample $\tilde{\mathbf{x}}$ drawn from some underlying base distribution $p_{X}(\mathbf{x})$.

However, it is not as straightforward to compute an analytical expression for density $p_{Y}(\mathbf{y})$. In fact, this is only possible if $G$ is a differentiable and invertible transformation (a diffeomorphism³), and if there is an analytical expression for $p_{X}(\mathbf{x})$.

Transformations that fail to satisfy these conditions (which includes something as simple as a multi-layer perceptron with non-linear activations) give rise to implicit distributions, and will be the subject of many posts to come. But for now, we will restrict our attention to diffeomorphisms.

Base distribution

Following on with our example, the base distribution $p_{X}(\mathbf{x})$ is given by a two-dimensional Gaussian with unit variances and covariance $\rho = 0.95$:

$$ p_{X}(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \mathbf{0}, \mathbf{\Sigma}), \qquad \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.95 \newline 0.95 & 1 \end{bmatrix} $$

This can be encapsulated by an instance of , which is parameterized by a lower-triangular matrix. First let’s import TensorFlow Distributions:

import tensorflow.contrib.distributions as tfd

Then we create the lower-triangular matrix and the instantiate the distribution:

>>> rho = 0.95
>>> Sigma = np.float32(np.eye(N=2) + rho * np.eye(N=2)[::-1])
>>> Sigma
array([[1. , 0.95],
 [0.95, 1. ]], dtype=float32)
>>> p_x = tfd.MultivariateNormalTriL(scale_tril=tf.cholesky(Sigma))

As with all subclasses of tfd.Distribution, we can evaluated the probability density function of this distribution by calling the p_x.prob method. Evaluating this on an uniformly-spaced grid yields the equiprobability contour plot below:

Forward Transformation

The required transformation $G$ is defined as:

$$ G(\mathbf{x}) = \begin{bmatrix} x_1 \newline x_2 - x_1^2 - 1 \newline \end{bmatrix} $$

We implement this in the _forward function below⁴:

def _forward(x):

 y_0 = x[..., 0:1]
 y_1 = x[..., 1:2] - y_0**2 - 1
 y_tail = x[..., 2:-1]

 return tf.concat([y_0, y_1, y_tail], axis=-1)

We can now use this to generate samples from $p_{Y}(\mathbf{y})$. To do this we first sample from the base distribution $p_{X}(\mathbf{x})$ by calling p_x.sample. For this illustration, we generate 1k samples, which is specified through the sample_shape argument. We then transform these samples through $G$ by calling _forward on them.

>>> x_samples = p_x.sample(1000)
>>> y_samples = _forward(x_samples)

The figure below contains scatterplots of the 1k samples x_samples (left) and the transformed y_samples (right):

Instantiating a `TransformedDistribution` with a `Bijector`

Having specified the forward transformation and the underlying distribution, we have now fully described the sample generation process, which is the bare minimum necessary to define a probability distribution.

The forward transformation is also the first of three operations needed to fully specify a Bijector, which can be used to instantiate a TransformedDistribution that encapsulates the banana-shaped distribution.

Creating a `Bijector`

First, let’s subclass Bijector to define the Banana bijector and implement the forward transformation as an instance method:

class Banana(tfd.bijectors.Bijector):

 def __init__(self, name="banana"):
 super(Banana, self).__init__(inverse_min_event_ndims=1,
 name=name)

 def _forward(self, x):

 y_0 = x[..., 0:1]
 y_1 = x[..., 1:2] - y_0**2 - 1
 y_tail = x[..., 2:-1]

 return tf.concat([y_0, y_1, y_tail], axis=-1)

Note that we need to specify either forward_min_event_ndims or inverse_min_event_ndims, the number of dimensions the forward or inverse transformation operate on (which can sometimes differ). In our example, both the inverse and forward transformation operate on vectors (rank 1 tensors), so we set inverse_min_event_ndims=1.

With an instance of the Banana bijector, we can call the forward method on x_samples to produce y_samples as before:

>>> y_samples = Banana().forward(x_samples)

Instantiating a `TransformedDistribution`

More importantly, we can now create a TransformedDistribution with the base distribution p_x and an instance of the Banana bijector:

>>> p_y = tfd.TransformedDistribution(distribution=p_x, bijector=Banana())

This now allows us to directly sample from p_y just as we could with p_x, and any other TensorFlow Probability Distribution:

>>> y_samples = p_y.sample(1000)

Neat!

Probability Density Function

Although we can now sample from this distribution, we have yet to define the operations necessary to evaluate its probability density function—the remaining two of three operations needed to fully specify a Bijector

Indeed, calling p_y.prob at this stage would simply raise a NotImplementedError exception. So what else do we need to define?

Recall the probability density of $p_{Y}(\mathbf{y})$ is given by:

$$ p_{Y}(\mathbf{y}) = p_{X}(G^{-1}(\mathbf{y})) \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) $$

Hence we need to specify the inverse transformation $G^{-1}(\mathbf{y})$ and its Jacobian determinant $\mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right )$.

For numerical stability, the Bijector API requires that this be defined in log-space. Hence, it is useful to recall that the forward and inverse log determinant Jacobians differ only in their signs⁵,

$$ \begin{align} \log \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = - \log \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right ), \end{align} $$

which gives us the option of implementing either (or both). However, do note the following from the official API docs:

Generally its preferable to directly implement the inverse Jacobian determinant. This should have superior numerical stability and will often share subgraphs with the _inverse implementation.

Inverse Transformation

So let’s implement the inverse transform $G^{-1}$, which is given by:

$$ G^{-1}(\mathbf{y}) = \begin{bmatrix} y_1 \newline y_2 + y_1^2 + 1 \newline \end{bmatrix} $$

We define this in the _inverse function below:

def _inverse(y):

 x_0 = y[..., 0:1]
 x_1 = y[..., 1:2] + x_0**2 + 1
 x_tail = y[..., 2:-1]

 return tf.concat([x_0, x_1, x_tail], axis=-1)

Jacobian determinant

Now we compute the log determinant of the Jacobian of the inverse transformation. In this simple example, the transformation is volume-preserving, meaning its Jacobian determinant is equal to 1.

This is easy to verify:

$$ \begin{align} \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = \mathrm{det} \begin{pmatrix} \frac{\partial}{\partial y_1} y_1 & \frac{\partial}{\partial y_2} y_1 \newline \frac{\partial}{\partial y_1} y_2 + y_1^2 + 1 & \frac{\partial}{\partial y_2} y_2 + y_1^2 + 1 \newline \end{pmatrix} \newline & = \mathrm{det} \begin{pmatrix} 1 & 0 \newline 2 y_1 & 1 \newline \end{pmatrix} = 1 \end{align} $$

Hence, the log determinant Jacobian is given by zeros shaped like input y, up to the last inverse_min_event_ndims=1 dimensions:

def _inverse_log_det_jacobian(y):

 return tf.zeros(shape=y.shape[:-1])

Since the log determinant Jacobian is constant, i.e. independent of the input, we can just specify it for one input by setting the flag is_constant_jacobian=True⁶, and the Bijector class will handle the necessary shape inference for us.

Putting it all together in the Banana bijector subclass, we have:

class Banana(tfd.bijectors.Bijector):

 def __init__(self, name="banana"):
 super(Banana, self).__init__(inverse_min_event_ndims=1,
 is_constant_jacobian=True,
 name=name)

 def _forward(self, x):

 y_0 = x[..., 0:1]
 y_1 = x[..., 1:2] - y_0**2 - 1
 y_tail = x[..., 2:-1]

 return tf.concat([y_0, y_1, y_tail], axis=-1)

 def _inverse(self, y):

 x_0 = y[..., 0:1]
 x_1 = y[..., 1:2] + x_0**2 + 1
 x_tail = y[..., 2:-1]

 return tf.concat([x_0, x_1, x_tail], axis=-1)

 def _inverse_log_det_jacobian(self, y):

 return tf.zeros(shape=())

Finally, we can instantiate distribution p_y by calling tfd.TransformedDistribution as we did before et voilà, we can now simply call p_y.prob to evaluate the probability density function.

Evaluating this on the same uniformly-spaced grid as before yields the following equiprobability contour plot:

Inline Bijector

Before we conclude, we note that instead of creating a subclass, one can also opt for a more lightweight and functional approach by creating an bijector:

banana = tfd.bijectors.Inline(
 forward_fn=_forward,
 inverse_fn=_inverse,
 inverse_log_det_jacobian_fn=_inverse_log_det_jacobian,
 inverse_min_event_ndims=1,
 is_constant_jacobian=True,
)
p_y = tfd.TransformedDistribution(distribution=p_x, bijector=banana)

Summary

In this post, we showed that using diffeomorphisms—mappings that are differentiable and invertible, it is possible transform standard distributions into interesting and complicated distributions, while still being able to compute their densities analytically.

The Bijector API provides an interface that encapsulates the basic properties of a diffeomorphism needed to transform a distribution. These are: the forward transform itself, its inverse and the determinant of their Jacobians.

Using this, TransformedDistribution automatically implements perhaps the two most important methods of a probability distribution: sampling (sample), and density evaluation (prob).

Needless to say, this is a very powerful combination. Through the Bijector API, the number of possible distributions that can be implemented and used directly with other functionalities in the TensorFlow Probability ecosystem effectively becomes endless.

Cite as:

@article{tiao2018bijector,
 title = "{B}uilding {P}robability {D}istributions with the {T}ensor{F}low {P}robability {B}ijector {API}",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2018",
 url = "https://tiao.io/post/building-probability-distributions-with-tensorflow-probability-bijector-api/"
}

To receive updates on more posts like this, follow me on and !

Links & Resources

Try this out yourself in a .
Paper: see footnote¹
Blog Post:
API Documentation:

Dillon, J.V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M. and Saurous, R.A., 2017. TensorFlow Distributions. . ↩︎ ↩︎
Haario, H., Saksman, E., & Tamminen, J. (1999). . Computational Statistics, 14(3), 375-396. ↩︎
for the transformation to be a diffeomorphism, it also needs to be smooth. ↩︎
we implement this for the general case of $K \geq 2$ dimensional inputs since this actually turns out to be easier and cleaner (a phenomenon known as ). ↩︎
this is a straightforward consequence of the which says the matrix inverse of the Jacobian of $G$ is the Jacobian of its inverse $G^{-1}$,
$$ \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) = \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} $$
Taking the determinant of both sides, we get:
$$ \begin{align} \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right ) & = \mathrm{det} \left ( \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} \right ) \newline & = \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right )^{-1} \end{align} $$
as required. ↩︎
See description of argument for further details. ↩︎

Contributed Talk: Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Sat, 14 Jul 2018 15:20:00 +0000

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Sun, 01 Jul 2018 00:00:00 +0000

A Tutorial on Variational Autoencoders with a Concise Keras Implementation

Wed, 20 Apr 2016 00:00:00 +0000

is awesome. It is a very well-designed library that clearly abides by its of modularity and extensibility, enabling us to easily assemble powerful, complex models from primitive building blocks. This has been demonstrated in numerous blog posts and tutorials, in particular, the excellent tutorial on . As the name suggests, that tutorial provides examples of how to implement various kinds of autoencoders in Keras, including the variational autoencoder (VAE)¹.

Like all autoencoders, the variational autoencoder is primarily used for unsupervised learning of hidden representations. However, they are fundamentally different to your usual neural network-based autoencoder in that they approach the problem from a probabilistic perspective. They specify a joint distribution over the observed and latent variables, and approximate the intractable posterior conditional density over latent variables with variational inference, using an inference network ² ³ (or more classically, a recognition model ⁴) to amortize the cost of inference.

While the examples in the aforementioned tutorial do well to showcase the versatility of Keras on a wide range of autoencoder model architectures, doesn’t properly take advantage of Keras’ modular design, making it difficult to generalize and extend in important ways. As we will see, it relies on implementing custom layers and constructs that are restricted to a specific instance of variational autoencoders. This is a shame because when combined, Keras’ building blocks are powerful enough to encapsulate most variants of the variational autoencoder and more generally, recognition-generative model combinations for which the generative model belongs to a large family of deep latent Gaussian models (DLGMs)⁵.

The goal of this post is to propose a clean and elegant alternative implementation that takes better advantage of Keras’ modular design. It is not intended as tutorial on variational autoencoders ⁶. Rather, we study variational autoencoders as a special case of variational inference in deep latent Gaussian models using inference networks, and demonstrate how we can use Keras to implement them in a modular fashion such that they can be easily adapted to approximate inference in tasks beyond unsupervised learning, and with complicated (non-Gaussian) likelihoods.

This first post will lay the groundwork for a series of future posts that explore ways to extend this basic modular framework to implement the cutting-edge methods proposed in the latest research, such as the normalizing flows for building richer posterior approximations ⁷, importance weighted autoencoders ⁸, the Gumbel-softmax trick for inference in discrete latent variables ⁹, and even the most recent GAN-based density-ratio estimation techniques for likelihood-free inference ¹⁰ ¹¹.

Model specification

First, it is important to understand that the variational autoencoder . Rather, the generative model is a component of the variational autoencoder and is, in general, a deep latent Gaussian model. In particular, let $\mathbf{x}$ be a local observed variable and $\mathbf{z}$ its corresponding local latent variable, with joint distribution

$$ p_{\theta}(\mathbf{x}, \mathbf{z}) = p_{\theta}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}). $$

In Bayesian modelling, we assume the distribution of observed variables to be governed by the latent variables. Latent variables are drawn from a prior density $p(\mathbf{z})$ and related to the observations through the likelihood $p_{\theta}(\mathbf{x} | \mathbf{z})$. Deep latent Gaussian models (DLGMs) are a general class of models where the observed variable is governed by a hierarchy of latent variables, and the latent variables at each level of the hierarchy are Gaussian a priori ⁵.

In a typical instance of the variational autoencoder, we have only a single layer of latent variables with a Normal prior distribution,

$$ p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}). $$

Now, each local latent variable is related to its corresponding observation through the likelihood $p_{\theta}(\mathbf{x} | \mathbf{z})$, which can be viewed as a probabilistic decoder. Given a hidden lower-dimensional representation (or “code”) $\mathbf{z}$, it “decodes” it into a distribution over the observation $\mathbf{x}$.

Decoder

In this example, we define $p_{\theta}(\mathbf{x} | \mathbf{z})$ to be a multivariate Bernoulli whose probabilities are computed from $\mathbf{z}$ using a fully-connected neural network with a single hidden layer,

$$ \begin{align*} p_{\theta}(\mathbf{x} | \mathbf{z}) & = \mathrm{Bern}( \sigma( \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2 ) ), \newline \mathbf{h} & = h(\mathbf{W}_1 \mathbf{z} + \mathbf{b}_1), \end{align*} $$

where $\sigma$ is the logistic sigmoid function, $h$ is some non-linearity, and the model parameters $\theta = \{ \mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2 \}$ consist of the weights and biases of this neural network.

It is straightforward to implement this in Keras with the :

decoder = Sequential([
 Dense(intermediate_dim, input_dim=latent_dim, activation='relu'),
 Dense(original_dim, activation='sigmoid')
])

You can view a summary of the model parameters $\theta$ by calling decoder.summary(). Additionally, you can produce a high-level diagram of the network architecture, and optionally the input and output shapes of each layer using from the keras.utils.vis_utils module. Although our architecture is about as simple as it gets, it is included in the figure below as an example of what the diagrams look like.

Note that by fixing $\mathbf{W}_1$, $\mathbf{b}_1$ and $h$ to be the identity matrix, the zero vector, and the identity function, respectively (or equivalently dropping the first Dense layer in the snippet above altogether), we recover logistic factor analysis. With similarly minor modifications, we can recover other members from the family of DLGMs, which include non-linear factor analysis, non-linear Gaussian belief networks, sigmoid belief networks, and many others ⁵.

Having specified how the probabilities are computed, we can now define the negative log likelihood of a Bernoulli $- \log p_{\theta}(\mathbf{x}|\mathbf{z})$, which is in fact equivalent to the :

def nll(y_true, y_pred):
 """ Negative log likelihood (Bernoulli). """

 # keras.losses.binary_crossentropy gives the mean
 # over the last axis. we require the sum
 return K.sum(K.binary_crossentropy(y_true, y_pred), axis=-1)

As we discuss later, this will not be the loss we ultimately minimize, but will constitute the data-fitting term of our final loss.

Note this is a valid definition of a , which is required to compile and optimize a model. It is a symbolic function that returns a scalar for each data-point in y_true and y_pred. In our example, y_pred will be the output of our decoder network, which are the predicted probabilities, and y_true will be the true probabilities.

Side note: Using TensorFlow Distributions in loss

If you are using the TensorFlow backend, you can directly use the (negative) log probability of Bernoulli from TensorFlow Distributions as a Keras loss, as I demonstrate in my post on .

Specifically we can define the loss as,

def nll(y_true, y_pred):
 """ Negative log likelihood (Bernoulli). """

 lh = K.tf.distributions.Bernoulli(probs=y_pred)

 return - K.sum(lh.log_prob(y_true), axis=-1)

This is exactly equivalent to the previous definition, but does not call K.binary_crossentropy directly.

Inference

Having specified the generative process, we would now like to perform inference on the latent variables and model parameters $\mathbf{z}$ and $\theta$, respectively. In particular, our goal is to compute the posterior $p_{\theta}(\mathbf{z} | \mathbf{x})$, the conditional density of the latent variable $\mathbf{z}$ given observed variable $\mathbf{x}$. Additionally, we wish to optimize the model parameters $\theta$ with respect to the marginal likelihood $p_{\theta}(\mathbf{x})$. Both depend on the marginal likelihood, whose calculation requires marginalizing out the latent variables $\mathbf{z}$. In general, this is computational intractable, requiring exponential time to compute, or it is analytically intractable and cannot be evaluated in closed-form. In our case, we suffer from the latter intractability, since our prior is Gaussian non-conjugate to the Bernoulli likelihood.

To circumvent this intractability we turn to variational inference, which formulates inference as an optimization problem. It seeks an approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$ closest in Kullback-Leibler (KL) divergence to the true posterior. More precisely, the approximate posterior is parameterized by variational parameters $\phi$, and we seek a setting of these parameters that minimizes the aforementioned KL divergence,

$$ \phi^* = \mathrm{argmin}_{\phi} \mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p_{\theta}(\mathbf{z} | \mathbf{x}) ] $$

With the luck we’ve had so far, it shouldn’t come as a surprise anymore that this too is intractable. It also depends on the log marginal likelihood, whose intractability is the reason we appealed to approximate inference in the first place. Instead, we maximize an alternative objective function, the evidence lower bound (ELBO), which is expressed as

$$ \begin{align*} \mathrm{ELBO}(q) & = \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ \log p_{\theta}(\mathbf{x} | \mathbf{z}) + \log p(\mathbf{z}) - \log q_{\phi}(\mathbf{z} | \mathbf{x}) ] \newline & = \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ \log p_{\theta}(\mathbf{x} | \mathbf{z}) ] -\mathrm{KL} [ q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ]. \end{align*} $$

Importantly, the ELBO is a lower bound to the log marginal likelihood. Therefore, maximizing it with respect to the model parameters $\theta$ approximately maximizes the log marginal likelihood. Additionally, maximizing it with respect to variational parameters $\phi$ can be shown to minimize $\mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p_{\theta}(\mathbf{z} | \mathbf{x}) ]$. Also, it turns out that the KL divergence determines the tightness of the lower bound, where we have equality iff the KL divergence is zero, which happens iff $q_{\phi}(\mathbf{z} | \mathbf{x}) = p_{\theta}(\mathbf{z} | \mathbf{x})$. Hence, simultaneously maximizing it with respect to $\theta$ and $\phi$ gets us two birds with one stone.

Next we discuss the form of the approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$, which can be viewed as a probabilistic encoder. Its role is opposite to that of the decoder. Given an observation $\mathbf{x}$, it “encodes” it into a distribution over its hidden lower-dimensional representations.

Encoder

For each local observed variable $\mathbf{x}_n$, we wish to approximate the true posterior distribution $p(\mathbf{z}_n|\mathbf{x}_n)$ over its corresponding local latent variables $\mathbf{z}_n$. A common approach is to approximate it using a variational distribution $q_{\lambda_n}(\mathbf{z}_n)$, specified as a diagonal Gaussian, where the local variational parameters $\lambda_n = \{ \boldsymbol{\mu}_n, \boldsymbol{\sigma}_n \}$ are the mean and standard deviation of this approximating distribution,

$$ q_{\lambda_n}(\mathbf{z}_n) = \mathcal{N}( \mathbf{z}_n | \boldsymbol{\mu}_n, \mathrm{diag}(\boldsymbol{\sigma}_n^2) ). $$

This approach has a number of shortcomings. First, the number of local variational parameters we need to optimize grows with the size of the dataset. Second, a new set of local variational parameters need to be optimized for new unseen test points. This is not to mention the strong factorization assumption we make by specifying diagonal Gaussian distributions as the family of approximations. The last is still an active area of research, and the first two can be addressed by introducing a further approximation using an inference network.

Inference network

We amortize the cost of inference by introducing an inference network which approximates the local variational parameters $\lambda_n$ for a given local observed variable $\textbf{x}_n$. For our approximating distribution in particular, given $\textbf{x}_n$ the inference network yields two vector-valued outputs $\boldsymbol{\mu}_{\phi}(\textbf{x}_n)$ and $\boldsymbol{\sigma}_{\phi}(\textbf{x}_n)$, which we use to approximate its local variational parameters $\boldsymbol{\mu}_n$ and $\boldsymbol{\sigma}_n$, respectively. Our approximate posterior distribution now becomes $$ q_{\phi}(\mathbf{z}_n | \mathbf{x}_n)

\mathcal{N}(\mathbf{z}n | \boldsymbol{\mu}{\phi}(\mathbf{x}n), \mathrm{diag}(\boldsymbol{\sigma}{\phi}^2(\mathbf{x}_n)) ). $$ Instead of learning local variational parameters $\lambda_n$ for each data-point, we now learn a fixed number of global variational parameters $\phi$ which constitute the parameters (i.e. weights) of the inference network. Moreover, this approximation allows statistical strength to be shared across observed data-points and also generalize to unseen test points.

We specify the mean $\boldsymbol{\mu}_{\phi}(\mathbf{x})$ and log variance $\log \boldsymbol{\sigma}_{\phi}^2(\mathbf{x})$ of this distribution as the output of an inference network. For this post, we keep the architecture of the network simple, with only a single hidden layer and two fully-connected output layers. Again, this is simple to define in Keras:

# input layer
x = Input(shape=(original_dim,))

# hidden layer
h = Dense(intermediate_dim, activation='relu')(x)

# output layer for mean and log variance
z_mu = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)

Since this network has multiple outputs, we couldn’t use the Sequential model API as we did for the decoder. Instead, we will resort to the more powerful , which allows us to implement complex models with shared layers, multiple inputs, multiple outputs, and so on.

Note that we output the log variance instead of the standard deviation because this is not only more convenient to work with, but also helps with numerical stability. However, we still require the standard deviation later. To recover it, we simply implement the appropriate transformation and encapsulate it in a .

# normalize log variance to std dev
z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)

Before moving on, we give a few words on nomenclature and context. In the prelude and title of this section, we characterized the approximate posterior distribution with an inference network as a probabilistic encoder (analogously to its counterpart, the probabilistic decoder). Although this is an accurate interpretation, it is a limited one. Classically, inference networks are known as recognition models, and have now been used for decades in a wide variety of probabilistic methods. When composed end-to-end, the recognition-generative model combination can be seen as having an autoencoder structure. Indeed, this structure contains the variational autoencoder as a special case, and also the now less fashionable Helmholtz machine ⁴. Even more generally, this recognition-generative model combination constitutes a widely-applicable approach currently known as amortized variational inference, which can be used to perform approximate inference in models that lie beyond even the large class of deep latent Gaussian models.

Having specified all the ingredients necessary to carry out variational inference (namely, the prior, likelihood and approximate posterior), we next focus on finalizing the definition of the (negative) ELBO as our loss function in Keras. As written earlier, the ELBO can be decomposed into two terms, $\mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ \log p_{\theta}(\mathbf{x} | \mathbf{z}) ]$ the expected log likelihood (ELL) over $q_{\phi}(\mathbf{z} | \mathbf{x})$, and $- \mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ]$ the negative KL divergence between prior $p(\mathbf{z})$ and approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$. We first turn our attention to the KL divergence term.

KL Divergence

Intuitively, maximizing the negative KL divergence term encourages approximate posterior densities that place its mass on configurations of the latent variables which are closest to the prior. Effectively, this regularizes the complexity of latent space. Now, since both the prior $p(\mathbf{z})$ and approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$ are Gaussian, the KL divergence can actually be calculated with the closed-form expression,

$$ \mathrm{KL} [ q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ] = - \frac{1}{2} \sum_{k=1}^K \{ 1 + \log \sigma_k^2 - \mu_k^2 - \sigma_k^2 \} $$

where $\mu_k$ and $\sigma_k$ are the $k$-th components of output vectors $\mu_{\phi}(\mathbf{x})$ and $\sigma_{\phi}(\mathbf{x})$, respectively. This is not too difficult to derive, and I would recommend verifying this as an exercise. You can also find a derivation in the appendix of Kingma and Welling’s (2014) paper ¹.

Recall that earlier, we defined the expected log likelihood term of the ELBO as a Keras loss. We were able to do this since the log likelihood is a function of the network’s final output (the predicted probabilities), so it maps nicely to a Keras loss. Unfortunately, the same does not apply for the KL divergence term, which is a function of the network’s intermediate layer outputs, the mean mu and log variance log_var.

We define an auxiliary which takes mu and log_var as input and simply returns them as output without modification. We do however explicitly introduce the of calculating the KL divergence and adding it to a collection of losses, by calling the method add_loss ¹².

class KLDivergenceLayer(Layer):

 """ Identity transform layer that adds KL divergence
 to the final model loss.
 """

 def __init__(self, *args, **kwargs):
 self.is_placeholder = True
 super(KLDivergenceLayer, self).__init__(*args, **kwargs)

 def call(self, inputs):

 mu, log_var = inputs

 kl_batch = - .5 * K.sum(1 + log_var -
 K.square(mu) -
 K.exp(log_var), axis=-1)

 self.add_loss(K.mean(kl_batch), inputs=inputs)

 return inputs

Next we feed z_mu and z_log_var through this layer (this needs to take place before feeding z_log_var through the Lambda layer to recover z_sigma).

z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])

Now when the Keras model is finally compiled, the collection of losses will be aggregated and added to the specified Keras loss function to form the loss we ultimately minimize. If we specify the loss as the negative log-likelihood we defined earlier (nll), we recover the negative ELBO as the final loss we minimize, as intended.

Side note: Alternative divergences

A key benefit of encapsulating the divergence in an auxiliary layer is that we can easily implement and swap in other divergences, such as the $\chi$-divergence or the $\alpha$-divergence. Using alternative divergences for variational inference is an active research topic ¹³ ¹⁴.

Side note: Implicit models and adversarial learning

Additionally, we could also extend the divergence layer to use an auxiliary density ratio estimator function, instead of evaluating the KL divergence in the analytical form above. This relaxes the requirement on approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ (and incidentally also prior $p(\mathbf{z})$) to yield tractable densities, at the cost of maximizing a cruder estimate of the ELBO. This is known as Adversarial Variational Bayes¹⁰, and is an important line of recent research that, when taken to its logcal conclusion, can extend the applicability of variational inference to arbitrarily expressive implicit probabilistic models with intractable likelihoods¹¹.

Reparameterization using Merge Layers

To perform gradient-based optimization of ELBO with respect to model parameters $\theta$ and variational parameters $\phi$, we require its gradients with respect to these parameters, which is generally intractable. Currently, the dominant approach for circumventing this is by Monte Carlo (MC) estimation of the gradients. The basic idea is to write the gradient of the ELBO as an expectation of the gradient, approximate it with MC estimates, then perform stochastic gradient descent with the repeated MC gradient estimates.

There exist a number of estimators based on different variance reduction techniques. However, MC gradient estimates based on the reparameterization trick, known as the reparameterization gradients, have be shown to have the lowest variance among competing estimators for continuous latent variables⁵. The reparameterization trick is a straightforward change of variables that expresses the random variable $\mathbf{z} \sim q_{\phi}(\mathbf{z} | \mathbf{x})$ as a deterministic transformation $g_{\phi}$ of another random variable $\boldsymbol{\epsilon}$ and input $\mathbf{x}$, with parameters $\phi$,

$$ z = g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon}), \quad \boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon}). $$

Note that $p(\boldsymbol{\epsilon})$ is simpler base distribution which is parameter-free and independent of $\mathbf{x}$ or $\phi$. To prevent clutter, we write the ELBO as an expectation of the function $f(\mathbf{x}, \mathbf{z}) = \log p_{\theta}(\mathbf{x} , \mathbf{z}) - \log q_{\phi}(\mathbf{z} | \mathbf{x})$ over distribution $q_{\phi}(\mathbf{z} | \mathbf{x})$. Now, for any function $f(\mathbf{x}, \mathbf{z})$, taking the gradient of the expectation with respect to $\phi$, and substituting all occurrences of $\mathbf{z}$ with $g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})$, we have

$$ \begin{align*} \nabla_{\phi} \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ f(\mathbf{x}, \mathbf{z}) ] & = \nabla_{\phi} \mathbb{E}_{p(\boldsymbol{\epsilon})} [ f(\mathbf{x}, g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})) ] \newline & = \mathbb{E}_{p(\mathbf{\epsilon})} [ \nabla_{\phi} f(\mathbf{x}, g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})) ]. \end{align*} $$

In other words, this simple reparameterization allows the gradient and the expectation to commute, thereby allowing us to compute unbiased stochastic estimates of the ELBO gradients by drawing noise samples $\boldsymbol{\epsilon}$ from $p(\boldsymbol{\epsilon})$.

To recover the diagonal Gaussian approximation we specified earlier $q_{\phi}(\mathbf{z}_n | \mathbf{x}_n) = \mathcal{N}(\mathbf{z}_n | \boldsymbol{\mu}_{\phi}(\mathbf{x}_n), \mathrm{diag}(\boldsymbol{\sigma}_{\phi}^2(\mathbf{x}_n)))$, we draw noise from the Normal base distribution, and specify a simple location-scale transformation

$$ \mathbf{z} = g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon}) = \mu_{\phi}(\mathbf{x}) + \sigma_{\phi}(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), $$

where $\mu_{\phi}(\mathbf{x})$ and $\sigma_{\phi}(\mathbf{x})$ are the outputs of the inference network defined earlier with parameters $\phi$, and $\odot$ denotes the elementwise product. In Keras, we explicitly make the noise vector an input to the model by defining an Input layer for it. We then implement the above location-scale transformation using , namely Add and Multiply.

eps = Input(shape=(latent_dim,))

z_eps = Multiply()([z_sigma, eps])
z = Add()([z_mu, z_eps])

Side note: Monte Carlo sample size

Note both the inputs for observed variables and noise (x and eps) need to be specified explicitly as inputs to our final model. Furthermore, the size of their first dimension (i.e. batch size) are required to be the same. This corresponds to using a exactly one Monte Carlo sample to approximate the expected log likelihood, drawing a single sample $\mathbf{z}_n$ from $q_{\phi}(\mathbf{z}_n | \mathbf{x}_n)$ for each data-point $\mathbf{x}_n$ in the batch. Although you might find an MC sample size of 1 surprisingly small, it is actually adequate for a sufficiently large batch size (~100) ¹. In a , I demonstrate how to extend our approach to support larger MC sample sizes using just a few minor tweaks. This extension is crucial for implementing the importance weighted autoencoder ⁸.

Now, since the noise input is drawn from the Normal distribution, we can save from having to feed in values for this input from outside the computation graph by binding a tensor to this Input layer. Specifically, we bind a tensor created using K.random_normal with the required shape,

eps = Input(tensor=K.random_normal(shape=(K.shape(x)[0], latent_dim)))

While eps still needs to be explicitly specified as an input to compile the model, values for this input will no longer be expected by methods such as fit, predict. Instead, samples from this distribution will be lazily generated inside the computation graph when required. See my notes on for more details.

In the , all of this logic is encapsulated in a single Lambda layer, which simultaneously draws samples from a hard-coded base distribution and also performs the location-scale transformation. In contrast, this approach achieves a good level of and . By decoupling the random noise vector from the layer’s internal logic and explicitly making it a model input, we emphasize the fact that all sources of stochasticity emanate from this input. It thereby becomes clear that a random sample drawn from a particular approximating distribution is obtained by feeding this source of stochasticity through a number of successive deterministic transformations.

Side notes: Gumbel-softmax trick for discrete latent variables

As an example, we could provide samples drawn from the Uniform distribution as noise input. By applying a number of deterministic transformations that constitute the Gumbel-softmax reparameterization trick ⁹, we are able to obtain samples from the Categorical distribution. This allows us to perform approximate inference on discrete latent variables, and can be implemented in this framework by adding a dozen or so lines of code!

Putting it all together

So far, we’ve dissected the variational autoencoder into modular components and discussed the role and implementation of each one at some length. Now let’s compose these components together end-to-end to form the final autoencoder architecture.

x = Input(shape=(original_dim,))
h = Dense(intermediate_dim, activation='relu')(x)

z_mu = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)

z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])
z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)

eps = Input(tensor=K.random_normal(shape=(K.shape(x)[0], latent_dim)))
z_eps = Multiply()([z_sigma, eps])
z = Add()([z_mu, z_eps])

decoder = Sequential([
 Dense(intermediate_dim, input_dim=latent_dim, activation='relu'),
 Dense(original_dim, activation='sigmoid')
])

x_pred = decoder(z)

It’s surprisingly concise, taking up around 20 lines of code. The diagram of the full model architecture is visualized below.

Finally, we specify and compile the model, using the negative log likelihood nll defined earlier as the loss.

vae = Model(inputs=[x, eps], outputs=x_pred)
vae.compile(optimizer='rmsprop', loss=nll)

Model fitting

Dataset: MNIST digits

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, original_dim) / 255.
x_test = x_test.reshape(-1, original_dim) / 255.

vae.fit(x_train,
 x_train,
 shuffle=True,
 epochs=epochs,
 batch_size=batch_size,
 validation_data=(x_test, x_test))

Loss (NELBO) convergence

pd.DataFrame(hist.history).plot(ax=ax)

Model evaluation

encoder = Model(x, z_mu)

# display a 2D plot of the digit classes in the latent space
z_test = encoder.predict(x_test, batch_size=batch_size)
plt.figure(figsize=(6, 6))
plt.scatter(z_test[:, 0], z_test[:, 1], c=y_test,
 alpha=.4, s=3**2, cmap='viridis')
plt.colorbar()
plt.show()

# display a 2D manifold of the digits
n = 15 # figure with 15x15 digits
digit_size = 28

# linearly spaced coordinates on the unit square were transformed
# through the inverse CDF (ppf) of the Gaussian to produce values
# of the latent variables z, since the prior of the latent space
# is Gaussian

z1 = norm.ppf(np.linspace(0.01, 0.99, n))
z2 = norm.ppf(np.linspace(0.01, 0.99, n))
z_grid = np.dstack(np.meshgrid(z1, z2))

x_pred_grid = decoder.predict(z_grid.reshape(n*n, latent_dim)) \
 .reshape(n, n, digit_size, digit_size)

plt.figure(figsize=(10, 10))
plt.imshow(np.block(list(map(list, x_pred_grid))), cmap='gray')
plt.show()

Recap

In this post, we covered the basics of amortized variational inference, looking at variational autoencoders as a specific example. In particular, we

Implemented the decoder and encoder using the and respectively.
Augmented the final loss with the KL divergence term by writing an auxiliary .
Worked with the log variance for numerical stability, and used a to transform it to the standard deviation when necessary.
Explicitly made the noise an Input layer, and implemented the reparameterization trick using .
, so random samples are generated within the computation graph.

What’s next

Next, we will extend the divergence layer to use an auxiliary density ratio estimator function, instead of evaluating the KL divergence in the analytical form above. This relaxes the requirement on approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ (and incidentally also prior $p(\mathbf{z})$) to yield tractable densities, at the cost of maximizing a cruder estimate of the ELBO. This is known as Adversarial Variational Bayes¹⁰, and is an important line of recent research that, when taken to its logcal conclusion, can extend the applicability of variational inference to arbitrarily expressive implicit probabilistic models with intractable likelihoods¹¹.

Cite as:

@article{tiao2017vae,
 title = "{A} {T}utorial on {V}ariational {A}utoencoders with a {C}oncise {K}eras {I}mplementation",
 author = "Tiao, Louis C",
 journal = "tiao.io",
 year = "2017",
 url = "https://tiao.io/post/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/"
}

To receive updates on more posts like this, follow me on and !

Links & Resources

Below, you can find:

The used to generate the diagrams and plots in this post.
The above snippets combined in a single executable Python file:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

from keras import backend as K

from keras.layers import Input, Dense, Lambda, Layer, Add, Multiply
from keras.models import Model, Sequential
from keras.datasets import mnist


original_dim = 784
intermediate_dim = 256
latent_dim = 2
batch_size = 100
epochs = 50
epsilon_std = 1.0


def nll(y_true, y_pred):
 """ Negative log likelihood (Bernoulli). """

 # keras.losses.binary_crossentropy gives the mean
 # over the last axis. we require the sum
 return K.sum(K.binary_crossentropy(y_true, y_pred), axis=-1)


class KLDivergenceLayer(Layer):

 """ Identity transform layer that adds KL divergence
 to the final model loss.
 """

 def __init__(self, *args, **kwargs):
 self.is_placeholder = True
 super(KLDivergenceLayer, self).__init__(*args, **kwargs)

 def call(self, inputs):

 mu, log_var = inputs

 kl_batch = - .5 * K.sum(1 + log_var -
 K.square(mu) -
 K.exp(log_var), axis=-1)

 self.add_loss(K.mean(kl_batch), inputs=inputs)

 return inputs


decoder = Sequential([
 Dense(intermediate_dim, input_dim=latent_dim, activation='relu'),
 Dense(original_dim, activation='sigmoid')
])

x = Input(shape=(original_dim,))
h = Dense(intermediate_dim, activation='relu')(x)

z_mu = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)

z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])
z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)

eps = Input(tensor=K.random_normal(stddev=epsilon_std,
 shape=(K.shape(x)[0], latent_dim)))
z_eps = Multiply()([z_sigma, eps])
z = Add()([z_mu, z_eps])

x_pred = decoder(z)

vae = Model(inputs=[x, eps], outputs=x_pred)
vae.compile(optimizer='rmsprop', loss=nll)

# train the VAE on MNIST digits
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, original_dim) / 255.
x_test = x_test.reshape(-1, original_dim) / 255.

vae.fit(x_train,
 x_train,
 shuffle=True,
 epochs=epochs,
 batch_size=batch_size,
 validation_data=(x_test, x_test))

encoder = Model(x, z_mu)

# display a 2D plot of the digit classes in the latent space
z_test = encoder.predict(x_test, batch_size=batch_size)
plt.figure(figsize=(6, 6))
plt.scatter(z_test[:, 0], z_test[:, 1], c=y_test,
 alpha=.4, s=3**2, cmap='viridis')
plt.colorbar()
plt.show()

# display a 2D manifold of the digits
n = 15 # figure with 15x15 digits
digit_size = 28

# linearly spaced coordinates on the unit square were transformed
# through the inverse CDF (ppf) of the Gaussian to produce values
# of the latent variables z, since the prior of the latent space
# is Gaussian
u_grid = np.dstack(np.meshgrid(np.linspace(0.05, 0.95, n),
 np.linspace(0.05, 0.95, n)))
z_grid = norm.ppf(u_grid)
x_decoded = decoder.predict(z_grid.reshape(n*n, 2))
x_decoded = x_decoded.reshape(n, n, digit_size, digit_size)

plt.figure(figsize=(10, 10))
plt.imshow(np.block(list(map(list, x_decoded))), cmap='gray')
plt.show()

D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014. ↩︎ ↩︎ ↩︎
↩︎
Section “Recognition models and amortised inference” in ↩︎
Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7(5), 889–904. ↩︎ ↩︎
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). “Stochastic backpropagation and approximate inference in deep generative models,” in Proceedings of The 31st International Conference on Machine Learning, 2014, (Vol. 32, pp. 1278–1286). Bejing, China: PMLR. ↩︎ ↩︎ ↩︎ ↩︎
For a complete treatment of variational autoencoders, and variational inference in general, I highly recommend:
- Jaan Altosaar’s blog post,
- Diederik P. Kingma’s PhD Thesis, .
↩︎
D. Rezende and S. Mohamed, “Variational Inference with Normalizing Flows,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, vol. 37, pp. 1530–1538. ↩︎
Y. Burda, R. Grosse, and R. Salakhutdinov, “Importance Weighted Autoencoders,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015. ↩︎ ↩︎
E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax,” Nov. 2016. in Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017. ↩︎ ↩︎
L. Mescheder, S. Nowozin, and A. Geiger, “Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks,” in Proceedings of the 34th International Conference on Machine Learning, 2017, vol. 70, pp. 2391–2400. ↩︎ ↩︎ ↩︎
D. Tran, R. Ranganath, and D. Blei, “Hierarchical Implicit Models and Likelihood-Free Variational Inference,” in Advances in Neural Information Processing Systems 30, 2017. ↩︎ ↩︎ ↩︎
To support sample weighting (fined-tuning how much each data-point contributes to the loss), Keras losses are expected returns a scalar for each data-point in the batch. In contrast, losses appended with the add_loss method don’t support this, and are expected to be a single scalar. Hence, we calculate the KL divergence for all data-points in the batch and take the mean before passing it to add_loss. ↩︎
Y. Li and R. E. Turner, “Rényi Divergence Variational Inference,” in Advances in Neural Information Processing Systems 29, 2016. ↩︎
A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei, “Variational Inference via chi Upper Bound Minimization,” in Advances in Neural Information Processing Systems 30, 2017. ↩︎

Machine Learning |

Empirical Gaussian Processes

Ax: A Platform for Adaptive Experimentation

Probabilistic Machine Learning in the Age of Deep Learning: New Perspectives for Gaussian Processes, Bayesian Optimization and Beyond (PhD Thesis)

Table of Contents

Introduction

Probabilistic Machine Learning

Probabilistic ML vs. Deep Learning

Thesis Goals

Gaussian Process Models

Bayesian optimisation

Thesis Overview

References

📄 One paper accepted to ICML 2023

Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes

Efficient Cholesky decomposition of low-rank updates

Rank-1 Updates

Low-Rank Updates

Summary

Batch Bayesian Optimisation via Density-ratio Estimation with Guarantees

Long Talk: BORE — Bayesian Optimization by Density-Ratio Estimation

Invited Talk: BORE — Bayesian Optimization by Density-Ratio Estimation

BORE: Bayesian Optimization by Density-Ratio Estimation

Code Example

Step-by-step Illustration

Video

Simulation-based Scoring for Model-based Asynchronous Hyperparameter and Neural Architecture Search

A Primer on Pólya-gamma Random Variables - Part II: Bayesian Logistic Regression

Binary Classification

Model – Bayesian Logistic Regression

Likelihood

Prior

Weight-space

Function-space

Inference and Prediction

Augmented Model

Likelihood conditioned on auxiliary variables

Prior over auxiliary variables

Pólya-gamma density (Polson et al. 2013)

Property I: Recovering the original model

Laplace transform of the Pólya-gamma density (Polson et al. 2013)

Property II: Gaussian-Gaussian conjugacy

Inference (Gibbs sampling)

Posterior over latent function values

Marginal and Conditional Gaussians (Bishop, Section 2.3.3, pg. 93)

Example: Gaussian process prior

Posterior over auxiliary variables

Implementation (Weight-space view)

Synthetic one-dimensional classification problem

Densities of two Gaussians and samples drawn from each.

Classification dataset $\mathcal{D}_N = \{(\mathbf{x}_n, y_n)\}_{n=1}^N$ and the true class-posterior probability.

Prior

Conditional likelihood

Inference and Prediction

Posterior over latent function values

Posterior over auxiliary variables

Gibbs sampling

Parameter $\boldsymbol{\beta}^{(t)}$ samples as Gibbs sampling iteration $t$ increases.

Auxiliary variable $\omega_n^{(t)}$ samples as Gibbs sampling iteration $t$ increases. For visualization purposes, each $\omega_n^{(t)}$ is placed at its corresponding input location $x_n$ along the horizontal axis.

Predicted class-membership probability $\pi^{(t)}(\mathbf{x})$ as Gibbs sampling iteration $t$ increases.

Code

Bonus: Gibbs sampling with mutual recursion and generator delegation

Links and Further Readings

Appendix

I

II

III

Laplace transform of the $\mathrm{PG}(1, 0)$ distribution (Polson et al. 2013)

An Illustrated Guide to the Knowledge Gradient Acquisition Function

Knowledge-gradient

Monte Carlo estimation

One-dimensional example

Latent blackbox function and $n=10$ observations.

Posterior predictive distribution

Posterior predictive distribution (*before* hyperparameter estimation).

Step 1: Hyperparameter estimation

Posterior predictive distribution (*after* hyperparameter estimation).

Step 2: Determine the predictive minimum

Predictive minimum $\tau_n$.

Step 3: Compute simulation-augmented predictive means

Posterior predictive distribution (before hyperparameter estimation).

Posterior predictive distribution (after hyperparameter estimation).

Instantiating a `TransformedDistribution` with a `Bijector`

Creating a `Bijector`

Instantiating a `TransformedDistribution`