Publications |

Empirical Gaussian Processes

Sun, 01 Feb 2026 00:00:00 +0000

Ax: A Platform for Adaptive Experimentation

Mon, 01 Sep 2025 00:00:00 +0000

Probabilistic Machine Learning in the Age of Deep Learning: New Perspectives for Gaussian Processes, Bayesian Optimization and Beyond (PhD Thesis)

Fri, 01 Sep 2023 00:00:00 +0000

The full text is available as a single PDF file

You can also find a list of contents and PDFs corresponding to each individual chapter below:

Chapter 1: Introduction
Chapter 2: Background
Chapter 3: Orthogonally-Decoupled Sparse Gaussian Processes with Spherical Neural Network Activation Features
Chapter 4: Cycle-Consistent Generative Adversarial Networks as a Bayesian Approximation
Chapter 5: Bayesian Optimisation by Classification with Deep Learning and Beyond
Chapter 6: Conclusion
Appendix A: Numerical Methods for Improved Decoupled Sampling of Gaussian Processes
Bibliography

Please find Chapter 1: Introduction reproduced in full below:

Introduction

Artificial intelligence (AI) stands poised to be among the most disruptive technologies of our era. The breakneck pace of recent AI advancements has been spearheaded by machine learning (ML), particularly the resurgence of deep learning. Deep learning is as old as the first general-purpose electronic computer; with roots tracing back to the 1940s and ’50s ( ; ), the revival of deep learning, beginning in the early 2010s, was catalysed by a series of breakthroughs that shattered previously perceived limitations and captivated the collective imagination. These breakthroughs span various domains, including computer vision ( ; ; ; ), speech recognition ( ; ), natural language processing ( ; ), protein folding ( ), generative art and artificial creativity ( ; ; ; ), as well as reinforcement learning for robotics control ( ; ) and achieving superhuman-level gameplay ( ; ).

Nevertheless, it is crucial to view these developments as means to an ultimate end rather than an end in themselves. Arguably, the true pinnacle of AI’s capabilities lies in optimal decision-making, whether that entails offering analyses and insights to aid humans in making better decisions or completely automating the decision-making process altogether. Practically any task directed towards a well-defined objective can be boiled down to a cascade of decisions. At a fundamental level, operating a vehicle involves a continuous stream of decisions involving accelerating, braking, and turning. Financial trading revolves around decisions to buy, sell, or hold various assets. Even complex engineering tasks, such as designing an aerofoil, involve a sequence of decisions about adjusting design variables to achieve desirable aerodynamic characteristics.

Yet, the intricacies of decision-making surpass what any single advancement in deep learning can address. While convolutional neural networks (CNNs) can facilitate object detection tasks in autonomous vehicles, recurrent neural networks (RNNs) can aid in forecasting market dynamics for systematic trading, and physics-informed NNs can assist in predicting aerodynamic effects, it remains the case that no target or quantity of interest can be entirely known or predictable (indeed, if they were, the pursuit of predictive modelling and ML would be superfluous). Instead, predictions often prove unreliable, or at best, uncertain, due to the limitations of our knowledge and the complexity and variability inherent in the underlying real-world processes. The impressive power of deep learning models often overshadows their ignorance of the limits of their own knowledge and the extent of uncertainty in their predictions. When these predictions are integrated into a sequential decision-making framework, such uncertainty can amplify, compound, and lead to catastrophic consequences. In the context of aeronautical engineering, this could result in inefficient designs; in quantitative finance, it can lead to devastating capital losses; and in autonomous driving, it can even cost lives.

Probabilistic Machine Learning

Grounded in the laws of probability and Bayesian statistics ( ; ), probabilistic ML provides a consistent framework for systematically reasoning about the unknown. The probabilistic approach to ML acknowledges that the real world is fraught with uncertainty and embraces this uncertainty as an inherent part of decision-making. Unlike traditional methods, including those of deep learning, it recognises model predictions not as absolute truths that can be represented as single point estimates produced from a deterministic mapping, but as full probability distributions that capture the potential outcomes of a random variable as it propagates through some underlying data-generating process. In a probabilistic model, all quantities are treated as random variables governed by probability distributions – the data are treated as observed variables, which are influenced by some underlying hidden variables, e.g., the model parameters. A prior distribution is used to express reasonable values for these hidden variables and to eliminate implausible ones. The relationship between observed and hidden variables is described using the likelihood, and the process of Bayesian inference amounts to calculating, using basic laws of probability, a posterior distribution over the hidden factors conditioned on the observed data, which can be seen as a refinement of the prior beliefs in light of new evidence. While the posterior distribution can be useful in and of itself, its primary role lies in facilitating subsequent prediction and decision-making by providing full probability distributions over predicted outcomes. This capability allows the decision-maker to assess the range of possible scenarios and their associated probabilities, enabling a more nuanced understanding of uncertainty and risk, which is indispensable in complex, dynamic environments where the repercussions of incorrect decisions can be severe. In essence, probabilistic ML equips autonomous decision-making systems with a probabilistic worldview, enabling them to navigate ambiguity and make sound decisions in the face of imperfect information.

Probabilistic ML vs. Deep Learning

While deep learning has dominated recent AI advances, probabilistic ML remains as important as ever and continues to offer valuable tools for addressing AI challenges that can not be fully resolved by deep learning alone. Although both approaches can be combined to create hybrid methods that leverage their respective strengths, some defining characteristics have traditionally set deep learning apart from probabilistic ML. Perhaps most notably, probabilistic ML approaches can achieve remarkable predictive performance even when data is scarce. In contrast, deep learning models tend to be data-intensive by nature, often demanding datasets of a scale proportional to their size (i.e., their parameter count) ( ), which has seen explosive growth in recent years ( ; ; ; ; ). With that being said, inference in many probabilistic models poses computational problems that are difficult to scale. On the other hand, deep learning approaches have excelled in scalability, a key factor contributing to their widespread success. This scalability is bolstered by their compatibility with various speed-enhancing mechanisms such as stochastic optimisation, specialised hardware accelerators (GPUs and TPUs), as well as distributed and/or cloud-based computing infrastructure. To bridge this gap, substantial research effort has been devoted to enabling probabilistic ML to benefit from these advantages through optimisation-based approximations to Bayesian inference ( ).

Moreover, as mentioned earlier, these paradigms are by no means mutually exclusive. Indeed, it is often possible to directly extend existing models with a Bayesian treatment of their parameters, adding a layer of probabilistic reasoning to the model, and allowing it to not only make predictions but also estimate the uncertainty associated with those predictions. An excellent example is the BNN, which treats the weights as hidden variables and leverages posterior inference to provide predictions while estimating associated uncertainties, delivering a more robust and principled approach to deep learning ( ; ; ).

The Bayesian formalism naturally gives rise to many popular methods and paradigms, often in the form of point estimates or other kinds of approximations. The quintessential example of this is found in linear regression, in particular, in ridge and lasso regression ( ), which correspond variously to maximum a posteriori (MAP) estimates in Bayesian linear regression (BLR) models with prior distributions possessing different sparsity-inducing characteristics ( ) – more broadly, mitigations against over-fitting tend to arise organically in Bayesian methods, which is why they are frequently characterised as being fundamentally more robust against over-fitting ( ). Likewise, the once à la mode support vector machines (SVMs) can be seen as MAP estimates for a class of nonparametric Bayesian models ( ), dropout ( ) in NNs can be seen as a variational approximation to exact inference in BNNs ( ), and unsupervised learning methods such as factor analysis (FA) ( ) and principal component analysis (PCA) ( ) are instances of a class of LVMs ( ; ) known as linear-Gaussian factor models ( ), to name just a few examples. Time and again, classical approaches have not only benefitted from being viewed through the Bayesian perspective but have also been enriched and redefined by the depth of insights this framework provides.

Thesis Goals

The over-arching goal of this thesis is to continue advancing the integration and cross-pollination between deep learning and probabilistic ML. We aim to further the interplay between these two fields, both by incorporating probabilistic interpretations and uncertainty quantification into popular deep learning frameworks, and by leveraging the representational power of deep NNs to improve established Bayesian methods. This dual-pronged approach provides fresh perspectives and taps the complementary strengths of both paradigms, advancing the foundations of AI and facilitating the development of more capable and dependable decision support frameworks. Ultimately, we strive to unlock the potential of deep learning within high-impact probabilistic ML methodologies, and to lend useful Bayesian perspectives on current deep learning techniques.

Gaussian Process Models

Arguably, no family of probabilistic models embodies the ethos of probabilistic ML and illustrates its nuances and parallels with deep learning quite like the GP. Accordingly, they shall occupy a prominent place in our thesis. In particular, GPs stand out as the ideal choice when dealing with limited data, offer the flexibility to encode prior beliefs through the covariance function, and provide predictive uncertainty estimates with a fine calibration that is second to none. Conversely, they are challenging to scale to large datasets, a limitation that has spurred extensive research and development efforts. Furthermore, in contrast to deep learning models, which are often lauded for their ability to automatically uncover valuable patterns and features in data, GPs have at times been dismissed as unsophisticated smoothing mechanisms ( ). Despite these apparent disparities, GPs are intricately connected to NNs in numerous ways. Among these, one of the most classical and well-known relationships is the convergence of single-layer NNs with randomly initialised weights toward GPs in the infinite-width limit ( ). Similar links have also been identified between GPs and infinitely wide deep NNs ( ; ).

In an effort to elevate the representational capabilities of GPs to a level comparable with deep NNs, DGPs ( ) stack together multiple layers of GPs. Additional efforts to construct efficient sparse GP approximations have leveraged the advantageous properties of computations on the hypersphere ( ), which has led to deep GP (DGP) models in which the propagation of posterior predictive means is equivalent to a forward pass through a deep neural network (NN) ( ; ). Notably, as a side effect, this model effectively provides uncertainty estimates for deep NN through its predictive variance. Among the contributions of our thesis is the further development of this framework, integrating cutting-edge techniques ( ; ) to address some of its practical limitations, thereby narrowing the performance gap between GPs and deep NNs.

Probabilistic models, serving a crucial role as decision support tools, routinely aid scientific discovery in fields such as physics and astronomy, guiding advancements in areas of medicine and healthcare encompassing bioinformatics, epidemiology, and medical diagnosis. Beyond that, these models have wide-ranging applications in economics, econometrics, and the social sciences. Moreover, they are indispensable in various engineering disciplines, such as robotics and environmental engineering. Among the many probabilistic models, GPs stand out as a powerful driving force behind a number of important sequential decision-making frameworks, including active learning ( ) and reinforcement learning ( ), and the broader area of probabilistic numerics at large ( ). Notably, Bayesian optimisation (BO) ( ; ; ) is one major area that relies heavily on GPs and will feature extensively in our thesis.

Bayesian optimisation

BO is a powerful methodology dedicated to the global optimisation of complex and resource-intensive objective functions. In contrast to classical optimisation methods, BO excels even when dealing with functions that lack strong assumptions or guarantees. These functions may not be convex, possess no gradients, lack a well-defined mathematical form, and observable only indirectly through noisy measurements.

At its core, BO is a sequential decision-making algorithm.

It relies on observations from past function evaluations to determine the next candidate location for evaluation in pursuit of optimal solutions. BO leverages a probabilistic model, often a GP, to represent its knowledge and beliefs about the unknown function. This model is continuously updated with the acquisition of each new observation, enabling the algorithm to adapt its behaviour and make sound decisions based on the evolving information.

BO effectively manages uncertainty inherent in such sequential decision-making processes by making use of the probabilistic model to the fullest, harnessing the entire predictive distribution, particularly, the predictive uncertainty, to select promising candidate solutions that bring the most value to the optimisation process. This generally consists not merely of those most likely to optimise the objective function (i.e., exploiting that which is known), but also those likely to reveal the most knowledge and information about the function itself (i.e., exploring that which remains unknown).

This pronounced emphasis on well-calibrated uncertainty distinguishes BO as one of the standout “killer apps” for GPs and a jewel in the crown of probabilistic ML applications. In practice, BO has proven instrumental across science, engineering, and industry, where efficiency and cost-effectiveness are paramount. Its applications include protein engineering ( ; ), material discovery ( ), experimental physics (e.g., experiments involving ultra-cold atoms ( ) and free-electron lasers ( )), environmental monitoring (sensor placement) ( ; ), and the design of aerodynamic aerofoils ( ; ), integrated circuits ( ; ), broadband high-efficiency power amplifiers ( ), and fast-charging protocols for lithium-ion batteries ( ). Notably, it has played a crucial role in automating the hyperparameter tuning of various ML models ( ; ), especially deep learning models, thus representing yet another way in which probabilistic ML has contributed to the advancement of deep learning.

However, GPs are not universally suitable for all BO problem scenarios. They are most effective when dealing with smooth, stationary functions with homoscedastic noise and a relatively modest input dimensionality. Additionally, GPs are easiest to work with for functions with a single output and purely continuous inputs. While a surprisingly wide array of real-world challenges satisfy these conditions, many high-impact problems, such as gene and protein design, which involves sequential inputs ( ; ; ; ; ); NAS, which involves structured inputs with intricate conditional dependencies; and automotive safety engineering, which involve numerous constraints and multiple objectives, clearly fall outside of this scope. This is not to say that GPs cannot be extended to such challenging scenarios. However, such extensions almost always come at a cost. Consequently, it makes sense to appeal to alternative modelling paradigms more naturally suited to specific tasks, e.g., employing random forests (RFs) to handle discrete and structured inputs, or deep NNs for capturing nonstationary behaviour and dealing with multiple objectives. A major contribution of this thesis is the introduction of a new formulation of BO that seamlessly accommodates virtually any modelling paradigm, including deep learning, without any compromise.

Thesis Overview

The core contributions of our thesis are summarised as follows:

We improve upon the framework for sparse hyperspherical GP approximations that employ nonlinear activations as inter-domain inducing features. This framework serves as a bridge between GPs and NNs, with posterior predictive mean taking the form of single-layer feedforward NNs. Our thesis examines some practical issues associated with this approach and proposes an extension that takes advantage of the orthogonal decoupling of GPs to mitigate these limitations. In particular, we introduce spherical inter-domain features to construct more flexible data-dependent basis functions for both the principal and orthogonal components of the GP approximation. We demonstrate that incorporating orthogonal inducing variables under this framework not only alleviates these shortcomings but also offers superior scalability compared to alternative strategies.
We provide a probabilistic perspective on cycle-consistent adversarial networks (CYCLEGANs), a cutting-edge deep generative model for style transfer and image-to-image translation. Specifically, we frame the problem of learning cross-domain correspondences without paired data as Bayesian inference in a latent variable model (LVM), in which the goal is to uncover the hidden representations of entities from one domain as entities in another. First, we introduce implicit LVMs, which allow flexible prior specification over latent representations as implicit distributions. Next, we develop a new variational inference (VI) framework that minimises a symmetrised statistical divergence between the variational and true joint distributions. Finally, we show that CYCLEGANs emerge as a closely-related variant of our framework, providing a useful interpretation as a Bayesian approximation.
We introduce a model-agnostic formulation of BO based on classification. Building on the established links between class-probability estimation (CPE), density-ratio estimation (DRE), and the improvement-based acquisition functions, we reformulate the acquisition function as a binary classifier over candidate solutions. This approach eliminates the need for an explicit probabilistic model of the objective function and casts aside the limitations of tractability constraints. As a result, our model-agnostic BO approach substantially broadens its applicability across diverse problem scenarios, accommodating flexible and scalable modelling paradigms such as deep learning without necessitating approximations or sacrificing expressive and representational capacity.

Accordingly, our thesis is organised as follows:

Chapter 2 (Background) lays the necessary groundwork for our thesis. We begin by outlining the fundamental principles of probability and Bayesian statistics, which form the basis of probabilistic ML. Additionally, we introduce the widely-adopted method of approximate Bayesian inference known as VI. Our discussion underscores the central role played by statistical divergences, prompting us to delve into a larger family of divergences and motivating our discussion of DRE. With a solid foundation in place, we shift our focus to GPs, providing an introductory overview and highlighting the most commonly-used sparse approximations. Finally, we conclude this background chapter by introducing the basic concepts behind BO.
Chapter 3 (Orthogonally-Decoupled Sparse GPs with Spherical Inducing Features) examines orthogonally-decoupled sparse GPs with spherical NN activation features, as summarised in the corresponding item above.
Chapter 4 (Cycle-Consistent Adversarial Learning as Bayesian Inference) examines from the perspective of approximate Bayesian inference, as summarised in the corresponding item above.
Chapter 5 (Bayesian Optimization by Density-Ratio Estimation) examines our model-agnostic approach to BO based on binary classification and DRE, as summarised in the corresponding item above.
Chapter 6 (Conclusion) brings this thesis to a close by reflecting on our main contributions and situating them in the broader landscape of probabilistic methods in ML. Finally, we conclude by presenting our outlook on the avenues for future research and development in this rapidly evolving field.

References

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. (2023). Palm 2 technical report. arXiv Preprint arXiv:2305.10403.

Attia, P. M., Grover, A., Jin, N., Severson, K. A., Markov, T. M., Liao, Y.-H., Chen, M. H., Cheong, B., Perkins, N., Yang, Z., et al. (2020). Closed-loop optimization of fast-charging protocols for batteries with machine learning. Nature, 578(7795), 397–402.

Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis: A unified approach. John Wiley & Sons.

Bayes, T. (1763). LII. An essay towards solving a problem in the doctrine of chances. By the late rev. Mr. Bayes, FRS communicated by mr. Price, in a letter to john canton, AMFR s. Philosophical Transactions of the Royal Society of London, 53, 370–418.

Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural network. International Conference on Machine Learning, 1613–1622.

Brochu, E., Cora, V. M., & De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv Preprint arXiv:1012.2599.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Chen, P., Merrick, B. M., & Brazil, T. J. (2015). Bayesian optimization for broadband high-efficiency power amplifier designs. IEEE Transactions on Microwave Theory and Techniques, 63(12), 4263–4272.

Damianou, A., & Lawrence, N. D. (2013). Deep gaussian processes. Artificial Intelligence and Statistics, 207–215.

Deisenroth, M., & Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. Proceedings of the 28th International Conference on Machine Learning (ICML-11), 465–472.

Duris, J., Kennedy, D., Hanuka, A., Shtalenkova, J., Edelen, A., Baxevanis, P., Egger, A., Cope, T., McIntire, M., Ermon, S., et al. (2020). Bayesian optimization of a free-electron laser. Physical Review Letters, 124(12), 124801.

Dutordoir, V., Durrande, N., & Hensman, J. (2020). Sparse Gaussian processes with spherical harmonic features. International Conference on Machine Learning, 2793–2802.

Dutordoir, V., Hensman, J., Wilk, M. van der, Ek, C. H., Ghahramani, Z., & Durrande, N. (2021). Deep neural networks as point estimates for deep Gaussian processes. Advances in Neural Information Processing Systems, 34.

Forrester, A. I., & Keane, A. J. (2009). Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1-3), 50–79.

Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 1050–1059.

Garnett, R. (2023). Bayesian Optimization. Cambridge University Press.

Garnett, R., Osborne, M. A., & Roberts, S. J. (2010). Bayesian optimization for sensor set selection. Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks, 209–219.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587.

Gonzalez, J., Longworth, J., James, D. C., & Lawrence, N. D. (2015). Bayesian optimization for synthetic gene design. arXiv Preprint arXiv:1505.01627.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial networks. arXiv Preprint arXiv:1406.2661.

Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 6645–6649.

Hennig, P., Osborne, M. A., & Kersting, H. P. (2022). Probabilistic numerics. Cambridge University Press.

Hie, B. L., & Yang, K. K. (2022). Adaptive machine learning for protein engineering. Current Opinion in Structural Biology, 72, 145–152.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). Training compute-optimal large language models. arXiv Preprint arXiv:2203.15556.

Houlsby, N., Huszár, F., Ghahramani, Z., & Lengyel, M. (2011). Bayesian active learning for classification and preference learning. arXiv Preprint arXiv:1112.5745.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). An introduction to variational methods for graphical models. Learning in Graphical Models, 105–161.

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25.

Lam, R., Poloczek, M., Frazier, P., & Willcox, K. E. (2018). Advances in bayesian optimization with applications in aerospace engineering. 2018 AIAA Non-Deterministic Approaches Conference, 1656.

Laplace, P. S. (1814). Théorie analytique des probabilités. Courcier.

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., & Sohl-Dickstein, J. (2017). Deep neural networks as gaussian processes. arXiv Preprint arXiv:1711.00165.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv Preprint arXiv:1509.02971.

Lyu, W., Xue, P., Yang, F., Yan, C., Hong, Z., Zeng, X., & Zhou, D. (2017). An efficient bayesian optimization approach for automated optimization of analog circuits. IEEE Transactions on Circuits and Systems I: Regular Papers, 65(6), 1954–1967.

MacKay, D. J. (1992). A practical bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472.

MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge university press.

Marchant, R., & Ramos, F. (2012). Bayesian optimisation for intelligent environmental monitoring. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2242–2249.

Matthews, A. G. de G., Rowland, M., Hron, J., Turner, R. E., & Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. arXiv Preprint arXiv:1804.11271.

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5, 115–133.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv Preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

Moss, H. B., Beck, D., González, J., Leslie, D. S., & Rayson, P. (2020). BOSS: Bayesian optimization over string spaces. arXiv Preprint arXiv:2010.00979.

Neal, R. M. (1995). BAYESIAN LEARNING FOR NEURAL NETWORKS

\[PhD thesis\]

. University of Toronto.

OpenAI, R. (2023). GPT-4 technical report. arXiv, 2303–08774.

Opper, M., & Winther, O. (2000). Gaussian processes and SVM: Mean field results and leave-one-out.

Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572.

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv Preprint arXiv:2112.11446.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv Preprint arXiv:2204.06125, 1(2), 3.

Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian Processes for Machine Learning. The MIT Press.

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.

Romero, P. A., Krause, A., & Arnold, F. H. (2013). Navigating the protein fitness landscape with gaussian processes. Proceedings of the National Academy of Sciences, 110(3), E193–E201.

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241.

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386.

Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345.

Salimbeni, H., Cheng, C.-A., Boots, B., & Deisenroth, M. (2018). Orthogonally decoupled variational Gaussian processes. Advances in Neural Information Processing Systems, 31.

Seko, A., Togo, A., Hayashi, H., Tsuda, K., Chaput, L., & Tanaka, I. (2015). Prediction of low-thermal-conductivity compounds with first-principles anharmonic lattice-dynamics calculations and bayesian optimization. Physical Review Letters, 115(20), 205901.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2015). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1), 148–175.

Shi, J., Titsias, M., & Mnih, A. (2020). Sparse orthogonal variational inference for Gaussian processes. International Conference on Artificial Intelligence and Statistics, 1932–1942.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv Preprint arXiv:1909.08053.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.

Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 25, 2951–2959.

Spearman, C. (1904). " general intelligence," objectively determined and measured. The American Journal of Psychology, 15(2), 201–292.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.

Sun, S., Shi, J., & Grosse, R. B. (2020). Neural networks as inter-domain inducing points. Third Symposium on Advances in Approximate Bayesian Inference.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267–288.

Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.

Torun, H. M., Swaminathan, M., Davis, A. K., & Bellaredj, M. L. F. (2018). A global bayesian optimization algorithm and its application to integrated system design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(4), 792–802.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv Preprint arXiv:2307.09288.

Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z., & Guyon, I. (2021). Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. NeurIPS 2020 Competition and Demonstration Track, 3–26.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Wigley, P. B., Everitt, P. J., Hengel, A. van den, Bastian, J. W., Sooriyabandara, M. A., McDonald, G. D., Hardman, K. S., Quinlivan, C. D., Manju, P., Kuhn, C. C., et al. (2016). Fast machine-learning online optimization of ultra-cold-atom experiments. Scientific Reports, 6(1), 25890.

Yang, K. K., Wu, Z., & Arnold, F. H. (2019). Machine-learning-guided directed evolution for protein engineering. Nature Methods, 16(8), 687–694.

Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes

Tue, 25 Apr 2023 00:00:00 +0000

Batch Bayesian Optimisation via Density-ratio Estimation with Guarantees

Thu, 01 Dec 2022 00:00:00 +0000

BORE: Bayesian Optimization by Density-Ratio Estimation

Sat, 08 May 2021 00:00:00 +0000

Bayesian Optimization (BO) by Density-Ratio Estimation (DRE), or BORE, is a simple, yet effective framework for the optimization of blackbox functions. BORE is built upon the correspondence between expected improvement (EI)—arguably the predominant acquisition functions used in BO—and the density-ratio between two unknown distributions.

One of the far-reaching consequences of this correspondence is that we can reduce the computation of EI to a probabilistic classification problem—a problem we are well-equipped to tackle, as evidenced by the broad range of streamlined, easy-to-use and, perhaps most importantly, battle-tested tools and frameworks available at our disposal for applying a variety of approaches. Notable among these are / and / for Deep Learning, for Gradient Tree Boosting, not to mention for just about everything else. The BORE framework lets us take direct advantage of these tools.

Code Example

We provide an simple example with Keras to give you a taste of how BORE can be implemented using a feed-forward neural network (NN) classifier. A useful class that the package provides is , a subclass of from Keras that inherits all of its existing functionalities, and provides just one additional method. We can build and compile a feed-forward NN classifier as usual:

from bore.models import MaximizableSequential
from tensorflow.keras.layers import Dense

# build model
classifier = MaximizableSequential()
classifier.add(Dense(16, activation="relu"))
classifier.add(Dense(16, activation="relu"))
classifier.add(Dense(1, activation="sigmoid"))

# compile model
classifier.compile(optimizer="adam", loss="binary_crossentropy")

See from the if this seems unfamiliar to you.

The additional method provided is argmax, which returns the maximizer of the network, i.e. the input $\mathbf{x}$ that maximizes the final output of the network:

x_argmax = classifier.argmax(bounds=bounds, method="L-BFGS-B", num_start_points=3)

Since the network is differentiable end-to-end wrt to input $\mathbf{x}$, this method can be implemented efficiently using a multi-started quasi-Newton hill-climber such as . We will see the pivotal role this method plays in the next section.

Using this classifier, the BO loop in BORE looks as follows:

import numpy as np

features = []
targets = []

# initialize design
features.extend(features_initial_design)
targets.extend(targets_initial_design)

for i in range(num_iterations):

 # construct classification problem
 X = np.vstack(features)
 y = np.hstack(targets)

 tau = np.quantile(y, q=0.25)
 z = np.less(y, tau)

 # update classifier
 classifier.fit(X, z, epochs=200, batch_size=64)

 # suggest new candidate
 x_next = classifier.argmax(bounds=bounds, method="L-BFGS-B", num_start_points=3)

 # evaluate blackbox
 y_next = blackbox.evaluate(x_next)

 # update dataset
 features.append(x_next)
 targets.append(y_next)

Let’s break this down a bit:

At the start of the loop, we construct the classification problem—by labeling instances $\mathbf{x}$ whose corresponding target value $y$ is in the top q=0.25 quantile of all target values as positive, and the rest as negative.
Next, we train the classifier to discriminate between these instances. This classifier should converge towards
$$ \pi^{*}(\mathbf{x}) = \frac{\gamma \ell(\mathbf{x})}{\gamma \ell(\mathbf{x}) + (1-\gamma) g(\mathbf{x})}, $$
where $\ell(\mathbf{x})$ and $g(\mathbf{x})$ are the unknown distributions of instances belonging to the positive and negative classes, respectively, and $\gamma$ is the class balance-rate and, by construction, simply the quantile we specified (i.e. $\gamma=0.25$).
Once the classifier is a decent approximation to $\pi^{*}(\mathbf{x})$, we propose the maximizer of this classifier as the next input to evaluate. In other words, we are now using the classifier itself as the acquisition function.

How is it justifiable to use this in lieu of EI, or some other acquisition function we’re used to? And what is so special about $\pi^{*}(\mathbf{x})$?

Well, as it turns out, $\pi^{*}(\mathbf{x})$ is equivalent to EI, up to some constant factors.

The remainder of the loop should now be self-explanatory. Namely, we
evaluate the blackbox function at the suggested point, and
update the dataset.

Step-by-step Illustration

Here is a step-by-step animation of six iterations of this loop in action, using the Forrester synthetic function as an example. The noise-free function is shown as the solid gray curve in the main pane. This procedure is warm-started with four random initial designs.

The right pane shows the empirical CDF (ECDF) of the observed $y$ values. The vertical dashed black line in this pane is located at $\Phi(y) = \gamma$, where $\gamma = 0.25$. The horizontal dashed black line is located at $\tau$, the value of $y$ such that $\Phi(y) = 0.25$, i.e. $\tau = \Phi^{-1}(0.25)$.

The instances below this horizontal line are assigned binary label $z=1$, while those above are assigned $z=0$. This is visualized in the bottom pane, alongside the probabilistic classifier $\pi_{\boldsymbol{\theta}}(\mathbf{x})$ represented by the solid gray curve, which is trained to discriminate between these instances.

Finally, the maximizer of the classifier is represented by the vertical solid green line. This is the location at which the BO procedure suggests be evaluated next.

We see that the procedure converges toward to global minimum of the blackbox function after half a dozen iterations.

To understand how and why this works in more detail, please read our paper! If you only have 15 minutes to spare, please watch the video recording of our talk!

Video

Simulation-based Scoring for Model-based Asynchronous Hyperparameter and Neural Architecture Search

Sat, 01 May 2021 00:00:00 +0000

Bayesian Optimization by Density Ratio Estimation

Tue, 01 Dec 2020 00:00:00 +0000

Variational Inference for Graph Convolutional Networks in the Absence of Graph Data and Adversarial Settings

Mon, 01 Jun 2020 00:00:00 +0000

This paper is a follow-up to our , previously presented at the NeurIPS2019 Graph Representation Learning Workshop, now with significantly expanded experimental analyses.

Model-based Asynchronous Hyperparameter and Neural Architecture Search

Sun, 01 Mar 2020 00:00:00 +0000

Variational Graph Convolutional Networks

Sun, 01 Dec 2019 00:00:00 +0000

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Sun, 01 Jul 2018 00:00:00 +0000