<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Deep Learning |</title><link>https://tiao.io/tags/deep-learning/</link><atom:link href="https://tiao.io/tags/deep-learning/index.xml" rel="self" type="application/rss+xml"/><description>Deep Learning</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 01 Sep 2023 00:00:00 +0000</lastBuildDate><image><url>https://tiao.io/media/icon_hu_9c2a75fde2335590.png</url><title>Deep Learning</title><link>https://tiao.io/tags/deep-learning/</link></image><item><title>Probabilistic Machine Learning in the Age of Deep Learning: New Perspectives for Gaussian Processes, Bayesian Optimization and Beyond (PhD Thesis)</title><link>https://tiao.io/publications/phd-thesis/</link><pubDate>Fri, 01 Sep 2023 00:00:00 +0000</pubDate><guid>https://tiao.io/publications/phd-thesis/</guid><description>&lt;p&gt;The full text is available as a single PDF file &lt;a href="phd-thesis-louis-tiao.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can also find a list of contents and PDFs corresponding to each individual chapter below:&lt;/p&gt;
&lt;h3 id="table-of-contents"&gt;Table of Contents&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Chapter 1: Introduction &lt;a href="contents/1 Introduction.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 2: Background &lt;a href="contents/2 Background.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 3: Orthogonally-Decoupled Sparse Gaussian Processes with Spherical Neural Network Activation Features &lt;a href="contents/3 Orthogonally-Decoupled Sparse Gaussian Processes with Spherical Neural Network Activation Features.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 4: Cycle-Consistent Generative Adversarial Networks as a Bayesian Approximation &lt;a href="contents/4 Cycle-Consistent Generative Adversarial Networks as a Bayesian Approximation.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 5: Bayesian Optimisation by Classification with Deep Learning and Beyond &lt;a href="contents/5 Bayesian Optimisation by Classification with Deep Learning and Beyond.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Chapter 6: Conclusion &lt;a href="contents/6 Conclusion.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Appendix A: Numerical Methods for Improved Decoupled Sampling of Gaussian Processes &lt;a href="contents/A Numerical Methods for Improved Decoupled Sampling of Gaussian Processes.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Bibliography &lt;a href="contents/Bibliography.pdf" target="_blank" rel="noopener"&gt;
&lt;span class="inline-block pr-1"&gt;
&lt;svg style="height: 1em; transform: translateY(0.1em);" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M3 16.5v2.25A2.25 2.25 0 0 0 5.25 21h13.5A2.25 2.25 0 0 0 21 18.75V16.5M16.5 12L12 16.5m0 0L7.5 12m4.5 4.5V3"/&gt;&lt;/svg&gt;
&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Please find &lt;em&gt;Chapter 1: Introduction&lt;/em&gt; reproduced in full below:&lt;/p&gt;
&lt;h3 id="introduction"&gt;Introduction&lt;/h3&gt;
&lt;p&gt;Artificial intelligence (AI) stands poised to be among the most disruptive technologies of our era. The breakneck pace of recent AI advancements has been spearheaded by machine learning (ML), particularly the resurgence of &lt;em&gt;deep learning&lt;/em&gt;. Deep learning is as old as the first general-purpose electronic computer; with roots tracing back to the 1940s and ’50s (
;
), the revival of deep learning, beginning in the early 2010s, was catalysed by a series of breakthroughs that shattered previously perceived limitations and captivated the collective imagination. These breakthroughs span various domains, including computer vision (
;
;
;
), speech recognition (
;
), natural language processing (
;
), protein folding (
), generative art and artificial creativity  (
;
;
;
), as well as reinforcement learning for robotics control (
;
) and achieving superhuman-level gameplay (
;
).&lt;/p&gt;
&lt;p&gt;Nevertheless, it is crucial to view these developments as means to an ultimate end rather than an end in themselves. Arguably, the true pinnacle of AI’s capabilities lies in optimal &lt;em&gt;decision-making&lt;/em&gt;, whether that entails offering analyses and insights to aid humans in making better decisions or completely automating the decision-making process altogether. Practically any task directed towards a well-defined objective can be boiled down to a cascade of decisions. At a fundamental level, operating a vehicle involves a continuous stream of decisions involving accelerating, braking, and turning. Financial trading revolves around decisions to buy, sell, or hold various assets. Even complex engineering tasks, such as designing an aerofoil, involve a sequence of decisions about adjusting design variables to achieve desirable aerodynamic characteristics.&lt;/p&gt;
&lt;p&gt;Yet, the intricacies of decision-making surpass what any single advancement in deep learning can address. While convolutional neural networks (CNNs) can facilitate object detection tasks in autonomous vehicles, recurrent neural networks (RNNs) can aid in forecasting market dynamics for systematic trading, and physics-informed NNs can assist in predicting aerodynamic effects, it remains the case that no target or quantity of interest can be entirely known or predictable (indeed, if they were, the pursuit of predictive modelling and ML would be superfluous). Instead, predictions often prove unreliable, or at best, &lt;em&gt;uncertain&lt;/em&gt;, due to the limitations of our knowledge and the complexity and variability inherent in the underlying real-world processes. The impressive power of deep learning models often overshadows their ignorance of the limits of their own knowledge and the extent of uncertainty in their predictions. When these predictions are integrated into a sequential decision-making framework, such uncertainty can amplify, compound, and lead to catastrophic consequences. In the context of aeronautical engineering, this could result in inefficient designs; in quantitative finance, it can lead to devastating capital losses; and in autonomous driving, it can even cost lives.&lt;/p&gt;
&lt;h4 id="probabilistic-machine-learning"&gt;Probabilistic Machine Learning&lt;/h4&gt;
&lt;p&gt;Grounded in the laws of probability and Bayesian statistics (
;
), &lt;em&gt;probabilistic&lt;/em&gt; ML provides a consistent framework for systematically reasoning about the unknown. The probabilistic approach to ML acknowledges that the real world is fraught with uncertainty and embraces this uncertainty as an inherent part of decision-making. Unlike traditional methods, including those of deep learning, it recognises model predictions not as absolute truths that can be represented as single &lt;em&gt;point estimates&lt;/em&gt; produced from a deterministic mapping, but as full &lt;em&gt;probability distributions&lt;/em&gt; that capture the potential outcomes of a random variable as it propagates through some underlying data-generating process. In a &lt;em&gt;probabilistic model&lt;/em&gt;, all quantities are treated as random variables governed by probability distributions – the data are treated as observed variables, which are influenced by some underlying hidden variables, e.g., the model parameters. A prior distribution is used to express reasonable values for these hidden variables and to eliminate implausible ones. The relationship between observed and hidden variables is described using the likelihood, and the process of Bayesian inference amounts to calculating, using basic laws of probability, a posterior distribution over the hidden factors conditioned on the observed data, which can be seen as a refinement of the prior beliefs in light of new evidence. While the posterior distribution can be useful in and of itself, its primary role lies in facilitating subsequent prediction and decision-making by providing full probability distributions over predicted outcomes. This capability allows the decision-maker to assess the range of possible scenarios and their associated probabilities, enabling a more nuanced understanding of uncertainty and risk, which is indispensable in complex, dynamic environments where the repercussions of incorrect decisions can be severe. In essence, probabilistic ML equips autonomous decision-making systems with a probabilistic worldview, enabling them to navigate ambiguity and make sound decisions in the face of imperfect information.&lt;/p&gt;
&lt;h4 id="probabilistic-ml-vs-deep-learning"&gt;Probabilistic ML vs. Deep Learning&lt;/h4&gt;
&lt;p&gt;While deep learning has dominated recent AI advances, probabilistic ML remains as important as ever and continues to offer valuable tools for addressing AI challenges that can not be fully resolved by deep learning alone. Although both approaches can be combined to create hybrid methods that leverage their respective strengths, some defining characteristics have traditionally set deep learning apart from probabilistic ML. Perhaps most notably, probabilistic ML approaches can achieve remarkable predictive performance even when data is scarce. In contrast, deep learning models tend to be data-intensive by nature, often demanding datasets of a scale proportional to their size (i.e., their parameter count) (
), which has seen explosive growth in recent years (
;
;
;
;
). With that being said, inference in many probabilistic models poses computational problems that are difficult to scale. On the other hand, deep learning approaches have excelled in scalability, a key factor contributing to their widespread success. This scalability is bolstered by their compatibility with various speed-enhancing mechanisms such as stochastic optimisation, specialised hardware accelerators (GPUs and TPUs), as well as distributed and/or cloud-based computing infrastructure. To bridge this gap, substantial research effort has been devoted to enabling probabilistic ML to benefit from these advantages through optimisation-based approximations to Bayesian inference (
).&lt;/p&gt;
&lt;p&gt;Moreover, as mentioned earlier, these paradigms are by no means mutually exclusive. Indeed, it is often possible to directly extend existing models with a Bayesian treatment of their parameters, adding a layer of probabilistic reasoning to the model, and allowing it to not only make predictions but also estimate the uncertainty associated with those predictions. An excellent example is the BNN, which treats the weights as hidden variables and leverages posterior inference to provide predictions while estimating associated uncertainties, delivering a more robust and principled approach to deep learning (
;
;
).&lt;/p&gt;
&lt;p&gt;The Bayesian formalism naturally gives rise to many popular methods and paradigms, often in the form of point estimates or other kinds of approximations. The quintessential example of this is found in linear regression, in particular, in ridge and lasso regression (
), which correspond variously to maximum &lt;em&gt;a posteriori&lt;/em&gt; (MAP) estimates in Bayesian linear regression (BLR) models with prior distributions possessing different sparsity-inducing characteristics (
) – more broadly, mitigations against over-fitting tend to arise organically in Bayesian methods, which is why they are frequently characterised as being fundamentally more robust against over-fitting (
). Likewise, the once &lt;em&gt;à la mode&lt;/em&gt; support vector machines (SVMs) can be seen as MAP estimates for a class of nonparametric Bayesian models (
), dropout (
) in NNs can be seen as a variational approximation to exact inference in BNNs (
), and unsupervised learning methods such as factor analysis (FA) (
) and principal component analysis (PCA) (
) are instances of a class of LVMs (
;
) known as linear-Gaussian factor models (
), to name just a few examples. Time and again, classical approaches have not only benefitted from being viewed through the Bayesian perspective but have also been enriched and redefined by the depth of insights this framework provides.&lt;/p&gt;
&lt;h3 id="thesis-goals"&gt;Thesis Goals&lt;/h3&gt;
&lt;p&gt;The over-arching goal of this thesis is to continue advancing the integration and cross-pollination between deep learning and probabilistic ML. We aim to further the interplay between these two fields, both by incorporating probabilistic interpretations and uncertainty quantification into popular deep learning frameworks, and by leveraging the representational power of deep NNs to improve established Bayesian methods. This dual-pronged approach provides fresh perspectives and taps the complementary strengths of both paradigms, advancing the foundations of AI and facilitating the development of more capable and dependable decision support frameworks. Ultimately, we strive to unlock the potential of deep learning within high-impact probabilistic ML methodologies, and to lend useful Bayesian perspectives on current deep learning techniques.&lt;/p&gt;
&lt;h4 id="gaussian-process-models"&gt;Gaussian Process Models&lt;/h4&gt;
&lt;p&gt;Arguably, no family of probabilistic models embodies the ethos of probabilistic ML and illustrates its nuances and parallels with deep learning quite like the GP. Accordingly, they shall occupy a prominent place in our thesis. In particular, GPs stand out as the ideal choice when dealing with limited data, offer the flexibility to encode prior beliefs through the covariance function, and provide predictive uncertainty estimates with a fine calibration that is second to none. Conversely, they are challenging to scale to large datasets, a limitation that has spurred extensive research and development efforts. Furthermore, in contrast to deep learning models, which are often lauded for their ability to automatically uncover valuable patterns and features in data, GPs have at times been dismissed as unsophisticated smoothing mechanisms (
). Despite these apparent disparities, GPs are intricately connected to NNs in numerous ways. Among these, one of the most classical and well-known relationships is the convergence of single-layer NNs with randomly initialised weights toward GPs in the infinite-width limit (
). Similar links have also been identified between GPs and infinitely wide &lt;em&gt;deep&lt;/em&gt; NNs (
;
).&lt;/p&gt;
&lt;p&gt;In an effort to elevate the representational capabilities of GPs to a level comparable with deep NNs, DGPs (
) stack together multiple layers of GPs. Additional efforts to construct efficient sparse GP approximations have leveraged the advantageous properties of computations on the hypersphere (
), which has led to deep GP (DGP) models in which the propagation of posterior predictive means is equivalent to a forward pass through a deep neural network (NN) (
;
). Notably, as a side effect, this model effectively provides uncertainty estimates for deep NN through its predictive variance. Among the contributions of our thesis is the further development of this framework, integrating cutting-edge techniques (
;
) to address some of its practical limitations, thereby narrowing the performance gap between GPs and deep NNs.&lt;/p&gt;
&lt;p&gt;Probabilistic models, serving a crucial role as decision support tools, routinely aid scientific discovery in fields such as physics and astronomy, guiding advancements in areas of medicine and healthcare encompassing bioinformatics, epidemiology, and medical diagnosis. Beyond that, these models have wide-ranging applications in economics, econometrics, and the social sciences. Moreover, they are indispensable in various engineering disciplines, such as robotics and environmental engineering. Among the many probabilistic models, GPs stand out as a powerful driving force behind a number of important sequential decision-making frameworks, including active learning (
) and reinforcement learning (
), and the broader area of probabilistic numerics at large (
). Notably, Bayesian optimisation (BO) (
;
;
) is one major area that relies heavily on GPs and will feature extensively in our thesis.&lt;/p&gt;
&lt;h4 id="bayesian-optimisation"&gt;Bayesian optimisation&lt;/h4&gt;
&lt;p&gt;BO is a powerful methodology dedicated to the global optimisation of complex and resource-intensive objective functions. In contrast to classical optimisation methods, BO excels even when dealing with functions that lack strong assumptions or guarantees. These functions may not be convex, possess no gradients, lack a well-defined mathematical form, and observable only indirectly through noisy measurements.&lt;/p&gt;
&lt;p&gt;At its core, BO is a sequential decision-making algorithm.&lt;/p&gt;
&lt;p&gt;It relies on observations from past function evaluations to determine the next candidate location for evaluation in pursuit of optimal solutions. BO leverages a probabilistic model, often a GP, to represent its knowledge and beliefs about the unknown function. This model is continuously updated with the acquisition of each new observation, enabling the algorithm to adapt its behaviour and make sound decisions based on the evolving information.&lt;/p&gt;
&lt;p&gt;BO effectively manages uncertainty inherent in such sequential decision-making processes by making use of the probabilistic model to the fullest, harnessing the entire predictive distribution, particularly, the predictive uncertainty, to select promising candidate solutions that bring the most value to the optimisation process. This generally consists not merely of those most likely to optimise the objective function (i.e., &lt;em&gt;exploiting&lt;/em&gt; that which is known), but also those likely to reveal the most knowledge and information about the function itself (i.e., &lt;em&gt;exploring&lt;/em&gt; that which remains unknown).&lt;/p&gt;
&lt;p&gt;This pronounced emphasis on well-calibrated uncertainty distinguishes BO as one of the standout “killer apps” for GPs and a jewel in the crown of probabilistic ML applications. In practice, BO has proven instrumental across science, engineering, and industry, where efficiency and cost-effectiveness are paramount. Its applications include protein engineering (
;
), material discovery (
), experimental physics (e.g., experiments involving ultra-cold atoms (
) and free-electron lasers (
)), environmental monitoring (sensor placement) (
;
), and the design of aerodynamic aerofoils (
;
), integrated circuits (
;
), broadband high-efficiency power amplifiers (
), and fast-charging protocols for lithium-ion batteries (
). Notably, it has played a crucial role in automating the hyperparameter tuning of various ML models (
;
), especially deep learning models, thus representing yet another way in which probabilistic ML has contributed to the advancement of deep learning.&lt;/p&gt;
&lt;p&gt;However, GPs are not universally suitable for all BO problem scenarios. They are most effective when dealing with smooth, stationary functions with homoscedastic noise and a relatively modest input dimensionality. Additionally, GPs are easiest to work with for functions with a single output and purely continuous inputs. While a surprisingly wide array of real-world challenges satisfy these conditions, many high-impact problems, such as gene and protein design, which involves sequential inputs (
;
;
;
;
); NAS, which involves structured inputs with intricate conditional dependencies; and automotive safety engineering, which involve numerous constraints and multiple objectives, clearly fall outside of this scope. This is not to say that GPs cannot be extended to such challenging scenarios. However, such extensions almost always come at a cost. Consequently, it makes sense to appeal to alternative modelling paradigms more naturally suited to specific tasks, e.g., employing random forests (RFs) to handle discrete and structured inputs, or deep NNs for capturing nonstationary behaviour and dealing with multiple objectives. A major contribution of this thesis is the introduction of a new formulation of BO that seamlessly accommodates virtually any modelling paradigm, including deep learning, without any compromise.&lt;/p&gt;
&lt;h3 id="thesis-overview"&gt;Thesis Overview&lt;/h3&gt;
&lt;p&gt;The core contributions of our thesis are summarised as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;span id="item:contrib-orthogonal-sparse-spherical-gp" label="item:contrib-orthogonal-sparse-spherical-gp"&gt;&lt;/span&gt; We improve upon the framework for sparse hyperspherical GP approximations that employ nonlinear activations as inter-domain inducing features. This framework serves as a bridge between GPs and NNs, with posterior predictive mean taking the form of single-layer feedforward NNs. Our thesis examines some practical issues associated with this approach and proposes an extension that takes advantage of the orthogonal decoupling of GPs to mitigate these limitations. In particular, we introduce spherical inter-domain features to construct more flexible data-dependent basis functions for both the principal and orthogonal components of the GP approximation. We demonstrate that incorporating orthogonal inducing variables under this framework not only alleviates these shortcomings but also offers superior scalability compared to alternative strategies.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;span id="item:contrib-cycle-bayes" label="item:contrib-cycle-bayes"&gt;&lt;/span&gt; We provide a probabilistic perspective on cycle-consistent adversarial networks (CYCLEGANs), a cutting-edge deep generative model for style transfer and image-to-image translation. Specifically, we frame the problem of learning cross-domain correspondences without paired data as Bayesian inference in a latent variable model (LVM), in which the goal is to uncover the hidden representations of entities from one domain as entities in another. First, we introduce implicit LVMs, which allow flexible prior specification over latent representations as implicit distributions. Next, we develop a new variational inference (VI) framework that minimises a symmetrised statistical divergence between the variational and true joint distributions. Finally, we show that CYCLEGANs emerge as a closely-related variant of our framework, providing a useful interpretation as a Bayesian approximation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;span id="item:contrib-bore" label="item:contrib-bore"&gt;&lt;/span&gt; We introduce a model-agnostic formulation of BO based on classification. Building on the established links between class-probability estimation (CPE), density-ratio estimation (DRE), and the improvement-based acquisition functions, we reformulate the acquisition function as a binary classifier over candidate solutions. This approach eliminates the need for an explicit probabilistic model of the objective function and casts aside the limitations of tractability constraints. As a result, our model-agnostic BO approach substantially broadens its applicability across diverse problem scenarios, accommodating flexible and scalable modelling paradigms such as deep learning without necessitating approximations or sacrificing expressive and representational capacity.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Accordingly, our thesis is organised as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Chapter 2 (Background) lays the necessary groundwork for our thesis. We begin by outlining the fundamental principles of probability and Bayesian statistics, which form the basis of probabilistic ML. Additionally, we introduce the widely-adopted method of approximate Bayesian inference known as VI. Our discussion underscores the central role played by statistical divergences, prompting us to delve into a larger family of divergences and motivating our discussion of DRE. With a solid foundation in place, we shift our focus to GPs, providing an introductory overview and highlighting the most commonly-used sparse approximations. Finally, we conclude this background chapter by introducing the basic concepts behind BO.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chapter 3 (Orthogonally-Decoupled Sparse GPs with Spherical Inducing Features) examines orthogonally-decoupled sparse GPs with spherical NN activation features, as summarised in the corresponding item above.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chapter 4 (Cycle-Consistent Adversarial Learning as Bayesian Inference) examines from the perspective of approximate Bayesian inference, as summarised in the corresponding item above.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chapter 5 (Bayesian Optimization by Density-Ratio Estimation) examines our model-agnostic approach to BO based on binary classification and DRE, as summarised in the corresponding item above.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chapter 6 (Conclusion) brings this thesis to a close by reflecting on our main contributions and situating them in the broader landscape of probabilistic methods in ML. Finally, we conclude by presenting our outlook on the avenues for future research and development in this rapidly evolving field.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="references"&gt;References&lt;/h3&gt;
&lt;div id="refs" class="references csl-bib-body hanging-indent" entry-spacing="0" line-spacing="2"&gt;
&lt;div id="ref-anil2023palm" class="csl-entry"&gt;
&lt;p&gt;Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. (2023). Palm 2 technical report. &lt;em&gt;arXiv Preprint arXiv:2305.10403&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-attia2020closed" class="csl-entry"&gt;
&lt;p&gt;Attia, P. M., Grover, A., Jin, N., Severson, K. A., Markov, T. M., Liao, Y.-H., Chen, M. H., Cheong, B., Perkins, N., Yang, Z., et al. (2020). Closed-loop optimization of fast-charging protocols for batteries with machine learning. &lt;em&gt;Nature&lt;/em&gt;, &lt;em&gt;578&lt;/em&gt;(7795), 397–402.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-bartholomew2011latent" class="csl-entry"&gt;
&lt;p&gt;Bartholomew, D. J., Knott, M., &amp;amp; Moustaki, I. (2011). &lt;em&gt;Latent variable models and factor analysis: A unified approach&lt;/em&gt;. John Wiley &amp;amp; Sons.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-bayes1763lii" class="csl-entry"&gt;
&lt;p&gt;Bayes, T. (1763). LII. An essay towards solving a problem in the doctrine of chances. By the late rev. Mr. Bayes, FRS communicated by mr. Price, in a letter to john canton, AMFR s. &lt;em&gt;Philosophical Transactions of the Royal Society of London&lt;/em&gt;, &lt;em&gt;53&lt;/em&gt;, 370–418.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-blundell2015weight" class="csl-entry"&gt;
&lt;p&gt;Blundell, C., Cornebise, J., Kavukcuoglu, K., &amp;amp; Wierstra, D. (2015). Weight uncertainty in neural network. &lt;em&gt;International Conference on Machine Learning&lt;/em&gt;, 1613–1622.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-brochu2010tutorial" class="csl-entry"&gt;
&lt;p&gt;Brochu, E., Cora, V. M., &amp;amp; De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. &lt;em&gt;arXiv Preprint arXiv:1012.2599&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-brown2020language" class="csl-entry"&gt;
&lt;p&gt;Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;33&lt;/em&gt;, 1877–1901.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-chen2015bayesian" class="csl-entry"&gt;
&lt;p&gt;Chen, P., Merrick, B. M., &amp;amp; Brazil, T. J. (2015). Bayesian optimization for broadband high-efficiency power amplifier designs. &lt;em&gt;IEEE Transactions on Microwave Theory and Techniques&lt;/em&gt;, &lt;em&gt;63&lt;/em&gt;(12), 4263–4272.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-damianou2013deep" class="csl-entry"&gt;
&lt;p&gt;Damianou, A., &amp;amp; Lawrence, N. D. (2013). Deep gaussian processes. &lt;em&gt;Artificial Intelligence and Statistics&lt;/em&gt;, 207–215.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-deisenroth2011pilco" class="csl-entry"&gt;
&lt;p&gt;Deisenroth, M., &amp;amp; Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. &lt;em&gt;Proceedings of the 28th International Conference on Machine Learning (ICML-11)&lt;/em&gt;, 465–472.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-duris2020bayesian" class="csl-entry"&gt;
&lt;p&gt;Duris, J., Kennedy, D., Hanuka, A., Shtalenkova, J., Edelen, A., Baxevanis, P., Egger, A., Cope, T., McIntire, M., Ermon, S., et al. (2020). Bayesian optimization of a free-electron laser. &lt;em&gt;Physical Review Letters&lt;/em&gt;, &lt;em&gt;124&lt;/em&gt;(12), 124801.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-dutordoir2020sparse" class="csl-entry"&gt;
&lt;p&gt;Dutordoir, V., Durrande, N., &amp;amp; Hensman, J. (2020). Sparse Gaussian processes with spherical harmonic features. &lt;em&gt;International Conference on Machine Learning&lt;/em&gt;, 2793–2802.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-dutordoir2021deep" class="csl-entry"&gt;
&lt;p&gt;Dutordoir, V., Hensman, J., Wilk, M. van der, Ek, C. H., Ghahramani, Z., &amp;amp; Durrande, N. (2021). Deep neural networks as point estimates for deep Gaussian processes. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;34&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-forrester2009recent" class="csl-entry"&gt;
&lt;p&gt;Forrester, A. I., &amp;amp; Keane, A. J. (2009). Recent advances in surrogate-based optimization. &lt;em&gt;Progress in Aerospace Sciences&lt;/em&gt;, &lt;em&gt;45&lt;/em&gt;(1-3), 50–79.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-gal2016dropout" class="csl-entry"&gt;
&lt;p&gt;Gal, Y., &amp;amp; Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. &lt;em&gt;International Conference on Machine Learning&lt;/em&gt;, 1050–1059.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-garnett_bayesoptbook_2023" class="csl-entry"&gt;
&lt;p&gt;Garnett, R. (2023). &lt;em&gt;&lt;span class="nocase"&gt;Bayesian Optimization&lt;/span&gt;&lt;/em&gt;. Cambridge University Press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-garnett2010bayesian" class="csl-entry"&gt;
&lt;p&gt;Garnett, R., Osborne, M. A., &amp;amp; Roberts, S. J. (2010). Bayesian optimization for sensor set selection. &lt;em&gt;Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks&lt;/em&gt;, 209–219.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-gelman2013bayesian" class="csl-entry"&gt;
&lt;p&gt;Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., &amp;amp; Rubin, D. B. (2013). &lt;em&gt;Bayesian data analysis&lt;/em&gt;. CRC press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-girshick2014rich" class="csl-entry"&gt;
&lt;p&gt;Girshick, R., Donahue, J., Darrell, T., &amp;amp; Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. &lt;em&gt;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition&lt;/em&gt;, 580–587.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-gonzalez2015bayesian" class="csl-entry"&gt;
&lt;p&gt;Gonzalez, J., Longworth, J., James, D. C., &amp;amp; Lawrence, N. D. (2015). Bayesian optimization for synthetic gene design. &lt;em&gt;arXiv Preprint arXiv:1505.01627&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-goodfellow2014generative" class="csl-entry"&gt;
&lt;p&gt;Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., &amp;amp; Bengio, Y. (2014). Generative adversarial networks. &lt;em&gt;arXiv Preprint arXiv:1406.2661&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-graves2013speech" class="csl-entry"&gt;
&lt;p&gt;Graves, A., Mohamed, A., &amp;amp; Hinton, G. (2013). Speech recognition with deep recurrent neural networks. &lt;em&gt;2013 IEEE International Conference on Acoustics, Speech and Signal Processing&lt;/em&gt;, 6645–6649.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-hennig2022probabilistic" class="csl-entry"&gt;
&lt;p&gt;Hennig, P., Osborne, M. A., &amp;amp; Kersting, H. P. (2022). &lt;em&gt;Probabilistic numerics&lt;/em&gt;. Cambridge University Press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-hie2022adaptive" class="csl-entry"&gt;
&lt;p&gt;Hie, B. L., &amp;amp; Yang, K. K. (2022). Adaptive machine learning for protein engineering. &lt;em&gt;Current Opinion in Structural Biology&lt;/em&gt;, &lt;em&gt;72&lt;/em&gt;, 145–152.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-hinton2012deep" class="csl-entry"&gt;
&lt;p&gt;Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. &lt;em&gt;IEEE Signal Processing Magazine&lt;/em&gt;, &lt;em&gt;29&lt;/em&gt;(6), 82–97.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-ho2020denoising" class="csl-entry"&gt;
&lt;p&gt;Ho, J., Jain, A., &amp;amp; Abbeel, P. (2020). Denoising diffusion probabilistic models. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;33&lt;/em&gt;, 6840–6851.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-hoffmann2022training" class="csl-entry"&gt;
&lt;p&gt;Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). Training compute-optimal large language models. &lt;em&gt;arXiv Preprint arXiv:2203.15556&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-houlsby2011bayesian" class="csl-entry"&gt;
&lt;p&gt;Houlsby, N., Huszár, F., Ghahramani, Z., &amp;amp; Lengyel, M. (2011). Bayesian active learning for classification and preference learning. &lt;em&gt;arXiv Preprint arXiv:1112.5745&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-jordan1998introduction" class="csl-entry"&gt;
&lt;p&gt;Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., &amp;amp; Saul, L. K. (1998). An introduction to variational methods for graphical models. &lt;em&gt;Learning in Graphical Models&lt;/em&gt;, 105–161.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-jumper2021highly" class="csl-entry"&gt;
&lt;p&gt;Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. &lt;em&gt;Nature&lt;/em&gt;, &lt;em&gt;596&lt;/em&gt;(7873), 583–589.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-krizhevsky2012imagenet" class="csl-entry"&gt;
&lt;p&gt;Krizhevsky, A., Sutskever, I., &amp;amp; Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;25&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-lam2018advances" class="csl-entry"&gt;
&lt;p&gt;Lam, R., Poloczek, M., Frazier, P., &amp;amp; Willcox, K. E. (2018). Advances in bayesian optimization with applications in aerospace engineering. &lt;em&gt;2018 AIAA Non-Deterministic Approaches Conference&lt;/em&gt;, 1656.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-laplace1814theorie" class="csl-entry"&gt;
&lt;p&gt;Laplace, P. S. (1814). &lt;em&gt;Théorie analytique des probabilités&lt;/em&gt;. Courcier.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-lee2017deep" class="csl-entry"&gt;
&lt;p&gt;Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., &amp;amp; Sohl-Dickstein, J. (2017). Deep neural networks as gaussian processes. &lt;em&gt;arXiv Preprint arXiv:1711.00165&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-lillicrap2015continuous" class="csl-entry"&gt;
&lt;p&gt;Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., &amp;amp; Wierstra, D. (2015). Continuous control with deep reinforcement learning. &lt;em&gt;arXiv Preprint arXiv:1509.02971&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-lyu2017efficient" class="csl-entry"&gt;
&lt;p&gt;Lyu, W., Xue, P., Yang, F., Yan, C., Hong, Z., Zeng, X., &amp;amp; Zhou, D. (2017). An efficient bayesian optimization approach for automated optimization of analog circuits. &lt;em&gt;IEEE Transactions on Circuits and Systems I: Regular Papers&lt;/em&gt;, &lt;em&gt;65&lt;/em&gt;(6), 1954–1967.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mackay1992practical" class="csl-entry"&gt;
&lt;p&gt;MacKay, D. J. (1992). A practical bayesian framework for backpropagation networks. &lt;em&gt;Neural Computation&lt;/em&gt;, &lt;em&gt;4&lt;/em&gt;(3), 448–472.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mackay2003information" class="csl-entry"&gt;
&lt;p&gt;MacKay, D. J. (2003). &lt;em&gt;Information theory, inference and learning algorithms&lt;/em&gt;. Cambridge university press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-marchant2012bayesian" class="csl-entry"&gt;
&lt;p&gt;Marchant, R., &amp;amp; Ramos, F. (2012). Bayesian optimisation for intelligent environmental monitoring. &lt;em&gt;2012 IEEE/RSJ International Conference on Intelligent Robots and Systems&lt;/em&gt;, 2242–2249.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-matthews2018gaussian" class="csl-entry"&gt;
&lt;p&gt;Matthews, A. G. de G., Rowland, M., Hron, J., Turner, R. E., &amp;amp; Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. &lt;em&gt;arXiv Preprint arXiv:1804.11271&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mcculloch1943logical" class="csl-entry"&gt;
&lt;p&gt;McCulloch, W. S., &amp;amp; Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. &lt;em&gt;The Bulletin of Mathematical Biophysics&lt;/em&gt;, &lt;em&gt;5&lt;/em&gt;, 115–133.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mnih2013playing" class="csl-entry"&gt;
&lt;p&gt;Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &amp;amp; Riedmiller, M. (2013). Playing atari with deep reinforcement learning. &lt;em&gt;arXiv Preprint arXiv:1312.5602&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-mnih2015human" class="csl-entry"&gt;
&lt;p&gt;Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. &lt;em&gt;Nature&lt;/em&gt;, &lt;em&gt;518&lt;/em&gt;(7540), 529–533.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-moss2020boss" class="csl-entry"&gt;
&lt;p&gt;Moss, H. B., Beck, D., González, J., Leslie, D. S., &amp;amp; Rayson, P. (2020). BOSS: Bayesian optimization over string spaces. &lt;em&gt;arXiv Preprint arXiv:2010.00979&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-neal1995bayesian" class="csl-entry"&gt;
&lt;p&gt;Neal, R. M. (1995). &lt;em&gt;BAYESIAN LEARNING FOR NEURAL NETWORKS&lt;/em&gt; &lt;/p&gt;
\[PhD thesis\]&lt;p&gt;. University of Toronto.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-openai2023gpt" class="csl-entry"&gt;
&lt;p&gt;OpenAI, R. (2023). GPT-4 technical report. &lt;em&gt;arXiv&lt;/em&gt;, 2303–08774.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-opper2000gaussian" class="csl-entry"&gt;
&lt;p&gt;Opper, M., &amp;amp; Winther, O. (2000). &lt;em&gt;Gaussian processes and SVM: Mean field results and leave-one-out&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-pearson1901liii" class="csl-entry"&gt;
&lt;p&gt;Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. &lt;em&gt;The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science&lt;/em&gt;, &lt;em&gt;2&lt;/em&gt;(11), 559–572.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-rae2021scaling" class="csl-entry"&gt;
&lt;p&gt;Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis &amp;amp; insights from training gopher. &lt;em&gt;arXiv Preprint arXiv:2112.11446&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-ramesh2022hierarchical" class="csl-entry"&gt;
&lt;p&gt;Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., &amp;amp; Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. &lt;em&gt;arXiv Preprint arXiv:2204.06125&lt;/em&gt;, &lt;em&gt;1&lt;/em&gt;(2), 3.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-10.7551/mitpress/3206.001.0001" class="csl-entry"&gt;
&lt;p&gt;Rasmussen, C. E., &amp;amp; Williams, C. K. I. (2005). &lt;em&gt;&lt;span class="nocase"&gt;Gaussian Processes for Machine Learning&lt;/span&gt;&lt;/em&gt;. The MIT Press.
&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-redmon2016you" class="csl-entry"&gt;
&lt;p&gt;Redmon, J., Divvala, S., Girshick, R., &amp;amp; Farhadi, A. (2016). You only look once: Unified, real-time object detection. &lt;em&gt;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition&lt;/em&gt;, 779–788.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-rombach2022high" class="csl-entry"&gt;
&lt;p&gt;Rombach, R., Blattmann, A., Lorenz, D., Esser, P., &amp;amp; Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. &lt;em&gt;Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition&lt;/em&gt;, 10684–10695.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-romero2013navigating" class="csl-entry"&gt;
&lt;p&gt;Romero, P. A., Krause, A., &amp;amp; Arnold, F. H. (2013). Navigating the protein fitness landscape with gaussian processes. &lt;em&gt;Proceedings of the National Academy of Sciences&lt;/em&gt;, &lt;em&gt;110&lt;/em&gt;(3), E193–E201.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-ronneberger2015u" class="csl-entry"&gt;
&lt;p&gt;Ronneberger, O., Fischer, P., &amp;amp; Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. &lt;em&gt;Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18&lt;/em&gt;, 234–241.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-rosenblatt1958perceptron" class="csl-entry"&gt;
&lt;p&gt;Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. &lt;em&gt;Psychological Review&lt;/em&gt;, &lt;em&gt;65&lt;/em&gt;(6), 386.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-roweis1999unifying" class="csl-entry"&gt;
&lt;p&gt;Roweis, S., &amp;amp; Ghahramani, Z. (1999). A unifying review of linear gaussian models. &lt;em&gt;Neural Computation&lt;/em&gt;, &lt;em&gt;11&lt;/em&gt;(2), 305–345.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-salimbeni2018orthogonally" class="csl-entry"&gt;
&lt;p&gt;Salimbeni, H., Cheng, C.-A., Boots, B., &amp;amp; Deisenroth, M. (2018). Orthogonally decoupled variational Gaussian processes. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;31&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-seko2015prediction" class="csl-entry"&gt;
&lt;p&gt;Seko, A., Togo, A., Hayashi, H., Tsuda, K., Chaput, L., &amp;amp; Tanaka, I. (2015). Prediction of low-thermal-conductivity compounds with first-principles anharmonic lattice-dynamics calculations and bayesian optimization. &lt;em&gt;Physical Review Letters&lt;/em&gt;, &lt;em&gt;115&lt;/em&gt;(20), 205901.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-shahriari2015taking" class="csl-entry"&gt;
&lt;p&gt;Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., &amp;amp; De Freitas, N. (2015). Taking the human out of the loop: A review of bayesian optimization. &lt;em&gt;Proceedings of the IEEE&lt;/em&gt;, &lt;em&gt;104&lt;/em&gt;(1), 148–175.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-shi2020sparse" class="csl-entry"&gt;
&lt;p&gt;Shi, J., Titsias, M., &amp;amp; Mnih, A. (2020). Sparse orthogonal variational inference for Gaussian processes. &lt;em&gt;International Conference on Artificial Intelligence and Statistics&lt;/em&gt;, 1932–1942.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-shoeybi2019megatron" class="csl-entry"&gt;
&lt;p&gt;Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., &amp;amp; Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism. &lt;em&gt;arXiv Preprint arXiv:1909.08053&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-silver2016mastering" class="csl-entry"&gt;
&lt;p&gt;Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. &lt;em&gt;Nature&lt;/em&gt;, &lt;em&gt;529&lt;/em&gt;(7587), 484–489.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-snoek2012practical" class="csl-entry"&gt;
&lt;p&gt;Snoek, J., Larochelle, H., &amp;amp; Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;25&lt;/em&gt;, 2951–2959.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-spearman1904general" class="csl-entry"&gt;
&lt;p&gt;Spearman, C. (1904). &amp;quot; general intelligence,&amp;quot; objectively determined and measured. &lt;em&gt;The American Journal of Psychology&lt;/em&gt;, &lt;em&gt;15&lt;/em&gt;(2), 201–292.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-srivastava2014dropout" class="csl-entry"&gt;
&lt;p&gt;Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &amp;amp; Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. &lt;em&gt;The Journal of Machine Learning Research&lt;/em&gt;, &lt;em&gt;15&lt;/em&gt;(1), 1929–1958.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-sun2020neural" class="csl-entry"&gt;
&lt;p&gt;Sun, S., Shi, J., &amp;amp; Grosse, R. B. (2020). Neural networks as inter-domain inducing points. &lt;em&gt;Third Symposium on Advances in Approximate Bayesian Inference&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-tibshirani1996regression" class="csl-entry"&gt;
&lt;p&gt;Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. &lt;em&gt;Journal of the Royal Statistical Society Series B: Statistical Methodology&lt;/em&gt;, &lt;em&gt;58&lt;/em&gt;(1), 267–288.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-tipping1999probabilistic" class="csl-entry"&gt;
&lt;p&gt;Tipping, M. E., &amp;amp; Bishop, C. M. (1999). Probabilistic principal component analysis. &lt;em&gt;Journal of the Royal Statistical Society: Series B (Statistical Methodology)&lt;/em&gt;, &lt;em&gt;61&lt;/em&gt;(3), 611–622.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-torun2018global" class="csl-entry"&gt;
&lt;p&gt;Torun, H. M., Swaminathan, M., Davis, A. K., &amp;amp; Bellaredj, M. L. F. (2018). A global bayesian optimization algorithm and its application to integrated system design. &lt;em&gt;IEEE Transactions on Very Large Scale Integration (VLSI) Systems&lt;/em&gt;, &lt;em&gt;26&lt;/em&gt;(4), 792–802.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-touvron2023llama" class="csl-entry"&gt;
&lt;p&gt;Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. &lt;em&gt;arXiv Preprint arXiv:2307.09288&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-turner2021bayesian" class="csl-entry"&gt;
&lt;p&gt;Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z., &amp;amp; Guyon, I. (2021). Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. &lt;em&gt;NeurIPS 2020 Competition and Demonstration Track&lt;/em&gt;, 3–26.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-vaswani2017attention" class="csl-entry"&gt;
&lt;p&gt;Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &amp;amp; Polosukhin, I. (2017). Attention is all you need. &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, &lt;em&gt;30&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-wigley2016fast" class="csl-entry"&gt;
&lt;p&gt;Wigley, P. B., Everitt, P. J., Hengel, A. van den, Bastian, J. W., Sooriyabandara, M. A., McDonald, G. D., Hardman, K. S., Quinlivan, C. D., Manju, P., Kuhn, C. C., et al. (2016). Fast machine-learning online optimization of ultra-cold-atom experiments. &lt;em&gt;Scientific Reports&lt;/em&gt;, &lt;em&gt;6&lt;/em&gt;(1), 25890.&lt;/p&gt;
&lt;/div&gt;
&lt;div id="ref-yang2019machine" class="csl-entry"&gt;
&lt;p&gt;Yang, K. K., Wu, Z., &amp;amp; Arnold, F. H. (2019). Machine-learning-guided directed evolution for protein engineering. &lt;em&gt;Nature Methods&lt;/em&gt;, &lt;em&gt;16&lt;/em&gt;(8), 687–694.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;</description></item><item><title>GPflux</title><link>https://tiao.io/projects/gpflux/</link><pubDate>Wed, 01 Sep 2021 00:00:00 +0000</pubDate><guid>https://tiao.io/projects/gpflux/</guid><description>&lt;p&gt;
is a TensorFlow/Keras
framework for Deep
, developed
at Secondmind Labs. It builds on
and exposes
Deep GP layers as familiar Keras building blocks, making it easier to compose
deep
.&lt;/p&gt;
&lt;p&gt;Contributed during my doctoral student researcher appointment at Secondmind
Labs, alongside Vincent Dutordoir, ST John, and other members of the lab.&lt;/p&gt;</description></item><item><title>AutoGluon</title><link>https://tiao.io/projects/autogluon/</link><pubDate>Sun, 01 Sep 2019 00:00:00 +0000</pubDate><guid>https://tiao.io/projects/autogluon/</guid><description>&lt;p&gt;
is an open-source
toolkit from AWS that automates ML for tabular, image, and text data. During
my AWS Berlin internship I was a core developer of the
-based
searcher module — described in
and later forming the basis of
.&lt;/p&gt;</description></item><item><title>Aboleth</title><link>https://tiao.io/projects/aboleth/</link><pubDate>Thu, 01 Jun 2017 00:00:00 +0000</pubDate><guid>https://tiao.io/projects/aboleth/</guid><description>&lt;p&gt;
is a minimalistic
TensorFlow framework for scalable
and
approximation, focused on
techniques. Built at CSIRO Data61
with Daniel Steinberg and Lachlan McCalman.&lt;/p&gt;</description></item><item><title>A Tutorial on Variational Autoencoders with a Concise Keras Implementation</title><link>https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/</link><pubDate>Wed, 20 Apr 2016 00:00:00 +0000</pubDate><guid>https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/</guid><description>&lt;p&gt;
is awesome. It is a very well-designed library that clearly abides by
its
of modularity and extensibility, enabling us to
easily assemble powerful, complex models from primitive building blocks.
This has been demonstrated in numerous blog posts and tutorials, in particular,
the excellent tutorial on
.
As the name suggests, that tutorial provides examples of how to implement
various kinds of autoencoders in Keras, including the variational autoencoder
(VAE)&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;Like all autoencoders, the variational autoencoder is primarily used for
unsupervised learning of hidden representations.
However, they are fundamentally different to your usual neural network-based
autoencoder in that they approach the problem from a probabilistic perspective.
They specify a joint distribution over the observed and latent variables, and
approximate the intractable posterior conditional density over latent
variables with variational inference, using an &lt;em&gt;inference network&lt;/em&gt;
&lt;sup id="fnref:2"&gt;&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref"&gt;2&lt;/a&gt;&lt;/sup&gt; &lt;sup id="fnref:3"&gt;&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref"&gt;3&lt;/a&gt;&lt;/sup&gt; (or more classically, a &lt;em&gt;recognition model&lt;/em&gt;
&lt;sup id="fnref:4"&gt;&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref"&gt;4&lt;/a&gt;&lt;/sup&gt;) to amortize the cost of inference.&lt;/p&gt;
&lt;p&gt;While the examples in the aforementioned tutorial do well to showcase the
versatility of Keras on a wide range of autoencoder model architectures,
doesn&amp;rsquo;t properly take
advantage of Keras&amp;rsquo; modular design, making it difficult to generalize and
extend in important ways. As we will see, it relies on implementing custom
layers and constructs that are restricted to a specific instance of
variational autoencoders. This is a shame because when combined, Keras&amp;rsquo;
building blocks are powerful enough to encapsulate most variants of the
variational autoencoder and more generally, recognition-generative model
combinations for which the generative model belongs to a large family of
&lt;em&gt;deep latent Gaussian models&lt;/em&gt; (DLGMs)&lt;sup id="fnref:5"&gt;&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref"&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;The goal of this post is to propose a clean and elegant alternative
implementation that takes better advantage of Keras&amp;rsquo; modular design.
It is not intended as tutorial on variational autoencoders &lt;sup id="fnref:6"&gt;&lt;a href="#fn:6" class="footnote-ref" role="doc-noteref"&gt;6&lt;/a&gt;&lt;/sup&gt;.
Rather, we study variational autoencoders as a special case of variational
inference in deep latent Gaussian models using inference networks, and
demonstrate how we can use Keras to implement them in a modular fashion such
that they can be easily adapted to approximate inference in tasks beyond
unsupervised learning, and with complicated (non-Gaussian) likelihoods.&lt;/p&gt;
&lt;p&gt;This first post will lay the groundwork for a series of future posts that
explore ways to extend this basic modular framework to implement the
cutting-edge methods proposed in the latest research, such as the normalizing
flows for building richer posterior approximations &lt;sup id="fnref:7"&gt;&lt;a href="#fn:7" class="footnote-ref" role="doc-noteref"&gt;7&lt;/a&gt;&lt;/sup&gt;, importance
weighted autoencoders &lt;sup id="fnref:8"&gt;&lt;a href="#fn:8" class="footnote-ref" role="doc-noteref"&gt;8&lt;/a&gt;&lt;/sup&gt;, the Gumbel-softmax trick for inference in
discrete latent variables &lt;sup id="fnref:9"&gt;&lt;a href="#fn:9" class="footnote-ref" role="doc-noteref"&gt;9&lt;/a&gt;&lt;/sup&gt;, and even the most recent GAN-based
density-ratio estimation techniques for likelihood-free inference
&lt;sup id="fnref:10"&gt;&lt;a href="#fn:10" class="footnote-ref" role="doc-noteref"&gt;10&lt;/a&gt;&lt;/sup&gt; &lt;sup id="fnref:11"&gt;&lt;a href="#fn:11" class="footnote-ref" role="doc-noteref"&gt;11&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;h1 id="model-specification"&gt;Model specification&lt;/h1&gt;
&lt;p&gt;First, it is important to understand that the variational autoencoder
.
Rather, the generative model is a component of the variational autoencoder and
is, in general, a deep latent Gaussian model.
In particular, let $\mathbf{x}$ be a local observed variable and
$\mathbf{z}$ its corresponding local latent variable, with joint
distribution&lt;/p&gt;
$$
p_{\theta}(\mathbf{x}, \mathbf{z})
= p_{\theta}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}).
$$&lt;p&gt;In Bayesian modelling, we assume the distribution of observed variables to be
governed by the latent variables. Latent variables are drawn from a prior
density $p(\mathbf{z})$ and related to the observations through the
likelihood $p_{\theta}(\mathbf{x} | \mathbf{z})$.
Deep latent Gaussian models (DLGMs) are a general class of models where the
observed variable is governed by a &lt;em&gt;hierarchy&lt;/em&gt; of latent variables, and the
latent variables at each level of the hierarchy are Gaussian &lt;em&gt;a priori&lt;/em&gt;
&lt;sup id="fnref1:5"&gt;&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref"&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;In a typical instance of the variational autoencoder, we have only a single
layer of latent variables with a Normal prior distribution,&lt;/p&gt;
$$
p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}).
$$&lt;p&gt;Now, each local latent variable is related to its corresponding observation
through the likelihood $p_{\theta}(\mathbf{x} | \mathbf{z})$, which can
be viewed as a &lt;em&gt;probabilistic&lt;/em&gt; decoder. Given a hidden lower-dimensional
representation (or &amp;ldquo;code&amp;rdquo;) $\mathbf{z}$, it &amp;ldquo;decodes&amp;rdquo; it into a
&lt;em&gt;distribution&lt;/em&gt; over the observation $\mathbf{x}$.&lt;/p&gt;
&lt;h2 id="decoder"&gt;Decoder&lt;/h2&gt;
&lt;p&gt;In this example, we define $p_{\theta}(\mathbf{x} | \mathbf{z})$ to be a
multivariate Bernoulli whose probabilities are computed from $\mathbf{z}$ using
a fully-connected neural network with a single hidden layer,&lt;/p&gt;
$$
\begin{align*}
p_{\theta}(\mathbf{x} | \mathbf{z})
&amp; = \mathrm{Bern}( \sigma( \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2 ) ), \newline
\mathbf{h}
&amp; = h(\mathbf{W}_1 \mathbf{z} + \mathbf{b}_1),
\end{align*}
$$&lt;p&gt;where $\sigma$ is the logistic sigmoid function, $h$ is some non-linearity, and
the model parameters
$\theta = \{ \mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2 \}$
consist of the weights and biases of this neural network.&lt;/p&gt;
&lt;p&gt;It is straightforward to implement this in Keras with the
:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;decoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intermediate_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sigmoid&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You can view a summary of the model parameters $\theta$ by calling
&lt;code&gt;decoder.summary()&lt;/code&gt;. Additionally, you can produce a high-level diagram of
the network architecture, and optionally the input and output shapes of each
layer using
from the
&lt;code&gt;keras.utils.vis_utils&lt;/code&gt; module. Although our architecture is about as
simple as it gets, it is included in the figure below as an example of what
the diagrams look like.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img alt="Decoder architecture"
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/decoder.svg"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Note that by fixing $\mathbf{W}_1$, $\mathbf{b}_1$ and $h$ to be the identity
matrix, the zero vector, and the identity function, respectively (or
equivalently dropping the first &lt;code&gt;Dense&lt;/code&gt; layer in the snippet above
altogether), we recover &lt;em&gt;logistic factor analysis&lt;/em&gt;.
With similarly minor modifications, we can recover other members from the
family of DLGMs, which include &lt;em&gt;non-linear factor analysis&lt;/em&gt;,
&lt;em&gt;non-linear Gaussian belief networks&lt;/em&gt;, &lt;em&gt;sigmoid belief networks&lt;/em&gt;, and many
others &lt;sup id="fnref2:5"&gt;&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref"&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;Having specified how the probabilities are computed, we can now define the
negative log likelihood of a Bernoulli $- \log p_{\theta}(\mathbf{x}|\mathbf{z})$, which is in fact equivalent to the
:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34; Negative log likelihood (Bernoulli). &amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# keras.losses.binary_crossentropy gives the mean&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# over the last axis. we require the sum&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;binary_crossentropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;As we discuss later, this will not be the loss we ultimately minimize, but will
constitute the data-fitting term of our final loss.&lt;/p&gt;
&lt;p&gt;Note this is a valid definition of a
,
which is required to compile and optimize a model. It is a symbolic function
that returns a scalar for each data-point in &lt;code&gt;y_true&lt;/code&gt; and &lt;code&gt;y_pred&lt;/code&gt;.
In our example, &lt;code&gt;y_pred&lt;/code&gt; will be the output of our &lt;code&gt;decoder&lt;/code&gt; network, which
are the predicted probabilities, and &lt;code&gt;y_true&lt;/code&gt; will be the true probabilities.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id="side-note-using-tensorflow-distributions-in-loss"&gt;Side note: Using TensorFlow Distributions in loss&lt;/h4&gt;
&lt;p&gt;If you are using the TensorFlow backend, you can directly use the (negative)
log probability of &lt;code&gt;Bernoulli&lt;/code&gt; from TensorFlow Distributions as a Keras
loss, as I demonstrate in my post on
.&lt;/p&gt;
&lt;p&gt;Specifically we can define the loss as,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34; Negative log likelihood (Bernoulli). &amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distributions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bernoulli&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lh&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_prob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This is exactly equivalent to the previous definition, but does not call
&lt;code&gt;K.binary_crossentropy&lt;/code&gt; directly.&lt;/p&gt;
&lt;hr&gt;
&lt;h1 id="inference"&gt;Inference&lt;/h1&gt;
&lt;p&gt;Having specified the generative process, we would now like to perform inference
on the latent variables and model parameters $\mathbf{z}$ and $\theta$,
respectively.
In particular, our goal is to compute the posterior
$p_{\theta}(\mathbf{z} | \mathbf{x})$, the conditional density of the latent
variable $\mathbf{z}$ given observed variable $\mathbf{x}$.
Additionally, we wish to optimize the model parameters $\theta$ with respect to
the marginal likelihood $p_{\theta}(\mathbf{x})$.
Both depend on the marginal likelihood, whose calculation requires marginalizing
out the latent variables $\mathbf{z}$. In general, this is computational
intractable, requiring exponential time to compute, or it is analytically
intractable and cannot be evaluated in closed-form. In our case, we suffer from
the latter intractability, since our prior is Gaussian non-conjugate to the
Bernoulli likelihood.&lt;/p&gt;
&lt;p&gt;To circumvent this intractability we turn to &lt;em&gt;variational inference&lt;/em&gt;, which
formulates inference as an optimization problem. It seeks an approximate
posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$ closest in Kullback-Leibler
(KL) divergence to the true posterior. More precisely, the approximate posterior
is parameterized by &lt;em&gt;variational parameters&lt;/em&gt; $\phi$, and we seek a setting
of these parameters that minimizes the aforementioned KL divergence,&lt;/p&gt;
$$
\phi^* = \mathrm{argmin}_{\phi}
\mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p_{\theta}(\mathbf{z} | \mathbf{x}) ]
$$&lt;p&gt;With the luck we&amp;rsquo;ve had so far, it shouldn&amp;rsquo;t come as a surprise anymore that
&lt;em&gt;this too&lt;/em&gt; is intractable. It also depends on the log marginal likelihood,
whose intractability is the reason we appealed to approximate inference in the
first place. Instead, we &lt;em&gt;maximize&lt;/em&gt; an alternative objective function, the
&lt;em&gt;evidence lower bound&lt;/em&gt; (ELBO), which is expressed as&lt;/p&gt;
$$
\begin{align*}
\mathrm{ELBO}(q)
&amp; =
\mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [
\log p_{\theta}(\mathbf{x} | \mathbf{z}) +
\log p(\mathbf{z}) -
\log q_{\phi}(\mathbf{z} | \mathbf{x})
] \newline
&amp; =
\mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})}
[ \log p_{\theta}(\mathbf{x} | \mathbf{z}) ]
-\mathrm{KL} [ q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ].
\end{align*}
$$&lt;p&gt;Importantly, the ELBO is a lower bound to the log marginal likelihood.
Therefore, maximizing it with respect to the model parameters $\theta$
approximately maximizes the log marginal likelihood.
Additionally, maximizing it with respect to variational parameters $\phi$ can
be shown to minimize
$\mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p_{\theta}(\mathbf{z} | \mathbf{x}) ]$.
Also, it turns out that the KL divergence determines the tightness of the lower
bound, where we have equality iff the KL divergence is zero, which happens iff
$q_{\phi}(\mathbf{z} | \mathbf{x}) = p_{\theta}(\mathbf{z} | \mathbf{x})$.
Hence, simultaneously maximizing it with respect to $\theta$ and $\phi$ gets us
two birds with one stone.&lt;/p&gt;
&lt;p&gt;Next we discuss the form of the approximate posterior
$q_{\phi}(\mathbf{z} | \mathbf{x})$, which can be viewed as a
&lt;em&gt;probabilistic&lt;/em&gt; encoder. Its role is opposite to that of the decoder.
Given an observation $\mathbf{x}$, it &amp;ldquo;encodes&amp;rdquo; it into a &lt;em&gt;distribution&lt;/em&gt;
over its hidden lower-dimensional representations.&lt;/p&gt;
&lt;h2 id="encoder"&gt;Encoder&lt;/h2&gt;
&lt;p&gt;For each local observed variable $\mathbf{x}_n$, we wish to approximate
the true posterior distribution $p(\mathbf{z}_n|\mathbf{x}_n)$ over its
corresponding local latent variables $\mathbf{z}_n$. A common approach is to
approximate it using a &lt;em&gt;variational distribution&lt;/em&gt;
$q_{\lambda_n}(\mathbf{z}_n)$, specified as a diagonal
Gaussian, where the &lt;em&gt;local&lt;/em&gt; variational parameters
$\lambda_n = \{ \boldsymbol{\mu}_n, \boldsymbol{\sigma}_n \}$ are the mean and
standard deviation of this approximating distribution,
&lt;/p&gt;
$$
q_{\lambda_n}(\mathbf{z}_n) =
\mathcal{N}(
\mathbf{z}_n |
\boldsymbol{\mu}_n,
\mathrm{diag}(\boldsymbol{\sigma}_n^2)
).
$$&lt;p&gt;
This approach has a number of shortcomings. First, the number of local
variational parameters we need to optimize grows with the size of the dataset.
Second, a new set of local variational parameters need to be optimized for new
unseen test points. This is not to mention the strong factorization assumption
we make by specifying diagonal Gaussian distributions as the family of
approximations. The last is still an active area of research, and the first
two can be addressed by introducing a further approximation using an inference
network.&lt;/p&gt;
&lt;h3 id="inference-network"&gt;Inference network&lt;/h3&gt;
&lt;h1 id="q_phimathbfz_n--mathbfx_n"&gt;We &lt;em&gt;amortize&lt;/em&gt; the cost of inference by introducing an &lt;em&gt;inference network&lt;/em&gt; which
approximates the local variational parameters $\lambda_n$ for a given local
observed variable $\textbf{x}_n$.
For our approximating distribution in particular, given $\textbf{x}_n$ the
inference network yields two vector-valued outputs $\boldsymbol{\mu}_{\phi}(\textbf{x}_n)$ and
$\boldsymbol{\sigma}_{\phi}(\textbf{x}_n)$, which we use to approximate its local
variational parameters $\boldsymbol{\mu}_n$ and $\boldsymbol{\sigma}_n$, respectively.
Our approximate posterior distribution now becomes
$$
q_{\phi}(\mathbf{z}_n | \mathbf{x}_n)&lt;/h1&gt;
&lt;p&gt;\mathcal{N}(\mathbf{z}&lt;em&gt;n
| \boldsymbol{\mu}&lt;/em&gt;{\phi}(\mathbf{x}&lt;em&gt;n),
\mathrm{diag}(\boldsymbol{\sigma}&lt;/em&gt;{\phi}^2(\mathbf{x}_n))
).
$$
Instead of learning &lt;em&gt;local&lt;/em&gt; variational parameters $\lambda_n$ for each data-point,
we now learn a fixed number of &lt;em&gt;global&lt;/em&gt; variational parameters $\phi$ which
constitute the parameters (i.e. weights) of the inference network.
Moreover, this approximation allows statistical strength to be shared across
observed data-points and also generalize to unseen test points.&lt;/p&gt;
&lt;p&gt;We specify the mean $\boldsymbol{\mu}_{\phi}(\mathbf{x})$ and log variance
$\log \boldsymbol{\sigma}_{\phi}^2(\mathbf{x})$ of this distribution as the output of
an inference network. For this post, we keep the architecture of the network
simple, with only a single hidden layer and two fully-connected output layers.
Again, this is simple to define in Keras:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# input layer&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# hidden layer&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intermediate_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# output layer for mean and log variance&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_log_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Since this network has multiple outputs, we couldn&amp;rsquo;t use the Sequential model
API as we did for the decoder. Instead, we will resort to the more powerful
,
which allows us to implement complex models with shared layers, multiple
inputs, multiple outputs, and so on.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img alt="Inference network"
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/inference_network.svg"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Note that we output the log variance instead of the standard deviation because
this is not only more convenient to work with, but also helps with numerical
stability. However, we still require the standard deviation later. To recover
it, we simply implement the appropriate transformation and encapsulate it in a
.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# normalize log variance to std dev&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_sigma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;))(&lt;/span&gt;&lt;span class="n"&gt;z_log_var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Before moving on, we give a few words on nomenclature and context.
In the prelude and title of this section, we characterized the approximate
posterior distribution with an inference network as a probabilistic encoder
(analogously to its counterpart, the probabilistic decoder).
Although this is an accurate interpretation, it is a limited one.
Classically, inference networks are known as &lt;em&gt;recognition models&lt;/em&gt;, and have now
been used for decades in a wide variety of probabilistic methods.
When composed end-to-end, the recognition-generative model combination can be
seen as having an autoencoder structure. Indeed, this structure contains the
variational autoencoder as a special case, and also the now less fashionable
&lt;em&gt;Helmholtz machine&lt;/em&gt; &lt;sup id="fnref1:4"&gt;&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref"&gt;4&lt;/a&gt;&lt;/sup&gt;.
Even more generally, this recognition-generative model combination constitutes
a widely-applicable approach currently known as &lt;em&gt;amortized variational inference&lt;/em&gt;,
which can be used to perform approximate inference in models that lie beyond
even the large class of deep latent Gaussian models.&lt;/p&gt;
&lt;p&gt;Having specified all the ingredients necessary to carry out variational
inference (namely, the prior, likelihood and approximate posterior), we next
focus on finalizing the definition of the (negative) ELBO as our loss function
in Keras. As written earlier, the ELBO can be decomposed into two terms,
$\mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})} [ \log p_{\theta}(\mathbf{x} | \mathbf{z}) ]$
the expected log likelihood (ELL) over $q_{\phi}(\mathbf{z} | \mathbf{x})$,
and $- \mathrm{KL} [q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ]$
the negative KL divergence between prior $p(\mathbf{z})$ and approximate
posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$. We first turn our attention
to the KL divergence term.&lt;/p&gt;
&lt;h3 id="kl-divergence"&gt;KL Divergence&lt;/h3&gt;
&lt;p&gt;Intuitively, maximizing the negative KL divergence term encourages approximate
posterior densities that place its mass on configurations of the latent
variables which are closest to the prior. Effectively, this regularizes the
complexity of latent space. Now, since both the prior $p(\mathbf{z})$ and
approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$ are Gaussian,
the KL divergence can actually be calculated with the closed-form expression,&lt;/p&gt;
$$
\mathrm{KL} [ q_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ]
= - \frac{1}{2} \sum_{k=1}^K \{ 1 + \log \sigma_k^2 - \mu_k^2 - \sigma_k^2 \}
$$&lt;p&gt;where $\mu_k$ and $\sigma_k$ are the $k$-th components of output vectors
$\mu_{\phi}(\mathbf{x})$ and $\sigma_{\phi}(\mathbf{x})$, respectively.
This is not too difficult to derive, and I would recommend verifying this as an
exercise. You can also find a derivation in the appendix of Kingma and Welling&amp;rsquo;s
(2014) paper &lt;sup id="fnref1:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;Recall that earlier, we defined the expected log likelihood term of the ELBO as
a Keras loss. We were able to do this since the log likelihood is a function of
the network&amp;rsquo;s final output (the predicted probabilities), so it maps nicely to a
Keras loss. Unfortunately, the same does not apply for the KL divergence term,
which is a function of the network&amp;rsquo;s intermediate layer outputs, the mean &lt;code&gt;mu&lt;/code&gt;
and log variance &lt;code&gt;log_var&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We define an auxiliary
which takes &lt;code&gt;mu&lt;/code&gt; and &lt;code&gt;log_var&lt;/code&gt; as input and simply returns them as output
without modification. We do however explicitly introduce the
of
calculating the KL divergence and adding it to a collection of losses, by
calling the method &lt;code&gt;add_loss&lt;/code&gt; &lt;sup id="fnref:12"&gt;&lt;a href="#fn:12" class="footnote-ref" role="doc-noteref"&gt;12&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;KLDivergenceLayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Layer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34; Identity transform layer that adds KL divergence
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; to the final model loss.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; &amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_placeholder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KLDivergenceLayer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;kl_batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;log_var&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;square&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_var&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kl_batch&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Next we feed &lt;code&gt;z_mu&lt;/code&gt; and &lt;code&gt;z_log_var&lt;/code&gt; through this layer (this needs to take
place before feeding &lt;code&gt;z_log_var&lt;/code&gt; through the Lambda layer to recover &lt;code&gt;z_sigma&lt;/code&gt;).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_log_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KLDivergenceLayer&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_log_var&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now when the Keras model is finally compiled, the collection of losses will be
aggregated and added to the specified Keras loss function to form the loss we
ultimately minimize. If we specify the loss as the negative log-likelihood we
defined earlier (&lt;code&gt;nll&lt;/code&gt;), we recover the negative ELBO as the final loss we
minimize, as intended.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id="side-note-alternative-divergences"&gt;Side note: Alternative divergences&lt;/h4&gt;
&lt;p&gt;A key benefit of encapsulating the divergence in an auxiliary layer is that we
can easily implement and swap in other divergences, such as the
$\chi$-divergence or the $\alpha$-divergence.
Using alternative divergences for variational inference is an active research
topic &lt;sup id="fnref:13"&gt;&lt;a href="#fn:13" class="footnote-ref" role="doc-noteref"&gt;13&lt;/a&gt;&lt;/sup&gt; &lt;sup id="fnref:14"&gt;&lt;a href="#fn:14" class="footnote-ref" role="doc-noteref"&gt;14&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id="side-note-implicit-models-and-adversarial-learning"&gt;Side note: Implicit models and adversarial learning&lt;/h4&gt;
&lt;p&gt;Additionally, we could also extend the divergence layer to use an auxiliary
density ratio estimator function, instead of evaluating the KL divergence in
the analytical form above.
This relaxes the requirement on approximate posterior
$q_{\phi}(\mathbf{z}|\mathbf{x})$ (and incidentally also prior $p(\mathbf{z})$)
to yield tractable densities, at the cost of maximizing a cruder estimate of the
ELBO.
This is known as Adversarial Variational Bayes&lt;sup id="fnref1:10"&gt;&lt;a href="#fn:10" class="footnote-ref" role="doc-noteref"&gt;10&lt;/a&gt;&lt;/sup&gt;, and is an
important line of recent research that, when taken to its logcal conclusion,
can extend the applicability of variational inference to arbitrarily expressive
implicit probabilistic models with intractable likelihoods&lt;sup id="fnref1:11"&gt;&lt;a href="#fn:11" class="footnote-ref" role="doc-noteref"&gt;11&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="reparameterization-using-merge-layers"&gt;Reparameterization using Merge Layers&lt;/h3&gt;
&lt;p&gt;To perform gradient-based optimization of ELBO with respect to model parameters
$\theta$ and variational parameters $\phi$, we require its gradients with
respect to these parameters, which is generally intractable.
Currently, the dominant approach for circumventing this is by Monte Carlo (MC)
estimation of the gradients. The basic idea is to write the gradient of the
ELBO as an expectation of the gradient, approximate it with MC estimates, then
perform stochastic gradient descent with the repeated MC gradient estimates.&lt;/p&gt;
&lt;p&gt;There exist a number of estimators based on different variance reduction
techniques. However, MC gradient estimates based on the reparameterization trick,
known as the &lt;em&gt;reparameterization gradients&lt;/em&gt;, have be shown to have the lowest
variance among competing estimators for continuous latent variables&lt;sup id="fnref3:5"&gt;&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref"&gt;5&lt;/a&gt;&lt;/sup&gt;.
The reparameterization trick is a straightforward change of variables that
expresses the random variable $\mathbf{z} \sim q_{\phi}(\mathbf{z} | \mathbf{x})$
as a deterministic transformation $g_{\phi}$ of another random variable
$\boldsymbol{\epsilon}$ and input $\mathbf{x}$, with parameters $\phi$,&lt;/p&gt;
$$
z = g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon}), \quad
\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon}).
$$&lt;p&gt;Note that $p(\boldsymbol{\epsilon})$ is simpler base distribution which is
parameter-free and independent of $\mathbf{x}$ or $\phi$.
To prevent clutter, we write the ELBO as an expectation of the function
$f(\mathbf{x}, \mathbf{z}) = \log p_{\theta}(\mathbf{x} , \mathbf{z}) -
\log q_{\phi}(\mathbf{z} | \mathbf{x})$ over distribution
$q_{\phi}(\mathbf{z} | \mathbf{x})$.
Now, for any function $f(\mathbf{x}, \mathbf{z})$, taking the gradient of the
expectation with respect to $\phi$, and substituting all occurrences of
$\mathbf{z}$ with $g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})$, we have&lt;/p&gt;
$$
\begin{align*}
\nabla_{\phi} \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})}
[ f(\mathbf{x}, \mathbf{z}) ]
&amp; = \nabla_{\phi} \mathbb{E}_{p(\boldsymbol{\epsilon})}
[ f(\mathbf{x}, g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})) ] \newline
&amp; = \mathbb{E}_{p(\mathbf{\epsilon})}
[ \nabla_{\phi} f(\mathbf{x}, g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})) ].
\end{align*}
$$&lt;p&gt;In other words, this simple reparameterization allows the gradient and the
expectation to commute, thereby allowing us to compute unbiased stochastic
estimates of the ELBO gradients by drawing noise samples $\boldsymbol{\epsilon}$
from $p(\boldsymbol{\epsilon})$.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;To recover the diagonal Gaussian approximation we specified earlier
$q_{\phi}(\mathbf{z}_n | \mathbf{x}_n) = \mathcal{N}(\mathbf{z}_n |
\boldsymbol{\mu}_{\phi}(\mathbf{x}_n), \mathrm{diag}(\boldsymbol{\sigma}_{\phi}^2(\mathbf{x}_n)))$,
we draw noise from the Normal base distribution, and specify a simple
location-scale transformation&lt;/p&gt;
$$
\mathbf{z}
= g_{\phi}(\mathbf{x}, \boldsymbol{\epsilon})
= \mu_{\phi}(\mathbf{x}) +
\sigma_{\phi}(\mathbf{x}) \odot
\boldsymbol{\epsilon}, \quad
\boldsymbol{\epsilon}
\sim \mathcal{N}(\mathbf{0}, \mathbf{I}),
$$&lt;p&gt;where $\mu_{\phi}(\mathbf{x})$ and $\sigma_{\phi}(\mathbf{x})$ are the outputs
of the inference network defined earlier with parameters $\phi$, and $\odot$
denotes the elementwise product. In Keras, we explicitly make the noise vector
an input to the model by defining an Input layer for it. We then implement the
above location-scale transformation using
, namely &lt;code&gt;Add&lt;/code&gt; and &lt;code&gt;Multiply&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Multiply&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_sigma&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_eps&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img alt="Reparameterization with simple location-scale transformation using Keras merge layers.
"
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/reparameterization.svg"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id="side-note-monte-carlo-sample-size"&gt;Side note: Monte Carlo sample size&lt;/h4&gt;
&lt;p&gt;Note both the inputs for observed variables and noise (&lt;code&gt;x&lt;/code&gt; and &lt;code&gt;eps&lt;/code&gt;) need to be
specified explicitly as inputs to our final model.
Furthermore, the size of their first dimension (i.e. batch size) are required
to be the same.
This corresponds to using a exactly one Monte Carlo sample to approximate the
expected log likelihood, drawing a single sample $\mathbf{z}_n$ from
$q_{\phi}(\mathbf{z}_n | \mathbf{x}_n)$ for each data-point $\mathbf{x}_n$ in
the batch. Although you might find an MC sample size of 1 surprisingly small,
it is actually adequate for a sufficiently large batch size (~100) &lt;sup id="fnref2:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt;.
In a
,
I demonstrate how to extend our approach to support larger MC sample sizes using
just a few minor tweaks. This extension is crucial for implementing the
&lt;em&gt;importance weighted autoencoder&lt;/em&gt; &lt;sup id="fnref1:8"&gt;&lt;a href="#fn:8" class="footnote-ref" role="doc-noteref"&gt;8&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Now, since the noise input is drawn from the Normal distribution, we can save
from having to feed in values for this input from outside the computation graph
by binding a tensor to this Input layer. Specifically, we bind a tensor created
using &lt;code&gt;K.random_normal&lt;/code&gt; with the required shape,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random_normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;While &lt;code&gt;eps&lt;/code&gt; still needs to be explicitly specified as an input to compile the
model, values for this input will no longer be expected by methods such as
&lt;code&gt;fit&lt;/code&gt;, &lt;code&gt;predict&lt;/code&gt;. Instead, samples from this distribution will be lazily
generated inside the computation graph when required. See my notes on
for more
details.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img alt="Encoder architecture."
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/encoder.svg"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;In the
, all of this logic is encapsulated in a single
&lt;code&gt;Lambda&lt;/code&gt; layer, which simultaneously draws samples from a hard-coded base
distribution and also performs the location-scale transformation.
In contrast, this approach achieves a good level of
and
.
By decoupling the random noise vector from the layer&amp;rsquo;s internal logic and
explicitly making it a model input, we emphasize the fact that all sources of
stochasticity emanate from this input. It thereby becomes clear that a random
sample drawn from a particular approximating distribution is obtained by feeding
this source of stochasticity through a number of successive deterministic
transformations.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id="side-notes-gumbel-softmax-trick-for-discrete-latent-variables"&gt;Side notes: Gumbel-softmax trick for discrete latent variables&lt;/h4&gt;
&lt;p&gt;As an example, we could provide samples drawn from the Uniform distribution
as noise input. By applying a number of deterministic transformations that
constitute the &lt;em&gt;Gumbel-softmax reparameterization trick&lt;/em&gt; &lt;sup id="fnref1:9"&gt;&lt;a href="#fn:9" class="footnote-ref" role="doc-noteref"&gt;9&lt;/a&gt;&lt;/sup&gt;, we
are able to obtain samples from the Categorical distribution. This allows us
to perform approximate inference on &lt;em&gt;discrete&lt;/em&gt; latent variables, and can be
implemented in this framework by adding a dozen or so lines of code!&lt;/p&gt;
&lt;h1 id="putting-it-all-together"&gt;Putting it all together&lt;/h1&gt;
&lt;p&gt;So far, we&amp;rsquo;ve dissected the variational autoencoder into modular components and
discussed the role and implementation of each one at some length.
Now let&amp;rsquo;s compose these components together end-to-end to form the final
autoencoder architecture.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intermediate_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_log_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_log_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KLDivergenceLayer&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_log_var&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_sigma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;))(&lt;/span&gt;&lt;span class="n"&gt;z_log_var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random_normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Multiply&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_sigma&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_eps&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;decoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intermediate_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sigmoid&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;decoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It&amp;rsquo;s surprisingly concise, taking up around 20 lines of code.
The diagram of the full model architecture is visualized below.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img alt="Variational autoencoder architecture."
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/vae_full.svg"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Finally, we specify and compile the model, using the negative log likelihood
&lt;code&gt;nll&lt;/code&gt; defined earlier as the loss.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;vae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;rmsprop&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nll&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h1 id="model-fitting"&gt;Model fitting&lt;/h1&gt;
&lt;h2 id="dataset-mnist-digits"&gt;Dataset: MNIST digits&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mnist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;255.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;255.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img alt="Variational autoencoder architecture for the MNIST digits dataset."
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/vae_full_shapes.svg"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;validation_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="loss-nelbo-convergence"&gt;Loss (NELBO) convergence&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img alt=""
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/nelbo.svg"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h1 id="model-evaluation"&gt;Model evaluation&lt;/h1&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# display a 2D plot of the digit classes in the latent space&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;z_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;viridis&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;colorbar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/result_latent_space_hu_8bb4eb676623e380.webp 320w, https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/result_latent_space_hu_64a98c5233df2a9d.webp 480w, https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/result_latent_space_hu_f97f8af14c434d9d.webp 600w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/result_latent_space_hu_8bb4eb676623e380.webp"
width="600"
height="500"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# display a 2D manifold of the digits&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="c1"&gt;# figure with 15x15 digits&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;digit_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# linearly spaced coordinates on the unit square were transformed&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# through the inverse CDF (ppf) of the Gaussian to produce values&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# of the latent variables z, since the prior of the latent space&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# is Gaussian&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ppf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ppf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dstack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;meshgrid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_pred_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;decoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z_grid&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;digit_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;digit_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_pred_grid&lt;/span&gt;&lt;span class="p"&gt;))),&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/result_manifold_hu_e23f379b58eda1c7.webp 320w, https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/result_manifold_hu_e0d7ef0dff27fb2e.webp 480w, https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/result_manifold_hu_8a0316a94df89cca.webp 500w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://tiao.io/posts/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/result_manifold_hu_e23f379b58eda1c7.webp"
width="500"
height="500"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h1 id="recap"&gt;Recap&lt;/h1&gt;
&lt;p&gt;In this post, we covered the basics of amortized variational inference, looking
at variational autoencoders as a specific example. In particular, we&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Implemented the decoder and encoder using the
and
respectively.&lt;/li&gt;
&lt;li&gt;Augmented the final loss with the KL divergence term by writing an auxiliary
.&lt;/li&gt;
&lt;li&gt;Worked with the log variance for numerical stability, and used a
to transform it to the
standard deviation when necessary.&lt;/li&gt;
&lt;li&gt;Explicitly made the noise an Input layer, and implemented the
reparameterization trick using
.&lt;/li&gt;
&lt;li&gt;
,
so random samples are generated &lt;em&gt;within&lt;/em&gt; the computation graph.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="whats-next"&gt;What&amp;rsquo;s next&lt;/h1&gt;
&lt;p&gt;Next, we will extend the divergence layer to use an auxiliary density ratio
estimator function, instead of evaluating the KL divergence in the analytical
form above.
This relaxes the requirement on approximate posterior
$q_{\phi}(\mathbf{z}|\mathbf{x})$ (and incidentally also prior $p(\mathbf{z})$)
to yield tractable densities, at the cost of maximizing a cruder estimate of the
ELBO.
This is known as Adversarial Variational Bayes&lt;sup id="fnref2:10"&gt;&lt;a href="#fn:10" class="footnote-ref" role="doc-noteref"&gt;10&lt;/a&gt;&lt;/sup&gt;, and is an
important line of recent research that, when taken to its logcal conclusion,
can extend the applicability of variational inference to arbitrarily expressive
implicit probabilistic models with intractable likelihoods&lt;sup id="fnref2:11"&gt;&lt;a href="#fn:11" class="footnote-ref" role="doc-noteref"&gt;11&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Cite as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tiao2017vae&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;{A} {T}utorial on {V}ariational {A}utoencoders with a {C}oncise {K}eras {I}mplementation&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;author&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;Tiao, Louis C&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;journal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tiao.io&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;2017&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;https://tiao.io/post/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;To receive updates on more posts like this, follow me on
and
!&lt;/p&gt;
&lt;h2 id="links--resources"&gt;Links &amp;amp; Resources&lt;/h2&gt;
&lt;p&gt;Below, you can find:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The
used to generate the diagrams and plots in this post.&lt;/li&gt;
&lt;li&gt;The above snippets combined in a single executable Python file:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scipy.stats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras.layers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Lambda&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Layer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Multiply&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Sequential&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mnist&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;original_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;784&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;intermediate_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;epochs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;epsilon_std&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34; Negative log likelihood (Bernoulli). &amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# keras.losses.binary_crossentropy gives the mean&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# over the last axis. we require the sum&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;binary_crossentropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;KLDivergenceLayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Layer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34; Identity transform layer that adds KL divergence
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; to the final model loss.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="s2"&gt; &amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_placeholder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KLDivergenceLayer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;kl_batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;log_var&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;square&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_var&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kl_batch&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;decoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intermediate_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sigmoid&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intermediate_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_log_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_log_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KLDivergenceLayer&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_log_var&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_sigma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;))(&lt;/span&gt;&lt;span class="n"&gt;z_log_var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random_normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stddev&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;epsilon_std&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;latent_dim&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Multiply&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_sigma&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;()([&lt;/span&gt;&lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_eps&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;decoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;vae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;rmsprop&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nll&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# train the VAE on MNIST digits&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mnist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;255.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;255.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;validation_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z_mu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# display a 2D plot of the digit classes in the latent space&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;z_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;viridis&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;colorbar&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# display a 2D manifold of the digits&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="c1"&gt;# figure with 15x15 digits&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;digit_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# linearly spaced coordinates on the unit square were transformed&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# through the inverse CDF (ppf) of the Gaussian to produce values&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# of the latent variables z, since the prior of the latent space&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# is Gaussian&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;u_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dstack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;meshgrid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;z_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ppf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u_grid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_decoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;decoder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z_grid&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;x_decoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x_decoded&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;digit_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;digit_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_decoded&lt;/span&gt;&lt;span class="p"&gt;))),&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;D. P. Kingma and M. Welling, &amp;ldquo;Auto-Encoding Variational Bayes,&amp;rdquo; in Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref1:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref2:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;
&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;Section &amp;ldquo;Recognition models and amortised inference&amp;rdquo; in
&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;Dayan, P., Hinton, G. E., Neal, R. M., &amp;amp; Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7(5), 889–904.
&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref1:4" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;Rezende, D. J., Mohamed, S., &amp;amp; Wierstra, D. (2014). &amp;ldquo;Stochastic backpropagation and approximate inference in deep generative models,&amp;rdquo; in Proceedings of The 31st International Conference on Machine Learning, 2014, (Vol. 32, pp. 1278–1286). Bejing, China: PMLR.
&amp;#160;&lt;a href="#fnref:5" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref1:5" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref2:5" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref3:5" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:6"&gt;
&lt;p&gt;For a complete treatment of variational autoencoders, and variational
inference in general, I highly recommend:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Jaan Altosaar&amp;rsquo;s blog post,
&lt;/li&gt;
&lt;li&gt;Diederik P. Kingma&amp;rsquo;s PhD Thesis,
.&lt;/li&gt;
&lt;/ul&gt;
&amp;#160;&lt;a href="#fnref:6" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/li&gt;
&lt;li id="fn:7"&gt;
&lt;p&gt;D. Rezende and S. Mohamed, &amp;ldquo;Variational Inference with Normalizing Flows,&amp;rdquo; in Proceedings of the 32nd International Conference on Machine Learning, 2015, vol. 37, pp. 1530–1538.&amp;#160;&lt;a href="#fnref:7" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:8"&gt;
&lt;p&gt;Y. Burda, R. Grosse, and R. Salakhutdinov, &amp;ldquo;Importance Weighted Autoencoders,&amp;rdquo; in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.&amp;#160;&lt;a href="#fnref:8" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref1:8" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:9"&gt;
&lt;p&gt;E. Jang, S. Gu, and B. Poole, &amp;ldquo;Categorical Reparameterization with Gumbel-Softmax,&amp;rdquo; Nov. 2016. in Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017.&amp;#160;&lt;a href="#fnref:9" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref1:9" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:10"&gt;
&lt;p&gt;L. Mescheder, S. Nowozin, and A. Geiger, &amp;ldquo;Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks,&amp;rdquo; in Proceedings of the 34th International Conference on Machine Learning, 2017, vol. 70, pp. 2391–2400.&amp;#160;&lt;a href="#fnref:10" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref1:10" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref2:10" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:11"&gt;
&lt;p&gt;D. Tran, R. Ranganath, and D. Blei, &amp;ldquo;Hierarchical Implicit Models and Likelihood-Free Variational Inference,&amp;rdquo; in Advances in Neural Information Processing Systems 30, 2017.&amp;#160;&lt;a href="#fnref:11" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref1:11" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&amp;#160;&lt;a href="#fnref2:11" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:12"&gt;
&lt;p&gt;To support sample weighting (fined-tuning how much each data-point
contributes to the loss), Keras losses are expected returns a scalar for each
data-point in the batch. In contrast, losses appended with the &lt;code&gt;add_loss&lt;/code&gt;
method don&amp;rsquo;t support this, and are expected to be a single scalar.
Hence, we calculate the KL divergence for all data-points in the batch and
take the mean before passing it to &lt;code&gt;add_loss&lt;/code&gt;.&amp;#160;&lt;a href="#fnref:12" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:13"&gt;
&lt;p&gt;Y. Li and R. E. Turner, &amp;ldquo;Rényi Divergence Variational Inference,&amp;rdquo; in Advances in Neural Information Processing Systems 29, 2016.&amp;#160;&lt;a href="#fnref:13" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:14"&gt;
&lt;p&gt;A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei, &amp;ldquo;Variational Inference via chi Upper Bound Minimization,&amp;rdquo; in Advances in Neural Information Processing Systems 30, 2017.&amp;#160;&lt;a href="#fnref:14" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item></channel></rss>