An Illustrated Guide to the Knowledge Gradient Acquisition Function

Draft – work in progress.

We provide a short guide to the knowledge-gradient (KG) acquisition function (Frazier et al., 2009)1 for Bayesian optimization (BO). Rather than being a self-contained tutorial, this posts is intended to serve as an illustrated compendium to the paper of Frazier et al., 20091 and the subsequent tutorial by Frazier, 20182, authored nearly a decade later.

This post assumes a basic level of familiarity with BO and Gaussian processes (GPs), to the extent provided by the literature survey of Shahriari et al., 20153, and the acclaimed textbook of Rasmussen and Williams, 2006, respectively.

Knowledge-gradient

First, we set-up the notation and terminology. Let f:XR be the blackbox function we wish to minimize. We denote the GP posterior predictive distribution, or predictive for short, by p(y|x,D). The mean of the predictive, or the predictive mean for short, is denoted by μ(x;D)=E[y|x,D] Let Dn be the set of n input-output observations Dn={(xi,yi)}i=1n, where output yi=f(xi)+ϵ is assumed to be observed with noise ϵN(0,σ2). We make the following abbreviation μn(x)=μ(x;Dn) Next, we define the minimum of the predictive mean, or predictive minimum for short, as τ(D)=minxXμ(x;D) If we view μ(x;D) as our fit to the underlying function f(x) from which the observations D were generated, then τ(D) is our estimate of the minimum of f(x), given observations D.

Further, we make the following abbreviations τn=τ(Dn),andτn+1=τ(Dn+1), where Dn+1=Dn{(x,y)} is the set of existing observations, augmented by some input-output pair (x,y). Then, the knowledge-gradient is defined as α(x;Dn)=Ep(y|x,Dn)[τnτn+1] Crucially, note that τn+1 is implicitly a function of (x,y), and that this expression integrates over all possible input-output observation pairs (x,y) for the given x under the predictive p(y|x,Dn).

Monte Carlo estimation

Not surprisingly, the knowledge-gradient function is analytically intractable. Therefore, in practice, we compute it using Monte Carlo estimation, α(x;Dn)1M(m=1Mτnτn+1(m)),y(m)p(y|x,Dn), where τn+1(m)=τ(Dn+1(m)) and Dn+1(m)=Dn{(x,y(m))}.

We refer to y(m) as the mth simulated outcome, or the mth simulation for short. Then, Dn+1(m) is the mth simulation-augmented dataset and, accordingly, τn+1(m) is the mth simulation-augmented predictive minimum.

We see that this approximation to the knowledge-gradient is simply the average difference between the predictive minimum values based on simulation-augmented data τn+1(m), and that based on observed data τn, across M simulations.

This might take a moment to digest, as there are quite a number of moving parts to keep track of. To help visualize these parts, we provide an illustration of each of the steps required to compute KG on a simple one-dimensional synthetic problem.

One-dimensional example

As the running example throughout this post, we use a synthetic function defined as f(x)=sin(3x)+x20.7x. We generate n=10 observations at locations sampled uniformly at random. The true function, and the set of noisy observations Dn are visualized in the figure below:

Latent blackbox function and $n=10$ observations.
Latent blackbox function and n=10 observations.

Using the observations Dn we have collected so far, we wish to use KG to score a candidate location xc at which to evaluate next.

Posterior predictive distribution

The posterior predictive p(y|x,Dn) is visualized in the figure below. In particular, the predictive mean μn(x) is represented by the solid orange curve.

Posterior predictive distribution (*before* hyperparameter estimation).
Posterior predictive distribution (before hyperparameter estimation).

Clearly, this is a poor fit to the data and a uncalibrated estimation of the predictive uncertainly.

Step 1: Hyperparameter estimation

Therefore, first step is to optimize the hyperparameters of the GP regression model, i.e. the kernel lengthscale, amplitude, and the observation noise variance. We do this using type-II maximum likelihood estimation (MLE), or empirical Bayes.

Posterior predictive distribution (*after* hyperparameter estimation).
Posterior predictive distribution (after hyperparameter estimation).

Step 2: Determine the predictive minimum

Next, we compute the predictive minimum τn=minxXμn(x). Since μn is end-to-end differentiable wrt to input x, we can simply use a multi-started quasi-Newton hill-climber such as L-BFGS. We visualize this in the figure below, where the value of the predictive minimum is represented by the orange horizontal dashed line, and its location is denoted by the orange star and triangle.

Predictive minimum $\tau\_n$.
Predictive minimum τn.

Step 3: Compute simulation-augmented predictive means

Suppose we are scoring the candidate location xc=0.1. For illustrative purposes, let us draw just M=1 sample yc(1)p(y|xc,Dn). In the figure below, the candidate location xc is represented by the vertical solid gray line, and the single simulated outcome yc(1) is represented by the filled blue dot.

In general, we denote the simulation-augmented predictive mean as μn+1(m)(x)=μ(x;Dn+1(m)), where Dn+1(m)=Dn{(x,y(m))} as defined earlier.

Here, the simulation-augmented dataset Dn+1(1) is the set of existing observations Dn, augmented by the simulated input-output pair (xc,yc(1)), Dn+1(1)=Dn{(xc,yc(1))}, and the corresponding simulation-augmented predictive mean μn+1(1)(x) is represented in the figure below by the solid blue curve.

Simulation-augmented predictive mean $\mu\_{n+1}^{(1)}(x)$ at location $x\_c = 0.1$
Simulation-augmented predictive mean μn+1(1)(x) at location xc=0.1

Step 4: Compute simulation-augmented predictive minimums

Next, we compute the simulation-augmented predictive minimum τn+1(1)=minxXμn+1(1)(x) It may not be immediately obvious, but μn+1(1) is in fact also end-to-end differentiable wrt to input x. Therefore, we can again appeal to an method such as L-BFGS. We visualize this in the figure below, where the value of the simulation-augmented predictive minimum is represented by the blue horizontal dashed line, and its location is denoted by the blue star and triangle.

Simulation-augmented predictive minimum $\tau\_{n+1}^{(1)}$ at location $x\_c = 0.1$
Simulation-augmented predictive minimum τn+1(1) at location xc=0.1

Taking the difference between the orange and blue horizontal dashed line will give us an unbiased estimate of the knowledge-gradient. However, this is likely to be a crude one, since it is based on just a single MC sample. To obtain a more accurate estimate, one needs to increase M, the number of MC samples.

Samples M>1

Let us now consider M=5 samples. We draw yc(m)p(y|xc,Dn), for m=1,,5. As before, the input location xc is represented by the vertical solid gray line, and the corresponding simulated outcomes are represented by the filled dots below, with varying hues from a perceptually uniform color palette to distinguish between samples.

Accordingly, the simulation-augmented predictive means μn+1(m)(x) at location xc=0.1, for m=1,,5 are represented by the colored curves, with hues set to that of the simulated outcome on which the predictive distribution is based.

Simulation-augmented predictive mean $\mu\_{n+1}^{(m)}(x)$ at location $x\_c = 0.1$, for $m = 1, \dotsc, 5$
Simulation-augmented predictive mean μn+1(m)(x) at location xc=0.1, for m=1,,5

Next we compute the simulation-augmented predictive minimum τn+1(m), which requires minimizing μn+1(m)(x) for m=1,,5. These values are represented below by the horizontal dashed lines, and their location is denoted by the stars and triangles.

Simulation-augmented predictive minimum $\tau\_{n+1}^{(1)}$ at location $x\_c = 0.1$, for $m = 1, \dotsc, 5$
Simulation-augmented predictive minimum τn+1(1) at location xc=0.1, for m=1,,5

Finally, taking the average difference between the orange dashed line and every other dashed line gives us the estimate of the knowledge gradient at input xc.


Cite as:

@article{tiao2021knowledge,
  title   = "{A}n {I}llustrated {G}uide to the {K}nowledge {G}radient {A}cquisition {F}unction",
  author  = "Tiao, Louis C",
  journal = "tiao.io",
  year    = "2021",
  url     = "https://tiao.io/post/an-illustrated-guide-to-the-knowledge-gradient-acquisition-function/"
}

To receive updates on more posts like this, follow me on Twitter and GitHub!


  1. Frazier, P., Powell, W., & Dayanik, S. (2009). The Knowledge-Gradient Policy for Correlated Normal Beliefs. INFORMS Journal on Computing, 21(4), 599-613. ↩︎ ↩︎

  2. Frazier, P. I. (2018). A Tutorial on Bayesian Optimization. arXiv preprint arXiv:1807.02811. ↩︎

  3. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2015). Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1), 148-175. ↩︎

Louis Tiao
Louis Tiao
Research Scientist

Thanks for stopping by! Let’s connect – drop me a message or follow me

comments powered by Disqus