1

I am reading slides on Tutorial for Bayesian Optimization [42/45] and came across this

"Covariance hyperparameters are often optimized rather than marginalized, typically in the name of convenience and efficiency"

I am trying to understand the significance of marginalization over optimization of covariance hyperparameters. Optimization is entirely different concept then marginalization, so what is the link between the two?

GENIVI-LEARNER
  • 720
  • 4
  • 13

1 Answers1

2

Let's say we have a density $f(x; \theta, \eta)$ and $\theta$ is a parameter of interest but $\eta$ is a nuisance parameter, i.e. we need to know its value to evaluate the density but we don't actually care about it.

We want to somehow get rid of $\eta$ and end up with something of the form $g(x; \theta)$.

Integration and optimization are two common approaches to this. For integration we could put a prior on $\eta$, say $\pi(\eta)$, and obtain $g_I$ via $$ g_I(x;\theta) = \int f(x;\theta,\eta)\pi(\eta)\,\text d\eta. $$ Intuitively, evaluating $g_I$ is like averaging $f$ over all possible values of $\eta$, weighted by their likeliness.

But another option is to plug just one value in for $\eta$. We could find $$ \hat\eta(x, \theta) = \underset{\eta}{\text{argmax}}\, f(x; \theta, \eta) $$ and then get $$ g_p(x; \theta) = f(x; \theta, \hat\eta(x, \theta)). $$

We could arrive at this by using a Dirac delta at $\hat\eta(x, \theta)$ and integrating against that, i.e. we could use a prior (that would need to depend on $x$ and $\theta$, so philosophically not really a prior) that puts a probability of $1$ on this maximum, and then $$ g_p(x; \theta) = \int f(x; \theta, \eta) \delta_{\hat\eta}(\eta)\,\text d\eta. $$

(I'm using a subscript $p$ since this is called "profiling")

So this is one way to see how they connect. Optimization is putting all our eggs in one basket and thinking that we can represent $f$ well by just using the most likely value of $\eta$, while integration is instead considering all the possible values but we're weighting according to our believed likeliness. Optimization is usually way easier computationally which is a big part of the appeal, although integration can work better. My answer here gives an example of that, and there's also a generally interesting discussion: MLE: Marginal vs Full Likelihood

This paper by Berger et al. is also interesting: https://www2.stat.duke.edu/~berger/papers/brunero.pdf

jld
  • 18,405
  • 2
  • 52
  • 65
  • This makes so much sense. I was also going through another paper on Bayesian Optimization and saw an equation that integrates acquisition function over predictive distribution of [Gaussian process Eq(2)](http://proceedings.mlr.press/v51/gonzalez16a.pdf), applying your comments, is the predictive distribution here is nuisance parameter? – GENIVI-LEARNER Oct 31 '19 at 15:58
  • 1
    @GENIVI-LEARNER it doesn't always have to be something that's a nuisance parameter, i just like that term because it emphasizes that we're trying to get rid of it one way or another – jld Oct 31 '19 at 19:38
  • Got it. Also one more thing, you mentioned that "we need to know its value to evaluate the density but we don't actually care about it." so you are suggesting that once we use the nuisance parameter to determine the density we can discard it by marginalizing as we dont actullay care about it. Right? – GENIVI-LEARNER Oct 31 '19 at 21:15
  • Have I summed up the idea correctly in my previous comment? – GENIVI-LEARNER Nov 01 '19 at 15:51
  • 1
    @GENIVI-LEARNER I think that sort of thing just depends on what you are doing, but also if you are marginalizing it out you aren’t even estimating it, you’re averaging over all possible $\eta$ so there isn’t even a $\hat\eta$ to discard – jld Nov 01 '19 at 16:10
  • i modified a post which is still unanswered. If it makes sense please do care to [take a look](https://stats.stackexchange.com/questions/432138/bayesian-optimization) – GENIVI-LEARNER Nov 16 '19 at 15:19