8

What are state-of-the-art alternatives to Gaussian Processes (GP) for nonparametric nonlinear regression with prediction uncertainty, when the size of the training set starts becoming prohibitive for vanilla GPs, but it is still not very large?

Details of my problem are:

  • input space is low-dimensional ($\mathcal{X} \subseteq \mathbb{R}^d$, with $2\le d \le 20$)
  • output is real-valued ($\mathcal{Y} \subseteq \mathbb{R}$)
  • training points are $10^3 \lesssim N \lesssim 10^4$, about a order of magnitude larger than what you could deal with standard GPs (without approximations)
  • the function $f: \mathcal{X} \rightarrow \mathcal{Y}$ to approximate is a black-box; we can assume continuity and a relative degree of smoothness (e.g., I would use a Matérn covariance matrix with $\nu = \frac{5}{2}$ for a GP)
  • for each queried point, the approximation needs to return mean and variance (or analogous measure of uncertainty) of the prediction
  • I need the method to be retrainable relatively fast (of the order of seconds) when one or a few new training points are added to the training set

Any suggestion is welcome (a pointer/mention to a method and why you think it'd work is enough). Thank you!

lacerbi
  • 4,816
  • 16
  • 44
  • 1
    What about sparse GPs? With good placement of the inducing points and if there is a sparse relationship between inputs and outputs, $10^4$ training points would be a piece of cake on a Xeon workstation. – DeltaIV Aug 05 '16 at 15:12
  • Thanks @DeltaIV. I think that the key point in your answer is "with good placement of the inducing points". Finding good inducing points ($f$ is black-box) seems like a hard problem. Which kind of approximation would you recommend? (e.g., FITC?) Does it work well in practice? – lacerbi Aug 05 '16 at 15:25
  • 1
    Of course you learn their position from data. No, FITC is inferior to VFE. Have a look here: http://arxiv.org/pdf/1606.04820v1.pdf. Dimensionality & size of the training data set are similar to yours. – DeltaIV Aug 05 '16 at 15:31
  • I meant, it seems a *computationally* hard problem. Of course you learn their position from the data. But from what I've seen so far, you had to perform an optimization over $\mathbb{R}^{md}$, where $m$ is the number of inducing points. Maybe it's easier than I thought. Thanks for the excellent reference -- it looks spot on and very recent, I had missed it. Maybe this is what I need. – lacerbi Aug 05 '16 at 15:37
  • 4
    Do you strictly need nonparametric and nonlinear regression methods? I don't know about your application, but in computational mechanics & fluid dynamics (classic cases where $f$ is a black box), methods similar to orthogonal polynomial regression work remarkably well, i.e., compressed sensing Polynomial Chaos/Stochastic Collocation methods. Otherwise you could try MARS or GAMs (GAMs are additive, though). – DeltaIV Aug 05 '16 at 15:37
  • 3
    Finally, I've never used them, but random forests and extreme gradient boosting are both popular nonparametric nonlinear regression methods for high dimensional problems with large training sets. – DeltaIV Aug 05 '16 at 15:41
  • Generalised additive models (GAMs) for location and scale, as implemented in R package BNSP. – papgeo Aug 08 '18 at 22:36

1 Answers1

1

A Matérn covariance matrix with $ν=5/2$ is almost converging to a Squared Exponential kernel.

So I think that a Radial Basis Function (RBF) based approach is perfect in this scenario. It is fast, it works for the kind of black-box function that you have, and you can get measures of uncertainty.

You can alternatively use inducing point approximations for GPs, have a look at FITC in the literature, but you have the same problem of where to select the inducing points.

ancillary
  • 512
  • 3
  • 8
  • Thanks. I knew of RBFs but didn't know that it is possible/easy to get decent measures of uncertainty for them (I thought that a RBF + uncertainty would pretty much go back to a GP with SE kernel). Could you recommend any starting point for reading on RBFs and how to compute uncertainty with them? – lacerbi Aug 24 '16 at 20:08
  • 1
    Well, it is basically Bayesian linear regression using basis functions. And you can choose the basis functions to be the Gaussian ones. So you just need to assign priors on the parameters and you'll get your posterior distribution. Follow the steps in Bishop's book "Pattern recognition", chapter 6.4.1. I also see from your profile that we have a lot of interests in common! Might be nice to keep in touch :-) I'm more than happy to help when I can. – ancillary Aug 24 '16 at 23:09
  • I had a look at Chapter 6.4.1. How is this different/faster than GPs? I understand that for training I could probably just minimize the loss via LBFGS (and perhaps there are even smarter methods). This is in my understanding why RBFs are faster to fit than GPs (the bottleneck for GPs is matrix inversion). But to compute the predictive uncertainty I need to condition on the observed points -- won't this require an inversion of a $M$-by-$M$ matrix? ($M$ number of training points) – lacerbi Aug 25 '16 at 14:14
  • Sorry, I should have probably said to have a look at the Bayesian Linear Regression in chapter 3. What you say is correct, a Bayesian linear regression model is equivalent to a GP with a special kernel function, so if you want the variance of the predictive distribution you have to invert the matrix. You can do it in a clever way by solving linear systems of equations backwards/forwards. – ancillary Aug 25 '16 at 19:10