3

Consider the training set $\{(x_i; y_i)\}_{i=1}^N, x_i \in \mathbb{R}^n, y_i \in \mathbb{R}$.

The goal is to find regression function, like, $f(x) = \sum_{i=1}^K a_i g_i(x) + a_0$. The least-squares solution is well known, if we know $g_i$ functions.

Lets consider that we have kernel regression with radial basis function (lets use gaussian), so $g_i(x) = \exp \left( - \gamma_i ||x_i - c_i ||^2 \right)$.

What's is the best choice for basis function parameters $\{c_i\}_{i=1}^K$ and $ \{ \gamma_i \}_{i=1}^K $?

Final Litiu
  • 147
  • 1
  • 9
TheBug
  • 163
  • 4
  • 2
    The answer should be data dependent. Are you asking how to determine them? – JohnRos Oct 29 '11 at 16:59
  • @JohnRos, Sure, it must be data depended. Is there any method to find parameters based on training set. – TheBug Oct 29 '11 at 17:28
  • If we know actual data distribution $\mathbb{P}$, according to least-squares method we have to optimize $\int \left( y - (\sum_{i=1}^K a_i g_i(x) + a_0) \right)^2 \, d\mathbb{P}(x,y) \to \min_{a_i}$, if $g_i$ are defined. If we don't know basis function's parameters then we have to solve following problem $\int \left( y - (\sum_{i=1}^K a_i g_i(x) + a_0) \right)^2 \, d\mathbb{P}(x,y) \to \min_{a_i, c_i, \gamma_i}$. Is there any methods that solves this problem, or any recommendations for best choice of parameters? – TheBug Oct 29 '11 at 17:36

2 Answers2

3

As others have suggested, the problem with a model of this form is that having so many tunable parameters is a recipe for over-fitting, and it is very likely that unless the dataset is very large you will get better generalisation performance from a more simple model. I would recommend using a single scale parameters $\gamma$ rather than having one for each datapoint, and using regularisation on the model weights $a_i$ (e.g. ridge regression.

To answer your question directly, it is possible to choose the scale parameters $\gamma_i$ and the centers $c_i$ by gradient descent optimisation of the regularised loss function, using essentially the back-propagation algorithm used in multi-layer perceptron neural networks. ISTR an early paper by Andrew Webb on this topic. However this brings all the problems associated with MLPs, such as local minima and long training times, and losses the main advantage of Radial Basis Function networks, i.e. efficient training procedure. I think Prof. Sheng Chen and co-workers at Southampton University have also been working on algorithms for this sort of thing (IIRC the papers were published in IEEE Transactions Neural Networks). However, I suspect the over-fitting problem is the fatal flaw in this approach and I didn't find the experimental evaluation very convincing.

The usual approach to tuning the regularisation (ridge) and kernel parameters is to minimise the leave-one-out cross-validation error (or GCV), which can be computed efficiently for this sort of model (including gradient information). However, if you have more than two or three parameters to tune this way you are very likely to over-fit the LOOCV model selection criterion, see

G. C. Cawley and N. L. C. Talbot, Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007 (www)

and

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (www)

For choosing the centers, there are a number of ways you can go about it, the most effective way is to choose a subset of the training data, either randomly, or by greedy optimisation of the training criterion, or to greedily span the space of the basis unctions (e.g. Fine and Scheinberg), or the Nystrom method. This problem has been well studied in the machine learning litterature, and a really thorough empirical study would be really useful. However, again, the key problem is that anytime you make a choice about the model or optimise a parameter based on a statistic estimated from a finite sample of data (e.g. a training criterion, a CV estimate or Bayesian marginal likelihood), you invite over-fitting, and the more choices you make, the greater the risk.

The standard approach in RBF neural networks (I think the paper is by Moody and Darken) is to perorm a cluster analysis of the data and place the centers on the cluster centers.

In short, there are a variety of ways this problem can be solved, but it is questionable whether it is going to be a good way to approach a practical application.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
2

The knots $\{c_i\}_{i=1}^K$ are usually set to $\{x_i \}_{i=1}^N$ (i.e. $K=N$) to avoid the intractable knot-selection problem while the model complexity is controlled by a separate regularization term that shrinks $\{a_j\}_{i=1}^N$ to zero.

Furthermore, $ \{ \gamma_i \}_{i=1}^N $ are all set to a global value ($\gamma$) which is tuned using resampling (e.g. cross-validation) - this requires training the model for a set $\gamma$ values from a (relatively small) grid. The regularization term penalty is tuned in the same way.

Under such simplifications this becomes a generalized ridge regression:

$\min_{\{a_j\}}\sum_{i=1}^N \left( y_i - (\sum_{j=1}^N a_j g_j(x_i,\gamma) + a_0) \right)^2 + \lambda\sum_{j=1}^Na_j^2$

where $g_j(x) = \exp \left( - \gamma ||x - x_j ||^2 \right)$, $\gamma$ and $\lambda$ are tuned using resampling as I mentioned above.

Ridge regression has a closed-form solution which is very similar to the solution of ordinary least squares (with OLS's $X'X$ replaced by $X'X +\lambda I$)

Yevgeny
  • 1,422
  • 12
  • 11
  • Thanks, I know about ridge and lasso regressions, where solution is $f(x) = \sum_{i=1}^N a_i \exp \left( - \gamma \|x - x_i \|^2 \right)$, according to representer theorem. But I'm looking for solution when $K < N$. – TheBug Nov 17 '11 at 21:46
  • 1
    I believe your formulation of this problem (i.e. unconstrained $\{c_i\}_{i=1}^K$ and $\{\gamma_i\}_{i=1}^K$) is NP-hard (not to mention it being non-linear and multimodal). This makes it as intractable/nasty optimization problem as it gets. So, unless this is a topic of a research thesis with big time budget, I would suggest to use heuristics + regularization which takes one to the realm of generalized additive models and (potentially) boosting/bagging. The original formulation is impractical. – Yevgeny Nov 17 '11 at 22:04
  • 1
    Another problem with the original formulation is that it does not protect against over-fitting - one could make the training error to zero just by setting _K_ to a high enough number.. – Yevgeny Nov 22 '11 at 19:32