Can regression obtained from different methods be improved by least squares of all regression results?

Question

Suppose I have predicted values obtained from several methods like KNN, maximum likelihood estimator, $k$-means clustering etc, say $x_1, x_2, \ldots, x_n$ with column vector $x_i$ predicted from method i. I want to combine all these results using least squares, i.e., $Xb=y$, where $X=[x_1,x_2,\ldots,x_n]$ and vector $y$ stores known values. So I have the predicted values obtained by this least square regression, say $x^*$. Of course, I have a training set to tune the parameters of all methods and get the coefficients of least squares. I wonder whether this combination can have a chance to outperform the best single method. That is, for a testing set, I use all single method to predict and the result obtained from the best single method is $x_{tj}$, then perform least squares I get combined result $x_t^*$, can I sure that $\sum(x_{tjk}-x_{tk})^2 \le \sum(x_{tk}^*-x_{tk})^2$, where $x_t$ is a known vector?

I have tested this for tens of thousands times using random sampling from my real data with different number of data points, only about $2\%$ of results show that the errors between combined predicted results and known values are smaller then that between the best predicted results and known values. So it seems that combine the results obtained by all single methods in this way can not help to improve the prediction precision. But how can I prove this?

How about if I set constraints on least square coefficients as $b_i>0$ and $\sum(b_i)=1$?

There has been quite some research on forecast combinations using least squares and other methods. An accessible explanation can be found in Diebold's free textbook ["Forecasting in Economics, Business, Finance and Beyond"](http://www.ssc.upenn.edu/~fdiebold/Teaching221/Forecasting.pdf) Chapter 12 "Model-based forecast combination", but there are many alternatives, too. — Richard Hardy, Mar 11 '17 at 16:28
@RichardHardy Thanks. So if the coefficients sum to unity, according to the text, the combined results should have at most error variances as results obtained from best method. But according to my simulation results, as mentioned in the question, it seems that the error of combined results is larger, do I miss something? — Elkan, Mar 11 '17 at 18:41
The text says that OLS estimates of weights are highly imprecise (have high variance) and thus equal weights may perform equally well or better than OLS weights, besides other things. — Richard Hardy, Mar 11 '17 at 19:07

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

It's certainly possible to improve prediction performance the way you suggest, although results on any particular problem aren't guaranteed. In general, methods that combine the predictions of multiple base models are called ensemble methods. There's a huge literature on them, and they tend to be very successful in practice. The method you're suggesting is form of stacking, where the outputs of the base models are fed as inputs to higher-level model, which uses them to produce the final prediction.

The way you implement stacking matters. One issue concerns the use of held-out data. Stacking properly requires doing the following: Partition the data into disjoint subsets, as in K-fold cross validation. For each fold, you have a training set and a validation set. Train each of the base models on the training set, then use them to produce predictions on the validation set. After doing this for all K folds, you have predictions from every base model for every point when it was part of the validation set. Use these validation set outputs to learn the parameters of the higher level combiner/generalizer model. In your case, this will be the weights of a linear regression model.

The details of the higher level model also matter. One issue is that the outputs of the base model will be correlated, because they're all trying to solve the same problem. When using linear regression as the higher level model, this is exactly the same issue as multicollinearity in standard regression problems. If unconstrained, the weights can be unstable and give poor predictive performance. Breiman (1996) discusses this issue. The solution is to constrain the weights to be nonnegative, which improved performance. A sum-to-one constraint makes sense, but turned out not to matter in practice. Penalizing the $\ell_2$ norm (i.e. ridge regression, which is often used to solve the multicollinearity problem in ordinary regression settings) worked better than unconstrained weights, but not as well as the nonnegativity constraint. Note that these results may not hold in all cases (e.g. there are other papers looking at classification problems where nonnegativity constraints weren't necessary).

Ensemble methods in general (including stacking) typically perform better when there's greater diversity among the base models. Many methods take explicit steps to encourage diversity. This is also something you could look into.

My answer here might also be relevant.

References

Wolpert (1992). Stacked generalization.
Breiman (1996). Stacked regressions.

Thanks. [Breiman's](https://pdfs.semanticscholar.org/a147/7f71c2edb741750301e5016cb9284ea1639b.pdf) paper provides great help to me. — Elkan, Mar 12 '17 at 12:04

Can regression obtained from different methods be improved by least squares of all regression results?

1 Answers1