6

I have built different classification models (logistic regression, randomforest, and xgboost) for a dataset. I would like to combine the prediction of all the models to reduce the variance and increase the robustness. I read that just averaging the predictions of all the models will not be the best way to combine. Also, good performing models should ideally be weighted more compared to the other low performing models. Can someone suggest me how to assign the weights to each model? Or should I iterate from 0.01 to 0.9 for all 3 models and decide the weights based on miss-classification rate for each combination?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Shudharsanan
  • 117
  • 2
  • 6
  • Look up "Bayesian model averaging" and also check out this blog post: http://mlwave.com/kaggle-ensembling-guide/ – shadowtalker Sep 02 '16 at 09:19
  • BTW, if you come up with a solution you're happy with, I would strongly encourage you to write it up as an answer to your own question. It will be helpful to future searchers. – shadowtalker Sep 02 '16 at 09:19
  • Thanks for your response people. But i got suggestion from my friends to run a linear or any non-linear model with the predictions from the all the models as independent variable to predict to the target variable(dependent variable). Will try it out and post the results. – Shudharsanan Sep 02 '16 at 15:09

1 Answers1

4

Stacking (Wolpert 1992) is a method for combining multiple base models using a high level model. The output of each base model is provided as an input to the high level model, which is then trained to maximize performance. Using the same data to train the base models and high level model would result in overfitting, so cross validation is used instead. Each base model is trained on the training set. The validation set is then fed through each base model to obtain inputs for training the high level model. This technique is sometimes called blending, when using a simple, held out validation set rather than cross validation. Stacking can be used for different types of problems (e.g. classification, regression, unsupervised learning). It works well in practice, and has become a popular tool in machine learning competitions.

In your case, the base models would be logistic regression, a random forest, and xgboost. Each of these models gives predicted class probabilities, which would be used as inputs to the high level model. In general, it's not necessary for base models to output class probabilities, but we can use them when available. A simple high level model might be a weighted average of the predicted class probabilities from each base model. In this case, you'd find the weights that minimize the log loss on the validation set (subject to the constraints that the weights are nonnegative and sum to one). An alternative high level model might be logistic or multinomial logistic regression (which will work even if the base models output scores rather than probabilities, like support vector machines). Fancier high level models are possible too (random forests, boosted classifiers, etc.).

In general, the best high level model will depend on the problem. It has been found that particular constraints on the high level model are helpful in certain settings. Be mindful that it's possible for the high level model to overfit the validation set (e.g. see here).

References:

Wolpert (1992). Stacked generalization.

Breiman (1996). Stacked regressions.

Ting and Witten (1999). Issues in stacked generalization.

Kaggle Ensembling Guide (blog post, 2015)

user20160
  • 29,014
  • 3
  • 60
  • 99