35

I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of 'true output' and outputs of my 3 models I see that each time at least one of the models is really close to the true output, though 2 others could be relatively far away.

When I compute minimal possible error (if I take prediction from 'best' predictor for each test example) I get a error which is much smaller than error of any model alone. So I thought about trying to combine predictions from these 3 diffent models into some kind of ensemble. Question is, how to do this properly? All my 3 models are build and tuned using scikit-learn, does it provide some kind of a method which could be used to pack models into ensemble? The problem here is that I don't want to just average predictions from all three models, I want to do this with weighting, where weighting should be determined based on properties of specific example.

Even if scikit-learn not provides such functionality, it would be nice if someone knows how to property address this task - of figuring out the weighting of each model for each example in data. I think that it might be done by a separate regressor built on top of all these 3 models, which will try output optimal weights for each of 3 models, but I am not sure if this is the best way of doing this.

Maksim Khaitovich
  • 658
  • 1
  • 7
  • 12

4 Answers4

39

Actually, scikit-learn does provide such a functionality, though it might be a bit tricky to implement. Here is a complete working example of such an average regressor built on top of three models. First of all, let's import all the required packages:

from sklearn.base import TransformerMixin
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge

Then, we need to convert our three regressor models into transformers. This will allow us to merge their predictions into a single feature vector using FeatureUnion:

class RidgeTransformer(Ridge, TransformerMixin):

    def transform(self, X, *_):
        return self.predict(X).reshape(len(X), -1)


class RandomForestTransformer(RandomForestRegressor, TransformerMixin):

    def transform(self, X, *_):
        return self.predict(X).reshape(len(X), -1)


class KNeighborsTransformer(KNeighborsRegressor, TransformerMixin):

    def transform(self, X, *_):
        return self.predict(X).reshape(len(X), -1)

Now, let's define a builder function for our frankenstein model:

def build_model():
    ridge_transformer = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('poly_feats', PolynomialFeatures()),
        ('ridge', RidgeTransformer())
    ])

    pred_union = FeatureUnion(
        transformer_list=[
            ('ridge', ridge_transformer),
            ('rand_forest', RandomForestTransformer()),
            ('knn', KNeighborsTransformer())
        ],
        n_jobs=2
    )

    model = Pipeline(steps=[
        ('pred_union', pred_union),
        ('lin_regr', LinearRegression())
    ])

    return model

Finally, let's fit the model:

print('Build and fit a model...')

model = build_model()

X, y = make_regression(n_features=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train)
score = model.score(X_test, y_test)

print('Done. Score:', score)

Output:

Build and fit a model...
Done. Score: 0.9600413867438636

Why bother complicating things in such a way? Well, this approach allows us to optimize model hyperparameters using standard scikit-learn modules such as GridSearchCV or RandomizedSearchCV. Also, now it is possible to easily save and load from disk a pre-trained model.

constt
  • 506
  • 4
  • 7
  • 1
    When using this approach, is there a simple way to extract which algo is being used when/what fraction of each algo? – David Hagan Nov 06 '17 at 14:47
  • 1
    Perhaps looking at coefficients of the resulting linear model (`model.named_steps['lin_regr'].coef_`) will give you some insights into how much each model in an ensemble contributes to the final solution. – constt Nov 07 '17 at 12:00
  • @constt Wouldn't you need to use cross_val_predict in your base models? It seems like your top-level model would get an overoptimistic signal from your base models as this is currently implemented. – Brian Bien May 16 '18 at 01:04
  • 1
    This is just kinda proof-of-concept example, I didn't address a model selection here. I think such models should be optimized as a whole, i.e., optimizing hyper-parameters of all the built-in models simultaneously using the cross-validation approach. – constt May 16 '18 at 04:57
  • if we put n_targets=1 `X, y = make_regression(n_features=10, n_targets=1)` it gives dimension error. can anyone please explain what to do? – Mohit Yadav Apr 12 '19 at 21:36
  • @constt `model.fit(X_train, y_train)` this line gives me error, when i make data set like this `X, y = make_regression(n_features=10, n_targets=1)`. – Mohit Yadav Apr 13 '19 at 13:47
  • If I give `n_targets=1` (or, do not specify it), I get an error: `ValueError: Expected 2D array, got 1D array instead... Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.` Well, I did the suggested reshape to both X and y, separately as well as both together, but the error does not go away. Furthermore, when I did `X = np.array(X)`, I got a `Win 32 PermissionError`. How do I fix these errors when my target has only 1 column? – Kristada673 Aug 20 '19 at 03:06
  • Sorry guys, I haven't been here for a while. Please check out edits I made for transformer classes' `transform` methods, now the model works for `1D` training data. – constt Jun 12 '20 at 12:23
  • It gets stuck during the `model.fit(X_train, y_train)`, even with small number of samples, and it produces deprecated warnings `DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.`. – shahar_m Jul 19 '20 at 14:51
  • @shahar_m Dunno, for me it works like a charm. No warnings. Python 3.6.9, scikit-learn 0.23.1, numpy 1.18.4 – constt Jul 20 '20 at 06:56
  • @constt I'm using scikit-learn version 0.19.2 since it's the only one that supports onnx conversion. – shahar_m Jul 20 '20 at 15:49
11

Ok, after spending some time on googling I found out how I could do the weighting in python even with scikit-learn. Consider the below:

I train a set of my regression models (as mentioned SVR, LassoLars and GradientBoostingRegressor). Then I run all of them on training data (same data which was used for training of each of these 3 regressors). I get predictions for examples with each of my algorithms and save these 3 results into pandas dataframe with columns 'predictedSVR', 'predictedLASSO' and 'predictedGBR'. And I add the final column into this datafrane which I call 'predicted' which is a real prediction value.

Then I just train a linear regression on this new dataframe:

 #df - dataframe with results of 3 regressors and true output

 from sklearn linear_model
 stacker= linear_model.LinearRegression()
 stacker.fit(df[['predictedSVR', 'predictedLASSO', 'predictedGBR']], df['predicted'])

So when I want to make a prediction for new example I just run each of my 3 regressors separately and then I do:

 stacker.predict() 

on outputs of my 3 regressors. And get a result.

The problem here is that I am finding optimal weights for regressors 'on average, the weights will be same for each example on which I will try to make prediction.

If anyone has any ideas on how to do stacking (weighting) using the features of current example it would be nice to hear them.

Maksim Khaitovich
  • 658
  • 1
  • 7
  • 12
  • Wow, I like very much this approach! But why did you used `LinearRegression()` instead of `LogisticRegression()` model? – mllamazares Mar 15 '17 at 23:27
  • 1
    @harrison4 because I was doing regression, not classification task? So I wanted to 'weight' output from each model. Anyway, this is a bad approach, good one is described here: http://stackoverflow.com/a/35170149/3633250 – Maksim Khaitovich Mar 15 '17 at 23:30
  • Yeah, sorry you're right! thanks for sharing the link! – mllamazares Mar 15 '17 at 23:35
7

If your data has obvious subsets you could run a clustering algorithm like k-means and then associate each classifier with the clusters it performs well on. When a new data point arrives, then determine what cluster it's in and run the associated classifier.

You could also use the inverse distances from the centroids to get a set of weights for each classifier and predict using a linear combination of all of the classifiers.

anthonybell
  • 472
  • 1
  • 3
  • 11
2

I accomplish a type of weighting by doing the following, once all your models are fully trained up and performing well:

  1. Run all your models on a large set of unseen testing data
  2. Store the f1 scores on the test set for each class, for each model
  3. When you predict with the ensemble, each model will give you the most likely class, so weight the confidence or probability by the f1 score for that model on that class. If you're dealing with distance (as in SVM, for example), just normalize the distances to get a general confidence, and then proceed with the per class f1 weighting.

You can further tune your ensemble by taking measure of percent correct over some time. Once you have a significantly large, new data set scored, you can plot threshold in steps of 0.1, for instance, against percent correct if using that threshold to score, to get an idea of what threshold will give you, say, 95% correct for class 1, and so on. You can keep updating the test set and f1 scores as new data come in and keep track of drift, rebuilding the models when thresholds or accuracy fall.

wwwslinger
  • 1,150
  • 7
  • 10
  • 1
    That's interesting, but it works only for classification tasks, as far as I see, while I am trying to solve regression task. Thus I can't compute F1 score. – Maksim Khaitovich Feb 28 '15 at 17:10