Why is my stacked model worse than my base models?

Question

I'm learning stacking and start with the approach outlined in Introduction stacking

I've plotted the data:

I first would like to check if my algorithm is correct (see below):

So I basically performed 5-Fold CV to generate my test predictions for each base model M1 and M2. The output of this step is a matrix called meta_data of size n*L, with L=2 (2 base models).

Is this fine? I see in some books the following (see below, Z is my meta_data):

What is the rationale for doing this? Is it because after meta_data has been built, we have actually KxL fitted models in total (L=2 fitted models for each fold), but eventually they need to be fit on the FULL training data, so that the stacked model knows where each base model performs well or poorly?

My second question is the following:

With my algorithm, I do CV to estimate the test error on 20% of the data. I use a logistic regression for the stacked model and my test error rate is 69%.

However, the test error rate of M1 alone is 79% and for M2 alone it's 70%. I also checked the correlation of the predictions done my M1 and the ones done by M2: it's roughly 80%.

Based on it, and if my algorithm is fine, is it normal to have worse performance on the stacked model?

You have multiple questions here and I recommend posting them separately; you are more likely to get answers that way. Regarding the first one, it may be considered off topic because it involves some code review. I have answered what you call your second question, but there is a third in between the two as well that is probably on topic. — mkt, Jul 10 '19 at 15:02
When you say an error rate of 69%, do you mean an *accuracy* rate of 69%? Are you using log-loss as you accuracy measure? — Acccumulation, Jul 10 '19 at 18:09

mkt · Answer 1 · 2019-07-12T08:18:30.340

To address your second question (or possibly third, since there's one in the middle you've not counted):

As far as I'm aware, there is no guarantee that stacking will always lead to better performance (i.e. lower prediction error). It tends to improve performance on average, for reasons well explained in this answer. Note that the same answer also implies that stacking is likely to be more useful with more base models.

Also, if the correct predictions of the base models are strongly correlated (as appears to be the case for you), the benefits of stacking are weaker.

SebK · Answer 2 · 2019-07-12T11:23:25.950

From the topology of this example it looks like the KNN should be good at distinguishing green and turquoise, because knn is strong when local relative density of a class is the best predictor and when a given distance has the same meaning at any point in space, and in any direction.

SVM here should be strong at making the difference between points belonging to concentric circles. (e.g. distinguishing blue from red from green-ish).

However here you are using SVM with a linear kernel, which means that you are making linear separators in the plane (straight lines). You want to be using what is called the kernel trick: using a transformation of the feature space equivalent to performing your fit into a space of higher dimension, that looks like a representation of the samples where the distance to the point of coordinates (0, 0) is reflected on a new third axis z. For example you could use a SVM with radial basis functions:

M2_svc = svm.SVC(C=5, kernel='rbf')

But the best kernel would be a custom kernel:

def custom_kernel(x, y):
    return x**2 + y**2

M2_svc = svm.SVC(C=5, kernel=custom_kernel)

I think that you are getting bad results with your stacking because the KNN already does a better job where your current SVM has OK performances, and everywhere else the linear SVM only introduces noise.

The idea of stacking is to merge classifiers that have complementary strengths.

Why is my stacked model worse than my base models?

2 Answers2