Why testing with no standardization is performing better than with standardization if I did standardize the training set?

Question

Using scikit-learn, I've run a grid search on a pipeline containing a StandardScaler instance and a LinearSVC instance, using the following code:

pipeline = Pipeline([('scaler', StandardScaler()),
                     ('model', CLF)])

params = dict(C=[10e-3, 10e-2, 10e-1, 10e0, 10e1, 10e2, 10e3],
              random_state=[5])

cvalidator = pre_cv_splits(val_idx)
grs = GridSearchCV(pipeline, params, n_jobs=-1, verbose=5, cv=cvalidator,
                   scoring=make_scorer(balanced_accuracy_score), iid=False, refit=True)

grs.fit(X, y)

Where pre_cv_splits is a generator of train/test indexes for grid search (I need an specific function for that because of the way my data is laid out).

After grid search, I'm getting the best model from the grid search with:

best_model = grs.best_estimator_

and applying it to my test instances:

n = test_labels.shape[0]

for i in tqdm(range(n), total=n, desc='Predicting'):

    X_test = test_data[i]
    y_test = test_labels[i]

    p_test = best_model.predict(X_test)

    sens = recall_score(y_test, p_test)
    spec = recall_score(y_test == 0, p_test == 0)
    acc = accuracy_score(y_test, p_test)
    bacc = balanced_accuracy_score(y_test, p_test)
    print(f'Subject {train_idx[i]:03d}')
    print(f'Sens = {sens:0.1%} | Spec = {spec:0.1%} | Acc = {acc:0.1%} | Bacc = {bacc:0.1%} |\n')

My test samples come from different subjects, and thus test_data is an m-sized array of arrays, where m is the number of subjects, and test_data[i] is an (n_i, d) array such that n_i is the number of samples from subject i, and d is the number of features.

When I first ran it, I made as small mistake and was getting best_model = grs.best_estimator_['model'], and thus when I ran p_test = best_model.predict(X_test) I was running only the LinearSVC, without performing standardization on the test set. The results I got were:

Subject 019
Sens = 69.4% | Spec = 94.4% | Acc = 92.4% | Bacc = 81.9% |

Subject 017
Sens = 77.6% | Spec = 79.2% | Acc = 78.9% | Bacc = 78.4% |

Subject 022
Sens = 93.7% | Spec = 46.8% | Acc = 61.6% | Bacc = 70.2% |

Subject 002
Sens = 94.9% | Spec = 56.2% | Acc = 75.0% | Bacc = 75.5% |

Subject 006
Sens = 92.7% | Spec = 67.3% | Acc = 79.4% | Bacc = 80.0% |

After fixing best_model = grs.best_estimator_['model'] to best_model = grs.best_estimator_, I ran testing again, and got the following results:

Subject 019
Sens = 25.0% | Spec = 98.8% | Acc = 92.9% | Bacc = 61.9% |

Subject 017
Sens = 58.2% | Spec = 94.8% | Acc = 88.2% | Bacc = 76.5% |

Subject 022
Sens = 81.1% | Spec = 66.6% | Acc = 71.2% | Bacc = 73.8% |

Subject 002
Sens = 88.3% | Spec = 80.1% | Acc = 84.1% | Bacc = 84.2% |

Subject 006
Sens = 60.3% | Spec = 87.9% | Acc = 74.9% | Bacc = 74.1% |

As you can see, while Specificity improved, Sensitivity dropped a lot, and specially the Balanced Accuracy is worse in most scenarios, which is strange considering that I performed grid search targeting the Balanced Accuracy score. It is important to add that I'm working with a quite unbalanced dataset, with many more negative samples than positive samples.

What are some reasons why this might be happening? My only guess is that the inter-subject variance of features might be very high (I'm already working on plotting this), and thus not scaling them might end up making the task of classifying the positive instances easier for some specific subjects. Is this a valid guess?

and https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia — cbeleites unhappy with SX, Feb 01 '20 at 15:34

Why testing with no standardization is performing better than with standardization if I did standardize the training set?

0 Answers0