Higher accuracies for larger k at cross validations?

Question

I am fitting an artificial neural network with Python's scitkit-learn. The data source is experimental data from my study.

Objective is identifying an optimal parameterization, plus I want to track the results for each run (something like a gridsearch, but I want to do it manually).

I realized that the randomness of the method introduces some uncertainties that disallow a proper sorting by scores. Parameterization A could achieve slightly higher scores than B, but the next time it turns out the other way around.

Consequently I increased k in the kfold cross-validation to obtain more robust results. Now I am confused to find that the overall accuracy increased with higher k, i.e. Root-Mean-Squared-Error is going down.

cross_validate(model, X, y, scoring='neg_mean_squared_error', cv=3)

I ran that line 5 times an received the following RMSEs (lower = better):

--> 0.630, 0.634, 0.633, 0.620, 0.616 (mean: 0.626, std: 0.008)

cross_validate(model, X, y, scoring='neg_mean_squared_error', cv=15)

--> 0.561, 0.553, 0.568, 0.544, 0.548 (mean: 0.548, std: 0.010)

Why is the error going down for the 15fold cross-validation? Why is the standard deviation not going down?

Update: I have found another strange thing. I used to use cross_val_score to evaluate my model like so:

cross_val_score(model, X_train, y_train, scoring="r2", cv=5)
cross_val_score(model, X_test, y_test, scoring="r2", cv=5)

This, however, gives significantly worse results than calculating

score_results = cross_validate(model, X, y, scoring="r2, cv=5)

When using cross_validate instead of cross_val_score the statistics indicate that the training data scores way better than the test data. Plus, the overall accuracy is better. How could that be?

A possibility: larger k means more data per fold, the cross-validation giving you the error to expect when learning from (k - 1)/k observations — einar, Nov 28 '19 at 09:56
Isn't k just the number of folds? I assumed the training data always had the same size - they are just shuffled k times in total. I updated the question, maybe I am missing something important after all... — offeltoffel, Nov 28 '19 at 11:46
@offeltoffel: training data size is the same for the $k$ folds, but it differs depending on $k$ (and $n$). As for `cross_val_score` vs. `cross_validation`: how are X and y related to X_train, X_test and y_train, y_test? — cbeleites unhappy with SX, Nov 28 '19 at 12:10
Whey you write "significantly worse results": is that significance in the statistical sense? — cbeleites unhappy with SX, Nov 28 '19 at 12:11
@cbeleites supports Monica: X_train is 75 % of X (`train_test_split` default setting). I thought this would not have a strong effect, since X is quite big (n=2,000 in the shown example). That may, however, explain why `cross_validation` yields higher accuracies, when the complete set of predictors is shuffled. "Significantly worse": e.g. relative Errors for training set decreasing from 0.2 to 0.01 and errors for test set decreasing from 0.3 to 0.18. — offeltoffel, Nov 28 '19 at 12:29
The first thing I'd confirm is that for each of your 3 data sets, `cross_validation` and `cross_val_score` return the same (considering uncertainty) results. — cbeleites unhappy with SX, Nov 28 '19 at 13:36
Also, stick to one metric until we have that one clear (in your post we have RMSE, -MSE, R² and some kind of relative error so far - I'm going to ignore accuracy as you probably didn't use that term in the sense of the figure of merit of that name) — cbeleites unhappy with SX, Nov 28 '19 at 13:43

cbeleites unhappy with SX · Accepted Answer · 2019-11-29T18:46:43.883

Bias / error averaged across folds

In $k$-fold cross validation, you train $k$ times on a subset of $n \frac{k-1}{k}$ observations out of the original $n$. The remaining $\frac{n}{k}$ are tested.

Training on fewer cases will on average lead to somewhat worse performance unless the learning curve is flattened out for the training sample sizes we compare.

In your case it looks like training on only 2/3 vs. 93 % of the cases does make a difference.

Standard deviation between folds

Here we have several contributions, which act in opposite directions:

model instability: The surrogate models trained on fewer training cases are possibly also more unstable. This would lead to increased standard deviation between folds (= surrogate models) for smaller $k$.
Uncertainty due to testing only a finite number of cases: On the other hand, we're comparing per-fold estimates of RMSE for different $k$. For small $k$, more test cases enter the per-fold estimate, in your case 1/3 vs. 1/15th of all cases. Thus, for smaller $k$, you get fewer but more certain (lower variance) per-fold estimates.
(This effect cancels out if you compare the usual pooled RMSE estimates after $k$ folds, i.e. each of the $n$ cases was tested exactly once and this is constant for all $k$).

Updates:

Errors on `X`, `X_train`, and `X_test`

The behaviour isn't strange at all: what you see is that smaller training sets (X: 1600, X_train: 1200, and X_test: 400) lead to worse model.

Also, cross validation on X_test is probably not doing what you think it does: it trains on a 4/5th subset of the data that is presumably reserved for single split/hold out testing. But hold out testing would use the model trained on X_train to predict X_test.

When using cross_validate instead of cross_val_score the statistics indicate that the training data scores way better than the test data

Assuming that "the statistics" indicates train_score and test_score as returned by cross_validate, this means that you are overfitting, and is thus in good agreement with seeing that small $k$ (fewer training cases at same model complexity => more overfitting) do worse in cross validation than larger $k$.

You say that you want to use the cross validation results for hyperparameter optimization. This means that you need independent data to validate the optimized model, either by another (nested aka double) cross validation, or by a single split.

Having a 1 : 3 split looks as if you'd like to use the single split aka hold out for this purpose.

In order to achieve independence between training (including optimization) and that test of generalization error of the optimized model, you need to do restrict your cross validation to X_train.

when the complete set of predictors is shuffled

cross valiation doesn't shuffle predcitors, it shuffles/splits cases.

Choice of $k$

should I increase 'k'

Yes.

or would that be doubtful in a statistical sense?

No, that would not be doubtful from a statistics point of view because the lower error here comes from lower bias, i.e. we expect the cross validation with larger $k$ to be less wrong than cross validation with smaller $k$.

So, larger $k$ doesn't hurt anything but your computation time. (The exception is Leave-One-Out, i.e. $k = n$ which has some undesirable statistical properties.)

The choice of $k = 3$ is IMHO an unusually small $k$. Personally, I rarely consider $k < 5$ unless I have fewer independent groups of cases, and I think $k > 10$ is usually not needed (I rather add iterations/repetitions in order to directly compute model stability). See also Choice of K in K-fold cross-validation.

In this particular case, I think the information that substantially worse error was observed for $k=3$ is important, though.
So we know now that training on 1333 cases vs. 1867 cases does make a difference. Direct comparison means that the compared models have the same complexity, and in that case worse performance likely comes from model instability. Thus, you need to check model stability (which you should do anyways, and in particular when doing model optimization)

I know that statistics go beyond "right" and "wrong", but would you personally raise an eyebrow when you find that higher k was chosen just for the sake of better model performance?

I would raise an eyebrow if that reason were given, because this argumentation shows a decided lack of understanding cross validation. Which in turn would make me suspicious of how far I can trust the sound statistical judgment of the authors in other aspects of the modeling/data analysis.

(Fortunately, you asked here and thus gave yourself a chance of improving this understanding! Go ahead with that!)

Thank you! Speaking in practical terms: should I increase 'k' and publish my findings with the lower errors or would that be doubtful in a statistical sense? I know that statistics go beyond "right" and "wrong", but would you personally raise an eyebrow when you find that higher k was chosen just for the sake of better model performance? — offeltoffel, Nov 28 '19 at 12:33
Thank you for the longer update on your answer. I assume the critical point is indeed that I misunderstood how `cross_validate` works. It was a matter of fortune that I discovered the "strange" behavior and found someone explaining the results to me. I will run different splits now, prevent leakage of training data and inspect the role of `k` until I get a full understanding of what's exactly going on here. — offeltoffel, Dec 02 '19 at 11:43

Higher accuracies for larger k at cross validations?

1 Answers1

Bias / error averaged across folds

Standard deviation between folds

Errors on X, X_train, and X_test

Choice of $k$

Errors on `X`, `X_train`, and `X_test`