Bias / error averaged across folds
In $k$-fold cross validation, you train $k$ times on a subset of $n \frac{k-1}{k}$ observations out of the original $n$. The remaining $\frac{n}{k}$ are tested.
Training on fewer cases will on average lead to somewhat worse performance unless the learning curve is flattened out for the training sample sizes we compare.
In your case it looks like training on only 2/3 vs. 93 % of the cases does make a difference.
Standard deviation between folds
Here we have several contributions, which act in opposite directions:
- model instability: The surrogate models trained on fewer training cases are possibly also more unstable. This would lead to increased standard deviation between folds (= surrogate models) for smaller $k$.
- Uncertainty due to testing only a finite number of cases: On the other hand, we're comparing per-fold estimates of RMSE for different $k$. For small $k$, more test cases enter the per-fold estimate, in your case 1/3 vs. 1/15th of all cases. Thus, for smaller $k$, you get fewer but more certain (lower variance) per-fold estimates.
(This effect cancels out if you compare the usual pooled RMSE estimates after $k$ folds, i.e. each of the $n$ cases was tested exactly once and this is constant for all $k$).
Updates:
Errors on X
, X_train
, and X_test
The behaviour isn't strange at all: what you see is that smaller training sets (X
: 1600, X_train
: 1200, and X_test
: 400) lead to worse model.
Also, cross validation on X_test
is probably not doing what you think it does: it trains on a 4/5th subset of the data that is presumably reserved for single split/hold out testing. But hold out testing would use the model trained on X_train
to predict X_test
.
When using cross_validate
instead of cross_val_score
the statistics indicate that the training data scores way better than the test data
Assuming that "the statistics" indicates train_score
and test_score
as returned by cross_validate
,
this means that you are overfitting, and is thus in good agreement with seeing that small $k$ (fewer training cases at same model complexity => more overfitting) do worse in cross validation than larger $k$.
You say that you want to use the cross validation results for hyperparameter optimization. This means that you need independent data to validate the optimized model, either by another (nested aka double) cross validation, or by a single split.
Having a 1 : 3 split looks as if you'd like to use the single split aka hold out for this purpose.
In order to achieve independence between training (including optimization) and that test of generalization error of the optimized model, you need to do restrict your cross validation to X_train
.
when the complete set of predictors is shuffled
cross valiation doesn't shuffle predcitors, it shuffles/splits cases.
Choice of $k$
should I increase 'k'
Yes.
or would that be doubtful in a statistical sense?
No, that would not be doubtful from a statistics point of view because the lower error here comes from lower bias, i.e. we expect the cross validation with larger $k$ to be less wrong than cross validation with smaller $k$.
So, larger $k$ doesn't hurt anything but your computation time.
(The exception is Leave-One-Out, i.e. $k = n$ which has some undesirable statistical properties.)
The choice of $k = 3$ is IMHO an unusually small $k$. Personally, I rarely consider $k < 5$ unless I have fewer independent groups of cases, and I think $k > 10$ is usually not needed (I rather add iterations/repetitions in order to directly compute model stability).
See also Choice of K in K-fold cross-validation.
In this particular case, I think the information that substantially worse error was observed for $k=3$ is important, though.
So we know now that training on 1333 cases vs. 1867 cases does make a difference. Direct comparison means that the compared models have the same complexity, and in that case worse performance likely comes from model instability. Thus, you need to check model stability (which you should do anyways, and in particular when doing model optimization)
I know that statistics go beyond "right" and "wrong", but would you personally raise an eyebrow when you find that higher k was chosen just for the sake of better model performance?
I would raise an eyebrow if that reason were given, because this argumentation shows a decided lack of understanding cross validation. Which in turn would make me suspicious of how far I can trust the sound statistical judgment of the authors in other aspects of the modeling/data analysis.
(Fortunately, you asked here and thus gave yourself a chance of improving this understanding! Go ahead with that!)