Why do the results in cross validation changes whenever I shuffle my training data?

Question

The result that I'm talking about is the mean of cross validation from the function of sklearn. I tried to shuffle my training data then applied the CV function (shuffle then CV). I did that for several times and each result was different, some are higher than the others. Is the arrangement of data from the training data relevant? If so, then how?

Additions: Sorry I didn't define my problem well. I'm doing a supervised binary classification. My train variable consists of the feature and classification. Here is part of my code:

import numpy as np
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

np.random.shuffle(train)
clf = LogisticRegression()
clf = clf.fit(train[0::,1::],train[0::,0])

mean = cross_validation.cross_val_score(clf, train[0::,1::],train[0::,0], cv=cross_validation.StratifiedKFold(train[:,0],5)).mean()
print mean
#sample for the train:
#train = [[1,0.5,0.3,0.6],[0,0.3,0.2,0.1],[0,0.1,0.9,0.7]]

Here I have n_fold=5. If I remove the np.random.shuffle(train) my result for the mean is approximately 66% and it stays the same even after running the program a couple of times. However, if I include the shuffle part, my mean changes (sometimes it increases and sometimes it decreases). And my question is, why does shuffling my training data changes my mean? If the order of data in my 'train' is irrelevant, then why does my mean changes after shuffling? This is just part of my code so assume that train variable is defined somewhere. My train variable is similar to the sample, but it consists of a lot more feature.

Describe how the cross-validation was carried out, e.g. averaged over x repeats of k-fold cross validation - what are x and k for you? If x < 50, depending on the sample size, cross-validation is often unstable. The most typical choice of k is 10. — Frank Harrell, May 31 '14 at 12:02
More details of the specific circumstances and the size of the effect that concerns you might help. — Glen_b, Jun 01 '14 at 01:16

score 4 · Accepted Answer · answered May 31 '14 at 17:27

Since the question is ill-defined, we should guess about the real situation. My five cents are the follows.

First of all, the arrangement of data is irrelevant. However, when using cross-validation methods that imply elements of randomness (k-fold CV for example), then each run will produce different, although close results. This is the first possible reason. To check if this is the case, you may carry out leave-one-out CV the same way as you have done your method of CV. If leave-one-out CV produce the same results after shuffling, then you have trapped into the described issue, but all is OK - that is a feature of stochastic methods of CV. If LOO-CV results also differ, then check point 2:
The second possible reason is that when you are permuting training data, you change only X matrix, but the y vector remains the same. That is an error. The correspondence between rows in X and elements in y should not be violated. Shuffle them together as united matrix and then split back, or save shuffling indices and permute y according to this indices.

Hi! Thanks for the answer, The first possibility is actually the reason why. I used LOO-CV instead of stratified k-fold. The result is now the same even though I shuffled my training data. — hehe, Jun 01 '14 at 07:25
Note however, that LOO is still subject to this variance due to model instability - just you cannot measure it because an LOO run is *complete* (i.e. all possibilities of leaving one object out are actually computed) with each object being predicted only once — cbeleites unhappy with SX, Jun 01 '14 at 08:10

score 1 · Answer 2 · answered May 31 '14 at 13:57

That the results depends on the data is normal. The assumptions of CV is that performance of the classifier does not vary much when splitting the data. If it does happen, that your results vary wildly, then it means that the assumptions of CV do not apply.

How do the results vary? What is the data your are working with like? i.e. how many samples? how many samples per class? What classifier are you working with? (overfitting?)

When working with a small data set, you can give bootstrapping a try.

score 1 · Answer 3 · edited Apr 13 '17 at 12:44

I'm assuming "each time" means a complete run of $k$ surrogate models for a $k$-fold cross validation: each case is predicted once (by one of the $k$ surrogate models). Differences arise when the surrogate models are unstable: different surrogate models yield (noticeably) different predictions for the same test case.

Thus, you've probably discovered that your models are unstable. If the cross validation surrogate models are unstable, I'd also assume that the final model is unstable: you need to restrict the complexity of your models.

Why do the results in cross validation changes whenever I shuffle my training data?

4 Answers4

Linked