How to deal with varying accuracy after every run of algorithm

Question

I am not talking about how to get the same result every time which can happen by setting the random state parameter to a number.

I am referring to the fact that since with one run I get 99.2 with the next 89.3, then the other 93.5. If it has so much variation on every random cross validation how can you rely on that?

Example:

from sklearn import datasets
iris = datasets.load_iris()
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
 columns= iris['feature_names'] + ['target'])

X = data1[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y = data1[['target']]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.20)
model = LogisticRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print('Accuracy:',accuracy_score(prediction,y_test))
print(classification_report(y_test,y_pred))

I you run this you will get tremendously different result any time.

if you run it, one time you will receive 99% accuracy which sounds great but if you run it again you will receive 85%.Therefore how can you tell if this model is good if it gives so many different results? — user12436030, Dec 02 '19 at 13:08

score 0 · Accepted Answer · answered Dec 02 '19 at 13:56

Cross validation, as you have seen, involves randomization. Therefore, any result derived from it will have some randomness. It is always good practice to repeat cross validation a couple of times (e.g., using different RNG seeds) to see how strong this randomness is.

If you have a small dataset, or a large model, your randomness will be larger than with a large dataset or a small model. And yes, if the randomness is high, then the "best" model derived from CV is a bit dubious.

One way of dealing with this is the so-called "one standard error rule", which attempts to balance this noise against the complexity of the models under consideration.

If you want more stability, consider rerunning the CV multiple times and averaging the results. Nothing keeps you from using each observation multiple times in different holdouts. (Of course, this only makes sense if the rest of the holdout sample differs each time, but the randomization should ensure this.)

Don't overdo it, though. Taken to extremes, this can lead to overfitting to your CV sample.

Incidentally, accuracy is not a good error measure.

it would be best for each try to not differ much from the previous ones correct? How much is acceptable though?5% is ok? — user12436030, Dec 02 '19 at 14:30
"Acceptable" will depend on what you will use your model for, there is no general "maximal acceptable variation in quality". — Stephan Kolassa, Dec 02 '19 at 14:46

How to deal with varying accuracy after every run of algorithm

1 Answers1