What if high validation accuracy but low test accuracy in research?

Question

I have a specific question about validation in machine learning research.

As we know, the machine learning regime asks researchers to train their models on the training data, choose from candidate models by validation set, and report accuracy on the test set. In a very rigorous study, the test set can only be used once. However, it can never be the research scenario, because we have to improve our performance until the test accuracy is better than state-of-the-art results before we can publish (or even submit) a paper.

Now comes the problem. Let's say 50% is the most state-of-the-art result, and my model can generally achieve 50--51 accuracy, which is better on average.

However, my best validation accuracy (52%) yields a very low test accuracy, e.g., 49%. Then, I have to report 49% as my overall performance if I can't further improve the validation acc, which I think is of no hope. This really prevents me from studying the problem, but it doesn't matter to my peers, because they don't see the 52% acc, which I think is an outlier.

So, how do people usually do in their research?

p.s. k-fold validation is of no help, because the same situation may still happen.

cdeterman · Answer 1 · 2015-04-23T13:37:23.173

11

By definition, when training accuracy (or whatever metric you are using) is higher than your testing you have an overfit model. In essence, your model has learned particulars that help it perform better in your training data that are not applicable to the larger data population and therefore result in worse performance.

I’m not sure why you say k-fold validation wouldn’t be helpful. Its’ purpose is to help avoid over fitting your models. Perhaps you don’t have enough data? A statement like this is important, especially if you are going to defend any research when such cross-validation methods are highly recommended.

You say you aren’t able to use the test set just once (again I assume smaller sample size?). In my experience the most common path followed is k-fold cross-validation of you model. Let’s take an example with 10-fold CV for a sample size of 100 and assume your classification problem is binary to make the calculations simple. I therefore have split my data in to 10 different folds. I then fit my model to 9/10 folds and then predict the 1/10 I left out. For this first run, the resulting confusion matrix is:

    0  1
0   4  1
1   2  3

I then repeat this analysis again with the next 1/10 fold left out and train on the other 9/10. And get my next confusion matrix. Once completed, I have 10 confusion matrices. I would then sum these matrices (so I had all 100 samples predicted) and then report my statistics (Accuracy, PPV, F1-score, Kappa, etc.). If your accuracy is not where you want it to be there are many other possibilities.

Your model needs be improved (change parameters)
You may need to try a different machine learning algorithm (not all algorithms created equal)
You need more data (subtle relationship difficult to find)
You may need to try transforming your data (dependent upon algorithm used)
There may be no relationship between your dependent and independent variables

The fact of the matter is, a lower testing metric (e.g. accuracy) than your training is indicative of overfitting your model not something you want when trying to create a new predictive model.

edited Apr 23 '15 at 13:37

answered Apr 23 '15 at 12:24

cdeterman

4,543
1
20
34

Thank you for replying. What I am concerning is the machine learning research for publications, rather than applying machine learning techniques. Oftentimes, benchmarks provide the standard split of training, validation, and test sets. Moreover, k-fold only cut down the variance. I may still come across the situation, where my (averaged) validation acc. is high but the test acc. is low. – Mou Apr 23 '15 at 14:14
It is sometimes funny that if I re-design my model a little bit, I can just ignore the outlier because my model (and thus hypothesis class) changes, but this does not work for tuning hyperparameters because we are choosing a model from the hypothesis class. However, indeed, we, researchers, have indefinite hypothesis class---we are trying whatever we like. That really bothers me during research, as oftentimes, the difference of accuracy is usually very small, say 0.1%. – Mou Apr 23 '15 at 14:52
@Mou, I guess I am still a little uncertain what you primary question is. There seems to be multiple questions. Dealing with outliers is a different topic. Are you trying to optimize parameters or evaluate a final model? This may be specific to a different fields but changes of 0.1% are pretty insignificant. You can either pursue the options listed in my answer or accept that you can only get so much from the current model and data. The model still appears to be overfit (albeit slightly). – cdeterman Apr 23 '15 at 15:14
I agree with you. I have to accept that my model is not so good. But several days ago, when the high cv acc. + low test acc. didn't jump to my screen, my model was the best one in the world. Now, it isn't although I didn't change anything. Further, I have no hope to outperform 52% in cv acc., which stucks my research, but my peers needn't worry about that. – Mou Apr 23 '15 at 16:21
You must have changed something for numbers to change or there is some randomization that you haven't set a `seed` to account for reproducibility. I suspect your cv procedure has some randomization that when repeated may return slightly different results (but this is only a guess). I really do suggest you explore some other models or data transformation to try and improve your performance. – cdeterman Apr 23 '15 at 16:32
Thank you for your suggestion. But then raises a further question. If I changed my model a bit, achieving 51% acc in both cv and test. Shall I report 52 cv + 49 test? or just 51 cv + 51 test? – Mou Apr 23 '15 at 17:01
The 52 cv + 49 test is due to different hyperparameters, not forgetting to set random seeds. For example, hyperparameter 1 yields 50.5 + 50.5, hyperparameter 2 yields 50.5 + 50.5， but hyperparameter 3 yields 52 + 49. 52 is too high in this task. – Mou Apr 23 '15 at 17:04
What you are describing is text book overfitting. Your hyperparameters 1 and 2 are fine but 3 is overfitting your model where it is no longer generalizing to the test dataset. There is no need to report overfitted model statistics. – cdeterman Apr 23 '15 at 17:06

What if high validation accuracy but low test accuracy in research?

1 Answers1