I have a specific question about validation in machine learning research.
As we know, the machine learning regime asks researchers to train their models on the training data, choose from candidate models by validation set, and report accuracy on the test set. In a very rigorous study, the test set can only be used once. However, it can never be the research scenario, because we have to improve our performance until the test accuracy is better than state-of-the-art results before we can publish (or even submit) a paper.
Now comes the problem. Let's say 50% is the most state-of-the-art result, and my model can generally achieve 50--51 accuracy, which is better on average.
However, my best validation accuracy (52%) yields a very low test accuracy, e.g., 49%. Then, I have to report 49% as my overall performance if I can't further improve the validation acc, which I think is of no hope. This really prevents me from studying the problem, but it doesn't matter to my peers, because they don't see the 52% acc, which I think is an outlier.
So, how do people usually do in their research?
p.s. k-fold validation is of no help, because the same situation may still happen.