Possible Duplicate:
How can I help ensure testing data does not leak into training data?
Overfitting is obviously a significant problem in machine-learning...perhaps the most significant problem. I believe a main reason for that is that it is often easy to accidentally overfit in subtle ways, particularly with respect to model-selection.
So suppose we have someone building a predictive model, but that someone is not necessarily well-versed in proper statistical or machine learning principles. Maybe we are helping that person as they are learning, or maybe that person is using some sort of software package that requires minimal knowledge to use.
Now this person might very well recognize that the real test comes from accuracy (or whatever other metric) on out-of-sample data. However, my concern is that there are a lot of subtleties there to worry about: overfitting is still a danger. In the simple case, they build their model and evaluate it on training data and evaluate it on held-out testing data. Unfortunately it can sometimes be all too easy at that point to go back and tweak some modeling parameter and check the results on that same "testing" data. At this point that data is no longer true out-of-sample data though, and overfitting can become a problem.
One potential way to resolve this problem would be to suggest creating many out-of-sample datasets such that each testing dataset can be discarded after use and not reused at all. This requires a lot of data management though, especially that the splitting must be done before the analysis (so you would need to know how many splits beforehand).
Perhaps a more conventional approach is k-fold cross validation. However, in some sense that loses the distinction between a "training" and "testing" dataset that I think can be useful, especially to those still learning. Also I'm not convinced this makes sense for all types of predictive models.
Is there some way that I've overlooked to help overcome the problem of overfitting and testing leakage while still remaining somewhat clear to an inexperienced user?