The following excerpt is from Schwager's Hedge Fund Market Wizzards (May 2012), an interview with the consistently successful hedge fund manager Jaffray Woodriff:
To the question: "What are some of the worst errors people make in data mining?":
A lot of people think they are okay because they use in-sample data for training and out-of-sample data for testing. Then they sort the models based on how they performed on the in-sample data and choose the best ones to test on the out-of-sample data. The human tendency is to take the models that continue to do well in the out-of-sample data and choose those models for trading. That type of process simply turns the out-of-sample data into part of the training data because it cherry-picks the models that did best in the out-of-sample period. It is one of the most common errors people make and one of the reasons why data mining as it is typically applied yields terrible results.
The interviewer than asks: "What should you be doing instead?":
You can look for patterns where, on average, all the models out-of-sample continue to do well. You know you are doing well if the average for the out-of-sample models is a significant percentage of the in-sample score. Generally speaking, you are really getting somewhere if the out-of-sample results are more than 50 percent of the in-sample. QIM's business model would never have worked if SAS and IBM were building great predictive modeling software.
My questions
Does this make any sense? What does he mean? Do you have a clue - or perhaps even a name for the proposed method and some references? Or did this guy find the holy grail nobody else understands? He even says in this interview that his method could potentially revolutionize science...