I was reading Faraway's textbook linear models with R (1st edition) last weekend. Faraway had a chapter called "Statistical Strategy and Model Uncertainty". He described (page 158) that he artificially generated some data using a very complicated model, then he asked his students to model the data and compare the students' predicted results vs read results. Unfortunately, most students over-fitted the testing data and gave predicted values totally off the mark. To explain this phenomenon, he wrote something very impressive to me:
" The reason the models were so different was that students applied the various methods in different orders. Some did variable selection before transformation and others, the reverse. Some repeated a method after the model was changed and others did not. I went over the strategies that several of the students used and could not find anything clearly wrong with what they had done. One student made a mistake in computing his or her predicted values, but there was nothing obviously wrong in the remainder. The performance on this assignment did not show any relationship with that in the exams. "
I was educated that the model prediction accuracy is the 'golden criterion' for us to select the best model performance. If I am not mistaken, this is also the popular method used in Kaggle competitions. But here Faraway observed something of a different nature, that the model prediction performance could have nothing to do with the ability of the statistican involved. In other words, whether we can build the best model in terms of predictive power is not really determined by how experienced we are. Instead it is determined by a huge 'model uncertainty' (blind luck?). My question is: is this true in real life data analysis as well? Or was I confused with something very basic? Because if this is true, then the implication to real data analysis is immense: without knowing the "real model" behind the data, there is no essential difference between the work done by experienced/inexperienced statisticans: both are just wild guesses in front of the training data available.