Was thinking a bit about model selection earlier, and I ended up getting hung up on the question: “If two models have similar predictive power, which model should I select?”
For example, we often prefer the “simplest” regression model which fits the data, in the sense of having the fewest parameters. This kind of intuition shows up all the time, and is a rough version of the commonly cited Occam’s razor. It certainly feels right to me - if two models make equally good predictions, it would feel strange to prefer the more complex one. This kind of decision process is formalized in Tibshirani’s “one standard error rule”, which attempts to balance out-of-sample error with model complexity.
What justification, if any, do we have for this? Are more parsimonious models less likely to overfit? Is there some other provable advantage to this strategy over a different criteria or selecting a model at random?