If two models have similar predictive power, why should we prefer the one with fewer parameters?

Question

Was thinking a bit about model selection earlier, and I ended up getting hung up on the question: “If two models have similar predictive power, which model should I select?”

For example, we often prefer the “simplest” regression model which fits the data, in the sense of having the fewest parameters. This kind of intuition shows up all the time, and is a rough version of the commonly cited Occam’s razor. It certainly feels right to me - if two models make equally good predictions, it would feel strange to prefer the more complex one. This kind of decision process is formalized in Tibshirani’s “one standard error rule”, which attempts to balance out-of-sample error with model complexity.

What justification, if any, do we have for this? Are more parsimonious models less likely to overfit? Is there some other provable advantage to this strategy over a different criteria or selecting a model at random?

Simplicity is probably the main reason, but some of your others make good sense. Exponential distributions (one parameter) are special cases of the gamma and Weibull families (each with two parameters). Why deal needlessly with estimation of an extra parameter? // If you are using a program that seeks to ID the family and parameters that 'best' match a sample, then the more general distribution almost always 'wins' with a slightly better fit (or overfit?), but you might discover that a sub-family is almost as good and easier to use. — BruceET, Jul 07 '20 at 05:44
@BruceET: Thank you for your answer. Does this mean that the justification is "it is more convenient for the analyst to pick a simpler model"? — Louis Cialdella, Jul 07 '20 at 11:46

If two models have similar predictive power, why should we prefer the one with fewer parameters?

0 Answers0