Statistics for model selection and model evaluation

Question

In his Dynamic Econometrics, David F. Hendry issues the following advice:

When a 'test' statistic is cited, you must ask 'Was that a selection criterion statistic or a genuine test statistic?' If a model is selected precisely because some criterion is satisfactory, then the value of that criterion cannot also be a test of the model's validity: by construction, models satisfy statistical tests which are used as the grounds for their selection.

Using statistics for both model selection and model evaluation seems to be a trap that many model builders fall into. For example, attending seminars, one will have heard comments like "the R-squared of my model is high (or acceptable)" in a bid to sell the model to potential consumers. Yet, unless spelled out explicitly, the model selection procedure remains known (honestly) only to the producer. In such cases, it appears to be difficult to be critical from a consumer perspective; Was the R-squared used for model selection, or, is it being used as a quality checking device? Perhaps, it's being used for both (erroneously)?

A couple of related questions follow.

Are there any guidelines that producers of models can follow in order to avoid misusing statistics in this way? Which model building strategies are least/most open to this trap? What can consumers of models do when they suspect that statistics have been used in this way? Is the crime to the extent that the model loses its usefulness altogether?

I'm keen to receive references on this topic, too.

Interesting question (+1). You might want to see my - hopefully relevant - answers [on selecting the best model](http://stats.stackexchange.com/a/129147/31372) (you likely are aware of the criteria mentioned, but just in case) and [on combining/comparing models](http://stats.stackexchange.com/a/128922/31372) (contains some interesting references). — Aleksandr Blekh, Feb 03 '15 at 00:55
[This book](http://www.amazon.com/dp/0387953647) also seems to be relevant to the topic (posting it here in case, if you've missed its mentioning in a comment to my second answer). — Aleksandr Blekh, Feb 03 '15 at 01:05
sample splitting/proper use of cross validation avoids some of these kinds of problems. — Glen_b, Feb 03 '15 at 07:07

Graeme Walsh · Answer 1 · 2015-02-08T23:26:33.737

I'll attempt to answer each question in turn. Contrary to fg nu's comment, I think there are real questions here - and real, although perhaps difficult, answers, too. Throughout, when I refer to the mis-use of statistics, I mean failing to recognise that one has made the mistake of mis-using the statistics for both model selection and model evaluation, as opposed to intentionally doing so. In other words, intellectual honesty is assumed. I self-answer in order to motivate more answers.

Q: Are there any guidelines that producers of models can follow in order to avoid misusing statistics in this way?

One guideline would be to put a considerable amount of time into planning one's research. This seems to sit in nicely with both Hendry and Leamer's methodological standpoints (despite it being arguable to attach the labels frequentist and Bayesian, respectively). For example, Leamer suggests three stages of data analysis; planning, criticism, and revision. He says that much time should be dedicated to the planning stage, which he defines as "preparing responses to hypothetical data sets". Having thought about the research process, the decisions faced at each node, and the corresponding responses, one ought to be less inclined to make the mistake of using a statistic for two conflicting purposes. As long as everything is tractable, the possibility of making this mistake could be eliminated. This is related to algorithmic research or automatic selection methods mentioned below.

Q: Which model building strategies are least/most open to this trap?

Without knowing the complete set of strategies, it's difficult to pin down the exact strategies most/least exposed to this trap. However, I will classify strategies based on one particular property to try and get a sensible answer. The key property is programmable. Modelling strategies that involve a significant amount of planning and that can be programmed or written out as a recipe - however complex that may be - are the strategies least open to using statistics for the dual purpose of model selection and model evaluation. That is, of course, under the assumption that one of the goals of the programmer would be to avoid the mis-use of statistics!

Interestingly, fast forwarding in time from Hendry's Dynamic Econometrics to his latest work with Doornik on automatic selection methods, one gets the sense that such programmatic or algorithmic research is something that will become more of a norm in the future.

The strategies most susceptible to the mis-use of statistics are modelling strategies that are unstructured; those that contain ad hockery and which may not be replicable. These unstructured methods are closely related to what Leamer refers to ad hoc specification searches (which disguise private beliefs) and the patchwork textbook econometrics that Hendry gives examples of.

Q: What can consumers of models do when they suspect that statistics have been used in this way?

Try to perform a replication study - or have someone else skilled enough to try and do it for you.

Q: Is the crime to the extent that the model loses its usefulness altogether?

Here, I borrow on Hendry, who says that "how a final model is derived is largely irrelevant; it is either useful or not, and that characteristic is independent of whether it comes purely from whimsy, some precise theory, or a very structured search." Note that it's largely irrelevant and not completely irrelevant; the advice to avoid mis-using statistics for model selection and model evaluation still stands. In other words, mis-use would likely result in the model not being the most dominant model (hence the advice to avoid it), however, if, for whatever reason, it did turn out to be the best and final model, any decision made during the research process would not impact on the model's usefulness.

References that I found to be useful when answering this question include:

The ET Dialogue: A Conversation on Econometric Methodology David F. Hendry, Edward E. Leamer and Dale J. Poirier Econometric Theory Vol. 6, No. 2 (Jun., 1990), pp. 171-261

Empirical Model Discovery and Theory Evaluation - Automatic Selection Methods in Econometrics By David F. Hendry and Jurgen A. Doornik

And the following presentation, entitled "How Empirical Evidence Does or Does Not Influence Economic Thinking and Theory" by David Hendry.

@fgnu I agree that the Hendry quotation is a guideline in itself. The question I asked, however, was for separate guidelines on how to actually implement Hendry's advice. In other words, I know I should not use the statistics for dual purposes, but what can steps can I take to avoid doing that. One answer, described above, is to appeal to algorithmic research; write out a clear program; plan ahead; implement a solid methodology - not something ad hoc. Implementing this may be difficult in practice, nevertheless, not impossible. Variations of Leamer's ideas usable, e.g. sensitivity analysis. — Graeme Walsh, Feb 08 '15 at 18:14

Statistics for model selection and model evaluation

1 Answers1