Frequentist /Bayesian hybrid for model selection: make sense?

Question

I was told that this hybrid is wrong and I should never use it for any serious writing:

We estimated all of the ARMA models considered in this paper using the standard 
procedure of maximum likelihood. We based model selection on the Schwarz Bayesian 
Criterion, with the requirement that all parameters were significant (at the 5% level).

True or False? Why?

What is SBC? Is that another name for BIC (Bayesian Information Criterion)? — probabilityislogic, Nov 15 '13 at 06:36
@probabilityislogic that's correct - both names are fairly common, though BIC is perhaps the more common of the two. — Glen_b, Nov 15 '13 at 11:07
Could you expand more on what you mean by "we based model selection on SBC" as this appears inconsistent with your 5% significance statement. What was based on SBC and what was based on 5% significance? — probabilityislogic, Nov 17 '13 at 09:14
Those are not my sentences so I can't really expand on it beyond giving you another quote "It has the lowest SBC and with all parameters significant" from what you can readily google yourself. I agree with your notion of inconsistency as Akaike's AIC was motivated by combining two things together: estimation and model selection. Hence my question. SBC is a standard abbreviation (see e.g. R help files) and is probably best, given that Akaike had Bayesian criterion of his own. — Hibernating, Nov 18 '13 at 02:26

Glen_b · Answer 1 · 2013-11-16T08:17:44.857

BIC and AIC can both be regarded as (asymptotically) Bayesian (under different situations), but they can also be regarded just as penalized (/regularized) maximum likelihood, for example.

That is, using BIC to do model selection doesn't mean you're being inherently Bayesian; a frequentist might choose to use it based on the fact that in many situations its frequentist properties are quite reasonable.

On the other hand, you could treat both halves of the procedure as Bayesian - after all, the ML estimates can be Bayesian ... if you use flat priors and your estimates are MAP. So you can reasonably claim to be frequentist twice or Bayesian twice.

However, that is not to say that - regarded as either a frequentist or a Bayesian procedure - the approach you're discussing is wise... but that wasn't the question.

I up-voted your answer and I agree that my question could have been accompanied with a better title. — Hibernating, Dec 08 '13 at 10:04

score 3 · Accepted Answer · edited Apr 13 '17 at 12:44

I have also another take, based on the book by Anderson on model selection. The authors state as follows:

It seems best not to associate the words significant or rejected with results under an information-theoretic paradigm. Questions concerning the strength of evidence for the models in the set are best addressed using the evidence ratio as well as an analysis of residuals, adjusted R2 and other model diagnostics or descriptive statistics.

From this point of view, using AIC (or BIC for that matter) should not be mixed with parameters for significance. Both criteria are rooted in the information theory, so this is less "Bayesian vs frequentist" and more "frequentist vs information-theoretic". See also this answer.

This is close to what I thought so I accept it. – Hibernating Dec 08 '13 at 09:54 — Hibernating, Dec 08 '13 at 09:54

score 2 · Answer 3 · answered Dec 08 '13 at 09:53

In the context of ARMA models, a model minimizing an AIC or related statistic will have a certain number of parameters defining the order of the model. Each of these parameters may or may not be statistically significant. Tables where figures in parentheses under each parameter estimate are the standard errors of the estimates are often seen in the literature. With these standard errors, obtained from either the observed or expected (Fisher) information, statistical testing for significance can be done (using either p-values or confidence intervals). For example, one readily gets p-values associated with z-ratios obtained by dividing estimated coefficients by their estimated standard errors.

While the paper does not seem to detail the approach to model selection taken, I think it is as simple as not selecting the model minimizing the AIC statistic if at least one of the coefficients is not significantly different from zero. For example, using

print(lh300<-arima(lh, order = c(3,0,0)))
print(lh100<-arima(lh, order = c(1,0,0)))

we first check the significance of the lh300 fit and find that p-values of the ar2 and ar3 coefficients are too large to consider the coefficients statistically different from zero. The second minimizer of the AIC is the lh100 fit with all of its coefficients significant. This procedure can be questioned on at least two viewpoints.

First, following Chatfield JRSS(1995)[vol.158,p.441]

"least-squares theory is known to not apply when the same data are used to formulate
and fit a model so that estimation follows model selection".

Consequently, the p-values cannot be considered accurate enough to guide the decision to disregard a model. Furthermore, the subjective 5 per cent mentioned in the question is clearly inconsistent with the objective spirit of the Minimum AIC Estimate procedure. Indeed, on the first page of his most-cited 1974 paper, Akaike wrote that

"By the introduction  of MAICE the problem of statistical identification is explicitly 
formulated as a problem of estimation and the need of the subjective judgement required    
in the hypothesis testing procedure for the decision on the levels of significance is
completely eliminated."

So to answer my own question, introducing subjectivity into what was meant to remove it is wrong.

Second, as in linear regression modelling, that a coefficient is insignificant does not automatically mean it should be eliminated from the model. Furthermore, the idea to consider models whose values of the AIC statistic are within a certain distance of the minimum AIC value as somewhat inferior to the model minimizing the AIC is wrong. The number 2 is often mentioned for that distance in the literature, although Chatfield once mentioned 4 and people who worked closely with Akaike mentioned 1. I remember Gavin Simpson expressed similar ideas on this (or related) Q&A website. Formal references can be made to Brockwell & Davis; Ruppert, Wand and Carrol as well as other well-known book-length sources of statistical knowledge. In our example, the data support lh300 and lh100 equally well, but lh100 has an advantage of being simpler.

So to answer my own question from another perspective, to consider models not minimizing the AIC statistic only if the model minimizing it happened to contain insignificant parameters is wrong.

Frequentist /Bayesian hybrid for model selection: make sense?

3 Answers3