Justification for and optimality of $R^2_{adj.}$ as a model selection criterion

Question

In a recent thread, use of adjusted $R^2$ ($R^2_{adj.}$) is mentioned in the context of model selection, e.g.

The adjustment was invented as a solution to problems caused by variable selection

Question: Is there any justification for using $R^2_{adj.}$ for model selection? That is, does $R^2_{adj.}$ have any optimality properties in the context of model selection?

For example, AIC is an efficient criterion and BIC is a consistent one, but $R^2$ does not coincide with any of them and so makes me wonder if it can be optimal in any other sense.

Also discussed here: https://stats.stackexchange.com/questions/222461/how-to-show-that-a-model-is-not-over-fitted — kjetil b halvorsen, Jun 20 '19 at 09:59
model selection can quite easily be "optimally sub-optimal" and shrinkage/regularisation can be a better method. proceed carefully, particularly if your goal is prediction of new cases.... — probabilityislogic, Jun 15 '20 at 13:37
@Richard Hardy, isn't that pretty much this thread? https://stats.stackexchange.com/questions/197112/why-information-criterion-not-adjusted-r2-are-used-to-select-appropriate-la/197237#197237 — Christoph Hanck, Jun 15 '20 at 13:52
@ChristophHanck, my problem is exactly what $R^2_{adj.}$ is optimal for. It is neither efficient (like AIC) nor consistent (like BIC). So is it good for anything? Maybe there is some criterion in addition to efficiency and consistency that makes $R^2_{adj.}$ the measure of choice? If so, is the criterion ever desirable? — Richard Hardy, Jun 15 '20 at 14:06
@probabilityislogic, good point. Hansen ["A Winner’s Curse for Econometric Models: On the Joint Distribution of In-Sample Fit and Out-of-Sample Fit and its Implications for Model Selection"](http://www.tse-fr.eu/sites/default/files/medias/stories/SEMIN_10_11/ECONOMETRIE/hansen.pdf) (2010) offers some concrete examples of that. So I wonder if there is *any* justification for $R^2_{adj.}$ as a model selection criterion. — Richard Hardy, Jun 15 '20 at 14:09
+1, I see. To give a moderately useful example, as my answer in the link demonstrates, adjusted $R^2$ amounts to choosing the model with the smallest $\log(\widehat{\sigma}^2)+\frac{K}{n}$. Hence, one might say that it is optimal for someone who has this loss function trading off fit and parsimony. — Christoph Hanck, Jun 16 '20 at 04:53

score 1 · Answer 1 · answered May 31 '19 at 21:51

I don't know if $R^2_{\text{adj.}}$ have any optimal properties for model selection, but it is surely taught (or at least mentioned) in that context. One reason might be because most students have met $R^2$ early on, so there is then something to build on.

One example is the following exam paper from University of Oslo (see problem 1.) The text used in that course, Regression Methods in Biostatistics Linear, Logistic, Survival, and Repeated Measures Models Second edition by Eric Vittinghoff, David V. Glidden, Stephen C. Shiboski and Charles E. McCulloch mentions $R^2_{\text{adj.}}$ early on in their chapter 10 on variable selection (as penalizing less than AIC, for example) but neither it nor AIC is mentioned in their summary/recommendations 10.5.

So it is maybe mostly used didactically, as an introduction to the problems of model selection, and not because of any optimality properties.

Thank you. The relevant part for me is the first part of your first sentence. — Richard Hardy, Jun 01 '19 at 08:20

score 0 · Answer 2 · answered May 14 '19 at 11:21

0

Answer for part1:

If you add more variables, even totally insignificant variable, R² can only go up. this is not the case with adjusted R². You can try running multiple regression and then add random variable and see what happened to R² and what happened to the adjusted R².

answered May 14 '19 at 11:21

Oren Ben-Harim

19
1

1

Thanks! I do not think this constitutes a justification for the use of $R^2_{adj.}$ for model selection. It only justifies the preference of $R^2_{adj.}$ to $R^2$ when estimating the population $R^2$ under a very restrictive assumption, as I explain in the comment in the linked thread. I have edited my post to make my question clearer. – Richard Hardy May 14 '19 at 11:36
The adjusted R squared increases only if the under investigation variable improves the model more than just a random variable. – Oren Ben-Harim May 15 '19 at 07:41
Thanks once more. Could you be more precise? In which sense exactly does it improve the model? Does the model become better at prediction under a given loss function? Or does a higher $R^2_{adj.}$ suggests the model is more likely to be the true model among a set of candidates? Or does it have any other properties that could justify it as a criterion for model selection? – Richard Hardy May 15 '19 at 07:57
R squared is the ratio of the explained variance and the adjusted R squared is less bias estimator for the population R squared. So you want higher explained variance in your model – Oren Ben-Harim May 15 '19 at 08:12
The explained variance is the variance that the model knows to calculate. The not explain variance is the noise that the model can't predict, – Oren Ben-Harim May 15 '19 at 12:50
OK, I think I get that. But in which way is $R^2_{adj.}$ relevant for model selection? (I do not dispute its relevance for measuring the ratio of explained variance to total variance, even though the simpler $R^2$ may be preferable in some cases, as I explain in the linked thread.) In prediction, it appears to be inferior to AIC (which is an efficient model selector) as it penalizes too little (less than AIC) and thus would select too rich a model. In search for the true model, this deficiency is even greater, as an even larger penalty is needed for selection consistency (compare BIC to AIC). – Richard Hardy May 15 '19 at 12:53
The are several algorithms like stepwise that start with all the variables (or with no variable) and remove variable one by one so the final goal is to maximize the adjusted R squared (this good only as a screening method ) – Oren Ben-Harim May 15 '19 at 12:59
My question is not whether such methods exist (they do), but whether they have a justification. There are a number of historically popular methods in statistics that have been proven over and over again to have poor properties; stepwise model selection is one of them, regardless of which criterion (AIC, BIC, $R^2_{adj.}$) is employed inside. These methods do not get extinct; once they are created, they exist. My question is, do they have a justification to be used. More specifically, does $R^2_{adj.}$ have a justification as a model selection criterion. – Richard Hardy May 15 '19 at 13:12
As I wrote this is good only as a "screening method". It doesn't replace the expert that build the model base on logic, knowledge and other researches but help as supporting tool – Oren Ben-Harim May 16 '19 at 11:51
@RichardHardy I see you say "it appears to be inferior to AIC (which is an efficient model selector) as it penalizes too little (less than AIC) and thus would select too rich a model" however AIC weighting is considered suboptimal and arbitrary itself (the 2s in the formula) and sometimes too severe; compare WAIC or DIC. If we are interested in collecting true factor effects, one logical approach is p<0.5 (more likely true than not). AR2 generally optimizes around .3 which can be considered more optimal in the sense of closer to 0.5 compared to AIC - optimal shouldn't mean restrictive. – John Vandivier Oct 13 '21 at 18:20
@JohnVandivier, interesting. Consider contributing these points as an answer in the linked thread. The justification and optimality of AIC is clear -- under appropriate assumptions. The 2s have no effect on model selection as they apply equally to every model. – Richard Hardy Oct 13 '21 at 18:21
@RichardHardy Will do! Thanks. I was hoping to have my error cleared up prior to downvote, but we shall take the plunge :) – John Vandivier Oct 13 '21 at 18:23
@JohnVandivier, not sure what downvote you are talking about. – Richard Hardy Oct 13 '21 at 18:24
I just mean that if I am mistaken I would prefer to be corrected in a comment rather than receiving a hypothetical downvote - hopefully no such vote obtains – John Vandivier Oct 13 '21 at 18:29

John Vandivier · Answer 3 · 2021-10-18T00:40:52.507

I would propose six optimality properties.

Overfit Mitigation
Simplicity and Parsimony
General Shared Understanding
Semi-Efficient Factor Identification
Robustness to Sample Size Change
Explanatory Utility

Overfit Mitigation

What kind of model is overfit? In part, this depends on the model's use case. Suppose we are using a model to test whether a hypothesized factor-level relationship exists. In that case a model which tends to allow spurious relations is overfit.

"The use of an adjusted R2...is an attempt to account for the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model." Wikipedia.

Simplicity and Parsimony

Parsimony is valued on normative and economic rationale. Occam's Razor is an example of a norm, and depending on what we mean by "justification," it might pass or fail.

The economic rationale for simplicity and parsimony are harder to dismiss:

Complex models with many factors are expensive to gather data for.
Complex models can be more expensive to execute.
Complex models are hard to communicate and think through. Business and legal risks can result from this, as well as plain time spent communicating from one person to another.

Given two models with equal explanatory power (R2), then, AR2 selects for the simpler and more parsimonious model.

General Shared Understanding

Justification involves shared understanding. Consider a peer-review situation. If the reviewer and the reviewed lack a shared understanding of model selection, questions or rejections may occur.

R2 is an elementary statistical concept and those only familiar with elementary statistics still generally understand that R2 is gameable and AR2 is preferred to R2 for the above reasons.

Sure, there may be better choices compared to AR2 such as AIC and BIC, but if the reviewer is unfamiliar with these then their use may not succeed as a justification. What's worse, the reviewer may have a misunderstanding themselves and required AIC or BIC when they aren't required - that itself is unjustified.

My limited understanding indicates that AIC is now considered rather arbitrary by many - specifically the 2s in the formula. WAIC, DIC, and LOO-CV have been suggested as preferred, see here.

I hope by "justified" we don't mean "no better parameter exists" because it seems to me that some better parameter might always exist unbeknownst to us, therefore this style of justification always fails. Instead "justified" ought to mean "satisfies the requirement at hand" in my view.

Semi-Efficient Factor Identification

Caveat: I made up this term and I could be using it wrong :)

Basically, if we are interested in identifying true factor relations, we should expect p < 0.5, ie P(B) > P'(B). AR2 maximization satisfies this as adding a factor with p >= 0.5 will reduce AR2. Now this isn't an exact match because I think AR2 generally penalizes p > 0.35-ish.

It's true AIC penalizes more in general but I'm not sure that's a good thing if the goal is to identify all observed features that have an identifiable relation, say at least directionally, in a given data set.

Robustness to Sample Size Change

In the comments of this post, Scortchi - Reinstate Monica notes that it "makes no sense to compare likelihoods (or therefore AICs) of models fitted on different nos observations." In contrast, r-squared and adjusted r-squared are absolute measures that can be compared with a change in the number of samples.

This might be useful in the case of a questionnaire that includes some optional questions and partial responses. It's of course important to be mindful of issues like response bias in such cases.

Explanatory Utility

Here, we are told that "R2 and AIC are answering two different questions...R2 is saying something to the effect of how well your model explains the observed data...AIC, on the other hand, is trying to explain how well the model will predict on new data."

So if the use case is non-predictive, such as in the case of theory-driven, factor-level hypothesis testing, AIC may be considered inappropriate.

Note that under some assumptions, LOOCV is asymptotically equivalent to AIC. — Richard Hardy, Oct 13 '21 at 19:30
Here I am trying to generally express coefficient hypothesis testing. Usually I care about direction of effect, significance, and importance. This general term leaves room to explore an independent factor, lags, interactions, marginal effects, and other computed, derived, or engineered features. Possibly including causal testing. I am mainly thinking about OLS at the moment. — John Vandivier, Oct 13 '21 at 20:40
Sometimes I’m also interested in how dependent variables relate to each other as a secondary concern. Eg does X2 “partial out” X1 — John Vandivier, Oct 13 '21 at 20:41
I cannot find the relevant thread anymore to post this comment to but I will do it here. Just wanted to let you know I watched Ben Lambert's YouTube video about AIC, DIC, WAIC and LOOCV and did not find it convincing enough to drop AIC in favor of DIC or WAIC. Statements in the video (e.g. about popularity or approximation quality of AIC) are too general to be correct even if they may hold in special cases (though I understand this is intentional due to the format of the lecture series). DIC and WAIC may function OK for Bayesian modeling but I do not see if they are applicable outside of that. — Richard Hardy, Oct 16 '21 at 16:44
Link for the audience [here](https://www.youtube.com/watch?v=xS4jDHQfP2o). The key statement in the video as I understand it wasn't to do with popularity; that's a result not a cause of utility. He says "The big step forward which DIC makes over AIC is applying a more general and useful measure of penalty." DIC penalty is related to data variance in contrast to the arbitrary 2k AIC penalty. It's hard for me to see how AIC can be preferred to either since it is a special case of either DIC or WAIC - a special case with little retrospective or comparative optimality justification. — John Vandivier, Oct 16 '21 at 19:12
Now, if we abandon "I must use the ideal optimality justification" for "I must use a satisfactory optimality justification," then sure AIC seems fine - the gains from others will often be unimportant and AIC is relatively widely used, understood, and supported in code libraries. However, on exactly parallel grounds, we can appreciate adjusted r-squared. — John Vandivier, Oct 16 '21 at 19:13
I do not remember Ben Lambert saying the 2k penalty is arbitrary. I hope he did not say that as that would simply be wrong. There is a clear theoretical justification for it in Akaiske's original paper and in later elaborations. I cannot comment on the relationship between AIC and DIC or WAIC as I do not know the latter two well enough. I am still looking for any optimality justification for $R^2_{adj.}$. The arguments you have provided are not quite what I am looking for. Nor are they on the level of concreteness that would lend itself for constructive criticism given my understanding. Sorry. — Richard Hardy, Oct 16 '21 at 19:24
@RichardHardy I just added items 5 and 6 - I don't expect these are what you are looking for either, but just in case! — John Vandivier, Oct 18 '21 at 00:42

Justification for and optimality of $R^2_{adj.}$ as a model selection criterion

3 Answers3

Linked