How do you select variables in a regression model?

Question

The traditional approach to variable selection is to find variables that contribute the most to predicting a new response. Recently I learned of an alternative to this. In modeling variables that determine the effect of a treatment--as for example in a clinical trial of a pharmaceutical--the variable is said to be qualitatively interacting with treatment if, leaving other things fixed, a change in that variable can create a change in which treatment is most effective. These variables are not always strong predictors of the effect but may be important for a physician when deciding on treatment for individual patients. In her PhD thesis Lacey Gunter developed a method for selecting these qualitatively interacting variables that could be missed by algorithms that base selection on prediction. Recently I have worked with her on extending these methods to other models including logistic regression and Cox proportional hazard regression models.

I have two questions:

What do you think about the value of these new methods?
In the case of the traditional methods what approach do you prefer? Criteria such as AIC, BIC, Mallows Cp, F tests for entering or dropping variables in stepwise, forward and backward...

The first paper on this came out in Gunter, L., Zhu, J and Murphy, S. A. (2009). Variable selection for qualitative interactions. Statistical Methodology doi:10, 1016/j.stamet.2009.05.003.

The next paper appeared in Gunter,L., Zhu, J. and Murphy, S. A. (2011). Variable selection of qualitative interactions in personalized medicine while controlling the familywise error rate. Journal of Biopharmaceutical Statistics 21, 1063-1078.

The next one appeared in a special issue on variable selection Gunter, L., Chernick, M. R. and Sun, J. (2011). A simple method for variable selection in regression with respect to treatment selection. Pakistan Journal of Statistics and Operations Research 7: 363-380.

You can find the papers at the journal websites. You may have to purchase the article. I might have the pdf files for these articles. Lacey and I have just completed a monograph on this topic which will be published as a SpringerBrief later this year.

Maybe I'm not following - if there is an _a priori_ reason to suspect effect modification, then how do these new methods differ from, for example, including interaction terms in the list of "candidate" variables for model selection? — Macro, May 03 '12 at 17:13
(1) One or more lines seem to have been lost in this question. I guess it might continue "stepwise, forward and backward, ..." (2) Model identification and variable selection have been extensively discussed here. E.g., searching on [+model +variable +selection](http://stats.stackexchange.com/search?q=%2Bmodel+%2Bselection+%2Bvariable&submit=search) presents 145 threads at this point. Narrowing that search will likely answer the second question. (3) To facilitate answers to the first question, could you provide a link or explicit references to this research? — whuber, May 03 '12 at 17:31
This is a matter of including a variable that interacts with the treatment. But it is a qualitative interaction not just a simple interaction. To interact the two lines must not be parallel. To qualitatively interact they must cross in the interval in which the variable is defined. So the idea is to find a variable that qualitatively interacts. This is different from picking variables and interaction terms that improve the fit or prediction. — Michael R. Chernick, May 03 '12 at 19:25
Thanks for taking the opportunity to respond, Michael. Perhaps a key point to bring up is that this site is *not* a discussion site, but rather a Q&A site. With that comes some slightly different modalities of communication. The FAQ covers this in some detail. Occasionally the threading can get a bit lost, but it's actually surprisingly rare I find, once one gets a little more experience with the general scheme of things. Cheers. — cardinal, May 03 '12 at 20:40
@cardinal I am not sure that I like this system and the silly downvoting because one person's answer is another person's comment. The basis of this is to start with a question but there is plenty of discussion. I notice that many questions are ill-posed and require answers to questions of clarification that must be asked through comments. Also even in the questions and answers there is plenty of discussion. Some questions have no answers such as the one where the question was "where are my t-values in the multiple comparison test?" — Michael R. Chernick, May 04 '12 at 05:47
@cardinal The Bonferroni bound involves p-values but not t-values so there is no correct way to explain where the t-values are. — Michael R. Chernick, May 04 '12 at 05:48
Michael, yes, the SE system takes some getting used to and is not perfect. But it does make sense and it is consistent. One thing we aim for is ongoing *improvement*: unlike list servers and bulletin boards, questions (and answers) can be modified; this is expected. Ultimately, we would like a thread to start with a single, well stated, complete question that stands on its own without reference to the comment thread; then it should continue with one or more well-written, well-attributed canonical answers. With this ideal in mind, @cardinal's suggestions may make more sense to you. — whuber, May 04 '12 at 14:21
Someone downgraded this question. Does anyone have an explanation why or why anyone would downvote my answer to Bill Huber;s question? That answer is direct to the point and complete. — Michael R. Chernick, May 07 '12 at 03:03
My answer was deleted by a moderator, probably because it was not appropriate for me to reply to someone elses question through an answer. So I have attached the information to my question as requested. — Michael R. Chernick, Jun 13 '12 at 14:36
I don't understand the question or the response to @Macro's comment. If "leaving other things fixed, a change in that variable can create a change in which treatment is most effective" wouldn't that suggest the candidate explanatory variable is useful in "predicting" the response variable? Or is it that this variable is only useful for a small percentage of individuals (in which case, yes, identifying such variables through statistics will be a daunting task)? — Peter Ellis, Feb 06 '13 at 09:44
I think that some of this discussion misses the point of Lacey's method. She is saying that the variables to be included in a regression model need not all be among the best predictors of outcome. In the medical context where the model predicts the outcome of a treatment, a variable that relates to which treatment is best can be valuable even though it may not help the prediction that much. — Michael R. Chernick, Nov 25 '16 at 21:56

score 2 · Answer 1 · answered Nov 03 '14 at 21:04

See Gelman and Hill, Data Analysis Using Regression and Multilevel/Hierarchical Model pg 69, they have a section on model selection. She is using a question based approach which is completely fine but in her paper she needs to justify why she included what she did in the model. Just like you said "These variables are not always strong predictors of the effect but may be important for a physician when deciding on treatment for individual patients." so as long as she justifies why these predictors should be included then it is fine. For me personally I prefer these methods. So here comes my answer to 2.
Stepwise, forward, and backwards I think are black boxes. When you run a model through all three you will not arrive to the same predictors. Therefore in terms of which to use I wouldn't have a clear answer. AIC or BIC is okay to use to compare models.

How do you select variables in a regression model?

1 Answers1