R-code question: model selection based on individual significance in regression?

Question

I'm trying to generate an R function that keeps relevant variables based on their absolute t-value (or p, whichever is easier in code).

Basically what I want is to run one regression (1), retain all variables that are significant (based on the t-value, or p). Then run another regression (2), retain all variables that are significant (based on t-value, or p). Then take all retained variables from the regressions (1 & 2) above and combine them in a final regression (3).

Currently I am struggling with the step of retaining variables based on their t-value, I can extract the t-values of the variables but don't know how to code that they should be "retained" and kept for the later regression.

What I have currently (example edited for better understanding):

summaryI1 <- summary(lm(y~x1+x2+x3+x4)) #x1-x4 are the vars included in regression 1
(coefI1 <- coef(summaryI1))
tI1 <- coefI1[, "t value"] #extracts the t-values for all the x's

cutoff <- which(tI2>cut, arr.in=TRUE) #checks that t-stats of vars are above the cutoff level

Suppose x1 and x2 are significant and I would like to keep them "aside" for the next regression. How would I code that r should create a new matrix of x1 and x2 based on x1 and x2 having t-values higher than my cutoff?

I'm not that experienced in r programming yet but learning. Can anyone help?

You say "I1 are the variables included in regression 1". Is this a placeholder for multiple variables? Isn't `x` a variable too? Please provide a reproducible example. — Sven Hohenstein, Sep 19 '12 at 11:18
Hi, I've changed the example, so now I have variables x1...x4 in regression 1, and I would like to retain those if their t-value is sufficiently high. I'm not concerned about the effectiveness of this stepwise approach, it is basically a smaller part of a bigger algorithm that I am working on, thank you! — felix, Sep 19 '12 at 12:39
See [Joris Meys' answer](http://stackoverflow.com/a/3701896/583830) to a similar question on SO. — jthetzel, Sep 19 '12 at 14:00
Generally, the `update()` function is useful when modifying models. — jthetzel, Sep 19 '12 at 14:03
@FrankHarrell But isn't there the function `fastbw` in your `rms` package that indeed does exactly that even using *p*-values? — Henrik, Sep 19 '12 at 15:20
I agree with @FrankHarrell, that such an approach is likely to lead you astray, unfortunately. If this doesn't make much sense / you want to understand why, I wrote an answer to another question: [algorithms-for-automatic-model-selection](http://stats.stackexchange.com/questions/20836//20856#20856) that discusses these issues. — gung - Reinstate Monica, Sep 19 '12 at 16:29
Although I am not completely opposed to stepwise procedures the way Frank is but I don't like the criterion you use. The individual t tests at any particular stage can be very misleading about the importance of a variable in the model. As has been noted with several other questions the magnitude of a coefficient and its statistical significance can change when the other variables in the model are changed. your second step allows you to throw out variables that were significant in the first round but it does not allow you to reintroduce variable. — Michael R. Chernick, Sep 19 '12 at 16:36
Your procedure is a stepdown or backwards selection with only two steps. Stepwise at least allows for many steps and can add back as well as take away. So if you think stepwise is bad, this is worse. — Michael R. Chernick, Sep 19 '12 at 16:38
Just to summarize a fine sequence of comments: **you really, really do not want to do this.** There are far better solutions readily available (and extensively discussed on our site under the [tag:model-selection] tag). — whuber, Sep 19 '12 at 19:24
Thank you for all the comments that this is bad, I am well aware of the shortcomings of stepwise. The procedure that I am trying to code is part of a larger project and in particular this type of selection will in the end only be applied to orthogonal variables, the method this algorithm aims to replicate works. My question was only trying to find a simple example illustrating my problem rather than being adequate in itself. All I need is the programming steps to select based on some criterion. — felix, Sep 20 '12 at 08:03
@felix, did you work through my second example? The same thing can happen even with orthogonal variables (the first example is much less likely with orthogonal). If you still insist on going forward, the 2nd example automatically finds the variables meeting a specific criteria and uses them in a follow-up model. You would need to modify it a bit, but does that get you started? — Greg Snow, Sep 20 '12 at 16:00
Hi Greg, thank you - I missed it when I checked last time! I will give it a try! I appreciate all the concerns of it going wrong, the way it's used in the end though will make sure its fine. — felix, Sep 20 '12 at 20:22

score 6 · Answer 1 · answered Sep 19 '12 at 18:01

Maybe some simulations will help you understand why these types of approaches can go bad.

First we will create some data, in this case x1 and x2 are correlated with each other (positive correlation) and y is a function of both of them (but more specifically their difference). First we look at the regressions of each of the x's individually, then the regression that combines them:

set.seed(1)
x1 <- rnorm(25)
x2 <- rnorm(25, x1)
y <- x1-x2 + rnorm(25)
pairs( cbind(y,x1,x2) )
cor( cbind(y,x1,x2) )
summary(lm(y~x1))
summary(lm(y~x2))
summary(lm(y~x1+x2))

This shows that if you use some type of pre-screening on individual (or groups) of predictor variables is done, and only the "significant" predictors are then used in further analyses, then very important variables can be screened out.

Now lets go the other way and simulate some data where there is no relationship between the response and any of the predictors, but we first screen the potential predictors and put just the "likely" predictors into a new model:

set.seed(1)
y <- rnorm(100)
x <- matrix( rnorm(50*100), ncol=50 )
mydat <- data.frame( y,x )

fit1 <- lm( y~., data=mydat )
fit0 <- lm( y~1, data=mydat )
(out <- add1( fit0, fit1, test='F' ))

form <- as.formula( paste('y ~', 
    paste0('X', which(out$`Pr(>F)` <= 0.1)-1, collapse=' + ') ) )

fit2 <- lm( form, data=mydat )
summary(fit2)
summary(fit1)

The overall f-test on fit1 shows that there is nothing going on, which is the correct answer since y was simulated without any influence from the x's. But the f-test on fit2 is significant (the wrong answer) because it does not take into account the screening done.

If simple simulations can show cases where stepwise type methods can leave out important variables and include useless ones, just think of the risks when using more complicated real world data.

score 0 · Answer 2 · answered Sep 20 '12 at 14:01

So maybe it's not a good idea, but here's a start:

    #first, some simulated data
    y = seq(from = 1, to = 10, by = 0.5)
    x1 = y + rnorm(length(y)) #not a significant variable
    x2 = y * runif(length(y), min = -5, max = 5) #a significant variable
    x3 = y + runif(length(y), min = -5, max = 5) #another insignificant variable

    model = lm(y ~ x1 + x2 + x3)
    summary(model)

    p_value = coef(summary(model))[,4]

    #create a new data frame to store y and only the significant regressors
    alpha = 0.05
    new_x = data.frame(model$model)
    for(i in 2:length(coef(model))){
       if(p_value[i] > alpha){
         new_x[,i] = NA
       }
    }

    #function to remove the NA columns
    remove.column.NA = function(data){
       i = 2
       while(i <= ncol(data)){
          if(is.na(data[1,i])){
             data = data[,-i]
             i = i-1
          } 
          else{
             i = i+1
          }
       }
       return(data)
    }

    new_x = remove.column.NA(new_x)

Now you have a dataframe (new_x) containing only the columns of significant regressors. I was trying to turn this into a complete function, but the names of the variables are lost in the lm() function.

If you agree that it's not a good idea, I don't understand the point of your developing this algorithm. — gung - Reinstate Monica, Sep 20 '12 at 14:09

R-code question: model selection based on individual significance in regression?

2 Answers2

Linked