2

I am building linear models by adding one variable at a time. I am interested in studying the effects of each variable and how those effects change as a new variable is added to the model. Basically, a step-wise regression that does not worry about significance.

Here is the code to my approach using the mtcars data as an example of what I am trying to accomplish:

library(broom)
library(tidyverse)
data("mtcars")

# creating empty list to store the models
models <- vector("list")

# getting the independent variables to put into the model
# also removing some variables due to collinearity
names <- colnames(mtcars[ ,-match(c("wt", "mpg", "disp", "cyl", "drat"), colnames(mtcars))])

# adding one independent variable to the model at a time
for (i in 1:length(names)){
  f <- as.formula(paste("mpg ~", paste(names[1:i], collapse = "+")))
  model <- lm(f, data=mtcars)
  models[[i]] <- model
}

# Naming the models
names(models) <- paste0("MODEL", 1:length(names))

# getting the coeffecients
all_coefs <- plyr::ldply(models, tidy, .id = "model")
coefs <- all_coefs %>% select(-(std.error:p.value)) %>%
  spread(model, estimate)

# getting the r2 
all_r2 <- plyr::ldply(models, glance, .id = "model")
r2 <- all_r2 %>% select(-r.squared, -(sigma:df.residual)) %>%
  spread(model, adj.r.squared) %>% 
  mutate(term = "adj.rsquared")

# gather the t-stats for each variable
p.value <- all_coefs %>% select(-(estimate:statistic)) %>%
  spread(model, p.value)

# combing r-squared and coeffecients
model_results <- bind_rows(coefs, r2)

The question I have is whether or not this seems like an appropriate approach to take for studying a lot of different models? If not, what approaches would others suggest.

roarkz
  • 121
  • 3
  • +1 Nice primary question. Do note that the secondary question (the one on how to store/create a specific dataframe in R might be considered off topic here on CV (if the people here do not or cannot answer this, you might want to post this subquestion to stackoverflow.com). – IWS Aug 08 '17 at 09:52
  • @IWS, thanks for the tip. I went ahead and removed that because it is off-topic here and I am not too worried about that piece of the puzzle. – roarkz Aug 08 '17 at 10:02
  • 4
    Have you seen: https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 ? – Tim Aug 08 '17 at 10:04
  • 1
    @Tim I also thought about linking that specific question. However if I understand OP correctly, this question concerns not automatic feature/variable selection (through p-values). Instead I understand that the OP wants an answer to whether adding features one by one is a valid way of assessing the influence a specific addition has on the variables already in the model. As this is order-of-adding dependent is doubt this is the way of doing such an analysis. As such, this is a slightly different question, where a specific plan of analysis might apply. – IWS Aug 08 '17 at 12:20
  • @IWS I *don't* say this is a duplicate. I say that this thread may be interesting for OP. – Tim Aug 08 '17 at 12:23
  • 1
    @Tim fair enough. I see how my comment might suggest you had. My apologies, I did not mean it that way. I just wanted to highlight what I think is different (as I'm curious to any answers as well). – IWS Aug 08 '17 at 12:26

0 Answers0