7

I have a regression model that looks like the following

glm.nb(formula = y ~ Gender + Age + x1 + x2 + x3, data = df)

In my problem, there are 20 possible choices of variables for x1, 20 possible choices for x2, and 20 possible choices for x3. Gender and Age must be in the model. This leaves me with 20*20*20 = 8,000 possible regressions. I was able to create a program that ran all of these regressions and deliver me the lowest AIC, but I was wondering if there was a library that already does this.

I do not consider what I will find to be the "best" model in any statistical manner, but I do find this exercise useful for exploring my data.

I have already attempted using bestglm and leaps. I do not believe these programs allow for specifying the choice of variable from multiple bucket of variables.

  • 6
    That is data dredging. What if you identify the best model purely because of some feature of the particular sample of data you collected? – Gavin Simpson Apr 12 '13 at 16:07
  • 5
    I agree with @GavinSimpson, & fail to see how this is helping you explore your data meaningfully (although I do think data exploration is very important & generally undervalued). To better understand the problems w/ the approach you're using, it may help you to read my answer here: [algorithms-for-automatic-model-selection](http://stats.stackexchange.com/questions/20836//20856#20856). – gung - Reinstate Monica Apr 12 '13 at 16:49
  • 1
    Gavin: are your $x_1, x_2, x_3$ drawing from the same 20 variables or 20 distinct variables? In response to others' comments, The p-value and consequently model coefficient will not have any useful interpretation with model selection, but for exploratory analyses, model selection is often used as a hypothesis generating mechanism. There is the risk of pursuing false positive associations as in inference, though. – AdamO Apr 12 '13 at 16:54
  • 6
    I would need to be convinced that this is sensible. *Why* are you forced to include 1 and only 1 variable from each of three sets of variables? – Peter Flom Apr 12 '13 at 18:32
  • 4
    Here's an exercise you should try. Generate random samples for the 20 variables for each of the three positions (but with roughly the same mean and variance as would be typical for the originals if you like). Now do the same thing as you're proposing - trying the 20x20x20 combinations in order to optimize the AIC. Repeat the whole exercise a number of times (100 at least; you'll want to automate the whole thing). [It would be nice if it didn't come out strongly related. What do you notice about the actual outcome of such a procedure?] – Glen_b Jul 15 '13 at 01:43
  • 3
    I agree with @PeterFlom it seems strange that you are forced to only include 1 variable from each of those 3 sets. Why not throw everything in there and do some kind of shrinkage analysis like ridge regression or lasso? – bdeonovic Feb 12 '14 at 13:15

2 Answers2

1

Since you have three categorical variables with $20$ categories each, plus gender and age, that gives you a total of $3 \times 19+1 = 58$ binary variables and one continuous variable. If you are willing to proceed without interaction effects, that gives you a model with $60$ coefficients (including an intercept term). That is a relatively manageable number of terms, and with a reasonable-sized data set, you should have an adequate number of residual degrees-of-freedom, and get reasonable estimates of your parameters.

Your problem emerges if you decide you want to include interaction effects between the categorical variables. Adding two- and three-way interactions between your three categorical variables, gives you $20^3 = 8000$ parameters in your model instead of $57$. (If you also interact with gender and age this quadruples again.) That is a large number of parameters, and you would need a large amount of data to ensure that you have an adequate number of residual degrees-of-freedom. Even if you have sufficient data for this, it is dubious to cherry-pick interaction terms from categorical variables in the manner you are describing - that is a classic example of data-dredging. Instead of doing this, you should either include the whole set of interaction terms for a categorical interaction, or remove the whole set of interaction terms. It is not legitimate to cherry-pick interaction terms from within indicator values for a broad categorical variable.

Ben
  • 91,027
  • 3
  • 150
  • 376
0

If you have 60 possible covariates, and just want to be able to use the model to build predictions and are not that concerned with interpretability, you might build a random forest on a training set of your data and see what kind of predictive power you could get from the model it builds. The package randomForest in r can help you with this.

user27008
  • 157
  • 6
  • 1
    As far as I know, randomForest in R *doesn't* fit a negative binomial model. – Gavin Simpson Apr 13 '13 at 01:24
  • 3
    Thank you for all your responses. While I agree with your concerns about the statistical issues with this sort of exploration, I was tasked with this request, and this is the primary reason I am pursuing it. To be more clear, each of these buckets contain a different type of variable. It is necessary for non-predictive reasons, that these all be included in the model. – user1634075 Apr 15 '13 at 18:22