0

So I have a data set which consists of roughly 100 independent variables, and then one dependent binary variable (the outcome). And each variable has roughly 1000 cases. Obviously I wish to do a logistic based regression on this to get a predictive model, and I know there are several ways to reduce the number of variables, eg LASSO, VIF, even PCA. Problem is, I need to report the logistic equation, so it can be used by others, and I'm not sure PCA is good in that regard, since it creates a 'new variable' or more depending on the amount of components needed - or have I misundersgood PCA?

I'm addition, roughly 50 of the variables are very colliniated. And I know that only one needs to be present, or none (if they don't make the model better). So it's not like it would make sense to have 2 or 3 of these variables in the model, since they represent something similar. So would an idea be, in order to reduce at least these variables, be to first figure out the other variables, and then after that create 50 different models with one of each of the collineated variables, and then compare fx the AIC values, and pick the best one? Or is that a wrong approach to that?

Denver Dang
  • 787
  • 4
  • 15
  • Am I correct that you have values for all 100 independent variables for all 1000 cases? What is the breakdown of outcome classes among the 1000 cases? Will others who use the model have access to values of all 100 independent variables for predicting future cases? – EdM Oct 09 '19 at 17:56
  • Yes, I have values for all IVs. I think there was one IV with 5 NAs, which I have imputed. Regarding the last question, I'm not fully sure I understand what you mean? – Denver Dang Oct 09 '19 at 17:58
  • You have 2 classes, say A and B. How many of the 1000 cases fall into each class? The answer might differ between a 50/950 and a 500/500 breakdown. With respect to my last question, you want to build a model "so it can be used by others." If some of your predictor variables might not be readily available to others who would like to use your model (because of cost, difficulty of assessment, etc), you might want to consider that as a criterion in predictor selection. – EdM Oct 09 '19 at 18:03
  • Ah, you mean the frequency the outcome? It's around 150 for the 1000. And yes, your last point is duly noted. – Denver Dang Oct 09 '19 at 18:06

1 Answers1

1

One possibility that you didn't mention is to use ridge regression. That has a conceptual relationship to principal components regression (PCR), but instead of making an all-or-none choice of which principal components to include it weights the principal components differentially to avoid overfitting. The result is a separate penalized ridge regression coefficient for each of your original predictors. See page 79 of ESLII for details of this relationship between ridge and PCR. Correlated predictors are handled well by ridge regression as each set of such predictors tends to be contained in the same principal components. If your main interest is in prediction and all of your predictors will be readily available in the future, ridge has the advantage of not throwing away any potentially useful information.

LASSO represents the other extreme, selecting a subset of predictors while it penalizes their coefficients to avoid overfitting. From your data, with about 150 members of the smaller class, it might select about 10 predictors if you choose the penalty value that minimizes the cross-validation deviance (an appropriate measure of logistic regression quality unlike, say, accuracy). From among a set of correlated predictors it will tend to choose one or a few most strongly associated with outcome in your particular data set, so you will notice some instability in the set of predictors selected if you repeat LASSO on multiple bootstrap samples of your data. That doesn't necessarily pose a problem with respect to predictions, as choosing any of those correlated predictors might do about as well, but you should be aware that the predictors chosen aren't necessarily the "best" in any general sense.

Elastic net combines LASSO and ridge regression in a way that might work well with your data set.

The result of your idea of comparing 50 different models, each including only 1 of the 50 highly correlated predictors, then choosing the best one would tend to be highly dependent on your particular data set and the model would not directly incorporate the fact that you used the data to select the predictor. Thus your model would tend to overfit and might not work well on other data samples. The penalization on coefficient values imposed by LASSO, ridge, or elastic net provides a better choice.

Finally, a warning about any of these approaches when you have categorical predictors, not just continuous predictors. PCR, ridge, LASSO, etc typically normalize predictors at the start so that the original scale of measurement (e.g., miles versus millimeters) doesn't influence the result. But what is the best way to "normalize" a binary predictor variable, or a multi-level categorical variable? Your knowledge of the subject matter might need to come into play with respect to that issue. See this page and its links for further discussion.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • I really appreciate this. I do however have no idea how to proceed with the categorical problem. Because there are predictors, which is something like: "Did this patient have this pre-treatment: Yes or No". I can't really think about any way this can be normalized in any way. But maybe that's just how it is... – Denver Dang Oct 09 '19 at 19:20
  • @DenverDang you are in very good company on this problem with categorical predictors; many just ignore it. Note that normalization, although often the default, is not required; software programs typically allow you to specify predictors that are not to be pre-normalized. You might want to try both ways for important binary predictors, repeating model-building on multiple bootstrap samples while testing model performance on the full original data set. – EdM Oct 09 '19 at 19:31