Splitting large number of variables into three separate logistic models for variable selection

Question

Is it appropriate to split up to 100+ variables into three groups then running each group into separate decision trees then run the new created features into their own separate logistic models to help determine the most significant features that can be used in a final model? An example is the likelihood you or your girlfriend will buy something at any one of two stores.

Further, after running the first three logistic models and determining all of the significant features from them, is it correct to then run those features into three more logistic models that have somewhat different binary events? An example is a. you or your girlfriend buy something at the first store, b. You buy something at the second store, c. your girlfriend buys something at the second store. It’s important to note that the first store has different attributes from the second store like much smaller, different location, etc.

Does this method of variable selection introduce biases and could lead to overfitting? It seems incorrect to me. I feel like omitted variable bias is one issue that can arise.

score 1 · Answer 1 · answered Jun 06 '19 at 04:06

1

Omitted-variable bias is an important issue in logistic regression, as noted on this page with an analytic result presented for the related probit regression. Omitting any predictor associated with outcome will bias estimates of coefficients for other predictors, even if the omitted predictors are uncorrelated with the included predictors. So best practice is to include all predictors reasonably expected to be related to outcome as you develop your model.

answered Jun 06 '19 at 04:06

EdM

57,766
7
66
187

Thanks. It also seems very confusing to use features based off fitting an entire binary event that will be used for fitting a subset of that event. Especially when a subset is a small fraction of the original training data with a different population compared to the whole. – shj997 Jun 06 '19 at 04:25

Splitting large number of variables into three separate logistic models for variable selection

1 Answers1