Appropriate Feature Selection methods

Question

I'm running a multinomial logistic regression and I'm torn regarding which variable selection method to apply...

The ones I know are backwards elimination or forward selection, chi square feature selection, and that's pretty much it. But I haven't found a general consensus or rule that explains why backwards elimination would be better or worse than Chi square, for example. I'm not just asking for those, if you have more to recommend they're welcome.

With this I'm looking to predict the probabilities of a 3-choice nominal variable, with independent variables that are the classic demographics, which range from continuous to nominal (age, income, years at job, gender, residential status) and some variables showing the amount of time the client has been in certain state (in months). From a total 17316 observations, 4599 belong to the class 0, 976 to class 1 and 11741 to class 2.

I was wondering if anyone had any resource that could help me with this?

Please look at the extensive discussion on [this page](https://stats.stackexchange.com/q/20836/28500) about problems with variable selection in general. Then please say more about the specific situation with _your_ multinomial logistic regression. How many predictors, how many classes, how many cases in each class, and how you intend to use the model (for example, will the model be used for prediction) matter a lot with respect to putting together a useful answer. In some situations you might not even need to do variable selection. Please provide the extra information by editing your question. — EdM, Aug 09 '20 at 20:41
Does this answer your question? [Algorithms for automatic model selection](https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection) — kurtosis, Aug 09 '20 at 20:43
Hi - relevant question on feature selection. Take a look at the links: [feature selection][https://stats.stackexchange.com/questions/477601/should-feature-selection-always-give-observable-patterns/477645#477645] and [peaking - with too many features][https://stats.stackexchange.com/questions/25524/feature-selection-for-the-text-mining/477321#477321] — Match Maker EE, Aug 09 '20 at 20:47
@EdM edited the original post to include a bit more info on my work :) Even though the page you linked is partly useful, it didn't really help with my underlying problem... I just know I _definitely_ shouldn't use stepwise selection. If you had some info on other selection methods it would be great. — amestrian, Aug 09 '20 at 23:50

score 1 · Answer 1 · answered Aug 10 '20 at 13:14

1

If your interest is prediction, then there is seldom a need to select features unless your model is in danger of overfitting. Why just throw away all the information available from a feature by removing it from your prediction model?

Even if your model is in danger of overfitting because of a low ratio of cases to predictors, a good strategy can be to keep all the features while penalizing them in some way to minimize overfitting. Ridge regression, which avoids overfitting by down-weighting regression coefficients, is directly applicable to a multinomial logistic regression model. Methods that learn slowly, like boosted trees, serve a similar function and take advantage of all the available information.

For classification schemes like logistic or multinomial regression, a useful rule of thumb to avoid overfitting without penalization is to have at least 15 or so cases in the smallest class per predictor you are evaluating. With over 900 cases in your smallest class, you probably have no need for predictor selection or penalization at all unless you have more than about 60 predictors total (including levels of categorical predictors above the first, and interaction terms).

In situations in which you need to cut down on features rather than penalize them, it's best to use knowledge of the subject matter to pre-select features or to combine multiple related features into a single feature before looking at outcomes. Frank Harrell's class notes and book provide a wealth of information on such strategies. Of the feature-selection approaches noted in the question, Harrell does say (page 4-48, class notes):

Do limited backwards step-down variable selection if parsimony is more important than accuracy. But confidence limits, etc., must account for variable selection (e.g., bootstrap).

So in that context backward elimination is the least objectionable, as you are taking into account all the available information with a more comprehensive model before you decide to start throwing information away. But it's often best to avoid outcome-driven feature selection in the first place.

answered Aug 10 '20 at 13:14

EdM

57,766
7
66
187

So basically in my case it would be okay not to do any feature selection at all? But then I suppose the variables that are not signfiicant, according to their p-value, should be removed, is that correct? – amestrian Aug 11 '20 at 19:43
also, what if I do feature selection and then apply regularization to the "final" regression? Would that make sense? Or maybe I should just apply the regularization at once? – amestrian Aug 11 '20 at 19:45
@amestrian I agree that you might not need feature selection at all _in this case_ unless there are a large number of interactions. There is no need to remove predictors that are "not significant" from a model. Their presence might well be helping to document the "significance" of the other predictors. That's particularly true if your goal is prediction: keep all information you can while avoiding overfitting. If you selected features based on subject-matter knowledge and still fear overfitting, regularization would help. If you're not overfitting, there might be little need. – EdM Aug 11 '20 at 20:13
Okay, maybe not remove those insignificant variables and run the regression again with just the others, but when I "input" new cases (or a holdout sample) I need to leave those coeficients out because they're not significant, right? – amestrian Aug 11 '20 at 21:14
@amestrian there's no need whatsoever to remove "not significant" predictors from a model that has been validated. "Not significant" can say more about the training sample size than about the importance of the predictor as part of an overall model. For prediction on a holdout sample or on new cases, keep all the predictors unless there is overfitting. If there's overfitting, redo the model e.g. with penalization. See extensive discussion [here](https://stats.stackexchange.com/q/66448/28500). – EdM Aug 11 '20 at 22:22
what do you mean by "validated"? Sorry I keep asking questions about this, I was taught my whole university education that I had to remove the non-significant variables from regressions so it's hard for me to wrap my head around it – amestrian Aug 12 '20 at 16:26
Oh, and you mentioned something about interactions, if I add a few, will the same principle apply to the interactions than to the other variables? – amestrian Aug 12 '20 at 17:29
@amestrian in terms of dangers of overfitting, removing "insignificant" predictors and so forth, each interaction is essentially just another predictor so the principles are the same. You shouldn't remove an individual "fixed" effect when you include it with an interaction, though. The apparent "significance" (in terms of difference from zero, the usual reported test) of an individual effect also in an interaction can depend on how the interacting variable is coded. See [this page](https://stats.stackexchange.com/a/417159/28500) for example. – EdM Aug 12 '20 at 19:24
@amestrian validation is estimating how well your model might work on a new sample. With a fairly large data set like yours, you might set aside 1/3 of your data as a test set, develop the model on the rest of the data (training set), and then see how well it performs on the held-out test set. If the model is properly fit, performance on the test set should not be much worse than what you had on the training set. Bootstrapping and cross-validation are other approaches. See the Harrell references linked in the answer, and his `rms` package in R, which provides validation and calibration tools. – EdM Aug 12 '20 at 19:32

Appropriate Feature Selection methods

1 Answers1