6

This is probably a very simple question. Let's say we use some metric to remove features, whether that be AIC, regularization like lasso, variable importance, t-tests, etc...

Assuming we use the same technique again as we continue to refine the model, would it be safe to assume that any removed features would not be significant again? My understanding is that some features may be correlated with other features - but if that's the case, you really only need to include one of them, especially if they are highly correlated.

Other than that, I can't think of a reason why you would add a feature back in to a model, once it has been removed.

The reason I am asking this question is because I am trying to build a large model with many features. Because I have so much data and limited technology, I am hoping to build a model sequentially. Train the model on say 10 features, remove the unimportant ones, then rebuild it by adding in 10 more, remove the unimportant ones, and continue until I reach my computer's capacity.

Are there any issues with that process?

Thanks!

confused
  • 2,453
  • 6
  • 26
  • 2
    [Algorithms for automatic model selection](https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856) – user2974951 Jun 08 '20 at 10:02
  • I'm not really looking for a feature selection method, but wondering if once a feature is removed, can I remove it for good and not worry that it'll be important again once other features are added to the model. – confused Jun 08 '20 at 10:06
  • 2
    I think without context this is probably impossible to answer. How many features do you have? Do you have enough computing power to perform dimension reduction on the features? – jcken Jun 08 '20 at 10:15
  • I do not, so I was going to build my model up slowly and remove any glaring features that are unimportant. For example, if I have 100 features, build a model using 50, remove the unimportant ones. Then build the same model on the other 50, remove the unimportant ones. Then combine the leftover features. If i have computing power left over, maybe add in some of the more significant ones I took out. I may not end up with all important features but hopefully end up with the most important ones. – confused Jun 08 '20 at 10:22
  • I probably have maybe 100 or so I think, but the problem isn't the number features but the number of observations. I would prefer to keep as many observations as I can. And unfortunately, I don't think the model I want to build allows for batch processing. – confused Jun 08 '20 at 10:24
  • 3
    This is answered in the thread mentioned by @user2974951. By "removing unimportant features" you are doing feature selection. Moreover, as mentioned in the linked thread, you are doing this in a way that is likely to give bad results. – Tim Jun 08 '20 at 10:35
  • Yes I realize I am doing feature selection. My OP mentioned that I was doing feature selection. My question is explicitly, once you remove a feature, is there reason that I would need to consider it again? My question was not about which algorithm to use for feature selection. – confused Jun 08 '20 at 10:48
  • There is intuition behind why adding a feature may decrease the importance of another (correlation), but is there intuition why once removed, there would be reason to add it back in. This has nothing to do with a specific way to do feature selection. – confused Jun 08 '20 at 10:55

2 Answers2

12

No, you cannot safely assume that. The reason is that conditional independence does not imply independence and vice versa (wiki).

Moreover the forward selection style approach you follow suffers from a fundamental problem: model selection criteria like that usually rely on p-values/t-statistics/... To be based on the "correct" underlying model. This however can't be true if you do forward selection and a 'correct' feature is included only later on the process. That's why usually you should at least do backward selection - if you do any stepwise selection at all. That way the 'true' model is at least nested in the startimg model for selection.

As has been mentioned in comments above, there are (much) better ways to do feature selection than a stepwise algorithm. At least try a LASSO approach.

Georg M. Goerg
  • 2,364
  • 20
  • 21
  • 7
    +1 From the master: https://statmodeling.stat.columbia.edu/2014/06/02/hate-stepwise-regression/ $$\text{Stepwise regression is...a bit of a joke.}$$ – Dave Jun 08 '20 at 11:25
  • So even as you get to more complicated models beyond regression, say boosting trees and NNs, if you have a feature set of 100, and it has been reduced to lets say 80 through some feature selection technique. And you have a final model. Then, you think of a new feature or have access to a new feature that you want to add in the model. You would want to re-create the model with 101 features as opposed to 81, and then do feature selection all over again? – confused Jun 08 '20 at 11:34
  • 2
    Yes. Feature 101 might be the one that makes feature 87 the stronger conditionally dependent variable (e.g. say true relationship is u = f(x87,x101)). If you are limited in practice by time/computational constraints and you are following iterative development, then I d suggest to at least every once on a while run full model again to make sure you are not victim of this conditional dependence behavior and missing out on previously dropped features – Georg M. Goerg Jun 08 '20 at 11:44
  • Ok thanks, that makes sense now. – confused Jun 08 '20 at 11:57
8

You seem to be assuming that the models work in additive fashion, so adding a feature to the model just "adds" some stuff related to this feature alone and does not influence the rest of the model, same with removing the feature. That is not the case. If machine learning models worked like this, then to build a model with $k$ features you would need only to build $k$ models with single feature and find a way of combining them. Here you can find a recent thread, and links to many other questions like this, where including new feature to regression model affects how the model uses the other features. This happens for linear regression, but will also be true for other machine learning algorithms.

You say that you would assume this to be issue "only" when the variables are correlated, but with real-life data, there always will be some degree of correlation between the variables. Moreover, it is not only about correlations between pairs of variables, but also about relations between all of the variables, where those relations can be non-linear as well. You should be rather talking about independence, and seeing all variables independent is even less likely then seeing them all uncorrelated.

More then this, by adding a new feature your algorithm needs to adapt. Imagine that you have a decision tree with the constraint to have not more then five samples in each final node. You cannot just add new feature to such tree without re-building it, because you already have not more then five samples in each final node, so you cannot split this node any further. In such case, you would need to re-build the whole tree, using different splits, or combinations of splits, so it would use your data in a different way then the initial tree. This would be the case even if the new feature that you are adding is independent of other features.

What you propose is partially answered also in the Algorithms for automatic model selection thread, that discusses stepwise feature selection algorithms and the most upvoted answer shows how proceeding in such stepwise fashion by adding (or removing) variables leads to ending up with bad models. It simply doesn't work for the reasons discussed above.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • 1
    Good point on independence/dependence being a better word to describe relationships between predictors than correlation. – confused Jun 09 '20 at 06:28