Variable Selection on a imbalanced data set

Question

Suppose I want to perform variable selection on a highly imbalanced data set. Do I have to balance the data set either by downsampling the majority class or upsample the minority class before I perform the variable selection method?

Of likely interest: https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he — Dave, May 27 '21 at 12:06
@Dave bear in mind that the example in given there does not reveal a class imbalance problem (and hence is a bit misleading) because the dataset is far too large for imbalance to be an issue. It is a shame it is closed to answers. Class imbalance is an estimation problem, and vast amounts of data resolve most estimation problems. — Dikran Marsupial, Sep 27 '21 at 17:13
@teotjunk what is the reason for performing feature selection? If it is to improve generalisation performance, I would strongly recommend against it, and suggest just using regularisation instead. — Dikran Marsupial, Sep 27 '21 at 17:14
This post ignores hundreds of relevant posts on the site. Stepwise variable selection is invalid. Removal of high quality data in order to "balance" is invalid. — Frank Harrell, Jan 29 '22 at 23:35

score 0 · Answer 1 · answered Nov 23 '19 at 12:32

I know of no general guidelines that link variable selection with narrow distributions (I believe that is a helpful term for a single variable that shows a marked imbalance, within a larger dataset). Decisions about the best ways to address narrow distributions are often predicated on the nature of the research and the analytic methods one aims to use. For example, Gary King has written extensively on the use of logistic regression in the presence of binary outcomes that involve "rare events." There are other common guidelines involving case-control studies. The literature on survey research contains much material on sample weights. And so on.

Variable selection (feature selection) is its own extremely broad and often contentious sub-field within statistics or machine learning. You'll find hundreds if not thousands of related posts on this site. If you explore them you'll come to find that a "do I have to?" stance is better replaced by one of "What approach will best achieve my goals? (which you'll need to specify)" or "What are the likely consequences of trying ___ ?"

@Frank Harrell - If my post, which mentions hundreds of others, ignores them, then so does yours. You also comment on my post calling invalid two things that I never mentioned. — rolando2, Jan 28 '22 at 13:54
Oh, ok, thanks. Suppose we delete these comments and you repost one above, just after the original, to refer explicitly to it. — rolando2, Jan 29 '22 at 15:05

Variable Selection on a imbalanced data set

1 Answers1