1

I am working on a binary classification problem where one of the most interesting features has a distribution which looks more or less bimodal. Here is the distribution plot of that feature: enter image description here

The two modes seem to correspond to two classes. When I look at the distribution of this feature corresponding to each class separately, this is what I get: enter image description here

enter image description here

Clearly, one of them is more like a log-normal distribution, and the other is more like normal, and the two peaks in the original distribution seems to correspond to two different classes. My question is how do I deal with this kind of bimodality in Logistic regression. Also, would other machine learning algorithms be more suitable for this kind of a problem?

matttree
  • 11
  • 1
  • 4
    What does it mean to “deal with” the feature? Bimodality of the distribution isn’t an obstacle for logistic regression. – Sycorax Dec 17 '21 at 20:25
  • 3
    The features in a logistic regression do not have distribution assumptions (except that constant features are unhelpful), so what problem do you see with your bimodal feature? – Dave Dec 17 '21 at 20:29
  • If you know the two classes then you can incorporate them into your logistic regression – Henry Dec 17 '21 at 20:50
  • @Henry I take that comment to mean that the two classes are the classes being predicted. – Dave Dec 17 '21 at 20:55
  • 1
    @Dave you may be correct. I had guessed these were credit scores, that the two classes were something like "does not have a formal job" and "has a formal job" and the prediction was "will default" or "will not default" – Henry Dec 17 '21 at 20:58
  • @Henry These are credit scores, which is the feature of a classification problem like "will default" or "will not default". – matttree Dec 17 '21 at 21:02
  • Nothing there looks lognormal to me. There is some fine structure in the distributions, which is probably secondary. – Nick Cox Feb 05 '22 at 15:28
  • What would you do with dummy variables? The variables with zero and ones values. You use them in logistic regression. – Aksakal Feb 05 '22 at 17:33
  • @Aksakal I think I get that your point is not to worry about features lacking normal distributions, but I’m not sure that’s clear to someone who doesn’t already know not to be concerned about a lack of feature normality. – Dave Feb 05 '22 at 17:45
  • It puzzles me why people think features should be normally distributed. It is not implied anywhere in regression – Aksakal Feb 05 '22 at 18:30

1 Answers1

4

Stick that feature in your regression like you would any other feature. Logistic regression makes no assumption that the features have a particular distribution. I suspect this misconception comes from the same confusion that people have about OLS linear regression.

Since the distribution of that feature for each $y$ category is not just a shift in location, you might benefit from using some nonlinear functions of this feature, such as a spline, but the distributions are so different that I expect this feature to give considerable discriminative ability on its own.

Dave
  • 28,473
  • 4
  • 52
  • 104