0

Let's say I have two variables, X and Y and want to build the model Y ~ X. Y can be a boolean or a continuous variable. X is a continuous variable. I want to turn X into a categorical variable either by making it a boolean or binning it.

However, I do not want to arbitrarily bin it (based on even intervals, quantiles, etc...) or just picking a single threshold (to create a boolean). Instead, I want stats or ML to pick the thersholds for me.

I've heard you can look at inflection points using a partial dependence graph, but that doesn't make sense to me as I don't want to spend the time building a complex machine learning model just to determine thresholds.

Is there a technique based off ANOVA that we could use to do this. For example, find the thershold that maximizes the difference in group means. This should automatically factor in the sample size of the two groups (to get statistical significance), and if that doesn't work, I'll just set a constraint on the sizes of the two groups.

For example let's say I just want to divide X into two categories. This is basically a one-way ANOVA problem. I want to find the two categories so that the statistical significance of the difference in group means is maximized. Basically find the threshold that maximizes the F-value. Is this even a statistically sound method to use? Is there a name for this? I can probably code something up to find the threshold.

If there's no major flaws with the method, I could just create a grid of values [0, 0.1, 0.2, etc...], do an ANOVA test and log the F-values, then find the threshold with the highest F-value...

confused
  • 2,453
  • 6
  • 26
  • 3
    Such binning is discouraged is basically all situations. https://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable https://stats.stackexchange.com/questions/390705/why-should-binning-be-avoided-at-all-costs https://twitter.com/f2harrell/status/949700082781310976?lang=en https://stats.stackexchange.com/questions/41227/justification-for-low-high-or-tertiary-splits-in-anova/41233#41233 https://discourse.datamethods.org/t/categorizing-continuous-variables/3402 Given these issues, why do binning? – Dave Jul 13 '21 at 20:04
  • Are you talking about regression trees? – BigBendRegion Jul 13 '21 at 20:04
  • @Dave I do binning because we have constraints against building a complex model when it comes to implementation. Also people like bins because it makes nice pretty tables as opposed to formulas. – confused Jul 13 '21 at 20:06
  • 1
    Part of your job is keep people who know less about statistics than you from making poor decisions about statistics. – Dave Jul 13 '21 at 20:07
  • Oh, I come from more of the perspective of selling an idea that can be sold and is reasonable. It's much easier to say, if A > B, then we do C. As opposed to take A, multiply by coefficient, and then if that's greater than B, we do C. But I also work in environment where as long as its better than before, that's what matters. – confused Jul 13 '21 at 20:41
  • If a complex black box ML model produces great results, thats awesome, but if my binning approach produces results that are better than before, it's sellable as people would more likely give me the green light (as they trust the method) in implementing it. But the ML model may sit in the dust since no one understands or trusts it or we don't have ability to implement. And makes my life easy as I don't need to do anything complex and I'd get greater credit. – confused Jul 13 '21 at 20:44
  • 1
    Re "no major flaws:" There's a glaring one. Almost any such binning procedure will be deeply and subtly flawed due to the temptation to apply standard procedures, such as ANOVA or chi-squared tests, to the binned data, without the needed modifications to account for the dependency between the bin cutpoints and the data. For an example of the problems in a simple setting (a chi-squared test of fit) see my account at https://stats.stackexchange.com/a/17148/919. – whuber Jul 13 '21 at 21:06
  • Binning *is* more complex: if you bin into, say, five categories, you will need to estimate five parameters. You can model nonlinear behavior for far fewer degrees of freedom by using splines, or just use a straightforward linear model with just two parameter estimates. – Stephan Kolassa Jul 14 '21 at 05:41

0 Answers0