3

Is there any recent research papers or state of the art methods on how to categorize/dichotomize an explanatory continuous variable in regression analysis?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
irudnyts
  • 31
  • 2
  • 7
    Yes, there is. [Don't do it.](http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous) – Stephan Kolassa Apr 09 '18 at 16:06
  • 3
    As @StephanKolassa writes, categorizing is usually counterproductive. If you want to ask for commentary on how it might be unusually appropriate for your situation, just let us know what that situation entails. – rolando2 Apr 09 '18 at 16:19
  • @rolando2 I have a GLM model with binary outcome variable (using canonical logit link function). For simplicity assume rhs is a continuous variable. However, the relationship between log odds of outcome variable and exploratory is not linear, and rather a quadratic one. I do not want to step into GAM models because I focus on explaining and not predicting. One way of transforming a variable would be to use abs(x - mean(x)), i.e. how far the value from its expectation. But I also want to try a categorical equivalent. – irudnyts Apr 10 '18 at 15:28
  • @StephanKolassa I fully agree with the most points of this list, even though it is possible to find a counterexample. Do you know any references in statistical journals on this topic? – irudnyts Apr 10 '18 at 15:39
  • 1
    I think I have seen a few papers purporting to offer reasons for categorization, but to be honest, I found them all very underwhelming. You may want to look at [What is the benefit of breaking up a continuous predictor variable?](https://stats.stackexchange.com/q/68834/1352) If your relationship is nonlinear, have you looked at splines? These can be very flexible, without inducing ecologically dubious jumps where discretization bins would meet. – Stephan Kolassa Apr 10 '18 at 16:24
  • @StephanKolassa In general I agree with you, but I'm curious to hear what you think of the hypothetical scenario suggested below – Richard Border Apr 11 '18 at 16:16
  • 1
    @RichardBorder; I already upvoted your answer a few days ago, and I agree that your specific example makes sense. – Stephan Kolassa Apr 12 '18 at 18:09

1 Answers1

2

In general, there is just one universally accepted, advisable way to categorize continuous data. And that is... Floating point numbers! Computers are incapable of representing the vast majority of real numbers, let alone all of the rational numbers. Even if we didn't rely on computers, we can't measure anything with infinite precision, so we still have to round at some point!

But, floating point aside, there is no general reason to categorize a continuous variable, except perhaps under specialized circumstances. You end up just ignoring real variability in your data and further increase the wrongness of your already wrong, but perhaps useful, model.

Some references explaining why one (unfortunately) commonplace technique for dichotomizing continuous predictors (median splits) is particularly problematic:

One hypothetical scenario where dichotomization might be okay:

Imagine you have a measurement $W$ that does a decent job of distinguishing between $X=0$ and $X>0$ but has extreme imprecision given $X>0$. Perhaps you know that, in a given year, the number of days an individual spent incarcerated, on parole, on probation, or in court. But you don't know those days are split between incarceration, parole, probation, etc... It doesn't make sense to treat days in court the same way that you'd treat days in prison, so something like a zero-inflated count model or a hurdle model might be inappropriate or exceedingly difficult to estimate. However, you might instead create a dichotomous "legal contact" variable $$W = \begin{cases} 0 & \text{if } X= 0\\ 1 & \text{if } X> 0\\ \end{cases}$$ and employ a logistic model. This doesn't solve your real problem, which is that your measurement is imprecise, but it might allow you to get some information from your data.

Again, if you were to instead have, for example, a measure of days in prison specifically, than something like a hurdle model would likely be more useful and more powerful.

Richard Border
  • 1,128
  • 9
  • 26