How to select cutpoints for dichotomizing an ordinal dependent variable for binary logistic regression

Question

I am doing a research and want to analyze factors that are associated with dental erosion using binary regression analysis. Dental erosion (dependent variable) was graded using a scale criteria with 5 points (0= No erosion, 1=mild erosion, 2=moderate erosion, 3=severe erosion, 4=very severe erosion). No one in the sample had grade zero. Therefore, because severe and very severe erosion are the most important clinically I am interested to see which factors are associated with occurrence of severe and very severe erosion (grades 3 and 4). The data were collected for 3 age groups (5, 13 and 18 years old children). I have the following question:

I dichotomized the dependent variable into two groups: group with grades (0,1) and other group with grades (3,4) and excluded grade 2 to create good contrasting groups. This resulted in massive drop in sample size but resulted in high odds ratios. Is that correct to do or should I include grade 2 with the first group as (0,1,2)?

It would have been better if you had simply edited [your earlier question](http://stats.stackexchange.com/q/184905/1352), or at least linked to it. By posting a new question, this thread loses the information that has been accumulating at the earlier one. — Stephan Kolassa, Dec 04 '15 at 16:22

score 8 · Answer 1 · answered Dec 04 '15 at 12:46

8

You should not dichotomize your dependent variable. You should use ordinal logistic regression, at least as a starting point.

You should not remove data.

answered Dec 04 '15 at 12:46

Peter Flom

94,055
35
143
276

score 8 · Answer 2 · answered Dec 04 '15 at 12:58

To expand on @Peter Flom's answer:

There is almost always more statistical and explanatory power in an analysis that keeps continuous and ordinal variables as such. This effect is larger if you penalize yourself in terms of sample size at the same time.

So, let's say you were ignore this advice and go ahead and dichotomize your data, what should you do? The answer could depend on what is considered best practice in your field, what split makes most sense from the standpoint of your theory and/or a contrasting theory, or what cut point results in the most equal group sizes. If you select a cutpoint based on 'creat(ing) good contrasting groups', then it sounds like your cutpoint is a result of p-value fishing of a sort and you should probably be suspicious of your resulting p-values and effect sizes (not that an effect-size based on an arbitrary cutpoint is particularly revealing in the first place).

I've very frequently seen claims that "we lose efficiency by throwing away data". Is there a formal exposition of this claim? — Heisenberg, Dec 04 '15 at 16:28
Well for one thing, reducing degrees of freedom tends to increase the magnitude of the error term — russellpierce, Dec 04 '15 at 16:30

score 1 · Answer 3 · answered Dec 04 '15 at 19:46

1

This simulation Median Split graphically demonstrates the noxious effect of dichotomizing a continuous variable.

answered Dec 04 '15 at 19:46

FTF

71
5

3

Yes and also see https://github.com/harrelfe/rscripts/blob/master/catgNoise.r for an interactive demonstration you run in `RStudio`. – Frank Harrell Dec 07 '15 at 14:54

Derrick Kaufman · Answer 4 · 2015-12-04T20:55:19.050

1

You have stated that you are interested in factors that are associated with severe and very severe, so it is appropriate to dichotomize so the interpretation of model results will match your hypothesis of interest. (Maybe this is when treatment starts to get expensive?). There are no moral imperatives here, no "should never"s.

The reason for not dropping the 2s is it depends on what would you like the interpretation of your resulting odd-ratio to be. To me the odds of being severe (3,4) vs non-severe (0,1,2) makes more sense then the odds of being severe (3,4) vs non-severe (0,1) among those who are not moderate. The second interpretation is not useful.

As a starting point I would go with the most simple model that does not throw away data. Then from there add more levels or parameters if they statistically improve the fit. Ordinal regression answers a much more broad question about the chances of going up a category, maybe not of interest to you.

The more complicated ordinal logistic regression makes the proportional odds assumption the odds of (0) vs (1,2,3,4), (0,1) vs (2,3,4), (0,1,2) vs (3,4), and (0,1,3) vs (4) are all equal, is a strong assumption, not easy to understand and usually not valid in practice so the model may not fit well. It is akin to treating the categories as continuous variables and running a linear regression, assuming a linear relationship between the categories and the factors. If you don't know how the categories were derived (ie how is no erosion different from mild erosion? is it quantified in terms of enamel thickness or some other continuous measurement?) that is a risky assumption.

edited Dec 04 '15 at 20:55

answered Dec 04 '15 at 20:36

Derrick Kaufman

36
2

1

Ordinal regression does not require estimating more parameters that "count" the more levels of $Y$ you have. It is not good practice to pool levels unless they are out of order, in general. – Frank Harrell Dec 04 '15 at 21:37
Ordinal logistic regression fits extra intercepts for each dichotomization in the model ( in this case 3 extra parameters). It makes unnecessary assumptions (possibly leading to bias) and generates additional odds ratios regarding comparisons that do not address OPs question: "Therefore, because severe and very severe erosion are the most important clinically I am interested to see which factors are associated with occurrence of severe and very severe erosion (grades 3 and 4)." There is no interest in other comparisons generated by the ordinal regression model. – Derrick Kaufman Dec 04 '15 at 22:59
1

No. Even though there is one intercept per unique value of $Y$ (save one), there is an order restriction on these so they effectively do not add degrees of freedom to the model. The limiting case is the Wilcoxon test which is a special case of the proportional odds model, which handles continuous $Y$ beautifully. You answer the original question by keeping $Y$ intact with all its categories. – Frank Harrell Dec 04 '15 at 23:23
1

I see. Thank you for that explanation. What are the consequences if the proportional odds assumption does not hold? What is the sample size required to have reasonable power to detect a deviation from this assumption? Does the proportional odds assumption need to hold for each factor of interest in order to make valid statements about the associations? – Derrick Kaufman Dec 04 '15 at 23:38
1

I don't know how to answer the second question, but the p.o. assumption would have to be pretty badly violated for a binary logistic model to work better. And as Stephen Senn has said "Clearly, the dependence of the proportional odds model on the assumption of proportionality can be overstressed. Suppose that two different statisticians would cut the same three-point scale at different cut points. It is hard to see how anybody who could accept either dichotomy could object to the compromise answer produced by the proportional odds model." (Stat in Med 28:3189, 2009). – Frank Harrell Dec 04 '15 at 23:41
I'm reading something along the lines of "As with the proportional hazards assumption in the Cox model, when this "model based assumption check" fails, it does not mean the model results are entirely invalid, it's just that the effect estimates are "averaged" over their inconsistent proportionality" here: http://stats.stackexchange.com/questions/76379/ordinal-regression-proportional-odds-assumption – Derrick Kaufman Dec 04 '15 at 23:49
Well put. And whenever someone abandons a model they need to make sure that their alternate choice isn't worse. On a different note, many people criticize the inexactness of the bootstrap for confidence intervals then turn around and use a more poorly performing parametric method. – Frank Harrell Dec 04 '15 at 23:50
Thank you so much Derrick Kaufman. I am totally convinced with your opinion. – Imad Saga Dec 05 '15 at 13:19
+1; there are no moral imperatives and practical requirements for dichotomizing trump statistical preferences – russellpierce Dec 05 '15 at 16:52
If I was a skeptic or (health authority or journal editor) and the interest was on the clinically significant comparison (0,1,2) vs (3,4), would I want you to use a model that depended on an unverifiable proportional odds assumption to borrow information from other non clinically significant comparisons (ie 1 vs 2,3,4) to draw conclusions? Wouldn't dichotimizaton be more conservative? – Derrick Kaufman Dec 05 '15 at 18:31
You have missed all the points above but you are right that dichotomization is conservative in that it throws away information and loses power. Someone who emphasizes type II error more than type I error would say that dichotomization is _really_ a bad idea. And the fact that some categories are more important clinically has nothing to do with collapsing categories. Note that the prop. odds assumption is verifiable but as stated above one can often get a better analysis than dichotomization even when PO is violated. – Frank Harrell Dec 07 '15 at 14:53

How to select cutpoints for dichotomizing an ordinal dependent variable for binary logistic regression

4 Answers4