1

I'm running an ordered logistic regression with 3 classes, and I am able to predict the probability of each observation belonging to each class. For some reason one class' probabilities (top40) never rise above a certain threshold. I am using the polr method from the R package MASS. I also have balanced the data set for each class. Does anyone know what this says about my data? Is it useful to balance one's data set?

Edit: I have modified my data set so that it is no longer a balanced subsample. The plot has also been updated, and the ceiling behaviour is still present in the class top40.

polr

Froblinkin
  • 33
  • 5
  • What is the graph? What are "typ" and "value"? – Peter Flom Jul 21 '18 at 12:49
  • What do you mean by balancing the data? Sounds like cheating. – Frank Harrell Jul 21 '18 at 13:00
  • @PeterFlom the graph represents the probabilities (value) for each class (top10, top40, and bot60) given some observation with some measurement (typ). The typ score is one of many variables used to construct the model. – Froblinkin Jul 27 '18 at 17:09
  • @FrankHarrell I'm balancing the data so that each target variable class has the same number of occurrences. Previously, the data was unbalanced, and I found the predicted probabilities to be skewed towards one class (more so than the imbalance alone would suggest). Even without balancing the data set, I still noticed this behavior, where certain predicted probabilities would hit a wall. – Froblinkin Jul 27 '18 at 17:10
  • Balancing is statistically invalid – Frank Harrell Jul 27 '18 at 19:03
  • @FrankHarrell I've changed the dataset, so now I'm not attempting to balance classes, but I was wondering if you could clarify about balancing being inappropriate. Say you have an unbalanced dataset where one class skews your model heavily. Some factor of interest is deemed significant under this model, but the model is junk. Wouldn't it make sense to balance your dataset, so you can get a more accurate read on the importance of your factors at distinguishing classes? – Froblinkin Jul 27 '18 at 20:26
  • 1
    You have to understand probabilities and to understand that probs. can be unequal. And study proper probability accuracy scores and using rank correlations to measure discrimination ability. My R rms package lrm and orm functions give you several useful indexes. Never tamper with the sample sizes. – Frank Harrell Jul 27 '18 at 20:49

0 Answers0