12

I'm trying to do an ordered logit regression. I'm running the model like so (just a dumb little model estimating number of firms in a market from income and population measures). My question is about predictions.

nfirm.opr<-polr(y~pop0+inc0, Hess = TRUE)
pr_out<-predict(nfirm.opr)

When I run predict (which I'm trying to use to get the predicted y), the outputs are either 0, 3, or 27, which in no way reflects what should seem to be the prediction based upon my manual predictions from the coefficient estimates and intercepts. Does anyone know how get "accurate" predictions for my ordered logit model?

EDIT

To clarify my concern, my response data has observations across all the levels

>head(table(y))
y
0  1  2  3  4  5 
29 21 19 27 15 16 

where as my predict variable seems to be bunching up

> head(table(pr_out))
pr_out
0     1   2   3   4   5 
117   0   0 114   0   0 
Steffen Moritz
  • 1,564
  • 2
  • 15
  • 22
prototoast
  • 526
  • 1
  • 4
  • 8
  • 2
    This is quite vague. How do the values returned by the `predict` function differ from the ones you generated manually? What is the structure of your dependent variable? Please provide a reproducible example. – Sven Hohenstein Oct 23 '12 at 08:07
  • 1
    I think you would want to see this-http://stats.stackexchange.com/questions/18119/how-do-you-predict-a-response-category-given-an-ordinal-logistic-regression-mode – Blain Waan Oct 23 '12 at 09:46
  • I bet most of your response variables are equal to 0, 3, or 27. If I remember rightly `predict` by default returns the response with the highest probability for each data point. Perhaps you want the actual probabilities for each possible response? If so, check the documentation for options for the `predict` function, I know it exists, but don't remember the syntax. – Aaron left Stack Overflow Oct 23 '12 at 12:21
  • If you *only* want to know how to get R to do this, your question is off-topic for CV (& belongs on [Stack Overflow](http://stackoverflow.com/) instead). Are you wondering about the statistical issues associated w/ this, or just how to get R to give you the numbers? – gung - Reinstate Monica Oct 23 '12 at 13:04
  • 2
    I don't quite follow your situation. You say that you are using an ordinal regression model, but you also say, as best I understand, that your response variable is the number of firms in a market. That's a *count*, it is ordinal, but OLR is not the right way to model that; you want to use some variant of Poisson regression. – gung - Reinstate Monica Oct 23 '12 at 13:06
  • 2
    @gung Yes, I understand the point about count vs ordinal. At the moment, I am trying to replicate the paper http://ideas.repec.org/a/ucp/jpolec/v99y1991i5p977-1009.html and they use an ordinal regression. I have also estimated count models, but that doesn't help me with this particular task. Also, no, it's not that I just want R to do this, I'm trying to understand where the behavior is deviating from my expectations (because I suspect the error is on my part, not R). – prototoast Oct 23 '12 at 15:52
  • 1
    Did you verify `polr()` against other functions? You could try `lrm()` from package `rms`: `lrmFit – caracal Oct 23 '12 at 16:19
  • Thanks for the suggestion. I'll work with those (and thanks for the answer below, I'm working through that as well--trying to figure out what's going on). – prototoast Oct 23 '12 at 16:30

1 Answers1

24

To manually verify the predictions derived from using polr() from package MASS, assume a situation with a categorical dependent variable $Y$ with ordered categories $1, \ldots, g, \ldots, k$ and predictors $X_{1}, \ldots, X_{j}, \ldots, X_{p}$. polr() assumes the proportional odds model

$$ \text{logit}(p(Y \leqslant g)) = \ln \frac{p(Y \leqslant g)}{p(Y > g)} = \beta_{0_g} - (\beta_{1} X_{1} + \dots + \beta_{p} X_{p}) $$

For possible choices implemented in other functions, see this answer. The logistic function is the inverse of the logit-function, so the predicted probabilities $\hat{p}(Y \leqslant g)$ are

$$ \hat{p}(Y \leqslant g) = \frac{e^{\hat{\beta}_{0_{g}} - (\hat{\beta}_{1} X_{1} + \dots + \hat{\beta}_{p} X_{p})}}{1 + e^{\hat{\beta}_{0_{g}} - (\hat{\beta}_{1} X_{1} + \dots + \hat{\beta}_{p} X_{p})}} $$

The predicted category probabilities are $\hat{P}(Y=g) = \hat{P}(Y \leq g) - \hat{P}(Y \leq g-1)$. Here is a reproducible example in R with two predictors $X_{1}, X_{2}$. For an ordinal $Y$ variable, I cut a simulated continuous variable into 4 categories.

set.seed(1.234)
N     <- 100                                    # number of observations
X1    <- rnorm(N, 5, 7)                         # predictor 1
X2    <- rnorm(N, 0, 8)                         # predictor 2
Ycont <- 0.5*X1 - 0.3*X2 + 10 + rnorm(N, 0, 6)  # continuous dependent variable
Yord  <- cut(Ycont, breaks=quantile(Ycont), include.lowest=TRUE,
             labels=c("--", "-", "+", "++"), ordered=TRUE)    # ordered factor

Now fit the proportional odds model using polr() and get the matrix of predicted category probabilities using predict(polr(), type="probs").

> library(MASS)                              # for polr()
> polrFit <- polr(Yord ~ X1 + X2)            # ordinal regression fit
> Phat    <- predict(polrFit, type="probs")  # predicted category probabilities
> head(Phat, n=3)
         --         -         +        ++
1 0.2088456 0.3134391 0.2976183 0.1800969
2 0.1967331 0.3068310 0.3050066 0.1914293
3 0.1938263 0.3051134 0.3067515 0.1943088

To manually verify these results, we need to extract the parameter estimates, from these calculate the predicted logits, from these logits calculate the predicted probabilities $\hat{p}(Y \leqslant g)$, and then bind the predicted category probabilities to a matrix.

ce <- polrFit$coefficients         # coefficients b1, b2
ic <- polrFit$zeta                 # intercepts b0.1, b0.2, b0.3
logit1 <- ic[1] - (ce[1]*X1 + ce[2]*X2)
logit2 <- ic[2] - (ce[1]*X1 + ce[2]*X2)
logit3 <- ic[3] - (ce[1]*X1 + ce[2]*X2)
pLeq1  <- 1 / (1 + exp(-logit1))   # p(Y <= 1)
pLeq2  <- 1 / (1 + exp(-logit2))   # p(Y <= 2)
pLeq3  <- 1 / (1 + exp(-logit3))   # p(Y <= 3)
pMat   <- cbind(p1=pLeq1, p2=pLeq2-pLeq1, p3=pLeq3-pLeq2, p4=1-pLeq3)  # matrix p(Y = g)

Compare to the result from polr().

> all.equal(pMat, Phat, check.attributes=FALSE)
[1] TRUE

For the predicted categories, predict(polr(), type="class") just picks - for each observation - the category with the highest probability.

> categHat <- levels(Yord)[max.col(Phat)]   # category with highest probability
> head(categHat)
[1] "-"  "-"  "+"  "++" "+"  "--"

Compare to result from polr().

> facHat <- predict(polrFit, type="class")  # predicted categories
> head(facHat)
[1] -  -  +  ++ +  --
Levels: -- - + ++

> all.equal(factor(categHat), facHat, check.attributes=FALSE)  # manual verification
[1] TRUE
caracal
  • 11,549
  • 49
  • 63