3

The latent variable motivation for losgistic regression goes thus. There exist $Y^*=\beta^tX+\epsilon$ which is continuous. We can only observe $Y$ at specific thresholds of $Y^*$, say at $Y^*\leq \alpha_1$, $\alpha_1<Y^*\leq \alpha_2$, $\alpha_2<Y^*\leq \alpha_3$ and $\alpha_3<Y^*\leq \alpha_4$, with $Y=1, 2, 3,4$ respectively

Therefore $P(Y\leq j)=P(Y^*\leq \alpha_j)=P(\beta^tX+\epsilon\leq \alpha_j)=P(\epsilon\leq \alpha_j-\beta^tX)$

Assuming that $\epsilon$ is logistic then we have: logit$(P(Y\leq j))=\alpha_j-\beta^tX$

My question goes like this: With only one threshold say say $\alpha$: We get a binary logistic model of the form

logit$(P(Y=j))=\alpha-\beta^tX$.

and with $n$ thresholds for large $n$ we observe almost all of $Y^*$ so we can use ordinary least squares regression for modeling. Can someone illustrate to me in a nice way what we gain and what we loose for large $n$ and for small $n$ where $n$ refers to the number of thresholds?.

mdewey
  • 16,541
  • 22
  • 30
  • 57
Chamberlain Mbah
  • 751
  • 4
  • 19
  • You have an issue of scale. You are essentially cutting up the real line into $n$ intervals, possibly of unequal length. There's no reason to believe that the first interval corresponds to 1 on the latent scale, even if the intervals are thin. – dimitriy May 14 '14 at 15:26
  • Well I do not see a problem with that, the model transform what ever real values you give to clusters to the latent scale. The intercepts you get are in the latent scale. – Chamberlain Mbah May 14 '14 at 16:03
  • So if you increase $n$ your model will give you $n-1$ values of the latent response scale, that is the intercepts. – Chamberlain Mbah May 14 '14 at 16:05
  • If your variable is a "coarsened" latent variable, then I would agree. But take bond ratings. There's an underlying latent variable called creditworthiness that some agency has divided into many bins, which range from AAA, AA, A, BBB, and so on to D. You can imagine coding these as 12, 11, 10,.... But you could just as well use another coding scheme. Which one should you use? – dimitriy May 14 '14 at 17:45
  • Also, AAA is better than AA, and AA is better than A, but the two differences are not equivalent. When you use OLS on the integer scale, you are imposing an assumption that distance between categories are all equal, which can be problematic. – dimitriy May 14 '14 at 17:45
  • 1
    I see your point and agree with you. So the problem of scale only arises when you use OLS. And using the proportional odds model with large $n$, there is no scale problem but then the number of parameters increase drastically. Do you agree?. – Chamberlain Mbah May 14 '14 at 19:13
  • That sounds right to me. – dimitriy May 14 '14 at 20:49
  • Ordinal seminparametric models have a very direct statement without needing to resort to latent variable. Latent variables can be useful for extending to more complex situations though. – Frank Harrell May 07 '17 at 13:11
  • https://stats.stackexchange.com/questions/218645/logistic-regression-and-latent-data – kjetil b halvorsen May 07 '17 at 13:15

0 Answers0