Interpret this observed-vs-fitted plot

Question

I have this OvF plot with the outcome being a ln-transformed continuous variable (length of stay in days).

This plot is the result of a survey-adjusted weighted mixed-level (1 level random intercept), linear regression done in stata 14.

I don't know how to interpret this plot. There seems to be a weird horizontal pattern accross the o=f line that i cannot understand. Please advise.

Also if you know another method of assessing the model fit for generalized linear regression (continuous outcomes), please let me know (stata commands --> extra kudos!)

Your observed values are presumably integer days, which lead to those unequally spaced values when you take their natural logarithms, but cannot fill the gaps — Henry, Oct 11 '16 at 20:43
As Henry says, this is what you would expect when the dependent variable is discrete - you don't need to fix it. There are several answers concerning diagnostic for log-linear/Poisson models. For example, have a look at [this](http://stats.stackexchange.com/questions/70558/diagnostic-plots-for-count-regression), [this](http://stats.stackexchange.com/questions/25068/interpreting-plot-of-residuals-vs-fitted-values-from-poisson-regression), or [this](http://stats.stackexchange.com/questions/99052/residuals-in-poisson-regression). — matteo, Oct 11 '16 at 20:56
@MatteoLisi i appreciate your answer but none of the 3 sources can help me out. number of days should treated as continuous discete variable, not as ordinal categorical. Do you suggest i ran Poisson regression instead? Im not sure how these sources help me out here - do i need a different kind of plot? The above sources point to residuals vs fitted instead. — Paris Char, Oct 11 '16 at 21:29
What you are doing looks similar to a Poisson regression, which is used when the outcome is a _count variable_, as it is in your case. A count variable is not ordinal-categorical; it consist of non-negative integer values {0, 1, 2, ...} that come from _counting_ rather than _ranking_. It is different from ordinal data because the scale of its values is not arbitrary. It is also not continuous, which is why your plot looks like that. — matteo, Oct 12 '16 at 13:04
Plotting residuals vs fitted is just another way to check the model fitness. The answer to the 1st link I suggested provides a summary of methods used to asses the fit of these models — matteo, Oct 12 '16 at 13:09
"Continuous discrete variable" is a contradiction in terms: by definition, "discrete" is as far from "continuous" as one can get. — whuber, Oct 12 '16 at 19:40

score 3 · Answer 1 · edited Oct 12 '16 at 19:29

3

Your variable is measured in days, so the lowest value is $1$ day or $\ ln 1=0$, the second lowest value is $2$ days or $\ln 2\approx0.69$, the third is $3$ days or $\ln 3\approx1.10$, etc. That is what you see in the graph; the bottom three horizontal lines of points happen at about the values $0$, $0.69$, and $1.10$. So the graph accurately represents what is happening in your data.

You mentioned you wanted to treat length of stay as continuous. However, what would that mean in terms of your length of stay variable? In principle we could have measured length of stay in hours. This would not solve the problem completely, as the variable will still be granular to some extent, but the graph would look much prettier. More importantly, does that extra detail contain information you care about? Probably not, as the hour of the day that you leave a hospital is largely determined by organizational concerns, e.g. when is the doctor who does the final check-up available? So a finer measure of length of stay would probably only add random noise to your variable.

Having said all that, I probably would not use a linear model for such duration data. There is a whole set of statistical models that are explicitly designed for such data, typically referred to as survival analysis. See here, here, here, or here.

edited Oct 12 '16 at 19:29

Nick Cox

48,377
8
110
156

answered Oct 12 '16 at 08:32

Maarten Buis

19,189
29
59

I appreciate your answer and I am aware of survival analysis but in this case I think Poisson Regression or Negative Binomial (doesn't work well with survey -specification due to many clusters being used) would be a better idea. – Paris Char Oct 12 '16 at 17:21
That sounds wrong to me, but it is your research, so also your decision. – Maarten Buis Oct 12 '16 at 17:33
I wouldn't use a negative binomial model in this case, but in Stata you can use the svy prefix before nbreg, so there shouldn't be big problem with including design information in an nbreg model. – Maarten Buis Oct 12 '16 at 17:38
What about Poisson regression? Also GLM with ln(LOS) as outcome has been studied with good fit ([link](http://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-12-68)) – Paris Char Oct 12 '16 at 19:33
GLMs with a log link function (Poisson regression is a special case) are not designed for that purpose, while survival analysis is. – Maarten Buis Oct 12 '16 at 20:09
I understand what you are saying, however increased LOS is a bad thing for a hospitalized patient and I am excluding the dead patients in this analyses as well, therefore making Survival analysis not so useful... Survival analysis you need outcomes (eg death), but i am excluding dead patients here... In many papers i found online, LOS was treated as count data , not time-to-event data. – Paris Char Oct 12 '16 at 20:42
I think you got confused by the (rather morbid) terminology. It is straightforward to apply survival analysis to length of stay: The event is in your case leaving the hospital. – Maarten Buis Oct 12 '16 at 20:51
Just removing dead patients sounds problematic to me. Treating them as censored would be well worth considering, though not without problems of its own. – Maarten Buis Oct 12 '16 at 20:53
the problem with that is that there are 2 ways of leaving the hospital...dead or alive. so you have to exclude one of the two. – Paris Char Oct 12 '16 at 22:04
That is what censoring is for. That is the main strength of survival analysis. – Maarten Buis Oct 12 '16 at 22:10

Interpret this observed-vs-fitted plot

1 Answers1