3

I understand that Poisson regression is used for count data because, among other reasons, it accommodates the count variable's mode being zero and the fact that it cannot be negative. What about continuous variables, for example C-reactive protein, in clinical contexts? It is heavily positive-skewed, cannot be negative, and can have mode of 0 in many scenarios (lower limit of detection can be, for example <3, and considered 0 for analysis). Transformations are inadequate. This situation is true for many biomarkers.

How should one approach regression with these variables, can Poisson regression be used?

bobmcpop
  • 1,063
  • 1
  • 14
  • 20
  • 3
    "Considered 0 for analysis" is usually thought to be a fundamental error and has to be considered one of the roots of your problem (as well as being a hint about how to resolve it). See, for instance, Helsel's book "Nondetects and Data Analysis." – whuber Oct 20 '17 at 23:34

1 Answers1

6

The Poisson model is not just skew and non-negative; for example it has a particular variance specification (the variance is equal to the mean). This will almost never be true for variables that are not counts (what happens if you change units? You change the shape of your Poisson model! That makes no sense).

While it would in some situations be possible to consider a quasi-Poisson model (variance proportional to the mean) that would still not usually be a suitable model for non-negative measurements of physical quantities.

You would instead typically expect them to have standard deviation proportional to mean - you don't expect the distributional characteristics, aside from a known change to a scale parameter, to change when you change scale.

If you're looking for a GLM, the obvious candidate with the characteristics non-negative, right skew, standard deviation proportional to mean is the gamma. (There are other commonly used standard-deviation-proportional-to-mean models -- the lognormal, the Weibull and so forth).

However, if you have exact zeros you might consider a zero-inflated or a hurdle model (gamma with a proportion of zeros).

On the other hand, if the low end is really censored (as with an instrument than cannot detect below some threshold and just records 0 there), then you may be better off to explicitly deal with that censoring. Since censored data is standard in survival models, programs for modelling survival often have what you need built right in (and offer multiple choices for survival time distributions, which you can parlay into models for C-reactive protein levels by simply putting that where survival time would go in the left-censored survival model)

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • I'm as big a fan of gamma regression as anyone, but I think this answer is too hard on the quasi-Poisson model. Why is it not "suitable"? The estimator is consistent, assuming you have the right functional form for the conditional mean, and you can use a robust VCE to get consistent estimates of the standard errors. It seems suitable to me; it's just probably not as good in all respects (in particular, not as efficient) as a gamma-log model. – The Laconic Oct 21 '17 at 20:34
  • @Laconic the variance specification really doesn't make much sense for physical measurements (indeed, aside from things that are counts or multiples of counts, the variance proportional to mean specification would fairly rarely make sense), so intervals - especially prediction intervals - and tests based on them will often be misleading. Modeling is not just about means! At the least such models should be applied (outside the arena of count data) with forethought, care and a skeptical eye. If you can give some reason to anticipate var. $\propto$ mean in this sot of context, I'm keen to hear it – Glen_b Oct 21 '17 at 23:57
  • @Glen_b Thank you that is very helpful. School boy question: if I only wanted to run a univariate linear regression with CRP as IV (lower end censored) and a continuous DV such as weight, how would I do this? In my undergraduate mind survival models and linear regression are separate things? – bobmcpop Oct 25 '17 at 02:26
  • 1
    Well, not so separate; parametric survival models are basically GLM-like models with censoring (but not necessarily natural-exponential family). Sure, if you stick to thinking of the response as necessarily *time*, they're separate things, but you can use survival models to fit regression / glm-like models (with or without censoring) to responses that aren't time. Numbers are just numbers, they don't know they're not times; as long as the model is the model you want, it's just fitting censored data by maximum likelihood.. Left censoring can sometimes be a bit tricky, as can ...ctd – Glen_b Oct 25 '17 at 08:29
  • ctd... a fixed censoring point. There's an example of using a parametric survival model (to fit data without censoring) here -- https://stats.stackexchange.com/questions/91762/analysis-of-variance-with-weibull-or-gamma-distributions/91786#91786 . Also see [this](https://stats.stackexchange.com/questions/124217/inconsistency-between-r-and-sas-for-mle-on-weibull/124224#124224) and [this](https://stats.stackexchange.com/questions/161891/how-to-estimate-the-parameters-of-frechet-distribution-in-r/161901#161901) ... ctd – Glen_b Oct 25 '17 at 08:43
  • and [this](https://stats.stackexchange.com/questions/230937/how-to-find-initial-values-for-weibull-mle-in-r/230946#230946) for other "nonstandard" uses of survreg. – Glen_b Oct 25 '17 at 09:04